Docs Menu

Docs HomeMongoDB Atlas

Token Filters

On this page

  • asciiFolding
  • daitchMokotoffSoundex
  • edgeGram
  • icuFolding
  • icuNormalizer
  • length
  • lowercase
  • nGram
  • regex
  • reverse
  • shingle
  • snowballStemming
  • stopword
  • trim

Token Filters always require a type field, and some take additional options as well.

"tokenFilters": [
{
"type": "<token-filter-type>",
"<additional-option>": <value>
}
]

Atlas Search supports the following token filters:

The asciiFolding token filter converts alphabetic, numeric, and symbolic unicode characters that are not in the Basic Latin Unicode block to their ASCII equivalents, if available. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be asciiFolding.
yes
originalTokens
string

String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:

  • include - include the original tokens with the converted tokens in the output of the token filter. We recommend this value if you want to support queries on both the original tokens as well as the converted forms.

  • omit - omit the original tokens and include only the converted tokens in the output of the token filter. Use this value if you want to query only on the converted forms of the original tokens.

no
omit

Example

The following index definition example uses a custom analyzer named asciiConverter. It uses the standard tokenizer with the asciiFolding token filter to index the fields in the example collection and convert the field values to their ASCII equivalent.

{
"analyzer": "asciiConverter",
"searchAnalyzer": "asciiConverter",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "asciiConverter",
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "asciiFolding"
}
]
}
]
}

The following query searches the first_name field for names using their ASCII equivalent.

db.minutes.aggregate([
{
$search: {
"index": "default",
"text": {
"query": "Sian",
"path": "page_updated_by.first_name"
}
}
},
{
$project: {
"_id": 1,
"page_updated_by.last_name": 1,
"page_updated_by.first_name": 1
}
}
])

Atlas Search returns the following results:

[
{
_id: 1,
page_updated_by: { last_name: 'AUERBACH', first_name: 'Siân'}
}
]

The daitchMokotoffSoundex token filter creates tokens for words that sound the same based on the Daitch-Mokotoff Soundex phonetic algorithm. This filter can generate multiple encodings for each input, where each encoded token is a 6 digit number.

Note

Don't use the daitchMokotoffSoundex token filter in:

It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be daitchMokotoffSoundex.
yes
originalTokens
string

String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:

  • include - include the original tokens with the encoded tokens in the output of the token filter. We recommend this value if you want queries on both the original tokens as well as the encoded forms.

  • omit - omit the original tokens and include only the encoded tokens in the output of the token filter. Use this value if you want to only query on the encoded forms of the original tokens.

no
include

Example

The following index definition example uses a custom analyzer named dmsAnalyzer. It uses the standard tokenizer with the daitchMokotoffSoundex token filter to index and query for words that sound the same as their encoded forms.

{
"analyzer": "dmsAnalyzer",
"searchAnalyzer": "dmsAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "dmsAnalyzer",
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "daitchMokotoffSoundex",
"originalTokens": "include"
}
]
}
]
}

The following query searches for terms that sound similar to AUERBACH in the page_updated_by.last_name field of the minutes collection.

db.minutes.aggregate([
{
$search: {
"index": "default",
"text": {
"query": "AUERBACH",
"path": "page_updated_by.last_name"
}
}
},
{
$project: {
"_id": 1,
"page_updated_by.last_name": 1
}
}
])

The query returns the following results:

{ "_id" : 1, "page_updated_by" : { "last_name" : "AUERBACH" } }
{ "_id" : 2, "page_updated_by" : { "last_name" : "OHRBACH" } }

Atlas Search returns documents with _id: 1 and _id: 2 because the terms in both documents are phonetically similar, and are coded using the same six digit 097500.

The edgeGram token filter tokenizes input from the left side, or "edge", of a text input into n-grams of configured sizes. You can't use the edgeGram token filter in synonym or autocomplete mapping definitions. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be edgeGram.
yes
minGram
integer
Number that specifies the minimum length of generated n-grams. Value must be less than or equal to maxGram.
yes
maxGram
integer
Number that specifies the maximum length of generated n-grams. Value must be greater than or equal to minGram.
yes
termNotInBounds
string

String that specifies whether to index tokens shorter than minGram or longer than maxGram. Accepted values are:

  • include

  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

no
omit

Example

The following index definition example uses a custom analyzer named englishAutocomplete. It performs the following operations:

  • Tokenizes with the standard tokenizer.

  • Token filtering with the following filters:

    • icuFolding

    • shingle

    • edgeGram

{
"analyzer": "englishAutocomplete",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "englishAutocomplete",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "icuFolding"
},
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
},
{
"type": "edgeGram",
"minGram": 1,
"maxGram": 10
}
]
}
]
}

Tip

See also:

The shingle token filter for a sample index definition and query.

The icuFolding token filter applies character folding from Unicode Technical Report #30. It has the following attribute:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be icuFolding.
yes

Example

The following index definition example uses a custom analyzer named diacriticFolder. It uses the keyword tokenizer with the icuFolding token filter to apply foldings from UTR#30 Character Foldings. Foldings include accent removal, case folding, canonical duplicates folding, and many others detailed in the report.

{
"analyzer": "diacriticFolder",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "diacriticFolder",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "icuFolding"
}
]
}
]
}

The icuNormalizer token filter normalizes tokens using a standard Unicode Normalization Mode. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be icuNormalizer.
yes
normalizationForm
string

Normalization form to apply. Accepted values are:

  • nfd (Canonical Decomposition)

  • nfc (Canonical Decomposition, followed by Canonical Composition)

  • nfkd (Compatibility Decomposition)

  • nfkc (Compatibility Decomposition, followed by Canonical Composition)

To learn more about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15.

no
nfc

Example

The following index definition example uses a custom analyzer named normalizer. It uses the whitespace tokenizer, then normalizes tokens by Canonical Decomposition, followed by Canonical Composition.

{
"analyzer": "normalizer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "normalizer",
"charFilters": [],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "icuNormalizer",
"normalizationForm": "nfc"
}
]
}
]
}

The length token filter removes tokens that are too short or too long. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be length.
yes
min
integer
Number that specifies the minimum length of a token. Value must be less than or equal to max.
no
0
max
integer
Number that specifies the maximum length of a token. Value must be greater than or equal to min.
no
255

Example

The following index definition example uses a custom analyzer named longOnly. It uses the length token filter to index only tokens that are at least 20 UTF-16 code units long after tokenizing with the standard tokenizer.

{
"analyzer": "longOnly",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "longOnly",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "length",
"min": 20
}
]
}
]
}

The lowercase token filter normalizes token text to lowercase. It has the following attribute:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be lowercase.
yes

Example

The following index definition example uses a custom analyzer named lowercaser. It uses the standard tokenizer with the lowercase token filter to lowercase all tokens.

{
"analyzer": "lowercaser",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "lowercaser",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "lowercase"
}
]
}
]
}

Tip

See also:

The regex token filter for a sample index definition and query.

The nGram token filter tokenizes input into n-grams of configured sizes. You can't use the nGram token filter in synonym or autocomplete mapping definitions. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be nGram.
yes
minGram
integer
Number that specifies the minimum length of generated n-grams. Value must be less than or equal to maxGram.
yes
maxGram
integer
Number that specifies the maximum length of generated n-grams. Value must be greater than or equal to minGram.
yes
termNotInBounds
string

String that specifies whether to index tokens shorter than minGram or longer than maxGram. Accepted values are:

  • include

  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

no
omit

Example

The following index definition example uses a custom analyzer named persianAutocomplete. It functions as an autocomplete analyzer for Persian and other languages that use the zero-width non-joiner character. It performs the following operations:

{
"analyzer": "persianAutocomplete",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "persianAutocomplete",
"charFilters": [
{
"type": "persian"
}
],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "icuNormalizer",
"normalizationForm": "nfc"
},
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
},
{
"type": "nGram",
"minGram": 1,
"maxGram": 10
}
]
}
]
}

The regex token filter applies a regular expression to each token, replacing matches with a specified string. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter. Value must be regex.
yes
pattern
string
Regular expression pattern to apply to each token.
yes
replacement
string
Replacement string to substitute wherever a matching pattern occurs.
yes
matches
string

Acceptable values are:

  • all

  • first

If matches is set to all, replace all matching patterns. Otherwise, replace only the first matching pattern.

yes

Example

The following index definition uses a custom analyzer named emailRedact for indexing the page_updated_by.email field in the minutes collection. It uses the standard tokenizer. It first applies the lowercase token filter to turn uppercase characters in the field to lowercase and then finds strings that look like email addresses and replaces them with the word redacted.

{
"analyzer": "lucene.standard",
"mappings": {
"dynamic": false,
"fields": {
"page_updated_by": {
"type": "document",
"fields": {
"email": {
"type": "string",
"analyzer": "emailRedact"
}
}
}
}
},
"analyzers": [
{
"charFilters": [],
"name": "emailRedact",
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "lowercase"
},
{
"matches": "all",
"pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
"replacement": "redacted",
"type": "regex"
}
]
}
]
}

The following query searches for the term example in the page_updated_by.email field of the minutes collection.

db.minutes.aggregate([
{
$search: {
"index": "default",
"text": {
"query": "example",
"path": "page_updated_by.email"
}
}
}
])

Atlas Search doesn't return any results for the query because the page_updated_by.email field doesn't contain any instances of the word example that aren't in an email address. Atlas Search replaces strings that match the regular expression provided in the custom analyzer with the word redacted.

The reverse token filter reverses each string token. It has the following attribute:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter. Value must be reverse.
yes

Example

The following index definition example for the minutes collection uses a custom analyzer named keywordReverse. It performs the following operations:

{
"analyzer": "lucene.keyword",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "keywordReverse",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "reverse"
}
]
}
]
}

The following query searches the page_updated_by.email field in the minutes collection using the wildcard operator to match any characters preceding the characters @example.com in reverse order. The reverse token filter can speed up leading wildcard queries.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"wildcard": {
"query": "*@example.com",
"path": "page_updated_by.email",
"allowAnalyzedField": true
}
}
},
{
"$project": {
"_id": 1,
"page_updated_by.email": 1,
}
}
])

For the preceding query, Atlas Search applies the custom analyzer to the wildcard query to transform the query as follows:

moc.elpmaxe@*

Atlas Search then runs the query against the indexed tokens, which are also reversed. The query returns the following documents:

[
{ _id: 1, page_updated_by: { email: 'auerbach@example.com' } },
{ _id: 2, page_updated_by: { email: 'ohrback@example.com' } },
{ _id: 3, page_updated_by: { email: 'lewinsky@example.com' } },
{ _id: 4, page_updated_by: { email: 'levinski@example.com' } }
]

The shingle token filter constructs shingles (token n-grams) from a series of tokens. You can't use the shingle token filter in synonym or autocomplete mapping definitions. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be shingle.
yes
minShingleSize
integer
Minimum number of tokens per shingle. Must be less than or equal to maxShingleSize.
yes
maxShingleSize
integer
Maximum number of tokens per shingle. Must be greater than or equal to minShingleSize.
yes

Example

The following index definition example uses two custom analyzers, emailAutocompleteIndex and emailAutocompleteSearch, to implement autocomplete-like functionality. Atlas Search uses the emailAutocompleteIndex analyzer during index creation to:

  • Replace @ characters in a field with AT

  • Create tokens with the whitespace tokenizer

  • Shingle tokens

  • Create edgeGram of those shingled tokens

Atlas Search uses the emailAutocompleteSearch analyzer during a search to:

{
"analyzer": "lucene.keyword",
"mappings": {
"dynamic": true,
"fields": {
"page_updated_by": {
"type": "document",
"fields": {
"email": {
"type": "string",
"analyzer": "emailAutocompleteIndex",
"searchAnalyzer": "emailAutocompleteSearch",
}
}
}
}
},
"analyzers": [
{
"name": "emailAutocompleteIndex",
"charFilters": [
{
"mappings": {
"@": "AT"
},
"type": "mapping"
}
],
"tokenizer": {
"maxTokenLength": 15,
"type": "whitespace"
},
"tokenFilters": [
{
"maxShingleSize": 3,
"minShingleSize": 2,
"type": "shingle"
},
{
"maxGram": 15,
"minGram": 2,
"type": "edgeGram"
}
]
},
{
"name": "emailAutocompleteSearch",
"charFilters": [
{
"mappings": {
"@": "AT"
},
"type": "mapping"
}
],
"tokenizer": {
"maxTokenLength": 15,
"type": "whitespace"
}
}
]
}

The following query searches for an email address in the page_updated_by.email field of the minutes collection:

db.minutes.aggregate([
{
$search: {
"index": "default",
"text": {
"query": "auerbach@ex",
"path": "page_updated_by.email"
}
}
}
])

The query returns the following results:

{
"_id" : 1,
"page_updated_by" : {
"last_name" : "AUERBACH",
"first_name" : "Siân",
"email" : "auerbach@example.com",
"phone" : "123-456-7890"
},
"title": "The weekly team meeting",
"text" : "<head> This page deals with department meetings. </head>"
}

The snowballStemming token filters Stems tokens using a Snowball-generated stemmer. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be snowballStemming.
yes
stemmerName
string

The following values are valid:

  • arabic

  • armenian

  • basque

  • catalan

  • danish

  • dutch

  • english

  • finnish

  • french

  • german

  • german2 (Alternative German language stemmer. Handles the umlaut by expanding ü to ue in most contexts.)

  • hungarian

  • irish

  • italian

  • kp (Kraaij-Pohlmann stemmer, an alternative stemmer for Dutch.)

  • lithuanian

  • lovins (The first-ever published "Lovins JB" stemming algorithm.)

  • norwegian

  • porter (The original Porter English stemming algorithm.)

  • portuguese

  • romanian

  • russian

  • spanish

  • swedish

  • turkish

yes

Example

The following example index definition uses a custom analyzer named frenchStemmer. It uses the lowercase token filter and the standard tokenizer, followed by the french variant of the snowballStemming token filter.

{
"analyzer": "frenchStemmer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "frenchStemmer",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "lowercase"
},
{
"type": "snowballStemming",
"stemmerName": "french"
}
]
}
]
}

The stopword token filter removes tokens that correspond to the specified stop words. This token filter doesn't analyze the specified stop words. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be stopword.
yes
tokens
array of strings
List that contains the stop words that correspond to the tokens to remove. Value must be one or more stop words.
yes
ignoreCase
boolean

Flag that indicates whether to ignore the case of stop words when filtering the tokens to remove. The value can be one of the following:

  • true - ignore case and remove all tokens that match the specified stop words

  • false - be case-sensitive and remove only tokens that exactly match the specified case

If omitted, defaults to true.

no
true

Example

The following index definition example uses a custom analyzer named It uses the stopword token filter after the whitespace tokenizer to remove the tokens that match the defined stop words is, the, and at. The token filter is case-insensitive and will remove all tokens that match the specified stop words.

{
"analyzer": "tokenTrimmer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "stopwordRemover",
"charFilters": [],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "stopword",
"tokens": ["is", "the", "at"]
}
]
}
]
}

The trim token filter trims leading and trailing whitespace from tokens. It has the following attribute:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this token filter type. Value must be trim.
yes

Example

The following index definition example uses a custom analyzer named tokenTrimmer. It uses the trim token filter after the keyword tokenizer to remove leading and trailing whitespace in the tokens created by the keyword tokenizer.

"analyzer": "tokenTrimmer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "tokenTrimmer",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "trim"
}
]
}
]
}
←  TokenizersDefine Field Mappings →
Give Feedback
© 2022 MongoDB, Inc.

About

  • Careers
  • Investor Relations
  • Legal Notices
  • Privacy Notices
  • Security Information
  • Trust Center
© 2022 MongoDB, Inc.