Token Filters
On this page
Token Filters always require a type field, and some take additional options as well.
"tokenFilters": [ { "type": "<token-filter-type>", "<additional-option>": <value> } ]
Atlas Search supports the following token filters:
- asciiFolding
- daitchMokotoffSoundex
- edgeGram
- icuFolding
- icuNormalizer
- length
- lowercase
- nGram
- regex
- reverse
- shingle
- snowballStemming
- stopword
- trim
asciiFolding
The asciiFolding
token filter converts alphabetic, numeric, and
symbolic unicode characters that are not in the Basic Latin Unicode
block
to their ASCII equivalents, if available. It has the following
attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be asciiFolding . | yes | |
originalTokens | string | String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:
| no | omit |
The following index definition example uses a custom analyzer
named asciiConverter
. It uses the standard tokenizer with the asciiFolding
token filter to
index the fields in the example
collection and convert the field values to their ASCII equivalent.
{ "analyzer": "asciiConverter", "searchAnalyzer": "asciiConverter", "mappings": { "dynamic": true }, "analyzers": [ { "name": "asciiConverter", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "asciiFolding" } ] } ] }
The following query searches the first_name
field for names
using their ASCII equivalent.
db.minutes.aggregate([ { $search: { "index": "default", "text": { "query": "Sian", "path": "page_updated_by.first_name" } } }, { $project: { "_id": 1, "page_updated_by.last_name": 1, "page_updated_by.first_name": 1 } } ])
Atlas Search returns the following results:
[ { _id: 1, page_updated_by: { last_name: 'AUERBACH', first_name: 'Siân'} } ]
daitchMokotoffSoundex
The daitchMokotoffSoundex
token filter creates tokens for words
that sound the same based on the Daitch-Mokotoff Soundex
phonetic algorithm. This filter can generate multiple encodings for
each input, where each encoded token is a 6 digit number.
Don't use the daitchMokotoffSoundex
token filter in:
- Synonym or autocomplete mapping definitions.
Operators where
fuzzy
is enabled. Atlas Search supports thefuzzy
option for the following operators:
It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be daitchMokotoffSoundex . | yes | |
originalTokens | string | String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:
| no | include |
The following index definition example uses a custom analyzer
named dmsAnalyzer
. It uses the standard tokenizer with the daitchMokotoffSoundex
token filter to index and query for words that sound the same
as their encoded forms.
{ "analyzer": "dmsAnalyzer", "searchAnalyzer": "dmsAnalyzer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "dmsAnalyzer", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "daitchMokotoffSoundex", "originalTokens": "include" } ] } ] }
The following query searches for terms that sound similar to
AUERBACH
in the page_updated_by.last_name
field of
the minutes
collection.
db.minutes.aggregate([ { $search: { "index": "default", "text": { "query": "AUERBACH", "path": "page_updated_by.last_name" } } }, { $project: { "_id": 1, "page_updated_by.last_name": 1 } } ])
The query returns the following results:
{ "_id" : 1, "page_updated_by" : { "last_name" : "AUERBACH" } } { "_id" : 2, "page_updated_by" : { "last_name" : "OHRBACH" } }
Atlas Search returns documents with _id: 1
and _id: 2
because the terms in both documents are phonetically similar,
and are coded using the same six digit 097500
.
edgeGram
The edgeGram
token filter tokenizes input from the left side, or
"edge", of a text input into n-grams of configured sizes. You can't use
the edgeGram token filter in synonym or autocomplete
mapping definitions. It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be edgeGram . | yes | |
minGram | integer | Number that specifies the minimum length of generated n-grams.
Value must be less than or
equal to maxGram . | yes | |
maxGram | integer | Number that specifies the maximum length of generated n-grams.
Value must be greater than or
equal to minGram . | yes | |
termNotInBounds | string | String that specifies whether to index tokens shorter than
If | no | omit |
The following index definition example uses a custom analyzer named
englishAutocomplete
. It performs the following operations:
- Tokenizes with the standard tokenizer.
Token filtering with the following filters:
icuFolding
shingle
edgeGram
{ "analyzer": "englishAutocomplete", "mappings": { "dynamic": true }, "analyzers": [ { "name": "englishAutocomplete", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" }, { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 }, { "type": "edgeGram", "minGram": 1, "maxGram": 10 } ] } ] }
The shingle token filter for a sample index definition and query.
icuFolding
The icuFolding
token filter applies character folding from Unicode
Technical Report #30.
It has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be icuFolding . | yes |
The following index definition example uses a custom analyzer named
diacriticFolder
. It uses the keyword tokenizer with the icuFolding
token filter to
apply foldings from UTR#30 Character Foldings. Foldings include accent
removal, case folding, canonical duplicates folding, and many others
detailed in the report.
{ "analyzer": "diacriticFolder", "mappings": { "dynamic": true }, "analyzers": [ { "name": "diacriticFolder", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "icuFolding" } ] } ] }
icuNormalizer
The icuNormalizer
token filter normalizes tokens using a standard
Unicode Normalization Mode. It
has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be icuNormalizer . | yes | |
normalizationForm | string | Normalization form to apply. Accepted values are:
To learn more about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15. | no | nfc |
The following index definition example uses a custom analyzer named
normalizer
. It uses the whitespace tokenizer, then normalizes
tokens by Canonical Decomposition, followed by Canonical Composition.
{ "analyzer": "normalizer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "normalizer", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "icuNormalizer", "normalizationForm": "nfc" } ] } ] }
length
The length
token filter removes tokens that are too short or too
long. It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be length . | yes | |
min | integer | Number that specifies the minimum length of a token.
Value must be less than or equal to
max . | no | 0 |
max | integer | Number that specifies the maximum length of a token.
Value must be greater than or equal to
min . | no | 255 |
The following index definition example uses a custom analyzer named
longOnly
. It uses the length
token filter to index only
tokens that are at least 20 UTF-16 code units long after tokenizing
with the standard tokenizer.
{ "analyzer": "longOnly", "mappings": { "dynamic": true }, "analyzers": [ { "name": "longOnly", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "length", "min": 20 } ] } ] }
lowercase
The lowercase
token filter normalizes token text to lowercase. It
has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be lowercase . | yes |
The following index definition example uses a custom analyzer named
lowercaser
. It uses the standard tokenizer with the lowercase
token filter to
lowercase all tokens.
{ "analyzer": "lowercaser", "mappings": { "dynamic": true }, "analyzers": [ { "name": "lowercaser", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" } ] } ] }
The regex token filter for a sample index definition and query.
nGram
The nGram
token filter tokenizes input into n-grams of configured
sizes. You can't use the nGram token filter in
synonym or autocomplete mapping definitions. It has the
following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type. Value must be nGram . | yes | |
minGram | integer | Number that specifies the minimum length of generated n-grams.
Value must be less than or
equal to maxGram . | yes | |
maxGram | integer | Number that specifies the maximum length of generated n-grams.
Value must be greater than or
equal to minGram . | yes | |
termNotInBounds | string | String that specifies whether to index tokens shorter than
If | no | omit |
The following index definition example uses a custom analyzer named
persianAutocomplete
. It functions as an autocomplete analyzer for
Persian and other languages that use the zero-width non-joiner
character. It performs the following operations:
- Normalizes zero-width non-joiner characters with the persian character filter.
- Tokenizes by whitespace with the whitespace tokenizer.
Applies a series of token filters:
icuNormalizer
shingle
nGram
{ "analyzer": "persianAutocomplete", "mappings": { "dynamic": true }, "analyzers": [ { "name": "persianAutocomplete", "charFilters": [ { "type": "persian" } ], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "icuNormalizer", "normalizationForm": "nfc" }, { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 }, { "type": "nGram", "minGram": 1, "maxGram": 10 } ] } ] }
regex
The regex
token filter applies a regular expression to each token,
replacing matches with a specified string. It has the following
attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter.
Value must be regex . | yes | |
pattern | string | Regular expression pattern to apply to each token. | yes | |
replacement | string | Replacement string to substitute wherever a matching pattern
occurs. | yes | |
matches | string | Acceptable values are:
If | yes |
The following index definition uses a custom analyzer named
emailRedact
for indexing the page_updated_by.email
field in the minutes
collection. It uses the
standard tokenizer. It first
applies the lowercase token filter to turn uppercase
characters in the field to lowercase and then finds strings
that look like email addresses and replaces them with the word
redacted
.
{ "analyzer": "lucene.standard", "mappings": { "dynamic": false, "fields": { "page_updated_by": { "type": "document", "fields": { "email": { "type": "string", "analyzer": "emailRedact" } } } } }, "analyzers": [ { "charFilters": [], "name": "emailRedact", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "matches": "all", "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$", "replacement": "redacted", "type": "regex" } ] } ] }
The following query searches for the term example
in the
page_updated_by.email
field of the minutes
collection.
db.minutes.aggregate([ { $search: { "index": "default", "text": { "query": "example", "path": "page_updated_by.email" } } } ])
Atlas Search doesn't return any results for the query because the
page_updated_by.email
field doesn't contain any instances
of the word example
that aren't in an email address.
Atlas Search replaces strings that match the regular expression
provided in the custom analyzer with the word redacted
.
reverse
The reverse
token filter reverses each string token. It has the
following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter.
Value must be reverse . | yes |
The following index definition example for the minutes collection uses a custom analyzer named
keywordReverse
. It performs the following operations:
- Uses dynamic mapping
- Tokenizes with the keyword tokenizer
- Applies the
reverse
token filter to tokens
{ "analyzer": "lucene.keyword", "mappings": { "dynamic": true }, "analyzers": [ { "name": "keywordReverse", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "reverse" } ] } ] }
The following query searches the page_updated_by.email
field in
the minutes
collection using the wildcard operator to
match any characters preceding the characters @example.com
in
reverse order. The reverse
token filter can speed up leading
wildcard queries.
db.minutes.aggregate([ { "$search": { "index": "default", "wildcard": { "query": "*@example.com", "path": "page_updated_by.email", "allowAnalyzedField": true } } }, { "$project": { "_id": 1, "page_updated_by.email": 1, } } ])
The query returns the following documents:
[ { _id: 1, page_updated_by: { email: 'auerbach@example.com' } }, { _id: 2, page_updated_by: { email: 'ohrback@example.com' } }, { _id: 3, page_updated_by: { email: 'lewinsky@example.com' } }, { _id: 4, page_updated_by: { email: 'levinski@example.com' } } ]
shingle
The shingle
token filter constructs shingles (token n-grams) from a
series of tokens. You can't use the shingle
token filter in
synonym or autocomplete mapping definitions. It has the
following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be shingle . | yes | |
minShingleSize | integer | Minimum number of tokens per shingle. Must be less than or equal
to maxShingleSize . | yes | |
maxShingleSize | integer | Maximum number of tokens per shingle. Must be greater than or
equal to minShingleSize . | yes |
The following index definition example uses two custom
analyzers, emailAutocompleteIndex
and
emailAutocompleteSearch
, to implement autocomplete-like
functionality. Atlas Search uses the emailAutocompleteIndex
analyzer during index creation to:
- Replace
@
characters in a field withAT
- Create tokens with the whitespace tokenizer
- Shingle tokens
- Create edgeGram of those shingled tokens
Atlas Search uses the emailAutocompleteSearch
analyzer during a
search to:
- Replace
@
characters in a field withAT
- Create tokens with the whitespace tokenizer tokenizer
{ "analyzer": "lucene.keyword", "mappings": { "dynamic": true, "fields": { "page_updated_by": { "type": "document", "fields": { "email": { "type": "string", "analyzer": "emailAutocompleteIndex", "searchAnalyzer": "emailAutocompleteSearch", } } } } }, "analyzers": [ { "name": "emailAutocompleteIndex", "charFilters": [ { "mappings": { "@": "AT" }, "type": "mapping" } ], "tokenizer": { "maxTokenLength": 15, "type": "whitespace" }, "tokenFilters": [ { "maxShingleSize": 3, "minShingleSize": 2, "type": "shingle" }, { "maxGram": 15, "minGram": 2, "type": "edgeGram" } ] }, { "name": "emailAutocompleteSearch", "charFilters": [ { "mappings": { "@": "AT" }, "type": "mapping" } ], "tokenizer": { "maxTokenLength": 15, "type": "whitespace" } } ] }
The following query searches for an email address in the
page_updated_by.email
field of the minutes
collection:
db.minutes.aggregate([ { $search: { "index": "default", "text": { "query": "auerbach@ex", "path": "page_updated_by.email" } } } ])
The query returns the following results:
{ "_id" : 1, "page_updated_by" : { "last_name" : "AUERBACH", "first_name" : "Siân", "email" : "auerbach@example.com", "phone" : "123-456-7890" }, "title": "The weekly team meeting", "text" : "<head> This page deals with department meetings. </head>" }
snowballStemming
The snowballStemming
token filters Stems tokens using a
Snowball-generated stemmer. It has the
following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be snowballStemming . | yes | |
stemmerName | string | The following values are valid:
| yes |
The following example index definition uses a custom analyzer named
frenchStemmer
. It uses the lowercase
token filter and the
standard tokenizer, followed
by the french
variant of the snowballStemming
token filter.
{ "analyzer": "frenchStemmer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "frenchStemmer", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "type": "snowballStemming", "stemmerName": "french" } ] } ] }
stopword
The stopword
token filter removes tokens that correspond to the
specified stop words. This token filter doesn't analyze the specified
stop words. It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be stopword . | yes | |
tokens | array of strings | List that contains the stop words that correspond to the tokens
to remove.
Value must be one or more stop words. | yes | |
ignoreCase | boolean | Flag that indicates whether to ignore the case of stop words when filtering the tokens to remove. The value can be one of the following:
If omitted, defaults to | no | true |
The following index definition example uses a custom analyzer named
It uses the stopword
token filter after the
whitespace tokenizer to remove
the tokens that match the defined stop words is
, the
, and
at
. The token filter is case-insensitive and will remove all
tokens that match the specified stop words.
{ "analyzer": "tokenTrimmer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "stopwordRemover", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "stopword", "tokens": ["is", "the", "at"] } ] } ] }
trim
The trim
token filter trims leading and trailing whitespace from
tokens. It has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this token filter type.
Value must be trim . | yes |
The following index definition example uses a custom analyzer named
tokenTrimmer
. It uses the trim
token filter after the
keyword tokenizer to remove leading
and trailing whitespace in the tokens created by the keyword
tokenizer.
"analyzer": "tokenTrimmer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "tokenTrimmer", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "trim" } ] } ] }