Token Filters

On this page

asciiFolding

daitchMokotoffSoundex
edgeGram
icuFolding
icuNormalizer
length
lowercase
nGram
regex
reverse
shingle
snowballStemming
stopword
trim

Token Filters always require a type field, and some take additional options as well.

"tokenFilters": [
  {
    "type": "<token-filter-type>",
    "<additional-option>": <value>
  }
]

Atlas Search supports the following token filters:

asciiFolding
daitchMokotoffSoundex
edgeGram
icuFolding
icuNormalizer
length
lowercase
nGram
regex
reverse
shingle
snowballStemming
stopword
trim

asciiFolding

The asciiFolding token filter converts alphabetic, numeric, and symbolic unicode characters that are not in the Basic Latin Unicode block to their ASCII equivalents, if available. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `asciiFolding`.	yes
`originalTokens`	string	String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following: `include` - include the original tokens with the converted tokens in the output of the token filter. We recommend this value if you want to support queries on both the original tokens as well as the converted forms. `omit` - omit the original tokens and include only the converted tokens in the output of the token filter. Use this value if you want to query only on the converted forms of the original tokens.	no	`omit`

Example

The following index definition example uses a custom analyzer named asciiConverter. It uses the standard tokenizer with the asciiFolding token filter to index the fields in the example collection and convert the field values to their ASCII equivalent.

{
  "analyzer": "asciiConverter",
  "searchAnalyzer": "asciiConverter",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "asciiConverter",
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "asciiFolding"
        }
      ]
    }
  ]
}

The following query searches the first_name field for names using their ASCII equivalent.

db.minutes.aggregate([
  {
    $search: {
      "index": "default",
      "text": {
        "query": "Sian",
        "path": "page_updated_by.first_name"
      }
    }
  },
  {
    $project: {
      "_id": 1,
      "page_updated_by.last_name": 1,
      "page_updated_by.first_name": 1
    }
  }
])

Atlas Search returns the following results:

[
  {
     _id: 1,
     page_updated_by: { last_name: 'AUERBACH', first_name: 'Siân'}
  }
]

daitchMokotoffSoundex

The daitchMokotoffSoundex token filter creates tokens for words that sound the same based on the Daitch-Mokotoff Soundex phonetic algorithm. This filter can generate multiple encodings for each input, where each encoded token is a 6 digit number.

Note

Don't use the daitchMokotoffSoundex token filter in:

Synonym or autocomplete mapping definitions.
Operators where fuzzy is enabled. Atlas Search supports the fuzzy option for the following operators:

It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `daitchMokotoffSoundex`.	yes
`originalTokens`	string	String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following: `include` - include the original tokens with the encoded tokens in the output of the token filter. We recommend this value if you want queries on both the original tokens as well as the encoded forms. `omit` - omit the original tokens and include only the encoded tokens in the output of the token filter. Use this value if you want to only query on the encoded forms of the original tokens.	no	`include`

Example

The following index definition example uses a custom analyzer named dmsAnalyzer. It uses the standard tokenizer with the daitchMokotoffSoundex token filter to index and query for words that sound the same as their encoded forms.

{
  "analyzer": "dmsAnalyzer",
  "searchAnalyzer": "dmsAnalyzer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "dmsAnalyzer",
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "daitchMokotoffSoundex",
          "originalTokens": "include"
        }
      ]
    }
  ]
}

The following query searches for terms that sound similar to AUERBACH in the page_updated_by.last_name field of the minutes collection.

db.minutes.aggregate([
  {
    $search: {
      "index": "default",
      "text": {
        "query": "AUERBACH",
        "path": "page_updated_by.last_name"
      }
    }
  },
  {
    $project: {
      "_id": 1,
      "page_updated_by.last_name": 1
    }
  }
])

The query returns the following results:

{ "_id" : 1, "page_updated_by" : { "last_name" : "AUERBACH" } }
{ "_id" : 2, "page_updated_by" : { "last_name" : "OHRBACH" } }

Atlas Search returns documents with _id: 1 and _id: 2 because the terms in both documents are phonetically similar, and are coded using the same six digit 097500.

edgeGram

The edgeGram token filter tokenizes input from the left side, or "edge", of a text input into n-grams of configured sizes. You can't use the edgeGram token filter in synonym or autocomplete mapping definitions. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `edgeGram`.	yes
`minGram`	integer	Number that specifies the minimum length of generated n-grams. Value must be less than or equal to `maxGram`.	yes
`maxGram`	integer	Number that specifies the maximum length of generated n-grams. Value must be greater than or equal to `minGram`.	yes
`termNotInBounds`	string	String that specifies whether to index tokens shorter than `minGram` or longer than `maxGram`. Accepted values are: `include` `omit` If `include` is specified, tokens shorter than `minGram` or longer than `maxGram` are indexed as-is. If `omit` is specified, those tokens are not indexed.	no	`omit`

Example

The following index definition example uses a custom analyzer named englishAutocomplete. It performs the following operations:

Tokenizes with the standard tokenizer.
Token filtering with the following filters:
- icuFolding
- shingle
- edgeGram

{
  "analyzer": "englishAutocomplete",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "englishAutocomplete",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "icuFolding"
        },
        {
          "type": "shingle",
          "minShingleSize": 2,
          "maxShingleSize": 3
        },
        {
          "type": "edgeGram",
          "minGram": 1,
          "maxGram": 10
        }
      ]
    }
  ]
}

Tip

icuFolding

The icuFolding token filter applies character folding from Unicode Technical Report #30. It has the following attribute:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `icuFolding`.	yes

Example

The following index definition example uses a custom analyzer named diacriticFolder. It uses the keyword tokenizer with the icuFolding token filter to apply foldings from UTR#30 Character Foldings. Foldings include accent removal, case folding, canonical duplicates folding, and many others detailed in the report.

{
  "analyzer": "diacriticFolder",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "diacriticFolder",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "icuFolding"
        }
      ]
    }
  ]
}

icuNormalizer

The icuNormalizer token filter normalizes tokens using a standard Unicode Normalization Mode. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `icuNormalizer`.	yes
`normalizationForm`	string	Normalization form to apply. Accepted values are: `nfd` (Canonical Decomposition) `nfc` (Canonical Decomposition, followed by Canonical Composition) `nfkd` (Compatibility Decomposition) `nfkc` (Compatibility Decomposition, followed by Canonical Composition) To learn more about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15.	no	`nfc`

Example

The following index definition example uses a custom analyzer named normalizer. It uses the whitespace tokenizer, then normalizes tokens by Canonical Decomposition, followed by Canonical Composition.

{
  "analyzer": "normalizer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "normalizer",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "icuNormalizer",
          "normalizationForm": "nfc"
        }
      ]
    }
  ]
}

length

The length token filter removes tokens that are too short or too long. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `length`.	yes
`min`	integer	Number that specifies the minimum length of a token. Value must be less than or equal to `max`.	no	0
`max`	integer	Number that specifies the maximum length of a token. Value must be greater than or equal to `min`.	no	255

Example

The following index definition example uses a custom analyzer named longOnly. It uses the length token filter to index only tokens that are at least 20 UTF-16 code units long after tokenizing with the standard tokenizer.

{
  "analyzer": "longOnly",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "longOnly",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "length",
          "min": 20
        }
      ]
    }
  ]
}

lowercase

The lowercase token filter normalizes token text to lowercase. It has the following attribute:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `lowercase`.	yes

Example

The following index definition example uses a custom analyzer named lowercaser. It uses the standard tokenizer with the lowercase token filter to lowercase all tokens.

{
  "analyzer": "lowercaser",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "lowercaser",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ]
    }
  ]
}

Tip

nGram

The nGram token filter tokenizes input into n-grams of configured sizes. You can't use the nGram token filter in synonym or autocomplete mapping definitions. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `nGram`.	yes
`minGram`	integer	Number that specifies the minimum length of generated n-grams. Value must be less than or equal to `maxGram`.	yes
`maxGram`	integer	Number that specifies the maximum length of generated n-grams. Value must be greater than or equal to `minGram`.	yes
`termNotInBounds`	string	String that specifies whether to index tokens shorter than `minGram` or longer than `maxGram`. Accepted values are: `include` `omit` If `include` is specified, tokens shorter than `minGram` or longer than `maxGram` are indexed as-is. If `omit` is specified, those tokens are not indexed.	no	`omit`

Example

The following index definition example uses a custom analyzer named persianAutocomplete. It functions as an autocomplete analyzer for Persian and other languages that use the zero-width non-joiner character. It performs the following operations:

Normalizes zero-width non-joiner characters with the persian character filter.
Tokenizes by whitespace with the whitespace tokenizer.
Applies a series of token filters:
- icuNormalizer
- shingle
- nGram

{
  "analyzer": "persianAutocomplete",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "persianAutocomplete",
      "charFilters": [
        {
          "type": "persian"
        }
      ],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "icuNormalizer",
          "normalizationForm": "nfc"
        },
        {
          "type": "shingle",
          "minShingleSize": 2,
          "maxShingleSize": 3
        },
        {
          "type": "nGram",
          "minGram": 1,
          "maxGram": 10
        }
      ]
    }
  ]
}

regex

The regex token filter applies a regular expression to each token, replacing matches with a specified string. It has the following attributes:

Name	Type	Description	Required?
`type`	string	Human-readable label that identifies this token filter. Value must be `regex`.	yes
`pattern`	string	Regular expression pattern to apply to each token.	yes
`replacement`	string	Replacement string to substitute wherever a matching pattern occurs.	yes
`matches`	string	Acceptable values are: `all` `first` If `matches` is set to `all`, replace all matching patterns. Otherwise, replace only the first matching pattern.	yes

Example

The following index definition uses a custom analyzer named emailRedact for indexing the page_updated_by.email field in the minutes collection. It uses the standard tokenizer. It first applies the lowercase token filter to turn uppercase characters in the field to lowercase and then finds strings that look like email addresses and replaces them with the word redacted.

{
  "analyzer": "lucene.standard",
  "mappings": {
    "dynamic": false,
    "fields": {
      "page_updated_by": {
        "type": "document",
        "fields": {
          "email": {
            "type": "string",
            "analyzer": "emailRedact"
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "emailRedact",
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "matches": "all",
          "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
          "replacement": "redacted",
          "type": "regex"
        }
      ]
    }
  ]
}

The following query searches for the term example in the page_updated_by.email field of the minutes collection.

db.minutes.aggregate([
  {
    $search: {
      "index": "default",
      "text": {
        "query": "example",
        "path": "page_updated_by.email"
      }
    }
  }
])

Atlas Search doesn't return any results for the query because the page_updated_by.email field doesn't contain any instances of the word example that aren't in an email address. Atlas Search replaces strings that match the regular expression provided in the custom analyzer with the word redacted.

reverse

The reverse token filter reverses each string token. It has the following attribute:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter. Value must be `reverse`.	yes

Example

The following index definition example for the minutes collection uses a custom analyzer named keywordReverse. It performs the following operations:

Uses dynamic mapping
Tokenizes with the keyword tokenizer
Applies the reverse token filter to tokens

{
  "analyzer": "lucene.keyword",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "keywordReverse",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "reverse"
        }
      ]
    }
  ]
}

The following query searches the page_updated_by.email field in the minutes collection using the wildcard operator to match any characters preceding the characters @example.com in reverse order. The reverse token filter can speed up leading wildcard queries.

db.minutes.aggregate([
  {
    "$search": {
      "index": "default",
      "wildcard": {
        "query": "*@example.com",
        "path": "page_updated_by.email",
        "allowAnalyzedField": true
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "page_updated_by.email": 1,
    }
  }
])

The query returns the following documents:

[
  { _id: 1, page_updated_by: { email: 'auerbach@example.com' } },
  { _id: 2, page_updated_by: { email: 'ohrback@example.com' } },
  { _id: 3, page_updated_by: { email: 'lewinsky@example.com' } },
  { _id: 4, page_updated_by: { email: 'levinski@example.com' } }
]

shingle

The shingle token filter constructs shingles (token n-grams) from a series of tokens. You can't use the shingle token filter in synonym or autocomplete mapping definitions. It has the following attributes:

Name	Type	Description	Required?
`type`	string	Human-readable label that identifies this token filter type. Value must be `shingle`.	yes
`minShingleSize`	integer	Minimum number of tokens per shingle. Must be less than or equal to `maxShingleSize`.	yes
`maxShingleSize`	integer	Maximum number of tokens per shingle. Must be greater than or equal to `minShingleSize`.	yes

Example

The following index definition example uses two custom analyzers, emailAutocompleteIndex and emailAutocompleteSearch, to implement autocomplete-like functionality. Atlas Search uses the emailAutocompleteIndex analyzer during index creation to:

Replace @ characters in a field with AT
Create tokens with the whitespace tokenizer
Shingle tokens
Create edgeGram of those shingled tokens

Atlas Search uses the emailAutocompleteSearch analyzer during a search to:

Replace @ characters in a field with AT
Create tokens with the whitespace tokenizer tokenizer

 {
  "analyzer": "lucene.keyword",
  "mappings": {
    "dynamic": true,
    "fields": {
      "page_updated_by": {
        "type": "document",
        "fields": {
          "email": {
            "type": "string",
            "analyzer": "emailAutocompleteIndex",
            "searchAnalyzer": "emailAutocompleteSearch",
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "emailAutocompleteIndex",
      "charFilters": [
        {
          "mappings": {
            "@": "AT"
          },
          "type": "mapping"
        }
      ],
      "tokenizer": {
        "maxTokenLength": 15,
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "maxShingleSize": 3,
          "minShingleSize": 2,
          "type": "shingle"
        },
        {
          "maxGram": 15,
          "minGram": 2,
          "type": "edgeGram"
        }
      ]
    },
    {
      "name": "emailAutocompleteSearch",
      "charFilters": [
        {
          "mappings": {
            "@": "AT"
          },
          "type": "mapping"
        }
      ],
      "tokenizer": {
        "maxTokenLength": 15,
        "type": "whitespace"
      }
    }
  ]
}

The following query searches for an email address in the page_updated_by.email field of the minutes collection:

db.minutes.aggregate([
  {
    $search: {
      "index": "default",
      "text": {
        "query": "auerbach@ex",
         "path": "page_updated_by.email"
      }
    }
  }
])

The query returns the following results:

{
  "_id" : 1,
  "page_updated_by" : {
    "last_name" : "AUERBACH",
    "first_name" : "Siân",
    "email" : "auerbach@example.com",
    "phone" : "123-456-7890"
  },
  "title": "The weekly team meeting",
  "text" : "<head> This page deals with department meetings. </head>"
}

snowballStemming

The snowballStemming token filters Stems tokens using a Snowball-generated stemmer. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `snowballStemming`.	yes
`stemmerName`	string	The following values are valid: `arabic` `armenian` `basque` `catalan` `danish` `dutch` `english` `finnish` `french` `german` `german2` (Alternative German language stemmer. Handles the umlaut by expanding ü to ue in most contexts.) `hungarian` `irish` `italian` `kp` (Kraaij-Pohlmann stemmer, an alternative stemmer for Dutch.) `lithuanian` `lovins` (The first-ever published "Lovins JB" stemming algorithm.) `norwegian` `porter` (The original Porter English stemming algorithm.) `portuguese` `romanian` `russian` `spanish` `swedish` `turkish`	yes

Example

The following example index definition uses a custom analyzer named frenchStemmer. It uses the lowercase token filter and the standard tokenizer, followed by the french variant of the snowballStemming token filter.

{
  "analyzer": "frenchStemmer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "frenchStemmer",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "snowballStemming",
          "stemmerName": "french"
        }
      ]
    }
  ]
}

stopword

The stopword token filter removes tokens that correspond to the specified stop words. This token filter doesn't analyze the specified stop words. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `stopword`.	yes
`tokens`	array of strings	List that contains the stop words that correspond to the tokens to remove. Value must be one or more stop words.	yes
`ignoreCase`	boolean	Flag that indicates whether to ignore the case of stop words when filtering the tokens to remove. The value can be one of the following: `true` - ignore case and remove all tokens that match the specified stop words `false` - be case-sensitive and remove only tokens that exactly match the specified case If omitted, defaults to `true`.	no	true

Example

The following index definition example uses a custom analyzer named It uses the stopword token filter after the whitespace tokenizer to remove the tokens that match the defined stop words is, the, and at. The token filter is case-insensitive and will remove all tokens that match the specified stop words.

{
  "analyzer": "tokenTrimmer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "stopwordRemover",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "stopword",
          "tokens": ["is", "the", "at"]
        }
      ]
    }
  ]
}

trim

The trim token filter trims leading and trailing whitespace from tokens. It has the following attribute:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this token filter type. Value must be `trim`.	yes

Example

The following index definition example uses a custom analyzer named tokenTrimmer. It uses the trim token filter after the keyword tokenizer to remove leading and trailing whitespace in the tokens created by the keyword tokenizer.

  "analyzer": "tokenTrimmer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "tokenTrimmer",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "trim"
        }
      ]
    }
  ]
}

← Tokenizers Define Field Mappings →

Token Filters.css-134mg1q{-webkit-align-self:center;-ms-flex-item-align:center;align-self:center;padding:0 10px;visibility:hidden;}.css-6vrlzm{border-radius:0!important;display:initial!important;margin:initial!important;}.css-1l4s55v{margin-top:-175px;position:absolute;padding-bottom:2px;}

asciiFolding

daitchMokotoffSoundex

edgeGram

icuFolding

icuNormalizer

length

lowercase

nGram

regex

reverse

shingle

snowballStemming

stopword

trim

Token Filters