Docs Menu

Tokenizers

On this page

  • edgeGram
  • keyword
  • nGram
  • regexCaptureGroup
  • regexSplit
  • standard
  • uaxUrlEmail
  • whitespace

A custom analyzer's tokenizer determines how Atlas Search splits up text into discrete chunks for indexing. Tokenizers require a type field, and some take additional options as well.

"tokenizer": {
"type": "<tokenizer-type>",
"<additional-option>": "<value>"
}

Atlas Search supports the following tokenizer options:

The edgeGram tokenizer tokenizes input from the left side, or "edge", of a text input into n-grams of given sizes. You can't use the edgeGram tokenizer in synonym or autocomplete mapping definitions. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this tokenizer type. Value must be edgeGram.
yes
minGram
integer
Number of characters to include in the shortest token created.
yes
maxGram
integer
Number of characters to include in the longest token created.
yes
Example

The following index definition example uses a custom analyzer named edgegramShingler. It uses the edgeGram tokenizer to create tokens between 2 and 5 characters long starting from the first character of text input and the shingle token filter.

{
"analyzer": "edgegramShingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "edgegramShingler",
"charFilters": [],
"tokenizer": {
"type": "edgeGram",
"minGram": 2,
"maxGram": 5
},
"tokenFilters": [
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}

The keyword tokenizer tokenizes the entire input as a single token. It has the following attribute:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this tokenizer type. Value must be keyword.
yes
Example

The following index definition example uses a custom analyzer named keywordTokenizingIndex. It uses the keyword tokenizer and a regular expression token filter that redacts email addresses.

{
"analyzer": "keywordTokenizingIndex",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "keywordTokenizingIndex",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "regex",
"pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
"replacement": "redacted",
"matches": "all"
}
]
}
]
}

The nGram tokenizer tokenizes into text chunks, or "n-grams", of given sizes. You can't use the nGram tokenizer in synonym or autocomplete mapping definitions. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this tokenizer type. Value must be nGram.
yes
minGram
integer
Number of characters to include in the shortest token created.
yes
maxGram
integer
Number of characters to include in the longest token created.
yes
Example

The following index definition example uses a custom analyzer named ngramShingler. It uses the nGram tokenizer to create tokens between 2 and 5 characters long and the shingle token filter.

{
"analyzer": "ngramShingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "ngramShingler",
"charFilters": [],
"tokenizer": {
"type": "nGram",
"minGram": 2,
"maxGram": 5
},
"tokenFilters": [
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}

The regexCaptureGroup tokenizer matches a regular expression pattern to extract tokens. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this tokenizer type. Value must be regexCaptureGroup.
yes
pattern
string
Regular expression to match against.
yes
group
integer
Index of the character group within the matching expression to extract into tokens. Use 0 to extract all character groups.
yes
Example

The following index definition example uses a custom analyzer named phoneNumberExtractor. It uses the regexCaptureGroup tokenizer to creates a single token from the first US-formatted phone number present in the text input.

{
"analyzer": "phoneNumberExtractor",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "phoneNumberExtractor",
"charFilters": [],
"tokenizer": {
"type": "regexCaptureGroup",
"pattern": "^\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b$",
"group": 0
},
"tokenFilters": []
}
]
}

The regexSplit tokenizer splits tokens with a regular-expression based delimiter. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this tokenizer type. Value must be regexSplit.
yes
pattern
string
Regular expression to match against.
yes
Example

The following index definition example uses a custom analyzer named dashSplitter. It uses the regexSplit tokenizer to create tokens from hyphen-delimited input text.

{
"analyzer": "dashSplitter",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "dashSplitter",
"charFilters": [],
"tokenizer": {
"type": "regexSplit",
"pattern": "[-]+"
},
"tokenFilters": []
}
]
}

The standard tokenizer tokenizes based on word break rules from the Unicode Text Segmentation algorithm. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this tokenizer type. Value must be standard.
yes
maxTokenLength
integer
Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens.
no
255
Example

The following index definition example uses a custom analyzer named standardShingler. It uses the standard tokenizer and the shingle token filter.

{
"analyzer": "standardShingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "standardShingler",
"charFilters": [],
"tokenizer": {
"type": "standard",
"maxTokenLength": 10,
},
"tokenFilters": [
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}
Tip
See also:

The regex token filter for a sample index definition and query.

The uaxUrlEmail tokenizer tokenizes URLs and email addresses. Although uaxUrlEmail tokenizer tokenizes based on word break rules from the Unicode Text Segmentation algorithm, we recommend using uaxUrlEmail tokenizer only when the indexed field value includes URLs and email addresses. For fields that don't include URLs or email addresses, use the standard tokenizer to create tokens based on word break rules. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this tokenizer type. Value must be uaxUrlEmail.
yes
maxTokenLength
int
Maximum number of characters in one token.
no
255
Example

The following index definition example uses a custom analyzer named emailUrlExtractor. It uses the uaxUrlEmail tokenizer to create tokens up to 200 characters long each for all text, including email addresses and URLs, in the input. It converts all text to lowercase using the lowercase token filter.

{
"analyzer": "emailUrlExtractor",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "emailUrlExtractor",
"charFilters": [],
"tokenizer": {
"type": "uaxUrlEmail",
"maxTokenLength": "200"
},
"tokenFilters": [
{
"type": "lowercase"
}
]
}
]
}

The whitespace tokenizer tokenizes based on occurrences of whitespace between words. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this tokenizer type. Value must be whitespace.
yes
maxTokenLength
integer
Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens.
no
255
Example

The following index definition example uses a custom analyzer named whitespaceLowerer. It uses the whitespace tokenizer and a token filter that lowercases all tokens.

{
"analyzer": "whitespaceLowerer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "whitespaceLowerer",
"charFilters": [],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "lowercase"
}
]
}
]
}
Tip
See also:

The shingle token filter for a sample index definition and query.

←  Character FiltersToken Filters →
Give Feedback
© 2022 MongoDB, Inc.

About

  • Careers
  • Investor Relations
  • Legal Notices
  • Privacy Notices
  • Security Information
  • Trust Center
© 2022 MongoDB, Inc.