Tokenizers
A custom analyzer's tokenizer determines how Atlas Search splits up text into discrete chunks for indexing. Tokenizers require a type field, and some take additional options as well.
"tokenizer": { "type": "<tokenizer-type>", "<additional-option>": "<value>" }
Atlas Search supports the following tokenizer options:
edgeGram
The edgeGram
tokenizer tokenizes input from the left side, or
"edge", of a text input into n-grams of given sizes. You can't use the
edgeGram tokenizer in synonym or autocomplete
mapping definitions. It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this tokenizer type.
Value must be edgeGram . | yes | |
minGram | integer | Number of characters to include in the shortest token created. | yes | |
maxGram | integer | Number of characters to include in the longest token created. | yes |
The following index definition example uses a custom analyzer named
edgegramShingler
. It uses the edgeGram
tokenizer to create
tokens between 2 and 5 characters long starting from the first
character of text input and the shingle token filter.
{ "analyzer": "edgegramShingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "edgegramShingler", "charFilters": [], "tokenizer": { "type": "edgeGram", "minGram": 2, "maxGram": 5 }, "tokenFilters": [ { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 } ] } ] }
keyword
The keyword
tokenizer tokenizes the entire input as a single token.
It has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this tokenizer type.
Value must be keyword . | yes |
The following index definition example uses a custom analyzer named
keywordTokenizingIndex
. It uses the keyword
tokenizer and a
regular expression token filter that redacts email addresses.
{ "analyzer": "keywordTokenizingIndex", "mappings": { "dynamic": true }, "analyzers": [ { "name": "keywordTokenizingIndex", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "regex", "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$", "replacement": "redacted", "matches": "all" } ] } ] }
nGram
The nGram
tokenizer tokenizes into text chunks, or "n-grams", of
given sizes. You can't use the nGram
tokenizer in synonym or autocomplete mapping definitions. It has the
following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this tokenizer type.
Value must be nGram . | yes | |
minGram | integer | Number of characters to include in the shortest token created. | yes | |
maxGram | integer | Number of characters to include in the longest token created. | yes |
The following index definition example uses a custom analyzer named
ngramShingler
. It uses the nGram
tokenizer to create tokens
between 2 and 5 characters long and the shingle token filter.
{ "analyzer": "ngramShingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "ngramShingler", "charFilters": [], "tokenizer": { "type": "nGram", "minGram": 2, "maxGram": 5 }, "tokenFilters": [ { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 } ] } ] }
regexCaptureGroup
The regexCaptureGroup
tokenizer matches a regular expression
pattern to extract tokens. It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this tokenizer type.
Value must be regexCaptureGroup . | yes | |
pattern | string | Regular expression to match against. | yes | |
group | integer | Index of the character group within the matching expression to
extract into tokens. Use 0 to extract all character groups. | yes |
The following index definition example uses a custom analyzer named
phoneNumberExtractor
. It uses the regexCaptureGroup
tokenizer
to creates a single token from the first US-formatted phone number
present in the text input.
{ "analyzer": "phoneNumberExtractor", "mappings": { "dynamic": true }, "analyzers": [ { "name": "phoneNumberExtractor", "charFilters": [], "tokenizer": { "type": "regexCaptureGroup", "pattern": "^\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b$", "group": 0 }, "tokenFilters": [] } ] }
regexSplit
The regexSplit
tokenizer splits tokens with a regular-expression
based delimiter. It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this tokenizer type.
Value must be regexSplit . | yes | |
pattern | string | Regular expression to match against. | yes |
The following index definition example uses a custom analyzer named
dashSplitter
. It uses the regexSplit
tokenizer
to create tokens from hyphen-delimited input text.
{ "analyzer": "dashSplitter", "mappings": { "dynamic": true }, "analyzers": [ { "name": "dashSplitter", "charFilters": [], "tokenizer": { "type": "regexSplit", "pattern": "[-]+" }, "tokenFilters": [] } ] }
standard
The standard
tokenizer tokenizes based on word break rules from the
Unicode Text Segmentation algorithm.
It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this tokenizer type.
Value must be standard . | yes | |
maxTokenLength | integer | Maximum length for a single token. Tokens greater than this
length are split at maxTokenLength into multiple tokens. | no | 255 |
The following index definition example uses a custom analyzer named
standardShingler
. It uses the standard
tokenizer and the
shingle token filter.
{ "analyzer": "standardShingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "standardShingler", "charFilters": [], "tokenizer": { "type": "standard", "maxTokenLength": 10, }, "tokenFilters": [ { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 } ] } ] }
The regex token filter for a sample index definition and query.
uaxUrlEmail
The uaxUrlEmail
tokenizer tokenizes URLs and email addresses. Although uaxUrlEmail
tokenizer
tokenizes based on word break rules from the Unicode Text Segmentation
algorithm,
we recommend using uaxUrlEmail
tokenizer only when the indexed
field value includes URLs and email
addresses. For fields that don't include URLs or email addresses, use the standard
tokenizer to create tokens based on word break rules. It has the
following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this tokenizer type.
Value must be uaxUrlEmail . | yes | |
maxTokenLength | int | Maximum number of characters in one token. | no | 255 |
The following index definition example uses a custom analyzer named
emailUrlExtractor
. It uses the uaxUrlEmail
tokenizer to
create tokens up to 200
characters long each for all text,
including email addresses and URLs, in the input. It converts all text to lowercase using the
lowercase token filter.
{ "analyzer": "emailUrlExtractor", "mappings": { "dynamic": true }, "analyzers": [ { "name": "emailUrlExtractor", "charFilters": [], "tokenizer": { "type": "uaxUrlEmail", "maxTokenLength": "200" }, "tokenFilters": [ { "type": "lowercase" } ] } ] }
whitespace
The whitespace
tokenizer tokenizes based on occurrences of
whitespace between words. It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this tokenizer type.
Value must be whitespace . | yes | |
maxTokenLength | integer | Maximum length for a single token. Tokens greater than this
length are split at maxTokenLength into multiple tokens. | no | 255 |
The following index definition example uses a custom analyzer named
whitespaceLowerer
. It uses the whitespace
tokenizer and a
token filter that lowercases all tokens.
{ "analyzer": "whitespaceLowerer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "whitespaceLowerer", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "lowercase" } ] } ] }
The shingle token filter for a sample index definition and query.