Custom Analyzers¶
Overview¶
An Atlas Search analyzer prepares a set of documents to be indexed by performing a series of operations to transform, filter, and group sequences of characters. You can define a custom analyzer to suit your specific indexing needs.
Syntax¶
A custom analyzer has the following syntax:
"analyzers": [ { "name": "<name>", "charFilters": [ <list-of-character-filters> ], "tokenizer": { "type": "<tokenizer-type>" }, "tokenFilters": [ <list-of-token-filters> ] } ]
Attributes¶
A custom analyzer has the following attributes:
Attribute | Type | Description | Required? |
---|---|---|---|
name | string | Name of the custom analyzer. Names must be unique within an index, and may not start with any of the following strings:
| yes |
charFilters | list of objects | Array containing zero or more character filters. See
Usage for more information. | no |
tokenizer | object | Tokenizer to use to create tokens. See
Usage for more information. | yes |
tokenFilters | list of objects | Array containing zero or more token filters. See
Usage for more information. | no |
Usage¶
To use a custom analyzer when indexing a collection, include the
following in your index definition analyzers
field:
- Optional. Specify one or more character filters. Character filters examine text one character at a time and perform filtering operations.
- Required. Specify the tokenizer. An analyzer uses a tokenizer to split chunks of text into groups, or tokens, for indexing purposes. For example, the whitespace tokenizer splits text fields into individual words based on where whitespace occurs.
Optional. Specify one or more token filters. After the tokenization step, the resulting tokens can pass through one or more token filters. A token filter performs operations such as:
- Stemming, which reduces related words, such as "talking", "talked", and "talks" to their root word "talk".
- Redaction, the removal of sensitive information from public documents.
The text passes through character filters first, then a tokenizer, and then the token filters.
Example Collection¶
This page contains sample index definitions and query examples for
character filters, tokenizers, and token filters. These examples use a
sample minutes
collection with the following documents:
{ "_id": 1, "page_updated_by": { "last_name": "AUERBACH", "first_name": "Siân", "email": "auerbach@example.com", "phone": "123-456-7890" }, "text" : "<head> This page deals with department meetings. </head>" } { "_id": 2, "page_updated_by": { "last_name": "OHRBACH", "first_name": "Noël", "email": "ohrbach@example.com", "phone": "123-456-0987" }, "text" : "The head of the sales department spoke first." } { "_id": 3, "page_updated_by": { "last_name": "LEWINSKY", "first_name": "Brièle", "email": "lewinsky@example.com", "phone": "123-456-9870" }, "text" : "<body>We'll head out to the conference room by noon.</body>" } { "_id": 4, "page_updated_by": { "last_name": "LEVINSKI", "first_name": "François", "email": "levinski@example.com", "phone": "123-456-8907" }, "text" : "<body>The page has been updated with the items on the agenda.</body>" }
Character Filters¶
Character filters always require a type field, and some take additional options as well.
"charFilters": [ { "type": "<filter-type>", "<additional-option>": <value> } ]
Atlas Search supports four types of character filters:
Type | Description |
---|---|
Strips out HTML constructs. | |
Normalizes text with the ICU
Normalizer. Based on Lucene's ICUNormalizer2CharFilter. | |
Applies user-specified normalization mappings to characters.
Based on Lucene's MappingCharFilter. | |
Replaces instances of zero-width non-joiner with
ordinary space. Based on Lucene's PersianCharFilter. |
htmlStrip¶
The htmlStrip
character filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this character filter. Must be htmlStrip . | yes | |
ignoredTags | array of strings | A list of HTML tags to exclude from filtering. | no |
The following example index definition uses a custom analyzer
named htmlStrippingAnalyzer
. It uses the htmlStrip
character filter to remove all HTML tags from the text except
the a
tag in the minutes
collection. It uses the
standard tokenizer and no
token filters.
{ "analyzer": "htmlStrippingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [{ "name": "htmlStrippingAnalyzer", "charFilters": [{ "type": "htmlStrip", "ignoredTags": ["a"] }], "tokenizer": { "type": "standard" }, "tokenFilters": [] }] }
The following search operation looks for occurrences of the
string head
in the text
field of the minutes
collection.
db.minutes.aggregate([ { $search: { text: { query: "head", path: "text" } } }, { $project: { "_id": 1, "text": 1 } } ])
The query returns the following results:
{ "_id" : 2, "text" : "The head of the sales department spoke first." } { "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }
The document with _id: 1
is not returned, because the
string head
is part of the HTML tag <head>
. The
document with _id: 3
contains HTML tags, but the string
head
is elsewhere so the document is a match.
icuNormalize¶
The icuNormalize
character filter has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this character filter. Must be icuNormalize . | yes |
The following example index definition uses a custom analyzer named
normalizingAnalyzer
. It uses the icuNormalize
character
filter, the whitespace tokenizer
and no token filters.
{ "analyzer": "normalizingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "normalizingAnalyzer", "charFilters": [ { "type": "icuNormalize" } ], "tokenizer": { "type": "whitespace" }, "tokenFilters": [] } ] }
mapping¶
The mapping
character filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this character filter. Must be mapping . | yes | |
mappings | object | An object containing a comma-separated list of mappings. A
mapping indicates that one character or group of characters
should be substituted for another, in the format
<original> : <replacement> . | yes |
The following example index definition uses a custom analyzer named
mappingAnalyzer
. It uses the mapping
character filter to
replace instances of \\
with /
. It uses the keyword
tokenizer and no token filters.
{ "analyzer": "mappingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "mappingAnalyzer", "charFilters": [ { "type": "mapping", "mappings": { "\\": "/" } } ], "tokenizer": { "type": "keyword" }, "tokenFilters": [] } ] }
The shingle token filter for a sample index definition and query.
persian¶
The persian
character filter has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this character filter. Must be persian . | yes |
The following example index definition uses a custom analyzer named
persianCharacterIndex
. It uses the persian
character filter,
the whitespace
tokenizer and no token filters.
{ "analyzer": "persianCharacterIndex", "mappings": { "dynamic": true }, "analyzers": [ { "name": "persianCharacterIndex", "charFilters": [ { "type": "persian" } ], "tokenizer": { "type": "whitespace" }, "tokenFilters": [] } ] }
Tokenizers¶
A custom analyzer's tokenizer determines how Atlas Search splits up text into discrete chunks for indexing.
Tokenizers always require a type field, and some take additional options as well.
"tokenizer": { "type": "<tokenizer-type>", "<additional-option>": "<value>" }
Atlas Search supports the following tokenizer options:
Name | Description |
---|---|
Tokenize based on word break rules from the Unicode Text
Segmentation algorithm. | |
Tokenize the entire input as a single token. | |
Tokenize based on occurrences of whitespace between words. | |
Tokenize into text chunks, or "n-grams", of given sizes. You
can't use the nGram tokenizer in
synonym or autocomplete mapping
definitions. | |
Tokenize input from the left side, or "edge", of a text input
into n-grams of given sizes. You can't use the edgeGram tokenizer in synonym or autocomplete mapping definitions. | |
Match a regular expression pattern to extract tokens. | |
Split tokens with a regular-expression based delimiter. | |
Tokenize URLs and email
addresses. Although uaxUrlEmail tokenizer
tokenizes based on word break rules from the Unicode Text
Segmentation algorithm,
we recommend using uaxUrlEmail
tokenizer only when the indexed field value includes URLs and email addresses. For fields that
do not include URLs or email
addresses, use the standard tokenizer to
create tokens based on word break rules. |
standard¶
The standard
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be standard . | yes | |
maxTokenLength | integer | Maximum length for a single token. Tokens greater than this
length are split at maxTokenLength into multiple tokens. | no | 255 |
The following example index definition uses a custom analyzer named
standardShingler
. It uses the standard
tokenizer and the
shingle token filter.
{ "analyzer": "standardShingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "standardShingler", "charFilters": [], "tokenizer": { "type": "standard", "maxTokenLength": 10, }, "tokenFilters": [ { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 } ] } ] }
The regex token filter for a sample index definition and query.
keyword¶
The keyword
tokenizer has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be keyword . | yes |
The following example index definition uses a custom analyzer named
keywordTokenizingIndex
. It uses the keyword
tokenizer and a
regular expression token filter that redacts email addresses.
{ "analyzer": "keywordTokenizingIndex", "mappings": { "dynamic": true }, "analyzers": [ { "name": "keywordTokenizingIndex", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "regex", "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$", "replacement": "redacted", "matches": "all" } ] } ] }
whitespace¶
The whitespace
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be whitespace . | yes | |
maxTokenLength | integer | Maximum length for a single token. Tokens greater than this
length are split at maxTokenLength into multiple tokens. | no | 255 |
The following example index definition uses a custom analyzer named
whitespaceLowerer
. It uses the whitespace
tokenizer and a
token filter that lowercases all tokens.
{ "analyzer": "whitespaceLowerer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "whitespaceLowerer", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "lowercase" } ] } ] }
The shingle token filter for a sample index definition and query.
nGram¶
The nGram
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be nGram . | yes | |
minGram | integer | Number of characters to include in the shortest token created. | yes | |
maxGram | integer | Number of characters to include in the longest token created. | yes |
The following example index definition uses a custom analyzer named
ngramShingler
. It uses the nGram
tokenizer to create tokens
between 2 and 5 characters long and the shingle token filter.
{ "analyzer": "ngramShingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "ngramShingler", "charFilters": [], "tokenizer": { "type": "nGram", "minGram": 2, "maxGram": 5 }, "tokenFilters": [ { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 } ] } ] }
edgeGram¶
The edgeGram
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be edgeGram . | yes | |
minGram | integer | Number of characters to include in the shortest token created. | yes | |
maxGram | integer | Number of characters to include in the longest token created. | yes |
The following example index definition uses a custom analyzer named
edgegramShingler
. It uses the edgeGram
tokenizer to create tokens
between 2 and 5 characters long starting from the first character of
text input and the shingle token filter.
{ "analyzer": "edgegramShingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "edgegramShingler", "charFilters": [], "tokenizer": { "type": "edgeGram", "minGram": 2, "maxGram": 5 }, "tokenFilters": [ { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 } ] } ] }
regexCaptureGroup¶
The regexCaptureGroup
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be regexCaptureGroup . | yes | |
pattern | string | A regular expression to match against. | yes | |
group | integer | Index of the character group within the matching expression to
extract into tokens. Use 0 to extract all character groups. | yes |
The following example index definition uses a custom analyzer named
phoneNumberExtractor
. It uses the regexCaptureGroup
tokenizer
to creates a single token from the first US-formatted phone number
present in the text input.
{ "analyzer": "phoneNumberExtractor", "mappings": { "dynamic": true }, "analyzers": [ { "name": "phoneNumberExtractor", "charFilters": [], "tokenizer": { "type": "regexCaptureGroup", "pattern": "^\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b$", "group": 0 }, "tokenFilters": [] } ] }
regexSplit¶
The regexSplit
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be regexSplit . | yes | |
pattern | string | A regular expression to match against. | yes |
The following example index definition uses a custom analyzer named
dashSplitter
. It uses the regexSplit
tokenizer
to create tokens from hyphen-delimited input text.
{ "analyzer": "dashSplitter", "mappings": { "dynamic": true }, "analyzers": [ { "name": "dashSplitter", "charFilters": [], "tokenizer": { "type": "regexSplit", "pattern": "[-]+" }, "tokenFilters": [] } ] }
uaxUrlEmail¶
The uaxUrlEmail
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be uaxUrlEmail . | yes | |
maxTokenLength | int | The maximum number of characters in one token. | no | 255 |
The following example index definition uses a custom analyzer named
emailUrlExtractor
. It uses the uaxUrlEmail
tokenizer to
create tokens up to 200
characters long each for all text,
including email addresses and URLs, in the input. It converts all text to lowercase using the
lowercase token filter.
{ "analyzer": "emailUrlExtractor", "mappings": { "dynamic": true }, "analyzers": [ { "name": "emailUrlExtractor", "charFilters": [], "tokenizer": { "type": "uaxUrlEmail", "maxTokenLength": "200" }, "tokenFilters": [ { "type": "lowercase" } ] } ] }
Token Filters¶
Token Filters always require a type field, and some take additional options as well.
"tokenFilters": [ { "type": "<token-filter-type>", "<additional-option>": <value> } ]
Atlas Search supports the following token filters:
Name | Description |
---|---|
Converts alphabetic, numeric, and symbolic unicode characters
that are not in the Basic Latin Unicode block
to their ASCII equivalents, if available. | |
Creates tokens for words that sound the same based on Daitch-Mokotoff Soundex phonetic algorithm. This filter can generate multiple encodings for each input, where each encoded token is a 6 digit number. Note Don't use daitchMokotoffSoundex token filter in:
| |
Normalizes token text to lowercase. | |
Removes tokens that are too short or too long. | |
Applies character folding from Unicode Technical Report #30. | |
Normalizes tokens using a standard Unicode Normalization Mode. | |
Tokenizes input into n-grams of configured sizes. You can't use
the nGram token filter in synonym or autocomplete mapping definitions. | |
Tokenize input from the left side, or "edge", of a text input
into n-grams of configured sizes. You can't use the
edgeGram token filter in
synonym or autocomplete mapping definitions. | |
Constructs shingles (token n-grams) from a series of tokens. You
can't use the shingle token filter in
synonym or autocomplete mapping definitions. | |
Applies a regular expression to each token, replacing matches
with a specified string. | |
Stems tokens using a Snowball-generated stemmer. | |
Removes tokens that correspond to the specified stop words.
This token filter doesn't analyze the specified stop words. | |
Trims leading and trailing whitespace from tokens. |
asciiFolding¶
The asciiFolding
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be asciiFolding . | yes | |
originalTokens | string | Specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:
| no | omit |
The following example index definition uses a custom analyzer
named asciiConverter
. It uses the standard tokenizer with the asciiFolding
token filter to
index the fields in the example
collection and convert the field values to their ASCII equivalent.
{ "analyzer": "asciiConverter", "searchAnalyzer": "asciiConverter", "mappings": { "dynamic": true }, "analyzers": [ { "name": "asciiConverter", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "asciiFolding" } ] } ] }
The following query searches the first_name
field for names
using their ASCII equivalent.
db.minutes.aggregate([ { $search: { "index": "default", "text": { "query": "Sian", "path": "page_updated_by.first_name" } } }, { $project: { "_id": 1, "page_updated_by.last_name": 1, "page_updated_by.first_name": 1 } } ])
Atlas Search returns the following results:
[ { _id: 1, page_updated_by: { last_name: 'AUERBACH', first_name: 'Siân'} } ]
daitchMokotoffSoundex¶
The daitchMokotoffSoundex
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be daitchMokotoffSoundex . | yes | |
originalTokens | string | Specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:
| no | include |
The following example index definition uses a custom analyzer
named dmsAnalyzer
. It uses the standard tokenizer with the daitchMokotoffSoundex
token filter to index and query for words that sound the same
as their encoded forms.
{ "analyzer": "dmsAnalyzer", "searchAnalyzer": "dmsAnalyzer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "dmsAnalyzer", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "daitchMokotoffSoundex", "originalTokens": "include" } ] } ] }
The following query searches for terms that sound similar to
AUERBACH
in the page_updated_by.last_name
field of
the minutes
collection.
db.minutes.aggregate([ { $search: { "index": "default", "text": { "query": "AUERBACH", "path": "page_updated_by.last_name" } } }, { $project: { "_id": 1, "page_updated_by.last_name": 1 } } ])
The query returns the following results:
{ "_id" : 1, "page_updated_by" : { "last_name" : "AUERBACH" } } { "_id" : 2, "page_updated_by" : { "last_name" : "OHRBACH" } }
Atlas Search returns documents with _id: 1
and _id: 2
because the terms in both documents are phonetically similar,
and are coded using the same six digit 097500
.
lowercase¶
The lowercase
token filter has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be lowercase . | yes |
The following example index definition uses a custom analyzer named
lowercaser
. It uses the standard tokenizer with the lowercase
token filter to
lowercase all tokens.
{ "analyzer": "lowercaser", "mappings": { "dynamic": true }, "analyzers": [ { "name": "lowercaser", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" } ] } ] }
The regex token filter for a sample index definition and query.
length¶
The length
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be length . | yes | |
min | integer | The minimum length of a token. Must be less than or equal to
max . | no | 0 |
max | integer | The maximum length of a token. Must be greater than or equal to
min . | no | 255 |
The following example index definition uses a custom analyzer named
longOnly
. It uses the length
token filter to index only
tokens that are at least 20 UTF-16 code units long after tokenizing
with the standard tokenizer.
{ "analyzer": "longOnly", "mappings": { "dynamic": true }, "analyzers": [ { "name": "longOnly", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "length", "min": 20 } ] } ] }
icuFolding¶
The icuFolding
token filter has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be icuFolding . | yes |
The following example index definition uses a custom analyzer named
diacriticFolder
. It uses the keyword tokenizer with the icuFolding
token filter to
apply foldings from UTR#30 Character Foldings. Foldings include accent
removal, case folding, canonical duplicates folding, and many others
detailed in the report.
{ "analyzer": "diacriticFolder", "mappings": { "dynamic": true }, "analyzers": [ { "name": "diacriticFolder", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "icuFolding" } ] } ] }
icuNormalizer¶
The icuNormalizer
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be icuNormalizer . | yes | |
normalizationForm | string | Normalization form to apply. Accepted values are:
For more information about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15. | no | nfc |
The following example index definition uses a custom analyzer named
normalizer
. It uses the whitespace tokenizer, then normalizes
tokens by Canonical Decomposition, followed by Canonical Composition.
{ "analyzer": "normalizer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "normalizer", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "icuNormalizer", "normalizationForm": "nfc" } ] } ] }
nGram¶
The nGram
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be nGram . | yes | |
minGram | integer | The minimum length of generated n-grams. Must be less than or
equal to maxGram . | yes | |
maxGram | integer | The maximum length of generated n-grams. Must be greater than or
equal to minGram . | yes | |
termNotInBounds | string | Accepted values are:
If | no | omit |
The following example index definition uses a custom analyzer named
persianAutocomplete
. It functions as an autocomplete analyzer for
Persian and other languages that use the zero-width non-joiner
character. It performs the following operations:
- Normalizes zero-width non-joiner characters with the persian character filter.
- Tokenizes by whitespace with the whitespace tokenizer.
Applies a series of token filters:
icuNormalizer
shingle
nGram
{ "analyzer": "persianAutocomplete", "mappings": { "dynamic": true }, "analyzers": [ { "name": "persianAutocomplete", "charFilters": [ { "type": "persian" } ], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "icuNormalizer", "normalizationForm": "nfc" }, { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 }, { "type": "nGram", "minGram": 1, "maxGram": 10 } ] } ] }
edgeGram¶
The edgeGram
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be edgeGram . | yes | |
minGram | integer | The minimum length of generated n-grams. Must be less than or
equal to maxGram . | yes | |
maxGram | integer | The maximum length of generated n-grams. Must be greater than or
equal to minGram . | yes | |
termNotInBounds | string | Accepted values are:
If | no | omit |
The following example index definition uses a custom analyzer named
englishAutocomplete
. It performs the following operations:
- Tokenizes with the standard tokenizer.
Token filtering with the following filters:
icuFolding
shingle
edgeGram
{ "analyzer": "englishAutocomplete", "mappings": { "dynamic": true }, "analyzers": [ { "name": "englishAutocomplete", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" }, { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 }, { "type": "edgeGram", "minGram": 1, "maxGram": 10 } ] } ] }
The shingle token filter for a sample index definition and query.
shingle¶
The shingle
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be shingle . | yes | |
minShingleSize | integer | Minimum number of tokens per shingle. Must be less than or equal
to maxShingleSize . | yes | |
maxShingleSize | integer | Maximum number of tokens per shingle. Must be greater than or
equal to minShingleSize . | yes |
The following example index definition uses two custom
analyzers, emailAutocompleteIndex
and
emailAutocompleteSearch
, to implement autocomplete-like
functionality. Atlas Search uses the emailAutocompleteIndex
analyzer during index creation to:
- Replace
@
characters in a field withAT
- Create tokens with the whitespace tokenizer
- Shingle tokens
- Create edgeGram of those shingled tokens
Atlas Search uses the emailAutocompleteSearch
analyzer during a
search to:
- Replace
@
characters in a field withAT
- Create tokens with the whitespace tokenizer tokenizer
{ "analyzer": "lucene.keyword", "mappings": { "dynamic": true, "fields": { "page_updated_by": { "type": "document", "fields": { "email": { "type": "string", "analyzer": "emailAutocompleteIndex", "searchAnalyzer": "emailAutocompleteSearch", } } } } }, "analyzers": [ { "name": "emailAutocompleteIndex", "charFilters": [ { "mappings": { "@": "AT" }, "type": "mapping" } ], "tokenizer": { "maxTokenLength": 15, "type": "whitespace" }, "tokenFilters": [ { "maxShingleSize": 3, "minShingleSize": 2, "type": "shingle" }, { "maxGram": 15, "minGram": 2, "type": "edgeGram" } ] }, { "name": "emailAutocompleteSearch", "charFilters": [ { "mappings": { "@": "AT" }, "type": "mapping" } ], "tokenizer": { "maxTokenLength": 15, "type": "whitespace" } } ] }
The following query searches for an email address in the
page_updated_by.email
field of the minutes
collection:
db.minutes.aggregate([ { $search: { "index": "default", "text": { "query": "auerbach@ex", "path": "page_updated_by.email" } } } ])
The query returns the following results:
{ "_id" : 1, "page_updated_by" : { "last_name" : "AUERBACH", "first_name" : "Siân", "email" : "auerbach@example.com", "phone" : "123-456-7890" }, "text" : "<head> This page deals with department meetings. </head>" }
regex¶
The regex
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be regex . | yes | |
pattern | string | Regular expression pattern to apply to each token. | yes | |
replacement | string | Replacement string to substitute wherever a matching pattern
occurs. | yes | |
matches | string | Acceptable values are:
If | yes |
The following index definition uses a custom analyzer named
emailRedact
for indexing the page_updated_by.email
field in the minutes
collection. It uses the
standard tokenizer. It first
applies the lowercase token filter to turn uppercase
characters in the field to lowercase and then finds strings
that look like email addresses and replaces them with the word
redacted
.
{ "analyzer": "lucene.standard", "mappings": { "dynamic": false, "fields": { "page_updated_by": { "type": "document", "fields": { "email": { "type": "string", "analyzer": "emailRedact" } } } } }, "analyzers": [ { "charFilters": [], "name": "emailRedact", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "matches": "all", "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$", "replacement": "redacted", "type": "regex" } ] } ] }
The following query searches for the term example
in the
page_updated_by.email
field of the minutes
collection.
db.minutes.aggregate([ { $search: { "index": "default", "text": { "query": "example", "path": "page_updated_by.email" } } } ])
Atlas Search doesn't return any results for the query because the
page_updated_by.email
field does not contain any instances
of the word example
that are not in an email address.
Atlas Search replaces strings that match the regular expression
provided in the custom analyzer with the word redacted
.
snowballStemming¶
The snowballStemming
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be snowballStemming . | yes | |
stemmerName | string | The following values are valid:
| yes |
The following example index definition uses a custom analyzer named
frenchStemmer
. It uses the lowercase
token filter and the
standard tokenizer, followed
by the french
variant of the snowballStemming
token filter.
{ "analyzer": "frenchStemmer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "frenchStemmer", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "type": "snowballStemming", "stemmerName": "french" } ] } ] }
stopword¶
The stopword
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be stopword . | yes | |
tokens | array of strings | The list of stop words that correspond to the tokens to remove.
Value must be one or more stop words. | yes | |
ignoreCase | boolean | The flag that indicates whether or not to ignore case of stop words when filtering the tokens to remove. The value can be one of the following:
If omitted, defaults to | no | true |
The following example index definition uses a custom analyzer named
It uses the stopword
token filter after the
whitespace tokenizer to remove
the tokens that match the defined stop words is
, the
, and
at
. The token filter is case-insensitive and will remove all
tokens that match the specified stop words.
{ "analyzer": "tokenTrimmer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "stopwordRemover", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "stopword", "tokens": ["is", "the", "at"] } ] } ] }
trim¶
The trim
token filter has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be trim . | yes |
The following example index definition uses a custom analyzer named
tokenTrimmer
. It uses the trim
token filter after the
keyword tokenizer to remove leading
and trailing whitespace in the tokens created by the keyword
tokenizer.
"analyzer": "tokenTrimmer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "tokenTrimmer", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "trim" } ] } ] }