Character Filters
On this page
Character filters require a type field, and some take additional options as well.
"charFilters": [ { "type": "<filter-type>", "<additional-option>": <value> } ]
Atlas Search supports four types of character filters:
htmlStrip
The htmlStrip
character filter strips out HTML constructs. It has
the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this character filter type.
Value must be htmlStrip . | yes | |
ignoredTags | array of strings | List that contains the HTML tags to exclude from filtering. | no |
The following index definition example uses a custom analyzer
named htmlStrippingAnalyzer
. It uses the htmlStrip
character filter to remove all HTML tags from the text except
the a
tag in the minutes
collection. It uses the
standard tokenizer and no
token filters.
{ "analyzer": "htmlStrippingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [{ "name": "htmlStrippingAnalyzer", "charFilters": [{ "type": "htmlStrip", "ignoredTags": ["a"] }], "tokenizer": { "type": "standard" }, "tokenFilters": [] }] }
The following search operation looks for occurrences of the
string head
in the text
field of the minutes
collection.
db.minutes.aggregate([ { $search: { text: { query: "head", path: "text" } } }, { $project: { "_id": 1, "text": 1 } } ])
The query returns the following results:
{ "_id" : 2, "text" : "The head of the sales department spoke first." } { "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }
The document with _id: 1
is not returned, because the
string head
is part of the HTML tag <head>
. The
document with _id: 3
contains HTML tags, but the string
head
is elsewhere so the document is a match.
icuNormalize
The icuNormalize
character filter normalizes text with the ICU Normalizer. It is based on Lucene's
ICUNormalizer2CharFilter.
It has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this character filter type.
Value must be icuNormalize . | yes |
The following index definition example uses a custom analyzer named
normalizingAnalyzer
. It uses the icuNormalize
character
filter, the whitespace tokenizer,
and no token filters.
{ "analyzer": "normalizingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "normalizingAnalyzer", "charFilters": [ { "type": "icuNormalize" } ], "tokenizer": { "type": "whitespace" }, "tokenFilters": [] } ] }
mapping
The mapping
character filter applies user-specified normalization
mappings to characters. It is based on Lucene's MappingCharFilter.
It has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this character filter type.
Value must be mapping . | yes | |
mappings | object | Object that contains a comma-separated list of mappings. A
mapping indicates that one character or group of characters
should be substituted for another, in the format
<original> : <replacement> . | yes |
The following index definition example uses a custom analyzer named
mappingAnalyzer
. It uses the mapping
character filter to
replace instances of \\
with /
. It uses the keyword
tokenizer and no token filters.
{ "analyzer": "mappingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "mappingAnalyzer", "charFilters": [ { "type": "mapping", "mappings": { "\\": "/" } } ], "tokenizer": { "type": "keyword" }, "tokenFilters": [] } ] }
The shingle token filter for a sample index definition and query.
persian
The persian
character filter replaces instances of zero-width
non-joiner
with ordinary space. It is based on Lucene's PersianCharFilter.
It has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | Human-readable label that identifies this character filter type.
Value must be persian . | yes |
The following example index definition uses a custom analyzer named
persianCharacterIndex
. It uses the persian
character filter,
the whitespace
tokenizer and no token filters.
{ "analyzer": "persianCharacterIndex", "mappings": { "dynamic": true }, "analyzers": [ { "name": "persianCharacterIndex", "charFilters": [ { "type": "persian" } ], "tokenizer": { "type": "whitespace" }, "tokenFilters": [] } ] }