Docs Menu

Character Filters

On this page

  • htmlStrip
  • icuNormalize
  • mapping
  • persian

Character filters require a type field, and some take additional options as well.

"charFilters": [
{
"type": "<filter-type>",
"<additional-option>": <value>
}
]

Atlas Search supports four types of character filters:

The htmlStrip character filter strips out HTML constructs. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this character filter type. Value must be htmlStrip.
yes
ignoredTags
array of strings
List that contains the HTML tags to exclude from filtering.
no
Example

The following index definition example uses a custom analyzer named htmlStrippingAnalyzer. It uses the htmlStrip character filter to remove all HTML tags from the text except the a tag in the minutes collection. It uses the standard tokenizer and no token filters.

{
"analyzer": "htmlStrippingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [{
"name": "htmlStrippingAnalyzer",
"charFilters": [{
"type": "htmlStrip",
"ignoredTags": ["a"]
}],
"tokenizer": {
"type": "standard"
},
"tokenFilters": []
}]
}

The following search operation looks for occurrences of the string head in the text field of the minutes collection.

db.minutes.aggregate([
{
$search: {
text: {
query: "head",
path: "text"
}
}
},
{
$project: {
"_id": 1,
"text": 1
}
}
])

The query returns the following results:

{ "_id" : 2, "text" : "The head of the sales department spoke first." }
{ "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }

The document with _id: 1 is not returned, because the string head is part of the HTML tag <head>. The document with _id: 3 contains HTML tags, but the string head is elsewhere so the document is a match.

The icuNormalize character filter normalizes text with the ICU Normalizer. It is based on Lucene's ICUNormalizer2CharFilter. It has the following attribute:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this character filter type. Value must be icuNormalize.
yes
Example

The following index definition example uses a custom analyzer named normalizingAnalyzer. It uses the icuNormalize character filter, the whitespace tokenizer, and no token filters.

{
"analyzer": "normalizingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "normalizingAnalyzer",
"charFilters": [
{
"type": "icuNormalize"
}
],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": []
}
]
}

The mapping character filter applies user-specified normalization mappings to characters. It is based on Lucene's MappingCharFilter. It has the following attributes:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this character filter type. Value must be mapping.
yes
mappings
object
Object that contains a comma-separated list of mappings. A mapping indicates that one character or group of characters should be substituted for another, in the format <original> : <replacement>.
yes
Example

The following index definition example uses a custom analyzer named mappingAnalyzer. It uses the mapping character filter to replace instances of \\ with /. It uses the keyword tokenizer and no token filters.

{
"analyzer": "mappingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "mappingAnalyzer",
"charFilters": [
{
"type": "mapping",
"mappings": {
"\\": "/"
}
}
],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": []
}
]
}
Tip
See also:

The shingle token filter for a sample index definition and query.

The persian character filter replaces instances of zero-width non-joiner with ordinary space. It is based on Lucene's PersianCharFilter. It has the following attribute:

Name
Type
Description
Required?
Default
type
string
Human-readable label that identifies this character filter type. Value must be persian.
yes
Example

The following example index definition uses a custom analyzer named persianCharacterIndex. It uses the persian character filter, the whitespace tokenizer and no token filters.

{
"analyzer": "persianCharacterIndex",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "persianCharacterIndex",
"charFilters": [
{
"type": "persian"
}
],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": []
}
]
}
←  Custom AnalyzersTokenizers →
Give Feedback
© 2022 MongoDB, Inc.

About

  • Careers
  • Investor Relations
  • Legal Notices
  • Privacy Notices
  • Security Information
  • Trust Center
© 2022 MongoDB, Inc.