Character Filters

On this page

htmlStrip

icuNormalize
mapping
persian

Character filters require a type field, and some take additional options as well.

"charFilters": [
  {
    "type": "<filter-type>",
    "<additional-option>": <value>
  }
]

Atlas Search supports four types of character filters:

htmlStrip
icuNormalize
mapping
persian

htmlStrip

The htmlStrip character filter strips out HTML constructs. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this character filter type. Value must be `htmlStrip`.	yes
`ignoredTags`	array of strings	List that contains the HTML tags to exclude from filtering.	no

Example

The following index definition example uses a custom analyzer named htmlStrippingAnalyzer. It uses the htmlStrip character filter to remove all HTML tags from the text except the a tag in the minutes collection. It uses the standard tokenizer and no token filters.

{
  "analyzer": "htmlStrippingAnalyzer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [{
    "name": "htmlStrippingAnalyzer",
    "charFilters": [{
      "type": "htmlStrip",
      "ignoredTags": ["a"]
    }],
    "tokenizer": {
      "type": "standard"
    },
    "tokenFilters": []
  }]
}

The following search operation looks for occurrences of the string head in the text field of the minutes collection.

db.minutes.aggregate([
  {
    $search: {
      text: {
        query: "head",
        path: "text"
      }
    }
  },
  {
    $project: {
      "_id": 1,
      "text": 1
    }
  }
])

The query returns the following results:

{ "_id" : 2, "text" : "The head of the sales department spoke first." }
{ "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }

The document with _id: 1 is not returned, because the string head is part of the HTML tag <head>. The document with _id: 3 contains HTML tags, but the string head is elsewhere so the document is a match.

icuNormalize

The icuNormalize character filter normalizes text with the ICU Normalizer. It is based on Lucene's ICUNormalizer2CharFilter. It has the following attribute:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this character filter type. Value must be `icuNormalize`.	yes

Example

The following index definition example uses a custom analyzer named normalizingAnalyzer. It uses the icuNormalize character filter, the whitespace tokenizer, and no token filters.

{
  "analyzer": "normalizingAnalyzer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "normalizingAnalyzer",
      "charFilters": [
        {
          "type": "icuNormalize"
        }
      ],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": []
    }
  ]
}

mapping

The mapping character filter applies user-specified normalization mappings to characters. It is based on Lucene's MappingCharFilter. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this character filter type. Value must be `mapping`.	yes
`mappings`	object	Object that contains a comma-separated list of mappings. A mapping indicates that one character or group of characters should be substituted for another, in the format `<original> : <replacement>`.	yes

Example

The following index definition example uses a custom analyzer named mappingAnalyzer. It uses the mapping character filter to replace instances of \\ with /. It uses the keyword tokenizer and no token filters.

{
  "analyzer": "mappingAnalyzer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "mappingAnalyzer",
      "charFilters": [
        {
          "type": "mapping",
          "mappings": {
            "\\": "/"
          }
        }
      ],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": []
    }
  ]
}

Tip

persian

The persian character filter replaces instances of zero-width non-joiner with ordinary space. It is based on Lucene's PersianCharFilter. It has the following attribute:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this character filter type. Value must be `persian`.	yes

Example

The following example index definition uses a custom analyzer named persianCharacterIndex. It uses the persian character filter, the whitespace tokenizer and no token filters.

{
  "analyzer": "persianCharacterIndex",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "persianCharacterIndex",
      "charFilters": [
        {
          "type": "persian"
        }
      ],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": []
    }
  ]
}

← Custom Analyzers Tokenizers →

Character Filters.css-134mg1q{-webkit-align-self:center;-ms-flex-item-align:center;align-self:center;padding:0 10px;visibility:hidden;}.css-6vrlzm{border-radius:0!important;display:initial!important;margin:initial!important;}.css-1l4s55v{margin-top:-175px;position:absolute;padding-bottom:2px;}

htmlStrip

icuNormalize

mapping

persian

Character Filters