Tokenizers

On this page

edgeGram

keyword
nGram
regexCaptureGroup
regexSplit
standard
uaxUrlEmail
whitespace

A custom analyzer's tokenizer determines how Atlas Search splits up text into discrete chunks for indexing. Tokenizers require a type field, and some take additional options as well.

"tokenizer": {
  "type": "<tokenizer-type>",
  "<additional-option>": "<value>"
}

Atlas Search supports the following tokenizer options:

edgeGram
keyword
nGram
regexCaptureGroup
regexSplit
standard
uaxUrlEmail
whitespace

edgeGram

The edgeGram tokenizer tokenizes input from the left side, or "edge", of a text input into n-grams of given sizes. You can't use the edgeGram tokenizer in synonym or autocomplete mapping definitions. It has the following attributes:

Name	Type	Description	Required?
`type`	string	Human-readable label that identifies this tokenizer type. Value must be `edgeGram`.	yes
`minGram`	integer	Number of characters to include in the shortest token created.	yes
`maxGram`	integer	Number of characters to include in the longest token created.	yes

Example

The following index definition example uses a custom analyzer named edgegramShingler. It uses the edgeGram tokenizer to create tokens between 2 and 5 characters long starting from the first character of text input and the shingle token filter.

{
  "analyzer": "edgegramShingler",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "edgegramShingler",
      "charFilters": [],
      "tokenizer": {
        "type": "edgeGram",
        "minGram": 2,
        "maxGram": 5
      },
      "tokenFilters": [
        {
          "type": "shingle",
          "minShingleSize": 2,
          "maxShingleSize": 3
        }
      ]
    }
  ]
}

keyword

Atlas Search won't index string fields that exceed 32766 characters using the keyword tokenizer.

The keyword tokenizer tokenizes the entire input as a single token. It has the following attribute:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this tokenizer type. Value must be `keyword`.	yes

Example

The following index definition example uses a custom analyzer named keywordTokenizingIndex. It uses the keyword tokenizer and a regular expression token filter that redacts email addresses.

{
  "analyzer": "keywordTokenizingIndex",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "keywordTokenizingIndex",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "regex",
          "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
          "replacement": "redacted",
          "matches": "all"
        }
      ]
    }
  ]
}

nGram

The nGram tokenizer tokenizes into text chunks, or "n-grams", of given sizes. You can't use the nGram tokenizer in synonym or autocomplete mapping definitions. It has the following attributes:

Name	Type	Description	Required?
`type`	string	Human-readable label that identifies this tokenizer type. Value must be `nGram`.	yes
`minGram`	integer	Number of characters to include in the shortest token created.	yes
`maxGram`	integer	Number of characters to include in the longest token created.	yes

Example

The following index definition example uses a custom analyzer named ngramShingler. It uses the nGram tokenizer to create tokens between 2 and 5 characters long and the shingle token filter.

{
  "analyzer": "ngramShingler",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "ngramShingler",
      "charFilters": [],
      "tokenizer": {
        "type": "nGram",
        "minGram": 2,
        "maxGram": 5
      },
      "tokenFilters": [
        {
          "type": "shingle",
          "minShingleSize": 2,
          "maxShingleSize": 3
        }
      ]
    }
  ]
}

regexCaptureGroup

The regexCaptureGroup tokenizer matches a regular expression pattern to extract tokens. It has the following attributes:

Name	Type	Description	Required?
`type`	string	Human-readable label that identifies this tokenizer type. Value must be `regexCaptureGroup`.	yes
`pattern`	string	Regular expression to match against.	yes
`group`	integer	Index of the character group within the matching expression to extract into tokens. Use `0` to extract all character groups.	yes

Example

The following index definition example uses a custom analyzer named phoneNumberExtractor. It uses the regexCaptureGroup tokenizer to creates a single token from the first US-formatted phone number present in the text input.

{
  "analyzer": "phoneNumberExtractor",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "phoneNumberExtractor",
      "charFilters": [],
      "tokenizer": {
        "type": "regexCaptureGroup",
        "pattern": "^\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b$",
        "group": 0
      },
      "tokenFilters": []
    }
  ]
}

regexSplit

The regexSplit tokenizer splits tokens with a regular-expression based delimiter. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this tokenizer type. Value must be `regexSplit`.	yes
`pattern`	string	Regular expression to match against.	yes

Example

The following index definition example uses a custom analyzer named dashSplitter. It uses the regexSplit tokenizer to create tokens from hyphen-delimited input text.

{
  "analyzer": "dashSplitter",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "dashSplitter",
      "charFilters": [],
      "tokenizer": {
        "type": "regexSplit",
        "pattern": "[-]+"
      },
      "tokenFilters": []
    }
  ]
}

standard

The standard tokenizer tokenizes based on word break rules from the Unicode Text Segmentation algorithm. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this tokenizer type. Value must be `standard`.	yes
`maxTokenLength`	integer	Maximum length for a single token. Tokens greater than this length are split at `maxTokenLength` into multiple tokens.	no	255

Example

The following index definition example uses a custom analyzer named standardShingler. It uses the standard tokenizer and the shingle token filter.

{
  "analyzer": "standardShingler",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "standardShingler",
      "charFilters": [],
      "tokenizer": {
        "type": "standard",
        "maxTokenLength": 10,
      },
      "tokenFilters": [
        {
          "type": "shingle",
          "minShingleSize": 2,
          "maxShingleSize": 3
        }
      ]
    }
  ]
}

Tip

uaxUrlEmail

The uaxUrlEmail tokenizer tokenizes URLs and email addresses. Although uaxUrlEmail tokenizer tokenizes based on word break rules from the Unicode Text Segmentation algorithm, we recommend using uaxUrlEmail tokenizer only when the indexed field value includes URLs and email addresses. For fields that don't include URLs or email addresses, use the standard tokenizer to create tokens based on word break rules. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this tokenizer type. Value must be `uaxUrlEmail`.	yes
`maxTokenLength`	int	Maximum number of characters in one token.	no	`255`

Example

The following index definition example uses a custom analyzer named emailUrlExtractor. It uses the uaxUrlEmail tokenizer to create tokens up to 200 characters long each for all text, including email addresses and URLs, in the input. It converts all text to lowercase using the lowercase token filter.

{
  "analyzer": "emailUrlExtractor",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "emailUrlExtractor",
      "charFilters": [],
      "tokenizer": {
        "type": "uaxUrlEmail",
        "maxTokenLength": 200
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ]
    }
  ]
}

whitespace

The whitespace tokenizer tokenizes based on occurrences of whitespace between words. It has the following attributes:

Name	Type	Description	Required?	Default
`type`	string	Human-readable label that identifies this tokenizer type. Value must be `whitespace`.	yes
`maxTokenLength`	integer	Maximum length for a single token. Tokens greater than this length are split at `maxTokenLength` into multiple tokens.	no	255

Example

The following index definition example uses a custom analyzer named whitespaceLowerer. It uses the whitespace tokenizer and a token filter that lowercases all tokens.

{
  "analyzer": "whitespaceLowerer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "whitespaceLowerer",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ]
    }
  ]
}

Tokenizers

edgeGram

Example

keyword

Example

nGram

Example

regexCaptureGroup

Example

regexSplit

Example

standard

Example

Tip

See also:

uaxUrlEmail

Example

whitespace

Example

Tip

See also:

Tokenizers.css-134mg1q{-webkit-align-self:center;-ms-flex-item-align:center;align-self:center;padding:0 10px;visibility:hidden;}.css-6vrlzm{border-radius:0!important;display:initial!important;margin:initial!important;}.css-1l4s55v{margin-top:-175px;position:absolute;padding-bottom:2px;}

edgeGram

Example

keyword

Example

nGram

Example

regexCaptureGroup

Example

regexSplit

Example

standard

Example

Tip

See also:

uaxUrlEmail

Example

whitespace

Example

Tip

See also:

Tokenizers