Tokenization rules used in FTS


Tokenization is the process of breaking words into discrete tokens to insert them into an index and to search for the tokens.

See the following sections for the basic rules of tokenization that are used in FTS for indexing and searching:

Tokenization rules for literal fields

For literal fields, the content of the field is treated as a single token with no modification. For more information, see Literal FTS Index. For example, x-y=z becomes x-y=z (one word). 

For literal fields, the following rules apply:

  • The entire field content is stored as a single word or token in the index. It can be found in a search only by supplying the exact matching string. That is, searching for y does not find a match, but searching for x-y=z does find a match.
  • Words are split at some of the special characters, and the special character is removed.
    • someone@bmc.com becomes someone bmc.com (two words).
      However, if you enter someone@bmc.com as the search term, exact match gets the highest rank.
    • pqr=hij becomes pqr hij (two words).
      The exact match gets the highest rank.
    • Alpha#Omega becomes Alpha Omega (two words).
      The exact match gets the highest rank.
    • Smith&Brown becomes Smith Brown (two words).
      The exact match gets the highest rank.
  • Words are split at hyphens. Hyphen is treated as a tokenizer.
    • abc-def=xyz becomes abc def xyz.
      Exact match gets the highest rank.
    • KX-13AF9 becomes KX 13AF9.
      Exact match gets the highest rank.
  • Certain special characters are considered non-tokenizers and the word remains a single word.
    • one:two becomes one:two (one word).
      Search term one does not match the text, one:two matches the text.
    • www.bmc.com becomes www.bmc.com (one word).

Tokenization rules for non-literal fields

For non-literal fields, the following rules apply:

  • Words are split at punctuation characters, and punctuation is removed. However, a dot that is not followed by white space is considered part of a token.
    • one:two becomes one two (two words).
    • Alpha#Omega becomes Alpha Omega (two words).
    • x.y.z becomes x.y.z (one word).
  • Words are split at hyphens unless the token contains a number, in which case the whole token is interpreted as a product number and is not split.
    • x-y=z becomes x y z (three words).
    • KX-13AF9 becomes KX-13AF9 (one word).
  • Email addresses and Internet host names are recognized as one token.
    • someone@bmc.com becomes someone@bmc.com (one word).
    • www.bmc.com becomes www.bmc.com (one word).
  • In words with no spaces, the ampersand (&) is retained.
    • Smith&Brown becomes Smith&Brown (one word).

 

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*