Tokenization rules used in FTS
Tokenization is the process of breaking words into discrete tokens to insert them into an index and to search on the tokens. Following are the basic rules of tokenization used in FTS for indexing and searching.
For literal fields, the content of the field is treated as a single token with no modification. For example, x-y=z becomes x-y=z (one word). All the text constitutes the token or word stored in the index. It can be found in a search only by supplying the exact matching string. That is, searching for y does not find a match, but searching for x-y=z does find a match.
For non-literal fields, the following rules apply:
- Words are split at punctuation characters, and punctuation is removed. However, a dot that is not followed by white space is considered part of a token.
- one:two becomes one two (two words).
- Alpha#Omega becomes Alpha Omega (two words).
- x.y.z becomes x.y.z (one word).
- Words are split at hyphens unless the token contains a number, in which case the whole token is interpreted as a product number and is not split.
- x-y=z becomes x y z (three words).
- KX-13AF9 becomes KX-13AF9 (one word).
- Email addresses and internet host names are recognized as one token.
- firstname.lastname@example.org becomes email@example.com (one word).
- www.bmc.com becomes www.bmc.com (one word).
- In words with no spaces, the ampersand (&) is retained.
- Smith&Brown becomes Smith&Brown (one word).
- Words are split at some of the special characters, and the special character is removed.
- firstname.lastname@example.org becomes someone bmc.com (two words). However if you enter email@example.com as the search term, exact match gets the highest rank.
- pqr=hij becomes pqr hij (two words). Exact match gets the highest rank.
- Alpha#Omega becomes Alpha Omega (two words). Exact match gets the highest rank.
- Smith&Brown becomes Smith Brown (two words). Exact match gets the highest rank.
- Words that are split at hyphens. Hyphen is treated as a tokenizer.
- abc-def=xyz becomes abc def xyz Exact match gets the highest rank.
- KX-13AF9 becomes KX 13AF9 Exact match gets the highest rank.
- Certain special characters are considered as non-tokenizers and the word remains single word.