A query string is passed through various filters that change its final form.
The analysis process consists of 2 parts, the Tokenisation and the Filter process.
The Tokenisation process takes a given text, "Good boy Pluto" for example, and splits it into tokens according to the chosen "Tokenizer".
Whitespace tokenizer would split the above query on every whitespace, producing the following 3 tokens.
There are 3 Tokenizer options for you to choose from that should cover all use cases.
Word Boundaries will the text "Easy-peasy lemon squeezy" and produce the tokens below.
On the other hand, using the
Whitespace tokenizer will take "Easy-peasy" as a single token, and produce 3 tokens.
Pattern option allows you to write a Regular Expression pattern to handle any edge cases that you may have.
For example, you can specify
, as a pattern to split the "Donald,Mickey, Goofy" into 3 tokens.
Once your text is tokenized, filters come to play and transform the tokens to allow better matching between them.
For example, if a user for any reason writes "mickey mickey Mouse" the tokenizer will give us 3 tokens
Unique filter will remove the double token
mickey and we will get
This way, the double
mickey word won't affect the scoring.
Users many times paste large texts into Search boxes and tokenize big texts is a quite expensive operation that won't help us find better matches.
Using Token limit, we can set a limit on how many tokens are created.
For example, the query "Where did Dory go ?" with a Token limit of 3 will only create the 3 tokens bellow
and ignore the rest.
Removing whitespaces from the begging and from the end of any token is always good practice.
Trim filter will change the
- " Goofy "
The Decimal Digit filter converts all digits to 0-9.
For example, the Bengali numeral
৩ will change to
The Unique filter removes any duplicate tokens.
For example, the tokens:
will change to
removing the double "Pluto" token.
Strip HTML filter removes any HTML from even before it even tokenized.
So from the HTML
<div> <p> The Lion King is a 1994 American animated musical drama film directed by Roger Allers and Rob Minkoff </p> </div>
will remain only the text
The Lion King is a 1994 American animated musical drama film directed by Roger Allers and Rob Minkoff
<p> tags will disappear.
Same as the HTML filter, the char filter manipulates the text before it's tokenized.
Use character mapping filter if you are indexing for example, user comments that tend to have emojis within them.
You can map all happy like
:D emojis to
By doing this, the text "To Infinity and Beyond :D" will change to "To Infinity and Beyond happy".
Now when the user searches for "happy" he get's the Buzz Lightyear's famous line.