Application
Indices
A space where your Documents are living.
You can think of an Index as an SQL Table or a space, that contains Documents.
Unlike traditional SQL Tables where you need to define the columns and their type first and then insert a row to it. An Index is highly flexible, and new attributes can be added on the fly.
If you aren't a technical user, think of an index like a magic box, where you put different stuff inside without worrying if the stuff fits into the box.
Typically you save the same types of Documents into the same Index.
For example, if you have an online shop, you would save all your products in one Index. But if your online shop contains products in different languages, then you create an Index per language.
This allows you to make different search configurations for each Index.
New Indices come with some default settings, but the real power comes when you make the Index settings match the Documents it contains.
What you configure influences the analysis process.
Analysis
Text processing is necessary to efficiently find a word in thousands of Documents, and it's quite an expensive task to do.
Still, this is the key that allows us to provide a powerful Search experience.
Let's see how it works.
Each time that you add a new Document to your Index, the analysis process takes the text attributes and analyzes them.
Analyzing is passing the text through various different filters and tokenizing them.
Let's say we have the following JSON Document, and we save it in our Index.
{
"name": "Lady and the Tramp"
}
If we chose to tokenize on whitespaces and use the lowercase filter. The analysis process will split the name
attribute to the following 4 tokens.
- lady
- and
- the
- tramp
Let's explore why.
The "Whitespace" tokenizer takes the name
attribute and produces 3 tokens. One for each encountered whitespace. So we end up with those 4 tokens:
- Lady
- and
- the
- Tramp
Next, each one of the tokens is passed to the "Lowercase" filter that lowercases the tokens "Lady" to "lady" and "Tramp" to "tramp" producing the final result.
Now, if a Query commes in, the same happens again for the Query string.
The user types the Query
- LADY TRAMP
because he forgot to turn off his caps lock.
"LADY TRAMP" passes through the same analysis and changes to:
- lady
- tramp
Now we can find those 2 tokens in the Document that indexed before, and despite the users caps lock.
Keep in mind here that the attributes of the Document are analyzed when you create or update it.
The Queries on the other hand, are analyzed in real-time when the Query comes.
Index update
Because of the analysis process updating an Index, with new settings can take some time depending on the Documents count, because each one of the Documents need to be analyzed again.
Tokenization
The first and the most important choice that we make is where we want to tokenize the incoming Query.
Word Boundaries
Choosing to tokenize the incoming Query on Word Boundaries is the easiest and safest choice for most cases.
It splits the example Query "Timon said Hakuna-Matata" into those 4 tokens.
- Timon
- said
- Hakuna
- Matata
Whitespaces
If you prefer to have the "Hakuna-Matata" part of the above Query as a single token, then use the "Whitespace" option.
The "Whitespace" option takes the above Query and tokenizes it as follows.
- Timon
- said
- Hakuna-Matata
Pattern
The "Word Boundaries" and "Whitespace" options cover the 90 percent of the tokenization cases, but there is also the Pattern option for the edge cases. It allows you to use a Regular Expressions to create a token for every matching occurrence.
Filters
Filters are a set of rules that modify the incoming text before reaching your Documents.
Trim
The "Trim" filter will remove any leading or trailing whitespace from the tokens.
So the token " red " will change to "red".
ASCII folding
"ASCII folding" converts alphabetic, numeric, and symbolic characters to their ASCII equivalent if one exists.
For example, the filter changes açaí à la carte to acai, a, la, carte.
Decimal digit
The "Decimal digit" takes all digits and changes them to numbers from 0 to 9.
For example, the Bengali numeral ৩ will change to 3.
Unique
The "Unique" filter removes duplicate tokens.
For example the "Dory can't tell where dory went" will change to
- Dory
- can't
- tell
- where
- went
Note here that the second occurrence of the word "dory" is missing.
Character mapping
"Character mapping" is a useful filter to manipulate the text even before it's tokenized.
Think of it as translations; We could use it to normalize the German letter ü
to ue
.
So using this filter we transform the word überraschung
to ueberraschung
.
Stip HTML
There are also cases where you will scrap webpages and want to make them searchable.
You can use the "Strip HTML" filter in this case.
This filter removes the HTML tags from your text.
Text attibutes
Keep in mind that those filters are applied to the incoming queries, and also to all text attributes of your documents.
Mapping
The mappings are probably one of the most critical settings when configuring your Index.
Mappings tell Sigmie which fields are text fields in order to analyze them.
Types
Text
In our JSON example from above, we have the name
field that we map as text in order to analyze it.
{
"name": "The Sword and the Store"
}
Boolean
Choose the Boolean type for fields like active
that contain true
or false
.
{
"active": true
}
Number
The correct option for Floats and Integers is the Number type.
{
"price": 33.34
}
Date
Not all string
fields are texts that need analysis. There are also Dates and because there isn't a standart JSON field for them to avoid analysing them it's better to choose the Date option here.
{
"created_at": "2020-12-29 10:55:13"
}
Keywords
The Keyword type is for attributes that don't require analyzing, but aren't dates.
For example, a category
attribute would be a perfect match for the Keyword option.
{
"category": "winter"
}
Filtering
To use filters on a text field has to be mapped as 'Keyword' and not as 'Text'. Because 'Keyword' fields aren't analyzed.
Language
Another powerful filter set that can make your results more relevant is the language filters.
If your index documents are in one of the supported Languages you can enjoy powerfull filters created the specific languages.
If your index language isn't supported yet, you can create your custom language rules.
Stopwords
All languages have words that are more important than others when searching. Also, there are words without any Search value.
An example from the English language is the "and" word.
The "Stopwords" filter allows to create a list of words that should be ignored.
Synonyms
Some words are different, but they mean the same thing. "Synonyms" filter is for these cases.
Using the synonyms filter, you can say that "treasure" and "diamond" should be treated the same.
Stemming
Stemming is the process of reducing a word to its root form. In the stemming section of your Index Language settings, you can define your own stemming rules.
For example, the words "went" and "going" are both could both be stemmed to the word "go".
So if all variants of a word are reduced to the same root form, they will match when searching.
Search
Every time an new Index is created there is also a new Search with the same name making the Index directly Searchable.