[Elasticsearch] Building a simple spell corrector with elasticsearch
Note : Implementation of the same using Java is available here on github. Check it out .
Let’s try to build a simple spell corrector using elasticsearch.
It is a very common behaviour for users to make typos, while they are searching in the web applications. If your web application implements search, then to enhance the user experience, the application must detect the typos during search and it should try to correct or suggest(like Google’s Did You mean feature) the correct words. So, how can you achieve this? Elasticsearch’s term_suggester to the rescue.
Let’s see how term_suggester solves our problem. Elasticsearch’s term suggester uses the edit distance algorithm to detect the closest correct word and suggest those closest words as a replacement for the wrongly-spelled word. So, how does it know which word is the correct word for suggesting? It actually depends on the data you have indexed into elasticsearch. So, if it finds any closest word in your data then it tries to suggest as an alternative to the misspelled word.
For Example : If the user types in midium, it can suggest “medium” if the word medium is present in your index.
NOTE: If there is no data in your index, elasticsearch itself cannot suggest you any words. It tries to predict words based on the data-set present in your index.
Now, let’s try to implement the solution.
First, let’s insert the index setting and mapping for indexing data into Elasticsearch.
PUT test
{
“settings”: {
“number_of_shards”: 1
},
“mappings”: {
“data”: {
“properties”: {
“my_field”: {
“type”: “text”
}
}
}
}
}
We created an index test with a mapping called data. And we defined a field with the name my_field which can store text for us.
Let’s insert some data into our index. I made a quick search in google for commonly misspelled words, I am taking few of them and indexing the data into test index.
PUT test/data/1
{
“my_field”:”disappoint”
} PUT test/data/2
{
“my_field”:”ecstasy”
} PUT test/data/3
{
“my_field”:”embarass”
}
Now that we have inserted the data. Let’s try to search using wrong spellings. Below is the search query.
GET test/_search
{
“query”: {
“match” : {
“my_field” : “dissappoint”
}
}
}
In the above query we searched for “dissapoint” i.e wrongly spelled word and for which we get no results. When you do not get any result for your search then you can always assume that there might be spelling mistake from the user and you can use term_suggester to suggest new words for your users.
Now lets’ see how to use term_suggester to suggest a new word for the user.
POST test/_search
{
“suggest” : {
“mytermsuggester” : {
“text” : “dissappoint”,
“term” : {
“field” : “my_field”
}
}
}
}
The above snippet is used to suggest terms that are closely related. Here I asked the term_suggester to suggest new words that are closely related to misspelled word “dissappoint” , I called my suggester mytermsuggester(you can name it anything) and i am telling suggester to suggest from the field my_field. Now, lets see the result of the above query.
{
“took”:3,
“timed_out”:false,
“_shards”:{
“total”:1,
“successful”:1,
“failed”:0
},
“hits”:{
“total”:0,
“max_score”:0,
“hits”:[]
},
“suggest”:{
“mytermsuggester”:[
{
“text”:“dissappoint”, //User input or misspelled word.
“offset”:0,
“length”:11,
“options”:[
{
“text”:“disappoint”, //This is the suggested word.
“score”:0.9,
“freq”:1
}
]
}
]
}
}
Aha! we got the word disappoint as a suggestion and it is present in our index. Now, you can suggest your user with this new word or you can correct the word yourself and show the results for corrected word. The result contains score and freq along with the word. Score is calculated based on the edit distance. Freq is the number of times the word occurs in your index.
There are lots of other options available to use with the term_suggest query and you can refer to Elasticsearch documentation here for the same. But some of the important ones are
- min_doc_freq — Minimum no of times the word should occur in your documents to be suggested. For suppose the value is 5 then the word has to occur in 5 different documents.
- max_term_freq — The minimum number of times the word should occur in your index irrespective of documents. i.e if 5 times the word is present in one document also it is suggested.
- sort — sort the suggested words based on score or freq. If the value is score, then sort based on scores, if value is freq then it sorts the words based on frequency.
Hope you enjoyed it ! :)