Elasticsearch [Search for word with space or without space (new york or newyork)]

Mallikarjuna J S
3 min readApr 2, 2017

I was using Elasticsearch and there was this specific problem I encountered in one of the applications I was working on. So, the problem was, there are numerous words in English that are two different words and can appear as single word in some context.

Example: New York can appear as newyork or new york. Suppose you have dataset which contains newyork (without space) and when you search for new york(with space) , you will end up not getting any results for the search you made.

In order to solve this problem we can make use of Elasticsearch tokenizers and filters.

So let us solve the problem. Elasticsearch settings consists of 3 main components i.e analyzers, filters , tokenizers and other index related settings.

Let us create a setting that is required for our search. We are now creating a custom analyzer that can be mapped to our field while creating mappings for our index.

PUT /test 
{
"settings":{
"analysis":{
"analyzer":{
"bigram_combiner":{
"tokenizer":"standard",
"filter":[
"lowercase",
"custom_shingle",
"my_char_filter"
]
}
},
"filter":{
"custom_shingle":{
"type":"shingle",
"min_shingle_size":2,
"max_shingle_size":3,
"output_unigrams":true
},
"my_char_filter":{
"type":"pattern_replace",
"pattern":" ",
"replacement":""
}
}
}
}
}

So, the trick here is with the custom_shingle and my_char_filter.

Shingle filter gives the combination of words and my_char_filter will remove the space between the shingles and hence gives back a single word. Lets analyze what our custom analyzer bigram_combiner does.

POST test/_analyze 
{
"analyzer":"bigram_combiner",
"text":"new york"
}

Result:

{
"tokens": [
{
"token": "new",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "newyork",
"start_offset": 0,
"end_offset": 8,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "york",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
}
]
}

So, if we use bigram_combiner as analyzer for our field we will accomplish what we need. It breaks down the words into 3 combinations. So, new york is now analyzed as three words : new, york and newyork

Now searching the field for new york or newyork yields us back the result you wanted.

Let us see how to use this analyzer in our index. Now, we have inserted the above settings let us insert the mappings for the same. We will create a mapping for the index with the name cities and a field called city, which uses the analyzer bigram_combiner.

PUT test/_mapping/cities 
{
"properties":{
"city":{
"type":"text",
"analyzer":"bigram_combiner"
}
}
}

Now lets insert the data:

PUT test/cities/1 
{
"city":"new york"
}

The index is ready and data is inserted it is time to search now and see the magic.

Query

GET test/_search 
{
"query":{
"match":{
"city":"newyork"
}
}
}

Here is the output from the above query.

Output

{  
"took":2,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":1,
"max_score":0.9181343,
"hits":[
{
"_index":"test",
"_type":"cities",
"_id":"1",
"_score":0.9181343,
"_source":{
"city":"new york"
}
}
]
}
}

Finally, we achieved what we wanted. We searched for newyork without spaces and yet we got back the result which contained new york as two separate words.

We achieved this just by using the combination of filters and analyzers. Elasticsearch filters and tokenizers gives you more power to search differently if used in combinations. There are lots of other filters available in ES which you can refer to in the Elasticsearch documentation.

Now be careful when you are using this settings for your data as this produces combination of all the words, if there are more than two words then it will consume more memory for building your index. Hope this is helpful.

--

--