Spelling mistakes during a search are common. We may at times search for a word with an incorrect letter or letters; for example, searching for rama movies instead of drama movies. The search can correct this query and return “drama”
movies instead of failing. The principle behind this type of query is called fuzziness, and Elasticsearch employs fuzzy queries to forgive spelling mistakes.
Fuzziness is a process of searching for similar terms based on the Levenshtein distance algorithm (also referred to as the edit distance). The Levenshtein distance is the number of characters that need to be swapped to fetch similar words. For example, searching for “cake”
can fetch “take”, “bake”, “lake”, “make”
, and others if a fuzzy
query is employed with fuzziness
(edit distance) set to 1. The following query should return all the drama genres because applying a fuzziness of 1 to “rama”
results in “drama”
:
Listing : A fuzzy query in action
GET movies/_search
{
"query": {
"fuzzy": {
"genre": {
"value": "rama",
"fuzziness": 1
}
}
}
}
In this example, we use the edit distance of 1 (one character) to fetch similar words. You can also try removing a character from the middle of the word too like dama or dram and so forth. These all result in positive returns when fuzziness
is set to 1.
If you drop one more letter (for example, “value”: “ama”
with fuzziness
set to 1), the fuzzy query in listing shown above does not return any results. Because we are missing two letters, we need to set the edit distance to 2 to solve this issue. The listing below shows this approach:
Listing : Fuzzy query with two letters missing in a word
GET movies/_search
{
"query": {
"fuzzy": {
"genre": {
"value": "ama",
"fuzziness": 2
}
}
}
}
This might be a clumsy way of handling things because sometimes you wouldn’t know if the user has mistyped one letter or a few letters. This is the reason Elasticsearch provides a default setting for fuzziness:
the AUTO
setting. If the fuzziness
attribute is not supplied, the default setting of AUTO
is assumed. The AUTO
setting deduces the edit distance based on the length of the word as shown in the following table.
Sticking to the default setting of AUTO
for the fuzziness
attribute is preferred unless you know exactly the use case at hand. That’s pretty much it for the term-level queries.
Note : Fuzzy vs wildcard query – Unlike a wildcard query where a wildcard operator such as * or ?, fuzzy query doesn’t use operators and instead it gets on with the task of fetching similar words using Levenshtein edit distance algorithm.
These short articles are condensed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.