Fuzzy queries – Chocolate Minds

Spelling mistakes during a search are common. We may at times search for a word with an incorrect letter or letters; for example, searching for rama movies instead of drama movies. The search can correct this query and return “drama” movies instead of failing. The principle behind this type of query is called fuzziness, and Elasticsearch employs fuzzy queries to forgive spelling mistakes.

Fuzziness is a process of searching for similar terms based on the Levenshtein distance algorithm (also referred to as the edit distance). The Levenshtein distance is the number of characters that need to be swapped to fetch similar words. For example, searching for “cake” can fetch “take”, “bake”, “lake”, “make”, and others if a fuzzy query is employed with fuzziness (edit distance) set to 1. The following query should return all the drama genres because applying a fuzziness of 1 to “rama” results in “drama”:

Listing : A fuzzy query in action

GET movies/_search
{
   "query": {
    "fuzzy": {
     "genre": {
      "value": "rama",
      "fuzziness": 1
     }
    }
   }
  }

In this example, we use the edit distance of 1 (one character) to fetch similar words. You can also try removing a character from the middle of the word too like dama or dram and so forth. These all result in positive returns when fuzziness is set to 1.

If you drop one more letter (for example, “value”: “ama” with fuzziness set to 1), the fuzzy query in listing shown above does not return any results. Because we are missing two letters, we need to set the edit distance to 2 to solve this issue. The listing below shows this approach:

Listing : Fuzzy query with two letters missing in a word

GET movies/_search
 {
  "query": {
   "fuzzy": {
    "genre": {
     "value": "ama", 
     "fuzziness": 2 
    }
   }
  }
 }

This might be a clumsy way of handling things because sometimes you wouldn’t know if the user has mistyped one letter or a few letters. This is the reason Elasticsearch provides a default setting for fuzziness: the AUTO setting. If the fuzziness attribute is not supplied, the default setting of AUTO is assumed. The AUTO setting deduces the edit distance based on the length of the word as shown in the following table.

Sticking to the default setting of AUTO for the fuzziness attribute is preferred unless you know exactly the use case at hand. That’s pretty much it for the term-level queries.

Note : Fuzzy vs wildcard query – Unlike a wildcard query where a wildcard operator such as * or ?, fuzzy query doesn’t use operators and instead it gets on with the task of fetching similar words using Levenshtein edit distance algorithm.

These short articles are condensed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.

You Might Also Like

Just Elasticsearch: 3/n Basics of Indexing

Manipulating Search Results

Creating and Restoring Snapshots (2/3)