Just Elasticsearch: 6/n Term Level Queries

This is the sixth article of a series of articles explaining Elasticsearch as simple as possible with practical examples. The articles are condensed versions of the topics from my upcoming book Just Elasticsearch. My aim is to create a simple, straight-to-the-point, example-driven book that one should read over a weekend to get started on Elasticsearch.

Just Elasticsearch By Madhusudhan Konda

Articles in the series:

  1. Introducing Elasticsearch
  2. Architecture
  3. Indexing Operations
  4. Document Operations
  5. Search Basics

All the code snippets, examples, and datasets related to this series of articles are available on my Github Wiki.

Overview

The term level searching is a structured search where the queries return results in exact matches. They search for structured data such as dates, numbers, and ranges. The results wouldn’t care about how well they match (like how relevant the documents match to the query) except returning the data (or not) if the query is matched. The term level query produces a yes/no binary option similar to the database’s WHERE clause.

Although the documents have a score associated with them, the scores really don’t matter.

One important characteristic of term level queries is that the queries are not analyzed (unlike full-text queries). That is, the terms are matched against the inverted index without having to be applying the analyzers to match the indexing pattern. That means the search words must be matching up with the fields indexed in the inverted index.

The term level queries are hence suitable for keyword searches, not text field searches. Similar to the keywords, the numerics, booleans, ranges, etc will also be not analyzed and directly added to the respective inverted indices.

Term-Level Queries

Elasticsearch exposes a handful of term-level queries, which include term/terms query, IDs query, fuzzy query, exists, range, fuzzy, and others. In the next section, we will go over a few important ones. As it is impossible to fit all types of queries in this book, I have picked out some important or popular queries, as discussed in the next section!

Indexing Sample Movie Data

We will create some movie test data for this chapter along with movie mappings. We don’t want Elasticsearch to deduce the field types, instead we will provide the relevant data types for each of the fields as mappings when we create the index (especially the release_dateduration fields — they can’t be text fields).

Both the movie mappings and the same data file are available on my Github here.

Now that we have primed our Elasticsearch server with data, let’s roll our sleeves and start querying using term-level queries.

Term Query

Term query fetches the documents that exactly match a given field. The field will not be analyzed, instead it will be matched against the inverted index. Using our movie dataset, if we are to search a 12A rating movie, we can develop a term query as shown below:

GET movies/_search
{
“query”: {
“term”: {
“rating”: “12A”
}
}
}

Surely, you’d get all 12A movies (The Dark Knight ) as the return results.

While we are here, let me ask you to change the query with the rating value to 12a (remove capitalized A) and rerun the query.

Did you see you didn’t get any results? Can you guess the reason?

Well, remember the rating field is a keyword type, so the field will never go through the analysis process and will always be matched with the contents of the inverted index. During indexing the document, the 12A will never be tokenized or played by filters (as it is a keyword type field) hence it will be inserted into the inverted index as-is. So, when we are searching for a rating as 12a, unfortunately, that value isn’t available (12A was indexed but not 12a) hence no result.

Terms Query

The Terms query is a big sister of term query — it searches multiple words against the single field. Say, we want to search for all movies with multiple content ratings, like PG, 12A, 15, etc. We use terms query for this purpose:

GET movies/_search
{
“query”: {
“terms”: {
“rating”: [“12A”,”15",”PG”]
}
}
}

The terms query expects a list of search words to be queried against a field, passed in as an array to the terms object. The above example searches all the movies with 12A, 15, and PG ratings.

IDs Query

The IDs query will fetch the documents given a set of document IDs. It’s a much simpler way to fetch the documents using a list of document IDs in one go:

GET movies/_search
{
“query”: {
“ids”: {
“values”: [10,8,6,4]
}
}
}

The above query returns all four documents with those 4 IDs.

Exists Query

As the name suggests, the exists query fetches the documents for a given field if that field exists. For example, if you run the following query, surely you’ll return the documents back because we know that the documents have that title field:

GET movies/_search
{
“query”: {
“exists”: {
“field”: “title”
}
}
}

If the field doesn’t exist, the return results will have an empty hits array (hits[]).

Range Query

The range query returns the documents for a range of a field. For example, if we have to fetch all the movies for a user_rating between 9.0 and 9.5, we execute the following range query:

GET movies/_search
{
“query”: {
“range”: {
“user_rating”: {
“gte”: 9.0,
“lte”: 9.5
}
}
}
}

The above query will fetch the movies that fall in the specified rating brackets (gte is the short form for greater-than-or-equal-to operator, and so on).

If you wish to fetch all the movies after 1970, you simply stitch the query as:

GET movies/_search
{
“query”: {
“range”: {
“release_date”: {
“gte”: “01–01–1970”
}
}
},“sort”: [
{
“release_date”: {
“order”: “asc”
}
}]
}

We are also sorting the movies in ascending order on the release date using the sort attribute on the query. You can see the results of the movies returned by oldest to newest.

Wildcard Query

At times we wish to use wildcards to do a search, for example, all possible combinations of movies with titles ending with “*father” or “god*” or even missing a single character like “god?ather” etc. This is where we use a wildcard query. The wildcard query accepts an asterisk (*) or a question mark (?) in the search word. The asterisk will allow you to search for zero or more characters, while the ? will only allow for a single character.

Let’s search for documents where the movie title starts with “god”:

GET movies/_search
{
“query”: {
“wildcard”: {
“title”: {
“value”: “god*
}
}
}
}

We should see both movies (Godfather and Godfather II) returned for the above wildcard query. We can tweak the queries by placing wildcards anywhere in the word.

The ? wildcard is used only if one character is expected to be matched, replace the query with “value”: “go?ather” and return to fetch the results. You can club multiple ”?” characters if you wish to like ”g???ather” too.

Prefix Query

At times we wish to query for words with a prefix, like “red” for “redemption” or red letter day, “god” for “godfather” / “Godzilla”. The following prefix query exactly does that, fetching The Shawshank Redemption if you search for a prefix “red”:

GET movies/_search
{
“query”: {
“prefix”: {
“title”: "red"
}
}
}

Fuzzy Query

We may at times search for a word with an incorrect letter or letters, for example, users searching “rama” movies instead of “drama” movies or “Lava” instead of “Java”. The search should correct this query and return “drama” movies instead of failing. The principle behind this type of query is called fuzziness which is based on Levenshtein Edit Distance.

The edit distance is the number of characters that needs to be swapped to fetch similar words. For example, searching for “cake” can fetch “take”, “bake”, “lake”, “make” and others if a fuzzy query is employed with a fuzziness (edit distance) set to 1.

The following query should return all the genres of “drama” as applying fuzziness of 1 to “rama” word will result in “drama”.

GET movies/_search
{
“query”: {
“fuzzy”: {
“genre”: {
“value”: “rama”,
“fuzziness”: 1
}
}
}
}

In this example, we are using the edit distance of one (one character) to fetch similar words.

If you drop one more letter (“ama”), unfortunately, the fuzzy query will not return results as the edit distance was set to 1. Of course, you can set the edit distance as 2 to solve this issue. This might be a clumsy way of handling things as sometimes you wouldn’t know if the user has mistyped one letter or a few.

This is the reason Elasticsearch provides a default setting for fuzziness — AUTO setting. If the fuzziness attribute is not supplied, the default setting of AUTO is assumed. Sticking to the default setting of AUTO as fuzziness is preferred unless you know exactly the use case at hand.

Summary

Elasticsearch has almost a dozen term-level queries out of the box. I strongly advise going over the documentation here to work through some other term level queries. In the article, we will learn about full-text search queries.

All the code snippets and datasets are available at my GitHub Wiki