Just Elasticsearch: 5/n. Search Basics

This is the fifth article of a series of articles explaining Elasticsearch as simple as possible with practical examples. The articles are condensed versions of the topics from my upcoming book Just Elasticsearch. My aim is to create a simple, straight-to-the-point, example-driven book that one should read over a weekend to get started on Elasticsearch.

All the code snippets, examples, and datasets related to this series of articles are available on my Github Wiki.

Introducing to Search

So far we’ve looked at priming the Elasticsearch with data. It is time to start working on the core part of Elasticsearch — the search.

Elasticsearch has a rich set of search APIs to support a multitude of search features such as We will dedicate this and upcoming couple of articles to getting the basics, working with full-text and term level search capabilities. In this article, we will go over the basics of Search and cover the fundamentals. This will help us work with the full-text and term level queries in the following articles.

Full Text and Structured Search

There are two variants of search — full-text and structured search. Any modern search engine including Elasticsearch caters to both categories of searches. Elasticsearch is big on full-text and comes with lots of bells and whistles in the full-text search department.

Structured Search

Structured Search queries return results only for exact matches. The structured search returns results without worrying about how relevant the results are to the question asked. That is, the search is only worried about finding the documents that match the search criteria (but not how well they are matched). This is the type of query we usually employ in relational database systems. The results are binary in nature — that is, the query results will be fetched if the condition is met; none if the condition faults.

There will be no scores attached to these results and hence they are not sorted based on relevancy. And as the results are not scored, they can be cached by the server thus gaining a performance benefit should the same query be rerun.

Full-text Search

On the other hand, full-text (unstructured) queries will try to find results that are relevant to the query. Most search engines will employ a relevancy algorithm to provide the results close to what a user is searching for. The return results will be listed according to the relevancy score, the highest being on the top of the list. This is where search products like Elasticsearch excel, finding relevant results using full-text queries.

Elasticsearch employs a similarity algorithm to generate a relevance score for full-text queries. The score is a positive floating-point number attached to the results, with the highest scored document indicating more relevant to the query criteria. Elasticsearch applies different relevance scoring algorithms for various queries, including allowing us to customize the scoring algorithm too.

There’s another fundamental concept that we need to learn — the execution context. Elasticsearch internally uses an execution context when running the searches — a filter context or query context, the topic of the next section.

Filter or Query Context

The structured or unstructured search is executed in an execution context by Elasticsearch — filter or query context respectively.

A structured search will result in a binary yes/no answer, hence Elasticsearch uses a filter context for this. There are no relevance scores expected for these results, hence filter context is the appropriate one.

Of course, the queries on full-text search fields will be run in a query context as they must have a scoring associated with each of the matched documents.

We will look at some examples to demonstrate these contexts in action in the upcoming chapters, but in the meantime let’s find out how we can access the Elasticsearch search APIs and endpoints.

Search APIs

Elasticsearch exposes the Search API via its _search endpoint. There are two ways of accessing the search endpoint:

  • URI Search Request: In this method, we pass in the search query parameters alongside the endpoint as params to the query.

For example:

GET books/_search?q=author:Bloch
  • Query DSL: Elasticsearch has implemented a domain-specific language (DSL) for search. The criteria are passed in as a JSON object as the payload when using Query DSL.

For example, fetching the books authored by Joshua Bloch:

GET books/_search
{
"query": {
"match": {
"author": "Joshua"
}
}
}

Keep a note, the endpoint for accessing the search in either case is

GET <index_name>/_search

You can use the POST method too, in addition to the GET method.

When searching across multiple indices use the comma-separated index names like

GET <index1>,<index2>,<index3>/_search 

including using wild cards:

GET <index1>,<ind*>/_search

URI Search

The URI Search method is an easy way to search simple queries, albeit not suitable for complex and complicated queries. We are expected to pass in the query parameters attached to the URL, as the format describes it below:

POST <index>/_search?q=<name:value> AND|OR <name:value>

Sample Data

Before we start searching for data using the URI Search method, we surely need our Elasticsearch to be primed with some sample data.

We will be using the books dataset for this purpose, so head over to my GitHub and copy the books dataset and execute the command in Kibana as shown below:

POST _bulk
{“index”:{“_index”:”books”,”_id”:”1"}}{“title”: “Core Java Volume I — Fundamentals”,”author”: “Cay S. Horstmann”,”edition”: 11, “synopsis”: “Java reference …”,”amazon_rating”: 4.6,”release_date”: “2018–08–27”,”tags”: [“Programming Languages, Java Programming”]}{“index”:{“_index”:”books”,”_id”:”2"}}{“title”: “Effective Java”,”author”: “Joshua Bloch”, “edition”: 3,”synopsis”: “A must-have book …”, “amazon_rating”: 4.7, “release_date”: “2017–12–27”, “tags”: [“Object Oriented Software Design”]}

We are using bulk API to index these books into our new books index as the sample above demonstrates.

URI Search Requests

Let’s go with a simple example — search all the books written by Joshua. We can write the query as shown below:

POST books/_search?q=author:Joshua

The URL is composed of the endpoint followed by the query — represented by a letter q.

This query returns the books written by Joshua:

"hits" : [
{
... "_source" : {
"title" : "Effective Java",
"author" : "Joshua Bloch",
...
}
},
{
...
"_source" : {
"title" : "Java Concurrency in Practice",
"author" : "Brian Goetz with Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea"
}
}
]

If you want to search for more than one author, add the additional query values with spaces. For example, the following query searches all the titles written by Joshua, Herbert, and Brian:

POST books/_search?q=author:Joshua Herbert Brian

The query uses OR operator as default — that is, it finds titles written by Joshua OR Herbert OR Brian hence you may find a good number of results. However, if you wish to find titles written by all three together, insert AND between the values:

POST books/_search?q=author:Joshua AND Herbert AND Goetz

This will not return any results as there is no book written by three as co-authors, yet!

Instead of adding AND operator like above repeatedly, you can simply append the query with default_operator param:

POST books/_search?q=author:Joshua Herbert Goetz&default_operator=AND

If we need to add another attribute in addition to the author, we follow the same principle, as shown below searching for Joshua’s first edition books tagged as a tech book:

POST books/_search?q=author:Joshua edition:1 tags:tech

The search query returns a result although there is no book tagged with tech. The reason is, as mentioned earlier, using the default conditional OR operator. Append default_operator=AND if you wish to change this boolean condition:

POST books/_search?q=author:Joshua edition:1 tags:tech&default_operator=AND

This query will return no results as one of the three conditions (tags: tech) is false.

Finally, you can also pass a few more parameters with the URL, for example from and size parameters along with sorting by edition. You can also ask Elasticsearch for an explanation of how it calculated the scoring using explain=true param. The following query is the example of all these parameters:

POST books/_search?q=author:Joshua edition:1 tags:tech&from=3&size=10&explain=true&sort=edition

While the URL method is simple and easy to code up, it becomes error-prone as the complexity of the query criteria grows. We can use it for quick testing but relying on it for complicated and in-depth queries might be asking for trouble.

We can alleviate the issue to some extent by using Query DSL’s query_string method (we will touch base on Query DSL in the next section):

POST { 
“query”: {
“query_string”: {
“default_field”: “author”,
“query”: “Joshua Goetz”,
“default_operator”: “AND”
}
}
}

Query DSL

Elasticsearch developed a search specific purpose language and syntax called Query DSL (domain-specific language) for querying the data. The Query DSL is a sophisticated, powerful, and expressive language to create a multitude of queries ranging from simple and basic to complex, nested, and complicated ones. It can also be extended for analytical queries too. It is a JSON based query language that can be constructed with deep queries both for search and analytics.

GET books/_search
{ //your request body goes here
"query": {
"match": {}
}
}

Or for aggregations, the query DSL looks like this:

GET books/_search
{
"aggs": {
"average_rating": {
"avg": {
"field": "amazon_rating"
}
}
}
}

Full-text and Term Level Queries

While working with search queries, the full-text and term-level queries will always pop up. Let’s understand the basics of these queries and where to use them and the differences between them.

To understand these queries, we need to go to the beginnings of data indexing.

We have learned that during the indexing of the documents, the data will be analysed before persisting to the data store? The choice of analysis is derived based on the data type of the attribute — for example, text fields get analysed while keyword typed data is persisted without the application of text analysers on them.

The same approach applies to the searching side as well — the text fields are analysed exactly using the same analysers used for indexing those fields in a document when searched. For example, if a text field of a document was analysed using a standard analyser, searching for that field will also happen using the same standard analyser.

The non-text fields, when searched, will match against the data stored in the inverted-index without applying the analyzers on the search criteria.

So, the takeaway point is that text fields are analysed and stored in the inverted indexed in multiple forms while the keyword family of types is stored inside the inverted index as-is.

Elasticsearch provides a handful of full-text queries, like match (and other types of the match), intervals, common_terms, query_string, simple_query_string, and others and a dozen term-level queries such as term, terms, prefix, range, ids, and others.

The next two articles are dedicated to full-text and term level queries, but let’s understand what relevancy is all about.

Relevancy

When you use Google for searching, you are most likely picking up the result from the top 5 results. The reason for this is that Google analysed your query and found out the results that most ‘suited’ for your search criteria — in other words, the most relevant results. These results are ranked from high to low so we get the highly relevant results with the highest score on the top.

Relevancy is a positive floating-point number that determines the ranking of the search results. Modern search engines not just return results based on your query’s criteria but also analyse and return the most relevant results. It’s how well the returned documents match the search criteria.

If you are searching for “Java” in a title of a book, a document containing more than one occurrence of a “Java” word in the title is highly relevant than the other documents where the title has one or no occurrence across other fields like description, review, etc.

Usually, there are three variables that make up the relevance on a high level:

  • Term Frequency: The term frequency (TF) is a measure of how frequently the search term appears in the document
  • Inverse document frequency: The inverse document frequency (IDF) is a measure of how frequently the search terms appear across all documents
  • Field length norm: The field length norm defines that search term in a shorter field is more relevant than a longer field

Elasticsearch uses Okapi BM25 relevancy algorithm for scoring the return results so the client can expect relevant results.

Term Frequency (TF) is a measure of how frequently the word appears in the document. If the word appears more times, the TF is high and so is the relevancy. That is, if a scientist is searching for “SARS-CoV-2 (Coronavirus)” in a stash of medical research documents’ content, the document with the search word appearing more number of times in the body is considered to be relevant than the same word appearing in comments, summary or other parts of the document.

The Inverse Document Frequency (IDF), on the other hand is the number of times the word appears across the whole set of documents (whole index). The higher the frequency of the word across all the given documents, the lower the relevance (hence inverse document frequency).

Going with the same example, If the same “SARS-CoV-2 (Coronavirus)” is found numerous times across multiple documents in the whole index, these documents are less relevant. What about stop words — the words like the, a, an, and which appear frequently in a document as well as across all documents? Well, the TF/IDF algorithm does not consider the stop words (although the BM25 algorithm takes them into consideration) for relevancy calculation as these words are pretty common in a document.

In addition to the TF/IDF, a Field-length norm is another feature taken into consideration for scoring calculation. The idea is simple — a word appearing in a shorter length field is more relevant than a word appearing in a longer length field.

For example, if “Java” appears twice in a 10-word title as opposed to 10 times in a 100-word synopsis, the former one (the document with the title field) is considered highly relevant.

The Best Matching 25 (BM25) is the default scoring algorithm in Elasticsearch. This algorithm has a couple of more characteristics in addition to TF/IDF and Field-length norm such as using the stop words and others.

Common Features

There are a few common features across using Search APIs such as highlighting search results, pagination, sorting, etc. Rather than introducing them for every query, I’ll explain some of the common features here so when you are working with the actual queries, you know how to use these features.

Highlighting Search Results

Elasticsearch provides the capability of highlighting search results.

In Query DSL, we can add a highlight object at the same level as the top-level query object, as demonstrated below, to receive our results with highlights:

GET books/_search
{
“query”: {..},
“highlight”: {
“fields”: {
“title”: {}
}
}
}

The highlight object expects a fields block, where we provide the field (like a title in the above query) in question inside the fields object.

The response will now consist of the highlighted fields enclosed in <em> tags (short for emphasis):

"hits" : [
{
...,
"highlight" : {
"title" : [
"<em>Java</em>: A Beginner’s <em>Guide</em>"]
}
},
{
...,
"highlight" : {
"title" : [
"<em>Java</em> - The <em>Complete</em> Reference"]
}
}
]

The words that matched with the criteria were tagged with <em> as you can see in the above message (for example: <em>Java</em>).

Pagination

We can build pagination in Elasticsearch by using a couple of query parameters — size and from. The size parameter fetches the results batch of a given size. By default the size is 10, so you’d expect to get a result of the top 10 documents for each query if available.

The from parameter, on the other hand, fetches the data offset from the first page. The following query fetches the data in batches of 200 each starting from the third batch

POST prices/_search
{
"size": 100,
"from": 2,
"query": {
"term": {
"company": "ELASTIC"
}
}
}

Let’s wrap it up as this article is becoming stretched. Now that we have an understanding of the search basics, let’s work on term level queries, the topic of the next article!

All the code snippets and datasets are available at my GitHub Wiki