This is the second article of a series of articles explaining Elasticsearch as simple as possible with practical examples. The articles are condensed versions of the topics from my upcoming book Just Elasticsearch. My aim is to create a simple, straight-to-the-point, example-driven book that one should read over a weekend to get started on Elasticsearch.
Articles in the series:
All the code snippets, examples, and datasets related to this series of articles are available on my Github Wiki.
Overview
Elasticsearch is a server-side distributed application. When you launch Elasticsearch on your machine, it boots up a single node instance, ready for serving the clients. When an additional node is started, it joins up in a cluster with the first one.
The below diagram depicts a common architecture of Elasticsearch in an enterprise. An organization would ideally build a Search Service exposing a portion of endpoints of the Elastisearch server to the end clients. The Search Service interacts with Elasticsearch as well as internal applications.
A cluster is a group of one or more nodes. A node is a running instance of Elasticsearch, consisting of many indices.
An index is a logical collection of documents. Every index is made up of one or more shards (and replicas). Any number of nodes can join this cluster as long as they belong to the same cluster identified as a cluster.name property, defined in the node configuration.
The basic unit of data is a Document represented in JSON format. The client sends documents to Elasticsearch to index the data, which will eventually hit the shards, where the data is stored in segments before flushing it to the file system.
A shard is an instance of the Apache Lucene. A complex process of reading and writing happens on these shards, which is handled by Lucene under the hoods. Each of the documents is housed in a primary shard, by employing a routing algorithm. The same algorithm is used to retrieve the documents by identifiers.
When it comes to search, the search request can end up on any node in the cluster as all the nodes can serve client requests. The node that received the requested assigns itself a role called coordinator, which sends out the request query to all the shards involved and returns the results.
Elasticsearch uses a data structure called an inverted index for each of the full-text fields during the indexing phase. The inverted index, on a high level, is a data structure like a dictionary, consisting of words and the documents they are present.
This inverted index is the key to faster retrieval of documents during the full-text search phase. Elasticsearch creates an Inverted Index for full-text queries, which is a backbone for fast search retrieval on full-text searches. The BKD trees are used as the data structure for numeric and geo shapes.
Before we delve into the moving parts, let’s understand the components that makeup Elasticsearch.
Document
A document is a unit of information passed on to Elasticsearch for storage. The process of inserting the document is called indexing. During the process of indexing, Elasticsearch analyses the document’s individual fields for faster search and analytics.
The following document represents a JSON document for a movie, for example:
{
"title":"Johnny English",
"synopsis":"",
"release_date":"2020-05-01",
"rating":"PG"
}
If you are familiar with relational databases, the above document is equivalent to a record in the table. On the subject of relational databases, we know the schema and tables in a relational database must be created upfront. We must alter the table structure if we have to modify a table. However, Elasticsearch is just the opposite (just like other NoSQL databases) – it lets you insert documents without a predefined schema. While Elasticsearch is schema-free, there are limitations related to the mapping of the individual fields to correct data types in Elasticsearch. We will revisit this concept when we discuss Mappings in the coming articles.
The documents will be inserted in a logical collection called index, discussed in the next section.
Index
The Index is the collection of documents. Just like we keep all our documents in a file cabinet, Elasticsearch keeps the data documents in an Index. It is a logical collection of documents, supported by shards. We use REST APIs to work on indices, for example, creating and deleting indices, changing the settings on them, closing and opening the indices, reindexing the data, and other operations.
An index has some properties like mappings, settings, and aliases. Mapping is a process of defining schema definitions while settings allow configuring the shards and replicas. Aliases are alternate names given to a single or a set of indices. Settings such as changing the number of replicas, for example, that can be modified on a running server dynamically. However, the number of shards cannot be changed when the index is open and live.
Shards
Shards are the lowest denominator of data holding entities in Elasticsearch. An index is made of one or more shards, which is essentially a Lucene index. The Lucene index is split into immutable segments. Shards are allocated during the index creation process. Replicas, as the name suggests, are copies of shards. They serve to the redundancy of the application. Replicas can also serve read requests, thus distributing the load during peak times.
Nodes
Earlier we looked at a node, which is simply an instance of an Elasticsearch server. When you start the application, Elasticsearch starts with one node for us by default. We can create any number of nodes (if you are creating on the same machine for development purposes, make sure each node is started with specific properties for data logs and path logs).
A node has multiple roles to play:
- Master
- Data
- Ingest
- Machine-learning
- Transform
- Coordination
A master node is someone who will be involved in high-level operations such as creating, deleting indices, node operations, and other admin related to the management of the cluster. They are usually light-weight processes and they won’t participate in CRUD operation of the documents. There is usually at least one master node in a cluster. Although the master node doesn’t work on documents, it still has the knowledge about the location of the documents. The responsibility is delegated to the data node for searching or indexing operations on documents.
A data node is where the actual indexing, searching, deleting and other document related operations happen. Once an index request is received, this node jumps into action to save the document to its index by calling a writer on the Lucene segment.
An ingest node handles the ingest operations – such as documents that get ingested via a pipeline operation, for example processing Word or PDF documents, transforming, and enriching input data before indexing and others.
The machine-learning node – which is predominantly used for executing ML algorithms and detecting anomalies. It is part of a commercial license, so you must buy X-Pack to enable the machine-learning capabilities. The transform node is the latest addition to the stack, used for the aggregated summary of data.
While the above roles are assigned to a node, a coordinating node is the default role that is assigned to all the nodes. It is the main point of contact from a client’s perspective. Any incoming request will end up on a coordinating node. It acts as a work manager, distributing the incoming requests to appropriate nodes. If it’s a search related query, it knows which node to talk to for getting those queries and hence routes those requests to those nodes. It then collects the results from these nodes before collating and sending them back to the client.
By default, master, data and ingest nodes are enabled. Coordinating nodes have no special flag to enable or disable as this is the default role. In order to enable or disable a node’s role, simply change the node settings in elasticsearch.yml config file to true or false:
node.master = true
node.data = true
node.ingest = false
Cluster
Clusters can be formed from a single node running on a laptop or spin up a multi-node cluster with hundreds of nodes. You can scale the Elasticsearch by either horizontally scaling or vertical. Simply adding more nodes to the network and starting up with appropriate network settings will make them join the available cluster and become ready.
Horizontal scaling has its own advantage – very easy to get this setup up and running. Alternatively, you can buy bigger machines and install multiple nodes so you can form the cluster by vertical scaling.
How Elasticsearch Works
Understanding the internals of how indexing and search works will cement our grip on Elasticsearch. Every full-text field of the document will be analyzed and stored by the engine while indexing.
Analysis plays a big part in Elasticsearch and its performance. During this analysis phase, each of the documents is tokenized into unique words and stored in the inverted index against their corresponding document number.
The documents undergo a process where not only they are tokenized but they are normalized for root words, synonyms, and others.
Once the documents are indexed, they are straightway searchable within 1 second. This is the response time Elasticsearch vouches by.
The search queries undergo the same analysis process as indexing.
That is, if a field was set with an English analyzer, the same analyzer will be used during the search phase. This guarantees that the words that were indexed and inserted into inverted indices were matched while searching for too.
The full-text search results are applied with a score called relevancy score.
Relevancy is a positive floating-point number that determines the ranking of the search results. Elasticsearch uses BM25 relevancy algorithm for scoring the return results so the client can expect relevant results. We will look at relevancy algorithms in detail in the upcoming articles.
Routing Algorithm
Let’s quickly discuss how Elasticsearch uses a routing algorithm to distribute our documents to the underlying shards when indexing. Each of the documents will be indexed in one and only one primary shard. Routing is a process of allocating a home for a document on a certain shard. Retrieving the same document will be easy too as the same routing function will be employed to find out the shard where that document belongs to.
The routing algorithm is a simple formula where Elasticsearch deduces the shard for a document during indexing or searching:
shard = hash(id) % number_of_shards
The hash function expects a unique id, generally a Document ID, or even custom ID provided by the user. The documents are evenly distributed so there is no chance of one of the shards getting overloaded.
Summary
In this article, we have looked at the high-level architecture of Elasticsearch. We have also learned about how search works from high ground, found about the inverted index data structure and glanced over the routing algorithm. We have briefly gone over the moving parts of the server such as clusters, nodes, indices, and documents.
All the code snippets and datasets are available at my GitHub Wiki
In the next article, we will go over the indexing operations. Stay tuned!