Just Elasticsearch: 3/n Basics of Indexing

This is the third article of a series of articles explaining Elasticsearch as simple as possible with practical examples. The articles are condensed versions of the topics from my upcoming book Just Elasticsearch. My aim is to create a simple, straight-to-the-point, example-driven book that one should read over a weekend to get started on Elasticsearch.

Previous article :

Just Elasticsearch : 2/n Architecture

In this article we will learn how to create, read, delete, and update an index or Indices that is, CRUD operations. We will learn how to manage them effectively and alter their settings.

Let’s jump right into it.

An index is a logical collection of our data represented as documents. Documents of similar shapes – for example, employees, orders, login audit data, news stories by region, and so on. 

They are held in their own indices. 

Each of these documents is held in their own index. 

Each index is distributed across shards and replicas. A newly created index would be associated with a set number of shards and replicas. 

If you are from a relational database background, this index may sound similar to a table schema. The data that you persist in the table in the form of is similar to our documents. 

Creating indices is a relatively straightforward job. If you are playing with Elasticsearch on a personal laptop or working in an organization’s development environment, probably you may have a luxury of creating indices on the fly, dynamically. However, predefining them with shard/replica settings is mappings a best practice. 

Elasticsearch provides a set of Index APIs for all indexing operations to manage indices – that is, we can create, delete and reindex them using these APIs. These APIs are accessible using REST over HTTP, just like any other APIs across Elastic ecosystem. 

Every index is associated with three sets of configurations: settings, mappings, and aliases. 

Let me go over what these are:

We use settings configuration for creating the number of shards and replicas amongst other properties, required for the index. The shards and replicas will allow scaling and high availability of the data. 

Aliases are the alternate names given to an index or set of indices. Aliases allow querying across multiple indices easy as well as reindexing data with zero downtime. 

The mappings define the schema of our data. It defines the datatypes of each and every field of our data that’s been stored and searched. 

Creating an Index

An index can be created in two ways:

  • Implicitly when indexing a document using document APIs – the indices will be created automatically i.e dynamically without us explicitly asking the server to do so. We already had a taste of this route when we issued a command for indexing a document – we added a new student a few seconds ago and the server created the students index dynamically. Server applies default settings and configurations to the index, the number of shards and replicas set to 1 by default.
  • Explicitly using Indexing APIs – here we create indices beforehand with all the required settings and configurations. Server will configure them as per our instructions. The default settings might be ok for development purposes, but I think you should create custom settings when you are running the application in production rather than depending on defaults.

Index Settings

Every index can be instantiated with some properties – whether default or custom ones – called settings.

Index settings exists as two variants: 

  • Dynamic Settings: These are the settings that can be modified on a live index. For example, properties like changing the number of replicas, allowing/disallowing writes, refresh intervals, etc can be changed while the index is in operation. We use the _settings api to update properties on the live index.
  • Static Settings: The static settings can only be applied during the process of index creation, like the number of shards, codec, and a couple of others. None of the settings can be changed as long as the index is in use (Of course, if you wish to change the static settings of a live index, you can close the index to re-apply the settings or re-create).

The settings can also be applied globally across multi-cluster platforms. When you have dozens or hundreds of indices (which usually a production system is expected to have), tweaking each and every one of the indices with appropriate settings is a humongous task. You may want a switch-all button for setting the properties across all indices in your estate. Here’s where global settings will come into place. Elasticsearch groups all these global settings under into Indices Modules settings. These are settings per-cluster level such as index recovery, field data cache, index buffer, node query cache settings and others.  

Index Templates

Copying the same settings across various indices, especially one by one is a tedious job. You really wish a predefined settings schema exists so creating a new index will implicitly be moulded from this settings schema. Any new index created will follow the same settings and hence be homogenous across the organization. Also, perhaps DevOps wouldn’t need to advocate the optimal settings to individual teams in an organization over and over. One use case might be to create a set of patterns based on environments. Say, a dev environment indices should have 3 shards and 2 replicas, while PROD must have 5 shards and 5 replicas, etc.

This is where templating of indices comes into picture. We can create a template with predefined patterns. So, when creating a new index, if the index name matches the pattern, the template is applied. In addition to this, we can create a template based on a glob pattern such as wild cards, prefixes and others. We can create a set of templates with appropriate index patterns with predefined settings.

[NOTE]

Glob (short for global command) pattern is a common wildcard pattern used in computer software. for  example, we use regularly in our programs for searching all files ensign with txt or log, like: *.txt or *java, *.log etc.

We use the _template endpoint to create such patterns. Let’s create a template for cars pattern:

PUT /_template/dev_template 
{
  "index_patterns":["*_cars","cars*"],
  "settings":{
    "number_of_shards":5,
    "number_of_replicas":2,
    "blocks":{
      "read_only_allow_delete":true
}
}
}

The above command creates an index template with specific settings as provided in the body of the method. It can use wildcards, prefixes, suffixes and other sophisticated glob patterns. For example, the above snippet shows the index_patterns property takes in a list of patterns – *_cars, cars* patterns

Now that we have a template pre-created, we can use this template when creating any new index whose name matches the index_patterns property defined in the pattern. The templated settings will be applied to any of the matches – for eg., old_cars, family_cars, carsdealers and carsnew etc.

// Creating old_cars - uses *_orders pattern
PUT old_cars

// Creating carsdealers uses the second pattern: cars* pattern
PUT carsdealers

To check if Elasticsearch honoured our templating request while creating these indices, simply fetch the index GET old_cars so you know if the index has number_of_shards as 5.

Do keep a note that templating wouldn’t work retrospectively, that is, any pre-existing indices will not be altered. 

GET /_template/dev_template command to fetch the persisted template.

Aliases

Aliases are alternate names given to indices for various purposes such as:

  • Aggregating data from multiple indices (as a single alias) for easy searching
  • Enabling zero downtime during re-indexing

Once we have an alias created, you can use it for indexing, querying and all other purposes as if it were an index.

Aliases are quite a handy and useful tool during development as well as in production. We can group multiple indices and assign an alias to them so one can write queries against a single alias than a dozen indices!

Deleting an Index

Deleting an existing index is straight forward, simply issue  DELETE <index_name> command:

// Delete cars index
DELETE cars
// Response
{
"acknowledged" : true
}