This is part 1/n series on Mapping:

Data is like a rainbow — it comes multi-coloured. Business data comes in various shapes and forms, usually represented as textual information, dates, numbers, inner objects, booleans, geo-locations, IP addresses, and so on.

In Elasticsearch, we model and index data as JSON documents. Each document consists of a number of fields, and every field has a certain type of data. For example, a movie document consists of a title and synopsis represented as textual data, release_date as a date, gross_earnings as floating-point data, and so on.

When we index a document, we need not bother (ideally) about the data type of the fields. That’s because Elasticsearch derives these types implicitly by looking at the field and the type of information in it.

Elasticsearch created a schema for us without us having to do any upfront work unlike in a relational database where it is mandatory to have the table schema defined. This schema-free feature helps developers get up and running with the system from day one. The best practice is to develop a schema upfront rather than letting Elasticsearch define it for us unless your requirements dictate not to have one.

Elasticsearch, however, expects us to provide clues as to how it can treat a field when indexing the data. These clues are either provided by us in the form of a schema definition while creating the index or derived by the engine implicitly if we allow it to do so.

Overview of mapping

Mapping is a process of defining and developing a schema definition representing the document’s data fields and their associated data types.

Mapping tells the engine the shape and form of the data that’s being indexed. Being a document-oriented database, Elasticsearch expects a single mapping definition per index.

Every field is treated as per the mapping rule. For example, a string field is treated as a text field, a number field is stored as an integer, a date field is indexed as a date to allow for date-related operations, and so on. Accurate and error-free mapping allows Elasticsearch to analyze the data faultlessly, aiding in search-related functionalities, sorting, filtering, and aggregation.

Mapping definition

A document consists of a set of fields representing business data, and every field has a specific data type (or multiple datatypes) associated with it. The mapping definition is the schema of the fields of the document and their data types.

Based on the data type, each of these fields is stored and indexed in a specific way. This helps Elasticsearch to support a multitude of search queries such as full-text, fuzzy, term, geo, and so on. Take a look at the example of a student document and the data types associated with it:

A student data represented as JSON document

In programming languages, we represent data with specific data types (strings, dates, numbers, objects, and so on). It is a common practice to let the system know about the types of variables during compilation.

Elasticsearch understands the data types of the fields while indexing the documents and, accordingly, stores the fields into appropriate data structures (for example, an inverted index for text fields or BKD trees for numericals) for data retrieval. The indexed data with precisely formed data types leads to accurate search results as well as helping to sort and aggregate the data.

Indexing a document for the first time

Let’s take an example and find out what happens if we index a document without creating the schema upfront. Say we have a movie document that we want to index, as demonstrated in the listing shown below.

#A movie document
PUT movies/_doc/1
{  
  "title":"Godfather",  
  "rating":4.9,  
  "release_year":"1972/08/01"
}

This would be our first document sent to Elasticsearch to get indexed. Remember, we didn’t create an index ( movies ) or the schema for this document data prior to the document ingestion. Here’s what happens when this document hits the engine:

A new index ( movies ) is created automatically with default settings.
A new schema is created for the movies index with the data types (we learn about data types shortly) deduced from this document’s fields. For example, title is set to a text and keyword types, rating to a float, and release_year to a date type.
The document gets indexed and stored in the Elasticsearch data store.
Subsequent documents get indexed without undergoing the previous steps as Elasticsearch consults the newly created schema for further indexing.

Elasticsearch does a bit of groundwork behind the scenes when performing these steps to create a schema definition with fields names and additional information. We can fetch the schema definition that Elasticsearch created dynamically using a mapping API. The response from invokingGET movies/_mapping command is shown below:

Fetching the mapping schema of the movies index

Each field has a specific data type defined, for example, the ratings field was declared to be a float, release_year a date type, and so on. Note that the individual field can be composed of other fields too — these data types are called multi data types. for example, the title field is not only defined as text but also has a keyword type associated with it.

Elasticsearch uses a feature called dynamic mapping to deduce the data types of the fields when a document is indexed for the first time by looking at the field values and deriving these types. During this process, it managed to derive the type of the name field to be text, based on the string value it has (” John Doe “).

As the field is stamped as a text type, all full-text related queries can be performed on this field. We can issue search words like john, doe, John Doe, and so forth to search for the documents consisting of that name in a variety of ways.

In addition to creating the name field a text type, Elasticsearch also did something extra for us: using the fields object, it created an additional type called keyword type for the name field, thus making the name field a multi-typed field. Multi typed fields can be associated with multiple data types.

In our example, by default, the name field was mapped totext as well as keyword type. The keyword fields are used for exact value searches. These data type fields are untouched and hence they won’t go through an analysis phase. That is, they are neither tokenized nor synonymized nor stemmed.

Fields declared as keyword data types use a special analyzer called noop (the no-operation analyzer), thus untouching the keyword data during the indexing process. This keyword analyzer spits out the entire field as one big token. By default, the keyword fields will not be normalized as the normalizer property on the keyword type is set to null.

While dynamic mapping is intelligent and convenient, be aware that it can get the schema definitions wrong too. Elasticsearch can only go so far when deriving the schema based on our document’s field values. It might make incorrect assumptions, leading to erroneous index schemas that can produce incorrect search results.

In the next article, we will go over the Dynamic Mapping in detail, stay tuned!

These short articles are condesed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.

Overview of mapping

Mapping definition

Indexing a document for the first time

You Might Also Like

Percolator Queries

Loading PDFs into Elasticsearch (3/3)

Introducing and Registering snapshots (1/3)