This is part 5/5 series on Mapping:

In the last article, we learned about some of the core and common data types to represent our data fields. In this article, we look at advanced and specialized data types.

The Geopoint (`geo_point`) data type

Most of us may have used a smart device to find the location of the nearest restaurant or asked for GPS directions to our mother-in-law’s house during Christmas. Elasticsearch developed a specialized data type geo_point for capturing the location of a place.

Location data is expressed as a geo_point datatype, which represents longitude and latitude. We can use this to pinpoint an address for a restaurant, a school, a golf course, and others.

The code listing shown below demonstrates the schema definition of an index calledrestaurants . It hosts restaurants with names and addresses. The notable point is that the address field is defined as a geo_point datatype:

# A restaurants index with address declared as geo_point
PUT restaurants
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "address": {
        "type": "geo_point"
      }
    }
  }
}

Now that we have an index, let’s index a sample restaurant (London based fictitious Sticky Fingers) with its location provided as longitude and latitude (listing given below):

# Indexing a restaurant - the location is provided as lon and lat
PUT restaurants/_doc/1
{
  "name": "Sticky Fingers",
  "address": {
    "lon": "0.1278",
    "lat": "51.5074"
  }
}

In the above code snippets, the address of the restaurant is provided in the form of longitude (lon) and latitude (lat) pairs. There are other ways to provide these inputs too, which we look at shortly.

We can not search and fetch the restaurants within the location perimeter. We can use a geo_bounding_box query for searching data involving geographical addresses. It takes inputs of top_left and bottom_right points to create a boxed up area around our point of interest, as the figure below demonstrates:

Geo bounding box of a location in central London

We provide the upper and lower bounds to this query using the lon (longitude) and lat (latitude) pairs (the address location points to London).

We write the geo_bounding_box query providing the address in the form of a rectangle with top_left and bottom_right coordinates provided as latitude and longitude, as the listing below shows:

# Listing to Fetch the restaurants around a geographical locationGET restaurants/_search
{
  "query": {
    "geo_bounding_box": {
      "address": {
        "top_left": {
          "lon": "0",
          "lat": "52"
        },
        "bottom_right": {
          "lon": "1",
          "lat": "50"
        }
      }
    }
  }
}

This query fetches our restaurant because the geo bounding box encompasses our restaurant:

# We found the Stick Fingers restaurant, yay!
"hits" : [
      {
        "_index" : "restaurants",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "Sticky Fingers",
          "address" : {
            "lon" : "0.1278",
            "lat" : "51.5074"
          }
        }
      }
    ]

As I mentioned earlier, we can provide the location information in various formats, not just latitude and longitude: for example, an array or a string. The table below provides the ways of creating location data and examples:

The object data type

Often we find data in a hierarchical manner — for example — an email consisting of top-level fields like the subject, to and from fields as well as inner object to hold attachments, as the snippet below demonstrates:

 "to:":"johndoe@johndoe.com",   
 "subject":"Testing Object Type",   
 "attachments":{     
  "filename":"file1.txt",     
  "filetype":"confidential"   
 }

JSON allows us to create such hierarchical objects: an object wrapped up in other objects. To represent such object hierarchy, Elasticsearch has a special data type to represent a hierarchy of objects — an object type. In the above example, as the attachment holds other attributes, we classify it as the object itself, thus of anobject type. The two properties filename and filetype in the attachments object can be modelled as text and long field respectively. With this information at hand, we can create a mapping definition as demonstrated in the listing given below:

# Defining the attatchments as object type. Though we can set the type as object speficially, Elasticsearch is clever enough to deduce it as an object type when it sees hierarchical data sets. Hence we can omit declaring the object typePUT emails
{
  "mappings": {
    "properties": {
      "to": {
        "type": "text"
      },
      "subject": {
        "type": "text"
      },
      "attachments": {
        "type":"object",
        "properties": {
          "filename": {
            "type": "text"
          },
          "filetype": {
            "type": "text"
          }
        }
      }
    }
  }
}

The attachments field is an object as it encapsulates the two other fields. Though we have explicitly mentioned the type as an object, Elasticsearch doesn’t expect us to do so. It sets the field’s datatype as an object whenever it encounters the fields with hierarchical data.

Once the schema is executed successfully, we can retrieve it by invoking GET emails/_mapping command (listing given below):

# The mapping schema for emails
# The attachments type is not listed (inferred as object by Elasticsearch!)
{
  "emails" : {
    "mappings" : {
      "properties" : {
        "attachments" : {
          "properties" : {
            "filename" : {"type" : "text"},
            "filetype" : {"type" : "text"}
          }
        },
        "subject" : {"type" : "text"},
        "to" : {"type" : "text"}
      }
    }
  }
}

While all other fields show their associated data types, the attachments wouldn’t. The object type of an inner object is inferred by Elasticsearch as default. Let’s index an email document, listing given below shows the query:

#Indexing an email document
PUT emails/_doc/1
{
  "to:": "johndoe@johndoe.com",
  "subject": "Testing Object Type",
  "attachments": {
    "filename": "file1.txt",
    "filetype": "confidential"
  }
}

Now that we have primed our emails index with a document, we can issue a match search query (we learn about search queries in the upcoming articles) on the inner object fields to fetch the relevant documents (and prove our point), shown in the listing below:

# Searching for an email based on the attachment name
GET emails/_search
{
  "query": {
    "match": {
      "attachments.filename": "file1.txt"
    }
  }
}

This will return the document from our store as the filename matches that of the document we have in store.

While object types are pretty straightforward, there’s one limitation that they carry: the inner objects are flattened out and not stored as individual documents. The downside of this action is that the relationship is lost between the objects indexed from an array. The good news is that we have another data type called the nested data to solve this problem.

Unfortunately, I can’t cover the object’s limitation here due to space constraints — you will find the full notes in the accompanying book’s chapter.

The nested data type

A nested datatype is a specialized form of an object type where the relationship between the arrays of objects in a document is maintained.

Going with the same example of our emails and attachments, this time let’s define the attachments field as nested data type, rather than letting Elasticsearch derive it as an object type. This calls for creating a schema with declaring the attachments field as a nested data type. The schema is shown in the listing given below:

# Creating the attachments field as nested datatype
PUT emails_nested
{
  "mappings": {
    "properties": {
      "attachments": {
        "type": "nested",
        "properties": {
          "filename": {
            "type": "keyword"
          },
          "filetype": {
            "type": "text"
          }
        }
      }
    }
  }
}

We have a schema definition created, so all we need to do is index a document. The listing given below does this exactly:

# Indexing a document with attachments 
PUT emails_nested/_doc/1
{
  "attachments": [
    {
      "filename": "file1.txt",
      "filetype": "confidential"
    },
    {
      "filename": "file2.txt",
      "filetype": "private"
    }
  ]
}

Once this document is successfully indexed, the final piece of the jigsaw is the search. The listing below will demonstrate the search query written to fetch documents — criteria being emails with an attachment of file1.txt and private as the file name and its classification type respectively. This combination doesn’t exist and hence the results must be empty, unlike in the case of an object where the data is searched criss-cross across documents returning the false-positive results.

# This query shoulnd't return resutls as we don't have file name as "file1.txt" and type as "private" data (look at the document above)
GET emails_nested/_search
{
  "query": {
    "nested": {
      "path": "attachments",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "attachments.filename": "file1.txt"
              }
            },
            {
              "match": {
                "attachments.filetype": "private"
              }
            }
          ]
        }
      }
    }
  }
}

The query in the listing above is searching for a file named file1.txt with a private classification, which doesn’t exist (look at the document we had indexed earlier). There are no documents returned for this query, which is exactly what we should expect. The file1.txt’s classification is confidential not private hence it didn’t match. So, when a nested type represents the array of inner objects, the individual object is stored and indexed as a hidden document.

The nested datatypes are pretty good at honouring the associations and relationships, so if we ever need to create an array of objects where each of the objects must be treated as an individual object, the nested datatype will become our friend.

No array types.

While we are on the subject of arrays, interestingly, there is no array data type in Elasticsearch. We can, however, set any field with more than one value, thus representing the field as an array. For example, a document with one name field can be changed from a single value to an array: "name": "John Doe" to "name": ["John Smith", "John Doe"] by simply adding a list of data values to the field. There’s one important point you must consider when creating arrays: you cannot mix up the array with various types. For example, you cannot declare the name field like this: "name": ["John Smith", 13, "Neverland"]. This is illegal as the field consists of multiple types and is not permitted.

Flattened (flattened) data type

So far we’ve looked at indexing the individual fields parsed from a JSON document. Each of the fields is treated as an individual and independent field when analyzing and storing it. However, sometimes we may not need to index all the subfields as individual fields thus letting them go through the analysis process. Think of a stream of chat messages on a chat system, running commentary during a live football match, a doctor taking notes on his patient’s ailments and so on. We can load this kind of data as one big blob rather than declaring each of the fields explicitly (or derived dynamically). Elasticsearch provides a special data type called flattened for this purpose.

A flattened datatype holds information in the form of one or more subfields, each subfield’s value indexed as a keyword. That is, none of the values is treated as text fields, thus do not undergo the text analysis process.

Let’s consider an example of a doctor taking running notes about his/her patient during the consultation. The mapping consists of two fields: the name of the patient and the doctor_notes-the doctor_notes filed is declared as flattened type. The listing given below provides the mapping:

# Listing for Creating a mapping with flattened data type
PUT consultations
{
  "mappings": {
    "properties": {
      "patient_name": {
        "type": "text"
      },
      "doctor_notes": {
        "type": "flattened"
      }
    }
  }
}

Any field (and its subfields) that’s declared as flattened will not get analyzed (we will learn about text analysis in Chapter 7). That is, all the values are indexed as keywords. Let’s create a patient consultation document (listing below) and index it:

# The consultation document with doctor’s notes
PUT consultations/_doc/1
{
  "patient_name": "John Doe",
  "doctor_notes": {
    "temperature": 103,
    "symptoms": [
      "chills",
      "fever",
      "headache"
    ],
    "history": "none",
    "medication": [
      "Antibiotics",
      "Paracetamol"
    ]
  }
}

As you can see, the doctor_notes holds a lot of information but remember we did not create these inner fields in our mapping definition. As the doctor_notes is a flattened type, all the values are indexed as keywords.

Finally, we search the index using any of the keywords from the doctor notes, as the listing given below demonstrates:

Listing for Searching through the flattened data type field

# Searching for patients prescribed with paracetomol
GET consultations/_search
{
  "query": {
    "match": {
      "doctor_notes": "Paracetamol"
    }
  }
}

Searching for Paracetamol will return our John Doe’s consultation document. You can experiment by changing the match query to any of the fields, for example: doctor_notes:chills” or even write a complex query like the one shown below:

#An advanced query to fetch patients based on multiple search criteria# Search for non-diabetic patients with headache and prescribed with antibiotcs
GET consultations/_search 
{   
 "query": {      
  "bool": {       
   "must": [{"match": {"doctor_notes": "headache"}},        
    {"match": {"doctor_notes": "Antibiotics"}}],       
    "must_not": [{"term": {"doctor_notes": {"value": "diabetics"}}}] 
  }   
 } 
}

In the query, we check for headaches and antibiotics but the patient shouldn’t be diabetic — The query returns John Doe as he isn’t diabetic but has headaches and is on antibiotics (get well soon, Doe!).

The flattened data types come in handy especially when we are expecting a lot of fields on an ad-hoc basis and having to define the mapping definitions for all of them beforehand isn’t feasible. Be mindful that the subfields of a flattened field are always keyword types.

The Join (join) data type

If you are from a relational database world, you would know the relationships between data — the joins — that enable the parent-child relationships. In Elasticsearch, however, every document that gets indexed is independent and maintains no relationship with any others in that index. Elasticsearch de-normalizes the data to achieve speed and gain performance during indexing and search operations. Elasticsearch provides a join datatype to consider parent-child relationships should we need them.

Consider an example of a doctor-patient (one-to-many) relationship: one doctor can have multiple patients and each patient is assigned to one doctor.

Let’s create a doctors index with a schema consisting of the definition of the relationship. To work with parent-child relationships using the join datatype, we need to

create a field of join type and
add additional information via relations object mentioning the relationship (for example, doctor-patient relationship in the current context)

The query here is preparing the doctors index with a schema definition:

# Creating an indx with join datatype - make sure you create a field with the name "relations"
PUT doctors
{
  "mappings": {
    "properties": {
      "relationship": {
        "type": "join",
        "relations": {
          "doctor": "patient"
        }
      }
    }
  }
}

Once we have the schema ready and indexed, we index two types of documents: one representing the doctor (parent) and the other the patient (child). Here’s is the doctor’s document, with the relationship mentioned as a doctor :

#Indexing a doctor - make sure the relationship field is set to doctor type
PUT doctors/_doc/1
{
  "name": "Dr Mary Montgomery",
  "relationship": {
    "name": "doctor"
  }
}

A notable point from the above snippet is that the relationship object declares the type of the document as doctor. The name attribute must be parent value (doctor) as declared in the mapping schema under relations tag. Once we have resident Dr Mary Montgomery doctor ready, the next step is to get two patients associated with her. The following query (listing below) does that:

# Listing for Creating two patients for our doctor
PUT doctors/_doc/2?routing=mary  
{
  "name": "John Doe",
  "relationship": {
    "name": "patient",
    "parent": 1
  }
}
 
PUT doctors/_doc/3?routing=mary 
{
  "name": "Mrs Doe",
  "relationship": {
    "name": "patient",
    "parent": 1
  }
}

The relationship object should have the value set as patient (remember the parent-child portion of the relations attribute in the schema?) and parent should be assigned with a document identifier of the associated doctor (ID 1 in our example).

There’s one more thing we need to understand when working with parent-child relationships. The parents and associated children will be indexed into the same shard to avoid the multi-shard search overheads. And as the documents should co-exist, we need to use a mandatory routing parameter in the URL. Routing is a function that would determine the shard where the document will reside.

Finally, it’s time to search for patients belonging to a doctor with ID 1. The query in listing below searches for all the patients associated with Dr Montgomery:

# Searching for all patients of Dr Montgomery
GET doctors/_search
{
  "query": {
    "parent_id": {
      "type": "patient",
      "id": 1
    }
  }
}

When we wish to fetch the patients belonging to a doctor, we use a search query called parent_id that would expect the child type (patient) and the parent’s ID (Dr Montgomery document ID is 1). This query will return Dr Montgomery’s patients – Mr and Mrs Doe.

Working with join datatypes isn’t straight forward as we are asking a non-relational data store engine to work with relationships — kinda asking too much, so use the joins only if you must.

Implementing parent-child relationships in Elasticsearch will have performance implications. As we touch base in the first chapter, Elasticsearch may not be the right tool if you are considering document relationships, so use this feature judiciously

Search as you type data type

Most search engines suggest words and phrases as we type in a search bar. This feature has few names — commonly known under a few names: search-as-you-type or typeahead or autocomplete or suggestions. Elasticsearch provides a convenient data type — search_as_you_type – to support this feature. Behind the scenes, Elasticsearch works very hard to make sure the fields tagged as search_as_you_type are indexed to produce n-grams, which we will see in action in this section.

The n-grams are a sequence of words for a given size. For example, if the word is “action”, the 3-ngram (ngrams for size 3) are: ["act", "cti","tio","ion"] and bi-grams (size 2) are: ["ac", "ct","ti","io","on"] and so on.

Edge n-grams, on the other hand, are n-grams of every word, where the start of the n-gram is anchored to the beginning of the word. Considering the “action” word as our example, the edge n-gram produces: ["a","ac","act","acti","actio","action"].

Shingles on the other hand are word n-grams. For example, the sentence “Elasticsearch in Action” will output:["Elasticsearch", "Elasticsearch in", "Elasticsearch in Action", "in", "in Action", "Action"]

Say, we are asked to support typeahead queries on a books index, i.e., when the user starts typing for a book’s title letter by letter in a search bar, we should be able to suggest the book/s with those letters he/she typing.

First, we need to create a schema with the field in question to be of search_as_you_type datatype. The listing below provides this mapping schema:

# Mapping schema for technical books with the title defined as search_as_you_type datatype
PUT tech_books
{
  "mappings": {
    "properties": {
      "title": {
        "type": "search_as_you_type"
      }
    }
  }
}

We now index a few books:

# Indexing few documents
PUT tech_books/_doc/1
{
  "title": "Elasticsearch in Action"
}  
 
PUT tech_books/_doc/2 
{   
 "title":"Elasticsearch for Java Developers" 
}PUT tech_books/_doc/3 
{   
 "title":"Elastic Stack in Action" 
}

As the title field’s type is of search_as_you_type data type, Elasticsearch creates a set of subfields called n-grams, in addition to the root field(title), as shown in the table below:

**Table showing the subfields created automatically by the engine**

As these fields are additionally created for us, searching on the field is expected to return the typeahead suggestions as the n-grams help produce them effectively.

Let’s create the search query as shown in the listing below.

# Searching in a search_as_you_type field and its subfields
GET tech_books/_search
{
  "query": {
    "multi_match": {
      "query": "in",
      "type": "bool_prefix",
      "fields": [
        "title",
        "title._2gram",
        "title._3gram"
      ]
    }
  }
}

This query should return the Elasticsearch in Action and Elastic Stack in Action books. We use a multi-match query because we are searching for a value across multiple fields — title, title._2gram, title._3gram, title._index_prefix.

Phew, that was a long one! In this article, we learned advanced datatypes such as an object, nested, flattened as well as others like geo_point and search_as_you_type. Please refer to book’s chapter for further details on the other datatypes and in-depth discussions and code examples.

These short articles are condensed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.

The Geopoint (geo_point) data type

The object data type

The nested data type

No array types.

Flattened (flattened) data type

The Join (join) data type

Search as you type data type

You Might Also Like

Explanation of the Scores

Just Elasticsearch: 3/n Basics of Indexing

Explicit Mapping

The Geopoint (`geo_point`) data type