This is part 5/5 series on Mapping:
- 1: Overview of Mapping
- 2. Dynamic Mapping
- 3. Explicit Mapping
- 4. Core data types
- 5. Advanced data types
In the last article, we learned about some of the core and common data types to represent our data fields. In this article, we look at advanced and specialized data types.
The Geopoint (geo_point
) data type
Most of us may have used a smart device to find the location of the nearest restaurant or asked for GPS directions to our mother-in-law’s house during Christmas. Elasticsearch developed a specialized data type geo_point
for capturing the location of a place.
Location data is expressed as a geo_point
datatype, which represents longitude and latitude. We can use this to pinpoint an address for a restaurant, a school, a golf course, and others.
The code listing shown below demonstrates the schema definition of an index calledrestaurants
. It hosts restaurants with names and addresses. The notable point is that the address field is defined as a geo_point
datatype:
# A restaurants index with address declared as geo_point
PUT restaurants
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"address": {
"type": "geo_point"
}
}
}
}
Now that we have an index, let’s index a sample restaurant (London based fictitious Sticky Fingers) with its location provided as longitude and latitude (listing given below):
# Indexing a restaurant - the location is provided as lon and lat
PUT restaurants/_doc/1
{
"name": "Sticky Fingers",
"address": {
"lon": "0.1278",
"lat": "51.5074"
}
}
In the above code snippets, the address of the restaurant is provided in the form of longitude (lon
) and latitude (lat
) pairs. There are other ways to provide these inputs too, which we look at shortly.
We can not search and fetch the restaurants within the location perimeter. We can use a geo_bounding_box
query for searching data involving geographical addresses. It takes inputs of top_left
and bottom_right
points to create a boxed up area around our point of interest, as the figure below demonstrates:
We provide the upper and lower bounds to this query using the lon
(longitude) and lat
(latitude) pairs (the address location points to London).
We write the geo_bounding_box
query providing the address in the form of a rectangle with top_left
and bottom_right
coordinates provided as latitude and longitude, as the listing below shows:
# Listing to Fetch the restaurants around a geographical locationGET restaurants/_search
{
"query": {
"geo_bounding_box": {
"address": {
"top_left": {
"lon": "0",
"lat": "52"
},
"bottom_right": {
"lon": "1",
"lat": "50"
}
}
}
}
}
This query fetches our restaurant because the geo bounding box encompasses our restaurant:
# We found the Stick Fingers restaurant, yay!
"hits" : [
{
"_index" : "restaurants",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "Sticky Fingers",
"address" : {
"lon" : "0.1278",
"lat" : "51.5074"
}
}
}
]
As I mentioned earlier, we can provide the location information in various formats, not just latitude and longitude: for example, an array or a string. The table below provides the ways of creating location data and examples:
The object data type
Often we find data in a hierarchical manner — for example — an email consisting of top-level fields like the subject, to and from fields as well as inner object to hold attachments, as the snippet below demonstrates:
"to:":"[email protected]",
"subject":"Testing Object Type",
"attachments":{
"filename":"file1.txt",
"filetype":"confidential"
}
JSON allows us to create such hierarchical objects: an object wrapped up in other objects. To represent such object hierarchy, Elasticsearch has a special data type to represent a hierarchy of objects — an object
type. In the above example, as the attachment holds other attributes, we classify it as the object itself, thus of anobject
type. The two properties filename
and filetype
in the attachments
object can be modelled as text
and long
field respectively. With this information at hand, we can create a mapping definition as demonstrated in the listing given below:
# Defining the attatchments as object type. Though we can set the type as object speficially, Elasticsearch is clever enough to deduce it as an object type when it sees hierarchical data sets. Hence we can omit declaring the object typePUT emails
{
"mappings": {
"properties": {
"to": {
"type": "text"
},
"subject": {
"type": "text"
},
"attachments": {
"type":"object",
"properties": {
"filename": {
"type": "text"
},
"filetype": {
"type": "text"
}
}
}
}
}
}
The attachments
field is an object
as it encapsulates the two other fields. Though we have explicitly mentioned the type as an object
, Elasticsearch doesn’t expect us to do so. It sets the field’s datatype as an object
whenever it encounters the fields with hierarchical data.
Once the schema is executed successfully, we can retrieve it by invoking GET emails/_mapping
command (listing given below):
# The mapping schema for emails
# The attachments type is not listed (inferred as object by Elasticsearch!)
{
"emails" : {
"mappings" : {
"properties" : {
"attachments" : {
"properties" : {
"filename" : {"type" : "text"},
"filetype" : {"type" : "text"}
}
},
"subject" : {"type" : "text"},
"to" : {"type" : "text"}
}
}
}
}
While all other fields show their associated data types, the attachments
wouldn’t. The object
type of an inner object is inferred by Elasticsearch as default. Let’s index an email document, listing given below shows the query:
#Indexing an email document
PUT emails/_doc/1
{
"to:": "[email protected]",
"subject": "Testing Object Type",
"attachments": {
"filename": "file1.txt",
"filetype": "confidential"
}
}
Now that we have primed our emails
index with a document, we can issue a match search query (we learn about search queries in the upcoming articles) on the inner object fields to fetch the relevant documents (and prove our point), shown in the listing below:
# Searching for an email based on the attachment name
GET emails/_search
{
"query": {
"match": {
"attachments.filename": "file1.txt"
}
}
}
This will return the document from our store as the filename matches that of the document we have in store.
While object types are pretty straightforward, there’s one limitation that they carry: the inner objects are flattened out and not stored as individual documents. The downside of this action is that the relationship is lost between the objects indexed from an array. The good news is that we have another data type called the nested
data to solve this problem.
Unfortunately, I can’t cover the object
’s limitation here due to space constraints — you will find the full notes in the accompanying book’s chapter.
The nested data type
A nested
datatype is a specialized form of an object
type where the relationship between the arrays of objects in a document is maintained.
Going with the same example of our emails and attachments, this time let’s define the attachments field as nested
data type, rather than letting Elasticsearch derive it as an object
type. This calls for creating a schema with declaring the attachments field as a nested
data type. The schema is shown in the listing given below:
# Creating the attachments field as nested datatype
PUT emails_nested
{
"mappings": {
"properties": {
"attachments": {
"type": "nested",
"properties": {
"filename": {
"type": "keyword"
},
"filetype": {
"type": "text"
}
}
}
}
}
}
We have a schema definition created, so all we need to do is index a document. The listing given below does this exactly:
# Indexing a document with attachments
PUT emails_nested/_doc/1
{
"attachments": [
{
"filename": "file1.txt",
"filetype": "confidential"
},
{
"filename": "file2.txt",
"filetype": "private"
}
]
}
Once this document is successfully indexed, the final piece of the jigsaw is the search. The listing below will demonstrate the search query written to fetch documents — criteria being emails with an attachment of file1.txt
and private
as the file name and its classification type respectively. This combination doesn’t exist and hence the results must be empty, unlike in the case of an object
where the data is searched criss-cross across documents returning the false-positive results.
# This query shoulnd't return resutls as we don't have file name as "file1.txt" and type as "private" data (look at the document above)
GET emails_nested/_search
{
"query": {
"nested": {
"path": "attachments",
"query": {
"bool": {
"must": [
{
"match": {
"attachments.filename": "file1.txt"
}
},
{
"match": {
"attachments.filetype": "private"
}
}
]
}
}
}
}
}
The query in the listing above is searching for a file named file1.txt
with a private
classification, which doesn’t exist (look at the document we had indexed earlier). There are no documents returned for this query, which is exactly what we should expect. The file1.txt’s
classification is confidential
not private
hence it didn’t match. So, when a nested type represents the array of inner objects, the individual object is stored and indexed as a hidden document.
The nested
datatypes are pretty good at honouring the associations and relationships, so if we ever need to create an array of objects where each of the objects must be treated as an individual object, the nested
datatype will become our friend.
No array types.
While we are on the subject of arrays, interestingly, there is no array data type in Elasticsearch. We can, however, set any field with more than one value, thus representing the field as an array. For example, a document with one name field can be changed from a single value to an array:
"name": "John Doe" to "name": ["John Smith", "John Doe"]
by simply adding a list of data values to the field. There’s one important point you must consider when creating arrays: you cannot mix up the array with various types. For example, you cannot declare the name field like this:"name": ["John Smith", 13, "Neverland"]
. This is illegal as the field consists of multiple types and is not permitted.
Flattened (flattened) data type
So far we’ve looked at indexing the individual fields parsed from a JSON document. Each of the fields is treated as an individual and independent field when analyzing and storing it. However, sometimes we may not need to index all the subfields as individual fields thus letting them go through the analysis process. Think of a stream of chat messages on a chat system, running commentary during a live football match, a doctor taking notes on his patient’s ailments and so on. We can load this kind of data as one big blob rather than declaring each of the fields explicitly (or derived dynamically). Elasticsearch provides a special data type called flattened
for this purpose.
A flattened
datatype holds information in the form of one or more subfields, each subfield’s value indexed as a keyword. That is, none of the values is treated as text fields, thus do not undergo the text analysis process.
Let’s consider an example of a doctor taking running notes about his/her patient during the consultation. The mapping consists of two fields: the name
of the patient and the doctor_notes
-the doctor_notes
filed is declared as flattened
type. The listing given below provides the mapping:
# Listing for Creating a mapping with flattened data type
PUT consultations
{
"mappings": {
"properties": {
"patient_name": {
"type": "text"
},
"doctor_notes": {
"type": "flattened"
}
}
}
}
Any field (and its subfields) that’s declared as flattened
will not get analyzed (we will learn about text analysis in Chapter 7). That is, all the values are indexed as keywords
. Let’s create a patient consultation document (listing below) and index it:
# The consultation document with doctor’s notes
PUT consultations/_doc/1
{
"patient_name": "John Doe",
"doctor_notes": {
"temperature": 103,
"symptoms": [
"chills",
"fever",
"headache"
],
"history": "none",
"medication": [
"Antibiotics",
"Paracetamol"
]
}
}
As you can see, the doctor_notes
holds a lot of information but remember we did not create these inner fields in our mapping definition. As the doctor_notes
is a flattened
type, all the values are indexed as keyword
s.
Finally, we search the index using any of the keywords from the doctor notes, as the listing given below demonstrates:
Listing for Searching through the flattened data type field
# Searching for patients prescribed with paracetomol
GET consultations/_search
{
"query": {
"match": {
"doctor_notes": "Paracetamol"
}
}
}
Searching for Paracetamol
will return our John Doe’s consultation document. You can experiment by changing the match query to any of the fields, for example: doctor_notes:chills
” or even write a complex query like the one shown below:
#An advanced query to fetch patients based on multiple search criteria# Search for non-diabetic patients with headache and prescribed with antibiotcs
GET consultations/_search
{
"query": {
"bool": {
"must": [{"match": {"doctor_notes": "headache"}},
{"match": {"doctor_notes": "Antibiotics"}}],
"must_not": [{"term": {"doctor_notes": {"value": "diabetics"}}}]
}
}
}
In the query, we check for headaches and antibiotics but the patient shouldn’t be diabetic — The query returns John Doe as he isn’t diabetic but has headaches and is on antibiotics (get well soon, Doe!).
The flattened
data types come in handy especially when we are expecting a lot of fields on an ad-hoc basis and having to define the mapping definitions for all of them beforehand isn’t feasible. Be mindful that the subfields of a flattened
field are always keyword
types.
The Join (join) data type
If you are from a relational database world, you would know the relationships between data — the joins — that enable the parent-child relationships. In Elasticsearch, however, every document that gets indexed is independent and maintains no relationship with any others in that index. Elasticsearch de-normalizes the data to achieve speed and gain performance during indexing and search operations. Elasticsearch provides a join
datatype to consider parent-child relationships should we need them.
Consider an example of a doctor-patient (one-to-many) relationship: one doctor can have multiple patients and each patient is assigned to one doctor.
Let’s create a doctors index with a schema consisting of the definition of the relationship. To work with parent-child relationships using the join
datatype, we need to
- create a field of
join
type and - add additional information via
relations
object mentioning therelationship
(for example, doctor-patient relationship in the current context)
The query here is preparing the doctors index with a schema definition:
# Creating an indx with join datatype - make sure you create a field with the name "relations"
PUT doctors
{
"mappings": {
"properties": {
"relationship": {
"type": "join",
"relations": {
"doctor": "patient"
}
}
}
}
}
Once we have the schema ready and indexed, we index two types of documents: one representing the doctor (parent) and the other the patient (child). Here’s is the doctor’s document, with the relationship
mentioned as a doctor
:
#Indexing a doctor - make sure the relationship field is set to doctor type
PUT doctors/_doc/1
{
"name": "Dr Mary Montgomery",
"relationship": {
"name": "doctor"
}
}
A notable point from the above snippet is that the relationship
object declares the type of the document as doctor
. The name
attribute must be parent value (doctor
) as declared in the mapping schema under relations
tag. Once we have resident Dr Mary Montgomery doctor ready, the next step is to get two patients associated with her. The following query (listing below) does that:
# Listing for Creating two patients for our doctor
PUT doctors/_doc/2?routing=mary
{
"name": "John Doe",
"relationship": {
"name": "patient",
"parent": 1
}
}
PUT doctors/_doc/3?routing=mary
{
"name": "Mrs Doe",
"relationship": {
"name": "patient",
"parent": 1
}
}
The relationship object should have the value set as patient
(remember the parent-child portion of the relations
attribute in the schema?) and parent
should be assigned with a document identifier of the associated doctor (ID 1 in our example).
There’s one more thing we need to understand when working with parent-child relationships. The parents and associated children will be indexed into the same shard to avoid the multi-shard search overheads. And as the documents should co-exist, we need to use a mandatory routing parameter in the URL. Routing is a function that would determine the shard where the document will reside.
Finally, it’s time to search for patients belonging to a doctor with ID 1. The query in listing below searches for all the patients associated with Dr Montgomery:
# Searching for all patients of Dr Montgomery
GET doctors/_search
{
"query": {
"parent_id": {
"type": "patient",
"id": 1
}
}
}
When we wish to fetch the patients belonging to a doctor
, we use a search query called parent_id
that would expect the child type (patient
) and the parent’s ID (Dr Montgomery document ID is 1). This query will return Dr Montgomery’s patients – Mr and Mrs Doe.
Working with join datatypes isn’t straight forward as we are asking a non-relational data store engine to work with relationships — kinda asking too much, so use the joins only if you must.
Implementing parent-child relationships in Elasticsearch will have performance implications. As we touch base in the first chapter, Elasticsearch may not be the right tool if you are considering document relationships, so use this feature judiciously
Search as you type data type
Most search engines suggest words and phrases as we type in a search bar. This feature has few names — commonly known under a few names: search-as-you-type or typeahead or autocomplete or suggestions. Elasticsearch provides a convenient data type — search_as_you_type
– to support this feature. Behind the scenes, Elasticsearch works very hard to make sure the fields tagged as search_as_you_type
are indexed to produce n-grams, which we will see in action in this section.
The n-grams are a sequence of words for a given size. For example, if the word is “action”, the 3-ngram (ngrams for size 3) are:
["act", "cti","tio","ion"]
and bi-grams (size 2) are:["ac", "ct","ti","io","on"]
and so on.Edge n-grams, on the other hand, are n-grams of every word, where the start of the n-gram is anchored to the beginning of the word. Considering the “action” word as our example, the edge n-gram produces:
["a","ac","act","acti","actio","action"]
.Shingles on the other hand are word n-grams. For example, the sentence “Elasticsearch in Action” will output:
["Elasticsearch", "Elasticsearch in", "Elasticsearch in Action", "in", "in Action", "Action"]
Say, we are asked to support typeahead queries on a books
index, i.e., when the user starts typing for a book’s title letter by letter in a search bar, we should be able to suggest the book/s with those letters he/she typing.
First, we need to create a schema with the field in question to be of search_as_you_type
datatype. The listing below provides this mapping schema:
# Mapping schema for technical books with the title defined as search_as_you_type datatype
PUT tech_books
{
"mappings": {
"properties": {
"title": {
"type": "search_as_you_type"
}
}
}
}
We now index a few books:
# Indexing few documents
PUT tech_books/_doc/1
{
"title": "Elasticsearch in Action"
}
PUT tech_books/_doc/2
{
"title":"Elasticsearch for Java Developers"
}PUT tech_books/_doc/3
{
"title":"Elastic Stack in Action"
}
As the title field’s type is of search_as_you_type
data type, Elasticsearch creates a set of subfields called n-grams, in addition to the root field(title
), as shown in the table below:
As these fields are additionally created for us, searching on the field is expected to return the typeahead suggestions as the n-grams help produce them effectively.
Let’s create the search query as shown in the listing below.
# Searching in a search_as_you_type field and its subfields
GET tech_books/_search
{
"query": {
"multi_match": {
"query": "in",
"type": "bool_prefix",
"fields": [
"title",
"title._2gram",
"title._3gram"
]
}
}
}
This query should return the Elasticsearch in Action and Elastic Stack in Action books. We use a multi-match query because we are searching for a value across multiple fields — title, title._2gram, title._3gram, title._index_prefix
.
Phew, that was a long one! In this article, we learned advanced datatypes such as an object
, nested
, flattened
as well as others like geo_point
and search_as_you_type
. Please refer to book’s chapter for further details on the other datatypes and in-depth discussions and code examples.
These short articles are condensed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.