Just Elasticsearch: 4/n. Document APIs

This is the fourth article of a series of articles explaining Elasticsearch as simple as possible with practical examples. The articles are condensed versions of the topics from my upcoming book Just Elasticsearch. My aim is to create a simple, straight-to-the-point, example-driven book that one should read over a weekend to get started on Elasticsearch.

All the code snippets, examples, and datasets related to this series of articles are available on my Github Wiki.

Overview

It is time to start seeding the Elasticsearch with some data. We need to index documents so we can carry out searches, analyse queries, alert systems, and so on. This article is dedicated to document-API operations such as creating, retrieving, updating, and deleting documents.

Documents

A document is a unit of information that’s expected to be passed on to the Elasticsearch server for storage.

Indexing is an act of persisting the documents into Elasticsearch. It’s the data going into Elasticsearch for query and analytical purposes. While indexing and storing this document, Elasticsearch will do its magic so the data is ready for searching.

We create documents in JSON format — a human-readable format where data is defined as key-value pairs. We can index JSON documents with simple structure and deeply nested objects as per your data modeling needs.

Let’s say we have a requirement of building a search catalog of technical books. We create an index called books for this purpose. We need to represent each book — in JSON format — as shown here:

{
"title":"Elasticsearch for Java Developers",
"author":"Madhusudhan Konda",
"synopsis":"A hands-on book for Java developers",
"release_date":2020-08-01,
"price":9.99
}

The document is self-explanatory, but let me go over a couple of points related to mapping.

Each of the fields has a type associated with it. For example, the title, author, and synopsis fields are text fields — represented as a string data type in Elasticsearch, while the release_date is a date field and price is a number. Note we are not creating mappings upfront for the index. We are letting Elasticsearch derive or deduce the schema from the first document that we index.

We can index a document either using PUT or POST methods. Both these HTTP methods do the honor of inserting a new document in Elasticsearch, albeit with a subtle difference. Let’s check out indexing documents using both methods in the following sections.

Creating Document using PUT

We use Kibana’s DevTools to index this document using the HTTP PUT method. So head over to Kibana and write the following query:

PUT books/_doc/1
{
"title":"Elasticsearch for Java Developers",
"author":"Madhusudhan Konda"
}

The books in the command indicate the index while the _doc indicates the type of the document. The digit 1 is the identifier of this document after that’s been inserted. After executing the above script, check on the right-hand side for response:

{
"_index" : "books",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 3,
"_primary_term" : 1
}

The response is a JSON document with _index, _type, and _id provided by us in our request. In addition to these attributes, there are few additional metadata, like _version, result, and others.

The _version indicates the current version of this document, the value indicates the first version. The number of course will be updated if we modify the document and reindex it. The result attribute indicates the operation action — “created” here indicates it is indeed a new document. We will visit the _shards information down the line, but that’s pretty much you see as a response for creating a document.

Creating a document using the POST method

In the last section, we had used the PUT method to index a document, in this section we see indexing a document using the POST method.

The main difference when using POST as opposed to PUT is that the POST method doesn’t expect an ID to be passed. Instead, it generates a randomly created UUID for the document automatically. This is the fundamental difference when creating a document using the PUT or POST method.

Let’s index the following document, in Kibana:

POST books/_doc
{
"title":"All About Java 8 Lambdas",
"author":"Madhusudhan Konda"
}

The _id value will have a random ID like you’d see the response as shown here below:

{
"_index" : "books",
"_type" : "_doc",
"_id" : "jOj6d3ABTpmmoDQOlC6_",
...
}

Except for the _id field, the rest of the information is not any different from the earlier PUT request.

When To Use PUT or POST

If you want to control the identifiers or if you already know the IDs of these documents, you should use the PUT method to index such documents. This could be your domain objects which may have a pre-defined identity strategy that you may want to adhere to.

On the other hand, having identifiers to streaming data, time-series events doesn’t make sense. Imagine price quotes originating from a pricing server, system alerts from an AWS service, tweets, quotes, notifications, etc. Having a randomly generated UUID is good enough.

Reading Documents

Reading documents in Elasticsearch can be carried out by any client which supports REST APIs such as Postman/Postwoman, Advanced REST client, cURL, or even Kibana.

We use HTTP GET to fetch a document. The format is

GET <index_name>/<doc_type>/<id>

For example, executing GET books/_doc/1 command on the Kibana console will fetch the previously index document with ID 1:

{
"_index" : "books",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 3,
"_primary_term" : 1,
"found" : true,
"_source" : {
"title" : "Elasticsearch for Java Developers",
"author" : "Madhusudhan Konda",
...
}
}

The JSON response you see in the above snippet has two parts — metadata and the original document under the _source attribute. We have been through these attributes many times, so let’s not go over them again.

When retrieving a document, if it is not found, you’ll get a response with an attribute found set to false:

{
"_index": "books",
"_type": "_doc",
"_id": "99",
"found": false
}

Manipulating Responses

The query’s response returned from the client may contain a lot of information and the client may not be interested in receiving all of it. There are ways of manipulating these responses before sending them to the client, the next as the section describes them.

Suppressing Source Fields

The notable attribute you see in the above response is the _source attribute. This is nothing but the source of the original input document you’ve indexed. Should you wish to fetch just the source(original document) but no metadata, simply issue GET <index>/_source/<id> command like below:

The GET books/_source/1 command returns:
GET books/_source/1
{
"title": "Elasticsearch for Java Developers",
"author": "Madhusudhan Konda",
...
}

There’s no metadata in the response — just the original document we’ve had indexed.

There may be instances where the source document is heavy or consists of sensitive information. You don’t want the source copy to be returned when the document is retrieved. In that case, you can suppress the source by setting the _source field to false as your request parameter in your query:

GET books/_doc/1?_source=false

The above command returns the document without the _source attachment. Not fetching source might free up the bandwidth too.

Include/Exclude fields

In addition to suppressing the _source field as we had seen in the last section, we can also include and/or exclude fields as per our convenience when retrieving the documents.

Say, we wish to fetch title, synopsis and price fields and suppress others, we execute the following command:

GET books/_doc/1?_source_includes=title,synopsis,price

In a similar vein, the following command includes all prices (say we have three prices set on the book document: price_usd, price_gbp, price_eur) but excludes price_usd:

GET books/_doc/1?_source_includes=price_usd_,price_gbp,price_eur & _source_exludes=price_usd

The final result will be price_gbp and price_eur thus removing the USD.

We can also fetch fields using wildcard — say if you have multiple price fields in a document (like price_usd, price_gbp, price_eur etc) with the same prefix (price_*), then we can fetch all the price fields by using wildcards. :

GET books/_doc/1?_source_includes=price*

Document Existence

How do we check if a document exists in Elasticsearch? Well, you can use the HTTP method HEAD to check if a document actually exists in the Elasticsearch database.

Issuing HEAD books/_doc/1 will return 200 — OK if the document exists. As per the HTTP standards, if the document is unavailable in the store, 404 NOT FOUND error will be returned to the client.

Deleting Documents

There are essentially two methods when you want to delete documents — using an ID or using a query. In the former case, we are deleting just a single document while the later deletes multiple documents in one go. When deleting by the query, we can set filter criteria, for example, delete documents whose status field is unpublished or documents from last month, etc. Let’s see both of these methods in action.

Delete With an ID

We can delete a document from Elasticsearch given the identity of that document using HTTP DELETE method, specifying the index, default type and of course, ID of the document, in the form of:

DELETE <index>/<type>/<id>

The document with ID 1 gets deleted when you execute this command: DELETE books/_doc/1

The response will have an attribute result set to deleted to let the client know the object has been deleted. Note the _version will increment when the delete is called for.

Deleting by Query

Deleting a single document is easy and straightforward, as we’ve seen in the last section. There’s another variant called _delete_by_query when you wish to delete documents in bulk. You can also provide various filter criteria to this _delete_by_query endpoint. It is an endpoint that expects a query body as a JSON document sent in the POST method.

Here’s the usage of _delete_by_query API which deletes all books where the author is me 🙁

POST books/_delete_by_query
{
"query":{
"match":{
"author":"Madhusudhan Konda"
}
}
}

The body of this POST uses Query DSL (Domain Specific Language) which can be passed in with a variety of queries like the term, range, match, and other queries similar to the search queries. We will learn more about search queries in the next few articles, but for now, keep a note that _delete_by_query can is a powerful endpoint with sophisticated delete criteria. Let’s see a few examples.

Delete By Range Query

The following query uses a range match to delete books whose price is greater than $9:

POST books/_delete_by_query
{
"query":{
"range":{
"price_usd":{
"gt":9
}
}
}
}

Delete All Documents

You can delete a whole set of documents from an index, too using the match_all query (do note this query will delete all your data in the index, so execute it with caution):

POST books/_delete_by_query
{
"query":{
"match":{
"match_all":{}
}
}
}

Delete operations are irreversible — so be cautious before hitting these your Elaticsearch with a delete query.

Delete Documents across Multiple Indices

We have been operating delete document operations on a single index. We can also delete documents across multiple indices by simply providing a comma-separated list of indices in the API URL.

So, deleting across three of the book-related indices is as given below:

POST old_books,tech_books,classic_books/_delete_by_query
{
"query":{
"match":{
"match_all":{}
}
}
}

Updating Documents

Documents are expected to be updated, or even the entire document may need to be replaced at times. Elasticsearch provides us with an _update API for updating the documents. Before we go over and start actioning, there’s something we need to know about updating the documents.

When an update is carried out by Elasticsearch, there is a process of steps that gets executed:

  • The document is fetched for the given ID
  • The document is updated with the relevant fields
  • The document is re-indexed (essentially it’s replaced with the new document we are sending in).

If you think we can do GET, UPDATE and POST individually, you are absolutely right. Behind the scenes this is what actually Elasticsearch does: it will fetch the document, create a new document with the existing fields plus updated fields at hand, and finally re-index. Note these are three different calls to the server, each resulting in a round trip to and fro from the client to server. However, Elasticsearch avoids this roundtrip by cleverly executing these three operations on the same shard thus avoiding to and fro traffic between client and server. Thus, using the _update API will avoid network calls and bandwidth.

Update Scenarios

There are usually at least three scenarios that updates may come into the picture:

  • Adding additional fields to an existing document
  • Modifying fields to an existing document
  • Replacing the whole document

All of these operations are performed using _update API, the format goes like this:

POST <index_name>/_update/<id>

NOTE: You can use POST <index_name>/_doc/<id>/_update API too but as the types are being deprecated and expected to be removed from version 8, avoid using this style API.

Our book document is missing the “pages” attribute, so let’s update the document to add this attribute. We use update API and in the body of the request and wrap the new fields in a doc object:

POST /books/_update/1
{
"doc":{
"pages":200
}
}

To update the existing field, all we have to do is to provide the new value to the field. The following snippet updates the value of pages attribute from 300 to a new value of 350:

POST books/_update/1
{
"doc": {
"pages":350
}
}

Scripted Updates

Scripted updates allow us to update the document based on conditions, like setting the reduced price of the book during a promotion, or stock count, etc. We use the same _update endpoint but we write the conditional updates in a script object which takes a source as the key. Inside the source, a context variable named ctx is available to fetch the source attributes by calling ctx._source.<field>, as demonstrated below for updating pages:

POST books/_update/1
{
"script":{
"source":{
"ctx._source.pages = 295"
}
}
}

We can also add new fields too, as shown in the following example, the price_usd attribute is amended to the existing document:

POST books/_update/1
{
"script":{
"source":{

"ctx._source.price_usd = 9.99"
}
}
}

We can update multiple fields in one go, like the following:

POST  books/_update/1
{
"script": {
"source": """
ctx._source.release_year = 2020;
ctx._source.price_usd = 9.99
"""
}
}

The notable thing is that the multi-line updates are carried out in a triple-quote block, with each key-value pair segregated with a semicolon.

We can implement a bit more complicated logic in the script block, too. Let’s say we want to stamp the book as a top_seller if the overall rating value is greater than 4.5. We can execute this logic in the script shown below:

POST books/_update/1
{
"script" : {
"source": "if(ctx._source.ratings > params.ratings)
{ ctx._source.top_seller=true }
else
{ctx._source top_seller=false}",
"lang": "painless",
"params" : {
"ratings" : 4.5
}
}
}

The Elasticsearch uses a scripting language called Painless for decoding the logic and executing the scripts. You can provide other languages like mustache, expression and others if you wish to use scripts written in a language of your choice.

Replacing Documents

In the next use case, we need to replace an existing document with a new one. This is super easy — just use the same PUT request we were familiar with when indexing a new document in the earlier session.

// PUT a new document with a ISBN but with the same ID
PUT books/_doc/1
{
"isbn":123456
}

The existing document (with this same ID) will be replaced with new data attributes after executing the above command.

Upsert

Upsert is an operation that would either update an existing document or index (create) a brand new document with the data provided. When we are updating a document, we can tell Elasticsearch to insert the document as a new document if the document doesn’t exist.

Upsert is short for update and insert

The following example illustrates the upsert in action.

POST books/_update/100
{
  "script":{
    "source":"ctx._source.title='Just Spring 2e'"
    },
    "upsert":{
      "title":"Just Spring 2e",
      "author":"Madhusudhan Konda"
    }
}

The request body has two parts to it — the first part is what the script should do when it encounters the document with id 100 (in other words, if the document is available). In this case, the expectation is that the title of that document will be set to “Just Spring 2e” (my upcoming updated version to the Just Spring book).

The second part of the JSON request is interesting — the upsert block consists of fields, which will constitute a new document. So, when we execute this query, as there was no document with that id available, the upsert block will get executed creating a new document with those fields in that block.

If you re-run the same query again, this time the script part gets executed as there is already a document available. The title will surely be changed to Just Spring 2e

Summary

This article is all about working with documents in Elasticsearch. We looked at indexing the documents, document retrieval, and manipulation of response results attributes. We also looked at deleting documents, learned about updating fields with new values as well as scripted updates. We also saw how to replace documents whole and how a scripted upsert works in action.

In the next article, we will go over the search fundamentals. Stay tuned!

All the code snippets and datasets are available at my GitHub Wiki