Introducing Ingest Pipelines (1/3)

Part 1/3 of the Ingest Pipelines mini-article series:

  1. Introduction to Ingest Pipelines
  2. Mechanics of Ingest Pipelines
  3. Loading PDFs into Elasticsearch

Overview

Data that’s expected to be indexed into Elasticsearch may need to undergo transformation and manipulation. Consider an example of loading millions of legal documents represented as PDF files into Elasticsearch for searching. Though bulk loading them is one of the ways, it is inadequate, cumbersome and error prone.

If you are thinking of ETL (extract-transform-load) tools employed for such data manipulation tasks, you are absolutely right. We have a plethora of such tools including Logstash. Logstash surely manipulates our data before indexing into Elasticsearch or persisting to a database or some other destinations. However, it is not lightweight, and needs an elaborate (if not complex) setup, preferably on a different machine.

Ingest pipelines are pipelines with a set of processors using the same syntax as Query DSL and get them applied on the incoming data to ETL them. The workflow is straightforward:

  • Create one or more pipelines with the expected logic based on the business requirements of what transformations or enhancements or enrichments are to be carried out on the data
  • Invoke the pipelines on the incoming data; the data goes through the series of processors in a pipeline, getting manipulated at every stage
  • The processed data is then indexed

The figure given below shows the workings of two independent pipelines:

Figure : A set of two independent pipelines with processors processing the data

Here, we have created two independent pipelines with different sets of processors. These pipelines are hosted/created on an ingest node.

The data gets massaged while going through these processors before indexing. We can invoke these pipelines during a bulk load or indexing individual documents.

A processor is a software component that can do one activity of transformation on the incoming data. A pipeline is made of a series of these processors. Each of these processors are given a dedicated to task of performing one task. They take the input and “processes” it based on their logic and spits out the processed data for the next stage. We can chain as many of these processors as the requirements dictate. Elasticsearch provides over three dozen processors out of the box.

That’s a gentle introduction of ingest pipelines — in the next article, we will go over mechanics of these pipelines and how they work together for manipulating the data.

Part 1/3 of the Ingest Pipelines mini-article series:

  1. Introduction to Ingest Pipelines
  2. Mechanics of Ingest Pipelines
  3. Loading PDFs into Elasticsearch

Me @ Medium || LinkedIn || Twitter || GitHub