Loading PDFs into Elasticsearch (3/3)

Part 3/3 of the Ingest Pipelines mini-article series:

Loading PDF files

Let’s just say that our business requirement is to load PDF files (imagine a load of legal documents or medical journals in PDF format) into Elasticsearch, enabling the clients to conduct searches on them. Elasticsearch lets us index PDF files using a dedicated processor called the attachmentingest processor.

The attachmentprocessor, as any ingest processor, is used in the ingest pipeline to load attachments — such PDF files, Word Documents, Emails and so on. It uses Apache’s Tika (https://tika.apache.org) library to extract the file data. The source data is expected to be converted into base64 format before loading into the pipeline. Let’s see it in action.

If we continue with our MI5 example, we are expected to load all the secret data which is presented in the form of PDF files into Elasticsearch. The following steps usually help visualize the process:

Define a pipeline with an attachment processor; the base64 content of the file is indexed on to a field (for our example, we define the field as secret_file_data)
Convert the PDF file content into bytes and feed to Base64 encoding utility (using any of the tool sets you have at your disposal).
Invoke the pipeline for the incoming data so the attachment processor processes the data.

Let’s create a pipeline with an attachment processor; the code in the listing shown below describes the code for that:

PUT _ingest/pipeline/confidential_pdf_files_pipeline 
{
  "description": "Pipeline to load PDF documents",
  "processors": [
    {
      "set": {
        "field": "category",
        "value": "confidential"
      },
      "attachment": { 
        "field": "secret_file_data"
      }
    }
  ]
}

After executing the above code, the confidential_pdf_files_pipelinegets created on the cluster. The attachmentprocessor expects the base64 encoded data of a file set on secret_file_datafield during the pipeline ingestion.

Now that we have the pipeline created, let’s test it. Suppose the file data is “Sunday Lunch at Konda’s” (the code language to get rid of Mr. Konda 😉 ) and we run the base64 encoder to produce the data in an base64 encoded form (I’ll leave it up to you to apply an encoder — the U3VuZGFSIEx1bmNoIGF0IEtvbmRhJ3M= is the base64 encoded PDF file with a one secret message (“Sunday Lunch at Konda’s”) in this case).

Testing the pipeline

POST _ingest/pipeline/confidential_pdf_files_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "op_name": "Op Konda",
        "secret_file_data":"U3VuZGF5IEx1bmNoIGF0IEtvbmRhJ3M="
      }
    }
  ]
}

As you can see, the secret_file_data is manually set with the base64 encoded string, which will then be fed to the pipeline. As per the definition of the pipeline, the attachmentprocessor is expecting the secret_file_datafiled with the encoded data. The snippet below shows the response from the above test:

...     
      "doc": {
        ..
        
        "_source": {
          "op_name": "Op Konda",
          "category": "confidential",
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "et",
            "content": "Sunday Lunch at Konda's",
            "content_length": 24
          },
          "secret_file_data": "U3VuZGF5IEx1bmNoIGF0IEtvbmRhJ3M="
        },
        "_ingest": {
          "timestamp": "2022-11-04T23:19:05.772094Z"
        }
      }
...

The response creates an additional object called attachmentwith few fields — content being the decoded form of the PDF file. There’s additional metadata available on the attachment as content_length, language and others; and the original encoded data is still available as secret_file_datafield. We can choose what fields we wish to persist as part of the metadata.

Now that we have the PDF data loaded to the Elasticsearch, it would be easy to search through the PDF content (which is out of scope for this article-series).

Elasticsearch provides over 40 ingest processors which suit a lot of requirements. As you can imagine, going over all of them in this appendix is impractical. My advice is to check out the documentation on the topic — and experiment with the code. Here’s the link to Elasticsearch’s documentation on the available ingest processors: