Prefix Queries

At times we might want to query for words using a prefix, like Leo for Leonardo or Mar for Marlon Brando, Mark Hamill, or Martin Balsam. Elasticsearch provides a prefix query for fetching records that match the beginning part of a word (a prefix). The prefix query in the following listing does exactly that:

GET movies/_search
{
"query": {
"prefix": {
"actors.original": {
"value": "Mar"
}
}
}
}

The above query fetches three movies with the actors Marlon, Mark, and Martin when we search for the prefix Mar. Do note that we are running the prefix query on actors.original field which keyword datatype.

Note : Prefix query is an expensive query – The prefix query is also an expensive query that can destabilize the cluster at times.

We don’t need to add an object consisting of value at the field block level. Instead, you can create a shortened version as the next listing demonstrates for brevity:

Listing to shorten the prefix query usage

GET movies/_search
{
"query": {
"prefix": {
"actors.original": "Leo"
}
}
}

As we wish to find out the matching fields in the results, we will highlight the results by adding highlights to the query. We add a highlight block to the prefix query. This accentuates one or more fields that match as the following listing shows.

Listing : Prefix search with highlighting

GET movies/_search
{
"_source": false,
"query": {
"prefix": {
"actors.original": {
"value": "Mar"
}
}
},
"highlight": {
"fields": {
"actors.original": {}
}
}
}

Because we don’t want the source to be returned in the response (“_source”: false), the results in the following snippet highlight where the prefix matched with the word:

"hits" : [{
..
"highlight" : {
"actors.original" : ["<em>Marlon Brando</em>"]
}
},
{
..
"highlight" : {
"actors.original" : ["<em>Martin Balsam</em>"]
}
},
{
..
"highlight" : {
"actors.original" : ["<em>Mark Hamill</em>"]
}
}]

We discussed earlier that the prefix queries exert additional computation strain when running the queries. Fortunately, there is a way to speed up such painstakingly ill-performant prefix queries — discussed in the next section.

Speeding up prefix queries

This is because the engine has to derive the results based on a prefix (any lettered word). The prefix queries, hence, are slow to run, but there’s a mechanism to speed them up: using the index_prefixes parameter on the field.

We can set the index_prefixes parameter on the field when developing the mapping schema. For example, the mapping definition in the listing shown below sets the title field (remember, the title field is a text data type) with an additional parameter, index_prefixes, on a new index, boxoffice_hit_movies, that we are creating for this exercise.

Listing : A new index with the index_prefixes parameter

PUT boxoffice_hit_movies 
{
"mappings": {
"properties": {
"title":{
"type": "text",
"index_prefixes":{}
}
}
}
}

As you can see from the code in the listing, the sole title property includes an additional property, index_prefixes. This indicates to the engine that, during the indexing process, it should create the field with prebuilt prefixes and store those values. For example, when you index a new document as shown in the following snippet:

PUT boxoffice_hit_movies/_doc/1
{
"title":"Gladiator"
}

Because we set index_prefixes on the title field in listing shown above, Elasticsearch indexes the prefixes with a minimum character size of 2 and a maximum character size of 5 by default. This way, when we run the prefix query, it doesn’t need to calculate the prefixes. Instead, it picks them up from storage.

Of course, we can change the default min and max sizes of the prefixes that Elasticsearch tries to create during indexing for us. This is done by tweaking the sizes of the index_prefixes object as the following listing demonstrates.

Listing : Custom character length settings for index_prefixes

PUT boxoffice_hit_movies_custom_prefix_sizes
{
"mappings": {
"properties": {
"title":{
"type": "text",
"index_prefixes":{
"min_chars":4,
"max_chars":10
}
}
}
}
}

In the listing, we ask the engine to precreate prefixes with a minimum and maximum character length of 4 and 10 letters, respectively. Note that min_chars must be greater than 0, and max_chars should be less than 20 characters. This way, we can customize the prefixes that Elasticsearch should create beforehand during the indexing process.

These short articles are condensed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.

Elasticsearch in Action