ElasticSearch 0.90.1: Updates in bulk API

Send to Kindle

bulkAs the 0.90.0 ElasticSearch is released it is time to look at the features that will became part of the incoming 0.90.1 and the big 1.0 release. The first thing that we payed attention to is the possibility of including partial document updates in the Bulk API request. So in addition to the standard index and delete command the ElasticSearch 0.90.1 will introduce the update one.

Example Data

Let’s use the simplified version of the data that was used when we’ve looked at the Rescore functionality (we store it in the bulk.json file):

{ "index": {"_index": "library", "_type": "book", "_id": "1"}}
{ "title": "All Quiet on the Western Front","otitle": "Im Westen nichts Neues","author": "Erich Maria Remarque","year": 1929,"tags": ["novel"],"copies": 1, "available": true, "section" : 3}
{ "index": {"_index": "library", "_type": "book", "_id": "2"}}
{ "title": "Catch-22","author": "Joseph Heller","year": 1961,"tags": ["novel"],"copies": 6, "available" : false, "section" : 1}
{ "index": {"_index": "library", "_type": "book", "_id": "3"}}
{ "title": "The Complete Sherlock Holmes","author": "Arthur Conan Doyle","year": 1936,"tags": [],"copies": 0, "available" : false, "section" : 12}
{ "index": {"_index": "library", "_type": "book", "_id": "4"}}
{ "title": "Crime and Punishment","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true}

We’ll use the example data to illustrate how the updates in bulk requests can be use, but for now let’s use the following command to index the above data:

$curl -s -XPOST 'localhost:9200/_bulk' --data-binary @bulk.json

Updating documents

Let’s assume that our book documents are changed from time to time in our library and till now we managed it using the Update API. However it was done one by one and it was taking some time. With the new bulk API extension in 0.90.1 we were able to prepare the following bulk request (we store it in the bulk_update.json file):

{ "update": {"_index": "library", "_type": "book", "_id": "2"}}
{ "doc":{ "updated": true }}
{ "update": {"_index": "library", "_type": "book", "_id": "3"}}
{ "doc":{ "updated": true }}

Similar to the standard Bulk API each indexing (update in our case) is built of two lines. The first line is responsible for telling ElasticSearch what type of operation will be performed (update in our case) and which document should be updated – which index it belongs to, which type it has and its identifier.

The second line carries the information about the fields we want to add to our documents. In our case we want to add the field named updated with the value of true to the documents with identifiers 2 and 3. In order to do that we send the following command:

$curl -s -XPOST 'localhost:9200/_bulk' --data-binary @bulk_update.json

In order to check if our documents were updated, let’s run the following request:

$curl -XGET 'localhost:9200/library/book/3

The response would be as follows:

{
 "_index" : "library",
 "_type" : "book",
 "_id" : "3",
 "_version" : 2,
 "exists" : true, "_source" : {"title":"The Complete Sherlock Holmes","author":"Arthur Conan Doyle","year":1936,"tags":[],"copies":0,"available":false,"section":12,"updated":true}
}

As you can see the _version of the document was increased and the _source contains the newly introduced field.

Updating document fields

What if we would like to update document fields ? That is also possible in the same way as when using the Update API – with the use of script. For example let’s update the available field and let’s set it to true for the above updated documents. We can do it by sending a bulk request with the following contents:

{ "update": {"_index": "library", "_type": "book", "_id": "2"}}
{ "script" : "ctx._source.available = param1", "params" : {"param1" : true}}
{ "update": {"_index": "library", "_type": "book", "_id": "3"}}
{ "script" : "ctx._source.available = param1", "params" : {"param1" : true}}

As you can see instead of the doc object we’ve used the script one. We’ve provided the script to change the available field and in addition to that we’ve provided parameters in the params object.

In order to check if our documents were updated, let’s run the following request:

$curl -XGET 'localhost:9200/library/book/3

The response would be as follows:

{
 "_index" : "library",
 "_type" : "book",
 "_id" : "3",
 "_version" : 3,
 "exists" : true, "_source" : {"title":"The Complete Sherlock Holmes","author":"Arthur Conan Doyle","year":1936,"tags":[],"copies":0,"available":true,"section":12,"updated":true}
}

Again, the _version of the document was increased and the _source contains the updated available field.

Upsert

Just like with the standard Update API, if a document we want to update doesn’t exists we can use the upsert object to create it. If we would like ElasticSearch to include title field named Mastering ElasticSearch to a document with the id of 5 we would send a bulk request with the following contents:

{ "update": {"_index": "library", "_type": "book", "_id": "5"}}
{ "script" : "ctx._source.available = param1", "params" : {"param1" : false}, "upsert": {"title": "Mastering ElasticSearch", "available": false}}

Let’s do a simple test before running the above bulk request. We run the following

$curl -XGET 'localhost:9200/library/book/5

The response will show us that the document doesn’t exist:

{
 "_index" : "library",
 "_type" : "book",
 "_id" : "5",
 "exists" : false
}

Now we run the above bulk request and than again the following command:

$curl -XGET 'localhost:9200/library/book/5

And the response to the above call would be as follows:

{
 "_index" : "library",
 "_type" : "book",
 "_id" : "5",
 "_version" : 3,
 "exists" : true, "_source" : {"title":"Mastering ElasticSearch","available":false}
}

As we can see the document was created.

Retrying on conflicts

In addition to all that we can add the _retry_on_conflict property to the bulk request line that is responsible for specifying the identifier, index and type of the document for ElasticSearch to retry failed indexing. For example, look at the following bulk request contents:

{ "update": {"_index": "library", "_type": "book", "_id": "2", "_retry_on_conflict": 2}}
{ "doc":{ "updated": true }}
{ "update": {"_index": "library", "_type": "book", "_id": "3"}}
{ "doc":{ "updated": true }}

If the document with identifier 2 would fail during indexing, because of the version conflicts,  ElasticSearch would retry its indexation the maximum of two times.

Tagged , , , ,

Leave a Reply