When updating a document with the index
API, we read the original document,
make our changes, and then reindex the whole document in one go. The most recent
indexing request wins: whichever document was indexed last is the one stored
in Elasticsearch. If somebody else had changed the document in the meantime,
their changes would be lost.
Many times, this is not a problem. Perhaps our main data store is a relational database, and we just copy the data into Elasticsearch to make it searchable. Perhaps there is little chance of two people changing the same document at the same time. Or perhaps it doesn’t really matter to our business if we lose changes occasionally.
But sometimes losing a change is very important. Imagine that we’re using Elasticsearch to store the number of widgets that we have in stock in our online store. Every time that we sell a widget, we decrement the stock count in Elasticsearch.
One day, management decides to have a sale. Suddenly, we are selling several widgets every second. Imagine two web processes, running in parallel, both processing the sale of one widget each, as shown in Consequence of no concurrency control.
The change that web_1
made to the stock_count
has been lost because
web_2
is unaware that its copy of the stock_count
is out-of-date. The
result is that we think we have more widgets than we actually do, and we’re
going to disappoint customers by selling them stock that doesn’t exist.
The more frequently that changes are made, or the longer the gap between reading data and updating it, the more likely it is that we will lose changes.
In the database world, two approaches are commonly used to ensure that changes are not lost when making concurrent updates:
- Pessimistic concurrency control
-
Widely used by relational databases, this approach assumes that conflicting changes are likely to happen and so blocks access to a resource in order to prevent conflicts. A typical example is locking a row before reading its data, ensuring that only the thread that placed the lock is able to make changes to the data in that row.
- Optimistic concurrency control
-
Used by Elasticsearch, this approach assumes that conflicts are unlikely to happen and doesn’t block operations from being attempted. However, if the underlying data has been modified between reading and writing, the update will fail. It is then up to the application to decide how it should resolve the conflict. For instance, it could reattempt the update, using the fresh data, or it could report the situation to the user.
Elasticsearch is distributed. When documents are created, updated, or deleted, the new version of the document has to be replicated to other nodes in the cluster. Elasticsearch is also asynchronous and concurrent, meaning that these replication requests are sent in parallel, and may arrive at their destination out of sequence. Elasticsearch needs a way of ensuring that an older version of a document never overwrites a newer version.
When we discussed index
, get
, and delete
requests previously, we pointed out
that every document has a _version
number that is incremented whenever a
document is changed. Elasticsearch uses this _version
number to ensure that
changes are applied in the correct order. If an older version of a document
arrives after a new version, it can simply be ignored.
We can take advantage of the _version
number to ensure that conflicting
changes made by our application do not result in data loss. We do this by
specifying the version
number of the document that we wish to change. If that
version is no longer current, our request fails.
Let’s create a new blog post:
PUT /website/blog/1/_create
{
"title": "My first blog entry",
"text": "Just trying this out..."
}
The response body tells us that this newly created document has _version
number 1
. Now imagine that we want to edit the document: we load its data
into a web form, make our changes, and then save the new version.
First we retrieve the document:
GET /website/blog/1
The response body includes the same _version
number of 1
:
{
"_index" : "website",
"_type" : "blog",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"title": "My first blog entry",
"text": "Just trying this out..."
}
}
Now, when we try to save our changes by reindexing the document, we specify
the version
to which our changes should be applied:
PUT /website/blog/1?version=1 (1)
{
"title": "My first blog entry",
"text": "Starting to get the hang of this..."
}
-
We want this update to succeed only if the current
_version
of this document in our index is version1
.
This request succeeds, and the response body tells us that the _version
has been incremented to 2
:
{
"_index": "website",
"_type": "blog",
"_id": "1",
"_version": 2
"created": false
}
However, if we were to rerun the same index request, still specifying
version=1
, Elasticsearch would respond with a 409 Conflict
HTTP response
code, and a body like the following:
{
"error": {
"root_cause": [
{
"type": "version_conflict_engine_exception",
"reason": "[blog][1]: version conflict, current [2], provided [1]",
"index": "website",
"shard": "3"
}
],
"type": "version_conflict_engine_exception",
"reason": "[blog][1]: version conflict, current [2], provided [1]",
"index": "website",
"shard": "3"
},
"status": 409
}
This tells us that the current _version
number of the document in
Elasticsearch is 2
, but that we specified that we were updating version 1
.
What we do now depends on our application requirements. We could tell the
user that somebody else has already made changes to the document, and to review the changes before trying to save them again.
Alternatively, as in the case of the widget stock_count
previously, we could
retrieve the latest document and try to reapply the change.
All APIs that update or delete a document accept a version
parameter, which
allows you to apply optimistic concurrency control to just the parts of your
code where it makes sense.
A common setup is to use some other database as the primary data store and Elasticsearch to make the data searchable, which means that all changes to the primary database need to be copied across to Elasticsearch as they happen. If multiple processes are responsible for this data synchronization, you may run into concurrency problems similar to those described previously.
If your main database already has version numbers—or a value such as
timestamp
that can be used as a version number—then you can reuse these
same version numbers in Elasticsearch by adding version_type=external
to the
query string. Version numbers must be integers greater than zero and less than
about 9.2e+18
--a positive long
value in Java.
The way external version numbers are handled is a bit different from the
internal version numbers we discussed previously. Instead of checking that the
current version
is _the same as the one specified in the request,
Elasticsearch checks that the current version
is _less than the specified
version. If the request succeeds, the external version number is stored as the
document’s new _version
.
External version numbers can be specified not only on index and delete requests, but also when creating new documents.
For instance, to create a new blog post with an external version number
of 5
, we can do the following:
PUT /website/blog/2?version=5&version_type=external
{
"title": "My first external blog entry",
"text": "Starting to get the hang of this..."
}
In the response, we can see that the current _version
number is 5
:
{
"_index": "website",
"_type": "blog",
"_id": "2",
"_version": 5,
"created": true
}
Now we update this document, specifying a new version
number of 10
:
PUT /website/blog/2?version=10&version_type=external
{
"title": "My first external blog entry",
"text": "This is a piece of cake..."
}
The request succeeds and sets the current _version
to 10
:
{
"_index": "website",
"_type": "blog",
"_id": "2",
"_version": 10,
"created": false
}
If you were to rerun this request, it would fail with the same conflict error we saw before, because the specified external version number is not higher than the current version in Elasticsearch.