Ever since NoSQL databases came into vogue, we hear more and more about eventual consistency. I want to to try and explain not only the difficulties that eventually consistent databases raise, but also why in some cases they can’t be avoided.
To simplify the discussion, I’ll narrow it down to key-value databases, i.e., databases that support (at least) two operations: read(k) – which returns the value identified by “k”, and write(k, v) which stores that value “v” so that it is identified by “k”. Pretty much every database has some sort of support for key-value in it (including most NoSQL databases and all SQL databases).
Arguably the most important property we expect a database to provide is “durability”, i.e., when something is written to the database we expect it to be there. That is, if we executed a write(k, v) operation and afterwards we executed a read(k) operation, we expect the result to be “v”. More precisely, we expect read(k) to return the value “v” from the last write(k, v) operation that was executed successfully.
A single machine DB can implement this property easily: write the data to disk, and serve it from disk (possibly with caching). For a multi-machine DB it is much more complicated to achieve such behavior, which is one of the reasons that “eventual consistency” started to appear in database systems.
Eventual consistency means that if a write(k, v) operation is completed successfully, then eventually all read(k) operations will return the value “v” (assuming that no new write(k, v’) operations are executed in between). However, until this happens, read(k) operations may return previous values. For example, suppose the following sequence of operations is executed:
- write(7, “A”)
- write(7, “B”)
In an eventual consistent database, the result of the last operation can be either “A” or “B” (or even an older value, if one exists). In a strongly consistent database, there is only one valid result, which is “B”.
Let’s consider an additional read(7), so the total sequence is:
- write(7, “A”)
- write(7, “B”)
In an eventual consistent database, all four possible results are legal. I.e.,
- line 3 returning “A” and line 4 returning “A”
- line 3 returning “A” and line 4 returning “B”
- line 3 returning “B” and line 4 returning “A”
- line 3 returning “B” and line 4 returning “B”
The most surprising of these possibilities is when line 3 returns “B” and line 4 returns “A”. It is surprising because we expect the database to be durable: if we write something, then it should stay there. We accept the fact that it might take time until it is there, but if we performed a read, and received the value, how could it suddenly disappear? (as a side note, the above example is why I think that the terms strong / eventual consistency are slightly misleading. It would might be better to term them strong / eventual durability, or perhaps something totally different).
Why Eventual Consistency?
Clearly, it is much easier to work with strongly consistent databases (when your write finishes, the data is there!), so why do we even have databases with eventual consistency? The answer is simple: eventual consistent databases can implement their read/write operations at a lower latency than strongly consistent databases.
One of the main reasons to have scalable databases is to support the load that can be created by high volumes of internet traffic. Usually such traffic volume requires many concurrent low latency database reads. These two properties (many concurrent requests and low latency) inevitably lead to eventual consistency.
To support the high load, we’ll eventually have to have more than one machine. Let’s call our two machines MA and MB. Moreover, since we have many concurrent read requests, they will have to be split between MA and MB; that is, MA can’t handle all the read requests by itself. In addition, requiring low latency means that if a machine receives a read operation, it should answer that read operation immediately. In other words, if MA receives a read operation it can’t contact MB before responding (same goes for MB, which can’t contact MA).
So what do we have so far? We have two machines, serving read requests. Once a read request is being processed, the processing machine can’t contact any other machine, to ensure the request is answered as quickly as possible. Now consider that at the beginning the value of key “k” is “v1” (i.e., the write operation(k, v1) has been executed), and that a new write operation is executed – write(k, v2). Eventually, both MA and MB need to be updated with the fact that key “k” is now associated with value “v2”. This can require some exchange of messages, but there is some “last message” MA receives which syncs MA’s key “k” value to be “v2”; let’s call that message msgA. Similarly, there is such message msgB, which convinces MB that it should update key “k” to “v2”. Notice that msgA and msgB can be very simple messages; for example, the protocol could be “when receiving write(x, y) notify the other machines of x and y”, and for such a protocol msgA/msgB would just be the pair (x, y).
Now we get to the punch line: if MA receives a read(k) request before it received msgA, it must respond “v1” and if it receives read(k) after it received msgA, it must respond “v2”. The same goes to MB and msgB.
If we have two read(k) requests (even from the same client), the network delays could be arranged so that each of the four possible outcomes will manifest itself. For example, to receive “v2” for the first read(k) and “v1” for the second read(k) – let’s assume MA receives the first read(k) and MB receives the second one – the network delays should be such that MA receives msgA before it receives the read(k) request, and MB should receive the read(k) request before it receives msgB. Similarly, any other combination is possible.
Notice that the above behavior is independent of the protocol used to write the data (i.e., it would even occur in advanced protocols such as Paxos, or Two Phase Commit), so long as the reading protocol has two properties: a) more than one machine can respond to read requests, and b) the responding machine can’t contact other machines before it finishes responding.
To sum up, in any system where we have more than one machine processing read requests, and while processing a read request the machine can’t contact other machines – the best we can hope for is eventual consistency. In other words, if high scale and very low latency are both critical, you’ll have to build your system so that it handles eventual consistency properly.
Academic Side Tour
(This section can be skipped, without affecting the understand of the post). A different formulation – from the distributed algorithms community – considers a basic building block called “shared register”. A shared register is an abstraction that supports two operations: read and write(v), where write(v) writes “v” into the register and read returns the value from the register. Such a register can be thought of as a key-value store with a single key, which is omitted. Alternatively, you can think of a key-value store as composed of a register per key-value pair.
Using the shared register abstraction, there are three different kinds of registers:
- Safe: if a read is performed without any write overlapping it, then it returns the value from the last write. However, if the read overlaps a write, there is no guarantee what the result will be (it could be anything).
- Regular: same as safe, with the additional requirement that if a read overlaps a write, then the result is either the result of the current write, of the previous write.
- Atomic: same as regular, with the additional requirement that if a read returned some value, then all future reads will return that value or a newer one. In other words, if two reads overlap a write, it is not possible that the first one will return the newer value, and the second one will return the older value.
There is a theoretical result (informally) stating that a a read operation on an atomic register must also write. That is, you can’t ensure an atomic read without an additional round of communication (a write).
Using the above definitions, eventual consistency is similar to a regular register, and strong consistency is similar to an atomic register. Pushing this analogy a bit further, the read-must-write proposition informally means that strong consistency must have an additional round of communications. I.e., it can’t process a read request without contacting additional machines.