The first time I heard of the Paxos algorithm was during my bachelor’s degree way back in 2004, when I participated in a Distributed Algorithms course. In the past few years Paxos came up multiple times, usually in the context of a robust implementation of some scalable storage system. It is almost always uttered in awe, as to say “that unbelievably complex algorithm, the Paxos. Beware!”. I’ve decided to reread the original paper, and try to explain what Paxos does, and how it does it. In a nutshell, Paxos solves the problem of resiliently replicating an increasingly growing (ordered) list of items.
The original Paxos paper discuses the legislative parliament of the island of Paxos. Paxos’ parliament holds its meetings in a chamber with very bad acoustics, so the legislators communicate using messengers. Both the legislators and the messengers are merchants, so they might need to leave the parliament chamber at any point of time. Each legislator can propose a decree to be passed, and if it is passed then it is added to the notebook each legislator keeps. A decree that was passed has an index, i.e., “137: Olive tax is 5% per ton”, which states that decree #137 is “Olive tax is 5% per ton”.
Behind the story of Paxos’ parliament hides a distributed log (sometimes called journal) resilient to node failures and message drops (legislators and messengers leaving the chamber). Paxos processes requests to add items to the log (pass a decree); an item is added at a sequentially increasing index (i.e., item A at index 1, item B at index 2, etc.), and all the nodes participating in the Paxos algorithm will eventually know that each item was added and the index it was added at (the notebook each legislator keeps).
Another way to look at Paxos’ functionality, is by considering a possible API formalization:
- add(item) – tries to add item to the log, and returns the index at which the item was added, or -1 if it wasn’t added
- get(index) – returns the item at the given index, or null if it wasn’t added yet
- an event triggered each time an item is added, providing the item and its index as parameters
Notice that the default implementation of the get(index) operation is eventually consistent; it can be made to be strongly consistent, if additional logic is added. An example of an eventually consistent implementation of Paxos is Google’s HRD.
Suppose you want to store data in a fault resilient manner. You know that two nodes isn’t enough, because of the split brain problem. Thus, you have three nodes: X, Y and Z. Since you want to continue working if a single node fails, when you process a write (or read) you must succeed if one node is down. Let’s assume X is down, and you write something. Y and Z process it. But, just as Y and Z are almost done committing the value, the maintenance gal brings back X and (mistakenly) brings down Y. What is the state of the value written? Is it there? Can we be sure that X will sync with Z and get the value correctly? This might seem like a situation with a very low probability, but when you have a large system with many nodes, and many messages, things with low probabilities happen all the time.
That’s where Paxos comes in. Paxos ensures that no matter what (I’ll specify the exact limitations shortly) if data is written, it will eventually propagate to all the nodes. Moreover, there will never be different nodes that think that some index contains different values (i.e., conflicting log records). These guarantees are called the “safety” property of Paxos.
Paxos assumes that:
- the content of messages aren’t changed after they’re sent
- nodes run the algorithm as instructed (i.e., no bugs, hackers, etc.)
- nodes have persistent storage (i.e., if they crash, the storage isn’t deleted)
Paxos ensures “safety” even if:
- messages are dropped or duplicated
- some (or all) of the nodes crash or restart
However, if the system is really “crazy” (e.g., all the nodes are down), Paxos can’t ensure that writes (or reads) actually occur; but if enough (a majority) of the nodes are up and communicating, then Paxos can ensure that write and reads are processed correctly. This property is called “liveness”.
To sum this section up, Paxos always guarantees “safety”, while ensuring “liveness” if a majority of the nodes are communicating with each other. (That’s why Paxos needs three nodes to support a single failure, and five nodes to support two failures).
A Bird’s Overview of Paxos
Paxos has three main building blocks: a) leader election protocol – to decide which node is running the show, b) consensus on a single log entry (called Synod) that preserves “safety” and c) a protocol for managing the entire log, i.e., what entries should be added, etc.
A bird’s view of the algorithm looks something like:
- continuously (every X seconds) run the leader election algorithm, so all nodes know who’s in charge
- when the leader receives a request to add a log item ITEM, it selects the next empty log index INDEX, and initiates a consensus on adding (INDEX, ITEM) to the log
- when a consensus is initiated to add (INDEX, ITEM), all nodes participate in it, eventually agreeing that item ITEM was added to the log at index INDEX
The leader election algorithm is meant to select a single node that everyone will agree that it is the leader. There are many different leader election algorithms, and Paxos doesn’t specify which one to use. A simple one could be to assign an id to each node, then have them broadcast their ids to everyone, and the node with the highest id is the leader.
It is possible – due to network partitioning – that more than one leader will be elected. (For example, in the above simple leader election algorithm, if messages from the node with highest id are dropped then it is possible that both the highest and second highest nodes will be selected as leaders). In a case where more than one node is the leader, the structure of the Synod algorithm (the consensus algorithm) ensures that at most a single item is selected for each log entry. I’ll go into the details about the Synod algorithm in the following post; for now, it is sufficient to say that if two different leaders initiate a consensus on (INDEX, ITEM1) and (INDEX, ITEM2) – two different items for the same index – then only one of them will get selected.
We’re left with talking a bit about managing the entire log: assume that there is a single leader. When the leader is elected, it goes over the log and sees if there are any “holes” in it. I.e., log entries that there isn’t yet a consensus on their value. For example, let’s say that at index 17 there is a hole. The leader initiates the Synod algorithm with the value (17, EMPTY). Due to the properties of the consensus (Synod) algorithm, it will either reach consensus on the value (17, EMPTY) or on (17, VALUE) for some previously proposed value. In both cases, the hole is filled.
Once there are no more holes, the leader can start processing writes: for each request to add an item to the log, the leader initiates the consensus algorithm with (INDEX, ITEM) where INDEX is the next empty slot and ITEM is the log item to write to the log. Notice that the leader can add multiple items to the log simultaneously and even add items while filling in the holes from before. So long as there is a single leader, items will be added to the log; if there is more than one leader, it is possible that the algorithm will pause, until there is a single leader again.
To sum up, Paxos is a highly resilient algorithm providing an abstraction of a distributed log. It assumes some mechanism of electing a leader. Paxos ensures that the data in the distributed log is consistent, even if there is no leader or there is more than one leader. In the usual case, when there is exactly one leader, data is inserted into the log in a consistent and timely fashion.
The resiliency of Paxos includes withstanding nodes crashing and restarting, as well as messages dropping and network partitioning. If more than half of the nodes can communicate properly, then Paxos will be able to perform writes. If less than half of the nodes can communicate (either because they are down, or because of network partitioning), then Paxos guarantees consistency, but might fail to be available (i.e., writes will not succeed).
All in all, if you favor CP over AP (in the CAP theorem), then Paxos is a very good solution for you.