The previous post gave a general overview of the Paxos algorithm. Here’s a quick recap: Paxos implements a resilient distributed log, such that items can be added and each item is assigned a unique (and increasing) index. The algorithm can be split into three main blocks: a leader election, a consensus on a single item (also called the Synod algorithm) and managing the entire log. In this post I want to go into more depth about the Synod algorithm.
Background
A quick disclaimer before we begin: this post describes the Paxos algorithm in its most basic form, as it was presented in the original paper. Different optimizations exist, but I think that going over them will distract us from the big picture.
The Synod algorithm is the center piece of the Paxos algorithm. It is executed separately for each log index, i.e., one Synod execution for the first item in the log, another Synod execution for the second item in the log, and so on. Multiple Synod algorithms can be executed concurrently, without affecting each other’s results. For the rest of the post, I’ll use the term “Synod instance” to mean an execution of the Synod algorithm for a specific index.
In a nutshell, the Synod algorithm is an agreement (consensus) algorithm which guarantees that: a) all the nodes participating have the same output value, b) the output value is proposed by some node (not just an arbitrary value), and c) if enough nodes can communicate with each other, eventually the algorithm will finish. The algorithm is constructed such that any node can propose a value, and the dynamics of the algorithm ensure that only one of those values is chosen. However, it is possible that different nodes will propose values at different times such that the algorithm runs forever. To expedite the termination of the Synod algorithm, we limit the nodes that can propose values to the Paxos leader node only. When there is a single leader, the algorithm will finish quickly. However, since it is built to support any node proposing a value, then even if there are multiple leaders, there will never be inconsistencies in the agreed output values.
The Synod Algorithm – A Bird’s View Description
For simplicity’s sake, let’s ignore the fact that for each log index a different Synod algorithm is executed, and instead consider a single Synod instance. The algorithm is composed of ballots, each one consisting of three phases: initialization, voting and resolution. Each ballot has a unique number associated with it and is managed by a single node; the numbering of the ballots defines an order among them, s.t. ballot A with number 5 is lower than ballot B with number 6.
Let’s go a bit more into the details of each phase. In the initialization phase, the managing node notifies all the nodes about the new ballot, and waits to hear that they agree to participate in the ballot (they might be participating in a higher ballot, and thus refuse to participate in the lower one). In addition to saying if they agree to be part of the ballot, the nodes also tell the manager what value should be selected as the ballot’s result. The voting phase starts once the managing node received a majority of nodes that will participate in the ballot. The manager then selects a value from the suggestions those nodes sent and sends this value to the participants. Each of the participating nodes casts its vote in favor of this suggestion, if it hasn’t already started a newer ballot. If it did start a newer ballot, it just ignores the old ballot’s messages. Lastly, in the resolution phase, if the managing node received a green light from all the participating nodes, the manager tells all the nodes in the system of the decided value.
Before going even more into details, let’s analyze the above description. It is possible that different ballots are running concurrently (note: this is different from the fact that different Synod instances and be executed concurrently), and therefore a ballot might never end. E.g., if a ballot started, and reached the voting phase, then another ballot started, it is possible that the first ballot will never finish the resolution phase, since the participating nodes will join the newer ballot, and ignore the older ballot in its voting phase.
According to the above, each ballot has a managing node. The detailed algorithm is such that consistency is preserved no matter which node is managing which ballot (so long as a single node manages each ballot). However, to ensure that a decision is actually made, only the leader of the Paxos algorithm acts as a manager of a ballot. Therefore, so long as there is a single leader of the Paxos algorithm, there will be only one manager in the ballots, and thus only one ballot (because that node will only start a single ballot), and thus the Synod algorithm will terminate quickly. If there is more than one node that thinks it is the Paxos leader, then infinitely many ballots might be initiated; however, consistency will always be preserved. I.e., the Synod algorithm will either not terminate, or terminate such that all nodes agree on the same value.
Last point before we delve into the gory details; consider the following situation: ballot X is running, and in the voting phase node U sent some value. Before ballot X finishes, a new ballot is started, and node U is requested to participate in it. Assuming the new ballot is higher than ballot X, our node U will participate in the new ballot. Let’s assume that the new ballot finishes before ballot X does. That is, a decision is made by the managing node of the new ballot. If ballot X ever finishes, it must finish with the same decision value as the one made by the new ballot, otherwise the consistency property of the Synod algorithm will be invalidated. Bellow we’ll investigate what mechanism ensures that ballots have the same decision value, but for now let’s just be aware that this property is required (and ensuring it, is – in my opinion – the main point of the Synod algorithm).
The Synod Algorithm – A Detailed Description
In the previous section we overviewed the three phases of each ballot in the Synod algorithm: initialization, voting and resolution. I now want to go into more details of these phases, along with the state that each node keeps. Each node holds four variables: lastTried, maxBallot, prevVote and prevBallot. At any node X, lastTried holds the last ballot number that X tried to initiate, maxBallot holds the highest ballot number that X participated in, and prevVote holds the vote cast by X in ballot prevBallot, which is the number of the last ballot X voted in.
The idea behind lastTried and maxBallot is simple – it is to ensure that the managing node and the participating nodes are all talking about the same ballot. The managing node marks the number of the ballot in lastTried and validates each message it receives, ensuring it is marked for that ballot. On the other hand, each participating node marks in maxBallot the highest ballot it is participating in, and responds to messages from the manager only if they are for that ballot. In a normal operation, once a ballot is initialized, all nodes’ maxBallot variables will be equal to the manager’s lastTried variable.
For brevity, the following description assumes that when the manager sends a message to everyone, it also sends a message to itself (and also responds to itself); it makes the presentation (and understanding) of the algorithm much easier.
- Initialization:
- The managing node selects a new number n for the ballot, such that n > lastTried, sets lastTried=n and sends NextBallot(n) to everyone.
- Upon receiving NextBallot(n) s.t. n > maxBallot, the receiving node responds with LastVote(n, prevVote, prevBallot) to the manager of the ballot, and sets maxBallot=n.
- Voting:
- Let’s say that if the managing node receives a message LastVote(n, v, m) s.t. n==lastTried from node X, then node X is considered as participating in the current ballot. Using this terminology, if a majority of the nodes are participating in the current ballot, then the managing node can go on to the voting phase, by sending BeginBallot(n, Value) to all nodes participating in the ballot. Value is chosen to be the v with the maximal m among all the LastVote(n, v, m) messages from the participating nodes. (Basically, each participating node sends to the manager its vote from the latest ballot it voted in, and the manager takes the latest vote among them).
- upon receiving BeginBallot(n, v) s.t. n==maxBallot, the receiving node sets prevVote=v, prevBallot=n and responds with Voted(n) to the manager.
- Resolution:
- If the manager received a message Voted(n) s.t. n==lastTried from each participating node, then it considers the ballot as done, and the agreed output value of the Synod algorithm is Value, as selected in the voting phase. The manager also sends Success(Value) to all nodes. (Note that there is no need to send the ballot number here, a detailed explanation is given bellow).
- Upon receiving Success(v) from anyone, the receiving node considers the ballot to be done, with the agreed output value v.
A few remarks about the above algorithm. First, notice that in some situations a node doesn’t have to reply (e.g., if it received a NextBallot(n) message while it is participating in a higher ballot). In such a setting, the manager might “get stuck” in the ballot, and not continue. However, this can happen due to another node starting a newer ballot, and thus a newer ballot will continue running instead of the stuck one. To make the algorithm even more robust, the manager can set a timer, and if it doesn’t receive enough messages in the allotted time, the manager will abandon the current ballot, and start a newer one. This timer mechanism might cause a race condition, in which two different managers each starts a new ballot before the other finishes – a situations which is solved when there is a single leader in the Paxos algorithm, and thus a single manager in the Synod algorithm.
Second, the ballot numbers play a very important role in the algorithm, and if different managers start a ballot with the same number, the algorithm will break. To solve this issue, each manager must have its own set of unique numbers that it will use – something that can be ensured by, for example, allotting 1, 4, 7, … to the first node, 2, 5, 8, … to the second node and 3, 6, 9, … to the third node (assuming a three-node Paxos).
Lastly, once a value V is voted (i.e., enough nodes voted for it in the voting phase), if another ballot finishes, it will always finish with the same value V (we’ll discuss this property in the next section). Thus, once a value V is voted, it can be sent to all nodes, and we don’t need to consider which ballot it came from. That’s the reason that the Success message doesn’t need to contain a ballot number.
Why Does This Work?
The main point that needs convincing is that the Synod algorithm can’t produce two different outputs. A value is the output of the Synod algorithm only if it is voted by enough nodes in the voting phase. Therefore, we need to show that if V is voted by enough nodes in the voting phase, then any other ballot which reaches the resolution phase will also have V voted.
Let’s consider an execution of the Synod algorithm in which for some ballot number n, the value V was voted in the voting phase, and the ballot reached the resolution stage. Since reaching the resolution phase requires that all participating nodes vote, and since reaching the voting phase requires a majority of the nodes to respond in the initialization phase, we conclude that a majority of the nodes have voted for V. That is, in the voting phase they have all set the value of prevVote to V and prevBallot to n.
Now, consider the following ballot (=the minimal ballot m s.t. m > n) that passes the initialization phase and goes on to the voting phase. It must receive a majority of nodes that respond with a LastVote message. If we have two sets, each containing more than half of the nodes, there must be a node that is in both of these sets. Thus, there is a node that voted for V in ballot n which sent a LastVote(m, V, n) in ballot m; and since V is the value from the highest ballot (recall that n is the highest ballot that isn’t m), it will be the value selected by the manager of ballot m, and therefore, if any node updates its value of prevVote it will set it to be V. Therefore, if another future ballot finishes, it must finish with the same value of ballot n, i.e., it will finish with value V.
TL;DR
Ok, that was a mouth full. Let’s recap. The Paxos algorithm runs a single Synod instance for each entry in the distributed log. I.e., a Synod instance on the value of the first entry, another instance for the second entry, and so on. The Synod algorithm reaches an agreement on a single value, and it does so by running ballots. Each ballot has a manager, which starts by trying to get a majority of nodes to agree to participate in the ballot, it collects suggested values from the participating nodes, selects a single value and asks the nodes to vote on it. If the manager receives enough votes, then the value is selected as the output value of the Synod algorithm.
The crux of the Synod algorithm is to ensure that if multiple ballots are executed concurrently, then the output is still guaranteed to be the same. I.e., if ballot X is executed concurrently with ballot Y, and ballot X finishes with some value V, then if ballot Y finishes, it must finish also with value V.
Since multiple ballots might be executed concurrently, it is possible that they will interfere with each one, and cause a race condition which will lead to an infinite sequence of ballots. However, if the Paxos algorithm has a single leader, then in each Synod instance only a single node will manage the ballots, thus ensuring that only one ballot is executed at each point in time, and therefore voiding the race condition and the infinite execution.
Pingback: Understanding Paxos | thoughts…
Pingback: A New Key-value Consensus Algorithm | Distributed Thoughts