I think fault tolerance is the most important aspect of distributed algorithms, for two reasons: 1) in practice, things break, and you want your data / system to continue working. 2) If you assume there are no failures, things get “easy”. Many, many problems become trivially solved; and that’s just boring!
So you’re convinced that fault tolerance is important, but why do I need to dedicate a post to it? Fault tolerance is not only important, it is also very hard to achieve. In fact, there are quite a few problems that can’t be solved if there are failures. (Can’t be solved, until we strengthen our assumptions, or change something in the setting.)
To see why fault tolerance is hard, let’s consider an example from a near by field: the RAID 5 storage. In RAID 5 you have an array of n disks, and you read/write data to them. To support the scenario when one of these disks fails, when you want to write n-1 bits, you write a single bit to each disk, and in the nth disk you write the xor of all the n-1 bits. That way, if some disk is ruined, its content can be recovered by xor-ing the data of the working disks.
RAID 5 is a cool idea, and it seems to be pretty straightforward to implement. There are two assumptions underlying the above description that make everything much easier: 1) when you write data to the disks, you either succeed or fail, and you know immediately which happened; 2) there is some centralized controller that ensures that at most one write occurs at any given moment. In essence, this means that the type of failures you’re dealing with are less “distributed” in their nature.
Now, let’s turn to a distributed system, which has n servers, each one storing data for read/write access. We want everything to be fault tolerant, s.t. everything continues to work if a single server fails. How can we implement the RAID 5 concepts in such a system? Suppose each time we want to write n-1 bits, we send each bit to a different server and the xor of all the bits is sent to the nth server.
But, what if two different clients try to write different values to the same location? In the RAID 5 original setting, we had a controller to streamline the writes. If we add a controller server, then we have a single point of failure, and if the controller server fails then the system grinds to a halt. If we don’t add a controller, then it is possible that n-1 bits of one value will be written, together with the xor of the bits of the second value.
Even if we have a simpler setting, in which the same bit is written to all the servers (i.e., instead of writing n-1 different bits of data to n server, the same single bit is written to all servers), we still have the same problem. Either we have a central coordinator (creating a single point of failure) or we need to handle the case when multiple writes occur concurrently. Actually, we don’t even need to have multiple writes. Suppose a client starts writing its bit to the n servers, and after writing to half of the server, the client crashes. Now, another client tries to read the value from the servers. What result should be returned?
A possible solution would be to say something like “read the value from all the servers, and return what the majority says”. Such a solution is very unstable. Suppose half of the servers receive the “write 1” command, while almost half didn’t. In such a setting, a single server’s failure could change the answer to “what data is written in the system”. It could actually get worse: if one of the servers doesn’t crash, but instead has message drops; leading to different clients getting different majorities, and thus deciding differently if it’s a 1 or not.
To sum up, I hope I’ve managed to convince you that fault tolerance in distributed systems is both important and hard. I also hope you find the problem interesting, I certainly do, and in the next few posts I’ll try to dive a bit more into this area.