It’s been quite a while since I last published a post. I’ve started a new job at Elastifile at the beginning of 2014, and it kinda took most of my spare time 🙂 I’ve resolved to get back to blogging, and I hope I succeed…
At Elastifile, we’ve built a distributed file-system that is truly scalable, and I’d like to share with you why creating such a file-system is interesting, challenging, and ultimately a lot of fun!
Let’s start by considering what exactly we expect from a distributed file-system. One of the basic expectations is to be able to access the file-system from different computers, and to have a consistent view of the file-system. For example, if I create a file “foo.txt” on one computer, I’d expect to be able to read this file from any other computer.
Time in a Distributed File-system
Consider we have a directory that has two files in it “a.txt” and “b.txt”. Suppose I run “ls -lt” within the directory. It should list the content of the directory, ordered by last-modified time. Here’s an example.
$ ls -lt total 0 -rw-r--r-- 1 ezra wheel 0 Nov 7 21:48 a.txt -rw-r--r-- 1 ezra wheel 0 Nov 7 21:47 b.txt
Notice that the later modified file appears first. I can change the last modified time of “b.txt”, and it will appear first:
$ touch b.txt $ ls -lt total 0 -rw-r--r-- 1 ezra wheel 0 Nov 7 21:49 b.txt -rw-r--r-- 1 ezra wheel 0 Nov 7 21:48 a.txt
What’s going on behind the scenes? Each file has a last-modified timestamp attached to it, and when the file is updated (e.g., by executing “touch b.txt”) that timestamp is updated.
Where does this timestamp value originate from? It shouldn’t be based on the client’s clock, because clients’ clocks can be wrong. So it should be based on the server’s clock. However, which server? If we’re a truly scalable file-system, we’ll need to be able to scale out, by adding additional servers. This inherently means we have more than one server, and thus more than one clock (each server has its own clock).
We could decide that timestamps are taken from one server only, but that has two main drawbacks: a) there is a single-point-of-failure issue; b) a single server has a limit of how many operations it can handle concurrently, and again we have a scaling issue.
So in any scalable solution, timestamps will be taken by different servers. This isn’t too bad. We can say file “a.txt” has its timestamp taken by server X, while file “b.txt” has its timestamp taken by server Y. That sounds pretty good.
However, now we have an oddity:
/tmp/bar$ touch a.txt /tmp/bar$ touch b.txt /tmp/bar$ ls -lt total 0 -rw-r--r-- 1 ezra wheel 0 Nov 7 22:04 a.txt -rw-r--r-- 1 ezra wheel 0 Nov 7 22:03 b.txt
The timestamp of “a.txt” should be earlier than “b.txt”, but it isn’t! How can this happen? If server X and server Y have slightly different times, then even though “a.txt” was modified before “b.txt”, it could have receive a timestamp that is after “b.txt”.
The interesting aspect of this problem is that we can never guarantee that both servers have exactly the same time, even if they have really accurate clocks. On the other hand, any attempt to serialize these timestamps will inevitably limit the scalability of the file-system. However, there are a few things that can be done – I’ll let you enjoy thinking about them.
Directory Structure in a Distributed File-system
Another example of the complexity of a distributed file-system is the directory structure. Directories can contain within them files and/or other directories. Consider the following example:
$ mkdir bar $ mkdir foo $ mv foo bar $ ll bar total 0 drwxr-xr-x 2 ezra wheel 68B Nov 7 22:15 foo
We created two directories, “foo” and “bar”, then placed “foo” under the directory “bar”.
We can now try and place “bar” under “foo”:
$ mv bar bar/foo mv: rename bar to bar/foo/bar: Invalid argument
Luckily, something prevented us from doing that. Otherwise, we’d have an infinite loop of “foo” being a sub-dir of “bar” being a sub-dir of “foo”, and so on. This example is a simple one. There are much more elaborate cycles possible (e.g., moving a parent directory to a sub-dir that is nested 20 levels deep).
Let’s consider what’s required to detect such invalid operations: We need to traverse the directory structure, and guarantee that there are no loops created due to this “mv” operation. Moreover, we need to guarantee this while other “mv” operations might be concurrently executing, changing the structure of the directory tree as we traverse it. Lastly, since our file-system is distributed and scalable, we can’t keep all directories on a single server, so we must allow different servers to handle different directories. Therefore, the aforementioned traversal is going to be a distributed traversal. Sounds like a lot of fun, right?
To sum up, for a distributed file-system to be truly scalable, it must allow for multiple servers to handle different files and/or directories. Once multiple servers handle different entities, guaranteeing specific relationships between these entities (such as having a sequence of timestamps, or no loops in the directory structure) becomes a distributed task.
Interesting? Yup. Challenging? Definitely. Fun? I think so 🙂