Sunday, May 5, 2019

OneFile and wait-free balanced trees

OneFile is the world's first wait-free STM and as such, it allows to create wait-free trees.
As it so happens, there is a hand-made wait-free tree in the literature but it is not a "balanced", which means that our sequential implementation of a red-black tree protected by OneFile is the first wait-free balanced tree ever made... and did I mention that it has wait-free memory reclamation?  :)

Hand-made lock-free trees are not new. A fellow researcher, Trevor Brown, has been particularly prolific in this area and made several robust trees, with integrated memory reclamation and several with lock-free progress (just a small sample of his work on this topic):

Back in 2013, Aravind Natarajan, Lee Savoie and Neeraj Mittal presented a hande-made wait-free tree. Their tree does not have memory reclamation, which means that as you do a removal of a node from the tree, the node will will not be reclaimed/deleted, thus leaking. Eventually, your application will crash due to running out of memory. I remember I tried once with Andreia to add memory reclamation to their code but it was just too hard, even for epoch-based reclamation.
Moreover, the keys have to be 64 bit variables, from which a couple of bits are stolen to store some metadata. Perhaps there is a way to extend it to generic keys, but it's non-trivial.
This tree is not balanced. What this means is that if you do insertions of keys in sequential order (1,2,3,4,5...) the tree will not re-balance the nodes and what you end up with is just a "linked list" of nodes. Obviously, if you create a tree with 1 million keys, the latency for any operation will be around hours. Yes, hours per insertion or removal or lookup because that's how long it takes to do 1 million traversals of a linked list with 1 million nodes.
This tree is interesting more as a theoretical construct: it showed for the first time that wait-free trees are possible!

We took a sequential implementation of a Red-Black Tree (a balanced tree) and we added the annotation to work with OneFile. The result is a fully functional wait-free balanced tree. Also, it has wait-free memory reclamation for its nodes and the keys are generic (a template parameter).
This isn't the first wait-free balanced tree, that honor falls on CX, a universal construct that is capable of taking any sequential implementation of a data structure and making a wait-free data structure, without even needing annotation of any sort. It even works with std::set.
One day I'll talk more about CX and what it can do, but for the moment let's just say that OneFile is better than CX when it comes to tail latency for writes.

Because OneFile is a generic technique to make concurrent data structures, it means that adding other operations to the underlying data structure is as easy/hard as adding those operations as sequential code. This means it's easy to add things like range queries, do joins/splits, range deletions, range updates, add key A if key B is present, and pretty much anything you can express in C++ code and place inside a lambda. All with linearizable consistency and wait-free progress.
The source code for this wait-free balanced tree can be obtained here:
All you'll need to run it is the header containing the OneFile STM and a C++ compiler.

OneFile is an STM and as such, allows for dynamic transactions. These transactions can cover multiple objects or multiple data structure instances. This means that we can have a wait-free linearizable operation which removes one key from a tree and inserts into another tree, or another data structure. In other words, these transactions are dynamic, have Atomicity, Isolation and Consistency (are linearizable).
... if only we could have Durability... but ohhh wait, there is this thing called "Persistent Memory" (PM) which can give byte-addressable durability.

We have made a variant of the OneFile that is meant for Persistent Memory and gives ACID wait-free transactions.
Next week we'll talk about Persistent Memory and how OneFile is actually the first ever wait-free database storage engine.

No comments:

Post a Comment