Wednesday, July 10, 2019

Hydra Conference

The Hydra Conference will start in less than 12 hours and the program is super cool.
Maurice Herlihy will be there, Cliff Click, Michael Scott, Leslie Lamport and many many more. You can check the live feed at this link:
https://hydraconf.com/onlinefree/

Saturday, May 18, 2019

OneFile and wait-free memory reclamation

Last week we talked about OneFile and how it is capable of doing wait-free allocation and de-allocation. This week we'll give a brief intro on how it is capable of doing wait-free memory reclamation.
This isn't the first wait-free memory reclamation scheme, I know of at least one more by Maged Michael based on LL/SC, but OneFile is the first one that is generic in the sense that it works for any data structure (although it requires DCAS). One thing's for sure, the optimistic technique we use in OneFile PTM is the first ever lock-free/wait-free memory reclamation technique shown for Persistent Memory (PM).

Let's start with the volatile memory first.
In a multi-threaded application, pointers to dynamically allocated objects (in the heap) are typically shared among multiple threads. In a lock-free data structure, because another thread may still de-reference a pointer to an object, we can not immediately delete/free the object when we no longer need it.
Instead, we need to use a memory reclamation technique like Pass The Buck, RCU, Hazard Pointers, or Hazard Eras.

In OneFile, user objects are allocated using tmNew() and de-allocated using tmDelete(). Internally, tmNew() calls 'new' and creates a new object that is tracked using Hazard Eras. When tmDelete() is called by the user, the object is placed in a retire list until it is safe to delete.
This technique is not new, Hazard Pointers and pretty much every pointer-based memory reclamation mechanism work like this. The innovation is that because OneFile is an STM, every access to a word of an object in the STM has to go through a load interposing method (we named pload()) and in this pload(), if we see that the sequence associated with the word we're trying to read has been modified and is more recent than the timestamp (curTx) of when we started the transaction, then it might be an inconsistent read, and we abort the transaction, never returning the (possibly incorrect) value at the memory address.
Explained another way, the pload() will never return a value that may be inconsistent (will throw/catch an exception and abort the current transaction) because that's part of how the STM works. A result of this, is that the memory reclamation scheme will never de-reference an incorrect pointer.

The really cool thing about this is that it was _not_ possible to make this option practical until the invention of Hazard Eras. Hazard Eras is the first atomics-only and lock-free memory reclamation scheme that uses epochs instead of pointers. Because of this, it means that a single era can protect all objects alive in a given time instant.
Doing something similar with a pointer-based scheme like Hazard Pointers would require having one published pointer per object traversed. First difficulty, we would have to know to which object a given address belongs to, a highly non-trivial problem in word-based STMs.
Second difficulty, and way more serious, each de-referenced pointer (node traversed in a data structure) needs to be protected first, which on Hazard Pointers means publishing that pointer and doing an mfence (on x86). It also means that each pointer has to have a unique publishing. Example: if we want to traverse a linked list with 1000 nodes, we need to publish 1000 distinct pointers, which other threads will have to scan before they de-allocate a single object. Clearly the memory usage cost of doing 1000 mfences would drown out any chance of making this a suitable approach, plus the fact that to delete an object we have to scan 1000 memory locations to check for a match.
_This_ is the reason why, until now, there has been no lock-free STM with lock-free memory reclamation!
Thanks to Hazard Eras, a single era protects all the objects that the transaction needs to touch, until the transaction commits or aborts.
In a certain sense, OneFile is in a show-case for the power of Hazard Eras.



But it doesn't stop there. Let's take a look at the algorithm for Persistent Memory (PM).
In the wait-free PM variant of OneFile we use the same Hazard Eras algorithm described above to track the std::function objects where the transactions are stored (the lock-free variant doesn't share transactions among threads), but the user allocated objects in PM are handled differently.
For the dynamic objects allocated by the user, we rely on an optimistic technique. Access to a de-allocated object in the mmap()ed region in PM does not trigger a segmentation fault (the object was within the mmaped region) and therefore doesn't have the constraints that using the system allocator has for the volatile heap.
However, it does have the issue that we can _never_ reset the sequence on a tmtype<T>. This sequence must be ever-increasing (takes about 100k years to wrap, so it should be fine), changing with every modification pertaining to a transaction.
The proof is in the paper https://github.com/pramalhe/OneFile/blob/master/OneFile-2019.pdf but it makes the memory reclamation scheme become almost trivial, in the sense that the implementation of the memory reclamation scheme is non-existent, or so it may seem when you look at the code.
Appearances are deceiving and although the no-resetting-sequence-of-a-tmtype being a constraint carefully handled throughout the code, the reason _why_ this is enough is super hard.
As Andreia always says, we make things that look so easy, that people sometimes mistake them for being trivial. Believe me, it's not, and it was not trivial to get there. Andreia, Nachshon, Pascal and I had to work hard to get there.

Until today, there has been no lock-free (much less wait-free) memory reclamation scheme presented for PM, and certainly none that is as generic or as fast as the one we use in OneFile PTM. The fact that we don't have to keep a retired/zombie list in PM is a major breakthrough because persisting such a list has a high cost in number of clwb/clflushopt/sfence. Our optimistic reclamation scheme doesn't have any zombie list: the destructor is immediately called during the transaction (and this is safe even if the transaction later aborts because those changes will themselves be reverted) and the object is immediately de-allocated, all from within the transaction. This guarantees there are no leaks and no allocator metadata corruption in the event of a failure.

Hope you enjoy OneFile.


That's the end of this series for now. I'll come back when we have more cool stuff to talk about.

Tuesday, May 14, 2019

OneFile is also a database engine

In this post we'll talk about OneFile, the world's first wait-free Software Transactional Memory that works on Persistent Memory, and why it's actually a (wait-free) database storage engine.

An STM (Software Transactional Memory) is a library which provides dynamic transactions over multiple objects with linearizable consistency. In effect, an STM provides Atomicity, Consistency and Isolation, but not Durability, i.e. and STM gives ACI transactions.
A Persistent Transactional Memory (PTM) is just like an STM except it also gives Durability, or as I like to call it "failure-resilience". PTMs are capable of providing fully ACID transactions, as long as they're used with byte-addressable Persistent Memory (PM).

Examples of Persistent Memory are the Optane Persistent Memory NV-DIMMS:
https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html
or the battery backed NV-DIMMS by HP or Viking:
https://www.hpe.com/nl/en/servers/persistent-memory.html
https://www.vikingtechnology.com/products/nvdimm/

Not all PTMs provide ACID with serializable isolation.
In the concurrency literature there are two major distinctions for the consistency of durable transactions:
- durable linearizable: pretty much equivalent to what the databse community calls "serializable isolation" or full ACID transactions;
- buffered durable linearizable: means that transactions are linearizable however, in the event of a failure, a non-specified amount of transactions will be lost up to the current one. The system is guaranteed to be restored to a consistent state in the past, just not necessarily the last committed transaction.
https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=30821

All the other cases in between, which the database folks have names for such as "dirty reads", "non-repeatable reads", "phantom reads", we just call those "wrong" or "bugs" because that's how most users think of them.

Last year we made a PTM with (blocking) durable linearizable mutative transactions and wait-free read-only transactions called RomulusLR:
https://dl.acm.org/citation.cfm?id=3210392
This PTM has ACID transactions and it's pretty fast due to its combining technique, but it does serialize all mutative transactions. Other PTMs like Mnemosyne don't have this limitation, but they have other limitations.
A pretty well know PTM is libpmemobj in PMDK by the Intel guys, which understandably use it as a way to sell more Optane persistent memory  ;)
libpmemobj only has durable transactions and leaves the concurrency part up to the user, either by using a global lock, or fine grained locking (good luck with that!).  
https://pmem.io/
PMDK is an awesome piece of software because it has many tools that help enable the usage of persistent memory, not to mention all the effort that the Intel guys have put in in defining standards so that it all works with the PM products from different vendors.

We made a variant of OneFile for PM. In this variant the memory reclamation is done in a different way. We'll leave those details to its own blog post, but the cool thing about OneFile for PM is that it comes with its own memory allocator for persistent memory:
https://github.com/pramalhe/OneFile/blob/master/ptms/OneFilePTMWF.hpp
Even if you don't have persistent memory, you can use a region of your DRAM memory and mmap on it. This will give you durability for process failure but not for system failure. In other words, if the process that is doing a transaction gets killed, the information will be recovered correctly, but if the machine dies (gets unpluged or reboots) then the data is lost (unless you have a persistent memory module in your machine).

The allocator that comes with OneFile is very simple, doesn't do coalescence of blocks or anything like that. We call it EsLoco (Extremely Simple Allocator).
Other allocators can be used, as long as annotation is added to them by adding the template tmtype to each struct/class member of the allocator metadata.
Because the changes to allocator metadata are part of the transaction, there will not be any leaks in the event of a failure and all the operations are wait-free. This means that this allocator is capable of allocating and de-allocating blocks of data in a wait-free way.
In summary, by using OneFile, we've made the world's first wait-free allocator!
So ok, the bad news is that it serializes all operations, whether allocations or de-allocations. Don't expect any scalability from it, but on the other hand, it will be hard to find anything better in a multi-threaded environment when it comes to tail latency.

Another interesting aspect is the number of fences.
When doing a durable transaction in block storage (disk) the typical trick is to do two fsync() or two fdatasync(). The cost of the fsync() is significant because it's about the order of the latency of a round trip to the media device, i.e. the disk's latency.
In PM, the equivalent is to do two pfence (SFENCE instructions on x86). The cost of the pfence is also the cost of a round trip to the device, but this is sub-microsecond so it's not that costly.
Libpmemobj in PMDK uses an undo-log technique, which means it executes two SFENCE per modified range/object in a transaction. Even though the pfence are 1000x cheaper than calling fsync,() if we do too many of them it becomes a performance bottleneck. In OneFile we did our best to reduce the number of pfence to a minimum and reduced them to 2 per transaction, regardless of the number of objects modified during the transaction.

In effect, OneFile is the world's first wait-free database engine. When coupled with persistent memory, it is capable of providing transactions with serializable isolation.
A relevant question would be:
Like all other databases, it has to have a recovery mechanism for when a failure occurs... right?
Actually, no it doesn't.
OneFile is the first database engine with null-recovery. Null-recovery is a property of certain lock-free algorithms which was first introduced in a paper by Izraelevitz et al. in 2016
https://www.cs.rochester.edu/u/scott/papers/2016_DISC_persistence.pdf
Basically it says that some lock-free (and wait-free) algorithms have this property of simply continuing where they left off after a failure, meaning that they don't need a specific recovery procedure to recover after a crash.
This property is theoretically appealing but no database engine or PTM had so far been able to do it. With OneFile we've shown a practical implementation of it and we showed that database recovery can be done with the cost of applying the last incomplete transaction (it's actually possible to do even less).
The database research community has entire papers where they have complicated techniques to improve the time it takes to recover a database: https://research.fb.com/wp-content/uploads/2016/11/fast-database-restarts-at-facebook.pdf?
With OneFile + PM, we've reduced recovery time to the time of a single transaction!

... and did I mention that transactions in OneFile PTM are wait-free and have serializable isolation?  :)


Next week we talk about wait-free memory reclamation

Sunday, May 5, 2019

OneFile and wait-free balanced trees

OneFile is the world's first wait-free STM and as such, it allows to create wait-free trees.
As it so happens, there is a hand-made wait-free tree in the literature but it is not a "balanced", which means that our sequential implementation of a red-black tree protected by OneFile is the first wait-free balanced tree ever made... and did I mention that it has wait-free memory reclamation?  :)

Hand-made lock-free trees are not new. A fellow researcher, Trevor Brown, has been particularly prolific in this area and made several robust trees, with integrated memory reclamation and several with lock-free progress (just a small sample of his work on this topic): 
https://arxiv.org/abs/1708.04838
https://www.cs.tau.ac.il/~mad/publications/atc2018-bst.pdf

Back in 2013, Aravind Natarajan, Lee Savoie and Neeraj Mittal presented a hande-made wait-free tree. Their tree does not have memory reclamation, which means that as you do a removal of a node from the tree, the node will will not be reclaimed/deleted, thus leaking. Eventually, your application will crash due to running out of memory. I remember I tried once with Andreia to add memory reclamation to their code but it was just too hard, even for epoch-based reclamation.
Moreover, the keys have to be 64 bit variables, from which a couple of bits are stolen to store some metadata. Perhaps there is a way to extend it to generic keys, but it's non-trivial.
This tree is not balanced. What this means is that if you do insertions of keys in sequential order (1,2,3,4,5...) the tree will not re-balance the nodes and what you end up with is just a "linked list" of nodes. Obviously, if you create a tree with 1 million keys, the latency for any operation will be around hours. Yes, hours per insertion or removal or lookup because that's how long it takes to do 1 million traversals of a linked list with 1 million nodes.
This tree is interesting more as a theoretical construct: it showed for the first time that wait-free trees are possible!

We took a sequential implementation of a Red-Black Tree (a balanced tree) and we added the annotation to work with OneFile. The result is a fully functional wait-free balanced tree. Also, it has wait-free memory reclamation for its nodes and the keys are generic (a template parameter).
This isn't the first wait-free balanced tree, that honor falls on CX, a universal construct that is capable of taking any sequential implementation of a data structure and making a wait-free data structure, without even needing annotation of any sort. It even works with std::set.
https://github.com/pramalhe/CX
One day I'll talk more about CX and what it can do, but for the moment let's just say that OneFile is better than CX when it comes to tail latency for writes.

Because OneFile is a generic technique to make concurrent data structures, it means that adding other operations to the underlying data structure is as easy/hard as adding those operations as sequential code. This means it's easy to add things like range queries, do joins/splits, range deletions, range updates, add key A if key B is present, and pretty much anything you can express in C++ code and place inside a lambda. All with linearizable consistency and wait-free progress.
The source code for this wait-free balanced tree can be obtained here:
https://github.com/pramalhe/OneFile/blob/master/datastructures/treemaps/OFWFRedBlackTree.hpp
All you'll need to run it is the header containing the OneFile STM and a C++ compiler.
https://github.com/pramalhe/OneFile/blob/master/stms/OneFileWF.hpp

OneFile is an STM and as such, allows for dynamic transactions. These transactions can cover multiple objects or multiple data structure instances. This means that we can have a wait-free linearizable operation which removes one key from a tree and inserts into another tree, or another data structure. In other words, these transactions are dynamic, have Atomicity, Isolation and Consistency (are linearizable).
... if only we could have Durability... but ohhh wait, there is this thing called "Persistent Memory" (PM) which can give byte-addressable durability.


We have made a variant of the OneFile that is meant for Persistent Memory and gives ACID wait-free transactions.
Next week we'll talk about Persistent Memory and how OneFile is actually the first ever wait-free database storage engine.

Monday, April 22, 2019

OneFile and Tail Latency

Last week we talked about OneFile, the world's first wait-free STM and on this post we're going to explain the importance of wait-free progress.
https://github.com/pramalhe/OneFile/

Lock-based concurrency mechanism (locks) have several difficulties in practice:
1) if we use more than one lock, we're subject to having deadlock issues;
2) if we use priority locks, we're subject to having priority inversion issues;
3) if we use a lock without starvation-freedom guarantees (such as a spinlock), we're subject to starvation and live-lock;
4) using locks is prone to convoying effects;
5) mutual exclusion locks don't scale for read-only operations, it takes a reader-writer lock to have some scalability for read-only operations and even then, we either execute read-only operations or one write, but never both at the same time. Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;


In concurrency, progress comes in different flavors http://concurrencyfreaks.blogspot.com/2013/05/lock-free-and-wait-free-definition-and.html
Wait-free algorithms provide more guarantees than lock-free algorithms which in turn provide more guarantees than blocking algorithms (with one exception for blocking starvation-free).

A lock-free algorithm guarantees that at least one thread is making progress (whatever "progress" may mean). But it also implies that in the event of a failure of a thread/process, at least one of the other threads/processes will continue to execute. This is an extremely strong guarantee when it comes to failure-resilience and I'll cover it on an upcoming blog post.

Blocking algorithms don't have such a guarantee: if the thread/process holding the lock dies, then none of the other threads/processes will ever be able to do any work. There is a catch here though, if the algorithm is blocking starvation-free, it means that given enough time (execution quantum) to all the threads/processes, then they will eventually complete. This means that when it comes to completing an operation, starvation-freedom guarantees that the operation will execute in a finite amount of time, as long as the system has enough cores to run the threads/processes or the scheduler is somewhat fair, and no failure occurs. In the latency distribution, this gives a "cutoff" beyond which the operations are completed.

Lock-free algorithms don't have this cut-off. Instead, they have a "fat tail" in the latency distribution. Being lock-free implies the possibility of starvation, and in that sense (and only in that sense) starvation-free is stronger than lock-free.

Wait-free bounded progress goes one step above all this: it guarantees that given a time slice of execution (quantum) all thread make progress. This is true even in the presence of failures: if N-1 threads/process fail, the remaining thread/process will still be able to complete. If not, then the algorithm is not wait-free. Moreover, wait-freedom implies being free from starvation, which means that it also has a cutoff in the latency distribution.

In summary, wait-free (bounded) provides the strongest of all progress guarantees:
- Progress for all threads/processes;
- Progress in the presence of failures;
- Deterministic cutoff on the tail of the latency distribution;


We have implemented two variants of OneFile, one with lock-free progress and the other with wait-free progress. The implementation with wait-free progress is slightly slower in terms of throughput, because giving these strong guarantees implies executing more synchronization, however, when it comes to tail latency, the wait-free variant of OneFile is better than any other (blocking) STM by orders of magnitude (lower is better):



At the tail of the latency distribution, for the percentile 99.99% (P9999), the improvement of OneFile over the blocking STMs are huge and speak for themselves. The dark blue line is OneFile wait-free and lower times are better. Notice that this is a log-log plot with each horizontal line separating an order of magnitude, meaning that the difference to the lock-free (light blue) and blocking STMs is at least 1000x better.

Wait-freedom is a vital characteristic when working with low latency scenarios such as networking devices, high frequency trading, or control systems (autopilot cars/planes/rockets). Sure, there are other ways to deal with it such as having a tight control over the scheduler and that is what the hardcore real-time people do anyways, but with OneFile we have strong latency guarantees without having control over the scheduler, and that is on itself an important change for multi-threaded embedded systems development.


Next week we're going to talk about how OneFile allows us to create wait-free balanced trees.