Wednesday, November 6, 2019

CX - Wait-Free Universal Construction

After many years, we've decided to make public the source code for the CX Universal Construction and explain what it is. Source code can be obtained here:
and paper here:

CX is the first wait-free Universal Construction that can actually be used in practice.
It can take lambdas for its operations, it has integrated (wait-free) memory reclamation, it was extensively tested and it doesn't leak.
CX allows you to take a sequential data structure or object (single-threaded code) and make a wait-free concurrent implementation of the same object by simply encapsulating the methods in calls to updateTx() or readTx(). The resulting data structures have linearizable operations.

Transforming a sequential data structure with CX is not trivial, but it is simple enough that a developer fresh out of college can do it. There are a few rules specific to wait-free UCs that need to be followed (no I/O, no exceptions, no side effects, deterministic code only) but these are easy enough to adhere to.

Unlike handmade wait-free and lock-free data structures, you can have any kind of type in your container (there are limitations on the return values though). All lock-free data structures I've seen so far were meant for integers only, or something equal or smaller than 64bits. Moreover, with CX it's easy to modify the data structure later and add more functionality.

Unlike STMs, CX does not need to have interposing and therefore, we can use CX to wrap data structures in libraries to which we don't have access to the source code. For example, you can use CX to wrap STL containers (we have!). Here is an example on how to make a wait-free std::set
or even easier, just create the instance and do the operations on the fly:

All you need is a copy constructor for the underlying data structure you want to protect. As long as your sequential data structure has a copy constructor (or something logically equivalent) and deterministic operations, you should be able to convert it to a wait-free data structure.

We've used CX to make the world's first wait-free balanced binary tree, with integrated wait-free memory reclamation, capable of supporting multiple types. How did we do that? Well, we wrapped a std::set in CX and BOOM, out comes the world's first wait-free balanced BST  :)

Several years ago, when Andreia first came up with the idea that we could use a reader-writer lock to achieve wait-freedom, I though that was crazy, but cool. After joining up forces with Pascal and a lot of work, we were able to turn that into a universal construction, CX.
Sure, it has its limitations, but it opens new venues for research and is evidence of some contrarian concepts:
- wait-freedom can be generic and fast (at least for read-mostly workloads);
- wait-free memory reclamation can be lightweight;
- wait-freedom can be achieved with locks;

Now for the bad news: just because it's wait-free, doesn't mean it scales.
CX serializes all mutative operations which means it's flat at best if the workload consists of only mutations on the sequential object.
It also consumes larges amounts of memory and in the very unlikely worst case, may require 2x MAX_THREADS replicas of the sequential object. If the object is a data structure with 1000 nodes, then it doesn't make a dent, but if we're talking about a container with multiple millions of keys and tens of threads writing to it, then it's going to chew up with DRAM pretty fast.
This high memory usage means CX likely won't be your favorite concurrency approach when deploying in production.

The good news is that it scales really well for read-only operations. This means that as long as your workload is read-mostly, it will be able to scale.
Synchronization for read-only operations in CX is very low, typically a few acquire-loads and a single seq-cst fence instruction and beyond that it's just regular loads.
Yes, that's right, the user-code you place in the lambdas is being executed without having to do any load acquires or heavy synchronization fences, just like when you execute code inside a lock!
This is what makes CX be very fast for read-mostly workloads.

In fact, it's so fast that when we submitted the paper to conferences, a few reviewers complained that it was too good to be true and it couldn't possibly be correct. That's how fast it is  ;)
Hand-made lock-free data structures can take months (or years) of work by expert researchers to complete, therefore it's normal be greeted with disbelief when we show them a technique capable of transforming generic sequential code into wait-free safe code. Something a young developer can do in minutes. If on top of that we say that it has wait-free memory reclamation and on top of it we show plots where it beats multiple handmade data structures in read-mostly scenarios, then they go ballistic and think it's all a bunch of BS.
It's understandable, we all have our biases. Mine is to think that it's a good idea to trade off large amounts of memory for a generic construct that scales in read-mostly workloads.What's yours?

Hey, nobody's perfect!

Monday, November 4, 2019

Is Left-Right a generic concurrency control?

Spoiler alert, yes, Left-Right is a generic concurrency control!

We mentioned Left-Right many times before, but here it goes again.
Left-Right is a technique where two replicas of your data are kept up to date. Mutative operations execute in one replica while the other replica can be read by in-flight readers and then the role of the replicas toggles for the write to apply the same operation on the opposite replica while the new readers go on the freshly modified replica.

Through the careful usage of atomics and RCU, this can be done in such a way as to have wait-free progress for the readers and that's what Left-Right is about.
There is only one writer at a time (we can use a global lock) but it can use flat combining to aggregate requests from multiple writers and thus have data locality without causing too much contention on the writer lock. This data locality allows for a "flat" throughput of the write operations, and in the case of Persistent Memory (Romulus) is can even have high positive scaling due to saving writes to PM.

If you want to see more details on Left-Right you check out this ppt:
or see my cppcon video on this topic:

Because Left-Right is a generic concurrency control, it can be used to make a Universal Construction
or an STM
both approaches imply blocking yet starvation-free writes, and wait-free population oblivious reads.
Read-only operations can execute in parallel with the write operation because they're accessing opposite replicas.

When a user decides to utilize the Left-Right UC, all it has to do is place the user-code in a lambda and pass it to the Left-Right UC.
The code in the lambda has some minor restrictions, shared in common with all other generic approaches that give wait-freedom for reads:
- No side effects in the lambda: can't modify global variables or thread-locals;
- Deterministic code: no access to random number generator devices or random functions, unless it's a pseudo-random generator and the seed is passed to the lambda;
- No I/O; no writing to files, no sending/receiving of network packets, no hw registers being written... actually all of this can be done but then you have to use an STM version (like Romulus) which needs load and store interposing
- No exceptions: any exceptions must be catch inside the lambda (user-code). For the read-only operations it can be made such that this is not a problem (implementation detail) but for the mutative operations this is an algorithmic constraint;
Operations in Left-Right are irrevocable (they don't abort-retry) which is a nice feature for most user code.

The bad news: it uses twice as much memory. Like all good things, there are trade-offs and the wait-freedom for readers has a cost, the memory usage.

Remember the issue about cross-dependencies (write skews) from the previous post ? Left-Right doesn't have such an issue because the synchronization is implicitly a global lock. Yes, yes, it serializes mutative operations, but remember, read-only operations are wait-free and go in parallel with the mutations. As long as your workload is read-dominated, Left-Right is a good trade-off.

Another nice feature is that memory reclamation works just like inside a lock: you just call malloc/free/new/delete and it works. No need to use epoch-based reclamation explicitly (it's already implicit in the usage of the userspace RCU inside Left-Right).

On the next post we're going to talk about a fully wait-free Universal Construction.