Wednesday, November 6, 2019

CX - Wait-Free Universal Construction

After many years, we've decided to make public the source code for the CX Universal Construction and explain what it is. Source code can be obtained here:
and paper here:

CX is the first wait-free Universal Construction that can actually be used in practice.
It can take lambdas for its operations, it has integrated (wait-free) memory reclamation, it was extensively tested and it doesn't leak.
CX allows you to take a sequential data structure or object (single-threaded code) and make a wait-free concurrent implementation of the same object by simply encapsulating the methods in calls to updateTx() or readTx(). The resulting data structures have linearizable operations.

Transforming a sequential data structure with CX is not trivial, but it is simple enough that a developer fresh out of college can do it. There are a few rules specific to wait-free UCs that need to be followed (no I/O, no exceptions, no side effects, deterministic code only) but these are easy enough to adhere to.

Unlike handmade wait-free and lock-free data structures, you can have any kind of type in your container (there are limitations on the return values though). All lock-free data structures I've seen so far were meant for integers only, or something equal or smaller than 64bits. Moreover, with CX it's easy to modify the data structure later and add more functionality.

Unlike STMs, CX does not need to have interposing and therefore, we can use CX to wrap data structures in libraries to which we don't have access to the source code. For example, you can use CX to wrap STL containers (we have!). Here is an example on how to make a wait-free std::set
or even easier, just create the instance and do the operations on the fly:

All you need is a copy constructor for the underlying data structure you want to protect. As long as your sequential data structure has a copy constructor (or something logically equivalent) and deterministic operations, you should be able to convert it to a wait-free data structure.

We've used CX to make the world's first wait-free balanced binary tree, with integrated wait-free memory reclamation, capable of supporting multiple types. How did we do that? Well, we wrapped a std::set in CX and BOOM, out comes the world's first wait-free balanced BST  :)

Several years ago, when Andreia first came up with the idea that we could use a reader-writer lock to achieve wait-freedom, I though that was crazy, but cool. After joining up forces with Pascal and a lot of work, we were able to turn that into a universal construction, CX.
Sure, it has its limitations, but it opens new venues for research and is evidence of some contrarian concepts:
- wait-freedom can be generic and fast (at least for read-mostly workloads);
- wait-free memory reclamation can be lightweight;
- wait-freedom can be achieved with locks;

Now for the bad news: just because it's wait-free, doesn't mean it scales.
CX serializes all mutative operations which means it's flat at best if the workload consists of only mutations on the sequential object.
It also consumes larges amounts of memory and in the very unlikely worst case, may require 2x MAX_THREADS replicas of the sequential object. If the object is a data structure with 1000 nodes, then it doesn't make a dent, but if we're talking about a container with multiple millions of keys and tens of threads writing to it, then it's going to chew up with DRAM pretty fast.
This high memory usage means CX likely won't be your favorite concurrency approach when deploying in production.

The good news is that it scales really well for read-only operations. This means that as long as your workload is read-mostly, it will be able to scale.
Synchronization for read-only operations in CX is very low, typically a few acquire-loads and a single seq-cst fence instruction and beyond that it's just regular loads.
Yes, that's right, the user-code you place in the lambdas is being executed without having to do any load acquires or heavy synchronization fences, just like when you execute code inside a lock!
This is what makes CX be very fast for read-mostly workloads.

In fact, it's so fast that when we submitted the paper to conferences, a few reviewers complained that it was too good to be true and it couldn't possibly be correct. That's how fast it is  ;)
Hand-made lock-free data structures can take months (or years) of work by expert researchers to complete, therefore it's normal be greeted with disbelief when we show them a technique capable of transforming generic sequential code into wait-free safe code. Something a young developer can do in minutes. If on top of that we say that it has wait-free memory reclamation and on top of it we show plots where it beats multiple handmade data structures in read-mostly scenarios, then they go ballistic and think it's all a bunch of BS.
It's understandable, we all have our biases. Mine is to think that it's a good idea to trade off large amounts of memory for a generic construct that scales in read-mostly workloads.What's yours?

Hey, nobody's perfect!

Monday, November 4, 2019

Is Left-Right a generic concurrency control?

Spoiler alert, yes, Left-Right is a generic concurrency control!

We mentioned Left-Right many times before, but here it goes again.
Left-Right is a technique where two replicas of your data are kept up to date. Mutative operations execute in one replica while the other replica can be read by in-flight readers and then the role of the replicas toggles for the write to apply the same operation on the opposite replica while the new readers go on the freshly modified replica.

Through the careful usage of atomics and RCU, this can be done in such a way as to have wait-free progress for the readers and that's what Left-Right is about.
There is only one writer at a time (we can use a global lock) but it can use flat combining to aggregate requests from multiple writers and thus have data locality without causing too much contention on the writer lock. This data locality allows for a "flat" throughput of the write operations, and in the case of Persistent Memory (Romulus) is can even have high positive scaling due to saving writes to PM.

If you want to see more details on Left-Right you check out this ppt:
or see my cppcon video on this topic:

Because Left-Right is a generic concurrency control, it can be used to make a Universal Construction
or an STM
both approaches imply blocking yet starvation-free writes, and wait-free population oblivious reads.
Read-only operations can execute in parallel with the write operation because they're accessing opposite replicas.

When a user decides to utilize the Left-Right UC, all it has to do is place the user-code in a lambda and pass it to the Left-Right UC.
The code in the lambda has some minor restrictions, shared in common with all other generic approaches that give wait-freedom for reads:
- No side effects in the lambda: can't modify global variables or thread-locals;
- Deterministic code: no access to random number generator devices or random functions, unless it's a pseudo-random generator and the seed is passed to the lambda;
- No I/O; no writing to files, no sending/receiving of network packets, no hw registers being written... actually all of this can be done but then you have to use an STM version (like Romulus) which needs load and store interposing
- No exceptions: any exceptions must be catch inside the lambda (user-code). For the read-only operations it can be made such that this is not a problem (implementation detail) but for the mutative operations this is an algorithmic constraint;
Operations in Left-Right are irrevocable (they don't abort-retry) which is a nice feature for most user code.

The bad news: it uses twice as much memory. Like all good things, there are trade-offs and the wait-freedom for readers has a cost, the memory usage.

Remember the issue about cross-dependencies (write skews) from the previous post ? Left-Right doesn't have such an issue because the synchronization is implicitly a global lock. Yes, yes, it serializes mutative operations, but remember, read-only operations are wait-free and go in parallel with the mutations. As long as your workload is read-dominated, Left-Right is a good trade-off.

Another nice feature is that memory reclamation works just like inside a lock: you just call malloc/free/new/delete and it works. No need to use epoch-based reclamation explicitly (it's already implicit in the usage of the userspace RCU inside Left-Right).

On the next post we're going to talk about a fully wait-free Universal Construction.

Wednesday, October 30, 2019

How about RLU? Is RLU a generic concurrency control?

On the previous post we went over why RCU is not a generic concurrency control. This time we'll talk about RLU (it's an L, not a C):

RLU is a synchronization technique which contains inside it an efficient userspace RCU.
It supports multi-atomic updates, making them visible in one-shot to the readers, without having to make an replica of the entire data, like on RCU.

The idea of RLU is that during a write operation, each modified object is first copied into a per-thread log, then it makes the modifications to the object in the log. When all modifications have been made, it increments the global clock, thus making all modifications atomically visible to readers. I'm over-simplifying, but it's ok for this post.

RLU is certainly more generic and efficient than RCU but it does not allow for generic code to be used. In fact, RLU does write-read synchronization but it does not handle write-write synchronization and the the paper itself mentions this explicitly:
"The basic idea of RLU described above provides read-write synchronization for object accesses. It does not however ensure write-write synchronization, which must be managedby the programmer if needed (as with RCU)".

To better understand why RLU is not a generic approach to concurrency, let's consider the canonical example of two threads 1 and 2 with cross dependencies. They are attempting to execute the following operations, where x and y start both at false:
Thread 1:
  if (!x) y = true;

Thread 2:
  if (!y) x = true;

As I'm sure you're aware, any serializable execution of these two operations will either result in x and y being {true,false} or {false,true} but never {true,true}. One of the operations has to "happen before" the other. If T1 executes before T2 then the end result will be {false,true} and if T2 executes before T1 then the end result will be {true,false}.
If you run this code without modification on RLU, it can happen that T1 reads x and sees false while T2 reads y and sees false, then T1 acquires the lock on y while T2 acquires the lock on x and each write to their log that the variables will be true and each advance the global clock and commit. In case you're wondering, this scenario is not linearizable.

In the database world, this kind of problem is called a write skew anomaly (see section 2.4 of the MV-RLU paper for a great explanation in the context of RLU). In the context of concurrent data structures, we call these bugs!
Handling cases like this is what an STM or a Universal Construction are supposed to do because an STM/UC must handle any sequential piece code (with some exceptions).
Granted, if you're really careful how you write the code, you can use RLU as long as you avoid writing sequential code that has this kind of cross dependencies. The problem is, in practice, this is very hard to do and even when you do it, you're never really sure it's actually done correctly (did I miss some cross dependency?). If you write user-code that has one of these problems, it will be hard to find it and likely hard to fix it.

There is an easy solution for RLU and that is to use a global lock, but then it looses some of its appeal because it no longer scales for writes.
Keep in mind though that RLU scales for read-only transactions!

STMs like TinySTM or TL2 have time-based algorithms that keep a read-set of all loads done within the transaction and do validation of the read-set at commit time exactly to handle these kind of cross dependencies.
If you're wondering how frequent are cross dependencies in real code, then even an ordered linked list is prone to these kind of issues (depending on how it is implemented), which helps to show that it's very hard to get away from this problem in practice.

If you ever read a paper where they say that they used RLU to make an STM, be very very suspicious: an STM must be able to handle cross dependencies and provide linearizable (globally serializable) operations regardless of what the underlying sequential user-code. If it works only for some small subset of user programs, then it's not an STM... sure it's a very valuable tool, but it's not an STM.

Monday, October 28, 2019

Is RCU a generic concurrency control?

In this post we're going to talk about RCU and we're going to start by what it means.
When people talk about RCU they usually mean two different things combined: a synchronization mechanism, which I'll call RCU-sync, and a make-copy-of-object-and-modify-copy, which I'll call Copy-On-Write (COW).

Let's start with Copy-On-Write.
This is a technique used in concurrency and even on durability (see previous post) where instead of modifying an object in-place, you make a copy of the object, modify the new object and then toggle a pointer to point from the old object to the new object. The name RCU comes from Read-Copy-Update as a reference to this technique.
When used in (shared memory) concurrency this creates the problem of what to do with the old object: is it safe to delete it immediately? What if there are other threads currently accessing it?
Solving this problem of object lifetime tracking is what the synchronization mechanism of RCU (RCU-sync) does.

There are many different algorithms that implement an RCU-sync. Some examples of RCU-sync algorithms can be seen in Paul McKenney's excellent book or in our paper about fast userspace RCUs. The thing about an RCU-sync is that it's a concurrency construct, like a mutual exclusion lock or a reader-writer lock. For example, a mutual exclusion lock implements at least the following API:

while a reader-writer lock implements at least:

and an RCU-sync implements at least:

The difference between RCU and locks is that locks provide a generic way of doing concurrency, while RCU does not.

The rcu_read_lock() and rcu_read_unlock() methods must guarantee that code within a critical section defined by these two functions must not escape outside the critical section. An RCU-sync must also guarantee that a call to rcu_synchronize() will wait for all threads currently executing a block of code in a rcu_read_lock/unlock() block to complete (call rcu_read_unlock) before rcu_synchronize() returns to the caller. This guarantee is called quiescence and it's extremely useful in shared memory concurrency. Note that the actual semantics are much more formal that what I've described here, but for the purpose of this post, it's enough to get an idea of how an RCU-sync behaves.

As mentioned above, RCU-sync provides quiescence, not mutual exclusion.
A thread calling rcu_synchronize() will know that by the time the call to rcu_synchronize() returns, everything done before this call is now visible to other threads (as long as the other threads call rcu_read_lock/unlock to access the associated data). However, there is nothing preventing other threads from modyfing the same data at the same time (data race behavior). This means that even when using RCU-synch we still need some extra mechanism to provide mutual exclusion or prevent races in some other way.

At its essence, RCU-sync is mechanism for a single writer to synchronize with multiple readers. It doesn't solve the problem of multiple writers and it also doesn't solve the problem of the readers having a consistent view of the data.
Let me explain a bit more about what I mean with "consistent view". Suppose we have two variables, "a" and "b" initially at zero and a single writer thread and multiple reader threads. The writer code could be:
writer() {
    a = 1;
    b = 1;

and the reader code:
reader() {
    printf("a=%d\n", a);
    printf("b=%d\n", b);

A reader thread could print "a=1,b=0", even if "a" and "b" are std::atomic with memory_order_consume, or even memory_order_seq_cst.
RCU-sync does not provide atomicity for multiple variables. In order to achieve this, we need to aggregate all those variables within a single object and make a snapshopt of this object using Copy-On-Write. But then, the concurrency technique really is COW, not RCU-sync. With such an approach, the RCU-sync is used only for memory reclamation of the old objects and if we're in Java then we don't even need that because the GC will do it for us.
And that is why whenever they use copy-on-write on java.util.concurrent they call it copy-on-write and not RCU  ;)

Let me repeat it again: RCU-sync is extremely useful for memory reclamation and as a component of more complex concurrency techniques, but it is not generic. RCU-sync does not provide atomicity, nor does it prevent write-write races, not even read-write races. All it does it give quiescence, which personally I think it's awesome! It's all a matter of expectations.

Just to go back to COW briefly, COW can be used to solve write-write races and read-write races by creating a new object on every modification of every sub-object. Basically, if you want full consistency across your data, you need a god object that contains all your data and any modification to the data (or just a single variable) will require a full copy of the entire data.
So yes, COW can be generic, it's just that nobody does it like this because it would be too slow.

Both COW and RCU-sync are valuable tools that researchers and concurrency library implementers can use to make sophisticated concurrent data structures or other synchronization mechanisms, but calling them generic concurrency controls is going too far.
In other words, there certainly are database engines and STMs (software transactional memory) which use COW or RCU-sync internally, but you'll never see a DBMS or STM whose concurrency control is RCU. It just doesn't make sense to have a DBMS where any mutative operation causes a copy of the entire DB to be made. Because, that would be the only way to provide generic transactions on that DBMS or STM.
In the DBMS context, a concurrency control is something like MVCC (multi-version concurrency control) or 2PL (two-phase locking) or OCC (optimistic concurrency control).

RCU is not a generic concurrency control.