Monday, July 17, 2017

What is Eventual Consistency ?

On this post we're going to talk about Eventual Consistency in the context of lock-free and wait-free data structures, with focus on the C++ memory model, although most (or all) of what we're going to cover also applies to other memory models likes the ones in D, Java , Rust, C11, <insert language here>.
If you're looking for an explanation on what Eventual Consistency is in the context of databases then this (may or) may not be what you're looking for!

Before we explain what is eventual consistency in C++, we need to explain a little of what shared memory is and a what is a memory model.
Computers do many things: arithmetic computations, decide different things based on logical expressions, sending and receiving packets, putting pixels on a screen, etc. For this post, we care about reading from main memory and writing to main memory. This means "loads" and "stores".
The whole C++ memory model is about what happens when you do loads and stores from different threads, and all the other stuff that the computer does is pretty much out of the picture.
Obviously, stores are used to store a value in a certain memory location, and loads are used to read a value from a given memory location:;
    value = x.load();

Also, from the memory model's point of view, all loads are indistinguishable, and so are all stores. CPU vendors have peculiarities, like some stores may re-order but others not (think stores on the same cache line), but the memory model doesn't really care about that, so for this post, we won't care either.

Ok, so this was really important. If you're not convinced that everything is about loads and stores then you better go and read the memory model of your favorite language and then come back.
Let me say it once more: It's all about loads and stores.

Combinatorically, if we have two different operations (loads and stores) we can define ordering constraints based on four cases:
1. A load may not reorder with a subsequent load;
2. A store may not reorder with a subsequent store;
3. A load may not reorder with a subsequent store;
4. A store may not reorder with a subsequent load;

When none of the four constraints above are set, and if we want to share data among threads or processes (remember, our context is lock-free and wait-free data structures) then in C++ this is called memory_order_relaxed. In this memory ordering any loads can reorder with any stores. Whatever the order of the code you wrote, the compiler may decide to reorder the loads and stores in whatever way it thinks is the best.
This has reduced synchronization and gives a lot of leeway to the compiler, thus resulting in high performance code. The downside is that most lock-free and wait-free algorithms were designed with a specific order in mind, and when this order is not respected, the algorithm no longer works.

To enforce constraint 1, the first load has to be done with a memory_order_acquire or memory_order_seq_cst. In the C++ memory model, loads with memory_order_acquire can not be reordered with any subsequent load or store. Most of today's CPUs enforce this constraint by default already, which means that it needs no synchronization as well.

To enforce constraint 2, the second store has to be done with a memory_order_release or memory_order_seq_cst. In the C++ memory model, stores with a memory_order_release can not be reordered with any previous load or store. CPUs with Total-Store-Order (TSO) like x86, enforce this constraint by default already, which means it needs no explicit synchronization. In practice the CPU internally has synchronization to achieve this strong ordering, but that's out of scope.

To enforce constraint 3, the first load has to be done with memory_order_acquire or memory_order_seq_cst. x86 CPUs provide this guarantee without explicit synchronization (barriers/fences).

To enforce constraint 4, both the load and the store have to be memory_order_seq_cst. No practical CPU provides this by default (such a CPU would be too slow) and to have such a constraint requires costly synchronization, typically a full fence or at least a store-load fence.
Explaining why this requires such high synchronization would require an entire (large) post about this topic, so let's just accept it as it is.

To sum it up, if we want all four constraints, we have to use sequentially consistent memory ordering, while if we want only the first three constraints, then the acquire-release memory ordering is enough.
Notice that stores and loads are done at the word level, 64 bits let's say. Therefore, when we talk about sequential consistency we are talking about it at the word level. Any object bigger than one word and this guarantee disappears because it will take multiple stores (or multiple loads) to modify (or read) such an object.

So what about eventual consistency?
Roughly speaking, eventual consistency guarantees if no new updates (stores) are made to a data item (word), then eventually all accesses (loads) to that item (word) will return the last updated value.
In other words, if we stop doing stores in a certain variable, eventually, when we go and read it from another thread, we will see the last store that was made. Pop quiz: which memory ordering is the weakest that gives this guarantee?
Answer: It's memory_order_relaxed (memory_order_acq_rel and memory_order_seq_cst also provide this guarantee but at a greater cost).
You see, something I didn't say about memory_order_relaxed is that the end result must look like the code you wrote, which means that eventually it will have to keep whatever was the last store you wrote in your code. This is highly non-obvious from reading the specs, but that's how it works, otherwise it would let the compiler produce incorrect programs, even in single-threaded code.

Ok, so now that we know that in C++ memory_order_relaxed provides eventual consistency (at the word level), what can we do with that?
Well, the answer to that depends on how useful you think that eventual consistency is.
The term eventual consistency is meant as a consistency property of an item, or object in C++ parlance. If the object is the size of a word (64 bits or less) then just make it atomic<> and all stores and loads to/from it must be changed to use memory_order_relaxed.
If the object is larger than a word then things start to get more complicated because then we could have concurrent stores touching different words of the object and it would leave it in an inconsistent state. Example below:
struct myobject {
    atomic<uint64_t> a {0};  // Consistency for this object implies that a is equal to b
    atomic<uint64_t> b {0};

myobject obj;

Thread 1:, memory_order_relaxed);, memory_order_relaxed);
Thread 2:, memory_order_relaxed);  // May occur before in thread 1, memory_order_relaxed);  // May occur after in thread 1
Thread 3:
    while (true) {
        if (obj.a.load() == obj.b.load()) break; // This may loop forever

In the above example, we defined the object to be consistent when having a and b members equal. The stores with Thread 1 may interleave with the stores in Thread 2, causing the object to be left in a inconsistent state permanently, with a=1 and b=2, which is not eventually consistent because the stores on the object have stopped but thread 3 will be in an infinite loop waiting for the object to be consistent.

Even without going into the topic of whether or when is eventual consistency useful, we still have to care about a subtle but extremely important detail in the definition of eventual consistency: it refers to an object.
Let me say this again because it's important: Eventual Consistency requires you to define the item (or object) on where this consistency is being applied.
This means that what you define to be the object affects the consistency.
Saying that you provide eventual consistency in your application is completely irrelevant unless you define at which granularity level you're providing such a guarantee.
If you're providing eventual consistency at the word level, then it's easy to implement, just make all accesses memory_order_relaxed;
If it's eventual consistent at the C++ object level, then each object must itself be atomic<>, which means that the underlying implementation is using a mutual exclusion lock to protect it, thus denying any benefit from doing eventual consistency. You might has well say you're providing strong consistency because you are in fact providing linearizability (at the object level);

In the context of data structures, if we say that a data structure is eventually consistent, it still needs to be atomic itself and all its nodes/whatever constitutes it. This means that the data structure itself needs to be consistent, or atomic. This is extremely hard to achieve without going into a fully linearizable (or at least sequentially consistent) algorithm, so perhaps this is not very useful in practice.

However, if we have multiple data structure instances, even if each of them is linearizable, doing operations or two or more of these data structures is not trivial to make them appear atomic (unless we're protecting everything with a global lock, or using a transactional memory). In such case, we can simply say that we're providing eventual consistency at the data structure level, and it would be correct to say so, because in each data structure instance we provide atomicity (linearizable consistency) but across data structure instances we only guarantee that eventually the modifications will be seen over all the instances.

If you take anything away from this post it should be this:
Eventual Consistency only makes sense when we define what is the "object" where this consistency is being applied.

Monday, July 10, 2017

Left-Right and C-RW-WP with Flat Combining

Flat Combining is a cool technique that allows multiple threads to help each other.
It was designed to be used with mutual exclusion locks but it has been used with other things,
such as Hardware Transactional Memory (see "Transactional Lock Elision Meets Combining" in PODC 2017).
We have used it to improve the throughput of C-RW-WP and Left-Right.

C-RW-WP can be seen here:
and Left-Right here:

Here is how it works for C-RW-WP:
- Writers publish their operation (a std::function) using a publishing mechanism, in our case it's just an array of atomic pointers to std::function. Each std::function typically stores a lambda on which we captured whatever parameters are needed to make things simpler;
- A writer spins until it gets the exclusive lock or until its entry in the publishing array goes to nullptr, thus meaning that some other writer has completed its operation and has placed the result in the corresponding entry on the array of results;
- If the writer gets the exclusive lock, then it is his responsibility to execute the other thread's operations that are in the publishing array, save the results in the results array, and set the pointer in the publishing array back to nullptr;
- Readers don't need to publish its operation in the publishing array, unless the lock is already in exclusive mode. In that case, they publish the operation and wait for the lock to be released, or for the entry in the publishing array to become nullptr;

Example code can be seen here:

There are a couple of minor tricks:
1st, the writer holding the lock will execute each operation at a time
and only then will it set the entry to nullptr with a store release
this way, another thread is guaranteed to see the result there if it sees its own entry at nullptr

2nd, the writers don't have a starvation-free cohort lock, it's a simple spin-lock, but because the Flat Combining technique is starvation-free, the writers end up being starvation-free (among themselves), a very good property to have without having to make a more complex or strick-fairness lock, such as a ticket lock.

3rd, the readers attempt to get the lock in shared mode first, and only if it is not available (a writer is already there) then they publish their operations
A reader's operation is not mutative (by definition) but the result needs to be known, so it's up to the writer (who doesn't care if the operation is mutative or not) to apply the operation and place the result in the results array.
The consequence of this is that in low contention, C-RW-WP works as usual (which is blazingly fast), but when there is more contention and some threads are writers, then it goes on the "flat combining path" which is starvation-free.

This means that the end result is a technique that is fully starvation-free, i.e. readers don't starve writers, writers don't starve readers, readers don't starve other readers, and writers don't starve other writers.
I don't know about you, but I think it's pretty cool to take something that has a large propensity for starvation like C-RW-WP (writers starve readers and writers will starve other writers if the cohort is a spinlock), apply Flat Combining to it, and get something that is fully starvation-free, and on top of that has lower tail latency and higher throughput on most workloads  :-O

So how about applying Flat Combining to Left-Right ?
This was so obvious that even back in 2012 Andreia was speaking of doing this (before we even read the Flat combining paper) but it was too simplistic and we wanted full wait-freedom so she dropped it. But now, here it comes again  :)

Adding Flat Combining to Left-Right is a bit trickier than adding Flat Combining to C-RW-WP, but not by much.

The first difference is that readers are wait-free in Left-Right, so there is no point for them to add their operations to the publishing array. Flat Combining is great but it provides at best starvation-free progress, while Left-Right provides wait-freedom for readers.
In Left-Right neither readers starve writers nor writers starve readers, so there isn't much of an advantage in progress of adding FC to Left-Right likes there was for C-RW-WP, but there is still the advantage of using a spin lock as the writersMutex and still providing starvation-freedom amongst writers, so we did that

The second difference is that we don't want the writer currently holding the writersMutex lock, to apply operations from other writers in one of the instances but not the other, because it would leave the instances in an inconsistent state. To solve this, we first copy the entire publishing array (it's just pointers) to a local (stack-allocated) array:
and only then do we start applying the mutations from the local array
After the toggle, we apply the same mutations on the opposite instances
and only then do we set the corresponding entries in the publishing array to nullptr

Just for fun, here are some plots.
Unfortunately I don't have the same plots with/without Flat Combining, but with FC is generally better:

For mutative only workloads (100% writes) using a global lock is better than Left-Right because on Left-Right we have to apply the same mutation twice (once on each instance). As soon as the number of readers increases, Left-Right takes the upper hand (regardless of using Flat Combining or not) and start to be better.

This kind of benchmark uses "mixed-ratio" threads and without over-subscription, so we don't really see the effects of starvation-freedom or wait-freedom, but it's good to know that Left-Right can be combined with Flat Combining and still come out ahead  :)