Tuesday, September 23, 2014

The two sides of Wait-Free Population Oblivious

The strongest of all progress conditions is wait-free populations oblivious (WFPO), and yet, there are two very different things that are called WFPO. One of them scales, and the other one doesn't.
You see, the WFPO definition talks about the number of operations not being bounded by the number of threads, but it doesn't take into account that a single hardware instruction may take an amount of time that is proportional to the number of threads/cores working on the same variable (or to be more specific, the same cache line).
One example of this is atomic_fetch_add() in C11/C++1x, which is implemented on x86 as a single XADD operation. From a progress condition point of view, this means that a call to atomic_fetch_add() does a single operation (the XADD instruction) and is therefore WFPO, but the time it takes to complete the XADD instruction is proportional to the number of threads/cores working on the same variable, as shown in this wonderful post by Martin Thompson (in his duration plots, lower is better):

I've repeated some of his work in C++14, and on an Opteron with 32 cores. The results are on the plot below, in number of operations per time unit as a function of the number of threads (higher is better):
There are three different lines for three different scenarios:
- All threads do XADD on the same variable;
- Each thread does XADD on a different variable in an array, but they are adjacent, and therefore, on the same cache line(s);
- Each thread does XADD on a different variable in an array, with padding so that each touches a different cache line to avoid false sharing;

C++ source code for this benchmark is available on github at this link:

As you can see, having multiple threads write to the same variable (or cache line) creates a scalability bottleneck, even though all three cases have the same progress condition WFPO... in fact, all three plots are doing exactly the same instruction, it's just that different things are being done by the cache-coherence system.
When all threads are writing to the same variable using the XADD, an heavy synchronization must be done between the cores to figure out which one is going to have which number, so that XADD returns an unique (sequential) number to each of the cores/threads. It doesn't really matter how sophisticated CPUs will become in the future (or cache-coherence protocols), there will always be a bound on the speed of communication between the cores, namely, the speed of light. It is not possible to transmit information faster than the speed of light, and to add to the pain, the more cores are sharing the same memory, the more messages have to be passed to achieve consensus, and the more physically apart these cores will be from each other due to physical placement and heat dissipation.
This means that the time it takes for a CPU with 64 cores to complete a fully synchronized instruction like a XADD is larger than 64/4 times it takes to complete the same kind of operation on a CPU with just 4 cores. The reason being that there are more messages to pass, and over longer distances, as shown in the schematic below:

CPU with 4 cores. Core 4 is reachable from core 1 with 2 links:

CPU with 64 cores. Core 64 is reachable from core 1 in 14 links:

It is important to understand that this limitation is implicitly a physical one, that is bounded by the laws of physics. Even if future CPU architects figure out a cheap and efficient way to stack up cores and place them in some kind of three-dimensional structure in a CPU, the distance between them times the number of cores will still grow larger than linearly, and no matter how smart they are, they won't be able to figure out how to increase the speed of the communication channels between the cores to something above the speed of light... it's simple Physics!
Even if there was a way to make the duration of the instruction to grow only linearly with the number of cores, it would still become a scalability bottleneck.

The takeaway from this is that, there are scalable WFPO algorithms and non-scalable WFPO algorithms. In order to complete, the Scalable WFPO algorithms/methods do not need to modify a shared variable, or do so with a low probability (like for example, doing XADD on one of a large number of variables, selected randomly).
The Non-Scalable WFPO algorithms/methods require the usage of heavy synchronization instructions, where multiple threads modify the same variable (i.e. lots of contention on one or a few variables).

At first thought, it could seem as if only synchronized modifying operations influence scalability, and read-only operations would always scale well, but we already saw before that even if you're only reading a variable and a single other thread is modifying it, it can kill scalability, as shown in this post http://concurrencyfreaks.com/2013/10/immutable-data-structures-are-not-as.html
Having a single thread modifying a variable (with a release barrier) that is read by multiple threads (with an acquire barrier) can imply a scalability bottleneck due to the cache misses on the cache line of that particular variable.
The reasons behind this effect, at their root, are due to the same physics we mentioned above. When a core modifies a variable, the modification of that cache line must be transmitted to all other cores. It's not as bad to have a single thread modifying a variable and all the other ones only reading, as having multiple thread modifying the same variable, but even so, all threads must obey the laws of physics.

WFPO is therefore, not enough to distinguish between the truly scalable methods and the ones that are not scalable.
To conclude, let's make up some definitions (emphasis on the make up):
- Scalable Wait-Free Population Oblivious: A method whose number of operations is bounded and does not depend on the number of threads (or cores) and as none or very little contention on all the shared variables that the method accesses;
- Non-Scalable Wait-Free Population Oblivious: A method whose number of operations is bounded and does not depend on the number of threads (or cores) and has some or high contention on one or several of the shared variables that the method accesses;

Notice that these are just empirical rules of thumb, and hopefully one day, someone will be able to formally prove (using Information Theory or Queueing theory) that there are in fact two different kinds of WFPO, and what are their precise properties.

Wednesday, September 17, 2014

Slides from CPPCON 2014

I was supposed to give a talk about wait-free and lock-free data structures at CppCon this year, but unfortunately I had some health issues a few weeks before and on the week of the conference I was getting surgery, so there was no way to attend  :(
I'm getting better now, but I feel really sad for missing some of the talks, specially the ones from Herb Sutter. His slides are already online, and the video should be online in a month (hopefully):

Lock-Free Programming (or Juggling Razor Blades)

Some other interesting slides:
- This one about RCU mostly  C++ Memory Model meets High-Update-Rate Data Structures
- This one about a lock-free queue Lock-Free by Example
- One about Transactional memory in C++: What did C++ do for Transactional Memory
- Influence of cache misses and cache lines in performance: Data Oriented Design and C++
- The differences between Parallelism and Concurrency: Overview of Parallel Programming in C++

Sunday, September 14, 2014

Left-Right Memory usage

We've talked before about Left-Right, a technique that can allow for read-only wait-free access to any data structure or dataset by creating two instances, and also talked about Double Instance Locking which is a simpler technique that allows for lock-free instead of wait-free read-only access.

One of the first things that people tell us when they first look at the Left-Right pattern (or the Double Instance Locking) is that it has the disadvantage of using twice the memory. This is not completely accurate because there are more than one way to use the Left-Right pattern in a multi-threaded application, and for the most common approach, the reduction in memory usage should be small. Let us start from the beginning:

What is a data structure?

A data structure is a way of organizing your data. Your data, henceforth named dataset, can have multiple organizations, i.e. data structures.
A concrete example would be a banking software, where the dataset consists of the records for each account, containing the name of the account holder, address, telephone number, amount deposited, and other info. This dataset can be organized by multiple data structures, for example, a list-based-set containing all the high net-worth accounts, or a tree-based-map containing all the customers ordered by name, or ordered by the account id as shown on the schematic below, etc.

Left-Right on a data structure

Suppose you want the internal bank application to check the value of an account. If you need wait-free guarantees for this, you can use the Left-Right pattern around the data structure of your choice. For example, suppose you have a TreeMap ordered by customer name so that you can easily find the record account for a given customer, the Left-Right pattern allows you to do this, provided you create two TreeMap instances. Notice however that there will be one and only one instance of each account record (dataset), and if concurrent access to it is needed, some mutual exclusion should be used, or a rw-lock.

For most practical applications, the contention occurs on the data structure itself, so it's enough to use the Left-Right on the data structure. This approach means that we are only duplicating the memory usage of the data structure (because there are two instances of it), but there is only one instance of the dataset and, for most applications, the size of the data structure in memory is small when compared to the size of the dataset, which means that having twice the data structure does not significantly increase the total memory usage of the application.

Left-Right on the dataset

Alternatively, instead of applying the Left-Right to the data structure, we can apply it to the dataset. This means that if a data structure is used to access the dataset, it must be a concurrent data structure (and thus provide whatever progress conditions that particular data structure provides), but the read-only access to the dataset itself will be wait-free, and there will be two instances in memory of the dataset.

A single Left-Right pattern applied to the entire dataset has the advantage that you can provide (blocking) linearizable transactions between the records of the dataset, while at the same time providing wait-free read-only access to any of the records or even iterate through them in a wait-free and linearizable way. This approach will typically consume way more memory (almost twice as much) than the previous approach because it means the dataset is duplicated in memory, but it can provide very a strong consistency model (linearization) which is some times a requirement in applications, or at least it is the safest bet when you don't know how strong the consistency model should be for that particular piece of the application.

Left-Right on the dataset and on the data structure

I expect that the least common pattern of all is the usage of the Left-Right for both the data structures and the dataset. Usually the contention or scalability/latency bottleneck is due to either one or the other, not to both. even so, it is possible to wrap both the dataset and the data structures each with its own Left-Right pattern.

Left-right on the dataset per datum

On the two previous cases we had a single Left-Right protecting the entire dataset, but it is possible to have on Left-Right per datum of the dataset (account record) . This option will use even more memory and it no longer allows for wait-free read-only iteration over the dataset, or transactions with simultaneous wait-free read-only access of a datum.
Total memory usage will be twice the size of the data structure, plus twice the size of the dataset, plus the size of a Left-Right object times the number of Left-Right objects which is number of datum in the dataset.


When you design or adapt an existing multi-threaded application to use the Left-Right pattern or Double Instance Locking or any other concurrency pattern, the first thing you need to do is identify the points where the contention is the highest, and design accordingly.
For most situations, this means identifying one or two data structures on which there is a lot of contention, implying that you need to either use a lock-free data structure or you use a "single-threaded" data structure and wrap it with the Left-Right pattern. Using a mutex or a reader-writer lock is rarely an option for these situations.
I would expect that there are some situations where transactions need to be done between different datum of the dataset in a linearizable way, or even iterate over the dataset while still providing linearizability as a consistency model, then you need to use the Left-Right on the dataset itself (and maybe the data structure).
Plan carefully and measure the contention on a real, and an extreme scenario to verify what are the application's needs in terms of scalability and latency. In the end, you'll need to measure, even if it's just some prototype code in some stress-test scenario, you must measure.

And remember, the Double Instance Locking pattern can be used in any version of C or C++, but if you want to use the Left-Right you will need a language with a well defined memory model, like: C11, C++1x, D, Java, Scala, or Go.