Thursday, May 30, 2013

Queues just aren't scalable

Queues don't scale with the number of threads. Most of you that live in the land of multi-core programming know this already, but now it has been proven in this paper:
Notice that in the list of authors is included Nir Shavit, one of the co-authors of "The Art of Multiprocessor Programming" (so that's what he's been up to lately!).

For those of you not mathematically inclined (or don't feel like reading the 18 pages of it), here is my short interpretation of what this means:
These guys have proven that Queues have a time complexity that is at best of the order of the number of threads concurrently accessing the Queue, and if it wasn't enough, they proved it also for Stacks and a few other data-structures that have some implicit sequentiality.
What this means for us common mortals is, that even if someone discovers a fully Wait-free or Wait-Free Population-Oblivious Queue (assuming such a thing is possible), it will still have a performance that reduces proportionally to the number of threads adding/removing items in the queue, in other words, it will not be scalable.

So, Queues aren't scalable and they never will be... ever!

But don't despair, there is some hope hidden behind this. Namely, you can't have a Queue that is fully scalable, but it doesn't mean that all its operations are equally non-scalable.
Just like different methods in an algorithm may have different progress-condition guarantees, perhaps they may also have different time complexity. Assuming such a thing is possible, it could allow for the creation of a Queue that has constant time complexity for insertions, and quadratic or exponential time complexity for removals (or the other way around).

With a queue like that, you would be able to insert into the queue very quickly and have little or no impact on performance no matter how many threads would be inserting, but then doing the removal could take a substantial amount of time which would not improve with by adding more threads to do removals, and could in fact worsen.

What possible purpose could a queue like that serve? Well, in real-time systems it is usually desirable (or even needed) to handle a large number of events in a constraint amount of time, or some kind of similar bursty behavior. Imagine for example, a software process in a router that has a fixed time to read packets before more packets arrive and fill up the hardware queues, but as long as it is able to put these packets in a queue for later being processed by some other threads, it is ok.

Either someone will prove such a thing is not possible, or someone will come up with a Queue with these kind of proprieties, until then, whenever your application gives crappy scalability, just blame it on the queues   :)

Tuesday, May 21, 2013

How to micro-benchmark concurrent data structures?

One of the issues with benchmarking of concurrent algorithms is to choose the right distribution of tasks among the threads.

For many of our tests we used equal-capability threads where each thread can either be doing a Read operation or a Write operation according to some probability (like for example: 50%, 10%, or 0.1% Writes).
This is fine for testing Blocking algorithms like Locks or lock-based data structure, but for data structures with Wait-Free capabilities it just doesn't work.
The reason is that, when we use the equal-capability threads approach there is an implicit dependency between the Reads and the Writes, and eventually we can (and will) end up with all Reads finished and only threads attempting to Write are waiting, or maybe the other way around, depending on whether there is some kind of reader-preference or writer-preference. Either way, we create a "blocking" dependency between the two kinds of operations, which can bias the micro-benchmark results significantly.
Here is a schematic that shows what the issue looks like:

In this example we have threes threads each doing 50% Writes (one Read operation, followed by one Write operation, followed by another Read, and so on).
Most of the time of each thread is spent waiting for the Write to finish because this is usually slowest operation. This shows that in practice we are "blocking" the Read operations, even if they are Wait-Free because each thread will not be able to progress to the following Read until it does a Write, thus causing a blocking dependency.

The solution is to use dedicated threads: some threads are Writers, and the remaining threads are Readers.
This way there is no blocking of the tasks of reading and writing to the data structure, and as many as can be executed will be. It has the added advantage that you get to test as well the "fairness" of the data structure when concurrent Reads and Writes are done.

We are currently finishing up a paper on a concurrent pattern which we call the "Left-Right pattern", and it has the property that Writes are Blocking and Reads are Wait-Free Population-Oblivious. With this kind of guarantees, we decided to make tests where we run one or two Writer threads, and all the other threads are Readers.
The reasoning behind it is that choosing two writer threads makes sure that at any given time there is at least a Writer waiting to execute, and then adding Reader threads will show how it scales under Read operations. This is a good test of the overall "fairness" of the data structure under high-contention. A "fair" data structure should provide proportional or constant performance for each of the operations (whether they are Writes or Reads) as we scale one of the others, in the case of our "Left-Right pattern", we scale up the number of Reader threads and check how the number of Write operations changes as a function of number of Readers.

In conclusion, mixed-task threads are fine for testing Blocking data structures, but for the ones with Wait-Free proprieties, it is better to use threads with dedicated tasks, in order to avoid blocking dependencies between the tasks in the test framework itself.

Sunday, May 19, 2013

Lock-Free and Wait-Free, definition and examples

After searching around in Google, I found that there is no single place that properly explains the different Progress Conditions that a concurrent algorithm or data structure may provide. Even "The Art of Multiprocessor Programming" has these definitions a bit spread around the book, and most of them are single one-line sentences that are way to cryptic for common mortals like myself, so I decided to put together what I know on the subject, and add some examples for each case.

We start with a list of the different classifications for Progress Conditions separated into the four main groups Blocking, Obstruction-Free, Lock-Free, and Wait-Free:

    1. Blocking
    2. Starvation-Free

    3. Obstruction-Free
    4. Lock-Free (LF)
    5. Wait-Free (WF) 
    6. Wait-Free Bounded (WFB)

    7. Wait-Free Population Oblivious (WFPO)

Before we dive into the definition of each of these terms, let us look at some of the details behind them.

From what's written on section 3.6 of "The Art of Multiprocessor Programming", it is difficult to understand whether or not the definitions for Wait-Free Bounded and Wait-Free Population Oblivious are orthogonal or not. Personally, I think that having an algorithm that is Wait-Free Population-Oblivious implies that it is bounded and, therefore, it is also Wait-Free Bounded. This means that a function (or algorithm, or class) can be WF, or WFB, or WFPO, this last one being the "best" of all the Progress Conditions.

The examples we will give here are all concerning the progress condition of a particular function (of an algorithm, or class), and not the full algorithm. We do this because different functions in a given algorithm may have very different progress guarantees, like for example: the Write operations may be Blocking, while the Read operations be Wait-Free Population Oblivious, something that is not obvious when you first look at Lock-Free and Wait-Free data structures.

Yes, you read it correctly: a data structure may have some of its functions Blocking and others Lock-Free and others even Wait-Free. On the other hand, when we say that a particular data structure is Lock-Free it means that all its operations are Lock-Free, or better.
There is no standard strict ranking in the literature of which one is better, but it is usually implied as :
Wait-Free is better than Lock-Free which is better than Blocking.

The reason for this is that, there aren't that many (practical) data structures that are Lock-Free or Wait-Free, so the issue of properly classifying them doesn't usually show up. Andreia and I have come up with so many different variations of different concurrent algorithms with mixed Progress Conditions that we had to start naming them properly and "sorting" them, to keep a handle on it, and to figure out which ones were worth researching.
Our order is the one described in the beginning of the post, where item 1. Blocking is the worst, and 7. WFPO is the best possible kind of guarantee you can get.

Something else that is a common cause of confusion is that "Wait-Free implies Lock-Free" (but not the other way around). This means that a function that is said to be Wait-Free gives that same (and more) progress guarantees than a function that is Lock-Free.
Here is a Venn diagram that might explain better what we mean by it:

This diagram represents sets of algorithms, where an algorithm that is WFPO is also part of the algorithms that are Lock-Free.

1. Blocking

This one is the most commonly known. Basically, almost everything with a lock is Blocking. An unexpected delay by one thread can prevent others from making progress. In the worst case, a thread holding the lock may be put to sleep and thus block every other thread (waiting on that lock) preventing them from making any progress.
Definition: A function is said to be Blocked if it is unable to progress in its execution until some other thread releases a resource.
Example: A simple CAS in a loop for a two state variable:

AtomicInteger lock = new AtomicInteger(0);
public void funcBlocking() {
         while (!lock.compareAndSet(0, 1)) {

2. Starvation-Free

This is sometimes called Lockout-Free. This is a dependent propriety in the sense that the progress occurs only if the underlying platform/OS provides certain guarantees.
Definition: As long as one thread is in the critical section, then some other thread that wants to enter in the critical section will eventually succeed (even if the thread in the critical section has halted).
Example: An exclusive lock with strict fairness is usually Starvation-Free.
A Ticket Lock in x86 is starvation free because it uses a single atomic_fetch_add() to place itself on the virtual queue of waiting threads.
Code for the Ticket Lock can be seen here.

3. Obstruction-Free

A non-blocking propriety. For more details on Obstruction-Free and Starvation-Free take a look at page 60 of "The Art of Multiprocessor Programming" (revised edition).
Definition: A function is Obstruction-Free if, from any point after which it executes in isolation, if finishes in a finite number of steps.
Example: Only example I know of, is the Double-ended Queue made by Maurice Herlihy, Mark Moir, and Victor Luchangco.

4. Lock-Free

The Lock-Free property guarantees that at least some thread is doing progress on its work. In theory this means that a method may take an infinite amount of operations to complete, but in practice it takes a short amount, otherwise it won't be much useful.
Definition: A method is Lock-Free if it guarantees that infinitely often some thread calling this method finishes in a finite number of steps.
Example: Incrementing an atomic integer in a CAS loop:

AtomicInteger atomicVar = new AtomicInteger(0);
public void funcLockFree() {
        int localVar = atomicVar.get();
        while (!atomicVar.compareAndSet(localVar, localVar+1)) {
               localVar = atomicVar.get();
Another notable example of Lock-Free is the ConcurrentLinkedQueue in java.util.concurrent, whose add() and remove() operations are Lock-Free.

5. Wait-Free

The Wait-Free property guarantees that any given thread provided with a time-slice will be able to make some progress and eventually complete. It guarantees that the number of steps is finite, but in practice it may be extremely large and depend on the number of active threads, which is why there aren't many practical Wait-Free data structures.
Definition: A method is Wait-Free if it guarantees that every call finishes its execution in a finite number of steps.
Example: This paper has an example of an algorithm that is Wait-Free (unbounded)
"Wait-Freedom vs. Bounded Wait-Freedom in Public Data Structures"

6. Wait-Free Bounded

Any function that is Wait-Free Bounded is also Wait-Free (but not necessarily Wait-Free Population Oblivious)
Definition: A method is Wait-Free Bounded if it guarantees that every call finishes its execution in a finite and bounded number of steps. This bound may depend on the number of threads.
Example: A function that scans/writes a value in an array whose size depends on the number of threads. If the number of operations per entry in the array is constant, then it is easy to see that it is Wait-Free Bounded but not Population Oblivious because the size of the array depends on the number of threads:

AtomicIntegerArray intArray = new AtomicIntegerArray(MAX_THREADS);
public void funcWaitFreeBounded() {
         for (int i = 0; i < MAX_THREADS ; i++) {
                intArray.set(i, 1);

7. Wait-Free Population Oblivious

These functions complete in a number of instructions that does not depend on the number of active threads. Any function that is Wait-Free Population Oblivious is also Wait-Free Bounded.
Definition: A Wait-Free method whose performance does not depend on the number of active threads is called Wait-Free Population Oblivious.
Example: The most simple example is incrementing an atomic variable using a fetch-and-add primitive (XADD instruction in x86 CPUs). This can be achieved using C11/C++11 atomic function fetch_add():

atomic<int> counter;
void funcWaitFreeBoundedPopulationOblivious() {

Closing thoughts

This isn't the whole story. There are two very important points we're missing here to get the full picture:

1st, if your function needs to allocate memory, then the progress guarantee it can give is bounded in practice by the progress condition given by the memory allocation mechanism. In my view, we should have a different sub-classification to distinguish between the functions that need to allocate memory and those who don't. You can check out this post for more details, but the basic idea is that there is not much of a point in creating a Wait-Free Population Oblivious function which always needs to allocate memory using a Blocking mechanism.

2nd, the whole concept of Progress Condition is to classify algorithms and functions in terms of time guarantees, yet, the definitions are made in terms of number of operations. This is done under the assumption that the time it takes for an operation to complete is the same regardless of the number of active threads, which is a correct assumption for single-threaded code, but an invalid one when it comes to multi-threaded programs.
The way the CPU's cache-coherence mechanisms work, makes it so that contention on (atomic) variables that are being accessed by multiple threads/cores can cause an operation/instruction to take much longer in certain situations (due to cache-misses). If you don't believe me then take a look at this post in the Mechanical Sympathy blog.
This is the main reason (or one of the main reasons) why many implementations of Wait-Free data structures are much slower than Lock-Free implementations of the same data structures, because although they have a stronger guarantee on the total number of operations, each operation may take much longer to complete because it causes more contention... and because they may have a larger average number of operations.

In practice, there is little difference between "Wait-Free" and "Wait-Free Bounded". This paper has a good explanation on it: "Wait-Freedom vs. Bounded Wait-Freedom in Public Data Structures"

Another detail is that there is no "Wait-Free Bounded Population Oblivious" because being Population Oblivious implies an upper bound on the worst-case number operations it takes to complete.

Hope that this post has helped you understand the differences between Lock-Free and Wait-Free.