Sunday, May 31, 2015

The Triple-Blocking problem in Actor Models

Actor Models are a nice new feature in many languages, either through direct support in the language like Clojure, or in a library like Akka for Java and Scala.
Actors are an interesting tool to deal with concurrency, that can be extremely successful at resolving certain kinds of problems, but they are not the "silver bullet".
If you want to see some of its flaws, then take a look at this presentation from Akka Days by Derek Wyatt from Auvik Networks... you can skip to minute 14 which is when the interesting stuff starts, and goes up to minute 50:

Besides the issues Derek points out, there is one more which I call "The Triple-Blocking Problem". Contrary to what many people think, just because it is asynchronous, it doesn't mean it is non-blocking, and in fact, most Actor Models are blocking in three different ways:
  1. The message passing queue is "blocking";
  2. An Actor can block waiting for another Actor;
  3. An actor will block waiting for a Future to have its result ready;

Let's look at the first one.
Several actor implementations use a blocking queue for the message passing mechanism, which means that only one thread/actor at a time can send messages to a given actor.
Perhaps even more worrisome, is that some implementations use a queue by Dmitry Yuokov that is wait-free for enqueueing and blocking for dequeueing, which is non-blocking for the act of enqueueing, but an actor dequeueing messages from its message queue can block indefinitely waiting for an enqueuer to complete, i.e. link the next node to the list.

The second case happens when an actor wants the result of a computation from another actor immediately. The other actor must first process the previous messages in the queue, and then execute the corresponding work/computation, and then provide the result. At any instant, the thread running the actor executing the work can be delayed/preempted/sleep/die, whatever.
The simple act of delegating the work to another actor/thread means that it can inherently block.

The third case is a kind of extension of the second case.
An actor can tell another actor to do some work/computation and put the result in a Future, but regardless of what is done by the first actor, it will need the result, at some point in time. When that happens, if the result is not yet ready, the first actor will block. It's as simple as that.

Having said all of this, remember that, "just because it's blocking, doesn't mean it's bad"!
Even blocking methods/actors can scale well under the right circumstances, so there is hope for the Actor Model, but don't expect actors to solve all your concurrency problems.

The actor model is at its essence a message passing mechanism with a Multi-Producer-Single-Consumer-wait-free-for-producers-blocking-for-consumers-queue (at best), and saying that we want to solve all problems in concurrency with a single type of queue is just as naive as saying we want to solve all problems in concurrency with a single type of lock. To attack concurrent problems we will always need multiple techniques: Locks, RW-Locks, Left-Right, Copy-On-Write, Lock-Free data structures, Wait-Free data structures, etc...

Saturday, May 23, 2015

The quest for the perfect load/store lock (part 2)

60 years ago we didn't even know if it was possible to provide mutual exclusion with a software-only solution, and then came Dekker (and Dijkstra) and showed it was possible. Knuth, Peterson, Leslie Lamport, and many others followed with other solutions throughout the years. Some were real innovations, others were just refinements of previous ideas.
It's funny that some of the solutions keep on getting "re-discovered"... I guess it's the old adage of "The Book" by Erdos, that the real good solutions have always been there and just need to be "discovered", as opposed to "invented".
One concrete example is Lamport's "One Bit Solution" that Andreia re-discovered earlier this year. Mostly because it is a simple solution and one of the fastest (although not starvation-free). By the time we realized that Lamport had come up with it a long time ago, we had already made several "upgrades" on it with the intent to provide starvation-freedom, so we decided to push on a bit more and make a paper about the two best solutions we had came up with for the N-thread mutual exclusion problem:

These two algorithms were designed having in mind the ideas from the previous post, about what an ideal lock should be.
One of the algorithms CRTurn, uses global spinning, where all threads are spinning on the same variable, named turn.
The other algorithm, CRHandover, uses local spinning, where each thread is (mostly) spinning on a different variable (and cache line).
The main innovation is that because they're based on Lamport's One Bit Solution, they can enter the critical section (locking) doing a single store in shared memory, i.e. an atomic_store() for the C11/C++1x folks or volatile store for the Java/JVM developers. Same thing when exiting the critical section (unlocking), a single store in shared memory is required. However, unlike Lamport's solution, they're both starvation-free.
If you saw the "Atomic Weapons" presentation by Herb Sutter then I'm sure you know that in such memory models the most expensive thing is doing a release-store (a store in shared memory), as opposed to doing an acquire-load or a regular load/store, this means, an ideal algorithm should have the smallest possible number of release-stores in order to be as fast as possible (under low contention).

Why do we care about the low contention scenario, and not so much about high contention one?
Well, imagine you have an application where you have many lock instances, and you do some profiling in your app for latency/scalability/throughput, and you find that one of the lock instances is a big bottleneck where many (or all) threads get stuck waiting for that particular resource or critical section... now what do you do?
If the resource under contention is a data structure, then this would be an ideal place to put a lock-free variant of the said data structure, or at least a variant that has finer grain locking. If it's an object or set of objects, then you can still have gains by using techniques such as Left-Right, or Copy-On-Write, or even one (or multiple) Reader-Writer Lock. Worst comes to worse, you can always re-design the application so as to reduce contention in this single point.
What this means in practice, is that any healthy application will not have high contention on a single lock systematically. Don't get me wrong, it can have bursts of high contention occasionally, and when that happens, it better be a starvation-free lock to ensure better (lower) latency and fairness, but the point is that most of the time, when a thread is attempting to acquire a lock (any lock) in your app, it will be uncontended. This means that the total throughput will be dominated by the behavior of the lock under low contention, i.e. a single thread attempting to acquire it, and therefore, reducing the number of release-stores in that case is a very desirable goal.

Now, this is not to say that locks that were designed with high contention in mind should not be taken seriously, it's just that it's IMO a rare case, but as I said on the previous post, an ideal lock should have a good behavior on both cases.

More on this in a future post.

Saturday, May 16, 2015

The quest for the perfect load/store lock (part 1)

This is the first part of a two-post digression into what it takes to make an optimum starvation-free mutual exclusion lock with only loads and stores.
If you don't know what such algorithms are, take a look at the software solutions to the mutual exclusion problem.

Let us start by considering what should be the properties of an idealized starvation-free algorithm for  mutual exclusion, with loads and stores, that works between N threads, under the C11/C++1x memory model.
Two distinct behaviors should be analyzed, namely, what happens when under high contention, where all N threads are concurrently competing for the lock, and what happens under low contention, when a single thread is attempting to acquire the lock.

For the high contention case, on the lock() method, an ideal algorithm should do local spinning on its own cache line when waiting for the current thread to release the lock, and then immediately enter the critical section.
Ideally, before it enters the spinning loop, the thread can spend some time reaching a consensus with the other waiting threads to order themselves relative to each other (there are usually N-1 waiting threads in this high contention scenario), so that later they don't waste time reaching consensus on which one is the next to get the lock.
By doing the relative ordering of the waiting threads, the algorithm gains time because otherwise, the waiting threads would just be spinning waiting for the lock to be released.

On the unlock() side, as soon as the critical section is over, it should write in the variable that the next thread is waiting on, and this way, pass the lock to the next thread. After doing this, the thread
previously holding the lock may still do other work, for example, to reset some variables back to their default state.

A template of what the pseudo-code would look like for a mutual exclusion lock algorithm that works well under high contention can be seen below:

void lock()
    while (atomic_load(&localSpinVariable) != MINE) Pause();
    // Enter Critical Section immediately

void unlock()
    // The next thread is waiting on that variable

For the low contention case, on the lock() method, the ideal algorithm should spend as little time as possible, so as to provide a maximum throughput when a single thread is continuously acquiring and releasing the lock, without any contention from other threads. This means the algorithm should provide a quick way to determine if there are other threads contending for the lock or already holding the lock, and if possible, do a single store to make the intent of acquiring the lock visible to other threads.
Under the C11/C++1x memory model, making the intent to acquire the lock visible to other threads, can be done with a single call to atomic_store(), and determining that there are no other threads contending on the lock requires (at least) anywhere from log2(N-1) (in the case of tournament-based algorithms), to N-1  atomic_load() calls (in the case of array-based algorithms).
Notice that although multiple loads may be required to confirm that the thread is alone in its intent to acquire the load, a single acquire-load is needed, along with a full barrier (acquire and release) at the end of the scan of the states of the remaning threads.

On the unlock() side, the goal is similarly to provide a maximum throughput and, as such, use the minimum possible number of stores and loads. At least one store is always required, i.e. the one that resets/changes the state of the current thread back to its default value of no longer holding the lock or attempting to do so.
A template for a lock that would work well under a low contention scenario can be seen below:

void lock()
    atomic_store(&mystate, WantIn);
    // This is the "fast path"
    if (noOtherThreadsAttemptingToAcquireLock()) return;
    // This is the "slow path"

void unlock()
    atomic_store(&mystate, DontWantIn);

The main conclusion from this analysis is that for the high contention case, an algorithm needs to be able to pass the lock to the next waiting thread as fast as possible, preferably with just one atomic_store(), while for the low contention case, an algorithm should be able to enter the critical section with just one atomic_store() and at most N-1 calls to atomic_load(), and then release the lock, preferably with just one atomic_store().

So what is this all about? In the next post we will go into practical examples of these ideas.