Saturday, May 16, 2015

The quest for the perfect load/store lock (part 1)

This is the first part of a two-post digression into what it takes to make an optimum starvation-free mutual exclusion lock with only loads and stores.
If you don't know what such algorithms are, take a look at the software solutions to the mutual exclusion problem.

Let us start by considering what should be the properties of an idealized starvation-free algorithm for  mutual exclusion, with loads and stores, that works between N threads, under the C11/C++1x memory model.
Two distinct behaviors should be analyzed, namely, what happens when under high contention, where all N threads are concurrently competing for the lock, and what happens under low contention, when a single thread is attempting to acquire the lock.

For the high contention case, on the lock() method, an ideal algorithm should do local spinning on its own cache line when waiting for the current thread to release the lock, and then immediately enter the critical section.
Ideally, before it enters the spinning loop, the thread can spend some time reaching a consensus with the other waiting threads to order themselves relative to each other (there are usually N-1 waiting threads in this high contention scenario), so that later they don't waste time reaching consensus on which one is the next to get the lock.
By doing the relative ordering of the waiting threads, the algorithm gains time because otherwise, the waiting threads would just be spinning waiting for the lock to be released.

On the unlock() side, as soon as the critical section is over, it should write in the variable that the next thread is waiting on, and this way, pass the lock to the next thread. After doing this, the thread
previously holding the lock may still do other work, for example, to reset some variables back to their default state.

A template of what the pseudo-code would look like for a mutual exclusion lock algorithm that works well under high contention can be seen below:

void lock()
    while (atomic_load(&localSpinVariable) != MINE) Pause();
    // Enter Critical Section immediately

void unlock()
    // The next thread is waiting on that variable

For the low contention case, on the lock() method, the ideal algorithm should spend as little time as possible, so as to provide a maximum throughput when a single thread is continuously acquiring and releasing the lock, without any contention from other threads. This means the algorithm should provide a quick way to determine if there are other threads contending for the lock or already holding the lock, and if possible, do a single store to make the intent of acquiring the lock visible to other threads.
Under the C11/C++1x memory model, making the intent to acquire the lock visible to other threads, can be done with a single call to atomic_store(), and determining that there are no other threads contending on the lock requires (at least) anywhere from log2(N-1) (in the case of tournament-based algorithms), to N-1  atomic_load() calls (in the case of array-based algorithms).
Notice that although multiple loads may be required to confirm that the thread is alone in its intent to acquire the load, a single acquire-load is needed, along with a full barrier (acquire and release) at the end of the scan of the states of the remaning threads.

On the unlock() side, the goal is similarly to provide a maximum throughput and, as such, use the minimum possible number of stores and loads. At least one store is always required, i.e. the one that resets/changes the state of the current thread back to its default value of no longer holding the lock or attempting to do so.
A template for a lock that would work well under a low contention scenario can be seen below:

void lock()
    atomic_store(&mystate, WantIn);
    // This is the "fast path"
    if (noOtherThreadsAttemptingToAcquireLock()) return;
    // This is the "slow path"

void unlock()
    atomic_store(&mystate, DontWantIn);

The main conclusion from this analysis is that for the high contention case, an algorithm needs to be able to pass the lock to the next waiting thread as fast as possible, preferably with just one atomic_store(), while for the low contention case, an algorithm should be able to enter the critical section with just one atomic_store() and at most N-1 calls to atomic_load(), and then release the lock, preferably with just one atomic_store().

So what is this all about? In the next post we will go into practical examples of these ideas.

No comments:

Post a Comment