Tuesday, May 27, 2014

CLH Mutex - An alternative to the MCS Lock

PS: It seems this algorithm is just the CLH lock with the previous node removed, and nothing original at all... dope.
Many thanks to David for pointing it out  ;)

So here is an implementation of the CLH lock in C11:

Compared to the MCS lock, in the CLH lock there is no physical queue, but there is a kind of logical queue. Each thread knows only two nodes: the node it created, and the node previous to it.
(slide 45)

Just like the MCS, the CLH is also: "starvation-free"; it is easy to transform into a recursive mutex if needed; and each thread waits/spins on its own cache line which will be written by the previous thread, namely, the succ_must_wait variable of the previous node .
This implementation of the CLH is memory safe, in the sense that all created nodes are safely and correctly deleted at some point in the execution, i.e. it manages its own memory and does not need a GC.
The main different from the MCS lock is that MCS may have to do an Compare-And-Swap in addition to the atomic_store() in the unlock() function.

Let's look at the member variables:

typedef struct clh_mutex_node_ clh_mutex_node_t;

struct clh_mutex_node_
    _Atomic char succ_must_wait;

typedef struct
    clh_mutex_node_t * mynode;
    char padding[64];  // To avoid false sharing with the tail
    _Atomic (clh_mutex_node_t *) tail;
} clh_mutex_t;
Each node contains a single variable, succ_must_wait. When succ_must_wait is zero, it means that the thread that created the next node can now enter and acquire the lock on the mutex. If succ_must_wait is 1, then that thread must spin/wait/yield. The only atomic variable is tail. The mynode variable is only used as a way to pass the pointer from the lock() function to the unlock() and could be discarded if we decided to return it with pass by reference in lock() and then pass them to unlock(), but that would lose some generalizability.

Now, lets look at the code:

void clh_mutex_init(clh_mutex_t * self)
    // We create the first sentinel node unlocked, with succ_must_wait=0
    clh_mutex_node_t * node = clh_mutex_create_node(0);
    self->prev = NULL;
    self->mynode = node;
    atomic_store(&self->tail, node);
The sole purpose of the initialization function is to create a sentinel node with succ_must_wait unlocked (a value of zero).

void clh_mutex_lock(clh_mutex_t * self)
    // Create the new node locked by default, setting succ_must_wait=1
    clh_mutex_node_t *mynode = clh_mutex_create_node(1);
    clh_mutex_node_t *prev = atomic_exchange(&self->tail, mynode);

    // This thread's node is now in the queue, so wait until it is its turn
    char prev_islocked = atomic_load_explicit(&prev->succ_must_wait, memory_order_relaxed);
    if (prev_islocked) {
        while (prev_islocked) {
            prev_islocked = atomic_load(&prev->succ_must_wait);
    // This thread has acquired the lock on the mutex

    self->mynode = mynode;
The lock() function starts by create a new node, mynode, and when it applies it on the tail using an atomic_exchange() it will return the previous node, prev. If the prev->succ_must_wait is 0 then it means that this thread has acquired the lock, otherwise it must spin/yield/wait until succ_must_wait is zero.
Then, we have to store mynode so that the unlock() function knows what is the mynode for the current thread. This could be done with one thread-local, but we can also store it in a member field (with the same name) because we know that between now and the call to unlock() there will be no other thread modifying this variable, except if some other thread is calling unlock() without first calling lock(), but that is a bug in the application and it has undefined behavior.
We can now safely free the memory used by prev with the guarantee that there will be no other thread that can access it. The previous thread no longer has a pointer to that node because the last time it used it was to do the atomic_store(), which allowed the current thread to acquire the lock, and no other thread has a reference to that node.
Notice that each thread is responsible for free()ing the memory of the previous node, not the node that it created.

void clh_mutex_unlock(clh_mutex_t * self)
    atomic_store(&self->mynode->succ_must_wait, 0);
In the unlock() function, notice that the value stored in self->mynode was written by the same thread, so it doesn't need to be atomic or have acquire/release barriers.
All that this does is to set the islocked to zero so that the next thread can acquire the lock.

Here are some animations that may help understand how the algorithm works:

A nice property about the CLH mutex is that it uses the (theoretically) minimum amount of synchronization operations to perform a mutual exclusion lock:
- In the lock() function it has a single synchronized operation atomic_exchange() with one acquire and one release barrier.
- In the unlock() function is has a single synchronized operation, the atomic_store() with a release barrier.

C11 source code on github, along with a powerpoint containing some animations:

You can also check a Java implementation of the original CLH algorithm in "The Art of multiprocessor Programming" section 7.5.2

Tuesday, May 20, 2014

Relaxed Atomics optimizations for the Ticket Lock

The Ticket Lock is a very simple mechanism for implementing a mutual exclusion lock:
It was discovered by John Mellor-Crummey and Michael Scott in 1991 (see section 2.2 of the pdf above).

Thanks to the new memory model for C++11 and for C11, we can use atomic operations on a variable, with or without acquire/release barriers.
When no barrier is used and only the atomicity property is guaranteed, it is called a "relaxed atomic":
Notice that in the C11/C++11 memory model, an atomic variable is not intrinsically "sequentially-consistent" or "relaxed", but instead, we can choose its behavior on a per-access basis.
The only thing that the atomic qualifier gives, is the guarantee of atomicity, i.e. that the variable will be read or written in "one shot". This is different from other languages like Java or Scala, where (almost) all native types provide that guarantee and therefore act as a "relaxed atomic", and all variables declared as sequentially consistent will be so, for the entirety of the program (unless, in the case o Java, you use sun.misc.unsafe). Although this subtle difference may seem unimportant, it has some relevance when the goal is to squeeze the maximum performance out of a synchronization or concurrent algorithm.

Suppose you want to declare an atomic integer in different languages, this is how you would go about:
C11 (can be relaxed or sequentially consistent)
_Atomic int x;

C++11 or C++14 (can be relaxed or sequentially consistent)
atomic<int> x;

Java (sequentially consistent)
volatile int x;

With that info at hand, we can now write a simple implementation of the Ticket Lock in C11. Here's how it would look:

typedef struct
    _Atomic long ingress;
    char padding[64];      // To avoid false sharing between ingress and egress
    _Atomic long egress;
} ticket_mutex_t;

void ticket_mutex_lock(ticket_mutex_t * self)
    long lingress = atomic_fetch_add(&self->ingress, 1);
    while (lingress != atomic_load(&self->egress)) {
    // This thread has acquired the lock on the mutex

void ticket_mutex_unlock(ticket_mutex_t * self)
    atomic_fetch_add(&self->egress, 1);

There are three two non-obvious optimizations we can do to this code by using relaxed atomics. Let's look at them one at a time.

1) In the lock() function, the first time egress is read, it can be done with a relaxed load.
The code for lock() would look like this:

void ticket_mutex_lock(ticket_mutex_t * self)
    long lingress = atomic_fetch_add(&self->ingress, 1);

    // If the ingress and egress match, then the lock as been acquired and
    // we don't even need to do an acquire-barrier.
    if (lingress == atomic_load_explicit(&self->egress, memory_order_relaxed)) return ;

    while (lingress != atomic_load(&self->egress)) {
        sched_yield();  // Replace this with thrd_yield() if you use <threads.h>
    // This thread has acquired the lock on the mutex
We can do this because there is an acquire barrier in atomic_fetch_add(ingress), which means that, not only the atomic_load_explicit(egress, memory_order_relaxed) can never be executed before the atomic_fetch_add(), but it will get the up-to-date value, of at least the same time as when the atomic_fetch_add(ingress) executed. This implies that the value read of  egress will never be higher than the value returned by atomic_fetch_add(ingress).

On x86, this optimization will not give any performance gains because the acquire barrier "comes for free", which means that saving one acquire barrier does not give any gains, but on other architectures, like ARMv8, this may provide some small gain. I'll let you know when I get my hands on an ARMv8 and a gcc with C11 or C++11 support for it.
PS: It turns out this optimization is not possible in a very special case if the CPU does branch prediction and re-orders instructions. Anyways, this first optimization doesn't play well with the C11/C++1x memory model, so better not use it. I've updated the source code accordingly.

2) In the unlock() function, we can replace the atomic_fetch_add() with an atomic_load() and an atomic_store().
There is no need to do the read/addition in one atomic operation in the unlock() function because only one thread at a time is able to call this function. This means that we can implement the increment on egress with an atomic_load() to read the current value, and atomic_store() to write the current value plus one.

3) The previously described atomic_load() can be made relaxed.
Step 2 can be optimized even further by doing a relaxed atomic_load(). There was an acquire barrier done on atomic_fetch_add() on ingress in the lock() function, or worse case, in the atomic_load() of egress inside the while() loop, which guarantees that the current thread has the most up-to-date value of egress. We also have the guarantee that between then and the call to unlock() there was no other thread modifying the egress variable, so this is safe to do.
The final code is the following:

void ticket_mutex_unlock( ticket_mutex_t * self)
    long legress = atomic_load_explicit(&self->egress, memory_order_relaxed);
    atomic_store(&self->egress, legress+1);

These are just a few examples of how to use relaxed atomics to optimize your code.
Now, we should be honest and mention that the improvements of these particular optimizations are very small, and that some of these optimizations can be tricky to understand or even to prove their correctness. The upside is that this code is still cross-platform, i.e., you can run it on  x86, or ppc, or mips, or arm, and with the memory model you have the confidence that the code is still safe (correct).
This is why I like the C11/C++1x memory model so much   :)

Source code for this C11 Ticket Lock is available on github:

Sunday, May 11, 2014

C11 Atomics and MPSC Mutual Exclusion Lock

PS: Many thanks to David Dice for pointing out that this is just a small variation on the CLH lock.
That's what I get for not paying attention to "The Art of Multiprocessor Programming"!

To celebrate the release of GCC 4.9.0, with its full support for the C11 Memory Model and its Atomics, we implemented a new Mutual Exclusion Lock a variant of the CLH lock using C11 Atomics, which you can get from github:

GCC 4.9.0 is not yet widely available, but if you're like me and don't to want to build it yourself, you can get if in Arch Linux by doing pacman -S gcc:

The CLH lock is "fair" and "starvation-free", assuming that the run-time/OS can provide that kind of guarantee, and it works more or less like this:


A new node of the linked list is created, and the head and tail are set to reference it.


1. A thread wishing to acquire the lock starts by creating a new node that it will insert in the tail of the queue, using an atomic_exchange() on the tail.
2. The atomic_exchange() operation will return a pointer to the previous node to which tail was pointing, and now this thread can set the "next" of that node to point to its own node, using an atomic_store().
3. We now loop until the head reaches the node previous to our own.


1. We assume that if unlock() is being called, it is because the current thread is holding the lock, which means that the node to which "head" points to is the one previous to the node created by the current thread, so now all that needs to be done is to advance the head to the next node and free() the memory of the previous which is now inaccessible, and it's "next" field will never be de-referenced by any other thread.