Concurrency Freaks: Relaxed Atomics optimizations for the Ticket Lock

Tuesday, May 20, 2014

Relaxed Atomics optimizations for the Ticket Lock

The Ticket Lock is a very simple mechanism for implementing a mutual exclusion lock:
http://web.mit.edu/6.173/www/currentsemester/readings/R06-scalable-synchronization-1991.pdf
It was discovered by John Mellor-Crummey and Michael Scott in 1991 (see section 2.2 of the pdf above).

Thanks to the new memory model for C++11 and for C11, we can use atomic operations on a variable, with or without acquire/release barriers.
When no barrier is used and only the atomicity property is guaranteed, it is called a "relaxed atomic":
http://en.cppreference.com/w/cpp/atomic/memory_order
Notice that in the C11/C++11 memory model, an atomic variable is not intrinsically "sequentially-consistent" or "relaxed", but instead, we can choose its behavior on a per-access basis.
The only thing that the atomic qualifier gives, is the guarantee of atomicity, i.e. that the variable will be read or written in "one shot". This is different from other languages like Java or Scala, where (almost) all native types provide that guarantee and therefore act as a "relaxed atomic", and all variables declared as sequentially consistent will be so, for the entirety of the program (unless, in the case o Java, you use sun.misc.unsafe). Although this subtle difference may seem unimportant, it has some relevance when the goal is to squeeze the maximum performance out of a synchronization or concurrent algorithm.

Suppose you want to declare an atomic integer in different languages, this is how you would go about:
C11 (can be relaxed or sequentially consistent)
_Atomic int x;

C++11 or C++14 (can be relaxed or sequentially consistent)
atomic<int> x;

Java (sequentially consistent)
volatile int x;

With that info at hand, we can now write a simple implementation of the Ticket Lock in C11. Here's how it would look:

typedef struct

{

_Atomic long ingress;

char padding[64]; // To avoid false sharing between ingress and egress

_Atomic long egress;

} ticket_mutex_t;

void ticket_mutex_lock(ticket_mutex_t * self)

{

long lingress = atomic_fetch_add(&self->ingress, 1);

while (lingress != atomic_load(&self->egress)) {

sched_yield();

}

// This thread has acquired the lock on the mutex

}

void ticket_mutex_unlock(ticket_mutex_t * self)

{

atomic_fetch_add(&self->egress, 1);

}

There are ~~three~~ two non-obvious optimizations we can do to this code by using relaxed atomics. Let's look at them one at a time.

~~1) In the lock() function, the first time egress is read, it can be done with a relaxed load.~~
~~The code for lock() would look like this:~~

void ticket_mutex_lock(ticket_mutex_t * self)

{

long lingress = atomic_fetch_add(&self->ingress, 1);

~~// If the ingress and egress match, then the lock as been acquired and~~

~~// we don't even need to do an acquire-barrier.~~

~~if (lingress == atomic_load_explicit(&self->egress, memory_order_relaxed)) return ;~~

while (lingress != atomic_load(&self->egress)) {

~~sched_yield(); // Replace this with thrd_yield() if you use <threads.h>~~

}

~~// This thread has acquired the lock on the mutex~~

}

We can do this because there is an acquire barrier in atomic_fetch_add(ingress), which means that, not only the atomic_load_explicit(egress, memory_order_relaxed) can never be executed before the atomic_fetch_add(), but it will get the up-to-date value, of at least the same time as when the atomic_fetch_add(ingress) executed. This implies that the value read of egress will never be higher than the value returned by atomic_fetch_add(ingress).

On x86, this optimization will not give any performance gains because the acquire barrier "comes for free", which means that saving one acquire barrier does not give any gains, but on other architectures, like ARMv8, this may provide some small gain. I'll let you know when I get my hands on an ARMv8 and a gcc with C11 or C++11 support for it.
PS: It turns out this optimization is not possible in a very special case if the CPU does branch prediction and re-orders instructions. Anyways, this first optimization doesn't play well with the C11/C++1x memory model, so better not use it. I've updated the source code accordingly.

2) In the unlock() function, we can replace the atomic_fetch_add() with an atomic_load() and an atomic_store().
There is no need to do the read/addition in one atomic operation in the unlock() function because only one thread at a time is able to call this function. This means that we can implement the increment on egress with an atomic_load() to read the current value, and atomic_store() to write the current value plus one.

3) The previously described atomic_load() can be made relaxed.
Step 2 can be optimized even further by doing a relaxed atomic_load(). There was an acquire barrier done on atomic_fetch_add() on ingress in the lock() function, or worse case, in the atomic_load() of egress inside the while() loop, which guarantees that the current thread has the most up-to-date value of egress. We also have the guarantee that between then and the call to unlock() there was no other thread modifying the egress variable, so this is safe to do.
The final code is the following:

void ticket_mutex_unlock( ticket_mutex_t * self)

{

long legress = atomic_load_explicit(&self->egress, memory_order_relaxed);

atomic_store(&self->egress, legress+1);

}

These are just a few examples of how to use relaxed atomics to optimize your code.
Now, we should be honest and mention that the improvements of these particular optimizations are very small, and that some of these optimizations can be tricky to understand or even to prove their correctness. The upside is that this code is still cross-platform, i.e., you can run it on x86, or ppc, or mips, or arm, and with the memory model you have the confidence that the code is still safe (correct).
This is why I like the C11/C++1x memory model so much :)

Source code for this C11 Ticket Lock is available on github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/ticket_mutex.h
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/ticket_mutex.c

4 comments:

AnonymousSeptember 12, 2014 at 9:35 PM
I believe your optimization (1) is unsound in the C11/C++11 model. Indeed,
while fetch_add is an acquire operation, it only synchronizes on the location
it is operating on. So the fetch_add from the lock cannot synchronize with the
fetch_add from the unlock (ingress is not the same location as egress). So your
code could break provided a sufficiently aggressive compiler and processor (I
suspect Itanium would do the trick, but I don't know its memory model well
enough to say from memory). For example one thread could take the lock,
initialize a variable, release the lock; then another thread could take the lock
and still see the variable uninitialized.

I agree with your optimizations (2) and (3).

I believe your optimization (1) is unsound in the C11/C++11 model. Indeed,
while fetch_add is an acquire operation, it only synchronizes on the location
it is operating on. So the fetch_add from the lock cannot synchronize with the
fetch_add from the unlock (ingress is not the same location as egress). So your
code could break provided a sufficiently aggressive compiler and processor (I
suspect Itanium would do the trick, but I don't know its memory model well
enough to say from memory). For example one thread could take the lock,
initialize a variable, release the lock; then another thread could take the lock
and still see the variable uninitialized.

I agree with your optimizations (2) and (3).

Two other possibilites exist:
- use the release/acquire attributes instead of the default of seq_cst (you are
not using its stronger guarantees and it is slower).
- using an atomic_thread_fence(acquire) after the loop in lock, turning all the
accesses in the loop in relaxed accesses.

This would result in the following code:
void ticket_mutex_lock(ticket_mutex_t * self)
{
long lingress = atomic_fetch_add(&self->ingress, 1);
while (lingress != atomic_load(&self->egress, memory_order_relaxed)) {
sched_yield();
}
atomic_thread_fence(memory_order_acquire);
}

void ticket_mutex_unlock(ticket_mutex_t * self)
{
long legress = atomic_load_explicit(&self-> egress, memory_order_relaxed);
atomic_store(&self-> egress, legress+1, memory_order_release);
}

I suspect the fetch_add in lock() could be made relaxed too, but it is not
obvious, and maybe wrong.

Best regards,
Robin
ReplyDelete
Replies
Pedro RamalheteSeptember 12, 2014 at 10:23 PM
Hi Robin,
Thanks for the comment.
I think you may be confusing memory_order_consume with memory_order_seq_cst:
http://en.cppreference.com/w/cpp/atomic/memory_order
If we were doing atomic_fetch_add_explicit(1, std::memory_order_consume) then the reasoning you describe would be correct because memory_order_consume only acts on the affected memory location, but we are using the default memory model which is memory_order_seq_cst which guarantees an "implicit" acquire and release when doing the atomic_fetch_add(), and as such prevents re-ordering with other code (at either CPU level or compiler level).
Therefore, I believe (1) to be a valid optimization.
ReplyDelete
Replies
AnonymousSeptember 13, 2014 at 1:39 AM
Thanks, but I am not confusing with memory_order_consume. The 'implicit' acquire means that *if* the fetch_add were to synchronize with another access, then every access afterwards will see everything that happened before that other access (which is indeed not even true for consume).
But fetch_add can only synchronize with accesses to the same variable. In the jargon of the standard: "An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A." ( it may be more readable as the 'synchronizes_with' predicate in the mathematical formalization of that part of the standard: http://www.cl.cam.ac.uk/~pes20/cpp/)
So if thread 1 take the lock, do stuff, unlock; then thread 2 takes the lock, it will see everything that was done before thread 1 took the lock (thanks to the synchronization between the fetch add in each lock()), but not necessarily the do_stuff, as there is no release access to ingress in unlock() that it could synchronize with.

Best regards,
Robin
ReplyDelete
Replies

Add comment