Concurrency Freaks

Sunday, June 15, 2014

A Rosetta Stone for Atomics

C11 and C++11 programmers know it as "atomic", D programmers know it as "shared", and Java programmers know it as "volatile", but semantically speaking, they are very similar. The inner workings of the memory model and the atomics in each of these languages can differ significantly, specially between Java and the other ones, but if you know the semantics of one of these, then you know the semantics for all of them.

Let's look at the atomics API for the languages C11, C++1x, D, and Java. This is a kind of "Rosetta Stone" of the different languages, which should come in handy for those of you wishing to translate a piece of concurrent code from one language to another.

Due to lack of space, we start by comparing the basic API for C11, C++1x and D:

Two things to notice on this board is that D has more "boiler plate" code than C11 or C++1x, and that there is no atomic addition in D (am I missing something?), so it has to be implemented using a CAS loop.
As is to be expected, there are a lot of similarities between these three languages, but when we get to Java it starts to get a bit more complicated.
First, there is no single API for atomics, but kind of three APIs:

The first thing that pops out is that using relaxed atomics with the Atomic* classes or volatile is currently not possible without resorting to sun.misc.Unsafe
Another detail for Java is that I'm not sure what the memory model defines in terms of race conditions for "relaxed atomics" (because it doesn't even mention such a thing), while on C++1x, C11 and D it is well defined, and this is important as we will see in a few posts.

I'm not an expert in all of these languages, so if you see anything missing or wrong, please let me know in the comments section.

If you need more info you can probably find it on one of these links:
http://en.cppreference.com/w/cpp/atomic
http://en.cppreference.com/w/c/atomic
http://dlang.org/phobos/core_atomic.html

http://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/AtomicInteger.html
http://docs.oracle.com/javase/tutorial/essential/concurrency/atomicvars.html
http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/share/classes/sun/misc/Unsafe.java

Sunday, June 8, 2014

CLH Reader-Writer Lock

Still on the topic of locks, we decided to use the idea of the CLH lock and adapt it to create a Reader-Writer Lock, which might be new, but we're not sure.
We named this algorithm "CLH Reader Writer Lock" (not very original, I know). We think this is a new algorithm because we don't know of any such approach, but if you find anything similar in the existing literature or on the internet, please send us a link or add it on the comments section.
PS: As Dave Dice has pointed out (thanks Dave!), this algorithm is the combination of the CLH lock on a previous post, with the C-RW-NP described in the paper NUMA-Aware Reader-Writer Locks.

So how does it work?

Similarly to the CLH mutex, each node contains only one field, named succ_must_wait:

typedef struct clh_rwlock_node_ clh_rwlock_node_t;

struct clh_rwlock_node_

{

_Atomic char succ_must_wait;

};

The lock instance itself contains one extra field, named readers_counter, and some extra padding to reduce false sharing between the three members of clh_rwlock_t:

typedef struct

{

clh_rwlock_node_t * mynode;

char padding1[64];

_Atomic (clh_rwlock_node_t *) tail;

char padding2[64];

_Atomic long readers_counter;

} clh_rwlock_t ;

The basic idea, is that each thread does the atomic_exchange() on the tail, regardless of it being a Reader or a Writer.
To allow multiple Readers simultaneously, when a Reader acquires the read-lock, it will immediately set the succ_must_wait=0 but before it does so, it will increment the readers_counter by one. When a Reader exits, it decrements the readers_counter by one.
The Writer will wait for its turn to acquire the write-lock, spinning on the previous' node succ_must_wait, and once it is zero, it will spin again waiting for the readers_counter to reach zero. This will allow multiple Readers to execute concurrently, while preventing a Writer to enter before its turn.
The following powerpoint presentation shows an animation of how this works:

https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Presentations/CLH-RWLock.pptx

The code for the Readers looks like this:

void clh_rwlock_readlock(clh_rwlock_t * self)

{

// Create the new node locked by default, setting succ_must_wait=1

clh_rwlock_node_t *mynode = clh_rwlock_create_node(1);

clh_rwlock_node_t *prev = atomic_exchange(&self->tail, mynode);

char prev_islocked = atomic_load_explicit(&prev->succ_must_wait, memory_order_relaxed);

if (prev_islocked) {

while (prev_islocked) {

sched_yield();

prev_islocked = atomic_load(&prev->succ_must_wait);

}

// Incrementing the readers_counter will prevent a Writer from going in

atomic_fetch_add(&self->readers_counter, 1);

// This will allow the next thread to go in, but only if it is a Reader

atomic_store(&mynode->succ_must_wait, 0);

// This thread has acquired the lock and it is now safe to

// cleanup the memory of the previous node.

free(prev);

}

void clh_rwlock_readunlock(clh_rwlock_t * self)

{

atomic_fetch_add(&self->readers_counter, -1);

}

Notice that the Reader must first increment readers_counter and only afterwards set the succ_must_wait to zero, otherwise a Writer might see the succ_must_wait at zero, and then immediately check readers_counter, see it is zero, and enter the critical section, immediately followed by a Reader which has just incremented readers_counter, and this would be incorrect. Also, the Reader will free() the memory of the previous node on the lock() function.

The code for the Writers looks like this:

void clh_rwlock_writelock(clh_rwlock_t * self)

{

// Create the new node locked by default, setting succ_must_wait=1

clh_rwlock_node_t *mynode = clh_rwlock_create_node(1);

clh_rwlock_node_t *prev = atomic_exchange(&self->tail, mynode);

// This thread's node is now in the queue, so wait until it is its turn

char prev_islocked = atomic_load_explicit(&prev->succ_must_wait, memory_order_relaxed);

if (prev_islocked) {

while (prev_islocked) {

sched_yield();

prev_islocked = atomic_load(&prev->succ_must_wait);

}

// Even though succ_must_wait is 0, there may be unfinished Readers, so spin/wait

// until they're over.

long readers_counter = atomic_load_explicit(&self->readers_counter, memory_order_relaxed);

if (readers_counter != 0) {

while (readers_counter != 0) {

sched_yield();

readers_counter = atomic_load(&self->readers_counter);

}

// This thread has acquired the lock

self->mynode = mynode;

free(prev);

}

void clh_rwlock_writeunlock(clh_rwlock_t * self)

{

atomic_store(&self->mynode->succ_must_wait, 0);

}

As can be seen from the code above, each Writer may have to spin twice: first to wait for its turn, spinning until the succ_must_wait of the previous node becomes zero, and second time waiting for any unfinished Readers to complete.

One very important point is that this Reader-Writer Lock is strictly fair and starvation-free.
Unlike simple spin-lock based RW-Locks, a Writer will never be forced to wait for a late coming Reader (or other Writer). Once the Writer or Reader executes the atomic_exchange() on the tail, it will only have to wait for threads that executed the atomic_exchange() before it, and never for older threads.

Also, notice that we make usage of relaxed atomics in several places, which shows, once again, that relaxed atomics are useful in creating practical implementations of synchronization mechanisms.

Link to C11 source code on github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/clh_rwlock.h
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/clh_rwlock.c

It is important to emphasize that the clh_rwlock might not scale well for Readers. We could replace the readers_counter with a scalable counter like the Distributed Cache Line Counter , or LongAdder, or a SNZI, or some other kind of readIndicator mechanism, which should increase the performance for Readers, but there would always be a bottleneck for Readers when doing the atomic_exchange() on the tail. Maybe there is a way to overcome this obstacle, but I haven't figured it out yet.
If you really need a lock that is scalable for Readers, then better go with one of the variants of the Scalable RW-Lock, (named C-RW-WP on this paper) which we covered on previous posts.

Monday, June 2, 2014

Minimum number of barriers for a Lock

While designing locks, whether they are mutual exclusion or reader-writer locks, one of the questions that often comes up is "How efficient can a lock be?".
When that lock is implemented on a language with acquire/release semantics like C11 or C++1x, then part of that question is related to "What is the minimum number of barriers required to implement a lock?" (or "fences" if you prefer). I came to a partial answer on this one, which I'm going to describe below.

Obviously, we weren't the first ones to think of this, and I'm pretty confident that experts in this field have reached the same (or more advanced) conclusions a very long time ago. People like Leslie Lamport, Maurice Herlihy, Nir Shavit, Michael Scott, David Dice, have been thinking about these issues for a long time. Perhaps they never thought it was too important, or they didn't bother to talk about it, or maybe they mentioned it briefly in one of their papers and we just missed it.
It could also happen that, until recently, this question wasn't all that relevant, the reason being that only recently did programming languages adopted a memory model with explicit acquire/release semantics.
Nowadays, with C++11/C++14 and C11, the memory model defines the synchronization operations in terms of acquire and release barriers, so it becomes an important question to design a lock that does the least amount of synchronization as possible, and for that, we must first figure out what is the minimum.

Keep in mind, that having an algorithm and implementation that provides the minimum number of barriers is just one of the effects contributing to a lock's performance. We can even argue that it is a small effect, when compared to other things, like false-sharing, contention, cache-misses, etc. Still, I think it is a question worth posing, and answering!

This answer I'm proposing is by no means rigorous, and I hope that in the future someone will give a more rigorous proof using Information Theory or something like it, but for the moment, a simple logical narrative should suffice:

Suppose you have an object L that is the lock, whose state may be LOCKED or UNLOCKED. The object L has two functions: lock() and unlock().

For a given thread to determine the current state of L, it must do an acquire barrier. For the same thread to get hold of the lock, it must change the state of L from UNLOCKED to LOCKED which means doing at least a write with a release barrier.
The actual algorithm that is used can be much more complicated, and in fact can even be reversed: first set the state to LOCKED with a release barrier, and then check if there are other threads doing the same, which needs an acquire barrier. It can also be done with a single atomic operation, for example when doing a spinlock with a Compare-And-Swap (CAS) or Exchange, both of which have implicit acquire and release barriers.

For the thread to release the lock, the minimum it needs to do is to inform the other threads/cores of the change of state from LOCKED to UNLOCKED, and that means doing at least one release barrier.

There are two exceptions to this rule however:

The first concerns "Recursive Locks" (or "Reentrant" if you prefer). On a recursive lock, the minimum number of barriers is zero. When a thread attempts to acquire a lock that already belongs to the current thread, it can have a specific state in L (a thread-local variable for example) that it checks. Because it was this same thread that wrote to it before, either to set it to LOCKED or UNLOCKED, there is no need to do an acquire or release barrier. Same thing for the unlock() function, a regular write without any barriers is enough, as long as this is not the last unlock() of L by this thread.

The second is for optimistic locks. An example of an optimistic lock is Java's StampedLock tryOptimisticLock(). Notice that optimistic locks are only possible for read-only operations, so they are used in the context of Reader-Writer Locks and I've never seen them being used with mutexes (don't know if there is a use-case for mutexes?).
An optimistic lock needs three barriers: one acquire barrier in the lock() function, and another acquire barrier in the unlock() to check if the state of the lock has changed during the (read-only) operation, followed by a release barrier also in the unlock(), to prevent code from escaping outside the critical section. The first acquire barrier is used to read the current state of the lock, i.e., if it is currently being held by a Writer, and the second acquire barrier to check if there was one or more Writers acquiring the lock for the duration of the Reader's operation.

In summary, for pessimistic non-recursive locks, the minimum amount of barriers needed for the lock() function, is 1 acquire and 1 release, while for the unlock() function is 1 release.
Examples of locks that have this minimum property are the spinlock with a CAS, or the Ticket Lock we described on a previous post, but there are many others.