Monday, June 22, 2015

Maurice Herlihy on Transactional Memory

Here is a nice talk by Maurice Herlihy:


It's mostly about Transactional Memory, but I like the part about Memory Reclamation (starting at around 31 minutes).

... and a nice dissertation about "The Future of Synchronization on Multicores"

Monday, June 15, 2015

An example of atomic_thread_fence()

One the previous post, we saw how to implement Lamport's One Bit Solution using C11 sequentially consistent atomics, and how to improve the implementation using memory_order_acquire and memory_order_release.
On this post, we are going to go one step further and use memory_order_relaxed and a few calls to atomic_thread_fence(), thus showing a practical application of such a fence.

We left of on the previous post with the code like this:

int st = UNLOCKED;
atomic_store(&states[id*PADRATIO], LOCKED);
for ( int j = 0; j < id; j += 1 )
    if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;
if (st == LOCKED) {
    atomic_store(&states[id*PADRATIO], UNLOCKED);
    while (1) {
        for ( int j = 0; j < N; j += 1 )
            if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;
            if (st == UNLOCKED) break;
            Pause();
    }
    continue;
}
for ( int j = id + 1; j < N; j += 1 )
    while (atomic_load_explicit(&states[j*PADRATIO], memory_order_acquire) == LOCKED) { Pause(); }
CriticalSection( id );               // critical section
atomic_store_explicit(&states[id*PADRATIO], UNLOCKED, memory_order_release);

The thing about the optimizations applied above, is that they don't actually do any re-order that the memory_order_seq_cst would not do, it's just that by replacing it with memory_order_acquire or memory_order_release, the compiler may be able to use less hardware fences when generating the code (depending on the CPU architecture in question).

Now, it becomes interesting to pose the question: how much and what kind of re-orderings are allowed on Lamport's One Bit solution, and still maintain its correctness?

This is a hard question. It's likely that Leslie Lamport himself would have to think about it for while before giving a definitive answer. You see, he probably never had to worry about hardware fences or anything like it back in those days, and just assumed sequential consistency (unless he came back to this problem more recently?). But nowadays, with some CPUs having very relaxed models (like PowerPC, ARM, or Tilera), the more relaxed we can make the algorithm, the more performance we will gain.

So, which atomic operations can be re-ordered and still get a correct algorithm?

1. We have to guarantee that the first atomic_store() that sets the state to LOCKED is the first atomic operation. Without this, there is no point in checking the states of the other threads because there would be no Happens-Before (HB) relation anymore. We could however do the check again, but it would waste some time, and that is a different algorithm (see Burns and Lynch).

2. The atomic_loads() of the states[] array can all be re-ordered with each other, even the ones to the left with the ones to the right of the current thread's id, and vice-versa. The order of these loads is not important and the HB relation will hold, as long as they are done after the current thread's state has been set to LOCKED.

3. The atomic_loads() of the states[] array can not be re-ordered with the Critical Section (CS), otherwise we would be executing pieces of code of the CS before even knowing if the lock is ours or not, which would break mutual exclusivity.

4. Same logic as above applies to the last store (the one after the CS), which means it must remain a memory_order_release.

With these guidelines in mind, we can now start to write code.
The first change, is to modify the atomic_loads() of state[] to be memory_order_relaxed. But this implies that they may not be re-ordered with the first atomic_store(), or with the CS, so we need to prevent it by using atomic_thread_fence(memory_order_seq_cst). This kind of fence is a full fence, in the sense that no operations above it will be re-ordered with operations below it (and vice-versa.
Before the CS we have to disallow re-ordering of the loads, and for that, we can use atomic_thread_fence(memory_order_acquire), which allows no code below it to move above the fence, and no loads above it from moving below it, and therefore, not be re-ordered with the CS.
The code then looks like this:


int st = UNLOCKED;
atomic_store(&states[id*PADRATIO], LOCKED);
atomic_thread_fence(memory_order_seq_cst);
for ( int j = 0; j < id; j += 1 )
    if ( (st = atomic_load_explicit(&states[j*PADRATIO], memory_order_relaxed)) == LOCKED ) break;
if (st == LOCKED) {
    atomic_store(&states[id*PADRATIO], UNLOCKED);
    while (1) {
        for ( int j = 0; j < N; j += 1 )
            if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;
            if (st == UNLOCKED) break;
            Pause();
    }
    continue;
}
for ( int j = id + 1; j < N; j += 1 )
    while (atomic_load_explicit(&states[j*PADRATIO], memory_order_relaxed) == LOCKED) { Pause(); }
atomic_thread_fence(memory_order_acquire);
CriticalSection( id );               // critical section
atomic_store_explicit(&states[id*PADRATIO], UNLOCKED, memory_order_release);


This is a good example of how/when to use atomic_thread_fence() for the two different cases. If you want to dig deeper into this, I recommend looking at page 1145, section Fences of the C++ draft.

We can still do a little bit better. If we look carefully at the first atomic_store(), we notice that the full fence of atomic_thread_fence(memory_order_seq_cst) will also prevent this store from being re-ordered with operations after the fence, which means we could make the store use memory_order_relaxed. We could think that if we are running this in a loop, it can be re-ordered with the last atomic_store() (the one after the CS), but because the two atomic stores are acting on the same memory location (the same variable) they will not be re-ordered due to the write-write coherency.

The final code looks like this:


int st = UNLOCKED;
atomic_store_explicit(&states[id*PADRATIO], LOCKED, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
for ( int j = 0; j < id; j += 1 )
    if ( (st = atomic_load_explicit(&states[j*PADRATIO], memory_order_relaxed)) == LOCKED ) break;
if (st == LOCKED) {
    atomic_store(&states[id*PADRATIO], UNLOCKED);
    while (1) {
        for ( int j = 0; j < N; j += 1 )
            if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;
            if (st == UNLOCKED) break;
            Pause();
    }
    continue;
}
for ( int j = id + 1; j < N; j += 1 )
    while (atomic_load_explicit(&states[j*PADRATIO], memory_order_relaxed) == LOCKED) { Pause(); }
atomic_thread_fence(memory_order_acquire);
CriticalSection( id );            
atomic_store_explicit(&states[id*PADRATIO], UNLOCKED, memory_order_release);




Just how much of a difference do all of these optimizations make?
Let's look at some numbers, with three different implementations:
  • seq/cst: All atomics use memory_order_seq_cst as shown at the beginning of  the previous post;
  • acq/rel: A few atomics have memory_order_acquire or memory_order_release instead of memory_order_seq_cst;
  • fence/relax; A few atomics use memory_order_relaxed and we put atomic_thread_fence() where needed;

N = 8

x86
seq/cst:     187906607
acq/rel:     215770541
fence/relax: 216936469

PowerPC      
seq/cst:      63647516
acq/rel:      75852650
fence/relax:  88879098


The results above are for N=8 with minimum contention, i.e., just one thread at a time is trying to get the lock.
For x86, changing from seq/cst to acq/rel is a big difference (almost 15%) because it gets rid of one MFENCE instruction, which is slow. Going from acq/rel to fence/relax is a minute difference, if any, because on x86 having a relaxed-load or an acquire-load is the same instruction (MOV from memory), and having a relaxed-store or release-store is the same instruction (MOV to memory), so the only difference (if any) arises from the extra optimizations that the compiler may do.
For PowerPC, there is a boost in performance of 19% in going from  seq/cst to acq/rel, and then another increase of 17% going to fence/relax. This happens because at each optimization we remove more HWSYNC and LWSYNC fence instructions.

In conclusion, relaxed atomics are cool because they can make your code run faster!

Friday, June 5, 2015

Lamport's One bit solution in C11

As we mentioned before, Lamport's One Bit Solution is one of the fastest (if not the fastest) software solution to the mutual exclusion problem that uses only loads and stores.
Here is the original algorithm:




Andreia came up with an almost identical solution earlier this year, having only a small variation on the "backoff strategy", which in Lamport's case is spinning on each entry of the array until all have been traversed, while on Andreia's variant the waiting thread just starts over. This gives a (small) statistical advantage so that the threads don't trample over each other, but the basic idea of the algorithm is the same: threads to the right of your entry in the array have priority over you, and you have priority over the threads to the left.
What this implies is that the thread with entry zero in the array has priority over all other threads and can easily starve them. This algorithm is not starvation-free (if you want to see a variation of this algorithm that is starvation-free, then check out CRTurn in our paper).

Here's what Andreia's variant looks like in C11 using memory_order_seq_cst:
while (1) {
    int st = UNLOCKED;
    atomic_store(&states[id*PADRATIO], LOCKED);
    for ( int j = 0; j < id; j += 1 )
        if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;
    if (st == LOCKED) {
        atomic_store(&states[id*PADRATIO], UNLOCKED);
        while (1) {
            for ( int j = 0; j < N; j += 1 )
                if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;
            if (st == UNLOCKED) break;
            Pause();
        }
        continue;
    }
}
for ( int j = id + 1; j < N; j += 1 )
    while (atomic_load(&states[j*PADRATIO]) == LOCKED) { Pause(); }
CriticalSection( id );
atomic_store(&states[id*PADRATIO], UNLOCKED);


where states[] is an array of atomic_int with N*PADRATIO entries (the maximum number of threads), id is the entry in states[] for the current thread, and PADRATIO is the size of a cache line (in atomic_ints).

In this implementation we are using only sequentially consistent atomics, and this is fine and correct, but it isn't terribly fast. Can we relax some of the atomic operations and still provide a correct algorithm?
Yes!

The most obvious improvement is to change the last atomic_store() to use memory_order_release. This will have no effect on the algorithm itself, because none of the instructions on Critical Section (CS) will be re-ordered with the atomic_store() seen as they are above it. Even when running in a loop, it will not be re-ordered with the first atomic_store() because atomic_stores() with memory_order_release will not be re-ordered.
This may seem like a small improvement, but on x86 it means we've just saved an MFENCE instruction, which improves performance dramatically, at least for the low contention scenario:
Displaying image.png

The table above was taken from this PhD thesis, which I recommend if you want to do a deep dive into the C11/C++1x memory model.


A second optimization that does not change the algorithm, is to replace the sequentially consistent loads in the while() loop before the CS, with memory_order_acquire. Loads with a memory_order_acquire will never be re-ordered with each other or with any instructions below them, which means that they will be "stuck in place", and no instruction of the CS can be re-ordered with these loads.
Notice that they could be re-ordered with an atomic_store() immediately above them, but there is always at least one sequentially consistent atomic_load() above them (either from the first for() loop or from the second for() loop), and above those, the atomic_store() is also sequentially consistent (either the first atomic_store() or the second atomic_store()), and because atomic_stores() and atomic_loads() using memory_order_seq_cst can not be re-ordered, everything is safe and correct.
As you can see from the table above, this optimization will make no difference whatsoever on x86, but on PowerPC and ARM, it will save an hwsync or a dmb, both expensive instructions.

There is one more optimization we could do, but in practice it doesn't save much, so maybe I'll talk about some other time.

Here's what the optimized code looks like:

while (1) {
    int st = UNLOCKED;
    atomic_store(&states[id*PADRATIO], LOCKED);
    for ( int j = 0; j < id; j += 1 )
        if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;
    if (st == LOCKED) {
        atomic_store(&states[id*PADRATIO], UNLOCKED);
        while (1) {
            for ( int j = 0; j < N; j += 1 )
                if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;
            if (st == UNLOCKED) break;
            Pause();
        }
        continue;
    }
}
for ( int j = id + 1; j < N; j += 1 )
    while (atomic_load_explicit(&states[j*PADRATIO], memory_order_acquire) == LOCKED) { Pause(); }
CriticalSection( id );
atomic_store_explicit(&states[id*PADRATIO], UNLOCKED, memory_order_release);