Concurrency Freaks: An example of atomic_thread

One the previous post, we saw how to implement Lamport's One Bit Solution using C11 sequentially consistent atomics, and how to improve the implementation using memory_order_acquire and memory_order_release.
On this post, we are going to go one step further and use memory_order_relaxed and a few calls to atomic_thread_fence(), thus showing a practical application of such a fence.

We left of on the previous post with the code like this:

int st = UNLOCKED;

atomic_store(&states[id*PADRATIO], LOCKED);

for ( int j = 0; j < id; j += 1 )

if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;

if (st == LOCKED) {

atomic_store(&states[id*PADRATIO], UNLOCKED);

while (1) {

for ( int j = 0; j < N; j += 1 )

if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;

if (st == UNLOCKED) break;

Pause();

}

continue;

}

for ( int j = id + 1; j < N; j += 1 )

while (atomic_load_explicit(&states[j*PADRATIO], memory_order_acquire) == LOCKED) { Pause(); }

CriticalSection( id ); // critical section

atomic_store_explicit(&states[id*PADRATIO], UNLOCKED, memory_order_release);

The thing about the optimizations applied above, is that they don't actually do any re-order that the memory_order_seq_cst would not do, it's just that by replacing it with memory_order_acquire or memory_order_release, the compiler may be able to use less hardware fences when generating the code (depending on the CPU architecture in question).

Now, it becomes interesting to pose the question: how much and what kind of re-orderings are allowed on Lamport's One Bit solution, and still maintain its correctness?

This is a hard question. It's likely that Leslie Lamport himself would have to think about it for while before giving a definitive answer. You see, he probably never had to worry about hardware fences or anything like it back in those days, and just assumed sequential consistency (unless he came back to this problem more recently?). But nowadays, with some CPUs having very relaxed models (like PowerPC, ARM, or Tilera), the more relaxed we can make the algorithm, the more performance we will gain.

So, which atomic operations can be re-ordered and still get a correct algorithm?

1. We have to guarantee that the first atomic_store() that sets the state to LOCKED is the first atomic operation. Without this, there is no point in checking the states of the other threads because there would be no Happens-Before (HB) relation anymore. We could however do the check again, but it would waste some time, and that is a different algorithm (see Burns and Lynch).

2. The atomic_loads() of the states[] array can all be re-ordered with each other, even the ones to the left with the ones to the right of the current thread's id, and vice-versa. The order of these loads is not important and the HB relation will hold, as long as they are done after the current thread's state has been set to LOCKED.

3. The atomic_loads() of the states[] array can not be re-ordered with the Critical Section (CS), otherwise we would be executing pieces of code of the CS before even knowing if the lock is ours or not, which would break mutual exclusivity.

4. Same logic as above applies to the last store (the one after the CS), which means it must remain a memory_order_release.

With these guidelines in mind, we can now start to write code.
The first change, is to modify the atomic_loads() of state[] to be memory_order_relaxed. But this implies that they may not be re-ordered with the first atomic_store(), or with the CS, so we need to prevent it by using atomic_thread_fence(memory_order_seq_cst). This kind of fence is a full fence, in the sense that no operations above it will be re-ordered with operations below it (and vice-versa.
Before the CS we have to disallow re-ordering of the loads, and for that, we can use atomic_thread_fence(memory_order_acquire), which allows no code below it to move above the fence, and no loads above it from moving below it, and therefore, not be re-ordered with the CS.
The code then looks like this:

int st = UNLOCKED;

atomic_store(&states[id*PADRATIO], LOCKED);

atomic_thread_fence(memory_order_seq_cst);

for ( int j = 0; j < id; j += 1 )

if ( (st = atomic_load_explicit(&states[j*PADRATIO], memory_order_relaxed)) == LOCKED ) break;

if (st == LOCKED) {

atomic_store(&states[id*PADRATIO], UNLOCKED);

while (1) {

for ( int j = 0; j < N; j += 1 )

if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;

if (st == UNLOCKED) break;

Pause();

}

continue;

}

for ( int j = id + 1; j < N; j += 1 )

while (atomic_load_explicit(&states[j*PADRATIO], memory_order_relaxed) == LOCKED) { Pause(); }

atomic_thread_fence(memory_order_acquire);

CriticalSection( id ); // critical section

atomic_store_explicit(&states[id*PADRATIO], UNLOCKED, memory_order_release);

This is a good example of how/when to use atomic_thread_fence() for the two different cases. If you want to dig deeper into this, I recommend looking at page 1145, section Fences of the C++ draft.

We can still do a little bit better. If we look carefully at the first atomic_store(), we notice that the full fence of atomic_thread_fence(memory_order_seq_cst) will also prevent this store from being re-ordered with operations after the fence, which means we could make the store use memory_order_relaxed. We could think that if we are running this in a loop, it can be re-ordered with the last atomic_store() (the one after the CS), but because the two atomic stores are acting on the same memory location (the same variable) they will not be re-ordered due to the write-write coherency.

The final code looks like this:

int st = UNLOCKED;

atomic_store_explicit(&states[id*PADRATIO], LOCKED, memory_order_relaxed);

atomic_thread_fence(memory_order_seq_cst);

for ( int j = 0; j < id; j += 1 )

if ( (st = atomic_load_explicit(&states[j*PADRATIO], memory_order_relaxed)) == LOCKED ) break;

if (st == LOCKED) {

atomic_store(&states[id*PADRATIO], UNLOCKED);

while (1) {

for ( int j = 0; j < N; j += 1 )

if ( (st = atomic_load(&states[j*PADRATIO])) == LOCKED ) break;

if (st == UNLOCKED) break;

Pause();

}

continue;

}

for ( int j = id + 1; j < N; j += 1 )

while (atomic_load_explicit(&states[j*PADRATIO], memory_order_relaxed) == LOCKED) { Pause(); }

atomic_thread_fence(memory_order_acquire);

CriticalSection( id );

atomic_store_explicit(&states[id*PADRATIO], UNLOCKED, memory_order_release);

Just how much of a difference do all of these optimizations make?
Let's look at some numbers, with three different implementations:

seq/cst: All atomics use memory_order_seq_cst as shown at the beginning of the previous post;
acq/rel: A few atomics have memory_order_acquire or memory_order_release instead of memory_order_seq_cst;
fence/relax; A few atomics use memory_order_relaxed and we put atomic_thread_fence() where needed;

N = 8

x86
seq/cst:     187906607
acq/rel:     215770541
fence/relax: 216936469

PowerPC
seq/cst:      63647516
acq/rel:      75852650
fence/relax: 88879098

The results above are for N=8 with minimum contention, i.e., just one thread at a time is trying to get the lock.
For x86, changing from seq/cst to acq/rel is a big difference (almost 15%) because it gets rid of one MFENCE instruction, which is slow. Going from acq/rel to fence/relax is a minute difference, if any, because on x86 having a relaxed-load or an acquire-load is the same instruction (MOV from memory), and having a relaxed-store or release-store is the same instruction (MOV to memory), so the only difference (if any) arises from the extra optimizations that the compiler may do.
For PowerPC, there is a boost in performance of 19% in going from seq/cst to acq/rel, and then another increase of 17% going to fence/relax. This happens because at each optimization we remove more HWSYNC and LWSYNC fence instructions.

In conclusion, relaxed atomics are cool because they can make your code run faster!

Concurrency Freaks

Monday, June 15, 2015

An example of atomic_thread_fence()

1 comment:

Contributors