http://web.mit.edu/6.173/www/currentsemester/readings/R06-scalable-synchronization-1991.pdf
It was discovered by John Mellor-Crummey and Michael Scott in 1991 (see section 2.2 of the pdf above).
Thanks to the new memory model for C++11 and for C11, we can use atomic operations on a variable, with or without acquire/release barriers.
When no barrier is used and only the atomicity property is guaranteed, it is called a "relaxed atomic":
http://en.cppreference.com/w/cpp/atomic/memory_order
Notice that in the C11/C++11 memory model, an atomic variable is not intrinsically "sequentially-consistent" or "relaxed", but instead, we can choose its behavior on a per-access basis.
The only thing that the atomic qualifier gives, is the guarantee of atomicity, i.e. that the variable will be read or written in "one shot". This is different from other languages like Java or Scala, where (almost) all native types provide that guarantee and therefore act as a "relaxed atomic", and all variables declared as sequentially consistent will be so, for the entirety of the program (unless, in the case o Java, you use sun.misc.unsafe). Although this subtle difference may seem unimportant, it has some relevance when the goal is to squeeze the maximum performance out of a synchronization or concurrent algorithm.
Suppose you want to declare an atomic integer in different languages, this is how you would go about:
C11 (can be relaxed or sequentially consistent)
_Atomic int x;
C++11 or C++14 (can be relaxed or sequentially consistent)
atomic<int> x;
Java (sequentially consistent)
volatile int x;
With that info at hand, we can now write a simple implementation of the Ticket Lock in C11. Here's how it would look:
typedef struct
{
_Atomic long ingress;
char padding[64]; // To avoid false sharing between ingress and egress
_Atomic long egress;
} ticket_mutex_t;
void ticket_mutex_lock(ticket_mutex_t * self)
{
long lingress = atomic_fetch_add(&self->ingress, 1);
while (lingress != atomic_load(&self->egress)) {
sched_yield();
}
// This thread has acquired the lock on the mutex
}
void ticket_mutex_unlock(ticket_mutex_t * self)
{
atomic_fetch_add(&self->egress, 1);
}
There are
PS: It turns out this optimization is not possible in a very special case if the CPU does branch prediction and re-orders instructions. Anyways, this first optimization doesn't play well with the C11/C++1x memory model, so better not use it. I've updated the source code accordingly.
2) In the unlock() function, we can replace the atomic_fetch_add() with an atomic_load() and an atomic_store().
There is no need to do the read/addition in one atomic operation in the unlock() function because only one thread at a time is able to call this function. This means that we can implement the increment on egress with an atomic_load() to read the current value, and atomic_store() to write the current value plus one.
3) The previously described atomic_load() can be made relaxed.
Step 2 can be optimized even further by doing a relaxed atomic_load(). There was an acquire barrier done on atomic_fetch_add() on ingress in the lock() function, or worse case, in the atomic_load() of egress inside the while() loop, which guarantees that the current thread has the most up-to-date value of egress. We also have the guarantee that between then and the call to unlock() there was no other thread modifying the egress variable, so this is safe to do.
The final code is the following:
void ticket_mutex_unlock( ticket_mutex_t * self)
{
long legress = atomic_load_explicit(&self->egress, memory_order_relaxed);
atomic_store(&self->egress, legress+1);
}
These are just a few examples of how to use relaxed atomics to optimize your code.
Now, we should be honest and mention that the improvements of these particular optimizations are very small, and that some of these optimizations can be tricky to understand or even to prove their correctness. The upside is that this code is still cross-platform, i.e., you can run it on x86, or ppc, or mips, or arm, and with the memory model you have the confidence that the code is still safe (correct).
This is why I like the C11/C++1x memory model so much :)
Source code for this C11 Ticket Lock is available on github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/ticket_mutex.h
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/ticket_mutex.c
I believe your optimization (1) is unsound in the C11/C++11 model. Indeed,
ReplyDeletewhile fetch_add is an acquire operation, it only synchronizes on the location
it is operating on. So the fetch_add from the lock cannot synchronize with the
fetch_add from the unlock (ingress is not the same location as egress). So your
code could break provided a sufficiently aggressive compiler and processor (I
suspect Itanium would do the trick, but I don't know its memory model well
enough to say from memory). For example one thread could take the lock,
initialize a variable, release the lock; then another thread could take the lock
and still see the variable uninitialized.
I agree with your optimizations (2) and (3).
I believe your optimization (1) is unsound in the C11/C++11 model. Indeed,
while fetch_add is an acquire operation, it only synchronizes on the location
it is operating on. So the fetch_add from the lock cannot synchronize with the
fetch_add from the unlock (ingress is not the same location as egress). So your
code could break provided a sufficiently aggressive compiler and processor (I
suspect Itanium would do the trick, but I don't know its memory model well
enough to say from memory). For example one thread could take the lock,
initialize a variable, release the lock; then another thread could take the lock
and still see the variable uninitialized.
I agree with your optimizations (2) and (3).
Two other possibilites exist:
- use the release/acquire attributes instead of the default of seq_cst (you are
not using its stronger guarantees and it is slower).
- using an atomic_thread_fence(acquire) after the loop in lock, turning all the
accesses in the loop in relaxed accesses.
This would result in the following code:
void ticket_mutex_lock(ticket_mutex_t * self)
{
long lingress = atomic_fetch_add(&self->ingress, 1);
while (lingress != atomic_load(&self->egress, memory_order_relaxed)) {
sched_yield();
}
atomic_thread_fence(memory_order_acquire);
}
void ticket_mutex_unlock(ticket_mutex_t * self)
{
long legress = atomic_load_explicit(&self-> egress, memory_order_relaxed);
atomic_store(&self-> egress, legress+1, memory_order_release);
}
I suspect the fetch_add in lock() could be made relaxed too, but it is not
obvious, and maybe wrong.
Best regards,
Robin
Hi Robin,
ReplyDeleteThanks for the comment.
I think you may be confusing memory_order_consume with memory_order_seq_cst:
http://en.cppreference.com/w/cpp/atomic/memory_order
If we were doing atomic_fetch_add_explicit(1, std::memory_order_consume) then the reasoning you describe would be correct because memory_order_consume only acts on the affected memory location, but we are using the default memory model which is memory_order_seq_cst which guarantees an "implicit" acquire and release when doing the atomic_fetch_add(), and as such prevents re-ordering with other code (at either CPU level or compiler level).
Therefore, I believe (1) to be a valid optimization.
Thanks, but I am not confusing with memory_order_consume. The 'implicit' acquire means that *if* the fetch_add were to synchronize with another access, then every access afterwards will see everything that happened before that other access (which is indeed not even true for consume).
ReplyDeleteBut fetch_add can only synchronize with accesses to the same variable. In the jargon of the standard: "An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A." ( it may be more readable as the 'synchronizes_with' predicate in the mathematical formalization of that part of the standard: http://www.cl.cam.ac.uk/~pes20/cpp/)
So if thread 1 take the lock, do stuff, unlock; then thread 2 takes the lock, it will see everything that was done before thread 1 took the lock (thanks to the synchronization between the fetch add in each lock()), but not necessarily the do_stuff, as there is no release access to ingress in unlock() that it could synchronize with.
Best regards,
Robin
Hi Robin,
DeleteI think I understand what you're saying. You are absolutely correct when you say that "(...) there is no release access to ingress in unlock() that it could synchronize with. (...)", but
you are forgetting that there is an Happens-Before relationship between "ingress" and "egress".
Assuming the scenario you describe with:
Thread 1:
ticket_mutex_lock();
do_stuff();
ticket_mutex_unlock();
Thread 2:
ticket_mutex_lock();
do_stuff();
ticket_mutex_unlock();
Let me do some assertions and then you can say if you disagree with any of these statements (I might be missing something which you find obvious):
a) In ticket_mutex_lock() the value of egress will always be read _after_ the value of ingress, due to the implicit acquire barrier in fetch_add(ingress,1)
b) ticket_mutex_lock() will only return if ingress equals egress, implying that do_stuff() will be called only in such a situation
c) To some thread synchronizing on egress, ticket_mutex_unlock() will always occur _after_ do_stuff()
Summarizing, in ticket_mutex_lock(), the fetch_add(ingress,1) happens-before the atomic_load_explicit(egress, relaxed), and only when
ingress equals egress will the lock be acquired and do_stuff() called.
However, while thinking about your comment, I thought about a different scenario which occurs only at the CPU re-ordering level due
to branch prediction, and after talking with Andreia for a while she showed one (rare but) possible issue:
Suppose the CPU predicts that the following expression will be true and starts executing part of do_stuff() before it has all the values from memory:
if (lingress == atomic_load_explicit(&self->egress, memory_order_relaxed)) return ;
In this case, even if it turns out to be true (ingress == egress), during a small amount of time the CPU was using values for variables from do_stuff() that
could have been simultaneously written by another thread which in the meantime called ticket_mutex_unlock() thus releasing the lock.
Notice that this scenario only happens at the CPU level if it decides to: predict true for this branch AND the egress is still not equal to ingress
AND the egress is equal to ingress by the time the value of egress arrives.
If the expression in the if() is false, then the code goes through a load with acquire, and would be correct.
This is a very tricky and rare scenario, but it means that optimization (1) is not a valid optimization.
I think this is not the scenario that you were talking about, but thanks for making us think about this Robin!
I will update the post and the code accordingly in the next few days.
Once again, thank you for your comment,
Pedro