Saturday, May 23, 2015

The quest for the perfect load/store lock (part 2)

60 years ago we didn't even know if it was possible to provide mutual exclusion with a software-only solution, and then came Dekker (and Dijkstra) and showed it was possible. Knuth, Peterson, Leslie Lamport, and many others followed with other solutions throughout the years. Some were real innovations, others were just refinements of previous ideas.
It's funny that some of the solutions keep on getting "re-discovered"... I guess it's the old adage of "The Book" by Erdos, that the real good solutions have always been there and just need to be "discovered", as opposed to "invented".
One concrete example is Lamport's "One Bit Solution" that Andreia re-discovered earlier this year. Mostly because it is a simple solution and one of the fastest (although not starvation-free). By the time we realized that Lamport had come up with it a long time ago, we had already made several "upgrades" on it with the intent to provide starvation-freedom, so we decided to push on a bit more and make a paper about the two best solutions we had came up with for the N-thread mutual exclusion problem:

These two algorithms were designed having in mind the ideas from the previous post, about what an ideal lock should be.
One of the algorithms CRTurn, uses global spinning, where all threads are spinning on the same variable, named turn.
The other algorithm, CRHandover, uses local spinning, where each thread is (mostly) spinning on a different variable (and cache line).
The main innovation is that because they're based on Lamport's One Bit Solution, they can enter the critical section (locking) doing a single store in shared memory, i.e. an atomic_store() for the C11/C++1x folks or volatile store for the Java/JVM developers. Same thing when exiting the critical section (unlocking), a single store in shared memory is required. However, unlike Lamport's solution, they're both starvation-free.
If you saw the "Atomic Weapons" presentation by Herb Sutter then I'm sure you know that in such memory models the most expensive thing is doing a release-store (a store in shared memory), as opposed to doing an acquire-load or a regular load/store, this means, an ideal algorithm should have the smallest possible number of release-stores in order to be as fast as possible (under low contention).

Why do we care about the low contention scenario, and not so much about high contention one?
Well, imagine you have an application where you have many lock instances, and you do some profiling in your app for latency/scalability/throughput, and you find that one of the lock instances is a big bottleneck where many (or all) threads get stuck waiting for that particular resource or critical section... now what do you do?
If the resource under contention is a data structure, then this would be an ideal place to put a lock-free variant of the said data structure, or at least a variant that has finer grain locking. If it's an object or set of objects, then you can still have gains by using techniques such as Left-Right, or Copy-On-Write, or even one (or multiple) Reader-Writer Lock. Worst comes to worse, you can always re-design the application so as to reduce contention in this single point.
What this means in practice, is that any healthy application will not have high contention on a single lock systematically. Don't get me wrong, it can have bursts of high contention occasionally, and when that happens, it better be a starvation-free lock to ensure better (lower) latency and fairness, but the point is that most of the time, when a thread is attempting to acquire a lock (any lock) in your app, it will be uncontended. This means that the total throughput will be dominated by the behavior of the lock under low contention, i.e. a single thread attempting to acquire it, and therefore, reducing the number of release-stores in that case is a very desirable goal.

Now, this is not to say that locks that were designed with high contention in mind should not be taken seriously, it's just that it's IMO a rare case, but as I said on the previous post, an ideal lock should have a good behavior on both cases.

More on this in a future post.

No comments:

Post a Comment