Sunday, April 27, 2014

Multi-Producer-Single-Consumer Queue

While reading a post on the concurrency-interest mailing list, and then the queue that is currently used for message passing between the actors in Akka, I found this post with an amazing queue algorithm by Dmitry Vyukov:

It seems that the current queue implementation in Akka uses this algorithm. It took me some time to understand how it works, and there are a few interesting things:

First, this queue is wait-free for push() and pop() operations.
The push() function does one XCHG and one store with release barrier, and no retries are needed.
The pop() function does stores, and that's it.
It's true that this queue is single-consumer only, but you should try to come up with a MPSC queue that is wait-free (or even lock-free) to see how hard it is. Yep, it's hard, and this guy did it, so I'm definitely impressed, and so should you!
Just to give it some perspective, the typical Michael & Scott queue which is implemented in Java's ConcurrentLinkedQueue, needs at least two CAS to insert a new node... both of which may fail and will have to be retried.

The disadvantage is, that this queue is not linearizable. In fact, it's not even sequentially consistent, or weakly consistent, or even has an happens-before relation between calls of push() and pop(). It's consistency model is serializable.
As surprising as it may seem, this is enough to get the actor model properly working and passing messages from actor to actor. I'm guessing this happens because each actor needs to see the received messages in FIFO order, but it is ok to not see those messages immediately.
It's the joy of distributed systems  :)

Later note:
After some comments (thanks Nitsan!) I realized that some details of this post were a bit confusing, namely, there are two variants of this MPSC queue and they have a slightly different pop().
In the implementation by Vyukov the pop()'s Progress Condition is Lock-Free and its Consistency Model is Serializable.
In the implementation in Akka, the pop()'s Progress Condition is Blocking, but its Consistency Model is Linearizable.
The Akka implementation is Blocking because it may spin eternally if the thread inserting the node blocks/crashes/sleeps before setting the next of the previous node, but Linearizability is a much stronger consistency model than Serializability, so it's a trade-off (both implementation are correct).

Sunday, April 20, 2014

Native Code Performance on Modern CPUs

...and speaking of BUILD 2014, Eric Brumer did a cool presentation on native code performance on modern CPUs.

He covered some new things like fused-multiply, AVX2, vectorization, store buffers, store load forwarding, etc.
My favourite slide out of the entire presentation is the one of item #2, where he shows that performance on modern CPUs is memory bound (should be true for most workloads): 
This is something most people on the concurrency world are already aware, but it seems that this is true even for single-threaded coded, mostly (but not only) due to vectorization.

I think he distilled a very good idea out of it: that we should pay attention to the where the loads and stores are located in our code, which is not an easy task for a developer, but a necessary one for those of us wishing to write high-performance code.

Thursday, April 17, 2014

Arrays vs Linked Lists

...and speaking of Herb Sutter, two weeks ago he made a presentation at BUILD 2014
were he talked about C++, but he actually spent a lot of time on the performance
of Arrays vs Linked Lists. This is something that Bjarn Stroustrup had already
covered about a couple years ago at GoingNative 2012.

As is typical of Herb Sutter's talks, this one is quite enlightening. Very cool stuff.