Tuesday, October 18, 2016

CRTurn Queue - The first MPMC memory-unbounded wait-free queue with memory reclamation

A few days ago I turned 40.
It's a big event, and to celebrate it, Andreia and I decided to share some of the work we've been doing recently.
Yes that's right, it's my birthday but you get the gifts!

We officially present CRTurn queue, the first (correct) memory-unbounded multi-producer-multi-consumer wait-free queue for C++, that does its own memory reclamation, and does it in a wait-free way.

If you just care about the C++ code, then follow this link
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/crturnqueue/include/CRTurnQueue.hpp
If it's Java you're interested, then we have an implementation with self-linking on the url below
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/queues/CRTurnQueue.java
And if you care about academic papers with invariant proofs and microbenchmark plots then look here
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/crturnqueue-2016.pdf

Otherwise, just keep on reading for the blog post  ;)


When it comes to memory unbounded multi-producer-multi-consumer (MPMC) queues there are three known approaches in the literature, all based on singly-linked lists, plus a handful of generic wait-free mechanisms that can be applied to queues, but they're too slow to be a match for the "hand-written" queues. You can see each respective paper for those three queues here:
KP http://www.cs.technion.ac.il/~erez/Papers/wf-methodology-ppopp12.pdf
FK http://www.cs.uoi.gr/tech_reports/publications/TR2011-01.pdf
YMC http://chaoran.me/assets/pdf/wfq-ppopp16.pdf
Of these three algorithms: KP is for Java only, which means the memory reclamation is done by the GC, and since the JVM's implementation of a GC is blocking, this queue isn't truly wait-free, at least not when it comes to latency, but IMO it's the best of all previous work, and it's the only one for which there is a correct implementation;
FK does no memory reclamation and has errors in the implementation;
YMC has a memory reclamation that is flawed by design, and even if fixed would not be wait-free, and seems to have some errors in the implementation (at least they were the first ones to realize the importance of doing memory reclamation).
This means that CRTurn queue is the first to reclaim its own memory, and therefore, the first that can be implemented in C or C++ or something without a GC. Not only does it reclaim the memory of the dequeued nodes, but it does so in a wait-free bounded way, a feat which is not easy (to say the least).

The CRTurn queue has other interesting properties, for example, it does no memory allocation except for the creation of the node that is inserted in the linked list, and even that can be pre-allocated if desired.
In CRTurn, the enqueueing and dequeueing algorithms are isolated, which means you can use just one of them and plug it with a single-threaded queue algorithm (singly-linked list based) to make a wait-free SPMC or MPSC queue. Or even better, the enqueue() algorithm is very short, so it is very tempting to attach it to the dequeue() of the Michael-Scott queue to create a simple queue that is MPMC with wait-free enqueue() and lock-free dequeue().
Such a queue has a small number of lines of code and is relatively easy to reason about and convince yourself of its correctness.
Other properties are, using a fast wait-free consensus we called the "Turn" consensus (think Lamport's Bakery, but better), and it achieves bounded wait-free with just the use of CAS (compare-and-swap), which is nice because not all CPUs have a FAA (fetch-and-add).

This queue was designed for low latency at the tail of the latency distribution, i.e. for real-time systems.  Sure, it does a new/delete for each node that is enqueued/dequeue but you can plugin your custom allocator if you want.
Surprisingly as it may seem, for the uncontended scenario, this queue isn't far behind from the Michael-Scott queue, i.e. the best known memory-unbounded mpmc lock-free queue. It's nice to see that we don't have to pay a very high price for having wait-free guarantees.
We tested this code heavily to make it production-ready and aimed to provide a code that is as simple as possible, but keep in mind that this is a wait-free queue and wait-free algorithms are rarely simple.

This queue is a big deal for us because Andreia and I worked really hard to get here. We spent late nights working on this, unending discussions on how to solve the different problems, how to proceed with the design, writing code, writing tests, running benchmarks, analyzing the results, figuring out the best way to measure tail latency, writing more tests, and more experimental code, and more discussions, until we got to what is being shown as CRTurn queue. Along the way, we created a new wait-free consensus protocol, we had to figure out how to apply Hazard Pointers in a completely wait-free way, how to use the least amount of stores for Hazard Pointers to keep the throughput high, and all of that without doing any heap allocation.
It was a long and hard journey, particularly the memory reclamation part, and we learned a lot along the way.
I don't know if we'll get published in a top tier conference, but we are sharing this fruit of our labor with everyone, and we hope you enjoy it  ;)

More to come...

12 comments:

  1. C'tor needs to throw when maxThreads out of range. Calls to enqueue and dequeue should fail when tid out of range. Checking for those values being negative can be avoided by changing parameters from int to unsigned...
    Jack Goral

    ReplyDelete
    Replies
    1. Hi Jack,
      We don't check boundaries for tid because there is an even bigger constraint/assumption: that the tids must be unique. This is very difficult to check in practice so we just don't do it and it's up to the user/caller to guarantee that. If the user can't even chose a positive tid lower than MAX_THREADS, then it's unlikely that he'll chose unique tids, and he won't be able to use this queue anyways.

      Delete
    2. I think you can check as much as you can and leave to the user what you can't. In my test of your queue I had to check for positive tid numbers, tid <=MAX_THREADS and store mapping for generated tids in std::unordered_map tids_;
      All this can be done in inside the queue and the user should not see/use tids at all.
      Jack

      Delete
    3. Huhh I'm afraid I don't understand how is it possible to achieve unique tids using std::unordered_map (without using a mutex)
      Perhaps there is a trick I'm missing?
      If so, feel free to drop me an email with more details at pramalhe gmail com, or better yet, submit a pull request on githb :)

      PS: I hope you checked for tid < MAX_THREADS (instead of tid <= MAX_THREADS), otherwise there will be errors due to "array out of bounds".

      Delete
    4. See my problem, too :-). Jack

      Delete
  2. I've just created a wait-free mpmc queue in 100+ lines of C++11 code:
    https://github.com/MengRao/WFMPMC

    Can you help check?

    Thanks,
    Meng

    ReplyDelete
  3. The performance of this queue is very good. Can anyone tell me how I can go about modifying this queue for IPC/Shared memory?

    ReplyDelete
    Replies
    1. You would need an allocator for shared memory, something like Doug Lea's alloctor. It wouldn't be wait-free (or event lock-free).
      How about using LCRQ? It needs double-word CAS (which exits on x86) and you have a single node and re-use, except in some special circumstances.
      https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/queues/LCRQueue.hpp
      We have a few non-blocking memory bounded queues but we haven't published the paper yet, so we can't put up the code online.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Thank you for the pointer. I was successfully able to modify the LCRQ implementation to shared memory using Boost interprocess shared memory allocator. I know that kinda makes it locking because the allocator is locking, but I can live with that trade off.

      I needed something for IPC in a 3 process simulation. Benchmarks here - https://github.com/r10a/threads_exp/tree/LCRQueue

      I am also in the process of converting it into a block-when necessary queue so the threads can sleep when the queue is empty to save CPU power.

      Delete
  4. Hi, as I know, cache line size of L1 and L2 is 64bytes. Why you use 128 in the code 'CLPAD = 128/sizeof(std::atomic);'?

    ReplyDelete
    Replies
    1. This is a trick I learned from Dave Dice (I hope I don't misquote him): when a cache line is requested on x86, sometimes the pre-fetcher will request the next adjacent cache line. This means that the next cache line also becomes a source of false sharing. As such, to fully eliminate false-sharing, the padding should be two cache lines => 128 bytes.

      Notice that alignas(128) doesn't really guarantee that the compiler will respect the 128 bytes alignment, but it's an effort.

      Delete