Concurrency Freaks: CRTurn Queue - The first MPMC memory-unbounded wait-free queue with memory reclamation

Tuesday, October 18, 2016

CRTurn Queue - The first MPMC memory-unbounded wait-free queue with memory reclamation

A few days ago I turned 40.
It's a big event, and to celebrate it, Andreia and I decided to share some of the work we've been doing recently.
Yes that's right, it's my birthday but you get the gifts!

We officially present CRTurn queue, the first (correct) memory-unbounded multi-producer-multi-consumer wait-free queue for C++, that does its own memory reclamation, and does it in a wait-free way.

If you just care about the C++ code, then follow this link
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/crturnqueue/include/CRTurnQueue.hpp
If it's Java you're interested, then we have an implementation with self-linking on the url below
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/queues/CRTurnQueue.java
And if you care about academic papers with invariant proofs and microbenchmark plots then look here
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/crturnqueue-2016.pdf

Otherwise, just keep on reading for the blog post ;)

When it comes to memory unbounded multi-producer-multi-consumer (MPMC) queues there are three known approaches in the literature, all based on singly-linked lists, plus a handful of generic wait-free mechanisms that can be applied to queues, but they're too slow to be a match for the "hand-written" queues. You can see each respective paper for those three queues here:
KP http://www.cs.technion.ac.il/~erez/Papers/wf-methodology-ppopp12.pdf
FK http://www.cs.uoi.gr/tech_reports/publications/TR2011-01.pdf
YMC http://chaoran.me/assets/pdf/wfq-ppopp16.pdf
Of these three algorithms: KP is for Java only, which means the memory reclamation is done by the GC, and since the JVM's implementation of a GC is blocking, this queue isn't truly wait-free, at least not when it comes to latency, but IMO it's the best of all previous work, and it's the only one for which there is a correct implementation;
FK does no memory reclamation and has errors in the implementation;
YMC has a memory reclamation that is flawed by design, and even if fixed would not be wait-free, and seems to have some errors in the implementation (at least they were the first ones to realize the importance of doing memory reclamation).
This means that CRTurn queue is the first to reclaim its own memory, and therefore, the first that can be implemented in C or C++ or something without a GC. Not only does it reclaim the memory of the dequeued nodes, but it does so in a wait-free bounded way, a feat which is not easy (to say the least).

The CRTurn queue has other interesting properties, for example, it does no memory allocation except for the creation of the node that is inserted in the linked list, and even that can be pre-allocated if desired.
In CRTurn, the enqueueing and dequeueing algorithms are isolated, which means you can use just one of them and plug it with a single-threaded queue algorithm (singly-linked list based) to make a wait-free SPMC or MPSC queue. Or even better, the enqueue() algorithm is very short, so it is very tempting to attach it to the dequeue() of the Michael-Scott queue to create a simple queue that is MPMC with wait-free enqueue() and lock-free dequeue().
Such a queue has a small number of lines of code and is relatively easy to reason about and convince yourself of its correctness.
Other properties are, using a fast wait-free consensus we called the "Turn" consensus (think Lamport's Bakery, but better), and it achieves bounded wait-free with just the use of CAS (compare-and-swap), which is nice because not all CPUs have a FAA (fetch-and-add).

This queue was designed for low latency at the tail of the latency distribution, i.e. for real-time systems. Sure, it does a new/delete for each node that is enqueued/dequeue but you can plugin your custom allocator if you want.
Surprisingly as it may seem, for the uncontended scenario, this queue isn't far behind from the Michael-Scott queue, i.e. the best known memory-unbounded mpmc lock-free queue. It's nice to see that we don't have to pay a very high price for having wait-free guarantees.
We tested this code heavily to make it production-ready and aimed to provide a code that is as simple as possible, but keep in mind that this is a wait-free queue and wait-free algorithms are rarely simple.

This queue is a big deal for us because Andreia and I worked really hard to get here. We spent late nights working on this, unending discussions on how to solve the different problems, how to proceed with the design, writing code, writing tests, running benchmarks, analyzing the results, figuring out the best way to measure tail latency, writing more tests, and more experimental code, and more discussions, until we got to what is being shown as CRTurn queue. Along the way, we created a new wait-free consensus protocol, we had to figure out how to apply Hazard Pointers in a completely wait-free way, how to use the least amount of stores for Hazard Pointers to keep the throughput high, and all of that without doing any heap allocation.
It was a long and hard journey, particularly the memory reclamation part, and we learned a lot along the way.
I don't know if we'll get published in a top tier conference, but we are sharing this fruit of our labor with everyone, and we hope you enjoy it ;)

More to come...

12 comments:

AnonymousOctober 27, 2016 at 2:14 PM
C'tor needs to throw when maxThreads out of range. Calls to enqueue and dequeue should fail when tid out of range. Checking for those values being negative can be avoided by changing parameters from int to unsigned...
Jack Goral
ReplyDelete
Replies
UnknownOctober 18, 2018 at 10:43 AM
I've just created a wait-free mpmc queue in 100+ lines of C++11 code:
https://github.com/MengRao/WFMPMC

Can you help check?

Thanks,
Meng
ReplyDelete
Replies
RohitFebruary 1, 2019 at 10:24 PM
The performance of this queue is very good. Can anyone tell me how I can go about modifying this queue for IPC/Shared memory?
ReplyDelete
Replies
UnknownAugust 28, 2019 at 5:29 AM
Hi, as I know, cache line size of L1 and L2 is 64bytes. Why you use 128 in the code 'CLPAD = 128/sizeof(std::atomic);'?
ReplyDelete
Replies

Add comment