Back in 2011 Alex Kogan and Erez Petrank showed a multi-producer-multi-consumer wait-free queue.
To the best of our knowledge, they provided the only reasonably useful and correctly implemented memory-unbounded MPMC wait-free queue (that's a mouthful).
Yes, there are generic wait-free techniques, but using them on a queue is just too slow, and yes, there two other MPMC wait-free algorithms but their implementation has bugs and it's CPU specific.
Even the algorithm by Kogan and Petrank relies on having an automatic Garbage Collector (GC) so they provided their implementation in Java. As the authors themselves mention in their paper, there is currently no wait-free GC, neither implemented in production nor in the literature, which means that this queue is never really wait-free, but let's leave that for another post, and just focus on latency.
How good is this "wait-free" queue when it comes to tail latency?
Pretty good as it turns out. We showed some plots in a previous post:
Unfortunately, as soon as you start "hammering on it" with lots of contention, a problem appears.
The problem is that the GC will do pauses, really really long pauses.
This queue is based on a singly-linked list, and like all singly-list based data structures when used with a GC, the unused nodes can't just be unlinked from the list, they need to be self-linked by pointing the next pointer to the node itself.
The why this is important is related to how GCs work, and there is a very good explanation in this presentation starting at minute 23
Life of a Twitter jvm engineer - The garbage keeps coming... by Tony Printezis
which by the way I recommend watching in its entirety, but for the purpose of this post, if you don't know what this self-linking stuff is all about, then go watch ten minutes of it starting at minute 23 and come back and read the rest of this post.
Do you understand now why self-linking is important?
This is really a non-obvious detail of GCs, and the first time I learned about this was several years ago from Doug Lea.
The following code is pretty much what was written in the original paper but with some sun.misc.unsafe added to it:
and in our benchmarks it has GC pauses that go up to 70 seconds.
yes, that's right, seventy seconds during which our 32 core machine is stuck doing nothing except running the GC :-O
Imagine you are reading this blog post on your laptop/tablet and then you click on a link and everything freezes for 70 seconds due to the GC... does that sound reasonable in any way?
You're going to say that this is the GC's fault and that using a JVM with a decent GC like Zing would fix it. I don't know because I don't have access to a JVM with Zing, and I believe that using another GC would improve, but I seriously doubt it will be as effective as doing self-linking.
For a deep-dive on this topic, check out the paper named "A Performance Study of Java Garbage Collectors on Multicore Architectures", particularly figure 1 and table 3:
The trick to help the GC is to do self-linking of the nodes after dequeueing, but we can't self-link the node where your value is, instead, we self-link the node that points to it, because the node where the value is may still be accessible through head, and we need its next to be valid because it will be used by the next thread calling deq() to obtain its "value". Here is the variant we did, with self-linking of unused nodes:
and in our benchmarks it has GC pauses that go up to 3 milliseconds.
Keep in mind that they are exactly the same algorithm, apart from an extra check in help_finish_enq() and for the self-linking in deq() and yet, they have completely different latency profiles.
Below is one example we got with GC logs. In this example the JVM ran out of memory during the benchmark and had to run a full GC which took 17 seconds and then another that took 42 seconds. At the end of the benchmark we call System.gc() and sleep for a while to trigger another GC run and that one took more than 45 seconds.
These 45 seconds are not accounted for in the benchmark which is unfair because the GC is cleaning up the garbage that the queue is producing, so that work should be taken into account when doing benchmarks as well, but anyways, we don't care much about throughput, it's more about latency for us:
##### KPNoSLQueue #####
[GC (Allocation Failure) 3932481K->221057K(15073280K), 17.3218925 secs]
[GC (Allocation Failure) 4153217K->547969K(15073280K), 42.6884304 secs]
Number of enqueues/dequeues per sec = 2235
[GC (System.gc()) 1324375K->601121K(15073280K), 45.8331662 secs]
[Full GC (System.gc()) 601121K->8434K(15073280K), 0.0763649 secs]
When we run the same benchmark for the self-linking version, there is still one GC allocation failure during the benchmark, but it takes only 1 millisecond to complete, and the throughput is significantly faster because there is very little waiting for the GC:
##### KPQueue ##### Starting run...
[GC (Allocation Failure) 3932485K->805K(15073280K), 0.0013195 secs]
Number of enqueues/dequeues per sec = 3216
[GC (System.gc()) 2541778K->453K(15073280K), 0.0012654 secs]
[Full GC (System.gc()) 453K->327K(15073280K), 0.0142511 secs]
This is the problem with not reclaiming unused memory (nodes) and leaving it up to the GC. Once you go down that path, you need to understand what the GC is doing to somehow make friends with the GC so that it will behave the way you want it to.
Long story short, if you're going to implement a singly-list based queue in Java, you better do self-linking.