Thursday, March 16, 2017

MultiList - A MPSC Wait-Free Queue

I was going to talk about BitNextLazyHead but didn't get the time to write a decent post about it, so instead I'm going to just post a link to one of our recent queues, named MultiList.

MultiList is based on an idea that Andreia had of having a queue with multiple linked lists instead of a singly linked list. This idea by itself is not new, but she was the first to come up with an approach that is linearizable and lock-free. In fact, this is not just lock-free, it's wait-free for enqueuing. Even more, in its simplest form, it's a wait-free Multi-Producer-Single-Consumer queue... and it doesn't need memory reclamation technique.

Yes, that's right, as crazy as it may seem, the MPSC variant does not need any kind of memory reclamation like RCU or Hazard Pointers or things like that. How awesome is that?
The paper is short and the code even shorter.
Link to the paper here: https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/multilist-2017.pdf

Monday, February 27, 2017

BitNext - A Lock-Free queue

This post is about BitNext, a new lock-free queue that Andreia and I designed, based on the lock-free linked list by Tim Harris.

Before we start, if you don't know what Tim Harris's lock-free linked list is, then go and read his paper:
https://timharris.uk/papers/2001-disc.pdf

Done?
Ok, so BitNext is an MPMC queue, with lock-free progress for enqueue() and dequeue(), linearizable consistency, and memory unbounded (node based).
It is easy to add Hazard Pointers to it and the C++ code we're going to show has them embedded.
Here is the link to our paper on BitNext:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/bitnext-2016.pdf

If you're a Java coder then sorry, but this post won't be of much use to you, not that BitNext is not implementable in Java, it's just that similarly to Harris's linked list, it requires some modifications to the algorithm to get it working in systems with a GC (like Java, Scala, D, etc). Such modifications are based on RTTI or AtomicMarkableReference and would take an entire post on itself, but if you're interested, you can take a look at this post http://concurrencyfreaks.blogspot.com/2014/03/harris-with-atomicmarkablereference-add.html


Before removing a node/key from Harris's list, we need to "mark" the next pointer by setting a bit in it. Touching bits in references is a no-no in GC systems and that's why this is a "native-only" approach.
The main trick in the Harris list is that setting this pointer will cause an insertion on the following node to have its CAS fail.
Harris list allows insertions to occurs anywhere on the list, and in a lock-free way. This is better than the Michael-Scott list which only allows insertions at the end (tail) of the list, thus creating a lot of contention on the tail node and tail pointer.
The thing about the Michael-Scott algorithm is that it was designed to be a queue, so we started wondering whether it would be possible to make a queue with Harris list as well. Obviously, it is possible, by always inserting a new node at the tail of the list, but the real question was:

"Can we insert new nodes not after the last node and still maintain FIFO order with linearizable consistency"?
and the answer is, yes, using the BitNext algorithm.

In BitNext, dequeuers start by advancing head only if its next is marked. If next is not marked, the dequeuer will attempt to mark the bit with a CAS.
A node with a marked bit is said to be logically dequeued. If the CAS to set the marked bit on node.next fails, it means another dequeuer was successful, or an enqueuer inserted a new node immediately after node.
Here is what the dequeue() looks like in C++:

For enqueuers there are two scenarios, both starting with the enqueuer reading the current tail into the local variable ltail.
When the ltail is the last node on the list, ltail.next is null, then, similarly to the Michael-Scott queue, the enqueuer will attempt to insert its node by doing a CAS on ltail.next from null to its node.
When the ltail is not the last node on the list, the enqueuer will advance the tail to the next node, ltail.next, and then attempt to insert its node in ltail.next with a CAS.
This attempt will be done at most maxThreads-1 times, after which, either the CAS is successful, or the ltail node has been logically dequeued, thus disallowing an insertion in ltail.next. If the enqueuer were to insert a node in ltail.next even if it had a marked bit, this would provide linearizable behavior as long as the next node had not been logically dequeued. Given that we can't provide such a guarantee, we forbid the enqueuer from inserting a node after a logically dequeued node, unless node.next is null, in which case there is no node after it.
Here is what the enqueue() looks like in C++:

Let's walk through the enqueue() code to understand a bit better.
We start by protecting the node we read as tail with an hazard pointer (lines 5 and 6).
Then, we advance the tail by one node if it is not the last node (lines 8 and 9). Notice that we advance by a single node even though the tail may be lagging behind multiple nodes (at most maxThreads-1 nodes behind).
If tail is already the last node, then we use an algorithm similar to Michael-Scott queue insertion (lines 12 and 21).
Then, it's a matter of inserting our node after the node we saw as tail (line 29). Notice that ltail may not be the last node on the list, but that's ok because if this node was already dequeued then it would have its "next" with a marked bit, and the insertion would fail. Because the only way to observer order on the queue is to dequeue items from it, and because the dequeue always marks the bit on next to dequeue the item/node, we know that that an insertion in the middle of the list is only successful if the node's next is not marked, and therefore, it's ok to insert in the middle.
This is however not a sufficient condition to have linearizability: we must guarantee that calls to enqueue() which do not overlap in time (i.e. have an happens-before sequence) have a corresponding global order, and that is why each enqueuer reads the tail and advances it by one (if it is not already the last).

It's a funny trick but it works really well.
The end result is that we get better throughput and better (lower) latency in the tail of the distribution when the queue is under high contention, mostly because BitNext has a natural tendency to spread contention over multiple nodes, instead of all threads hammering on the last node attempting to insert their node.

One the next post we will look at a variant of BitNext which relaxes the advance of the head, improving the dequeue() as well.
For now, here is the C++ code for BitNext:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/queues/BitNextQueue.hp

Friday, February 17, 2017

Two posters at PPoPP 2016

We got two posters accepted at PPoPP 2016, yupiii!!!

One was for "Poor Man's URCU" which we talked about on a post long ago.
http://dl.acm.org/citation.cfm?id=3019021
It's a very simple way to implement a Userspace RCU using reader-writer locks, in a lock-free way. Yes, you can use the trylock() API in reader-writer locks to build lock-free constructs, with URCU being an example.

The other one was the Turn Queue, the world's fastest (so far) wait-free linearizable MPMC queue with wait-free memory reclamation, which we talked about on a post a couple months ago.
http://dl.acm.org/citation.cfm?id=3019022
The full paper is called "A Wait-Free Queue with Wait-Free Memory Reclamation" and you can get a draft here if you want to.

Tuesday, February 7, 2017

Cliff Click - A JVM Does That ???

There are a few people that are really knowleadgeable about Java, and Cliff Click is certainly one of them. One of his latest talks was at Devoxx and it's named "A JVM Does That???". Youtube link below.




This talk is so rich with content that it's hard to explain what it is about, so I'm just going to mention my favorite parts, but please watch it all if you haven't done so yet.

The explanation on the endless loop of GC cycles when have a cache (data structure) near the limit is just divine:
https://www.youtube.com/watch?v=-vizTDSz8NU&t=24m40s

Or how about the illusion that locks are fast (and fair)
https://www.youtube.com/watch?v=-vizTDSz8NU&t=12m55s
hence the importance of starvation-free locks that I mentioned in my CppCon talk, and the reason behind Andreia and I spending some of our time coming up with new starvation-free lock algorithms like Tidex:
http://dl.acm.org/citation.cfm?id=2851171
http://concurrencyfreaks.com/2014/12/tidex-mutex.html

The part about thread priorities is a nice demonstration of why lock-free and wait-free algorithms are so important (you don't need to care about thread priorities with lock-free and wait-free):
https://www.youtube.com/watch?v=-vizTDSz8NU&t=28m10s

But my favorite is the fact that System.currentTimeMillis() has happens-before guarantees across cores on the HotSpot JVM, because this guarantee allowed us to make a blazing fast queue (which we will show in a future post):
https://www.youtube.com/watch?v=-vizTDSz8NU&t=14m22s

Learning is fun with Cliff Click!

Thursday, February 2, 2017

Nitsan Wakart on Lock-Free Queues

About a year ago Nitsan Wakart gave a very nice presentation about lock-free queues
Here is the link to vimeo:  https://vimeo.com/131688512

Nitsan works for Azul Systems and he has a blog named Psy-Lobotomy-Saw.

If you're into non-blocking queues then make sure not to miss out this video because it has lots of interesting facts.
Some of the topics he covers are:
  • The difficulty in running proper benchmarks for lock-free queues. IMO the hardest to obtain is determinism.
  • What is throughput on a queue? single-enqueue-single-dequeue or bursts?
  • Throughput vs Latency.
  • Benchmarking empty queues vs non-empty.
  • Warmup on benchmarks.
  • Memory Bounded (circular array or ring buffer) vs Memory Unbounded (singly-linked list).
  • When can we discard the volatile keyword in Java Queues (hint: very rarely), and even some of the masters have made this mistake.
  • False Sharing.

Friday, January 13, 2017

DoubleLink - A Low-Overhead Lock-Free Queue

On this post we're going to talk about a new lock-free queue that Andreia and I invented.
It's called DoubleLink Queue and it's a memory-unbounded Multi-Producer-Multi-Consumer linearizable lock-free queue.
Example code is in C++ with Hazard Pointers but we have a Java version as well, both available on github:
C++
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/queues/CRDoubleLinkQueue.hpp
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/queues/HazardPointersDL.hpp

Java
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/queues/CRDoubleLinkQueue.java

Before we start, a short intro

There are many, many, memory unbounded linearizable MPMC lock-free (and a few wait-free as well). However, they are all based on the enqueueing and dequeueing mechanism designed by Maged Michael and Michael Scott, or at least all of the interesting lock-free memory-unbounded MPMC queues are based on it.
The only exception to this that I know of, is SimQueue by Panagiota Fatourou and Nikolaos Kallimanis (more on this later).
This is just to show that although there are many ways to make queues, there is only one simple way to do it, i.e. there is one easy way to do enqueueing and dequeueing in a singly-linked list using CAS operations.
... well, actually this isn't true, because the queue we're going to talk about today has a different way of doing it  ;)


Motivation

Why another lock-free queue?
Well, because we can, but mostly, because we were trying to answer the question:
Is it possible to insert a node in a linked list using less than 2 CAS ?
You see, the Michael-Scott queue (MS from now on) needs at least 2 CAS operations to insert a new node/item: one CAS to link the node to the last node's "next", and another CAS to advance the tail. For the dequeue, a single CAS is needed which advances the head, however, it needs two seq-cst stores when using hazard pointers, one to protect the current head node and another to protect the next node, because it's the next node that contains the pointer to the item we're trying to dequeue.
Summing up in a nice table, this is what the MS queue does, as well as all other queues based on the same mechanism:




2 links or !2 links, that is the question

The first difference between DoubleLink and MS queue is that DoubleLink is based on a doubly-linked list, while MS is based on a singly linked list. And, as strange as it may seem, its this tiny difference that allows it to do an insertion with a single CAS operation.
In MS, a new node is linked to the list in two steps:
  1. CAS on the last node's next;
  2. CAS to advance the tail;



On DoubleLink this work differently. First, we create a new node whose "prev" pointer is pointing to the current last node on the list. Then we do a CAS on the tail to swing the pointer to this new node, and finally we link the previous last node to the current last node. The steps are:
  1. Create a new node linking its prev to the current last node;
  2. CAS to advance the tail;
  3. Link the previous last node to the current last node;



For those of you paying attention, you may be wondering "How can doing something in three steps be any simpler or faster than doing it in two steps?", and you're right in thinking that. The simplicity part is arguable, but the trick here is that steps 1 and 3 of DoubleLink queue have no synchronization, i.e. they are done with relaxed stores, while on the MS queue both steps 1 and 2 have a CAS which is a heavy synchronization instruction.


Code for enqueue()

Let's take a look at the enqueue() code in C++:

As can be seen in line 7, the store on the new node's prev is just a regular store because there are no races on prev: once the node is visible to other threads (with the CAS on the tail) the prev member will never change again, a concept sometimes referred to as "immutable after visible".
Then, in line 9 we make sure that the previous node is linked to the current last node, this time it's an atomic store in the next member, but it's a relaxed store.
Afterwards, in line 10 we advance the tail with a CAS, the only synchronized operation, and we link the previous node to the current last node using a release (or relaxed) store (line 11).

This kind of approach of setting the pointer "back" first and then advancing the tail is not new, and was first shown by Treiber, which used it to make the world's first (and simplest) lock-free data structure, namely, a lock-free stack: https://en.wikipedia.org/wiki/Treiber_Stack
The innovation in DoubleLink is that we can use this same trick in a doubly-linked list, and then have the enqueuers do a store race on lprev.next to make sure it is linked to the last node in the list before advancing the tail to the new last node. This race is safe according to the memory model because it is done on an atomic variable.
Personally, I find this approach super simple and very intuitive, but I'm biased  ;)


What about SimQueue?

Remember when I said earlier on this post that pretty much all other queues in this category (memory-unbounded, lock-free, linearizable, MPMC) use the MS algorithm underneath, except SimQueue? Well, a bit of history of discovery is in order here:
Andreia and I have had the idea for DoubleLink and its implementation for some time now, and we had a working implementation before we read the paper on the SimQueue, and long before we understood the algorithm used in the SimQueue. As it so happens, the enqueue method in SimQueue uses a trick that is not too far from this one, except ours is way more simple.
In SimQueue, the EnqState object has a link to the old tail, the beginning of the list, and the new tail. The first two are called link_a and link_b in the original paper if I'm not mistaken... anyways, they have multiple threads (dequeuers as well) do the link between the nodes link_a and link_b, which is pretty much the same idea we have for having all the enqueuers (and only the enqueuers) race on lprev->next to set it to the current tail using a relaxed store. One more reason to go back to the post on SimQueue and read it carefully, it's just one of its many hidden gems: http://concurrencyfreaks.com/2016/12/sim-queue-in-java-part-12.html
Their idea is way more complex than what we show in DoubleLink because they need it to be wait-free and we were just aiming for lock-free, but I just want to make it clear that we didn't copy their idea because we didn't even know about it, and it's not the same thing because ours is simpler, and simplicity is elegance. It's far easier to make something more complex with extra functionality than it is to make something simpler with the same or better functionality  ;)


Code for dequeue()

The dequeue method is pretty much the same as MS queue, so nothing new to see here. The only difference is that we have a doubly-linked list instead of a singly linked list:



There is one last trick in DoubleLink, and it's related to memory reclamation

Unless you're in Java or something with a Garbage Collector, you need some memory reclamation technique if you want to use MS or DoubleLink, or any other memory-unbounded lock-free queue.
DoubleLink is a lock-free queue and as such, it needs a lock-free (or wait-free) memory reclamation technique. This eliminates things like Epoch-based reclamation, RCU and Userspace RCU, and pretty much leaves pointer-based techniques like Hazard Pointers. Yes, there are other lock-free (pointer-bsaed) memory reclamation techniques, but they are usually more complex to deploy and we're aiming for simplicity, so lets stick to HP for now.
As we saw on the beginning of the post, the dequeue() in MS needs two sequentially consistent stores for the Hazard Pointers, but it would be nice if we could do it using just one... and indeed it is possible, but not for MS.
The trick is that we can make a variant of HP where the retire() scans not just for the current node that is attempting to delete but also for threads using the prev and next objects of the node.

In our variant of HP, we scan the other thread's hazard pointers for a match of the node we want to delete and we scan also for the next and prev of the node we want to delete.
If another thread is using the previous node and attempting a dequeue() operation, it may dereference the next node (line 4 of dequeue()), which is the node we're trying to delete, and therefore, we can not delete it yet. If another thread is using the next node and attempting an enqueue() operation, it may dereference the previous node (line 9 of enqueue()), which is the node we're trying to delete, and therefore, we can not delete it yet.

Each thread is looking for usages of the pointers in the next and prev of the nodes in its own retired list, and does not dereference the next or prev of the nodes in the list/array of hazardous pointers. Such behavior would be incorrect and lead to a crash.

This simple trick of scanning for occurrences of node.prev to protect lnext in dequeue(), and scanning for node.next to protect the lprev pointer in enqueue(), reduces one seq-cst store in each method. Unfortunately there is no gain in applying this HP variant to the MS queue because its enqueue() already requires one seq-cst store (only the tail node needs to be protected), and for the dequeue() the next node can not be protected because there is no way of knowing what is the previous node.

The whole logic of going through this trouble is because a seq-cst store is typically a heavyweight synchronization operation, for example, on x86 it implies one MFENCE instruction, and therefore, any reduction in the amount of seq-cst stores will reflect itself in a throughput gain.



Comparison

The comparison table of MS and DoubleLink is shown below and as you can see, DoubleLink does less synchronized operations for both enqueueing and dequeueing:




Unfortunately, this reduction in synchronized instructions doesn't reflect itself in large throughput gains because most throughput when uncontended is related to cache locality, and when highly contented is related to the contention... it's all about the cache.
Anyways, here are some comparison plots between MS and DoubleLink:






The throughput gains go up to 30% but not much more than this. Still, a good improvement having into consideration that we didn't make the code more complex.
The main drawback when compared to MS queue is that each node in DoubleLink needs two pointers: the prev and the next, while on MS needs just the next, which means DoubleLink uses more memory for a non-empty queue. This isn't important in the land of plenty of RAM like servers nowadays, but in mobile and IoT land it can be a bit annoying... nobody's perfect.


30% may not seem like a lot, but when you take into consideration that now you can go to any of the other lock-free or wait-free queue and replace the MS algorithm with DoubleLink to get an extra throughput gain, then it becomes interesting.
In the end, we set out to see if it was possible to make a lock-free memory unbounded queue using a single synchronized instruction for the insertion and a single synchronized instruction to protect the node, and it is! I would say this is pretty hard to beat, but I know for a fact Andreia has a queue that beats this, and on top of it is wait-free, so yes, we can do better (with other limitations though).... a topic for a future post.
In the meantime, we hope you enjoy the DoubleLink algorithm and use it to make even better queues!

You can get a short academic paper on the DoubleLink Queue here:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/doublelink-2016.pdf