Monday, February 27, 2017

BitNext - A Lock-Free queue

This post is about BitNext, a new lock-free queue that Andreia and I designed, based on the lock-free linked list by Tim Harris.

Before we start, if you don't know what Tim Harris's lock-free linked list is, then go and read his paper:
https://timharris.uk/papers/2001-disc.pdf

Done?
Ok, so BitNext is an MPMC queue, with lock-free progress for enqueue() and dequeue(), linearizable consistency, and memory unbounded (node based).
It is easy to add Hazard Pointers to it and the C++ code we're going to show has them embedded.
Here is the link to our paper on BitNext:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/bitnext-2016.pdf

If you're a Java coder then sorry, but this post won't be of much use to you, not that BitNext is not implementable in Java, it's just that similarly to Harris's linked list, it requires some modifications to the algorithm to get it working in systems with a GC (like Java, Scala, D, etc). Such modifications are based on RTTI or AtomicMarkableReference and would take an entire post on itself, but if you're interested, you can take a look at this post http://concurrencyfreaks.blogspot.com/2014/03/harris-with-atomicmarkablereference-add.html


Before removing a node/key from Harris's list, we need to "mark" the next pointer by setting a bit in it. Touching bits in references is a no-no in GC systems and that's why this is a "native-only" approach.
The main trick in the Harris list is that setting this pointer will cause an insertion on the following node to have its CAS fail.
Harris list allows insertions to occurs anywhere on the list, and in a lock-free way. This is better than the Michael-Scott list which only allows insertions at the end (tail) of the list, thus creating a lot of contention on the tail node and tail pointer.
The thing about the Michael-Scott algorithm is that it was designed to be a queue, so we started wondering whether it would be possible to make a queue with Harris list as well. Obviously, it is possible, by always inserting a new node at the tail of the list, but the real question was:

"Can we insert new nodes not after the last node and still maintain FIFO order with linearizable consistency"?
and the answer is, yes, using the BitNext algorithm.

In BitNext, dequeuers start by advancing head only if its next is marked. If next is not marked, the dequeuer will attempt to mark the bit with a CAS.
A node with a marked bit is said to be logically dequeued. If the CAS to set the marked bit on node.next fails, it means another dequeuer was successful, or an enqueuer inserted a new node immediately after node.
Here is what the dequeue() looks like in C++:

For enqueuers there are two scenarios, both starting with the enqueuer reading the current tail into the local variable ltail.
When the ltail is the last node on the list, ltail.next is null, then, similarly to the Michael-Scott queue, the enqueuer will attempt to insert its node by doing a CAS on ltail.next from null to its node.
When the ltail is not the last node on the list, the enqueuer will advance the tail to the next node, ltail.next, and then attempt to insert its node in ltail.next with a CAS.
This attempt will be done at most maxThreads-1 times, after which, either the CAS is successful, or the ltail node has been logically dequeued, thus disallowing an insertion in ltail.next. If the enqueuer were to insert a node in ltail.next even if it had a marked bit, this would provide linearizable behavior as long as the next node had not been logically dequeued. Given that we can't provide such a guarantee, we forbid the enqueuer from inserting a node after a logically dequeued node, unless node.next is null, in which case there is no node after it.
Here is what the enqueue() looks like in C++:

Let's walk through the enqueue() code to understand a bit better.
We start by protecting the node we read as tail with an hazard pointer (lines 5 and 6).
Then, we advance the tail by one node if it is not the last node (lines 8 and 9). Notice that we advance by a single node even though the tail may be lagging behind multiple nodes (at most maxThreads-1 nodes behind).
If tail is already the last node, then we use an algorithm similar to Michael-Scott queue insertion (lines 12 and 21).
Then, it's a matter of inserting our node after the node we saw as tail (line 29). Notice that ltail may not be the last node on the list, but that's ok because if this node was already dequeued then it would have its "next" with a marked bit, and the insertion would fail. Because the only way to observer order on the queue is to dequeue items from it, and because the dequeue always marks the bit on next to dequeue the item/node, we know that that an insertion in the middle of the list is only successful if the node's next is not marked, and therefore, it's ok to insert in the middle.
This is however not a sufficient condition to have linearizability: we must guarantee that calls to enqueue() which do not overlap in time (i.e. have an happens-before sequence) have a corresponding global order, and that is why each enqueuer reads the tail and advances it by one (if it is not already the last).

It's a funny trick but it works really well.
The end result is that we get better throughput and better (lower) latency in the tail of the distribution when the queue is under high contention, mostly because BitNext has a natural tendency to spread contention over multiple nodes, instead of all threads hammering on the last node attempting to insert their node.

One the next post we will look at a variant of BitNext which relaxes the advance of the head, improving the dequeue() as well.
For now, here is the C++ code for BitNext:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/queues/BitNextQueue.hp

Friday, February 17, 2017

Two posters at PPoPP 2016

We got two posters accepted at PPoPP 2016, yupiii!!!

One was for "Poor Man's URCU" which we talked about on a post long ago.
http://dl.acm.org/citation.cfm?id=3019021
It's a very simple way to implement a Userspace RCU using reader-writer locks, in a lock-free way. Yes, you can use the trylock() API in reader-writer locks to build lock-free constructs, with URCU being an example.

The other one was the Turn Queue, the world's fastest (so far) wait-free linearizable MPMC queue with wait-free memory reclamation, which we talked about on a post a couple months ago.
http://dl.acm.org/citation.cfm?id=3019022
The full paper is called "A Wait-Free Queue with Wait-Free Memory Reclamation" and you can get a draft here if you want to.

Tuesday, February 7, 2017

Cliff Click - A JVM Does That ???

There are a few people that are really knowleadgeable about Java, and Cliff Click is certainly one of them. One of his latest talks was at Devoxx and it's named "A JVM Does That???". Youtube link below.




This talk is so rich with content that it's hard to explain what it is about, so I'm just going to mention my favorite parts, but please watch it all if you haven't done so yet.

The explanation on the endless loop of GC cycles when have a cache (data structure) near the limit is just divine:
https://www.youtube.com/watch?v=-vizTDSz8NU&t=24m40s

Or how about the illusion that locks are fast (and fair)
https://www.youtube.com/watch?v=-vizTDSz8NU&t=12m55s
hence the importance of starvation-free locks that I mentioned in my CppCon talk, and the reason behind Andreia and I spending some of our time coming up with new starvation-free lock algorithms like Tidex:
http://dl.acm.org/citation.cfm?id=2851171
http://concurrencyfreaks.com/2014/12/tidex-mutex.html

The part about thread priorities is a nice demonstration of why lock-free and wait-free algorithms are so important (you don't need to care about thread priorities with lock-free and wait-free):
https://www.youtube.com/watch?v=-vizTDSz8NU&t=28m10s

But my favorite is the fact that System.currentTimeMillis() has happens-before guarantees across cores on the HotSpot JVM, because this guarantee allowed us to make a blazing fast queue (which we will show in a future post):
https://www.youtube.com/watch?v=-vizTDSz8NU&t=14m22s

Learning is fun with Cliff Click!

Thursday, February 2, 2017

Nitsan Wakart on Lock-Free Queues

About a year ago Nitsan Wakart gave a very nice presentation about lock-free queues
Here is the link to vimeo:  https://vimeo.com/131688512

Nitsan works for Azul Systems and he has a blog named Psy-Lobotomy-Saw.

If you're into non-blocking queues then make sure not to miss out this video because it has lots of interesting facts.
Some of the topics he covers are:
  • The difficulty in running proper benchmarks for lock-free queues. IMO the hardest to obtain is determinism.
  • What is throughput on a queue? single-enqueue-single-dequeue or bursts?
  • Throughput vs Latency.
  • Benchmarking empty queues vs non-empty.
  • Warmup on benchmarks.
  • Memory Bounded (circular array or ring buffer) vs Memory Unbounded (singly-linked list).
  • When can we discard the volatile keyword in Java Queues (hint: very rarely), and even some of the masters have made this mistake.
  • False Sharing.