Friday, February 17, 2017

Two posters at PPoPP 2016

We got two posters accepted at PPoPP 2016, yupiii!!!

One was for "Poor Man's URCU" which we talked about on a post long ago.
http://dl.acm.org/citation.cfm?id=3019021
It's a very simple way to implement a Userspace RCU using reader-writer locks, in a lock-free way. Yes, you can use the trylock() API in reader-writer locks to build lock-free constructs, with URCU being an example.

The other one was the Turn Queue, the world's fastest (so far) wait-free linearizable MPMC queue with wait-free memory reclamation, which we talked about on a post a couple months ago.
http://dl.acm.org/citation.cfm?id=3019022
The full paper is called "A Wait-Free Queue with Wait-Free Memory Reclamation" and you can get a draft here if you want to.

Tuesday, February 7, 2017

Cliff Click - A JVM Does That ???

There are a few people that are really knowleadgeable about Java, and Cliff Click is certainly one of them. One of his latest talks was at Devoxx and it's named "A JVM Does That???". Youtube link below.




This talk is so rich with content that it's hard to explain what it is about, so I'm just going to mention my favorite parts, but please watch it all if you haven't done so yet.

The explanation on the endless loop of GC cycles when have a cache (data structure) near the limit is just divine:
https://www.youtube.com/watch?v=-vizTDSz8NU&t=24m40s

Or how about the illusion that locks are fast (and fair)
https://www.youtube.com/watch?v=-vizTDSz8NU&t=12m55s
hence the importance of starvation-free locks that I mentioned in my CppCon talk, and the reason behind Andreia and I spending some of our time coming up with new starvation-free lock algorithms like Tidex:
http://dl.acm.org/citation.cfm?id=2851171
http://concurrencyfreaks.com/2014/12/tidex-mutex.html

The part about thread priorities is a nice demonstration of why lock-free and wait-free algorithms are so important (you don't need to care about thread priorities with lock-free and wait-free):
https://www.youtube.com/watch?v=-vizTDSz8NU&t=28m10s

But my favorite is the fact that System.currentTimeMillis() has happens-before guarantees across cores on the HotSpot JVM, because this guarantee allowed us to make a blazing fast queue (which we will show in a future post):
https://www.youtube.com/watch?v=-vizTDSz8NU&t=14m22s

Learning is fun with Cliff Click!

Thursday, February 2, 2017

Nitsan Wakart on Lock-Free Queues

About a year ago Nitsan Wakart gave a very nice presentation about lock-free queues
Here is the link to vimeo:  https://vimeo.com/131688512

Nitsan works for Azul Systems and he has a blog named Psy-Lobotomy-Saw.

If you're into non-blocking queues then make sure not to miss out this video because it has lots of interesting facts.
Some of the topics he covers are:
  • The difficulty in running proper benchmarks for lock-free queues. IMO the hardest to obtain is determinism.
  • What is throughput on a queue? single-enqueue-single-dequeue or bursts?
  • Throughput vs Latency.
  • Benchmarking empty queues vs non-empty.
  • Warmup on benchmarks.
  • Memory Bounded (circular array or ring buffer) vs Memory Unbounded (singly-linked list).
  • When can we discard the volatile keyword in Java Queues (hint: very rarely), and even some of the masters have made this mistake.
  • False Sharing.

Friday, January 13, 2017

DoubleLink - A Low-Overhead Lock-Free Queue

On this post we're going to talk about a new lock-free queue that Andreia and I invented.
It's called DoubleLink Queue and it's a memory-unbounded Multi-Producer-Multi-Consumer linearizable lock-free queue.
Example code is in C++ with Hazard Pointers but we have a Java version as well, both available on github:
C++
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/queues/CRDoubleLinkQueue.hpp
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/queues/HazardPointersDL.hpp

Java
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/queues/CRDoubleLinkQueue.java

Before we start, a short intro

There are many, many, memory unbounded linearizable MPMC lock-free (and a few wait-free as well). However, they are all based on the enqueueing and dequeueing mechanism designed by Maged Michael and Michael Scott, or at least all of the interesting lock-free memory-unbounded MPMC queues are based on it.
The only exception to this that I know of, is SimQueue by Panagiota Fatourou and Nikolaos Kallimanis (more on this later).
This is just to show that although there are many ways to make queues, there is only one simple way to do it, i.e. there is one easy way to do enqueueing and dequeueing in a singly-linked list using CAS operations.
... well, actually this isn't true, because the queue we're going to talk about today has a different way of doing it  ;)


Motivation

Why another lock-free queue?
Well, because we can, but mostly, because we were trying to answer the question:
Is it possible to insert a node in a linked list using less than 2 CAS ?
You see, the Michael-Scott queue (MS from now on) needs at least 2 CAS operations to insert a new node/item: one CAS to link the node to the last node's "next", and another CAS to advance the tail. For the dequeue, a single CAS is needed which advances the head, however, it needs two seq-cst stores when using hazard pointers, one to protect the current head node and another to protect the next node, because it's the next node that contains the pointer to the item we're trying to dequeue.
Summing up in a nice table, this is what the MS queue does, as well as all other queues based on the same mechanism:




2 links or !2 links, that is the question

The first difference between DoubleLink and MS queue is that DoubleLink is based on a doubly-linked list, while MS is based on a singly linked list. And, as strange as it may seem, its this tiny difference that allows it to do an insertion with a single CAS operation.
In MS, a new node is linked to the list in two steps:
  1. CAS on the last node's next;
  2. CAS to advance the tail;



On DoubleLink this work differently. First, we create a new node whose "prev" pointer is pointing to the current last node on the list. Then we do a CAS on the tail to swing the pointer to this new node, and finally we link the previous last node to the current last node. The steps are:
  1. Create a new node linking its prev to the current last node;
  2. CAS to advance the tail;
  3. Link the previous last node to the current last node;



For those of you paying attention, you may be wondering "How can doing something in three steps be any simpler or faster than doing it in two steps?", and you're right in thinking that. The simplicity part is arguable, but the trick here is that steps 1 and 3 of DoubleLink queue have no synchronization, i.e. they are done with relaxed stores, while on the MS queue both steps 1 and 2 have a CAS which is a heavy synchronization instruction.


Code for enqueue()

Let's take a look at the enqueue() code in C++:

As can be seen in line 7, the store on the new node's prev is just a regular store because there are no races on prev: once the node is visible to other threads (with the CAS on the tail) the prev member will never change again, a concept sometimes referred to as "immutable after visible".
Then, in line 9 we make sure that the previous node is linked to the current last node, this time it's an atomic store in the next member, but it's a relaxed store.
Afterwards, in line 10 we advance the tail with a CAS, the only synchronized operation, and we link the previous node to the current last node using a release (or relaxed) store (line 11).

This kind of approach of setting the pointer "back" first and then advancing the tail is not new, and was first shown by Treiber, which used it to make the world's first (and simplest) lock-free data structure, namely, a lock-free stack: https://en.wikipedia.org/wiki/Treiber_Stack
The innovation in DoubleLink is that we can use this same trick in a doubly-linked list, and then have the enqueuers do a store race on lprev.next to make sure it is linked to the last node in the list before advancing the tail to the new last node. This race is safe according to the memory model because it is done on an atomic variable.
Personally, I find this approach super simple and very intuitive, but I'm biased  ;)


What about SimQueue?

Remember when I said earlier on this post that pretty much all other queues in this category (memory-unbounded, lock-free, linearizable, MPMC) use the MS algorithm underneath, except SimQueue? Well, a bit of history of discovery is in order here:
Andreia and I have had the idea for DoubleLink and its implementation for some time now, and we had a working implementation before we read the paper on the SimQueue, and long before we understood the algorithm used in the SimQueue. As it so happens, the enqueue method in SimQueue uses a trick that is not too far from this one, except ours is way more simple.
In SimQueue, the EnqState object has a link to the old tail, the beginning of the list, and the new tail. The first two are called link_a and link_b in the original paper if I'm not mistaken... anyways, they have multiple threads (dequeuers as well) do the link between the nodes link_a and link_b, which is pretty much the same idea we have for having all the enqueuers (and only the enqueuers) race on lprev->next to set it to the current tail using a relaxed store. One more reason to go back to the post on SimQueue and read it carefully, it's just one of its many hidden gems: http://concurrencyfreaks.com/2016/12/sim-queue-in-java-part-12.html
Their idea is way more complex than what we show in DoubleLink because they need it to be wait-free and we were just aiming for lock-free, but I just want to make it clear that we didn't copy their idea because we didn't even know about it, and it's not the same thing because ours is simpler, and simplicity is elegance. It's far easier to make something more complex with extra functionality than it is to make something simpler with the same or better functionality  ;)


Code for dequeue()

The dequeue method is pretty much the same as MS queue, so nothing new to see here. The only difference is that we have a doubly-linked list instead of a singly linked list:



There is one last trick in DoubleLink, and it's related to memory reclamation

Unless you're in Java or something with a Garbage Collector, you need some memory reclamation technique if you want to use MS or DoubleLink, or any other memory-unbounded lock-free queue.
DoubleLink is a lock-free queue and as such, it needs a lock-free (or wait-free) memory reclamation technique. This eliminates things like Epoch-based reclamation, RCU and Userspace RCU, and pretty much leaves pointer-based techniques like Hazard Pointers. Yes, there are other lock-free (pointer-bsaed) memory reclamation techniques, but they are usually more complex to deploy and we're aiming for simplicity, so lets stick to HP for now.
As we saw on the beginning of the post, the dequeue() in MS needs two sequentially consistent stores for the Hazard Pointers, but it would be nice if we could do it using just one... and indeed it is possible, but not for MS.
The trick is that we can make a variant of HP where the retire() scans not just for the current node that is attempting to delete but also for threads using the prev and next objects of the node.

In our variant of HP, we scan the other thread's hazard pointers for a match of the node we want to delete and we scan also for the next and prev of the node we want to delete.
If another thread is using the previous node and attempting a dequeue() operation, it may dereference the next node (line 4 of dequeue()), which is the node we're trying to delete, and therefore, we can not delete it yet. If another thread is using the next node and attempting an enqueue() operation, it may dereference the previous node (line 9 of enqueue()), which is the node we're trying to delete, and therefore, we can not delete it yet.

Each thread is looking for usages of the pointers in the next and prev of the nodes in its own retired list, and does not dereference the next or prev of the nodes in the list/array of hazardous pointers. Such behavior would be incorrect and lead to a crash.

This simple trick of scanning for occurrences of node.prev to protect lnext in dequeue(), and scanning for node.next to protect the lprev pointer in enqueue(), reduces one seq-cst store in each method. Unfortunately there is no gain in applying this HP variant to the MS queue because its enqueue() already requires one seq-cst store (only the tail node needs to be protected), and for the dequeue() the next node can not be protected because there is no way of knowing what is the previous node.

The whole logic of going through this trouble is because a seq-cst store is typically a heavyweight synchronization operation, for example, on x86 it implies one MFENCE instruction, and therefore, any reduction in the amount of seq-cst stores will reflect itself in a throughput gain.



Comparison

The comparison table of MS and DoubleLink is shown below and as you can see, DoubleLink does less synchronized operations for both enqueueing and dequeueing:




Unfortunately, this reduction in synchronized instructions doesn't reflect itself in large throughput gains because most throughput when uncontended is related to cache locality, and when highly contented is related to the contention... it's all about the cache.
Anyways, here are some comparison plots between MS and DoubleLink:






The throughput gains go up to 30% but not much more than this. Still, a good improvement having into consideration that we didn't make the code more complex.
The main drawback when compared to MS queue is that each node in DoubleLink needs two pointers: the prev and the next, while on MS needs just the next, which means DoubleLink uses more memory for a non-empty queue. This isn't important in the land of plenty of RAM like servers nowadays, but in mobile and IoT land it can be a bit annoying... nobody's perfect.


30% may not seem like a lot, but when you take into consideration that now you can go to any of the other lock-free or wait-free queue and replace the MS algorithm with DoubleLink to get an extra throughput gain, then it becomes interesting.
In the end, we set out to see if it was possible to make a lock-free memory unbounded queue using a single synchronized instruction for the insertion and a single synchronized instruction to protect the node, and it is! I would say this is pretty hard to beat, but I know for a fact Andreia has a queue that beats this, and on top of it is wait-free, so yes, we can do better (with other limitations though).... a topic for a future post.
In the meantime, we hope you enjoy the DoubleLink algorithm and use it to make even better queues!

You can get a short academic paper on the DoubleLink Queue here:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/doublelink-2016.pdf