Concurrency Freaks: The Encapsulator

Monday, November 14, 2016

The Encapsulator Queue (part 3)

On the previous two posts we talked about The Encapsulator queue, a concurrent queue that is wait-free for enqueues and lock-free for dequeues.
http://concurrencyfreaks.blogspot.com/2016/11/the-encapsulator-queue-wait-freelock.html
http://concurrencyfreaks.blogspot.com/2016/11/the-encapsulator-queue-part-2.html
This post will complete the series with some microbenchmarks made in Java.

Benchmarking queues is not an easy task, and many have ranted about bad benchmarking, so let's hope the benchmarks on this post aren't too bad.
Just keep in mind that this is Java, and in Java we can always have the GC enter at unexpected times and distort the results, or even worse, not enter during the benchmark and when the benchmark ends the GC finally enters and has to clean up all the garbage objects that were created by a queue, which distorts the results giving an unfair advantage to data structures that create a lot of temporary objects.
That's why we like C++ way better for doing this kind of stuff, but we couldn't bother to implement The Encapsulator in C++ with memory reclamation so Java it is :)

These benchmarks are very similar to the ones described in our paper on the wait-free CRTurn queue.
We start with the usual benchmark of doing a single-enqueue-single-dequeue, where each thread alternatively does an enqueue followed by a dequeue.
This benchmark is simple to implement and is surprisingly deterministic. I say "surprisingly" because at any given time we can have all the threads doing enqueues, or all the threads doing dequeues, or half doing enqueues with the other half doing dequeues, or just one doing enqueue and N-1 doing dequeues, etc. All these combinatorics create different contention on the head/tail (and possibility other) variables of the queue, which can produce varying results, but we do 5 runs and chose the median of those 5 to display the results and the results are stable.
Here is the plot comparing the Michael-Scott (lock-free) queue, the CRTurn (wait-free) queue, and The Encapsulator:

The plot on the right-side is the same as the left-side, but normalized to Michael-Scott has a baseline.
As you can see, The Encapsulator has an overhead, probably from creating an array for each node, but it scales and at 8 threads is starts beating the Michael-Scott queue. It beats CRTurn queue in every scenario, but keep in mind that it creates a lot of trash for the GC to clean up (one array per node plus all the Encap objects) and that the dequeue() in The Encapsulator is lock-free, while the CRTurn queue is wait-free for both enqueues and dequeues.

The other set of microbenchmarks are made of bursts, with one burst of enqueues, then one burst of dequeues, then again a burst of enqueues, and so on.
This is a very deterministic benchmark, as long as we can start all threads at mostly the same time, or the burst is large enough to reduce the effect that not all threads start the burst at exactly the same time. The burst has to be large enough so as to avoid that by the time the last thread starts the first starting has already completed, which wouldn't be a proper representation of the contention on the queue.
For this reason, we choose bursts with 1 million operation, i.e. one million enqueues or 1 million dequeues.
To prevent memory allocation and cache issues, we divide the burst by the number of threads on the current entry. This means that if it is a single thread doing the operations, then it's going to do 1M enqueues, but if it's 32 threads, then each thread will do 1/32M = 31250 enqueues.
At the end of each burst the threads wait for each other before starting the next burst, with a burst of dequeues following a burst of enqueues and vice-versa.

Here are the plots for the enqueues with the number of operations and the ratio to Michael-Scott:

and the plots for dequeues:

With one thread, Michael-Scott has a big advantage, but as soon as contention increases, The Encapsulator starts to show its advantage.

That's all on The Encapsulator. Next post we will cover a new lock-free queue.

Tuesday, November 8, 2016

The Encapsulator Queue (part 2)

On the previous post we talked about The Encapsulator, an MPMC queue that is wait-free for enqueue() and lock-free for dequeue(), and we're going to cover it a bit more in this post.

In The Encapsulator there is one Encap object for each item. Each node has one or multiple pointers to Encap objects:

There is an items array in each node with up to maxThreads entries, pointing to different Encap objects:

Although on a per node basis each entry in items is unique (points to a different Encap), it can happen that different nodes point to the same Encap objects, i.e. there are “repetitions”:

Dequeuing an item consists of doing a CAS on the pointer of the Encap which means that if one thread goes through nodeA and dequeues z, another thread going through nodeC will see the item z as removed, or fail in its own CAS attempt to dequeue:

One of the main issues with making a (single-list based) queue that is wait-free for enqueues, is for every thread to know if a certain item has already been enqueued (is the request open/closed ?). Different queue algorithms have different ways of solving this particular problem.
On the CRTurn queue and the queue by Kogan and Petrank, this is solved in a "lock-step" way, where each operation is done one step at a time to make sure it becomes visible to all threads. The SimQueue by Fatourou and Kallimanis uses a super cool technique that makes a snapshot of the requests and one extra bit that is XOR'ed... non-trivial, but really smart approach.
On The Encapsulator we solve this problem by allowing an item to be enqueued multiple times, i.e. when in doubt, enqueue it again.
This would be a big problem, causing the same item to be inserted multiple times (duplicate entries in the queue) which would means it is no longer a correct FIFO queue, but thanks to the Encap we guarantee that each item is dequeued once and only once.
There may be different nodes holding a pointer to the same Encap object, but once dequeued, this becomes visible to all dequeuers, and one and only one of the dequeuers will be able to have a successful CAS that logically removes the item from the queue.
Funny enough, it's always the first node that does it, but it could be done from a different node and it would still be linearizable... muahahah
Image result for muahahaha evil laugh

Every rose has its thorns, and The Encapsulator is no different: by allowing an item to be enqueued multiple times, we are using extra (wasted) memory in the arrays of items and in the worse case scenario (no matter how unlikely) we could end up with a very large number of nodes having duplicated pointers to Encaps that have long been dequeued.

There is a way to overcome this: the trick is to have an array for each thread where each thread stores the last seen entry for enqueuers[i] of the other threads. If the entry is still the same on the next call to enqueue() then it will not be placed in the node.items[] array.
Remember that the Encap objects are unique, so we can distinguish them, even if they encapsulate the same object, i.e. The Encapsulator correctly handles the enqueueing of the same item by multiple threads.
With the approach of the thread-local array (per thread) we create a bound on the number of times a given Encap can be added to a node in the queue. This bound is maxThreads, iff each thread enqueues it once.

Simple trick, and efficient. The disadvantage is that with this approach, an empty queue is still chewing up arrays occupying maxThreads^2, due to having one array of size maxThreads per thread, which is not too nice if you have a lot of threads and a lot of queue instances.
... there are always trade-offs.

On the next post we'll look at some benchmarks in Java.