Monday, November 14, 2016

The Encapsulator Queue (part 3)

On the previous two posts we talked about The Encapsulator queue, a concurrent queue that is wait-free for enqueues and lock-free for dequeues.
This post will complete the series with some microbenchmarks made in Java.

Benchmarking queues is not an easy task, and many have ranted about bad benchmarking, so let's hope the benchmarks on this post aren't too bad.
Just keep in mind that this is Java, and in Java we can always have the GC enter at unexpected times and distort the results, or even worse, not enter during the benchmark and when the benchmark ends the GC finally enters and has to clean up all the garbage objects that were created by a queue, which distorts the results giving an unfair advantage to data structures that create a lot of temporary objects.
That's why we like C++ way better for doing this kind of stuff, but we couldn't bother to implement The Encapsulator in C++ with memory reclamation so Java it is  :)

These benchmarks are very similar to the ones described in our paper on the wait-free CRTurn queue.
We start with the usual benchmark of doing a single-enqueue-single-dequeue, where each thread alternatively does an enqueue followed by a dequeue.
This benchmark is simple to implement and is surprisingly deterministic. I say "surprisingly" because at any given time we can have all the threads doing enqueues, or all the threads doing dequeues, or half doing enqueues with the other half doing dequeues, or just one doing enqueue and N-1 doing dequeues, etc. All these combinatorics create different contention on the head/tail (and possibility other) variables of the queue, which can produce varying results, but we do 5 runs and chose the median of those 5 to display the results and the results are stable.
Here is the plot comparing the Michael-Scott (lock-free) queue, the CRTurn (wait-free) queue, and The Encapsulator:

The plot on the right-side is the same as the left-side, but normalized to Michael-Scott has a baseline.
As you can see, The Encapsulator has an overhead, probably from creating an array for each node, but it scales and at 8 threads is starts beating the Michael-Scott queue. It beats CRTurn queue in every scenario, but keep in mind that it creates a lot of trash for the GC to clean up (one array per node plus all the Encap objects) and that the dequeue() in The Encapsulator is lock-free, while the CRTurn queue is wait-free for both enqueues and dequeues.

The other set of microbenchmarks are made of bursts, with one burst of enqueues, then one burst of dequeues, then again a burst of enqueues, and so on.
This is a very deterministic benchmark, as long as we can start all threads at mostly the same time, or the burst is large enough to reduce the effect that not all threads start the burst at exactly the same time. The burst has to be large enough so as to avoid that by the time the last thread starts the first starting has already completed, which wouldn't be a proper representation of the contention on the queue.
For this reason, we choose bursts with 1 million operation, i.e. one million enqueues or 1 million dequeues.
To prevent memory allocation and cache issues, we divide the burst by the number of threads on the current entry. This means that if it is a single thread doing the operations, then it's going to do 1M enqueues, but if it's 32 threads, then each thread will do 1/32M = 31250 enqueues.
At the end of each burst the threads wait for each other before starting the next burst, with a burst of dequeues following a burst of enqueues and vice-versa.

Here are the plots for the enqueues with the number of operations and the ratio to Michael-Scott:

and the plots for dequeues:

With one thread, Michael-Scott has a big advantage, but as soon as contention increases, The Encapsulator starts to show its advantage.

That's all on The Encapsulator. Next post we will cover a new lock-free queue.

No comments:

Post a Comment