Thursday, November 27, 2014

Coroutines with await 2.0

Another presentation from CppCon 2014, by Gor Nishanov, this time about a new proposal to C++11 named "Await 2.0" with Stackless Resumable Functions.
I'm not a big fan of coroutines but they do have their usage, and the first 10 minutes of the presentation are well worth the time because he provides the best explanation of the multiple variants of coroutines in existance.

He also provides several examples, which help to understand some of the concepts related to coroutines.

Saturday, November 22, 2014

Left-Right GT variant

On this post we're going to cover a newly discovered variant of the Left-Right pattern: Left-Right GT.

The story

Before we go into the actual algorithm a bit of background history is in order.
Last week there was a blog post by Gil Tene (the CTO of Azul) about a new synchronization mechanism he called WriterReaderPhaser. Gil has some cool talks and I've posted one of them here about Latency, which is well worth watching.
I looked at the description of the algorithm and then at the code, and to me (and Andreia also) it looks like a variant of the Left-Right pattern, in which a new ReadIndicator was devised (by Gil) and then merged with the versionIndex. I had some discussion with him on the comments section of his post and I'm afraid we didn't get him to agree that this is a variant of the Left-Right, probably because he doesn't know what the Left-Right is. Regardless of who's right on that account, either Gil has discovered a new synchronization mechanism and we used the idea to make a new variant of Left-Right, or Gil has discovered a new variant of Left-Right and then we did a modification to it to make it more scalable. The end result is the same to me: there is a new Left-Right variant  ;)

Notice that the WriterReaderPhaser does mutations inside the arrive()/depart() blocks and then a read-only operation on the synchronized part, which is opposite of what the Left-Right was designed to do, but you can use a Reader-Writer Lock "backwards" and it will still be a Reader-Writer Lock, or as Shakespeare would put it "A Rose by any other name would smell as sweet". In this post I'm going to explain the Left-Right GT variant, not the details of the WriterReaderPhaser (I'm not the best qualified person to do that). If you want to understand what is the WriterReaderPhaser and how it works, then better go on to Gil's post.
hint: it's the Left-Right pattern with roles (Reader or Writer) reversed :)

The algorithm

The example code shown below is going to be in Java, but it is easy to "translate" to C11/C++1x atomics, and it doesn't need a GC.
The main innovation is a new kind of ReadIndicator that Gil has come up with, which consists of three counters:

private final AtomicLong startEpoch = new AtomicLong(0);
private final AtomicLong posEndEpoch = new AtomicLong(0);
private final AtomicLong negEndEpoch = new AtomicLong(Long.MIN_VALUE);

where the arrive()/depart() methods are the following:

private long arrive() {
    return startEpoch.getAndIncrement();
private void depart(final long localVI) {
    if (localVI < 0) {
    } else {
This is reminiscent of the Ingress/Egress technique described on the previous post, but there are some differences, with the first one being that the ingress counter is shared by both versions of the versionIndex (it's the one named startEpoch), where the versionIndex is encoded as the sign bit of startEpoch. The second difference is in the way that Writer modifies the versionIndex, which in this variant is done with a getAndSet() (named atomic_exchange() in C/C++ terminology), and this is the crucial point because atomic_exchange() is a wait-free atomic operation, unlike a CAS loop which is lock-free.

The real difference comes in toggleVersionAndScan(), which in the GT variant only has to do one scan (like on the RV and NV variants):

private void toggleVersionAndScan() {
    final long localVI = startEpoch.get();
    if (localVI < 0) {
        // This version is negative, so next versionIndex is positive.
        // Reset counter
        // Toggle versionIndex and count the number of arrives()
        final long localStartValue = startEpoch.getAndSet(0);
        // Wait for the readers that did arrive() with a negative versionIndex to depart()
        while (localStartValue != negEndEpoch.get()) Thread.yield();
    } else {
        // This version is positive, so next versionIndex is negative.
        // Reset counter
        // Toggle versionIndex and count the number of arrives()
        final long localStartValue = startEpoch.getAndSet(Long.MIN_VALUE);
        // Wait for the readers that did arrive() with a positive versionIndex to depart()
        while (localStartValue != posEndEpoch.get()) Thread.yield();

The comments on the code pretty much say it all, but there are some subtleties that shouldn't be missed.
The first one is that we can immediately reset the egress counter for the opposite versionIndex because we know that there are no Readers there, and we know this because on the previous toggle we waited for them, so we know they aren't going to increase any more.
The second detail is that some of the Readers that incremented the startEpoch before the toggle, have already seen the most up-to-date leftRight variable, and still we're going to wait for them, but that's "business as usual".

Improving Scalability

One of the first things that Andreia noticed about the GT variant used in the WriterReaderPhaser just by reading the code is that it wouldn't scale as the number of threads calling arrive()/depart() increases. She pointed out that we can safely replace the egress counters (negEndEpoch and posEndEpoch) with arrays of counters like DistributedCacheLineCounter or LongAdder, but the startEpoch counter is trickier because of the need for the atomicity of getAndSet().
After a few days of mulling it over, we realized that the startEpoch counter can also be replaced with an array of counters, it's just that we need to make our own changes to getAndIncrement() to sum() and to getAndSet().
I'm not going to go into a lot of details right now because it would make this post too long, but you can check out the inner class of to see how it is done.

The basic premise is that it's ok for some Readers to see an older versionIndex and some others to see the new versionIndex during a certain time window. This time window occurs during the toggle operation, but once it is done, all Readers are guaranteed to see the new versionIndex.
There is some extra trickiness due to the encoding of the versionIndex in the sign of each counter of startEpoch (which is now an array of counters), but we can handle that by subtracting Long.MIN_VALUE at the appropriate places.
The getAndSet() operation has to return the sum of the deltas of each of the counters in the array of counters, and that's the value that we need to check for when reading the array of counters of the corresponding egress.

One thing you might ask, is why don't we see the bottleneck on the startEpoch.getIncrement() in the WriterReaderPhaser? The answer is the bottleneck is there as well, it's just that the histogram classes where it is used are doing a few AtomicLong.getAndIncrement() on global counters, which is the actual bottleneck in terms of performance, such that for that particular application you won't even notice the difference between using the GT or the ScalableGT.

Performance Results

So here are the results of our microbenchmark when we apply these techniques to a Java TreeSet, and run it on a 32-core AMD Opteron:
This system has 32 cores
Filling up trees with 100000 elements...
----- writePerMil tests numThreads=32  Writes=50.0%  numElements=100000 -----
##### LRScalableTreeSet              #####  Total Ops/ms = 710
##### LRTreeSetGT                    #####  Total Ops/ms = 821
##### LRTreeSetScalableGT            #####  Total Ops/ms = 146

Filling up trees with 100000 elements...
----- writePerMil tests numThreads=32  Writes=10.0%  numElements=100000 -----
##### LRScalableTreeSet              #####  Total Ops/ms = 2966
##### LRTreeSetGT                    #####  Total Ops/ms = 2545
##### LRTreeSetScalableGT            #####  Total Ops/ms = 669

Filling up trees with 100000 elements...
----- writePerMil tests numThreads=32  Writes=1.0%  numElements=100000 -----
##### LRScalableTreeSet              #####  Total Ops/ms = 18422
##### LRTreeSetGT                    #####  Total Ops/ms = 3680
##### LRTreeSetScalableGT            #####  Total Ops/ms = 6022

Filling up trees with 100000 elements...
----- writePerMil tests numThreads=32  Writes=0.1%  numElements=100000 -----
##### LRScalableTreeSet              #####  Total Ops/ms = 74787
##### LRTreeSetGT                    #####  Total Ops/ms = 5379
##### LRTreeSetScalableGT            #####  Total Ops/ms = 26621

Filling up trees with 100000 elements...
----- writePerMil tests numThreads=32  Writes=0.0%  numElements=100000 -----
##### LRScalableTreeSet              #####  Total Ops/ms = 414741
##### LRTreeSetGT                    #####  Total Ops/ms = 2769
##### LRTreeSetScalableGT            #####  Total Ops/ms = 48537

As I'm sure you've noticed, the LRScalableTreeSet is the best in almost all scenarios, this is the Classical Left-Right with a ReadIndicator composed of a CLQ+Array+finalizers. In LRScalableTreeSet the arrive() and depart() are just simple stores with release, and each thread is writing on its own cache line (courtesy of @Contended), so it's pretty hard to beat that. Moreover, the Writer only has to do loads with acquire to check for the states of the readers, while on the GT variant it has to do a GetAndSet(), and on the ScalableGT it has to do multiple GetAndSet(), which explains why the ScalableGT is so much slower when there are a lot of write operations.

I was expecting a little bit more of performance out of ScalableGT but I guess it all boils down to how expensive are the getAndSet() operations compared to loads with acquire, and the getAndIncrement() compared to stores with release... huhhh on second thought, it was unreasonable of me to expect the LRScalableTreeSet to not remain the best when it comes to raw performance under scale  ;)

As always, source code is available on github: