Sunday, December 14, 2014

Relaxed atomics on array of counters

It's almost Christmas, and this year we don't have anything big (I mean, we do, but it won't be ready by Christmas), but we do have several small goodies to give away!

The first one of these goodies, is an optimization technique where we reduce the number of acquire-loads to a minimum when reading an array of counters (or list of counters). This is useful not just for counters themselves, but also when using counters as ReadIndicators, and as you may already know, counter-based ReadIndicators are used in the Left-Right and in Scalable RW-Locks (or C-RW-WP and the other algorithms described in NUMA-Aware Reader-Writer Locks).

To give a specific example, we're going to take the case of DCLC when used with C-RW-WP, but keep in mind that this optimization can be used for other ReadIndicators and for other algorithms.

If you look at the algorithm for the Writer on the scalable reader-writer lock (C-RW-WP) you will see that once the ReadIndicator.isEmpty() returns true, the thread will enter the write-critical-section:



so lets write out the operations in order of execution from top to bottom:



in the schematic above, we are using an array of atomic counters (like DCLC) and we're using C11/C++1x terminology, but further ahead in this post we'll look at how it is done in Java (hint, it's very similar). In this particular example, the array has 8 entries and all operations are loads with acquires which means that they can not be re-ordered with each other. The last load with acquire (on array[7]) prevents any code from being re-ordered from below to above, which means that there can be no code from the write-critical-section entering the isEmpty() method.


In order to reduce the number of acquire-loads, a naive approach would be to use relaxed atomic loads in the middle of the array. If we use relaxed loads in the middle of the isEmpty(), they can’t be re-ordered to be above the load-acquire of array[0], but they can be re-ordered to be below the load-acquire array[7], which would place them in the middle of the write-critical-section, which would be incorrect.



An example of why it wrong, is if the relaxed load on array[6] is done after the regular store on y and by the time the load is done, it is indeed zero (empty) but it wasn’t so if it had been read before the load-acquire of array[7] (the load of array[6] was a speculative load) which means we no longer have mutual exclusion!



The correct way to do this is the following: If we add a release fence before the last load with acquire, we prevent the relaxed loads from descending, and the last load acquire prevents the regular code of the critical section to ascend:






Notice that a release and acquire fences can not be re-ordered with each other in the std::memory_order_seq_cst memory model.


How about Java?
Well, Java doesn't have "releases" but it does sun.misc.unsafe.loadFence() which works just as well, because all the operation we want to prevent re-ordering are loads.



If we add a loadFence() before the last relaxed load, we prevent the relaxed loads from descending, and the loadFence() itself prevents the regular code of the critical section from ascending http://openjdk.java.net/jeps/171.
Notice that we are assuming that Unsafe.loadFence() will not be reordered to be below a regular/relaxed load in the Java Memory Model. If this was not a valid optimization, then StampedLock.validate() would be broken.


So what kind of results does this give?
We ran the benchmarks on an Intel i7-3740QM with 4 cores (8 hyper-threads), on an AMD Opteron with 32 cores (two nodes), and on a PowerPC 8S instance on RunAbove. The results (in Java) show that there is some but usually small gains with this on x86, as is to be expected, but on PowerPC there are some significant gains. I'm not sure all of these gains are due to the removal of the volatile loads, and it seems at least some of it to be due to not going through Unsafe.getLong() or Unsafe.getLongVolatile() and instead reading directly from an array of longs... ohh well further tests will be done, and we'll test with C++1x to see how it goes.







 




In case it wasn't obvious, this technique can be applied to other counters such as LongAdder, and to other ReadIndicators such as CLQ+Array+Finalizers. More on that on future posts.
In the meantime, here is the code in Java:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/counter/DistributedCacheLineCounter.java
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/counter/DistributedCacheLineCounterRelax.java
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/counter/TestCounters.java
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Presentations/DCLCRelax.pptx



No comments:

Post a Comment