Concurrency Freaks: LongAdder

On a more "academic" note, I have some minor doubts on whether LongAdder::increment() may actually be blocking, the reasoning being the following:

Suppose that a thread has decided to create a new larger cells[] and as such, has set cellsBusy to 1, created a new cells[] with the first half of the entries copied from the old one, and the second half still at null, and then for some reason this thread goes to sleep for a very long time, leaving cellsBusy at 1.
Some other thread doing an increment() may end up on one of the new entries for which there is no Cell instance, and because cellsBusy is 1, it can't put in its own Cell instance, so, it will call "h = advanceProbe(h)" to get another index of cells[] through "a = cells[(n-1) & h]".

The advanceProbe() function uses a XorShift algorithm which has a very high period (maybe close to 2^31 ?), but we only use the last few bits of h to get a new index of cells[]. One could theoretically conceive that there is some seed h for which all of the calls to advanceProbe() return an h whose least significant bits are such that the index always falls in one of the cells that
is not yet allocated, and in that case, longAccumulate() would be blocking.
Even if such a thing is possible, it will certainly be a very unlikely event, and having into consideration the definition of lock-free taken from "The Art of Multiprocessor Programming":
"A method is Lock-Free if it guarantees that infinitely often some thread calling this method finishes in a finite number of steps" where the key part here is the "infinitely often", I'm still leaning towards saying that LongAdder::increment() is Lock-Free.

So why is this long and boring incursion into the (theoretical) progress conditions of LongAdder?
Well, LongAdder is one of the possible readIndicator implementations for the Left-Right technique, but the problem is that LongAdder is (at best) Lock-Free so if we use it, we lose the Wait-Free progress condition on the read operations provided by Left-Right mechanism... what we gain is the code simplicity and low memory usage (kind of), because most of the complexity of the readIndicator gets abtracted away in the code for LongAdder.

If you don't know what LongAdder is you can check it out here:
http://download.java.net/lambda/b78/docs/api/java/util/concurrent/atomic/LongAdder.html
http://concurrencyfreaks.com/2013/09/longadder-is-not-sequentially-consistent.html

Like I said, this is mostly academic ;)

On the previous post we showed that LongAdder is Sequentially Consistent (SC) if only the functions increment() and sum() are used, but not SC if the decrement() is also used. In this post we will see what we can do with that knowledge, namely, how to make a reader-writer lock using LongAdder and StampedLock.

One of the main components of a Reader-Writer Lock is the mechanism that "counts" Readers. The simplest of all is a synchronized variable (a volatile in Java, or atomic<> in C++) that is incremented or decremented as a Reader acquires the read-lock.
When this counter reaches zero, it means that all Readers have completed, and a Writer may now enter, after it has "blocked" new Readers somehow. There are many variants of this mechanism but they all use an SC counter. The problem with using a single variable as a counter is that it creates a lot of contention on a single cache line when the multiple Readers increment/decrement the counter, so a much better solution is to use some kind of "distributed counter"... the main requirement being that it must be sequentially consistent.

The simplest approach is to use something like the DistributedCacheLineCounter (DCLC) that we talked about in a previous post:
http://concurrencyfreaks.blogspot.co.uk/2013/08/concurrency-pattern-distributed-cache.html

There are two problems with this approach though:
1st, is that the number of collisions will have a statistical distribution (think hashtables), and the chance of having collisions is not too high, but it will happen, and a collision of two threads is enough to kill the performance for those two threads. If more than two threads collide, then it is even worse.
2nd, it consumes a non negligible amount of memory. To have any decent performance under high contention, you'll need an array with at least the same number of entries as the number of cores, and even if you have just 32 cores, multiply that by the padding that is needs so that each entry is on a different cache line, and you get 2kB per lock instance, which may not be much, but it depends on your application. If your app has a thousand lock instances, that grows to 2MB, and what if you have a thousand cores, or millions of locks?.... I've never seen millions of locks in a code base, but tens of thousands of lock instances seems common-place to me.

LongAdder takes care of the disadvantages mentioned above, not totally, but it in terms o memory saving it is better, so it is worth thinking about it for a bit to see if we can come up with a way to use LongAdder in a lock instead of DCLC. The problem is that LongAdder is not SC... unless we only use sum() and increment(), so the question is:
Can we build a reader-writer lock where the mechanism that counts Readers can only do increment() and sum()?

The answer is yes!
The trick is to use two counters, which we named readersIngress and readersEgress.
readersIngress is incremented whenever a Reader is attempting to "acquire" the shared-lock, and readersEgress is incremented whenever a Reader is "releasing" the shared-lock. The number of active Readers is zero if the value of readersEgress.sum() is equal to readersIngress.sum(). Notice that the order by which the two counters are read is very important: the readersEgress must be read before reading readersIngress, otherwise, by the time readersEgress is being read, a new Reader may have acquired the read-lock and increased readersIngress.

We named this new lock LongAdderRWLock and you can get the source code from the Concurrency Freaks Library here:
https://sourceforge.net/projects/ccfreaks/files/java/src/com/concurrencyfreaks/locks/LongAdderRWLock.java/download
In addition, we did a variant named LongAdderStampedRWLock that combines the LongAdder with a fallback to the StampedLock in case there are a lot of Writers.
https://sourceforge.net/projects/ccfreaks/files/java/src/com/concurrencyfreaks/locks/LongAdderStampedRWLock.java/download

Performance Plots

We ran our usual microbenchmark on a 32-core opteron, and compared it against the StampedLock, ScalableRWLock, and ScalableStampedRWLock. Here are the results for different workloads of Writers vs. Readers:

In terms of performance it doesn't lag behind the "Scalable" version, and it uses a lot less memory in scenarios with a large number of threads, because the memory consumption for the "Scalable" locks is O(NReader-threads), while for the "LongAdder" locks it is at most O(NCores).

Code Simplicity

The thing I really like about the "LongAdder" locks is that the code is very simple, due to the complexity being hidden away in the LongAdder counter itself, which is part of JDK 8, and therefore, has been thoroughly tested.
Don't believe me when I say it is simple? Then here is the code to acquire/release the read-only lock for the LongAdderStampedRWLock:

public void sharedLock() {

while (true) {

readersIngress.increment();

if (!stampedLock.isWriteLocked()) {

// Acquired lock in read-only mode

return;

} else {

// Rollback logical counter to avoid blocking a Writer

readersEgress.increment();

// If there is a Writer, wait until it is gone

while (stampedLock.isWriteLocked()) {

Thread.yield();

}

public void sharedUnlock() {

readersEgress.increment();

return;

}

and here is the code to acquire/release the write lock:

public void exclusiveLock() {

// Try to acquire the stampedLock in write-mode

stampedLock.writeLock();

while (true) {

// Order is _very_ important here

final long egressSum = readersEgress.sum();

final long ingressSum = readersIngress.sum();

if (egressSum == ingressSum) return;

Thread.yield();

}

public void exclusiveUnlock() {

if (!stampedLock.isWriteLocked()) {

throw new IllegalMonitorStateException();

}

stampedLock.asWriteLock().unlock();

}

It doesn't get any simpler than that!

Conclusions

We've shown a new technique for Reader-Writer locks that combines the StampedLock and the LongAdder to make a lock with low memory footprint and high performance. In case you missed it from the plots above, its performance can be much better than the standalone StampedLock, which is shown as the cyan line that is in the bottom of all plots, except the "50% Writes".

Of the two variants, the LongAdderRWLock seems to be the best overall, but because it is based on the C-RW-WP, it has the disadvantage of having a slight Writer-Preference. If you look at the code, you'll see that both of them use a StampedLock, but the LongAdderRWLock is only using it to serialize the Writers, while the LongAdderStampedRWLock is using it as a fallback mechanism when a lot of Writers exist, which means it inherits most of the fairness proprieties provided by the StampedLock.

We're still doing tests and Andreia is working on some interesting variants so lets see what comes out of it.
All of this will be part of the upcoming release of the Concurrency Freaks Library 0.5

Concurrency Freaks

Saturday, January 18, 2014

A dissertation on the progress conditions of LongAdder

Tuesday, October 1, 2013

Even more on Counters

Sunday, September 22, 2013

Combining the StampedLock and LongAdder to make a new RW-Lock

Performance Plots

Code Simplicity

Conclusions

Contributors