Friday, March 29, 2013

Bjarne Stroustrup on C++ Style at "Going Native"

We've been very busy lately with lots of changes to the Scalable-Read-Write-Lock so not too many updates. To keep you busy, here is a very nice talk by Bjarne Stroustrup at the conference "Going Native" from 2012
He talks about very interesting stuff (though not too much about concurrency) with some nice insights on linked-lists and arrays:

Monday, March 25, 2013

Optimizing exclusiveLock() in ScalableRWLock

Upon looking at exclusiveLock() for the first time, you may be temped to think that iterating through the ConcurrentLinkedQueue (CLQ) is not a very efficient way to scan the states of the Readers. A much better and obvious optimization is to use ConcurrentLinkedQueue.toArray() and then scan the array. Alas, this is not true unless some very clever manipulation is done.

As a first approach you could consider doing it on the fly and calling toArray() every time exclusiveLock() is called, but this means that you are allocating memory in a code path that is meant to be ran many times and fast. This is a big no-no because it causes the Garbage Collector to trash too much and makes the lock slower. How slow, depends on how good is the GC of the VM you're using, but don't expect that approach to improve performance.

A second approach is to have a reference to an array (lets call it readersStateArrayRef)  that is updated whenever a new thread does a read-lock or is terminated, i.e. whenever addState() or removeState() are called. This is also not a good approach because you need to maintain consistency between the CLQ and the array, and for that you need a lock. Apart from the problem of wanting to implement a lock with another lock (which sounds like an infinite recursive loop), there is the issue that removeState() is called from finalize() and having a lock in it may cause other threads to block... or the GC to block? I would say it's not a good idea at all.

There is a third approach which actually works, that Andreia thought of yesterday. The reasons why it works are complex and subtle, and I'll try to enumerate them here, but first lets describe the how .
We start by creating an atomic reference to an array of AtomicInteger (the states of each reader) and a static dummy array just so we have a reference we can differentiate from null and from other valid array references:

private final static AtomicInteger[] dummyArray = new AtomicInteger[0];    

    private final AtomicReference<AtomicInteger[]> readersStateArrayRef;

Now, we make it so that every change to the CLQ readersStateList is followed by a readersStateArrayRef.set() to null, and in this way mark the array as being out-of-sync with the CLQ. This means that that in addState() and removeState() we need to call readersStateArrayRef.set() after adding or removing from the CLQ.

On the exclusiveLock() function, we check if the readersStateArrayRef is null or not. If it is null, then we need to re-make the array using ConcurrentLinkedQueue.toArray(), and if it is not null then it means it is in-sync with the CLQ and we can use the array to scan the states of the Readers.

We have to check for null only after the writerOwner has been acquired by the current lock, because after that, even if the another thread adds to the CLQ, the state that it adds will always be SRWL_STATE_NOT_READING because it will see that there is a Writer ongoing and not be able to transition to SRWL_STATE_READING  (or only temporarily). If the thread is terminating, then there can be a leftover reference to its state (AtomicInteger) on the array, but that's fine because the state will be SRWL_STATE_NOT_READING (worst-case scenario it was set by removeState() ).
This means that it is safe to have temporary inconsistency between the array and the CLQ as long as it happens after the writerOwner have been set... an important an non trivial detail.

We could imagine that something like this would be enough:

// Try to acquire the lock in write-mode
while (!writerOwner.compareAndSet( SRWL_INVALID_TID, tid_self))
AtomicInteger[] localReadersStateArray = readersStateArrayRef.get();
if (localReadersStateArray == null) {
    // Convert the CLQ to an array
    localReadersStateArray = 
        readersStateList.toArray(new AtomicInteger[readersStateList .size()]);
// Scan the array of Reader states
for (AtomicInteger readerState : localReadersStateArray) {
    while (readerState.get() == SRWL_STATE_READING) Thread.yield();

but it is not, and in fact, it's wrong!
The reason it doesn't work is because we can have a race condition with a new thread calling addState() after the call to toArray() but before the call of readersStateArrayRef.set(localReadersStateArray), which would not impact the current running exclusiveLock() but it would cause the next call to exclusiveLock() to believe that that readersStateArray was in sync with the CLQ (because it was non-null), when in fact it was out of sync.
This could cause some reader states to be permanently left out of the scan by the writers, which would mean the lock would not be working properly.

The way to overcome this is to use the dummyArray we defined previously, by setting the readersStateArrayRef to the dummyArray before calling CLQ.toArray(), and afterwards doing a compareAndSet() from dummyArray to the newly created array.
If the CAS is successful then the array is in-sync with the CLQ and the next exclusiveLock() will be able to scan only the array.
If the CAS fails, there was another thread calling addState() or removeState() and the CLQ is most likely out-of-sync with the array. This is not an issue for the currently running exclusiveLock() but the next call will see the null and re-make the array by calling again toArray().
In fact, we don't even need to check the outcome of the CAS, but it needs to be a CAS because if the old reference was not dummyArray we can not overwrite it with the reference to the new array, only the next call of exclusiveLock() will be able to do that.

The working code looks like this:

// Try to acquire the lock in write-mode
while (!writerOwner.compareAndSet( SRWL_INVALID_TID, tid_self))
    Thread. yield();
AtomicInteger[] localReadersStateArray = readersStateArrayRef.get();
if (localReadersStateArray == null) {
    // Set to dummyArray before scanning the readersStateList
    readersStateArrayRef.set(dummyArray );
    // Copy the CLQ to an array
        readersStateList.toArray(new AtomicInteger[readersStateList.size()]);
    readersStateArrayRef.compareAndSet(dummyArray, localReadersStateArray);
// Scan the array of Reader states
for (AtomicInteger readerState : localReadersStateArray) {
    while (readerState.get() == SRWL_STATE_READING) Thread.yield();

This approach can give a 10% to 20% improvement for a large number of Writes (50% and 20% Writes), and has no impact when the number of Writes is low, as is to be expected.
We're going to run a few more tests before adding it to the next release of the Concurrency Freaks Library   :)

Sunday, March 24, 2013

ScalableReentrantRWLock Performance Plots

Just a quick post on the latest performance plots of ScalableReentrantRWLock, the version that will be in ConcurrencyFreaks Library 0.3

Tests were done on an AWS 8 core instance (c1.xlarge), where we read or increment an array with 16 int variables, depending on whether we're locking in read mode or write mode.

The vertical axis represents the total number of operations (read operations + write operations) per milisecond, and the horizontal axis the number of threads.
Several configurations were tried, with 50%, 20%, 10%, 2%, 0.1%,  and 0% Writes. For example, a configuration of 5 threads with 20% Writes means that each thread will do 4 read operations followed by one write operation and repeat. This has important implications, one of them is that, on average, for the case mentioned before (5 threads with 20% Writes) there is always a thread attempting a write operation.


As you can see from the plots, there is only a clear advantage for the ScalableReentrantRWLock if the number of Writes is reduced (at 10% Writes the StampedLock is much better). But these results change from system to system, for example, on my i7 720 Q laptop with 4 cores (8 threads) here is the plot for 10% Writes:

Conclusion, if you have a very low number of writes and high-contention, then you will probably see a significant performance improvement using ScalableReentrantRWLock, but if you have a high number of writes, then most likely not... the best is to try and measure with all the different locks for your specific application.