Friday, August 2, 2013

ScalableStampedRWLock on 32-core opteron

We got a new 32-core AMD Opteron machine (two Opteron 6272 CPUs with 16 cores each), so we decided to run the rw-lock benchmark tests on it to compare against the measurements we had done on the 80-core machine which had Intel Xeon processors. Here is what we got on the new machine:

The ScalableStampedRWLock is a modification we did on the original ScalableRWLock which was similar to the lock described on the paper by David Dice et al, named "NUMA-Aware Reader-Writer Locks". The main change in this new lock is in the combination technique which is also similar to the one described in the paper mentioned above, and it consists of replacing the CAS loop on the exclusiveLock() with a call to StampedLock.writeLock(), and on the sharedLock(), instead of yielding when there is a Writer, the Read will go directly to StampedLock.readLock().
With this change, we keep much of the performance provided by the ScalableRWLock when the number of Writers is low, and at the same time we get most of the advantages that the StampedLock offers: strong fairness; the park/unpark queuing mechanism; spin-lock for a random time before park(), etc.

The performance plots aren't as impressive as on the 80-core machine which we had tested on previously. This may be simply related to the fact that the 80-core machine was made of Intel Xeons and this one is made of AMD Opterons and they have different ways of doing cache-synchronization and different overall performance. More testing would be required to figure out what exactly is happening, but the differences in performance are noticeable in both cases  :)

One thing to keep in mind is that the Write ratios indicated in the plots may be a bit misleading. When we say that the test was done for a "Write ratio of 10%" it means that of every 10 operations done by each thread, there was one that was a "write operation", but the detail that is hidden from view is that a "write" usually takes much longer than a "read". For example, let's say that a "write" takes 10x longer to complete than a "read" (and I think I'm being conservative here), then this would imply that there is, on average, always an ongoing "write", which has some important implications to the performance of "reads". As the number of threads is increased, then the effect becomes even stronger.
My experience is that, most access to data-structures and critical code paths are usually read-many-write-few or even read-many-write-rarely-but-in-a-burst, and for those access patterns, the ScalableStampedRWLock (and the ScalableRWLock) can provide interesting performance gains.

You can get the ScalableStampedRWLock and its reentrant version ScalableStampedReentrantRWLock from the latest Concurrency Freaks library on sourceforge:

No comments:

Post a Comment