Sunday, September 22, 2013

Combining the StampedLock and LongAdder to make a new RW-Lock

On the previous post we showed that LongAdder is Sequentially Consistent (SC) if only the functions increment() and sum() are used, but not SC if the decrement() is also used. In this post we will see what we can do with that knowledge, namely, how to make a reader-writer lock using LongAdder and StampedLock.

One of the main components of a Reader-Writer Lock is the mechanism that "counts" Readers. The simplest of all is a synchronized variable (a volatile in Java, or atomic<> in C++) that is incremented or decremented as a Reader acquires the read-lock.
When this counter reaches zero, it means that all Readers have completed, and a Writer may now enter, after it has "blocked" new Readers somehow. There are many variants of this mechanism but they all use an SC counter. The problem with using a single variable as a counter is that it creates a lot of contention on a single cache line when the multiple Readers increment/decrement the counter, so a much better solution is to use some kind of "distributed counter"... the main requirement being that it must be sequentially consistent.

The simplest approach is to use something like the DistributedCacheLineCounter (DCLC) that we talked about in a previous post:
http://concurrencyfreaks.blogspot.co.uk/2013/08/concurrency-pattern-distributed-cache.html

There are two problems with this approach though:
1st, is that the number of collisions will have a statistical distribution (think hashtables), and the chance of having collisions is not too high, but it will happen, and a collision of two threads is enough to kill the performance for those two threads. If more than two threads collide, then it is even worse.
2nd, it consumes a non negligible amount of memory. To have any decent performance under high contention, you'll need an array with at least the same number of entries as the number of cores, and even if you have just 32 cores, multiply that by the padding that is needs so that each entry is on a different cache line, and you get 2kB per lock instance, which may not be much, but it depends on your application. If your app has a thousand lock instances, that grows to 2MB, and what if you have a thousand cores, or millions of locks?.... I've never seen millions of locks in a code base, but tens of thousands of lock instances seems common-place to me.

LongAdder takes care of the disadvantages mentioned above, not totally, but it in terms o memory saving it is better, so it is worth thinking about it for a bit to see if we can come up with a way to use LongAdder in a lock instead of DCLC. The problem is that LongAdder is not SC... unless we only use sum() and increment(), so the question is:
Can we build a reader-writer lock where the mechanism that counts Readers can only do increment() and sum()?

The answer is yes!
The trick is to use two counters, which we named readersIngress and readersEgress.
readersIngress is incremented whenever a Reader is attempting to "acquire" the shared-lock, and readersEgress is incremented whenever a Reader is "releasing" the shared-lock. The number of active Readers is zero if the value of readersEgress.sum() is equal to readersIngress.sum(). Notice that the order by which the two counters are read is very important: the readersEgress must be read before reading readersIngress, otherwise, by the time readersEgress is being read, a new Reader may have acquired the read-lock and increased readersIngress.

We named this new lock LongAdderRWLock and you can get the source code from the Concurrency Freaks Library here:
https://sourceforge.net/projects/ccfreaks/files/java/src/com/concurrencyfreaks/locks/LongAdderRWLock.java/download
In addition, we did a variant named LongAdderStampedRWLock that combines the LongAdder with a fallback to the StampedLock in case there are a lot of Writers.
https://sourceforge.net/projects/ccfreaks/files/java/src/com/concurrencyfreaks/locks/LongAdderStampedRWLock.java/download


Performance Plots

We ran our usual microbenchmark on a 32-core opteron, and compared it against the StampedLock, ScalableRWLock, and ScalableStampedRWLock. Here are the results for different workloads of Writers vs. Readers:














In terms of performance it doesn't lag behind the "Scalable" version, and it uses a lot less memory in scenarios with a large number of threads, because the memory consumption for the "Scalable" locks is O(NReader-threads), while for the "LongAdder" locks it is at most O(NCores).

Code Simplicity

The thing I really like about the "LongAdder" locks is that the code is very simple, due to the complexity being hidden away in the LongAdder counter itself, which is part of JDK 8, and therefore, has been thoroughly tested.
Don't believe me when I say it is simple? Then here is the code to acquire/release the read-only lock for the LongAdderStampedRWLock:


    public void sharedLock() {
        while (true) {
            readersIngress.increment();
            if (!stampedLock.isWriteLocked()) {
                // Acquired lock in read-only mode
                return;
            } else {
                // Rollback logical counter to avoid blocking a Writer
                readersEgress.increment();
                // If there is a Writer, wait until it is gone
                while (stampedLock.isWriteLocked()) {
                    Thread.yield();
                }
            }
        }       
    }
    public void sharedUnlock() {
        readersEgress.increment();
        return;
    }

and here is the code to acquire/release the write lock:


    public void exclusiveLock() {
        // Try to acquire the stampedLock in write-mode
        stampedLock.writeLock();
        while (true) {
            // Order is _very_ important here
            final long egressSum = readersEgress.sum();
            final long ingressSum = readersIngress.sum();
            if (egressSum == ingressSum) return;
            Thread.yield();
        }
    }
    public void exclusiveUnlock() {
        if (!stampedLock.isWriteLocked()) {
            throw new IllegalMonitorStateException();
        }
        stampedLock.asWriteLock().unlock();
    }


It doesn't get any simpler than that!


Conclusions

We've shown a new technique for Reader-Writer locks that combines the StampedLock and the LongAdder to make a lock with low memory footprint and high performance. In case you missed it from the plots above, its performance can be much better than the standalone StampedLock, which is shown as the cyan line that is in the bottom of all plots, except the "50% Writes".

Of the two variants, the LongAdderRWLock seems to be the best overall, but because it is based on the C-RW-WP, it has the disadvantage of having a slight Writer-Preference. If you look at the code, you'll see that both of them use a StampedLock, but the LongAdderRWLock is only using it to serialize the Writers, while the LongAdderStampedRWLock is using it as a fallback mechanism when a lot of Writers exist, which means it inherits most of the fairness proprieties provided by the StampedLock.

We're still doing tests and Andreia is working on some interesting variants so lets see what comes out of it.
All of this will be part of the upcoming release of the Concurrency Freaks Library 0.5







7 comments:

  1. Hello Pedro,

    I have updated my RWLock to version 1.1 , i have added another
    version of my RWLock , please read the following description:

    Description:

    A fast, and scalable, and lightweight Multiple-Readers-Exclusive-Writer Lock that is portable called LW_RWLock and that works across processes and threads and a fast and scalable Multiple-Readers-Exclusive-Writer Lock called RWLock that is portable and that don't use spin wait but uses an event object and also my SemaMonitor and that consumes less CPU than the lightweight version and it processes now the writers on a FIFO order and that's also important and it works across threads .

    You can take a look at it on the following website:

    http://pages.videotron.com/aminer/


    Thank you,
    Amine Moulay Ramdane.

    ReplyDelete

  2. Hello Pedro,

    I have used my SemaMonitor inside my RWLock and benchmarked
    it against the Windows Semaphore and i have found that it is faster than the Windows Semaphore. You will find my SemaCondvar and
    my SemaMonitor on the following website:

    http://pages.videotron.com/aminer/


    Thank you,
    Amine Moulay Ramdane.


    ReplyDelete
    Replies
    1. Hello Pedro,


      Here is my new algorithm for a distributed priority and
      non priority LOCK, this algorithm reduce a lot the
      cache-coherence traffic and is good, so follow me please...


      First you you allocate an array with the same number of elements
      as the numbers of cores and the elements in the array must be 64 bytes
      cache aligned and must be a record containing one element that is the
      a flag for the first CAS that will be used as a critical section .

      You first initialise the distributed priority LOCK by setting
      a flag for the second CAS that will be used as a critical section to 0 and you set the flag of the first CAS that will be used also as a critical section to 1 (to permit the first thread to enter the lock())

      The lock() function will look like this:

      Every thread that enters the lock() must acquire his processor
      number by using GetProcessorNumber() function, but this function
      will be optimized to amortize the number of calls to it, and that's
      easy to do.

      You enter the lock() function by pushing the processor number of the threads into a queue or priority queue that will be optimized, this queue will be using a CAS on the push() and will not use any CAS on the pop(), we don't need it on the pop() cause only one thread will pop() from the unlock() function, after that you enter a repeat loop where you test with the first CAS and you test also with the second CAS the flag of the corresponding element(flag) of the array using the processor number of the threads as an index , the flag for the second CAS will be set to 1 so the first thread will enter the lock() and the other threads will test in a repeat loop for the first CAS and the second CAS and there flags will have been set to zero so they will wait..

      After that the first thread will arrive to the unlock() function and
      he will pop() the processor number from the optimized priority queue
      or non priority queue and set the flag for the first CAS to 0 for the corresponding processor core this will allow a thread to enter the lock , if there is no elements in the queue the thread will set the flag of the second CAS to zero and this will allow a thread to enter the lock.


      So as you have noticed my algorithm is efficient also cause if there
      is threads waiting, the cache-coherence traffic will be reduced a lot
      cause we are using local variables in each element of the array alligned to 64 byte.


      So if you have noticed, in fact i am using 2 CASes , so my algorithm
      is good.

      This was my algorithm for a distributed priority and non priority LOCK
      that is efficient, and i will code it as soon as possible.



      Thank you,
      Amine Moulay Ramdane.

      Delete
    2. I correct some typos, please read again...

      Hello,

      Here is my new algorithm for a distributed priority and
      non priority LOCK, this algorithm reduce a lot the
      cache-coherence traffic and is good, so follow me please...


      First you you allocate an array with the same number of elements
      as the numbers of cores and the elements in the array must be 64 bytes
      cache aligned and must be a record containing one element that is the
      a flag for the first CAS that will be used as a critical section .

      You first initialise the distributed priority LOCK by setting
      a flag for the second CAS that will be used as a critical section to 0 and you set the flag of the first CAS that will be used also as a critical section to 1 (to permit the first thread to enter the lock())

      The lock() function will look like this:

      Every thread that enters the lock() must acquire his processor
      number by using GetProcessorNumber() function, but this function
      will be optimized to amortize the number of calls to it, and that's
      easy to do.

      You enter the lock() function by pushing the processor number of the threads into a queue or priority queue that will be optimized, this queue will be using a CAS on the push() and will not use any CAS on the pop(), we don't need it on the pop() cause only one thread will pop() from the unlock() function, after that you enter a repeat loop where you test with the first CAS and you test also with the second CAS the flag of the corresponding element(flag) of the array using the processor number of the threads as an index , the flag for the second CAS will be set to 1 so the first thread will enter the lock() and the other threads will test in a repeat loop for the first CAS and the second CAS and there flags will have been set to zero so they will wait..

      After that the first thread will arrive to the unlock() function and
      he will pop() the processor number from the optimized priority queue
      or non priority queue and set the flag for the first CAS to 1 for the corresponding processor core this will allow a thread to enter the lock , if there is no elements in the queue the thread will set the flag of the second CAS to 1 and this will allow a thread to enter the lock.


      So as you have noticed my algorithm is efficient also cause if there
      is threads waiting, the cache-coherence traffic will be reduced a lot
      cause we are using local variables in each element of the array alligned to 64 byte.


      So if you have noticed, in fact i am using 2 CASes , so my algorithm
      is good.

      This was my algorithm for a distributed priority and non priority LOCK
      that is efficient, and i will code it as soon as possible.



      Thank you,
      Amine Moulay Ramdane.

      Delete
    3. Hello,

      Be smart please, to reduce a lot the cache-coherence traffic,
      you have to choose a queue that reduces a lot the cache-coherence traffic, and for that you have to code a queue as a linklist
      (similar to to the MCS queue) that reduces a lot the cache-coherence traffic, but you have to avoid the lockfree queues cause they will higher the cache-coherence traffic. This is how you have to code my distributed LOCK and it will efficient and be good.



      Thank you,
      Amine Moulay Ramdane.

      Delete
    4. Hello,


      I have come to an interresting subject about lockfree queues,
      if you take a look at those lockfree queues, like in transactional memory, if the data have changed those lockfree queue retries and loop again, but this (lockfree) mechanism generates a lot of cache-coherence traffic, so i will not advice this lockfree mechanism, so how
      can we do it ? you can take a look at the MCS queue Lock , i
      think they are doing it like a waitfree linklist that doesn't spin and
      that reduces a lot the cache-coherence traffic, other than that
      if your memory manager is not optimal and uses a lockfree mechanism and generates a lot cache-coherence traffic, so you have to use a freelist to lower the cache coherence traffic.


      So be smart please and think about that.


      Thank you,
      Amine Moulay Ramdane.

      Delete
    5. Hello,

      I have wrote this about my distributed priority and non priority LOCK :

      "So if you have noticed, in fact i am using 2 CASes , so my algorithm is good."


      You will say that i will in fact use 3 atomic operations , 2 CASes and one
      inside the waitfree queue, but be smart please, when there is
      threads waiting and spining , one CAS among the CASes will be very cheap cause the variable will be inside the L2 cache, and the other CAS will be
      expensive cause the threads must load the variable from the
      remote caches and this is expensive, so it is as we would
      have one CAS cause one amongst the second is very cheap,
      so it will make this a total of 2 CASes if we count the CAS for
      the waitfree queue.

      This was my algorithm for a distributed priority and non priority LOCK
      that is efficient, and i will code it as soon as possible.


      Thank you,
      Amine Moulay Ramdane.


      Delete