Sunday, September 1, 2013

Distributed Cache-Line Counter Scalable RW-Lock

In this post, we're going to talk about a variant of the Scalable RW-Lock that uses a Distributed Cache-Line Counter (DCLC) to measure the number of active Readers.

For those of you that don't know what the Scalable RW-Lock is, it's a Reader-Writer lock that Andreia and I discovered, and was simultaneously discovered by a team at Oracle Labs which they published under the name "NUMA-Aware Reader-Writer Locks". And if you're wondering what a DCLC is then you can take a look at this post:
http://concurrencyfreaks.blogspot.com/2013/08/concurrency-pattern-distributed-cache.html

Compared to the "classic" mechanism of the ScalableRWLock, this technique doesn't need to use finalize()rs nor a ConcurrentLinkedQueue, and no ThreadLocals... well, for the reentrant version of the lock we need a ThreadLocal, but we don't need to register a finalize() for that.

The idea is to use the DCLC to measure the number of Readers and then the Writer that wins the election among the competing Writers will just have to set its state to WRITING and then scan the array of the DCLC and wait for it to be zero. Moreover, it doesn't need to restart from the first position of the array if one of the intermediate positions is non-zero, it just needs to wait until that position is zero, and then advance to the next one.

We named this new reader-writer lock variant DCLCRWLock. One of its advantages is that it is very easy to implement in any language because it doesn't need the ThreadLocals or a ConcurrenLinkedQueue. Another advantage is that the Reader threads don't need to "register" the first time they try to acquire the lock, which means that it is ideal for applications where each thread seldom acquires the lock.


Disadvantages:
- Not reentrant, but can be made reentrant with the usage of a ThreadLocal
- As with all hash-function based algorithms, if the IDs of the threads that are trying to acquire the lock in shared mode end up in the same bucket, there will be a lot of contention in a single entry of the DCLC array and the lock won't scale. Statistically, the number of collisions should be small and, therefore, the contention low, which implies a high scalability with the number of Readers.

Advantages:
- No need for ThreadLocals
- No need for a ConcurrentLinkedQueue
- Works nicely for applications with a lot of thread destruction/creation (or threadpools)
- Easy to implement in several languages, as long as they have a Sequentially-Consistent Data-Race-Free Memory Model and an atomics API.
- Unlike the other variants of the Scalable RW-Lock, this one doesn't need a GC.



Performance:
Here are some performance plots for the Java version compared with the ScalableStamped, Stamped, ScalableRWLock.
As you can see, its results are similar to the ScalableRWLock, and in some cases seem even better (but could be variation in the tests because we did just one run), although in in some pontual cases the performance drops significantly, my guess is that its due to collisions of the hash function, as was to be expected.














Source Code:
Here is the link to the source code, but it will be part of the next release of the Concurrency Freaks library on sourceforge.
DCLCRWLock.java




5 comments:

  1. You can also take a look at my scalable RWLock at:

    http://pages.videotron.com/aminer/

    I have wrote it in Object Pascal, but Object Pascal is easy to understand.


    Thank you,
    Amine Moulay Ramdane.

    ReplyDelete
    Replies
    1. Hi Amine,
      Indeed it seems the same algorithm. Your implementation takes advantage of the GetCurrentProcessorNumber() to be "NUMA aware" and identify which entry in the array to increment, which means it is closer to the original paper than our implementation (the paper is called "NUMA Aware Reader-Writer Locks" and you can get it from Dave Dice's blog).

      Unfortunately, in Java (or in portable C/C++) there is no functionality similar to GetCurrentProcessorNumber, so we had to "work around it" and create the technique with the hash2tid.

      Delete

    2. Hello Pedro Ramalhete,

      I think my scalable RWLock ajnd the DV and the "NUMA-Aware Reader-Writer Lock" all of them suffer from the following problems , read the following
      from the paper "NUMA-Aware Reader-Writer Locks":

      "However, every writer incurs the overhead of acquiring write permission
      for the RW lock of every node, potentially generating significant coherence
      traffic. Thus, the performance of DV plummets with increased writer activity"

      "most multi-core multi-chip systems available today
      have relatively small number of NUMA nodes (4 in the system used
      in our experiments), and we believe the overhead on the writer’s
      execution path is not a major performance concern on these systems.
      Future NUMA systems with larger numbers of nodes may
      pose a problem, but we leave the exploration of possible solutions
      to future work."


      Does your Reader-Writer lock suffers from those same problems ?


      Thank you,
      Amine Moulay Ramdane.




      Delete
    3. Hi Amine,

      Be careful not to confuse the DV lock with the Cohort-RW lock (more specifically, the C-RW-WP) that was published in the "NUMA-aware reader writer locks".
      From what I could tell of your code, the algorithm you have implemented is very similar to the C-RW-WP, and the algorithms that Andreia and I have implemented, although not numa-aware, they also have the same state machines as the C-RW-WP.

      For the DV lock, yes it seems there will be an impact due to cache coherence traffic, because there is a lock associated with every node/core, and the write has to acquire all the locks in a pre-determined sequence (to avoid dead-locks between writes).
      For the C-RW-WP the issue is on the array where Readers publish their state which will grow with the number of nodes, but the impact is expected to be on the Writer's side, not on the Reader's side. For the Writer, I would expect the number of cache lines it has to read to increase with O(N_cores), and any NUMA architecture that can't handle that kind of load in a linear way doesn't seem to be a very useful architecture. Time will tell how future computers will behave.

      As always in Engineering, there is no "silver bullet", and some solutions are better for some cases, while others are better for other scenarios. The C-RW-WP lock has extremely fast Readers, at the cost of having slower Writers.



      Btw, Andreia looked at your code and noticed that you're using the state machine with 3 operations for the readers (get()/set()/get()) which can be further "optimized" to a set()/get() with high probability. You can take a look at this post for more details "What's the minimum number of steps to do a read-only lock?":
      www.concurrencyfreaks.com/2013/02/whats-minimum-number-of-steps-to-do.html
      The RLock procedure should start by doing LockedExchangeAdd(FCount1^[myid].fcount1,1); and then test FCount3^.fcount3 = 1, and if it is at 1, then rollback the counter with FCount1^[myid].fcount1 and proceed as usual.

      Cheers,
      Pedro

      Delete

    4. Thank you Pedro.


      Amine Moulay Ramdane.

      Delete