Tuesday, November 18, 2014

A Catalog of Read Indicators

ReadIndicators are one of the most used concurrency constructs in synchronization mechanisms, even if you never hear of them being named as so. Today we're going to spend some time talking about them.
ReadIndicators are usually seen in the wild as part of more complex synchronization mechanisms, such as Locks, Reader-Writer Locks, the Left-Right Pattern, etc. They can be NUMA-aware or not, they can scale well with contention, or not at all.


API

A ReadIndicator object has three methods:

arrive():
Semantically speaking, it works like lock.acquire() and it should be called before entering the critical section. Depending on the particular implementation, its progress condition may range from Wait-Free Population Oblivious (WFPO) to Lock-Free.

depart():
Semantically speaking, it works like lock.release() and it should be called after leaving the critical section. Its progress condition ranges from WFPO to Lock-Free.

isEmpty():
This function returns true if there are any threads in the critical section, i.e. that have called arrive() and have not yet called depart(). Its progress condition ranges from WFPO to Lock-Free.



The Catalog

We will now go into a flyby of the currently known implementations of ReadIndicators, so if you just want the jist of it, skip directly to the end of this post to see a comparison table of the different techniques.

Atomic Counter:

The simplest implementation of a ReadIndicator is to use a (sequentially consistent) atomic variable that counts the number of threads currently in the critical section. In Java it would look like this:
public class RIAtomicCounter implements ReadIndicator {
    private final AtomicLong counter = new AtomicLong(0);
   
    public void arrive() {
        counter.getAndIncrement();
    }
   
    public void depart() {
        counter.getAndDecrement();
    }
   
    public boolean isEmpty() {
        return counter.get() > 0;
    }
}


On x86, the getAndIncrement() and getAndDecrement() are implemented as a XADD instruction, which means that all three methods are WFPO. On CPUs where the getAndIncrement() and getAndDecrement() get transformed into a CAS loop, then the methods arrive() and depart() will be Lock-Free.
The main disadvantage of this technique is scalability. Obviously, if you're going to call arrive() and depart() a lot, it will create high contention on the counter variable, which will bog down your synchronization mechanism or data structure.
Regarding memory usage, this is the best if all techniques because it uses a single integer, making it ideal if you have a lot of instances of the synchronization mechanism, for example, to be used in an application with lots of instances of an object where each of these objects is protected with a Reader-Writer Lock that uses a ReadIndicator.


Ingress/Egress Counters:

In this technique we use two counters, which can be single atomics or an array of atomics as shown in DCLC. One of the counters (ingress) is incremented upon a call to arrive() and the other one (egress) upon a call to depart(). This is in a way very similar to how a Ticket Lock works.
The first thing to notice is that as long as ingress and egress are on different cache lines, you've just reduced contention by two when compared to a single counter, but this isn't such a big advantage.
The first time we saw this technique was on the paper "NUMA aware reader-writer locks" but I have no idea if it was discovered by the Oracle Scalable Synchronization team or not.


public class RIIngressEgressSingleAtomic implements ReadIndicator {
    private final AtomicLong ingress = new AtomicLong(0);
    private final AtomicLong egress = new AtomicLong(0);
   
    public void arrive() {
        ingress.getAndIncrement();
    }
   
    public void depart() {
        egress.getAndIncrement();
    }
   
    public boolean isEmpty() {
        // Order is _very_ important here
        return egress.get() == ingress.get();
    }
}


The two advantages of this technique is that it can use distributed counters (arrays of sequentially consistent atomics) and that it can be made NUMA-aware. Even better, we can use a class like Java's LongAdder, which isn't sequentially consistent if used as a single counter but it is Linearizable when used in this way. A few of our Reader-Writer Locks use it, like LongAdderRWLock or LongAdderExtRWLock.


Pre-Assigned Per-thread entry:

In this technique we simply pre-assign entries in an array to each thread. Obviously, this doesn't work very well if the number of threads in the application varies, or if you have thread creation/destruction.
The main advantage is that we only need a store with release on the arrive() and another one on the depart() method, and if the array is cache-line padded to prevent false-sharing, then there won't be contention among the different threads doing arrive()/depart() thus providing good scalability.


public class RIStaticPerThread implements ReadIndicator {
    private final AtomicLongArray perThreadState;
    private final int maxNumThreads;
   
    private static final int STATE_NOT_READING = 0;
    private static final int STATE_READING = 0;
   
    public RIStaticPerThread(int maxNumThreads) {
        this.maxNumThreads = maxNumThreads;
        this.perThreadState = new AtomicLongArray(maxNumThreads);
    }
   
    public void arrive() {
        perThreadState.set((int)Thread.currentThread().getId(), STATE_READING);
    }
   
    public void depart() {
        perThreadState.set((int)Thread.currentThread().getId(), STATE_NOT_READING);
    }
   
    public boolean isEmpty() {
        for (int tid = 0; tid < maxNumThreads; tid++) {
            if (perThreadState.get(tid) == STATE_READING) return false;
        }
        return true;
    }
}


A slightly better way is to register each thread by doing a CAS, which is what we did for https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/locks/ScalableRWLockS.java

Dynamically Assigned Per-thread entry (Scalable Mechanism):

This is a slightly complicated one. It combines a ConcurrentLinkedQueue (CLQ), an array, and finalizers (yuck, I know, who needs finalizers?).
Each thread should register itself upon startup, or we check on every call to arrive() if it is already registered. The registration mechanism uses a CLQ to add new thread entries to the list, and generates a new array with references to each entry to allow for faster lookups on isEmpty(). Notice that once the registration has been done, the arrive() and depart() consist of a simple store with release, making this technique the only one that we know of that adjusts to a variable number of threads in your application and that is WPFO for arrive() and depart(). The isEmpty() is Wait-Free Bounded (by the number of threads) because the size of the CLQ list (and mirroring array) depends on the number of threads calling arrive()/depart().
The example code is too big to paste in this blog post, so just give a look here for some example usage:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/locks/ScalableRWLock.java
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/locks/ScalableStampedRWLock.java

As far as I can tell, Andreia and I discovered this technique. It is not a particularly difficult feat to think that a CLQ can be used for registering/deregistering threads and then patching it up with a finalizer. What is slightly non-trivial is to realize that we can make an array that mirrors the CLQ for fast lookups in isEmpty() and that it still provides Linearizability among all three methods arrive()/depart()/isEmpty().
This technique is WFPO for arrive() and depart() on any CPU, and Wait-Free Bounded (by the number of threads) for isEmpty(). It is and highly scalable (the @Contended guarantees no false sharing effects) and its main problem is the memory usage which can be high because it depends on the number of threads, which means if you have thousands of threads or rapid creation and destruction of threads you'll encounter some performance issues, but who does that anyways?



SNZI - Scalable Non-Zero Indicators

This is a lock-free scalable technique for ReadIndicators. I'm not going to show code on it, and you can download a good powerpoint presentation here if you're interested. It has good scalability and low memory usage properties, but the fact that it is lock-free for all three methods arrive()/depart()/isEmpty() can put it into a big disadvantage in relation to the other techniques.


Others

I'm sure many more algorithms for ReadIndicators exist, and on the Left-Right paper we even talk about two others: No Version and Reader's Version.



Comparison Table



Which one is better depends on the application. How much scalability do you want? What kind of progress conditions do you need for arrive()/depart() and for isEmpty() ?
Use the table above and decide for yourself.

If you know of other ReadIndicators, feel free to tell us about them in the comments section.