Saturday, December 3, 2016

LCRQ and FAAArrayQueue are not good for all scenarios

On the previous post we presented a new lock-free linearizable queue named FAAArrayQueue which is currently the fastest lock-free queue, if not, the fastest portable lock-free queue.
FAAArrayQueue uses a Fetch-And-Add (FAA) instruction to assign an entry to each enqueuer and each dequeuer, similarly to LCRQ, a fast lock-free queue invented by Adam Morrison and Yehuda Afek back in 2013 that also needs a double-word CAS (CAS2).

For the more distracted of you, there is a non-obvious problem with queues that use a fetch-and-add with this approach, where the enqueuer may be "bumped out" by a faster dequeuer that took the same ticket when doing the FAA. Typically, this is not an issue because the enqueuer is fast enough to do the FAA and place its item in the designated position before a matching dequeuer does an FAA and sees that there is no item in its position and bumps out the enqueuer.
There is however a scenario where this happens a lot, or at least enough times to cause a significant performance drop. It occurs when we have one or just a few producers (enqueuers) and multiple consumers (dequeuers). This is sometimes called a "multiplexer pattern" or "sprayer pattern" and it is often used in situations where one thread wants to delegate work to other threads.

To show this issue in action, we created a benchmark with one thread being a dedicated producer and the other N threads being dedicated consumers.
In the end we show how much enqueues were done by the enqueuer and how many non-null dequeues were done in total by all the consumers. For a well-behaved queue we would expect the number of dequeues to remain more or less constant, or perhaps even to scale, at least up to the number of enqueues, which itself should remain the same regardless of the number of consumers (there is always one and only one producer).
Unfortunately the results show a very different picture for LCRQ and FAAArrayQueue.
Here are the results of the benchmark comparing four different queues as the number of dedicated consumers increase:

The tests are in C++, they run for just 4 seconds, on a 32-core AMD machine, and all these four lock-free queues are doing their own memory reclamation using the same Hazard Pointer's class.

The plots show that as the number of dedicated consumers increase, the throughput decreases dramatically for the LCRQ and FAAArrayQueue, even for the enqueuers. This happens because of the "bump out" effect described above, that causes the enqueuer to have to do multiple fetch-and-add until it can find a vacant entry. Obviously this affects the tail latency a lot, but we're not even going to into that.

The conclusion from all of this is that different queues have different properties, and fast and balanced queues are hard to come by.
Anyways, if you have more dequeuers than enqueuers, particularly under dedicated enqueuers or dequeuers (a more realistic scenario than having the same thread being both enqueuer and dequeuer on the same thread), then the LazyIndexArrayQueue (or others) is probably a better choice than the LCRQ or FAAArrayQueue.

Sunday, November 27, 2016

FAAArrayQueue - MPMC lock-free queue (part 4 of 4)

On this post we're going to show what is currently the fastest lock-free queue on the planet, or to be more precise, the fastest portable linearizable MPMC memory-unbounded lock-free queue. This is the fourth and last of the four linked-list-of-arrays-based queues with lock-free enqueues and dequeues.

If you want to jump directly to the code, you can get it on github in C++ (with memory reclamation):
or in Java:


Saying that FAAArrayQueue is the fastest lock-free queue in the planet is a bold claim, but from what everybody tells us, the fastest lock-free queue is the LCRQ made by Adam Morrison and Yehuda Afek in 2013
and as it so happens, the FAAArrayQueue is slightly slower and sometimes slightly faster than LCRQ, as you can see on the benchmarks below, all in C++, and all with arrays of 1024 entries per node (except Michael-Scott which has a single item per node):

The reason why FAAArray is a bit below LCRQ in the Enq-Deq tests is because of cache locality: in a single-enqueue-single-dequeue benchmark the LCRQ will keep re-using the same array, which means better cache locality, no new nodes created, and therefore, better throughput.
We're using an array of 1024 entries per node, but if we use a larger array then this difference fades away, but I won't do that because I think it's silly to have a linked-list based queue where each node has a large array (who uses that in practice?).
On the burst benchmark, the bursts are much larger than 1024, so it forces LCRQ to create new nodes, which means that it won't be able to re-use the array, and in that scenario, the FAAArrayQueue is equally good for dequeues and even better for enqueues.

LCRQ is non-portable because it requires a double-word compare-and-swap (CAS2), which means it can only be used in x86 and only for native code (no Java, sorry). FAAArrayQueue is capable of getting close to the performance of LCRQ and it is portable because it needs just regular CAS, i.e. you can implement it in Java, or using C++ for any architecture by using std::atomics<>.

How does FAAArrayQueue works?

It has some similarities with LazyIndexArrayQueue. For example, there is an enqidx and deqidx in each node, but they have different purposes: to provide a unique entry in the items array through the usage of a fetch-and-add (FAA) instruction (example code in C++).
    struct Node {
        std::atomic<int>   deqidx;
        std::atomic<T*>    items[BUFFER_SIZE];
        std::atomic<int>   enqidx;
        std::atomic<Node*> next;

        // Start with the first entry pre-filled and enqidx at 1
        Node(T* item) : deqidx{0}, enqidx{1}, next{nullptr} {
            items[0].store(item, std::memory_order_relaxed);
            for (long i = 1; i < BUFFER_SIZE; i++) {
                items[i].store(nullptr, std::memory_order_relaxed);

This idea has been used before, for example by LCRQ itself, or by Yang and Mellor-Crummey (YMC queue) in "A Wait-Free Queue as Fast as Fetch-And-Add"
The YMC uses this FAA approach to achieve high throughput and wait-free (unbounded) progress, and they even have the basic idea described in Listing 1 of their paper, but it's an obstruction free queue, while FAAArrayQueue is lock-free.

On the LazyIndexArrayQueue, the enqueue would start by looking for an empty entry, and then if it would do a sucessful CAS, it would update the enqidx. On FAAArrayQueue the logic is inverted: first we do FAA in enqidx to obtain the entry of the array, and then we do CAS on that particular entry:
    void enqueue(T* item, const int tid) {
        if (item == nullptr) throw std::invalid_argument("item can not be nullptr");
        while (true) {
            Node* ltail = hp.protect(kHpTail, tail, tid);
            const int idx = ltail->enqidx.fetch_add(1);
            if (idx > BUFFER_SIZE-1) { // This node is full
                if (ltail != tail.load()) continue;
                Node* lnext = ltail->next.load();
                if (lnext == nullptr) {
                    Node* newNode = new Node(item);
                    if (ltail->casNext(nullptr, newNode)) {
                        casTail(ltail, newNode);
                    delete newNode;
                } else {
                    casTail(ltail, lnext);
            T* itemnull = nullptr;
            if (ltail->items[idx].compare_exchange_strong(itemnull, item)) {

The whole node logic of inserting a new node and advancing tail and head is still the classical Michael-Scott algorithm, just like on the other 3 array-based queues we presented, and just like on LCRQ, or on YMC. Nothing new here.

For the dequeue, again the logic is inverted from the LazyIndexArrayQueue: first we do the FAA to get the index of the entry to dequeue, and then we do the CAS to obtain the item, or on this particular implementation, we do an atomic_exchange because it is supposed to be slightly faster on x86:
    T* dequeue(const int tid) {
        while (true) {
            Node* lhead = hp.protect(kHpHead, head, tid);
            if (lhead->deqidx.load() >= lhead->enqidx.load() && 

                lhead->next.load() == nullptr) break;
            const int idx = lhead->deqidx.fetch_add(1);
            if (idx > BUFFER_SIZE-1) { // Node has been drained                

                Node* lnext = lhead->next.load();
                if (lnext == nullptr) break;  // No more nodes in the queue
                if (casHead(lhead, lnext)) hp.retire(lhead, tid);
            T* item = lhead->items[idx].exchange(taken);
            if (item == nullptr) continue;
            return item;
        return nullptr;

The uncontended case does 1 FAA plus 1 CAS plus 1 hazard pointer (seq-cst store) for the enqueue, and a similar number of operations for the dequeue.
In x86, the FAA is a fast operation even under contention (as explained in the LCRQ paper), and both the CAS and the hazard pointer store are done with no contention, just like on LCRQ, hence the high throughput.

The comparison between FAAArrayQueue and LCRQ is not completely fair because they are slightly different beasts:
1. LCRQ is capable of re-using the entries in the items array and will not allocate a new node (with a new items array) unless the current one is already full. FAAArrayQueue is incapable of re-using entries in the items array. This is a clear advantage to LCRQ because it allows the re-usage of the same array (good cache-locality) so long as the queue is not too full. However, if memory usage is a concern for you, then keep in mind that the array of items in the LCRQ uses 128 bytes per entry, while on FAAArrayQueue uses just 8 bytes per entry (on a 64 bit machine).

2. LCRQ requires a double-word Compare-And-Swap (CAS2) instruction, which only exists for x86 and it is not part of any language's memory model or atomics API (that I'm aware of). Any LCRQ implementation is always x86 specific and must be done in a native language (i.e. C/C++/D are ok, Java is a no-no). FAAArrayQueue uses regular CAS which means it can be implemented in any CPU architecture (x86, powerpc, arm, etc) and any language with a memory model and atomics (C11,C++1x,D,Java,etc). Here the advantage is clearly on FAAArrayQueue's side.

If you want a high throughput, memory unbounded, MPMC, linearizable, portable lock-free queue, you'll have a hard time finding something better than FAAArrayQueue.