Saturday, January 7, 2017

LCRQ in C++ with Hazard Pointers

LCRQ is a very fast Multi-Producer-Multi-Consumer lock-free linearizable queue.
It was designed by Adam Morrison and Yehuda Afek in 2013 and you can get the academic paper here:

The basic idea is to use the Michael-Scott lock-free queue but instead of having one item per node we have an array of items in each node. C++ code examples:
static const int RING_POW = 10;
static const uint64_t RING_SIZE = 1ull << RING_POW;

struct Cell {
    std::atomic<T*>       val;
    std::atomic<uint64_t> idx;
    uint64_t pad[14];
} __attribute__ ((aligned (128)));

struct Node {
    std::atomic<int64_t> head  __attribute__ ((aligned (128)));
    std::atomic<int64_t> tail  __attribute__ ((aligned (128)));
    std::atomic<Node*> next    __attribute__ ((aligned (128)));
    Cell array[RING_SIZE];

    Node() {
        for (unsigned i = 0; i < RING_SIZE; i++) {
            array[i], std::memory_order_relaxed);
            array[i], std::memory_order_relaxed);
        }, std::memory_order_relaxed);, std::memory_order_relaxed);, std::memory_order_relaxed);

This saves up the allocation, deallocation, and tracking of a large number of nodes. It may not seem like a big gain but it is, particularly when using hazard pointers, because now we don't incurr the cost of multiple hazard pointer stores. There is one store per node, and one node holds RING_SIZE items.
That's not all though, LCRQ is capable of re-using entries in the array of a node if they get dequeued in the meantime, which means we may not even need to allocate new nodes, so along as the capacity of the node is not exceeded, i.e. we don't place more items in the queue than there are entries in the array. This saves further allocation/deallocation and provides good cache locality, and who doesn't like cache locality?!?

So much for the good news, now for the bad news:
LCRQ uses CAS2 (double-word compare-and-swap), and for those of you not familiar with it, it's a 128 bit instruction that only exists on x86.
There is no CAS2 or equivalent instruction in ARM or PowerPC, which means that there is no CAS2 in the C11 or C++ or Java memory model, and probably there never will be. Unlike Fetch-And-Add (FAA) which can be "simulated" with a regular CAS or a LL/SC, the CAS2 instruction can not be simulated with other atomic instructions (except transactional ones).

Even more, CAS2 works on aligned memory, which means that every entry in the array of items must be aligned, which wastes a bit of memory, but ok, it's not too bad... unless you decide to use a node where there are 2^16 items, in which case it's 2^16 x 128 bits = 1 MegaByte of memory per node, and an empty queue has one node (the last used or the sentinel node), which means that for every empty queue in your app you are using 1MB of memory... and the queue is empty. Granted, you don't get much more performance out of a 2^16 array when compared to using an array with 32 entries (see Figure 9 of the academic paper), so you might as well use 32

The last piece of bad news is that even on x86, neither the C11 nor C++ memory model define CAS2 or how the other atomics interact with it, which makes it hard to reason about it. Nothing to worry about if you're used to writting concurrent code without a memory model. I'm not, but hey, I've been told I'm weird like that  ;)

Unlike previous queues, I'm not going to go into detail about this queue because although we have a C++ implementation of it, with Hazard Pointers, it's pretty much the code provided by the authors, so there is nothing new we made, so you're better off reading their paper if you want to understand this queue.
Here is the source code in C++ with Hazard Pointers for our implementation:

If you want a queue that is just as fast (or faster) and that can be implemented in any CPU or language (C11/C++1x/Java/JVM) then take a look at FAA Array Queue that we talked about on a previous post. It doesn't do re-usage, but it's just as fast and uses only FAA and CAS:

No comments:

Post a Comment