On the previous post we presented a new lock-free linearizable queue named FAAArrayQueue which is currently the fastest lock-free queue, if not, the fastest portable lock-free queue.
FAAArrayQueue uses a Fetch-And-Add (FAA) instruction to assign an entry to each enqueuer and each dequeuer, similarly to LCRQ, a fast lock-free queue invented by Adam Morrison and Yehuda Afek back in 2013 that also needs a double-word CAS (CAS2).
For the more distracted of you, there is a non-obvious problem with queues that use a fetch-and-add with this approach, where the enqueuer may be "bumped out" by a faster dequeuer that took the same ticket when doing the FAA. Typically, this is not an issue because the enqueuer is fast enough to do the FAA and place its item in the designated position before a matching dequeuer does an FAA and sees that there is no item in its position and bumps out the enqueuer.
There is however a scenario where this happens a lot, or at least enough times to cause a significant performance drop. It occurs when we have one or just a few producers (enqueuers) and multiple consumers (dequeuers). This is sometimes called a "multiplexer pattern" or "sprayer pattern" and it is often used in situations where one thread wants to delegate work to other threads.
To show this issue in action, we created a benchmark with one thread being a dedicated producer and the other N threads being dedicated consumers.
In the end we show how much enqueues were done by the enqueuer and how many non-null dequeues were done in total by all the consumers. For a well-behaved queue we would expect the number of dequeues to remain more or less constant, or perhaps even to scale, at least up to the number of enqueues, which itself should remain the same regardless of the number of consumers (there is always one and only one producer).
Unfortunately the results show a very different picture for LCRQ and FAAArrayQueue.
Here are the results of the benchmark comparing four different queues as the number of dedicated consumers increase:
The tests are in C++, they run for just 4 seconds, on a 32-core AMD machine, and all these four lock-free queues are doing their own memory reclamation using the same Hazard Pointer's class.
The plots show that as the number of dedicated consumers increase, the throughput decreases dramatically for the LCRQ and FAAArrayQueue, even for the enqueuers. This happens because of the "bump out" effect described above, that causes the enqueuer to have to do multiple fetch-and-add until it can find a vacant entry. Obviously this affects the tail latency a lot, but we're not even going to into that.
The conclusion from all of this is that different queues have different properties, and fast and balanced queues are hard to come by.
Anyways, if you have more dequeuers than enqueuers, particularly under dedicated enqueuers or dequeuers (a more realistic scenario than having the same thread being both enqueuer and dequeuer on the same thread), then the LazyIndexArrayQueue (or others) is probably a better choice than the LCRQ or FAAArrayQueue.