Before we start explaining the dequeue procedure we need to look at the DeqState that contains the atomic snapshot of data that is needed for the dequeues:
It is composed by:
- head: A pointer to the first node of the queue;
- turn: The thread's turn, i.e. thread id. Not needed if you don't use the Turn consensus;
- items: An array with pointers to the items for each dequeuer. Each thread that can dequeue has a unique entry in this array assigned to it where the pointer to the last item dequeued is stored. Threads helping a dequeuer will dequeue the node/item for it and place the item here so that the dequeuer can find it;
- applied: An array of bits to be matched against the dequeuers array. If the bits are the same then the dequeue request has been granted and the corresponding item is in the thread's id position in the items array. If they don't match, then the dequeue of the item is yet to be made;
Here's what the dequeue looks like:
The code may not seem simple but it's not too bad when compared with other MPMC wait-free queues.
The first step is to publish the request to dequeue, which is done by toggling the bit on the dequeuers array (lines 6 and 7).
Then, we start helping other dequeuers, starting from the current turn (could be from zero) and skipping the closed requests (line 20). In our local items array we start assigning the items of the consecutive nodes (line 27) starting from the current head (line 16) and up until all open requests have been granted, including our own request, or until there are no more nodes to dequeue (line 24).
Contrary to what you may think, we can not stop once we find a node whose next is null. We have to keep trying for all other requests because there could be a request that appeared later, after an insertion of a new node and we would give it a null even though by that time there was already a new node (and item) in the queue, which would not be a linearizable behavior.
Yes, that one is a bit tricky I'm afraid, so re-read the sentence and think about it carefully.
The last step is to construct the new DeqState instance with the new arrays and variables (line 34) and do a CAS operation to replace all of these variables in one atomic snapshot (line 34). Once again the Copy-On-Write pattern comes to the rescue here... this is why COW is so important!
Like Helihy's Universal Wait-Free Construct, we have to do this twice (line 8) because one time may not be enough to grant our own request when another thread won the CAS on the DeqState.
Closing thoughtsAs you can see, in both enqueue() and dequeue() we only have for() loops with constant bound, which means this queue is wait-free bounded (by the number of threads).
I'm hiding away all the complexity because we have a GC and we don't re-use the instances of DeqState or nodes. Re-using means we are subject to ABA problems and that is a whole new world of pain (but it can be done). Not using a GC means using Hazard Pointers, in a wait-free way, which is a whole universe of pain... a subject for another post altogether.
Just to pull the veil a little bit to show how much pain it is, think about how would you be able to return deqstate.items[tid] using Hazard Pointers in a wait-free way? What if the deqstate variable keeps changing, thus invalidating the hazard pointer? You can't de-reference the contents of destate because they may be reclaimed and you get a crash... how do you solve that problem? Hint: the solution is in the academic paper.
I'll try to show some performance plots in some upcoming post, but you can take a look at the author's paper which has some plots if you want to. http://thalis.cs.uoi.gr/tech_reports/publications/TR2011-01.pdf
SimQueue is an amazing queue and it deserves the time it takes to understand it. It's full of cool ideas and tricks, so dig in, and learn what you can!
Java implementation here: