After making the previous post, I went back to the code of the Michael-Scott persistent queue and figured out there is one psync and one pwb calls that can be removed in the enqueue() method, namely, the ones before the return statement.
This is possible because, for enqueue() to return, it means the CAS on tail has been done, and although the value of tail may not be persisted, the CAS on tail guarantees that the value of the node->next is persisted. This means that if a crash occurs (in the scenario shown on the previous post) and a_is_persisted is persisted and the change to tail occurring on the enqueue is not, it's still ok because the node->next is persisted, which will allow the recover() of the queue to advance the tail and persist it as well.
Ok I know, that sentence was long and confusing, but if I were to make a formal proof, it would be waaayyy more confusing ;)
Unfortunately, for the dequeue(), no such trick is possible on the head, therefore, we really do need the PWB(&head) and PSYNC() before returning from dequeue().
The source code on github has been updated:
The performance increase is barely noticeable so I won't even bother to show the new plot.
This is nice illustration of the similarities between the C++/C11/Java sequentially-consistent atomics, and the pwb/pfence/psync model described here. Sure, we can do the algorithms with everything seq-cst, and everything full persistence fences, but to get better performance, we need to understand the algorithm to reduce the number of fences to a minimum, regardless of whether those fences are for synchronization (concurrency) or for persistency.