Thursday, February 8, 2018

Can't protect a pointer without a store-load fence

This post is about memory reclamation algorithms in shared memory concurrency. That's things like Hazard Pointers, Drop the Anchor, epoch-based reclamation (User-space RCUs) and of course, Hazard Eras. There's an insight about memory reclamation, I had a long time ago, that I want to share. This insight probably isn't new, but it was to me when I thought about it  ;)


Let's start from the beginning:

Many years ago, Leslie Lamport made a short paper where he showed that to have mutual exclusion on shared memory system, you need at least one store-load fence. Not sure this is the right reference but it's close enough to this topic:
https://www.microsoft.com/en-us/research/publication/make-multiprocessor-computer-correctly-executes-multiprocess-programs/
In other words, for two threads to have mutually exclusive access to a certain data, this data must be protected with an algorithm (a lock) and to implement this algorithm you will need at least one store and one load, whose visibility must not be re-ordered to the other thread.
It's probably easier to think about this in the context of Peterson's algorithm, or Dekker's, or one of our 10 algorithms which do the same kind of thing:
https://en.wikipedia.org/wiki/Dekker%27s_algorithm
https://en.wikipedia.org/wiki/Peterson%27s_algorithm
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/cr2t-2016.pdf

If you get in the context of these algorithms, you can think about the store as being a way of signaling the intent to acquire the lock, and the load as a check that the other thread doesn't already have the lock. If the effects or visibility of these two operations could be changed (by the compiler or CPU or cache-coherence system) then the algorithm would no longer be correct.
Placing a store-load fence between the store and the load prevents re-ordering from happening, making the algorithm correct, at the cost of adding synchronization.

Another analogy, is to think about a store as a mechanism to send a message to a certain centralized location, and the load as a mechanism to read the message at that centralized location. This is easier for us humans to reason about.
In the end it's all physics, transmitting information takes time, and it involves waiting which by definition is blocking or at least involves some kind of synchronization, in this case, the store-load fence.

As an aside, there are no explicit store-load fences on the C11/C++ memory model, therefore we must use a seq-cst store followed by a seq-cst load, or a relaxed store and relaxed load but with an atomic_thread_fence(memory_order_seq_cst) in the middle.


Waaaaiiit, you said you were going to talk about memory reclamation, but you're talking about mutual exclusion?!? You tricked us!!!

Not exactly, you see, the two-thread mutual exclusion problem and the memory reclamation problem have a lot in common.
A lock algorithm allows two threads to access an object, one a time. Each thread has the guarantee that when it is accessing the object, the other thread is not accessing the object.
A memory reclamation algorithm allows two threads (the reader and the reclaimer) to access an object, with the reclaimer having the guarantee that at a certain point in time the reader will never access the object again.
Notice the difference?

For mutual exclusion a thread needs to know if the other thread is not currently accessing the object, while for memory reclamation, a thread (the reclaimer) needs to know if the other thread (the reader) is not currently accessing the object nor will it ever try to access it again.

In summary, memory reclamation can be seen as mutual exclusion with a stronger requirement.
Which means, that if to do mutual exclusion we need a store-load fence, then we also need (at least) a store-load fence for doing memory reclamation.
To be more precise, a reader needs one store-load fence to signal its intent to protect an object to the reclaimer.

Now, I never proved this formally, this is just a sequence of logical deductions. So let me substantiate this bold statement with some examples.

Start by looking at figure 3 of our Hazard Eras paper:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/hazarderas-2017.pdf



Looking at the left side, to protect a pointer in Hazard Eras we do a seq-cst store of this pointer on a shared array, and then we check that the pointer is still valid. There is an implicit store-load fence placed in by the compiler when we use seq-cst atomic store followed by a load.
In Hazard Pointers, at least one store-load fence is needed for every object needing protection.

Looking at the right side, to protect a pointer with URCU or Epoch-based reclamation, we do a seq-cst store of an epoch and then load whatever pointer we want to use. Again here there is an implicit store-load fence so that the store of the epoch will not be re-ordered with any of the loads of the object that will be used in the read-side critical section.
In Epoch-based reclamation, a single store-load fence is needed to protect all objects, now and until the read-side critical section ends.

Looking at the middle code, to protect a pointer with Hazard Eras, we do a seq-cst store of an era, load whatever pointers we need, and then check that the era has not changed. Again here, there is an implicit store-load fence between the store of the era and the subsequent loads. In fact, there is also a constraint that the load of the pointer(s) and the following load of the era can not be re-ordered, but that's another story.


Do you see the "pattern" now?
All three classes of algorithms require one store-load fence.
The difference between the three is that:
- Hazard Pointers does one fence to protect one object;
- Epoch-based does one fence to protect all objects (until the end of the read-side critical section);
- Hazard Eras does one fence to protect all currently live objects;
and of course, Epoch-based reclamation is blocking, while Hazard Pointers and Hazard Eras are lock-free.


Having said this, there are some apparent exceptions to this, but I would argue that the synchronization of the store-load fence is always there, it just might be hidden from sight.
For example, we can use hardware transactional memory to do memory reclamation, and no explicit store-load fence is needed... because the hardware does the corresponding synchronization internally.
http://www.cs.technion.ac.il/~erez/Papers/teleportation.pdf
It's also possible to use optimistic memory reclamation, but then we're relaxing the premises of the memory reclamation problem I stated above and that may not always be possible.
http://www.cs.technion.ac.il/~erez/Papers/oa-spaa-15.pdf
There are also other approaches that make use of services provided by the Operating System, just like there are mutual exclusion locks that are OS provided. Ultimately, they will need synchronization or a store-load fence, maybe it's the kernel that's doing it instead of the user code explicitly, but the synchronization is there somewhere, hidden behind all the OS/Kernel layers.

The final insight is the following: If it does correct and safe memory reclamation in a shared memory concurrency system, it needs at least one store-load fence, or an equivalent form of synchronization!

5 comments:

  1. Hello, I'm a regular reader of your posts, and I love those, please keep on posting.
    You say c++11 has no store-load fence and we must use a seq-cst model.
    Why does the sequence
    x.load()
    y.store(v, release)
    is not enough ?

    ReplyDelete
    Replies
    1. Hi David,
      The short answer, watch these to understand the difference between a seq-cst store and release store:
      http://concurrencyfreaks.com/2013/02/the-new-memory-model-in-c11-and-c11.html

      The long answer is, the problem is not the
      x.load()
      y.store(v, release)
      the problem is the recheck after the store:
      v = x.load();
      y.store(v, release);
      if (v == x.load()) return x;

      If the second x.load() is executed before the y.store(v) is visible to other threads, then it would break the Hazard Pointers algorithm because you could have a pointer to an object which got reclaimed before the store was visible to other threads. Doing the store as seq-cst prevents that from happening. The exact details are non-obvious and require some intimate knowledge of how the HP algorithm works. You can check the original HP paper by Maged Michael if you're interested in the details
      https://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf

      Delete
    2. I missed the problem was on the second load. Tx !

      Delete
    3. Don't worry, this stuff is never obvious.
      Glad you enjoyed the post :)

      Delete
  2. Interesting article. Thanks for sharing your insights and thoughts.

    ReplyDelete