Tuesday, May 2, 2017

Hazard Eras - Non-Blocking Memory Reclamation

We are proud to present Hazard Eras, a new memory reclamation technique that is lock-free/wait-free. Hazard Eras uses the same API as Hazard Pointers but can be up to 5x faster.
A brief announcement will appear in SPAA 2017:
https://spaa.acm.org/2017/program.html

The full paper can be obtained here:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/hazarderas-2017.pdf

Hazard Eras uses just atomics and the Memory Model, which means it is easily portable across a multitude of architectures. It's kind of an hybrid between epoch-based and pointer-based memory reclamation, but unlike previous approaches, Hazard Eras is non-blocking, i.e. lock-free, and in some cases even wait-free (just like Hazard Pointers it depends on how it's used).

6 comments:

  1. I see your implementation uses tid as index. Here are some practical questions.
    How do you deal with the situation where number of threads
    is dynamic?
    How do you map thread ids to sized-array index without collisions?

    ReplyDelete
    Replies
    1. Hi Xinjing,

      We wanted to show the simplest possible implementation so we went with tid.

      It is relatively easy to change it to a variable number of threads, just use the same technique that Hazard Pointers uses where instead of an array with one entry per thread, there is a linked list with one node per thread. Each node has an "active" flag that the reader threads use to register themselves with a CAS. The address of the node can then be kept in a control block structure (like HP does it) or a thread-local variable.
      Registration with this mechanism is wait-free bounded because the number of threads is bounded, thus implying that the insertion of a new node can fail at most a bounded number of times.
      See Fig. 4 of the Hazard Pointers paper http://researchweb.watson.ibm.com/people/m/michael/ieeetpds-2004.pdf
      or the folly implementation by Maged Michael himself:
      https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L219

      Btw, if it was me, I would use an array of size MAX_THREADS (assuming some bound is known for the maximum number of simultaneous threads in the application), where each entry in the array has MAX_ERAS hazard era entries (64 bit atomic integers) and one entry for the "active" (atomic boolean) which is used by the readers to CAS to make the reservation. When the CAS succeeds, the index of the reserved entry in the array can be kept on a thread-local variable.
      By having an array instead of a linked list you get better scan speed for the reclaimer threads, at the cost of using more memory... memory is cheap, right? ;)

      Delete
  2. I tried to implement your hazard eras algorithm, and compare performance with hazard pointers on LCRQ. I ran into a couple of issues:
    A. In algorithm 3, line 49, you are using CLPAD (presumably standing for cache line padding). However, you did not define it, or use it consistently when accessing retiredList.
    B. I am unsure this is going to work for MAX_HES > 1, as you are not storing in the hazard list pointers, but eras, and the retired list does not contain index.
    C. Running my version of the code compared to HP was slower (Linux Red Hat, GCC 4.8.1, 21 x Xeon 4890). I wonder if I made a mistake. Is your code available for comparison?

    ReplyDelete
    Replies
    1. Hello Ofer,

      A. That's a typo in the paper. Just remove CLPAD. On our implementation we use CLPAD as a "cache line padding" to reduce false sharing.
      I forgot to make our implementation available online. Here it goes:
      https://github.com/pramalhe/ConcurrencyFreaks/tree/master/CPP/papers/hazarderas

      B. It works :)
      Just check the example usage for the Maged-Harris lock-free linked list which uses three hazard eras
      https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/hazarderas/MagedHarrisLinkedListHE.hpp

      C. See A for our version.
      Actually, applying Hazard Eras to a queue or stack is the worst possible scenario and you will profit nothing compared to using Hazard Pointers, possibly even loose some performance.
      HE is only valuable for data structures where there is "traversal" of nodes, which means that the same hazardous pointer is re-used multiple times.
      For lock-free queues like LCRQ, Hazard Pointers are not the bottleneck (well it depends on the size of the ring buffer in each node) and for every dequeue() there is a new hazardous pointer and a retire(). Having these many retires in HE means that the clockEra keeps advancing every time, which invalidates that cache line and forces HE readers to go on the slow path.
      In summary, for queues/stacks and other highly mutative data structures without any traversal, better use Hazard Pointers. For linked lists or trees or data structures where a lot of node traversal is done, use Hazard Eras.

      Thanks for finding the CLPAD typo on the paper!

      Delete
    2. Thanks for posting your code!
      A couple of things:
      1. There is a difference between your code and the paper - in algorithm 3 line 59 you are using a full memory fence -
      const auto era = he[tid][ihe].load();
      while in the code base you are using acquire - line 179 -
      const auto era = he[tid][ihe].load(std::memory_order_acquire);

      2. I noticed you are using a padding to size of 128, on an AMD Opteron, despite the cache line size being 64. I would very much appreciate an explanation (when I experimented with an implementation of LCRQueue, I saw a benefit from size 128 compared to 64, but have no idea why).

      3. In the code base, lines 144, 145:
      he[tid][index].store(era, std::memory_order_relaxed);
      std::atomic_thread_fence(std::memory_order_seq_cst);

      Is there any reason you are not using the below?
      he[tid][index].store(era);

      Delete
    3. Hello Ofer,

      1. Yes the difference is fine (both are correct).
      We can relax the load to be acquire instead of seq-cst, which may give a minor throughput increase on ARM and PowerPC (no difference will be seen on x86).

      2. Here is the unofficial explanation: the cache lines on x86 are 64 bytes but the prefetcher has a tendency to get the next cache line as well, thus causing false-sharing effects unless the variables are separated by an extra cache line.

      3. This is a tricky one :)
      It's related to _how_ protectPtr() is used in the algorithm where it is being depolyed. If you're using seq-cst atomics in your algorithm then it's fine (and preferable) to use .store(era), however, if you are using acquire-loads or relaxed loads in your algorithm and you have some expectations on the semantics of protectPtr() then it may result in incorrect code that is not properly protecting the object for that pointer.
      In other words, a subsequent seq-cst load will not be re-ordered with the seq-cst store, but if it's an acquire load or relaxed load then it may be re-ordered, and if your algorithm has some kind of expectation about re-orderings then the end result may break those expectations.
      With the approach of using a relaxed store and then a seq-cst full fence, this issue no longer exists: regardless of having a subsequent acquire load or relaxed load it will not be reordered with the store of the era, and therefore, will work in every scenario.
      Maged Michael himself implements it for Hazard Pointers with the full barrier to be on the safe side:
      https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L236
      though (theoretically) the approach with seq-cst store should be slightly faster on some CPUs.

      Delete