Wednesday, August 30, 2017

Why is memory reclamation so important?

As it so happens, object lifetime tracking (or Memory Reclamation as I prefer to call it) is the current most difficult open problem in shared memory concurrency.
If you want to design or implement a lock-free or wait-free data structure, the hardest part isn't designing/implementing the data structure itself. The hardest task is to add memory reclamation to this data structure, without reducing the progress or throughput of the data structure you designed/implemented.

Huhhhh wait, are you saying that implementing a lock-free data structure is easy when compared to memory reclamation? Wwwhhaaaat?!?
No, I'm only saying that memory reclamation is even harder  :)
Let me try to explain why, and there are four reasons behind it:

Reason number 1, progress

So let's say you designed/implemented/found-on-github a non-blocking data structure to serve your needs, but this data structure comes without memory reclamation, and unless you're on a language with a Garbage Collector, you now come face to face with the problem of adding memory reclamation to it.
Think of it this way, what is the purpose of going through all the trouble of doing lock-free/wait-free stuff if then you end up having to slap a blocking mechanism on top of it?
Obviously, it would be kind of fruitless to put a blocking memory reclamation on top of a non-blocking data structure because the end result would be blocking.
This implies that whatever memory reclamation you're going to use on your favorite data structure, needs to have at least the same progress guarantees as the data structure. If not, then all that work that went into designing/implementing a lock-free data structure, just goes down the drain, and you might as well go back to using a global lock, which has low overhead and provides safe memory reclamation (just grab the lock before calling delete/free on nodes of the data structure).

Btw, even if you have a GC, no known GC is lock-free (much less wait-free), so you don't really have non-blocking guarantees of progress. Just take a look at the previous post with Erez Petrank's presentation.

Reason number 2, throughput

When we use a non-blocking data structure, whether lock-free or wait-free, one of the goals is to achieve more performance. By performance I mean either more throughput, or lower tail latency, or both.
Typically, as developers, what we want from these data structures is to have higher throughput as the number of threads working on the data structure increases, and maybe we also care that the tail latency stays low. If we didn't care about one or both of these things, then we could just protect whatever single-threaded data structure we needed to use, with one global lock and be done with it, but as everyone knows, this doesn't scale.

Progress is not enough, if the lock-free data structure you're using is fast, then you need a memory reclamation technique that is at least as fast as it, so as to not become the bottleneck.
What is the purpose of having a super-duper fast lock-free data structure if then you slap on top of it a lock-free-but-horribly-slow memory reclamation technique which drags down performance? Well ok, you may still have the progress guarantee, which will give you low tail latency for wait-free data structures with wait-free memory reclamation, but really... if it's slow, will you use it?

Reason number 3, simplicity

When we apply a memory reclamation on top of a pre-existing non-blocking data structure, we want it to be as simple as possible. The simpler it is, the less likely it we will be that we make mistakes. Unfortunately, most memory reclamation techniques are hard to deploy on pre-existing data structures.
I've previously wrote an entire post explaining just how hard it can be for Hazard Pointers and how much easier it is for URCUs:

The jist of it is, that it requires expert knowledge about the data structure where the memory reclamation is being applied, so expert that most data structure designers/researchers don't bother to add memory reclamation to their data structures. They claim it's "an implementation detail", but I know better, it's because it's really hard. Don't take it the wrong way, it's not that they aren't smart enough, it's just that this kind of work is not valued in academia, therefore, there is no incentive for them to waste precious space in their articles explaining how to make a data structure that is actually deployable in a production setting.  Not to mention spending brain power thinking about it (it's true, it always gives me a headache).
And if you're convinced that things like Hazard Pointers can be tricky to use, well then, it can be even more complex, for techniques like Drop The Anchor.

Reason number 4, resilience

Another reason to go with lock-free/wait-free data structures is because they are resilient to failures. On a shared memory system with multiples processes accessing the same data structure, even if one of the processes dies, the others will be able to progress in their work. This is the true gem of lock-free data structures: progress in the presence of failure.
Blocking data structures (typically) do not have this property (there are exceptions though).
If we add a blocking memory reclamation (like URCU) to a lock-free/wait-free data structure, we are loosing this resilience because one dead process will prevent further memory reclamation and eventually bring down the whole system.
There goes the resilience advantage out the window.

The three families of memory reclamation

Manual memory reclamation comes in one of three flavors: Atomic reference counting, quiescent-based, pointer-based.
All memory reclamation techniques belong to (at least) one of these families. Many memory reclamation algorithms exist but we'll focus only on the ones that can be implemented using the C++ memory model and atomics.

Atomic Reference Counting is just what the name says, it's a "reference counting" technique but with atomics. The details are a bit more tricky than your usual reference counting (aka smart pointers), but it's still graspable for most. They can be implemented in a wait-free way (in x86) but they have two main drawbacks: they're slow and they aren't universal.
They're slow because whenever we have to traverse a list of nodes we need to atomically increment a counter in one node and decrement a counter in another... even though we're just reading.
They're not universal because of several reasons, most of them described in section 9.1 of Paul McKenney's excellent book

Quiescent-based is where Epoch-Based-Reclamation (EBR) and all Userspace RCUs belong. The fastest memory reclamation is here (for readers). These methods tend to have very little synchronization, are easy to deploy in nearly all algorithms (i.e. data structures) and they're easy to understand.
When everything goes well, these are the fastest way of doing memory reclamation.
The downside is that they're blocking for memory reclamation. This isn't a simple limitation of the current algorithms that someone will one day be able to crack. Nope, it's something intrinsic to this approach. Reclamation with EBR or URCU is, and always will be, inherently blocking.

Pointer-based is where Hazard Pointers, Pass The Buck, Drop The Anchor and Hazard Eras belong. Pointer-based, well, they're not too hard to implement, but they can be extremely tricky to deploy. Their main problem is they're slow (though faster than atomic reference counting).
They can be completely wait-free, for readers and reclaimers. In case you didn't notice, this makes them the only family that can be used generically and capable of providing lock-free/wait-free progress.
Hazard Eras are the fastest of these, approaching (and in some scenarios surpassing) the best URCUs, but they do it at a cost in memory usage. It's the typical trade-off memory vs CPU.

You may notice that most (ok, almost all) academic papers on non-blocking data structures don't mention memory reclamation anywhere, or just briefly mention it in one sentence. This is because they would need many pages to detail all the intricacies of adapting such algorithm to have safe memory reclamation. Their typical approach is to say "That's not my problem, I'm just designing the algorithm. Memory Reclamation is an implementation detail."
... an "implementation detail", yeah, right.

Going back to our four reasons why memory reclamation is a hard problem (progress, performance, simplicity, resilience), if we look at the three families and compare some of the techniques of each family we see their advantages/disadavantages.

In the end there is no one solution fits all and it's highly unlikely that there ever will be. As always in engineering (software or not) some solutions are better in some scenarios and others in different scenarios. Though I have to say that Hazard Eras works well for scenarios with traversal of linked lists.

Wrapping up

The point I'm trying to make is the following: The fact that the memory reclamation technique needs to have the same progress (or better) as the data structure where you're using it, means that for fully lock-free or wait-free data structures the only true solution is to use pointer-based memory reclamation. Moreover, if we want the data structure to be fast, we need the memory reclamation needs to be just as fast (on top of having the same progress).

Let me be super clear on this. As of this writing, and likely for all eternity, the biggest bottleneck on deploying a lock-free or wait-free data structure is the memory reclamation technique.
If you think about it, a memory reclamation technique that is wait-free, must itself be composed of one or several data structures, likely to have wait-free progress, so we're stuck in a catch-22 where we want to make wait-free data structures faster, but we need to use (other) wait-free data structures to protect those wait-free data structures.

To end on a positive note, yes it's sad we likely can't make memory reclamation any faster than URCU or Hazard Eras, but we can certainly make it easier to use, and that's one of the things Andreia and I are working on. More on this in a future post  ;)


  1. This comment has been removed by the author.

  2. Great write-up! A few notes worth mentioning (IMO) below:

    Garbage collection:
    This is only true in absence of garbage collection. If you're paying garbage collection cost to begin with, this is not an issue. Also, note things such as

    This only applies to dynamic lock-free data structures:
    Or, data structures requiring memory allocation. If you're using bounded buffers and don't require memory allocation, this isn't an issue.

    Not all passive schemes are quiescent-state-based. In QSBR, quiescent points are a first-class entity while that is *not* the case in EBR (you have only 3 distinct states). Absent extensions you are unable to differentiate one quiescent point from another, which has real implications on being able to implement things like deferred / non-blocking reclamation efficiently. There are some advantages to this, a while ago Paul Khuong contributed a reference counting scheme to epoch reclamation allowing for overlapping protected sections (bounded epoch makes reference counting practical, something you can't do with unbounded quiescent points). It's pretty great and note, it doesn't require a strict notion of quiescence! You'll find this in Concurrency Kit.

    It is also worth noting the HP, etc... (pointer-based schemes) can be used to implement QSBR.

    For these reasons, I prefer to classify these techniques into two primary families (based on the semantics of the interface rather than the implementation): "passive schemes" and "active schemes". Passive schemes do not require cooperation from the algorithms requiring safe memory reclamation (EBR, QSBR, etc...) while "active schemes" do (HP, PTB, etc... in their textbook form require modification to the algorithm).

    On blocking for passive schemes:
    It is worth noting that QSBR, EBR and other "passive schemes" do have "non-blocking" (informally) interfaces (rcu_call, text-book EBR utilizes limbo lists, etc...) such that writers do not have to block on the fast path, but of course, there is the cost of memory accumulating until a grace period is detected (so, not non-blocking in the formal sense but if you've the memory to spare and sufficient forward progress, becomes a non-issue). In my implementations, I typically use a dedicated thread for forward progress / epoch detection, ensuring the writer has forward progress.

    You have much higher granularity and the lock-free progress guarantee in hazard pointers, because it is pointer-based (tied to the forward progress guarantees of the underlying algorithm).

    Under-appreciated recent development:
    An interesting thing to note here is there are schemes such as which do not necessitate the same heavy-weight barriers in the presence of invariant TSC+TSO and with a sufficiently high granularity TSC, can provide the same forward progress guarantees as hazard pointers.

    On hazard pointers being slow:
    One interesting thing to note is hazard pointers can also be used to implement "passive" schemes such as proxy collection, to give similar performance properties as EBR and QSBR.

    On real world implementations:
    It is worth mentioning :-P

    1. Hi sbahra,

      Thanks for the complement.
      Here are some comments:

      - ThreadScan shown at SPAA is _blocking_, ask the authors if you don't believe me.

      - Sure, if we don't do memory allocation then there is no need for memory reclamation. Unfortunately there aren't that many (useful) lock-free data structures with pre-allocated nodes.

      - Sure, you can name them "Passive schemes" and "Active schemes", but that's not what everyone else in the field of concurrent data structures does. If you want to give it your own naming, that's your choice :)

      - Quiescent Based reclamation is _blocking_, period.
      This includes RCU and all EBR I've seen so far, with the only exception being Hazard Eras which is Lock-Free and uses "epochs/eras" but it's actually a pointer-based reclamation technique.

      - It is very easy to mistake a memory reclamation technique with "unbounded deferral" with with one that is lock-free. In EBR and RCU we can use "unbounded deferral" by placing undeletable objects on a list. If a "reader" thread gets stuck/sleeps, it can prevent _any_ further memory reclamation, which means that this "unbounded deferral" list will grow until all memory in the system is depleted.
      This is _not_ lock-free, therefore implying that RCU and EBR and any other mechanism that does unbounded deferral is not lock-free.
      If the deferral is bounded, then it means that when the bound is reach there has to be some kind of "waiting" for the reader threads (a call to synchronize_rcu() or equivalent).
      Unless we get a server with an unbounded amount of memory (I never saw anything like that), there has to be a bound for the deferral for the technique to even be "correct". A bounded deferral implies waiting, therefore, implying blocking.
      All quiescent-based reclamation implies "waiting", which implies the possibility of blocking, even if for most pratical cases it might not because it defers its work (what you call the "fast-path").

      - Could you point out a link to code or paper that shows what you mean by "hazard pointers can also be used to implement "passive" schemes such as proxy collection, to give similar performance properties as EBR and QSBR." ?
      My guess is that when used that way then they stop being lock-free... just a guess though ;)

      Good that there are other people interested in this kind of stuff.
      Did you see the talks by Erez Petrank on this topic? He is one of the world's leading experts in memory reclamation and GCs:

    2. Hi Pedro!

      I just wanted to highlight a few cases where the safe memory reclamation isn't an issue. To the uninitiated reader, they may think this problem applies in all cases, scaring them away from using valuable tools and techniques. I think your document does address this in present form though, thanks!

      - re:Taxonomy. Yeah, totally. However, just a few years ago, even QSBR and EBR were distinguished in the literature. Regardless of what "everyone else in the field of concurrent data structure does", does this taxonomy actually make sense? When you can implement EBR without a strict notion of quiescence, is it really QSBR? Is QSBR anything involving the detection of some form of "forward progress" (amortized), or is quiescence an actual part of it? (Noting that forward progress and quiescence don't go hand-in-hand necessarily)

      - ThreadScan is blocking, yes. I believe you :-P

      - Regarding blocking. Apologies here for using informal terminology (was noted in the parenthesis). No question or debate about the fact the "QSBR" schemes are blocking. I am referring to the deferral and delegation mechanisms. As you know (and point out in your papers), this means reclamation blocks but not necessarily write-side progress. In the real world, this means RCU and other techniques can be applicable for even write-heavy workloads as long as you can ensure a bound on memory usage and upper-bound on critical section size. This actually ends up being very nice for performance if you couple it with cheap allocation schemes (bump pointer, etc...). Not all memory will be depleted because we are guaranteed to have enough memory until a grace period. Just like we don't have unbounded memory, we don't have unbounded compute, a program can only service so many requests a second. The constraint has to be a bound on read-side "protected"/critical section size though.

      - re: Hazard pointers can be used to implement QSBR and EBR. There is no paper for this. I think your intuition is in the right direction though, think proxy collection style (amortized reference counting to proxy object). Proxy object can be protected using a hazard pointer. Yes, you lose lock-free progress guarantee.

      - re: Erez, yes! I loved the survey talk especially. Since Hart's thesis, there hasn't been a decent survey. I'd love to see an updated comparative analysis though (the original paper had bugs in the implementation, utilized global locks and unfortunately, did not take into account grace period detection latency which is an important thing to measure!). Do you know of any efforts to do this?

      - What are your thoughts on ParSec? Utilizing invariant TSC for implementing SMR.

    3. Hi sbahra
      Just some more comments:

      a) Regarding Taxonomy, yes it can be confusing.
      Bottom line, all memory reclamation techniques that are in the group of EBR I've seen so far, use Quiescence, and quiescence is implicitly "blocking".
      I would say that the answer to your question is that "No, we can't implement EBR without Quiescence", but I never saw a formal proof of this ;)

      b) Yes, great performance advantage can be gained on the write-side if the reclamation is deferred or delegated.
      I don't understand what you say by "The constraint has to be a bound on read-side "protected"/critical section size". Could you elaborate?
      Be aware that bounded deferral is blocking for the write-side because (by definition) is deferrs until a certain bound, and then it blocks when the bound on the number of deferred objects is reached.
      Unbounded deferral is worse than blocking because it requires an unbounded amount of RAM to be correct.

      c) Nope, unfortunately I haven't seen a good comparison recently.
      For manual memory reclamation most papers compare with Hazard Pointers (it's what we did for Hazard Eras). For Garbage Collectors I don't know, I'm not in the field.

      d) When you talk about ParSec you mean this?
      Didn't know about it.
      ps_quiesce() will block reclamation when it returns zero. This is actually an URCU that uses the TSC for synchronization. Very interesting!

      If you're ever in Switzerland do let me know, it would be easier to discuss this stuff in person :)

  3. For my Objective-C runtime, which is based on concurrent data structures, I needed to write a portable ABA reclamation technique.

    What I have now is surely not pointer based and it's not atomic operations either (as described). So that would leave, quiescent based. But where am I blocking ?

    I don't see it. Threads only "soft lock" at the beginning and end of their lifetime. And AFAIK that's not even blocking. Arguably the memory allocation is finally done via malloc/free, which could block, but that's not part of the algorithm itself. (and following)

    It's not too complicated to understand and the randomized test code runs for days on end on ARM and x86_64 without croaking. So to me the approach appears valid. It would be interesting to get an opinion on this.

  4. Thank you for the inspiration. This article is very helpful and informative. Good job!

  5. This comment has been removed by a blog administrator.

  6. This comment has been removed by a blog administrator.

  7. This comment has been removed by a blog administrator.

  8. Ten years ago I found what is remote working. It never even crossed my mind that one could work from outside of the office. And then I joined Skype family. Compared to my previous work – oh my what a different culture. A culture, a family and the product that I instantly fell in love with.

  9. thanks for explaining this in so much detail. I never heard of the term memory reclamation before, I understand now. Spectrum activation