Concurrency Freaks: Why is memory reclamation so important?

Wednesday, August 30, 2017

Why is memory reclamation so important?

As it so happens, object lifetime tracking (or Memory Reclamation as I prefer to call it) is the current most difficult open problem in shared memory concurrency.
If you want to design or implement a lock-free or wait-free data structure, the hardest part isn't designing/implementing the data structure itself. The hardest task is to add memory reclamation to this data structure, without reducing the progress or throughput of the data structure you designed/implemented.

Huhhhh wait, are you saying that implementing a lock-free data structure is easy when compared to memory reclamation? Wwwhhaaaat?!?
No, I'm only saying that memory reclamation is even harder :)
Let me try to explain why, and there are four reasons behind it:

Reason number 1, progress

So let's say you designed/implemented/found-on-github a non-blocking data structure to serve your needs, but this data structure comes without memory reclamation, and unless you're on a language with a Garbage Collector, you now come face to face with the problem of adding memory reclamation to it.
Think of it this way, what is the purpose of going through all the trouble of doing lock-free/wait-free stuff if then you end up having to slap a blocking mechanism on top of it?
Obviously, it would be kind of fruitless to put a blocking memory reclamation on top of a non-blocking data structure because the end result would be blocking.
This implies that whatever memory reclamation you're going to use on your favorite data structure, needs to have at least the same progress guarantees as the data structure. If not, then all that work that went into designing/implementing a lock-free data structure, just goes down the drain, and you might as well go back to using a global lock, which has low overhead and provides safe memory reclamation (just grab the lock before calling delete/free on nodes of the data structure).

Btw, even if you have a GC, no known GC is lock-free (much less wait-free), so you don't really have non-blocking guarantees of progress. Just take a look at the previous post with Erez Petrank's presentation.

Reason number 2, throughput

When we use a non-blocking data structure, whether lock-free or wait-free, one of the goals is to achieve more performance. By performance I mean either more throughput, or lower tail latency, or both.
Typically, as developers, what we want from these data structures is to have higher throughput as the number of threads working on the data structure increases, and maybe we also care that the tail latency stays low. If we didn't care about one or both of these things, then we could just protect whatever single-threaded data structure we needed to use, with one global lock and be done with it, but as everyone knows, this doesn't scale.

Progress is not enough, if the lock-free data structure you're using is fast, then you need a memory reclamation technique that is at least as fast as it, so as to not become the bottleneck.
What is the purpose of having a super-duper fast lock-free data structure if then you slap on top of it a lock-free-but-horribly-slow memory reclamation technique which drags down performance? Well ok, you may still have the progress guarantee, which will give you low tail latency for wait-free data structures with wait-free memory reclamation, but really... if it's slow, will you use it?

Reason number 3, simplicity

When we apply a memory reclamation on top of a pre-existing non-blocking data structure, we want it to be as simple as possible. The simpler it is, the less likely it we will be that we make mistakes. Unfortunately, most memory reclamation techniques are hard to deploy on pre-existing data structures.
I've previously wrote an entire post explaining just how hard it can be for Hazard Pointers and how much easier it is for URCUs:
http://concurrencyfreaks.blogspot.com/2016/08/hazard-pointers-vs-rcu.html

The jist of it is, that it requires expert knowledge about the data structure where the memory reclamation is being applied, so expert that most data structure designers/researchers don't bother to add memory reclamation to their data structures. They claim it's "an implementation detail", but I know better, it's because it's really hard. Don't take it the wrong way, it's not that they aren't smart enough, it's just that this kind of work is not valued in academia, therefore, there is no incentive for them to waste precious space in their articles explaining how to make a data structure that is actually deployable in a production setting. Not to mention spending brain power thinking about it (it's true, it always gives me a headache).
And if you're convinced that things like Hazard Pointers can be tricky to use, well then, it can be even more complex, for techniques like Drop The Anchor.

Reason number 4, resilience

Another reason to go with lock-free/wait-free data structures is because they are resilient to failures. On a shared memory system with multiples processes accessing the same data structure, even if one of the processes dies, the others will be able to progress in their work. This is the true gem of lock-free data structures: progress in the presence of failure.
Blocking data structures (typically) do not have this property (there are exceptions though).
If we add a blocking memory reclamation (like URCU) to a lock-free/wait-free data structure, we are loosing this resilience because one dead process will prevent further memory reclamation and eventually bring down the whole system.
There goes the resilience advantage out the window.

The three families of memory reclamation

Manual memory reclamation comes in one of three flavors: Atomic reference counting, quiescent-based, pointer-based.
All memory reclamation techniques belong to (at least) one of these families. Many memory reclamation algorithms exist but we'll focus only on the ones that can be implemented using the C++ memory model and atomics.

Atomic Reference Counting is just what the name says, it's a "reference counting" technique but with atomics. The details are a bit more tricky than your usual reference counting (aka smart pointers), but it's still graspable for most. They can be implemented in a wait-free way (in x86) but they have two main drawbacks: they're slow and they aren't universal.
They're slow because whenever we have to traverse a list of nodes we need to atomically increment a counter in one node and decrement a counter in another... even though we're just reading.
They're not universal because of several reasons, most of them described in section 9.1 of Paul McKenney's excellent book
https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e1.pdf

Quiescent-based is where Epoch-Based-Reclamation (EBR) and all Userspace RCUs belong. The fastest memory reclamation is here (for readers). These methods tend to have very little synchronization, are easy to deploy in nearly all algorithms (i.e. data structures) and they're easy to understand.
When everything goes well, these are the fastest way of doing memory reclamation.
The downside is that they're blocking for memory reclamation. This isn't a simple limitation of the current algorithms that someone will one day be able to crack. Nope, it's something intrinsic to this approach. Reclamation with EBR or URCU is, and always will be, inherently blocking.

Pointer-based is where Hazard Pointers, Pass The Buck, Drop The Anchor and Hazard Eras belong. Pointer-based, well, they're not too hard to implement, but they can be extremely tricky to deploy. Their main problem is they're slow (though faster than atomic reference counting).
They can be completely wait-free, for readers and reclaimers. In case you didn't notice, this makes them the only family that can be used generically and capable of providing lock-free/wait-free progress.
Hazard Eras are the fastest of these, approaching (and in some scenarios surpassing) the best URCUs, but they do it at a cost in memory usage. It's the typical trade-off memory vs CPU.

You may notice that most (ok, almost all) academic papers on non-blocking data structures don't mention memory reclamation anywhere, or just briefly mention it in one sentence. This is because they would need many pages to detail all the intricacies of adapting such algorithm to have safe memory reclamation. Their typical approach is to say "That's not my problem, I'm just designing the algorithm. Memory Reclamation is an implementation detail."
... an "implementation detail", yeah, right.

Going back to our four reasons why memory reclamation is a hard problem (progress, performance, simplicity, resilience), if we look at the three families and compare some of the techniques of each family we see their advantages/disadavantages.

In the end there is no one solution fits all and it's highly unlikely that there ever will be. As always in engineering (software or not) some solutions are better in some scenarios and others in different scenarios. Though I have to say that Hazard Eras works well for scenarios with traversal of linked lists.

Wrapping up

The point I'm trying to make is the following: The fact that the memory reclamation technique needs to have the same progress (or better) as the data structure where you're using it, means that for fully lock-free or wait-free data structures the only true solution is to use pointer-based memory reclamation. Moreover, if we want the data structure to be fast, we need the memory reclamation needs to be just as fast (on top of having the same progress).

Let me be super clear on this. As of this writing, and likely for all eternity, the biggest bottleneck on deploying a lock-free or wait-free data structure is the memory reclamation technique.
If you think about it, a memory reclamation technique that is wait-free, must itself be composed of one or several data structures, likely to have wait-free progress, so we're stuck in a catch-22 where we want to make wait-free data structures faster, but we need to use (other) wait-free data structures to protect those wait-free data structures.

To end on a positive note, yes it's sad we likely can't make memory reclamation any faster than URCU or Hazard Eras, but we can certainly make it easier to use, and that's one of the things Andreia and I are working on. More on this in a future post ;)

12 comments:

sbahraSeptember 17, 2017 at 7:47 PM
This comment has been removed by the author.
ReplyDelete
Replies
sbahraSeptember 17, 2017 at 7:55 PM
Great write-up! A few notes worth mentioning (IMO) below:

Garbage collection:
This is only true in absence of garbage collection. If you're paying garbage collection cost to begin with, this is not an issue. Also, note things such as http://web.mit.edu/willtor/www/res/threadscan-spaa-15.pdf

This only applies to dynamic lock-free data structures:
Or, data structures requiring memory allocation. If you're using bounded buffers and don't require memory allocation, this isn't an issue.

Taxonomy:
Not all passive schemes are quiescent-state-based. In QSBR, quiescent points are a first-class entity while that is *not* the case in EBR (you have only 3 distinct states). Absent extensions you are unable to differentiate one quiescent point from another, which has real implications on being able to implement things like deferred / non-blocking reclamation efficiently. There are some advantages to this, a while ago Paul Khuong contributed a reference counting scheme to epoch reclamation allowing for overlapping protected sections (bounded epoch makes reference counting practical, something you can't do with unbounded quiescent points). It's pretty great and note, it doesn't require a strict notion of quiescence! You'll find this in Concurrency Kit.

It is also worth noting the HP, etc... (pointer-based schemes) can be used to implement QSBR.

For these reasons, I prefer to classify these techniques into two primary families (based on the semantics of the interface rather than the implementation): "passive schemes" and "active schemes". Passive schemes do not require cooperation from the algorithms requiring safe memory reclamation (EBR, QSBR, etc...) while "active schemes" do (HP, PTB, etc... in their textbook form require modification to the algorithm).

On blocking for passive schemes:
It is worth noting that QSBR, EBR and other "passive schemes" do have "non-blocking" (informally) interfaces (rcu_call, text-book EBR utilizes limbo lists, etc...) such that writers do not have to block on the fast path, but of course, there is the cost of memory accumulating until a grace period is detected (so, not non-blocking in the formal sense but if you've the memory to spare and sufficient forward progress, becomes a non-issue). In my implementations, I typically use a dedicated thread for forward progress / epoch detection, ensuring the writer has forward progress.

You have much higher granularity and the lock-free progress guarantee in hazard pointers, because it is pointer-based (tied to the forward progress guarantees of the underlying algorithm).

Under-appreciated recent development:
An interesting thing to note here is there are schemes such as https://www2.seas.gwu.edu/~seotest/publications/eurosys16ps.pdf which do not necessitate the same heavy-weight barriers in the presence of invariant TSC+TSO and with a sufficiently high granularity TSC, can provide the same forward progress guarantees as hazard pointers.

On hazard pointers being slow:
One interesting thing to note is hazard pointers can also be used to implement "passive" schemes such as proxy collection, to give similar performance properties as EBR and QSBR.

On real world implementations:
It is worth mentioning http://concurrencykit.org :-P
ReplyDelete
Replies
Nat!September 18, 2017 at 12:31 PM
For my Objective-C runtime, which is based on concurrent data structures, I needed to write a portable ABA reclamation technique.

What I have now is surely not pointer based and it's not atomic operations either (as described). So that would leave, quiescent based. But where am I blocking ?

I don't see it. Threads only "soft lock" at the beginning and end of their lifetime. And AFAIK that's not even blocking. Arguably the memory allocation is finally done via malloc/free, which could block, but that's not part of the algorithm itself.

https://www.mulle-kybernetik.com/weblog/2015/mulle_aba_release.html (and following)

It's not too complicated to understand and the randomized test code runs for days on end on ARM and x86_64 without croaking. So to me the approach appears valid. It would be interesting to get an opinion on this.

ReplyDelete
Replies
Web DeveloperFebruary 28, 2018 at 4:24 AM
Thank you for the inspiration. This article is very helpful and informative. Good job!

ReplyDelete
Replies
thesis writing serviceMarch 29, 2018 at 12:34 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
UnknownAugust 1, 2018 at 6:30 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Website Development company in DelhiJanuary 10, 2019 at 10:48 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
PoLJune 5, 2019 at 6:31 PM
Ten years ago I found what is remote working. It never even crossed my mind that one could work from outside of the office. And then I joined Skype family. Compared to my previous work – oh my what a different culture. A culture, a family and the product that I instantly fell in love with.
ReplyDelete
Replies
jerrysproductreviewsJuly 17, 2019 at 1:42 AM
thanks for explaining this in so much detail. I never heard of the term memory reclamation before, I understand now. Spectrum activation
ReplyDelete
Replies