Saturday, August 27, 2016

Hazard Pointers vs RCU

Hazard Pointers are hard, I mean, they're really really hard.
I don't mean they're hard to implement, they're not easy either, but ok, I can confidently say it's not hard to implement your own Hazard Pointer class.
What I mean is that it is really hard to deploy hazard pointers in your own data structure or algorithm, and even harder if you have to do it in someone else's data structure.
Anybody who says that hazard pointers are easy is just saying bullsh1t or has no idea what they're talking about...  unless it's Maged Michael saying it, then ok, I can believe it's easy for him because he invented the bloody thing  :)

Let me (try to) justify why I say this:
When you deploy the Hazard Pointers (HP) technique in a data structure, there are several steps and questions that you need to answer. For example:

- What are the pointers that are shared between threads?
The more shared pointers (and objects) there are, the more hazard pointers you will need.

- Do we dereference all of these pointers?
Maybe some pointers are shared but we don't dereference them. Maybe we don't need an hazard pointer for those.

- Do we need to worry about ABA issues for the pointers that are not dereferenced?
If yes, then we need an hazard pointer to protect them anyhow.

- When does the lifetime of the local pointer ends?
When it's no longer needed we have to clear the corresponding entry of the array of hazard pointers to allow the object to be deleted.

- Which pointers go with which indexes?
Some pointers are never used at the same time so they can "share" the same index in the array of hazard pointers.  Otherwise, a given pointer should have a unique bound to an entry of the hp array.

- How many hazard pointers are needed?
The more pointers are held at the same time, the less entries in the hp array can be shared.  If we're traversing a list, typically we won't need more than three, but if we're traversing a tree, we may need one for each level of the tree.

- For the validation step, is it ok to use the same shared variable or not?
Most of the times, validating an hazard pointer is just a matter of reading an atomic variable, storing it in the hp array, and then checking that  the same atomic variable has not changed. Sometimes, this is not enough to ensure safety and we need to read some other variable.
Figuring out what that other variable is, depends on the invariants of the data structure where you're trying to deploy HPs.

- When is it safe to retire an object/node?
We can retire a node only when that node is no longer accessible by other threads, and it's not always trivial to figure out in the code where to do it.

... and there's more, but I'll spare you.

Ohhhh wait, you thought this "Hazard Pointer thing" was just a matter of publishing some pointers so that those objects don't get deleted?
ooops, looks like it's not that simple, now does it?   :)

Memory reclamation techniques like Hazard Pointers or RCU have two sides to it. There's the "readers" which just read objects/nodes but don't delete anything, and there's the "reclaimers" which retire and delete object/nodes (and also read).
In terms of progress conditions, HP is lock-free for the readers and wait-free bounded for the reclaimers.
There are some wait-free data structures where the "helping" mechanism can be used to make the HP readers become wait-free, but this is a specially tricky thing to do and deserves its own blog post.
On the other hand, RCU (and by RCU I mean Userspace-RCU) can be wait-free population oblivious for the readers, and is (at best) blocking with starvation-freedom for the reclaimers.

Perhaps more important than their progress conditions, is the fact that in HP, each (reader) thread has to do one sequentially-consistent atomic store for every node/object that is used (an MFENCE on x86), while on RCU it has to do it two or three times, regardless of the number of number of nodes/objects that are used.
Just in case you missed it, it's two or three times for RCU versus HP doing it a million times if there are a million nodes in a list!

How about RCU, how easy is to deploy it?
It's just a matter of calling rcu_read_lock() at the beginning of each method, and rcu_read_unlock() in every exit path, and in remove() it would need a synchronize_rcu() after the rcu_read_unlock() of a successful remove, followed by a free()/delete of the node.
Obviously, such an approach makes the remove() blocking because it has to wait for all "readers".

Huhhh so what are the questions you need to ask when you want to deploy RCU?
- Where do the functions start?
Huhhh immediately after the curly braces. We can place the call to rcu_read_lock() at the beginning, or anywhere before a shared variable is read.

- Where do the functions return?
Huhhh immediately before a "return" statement or at the last curly brace if the function returns void. We can place the call to rcu_read_unlock() before  any exit path or as soon as no shared pointers are used anymore.

- When is it safe to retire an object/node?
We can retire a node only when that node is no longer accessible by other threads.
For the methods that need to delete an object/node, call synchronize_rcu() after the rcu_read_unlock() and then delete node.
That's it!
Did you notice any difference between the list of questions for RCU and the ones for Hazard Pointers?

This is where RCU's true magic comes from: its simplicity of deployment.
Sure, it's blocking for reclaimers (but it can scale by sharing grace periods).
RCU is wait-free for readers (low latency), has little synchronization overhead (high throughput), and it is easy to deploy... what more do you want?

FYI, there aren't that many concurrency constructs that give you high throughput and low latency.
Sure, it's just for the readers, and the reclaimers get blocked, but there are no silver bullets, only engineering solutions that fit better depending on the scenario, and RCU is the best fit in more scenarios than you may think  ;)

Friday, August 19, 2016

Userspace RCU in GlusterD

Here is a great example of how RCU can be used in a project to improve scalability and throughput, namely in GlusterD:

Atin Mukherjee talks about several interesting details on how to use RCU.
My only quirk is on slide 8 "Different Locking Primitives", RCU is not a locking primitive in the sense that it doesn't provide mutual exclusion by itself, but it's hard to explain exactly what it does, so it's ok to put it on the same slide.

He even talks about a bug in their code where they did something like:
    // do some stuff
    synchronize_rcu();   // This will deadlock

This happens because we can't call synchronize_rcu() from within a block of code protected with rcu_read_lock().

It's interesting that they went with the Bullet-Proof RCU variant, which just shows that most software engineers are willing to sacrifice a small amount of performance if it gives them more flexibility and ease-of-use (myself included).

Friday, August 12, 2016

Concurrency talks at CppCon 2016

The program for CppCon 2016 is out and there are some really interesting talks coming up.
On the topic of Concurrency, I'm looking forward to these three talks:
- JF Bastien on "No Sane Compiler would optimize atomics"
- Hans Boehm on "Using weakly ordered C++ atomics correctly"
- Paul McKenney on "A lock-free concurrency toolkit for deferred reclamation and optimistic speculation"

JF Bastien has some really good insights into optimizing code with atomics and the different memory orders, so I really want to see what he's going to say this year.
If you don't know who he is, then take a look at his interview from last year's CppCon or from CppCast:
and a preview of what his presentation will look like:
and if you want to see some funny comments then check the last 10 minutes of this talk where JF Bastien gives some comments to the speaker:

As for Hans Boehm, well, if you're a regular reader from this blog I shouldn't need to introduce him, but in case you just landed here by accident, then he's one of the contributors to the C++ Memory Model... and the Java Memory Model.
This year he's at CppCon and he's going to talk about common mistakes with non seq-cst atomics, so I definitely want to see his talk.
Here is some of the content he's going to go over:
Funny thing, I was considering giving a talk this year precisely about this topic but I'm not able to go to CppCon so I didn't even submit the proposal. Obviously, I wouldn't expect my talk to be as good as his is going to be, I was thinking of focusing more on examples and what kind of optimizations are safe to do when writing an algorithm using C++ atomics.
... ohh well, maybe I can still propose it next year as a more "hands-on-approach".

And finally, Paul McKenney is going to to give two talks, with the first one being about the Issaquah challenge (which he talked about a couple years ago so I hope he's going to go a bit more in depth this time)
and the second talk is going to be about safe memory reclamation in concurrency.
It looks like Paul, Michael Wong and Maged Michael are trying to add techniques for safe memory reclamation to the C++ standard, namely, adding RCU and Hazard Pointers.
For those of you who don't know what this means, let me just say that most experts in Concurrency (shared memory concurrency) say that Memory Reclamation is the toughest problem in Concurrency. This includes people like Maurice Herlihy, Erez Petrank, and many others.
If Memory Reclamation is the toughest problem in Concurrency, and Concurrency is the toughest way of programming, then Memory Reclamation must be the hardest thing you can do in Programming.

There are three main ways to do memory reclamation in concurrent (non-blocking) applications: Atomic Reference Counters, RCU, and Hazard Pointers (HP).
They have different properties, like different progress conditions for Readers and Reclaimers. By "Readers" I mean the threads calling methods that don't delete anything in the data structure, and "Reclaimers" are the ones doing the deletions.

Atomic Reference Counters are already in the next standard and should be part of C++17.
Herb Sutter was the one that proposed them and I'm glad he did, but when it comes to concurrent data structures, it is pretty much useless because it is only non blocking in certain situations, and it is the slowest of the three methods for Readers because when you're traversing a list with nodes you need to do two atomic_fetch_add() per node that is traversed.
If you want to understand what situations are those, then take a look at chapter 9.1 of Paul Mckenney's book in this link

RCU is better known for its implementation in the Linux Kernel, but there are many userspace implementations of RCU. We used some of them in the Left-Right technique.
RCU is wait-free (population oblivious) and super fast for Readers, but it is blocking for Reclaimers which means that if you want high throughput or low latency even when doing removal of nodes/objects from a data structure, this is isn't going to give you what you want.
Even with delegation, where the last reader will delete the node/object, it doesn't change the progress condition for the Reclaimers, and it even makes the Readers worse off because now they may have to delete an unbounded number of objects, thus making the Readers wait-free unbounded, which has much worse tail latency.
RCU is the simplest to use of the three techniques because its semantics for the Readers are similar to those of a Reader-Writer Lock.

Hazard Pointers are typically lock-free for the Readers (wait-free in some situations) and wait-free bounded for the Reclaimers.
The main drawback is that they're slow for Readers because we have to do one sequentially consistent store for every node that is traversed.
It's not as slow as Atomic Reference Counter is for Readers, but it is slower than RCU which pretty much has no synchronization apart from the initial call to rcu_read_lock() and the final call to rcu_read_unlock().
Of the three techniques, this is the hardest to use, not because it is hard per-say, but because it requires a very deep knowledge of the algorithm where you're applying hazard pointers to. If you want to make it wait-free then the difficulty goes up a notch, and it typically requires the algorithm to be designed with HPs in mind.
One of the untold advantages of HPs is that they are memory bounded. This means that at any given time you can not have more than (MAX_THREADS x MAX_HPS) nodes/objects in the retired lists waiting to be deleted (for an 'R' factor of zero). RCU has no such bound and if you have memory constraints this can become problematic and it can impact your tail latency for the Reclaimers.

There are many other methods for safe memory reclamation, but they are either variations on these ideas or they rely on some architecture/OS specific functionality, like needing Posix signals, or some CPU instructions that only exist on some hardware. Those are "outside" of the C++ Memory Model, i.e. you can't implement them using only atomics and the Memory Model and therefore, their implementations are not portable.

Paul McKenney is one of the authors of RCU.
Hazard Pointers are the brain-child of Maged Michael.
Obviously, there is a bit of friendly rivalry between the authors of the two best methods for non-blocking memory reclamation  :)
Having the two of them together in this endeavor to provide safe memory reclamation in C++, is like having Professor Xavier and Magneto joining forces to bring down Apocalypse  :)
Yes, it's really _that_ cool!

Sure, I'm a fan boy and I'm exaggerating. Just like Andreia was saying the other day, you can implement yourself Hazard Pointers and RCU, efficiently, in C++, and you don't really need library support for that, and in fact _we_ have done it.
However, I'm not going to say it was easy (it wasn't, specially the hazard pointers stuff) and most people just want to use a pre-existing library/API which they know is correct and fast for most usages, and that's why having Paul, Maged Michael, and Michael Wong behind this is a great thing for C++ and its community.
If you care about Concurrency and if you're at CppCon this year, make sure to attend this talk, it should be epic!

Saturday, August 6, 2016

Throughput vs Latency and Lock-Free vs Wait-Free

On the previous post we showed an excellent presentation by Cliff Click, and one of the things he mentions is that there is performance for throughput and performance for latency, which really resonated with me.

When people talk about performance, they're usually just interested in throughput. This is fine, but throughput is just one of the vectors of performance, it's just a number (for a particular workload).
Having a number is nice, it means we can quantify, we can measure, and therefore, we can compare.

You know what is much better than having a number?
Having a distribution, namely, the latency distribution.

How do you know if a certain technique or algorithm X is better than algorithm Y?
Typically, you measure, and typically, you do it by measuring the throughput of X and Y in different scenarios or workloads.
If you're really sophisticated, then you might even do several runs, and choose the median of those runs, and/or compute some error bars based on the standard deviation or the min/max of the sample of runs.
This is fine (it's already one step ahead of just guessing which algorithm is better), but we can do more:
Instead of showing a number (plus or minus an error) we can show a distribution. And when it comes to concurrent algorithms, the best is to show the latency distribution.

And why is that, you may ask?
Because the inverse of the mean of the latency distribution is...
... I'll give you a few seconds to think about it...
... yes you guessed it, the throughput!

Looking at a latency distribution has way more information than looking at the throughput, the same way that looking at a statistical distribution has way more information than the mean of the same distribution, why, because it's just a number.
You know the old saying of "one picture is worth a thousand words"? Well then, one latency distribution plot is worth a thousand throughput measurements, and this last statement is literally true if the plot is an histogram with 1000 entries... lol

Why is this important?
For two reasons:

First, if you care about throughput, and only about throughput, then you care about the latency distribution but in a very specific way, namely, you want to know only the mean of the distribution. That's fine, but then how do you compare two algorithms, you are going to do several measures, right?
What is the error in this measurement?
How to compute error bars?
Do you use a standard deviation or something else?
Are the error bars assuming a Normal distribution?
Is it safe to assume a Normal distribution?
Why now just show the latency distribution?

Second, there is typically a tradeoff when it comes to throughput vs latency (not always, but typically). Going back to our example, perhaps algorithm X provides better throughput than algorithm Y, but Y has a better tail latency than X.
Which one is "better"?
There is no right answer, it depends on how you define better: If you just want raw throughput and you don't care that 1% of the cases will take an enormous amount of time to complete, then X is better. If you're willing to sacrifice throughput to have more fairness and better (lower) latency at the tail of the latency distribution, then algorithm Y is better.

One concrete example is when you look at a lock-free queue (like the one by Maged Michael and Michael Scott) vs a wait-free queue (like the one by Alex Kogan and Erez Petrank):

In these plots, lower is better.
The plot for the median (this is the median, not the mean) shows that the Michael-Scott queue is better than Kogan-Petrank (about 10x better), but when it comes to the tail latency, namely at the 99.9%, the Kogan-Petrank is better.
By 99.9% we mean that 99.9% of the calls to enqueue() take x microseconds or less to complete, and therefore, 0.1% of the calls take more than x microseconds to complete, where x is the number on the vertical axis of the plots.

Just like it doesn't make sense to say that Throughput is better than (tail) Latency, it is also meaningless to say that a Lock-Free algorithm that has higher throughput than a similar Wait-Free algorithm is better. The only exception is if they both have the same throughput or the lock-free algorithm has a lower throughput than the wait-free one, in that case, by definition, the wait-free is better, because it's better in all metrics.

Fedor Pikus is going to talk about lock-free again this year at CppCon, so let's see what he has to say about this topic.
I talked about latency at the beginning of my presentation at CppCon last year in case you want to learn more.