Saturday, August 27, 2016

Hazard Pointers vs RCU

Hazard Pointers are hard, I mean, they're really really hard.
I don't mean they're hard to implement, they're not easy either, but ok, I can confidently say it's not hard to implement your own Hazard Pointer class.
What I mean is that it is really hard to deploy hazard pointers in your own data structure or algorithm, and even harder if you have to do it in someone else's data structure.
Anybody who says that hazard pointers are easy is just saying bullsh1t or has no idea what they're talking about...  unless it's Maged Michael saying it, then ok, I can believe it's easy for him because he invented the bloody thing  :)

Let me (try to) justify why I say this:
When you deploy the Hazard Pointers (HP) technique in a data structure, there are several steps and questions that you need to answer. For example:

- What are the pointers that are shared between threads?
The more shared pointers (and objects) there are, the more hazard pointers you will need.

- Do we dereference all of these pointers?
Maybe some pointers are shared but we don't dereference them. Maybe we don't need an hazard pointer for those.

- Do we need to worry about ABA issues for the pointers that are not dereferenced?
If yes, then we need an hazard pointer to protect them anyhow.

- When does the lifetime of the local pointer ends?
When it's no longer needed we have to clear the corresponding entry of the array of hazard pointers to allow the object to be deleted.

- Which pointers go with which indexes?
Some pointers are never used at the same time so they can "share" the same index in the array of hazard pointers.  Otherwise, a given pointer should have a unique bound to an entry of the hp array.

- How many hazard pointers are needed?
The more pointers are held at the same time, the less entries in the hp array can be shared.  If we're traversing a list, typically we won't need more than three, but if we're traversing a tree, we may need one for each level of the tree.

- For the validation step, is it ok to use the same shared variable or not?
Most of the times, validating an hazard pointer is just a matter of reading an atomic variable, storing it in the hp array, and then checking that  the same atomic variable has not changed. Sometimes, this is not enough to ensure safety and we need to read some other variable.
Figuring out what that other variable is, depends on the invariants of the data structure where you're trying to deploy HPs.

- When is it safe to retire an object/node?
We can retire a node only when that node is no longer accessible by other threads, and it's not always trivial to figure out in the code where to do it.

... and there's more, but I'll spare you.

<sarcasm>
Ohhhh wait, you thought this "Hazard Pointer thing" was just a matter of publishing some pointers so that those objects don't get deleted?
ooops, looks like it's not that simple, now does it?   :)
</sarcasm>

Memory reclamation techniques like Hazard Pointers or RCU have two sides to it. There's the "readers" which just read objects/nodes but don't delete anything, and there's the "reclaimers" which retire and delete object/nodes (and also read).
In terms of progress conditions, HP is lock-free for the readers and wait-free bounded for the reclaimers.
There are some wait-free data structures where the "helping" mechanism can be used to make the HP readers become wait-free, but this is a specially tricky thing to do and deserves its own blog post.
On the other hand, RCU (and by RCU I mean Userspace-RCU) can be wait-free population oblivious for the readers, and is (at best) blocking with starvation-freedom for the reclaimers.

Perhaps more important than their progress conditions, is the fact that in HP, each (reader) thread has to do one sequentially-consistent atomic store for every node/object that is used (an MFENCE on x86), while on RCU it has to do it two or three times, regardless of the number of number of nodes/objects that are used.
Just in case you missed it, it's two or three times for RCU versus HP doing it a million times if there are a million nodes in a list!

How about RCU, how easy is to deploy it?
It's just a matter of calling rcu_read_lock() at the beginning of each method, and rcu_read_unlock() in every exit path, and in remove() it would need a synchronize_rcu() after the rcu_read_unlock() of a successful remove, followed by a free()/delete of the node.
Obviously, such an approach makes the remove() blocking because it has to wait for all "readers".



Huhhh so what are the questions you need to ask when you want to deploy RCU?
- Where do the functions start?
Huhhh immediately after the curly braces. We can place the call to rcu_read_lock() at the beginning, or anywhere before a shared variable is read.

- Where do the functions return?
Huhhh immediately before a "return" statement or at the last curly brace if the function returns void. We can place the call to rcu_read_unlock() before  any exit path or as soon as no shared pointers are used anymore.

- When is it safe to retire an object/node?
We can retire a node only when that node is no longer accessible by other threads.
For the methods that need to delete an object/node, call synchronize_rcu() after the rcu_read_unlock() and then delete node.
 
That's it!
Did you notice any difference between the list of questions for RCU and the ones for Hazard Pointers?

This is where RCU's true magic comes from: its simplicity of deployment.
Sure, it's blocking for reclaimers (but it can scale by sharing grace periods).
RCU is wait-free for readers (low latency), has little synchronization overhead (high throughput), and it is easy to deploy... what more do you want?



FYI, there aren't that many concurrency constructs that give you high throughput and low latency.
Sure, it's just for the readers, and the reclaimers get blocked, but there are no silver bullets, only engineering solutions that fit better depending on the scenario, and RCU is the best fit in more scenarios than you may think  ;)

Friday, August 19, 2016

Userspace RCU in GlusterD

Here is a great example of how RCU can be used in a project to improve scalability and throughput, namely in GlusterD:


Atin Mukherjee talks about several interesting details on how to use RCU.
My only quirk is on slide 8 "Different Locking Primitives", RCU is not a locking primitive in the sense that it doesn't provide mutual exclusion by itself, but it's hard to explain exactly what it does, so it's ok to put it on the same slide.


He even talks about a bug in their code where they did something like:
    rcu_read_lock();
    // do some stuff
    synchronize_rcu();   // This will deadlock
    rcu_read_unlock();

This happens because we can't call synchronize_rcu() from within a block of code protected with rcu_read_lock().

It's interesting that they went with the Bullet-Proof RCU variant, which just shows that most software engineers are willing to sacrifice a small amount of performance if it gives them more flexibility and ease-of-use (myself included).

Friday, August 12, 2016

Concurrency talks at CppCon 2016

The program for CppCon 2016 is out and there are some really interesting talks coming up.
On the topic of Concurrency, I'm looking forward to these three talks:
- JF Bastien on "No Sane Compiler would optimize atomics"
- Hans Boehm on "Using weakly ordered C++ atomics correctly"
- Paul McKenney on "A lock-free concurrency toolkit for deferred reclamation and optimistic speculation"

JF Bastien has some really good insights into optimizing code with atomics and the different memory orders, so I really want to see what he's going to say this year.
If you don't know who he is, then take a look at his interview from last year's CppCon or from CppCast:
http://concurrencyfreaks.blogspot.pt/2015/11/interviews-for-channel9.html
http://cppcast.com/2015/07/jf-bastien/
and a preview of what his presentation will look like:
https://github.com/boostcon/cppnow_presentations_2016/blob/master/03_friday/no_sane_compiler_would_optimize_atomics.pdf
and if you want to see some funny comments then check the last 10 minutes of this talk where JF Bastien gives some comments to the speaker:
https://www.youtube.com/watch?v=k12BJGSc2Nc

As for Hans Boehm, well, if you're a regular reader from this blog I shouldn't need to introduce him, but in case you just landed here by accident, then he's one of the contributors to the C++ Memory Model... and the Java Memory Model.
This year he's at CppCon and he's going to talk about common mistakes with non seq-cst atomics, so I definitely want to see his talk.
Here is some of the content he's going to go over:
https://docs.google.com/presentation/d/1Eapu4G6QcetO9mOeZSQJAuvxB8kK3pHQwxWbHNOLk8w/pub?start=false&loop=false&delayms=3000#slide=id.p
Funny thing, I was considering giving a talk this year precisely about this topic but I'm not able to go to CppCon so I didn't even submit the proposal. Obviously, I wouldn't expect my talk to be as good as his is going to be, I was thinking of focusing more on examples and what kind of optimizations are safe to do when writing an algorithm using C++ atomics.
... ohh well, maybe I can still propose it next year as a more "hands-on-approach".

And finally, Paul McKenney is going to to give two talks, with the first one being about the Issaquah challenge (which he talked about a couple years ago so I hope he's going to go a bit more in depth this time)
https://www.youtube.com/watch?v=1Q-RH2tiyt0
and the second talk is going to be about safe memory reclamation in concurrency.
It looks like Paul, Michael Wong and Maged Michael are trying to add techniques for safe memory reclamation to the C++ standard, namely, adding RCU and Hazard Pointers.
For those of you who don't know what this means, let me just say that most experts in Concurrency (shared memory concurrency) say that Memory Reclamation is the toughest problem in Concurrency. This includes people like Maurice Herlihy, Erez Petrank, and many others.
If Memory Reclamation is the toughest problem in Concurrency, and Concurrency is the toughest way of programming, then Memory Reclamation must be the hardest thing you can do in Programming.

There are three main ways to do memory reclamation in concurrent (non-blocking) applications: Atomic Reference Counters, RCU, and Hazard Pointers (HP).
They have different properties, like different progress conditions for Readers and Reclaimers. By "Readers" I mean the threads calling methods that don't delete anything in the data structure, and "Reclaimers" are the ones doing the deletions.

Atomic Reference Counters are already in the next standard and should be part of C++17.
Herb Sutter was the one that proposed them and I'm glad he did, but when it comes to concurrent data structures, it is pretty much useless because it is only non blocking in certain situations, and it is the slowest of the three methods for Readers because when you're traversing a list with nodes you need to do two atomic_fetch_add() per node that is traversed.
If you want to understand what situations are those, then take a look at chapter 9.1 of Paul Mckenney's book in this link https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e1.pdf

RCU is better known for its implementation in the Linux Kernel, but there are many userspace implementations of RCU. We used some of them in the Left-Right technique.
RCU is wait-free (population oblivious) and super fast for Readers, but it is blocking for Reclaimers which means that if you want high throughput or low latency even when doing removal of nodes/objects from a data structure, this is isn't going to give you what you want.
Even with delegation, where the last reader will delete the node/object, it doesn't change the progress condition for the Reclaimers, and it even makes the Readers worse off because now they may have to delete an unbounded number of objects, thus making the Readers wait-free unbounded, which has much worse tail latency.
RCU is the simplest to use of the three techniques because its semantics for the Readers are similar to those of a Reader-Writer Lock.

Hazard Pointers are typically lock-free for the Readers (wait-free in some situations) and wait-free bounded for the Reclaimers.
The main drawback is that they're slow for Readers because we have to do one sequentially consistent store for every node that is traversed.
It's not as slow as Atomic Reference Counter is for Readers, but it is slower than RCU which pretty much has no synchronization apart from the initial call to rcu_read_lock() and the final call to rcu_read_unlock().
Of the three techniques, this is the hardest to use, not because it is hard per-say, but because it requires a very deep knowledge of the algorithm where you're applying hazard pointers to. If you want to make it wait-free then the difficulty goes up a notch, and it typically requires the algorithm to be designed with HPs in mind.
One of the untold advantages of HPs is that they are memory bounded. This means that at any given time you can not have more than (MAX_THREADS x MAX_HPS) nodes/objects in the retired lists waiting to be deleted (for an 'R' factor of zero). RCU has no such bound and if you have memory constraints this can become problematic and it can impact your tail latency for the Reclaimers.

There are many other methods for safe memory reclamation, but they are either variations on these ideas or they rely on some architecture/OS specific functionality, like needing Posix signals, or some CPU instructions that only exist on some hardware. Those are "outside" of the C++ Memory Model, i.e. you can't implement them using only atomics and the Memory Model and therefore, their implementations are not portable.

Paul McKenney is one of the authors of RCU.
Hazard Pointers are the brain-child of Maged Michael.
Obviously, there is a bit of friendly rivalry between the authors of the two best methods for non-blocking memory reclamation  :)
Having the two of them together in this endeavor to provide safe memory reclamation in C++, is like having Professor Xavier and Magneto joining forces to bring down Apocalypse  :)
Yes, it's really _that_ cool!

Sure, I'm a fan boy and I'm exaggerating. Just like Andreia was saying the other day, you can implement yourself Hazard Pointers and RCU, efficiently, in C++, and you don't really need library support for that, and in fact _we_ have done it.
However, I'm not going to say it was easy (it wasn't, specially the hazard pointers stuff) and most people just want to use a pre-existing library/API which they know is correct and fast for most usages, and that's why having Paul, Maged Michael, and Michael Wong behind this is a great thing for C++ and its community.
If you care about Concurrency and if you're at CppCon this year, make sure to attend this talk, it should be epic!

Saturday, August 6, 2016

Throughput vs Latency and Lock-Free vs Wait-Free

On the previous post we showed an excellent presentation by Cliff Click, and one of the things he mentions is that there is performance for throughput and performance for latency, which really resonated with me.

When people talk about performance, they're usually just interested in throughput. This is fine, but throughput is just one of the vectors of performance, it's just a number (for a particular workload).
Having a number is nice, it means we can quantify, we can measure, and therefore, we can compare.

You know what is much better than having a number?
Having a distribution, namely, the latency distribution.

How do you know if a certain technique or algorithm X is better than algorithm Y?
Typically, you measure, and typically, you do it by measuring the throughput of X and Y in different scenarios or workloads.
If you're really sophisticated, then you might even do several runs, and choose the median of those runs, and/or compute some error bars based on the standard deviation or the min/max of the sample of runs.
This is fine (it's already one step ahead of just guessing which algorithm is better), but we can do more:
Instead of showing a number (plus or minus an error) we can show a distribution. And when it comes to concurrent algorithms, the best is to show the latency distribution.

And why is that, you may ask?
Because the inverse of the mean of the latency distribution is...
... I'll give you a few seconds to think about it...
... yes you guessed it, the throughput!

Looking at a latency distribution has way more information than looking at the throughput, the same way that looking at a statistical distribution has way more information than the mean of the same distribution, why, because it's just a number.
You know the old saying of "one picture is worth a thousand words"? Well then, one latency distribution plot is worth a thousand throughput measurements, and this last statement is literally true if the plot is an histogram with 1000 entries... lol

Why is this important?
For two reasons:

First, if you care about throughput, and only about throughput, then you care about the latency distribution but in a very specific way, namely, you want to know only the mean of the distribution. That's fine, but then how do you compare two algorithms, you are going to do several measures, right?
What is the error in this measurement?
How to compute error bars?
Do you use a standard deviation or something else?
Are the error bars assuming a Normal distribution?
Is it safe to assume a Normal distribution?
Why now just show the latency distribution?


Second, there is typically a tradeoff when it comes to throughput vs latency (not always, but typically). Going back to our example, perhaps algorithm X provides better throughput than algorithm Y, but Y has a better tail latency than X.
Which one is "better"?
There is no right answer, it depends on how you define better: If you just want raw throughput and you don't care that 1% of the cases will take an enormous amount of time to complete, then X is better. If you're willing to sacrifice throughput to have more fairness and better (lower) latency at the tail of the latency distribution, then algorithm Y is better.

One concrete example is when you look at a lock-free queue (like the one by Maged Michael and Michael Scott) vs a wait-free queue (like the one by Alex Kogan and Erez Petrank):





In these plots, lower is better.
The plot for the median (this is the median, not the mean) shows that the Michael-Scott queue is better than Kogan-Petrank (about 10x better), but when it comes to the tail latency, namely at the 99.9%, the Kogan-Petrank is better.
By 99.9% we mean that 99.9% of the calls to enqueue() take x microseconds or less to complete, and therefore, 0.1% of the calls take more than x microseconds to complete, where x is the number on the vertical axis of the plots.


Just like it doesn't make sense to say that Throughput is better than (tail) Latency, it is also meaningless to say that a Lock-Free algorithm that has higher throughput than a similar Wait-Free algorithm is better. The only exception is if they both have the same throughput or the lock-free algorithm has a lower throughput than the wait-free one, in that case, by definition, the wait-free is better, because it's better in all metrics.


Fedor Pikus is going to talk about lock-free again this year at CppCon, so let's see what he has to say about this topic.
I talked about latency at the beginning of my presentation at CppCon last year in case you want to learn more.

Saturday, July 30, 2016

Cliff Click on VM Design Choices

Whenever I see a presentation online by Cliff Click I just jump at it. There is just so much content in a single presentation, maybe it's because he speaks really fast, but maybe, just maybe, it's because this guy is really good.  :-O


If you ever wondered how to make your own JVM, how HotSpot works, how a SafePoint is implemented, some little known truths about Garbage Collection in the JVM (did you know there is a lock on the thread's stack that the GC/thread acquire?), even how to make a scalable lock, and more, much much more, then watch this video!
I'm pretty sure you'll learn something, at least some detail about the inner workings of a JVM.... huhhh unless of course, you are Cliff Click yourself, or you worked with him in this stuff, or you watched one of these presentations live, in which case, you already know everything that's being said. lol

The part about GCs was amazing, and so was the one about inlining.

One thing I really like is that he's one of the very few people who make a distinction between performance for throughput and performance for latency (Andreia and I we just call it throughput and latency).

I had to watch it twice and a third time will be needed to get all the details. And if you missed some details, try watching the rerun of the same presentation here:



Really cool stuff

Friday, July 22, 2016

The Dangers of Optimistic Concurrency

We talked about optimistic concurrency previously, more specifically, in the context of StampedLock tryOptimisticRead().

In this post we're going to give a few more examples of what can go wrong when using tryOptimisticRead() and optimistic concurrency in general.
Don't get me wrong, I think that the fact that there is an RW-Lock in Java that provides this kind of API is a great addition, it's just that most developers are not aware of the dangers of using this API.
I personally find optimistic concurrency extremely useful when used in a highly controlled environment, i.e. inside the code of a data structure that you designed and implemented and that nobody else is going to touch. This is unlikely if you work for a large organization (like myself)
where your code is public and there will be other people changing it over time, and that's problematic for optimistic concurrency because it is very hard to reason about, which means it is very easy to make a small change that causes a nasty bug.



One of the arguments that the users of optimistic concurrency give (at least in Java) is that you can catch all kinds of exceptions and retry the operation in case an exception is thrown. This has its problems because the code that is being called inside tryOptimisticRead()/validate() may actually throw exceptions as part of its normal behavior, and then you don't know how to distinguish between an "expected" exception, from an "unexpected" exception caused by some weird re-ordering of code:

   double exmaple1() { // A read-only (optimistic) method
     long stamp = sl.tryOptimisticRead();
     Result result;
     try {
         result = mayThrowAnException(); // throwing is a "normal" behavior
     } catch (Exception e) {
         // Ooops an exception was thrown... should I recompute result or not?
     }
     if (!sl.validate(stamp)) {
        stamp = sl.readLock();
        try {
           result =
mayThrowAnException();
        } finally {
           sl.unlockRead(stamp);
        }
     }
     return result;
   }



But let's say it's not a completely generic piece of code and you have some control over it. Then, there is still the issue that some re-ordering will cause the code to enter an infinite loop. Now what?
You can have some variables reordering that would cause an infinite loop, like on the following example, where x starts at 0 and y starts at 1 and they are incremented each on another thread inside the write-lock. The order of reading (or writing) of x and y can be re-ordered and the variable delta ends up being zero and the for() loop goes into an infinite cycle.
   void incrementXY() { // an exclusively locked method
     long stamp = sl.writeLock();
     try {
       sharedX++;  // Starts at zero. Can be re-ordered with increment below
       sharedY++;  // Starts at one. Can be re-ordered with increment above
     } finally {
       sl.unlockWrite(stamp);
     }
   }
  
   void exmaple2() { // A read-only (optimistic) method
     long stamp = sl.tryOptimisticRead();
     long x = sharedX;                    // Starts at 0
     long y = sharedY;                    // Starts at 1
     long delta = y - x;                  // Ooops, this can be zero, or negative, or whatever
     for (i = 0; i < 1000; i+=delta) {
         doSomeReading();                 // To infinity and beyond!
     }
     if (!sl.validate(stamp)) {
        stamp = sl.readLock();
        try {
          doSomeReading();
        } finally {
           sl.unlockRead(stamp);
        }
     }
   }

Notice that using a regular read-lock would mean that delta would always be 1 and there would be no infinite loop.

Well, to fix this, you can wait for a certain timeout. This raises the question of how long to wait for, and does it depend on the machine you're running on, and you went through all this trouble because someone told you that tryOptimisticRead()/validate() was faster than using readLock()/unlockRead(), but your code ends up having to wait for some randomly chosen (hard-coded) timer to expire... and you think that is going to be faster? .... yeah, right.


Ok, but let's say you're willing to take the hit and add in your code the boilerplate to handle all the exceptions, and to timeout if the read-only critical section goes into an infinite loop (which are both a major pain and it would make my eyes bleed to write such code, so I'm not going to), after all this hurdle, you've got safe optimistic read-only code... right?
... nope, not really.

Suppose in the example above the doSomeReading() method is actually allocating memory.
And suppose that there is no re-ordering but the thread calling example3() goes to sleep after reading sharedX, and in the meantime, sharedY is incremented multiple times... like one million times:
   double example3() { // A read-only (optimistic) method
     long stamp = sl.tryOptimisticRead();
     long x = sharedX;                    // Reads this to be 1
     // Thread sleeps here
     long y = sharedY;                    // Reads this to be 1 million
     long delta = y - x;                  // oops, this can be a really large number
     Widget array[] = new Widget[delta];  // Suppose there is enough memory for this
     for (i = 0; i < delta; i++) {
         array[i] = new Widget();         // There might not be enough memory for 1M Widget instances... out of memory?
     }
     if (!sl.validate(stamp)) {
        stamp = sl.readLock();
        try {
          result = someReadOnlyFunctionThatThrowsAnException();
        } finally {
           sl.unlockRead(stamp);
        }
     }
     return result;
   }


By allocating a large number of Widget instances, you've just consumed all the memory in the JVM (or the system), and although you're catching exceptions and abort and then the GC will cleanup this mess, it might be too late, because now you may have failures in other threads of this application (or even in another applications running on the same JVM) because there is not enough memory to allocate anything...
your whole app goes down in a blaze of glory, maybe bringing down with it all the other apps in the JVM, and in an extreme situation, cause the OS to go down as well (unlikely in the case of Java/JVM, but not so unlikely in other languages that have a GC but run natively, like the D language).

As far as I know, there is no way to protect efficiently against this issue, but if you know a way, please use the comments below this post  ;)



In case you think this example is contrived (it is), there are examples of this kind of thing happening in production systems where optimistic concurrency is being used, like on this presentation at CppCon last year by Brent Hall.

In this particular case, it was a transactional memory with optimistic reads that had an issue very similar to the example 3 above.



And the biggest reason of all not to use optimistic concurrency is: none of the above!
It's reasoning.

Concurrency is hard, everybody knows that. The whole point of doing things like immutability and (pessimistic) transactional memory and having linearizable data structures is to have something that is easy(er) to reason about.
Optimistic concurrency requires us to reason in terms of Relaxed Consistency (like relaxed atomics in the C++ Memory Model), and our pitiful linear brains are not made for that, and anyone saying otherwise is either a genius (I mean, like above Einstein level) or has no idea what they're talking about.
Most software developers work in groups (companys, startups, university, research labs, whatever) and even if you can reason about such code, it is unlikely that everyone else in your team will be able to understand it and change it in the future.
You have to ask yourself these questions:
Even if the code does what you think it does, how are you going to convince the rest of the team of the correctness of your code?
Is everyone in your team (i.e. the weakest team member) able to understand this code well enough that they can safely change it?

Unless the code is extremely simple and your team is made of a group of outstanding individuals, this is unlikely.
But hey, it's your code, so do what you think is best  ;)

Monday, July 18, 2016

Angelika Langer at GOTO 2014

If you want a soft intro to some of the concurrency features introduced in Java 8, here is a video by Angelika Langer at GOTO 2014


I don't know much about the Future's stuff, but the part about StampedLock and Contended is well explained, with intuitive examples as well.
In fact, she makes it look so easy to use StampedLock and particularly the optimistic API of tryOptimiscRead()/validate() that someone who is new to it may think it is easy... it's not.

IMO using optimistic concurrency is like juggling knifes, and I've ranted about it before and will rant about it again on my next post, but I am happy that people are talking about and explaining it in such detail... too bad they didn't give her enough time to explain all the issues that can occur with tryOptimiscRead()/validate(), but in these conferences time is always short so you never get to present everything you want.