Thursday, October 27, 2016

Are Hazard Pointers Lock-Free or Wait-Free ?

I've seen some people refer to Hazard Pointers (HP) as lock-free and others as wait-free, so which one is it?

The simple answer is: they're lock-free

The complex answer is: it depends

If you go and take a look at the original HP paper by Maged Michael, you'll see there is an implicit mention of "lock-free":
I mean, the title itself has the term "lock-free" in it: "Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects"

So it's lock-free, right?

But then you start reading the text and in page 492 you have this:
(...) It is wait-free [8], i.e., progress is guaranteed for active threads individually, not just collectively; thus, it is also applicable to wait-free algorithms without weakening their progress guarantee. It allows reclaimed memory to be returned to the operating system. It does not require any special support from the kernel or the scheduler. (...)

The first sentence is clearly mentioning that although this was designed for usage in lock-free algorithms, it can be used in wait-free algorithms, and the HP itself will be wait-free.
The last sentence is just to say that you don't need kernel/scheduler support like you do on the Linux Kernel's RCU. Of course, nowadays there are good userspace RCU implementations that don't need that either and can be implemented with just atomics, i.e. they're as generic as HPs.

Huhhh, so it's wait-free then?

Not so fast. The question you want to pose is:
Is it true that we can we apply HPs to a wait-free algorithm and keep it wait-free? For every wait-free algorithm?

Andreia and I know for a fact that answer to this question is "No".
Not every wait-free algorithm can be adapted to use Hazard Pointers and still maintain its wait-free progress. The best example is the list by Tim Harris (later modified by Maged Michael himself) which can be made wait-free for lookups, i.e. the contains() method, but when using HP it will revert back to being lock-free (more details on this below).

The Hazard Pointers technique has three procedures in its API:
- Publish a pointer: publish(ptr)
- Clear a pointer: clear(ptr)
- Retire an object: retire(ptr)    (named retireNode() in the original paper)

The retire() method is called when reclaiming memory, for example, from within a remove() method in the Harris-Maged list after a node has been successfully marked and unlinked from the list. The publish()/clear() are used by all methods contains()/add()/remove() because any of them can dereference pointers that another thread may be attempting to delete through a call to retire().

One of the rarely mentioned advantages of HP is that they provide a memory bound. In other words, they give the guarantee that out of all the objects/pointers for which retire(object) was called, there is a maximum number of objects that have not yet been deleted. For an 'R' factor of zero (page 492 of the HP paper) this bound is MAX_THREADS x MAX_HPS, which is actually very low, and this is in the extremely unlikely worst-case scenario.
This low bound on the memory usage can be vital on embedded systems where memory is scarce.
This bound on the number of object/nodes to retire has one other implication, it means that a call to Scan() (figure 3 of the HP paper) which is typically called from inside retire(), will have a bounded number of nodes to scan, and therefore, retire() is a wait-free bounded method.

As it so happens, publishing a pointer is just a matter of doing a seq-cst store of the pointer, and
clearing a pointer consists of doing a seq-cst store of nullptr on the same shared memory location (the hazard pointers array), and both of these operations are wait-free population oblivious.
In summary, the progress conditions of the HP API are:
- publish(ptr): wait-free population oblivious
- clear(ptr): wait-free population oblivious
- retire(ptr): wait-free bounded

Hummmm, so HPs are wait-free bounded for doing memory reclamation?
Yes, this they can be made wait-free bounded for calls to retire(), subject to implementation details.

And HPs are wait-free population oblivious for calls to publish() and clear(), so they must be wait-free?
No, not always. The answer is related to the way you call publish() and clear().

How do you adapt a wait-free algorithm to (wait-free) hazard pointers?

Here is what a lock-free method looks like when using hazard pointers:
std::atomic<Node*> ptr; // Global shared pointer

void lockFreeMethod() {
  Node* lptr = ptr.load();
  do {
  } while (lptr != ptr.load());
}      // May need infinite steps to complete

and here is what a wait-free usage looks like:

void waitFreeBoundedMethod() {
  for (int istep=0; istep < maxSteps; istep++) {
    Node* lptr = ptr.load();
    if (lptr != ptr.load()) continue;

}    // Completes in maxSteps

The difference between the lock-free and the wait-free version is in how you call the publish() method.
If you have a wait-free algorithm where you want to incorporate HPs and you are able to re-write it such that is looks like the second form, then it will be wait-free.
However, if you can only write it in the first form, then there is the possibility (no matter how unlikely) that the loop will go on indefinitely, meaning that it's lock-free.
In which situations can you write an algorithm in the second form? That's a subject for another post.

Unfortunately, when using HPs, the contains() method in the Harris-Maged list can only be written in a lock-free way.
The reasons for this are related to the invariants on the Harris-Maged list itself, and they would require a separate post to explain in detail, but the rough idea is that re-checking the pointer may cause an invalidation the publishing due to some other thread calling add()/remove() changing the pointer. This can happen an infinite number of times, so it's lock-free. And we can't just give up on it and skip to the next node because the next node may have already been free()/deleted.
Yes, it can happen that the next node wasn't just retired, it was really deleted, and then... crash!

If yes, then I just proved the point on my previous post: Hazard Pointers are hard to use, not because they are complex themselves, but because they require deep knowledge of the algorithm where they are being deployed.

In summary, doing memory reclamation with hazard pointers is always wait-free, but using hazard pointers for reading is typically lock-free and although it can be made wait-free it is never easy and sometimes not possible.

Tuesday, October 18, 2016

CRTurn Queue - The first MPMC memory-unbounded wait-free queue with memory reclamation

A few days ago I turned 40.
It's a big event, and to celebrate it, Andreia and I decided to share some of the work we've been doing recently.
Yes that's right, it's my birthday but you get the gifts!

We officially present CRTurn queue, the first (correct) memory-unbounded multi-producer-multi-consumer wait-free queue for C++, that does its own memory reclamation, and does it in a wait-free way.

If you just care about the C++ code, then follow this link
If it's Java you're interested, then we have an implementation with self-linking on the url below
And if you care about academic papers with invariant proofs and microbenchmark plots then look here

Otherwise, just keep on reading for the blog post  ;)

When it comes to memory unbounded multi-producer-multi-consumer (MPMC) queues there are three known approaches in the literature, all based on singly-linked lists, plus a handful of generic wait-free mechanisms that can be applied to queues, but they're too slow to be a match for the "hand-written" queues. You can see each respective paper for those three queues here:
Of these three algorithms: KP is for Java only, which means the memory reclamation is done by the GC, and since the JVM's implementation of a GC is blocking, this queue isn't truly wait-free, at least not when it comes to latency, but IMO it's the best of all previous work, and it's the only one for which there is a correct implementation;
FK does no memory reclamation and has errors in the implementation;
YMC has a memory reclamation that is flawed by design, and even if fixed would not be wait-free, and seems to have some errors in the implementation (at least they were the first ones to realize the importance of doing memory reclamation).
This means that CRTurn queue is the first to reclaim its own memory, and therefore, the first that can be implemented in C or C++ or something without a GC. Not only does it reclaim the memory of the dequeued nodes, but it does so in a wait-free bounded way, a feat which is not easy (to say the least).

The CRTurn queue has other interesting properties, for example, it does no memory allocation except for the creation of the node that is inserted in the linked list, and even that can be pre-allocated if desired.
In CRTurn, the enqueueing and dequeueing algorithms are isolated, which means you can use just one of them and plug it with a single-threaded queue algorithm (singly-linked list based) to make a wait-free SPMC or MPSC queue. Or even better, the enqueue() algorithm is very short, so it is very tempting to attach it to the dequeue() of the Michael-Scott queue to create a simple queue that is MPMC with wait-free enqueue() and lock-free dequeue().
Such a queue has a small number of lines of code and is relatively easy to reason about and convince yourself of its correctness.
Other properties are, using a fast wait-free consensus we called the "Turn" consensus (think Lamport's Bakery, but better), and it achieves bounded wait-free with just the use of CAS (compare-and-swap), which is nice because not all CPUs have a FAA (fetch-and-add).

This queue was designed for low latency at the tail of the latency distribution, i.e. for real-time systems.  Sure, it does a new/delete for each node that is enqueued/dequeue but you can plugin your custom allocator if you want.
Surprisingly as it may seem, for the uncontended scenario, this queue isn't far behind from the Michael-Scott queue, i.e. the best known memory-unbounded mpmc lock-free queue. It's nice to see that we don't have to pay a very high price for having wait-free guarantees.
We tested this code heavily to make it production-ready and aimed to provide a code that is as simple as possible, but keep in mind that this is a wait-free queue and wait-free algorithms are rarely simple.

This queue is a big deal for us because Andreia and I worked really hard to get here. We spent late nights working on this, unending discussions on how to solve the different problems, how to proceed with the design, writing code, writing tests, running benchmarks, analyzing the results, figuring out the best way to measure tail latency, writing more tests, and more experimental code, and more discussions, until we got to what is being shown as CRTurn queue. Along the way, we created a new wait-free consensus protocol, we had to figure out how to apply Hazard Pointers in a completely wait-free way, how to use the least amount of stores for Hazard Pointers to keep the throughput high, and all of that without doing any heap allocation.
It was a long and hard journey, particularly the memory reclamation part, and we learned a lot along the way.
I don't know if we'll get published in a top tier conference, but we are sharing this fruit of our labor with everyone, and we hope you enjoy it  ;)

More to come...

Friday, October 14, 2016

Dope! I should have thought of that!

What's one of the biggest compliments someone can give me?
It's when I explain an idea or algorithm and someone says: Dope! I should have thought of that!

I think this a great compliment, and let me explain why.

First, when someone says something like that it means that they think the idea is almost obvious or at least simple (to them), which implies that I was able to explain the idea to make it sound simple.
This may seem a trivial task, after all, explaining simple ideas is easy, right?
Well, I'm not always the most articulate person and have a (bad) tendency to overly complicate things, which you might have already picked up on if you 're reading this blog. In my defense, I do make an effort to explain concepts in the simplest way I can think of, too bad that is sometimes a really overly complex way.
Andreia is much better than I at this game. She has this seemingly innate skill to dismantle complex ideas in its basic constituents, and she helps me with it  ;)
Nobody's perfect, but luckily, I don't have this issue when it comes to writing code... the problem with code is that is explains the how but it doesn't explain the why.
So, when someone says "I should have thought of that!" it gets translated in my brain into something like "You made it sound so simple that I wonder why I didn't think of it myself"  lol

Second, no matter how smart you are, if the idea is complex, you're not going to be able to explain it in a simple way. If someone thinks your idea is simple, then there is at least one person in the world to whom your idea is simple, and simplicity is elegance! Most people can make something useful and complicated, but only a few can make something useful and simple.
I'm not a big fan of Steve Jobs, but there is quote attributed to him that I think is perfect for this situation:
When you start looking at a problem and it seems really simple,  
you don’t really understand the complexity of the problem.  Then 
you get into the problem,  and you see that it’s really complicated,  
and you come up with all these convoluted solutions.  That’s sort of 
the middle,  and that’s where most people stop… But the really 
great person will keep on going and find the key, the underlying 
principle of the problem — and come up with an elegant,  really 
beautiful solution that works.

... or from C.A.R. Hoare, The 1980 ACM Turing Award Lecture:
There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.

It's really hard to create something and then go the extra mile to make it as simple as possible.
It takes time, it takes energy, it takes money (aren't these three equivalent anyways?).
This has value not just due to aesthetics but because making something as simple as possible means it is much less likely to have bugs or flaws, and it's easier to propagate to other people, a kind of meme. It becomes viral and infects other people's brains!
Sometimes is goes to the point that other people think they have invented it themselves:

Left-Right is certainly one of these ideas. It can be implemented in less than 10 lines of code and explained in just a few sentences:
The thing is, we spent nearly three years working on it. From its first inception until it was distilled into the simple idea it is today, there was a long journey, with lots of longs nights of discussions and coding.
The how it works may be deceivingly simple, but the why it works this way (and not some other way) is definitely not simple.

So, when someone tells me that Left-Right (or some other idea Andreia and I came up with) is so simple that they should have thought of it, I smile and say thanks, because that's a huge complement, whether they know it or not  ;)

Friday, October 7, 2016

Self-linking and Latency + Life of a Twitter jvm engineer

Today we're going to take a look at the effects of doing self-linking in singly-list based data structures in Java. More specifically, how it changes tail latency in singly-list based queues like the one by Alex Kogan and Erez Petrank.

Back in 2011 Alex Kogan and Erez Petrank showed a multi-producer-multi-consumer wait-free queue.
To the best of our knowledge, they provided the only reasonably useful and correctly implemented memory-unbounded MPMC wait-free queue (that's a mouthful).
Yes, there are generic wait-free techniques, but using them on a queue is just too slow, and yes, there two other MPMC wait-free algorithms but their implementation has bugs and it's CPU specific.
Even the algorithm by Kogan and Petrank relies on having an automatic Garbage Collector (GC) so they provided their implementation in Java. As the authors themselves mention in their paper, there is currently no wait-free GC, neither implemented in production nor in the literature, which means that this queue is never really wait-free, but let's leave that for another post, and just focus on latency.

How good is this "wait-free" queue when it comes to tail latency?

Pretty good as it turns out. We showed some plots in a previous post:

Unfortunately, as soon as you start "hammering on it" with lots of contention, a problem appears.
The problem is that the GC will do pauses, really really long pauses.

This queue is based on a singly-linked list, and like all singly-list based data structures when used with a GC, the unused nodes can't just be unlinked from the list, they need to be self-linked by pointing the next pointer to the node itself.
The why this is important is related to how GCs work, and there is a very good explanation in this presentation starting at minute 23
Life of a Twitter jvm engineer - The garbage keeps coming... by Tony Printezis

which by the way I recommend watching in its entirety, but for the purpose of this post, if you don't know what this self-linking stuff is all about, then go watch ten minutes of it starting at minute 23 and come back and read the rest of this post.

Do you understand now why self-linking is important?
This is really a non-obvious detail of GCs, and the first time I learned about this was several years ago from Doug Lea.

The following code is pretty much what was written in the original paper but with some sun.misc.unsafe added to it:
and in our benchmarks it has GC pauses that go up to 70 seconds.
yes, that's right, seventy seconds during which our 32 core machine is stuck doing nothing except running the GC  :-O
Imagine you are reading this blog post on your laptop/tablet and then you click on a link and everything freezes for 70 seconds due to the GC... does that sound reasonable in any way?
You're going to say that this is the GC's fault and that using a JVM with a decent GC like Zing would fix it. I don't know because I don't have access to a JVM with Zing, and I believe that using another GC would improve, but I seriously doubt it will be as effective as doing self-linking.
For a deep-dive on this topic, check out the paper named "A Performance Study of Java Garbage Collectors on Multicore Architectures", particularly figure 1 and table 3:

The trick to help the GC is to do self-linking of the nodes after dequeueing, but we can't self-link the node where your value is, instead, we self-link the node that points to it, because the node where the value is may still be accessible through head, and we need its next to be valid because it will be used by the next thread calling deq() to obtain its "value". Here is the variant we did, with self-linking of unused nodes:
and in our benchmarks it has GC pauses that go up to 3 milliseconds.
Keep in mind that they are exactly the same algorithm, apart from an extra check in help_finish_enq() and for the self-linking in deq() and yet, they have completely different latency profiles.

Below is one example we got with GC logs. In this example the JVM ran out of memory during the benchmark and had to run a full GC which took 17 seconds and then another that took 42 seconds. At the end of the benchmark we call System.gc() and sleep for a while to trigger another GC run and that one took more than 45 seconds.
These 45 seconds are not accounted for in the benchmark which is unfair because the GC is cleaning up the garbage that the queue is producing, so that work should be taken into account when doing benchmarks as well, but anyways, we don't care much about throughput, it's more about latency for us:
##### KPNoSLQueue                          ##### 
Starting run...

[GC (Allocation Failure)  3932481K->221057K(15073280K), 17.3218925 secs]
[GC (Allocation Failure)  4153217K->547969K(15073280K), 42.6884304 secs]

Ending run
Number of enqueues/dequeues per sec = 2235
[GC (System.gc())  1324375K->601121K(15073280K), 45.8331662 secs]
[Full GC (System.gc())  601121K->8434K(15073280K), 0.0763649 secs]

When we run the same benchmark for the self-linking version, there is still one GC allocation failure during the benchmark, but it takes only 1 millisecond to complete, and the throughput is significantly faster because there is very little waiting for the GC:

##### KPQueue                              #####  Starting run...
[GC (Allocation Failure)  3932485K->805K(15073280K), 0.0013195 secs]
Ending run

Number of enqueues/dequeues per sec = 3216
[GC (System.gc())  2541778K->453K(15073280K), 0.0012654 secs]
[Full GC (System.gc())  453K->327K(15073280K), 0.0142511 secs]

This is the problem with not reclaiming unused memory (nodes) and leaving it up to the GC. Once you go down that path, you need to understand what the GC is doing to somehow make friends with the GC so that it will behave the way you want it to.

Long story short, if you're going to implement a singly-list based queue in Java, you better do self-linking.