Friday, September 30, 2016

Wait-free Bounded vs Wait-free Unbounded

Within the progress condition of wait-free there are two major groups:
- methods/algorithms that provide wait-free bounded guarantees;
- methods/algorithms that provide wait-free unbounded guarantees;

The difference between the two is subtle but very important.
Wait-free bounded means that there is a known bound on the number of steps, which itself has a strong but hidden implication: that there is a bound on the number of objects to be tracked, and therefore, to be deleted.
Wait-free unbounded means there is no known bound, and therefore, there may or may not exist a bound on the number of objects to be deleted.
This is confusing but let's consider a concrete example:

This year at PPoPP there was a wait-free queue presented by Chaoran Yang and John Mellor-Crummey.
John Mellor-Crummey is the MC in the MCS lock, so you may have "heard of him" before  ;)  
This queue uses fetch-and-add (FAA) and is wait-free unbounded. In the meantime, we designed a few algorithms based on fetch-and-add that are also wait-free unbounded (yes, yes, we did some wait-free queues).
The problem with this queue (and the wait-free unbounded queues we did as well) is that the dequeue() may have to traverse an unbounded number of nodes to find a node to dequeue. The basic idea is that you read the current head and then do a FAA on a certain monotonic counter to take a ticket and then traverse from head until you find a node that matches your ticket. All nodes have a monotonically increasing and unique ticket.
The problem with this approach is that between reading the head and doing the FAA the thread may go to sleep, and by the time it wakes up, there could be a very large number of nodes that were dequeued (and enqueued) and the counter in the FAA is for a ticket that is very far from the head.

This is bad enough on itself for latency at the tail of the distribution because now we have to traverse all these nodes which can take a long time, but it gets worse.
The fact that we have to traverse these nodes to find the one containing our ticket (with the item we need to return) means that all those nodes need to be live. In other words, we can't delete any of those nodes.
Do you see the problem with that?
That's ok, Chaoran and John didn't seem to see it either and they know a lot about concurrency.
If you can't delete those nodes even though they have been dequeued, how much memory do you need to keep them alive?
1Mb ? 1Gb? 1Tb? Even more?
Notice that the number of nodes that needs to be traversed is finite, but we don't known how large it is because it's unbounded.
So, how much memory do we need to store an unknown amount of nodes?
See the problem now?

Quoting from Harald Hefgott which is one of the top mathematicians of present times: "Computers today are very fast and can also perform calculations in parallel. But the memory is still limited,"

Maybe you think this is purely theoretical.
In that case, consider what happens if we have oversubscription of threads to cores. Let's say we have 8 cores on our machine and we fire up 80 threads to dequeue from the queue, and maybe another 80 threads to enqueue just to keep it balanced. Only 8 threads at a time will be scheduled, and depending on the OS, each thread may be scheduled for few hundred milliseconds (sometimes called the "quantum").
It will take 20 "cycles of scheduling" until a thread is scheduled again, and during that time a lot of dequeues can occur.
If there is a thread calling dequeue() and reading the head and goes to sleep just before doing the FAA, then from that moment on, no further nodes in the queue can be deleted.
Let me repeat this last part because it's important: No further nodes can be deleted, not newly inserted nodes, nor already existing nodes, none, zero, nada, zilch, niente.

The 20 cycles take about 100ms x 20 = 2 seconds, and on average half the time we dequeue (the other half we enqueue) so, how many nodes can we dequeue in one second?
Modern computers can do a few million such operations per second and perhaps, per core, depending on contention, but let's say the answer is Z.
Each node in a queue takes at least 3x64 bits, with 64bits for the next pointer, 64bits for the value pointer, and 64bits for ticket.
We need to keep in memory at least 24 x Z bytes. If Z is a really large number then this may be more memory than there is available on the machine, and Z can increase with more oversubscription, or higher quantums, or and unfair thread scheduler, making the problem worse.

In practice this is a dangerous approach.
In theory, the fact that the program can crash due to running out of memory, kind of implies that these algorithms are not wait-free, which is a bit of a contradiction.
Wait-freedom implies resilience to faults, i.e. a thread is suppose to make progress and complete its work in a finite number of steps, regardless of what the other threads are doing (even sleeping).
If having a single sleeping thread can crash any of the other threads and even the entire application due to memory exhaustion (an extreme case, but quite possible), is this really wait-free?
I don't think so.

In other words, we have one thread calling dequeue() and it goes to sleep for a while, and this simple fact causes any of the other threads to fail, whether enqueues or dequeues or even threads doing something else in the application which will run out of memory and stop working correctly.
What kind of wait-free queue doesn't give a decent latency guarantee at the tail, and it can cause program crashes for no particular reason? These wait-free unbounded queues are not really wait-free.

What would you think if I came up to you and said:
Hey! I invented this really cool wait-free (unbounded) queue. It's great because it's wait-free... it has one small quirk though, it can sometimes cause your application to crash!

Whether you think it's wait-free or not, I doubt you would be willing to deploy such a queue in production if you heard me say it like that  ;)

Are all wait-free unbounded queues and data structures affected by this issue?
I don't think so. If the "unbounded" part of the wait-free is due to some calculation that has an unbounded number of steps and not due to traversing an "unbounded" number of nodes, then it should be safe.
Unfortunately, I don't know of any other wait-free unbounded data structure to provide an example. If you know one, please use the comments section!

No comments:

Post a Comment