Monday, April 22, 2019

OneFile and Tail Latency

Last week we talked about OneFile, the world's first wait-free STM and on this post we're going to explain the importance of wait-free progress.

Lock-based concurrency mechanism (locks) have several difficulties in practice:
1) if we use more than one lock, we're subject to having deadlock issues;
2) if we use priority locks, we're subject to having priority inversion issues;
3) if we use a lock without starvation-freedom guarantees (such as a spinlock), we're subject to starvation and live-lock;
4) using locks is prone to convoying effects;
5) mutual exclusion locks don't scale for read-only operations, it takes a reader-writer lock to have some scalability for read-only operations and even then, we either execute read-only operations or one write, but never both at the same time. Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;

In concurrency, progress comes in different flavors
Wait-free algorithms provide more guarantees than lock-free algorithms which in turn provide more guarantees than blocking algorithms (with one exception for blocking starvation-free).

A lock-free algorithm guarantees that at least one thread is making progress (whatever "progress" may mean). But it also implies that in the event of a failure of a thread/process, at least one of the other threads/processes will continue to execute. This is an extremely strong guarantee when it comes to failure-resilience and I'll cover it on an upcoming blog post.

Blocking algorithms don't have such a guarantee: if the thread/process holding the lock dies, then none of the other threads/processes will ever be able to do any work. There is a catch here though, if the algorithm is blocking starvation-free, it means that given enough time (execution quantum) to all the threads/processes, then they will eventually complete. This means that when it comes to completing an operation, starvation-freedom guarantees that the operation will execute in a finite amount of time, as long as the system has enough cores to run the threads/processes or the scheduler is somewhat fair, and no failure occurs. In the latency distribution, this gives a "cutoff" beyond which the operations are completed.

Lock-free algorithms don't have this cut-off. Instead, they have a "fat tail" in the latency distribution. Being lock-free implies the possibility of starvation, and in that sense (and only in that sense) starvation-free is stronger than lock-free.

Wait-free bounded progress goes one step above all this: it guarantees that given a time slice of execution (quantum) all thread make progress. This is true even in the presence of failures: if N-1 threads/process fail, the remaining thread/process will still be able to complete. If not, then the algorithm is not wait-free. Moreover, wait-freedom implies being free from starvation, which means that it also has a cutoff in the latency distribution.

In summary, wait-free (bounded) provides the strongest of all progress guarantees:
- Progress for all threads/processes;
- Progress in the presence of failures;
- Deterministic cutoff on the tail of the latency distribution;

We have implemented two variants of OneFile, one with lock-free progress and the other with wait-free progress. The implementation with wait-free progress is slightly slower in terms of throughput, because giving these strong guarantees implies executing more synchronization, however, when it comes to tail latency, the wait-free variant of OneFile is better than any other (blocking) STM by orders of magnitude (lower is better):

At the tail of the latency distribution, for the percentile 99.99% (P9999), the improvement of OneFile over the blocking STMs are huge and speak for themselves. The dark blue line is OneFile wait-free and lower times are better. Notice that this is a log-log plot with each horizontal line separating an order of magnitude, meaning that the difference to the lock-free (light blue) and blocking STMs is at least 1000x better.

Wait-freedom is a vital characteristic when working with low latency scenarios such as networking devices, high frequency trading, or control systems (autopilot cars/planes/rockets). Sure, there are other ways to deal with it such as having a tight control over the scheduler and that is what the hardcore real-time people do anyways, but with OneFile we have strong latency guarantees without having control over the scheduler, and that is on itself an important change for multi-threaded embedded systems development.

Next week we're going to talk about how OneFile allows us to create wait-free balanced trees.