Saturday, August 6, 2016

Throughput vs Latency and Lock-Free vs Wait-Free

On the previous post we showed an excellent presentation by Cliff Click, and one of the things he mentions is that there is performance for throughput and performance for latency, which really resonated with me.

When people talk about performance, they're usually just interested in throughput. This is fine, but throughput is just one of the vectors of performance, it's just a number (for a particular workload).
Having a number is nice, it means we can quantify, we can measure, and therefore, we can compare.

You know what is much better than having a number?
Having a distribution, namely, the latency distribution.

How do you know if a certain technique or algorithm X is better than algorithm Y?
Typically, you measure, and typically, you do it by measuring the throughput of X and Y in different scenarios or workloads.
If you're really sophisticated, then you might even do several runs, and choose the median of those runs, and/or compute some error bars based on the standard deviation or the min/max of the sample of runs.
This is fine (it's already one step ahead of just guessing which algorithm is better), but we can do more:
Instead of showing a number (plus or minus an error) we can show a distribution. And when it comes to concurrent algorithms, the best is to show the latency distribution.

And why is that, you may ask?
Because the inverse of the mean of the latency distribution is...
... I'll give you a few seconds to think about it...
... yes you guessed it, the throughput!

Looking at a latency distribution has way more information than looking at the throughput, the same way that looking at a statistical distribution has way more information than the mean of the same distribution, why, because it's just a number.
You know the old saying of "one picture is worth a thousand words"? Well then, one latency distribution plot is worth a thousand throughput measurements, and this last statement is literally true if the plot is an histogram with 1000 entries... lol

Why is this important?
For two reasons:

First, if you care about throughput, and only about throughput, then you care about the latency distribution but in a very specific way, namely, you want to know only the mean of the distribution. That's fine, but then how do you compare two algorithms, you are going to do several measures, right?
What is the error in this measurement?
How to compute error bars?
Do you use a standard deviation or something else?
Are the error bars assuming a Normal distribution?
Is it safe to assume a Normal distribution?
Why now just show the latency distribution?

Second, there is typically a tradeoff when it comes to throughput vs latency (not always, but typically). Going back to our example, perhaps algorithm X provides better throughput than algorithm Y, but Y has a better tail latency than X.
Which one is "better"?
There is no right answer, it depends on how you define better: If you just want raw throughput and you don't care that 1% of the cases will take an enormous amount of time to complete, then X is better. If you're willing to sacrifice throughput to have more fairness and better (lower) latency at the tail of the latency distribution, then algorithm Y is better.

One concrete example is when you look at a lock-free queue (like the one by Maged Michael and Michael Scott) vs a wait-free queue (like the one by Alex Kogan and Erez Petrank):

In these plots, lower is better.
The plot for the median (this is the median, not the mean) shows that the Michael-Scott queue is better than Kogan-Petrank (about 10x better), but when it comes to the tail latency, namely at the 99.9%, the Kogan-Petrank is better.
By 99.9% we mean that 99.9% of the calls to enqueue() take x microseconds or less to complete, and therefore, 0.1% of the calls take more than x microseconds to complete, where x is the number on the vertical axis of the plots.

Just like it doesn't make sense to say that Throughput is better than (tail) Latency, it is also meaningless to say that a Lock-Free algorithm that has higher throughput than a similar Wait-Free algorithm is better. The only exception is if they both have the same throughput or the lock-free algorithm has a lower throughput than the wait-free one, in that case, by definition, the wait-free is better, because it's better in all metrics.

Fedor Pikus is going to talk about lock-free again this year at CppCon, so let's see what he has to say about this topic.
I talked about latency at the beginning of my presentation at CppCon last year in case you want to learn more.


  1. Good article for content and the fact is has figures. See the following article too.
    Computer Measurement Group (CMG) [Queuing Theory Limitations]

  2. Good article, esp. the diagrams
    Aso see Computer Measurement Group (CMG) [Queuing Theory Limitations]