Monday, March 11, 2013

The End of multi-core Scalability is near

In 1988 Maurice Herlihy proved in his seminal paper "Impossibility and universality results for wait-free synchronization", that a Wait-Free consensus can be solved for any number of threads using a CAS. He didn't explain how it could be done, but in practice this means that you can use a CAS to write concurrent Wait-Free algorithms, as long as you don't put the CAS in a loop of retries. Doing that is a necessary but not sufficient condition.

Assuming you figure out how to implement whatever algorithm is that you want to make, in a Wait-Free way, there is still a problem: you usually want it to be Scalable, or have Low-Latency, or both. If you're only interested in Low-Latency, then you're safe, and the rest of this post doesn't apply, but if you need the algorithm to be Scalable, then you have a big problem, which can be summarized by the plot below, and has been extensively described in this post, and this:


You see, to be truly scalable with the number of cores/threads, an algorithm has to be Wait-Free-Population-Oblivious (WFPO), or very close to it. Creating such an algorithm is usually an insurmountable difficulty on its own, but even if you invent one, or copy from someone who has done it for you, there is still a problem, and the problem is related to the scalability of the CAS instruction on current CPUs.
The CAS instruction enforces cache-coherency on the variable that is being modified and its corresponding cache-line. This means that if several cores are contending for that cache-line, there must be an algorithm to solve this, and it's implemented as the MESI protocol or one of its variants.
The problem with this approach is that it causes the time required to ensure cache-line consistency to grow linearly with the number of cores competing for the cache-line. This means that the time required to complete a CAS operation grows linearly (or nearly) with the number of threads contending on the modified variable.

Let me repeat it because it's the single most important idea on this post: "The more threads you throw at an algorithm that uses CAS on a single variable, the slower it will be to complete each CAS operation".

For example, let us say you have a system with 32 cores, and you make an application that uses 4 threads, and is capable of doing 100M CAS per second per thread when they are competing on the same variable, then you can expect that if you run the same code with 8 threads that each thread will be capable of doing 50M CAS per second per thread. If you run it with 16 threads, you can expect each thread to complete no more than 25M CAS per second per thread, and so on, as you increase threads until you run out of cores.

The math is pretty simple, but let's spell it out here:
4 threads x 100M CAS = 400M CAS per second
8 threads x 50M CAS   = 400M CAS per second
16 threads x 25M CAS  = 400M CAS per second
Yes, they all do the same overall number of operations even though the first one is occupying only 4 cores and the third one is using up 16 cores.

Don't believe me? Then run this code snippet with a batchSize large enough to finish in a few seconds, and then repeat with a different number of threads and compare the times:

            for (int i = 0; i < batchSize/2; i++) {
                counterLong.compareAndSet(0, 1);
                counterLong.compareAndSet(1, 0);
or you can download this test at sourceforge.

Why does this happen? Well, the details are long and obscure (at least to me), but it's all related to cache-coherency protocols, like MOESI and MESIF and stuff like that.

What can you do about it?
Nothing... unless you can come up with a cache-coherence protocol whose time to complete doesn't grow linearly with the number of cores.

Summarizing, even if we start inventing new WFPO algorithms, the limit on multi-core scalability is still there, and it is quite low, which means all our work was kind of in vain because the algorithm we so carefully designed to always finish in a finite number of steps, now uses instructions whose time to complete grows with the number of threads and therefore, is not scalable at all.

I don't know about you, but to me, this suckz  :(
As such, I would like to address an open letter to the CPU architects and engineers at Intel, ARM, and AMD:

Dear CPU engineers,

Please design a cache-coherence protocol that scales with the number of cores reading/writing on a cache-line.
I know that what I'm asking of you might be impossible, but unless one of you comes up with a protocol where the time to access a dirty cache-line doesn't grow with the number of cores, then pretty soon the companies you work for will not be able to sell new CPUs with more cores, because no one will be able to use those new cores for anything useful.

Think about it, really. I mean, how many cores can you have on your smartphone/tablet/PC ? 

8? 16? 32?
After a certain point it won't matter because, there will not be enough different applications running at the same time on the device, and we will lose the edge given by parallelization, and the bottleneck of concurrency will show up. It's true that many people install a gazilion apps in their smarthpones, but they aren't running them all simultaneously.
Parallelization will only take you so far, and at some point the future, unless you can come up with a way to allow developers to write scalable concurrent algorithms, you, or most of you, will be out of a job, because no one will buy the new CPUs your company makes.

Look at it from this perspective: Your job depends on it, so better start working on it now!

Thanks, and good luck.

Yes, I know, this is about as much good as writing a letter to Santa-Claus asking for a real-size-latest-model Ferrari, but hey, you gotta try.

I want to end on a slightly more positive note, and say that not all hope is lost in the land of scalable algorithms and data-structures:

First, in the more distant future, someone might come up with a better computer architecture than the current von-Neumann, that doesn't have the inherent physical limitations that we see today when the CPU accesses memory. Perhaps something more akin to neural-networks, or something even stranger, that will change the way we think about algorithms and computation in general. Whatever it may be, it is still a long way away, and no matter how smart are the people that come up with it, they still have to follow the laws of physics (particularly the 2nd law of thermodynamics).

Second, we already have things like Hadoop, that can scale algorithms almost limitless. I know it's a trick, they're changing concurrency into parallelization, and not all algorithms can be transformed in a such a way, with anything requiring a queue being a good counter-example. Still, for those algorithms that can be "map-reduced", the future (and present) looks promising, with lots of room for improvements and growth.

Third, we still have some more years until the scalability ceiling on Lock-Free and Wait-Free algorithms starts to be a real problem, so until then, enjoy the ride and the sunny weather, and let "Future You" take care of it   ;)


  1. Looks like CPU designers are working around the problem by coming up with alternative ways of using all the cores they can now put on a CPU. For instance, the new eight core Samsung Exynos 5 CPU. It has four low power/efficient cores and four high performance. Only four of them can be used at the same time, depending on the workload. Probably for the time being we will see more and more similar approaches, with heterogeneous and specialized cores.

  2. Not sure if CPU engineers have read your letter but may be solution will be found soon

  3. Public summary of the same article

  4. One more interesting idea