Monday, December 16, 2013

The Rise and Fall of x86

This post is about a prediction, and as far as predictions go, please take it with a pinch of salt.
I've been wanting to write about this for a very long time, but it wasn't really possible without talking about the Left-Right mechanism, and you'll see why below.

The prediction is:
x86 has reached its peak, and it is downhill from here

Some of you are thinking: Baaahhh that's nothing new, other guys have been saying that for quite a while... it's even on Reuters (kind of):
and indeed, they have, and they are right, but that's not what this post is about.
Certainly, the mobile/tablet/smartphone market is completely dominated by ARM, and x86 has a smaller and smaller piece of the pie. And if you start thinking about IoT (Internet of Things) then the role of x86 in the future, looks to be smaller and smaller:

This is all true, and it does make the title of the post kind of obvious, but there is a missing piece of the puzzle that most people haven't seen yet, and it will be the death touch on x86: The Performance (per Watt) Wars of the Data Center.

Ooohhhhh is there a "performance war" going on?
Yes, yes there is. Just take a look at the first 10 minutes of this presentation to understand what I'm talking about "All your cores are belong to us"
The summary of it is, that the big cloud and data center folks like Google, Amazon, IBM, are buying truck loads of machines (x86), and even though these machines have more cores than the previous ones, the improvement in performance is not that big... duhhhhh!
For the moment, they're buying x86, but some are already playing around with ARM:

The data center is the current battle front between x86 and ARM, or better yet, it is the deathmatch of ordered vs. relaxed memory architectures, and now we're getting to the real topic of this post.

As Herb Sutter predicted back in 2005 on his famous article "The Free Lunch is Over", we can't make practical CPUs with higher clock speed than ~3 GHz:
but we can increase the number of cores per die, or even the number of dies. Other techniques have allowed single-core performance to increase over the years, but even that may be reaching its limit:

x86's popularity in the PC and Server markets is due to many reasons, but one of the technical ones is its ease of programming in a multi-core environment, due to its nearly strict ordering. There are two good presentations to understand this:
and the second one from Herb Sutter is just brutal (start at minute 30 of the second video in this link):

The key takeaway from these presentations is the following:
On x86, there are no special instructions to force ordering of loads, which means that on x86 we always pay a penalty on synchronization for every load!
The sentence above is the single most important concept in this post, so read it again until you understand the ramifications...
... have you read the sentence above in bold? Do you understand what it means?

Let me try to explain in my own words why I think this is important.
On x86, every time one of the cores executes a load instruction, it must make sure that no other core has written into the cache line associated with the memory location that is being read, which requires heavy synchronization... for EVERY load!
Now, the folks at Intel have been doing this kind of stuff for many years, and they're pretty good at it, so good in fact, that the price to pay is small. The real problem starts when you increase the number of cores, because as it increases, the synchronization has to be done with more cores, and the price grows higher and higher, meaning that each core won't be able to be as fast as theoretically possible (on a relaxed memory architecture).
The more cores or CPUs you connect together, the higher will be the advantage from CPUs with a relaxed memory model. This is pure physics, you can't go faster than the speed of light, so the less you need to communicate between the cores, the faster the architecture will be.
There is a nice post from Martin Thompson on how caches work on modern microprocessors that might shed more light for those of you wondering how a cache works:

C++ and C, or more specifically, C++11 and C11 have their own well defined Memory Model, and even Java has a more or less well defined Memory Model. The thing about these Memory Models is that there are no current commercial CPUs that follow this model exactly. ARMv8 is the first of its kind because it will match this Memory Model, usually named Sequentially-Consistent for Data-Race-Free programs (SC-DRF).
ARMv8 will be the first commercial CPU to be able to squeeze out all the (theoretical) performance out of multi-threaded programs written in these languages.

In practice, the fact that x86 gives more guarantees than needed by the C++/C/Java memory models, is not really something to worry about because, this is only noticeable in concurrent data structures, and as you know, most concurrent data structures use atomic<> (volatile for Java) variables which have implicit synchronization i.e. acquire/release barriers.

... or at least it was so until recently.
You see, there are new algorithms and data structures that require little synchronization, and can take advantage of "relaxed atomics". My favorite example is the Left-Right mechanism
but others exist, with NUMA-awareness for example, and Andreia has a new one that is super cool.

For example, if you look at a lock-free implementation of a linked list, like the ConcurrentLinkedQueue (CLQ) in java.util.concurrent, you will see that each node traversal "" does an acquire-barrier, because the "next" member is volatile. If you use the Left-Right mechanism with two (non concurrent) linked-lists, then the result will be a logical linked list that is wait-free for read operations, and only a few synchronized operations are done at the beginning of the list traversal and end. Notice that the Left-Right (LR) mechanism is unfortunately blocking for write operations.
This means that on a traversal of a linked list with N nodes, the number of synchronized operations is N for the CLQ, and constant for the LR.
On x86 there are no differences between the two, because all load instructions have an implicit acquire-barrier. On the other hand, on ARMv8, only the instructions that have an explicit acquire-barrier will do that kind of synchronization, otherwise the load will be done from cache, assuming it is available in cache L1/L2/L3, and that no other core has "touched" the same cache line.
CLQ list = O(N) acquire-barriers
LR list = O(1) acquire-barriers

In summary, discoveries of new lock-free and wait-free algorithms that use relaxed atomics, are giving more and more and advantage to CPUs with relaxed memory ordering, which means that to get the maximum performance per Watt, the big players will have to move to these kind of algorithms and CPUs, to reap the full benefits of the best available hardware.

It is kind of ironic that one of the main features that gave x86 its popularity (strong ordering) is the one that will ultimately bring its demise.
I believe microprocessors with strong ordering (like x86) are doomed to be superseded by relaxed memory ordering (like ARM), and that is a strong prediction.

1 comment:

  1. You state: "On x86, every time one of the cores executes a load instruction, it must make sure that no other core has written into the cache line associated with the memory location that is being read, which requires heavy synchronization... for EVERY load!"

    Let me state that that's simply hogwash. I'm no x86 apologist, but I must point out that that is simply false.

    What x86 does require is sequential *consistency* for loads. That is, if you write a program that says "Load X; Load Y", then reading from Y must _appear_ to happen after reading from X. This does not require heavy system-wide synchronization for every load.

    Consider this example. At the start of time, X = old_X and Y = old_Y. On CPU 0, we execute:

    Y = new_Y;
    X = new_X;

    On CPU 1 we execute:

    R0 = X;
    R1 = Y;

    For an architecture with sequentially consistent loads and stores, the permissible outcomes for R0 and R1 are:

    R0 = new_X; R1 = new_Y;
    R0 = old_X; R1 = new_Y;
    R0 = old_X; R1 = old_Y;

    The outcome "R0 = new_X; R1 = old_Y;" is not allowed. Note that CPU 1 is allowed to read memory in any order. All that's required is that this relationship be maintained.

    One way to do this is to batch up write notifications from one processor to all others, and deliver writes in a batch to all other processors. All other processors might only see "old_X; old_Y" followed by "new_X; new_Y", without ever seeing the intermediate state "new_Y; old_X". Since it never sees this intermediate state, it could execute its loads to X and Y in any order without violating sequential consistency.

    That's just one trivial example. But, hopefully it demonstrates that there are other ways to preserve the semantic of load-load ordering without requiring a system-wide barrier between each load. If x86 truly required an actual system-wide barrier between any two loads, it would have been ditched long ago.

    Instead, modern processors use coherence ordering queues and similar structures to establish a memory system order, and synchronize between processors only coarsely to limit the number of ordering interactions between them.