Thursday, October 30, 2014

Relaxed Atomics Linked List on PowerPC


I'm really excited about this one! It's very rare to write a data structure that can give 15x performance gains over the current implementations  :-O

After a long waiting time to get my hands on a non-x86 machine with many cores to test the relaxed atomics linked list we discovered, I finally found RunAbove (many thanks to Yann for pointing it out!).
RunAbove have currently two different setups for PowerPC, one with 8 cores named Power8 S, and one with 176 (hw) threads named Power8 2XL. I registered myself for it, got a 32$ voucher (which is awesome, compared to AWS price ripoff for a single 16 core machine), and started running benchmarks on it.

The microbenchmarks compare three different linked lists in Java:
CLLElectedUnlink - A Lock-Free linked list that uses relaxed atomics for list traversal;
CLLElectedUnlinkVolatile - Exactly the same algorithm as CLLElectedUnlink but using a volatile (sequentially consistent atomic) for the Node.next;
ConcurrentLinkedQueue - This is from java.util.concurrent with Doug Lea's implementation of Michael and Scott's algorithm;

The source code for the lock-free lists we created and the benchmark itself can be found on github:
CLLElectedUnlink.java
CLLElectedUnlinkVolatile.java
BenchmarkListRelaxedvsVolatile.java
and you can try it out on RunAbove to see for yourself.

Barrier wise, when doing traversals of the lists, it's important to note the following:
- CLLElectedUnlink has a fixed amount of acquire-barriers (volatile loads), usually just one to read the head and then another to check that the last node's next is really null;
- CLLElectedUnlinkVolatile has one acquire-barrier per node, when reading the Node.next;
- ConcurrentLinkedQueue has two acquire-barriers per node, one for reading Node.next and one for reading Node.item;
The thing about loads with acquire (volatile loads) is that on x86 they are usually just regular loads, which means that using relaxed atomics (java's regular loads) for traversing the list has no gain when compared to using sequentially consistent atomics (java's volatile loads). On x86 there won't be any impact on the performance, neither good nor bad, but on other CPUs with a more relaxed architecture like PowerPC or ARM, this can make a difference, in fact, it can make a 15x difference!

Let's look at a plot of the microbenchmark comparing the three data structures, in a scenario where a thread can either do a write operation (a remove() followed by an add()) with a 10% probability, or a read-only operation (two contains()) with a 90% probability.
On an x86 Intel Core i7-3740QM we get the following results:



On the ppc Power8 S instance of RunAbove we get a significant different throughput:


yep, huge increase in throughput when traversing the list!

Unfortunately it's not such a rosy picture when we run on the PowerPC 176 VPC instance, but I suspect that it is due to the number of actual cores being much lower than 176, and the oversusbscription of threads causing some non-trivial effect. Even so, the performance gains are quite noticeable (at least 2x)



This means we finally showed that relaxed atomics can be used to make a really fast lock-free linked list!

Sunday, October 26, 2014

How Ubisoft develops games for Multicore at CppCon

Another presentation from CppCon, this time by Jeff Preshing who is a Technical Architect at Ubisoft Montreal

The talk is divided into two sections, and I found the first part nice to watch because there were some different patterns for concurrency that are particular to video games, and to see the approaches that are used for those situations was really interesting. Video games are cool application because they have strong time constraints (they have to push out a frame every 1/30 or 1/60 of a second), so they care about latency and scalability in multicore environments.

The presenter talks about a wait-free queue which, for the record, was created by Herlihy and Wing, as shown in The Art of Multiprocessor Programming, Fig 3.17

The second part is about C++11 atomics and it's not so interesting, better watch Herb Sutter talk on the subject. It is cool that they intend to use C++11 atomics because they develop for several different platforms (x86, PowerPC, ARM) which means that using these atomics makes their life easier in terms of writing portable code across these platforms.

There is also an interesting comment near the end by Michael Wong about std::memory_order_consume.

In the middle of the talk there is a question also from Michael Wong about relaxed atomics and whether they (the Ubisoft guys) use it for lock-free data structures, to which he answers no. Well, as it so happens, Andreia and I have discovered a lock-free linked list that uses relaxed atomics as we talked about on a previous post, and as far as I know, this is the only lock-free linked list using relaxed atomics currently in existence. Unfortunately, we only published a Java version, and the C++11 version will have to wait because we need to polish it a bit more.

Btw, if you don't know what they do at Ubisoft, one example is Assassin's Creed.

Thursday, October 23, 2014

The line between Research and Engineering

Here is a good post from Joe Duffy, that pretty much says the same I've been thinking about the difference between Research and Engineering:
http://joeduffyblog.com/2013/07/13/blurring-the-line-between-research-and-engineering/

The basic message he's trying to pass is that there is not much point in having ideas if they're not practical to implement, which means that to be a good researcher you have understand what is practical and what is not, which implies some degree of exposure to the real world, and more specifically for Computer Science, a good knowledge of the tools, systems, and languages required to implement the idea.

Personally, I go even further, and see the Engineering / Research divide as a Cooking analogy.
A good engineer is like a good chef: given a recipe book and good ingredients, she can make wonderful dishes, and sometimes even with crappy ingredients and crappy recipes she can make good dishes as well (that's the difference between an excellent engineer and a good engineer).
A good researcher is more akin to a Master Chef or one of those guys running restaurants with two Michelin Stars. They may not even be the top best executioners of a recipe (they do have to be pretty good though), but they make their own recipes, they're the ones that write the cookbooks. They experiment with new combinations and sometimes come up with totally unique flavors. Their skillset is a merge of Creativity with Technical Expertise, and you won't be a Master Chef without both.

Perhaps the name "Researcher" is not appropriate to what I'm describing because in academia pretty much every one is called a "researcher"... hummm a more suitable name would be "Innovator" (although Master Chef does sound pretty cool because it reminds me of Master Chief)

Every IT company (and job applicant) is interested in the differences between good engineers and excellent engineers, but these "Innovators" are the unspoken elements, mostly because people with such a high degree of technical expertise are rare on itself, but out of those then select the ones whose creativity allows them to make up new stuff instead of just being really good at "following recipes", and then you're left with a very small subset of individuals, so small a set, that each of them is close to unique in the way he/she innovates, and therefore, there is no point in trying to come up with generalizations to identify them (apart from the fact they are really good technically, and creatively, and they come up with ideas that are actually useful).
You see, the problem isn't coming up with ideas, we have entire Universities and research centers full of PhDs, Post-Docs, and even companies with very bright engineers, all of them with plenty of ideas to go around. In the Universisites, most of the ideas are focused on stuff that can "get a PhD thesis" or "get a grant for another year", and unfortunately that's not the kind of goal that is always innovative or useful to the field of study or the progress of Mankind as a whole. On Companies, many good and excellent engineers have ideas all the time, but these are usually very narrow focused to solve specific problems (at least those are actually useful for something).
Again the cooking analogy applies here: the excellent chefs can take a recipe and adapt it to the circumstances, applying their creative skills, they can take or add a few ingredients or processes and improve an already know recipe to almost perfection. The Master Chefs on the other hand go way beyond changing known recipes.

I guess that what I want to close with is that, we need it all: we need the "Research" and the "Engineering", we need a lot of "good researchers" and "good engineers", we need plenty "excellent researchers" and "excellent engineers", and every once in a while, we also need a few "Master Chief".

Sunday, October 19, 2014

Tony Van Eerd on "Lock-Free by Example"

One of the talks on CppCon 2014 was given by Tony Van Eerd on the subject of lock-free queues.

The title of the talk is a bit misleading because although the techniques he describes are the ones used on lock-free algorithms and data structures, the queue shown on the presentation is not actually lock-free for either pop() or push().

The interesting thing about his talk is that he focuses on the thought process needed to arrive at a lock-free queue. Most papers and talks just give us the "end result" with the algorithm as it should be, but they don't explain why it has to be the way it is. His talk goes into the common pitfalls of designing a lock-free queue, which makes it particularly instructive for people new to the field.

There was one statement that caught my attention, that lock-free algorithms need to maintain invariants at all times. This is sooooo true!
The main reason why designing lock-free (and wait-free) data structures is hard, is due to the fact that the invariants must hold at every step of the way, for every method of the data structure, and every shared (atomic) variable. The more methods you have on a lock-free data structure, the more complex the problem becomes. The more variables are shared among those methods, the greater the complexity and possible interleavings, and higher number of states, where each one must still uphold all the invariants.
Designing data structures with locks is much easier because, the code within a lock()/unlock() can temporarily break invariants without worry, as long as the invariants are restored before the call to unlock().

Sunday, October 12, 2014

Concurrency Track of Eurosys 2014

Eurosys 2014 had a Concurrency Track and they recorded the videos for the four presentations.
Here is the link to the recording:
http://av-media.vu.nl/VUMedia/Play/568ed363278f4ab48e15c36b4a23e1a01d?catalog=55e11845-c6ed-4a29-a4bd-9f082393c2a6
http://av-media.vu.nl/VUMedia/Play/568ed363278f4ab48e15c36b4a23e1a01d?catalog=55e11845-c6ed-4a29-a4bd-9f082393c2a6


The recording includes:
- Callisto: Co-Scheduling  Parallel Runtime Systems   (0h:0m)
- StackTrack: An automated transactional approach to concurrent memory reclamation  (0h:29m)
- Using Restricted Transactional Memory to build a scalable in-memory database   (0h:54m)
- Algorithmic improvement for fast concurrent cuckoo hashing       (1h:36m)

I found StackTrack particularly interesting.... and it has Maurice and Nir as co-authors.