Thursday, October 30, 2014
Relaxed Atomics Linked List on PowerPC
I'm really excited about this one! It's very rare to write a data structure that can give 15x performance gains over the current implementations :-O
After a long waiting time to get my hands on a non-x86 machine with many cores to test the relaxed atomics linked list we discovered, I finally found RunAbove (many thanks to Yann for pointing it out!).
RunAbove have currently two different setups for PowerPC, one with 8 cores named Power8 S, and one with 176 (hw) threads named Power8 2XL. I registered myself for it, got a 32$ voucher (which is awesome, compared to AWS price ripoff for a single 16 core machine), and started running benchmarks on it.
The microbenchmarks compare three different linked lists in Java:
CLLElectedUnlink - A Lock-Free linked list that uses relaxed atomics for list traversal;
CLLElectedUnlinkVolatile - Exactly the same algorithm as CLLElectedUnlink but using a volatile (sequentially consistent atomic) for the Node.next;
ConcurrentLinkedQueue - This is from java.util.concurrent with Doug Lea's implementation of Michael and Scott's algorithm;
The source code for the lock-free lists we created and the benchmark itself can be found on github:
and you can try it out on RunAbove to see for yourself.
Barrier wise, when doing traversals of the lists, it's important to note the following:
- CLLElectedUnlink has a fixed amount of acquire-barriers (volatile loads), usually just one to read the head and then another to check that the last node's next is really null;
- CLLElectedUnlinkVolatile has one acquire-barrier per node, when reading the Node.next;
- ConcurrentLinkedQueue has two acquire-barriers per node, one for reading Node.next and one for reading Node.item;
The thing about loads with acquire (volatile loads) is that on x86 they are usually just regular loads, which means that using relaxed atomics (java's regular loads) for traversing the list has no gain when compared to using sequentially consistent atomics (java's volatile loads). On x86 there won't be any impact on the performance, neither good nor bad, but on other CPUs with a more relaxed architecture like PowerPC or ARM, this can make a difference, in fact, it can make a 15x difference!
Let's look at a plot of the microbenchmark comparing the three data structures, in a scenario where a thread can either do a write operation (a remove() followed by an add()) with a 10% probability, or a read-only operation (two contains()) with a 90% probability.
On an x86 Intel Core i7-3740QM we get the following results:
On the ppc Power8 S instance of RunAbove we get a significant different throughput:
yep, huge increase in throughput when traversing the list!
Unfortunately it's not such a rosy picture when we run on the PowerPC 176 VPC instance, but I suspect that it is due to the number of actual cores being much lower than 176, and the oversusbscription of threads causing some non-trivial effect. Even so, the performance gains are quite noticeable (at least 2x)
This means we finally showed that relaxed atomics can be used to make a really fast lock-free linked list!