Wednesday, November 5, 2014

Improved ConcurrentLinkedQueue for PowerPC (and ARM?)

There is lock-free list/queue in Java under the java.util.concurrent package named ConcurrentLinkedQueue. It uses a slightly modified version of Michael and Scott's algorithm, and it has the following main methods:
add()/offer(): Adds a new item to the end of the list;
poll(): Removes the first item in the list;
contains(): Verifies whether a given item exists in the list or not;
remove(): Removes an item anywhere in the list if it is present;

We talked on a previous post about CLLElectedUnlink that can use an optimized version with relaxed list traversals to improve the throughout in CPU architectures with weak memory ordering, like PowerPC and ARM, providing an increase in performance that can go up to 15x.
As it so happens, we figured out a way to use the same kind of optimizations on Java's ConcurrentLinkedQueue (CLQ) with minimum code modifications, and we named it ConcurrentLinkedQueueRelaxed.

Yes, you read it well, it's not a typo! This version of CLQ has a throughput increase on PowerPC that can be 15x than the one currently on JDK 8:

You can try for yourself on one of the PowerPC instances of RunAbove, or if you have a PowerPC machine.
We would like to hear about the results of your own application, or your own benchmarks, on either PowerPC or ARM (ARMv7 or ARMv8).

A more detailed description of the optimizations can be found on this presentation:

The basic idea is that we traverse the list and read the value of item without using a non-volatile load, until the item we're looking for is found (it's a bit more trickier than that, but this is the main concept).

We modified contains() to use read the reference to the next node using a non-volatile load (unless it is null), and to read the reference to the item also as a non-volatile load (unless it is null):

public boolean contains(Object o) {
    if (o == null) return false;
    for (Node<E> p = first(); p != null; p = succRelaxed(p)) {
        E item = p.getRelaxedItem();
        if (item != null && o.equals(item) && p.item != null)
            return true;
    return false;

A similar modification was done for remove():
public boolean remove(Object o) {
    if (o == null) return false;
    Node<E> pred = null;
    for (Node<E> p = first(); p != null; p = succRelaxed(p)) {
        E item = p.getRelaxedItem();
        if (item != null &&
            o.equals(item) &&
            p.casItem(item, null)) {
            Node<E> next = succ(p);
            if (pred != null && next != null)
                pred.casNext(p, next);
                return true;
            pred = p;
        return false;

The results speak for themselves:

The 100% Writes plot means that 50% of the operations are add(), the other 50% remove(), and there are no contains() being done.

Notice that as Herb Sutter mentioned on the second part of his presentation on the C++1x memory model, and Doug Lea on several presentations, using volatile loads or relaxed loads on x86 is the same, which means that these optimizations have absolutely no effect on x86.
It's only on PowerPC (and maybe ARM) that we see its true potential, because a volatile load on PowerPC implies two dcs instructions per node that is traversed, so by cutting those, we have the observed performance gains.
And what's more impressive is that you get the performance improvement even if you're using a single thread  :-O


  1. Hi Pedro,

    Great work!
    Could you please share results for 50% writes - it is most used mode when using CLQ as Queue.
    Is it possible to measure latency for exchange between queues on Power PC platform (both for original and modified)?


  2. Hi Andriy,
    Thanks, good to know our work is appreciated ;)

    We don't have the plot for 50% Writes, but I added the 100% Writes on the post above and you can "extrapolate" between the 10% and the 100%.

    Regarding latency measurements, we didn't do it but we have some benchmarks that we can adapt to this, the problem is always the choice of which scenario to measure?
    - 100%, 10%, 1%, or 0% ?
    - How big should the list be? 100, 1k? 10k? 1M?
    - Should we have dedicated threads to add()/remove() and others for contains()?
    - Should we measure the latency for contains() or for remove(), or both?
    - How many threads to run with?

    Latency measurements are very sensitive on the application where it's run. Are you sure it will be useful to see latency measurements for a microbenchmark?


  3. Hi Pedro,

    Thank you for sharing 100% Write results and they looks very impressive too.

    I see that it is impossible to carefully measure distribution of sub-microsecond latencies on JVM due nanotrusting nanotime:

    I'm interested in CLQ because it was used widely for message exchanging in concurrent libraries and frameworks (e. g. actors libraries in Akka & Scalaz).

    Couple of years ago I found non-intrusive MPSC node-based queue, described by Dmitriy Vyukov:
    Then I implemented alternative versions of actors message queues for Akka & Scalaz using this queue:
    This repo also has number of benchmarks that tests throughput and avg. latency of message exchanging in different modes and env. configuration.

    In last commits I tried to follow your approach by relaxing of message dequeueing using UNSAFE.getObject(...) call before volatile read.
    Now I looking for acceess to non x86 hardware to test if these changes are worth.

    It will be great if you will share your expirience of testing on PowerPC with RunAbove service:
    - how to sign-in and setup environment (JDK, Maven/Groove/SBT etc.);
    - which tools or commands are available to help in performance testing;
    - is it possible to setup some CI server to run benchmark on per commit basis?

    Best regards,

    1. Hi Andriy,

      I didn't try Maven/Groove/SBT so I have no idea, but getting a shell to the VM instance is very similar to Amazon AWS: you setup the ssh key, and then connect with username "admin" and your keyphrase, using ssh client or putty or something like it.
      Once you're in, you can install new packages using:
      sudo su -
      yum install
      yum install gcc unzip java wget tar

      I don't know of any performance tools available, but if there are any on Fedora/Ubuntu you can install them, otherwise you can pull them from some other website, using wget for example (to pull from sourceforge/github etc). We used our own java benchmarks.

      You can setup CI but it requires having the VM always on, which may not be that cheap (the 8 core instance is 0.05$ per hour). It depends on where you are running your CI server.