Saturday, July 5, 2014

CLLElectedUnlink - A Lock-Free List with Relaxed Atomics (part 2)

This is the second part of the post on the CLLElectedUnlink, a lock-free linked list that uses relaxed atomics for node traversal, and the Elected Pattern to physically remove nodes from the list.

On the previous post we left of at the unlinking explanation. Now lets look at memory management.

Memory Management

What about memory management?
Well, the bad news is that the CLLElectedUnlink really needs an automatic GC, and it won't work without one. The original Michael and Scott queue doesn't need any memory management technique because list traversals are not allowed, but the version implemented in Java's CLQ could be adapted to C/C++ by using a memory management technique like Hazard Pointers.
The CLLElectedUnlink on the opposite, can't be used with Hazard Pointers or any other currently available memory management technique because they require knowledge of when the node has been physically removed from the list (i.e. is no longer reachable by another thread) and this is very hard to guarantee with certainty, due to the usage of the relaxed atomics.
Even if the unlinking procedure is done such that the stores had a release barrier, a thread starting the list traversal slightly before the node is unlinked, could still see the node that has been unlinked from the list. Remember that, at the start of the list traversal there is an acquire barrier done when the head is read.


The add method uses the algorithm by Michael & Scott:

public boolean add(final E item) {
    final Node<E> newNode = new Node<E>(item);
    while ( true) {
        final Node<E> localTail = tail ;
        final Node<E> node = localTail.getAcqBarrierNext();
        if (localTail == tail) {
            if (node == null) {
                // It seems this is the last node, so add the newNode here
                // and try to move the tail to the newNode
                if (localTail.casNext( null, newNode)) {
                    casTail(localTail, newNode); // Failure is OK.
                    return true ;
            } else {
                casTail(localTail, node);

In this implementation of the CLLElectedUnlink, we have three functions to do removals:
markNode(): Traverses the list and marks a node as logically removed. Does not need to "get the lock" of the Election Pattern.
markAndUnlinkOne(): Traverses the list, marks the node, and unlinks it. Uses the Election Pattern.
markAndUnlinkAll(): Traverses the list unlinking all the nodes that are marked for removal. Will mark and unlink the node it is searching for, if it finds it. Uses the Election Pattern.

public boolean remove(final E item) {
    if (unlinkGuard == NO_GUARD && casUnlinkGuard(NO_GUARD, GUARDED)) {
        try {
            // We got the hold on the guard, now figure out if unlinking is
            // needed for other nodes or just this one.
            if ( unlinkNeeded == NEED_UNLINK ) {
                return markOneAndUnlinkAll(item);              
            } else {
                return markAndUnlinkOne(item);
        } finally {
            unlinkGuard = NO_GUARD ;
    } else {
        // Didn't get the hold of the guard, so mark the node and don't
        // do any unlinking.
        return markNode(item);

The contains() function traverses the list using relaxed loads. As soon as it sees a at null, it must re-read it using an sequentially consistent load or similar technique:

public boolean contains(final E item) {
    if (item == null) return false;
    Node<E> node = head;
    while (node != null) {
        if (item.equals(node. item) && node. state == INUSE) {
            return true ;
        // No need for acquire-barriers unless we see null
        final Node<E> nnext = ;
        node = (nnext == null) ? node.getAcqBarrierNext() : nnext;
    return false;

Progress Conditions

Like I said at the beginning, this is a lock-free data structure, with all three public methods lock-free: add(), remove() and contains().
Like on Michael & Scott algorithm, the add() function is lock-free.
Similarly to Micheal & Scott's algorithm, the contains() function will loop until it finds a node with a "next" of null (after doing an acquire barrier), and because there may be threads continuously adding new nodes to the end of the list, the contains() is lock-free. The same logic applies to remove().

Consistency Model

Although it may not seem so at first glance, this linked list is Linearizable.
For the add(), the linearization point is the same as Michael & Scott's algorithm, because it is pretty much the same algorithm.
For the remove(), the logical removal of a node becomes visible to other threads doing remove() or contains() at the CAS operation that changes the Node.state to REMOVED, thus the linearization point for the remove() is the place in the code where the CAS is successful, while for the contains() it is the place where it reads the state variable.
For both the contains() and the remove(), their linearization points when searching for a node, are at the place where they read an it is non-null, because then, a node inserted by an add() becomes visible to them, and then the item must match. At most, the same will have to be re-read with an acquire barrier to make it visible, but the linearization point is the same.

On a side note, more of a wacky idea than anything else, I think it is possible to remove the acquire barrier on the when it is null, but then there would only be one acquire barrier, the one when head is read. I don't think this data structure would be Linearizable anymore, but it seems to me it may still be Sequentially Consistent. The reason why it would (probably) still be Sequentially Consistent is a bit tricky to explain, but it is related to three things:
First, there would always be at least one acquire barrier, the one from the head so, you could always use that as a point where you gain visibility over all the latest changes.
Second, any time a check of the state is done, it implies an acquire barrier, which will provide visibility on all modifications (up to that point), so there can't be the issue of not seeing an earlier remove() but seeing a later remove().
Third, the add() are always done in sequence, which means that a node B that was inserted later than a node A must be, by definition, later in the list, and therefore, if A is not seen, then B will not be seen either.
Like I said, it's just a crazy idea, so better stick with the trick of having the acquire barrier when null is seen, and this way we keep linearizability, which is a very nice property to have.

What is this good for

So what can we use this data structure for?
Well, pretty much anything you can use a concurrent linked list for. In addition, with some modifications, it can be transformed into a unordered set, by implementing an addIfAbsent() function that starts from head searching for a matching item, and if it doesn't find it, it will insert a new node at the tail of the list with the item.
If we change the state in Node to a value of type V, where a null value represents the node has been marked as removed, and rename item to key of type K, then we can make an unordered map out of this. Obviously, the performance of this unordered map is very lousy when compared to an hash-table based implementation, but it is possible to do it like this.

Disadvantages when compared with Java 8 ConcurrentLinkedQueue:

- On the CLLElectedUnlink, a thread doing traversal may have to traverse many nodes that have been already marked for removal but not yet unlinked. This can also happen on the CLQ, although more rarely;
- Uses more memory because it needs an extra variable per node for the state;
- The CLLElectedUnlink isn't good as a queue;
- etc;

- Removals may be slightly faster than on the CLQ because they only need to do one CAS (as long as there are no retries) and no stores;
- It uses relaxed atomics, which is cool;
- Might be faster on relaxed architectures (i.e. non-x86)... emphasis on the might;

Yeah I know, the CLLElectedUnlink isn't exactly useful. As it is, it's more of a proof-of-concept, or research tool, than an actual data structure to be used in your day-to-day life. If you're looking for useful stuff then better go read about Double Instance Locking, or the Left-Right Pattern, or Scalable Reader Writer Locks.


 Closing Thoughts

Now, I have to say that , although inspired by Java's CLQ which uses Michael & Scott queue algorithm, this data structure is a new idea that Andreia had. She actually came up with it as part of a much more complex data structure for which we have a paper already written, and are trying to get it accepted in a conference. She was kind enough to let me publish this small piece here and share it with you, so that I could show an example of why relaxed atomics are so cool. My small contribution to this data structure was to realize that the code could be easily transformed to allow relaxed atomics while traversing the list.


Once again, Java code is available on github:
and in the meantime I've added an implementation in the D language, using relaxed atomics from D, and D's Garbage Collector:

If you happen to have an ARM, Cavium, Tilera, or PPC processor where this runs, please let us know how it compares against Java's CLQ when you fill it up with a few thousand items, and then do contains() on them. I'm curious to whether any noticeable performance difference will be seen, or if the cache-misses will always be the major bottleneck.

No comments:

Post a Comment