Wednesday, October 1, 2014

Linked Lists with wait-free reads in C++ (LRLinkedList and LRLinkedListSingle)

Today Andreia gave birth to a healthy baby boy. To celebrate this happy event, we are releasing the C++14 code for not one, but two concurrent linked lists, both Wait-Free Population Oblivious (WFPO) for read-only operations.

The first one, named LROrderedLinkedList (LROLL), is just a linked-list-set wrapped in the Left-Right pattern.
The second one, named LROrderedLinkedListSingle (LROLLS), is sightly more complex, but it is faster for modifications (insertions and removals).
These two data structures are interesting as examples of using and implementing the Left-Right pattern in a scalable way in C++14, and they are also the only linked-list sets that are WFPO for read-only operations. In fact, these two are the only linked list sets that are wait-free for read-only operations in C++, AFAIK.

Yes, I know about Harris linked list set which is lock-free for all operations (lookups, insertions, and removals), and I know about its variant with wait-free lookups (wait-free bounded by the number of keys in the set), but there is a detail about the variant with wait-free lookups that it implicitly requires a system with a Garbage Collector. On a system without an automatic GC (like C11 or C++1x) we can't use Harris's linked list by itself without adding some kind of memory management, i.e. Hazard Pointers.
There are two problems with using Hazard Pointers (HP) in Harris list:
First, HPs are not very fast, because they require at least one store with release barrier for each node of the list that is traversed, even though the lookup function is expected to be a read-only operation.
Second, the HP technique is lock-free and not wait-free, which means that using it in the lookup function reduces the progress condition of the lookups from wait-free bounded to lock-free.
In practice, all this means that in C11/C++1x there is no linked list set that is wait-free AFAIK (if you happen to know one, please add the link to it on the comments section).
If you have no idea what Harris linked list is, or what are Hazard Pointers, then here are the links to them:
http://timharris.uk/papers/2001-disc.pdf
http://concurrencyfreaks.com/2014/03/harriss-linked-list.html
http://researchweb.watson.ibm.com/people/m/michael/ieeetpds-2004.pdf

This is where LROLL enters. LROLL is composed of the Left-Right pattern associated with two linked list sets. The basic idea is that read-only operations are going on one of the lists, while a single modifying thread is inserting/removing a node on the other list, and afterwards redirects new lookups to the recently modified list and waits for unfinished lookups before applying the same modification on the opposite list.



Notice that there is only one instance of each key but there are two nodes for each key, and that LROLL does its own memory management without needing an extra memory management.
The big disadvantage is that the Left-Right pattern has a global lock that serializes modifications in the list (insertions/removals). This means that both insertions and removals are blocking, which by itself is not too bad, but the problem is that it's not scalable due to the serialization of the modifying operations. Moreover, two lookups are needed, and the modification has to be applied on two lists, which doubles the pain.
However, there are three big advantages:
First, the WFPO progress condition on the lookups means that the number of synchronized atomic operations is constant (depends on the readIndicator used, but typically just two or three stores with release). This provides excellent performance for lookups and very low latency guarantees.
Second, both HPs and Harris linked list are patented. HPs are patented by IBM, and Harris list by Sun/Oracle. The Left-Right pattern is free for use  :)
https://docs.google.com/viewer?url=patentimages.storage.googleapis.com/pdfs/US20040107227.pdf
https://docs.google.com/viewer?url=patentimages.storage.googleapis.com/pdfs/US7117502.pdf
Third, and this is non-obvious, there are no other linked list implementations that provide linearizability for read-only iterators, with the exception of using a mutex or reader-writer lock to protect a linked list, but that approach is blocking for all operations. Both Harris linked list and Michael & Scott linked list provide only weak consistency for iterators of the list, but the LROLL and LROLLS provides linearizability for read-only iterators (this property is sometimes called snapshot consistency).


Just how important is to have linearizabile read-only iterators? Well, it depends on your application, so let me give an example.
Suppose you have a banking application that uses a linked list set of bank accounts. If we use a LROLL, adding a new bank account or removing a closed account are both blocking procedures. Doing a transfer from one account to another is also a blocking operation. Checking the current amount in an account is WFPO (and linearizable).
How about determining the whole amount of cash available in all the accounts, i.e. all the money in the bank? Well, that is also a WFPO operation and linearizable, because it is a read-only iteration that goes over all the accounts. Notice that while this operations is ongoing, any modifying operation will have to wait for this to finish (and for any other read-only operations). Harris and Michael & Scott lock-free lists don't allow for this kind of operation, or they do allow, but not in a linearizable way, which in practice means that the result of a function that goes over all the accounts and sums them all will most likely be very different from the actual amount in the bank because you can count ongoing transfered amounts twice (or multiple times), or never count them. For example, suppose there are only two accounts and while one thread is summing the values from each, there is another thread that is doing a transfer from the first to second, by removing from the first 1M $ and then adding 1M $ to the second... if at the same time the other thread reads the amount after the 1M $ has been withrawed and then reads the value on the second account before it is credited, then it will see 1M $ less than the actual amount of money in the bank.
The above situation can not happen with LROLL


LROLLS extends the idea of LROLL a bit further. Instead of having two instances of the linked list, it has a single physical linked list but two logical lists as shown below:



Each node has two "next" members, one of the logical left list (nextL) and another for the logical right list (nextR):
<template T> class Node {
    T *key;
    Node *nextL;
    Node *nextR;
};


The first benefit of this approach is saving one pointer to the key, thus saving some memory, but the main advantage is the fact that an insertion or removal in the middle of the list does not have to start over and do a second lookup to find the node to insert after (or the node to be removed) because it is the same physical node. In LROLLS it's just a matter or linking/unlinking the new node on the same physical node for both nextL and nextR.

In LROLLS, an insertion has to do one lookup to find the node previous to the one before the insertion point of the new node (let's say it's nextL), and then insert the new node there, wait for the left-right toggle, and then insert on the same node but do it on the opposite list (let's say nextR). In LROLL, this last step requires a lookup from the beginning of the list, but on LROLLS we already know where the node is, so no lookup is needed, which means we don't waste time doing a second lookup. For small lists this isn't much of an advantage because most of the time is coming from the overhead of the synchronization needed by the Left-Right pattern, but for large lists, this can make a difference in performance that is nearly 2x faster for LROLLS than for LROLL.

For example, in LROLLS the removal of the key Y from the set would work like this:
1. The Writer starts from the head of the logical list opposite to the one the Readers are using and unlinks the corresponding next field.


2. The Writer re-directs new Readers to the currently modified logical list, waits for older Readers to finish on the other logical list, and then unlinks the other next field.


3. The node with the key Y can now be deleted/free’ed




All these examples use a linked-list-set, but the idea applies to any other data structure (except for LROLLS which is more specific to link-list based data structures), i.e. you can use the same kind of ideas to create concurrent versions of other single-threaded data structures.

As always, source code is available on github:
LinkedListSet.h - A single-thread only linked list set. The Left-Right pattern uses two instances of this class to create a set with WFPO read-only operations
LROrderedLinkedList.h - LROLL: two LinkedListSet instances protected with the Left-Right pattern using a DCLC readIndicator
LROrderedLinkedListSingle.h - LROLLS: Less memory usage than LROLL and 2x faster insertions and removals
RWLockLinkedListDCLC.h - Uses a DCLCRWLock (based on C-RW-WP) to provide concurrent (blocking) access to a LinkedListSet
RWLockLinkedListPT.h - Uses a pthread_rwlock_t to provide concurrent (blocking) access to a LinkedListSet
RWLockLinkedListSM.h - Uses a std::shared_time_mutex to provide concurrent (blocking) access to a LinkedListSet


PS: Ohhhhh I almost forgot, if you're on C99 or C++99 (no memory model) then you can use the Double Instance Locking instead of the Left-Right in LROLL and LROLLS and obtain lock-free progress condition for read-only operations.

7 comments:

  1. Sorry, not buying this. You write "On a system without an automatic GC (like C11 or C++1x) we can't use Harris's linked list by itself without adding some kind of memory management, i.e. Hazard Pointers. " and then proceed to assume that Hazard pointers are the only possible memory management.

    In practise, there's many, many method of handing these, ranging from the trivial to the slightly less so.

    Reducing the read cost to one release store plus one un-barriered store per read operation (_not_ per node) is trivial. Any flavour of Epoch GC will do this. (This is essentially reference counting the versions in-use, noting that we can block the writers to ensure that there are at most 2 live versions).

    A brief search found some code: http://hwegc.cvs.sourceforge.net/viewvc/hwegc/hw_ebr.hpp?view=markup&pathrev=HEAD
    (I'm not claiming it's decent code, just something that illustrates the notion. This code is some distance from optimal).

    If we're allowed some OS support:

    We can reduce the cost to two _non-atomic_ stores by storing the current version in a thread-local on during read operations, and requiring writers signal every thread to read that thread-local.

    (There's a similar version for CPU local).

    ReplyDelete
    Replies
    1. Hi,
      As far as I know, Hazard Pointers are the best currently known (lock-free) memory management technique, apart perhaps from newer stuff like "Drop The Anchor" or "StackTrack"
      http://www.cs.technion.ac.il/~sakogan/papers/spaa13.pdf
      http://msr-waypoint.com/pubs/211690/htm_gc.pdf

      If you go into GC territory, then as you said, you need OS support for it, which many OSes don't have. In fact, if you have access to the OS, then you're better off with something like RCU which seems more deterministic in latency (and scalability) than most (or all?) GCs.
      The Left-Right technique has as a main advantage that it doesn't need a GC, or an OS with GC support, or RCU support, and therefore, can be used in pretty much any system, preferably with a nice memory model to make it easier to implement ;)

      Delete
    2. Alas, I clearly obscured my point.

      Reader ref counting schemes do not require OS support.
      If you add OS support, you can make them even faster, but they work fine without it.

      I can't find a handy concise reference, so here's a sketch.

      global CurrentVer = 0;
      global counts[N] = {0,};

      Reader:
      do {
      acquire load e0 = CurrentVer
      release increment count[e0]
      acquire load e1 = CurrentVer
      if (e0 == e1) break;
      release decrement count[e0];
      } while (e1 != e0);

      // Start of critical section. Follow pointers with impunity.
      // Do read operation and so on.

      // End of critical section, release reference count.
      release decrement count[e0];

      Writer:
      take writer lock;

      modify linked list as per harris et al;
      // Assume it unlinks element M from the list which
      // needs to be deleted.

      acquire load e0 = CurrentVer
      release store CurrentVer= (e0 + 1) mod N;
      // Wait for all readers to advance to the new epoch.
      do {
      acquire load c = counts[e0];
      if (c == 0) break;
      sleep or similar;
      } while(1);

      // All readers that could have visibility of M have all left
      // the critical section, so it's now safe to delete.
      delete M;

      This is WFPO for readers.
      Writers in this trivial implementation have a tougher time.
      (Speeding up the writers is trivial, albeit at the cost of making the readers not strictly WFPO: The do{}while() loop in the Reader would no longer have a strict bound).

      If the cost of the release-stores seem bothersome, change counts[] to be per-thread or per-cpu variables to taste (with the obvious writer costs).

      Key points:
      #1. Readers are WFPO
      #2. There's no OS support required.
      #3. Reader cost is two release stores per read operation (not per node).

      Re RCU: This is just terminology. For any lock free code that permits memory reclaim, the reader population must be tracked _somehow_. It can be eager (any one of a wide variety of ref counting schemes) or lazy (any one of a wide variety of state examination schemes) or some hybrid along the way.

      The ref counting scheme above happens to be near the eager end of the spectrum. Adding a delay list per version makes it a bit lazier. We could make it lazier again with a stop-the-world GC. They're just different ways to trade off the underlying cost of knowing what the heck the readers are doing.

      Delete
    3. Hi,
      Thank you for your comment, with the example you provide I was able to clearly understand the idea!
      You are correct when you say that many memory management techniques (like Reference Counting schemes) do not need OS support.
      What is incorrect, is to say that the Reader for the code you show is Wait-Free Population Oblivious, or even Wait-Free.
      In the Ref Counting scheme you describe, the Reader method is Lock-Free. The fact that it has a while loop is usually a good tell-tale that it is not wait-free. You can check the definition of Lock-Free, Wait-Free, and WFPO at the following post, or on the book "The Art of Multiprocessor Programming":
      http://concurrencyfreaks.com/2013/05/lock-free-and-wait-free-definition-and.html

      I didn't mention this technique in the post because it is only lock-free, and because the Writers are blocking, which is the main disadvantage of the Left-Right in comparison with Hazard Pointers when using per-linked-list-node or Reference Counting also per-linked-list-node. If you use HPs or Ref Count with a global lock, then all advantages are lost versus the Left-Right technique, because the Left-Right is also blocking (starvation-free) for Writers but WFPO for Readers.
      In addition, you can also use Hazard Pointers in pretty much the same way as you describe, which should be (slightly) better than using Ref Count, because in HPs we need to do two stores with release while in Ref Count we do two increments with release.

      In summary, the reason why you can't find a link to a Reference Counting scheme that is wait-free for read-only operations is because there isn't one ;)

      Delete
  2. Firstly, congratulations! Thanks for giving birth to new ideas.

    ReplyDelete
    Replies
    1. Thanks Mark!
      For the "baby stuff" it was mostly Andreia, but for the Left-Right stuff it was the both of us ;)

      Delete
  3. This blog awesome and i learn a lot about programming from here.The best thing about this blog is that you doing from beginning to experts level.

    Love from

    ReplyDelete