Thursday, September 24, 2015

Poor Man's URCU

This post is about a new way of implementing the core API of RCU using only locks.
You can get the detailed paper here:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/poormanurcu-2015.pdf
and benchmark code (which I'm not particularly proud of, but it works) here:
https://github.com/pramalhe/ConcurrencyFreaks/tree/master/CPP/papers/poormansurcu

RCU is many things, but at its essence, it allows to introduce quiescent points in your program. As a subset of this, we can say that "RCU allows a thread to do a modification and then wait until it is certain that this modification is visible to all other threads".
This last feature may seem trivial but it is incredibly useful, from doing safe memory reclamation, to creating stuff like Left-Right.

RCU has three core APIs, namely, rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu(). Typical implementations have wait-free progress for  rcu_read_lock() and rcu_read_unlock(), and blocking by design on synchronize_rcu().
You can see the main implementations on librcu, and many toy implementations in section 9.3.5 "Toy" RCU Implementations of perfbook by Paul McKenney.
https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

To use RCU in your application you need to link with the URCU library, or implement one of the URCU toy implementations mentioned in perfbook. However, this approach requires having a compiler and language with access to atomic operations. An alternative to this is to use a lock-based implementation of RCU, like the one described in 9.3.5.1 of perfbook:
static void rcu_read_lock(void) {
    spin_lock(&rcu_gp_lock);
}

static void rcu_read_unlock(void) {
    spin_unlock(&rcu_gp_lock);
}

void synchronize_rcu(void) {
    spin_lock(&rcu_gp_lock);
    spin_unlock(&rcu_gp_lock);
}

The approach above is blocking for rcu_read_lock() and not scalable at all. A slightly better approach would be to use a Reader-Writer lock instead of a spinlock but still blocking.
We propose an alternative way of using only locks that is lock-free for rcu_read_lock().

Threads that call rcu_read_lock() and rcu_read_unlock() (Arrivers, or Readers) will attempt to acquire the reader lock on the first of the reader-writer locks (rwlock1) with a trylock() operation, and if it fails, do a trylock() on the second reader-writer lock (rwlock2), and if that also fails, then try again the first rwlock, and so on.
This procedure could go on indefinitely, but it should only fail if there is another thread doing progress, namely, a thread acquiring the write-lock in one of the reader-writer locks, which means the operation is lock-free.

A thread that calls synchronize_rcu() (Toggler or Writer) will start by acquiring the lock on the mutual exclusion lock to guarantee only one Toggler at a time is present.
The Toggler will then acquire the write-lock on the second reader-writer lock and release it as soon as it has obtained it, followed by acquiring the write-lock on the first reader-writer lock and release it as soon as it has obtained it. This is enough to introduce a linearization point, and it is now safe for the Toggler to free/delete the memory of the object that is meant to be reclaimed (or whatever other task it needs to do), because it has the guarantee that any ongoing Readers can no longer hold a pointer to such object.

Here's what it looks like when using pthreads:

typedef struct {
  pthread_mutex_t  togglerMutex;
  pthread_rwlock_t rwlock1;
  pthread_rwlock_t rwlock2;
} poorman_urcu_t;

// The return value should be passed as arg 'whichone'
int rcu_read_lock(poorman_urcu_t * self) {
  while (1) {
    if (pthread_rwlock_tryrdlock(&self->rwlock1) == 0) {
      return 1;
    }
    if (pthread_rwlock_tryrdlock(&self->rwlock2) == 0) {
      return 2;
    }
  }
}

void rcu_read_unlock(poorman_urcu_t * self, int whichone) {
  if (whichone == 1) {
    pthread_rwlock_unlock(&self->rwlock1);
  } else {    // whichone == 2
    pthread_rwlock_unlock(&self->rwlock2);
  }
}

void synchronize_rcu(poorman_urcu_t * self) {
  pthread_mutex_lock(&self->togglerMutex);
  pthread_rwlock_wrlock(&self->rwlock2);
  pthread_rwlock_unlock(&self->rwlock2);
  pthread_rwlock_wrlock(&self->rwlock1);
  pthread_rwlock_unlock(&self->rwlock1);
  pthread_mutex_unlock(&self->togglerMutex);
}


Notice that in this implementation the rcu_read_lock() API returns which reader-writer lock was acquired in the form of an integer, which is then passed to rcu_read_unlock() as an argument, so that it knows which reader-writer lock to unlock, but an alternative implementation could be done using a thread-local variable to pass this information.
An example of how to use this API for memory reclamation can be seen here https://lwn.net/Articles/262464/#Wait%20For%20Pre-Existing%20RCU%20Readers%20to%20Complete

We compared against the Bullet-Proof URCU implementation and it's not as good (obviously) but that's not too bad because for most scenarios the bottleneck won't be the RCU calls themselves but some other piece of code in your application.



One interesting detail is that if the underlying Reader-Writer lock is starvation-free for Writers, then synchronize_rcu() will also be starvation-free. This is a desirable property for low latency settings.
Alternatively, a Reader-Writer lock with writer-preference should be used, to avoid starving threads calling synchronize_rcu() (Togglers).

Another interesting detail is that the order of the acquisition/release of the rwlocks in synchronize_rcu() is not important for correctness. The algorithm would be correct regardless of acquiring/releasing the exclusive lock first on the rwlock1 and then rwlock2, or vice-versa.
Although this doesn't matter for correctness, it may matter for performance. For example, in a scenario where the read operations are typically very short, and we choose the first lock the Readers try to be the last lock the Togglers lock (like on the code shown above), after the Toggler unlocks the rwlock1, by the time it unlocks the togglerMutex and then the next waiting Toggler acquires the togglerMutex, most old Readers would have finished their work and have released rwlock2 and are now on rwlock1 (the first one they try), which means that the Toggler will be able to acquire the exclusive lock on rwlock2 and probabilistically ends up waiting only to get rwlock1.

And talking about performance, obviously, this technique's throughput is bounded by the underlying rwlock implementation. If a pthread_rwlock_t is used, don't expect much out of it, but if something like C-RW-WP (Scalable RW-Lock) is used instead, then Readers will scale very well and I would expect this technique to come close to the kind of scalability provided by other mainstream URCU implementations.
Our goal here wasn't performance anyways, we were concerned more about an RCU that was easy to deploy, low memory footprint, and low latency guarantees, that would work on top of pthreads because we have some old small embedded systems where RCU can be useful (think Internet of Things).

A different use-case for this is on scenarios where linking with RCU isn't easy, like on the JVM. It's true that most use-cases for RCU are related to safe memory reclamation and on the JVM we have a GC, so it's kind of pointless for that, but there are some special use-cases, for example, when we need to track the life-time of complex objects and want to safely re-use them in a concurrent setting, with the certainty that there are no threads still referencing them.... but using RCU in this way is a topic for a post on itself.


To wrap up, although not particularly efficient, and with a limited API, our "Poor Man's URCU" algorithm is highly portable across Operating Systems and programming languages, does not require linking with an external library, does not require kernel/OS support, and has no licensing constraint.
We hope that the simplicity and ease-of-use of our algorithm can foster the usage of RCU in domains where before it was impractical.

Thursday, September 3, 2015

Fast lookups with memory_order_consume

On this post we're going to explain how to make your lock-free/wait-free data structures do faster lookups by using std::memory_order_consume (at least on PowerPC).

Motivation

Many known concurrent data structures with lock-free and wait-free progress
use some kind of node traversal when doing lookups (Harris Linked List, Michael & Scott, Fast Concurrent Lock-Free Binary Trees, etc).
From the point of view of the C11/C++11 memory model, this traversal is done with an acquire-load each time the pointer to the next node is read, using memory_order_acquire or memory_order_seq_cst. Doing this ensures that the contents of the node will be properly initialized, and otherwise consistent, including the key associated with the node.
http://en.cppreference.com/w/cpp/atomic/memory_order

The problem with using memory_order_acquire or memory_order_seq_cst is that they do a memory barrier on CPUs with weak ordering like PowerPC.

source: The C11 and C++11 Concurrency Model
 
One of the available orderings in the C11/C++11 memory model is memory_order_consume which when used to read the pointer to the next node, provides the guarantee that the contents of the next node and dependent variables will be up to date. This includes the pointer to the key
associated with the node, but not necessarily its contents.
A good example (pointed out to us by Martin Buchholz) is when the key contains a pointer to a global (static) variable, for which the dependency can not be guaranteed, and as such, it may not be properly initialized, and no Happens-Before relation is provided.

The motivation behind the usage of memory_order_consume is that on most architectures (DEC Alphas being the exception due to its reordering of dependent loads) it does not require a memory barrier, and therefore, provides better throughput when traversing the nodes.
A desirable algorithm would allow any kind of user-provided key, regardless of its contents, to be used on a node-based data structure, and these nodes to be traversed using as much as possible memory_order_consume, without doing any loads with memory_order_acquire or reducing its usage to a minimum.


Algorithm

Our technique is very simple, it requires adding a new member field to each node, keyhash, which will store an hash of the key. Notice that this field can be a regular variable, i.e. there is no need to use std::atomic<>.
A C++1x example code looks like this:

template<typename K>
class Node {
  std::atomic<K*>    key;
  std::size_t        keyhash;
  std::atomic<Node*> next;
};


std::hash<K> hashFunc;

Node* lookup(Node* head, K* lookup_key) {
  std::size_t lookup_hash = hashFunc(*lookup_key);
  Node *node = head->next.load();
  while (node != nullptr) {
    if (node->keyhash == lookup_hash) {
      K* node_key = node->key.load(std::memory_order_acquire);
      if (*node_key == *lookup_key) {
        atomic_thread_fence(std::memory_order_acquire);
        return node;
      }
    }
    node = node->next.load(std::memory_order_consume);
  }
  atomic_thread_fence(std::memory_order_acquire);
  return nullptr;
}


The idea is that we do memory_order_consume on each node->next and we check if the keyhash member of the node is the same as the hash of the key we're searching for. Only when there is a match, do we look at the key, using memory_order_acquire, otherwise we continue to the next node without ever looking at the key. This memory_order_acquire will provide consistency in the key and make sure its fields are properly initialized, even for members that are, or point to, global variables.

A few things to notice on the code above:
  • The first load, on head->next, is done with a std::memory_order_seq_cst so as to prevent stores from going inside the loop. Depending on the particularities of the data structure you may or may not need this. Rule of thumb is, if you need it when using memory_order_acquire then you certainly need it with memory_order_consume;
  • Before returning, either success or failure to find the key, we do an atomic_thread_fence(std::memory_order_acquire) to prevent the loads from "escaping". Again here, if you needed it for memory_order_acquire then you also need it for memory_order_consume. Doing the loads of node->next with memory_order_seq_cst does not need any atomic_thread_fence() in the end, but then it will be much slower on PowerPC;
  • If you're using RCU, then it may provide you acquire/release semantics (depending on the implementation) and you won't need the atomic_thread_fences();


Ordered data structures

The technique above works well for unordered-node-based data structures, but how about ordered-node-based data structures, like Harris Linked List or some kind of Lock-Free Binary Tree?
We can also use the same approach but with an extra trick: instead of ordering on the key, we order on the key's hash and only when there is a match do we start ordering on the key.
What this means in practice is that we can create an internal "Comparator" method that calls the user's comparator, and in C++ it would work more or less like this:
int internal_comparator(Node* node, K* lookup_key,  std::size_t lookup_hash) {
    if (lookup_hash > node->keyhash) return 1;
    if (lookup_hash < node->keyhash) return -1;
    K* node_key = node->key.load(std::memory_order_acquire);
    return comparator(lookup_key, node_key);
}

This way, the logic of the code of the lock-free/wait-free data structure requires little modification, it's just a matter of calling internal_comparator() instead of comparator(), and using memory_order_consume for the node's traversal.

Neat, huh?   :)


Microbenchmarks

So what do we gain by all of this?
Speeeeeed, that's what!
We ran microbenchmarks on a PowerPC, Power8E with 8 cores, Ubuntu Linux 14.04 64 bit, GCC 4.9.2.

The plots below show the mean for 5 runs of the number of operations per second as a function of the number of threads on a microbenchmark that iterates for 20 seconds. On each iteration, we randomly pick a key and traverse a linked list of 100, 1000, or 10000 keys until we find the one we're looking for.

To make it more realistic, before starting the lookup, we call rcu_read_lock(), and call rcu_read_unlock() when the lookup finishes. This causes a fixed overhead that harms short lookups (on small lists). Your mileage may vary when using other memory reclamation techniques like reference counting or hazard pointers, but as Andreia pointed out, it doesn't seem possible to have performance gains with this technique when using reference counting or HPs. For example, hazard pointers do a (sequentially consistent) store in an hazard pointer and then a load on the next pointer's node, which could (but should not) be re-ordered if the load is memory_order_consume or memory_order_acquire, but will not be re-ordered if it is memory_order_seq_cst.
The source code for the microbenchmark is not important because it's not really production-ready, but it's on github if you want to take a look.

The benchmark executes three different scenarios:
  • Scenario 1: pointers to the next node are read with memory_order_seq_cst and, therefore, the key can be read as a regular variable;
  • Scenario 2: pointers to the next node are read with memory_order_acquire, and the key is also read as a regular variable;
  • Scenario 3: we used the code shown in lookup() where the pointer to the next node is read with memory_order_consume, and when the hashes of the keys match, the key associated with the node is read with memory_order_acquire.
The plots show the ratios obtained from the results of third scenario over the second (consume/acquire), and for the third scenario over the first (consume/seq_cst).











Analysis of Results

From the first plot we can see that the advantage of using memory_order_consume over memory_order_acquire is not much, but can go up to 80% increase.
The second plot shows that the gain of memory_order_consume over memory_order_seq_cst (the default ordering for atomics in C11 and C++1x) is significant and can provide up to a 7x throughput gain, which is definitely worth the small effort of implementing this "hash trick" with memory_order_consume.

How about x86?
Well, in x86 there is no significant difference between the three different scenarios, because as you can see in the C++11 atomics mapping, in x86, all loads translate to a MOV instruction without any memory barrier, regardless of it being a load with memory_order_seq_cst, memory_order_acquire, or memory_order_consume. So, don't bother using this technique if you're only going to run on x86... you won't gain anything.

Now, you may be wondering: Huhhh I thought that memory_order_consume was "broken" on GCC 4.9 ?!?
Yeah, just in case, we're actually using memory_order_relaxed as a replacement for memory_order_consume in our benchmarks. Please do NOT do this on your production code, we did it just because there was no other practical alternative. For our purposes, consume and relaxed will be (theoretically) the same on PowerPC, so it's ok (kind of), and it does give a good proxy of what the performance will be with the "actual" memory_order_consume, when it gets properly implemented in some future GCC version.
For more on this topic, make sure to watch Paul McKenney's at CppCon 2015 "C++ Atomics: The Sad Story of memory_order_consume: A Happy Ending at Last?"
and Michael Wong's presentation "C++11/14/17 Atomics the Deep dive: the gory details, before the story consumes you!"
http://cppcon.org/2015program/


Conclusion

Of course, it's not all roses, there's a thorn. We're adding a new 64bit (or 32bit?) member to the node to store the hash's key.
It may seem like a lot compared to the fact that on a simple linked list there are only two other 64bit variables (the pointer to the key and the pointer to the next node), so we're increasing the memory usage of this data structure by 50%... but keep in mind that for every node, there is an associated key, and the key can be quite large, to the point that having one extra 64bit variable (per key) is negligible.

There are other ways to use memory_order_consume in lock-free data structures, and if we have some time we'll go over a few of them in a future post, but this technique is the most generic, easy to understand, and easy to implement... and in my book, simplicity counts a lot!

Who would have thought that std::memory_order_consume is actually useful ?!?  ;)