Today's post is about Userspace RCU... I know, I know... what, again with this URCU stuff?
Well, as it so happens, this is the fastest URCU currently known, which makes it the fastest (manual) way to do object lifetime tracking in a multi-threaded application, so I'm giving it nothing more than the attention it's due!
I wrote before about URCUs, namely two new algorithms Andreia and I designed, which we named GraceVersion and TwoPhase. Both these URCUs have grace sharing, a term used by Paul McKenney to describe what happens when multiple threads simultaneously call synchronize_rcu(). For these two algorithms, a single grace period may be used by multiple threads. On top of that calls to synchronize_rcu() don't block each other, they have to wait only for readers, like every URCU has to do (otherwise it wouldn't be an RCU).
We made comparison on the algorithms and implementations between our URCUs and the current state of the art, implemented in the userpsace-rcu library
But sometimes this isn't enough to see the difference. Sometimes we need to apply it to some (semi-)realistic scenario to understand just how much difference these new URCUs make.
As such, we picked up a lock-free linked list by Maged Michael and Tim Harris (Maged-Harris linked list) and applied these URCUs on top and made some benchmarks.
Before we dig into the plots, I need to provide some details. This linked list actually implements a "set", with three main methods: contains(), remove(), add(). These three methods are lock-free but when we add an URCU to them, it means that we'll have to do reclamation on the remove() method, and as you all should know by now, all URCUs are blocking for reclamation (calls to synchronize_rcu()), therefore, the remove() method becomes blocking.
In our benchmarks the list used is particularly small, just 100 keys, which means the list has about 100 nodes (maybe a bit less temporarily). The reason we chose such a small list was to emphasize the scenarios where our URCUs behave the best, i.e. scenarios with high contention of calls to synchronize_rcu().
For those of you who are wrinkling your noses and thinking that a list/set with 100 nodes isn't very realistic, then consider instead a well dimensioned lock-free hashmap/hashset based on an array of buckets each containing a Maged-Harris list. This a simple way to make a lock-free hashmap, and when there are few collisions, the lists at each of the buckets will be small, with likely just a few nodes. Any lookup/insertion/removal has to traverse much less than 100 nodes (unless the hash function is unsuitable for the key set). So, comparing to such a scenario, a list with 100 nodes is actually "large" in terms of how long it takes on average to find a key in the set.
We show four difference scenarios in our plots. The top-left shows 100% updates, this means all threads are either attempting a remove() of a random key followed by an add() of the same key (thus one memory reclamation of a node). The bottom-right shows 0% updates, this means all threads are calling contains() on a random key. The other two plots are a mix of the two scenarios, where the decision of whether to do a remove()/add() or a contains() is chosen randomly but with an average described on the plot's title, i.e. 10% or 1%.
The 100% updates scenarios is where our URCUs really shine, due to their low overhead and grace sharing. As contention increases, they're about 5x faster than MemoryBarrier and nearly 50x than BulletProof. As the ratio of read-only operations increases, and the updates decrease, the MemoryBarrier catches up, eventually at 1% being pretty much the same as our algorithms. The BulletProof never catches up unless we go into purely read-only scenario, which is clearly unrealistic because if we knew with absolute certainty that there would be no updates then there would be no point in using an RCU because there would certainly be no removal.
Perhaps even more important than showing just how good our URCUs are, these plots are an excellent example of the importance of choosing the right memory reclamation technique when deploying a lock-free or wait-free data structure. No matter how fast the lock-free data structure, if the memory reclamation technique isn't fast enough to keep up with it, it's going to become the bottleneck, and negate any scalability benefits that the lock-free data structure was meant to provide. I hope these plots give a better feeling of just how important that is.
Source code for our GraceVersion URCU and TwoPhase URCU can be found here.