Today's post is about Userspace RCU... I know, I know... what, again with this URCU stuff?
Well, as it so happens, this is the fastest URCU currently known, which makes it the fastest (manual) way to do object lifetime tracking in a multi-threaded application, so I'm giving it nothing more than the attention it's due!
I wrote before about URCUs, namely two new algorithms Andreia and I designed, which we named GraceVersion and TwoPhase. Both these URCUs have grace sharing, a term used by Paul McKenney to describe what happens when multiple threads simultaneously call synchronize_rcu(). For these two algorithms, a single grace period may be used by multiple threads. On top of that calls to synchronize_rcu() don't block each other, they have to wait only for readers, like every URCU has to do (otherwise it wouldn't be an RCU).
We made comparison on the algorithms and implementations between our URCUs and the current state of the art, implemented in the userpsace-rcu library
http://concurrencyfreaks.blogspot.com/2017/05/battle-of-userspace-rcus-graceversion.html
But sometimes this isn't enough to see the difference. Sometimes we need to apply it to some (semi-)realistic scenario to understand just how much difference these new URCUs make.
As such, we picked up a lock-free linked list by Maged Michael and Tim Harris (Maged-Harris linked list) and applied these URCUs on top and made some benchmarks.
Before we dig into the plots, I need to provide some details. This linked list actually implements a "set", with three main methods: contains(), remove(), add(). These three methods are lock-free but when we add an URCU to them, it means that we'll have to do reclamation on the remove() method, and as you all should know by now, all URCUs are blocking for reclamation (calls to synchronize_rcu()), therefore, the remove() method becomes blocking.
In our benchmarks the list used is particularly small, just 100 keys, which means the list has about 100 nodes (maybe a bit less temporarily). The reason we chose such a small list was to emphasize the scenarios where our URCUs behave the best, i.e. scenarios with high contention of calls to synchronize_rcu().
For those of you who are wrinkling your noses and thinking that a list/set with 100 nodes isn't very realistic, then consider instead a well dimensioned lock-free hashmap/hashset based on an array of buckets each containing a Maged-Harris list. This a simple way to make a lock-free hashmap, and when there are few collisions, the lists at each of the buckets will be small, with likely just a few nodes. Any lookup/insertion/removal has to traverse much less than 100 nodes (unless the hash function is unsuitable for the key set). So, comparing to such a scenario, a list with 100 nodes is actually "large" in terms of how long it takes on average to find a key in the set.
We show four difference scenarios in our plots. The top-left shows 100% updates, this means all threads are either attempting a remove() of a random key followed by an add() of the same key (thus one memory reclamation of a node). The bottom-right shows 0% updates, this means all threads are calling contains() on a random key. The other two plots are a mix of the two scenarios, where the decision of whether to do a remove()/add() or a contains() is chosen randomly but with an average described on the plot's title, i.e. 10% or 1%.
The 100% updates scenarios is where our URCUs really shine, due to their low overhead and grace sharing. As contention increases, they're about 5x faster than MemoryBarrier and nearly 50x than BulletProof. As the ratio of read-only operations increases, and the updates decrease, the MemoryBarrier catches up, eventually at 1% being pretty much the same as our algorithms. The BulletProof never catches up unless we go into purely read-only scenario, which is clearly unrealistic because if we knew with absolute certainty that there would be no updates then there would be no point in using an RCU because there would certainly be no removal.
Perhaps even more important than showing just how good our URCUs are, these plots are an excellent example of the importance of choosing the right memory reclamation technique when deploying a lock-free or wait-free data structure. No matter how fast the lock-free data structure, if the memory reclamation technique isn't fast enough to keep up with it, it's going to become the bottleneck, and negate any scalability benefits that the lock-free data structure was meant to provide. I hope these plots give a better feeling of just how important that is.
Source code for our GraceVersion URCU and TwoPhase URCU can be found here.
Hello,
ReplyDelete> all URCUs are blocking for reclamation (calls to synchronize_rcu()), therefore, the remove() method becomes blocking.
I do agree that synchronize_rcu() is by design, blocking. However, what would prevent an URCU implementation to provide a non-blocking alternative to synchronize_rcu() like defer_rcu() or something like that ? Basically, you would register a callback that will be called to reclaim your memory / whatever once all pre-existing readers have exited.
With EBR, this could be implemented when detecting that a particular vacated and is not visible anymore. Currently, I'm not particular aware of any existing URCU implementations that provide such behavior, but this is definitely possible, isn't it ?
Hi Mathieu,
DeleteYes, work can be deferred, even if that work is "memory reclamation".
Hidden behind the deferral, is the implicit consequence that if a thread called rcu_read_lock() and went to sleep for a very long time (or got stuck somehow, or even died), then no other thread will be able to do the work of "memory reclamation".
Therefore, this "work" will be deferred continously. Other threads may create new objects and then attempt to delete them, but they can not reclaim the memory for those objects because of the sleepy/dead reader.
The deferral procedure implies a list formed by the objects waiting to be deleted. The number of times the deferral can occur is theoretically unbounded, therefore, the size of the list will be unbounded.
This poses two problems: First, what is the progress condition of the traversal/processing of an unbounded list, knowing that all these objects will eventually have to be reclaimed by a thread, which will have to proces this list; Second, to store a list with an unbounded size, we need a computer with an unbounded amount of RAM, which is clearly unrealistic, and IMO incorrect to do.
RCU has an API with the name call_rcu(), created for the purpose of deferring work to a later time, including deferral of memory reclamation.
More info here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0461r1.pdf
https://github.com/urcu/userspace-rcu/blob/a8e7c8d9eabfecc5017053754f6b446a95b00010/doc/rcu-api.md
https://lwn.net/Articles/264090/
All EBR suffers from the same limitations.
This is the main problem that we were trying to solve with Hazard Eras ;)
If you want to learn more on this topic, I strongly suggest the presentation by Erez Petrank about memory reclamation:
https://www.youtube.com/watch?v=aedEe0Zx_g0
https://www.youtube.com/watch?v=BCXrG1M65HU
Cheers,
Pedro