Sunday, December 11, 2016

Two-Phase Grace Sharing Userspace RCU

This week we interrupt our regular topic of queues to talk about userspace RCU.

Andreia and I came up with two new algorithms for doing simple Userspace RCU (URCU) with "grace sharing", using only the C11/C++1x memory model and its atomics API.
Grace sharing is an important feature of URCU because without it the scalability of calls to synchronize_rcu() will be at best flat.
We wrote a small brief announcement about the two algorithms and you can get it here:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/gracesharingurcu-2016.pdf

The first and simplest is "Grace Version" which we called "Readers Version" on a previous post:
http://www.concurrencyfreaks.com/2016/09/a-simple-userspace-rcu-in-java.html
The second one, which we never talked about before, is called "Two-Phase".


Why have two algorithms for doing the same thing, Grace Version and Two-Phase?

Well, because Grace Version requires each thread to have a unique id so that it has a unique entry in an array. This means that in your application you need to know how many threads are doing read-side critical sections. This isn't always easy in practice, so, in those cases, you can use the Two-Phase algorithm. The main advantage of the Two-Phase algorithm is that it can use any kind of ReadIndicator.
I've blabered about ReadIndicators before, and if you want to know more about those you can look at my CppCon talk from last year:
https://www.youtube.com/watch?v=FtaD0maxwec
and on this post:
http://www.concurrencyfreaks.com/2014/11/a-catalog-of-read-indicators.html

In the Two-Phase algorithm any ReadIndicator can be used, particularly, one of our favourites which is an array of atomic counters where each reader does fetch-and-add to increment based on a hash of its thread's id, and decrement when leaving the read-side critical section.
This idea isn't new, and two-phased protocols have been used by Bullet-Proof URCU and even our Left-Right technique uses it:
https://github.com/urcu/userspace-rcu
http://www.concurrencyfreaks.com/2013/12/left-right-concurrency-control.html
For such ReadIndicators, no thread registration is needed, which is much more versatile and a much more friendly API to be able to deploy URCU in your application.
However, in our Two-Phase algorithm the grace periods can be shared by multiple threads simultaneously calling synchronize_rcu(), which provides a better scalability for the "updaters" (threads calling synchronize_rcu(). We never thought about this approach before because for Left-Right it doesn't make sense to use an URCU like that, seen that there can only be one updater (writer) at any given time.


How does the Two-Phase algorithm works?

The best is to read the paper, but the main idea is simple: start from the usual two-phase protocols using two ReadIndicators (like Classical Left-Right) and then exit from each of the loops if the version has advanced enough to guarantee that some other updater has seen the opposite ReadIndicator as being empty.
If none of this makes sense, then just go and take a look at the Left-Right paper to see where these ideas came from  ;)
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/left-right-2014.pdf


Benchmarking:

On the first plot we're comparing just pure readers: how fast can you do rcu_read_lock()/unlock(). The results show that both our Grace Version and Two-Phase using one entry per thread have the same (nearly linear) scalability, with the Two-Phase using an array of counters being close below. Notice that neither the Two-Phase CountersArray nor the Bullet-Proof need thread registration, while the other implementations do.



The second plot shows what happens when we're just calling synchronize_rcu() without any ongoing readers. This scenario is interesting to see the overhead in synchronization.
Our three algorithms behave more or less the same, while the previous URCU implementations are way below.
This isn't a completely fair comparison because URCU Bullet-Proof and URCU Default do other things like handle call_rcu() and others, which incurs overheads, but anyways, if you don't need anything else besides rcu_read_lock()/unlock() and synchronize_rcu(), then our implementations will do great.


The third plot shows the effects os grace sharing: there are two readers threads and the rest are updaters. The read-site critical sections are long, at least long enough not to notice the difference in overhead of the different algorithms (seen on plot above).
The main effect here is the sharing of the grace period and it's easy to see that Bullet-Proof is the only one that is incapable of sharing the grace period.



In essence, if all you need is an URCU with the APIs rcu_read_lock(), rcu_read_unlock() and synchronize_rcu(), then our algorithms are the best in all tested scenarios, so feel free to copy them in your code base, or use one of the implementations we have in github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/gracesharingurcu/URCUTwoPhase.hpp
https://github.com/pramalhe/ConcurrencyFreaks/tree/master/CPP/papers/gracesharingurcu


And now, back to queues...

No comments:

Post a Comment