Wednesday, May 17, 2017

Battle of the Userspace RCUs: GraceVersion and TwoPhase versus BulletProof and Memory Barrier

We've talked before about Userspace RCU, particularly two new algorithms we've designed named GraceVersion and TwoPhase that allow grace periods to be shared among updaters.

We submitted the brief announcement to a major conference and although it wasn't accepted, we got one amazing review (if you're the reviewer and you're reading this, thank you for your kind words!) and we decided to invest a little more time in it and explain a few more details about these two URCUs that use just atomics.
We ended up writing a proof of correctness and even a comparison with previous URCU algorithms, all of which you can get here:

The comparison is probably interesting to a larger audience because it shows an example where simpler lock-free code can be beat complex blocking code. So let's dig into it.

Up until now, the most versatile Userspace RCU implementations out there were the ones provided by the userspace-rcu library
This library has several URCUs, but two of them use only atomic instructions and fences, without any need for other special things (like POSIX signals). The two URCUs are "Bullet Proof" and "Memory Barrier".


Source code for BulletProof can be seen here:

BulletProof is the more versatile of the URCUs in userspace-rcu library, allowing it to be used without the user explicitly doing prior reader thread registration. The reason it doesn't need reader registration is because it will do it internally if needed.
In applications with thread creation and destruction, it is imperative that each thread does its own de-registration before being deleted, otherwise the list of threads will grow indefinitely, dragging down the throughput when calling synchronize_rcu() and needlessly consuming memory.

Our TwoPhase URCU with RICountersArray as a ReadIndicator, which requires no prior thread registration, has no such issue. When using RICountersArray, TwoPhase URCU is the more versatile of all URCUs we've see so far because no thread registration or deregistration is required. All it takes is a pre-allocated array or list with preferably enough entries to cover the number of cores, plus a buffer to avoid too many hash collisions. We chose 2x the number of cores as the default for our implementation but the exact size can be tweaked at the user's discretion.

Another thing to notice in BulletProof is that rcu_read_lock() will internally call rcu_bp_register() to register a reader thread if it is not registered already. This requires holding a mutex lock named rcu_registry_lock:
Sure, it's a one-time call, but it's there, which means it is "blocking" by definition.
Our TwoPhase URCU is non-blocking for readers, more specifically, it is wait-free for CPUs where fetch_add() is wait-free, otherwise it is lock-free.

In BulletProof, the call to rcu_read_lock() has potentially two full fences (MFENCE instruction on x86), as can be seen in lines 149 and 170 of urcu-bp.h.
In both of our URCU algorithms there is one single fence in rcu_read_lock(), implemented as either a full fence or as an atomic increment (XADD on x86).

In BulletProof, the call to rcu_read_unlock() has two full fences:
In our GraceVersion algorithm, there is no fence on rcu_read_unlock()
while on the TwoPhase algorithm there is no fence when using the RIEntryPerThread and there is a single fence in the form of atomic increment when using a ReadIndicator with counters.

This difference in fences may seem minor, but it is responsible for the performance difference observed on this benchmark plot:

Regarding the synchronize_rcu() implementation in BulletProof, it grabs two mutual exclusion locks, named rcu_gp_lock and rcu_registry_lock (lines 271 and 273), releasing them only near the end of the call.

Notice that rcu_registry_lock is the same lock required by the internal thread registration called from rcu_read_lock(), meaning that in practice, a thread calling synchronize_rcu() is capable of blocking readers that have called rcu_read_lock() for the first time, i.e. the call to rcu_read_lock() in BulletProof is blocking.
None of our URCU algorithms grab any mutual exclusion lock in their calls to synchronize_rcu().

Besides the synchronization required to acquire two mutual exclusion locks, the synchronize_rcu() method in BulletProof executes 4 full fences (lines 281, 295, 312 and 330 of urcu-bp.c).
Both the GraceVersion and TwoPhase URCUs have a single synchronized instruction in their calls to synchronize_rcu(), when executing the compare_exchange_strong().

Because most of the code in synchronize_rcu() in BulletProof is executed with a mutual exclusion lock held, only one thread at a time is effectively executing it, preventing sharing of the grace period.
Moreover, the BulletProof algorithm was not designed to share grace periods and would not work correctly without a mutual exclusion lock.
This inability to share grace periods is the reason why the data points for BulletProof remain flat on the this benchmark plot:

Without sharing the grace period, each thread calling synchronize_rcu() must first wait for the other callers of synchronize_rcu() to complete, before itself can wait for the (then) current ongoing readers.

The GraceVersion and TwoPhase URCUs are capable of sharing grace periods and therefore multiple threads calling synchronize_rcu() can work simultaneously, without having to wait for each other, waiting only for the current ongoing readers. The more threads we add the more work can complete, thus the linear scaling in plot above for these two algorithms.
This is why a call to synchronize_rcu() in GraceVersion can provide a throughput that is 1000x higher than in BulletProof.

If I have some time I'll compare with MemoryBarrier URCU in a few weeks. In the meantime, you can take a look at the code and try to understand why is it slower than GraceVersion.



  1. It would have been great if you also compared these implementations to ck_epoch from - Any plans to include it?

    1. Also, I've published results comparing to URCU under "Slides" (fast path makes a big difference in real world).

    2. Hi Sbahra,
      An URCU can be used to do memory reclamation, however, an epoch-based memory reclamation (EBR) is not necessarily an URCU. Our paper is about URCUs so no, we can't compare the two.

      Having said that, do you think that it is possible to convert ck_epoch to be an URCU? That means, something with an API which at least matches rcu_read_lock(), rcu_read_unlock() and synchronize_rcu() ?

      From looking at the code, it seems that ck_epoch_synchronize() is equivalent to synchronize_rcu() but I couldn't find direct equivalents for the rcu_read_lock/unlock()

      If you think the algorithm in that EBR can be used to make a new URCU, then _that_ is interesting and worth of at least a small research paper (brief announcement) ;)
      Ping me back if you think yes!

    3. Btw, I'm assuming that what you call the "fast-path" is a deferral of the work to a later time. Yes I agree, it makes a bit difference, but it is still "blocking".
      Just to be clear, being asynchronous and being non-blocking are two different things. An asynchronous method can still result in "blocking" behavior.

  2. In this case, the read_lock call is called `ck_epoch_begin` and the `read_unlock` call is `ck_epoch_end`. These are inline functions in ck_epoch.h.

    Here's a very important factor, that's pretty damn cool:
    Core to EBR (compared to QSBR), is that the global epoch can only make forward progress if the current value has been observed. This means that for any value of the epoch, either the value is the previous epoch or current. This allows us to apply local reference counting to epoch counters, which means we can trivially ensure forward progress *in absence of* quiescent states. These are just regular increment / store operations, no atomics. This is what "ck_epoch_section_t" is for, it is a local reference count. I give credit to Paul Khuong for this nice extension. This is crucial to use SMR for evented systems in an easy composable fashion, where every thread has thousands of events (it doesn't make sense to have an epoch record for every event, so we amortize them). So, what's cool about this? Other than it unlocks a use-case that is otherwise not feasible with other passive mechanisms today, it also allows us to forego the memory barrier (for TSO, no barriers at all, for non-TSO, it does mean we don't need a full barrier though for non-TSO). We only pay the barrier the first time going from idle to active, but if there's let's say 1000 events to process in an I/O loop, those are amortized across 1 barrier.