Wednesday, May 17, 2017

Battle of the Userspace RCUs: GraceVersion and TwoPhase versus BulletProof and Memory Barrier

We've talked before about Userspace RCU, particularly two new algorithms we've designed named GraceVersion and TwoPhase that allow grace periods to be shared among updaters.

We submitted the brief announcement to a major conference and although it wasn't accepted, we got one amazing review (if you're the reviewer and you're reading this, thank you for your kind words!) and we decided to invest a little more time in it and explain a few more details about these two URCUs that use just atomics.
We ended up writing a proof of correctness and even a comparison with previous URCU algorithms, all of which you can get here:

The comparison is probably interesting to a larger audience because it shows an example where simpler lock-free code can be beat complex blocking code. So let's dig into it.

Up until now, the most versatile Userspace RCU implementations out there were the ones provided by the userspace-rcu library
This library has several URCUs, but two of them use only atomic instructions and fences, without any need for other special things (like POSIX signals). The two URCUs are "Bullet Proof" and "Memory Barrier".


Source code for BulletProof can be seen here:

BulletProof is the more versatile of the URCUs in userspace-rcu library, allowing it to be used without the user explicitly doing prior reader thread registration. The reason it doesn't need reader registration is because it will do it internally if needed.
In applications with thread creation and destruction, it is imperative that each thread does its own de-registration before being deleted, otherwise the list of threads will grow indefinitely, dragging down the throughput when calling synchronize_rcu() and needlessly consuming memory.

Our TwoPhase URCU with RICountersArray as a ReadIndicator, which requires no prior thread registration, has no such issue. When using RICountersArray, TwoPhase URCU is the more versatile of all URCUs we've see so far because no thread registration or deregistration is required. All it takes is a pre-allocated array or list with preferably enough entries to cover the number of cores, plus a buffer to avoid too many hash collisions. We chose 2x the number of cores as the default for our implementation but the exact size can be tweaked at the user's discretion.

Another thing to notice in BulletProof is that rcu_read_lock() will internally call rcu_bp_register() to register a reader thread if it is not registered already. This requires holding a mutex lock named rcu_registry_lock:
Sure, it's a one-time call, but it's there, which means it is "blocking" by definition.
Our TwoPhase URCU is non-blocking for readers, more specifically, it is wait-free for CPUs where fetch_add() is wait-free, otherwise it is lock-free.

In BulletProof, the call to rcu_read_lock() has potentially two full fences (MFENCE instruction on x86), as can be seen in lines 149 and 170 of urcu-bp.h.
In both of our URCU algorithms there is one single fence in rcu_read_lock(), implemented as either a full fence or as an atomic increment (XADD on x86).

In BulletProof, the call to rcu_read_unlock() has two full fences:
In our GraceVersion algorithm, there is no fence on rcu_read_unlock()
while on the TwoPhase algorithm there is no fence when using the RIEntryPerThread and there is a single fence in the form of atomic increment when using a ReadIndicator with counters.

This difference in fences may seem minor, but it is responsible for the performance difference observed on this benchmark plot:

Regarding the synchronize_rcu() implementation in BulletProof, it grabs two mutual exclusion locks, named rcu_gp_lock and rcu_registry_lock (lines 271 and 273), releasing them only near the end of the call.

Notice that rcu_registry_lock is the same lock required by the internal thread registration called from rcu_read_lock(), meaning that in practice, a thread calling synchronize_rcu() is capable of blocking readers that have called rcu_read_lock() for the first time, i.e. the call to rcu_read_lock() in BulletProof is blocking.
None of our URCU algorithms grab any mutual exclusion lock in their calls to synchronize_rcu().

Besides the synchronization required to acquire two mutual exclusion locks, the synchronize_rcu() method in BulletProof executes 4 full fences (lines 281, 295, 312 and 330 of urcu-bp.c).
Both the GraceVersion and TwoPhase URCUs have a single synchronized instruction in their calls to synchronize_rcu(), when executing the compare_exchange_strong().

Because most of the code in synchronize_rcu() in BulletProof is executed with a mutual exclusion lock held, only one thread at a time is effectively executing it, preventing sharing of the grace period.
Moreover, the BulletProof algorithm was not designed to share grace periods and would not work correctly without a mutual exclusion lock.
This inability to share grace periods is the reason why the data points for BulletProof remain flat on the this benchmark plot:

Without sharing the grace period, each thread calling synchronize_rcu() must first wait for the other callers of synchronize_rcu() to complete, before itself can wait for the (then) current ongoing readers.

The GraceVersion and TwoPhase URCUs are capable of sharing grace periods and therefore multiple threads calling synchronize_rcu() can work simultaneously, without having to wait for each other, waiting only for the current ongoing readers. The more threads we add the more work can complete, thus the linear scaling in plot above for these two algorithms.
This is why a call to synchronize_rcu() in GraceVersion can provide a throughput that is 1000x higher than in BulletProof.

If I have some time I'll compare with MemoryBarrier URCU in a few weeks. In the meantime, you can take a look at the code and try to understand why is it slower than GraceVersion.