Tuesday, October 18, 2016

CRTurn Queue - The first MPMC memory-unbounded wait-free queue with memory reclamation

A few days ago I turned 40.
It's a big event, and to celebrate it, Andreia and I decided to share some of the work we've been doing recently.
Yes that's right, it's my birthday but you get the gifts!

We officially present CRTurn queue, the first (correct) memory-unbounded multi-producer-multi-consumer wait-free queue for C++, that does its own memory reclamation, and does it in a wait-free way.

If you just care about the C++ code, then follow this link
If it's Java you're interested, then we have an implementation with self-linking on the url below
And if you care about academic papers with invariant proofs and microbenchmark plots then look here

Otherwise, just keep on reading for the blog post  ;)

When it comes to memory unbounded multi-producer-multi-consumer (MPMC) queues there are three known approaches in the literature, all based on singly-linked lists, plus a handful of generic wait-free mechanisms that can be applied to queues, but they're too slow to be a match for the "hand-written" queues. You can see each respective paper for those three queues here:
KP http://www.cs.technion.ac.il/~erez/Papers/wf-methodology-ppopp12.pdf
FK http://www.cs.uoi.gr/tech_reports/publications/TR2011-01.pdf
YMC http://chaoran.me/assets/pdf/wfq-ppopp16.pdf
Of these three algorithms: KP is for Java only, which means the memory reclamation is done by the GC, and since the JVM's implementation of a GC is blocking, this queue isn't truly wait-free, at least not when it comes to latency, but IMO it's the best of all previous work, and it's the only one for which there is a correct implementation;
FK does no memory reclamation and has errors in the implementation;
YMC has a memory reclamation that is flawed by design, and even if fixed would not be wait-free, and seems to have some errors in the implementation (at least they were the first ones to realize the importance of doing memory reclamation).
This means that CRTurn queue is the first to reclaim its own memory, and therefore, the first that can be implemented in C or C++ or something without a GC. Not only does it reclaim the memory of the dequeued nodes, but it does so in a wait-free bounded way, a feat which is not easy (to say the least).

The CRTurn queue has other interesting properties, for example, it does no memory allocation except for the creation of the node that is inserted in the linked list, and even that can be pre-allocated if desired.
In CRTurn, the enqueueing and dequeueing algorithms are isolated, which means you can use just one of them and plug it with a single-threaded queue algorithm (singly-linked list based) to make a wait-free SPMC or MPSC queue. Or even better, the enqueue() algorithm is very short, so it is very tempting to attach it to the dequeue() of the Michael-Scott queue to create a simple queue that is MPMC with wait-free enqueue() and lock-free dequeue().
Such a queue has a small number of lines of code and is relatively easy to reason about and convince yourself of its correctness.
Other properties are, using a fast wait-free consensus we called the "Turn" consensus (think Lamport's Bakery, but better), and it achieves bounded wait-free with just the use of CAS (compare-and-swap), which is nice because not all CPUs have a FAA (fetch-and-add).

This queue was designed for low latency at the tail of the latency distribution, i.e. for real-time systems.  Sure, it does a new/delete for each node that is enqueued/dequeue but you can plugin your custom allocator if you want.
Surprisingly as it may seem, for the uncontended scenario, this queue isn't far behind from the Michael-Scott queue, i.e. the best known memory-unbounded mpmc lock-free queue. It's nice to see that we don't have to pay a very high price for having wait-free guarantees.
We tested this code heavily to make it production-ready and aimed to provide a code that is as simple as possible, but keep in mind that this is a wait-free queue and wait-free algorithms are rarely simple.

This queue is a big deal for us because Andreia and I worked really hard to get here. We spent late nights working on this, unending discussions on how to solve the different problems, how to proceed with the design, writing code, writing tests, running benchmarks, analyzing the results, figuring out the best way to measure tail latency, writing more tests, and more experimental code, and more discussions, until we got to what is being shown as CRTurn queue. Along the way, we created a new wait-free consensus protocol, we had to figure out how to apply Hazard Pointers in a completely wait-free way, how to use the least amount of stores for Hazard Pointers to keep the throughput high, and all of that without doing any heap allocation.
It was a long and hard journey, particularly the memory reclamation part, and we learned a lot along the way.
I don't know if we'll get published in a top tier conference, but we are sharing this fruit of our labor with everyone, and we hope you enjoy it  ;)

More to come...

Friday, October 14, 2016

Dope! I should have thought of that!

What's one of the biggest compliments someone can give me?
It's when I explain an idea or algorithm and someone says: Dope! I should have thought of that!

I think this a great compliment, and let me explain why.

First, when someone says something like that it means that they think the idea is almost obvious or at least simple (to them), which implies that I was able to explain the idea to make it sound simple.
This may seem a trivial task, after all, explaining simple ideas is easy, right?
Well, I'm not always the most articulate person and have a (bad) tendency to overly complicate things, which you might have already picked up on if you 're reading this blog. In my defense, I do make an effort to explain concepts in the simplest way I can think of, too bad that is sometimes a really overly complex way.
Andreia is much better than I at this game. She has this seemingly innate skill to dismantle complex ideas in its basic constituents, and she helps me with it  ;)
Nobody's perfect, but luckily, I don't have this issue when it comes to writing code... the problem with code is that is explains the how but it doesn't explain the why.
So, when someone says "I should have thought of that!" it gets translated in my brain into something like "You made it sound so simple that I wonder why I didn't think of it myself"  lol

Second, no matter how smart you are, if the idea is complex, you're not going to be able to explain it in a simple way. If someone thinks your idea is simple, then there is at least one person in the world to whom your idea is simple, and simplicity is elegance! Most people can make something useful and complicated, but only a few can make something useful and simple.
I'm not a big fan of Steve Jobs, but there is quote attributed to him that I think is perfect for this situation:
When you start looking at a problem and it seems really simple,  
you don’t really understand the complexity of the problem.  Then 
you get into the problem,  and you see that it’s really complicated,  
and you come up with all these convoluted solutions.  That’s sort of 
the middle,  and that’s where most people stop… But the really 
great person will keep on going and find the key, the underlying 
principle of the problem — and come up with an elegant,  really 
beautiful solution that works.

... or from C.A.R. Hoare, The 1980 ACM Turing Award Lecture:
There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.

It's really hard to create something and then go the extra mile to make it as simple as possible.
It takes time, it takes energy, it takes money (aren't these three equivalent anyways?).
This has value not just due to aesthetics but because making something as simple as possible means it is much less likely to have bugs or flaws, and it's easier to propagate to other people, a kind of meme. It becomes viral and infects other people's brains!
Sometimes is goes to the point that other people think they have invented it themselves:

Left-Right is certainly one of these ideas. It can be implemented in less than 10 lines of code and explained in just a few sentences:
The thing is, we spent nearly three years working on it. From its first inception until it was distilled into the simple idea it is today, there was a long journey, with lots of longs nights of discussions and coding.
The how it works may be deceivingly simple, but the why it works this way (and not some other way) is definitely not simple.

So, when someone tells me that Left-Right (or some other idea Andreia and I came up with) is so simple that they should have thought of it, I smile and say thanks, because that's a huge complement, whether they know it or not  ;)

Friday, October 7, 2016

Self-linking and Latency + Life of a Twitter jvm engineer

Today we're going to take a look at the effects of doing self-linking in singly-list based data structures in Java. More specifically, how it changes tail latency in singly-list based queues like the one by Alex Kogan and Erez Petrank.

Back in 2011 Alex Kogan and Erez Petrank showed a multi-producer-multi-consumer wait-free queue.
To the best of our knowledge, they provided the only reasonably useful and correctly implemented memory-unbounded MPMC wait-free queue (that's a mouthful).
Yes, there are generic wait-free techniques, but using them on a queue is just too slow, and yes, there two other MPMC wait-free algorithms but their implementation has bugs and it's CPU specific.
Even the algorithm by Kogan and Petrank relies on having an automatic Garbage Collector (GC) so they provided their implementation in Java. As the authors themselves mention in their paper, there is currently no wait-free GC, neither implemented in production nor in the literature, which means that this queue is never really wait-free, but let's leave that for another post, and just focus on latency.

How good is this "wait-free" queue when it comes to tail latency?

Pretty good as it turns out. We showed some plots in a previous post:

Unfortunately, as soon as you start "hammering on it" with lots of contention, a problem appears.
The problem is that the GC will do pauses, really really long pauses.

This queue is based on a singly-linked list, and like all singly-list based data structures when used with a GC, the unused nodes can't just be unlinked from the list, they need to be self-linked by pointing the next pointer to the node itself.
The why this is important is related to how GCs work, and there is a very good explanation in this presentation starting at minute 23
Life of a Twitter jvm engineer - The garbage keeps coming... by Tony Printezis

which by the way I recommend watching in its entirety, but for the purpose of this post, if you don't know what this self-linking stuff is all about, then go watch ten minutes of it starting at minute 23 and come back and read the rest of this post.

Do you understand now why self-linking is important?
This is really a non-obvious detail of GCs, and the first time I learned about this was several years ago from Doug Lea.

The following code is pretty much what was written in the original paper but with some sun.misc.unsafe added to it:
and in our benchmarks it has GC pauses that go up to 70 seconds.
yes, that's right, seventy seconds during which our 32 core machine is stuck doing nothing except running the GC  :-O
Imagine you are reading this blog post on your laptop/tablet and then you click on a link and everything freezes for 70 seconds due to the GC... does that sound reasonable in any way?
You're going to say that this is the GC's fault and that using a JVM with a decent GC like Zing would fix it. I don't know because I don't have access to a JVM with Zing, and I believe that using another GC would improve, but I seriously doubt it will be as effective as doing self-linking.
For a deep-dive on this topic, check out the paper named "A Performance Study of Java Garbage Collectors on Multicore Architectures", particularly figure 1 and table 3:

The trick to help the GC is to do self-linking of the nodes after dequeueing, but we can't self-link the node where your value is, instead, we self-link the node that points to it, because the node where the value is may still be accessible through head, and we need its next to be valid because it will be used by the next thread calling deq() to obtain its "value". Here is the variant we did, with self-linking of unused nodes:
and in our benchmarks it has GC pauses that go up to 3 milliseconds.
Keep in mind that they are exactly the same algorithm, apart from an extra check in help_finish_enq() and for the self-linking in deq() and yet, they have completely different latency profiles.

Below is one example we got with GC logs. In this example the JVM ran out of memory during the benchmark and had to run a full GC which took 17 seconds and then another that took 42 seconds. At the end of the benchmark we call System.gc() and sleep for a while to trigger another GC run and that one took more than 45 seconds.
These 45 seconds are not accounted for in the benchmark which is unfair because the GC is cleaning up the garbage that the queue is producing, so that work should be taken into account when doing benchmarks as well, but anyways, we don't care much about throughput, it's more about latency for us:
##### KPNoSLQueue                          ##### 
Starting run...

[GC (Allocation Failure)  3932481K->221057K(15073280K), 17.3218925 secs]
[GC (Allocation Failure)  4153217K->547969K(15073280K), 42.6884304 secs]

Ending run
Number of enqueues/dequeues per sec = 2235
[GC (System.gc())  1324375K->601121K(15073280K), 45.8331662 secs]
[Full GC (System.gc())  601121K->8434K(15073280K), 0.0763649 secs]

When we run the same benchmark for the self-linking version, there is still one GC allocation failure during the benchmark, but it takes only 1 millisecond to complete, and the throughput is significantly faster because there is very little waiting for the GC:

##### KPQueue                              #####  Starting run...
[GC (Allocation Failure)  3932485K->805K(15073280K), 0.0013195 secs]
Ending run

Number of enqueues/dequeues per sec = 3216
[GC (System.gc())  2541778K->453K(15073280K), 0.0012654 secs]
[Full GC (System.gc())  453K->327K(15073280K), 0.0142511 secs]

This is the problem with not reclaiming unused memory (nodes) and leaving it up to the GC. Once you go down that path, you need to understand what the GC is doing to somehow make friends with the GC so that it will behave the way you want it to.

Long story short, if you're going to implement a singly-list based queue in Java, you better do self-linking.

Friday, September 30, 2016

Wait-free Bounded vs Wait-free Unbounded

Within the progress condition of wait-free there are two major groups:
- methods/algorithms that provide wait-free bounded guarantees;
- methods/algorithms that provide wait-free unbounded guarantees;

The difference between the two is subtle but very important.
Wait-free bounded means that there is a known bound on the number of steps, which itself has a strong but hidden implication: that there is a bound on the number of objects to be tracked, and therefore, to be deleted.
Wait-free unbounded means there is no known bound, and therefore, there may or may not exist a bound on the number of objects to be deleted.
This is confusing but let's consider a concrete example:

This year at PPoPP there was a wait-free queue presented by Chaoran Yang and John Mellor-Crummey. http://dl.acm.org/citation.cfm?id=2851168
John Mellor-Crummey is the MC in the MCS lock, so you may have "heard of him" before  ;)  
This queue uses fetch-and-add (FAA) and is wait-free unbounded. In the meantime, we designed a few algorithms based on fetch-and-add that are also wait-free unbounded (yes, yes, we did some wait-free queues).
The problem with this queue (and the wait-free unbounded queues we did as well) is that the dequeue() may have to traverse an unbounded number of nodes to find a node to dequeue. The basic idea is that you read the current head and then do a FAA on a certain monotonic counter to take a ticket and then traverse from head until you find a node that matches your ticket. All nodes have a monotonically increasing and unique ticket.
The problem with this approach is that between reading the head and doing the FAA the thread may go to sleep, and by the time it wakes up, there could be a very large number of nodes that were dequeued (and enqueued) and the counter in the FAA is for a ticket that is very far from the head.

This is bad enough on itself for latency at the tail of the distribution because now we have to traverse all these nodes which can take a long time, but it gets worse.
The fact that we have to traverse these nodes to find the one containing our ticket (with the item we need to return) means that all those nodes need to be live. In other words, we can't delete any of those nodes.
Do you see the problem with that?
That's ok, Chaoran and John didn't seem to see it either and they know a lot about concurrency.
If you can't delete those nodes even though they have been dequeued, how much memory do you need to keep them alive?
1Mb ? 1Gb? 1Tb? Even more?
Notice that the number of nodes that needs to be traversed is finite, but we don't known how large it is because it's unbounded.
So, how much memory do we need to store an unknown amount of nodes?
See the problem now?

Quoting from Harald Hefgott which is one of the top mathematicians of present times: "Computers today are very fast and can also perform calculations in parallel. But the memory is still limited,"

Maybe you think this is purely theoretical.
In that case, consider what happens if we have oversubscription of threads to cores. Let's say we have 8 cores on our machine and we fire up 80 threads to dequeue from the queue, and maybe another 80 threads to enqueue just to keep it balanced. Only 8 threads at a time will be scheduled, and depending on the OS, each thread may be scheduled for few hundred milliseconds (sometimes called the "quantum").
It will take 20 "cycles of scheduling" until a thread is scheduled again, and during that time a lot of dequeues can occur.
If there is a thread calling dequeue() and reading the head and goes to sleep just before doing the FAA, then from that moment on, no further nodes in the queue can be deleted.
Let me repeat this last part because it's important: No further nodes can be deleted, not newly inserted nodes, nor already existing nodes, none, zero, nada, zilch, niente.

The 20 cycles take about 100ms x 20 = 2 seconds, and on average half the time we dequeue (the other half we enqueue) so, how many nodes can we dequeue in one second?
Modern computers can do a few million such operations per second and perhaps, per core, depending on contention, but let's say the answer is Z.
Each node in a queue takes at least 3x64 bits, with 64bits for the next pointer, 64bits for the value pointer, and 64bits for ticket.
We need to keep in memory at least 24 x Z bytes. If Z is a really large number then this may be more memory than there is available on the machine, and Z can increase with more oversubscription, or higher quantums, or and unfair thread scheduler, making the problem worse.

In practice this is a dangerous approach.
In theory, the fact that the program can crash due to running out of memory, kind of implies that these algorithms are not wait-free, which is a bit of a contradiction.
Wait-freedom implies resilience to faults, i.e. a thread is suppose to make progress and complete its work in a finite number of steps, regardless of what the other threads are doing (even sleeping).
If having a single sleeping thread can crash any of the other threads and even the entire application due to memory exhaustion (an extreme case, but quite possible), is this really wait-free?
I don't think so.

In other words, we have one thread calling dequeue() and it goes to sleep for a while, and this simple fact causes any of the other threads to fail, whether enqueues or dequeues or even threads doing something else in the application which will run out of memory and stop working correctly.
What kind of wait-free queue doesn't give a decent latency guarantee at the tail, and it can cause program crashes for no particular reason? These wait-free unbounded queues are not really wait-free.

What would you think if I came up to you and said:
Hey! I invented this really cool wait-free (unbounded) queue. It's great because it's wait-free... it has one small quirk though, it can sometimes cause your application to crash!

Whether you think it's wait-free or not, I doubt you would be willing to deploy such a queue in production if you heard me say it like that  ;)

Are all wait-free unbounded queues and data structures affected by this issue?
I don't think so. If the "unbounded" part of the wait-free is due to some calculation that has an unbounded number of steps and not due to traversing an "unbounded" number of nodes, then it should be safe.
Unfortunately, I don't know of any other wait-free unbounded data structure to provide an example. If you know one, please use the comments section!

Thursday, September 22, 2016

URCU ReadersVersion in C++

On the previous post we showed a new Userspace RCU (URCU) algorithm that can be implemented with just a atomics and the Java/C11/C++1x memory model.
Here is the implementation in C++:

Not only is this code simple enough to just copy/paste into your codebase, but it has a pretty good throughput and scales well.
This seems to be the best basic implementation of an URCU (with just atomics) in all possible scenarios. There may be other implementations that surpass it, but they would definitely have to use some non-portable trickery, which URCUs usually do, stuff like using POSIX signals, or some CPU special instructions, or some Kernel API.
Anyways, if you're interested in portable low-overhead memory reclamation for C/C++ that is wait-free population oblivious for readers and blocking for updaters, yet capable of aggregating grace periods, then this should do the trick for you!

We did a simple benchmark against two of the URCU implementations in the userspace urcu lib http://liburcu.org/ and we'll make the benchmark available in a couple weeks, but in the meantime here are some plots below.
We compared against the Bullet-Proof URCU and the default URCU. More details in this readme https://github.com/urcu/userspace-rcu

The first plot shows just readers, i.e. calls to rcu_read_lock()/rcu_read_unlock(), where the read-side critical section is just an atomic load on a global variable. This means the read-side critical section is as short as possible, and that if there is any contention is caused by the overhead in rcu_read_lock()/rcu_read_unlock(). As you can see from the plots, all three scale well, but the ReadersVersion is the one that scales better.

The second plot shows just calls to synchronize_rcu() and none of the methods scale (more on this in another post) but at least they don't dropoff as the number of threads increases, and ReadersVersion is about 10x faster than the other two URCUs.

The third plot shows again calls to synchronize_rcu() but this time there are two other thread that are continuously doing rcu_read_lock/rcu_read_unlock. The read-side critical section is very long, it consists of reading an array with 100000 entries, which means the time is dominated by the waiting for the readers and not by the synchronization mechanism inside synchronize_rcu().
Again here, ReadersVersion scales well because it aggregates grace periods, and so does the default urcu but doesn't have as good scalability as that. Bullet-Proof has no mechanism for sharing grace periods, and therefore, has no scalability on this kind of scenario.

Wednesday, September 14, 2016

A simple Userspace RCU in Java

In this post we're going to talk about a simple RCU implementatio, and provide an example in Java, but it's easy to port it to C++.

So, what is RCU again?
Yeah, well, RCU is many things and it has been talked about it detail by one of its original authors here.

... no, I mean, like, what is RCU, like _really_?
In its most basic form, RCU is a concurrency construct to provide on-demand quiescence.

By concurrency construct I mean it's a building block to use in a concurrency setting, the same way that a mutual exclusion lock is a concurrency construct, or a semaphore, or a reader-writer lock, etc. And by on-demand quiescence I mean that you can introduce quiescence anywhere in your application, as needed (more on that later).

RCU's basic API is:
  • rcu_read_lock()
  • rcu_read_unlock()
  • synchronize_rcu()

with the first two being called in a block, i.e. for every rcu_read_lock() there must be a later rcu_read_unlock(). Also, we can not call synchronize_rcu() from within a block of code enclosed by rcu_read_lock() / rcu_read_unlock().

Now, unlike the examples I gave of locks and semaphores, RCU does not provide any kind of mutual exclusivity property, at least not directly, but it does provides two very interesting guarantees:
First, the effects of any operation that happens before synchronize_rcu() in one thread (T1), will for sure be visible in another thread (T2) after rcu_read_lock(), if the call to synchronize_rcu() in T1 returned before the call to rcu_read_lock() in T2;
Second, the effects of any operation that happens after synchronize_rcu() in one thread (T1), will for sure not be visible in another thread (T2) before rcu_read_unlock(), if the call to rcu_read_lock() in T2 started before the call to synchronize_rcu() in T1.

... yes, it's confusing when you read it for the first time, but the idea is pretty simple. Here is a schematic that may help to understand:

T1 did some stuff (stuff before) then it called synchronize_rcu() and then it did some other stuff (stuff after). Time flows from top to bottom, i.e. execution order.
The other threads, T2, T3, T4 and T5 are doing rcu_read_lock()/rcu_read_unlock() and we show which events are seen in each situation.

The trickier one here is T5 for which rcu_read_lock() started during the call of synchronize_rcu() (if there was an absolute clock), but because synchronize_rcu() terminated before the call for rcu_read_unlock(), it means that T1 considered that rcu_read_lock() occurred after and so it did have to wait for the call to rcu_read_unlock(), and therefore, T5 will see the stuff that happened before on T1 and may or may not see that stuff that happened after.

Still confused?
That's ok, let's try another way then: The semantics of rcu_read_lock()/rcu_read_unlock() work just like the semantics of a read-lock in a reader-writer lock (only the semantics are similar, the effects are very different). And the semantics for synchronize_rcu(), well, there is no good analogy but you can think of it as having a lock()/unlock() of a writer lock... it's the closest thing I can come up with.

What does an RCU implementation looks like in Java?

static class RCU {

    final static long NOT_READING = Long.MAX_VALUE;

    final static int MAX_THREADS = 128;

    final AtomicLong reclaimerVersion = new AtomicLong(0);

    final AtomicLongArray readersVersion = new AtomicLongArray(MAX_THREADS);


    public RCU() {

        for (int i=0; i < MAX_THREADS; i++) readersVersion.set(i, NOT_READING);



    public static int getTID() {

        return (int)(Thread.currentThread().getId() % MAX_THREADS);



    public void read_lock(final int tid) {  // rcu_read_lock()

        final long rv = reclaimerVersion.get();

        readersVersion.set(tid, rv);

        final long nrv = reclaimerVersion.get();

        if (rv != nrv) readersVersion.lazySet(tid, nrv);



    public void read_unlock(final int tid) { // rcu_read_unlock()

        readersVersion.set(tid, NOT_READING);



    public void synchronize_rcu() {

        final long waitForVersion = reclaimerVersion.incrementAndGet();

        for (int i=0; i < MAX_THREADS; i++) {

            while (readersVersion.get(i) < waitForVersion) { } // spin




In this algorithm above, each thread calling synchronize_rcu() does a incrementAndGet() on an atomic variable named reclaimersVersion and then spins while waiting for all the ongoing readers to complete by scanning the array of their versions named readersVersion.
Upon calling rcu_read_lock(), a (reader) thread will read the current value in reclaimersVersion and set its uniquely assigned entry in the readersVersion array to the same value, and then re-check that the version has not changed and if so, update the readersVersion with the new entry.
When the read operation is done, it will call rcu_read_unlock() which will set its entry in readersVersion back to NOT_READING, a constant with a special value that means that the reader is not currently active.

It's hard to get any simpler than this  ;)

The advantages of this algorithm are:
- Very light in synchronization on the reader's side: Only two sequentially consistent loads and one seq-cst store on rcu_read_lock() and a release store in rcu_read_unlock();
- Wait-free for readers: No loops or retries on either rcu_read_lock() or rcu_read_unlock();
- Starvation-free for reclaimers: Calls to synchronize_rcu() are free from starvation... non-obvious why, but true;
- Multiple reclaimers can make progress simultaneously: Unlike some of the other Userspace RCU implementations that have a mutual exclusion lock inside synchronize_rcu(), this one does not, which means it scales well for multiple reclaimers;
- Code is short and sweet: There can't be many hidden bugs in something that takes less than 20 lines of code;

The only true subtle detail about this algorithm is in the fact that rcu_read_lock() does a load, a store, a load and then a relaxed store. The first load and store would suffice for the algorithm itself, but the thing about volatile/seq-cst stores is that they allow regular code to be reordered from below to above. This means that we could have code being executed before the readersVersion.set(tid, rv) which would be incorrect from the point of view of the expected semantics of RCU, and therefore, we need some kind of barrier to prevent code from traveling up. The easiest way to accomplish this is to do a volatile/seq-cst load, and since we're doing that, we might as well check if it changed and if it did, then we might as well update it with a lazy (or relaxed) store.
Like I said, it's subtle and not too important to understand how the algorithm works, it's more of an implementation detail.

Why would anyone want use RCU in Java? RCU is for memory reclamation, and Java has a GC, right?
Like I mentioned before, RCU is not (just) about memory reclamation, it's about using quiescence in your application to do whatever you want with quiescence, and as amazing as it may seem, there is a non-negligible amount of things you can do with it!

For example, suppose you have a pool of complex objects that are too costly to create, so you want to re-use by returning to the pool. But this pool is concurrent, so you have to make sure that some thread that is lagging behind isn't holding a reference to the object anymore.
After returning the object to the pool, when is it safe to re-use the object?
The answer is RCU: Protect all accesses to the objects from that pool with an rcu_read_lock()/rcu_read_unlock() and only reuse from the pool after calling synchronize_rcu().
Another example is Relativistic Programming which you can see more of in this paper:

A funny thing about RCU is that one single RCU instance can be used for as many purposes as desired, for example, one RCU instance for all the pools (if there are more than one), or one per pool, as you prefer.

We'll prepare a C++ implementation for a future post.