Friday, September 30, 2016

Wait-free Bounded vs Wait-free Unbounded


Within the progress condition of wait-free there are two major groups:
- methods/algorithms that provide wait-free bounded guarantees;
- methods/algorithms that provide wait-free unbounded guarantees;

The difference between the two is subtle but very important.
Wait-free bounded means that there is a known bound on the number of steps, which itself has a strong but hidden implication: that there is a bound on the number of objects to be tracked, and therefore, to be deleted.
Wait-free unbounded means there is no known bound, and therefore, there may or may not exist a bound on the number of objects to be deleted.
This is confusing but let's consider a concrete example:

This year at PPoPP there was a wait-free queue presented by Chaoran Yang and John Mellor-Crummey. http://dl.acm.org/citation.cfm?id=2851168
John Mellor-Crummey is the MC in the MCS lock, so you may have "heard of him" before  ;)  
This queue uses fetch-and-add (FAA) and is wait-free unbounded. In the meantime, we designed a few algorithms based on fetch-and-add that are also wait-free unbounded (yes, yes, we did some wait-free queues).
The problem with this queue (and the wait-free unbounded queues we did as well) is that the dequeue() may have to traverse an unbounded number of nodes to find a node to dequeue. The basic idea is that you read the current head and then do a FAA on a certain monotonic counter to take a ticket and then traverse from head until you find a node that matches your ticket. All nodes have a monotonically increasing and unique ticket.
The problem with this approach is that between reading the head and doing the FAA the thread may go to sleep, and by the time it wakes up, there could be a very large number of nodes that were dequeued (and enqueued) and the counter in the FAA is for a ticket that is very far from the head.

This is bad enough on itself for latency at the tail of the distribution because now we have to traverse all these nodes which can take a long time, but it gets worse.
The fact that we have to traverse these nodes to find the one containing our ticket (with the item we need to return) means that all those nodes need to be live. In other words, we can't delete any of those nodes.
Do you see the problem with that?
No?
That's ok, Chaoran and John didn't seem to see it either and they know a lot about concurrency.
If you can't delete those nodes even though they have been dequeued, how much memory do you need to keep them alive?
1Mb ? 1Gb? 1Tb? Even more?
Notice that the number of nodes that needs to be traversed is finite, but we don't known how large it is because it's unbounded.
So, how much memory do we need to store an unknown amount of nodes?
See the problem now?

Quoting from Harald Hefgott which is one of the top mathematicians of present times: "Computers today are very fast and can also perform calculations in parallel. But the memory is still limited,"

Maybe you think this is purely theoretical.
In that case, consider what happens if we have oversubscription of threads to cores. Let's say we have 8 cores on our machine and we fire up 80 threads to dequeue from the queue, and maybe another 80 threads to enqueue just to keep it balanced. Only 8 threads at a time will be scheduled, and depending on the OS, each thread may be scheduled for few hundred milliseconds (sometimes called the "quantum").
It will take 20 "cycles of scheduling" until a thread is scheduled again, and during that time a lot of dequeues can occur.
If there is a thread calling dequeue() and reading the head and goes to sleep just before doing the FAA, then from that moment on, no further nodes in the queue can be deleted.
Let me repeat this last part because it's important: No further nodes can be deleted, not newly inserted nodes, nor already existing nodes, none, zero, nada, zilch, niente.

The 20 cycles take about 100ms x 20 = 2 seconds, and on average half the time we dequeue (the other half we enqueue) so, how many nodes can we dequeue in one second?
Modern computers can do a few million such operations per second and perhaps, per core, depending on contention, but let's say the answer is Z.
Each node in a queue takes at least 3x64 bits, with 64bits for the next pointer, 64bits for the value pointer, and 64bits for ticket.
We need to keep in memory at least 24 x Z bytes. If Z is a really large number then this may be more memory than there is available on the machine, and Z can increase with more oversubscription, or higher quantums, or and unfair thread scheduler, making the problem worse.

In practice this is a dangerous approach.
In theory, the fact that the program can crash due to running out of memory, kind of implies that these algorithms are not wait-free, which is a bit of a contradiction.
Wait-freedom implies resilience to faults, i.e. a thread is suppose to make progress and complete its work in a finite number of steps, regardless of what the other threads are doing (even sleeping).
If having a single sleeping thread can crash any of the other threads and even the entire application due to memory exhaustion (an extreme case, but quite possible), is this really wait-free?
I don't think so.

In other words, we have one thread calling dequeue() and it goes to sleep for a while, and this simple fact causes any of the other threads to fail, whether enqueues or dequeues or even threads doing something else in the application which will run out of memory and stop working correctly.
What kind of wait-free queue doesn't give a decent latency guarantee at the tail, and it can cause program crashes for no particular reason? These wait-free unbounded queues are not really wait-free.

What would you think if I came up to you and said:
Hey! I invented this really cool wait-free (unbounded) queue. It's great because it's wait-free... it has one small quirk though, it can sometimes cause your application to crash!

Whether you think it's wait-free or not, I doubt you would be willing to deploy such a queue in production if you heard me say it like that  ;)


Are all wait-free unbounded queues and data structures affected by this issue?
I don't think so. If the "unbounded" part of the wait-free is due to some calculation that has an unbounded number of steps and not due to traversing an "unbounded" number of nodes, then it should be safe.
Unfortunately, I don't know of any other wait-free unbounded data structure to provide an example. If you know one, please use the comments section!

Thursday, September 22, 2016

URCU ReadersVersion in C++

On the previous post we showed a new Userspace RCU (URCU) algorithm that can be implemented with just a atomics and the Java/C11/C++1x memory model.
Here is the implementation in C++:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/gracesharingurcu/URCUGraceVersion.hpp

Not only is this code simple enough to just copy/paste into your codebase, but it has a pretty good throughput and scales well.
This seems to be the best basic implementation of an URCU (with just atomics) in all possible scenarios. There may be other implementations that surpass it, but they would definitely have to use some non-portable trickery, which URCUs usually do, stuff like using POSIX signals, or some CPU special instructions, or some Kernel API.
Anyways, if you're interested in portable low-overhead memory reclamation for C/C++ that is wait-free population oblivious for readers and blocking for updaters, yet capable of aggregating grace periods, then this should do the trick for you!

We did a simple benchmark against two of the URCU implementations in the userspace urcu lib http://liburcu.org/ and we'll make the benchmark available in a couple weeks, but in the meantime here are some plots below.
We compared against the Bullet-Proof URCU and the default URCU. More details in this readme https://github.com/urcu/userspace-rcu

The first plot shows just readers, i.e. calls to rcu_read_lock()/rcu_read_unlock(), where the read-side critical section is just an atomic load on a global variable. This means the read-side critical section is as short as possible, and that if there is any contention is caused by the overhead in rcu_read_lock()/rcu_read_unlock(). As you can see from the plots, all three scale well, but the ReadersVersion is the one that scales better.




The second plot shows just calls to synchronize_rcu() and none of the methods scale (more on this in another post) but at least they don't dropoff as the number of threads increases, and ReadersVersion is about 10x faster than the other two URCUs.



The third plot shows again calls to synchronize_rcu() but this time there are two other thread that are continuously doing rcu_read_lock/rcu_read_unlock. The read-side critical section is very long, it consists of reading an array with 100000 entries, which means the time is dominated by the waiting for the readers and not by the synchronization mechanism inside synchronize_rcu().
Again here, ReadersVersion scales well because it aggregates grace periods, and so does the default urcu but doesn't have as good scalability as that. Bullet-Proof has no mechanism for sharing grace periods, and therefore, has no scalability on this kind of scenario.


Wednesday, September 14, 2016

A simple Userspace RCU in Java

In this post we're going to talk about a simple RCU implementatio, and provide an example in Java, but it's easy to port it to C++.

So, what is RCU again?
Yeah, well, RCU is many things and it has been talked about it detail by one of its original authors here.

... no, I mean, like, what is RCU, like _really_?
In its most basic form, RCU is a concurrency construct to provide on-demand quiescence.

By concurrency construct I mean it's a building block to use in a concurrency setting, the same way that a mutual exclusion lock is a concurrency construct, or a semaphore, or a reader-writer lock, etc. And by on-demand quiescence I mean that you can introduce quiescence anywhere in your application, as needed (more on that later).

RCU's basic API is:
  • rcu_read_lock()
  • rcu_read_unlock()
  • synchronize_rcu()

with the first two being called in a block, i.e. for every rcu_read_lock() there must be a later rcu_read_unlock(). Also, we can not call synchronize_rcu() from within a block of code enclosed by rcu_read_lock() / rcu_read_unlock().


Now, unlike the examples I gave of locks and semaphores, RCU does not provide any kind of mutual exclusivity property, at least not directly, but it does provides two very interesting guarantees:
First, the effects of any operation that happens before synchronize_rcu() in one thread (T1), will for sure be visible in another thread (T2) after rcu_read_lock(), if the call to synchronize_rcu() in T1 returned before the call to rcu_read_lock() in T2;
Second, the effects of any operation that happens after synchronize_rcu() in one thread (T1), will for sure not be visible in another thread (T2) before rcu_read_unlock(), if the call to rcu_read_lock() in T2 started before the call to synchronize_rcu() in T1.

... yes, it's confusing when you read it for the first time, but the idea is pretty simple. Here is a schematic that may help to understand:


T1 did some stuff (stuff before) then it called synchronize_rcu() and then it did some other stuff (stuff after). Time flows from top to bottom, i.e. execution order.
The other threads, T2, T3, T4 and T5 are doing rcu_read_lock()/rcu_read_unlock() and we show which events are seen in each situation.

The trickier one here is T5 for which rcu_read_lock() started during the call of synchronize_rcu() (if there was an absolute clock), but because synchronize_rcu() terminated before the call for rcu_read_unlock(), it means that T1 considered that rcu_read_lock() occurred after and so it did have to wait for the call to rcu_read_unlock(), and therefore, T5 will see the stuff that happened before on T1 and may or may not see that stuff that happened after.


Still confused?
That's ok, let's try another way then: The semantics of rcu_read_lock()/rcu_read_unlock() work just like the semantics of a read-lock in a reader-writer lock (only the semantics are similar, the effects are very different). And the semantics for synchronize_rcu(), well, there is no good analogy but you can think of it as having a lock()/unlock() of a writer lock... it's the closest thing I can come up with.

What does an RCU implementation looks like in Java?



static class RCU {

    final static long NOT_READING = Long.MAX_VALUE;

    final static int MAX_THREADS = 128;

    final AtomicLong reclaimerVersion = new AtomicLong(0);

    final AtomicLongArray readersVersion = new AtomicLongArray(MAX_THREADS);

   

    public RCU() {

        for (int i=0; i < MAX_THREADS; i++) readersVersion.set(i, NOT_READING);

    }

   

    public static int getTID() {

        return (int)(Thread.currentThread().getId() % MAX_THREADS);

    }

   

    public void read_lock(final int tid) {  // rcu_read_lock()

        final long rv = reclaimerVersion.get();

        readersVersion.set(tid, rv);

        final long nrv = reclaimerVersion.get();

        if (rv != nrv) readersVersion.lazySet(tid, nrv);

    }

   

    public void read_unlock(final int tid) { // rcu_read_unlock()

        readersVersion.set(tid, NOT_READING);

    }

   

    public void synchronize_rcu() {

        final long waitForVersion = reclaimerVersion.incrementAndGet();

        for (int i=0; i < MAX_THREADS; i++) {

            while (readersVersion.get(i) < waitForVersion) { } // spin

        }

    }

}


In this algorithm above, each thread calling synchronize_rcu() does a incrementAndGet() on an atomic variable named reclaimersVersion and then spins while waiting for all the ongoing readers to complete by scanning the array of their versions named readersVersion.
Upon calling rcu_read_lock(), a (reader) thread will read the current value in reclaimersVersion and set its uniquely assigned entry in the readersVersion array to the same value, and then re-check that the version has not changed and if so, update the readersVersion with the new entry.
When the read operation is done, it will call rcu_read_unlock() which will set its entry in readersVersion back to NOT_READING, a constant with a special value that means that the reader is not currently active.

It's hard to get any simpler than this  ;)

The advantages of this algorithm are:
- Very light in synchronization on the reader's side: Only two sequentially consistent loads and one seq-cst store on rcu_read_lock() and a release store in rcu_read_unlock();
- Wait-free for readers: No loops or retries on either rcu_read_lock() or rcu_read_unlock();
- Starvation-free for reclaimers: Calls to synchronize_rcu() are free from starvation... non-obvious why, but true;
- Multiple reclaimers can make progress simultaneously: Unlike some of the other Userspace RCU implementations that have a mutual exclusion lock inside synchronize_rcu(), this one does not, which means it scales well for multiple reclaimers;
- Code is short and sweet: There can't be many hidden bugs in something that takes less than 20 lines of code;

The only true subtle detail about this algorithm is in the fact that rcu_read_lock() does a load, a store, a load and then a relaxed store. The first load and store would suffice for the algorithm itself, but the thing about volatile/seq-cst stores is that they allow regular code to be reordered from below to above. This means that we could have code being executed before the readersVersion.set(tid, rv) which would be incorrect from the point of view of the expected semantics of RCU, and therefore, we need some kind of barrier to prevent code from traveling up. The easiest way to accomplish this is to do a volatile/seq-cst load, and since we're doing that, we might as well check if it changed and if it did, then we might as well update it with a lazy (or relaxed) store.
Like I said, it's subtle and not too important to understand how the algorithm works, it's more of an implementation detail.



Why would anyone want use RCU in Java? RCU is for memory reclamation, and Java has a GC, right?
Like I mentioned before, RCU is not (just) about memory reclamation, it's about using quiescence in your application to do whatever you want with quiescence, and as amazing as it may seem, there is a non-negligible amount of things you can do with it!

For example, suppose you have a pool of complex objects that are too costly to create, so you want to re-use by returning to the pool. But this pool is concurrent, so you have to make sure that some thread that is lagging behind isn't holding a reference to the object anymore.
After returning the object to the pool, when is it safe to re-use the object?
The answer is RCU: Protect all accesses to the objects from that pool with an rcu_read_lock()/rcu_read_unlock() and only reuse from the pool after calling synchronize_rcu().
Another example is Relativistic Programming which you can see more of in this paper:
http://web.cecs.pdx.edu/~walpole/papers/haskell2015.pdf

A funny thing about RCU is that one single RCU instance can be used for as many purposes as desired, for example, one RCU instance for all the pools (if there are more than one), or one per pool, as you prefer.


We'll prepare a C++ implementation for a future post.