Concurrency Freaks: copy-on-write

Showing posts with label copy-on-write. Show all posts

Monday, October 28, 2019

Is RCU a generic concurrency control?

In this post we're going to talk about RCU and we're going to start by what it means.
When people talk about RCU they usually mean two different things combined: a synchronization mechanism, which I'll call RCU-sync, and a make-copy-of-object-and-modify-copy, which I'll call Copy-On-Write (COW).

Let's start with Copy-On-Write.
This is a technique used in concurrency and even on durability (see previous post) where instead of modifying an object in-place, you make a copy of the object, modify the new object and then toggle a pointer to point from the old object to the new object. The name RCU comes from Read-Copy-Update as a reference to this technique.
When used in (shared memory) concurrency this creates the problem of what to do with the old object: is it safe to delete it immediately? What if there are other threads currently accessing it?
Solving this problem of object lifetime tracking is what the synchronization mechanism of RCU (RCU-sync) does.

There are many different algorithms that implement an RCU-sync. Some examples of RCU-sync algorithms can be seen in Paul McKenney's excellent book or in our paper about fast userspace RCUs. The thing about an RCU-sync is that it's a concurrency construct, like a mutual exclusion lock or a reader-writer lock. For example, a mutual exclusion lock implements at least the following API:
lock()
unlock()
while a reader-writer lock implements at least:
read_lock()
read_unlock()
write_lock()
write_unlock()
and an RCU-sync implements at least:
rcu_read_lock()
rcu_read_unlock()
rcu_synchronize()
The difference between RCU and locks is that locks provide a generic way of doing concurrency, while RCU does not.

The rcu_read_lock() and rcu_read_unlock() methods must guarantee that code within a critical section defined by these two functions must not escape outside the critical section. An RCU-sync must also guarantee that a call to rcu_synchronize() will wait for all threads currently executing a block of code in a rcu_read_lock/unlock() block to complete (call rcu_read_unlock) before rcu_synchronize() returns to the caller. This guarantee is called quiescence and it's extremely useful in shared memory concurrency. Note that the actual semantics are much more formal that what I've described here, but for the purpose of this post, it's enough to get an idea of how an RCU-sync behaves.

As mentioned above, RCU-sync provides quiescence, not mutual exclusion.
A thread calling rcu_synchronize() will know that by the time the call to rcu_synchronize() returns, everything done before this call is now visible to other threads (as long as the other threads call rcu_read_lock/unlock to access the associated data). However, there is nothing preventing other threads from modyfing the same data at the same time (data race behavior). This means that even when using RCU-synch we still need some extra mechanism to provide mutual exclusion or prevent races in some other way.

At its essence, RCU-sync is mechanism for a single writer to synchronize with multiple readers. It doesn't solve the problem of multiple writers and it also doesn't solve the problem of the readers having a consistent view of the data.
Let me explain a bit more about what I mean with "consistent view". Suppose we have two variables, "a" and "b" initially at zero and a single writer thread and multiple reader threads. The writer code could be:
writer() {
    a = 1;
    rcu_synchronize();
    b = 1;
    rcu_synchronize();
}
and the reader code:
reader() {
    rcu_read_lock();
    printf("a=%d\n", a);
    printf("b=%d\n", b);
    rcu_read_unlock();
}

A reader thread could print "a=1,b=0", even if "a" and "b" are std::atomic with memory_order_consume, or even memory_order_seq_cst.
RCU-sync does not provide atomicity for multiple variables. In order to achieve this, we need to aggregate all those variables within a single object and make a snapshopt of this object using Copy-On-Write. But then, the concurrency technique really is COW, not RCU-sync. With such an approach, the RCU-sync is used only for memory reclamation of the old objects and if we're in Java then we don't even need that because the GC will do it for us.
And that is why whenever they use copy-on-write on java.util.concurrent they call it copy-on-write and not RCU ;)

Let me repeat it again: RCU-sync is extremely useful for memory reclamation and as a component of more complex concurrency techniques, but it is not generic. RCU-sync does not provide atomicity, nor does it prevent write-write races, not even read-write races. All it does it give quiescence, which personally I think it's awesome! It's all a matter of expectations.

Just to go back to COW briefly, COW can be used to solve write-write races and read-write races by creating a new object on every modification of every sub-object. Basically, if you want full consistency across your data, you need a god object that contains all your data and any modification to the data (or just a single variable) will require a full copy of the entire data.
So yes, COW can be generic, it's just that nobody does it like this because it would be too slow.

Both COW and RCU-sync are valuable tools that researchers and concurrency library implementers can use to make sophisticated concurrent data structures or other synchronization mechanisms, but calling them generic concurrency controls is going too far.
In other words, there certainly are database engines and STMs (software transactional memory) which use COW or RCU-sync internally, but you'll never see a DBMS or STM whose concurrency control is RCU. It just doesn't make sense to have a DBMS where any mutative operation causes a copy of the entire DB to be made. Because, that would be the only way to provide generic transactions on that DBMS or STM.
In the DBMS context, a concurrency control is something like MVCC (multi-version concurrency control) or 2PL (two-phase locking) or OCC (optimistic concurrency control).

RCU is not a generic concurrency control.

Monday, July 11, 2016

John Cairns Presents - Smashing Atomics

If you're into Java and queues, here is a light intro talk by John Cairns. There's nothing new about it that you haven't already seen on other talks posted here (except for his Disruptor variant), but it's always nice to see how a SeqLock can work, and if you're a newbie to the area, this is not a bad place to start:

Btw, there are a few slides that mention RCU, but just replace that with Copy-On-Write because that's what is being talked about, not the RCU synchronization technique.

Also, the slide on "Lock Freedom" has a few incorrect statements so just ignore it.
Namely, the third statement that says "All threads will make progress in a finite time" is incorrect. Lock-Free methods do not give such a guarantee. To have all threads make progress in a finite number of steps we need to have Wait-Free, and there is no progress condition that gives guarantees about "time", they're all defined in terms of number of steps.
The fourth statement in the "Lock Freedom" slide is also incorrect. There are many lock-free data structures with some methods that are not linearizable. The most well known is Michael-Scott Queue (implemented in java.util.concurrent.ConcurrentLinkedQueue) which is not linearizable for iterations. Linearizability is a "Consistency Model" and Lock-Freedom is a "Progress Condition", and although there is some faint relationship between the two concepts, they are mostly orthogonal.

Saturday, October 17, 2015

Talk at CppCon 2015

Last year I was supposed to go give a talk at CppCon but one week earlier I was getting surgery so I couldn't go. This year I was able to attend, and give a talk not just about Left-Right, but also about Progress Conditions and how they affect Latency.
If you care about wait-free data structures (for reads) and about low latency software development, then I strongly recommend this talk.... but I'm totally biased ;)

Yes, that's me in the video... and no, your sound is not mutted, I just speak low, so you'll have to increase the volume :)

Presentation slides available on github
https://github.com/CppCon/CppCon2015/tree/master/Presentations/How%20to%20make%20your%20data%20structures%20wait-free%20for%20reads

PS: Big thanks to those of you who dropped by at the end of the talk to give some words of encouragement. Don't worry, Andreia and I are still cooking up new algorithms and data structures and will give more info soon ;)

Friday, March 14, 2014

Harris with AtomicMarkableReference - add

Still on the topic of Harris's linked list, today we will cover the variant using AtomicMarkableReference (AMR), which is described in "The Art of Multiprocessor Programming" on section 9.8

As shown on the previous post, the basic Harris algorithm has one node for each key, where each node has a "next" member with one bit reserved for the "marked" variable:

We also talked about the fact that managed languages (like Java) do not allow bit manipulation of references, which means that the classic Harris algorithm can not be implemented in Java. We can instead implement a variant which is HarrisAMR.
The source code for HarrisAMR can be seen in section 9.8 of "The Art of Multiprocessor Programming", the problem is that there are a few errors in the code, if you look at the first edition of this book. The revised edition has the correct code, but here are the modifications needed in case you only have the first edition:

Figure 9.26, line 27 should be:
snip = curr.next.compareAndSet(succ, succ, false, true);

Figure 9.27, line 40 should be:
curr = curr.next.getReference();

If you have neither the "first edition" nor the "revised first edition", then here is a link to our own implementation:

http://sourceforge.net/projects/ccfreaks/files/java/src/com/concurrencyfreaks/lists/HarrisAMRLinkedList.java

The AtomicMarkableReference class in java.util.concurrent uses the Copy-On-Write pattern to modify its member variables. It is composed of a "reference" and a boolean flag named "mark", both variables being final. Whenever one (or both) of these variables need to be modified, a new instance must created and replace the old one using a Compare-And-Swap (CAS). This technique is Lock-Free for modifications, which means it can be applied to Harris linked list without loss of guarantees of progress conditions.
In fact, in Java there is an extra indirection, caused by the existence of the class Pair inside the AtomicMarkableReference, as can be seen on the source code here. The Pair is the class that actually contains the "reference" and the "mark", but for simplicity, we will talk about AMR as being the everything put together, otherwise the next schematics would have been (even more) insane. On HarrisAMR, the "reference" in the AMR is a reference to the next node in the list.

This technique uses more memory and creates a slightly more complex linked list where the true nodes and the AMR instances alternate with each other, as shown in this figure:

Having an higher memory occupancy and slower traversal is not the only disadvantage of the HarrisAMR. To understand why, lets start by looking at what happens when we do an insertion in the list. Suppose for example we want to add the key B to the set on the following configuration, starting by creating Node B:

Now, a new AMR instance is created inside AtomicMarkableReference which references Node D and has a mark at false:

On Node A, we must now replace the AMR instance that is pointing to Node D with a new instance that points to Node B. Remember that these instances have an immutable (final) reference and mark so they can't be changed and a new instance must be created to replace the old one, using a CAS:

All that's left is for the Garbage Collector to cleanup the old instance that is pointing to Node D:

This whole procedure, not only causes churn on the GC, it also causes cache flushing because now the "old" AMR reference that is going to be collected by the GC no longer needs to be in cache, and four new pointers have to be put in the cache: from next to AMR B, from AMR B to Node B, from Node B do AMR D, and from AMR D to Node D.

So, for a single add(), we had to create three new objects in memory, and cleanup an old instance... and we didn't even look a the case where the add() has to retry because another thread did an add() in front of it, in that case, even more (unused) objects will be created!