Wednesday, October 30, 2019

How about RLU? Is RLU a generic concurrency control?

On the previous post we went over why RCU is not a generic concurrency control. This time we'll talk about RLU (it's an L, not a C):

RLU is a synchronization technique which contains inside it an efficient userspace RCU.
It supports multi-atomic updates, making them visible in one-shot to the readers, without having to make an replica of the entire data, like on RCU.

The idea of RLU is that during a write operation, each modified object is first copied into a per-thread log, then it makes the modifications to the object in the log. When all modifications have been made, it increments the global clock, thus making all modifications atomically visible to readers. I'm over-simplifying, but it's ok for this post.

RLU is certainly more generic and efficient than RCU but it does not allow for generic code to be used. In fact, RLU does write-read synchronization but it does not handle write-write synchronization and the the paper itself mentions this explicitly:
"The basic idea of RLU described above provides read-write synchronization for object accesses. It does not however ensure write-write synchronization, which must be managedby the programmer if needed (as with RCU)".

To better understand why RLU is not a generic approach to concurrency, let's consider the canonical example of two threads 1 and 2 with cross dependencies. They are attempting to execute the following operations, where x and y start both at false:
Thread 1:
  if (!x) y = true;

Thread 2:
  if (!y) x = true;

As I'm sure you're aware, any serializable execution of these two operations will either result in x and y being {true,false} or {false,true} but never {true,true}. One of the operations has to "happen before" the other. If T1 executes before T2 then the end result will be {false,true} and if T2 executes before T1 then the end result will be {true,false}.
If you run this code without modification on RLU, it can happen that T1 reads x and sees false while T2 reads y and sees false, then T1 acquires the lock on y while T2 acquires the lock on x and each write to their log that the variables will be true and each advance the global clock and commit. In case you're wondering, this scenario is not linearizable.

In the database world, this kind of problem is called a write skew anomaly (see section 2.4 of the MV-RLU paper for a great explanation in the context of RLU). In the context of concurrent data structures, we call these bugs!
Handling cases like this is what an STM or a Universal Construction are supposed to do because an STM/UC must handle any sequential piece code (with some exceptions).
Granted, if you're really careful how you write the code, you can use RLU as long as you avoid writing sequential code that has this kind of cross dependencies. The problem is, in practice, this is very hard to do and even when you do it, you're never really sure it's actually done correctly (did I miss some cross dependency?). If you write user-code that has one of these problems, it will be hard to find it and likely hard to fix it.

There is an easy solution for RLU and that is to use a global lock, but then it looses some of its appeal because it no longer scales for writes.
Keep in mind though that RLU scales for read-only transactions!

STMs like TinySTM or TL2 have time-based algorithms that keep a read-set of all loads done within the transaction and do validation of the read-set at commit time exactly to handle these kind of cross dependencies.
If you're wondering how frequent are cross dependencies in real code, then even an ordered linked list is prone to these kind of issues (depending on how it is implemented), which helps to show that it's very hard to get away from this problem in practice.

If you ever read a paper where they say that they used RLU to make an STM, be very very suspicious: an STM must be able to handle cross dependencies and provide linearizable (globally serializable) operations regardless of what the underlying sequential user-code. If it works only for some small subset of user programs, then it's not an STM... sure it's a very valuable tool, but it's not an STM.

Monday, October 28, 2019

Is RCU a generic concurrency control?

In this post we're going to talk about RCU and we're going to start by what it means.
When people talk about RCU they usually mean two different things combined: a synchronization mechanism, which I'll call RCU-sync, and a make-copy-of-object-and-modify-copy, which I'll call Copy-On-Write (COW).

Let's start with Copy-On-Write.
This is a technique used in concurrency and even on durability (see previous post) where instead of modifying an object in-place, you make a copy of the object, modify the new object and then toggle a pointer to point from the old object to the new object. The name RCU comes from Read-Copy-Update as a reference to this technique.
When used in (shared memory) concurrency this creates the problem of what to do with the old object: is it safe to delete it immediately? What if there are other threads currently accessing it?
Solving this problem of object lifetime tracking is what the synchronization mechanism of RCU (RCU-sync) does.

There are many different algorithms that implement an RCU-sync. Some examples of RCU-sync algorithms can be seen in Paul McKenney's excellent book or in our paper about fast userspace RCUs. The thing about an RCU-sync is that it's a concurrency construct, like a mutual exclusion lock or a reader-writer lock. For example, a mutual exclusion lock implements at least the following API:

while a reader-writer lock implements at least:

and an RCU-sync implements at least:

The difference between RCU and locks is that locks provide a generic way of doing concurrency, while RCU does not.

The rcu_read_lock() and rcu_read_unlock() methods must guarantee that code within a critical section defined by these two functions must not escape outside the critical section. An RCU-sync must also guarantee that a call to rcu_synchronize() will wait for all threads currently executing a block of code in a rcu_read_lock/unlock() block to complete (call rcu_read_unlock) before rcu_synchronize() returns to the caller. This guarantee is called quiescence and it's extremely useful in shared memory concurrency. Note that the actual semantics are much more formal that what I've described here, but for the purpose of this post, it's enough to get an idea of how an RCU-sync behaves.

As mentioned above, RCU-sync provides quiescence, not mutual exclusion.
A thread calling rcu_synchronize() will know that by the time the call to rcu_synchronize() returns, everything done before this call is now visible to other threads (as long as the other threads call rcu_read_lock/unlock to access the associated data). However, there is nothing preventing other threads from modyfing the same data at the same time (data race behavior). This means that even when using RCU-synch we still need some extra mechanism to provide mutual exclusion or prevent races in some other way.

At its essence, RCU-sync is mechanism for a single writer to synchronize with multiple readers. It doesn't solve the problem of multiple writers and it also doesn't solve the problem of the readers having a consistent view of the data.
Let me explain a bit more about what I mean with "consistent view". Suppose we have two variables, "a" and "b" initially at zero and a single writer thread and multiple reader threads. The writer code could be:
writer() {
    a = 1;
    b = 1;

and the reader code:
reader() {
    printf("a=%d\n", a);
    printf("b=%d\n", b);

A reader thread could print "a=1,b=0", even if "a" and "b" are std::atomic with memory_order_consume, or even memory_order_seq_cst.
RCU-sync does not provide atomicity for multiple variables. In order to achieve this, we need to aggregate all those variables within a single object and make a snapshopt of this object using Copy-On-Write. But then, the concurrency technique really is COW, not RCU-sync. With such an approach, the RCU-sync is used only for memory reclamation of the old objects and if we're in Java then we don't even need that because the GC will do it for us.
And that is why whenever they use copy-on-write on java.util.concurrent they call it copy-on-write and not RCU  ;)

Let me repeat it again: RCU-sync is extremely useful for memory reclamation and as a component of more complex concurrency techniques, but it is not generic. RCU-sync does not provide atomicity, nor does it prevent write-write races, not even read-write races. All it does it give quiescence, which personally I think it's awesome! It's all a matter of expectations.

Just to go back to COW briefly, COW can be used to solve write-write races and read-write races by creating a new object on every modification of every sub-object. Basically, if you want full consistency across your data, you need a god object that contains all your data and any modification to the data (or just a single variable) will require a full copy of the entire data.
So yes, COW can be generic, it's just that nobody does it like this because it would be too slow.

Both COW and RCU-sync are valuable tools that researchers and concurrency library implementers can use to make sophisticated concurrent data structures or other synchronization mechanisms, but calling them generic concurrency controls is going too far.
In other words, there certainly are database engines and STMs (software transactional memory) which use COW or RCU-sync internally, but you'll never see a DBMS or STM whose concurrency control is RCU. It just doesn't make sense to have a DBMS where any mutative operation causes a copy of the entire DB to be made. Because, that would be the only way to provide generic transactions on that DBMS or STM.
In the DBMS context, a concurrency control is something like MVCC (multi-version concurrency control) or 2PL (two-phase locking) or OCC (optimistic concurrency control).

RCU is not a generic concurrency control.

Thursday, October 24, 2019

Hydra 2019 talks are now online

If you're interested in distributed systems or shared memory concurrency, the Hydra 2019 talks are now online.

Here are a few of my favorites:

Cliff Click on HTM:

Cliff Click on H20 (a distributed system that simulates a shared memory)

Maurice Herlihy on blockchains and distributed computing:

and mine on wait-free data structures and wait-free transactions:

Sunday, October 20, 2019

Why use a PTM in your application?

When adapting or creating new applications that use Persistent Memory (PM), developers are faced with two main choices: using a PTM or manually placing flushes and fences.

If they chose hand-placed fences, they have to worry about the correct placement of the clwbs, sfences and, in which situations are non-temporal stores a better choice performance-wise.
Placing these fences correctly can be a difficult task that even expert researchers will occasionally get wrong.
One incorrect fence or flush is enough the prevent the recovery of the application in the event of a crash, thus losing or even corrupting data in PM permanently.
Moreover, dealing with concurrency must also be done by hand, yet another non-trivial task.
Finding expert developers that are capable of dealing with durability and concurrency is likely the reason why database engines are a costly piece of software that requires a team of experts to assemble.

If the application developer chooses instead to use a PTM, concerns of how to make the code durable and concurrent, disappear.
The developer's task now becomes the identification of which blocks of code and data must be accessed in an atomic way, and encapsulate those inside a transaction. Notice that identifying atomicity of data is a task required even when choosing the hand-placement of flushes and fences.
Another advantage of using a PTM is that any modifications to allocator metadata during allocation or de-allocation, becomes part of the transaction, implying that there will be no persistent memory leaks due to loss of metadata when a crash occurs.
For the end-developer, the code within a transaction block can be reasoned about as if it were sequential.
This shift of complexity from the application developer to the PTM library implementer, tremendously increases development speed for the application developer, reduces bug count and improves code maintainability.

Monday, October 14, 2019

What is a concurrent algorithm without order?

I've heard Leslie Lamport mention on multiple occasions: "An algorithm is not a program!"
and I fully agree with him and would like to add that "A concurrent or durable algorithm without ordering is not an algorithm!"

What Leslie says makes sense because code is just a particular implementation or expression of an algorithm in a specific language. Code is bounded by the constructs that form the language in which that code is written.
An algorithm has no such constraints and is the mathematical expression of an idea.

An algorithm is precise, yet, universal.
A program works only for a particular language.

When an algorithm is incorrect, a program that uses such an algorithm will be incorrect as well.
When an algorithm is correct, a program that uses such an algorithm may be incorrect because it implemented the algorithm incorrectly, but the correctness of the underlying algorithm in unaffected.

An algorithm has mathematical beauty.
A well made program can have craftsmanship and be appreciated as beautiful, but rarely (if ever) on the mathematical sense.

So what do I mean by "An algorithm without ordering is not an algorithm"?
In the context of concurrent and/or durable algorithms order is vital for correctness. In concurrency papers, most people assume sequential consistency of the (atomic) steps that the algorithm performs and indeed, the default memory model in C11 and C++11 is memory_order_seq_cst exactly for that reason. In practice, if you were to implement a concurrent algorithm in such a way, it would likely be slow because each store (and load on non-TSO CPUs) would require a fence to guarantee ordering.
Researchers writing papers assume this strong ordering because they focus on "the algorithm" as being a series of steps and leave the dependencies of those steps (ordering) as an implementation detail.
If performance (throughput/scalability/latency) was not an issue, then we would still be using single-core CPUs and single-threaded applications and would not have to deal with all the complexity that concurrent algorithms entail! The amount of fences are (not always, but) usually what dictates the throughput of a concurrent algorithm. This means that the placement and number of fences (ordering constraints) are vital to the algorithm, not just for correctness but also for performance reasons.

The same logic applies to durable algorithms, where the ordering constraints are typically the main performance indicator. A durable algorithm without ordering does not guarantee durability and therefore, it becomes useless. However, not all steps impose a strong ordering among each other. It becomes important when we describe the algorithm to explicitly mention this dependency of steps.

Most of us don't realize the importance of "order" because for the majority of algorithms this order is implicit. For example, consider a sequential algorithm for inserting a new node in the middle of a linked-list:
Step 1) Create a new node;
Step 2) Make the "next" pointer in the node point to the successor node in the list;
Step 3) Make The "next" pointer in the predecessor node of the list point to the new node;

This is trivially simple, but if you try use this algorithm in a system where ordering is not guaranteed, these steps may become re-ordered (from the point of view of another process/thread) and the algorithm becomes incorrect. And this is one of the main issues of concurrent algorithms and even distributed systems.

Going back to Leslie, he was the first to show that to have mutual exclusion on a concurrent system we need to have at least a store-load fence. In other words, there is a minimum amount of ordering that is needed to make an algorithm that is mutually exclusive.
A similar result exists for durable algorithms: atomic durability requires at least two constraints, one for ordering and one for synchronization (round-trip), see this post for more info.
The similarity is such that Paxos, an algorithm discovered by Leslie Lamport and used in distributed systems, can be used not just to provide failure-resilience in concurrent/distributed systems but also to provide failure-resilience for durability. And the reason behind it is somewhat related to ordering.

Unfortunately, there is no easy way or commonly accepted way of expressing ordering constraints in an algorithm and because of that, most people think that orderings constraints are "an implementation detail". Likely this happens because our human brains are so used to thinking in sequential order that we expect the steps (code) that we write in a certain order to be executed in that exact order, because that's how the physical world around us typically behaves.
The problem is, concurrent algorithms don't follow these rules and because of that, ordering must be part of a concurrent algorithm.

IMO, a concurrent algorithm without a specification of the order is an incomplete algorithm. Same thing for durable algorithms. This also implies that an optimal algorithm is one that has the least amount of ordering, or in other words, a good concurrent/durable algorithm is an algorithm with a small number of fences.

Sunday, October 6, 2019

Recovery on a hand-made durable data structure versus recovery on a Persistent Transactional Memory

A durable data structure is a data structure in Persistent Memory (PM) or Non-Volatile Memory, if you prefer.
These data structures have to be resilient to failures. In other words, if there is a power cut during an operation on the data structure, the data structure must be correctly recovered after the system restarts.
For the rest of this post, and likely for all other posts you read in this site, when I say "durable data structure" I mean a data structure that has "atomic durability": In the event of a failure, the side effects of all completed operations will be visible upon restart and recovery.

The are two main approaches to having a durable data structure in PM:
1) Write an algorithm for a new data structure. This algorithm has to be failure-resilient;
2) Write a regular data structure and wrap it in a PTM;

Making a new data structure is typically the subject of an entire research paper, while using a PTM is not. Researchers tend to lean on doing new data structures because it's more fun (we can't get a paper accepted where we describe "using" a data structure). Which approach is better depends on what you mean by "better", but my favorite is 2).

However, set's say you decide to use a hand-made durable data structure that someone else made. Now how do you use it in your program?

Most applications have many data structure instances. For an application that needs durability (or persistence, if you prefer) not all data needs to be persistent, which means that not all data structures need to be persistent either. However, let's assume it is more than 1 data structure instance.
When a crash occurs, it could have occurred during an operation on *any* of these data structures. This means we need to call the recovery() method for each of these data structure instances, just in case.

This by itself creates a nuisance, because now you need to have an explicit list of all the durable data structures in your application and iterate through all in your recovery code:
void recovery() {   // called after a restart

If you forget to add one of the data structures to the main recovery method, you risk doing operations on an inconsistent data structure and further corrupting your data. In PM, when you corrupt the data it's gone forever... unless you had backups, and even then you're going to lose all data from the last non-corrupt backup.

Another difficulty is that sometimes the number of durable data structures may be dynamic because the application creates and destroys data structures during execution, to store some data. This means that it's not possible to add each data structure instance in the recovery() method because they may not exist when the program restarts. This typically means you will need a registration mechanism where newly created data structure instances are added and later de-registered when the data structure is destroyed. Also, you need to save the root pointer to each of those data structure instances.
PTMs handle this transparently because the transaction has the all-or-nothing semantics. If new data structure instances are created or destroyed, you can use the root pointer API (in OneFile and Romulus we called it get_object() and put_object()). When there is a crash, any modifications to ongoing transactions will be reverted, including allocating or de-allocating a new data structure instance and including adding or removing a root pointer. By having a PTM whose transactions can encompass the modifications on the data structure and the allocation/de-allocation of the data structure and the adding/removing of the instance as a root pointer, we give transactions that are easy to reason about and make your code consistent at all times.

Ok, ok, so it's not that bad for the hand-made durable data structures. We can do automatic checks for these scenarios by adding registration mechanisms in the constructor of the data structure, or even some kind of static check or compile-time assertion to make sure everything is as it should be.
But the thing is, if you use a PTM, you only have to call the recovery method of the PTM, one time, regardless of the number of data structure instances you have in your application!
In the case of OneFile, it has null recovery which means there is no recovery method: the PTM will simply continue execution where it left of.

This is yet another thing where PTMs are better than an hand-made durable data structures.

Tuesday, October 1, 2019

Atomic Durability - How do databases recover from a crash ?

In this post we're going to talk about the four different ways of having durable transactions. If you want to know how databases and file systems guarantee correct data recovery after a power failure, then keep reading!

Writing to a durable/persistent media can be a complex task. If there is a failure half-way through, you may end up with corrupted data.
By durable media I mean a disk drive, an SD card, a USB stick, a Persistent Memory module (NV-DIMM), a network attached NAS, and many other like these.

Suppose you're changing a customer address in a database. The current address is "Bag End, Shire" and we want to change it to "Rivendell".
What happens of there is a crash half-way through the write?
Suppose the first four bytes were written before the crash occurred. Upon restart, the address in the database is now "Rivend, Shire"... clearly an incorrect address. What should we do?

This is a typical problem encountered by the designers of DBMS (Database Management Systems) and file systems.
Generally speaking, there are four solutions to this problem and they are:
1) Undo-Log;
2) Redo-Log;
3) Copy-On-Write (sometimes called Shadow Copy or Shadow Paging);
4) Romulus;

Before we explain how each of these works, we need to introduce two things which we're going to call "ordering fence" and "synchronization fence".
An "ordering fence" is something that guarantees that previous writes will reach the durable media before subsequent writes. For example:


implies that "a" will always be written to the durable media before "b" is written.

A "synchronization fence" is something that guarantees that previously written data is now durable. And by durable I mean it's persistent in the media and in the event of a crash, this data will not be lost. For example:


if you heard a beep and immediately after there is a power cut, you know for sure that the media contains the data "a" because the beep occurred after the sync_fence() which was after the write(a) was persisted.

The actual semantics depend on what hardware we're talking about, but the idea is pretty much the same: you need order_fence() and you need sync_fence().
When writing to disk, both the order_fence() and sync_fence() have to be implemented as fsync() or fdatasync(). When writing to PM they're made up of a combination of CLWB and SFENCE instructions.

Now back to the actual algorithms

1) Undo-Log

This technique is at least 30 years old and dates back to AIRES.
As the name indicates, it's an undo-log technique. Before writing to the durable media, we copy the original contents to a log, then we indicate that the log is consistent, then write the new values on the actual data, and finally reset the log:

1:  write(dst, src, size) {
2:    memcpy(src, log, size);   // copy original data to log
3:    order_fence();
4:    recover_from_log = true;
5:    order_fence();
6:    memcpy(dst, src, size);   // write new data
7:    order_fence();
8:    recover_from_log = false;
9:    sync_fence();
10: }

Let's examine what happens if there is a crash in:
- Lines 1 to 4: no data has been over written, nothing happens;
- Lines 4 to 6: 'recover_from_log' is seen as 'true' and the recovery will overwrite the contents of 'src' with the contents of 'log'. They are the same, therefore, no change;
- Lines 6 to 8: the recovery method will overwrite the new data with the contents of the log, restoring the old values, as it should be;
- After line 8: assuming 'recover_from_log' is durable, nothing will be reverted, the write() has committed successfully;

Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing (transaction) semantics, the algorithm changes slightly:

1:  write(dst, src, size) {
2:    memcpy(src, &log[entries].data, size);   // copy original data to log entry
3:    log[entries].size = size;                // save the size of this write()
4:    log[entries].addr = dst;                 // save the destination address
3:    order_fence();
4:    entries++;
5:    order_fence();
6:    memcpy(dst, src, size);                  // write new data
7:  }
8:  commit() {
9:    sync_fence();
10:   entries = 0;
11:   order_fence();
12: }

When used in this way, the number of fences is 2 per call to write() plus 2 at commit time. The higher the number of write()s in the transaction, the more fences will be executed.

Disadvantage 1) Two fences per write(). Can be very slow;
Advantage 1) Reading and writing is done in place;

In the context of Persistent Memory, undo-log is what libpmemobj does. This is the library in PMDK that provides durable transactions.
In the context of databases, AIRES was the first to use undo-log.

2) Redo-Log

In the redo-log technique the approach is almost opposite from the undo-log: we write the new data on the log, make the log consistent and after, copy the new contents from the log to the data:

1:  write(dst, src, size) {
2:    memcpy(log, src, size);   // copy new data to log
3:    order_fence();
4:    recover_from_log = true;
5:    order_fence();
6:    memcpy(dst, log, size);   // copy new data from log
7:    sync_fence();
8:    recover_from_log = false;
9:  }

If there is a crash after the log has been made consistent, the modifications in the log will be re-applied.
Let's examine what happens if there is a crash in:
- Lines 1 to 4: recovery does nothing;
- Lines 4 to 8: recovery method will re-apply the contents of the log to the data (redo) overwriting any old values that may have been left incomplete (in case the crash occurred during line 6);
- After line 8: recovery does nothing;

Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), the algorithm changes slightly:

1:  write(dst, src, size) {
2:    memcpy(&log[entries].data, src, size);   // copy new data to log entry
3:    log[entries].size = size;                // save the size of this write()
4:    log[entries].addr = dst;                 // save the destination address
5:    non_durable_entries++;
6:  }
7:  commit() {
8:    order_fence();
9:    entries = non_durable_entries;           // make the log consistent
10:   sync_fence();
11:   for (i : entries) memcpy(log[i].addr, &log[i].data, log[i].size);
12:   order_fence();
13:   entries = 0;
14:   order_fence();
15: }

When used for transactions, a redo-log always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction, it's always 4 fences. This number can be brought down to 2 fences, but the algorithm becomes more complex.

Advantage 1) Only four fences per transaction;
Disadvantage 1) Reading data requires a lookup on the log, so as to cover the read-after-write scenarios. Without it, we would be reading stale data and break invariants.

In the context of Persistent Memory, redo-log is what Mnemosyne and Onefile do. Mnemosyne was the first blocking PTM and OneFile was the first wait-free PTM.
In the context of databases, Microsoft Hekaton, MySQL InnoDB engine, PostgreSQL and OracleDB all use a redo-log technique to provide transactions, while IBM DB2 uses a combination of undo and redo logs.

3) Copy-On-Write

Copy-On-Write (COW) is sometimes named "shadow copying" or "shadow paging".
This technique doesn't allow for transactional semantics. For that, it needs to be coupled with either undo-log or redo-log.
The COW technique is always applied to a block or an object. First, a new block/object is reserved, which is a replica of the block/object that is to be modified, then the modifications are executed on the new block/object and finally, a single pointer swap makes the new block/object accessible (durable):

1:  modify(obj_ptr, mutation()) {
2:    new_obj_ptr = malloc(sizeof(*obj_ptr));          // allocate a new object (2 fences or more)
3:    memcpy(new_obj_ptr, obj_ptr, sizeof(*obj_ptr));  // copy the contents of the old to the new object
4:    new_obj_ptr->mutation();                         // apply the modification on the new object
5:    order_fence();
6:    old_obj_ptr = obj_ptr;
7:    obj_ptr = new_obj_ptr;                           // pointer swap (atomic)
8:    sync_fence();
9:    free(obj_ptr);                                   // de-allocate old object (2 fences or more)
10: }

Disadvantage 1) Every modification requires the allocation of a new block/object and the de-allocation of the old one. This can lead to high "fragmentation" on the durable media and it has a performance penalty because the act of allocating a new block/object requires two extra fences to change allocator metadata. This means that COW executes 4 fences per modified object (1 order + 1 sync for the allocator plus 1 order + 1 sync for the modification and pointer change);
Disadvantage 2) The entire block/object has to be copied, even if a single byte has changed. Depending on the usage pattern, this can create high write amplification which in turn affects performance and possibly media endurance (think flash and PM);
Disadvantage 3) Needs and undo-log or redo-log to provide transactional semantics, which can mean additional fences;
Disadvantage 4) Write amplification may be close to zero (just the pointer swap and allocator metadata changes), or it may be close to infinity (a single byte change on a very large object requires a full copy of that object, and allocator metadata changes and the pointer swap);

To be properly durable, COW techniques will always execute *at least* two fences per modified block/object (1 order + 1 sync). Extra fences may be needed for the allocation and de-allocation of the block/object (2 or more fences).

In the context of Persistent Memory, there is no PTM that uses COW but there are hand-made data structures that use this technique extensively, examples being the ones described in the MOD paper (although their algorithm requires two fences per operation to be correct, plus whatever the allocator does).
When it comes to databases, SAP HANA uses a combination of COW and redo-log to provide durable transactions.

4) Romulus

Since last year (2018), there is now a fourth way to achieve durability and it's called Romulus.
In Romulus, two twin replicas of the data are maintained at all times, with the modifications first executed in the first replica 'main' and after, applied to the other replica 'back':

1:  write(dst, src, size) {
2:    state = MUTATING;
3:    order_fence();
4:    memcpy(dst_main, src, size);   // copy new data to 'main'
5:    order_fence();
6:    state = COPYING;
7:    sync_fence();
8:    memcpy(dst_back, src, size);   // copy new data to 'back'
9:    order_fence();
10:   state = IDLE;
11: }

Let's examine what happens if there is a crash in:
- Lines 2 to 4: the recovery method see state at 'MUTATING' and copies the entire contents of 'back' to 'main' (no change);
- Lines 4 to 6: the recovery method copies 'back' to 'main', undoing the possibly incomplete modifications in 'main';
- Lines 6 to 10: assuming state is durable at 'COPYING', the recovery method will copy the contents of 'main' to back, effectively acting as a redo and duplicating the committed modifications;
- After line 10: recovery does nothing;

Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), then it is important to have an optimization where a non-durable log is kept with the addresses and ranges of the modified data:

1:  write(dst, src, size) {
2:    state = MUTATING;
3:    order_fence();
4:    memcpy(dst_main, src, size);   // copy new data to 'main'
5:    log[non_durable_entries].addr = dst_main;
6:    log[non_durable_entries].size = size;
7:    non_durable_entries++;
8:  }
9:  commit() {
10:   order_fence(); 
11:   state = COPYING;
12:   sync_fence();
13:   // 'offset' is the delta between 'back' and 'main'
14:   for (i : non_durable_entries) {

15:     memcpy(log[i].addr+offset, log[i].addr, log[i].size);
16:   }
17:   order_fence();
18:   state = IDLE;
19: }

When used for transactions, romulus always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction. Similarly to redo-log, this number can be brought down to 2 fences, but the algorithm becomes more complex.

Advantage 1) Reading and writing is done in place, directly from the data;
Advantage 2) Transactions are unbounded in size. The volatile log can grow dynamically and if it grows too much, we can just stop using it and copy the entire 'main' to 'back' at the end of the transaction. No other technique supports transactions of unbounded size;
Advantage 3) Write amplification is 1;
Advantage 4) Four fences per transaction;
Disadvantage 1) Only half the capacity of the media is available to the user. The redo-log and undo-log need to pre-reserve space for their logs but rarely do applications reserve half the capacity of the media to the logs, giving the advantage to undo/redo/cow.

Whether we're talking about filesystems, DBMS or some other kind of application that requires durability with resilience to failures, it must use one or a combination/variant of these four techniques. Each of them has different trade-offs in terms of performance, memory usage, ease-of-implementing and ease-of-use. In the end, it's an engineering choice which of these four algorithms is best suited for a particular application.