Concurrency Freaks: Durability

Showing posts with label Durability. Show all posts

Friday, March 12, 2021

The 4 laws of Durability

When it comes to having durable data, there are four ways to do it: undo log, redo log, shadow copy and shadow data.

Let's start with the preliminaries.

So what do we mean by "durable"?

Well, durable means that whatever data you're trying to save, has reached your storage device in a consistent way. It means that when you write to storage you want it to be "permanent", whether that storage device is a USB key, a CD ROM, an hard drive, and SSD, or a non-volatile memory DIMM like Intel's Optane DC PM.

For any of these storage devices, the algorithms are always the same: you have to use one of the four mentioned above.

Keep in mind that these are needed if the data needs to be consistent, i.e. you want to see the whole data before the storage or none of the data. I mean, if we were ok with having garbled data, then why would we bother saving it in permanent storage? The whole point of making data durable is because it has important information and therefore, it implies consistency.

Now that the basics are out of the way, what are exactly these four algorithms?

I'm going to focus on these in the context of transactions, but they don't have to be necessarily about that.

Undo log is technique where we write to durable storage a log entry before each write is done to storage. It allows multiple independent (non-atomic) writes to become durable in an all-or-nothing way, like a transaction, or a checkpoint.

In the context of persistent memory, libpmemobj in PMDK is an example of a transactional system that uses undo log.

In Redo log we write the log with multiple entries to storage before writing the actual data. The difference between redo and undo is that undo log does one entry in the log at a time followed by one modification, while the redo log does all entries in the log in one shot and then all the modifications in one shot.

Mnemosyne and OneFile are examples of transactional systems that utilize redo log.

Shadow copy, sometimes called Copy-On-Write (COW) creates a new replica of the data and writes the new data along with the unchanged contents to durable storage, before swapping some kind of pointer to indicate the this is the new object/data and the old one can be discarded. COW can't really be used by itself for transactions over multiple objects, but it can be combined with redo log to make it more efficient.

One example is SAP HANA which uses redo log with COW.

Shadow data can sometimes be confused with COW but it is not the same thing. In shadow data two (or more) replicas of the entire data are kept in durable storage and they both are updated with the modifications, one at a time. First one replica, then a logical pointer and then the second replica. On the next set of atomic writes the recently updated replica is the first to be updated.

Examples of shadow data transactional systems are Romulus, RedoDB and Trinity to some extent.

We though long and hard at the similarities and differences between these four algorithms for durable transactions, and we found they possess four common characteristics, regardless of the underlying storage media for which they are intended.

Each one of these characteristics reveals an important insight into the concept of durability and we believe these to be empirical rules to which all durable techniques abide. These rules are:

There must be a replica of the data;
There must be a durable state indicating which of the replicas is consistent;
All algorithms require at least one ordering constraint of the writes to durable storage;
A modification is durable only after a round-trip fence to the storage hardware;

The first key insight regarding durable transactions is that a consistent and durable replica of the data must exist at all times. This replica may be a full copy of the data, such as on shadow data, or it may be a logical replica, such as on undo log and redo log.

Intuitively, there has to be a consistent replica of the data, so that there is a way to recover data to its original consistent state in the event of a failure. Shadow data keeps a full replica of the data thus incurring a high permanent usage of the durable media (space amplification), while the undo log and redo log approaches have to write in durable storage, not just the new data but also, encoded information about the location and size of the modification (write amplification).

There's clearly an important trade-off here: log-based algorithms will increase (amortized) write amplification but shadow-data-based algorithms will increase space amplification.

The second empirical rule implies that the algorithm must ensure that, irrespective of when a failure occurs, there is a way for the recovery procedure to determine which of the replicas is consistent.

Shadow data like Romulus uses a two-bit variable to determine which of the two replicas is consistent, while redo log and undo log can use the size of the log (zero or non-zero) to indicate if the log is consistent.

By itself, there is no significant difference in any of the approaches however, the exact mechanics, will influence the number of ordering constraints in the algorithm.

This leads us to the third insight, that data consistency is possible only through ordering of some of the writes.

For shadow-copying, the modifications on the new block must be made durable before the pointer swap, otherwise a failure occurring after the pointer swap is made durable, would leave the pointer referencing an inconsistent block. This means that apart from block allocation and de-allocation details, shadow-copying has a single ordering constraint, or in other words, a single ordering fence.

Shadow data like Romulus uses a two-bit state (though one bit would suffice) to indicate which of the two replicas is the consistent one, or whether both are consistent. If the state variable indicating which replica is the consistent one becomes durable before or after the modifications on either replica and a crash occurs, upon recovery it may be referencing the inconsistent replica. For this algorithm, three ordering constraints exist: one to prevent the state from changing to COPYING before the modifications in main replica are done; another to prevent the modifications in back replica from being done before the state changes to COPYING; and another one to prevent the state change to IDLE before the changes on back replica are durable.

The undo log technique has two constraints per modified object/range: the log entry must contain the old value before the entry is added to the log; and the entry must be added to the log before the modification is done on the data. Undo log has one extra constraint per transaction, requiring the last modification to be durable before the log is reset.

The Redo log technique has three constraints per transaction: all the log entries must be durable before the log size is set; the log size must be set before the modifications are done on the data; the modifications on the data must be durable before the log is reset.

The fourth and final rule addresses the need for a round-trip synchronization mechanism to the storage domain, such that the hardware can guarantee that it contains, in stable durable storage, all the previously written data. The cost of such a fence is typically of the order of the storage device's latency.

Fast devices like PM implement this round-trip fence orders of magnitude faster than slower devices, like hard drives.

Without such a mechanism, it is not possible to have durable operations, even if ordering constraints are set: in the event of a failure, the ordering constraints impose a temporal sequence of which the writes will be made durable, but there is no guarantee on durability.

A corollary of this is that all algorithms require one and only one such fence, strategically placed.

Notice that the ordering constraints may be replaced by such synchronous fences, at the detriment of performance, and in fact, many storage systems make no distinction between the two. Ordering is typically achieved with an asynchronrous fence and it relates to the order to which certain writes will be made durable in the storage media.

On block based storage, this is typically implemented with fsync() or fdatasync().

In Persistent Memory (PM) ordering can be achieved through the combination of flushes (clwb) and fences (sfence) or by writing to the same cache line. The round-trip guarantee of durability is given by a synchronous fence, either fsync()/fdatasync() on block storage, or sfence on PM storage.

In case you haven't noticed, the fact that all algorithms require one round-trip fence to the device (psync), but may require multiple ordering fences (pfence) has implications in performance. This is specially true given that the psync has inescapable physical implications: it is not possible to have all-or-nothing consistent durability without a psync that physically does a round trip to the storage device (or at least the storage domain) and therefore the latency cost of this single round trip is inescapable.

However, different algorithms may have different ordering constraints (pfences) and these may have different costs.

Yes, fsync() is used for both sync and ordering on block devices, and the sfence instruction is also used for both in PM, however, there are tricks. In PM, writes to the same cache line are guaranteed to be ordered and therefore, no sfence is needed to order them, as long as store with memory_order_release is used.

Seen as these round trips are typically the bottleneck when doing random writes to PM, the fact that we can create an algorithm with a lower number of psyncs means we can have a performance gain that is nearly proportional to the reduction in the number of such fences.

This is exactly what we've done with Trinity.

Trinity is a novel durability technique that needs just two fences per transaction and reduces the number of flushes when doing random writes. It consumes more memory than the other previous techniques but it has significant higher performance.

Moreover, we combined it with our own variant of TL2 for highly scalable durable linearizable transactions, and we used that to make a K/V store, which is likely that fastest K/V store on the planet with full transactions (though you need Optane Persistent Memory to be able to run it).

If you want to see the video, it's here:

https://youtu.be/vdInrf_kk1w

If you want the source code, it's here:

https://github.com/pramalhe/durabletx/blob/master/ptms/trinity/TrinityVRTL2.hpp

Monday, October 14, 2019

What is a concurrent algorithm without order?

I've heard Leslie Lamport mention on multiple occasions: "An algorithm is not a program!"
and I fully agree with him and would like to add that "A concurrent or durable algorithm without ordering is not an algorithm!"

What Leslie says makes sense because code is just a particular implementation or expression of an algorithm in a specific language. Code is bounded by the constructs that form the language in which that code is written.
An algorithm has no such constraints and is the mathematical expression of an idea.

An algorithm is precise, yet, universal.
A program works only for a particular language.

When an algorithm is incorrect, a program that uses such an algorithm will be incorrect as well.
When an algorithm is correct, a program that uses such an algorithm may be incorrect because it implemented the algorithm incorrectly, but the correctness of the underlying algorithm in unaffected.

An algorithm has mathematical beauty.
A well made program can have craftsmanship and be appreciated as beautiful, but rarely (if ever) on the mathematical sense.

So what do I mean by "An algorithm without ordering is not an algorithm"?
In the context of concurrent and/or durable algorithms order is vital for correctness. In concurrency papers, most people assume sequential consistency of the (atomic) steps that the algorithm performs and indeed, the default memory model in C11 and C++11 is memory_order_seq_cst exactly for that reason. In practice, if you were to implement a concurrent algorithm in such a way, it would likely be slow because each store (and load on non-TSO CPUs) would require a fence to guarantee ordering.
Researchers writing papers assume this strong ordering because they focus on "the algorithm" as being a series of steps and leave the dependencies of those steps (ordering) as an implementation detail.
If performance (throughput/scalability/latency) was not an issue, then we would still be using single-core CPUs and single-threaded applications and would not have to deal with all the complexity that concurrent algorithms entail! The amount of fences are (not always, but) usually what dictates the throughput of a concurrent algorithm. This means that the placement and number of fences (ordering constraints) are vital to the algorithm, not just for correctness but also for performance reasons.

The same logic applies to durable algorithms, where the ordering constraints are typically the main performance indicator. A durable algorithm without ordering does not guarantee durability and therefore, it becomes useless. However, not all steps impose a strong ordering among each other. It becomes important when we describe the algorithm to explicitly mention this dependency of steps.

Most of us don't realize the importance of "order" because for the majority of algorithms this order is implicit. For example, consider a sequential algorithm for inserting a new node in the middle of a linked-list:
Step 1) Create a new node;
Step 2) Make the "next" pointer in the node point to the successor node in the list;
Step 3) Make The "next" pointer in the predecessor node of the list point to the new node;
This is trivially simple, but if you try use this algorithm in a system where ordering is not guaranteed, these steps may become re-ordered (from the point of view of another process/thread) and the algorithm becomes incorrect. And this is one of the main issues of concurrent algorithms and even distributed systems.

Going back to Leslie, he was the first to show that to have mutual exclusion on a concurrent system we need to have at least a store-load fence. In other words, there is a minimum amount of ordering that is needed to make an algorithm that is mutually exclusive.
A similar result exists for durable algorithms: atomic durability requires at least two constraints, one for ordering and one for synchronization (round-trip), see this post for more info.
The similarity is such that Paxos, an algorithm discovered by Leslie Lamport and used in distributed systems, can be used not just to provide failure-resilience in concurrent/distributed systems but also to provide failure-resilience for durability. And the reason behind it is somewhat related to ordering.

Unfortunately, there is no easy way or commonly accepted way of expressing ordering constraints in an algorithm and because of that, most people think that orderings constraints are "an implementation detail". Likely this happens because our human brains are so used to thinking in sequential order that we expect the steps (code) that we write in a certain order to be executed in that exact order, because that's how the physical world around us typically behaves.
The problem is, concurrent algorithms don't follow these rules and because of that, ordering must be part of a concurrent algorithm.

IMO, a concurrent algorithm without a specification of the order is an incomplete algorithm. Same thing for durable algorithms. This also implies that an optimal algorithm is one that has the least amount of ordering, or in other words, a good concurrent/durable algorithm is an algorithm with a small number of fences.

Sunday, October 6, 2019

Recovery on a hand-made durable data structure versus recovery on a Persistent Transactional Memory

A durable data structure is a data structure in Persistent Memory (PM) or Non-Volatile Memory, if you prefer.
These data structures have to be resilient to failures. In other words, if there is a power cut during an operation on the data structure, the data structure must be correctly recovered after the system restarts.
For the rest of this post, and likely for all other posts you read in this site, when I say "durable data structure" I mean a data structure that has "atomic durability": In the event of a failure, the side effects of all completed operations will be visible upon restart and recovery.

The are two main approaches to having a durable data structure in PM:
1) Write an algorithm for a new data structure. This algorithm has to be failure-resilient;
2) Write a regular data structure and wrap it in a PTM;

Making a new data structure is typically the subject of an entire research paper, while using a PTM is not. Researchers tend to lean on doing new data structures because it's more fun (we can't get a paper accepted where we describe "using" a data structure). Which approach is better depends on what you mean by "better", but my favorite is 2).

However, set's say you decide to use a hand-made durable data structure that someone else made. Now how do you use it in your program?

Most applications have many data structure instances. For an application that needs durability (or persistence, if you prefer) not all data needs to be persistent, which means that not all data structures need to be persistent either. However, let's assume it is more than 1 data structure instance.
When a crash occurs, it could have occurred during an operation on *any* of these data structures. This means we need to call the recovery() method for each of these data structure instances, just in case.

This by itself creates a nuisance, because now you need to have an explicit list of all the durable data structures in your application and iterate through all in your recovery code:
void recovery() { // called after a restart
ds1.recovery();
ds2.recovery();
ds3.recovery();
...
}
If you forget to add one of the data structures to the main recovery method, you risk doing operations on an inconsistent data structure and further corrupting your data. In PM, when you corrupt the data it's gone forever... unless you had backups, and even then you're going to lose all data from the last non-corrupt backup.

Another difficulty is that sometimes the number of durable data structures may be dynamic because the application creates and destroys data structures during execution, to store some data. This means that it's not possible to add each data structure instance in the recovery() method because they may not exist when the program restarts. This typically means you will need a registration mechanism where newly created data structure instances are added and later de-registered when the data structure is destroyed. Also, you need to save the root pointer to each of those data structure instances.
PTMs handle this transparently because the transaction has the all-or-nothing semantics. If new data structure instances are created or destroyed, you can use the root pointer API (in OneFile and Romulus we called it get_object() and put_object()). When there is a crash, any modifications to ongoing transactions will be reverted, including allocating or de-allocating a new data structure instance and including adding or removing a root pointer. By having a PTM whose transactions can encompass the modifications on the data structure and the allocation/de-allocation of the data structure and the adding/removing of the instance as a root pointer, we give transactions that are easy to reason about and make your code consistent at all times.

Ok, ok, so it's not that bad for the hand-made durable data structures. We can do automatic checks for these scenarios by adding registration mechanisms in the constructor of the data structure, or even some kind of static check or compile-time assertion to make sure everything is as it should be.
But the thing is, if you use a PTM, you only have to call the recovery method of the PTM, one time, regardless of the number of data structure instances you have in your application!
In the case of OneFile, it has null recovery which means there is no recovery method: the PTM will simply continue execution where it left of.

This is yet another thing where PTMs are better than an hand-made durable data structures.

Tuesday, October 1, 2019

Atomic Durability - How do databases recover from a crash ?

In this post we're going to talk about the four different ways of having durable transactions. If you want to know how databases and file systems guarantee correct data recovery after a power failure, then keep reading!

Writing to a durable/persistent media can be a complex task. If there is a failure half-way through, you may end up with corrupted data.
By durable media I mean a disk drive, an SD card, a USB stick, a Persistent Memory module (NV-DIMM), a network attached NAS, and many other like these.

Suppose you're changing a customer address in a database. The current address is "Bag End, Shire" and we want to change it to "Rivendell".
What happens of there is a crash half-way through the write?
Suppose the first four bytes were written before the crash occurred. Upon restart, the address in the database is now "Rivend, Shire"... clearly an incorrect address. What should we do?

This is a typical problem encountered by the designers of DBMS (Database Management Systems) and file systems.
Generally speaking, there are four solutions to this problem and they are:
1) Undo-Log;
2) Redo-Log;
3) Copy-On-Write (sometimes called Shadow Copy or Shadow Paging);
4) Romulus;

Before we explain how each of these works, we need to introduce two things which we're going to call "ordering fence" and "synchronization fence".
An "ordering fence" is something that guarantees that previous writes will reach the durable media before subsequent writes. For example:

write(a);
order_fence();
write(b);
implies that "a" will always be written to the durable media before "b" is written.

A "synchronization fence" is something that guarantees that previously written data is now durable. And by durable I mean it's persistent in the media and in the event of a crash, this data will not be lost. For example:

write(a);
sync_fence();
computer_beeps();
if you heard a beep and immediately after there is a power cut, you know for sure that the media contains the data "a" because the beep occurred after the sync_fence() which was after the write(a) was persisted.

The actual semantics depend on what hardware we're talking about, but the idea is pretty much the same: you need order_fence() and you need sync_fence().
When writing to disk, both the order_fence() and sync_fence() have to be implemented as fsync() or fdatasync(). When writing to PM they're made up of a combination of CLWB and SFENCE instructions.

Now back to the actual algorithms

1) Undo-Log

This technique is at least 30 years old and dates back to AIRES.
As the name indicates, it's an undo-log technique. Before writing to the durable media, we copy the original contents to a log, then we indicate that the log is consistent, then write the new values on the actual data, and finally reset the log:

1: write(dst, src, size) {
2:    memcpy(src, log, size);   // copy original data to log
3:    order_fence();
4:    recover_from_log = true;
5:    order_fence();
6:    memcpy(dst, src, size);   // write new data
7:    order_fence();
8:    recover_from_log = false;
9:    sync_fence();
10: }
Let's examine what happens if there is a crash in:
- Lines 1 to 4: no data has been over written, nothing happens;
- Lines 4 to 6: 'recover_from_log' is seen as 'true' and the recovery will overwrite the contents of 'src' with the contents of 'log'. They are the same, therefore, no change;
- Lines 6 to 8: the recovery method will overwrite the new data with the contents of the log, restoring the old values, as it should be;
- After line 8: assuming 'recover_from_log' is durable, nothing will be reverted, the write() has committed successfully;

Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing (transaction) semantics, the algorithm changes slightly:

1: write(dst, src, size) {
2:    memcpy(src, &log[entries].data, size);   // copy original data to log entry
3:    log[entries].size = size;                // save the size of this write()
4:    log[entries].addr = dst;                 // save the destination address
3:    order_fence();
4:    entries++;
5:    order_fence();
6:    memcpy(dst, src, size);                  // write new data
7: }
8: commit() {
9:    sync_fence();
10:   entries = 0;
11:   order_fence();
12: }

When used in this way, the number of fences is 2 per call to write() plus 2 at commit time. The higher the number of write()s in the transaction, the more fences will be executed.

Pros/Cons:
Disadvantage 1) Two fences per write(). Can be very slow;
Advantage 1) Reading and writing is done in place;

In the context of Persistent Memory, undo-log is what libpmemobj does. This is the library in PMDK that provides durable transactions.
In the context of databases, AIRES was the first to use undo-log.

2) Redo-Log

In the redo-log technique the approach is almost opposite from the undo-log: we write the new data on the log, make the log consistent and after, copy the new contents from the log to the data:

1: write(dst, src, size) {
2:    memcpy(log, src, size);   // copy new data to log
3:    order_fence();
4:    recover_from_log = true;
5:    order_fence();
6:    memcpy(dst, log, size);   // copy new data from log
7:    sync_fence();
8:    recover_from_log = false;
9: }

If there is a crash after the log has been made consistent, the modifications in the log will be re-applied.
Let's examine what happens if there is a crash in:
- Lines 1 to 4: recovery does nothing;
- Lines 4 to 8: recovery method will re-apply the contents of the log to the data (redo) overwriting any old values that may have been left incomplete (in case the crash occurred during line 6);
- After line 8: recovery does nothing;

Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), the algorithm changes slightly:

1: write(dst, src, size) {
2:    memcpy(&log[entries].data, src, size);   // copy new data to log entry
3:    log[entries].size = size;                // save the size of this write()
4:    log[entries].addr = dst;                 // save the destination address
5:    non_durable_entries++;
6: }
7: commit() {
8:    order_fence();
9:    entries = non_durable_entries;           // make the log consistent
10:   sync_fence();
11:   for (i : entries) memcpy(log[i].addr, &log[i].data, log[i].size);
12:   order_fence();
13:   entries = 0;
14:   order_fence();
15: }

When used for transactions, a redo-log always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction, it's always 4 fences. This number can be brought down to 2 fences, but the algorithm becomes more complex.

Pros/Cons:
Advantage 1) Only four fences per transaction;
Disadvantage 1) Reading data requires a lookup on the log, so as to cover the read-after-write scenarios. Without it, we would be reading stale data and break invariants.

In the context of Persistent Memory, redo-log is what Mnemosyne and Onefile do. Mnemosyne was the first blocking PTM and OneFile was the first wait-free PTM.
In the context of databases, Microsoft Hekaton, MySQL InnoDB engine, PostgreSQL and OracleDB all use a redo-log technique to provide transactions, while IBM DB2 uses a combination of undo and redo logs.

3) Copy-On-Write

Copy-On-Write (COW) is sometimes named "shadow copying" or "shadow paging".
This technique doesn't allow for transactional semantics. For that, it needs to be coupled with either undo-log or redo-log.
The COW technique is always applied to a block or an object. First, a new block/object is reserved, which is a replica of the block/object that is to be modified, then the modifications are executed on the new block/object and finally, a single pointer swap makes the new block/object accessible (durable):

1: modify(obj_ptr, mutation()) {
2:    new_obj_ptr = malloc(sizeof(*obj_ptr));          // allocate a new object (2 fences or more)
3:    memcpy(new_obj_ptr, obj_ptr, sizeof(*obj_ptr)); // copy the contents of the old to the new object
4:    new_obj_ptr->mutation();                         // apply the modification on the new object
5:    order_fence();
6:    old_obj_ptr = obj_ptr;
7:    obj_ptr = new_obj_ptr;                           // pointer swap (atomic)
8:    sync_fence();
9:    free(obj_ptr);                                   // de-allocate old object (2 fences or more)
10: }

Pros/Cons:
Disadvantage 1) Every modification requires the allocation of a new block/object and the de-allocation of the old one. This can lead to high "fragmentation" on the durable media and it has a performance penalty because the act of allocating a new block/object requires two extra fences to change allocator metadata. This means that COW executes 4 fences per modified object (1 order + 1 sync for the allocator plus 1 order + 1 sync for the modification and pointer change);
Disadvantage 2) The entire block/object has to be copied, even if a single byte has changed. Depending on the usage pattern, this can create high write amplification which in turn affects performance and possibly media endurance (think flash and PM);
Disadvantage 3) Needs and undo-log or redo-log to provide transactional semantics, which can mean additional fences;
Disadvantage 4) Write amplification may be close to zero (just the pointer swap and allocator metadata changes), or it may be close to infinity (a single byte change on a very large object requires a full copy of that object, and allocator metadata changes and the pointer swap);

To be properly durable, COW techniques will always execute *at least* two fences per modified block/object (1 order + 1 sync). Extra fences may be needed for the allocation and de-allocation of the block/object (2 or more fences).

In the context of Persistent Memory, there is no PTM that uses COW but there are hand-made data structures that use this technique extensively, examples being the ones described in the MOD paper (although their algorithm requires two fences per operation to be correct, plus whatever the allocator does).
When it comes to databases, SAP HANA uses a combination of COW and redo-log to provide durable transactions.

4) Romulus

Since last year (2018), there is now a fourth way to achieve durability and it's called Romulus.
In Romulus, two twin replicas of the data are maintained at all times, with the modifications first executed in the first replica 'main' and after, applied to the other replica 'back':

1: write(dst, src, size) {
2:    state = MUTATING;
3:    order_fence();
4:    memcpy(dst_main, src, size);   // copy new data to 'main'
5:    order_fence();
6:    state = COPYING;
7:    sync_fence();
8:    memcpy(dst_back, src, size);   // copy new data to 'back'
9:    order_fence();
10:   state = IDLE;
11: }

Let's examine what happens if there is a crash in:
- Lines 2 to 4: the recovery method see state at 'MUTATING' and copies the entire contents of 'back' to 'main' (no change);
- Lines 4 to 6: the recovery method copies 'back' to 'main', undoing the possibly incomplete modifications in 'main';
- Lines 6 to 10: assuming state is durable at 'COPYING', the recovery method will copy the contents of 'main' to back, effectively acting as a redo and duplicating the committed modifications;
- After line 10: recovery does nothing;

Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), then it is important to have an optimization where a non-durable log is kept with the addresses and ranges of the modified data:

1: write(dst, src, size) {
2:    state = MUTATING;
3:    order_fence();
4:    memcpy(dst_main, src, size);   // copy new data to 'main'
5:    log[non_durable_entries].addr = dst_main;
6:    log[non_durable_entries].size = size;
7:    non_durable_entries++;
8: }
9: commit() {
10:   order_fence();
11:   state = COPYING;
12:   sync_fence();
13:   // 'offset' is the delta between 'back' and 'main'
14:   for (i : non_durable_entries) {
15:     memcpy(log[i].addr+offset, log[i].addr, log[i].size);
16: }
17:   order_fence();
18:   state = IDLE;
19: }

When used for transactions, romulus always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction. Similarly to redo-log, this number can be brought down to 2 fences, but the algorithm becomes more complex.

Pros/Cons:
Advantage 1) Reading and writing is done in place, directly from the data;
Advantage 2) Transactions are unbounded in size. The volatile log can grow dynamically and if it grows too much, we can just stop using it and copy the entire 'main' to 'back' at the end of the transaction. No other technique supports transactions of unbounded size;
Advantage 3) Write amplification is 1;
Advantage 4) Four fences per transaction;
Disadvantage 1) Only half the capacity of the media is available to the user. The redo-log and undo-log need to pre-reserve space for their logs but rarely do applications reserve half the capacity of the media to the logs, giving the advantage to undo/redo/cow.

Whether we're talking about filesystems, DBMS or some other kind of application that requires durability with resilience to failures, it must use one or a combination/variant of these four techniques. Each of them has different trade-offs in terms of performance, memory usage, ease-of-implementing and ease-of-use. In the end, it's an engineering choice which of these four algorithms is best suited for a particular application.