Friday, March 12, 2021

The 4 laws of Durability

When it comes to having durable data, there are four ways to do it: undo log, redo log, shadow copy and shadow data.


Let's start with the preliminaries.

So what do we mean by "durable"?

Well, durable means that whatever data you're trying to save, has reached your storage device in a consistent way. It means that when you write to storage you want it to be "permanent", whether that storage device is a USB key, a CD ROM, an hard drive, and SSD, or a non-volatile memory DIMM like Intel's Optane DC PM.

For any of these storage devices, the algorithms are always the same: you have to use one of the four mentioned above.

Keep in mind that these are needed if the data needs to be consistent, i.e. you want to see the whole data before the storage or none of the data. I mean, if we were ok with having garbled data, then why would we bother saving it in permanent storage? The whole point of making data durable is because it has important information and therefore, it implies consistency.


Now that the basics are out of the way, what are exactly these four algorithms?

I'm going to focus on these in the context of transactions, but they don't have to be necessarily about that.


Undo log is technique where we write to durable storage a log entry before each write is done to storage. It allows multiple independent (non-atomic) writes to become durable in an all-or-nothing way, like a transaction, or a checkpoint.

In the context of persistent memory, libpmemobj in PMDK is an example of a transactional system that uses undo log.


In Redo log we write the log with multiple entries to storage before writing the actual data. The difference between redo and undo is that undo log does one entry in the log at a time followed by one modification, while the redo log does all entries in the log in one shot and then all the modifications in one shot.

Mnemosyne and OneFile are examples of transactional systems that utilize redo log.


Shadow copy, sometimes called Copy-On-Write (COW) creates a new replica of the data and writes the new data along with the unchanged contents to durable storage, before swapping some kind of pointer to indicate the this is the new object/data and the old one can be discarded. COW can't really be used by itself for transactions over multiple objects, but it can be combined with redo log to make it more efficient.

One example is SAP HANA which uses redo log with COW.


Shadow data can sometimes be confused with COW but it is not the same thing. In shadow data two (or more) replicas of the entire data are kept in durable storage and they both are updated with the modifications, one at a time. First one replica, then a logical pointer and then the second replica. On the next set of atomic writes the recently updated replica is the first to be updated.

Examples of shadow data transactional systems are Romulus, RedoDB and Trinity to some extent.



We though long and hard at the similarities and differences between these four algorithms for durable transactions, and we found they possess four common characteristics, regardless of the underlying storage media for which they are intended.

Each one of these characteristics reveals an important insight into the concept of durability and we believe these to be empirical rules to which all durable techniques abide. These rules are:

  1.  There must be a replica of the data;
  2.  There must be a durable state indicating which of the replicas is consistent;
  3. All algorithms require at least one ordering constraint of the writes to durable storage; 
  4. A modification is durable only after a round-trip fence to the storage hardware;



The first key insight regarding durable transactions is that a consistent and durable replica of the data must exist at all times. This replica may be a full copy of the data, such as on shadow data, or it may be a logical replica, such as on undo log and redo log.

Intuitively, there has to be a consistent replica of the data, so that there is a way to recover data to its original consistent state in the event of a failure. Shadow data keeps a full replica of the data thus incurring a high permanent usage of the durable media (space amplification), while the undo log and redo log approaches have to write in durable storage, not just the new data but also, encoded information about the location and size of the modification (write amplification).

There's clearly an important trade-off here: log-based algorithms will increase (amortized) write amplification but shadow-data-based algorithms will increase space amplification.


The second empirical rule implies that the algorithm must ensure that, irrespective of when a failure occurs, there is a way for the recovery procedure to determine which of the replicas is consistent.

Shadow data like Romulus uses a two-bit variable to determine which of the two replicas is consistent, while redo log and undo log can use the size of the log (zero or non-zero) to indicate if the log is consistent.

By itself, there is no significant difference in any of the approaches however, the exact mechanics, will influence the number of ordering constraints in the algorithm.


This leads us to the third insight, that data consistency is possible only through ordering of some of the writes.

For shadow-copying, the modifications on the new block must be made durable before the pointer swap, otherwise a failure occurring after the pointer swap is made durable, would leave the pointer referencing an inconsistent block. This means that apart from block allocation and de-allocation details, shadow-copying has a single ordering constraint, or in other words, a single ordering fence.

Shadow data like Romulus uses a two-bit state (though one bit would suffice) to indicate which of the two replicas is the consistent one, or whether both are consistent. If the state variable indicating which replica is the consistent one becomes durable before or after the modifications on either replica and a crash occurs, upon recovery it may be referencing the inconsistent replica. For this algorithm, three ordering constraints exist: one to prevent the state from changing to COPYING before the modifications in main replica are done; another to prevent the modifications in back replica from being done before the state changes to COPYING; and another one to prevent the state change to IDLE before the changes on back replica are durable.

The undo log technique has two constraints per modified object/range: the log entry must contain the old value before the entry is added to the log; and the entry must be added to the log before the modification is done on the data. Undo log has one extra constraint per transaction, requiring the last modification to be durable before the log is reset.

The Redo log technique has three constraints per transaction: all the log entries must be durable before the log size is set; the log size must be set before the modifications are done on the data; the modifications on the data must be durable before the log is reset.


The fourth and final rule addresses the need for a round-trip synchronization mechanism to the storage domain, such that the hardware can guarantee that it contains, in stable durable storage, all the previously written data. The cost of such a fence is typically of the order of the storage device's latency.

Fast devices like PM implement this round-trip fence orders of magnitude faster than slower devices, like hard drives.

Without such a mechanism, it is not possible to have durable operations, even if ordering constraints are set: in the event of a failure, the ordering constraints impose a temporal sequence of which the writes will be made durable, but there is no guarantee on durability.

A corollary of this is that all algorithms require one and only one such fence, strategically placed.

Notice that the ordering constraints may be replaced by such synchronous fences, at the detriment of performance, and in fact, many storage systems make no distinction between the two. Ordering is typically achieved with an asynchronrous fence and it relates to the order to which certain writes will be made durable in the storage media.

On block based storage, this is typically implemented with fsync() or fdatasync().

In Persistent Memory (PM) ordering can be achieved through the combination of flushes (clwb) and fences (sfence) or by writing to the same cache line. The round-trip guarantee of durability is given by a synchronous fence, either fsync()/fdatasync() on block storage, or sfence on PM storage.


In case you haven't noticed, the fact that all algorithms require one round-trip fence to the device (psync), but may require multiple ordering fences (pfence) has implications in performance. This is specially true given that the psync has inescapable physical implications: it is not possible to have all-or-nothing consistent durability without a psync that physically does a round trip to the storage device (or at least the storage domain) and therefore the latency cost of this single round trip is inescapable.

However, different algorithms may have different ordering constraints (pfences) and these may have different costs.


Yes, fsync() is used for both sync and ordering on block devices, and the sfence instruction is also used for both in PM, however, there are tricks. In PM, writes to the same cache line are guaranteed to be ordered and therefore, no sfence is needed to order them, as long as store with memory_order_release is used.

Seen as these round trips are typically the bottleneck when doing random writes to PM, the fact that we can create an algorithm with a lower number of psyncs means we can have a performance gain that is nearly proportional to the reduction in the number of such fences.


This is exactly what we've done with Trinity.

Trinity is a novel durability technique that needs just two fences per transaction and reduces the number of flushes when doing random writes. It consumes more memory than the other previous techniques but it has significant higher performance.

Moreover, we combined it with our own variant of TL2 for highly scalable durable linearizable transactions, and we used that to make a K/V store, which is likely that fastest K/V store on the planet with full transactions (though you need Optane Persistent Memory to be able to run it).


If you want to see the video, it's here:


If you want the source code, it's here:

No comments:

Post a Comment