Monday, October 14, 2019

What is a concurrent algorithm without order?

I've heard Leslie Lamport mention on multiple occasions: "An algorithm is not a program!"
and I fully agree with him and would like to add that "A concurrent or durable algorithm without ordering is not an algorithm!"

What Leslie says makes sense because code is just a particular implementation or expression of an algorithm in a specific language. Code is bounded by the constructs that form the language in which that code is written.
An algorithm has no such constraints and is the mathematical expression of an idea.

An algorithm is precise, yet, universal.
A program works only for a particular language.

When an algorithm is incorrect, a program that uses such an algorithm will be incorrect as well.
When an algorithm is correct, a program that uses such an algorithm may be incorrect because it implemented the algorithm incorrectly, but the correctness of the underlying algorithm in unaffected.

An algorithm has mathematical beauty.
A well made program can have craftsmanship and be appreciated as beautiful, but rarely (if ever) on the mathematical sense.


So what do I mean by "An algorithm without ordering is not an algorithm"?
In the context of concurrent and/or durable algorithms order is vital for correctness. In concurrency papers, most people assume sequential consistency of the (atomic) steps that the algorithm performs and indeed, the default memory model in C11 and C++11 is memory_order_seq_cst exactly for that reason. In practice, if you were to implement a concurrent algorithm in such a way, it would likely be slow because each store (and load on non-TSO CPUs) would require a fence to guarantee ordering.
Researchers writing papers assume this strong ordering because they focus on "the algorithm" as being a series of steps and leave the dependencies of those steps (ordering) as an implementation detail.
If performance (throughput/scalability/latency) was not an issue, then we would still be using single-core CPUs and single-threaded applications and would not have to deal with all the complexity that concurrent algorithms entail! The amount of fences are (not always, but) usually what dictates the throughput of a concurrent algorithm. This means that the placement and number of fences (ordering constraints) are vital to the algorithm, not just for correctness but also for performance reasons.

The same logic applies to durable algorithms, where the ordering constraints are typically the main performance indicator. A durable algorithm without ordering does not guarantee durability and therefore, it becomes useless. However, not all steps impose a strong ordering among each other. It becomes important when we describe the algorithm to explicitly mention this dependency of steps.

Most of us don't realize the importance of "order" because for the majority of algorithms this order is implicit. For example, consider a sequential algorithm for inserting a new node in the middle of a linked-list:
Step 1) Create a new node;
Step 2) Make the "next" pointer in the node point to the successor node in the list;
Step 3) Make The "next" pointer in the predecessor node of the list point to the new node;

This is trivially simple, but if you try use this algorithm in a system where ordering is not guaranteed, these steps may become re-ordered (from the point of view of another process/thread) and the algorithm becomes incorrect. And this is one of the main issues of concurrent algorithms and even distributed systems.

Going back to Leslie, he was the first to show that to have mutual exclusion on a concurrent system we need to have at least a store-load fence. In other words, there is a minimum amount of ordering that is needed to make an algorithm that is mutually exclusive.
A similar result exists for durable algorithms: atomic durability requires at least two constraints, one for ordering and one for synchronization (round-trip), see this post for more info.
The similarity is such that Paxos, an algorithm discovered by Leslie Lamport and used in distributed systems, can be used not just to provide failure-resilience in concurrent/distributed systems but also to provide failure-resilience for durability. And the reason behind it is somewhat related to ordering.

Unfortunately, there is no easy way or commonly accepted way of expressing ordering constraints in an algorithm and because of that, most people think that orderings constraints are "an implementation detail". Likely this happens because our human brains are so used to thinking in sequential order that we expect the steps (code) that we write in a certain order to be executed in that exact order, because that's how the physical world around us typically behaves.
The problem is, concurrent algorithms don't follow these rules and because of that, ordering must be part of a concurrent algorithm.

IMO, a concurrent algorithm without a specification of the order is an incomplete algorithm. Same thing for durable algorithms. This also implies that an optimal algorithm is one that has the least amount of ordering, or in other words, a good concurrent/durable algorithm is an algorithm with a small number of fences.

Sunday, October 6, 2019

Recovery on a hand-made durable data structure versus recovery on a Persistent Transactional Memory

A durable data structure is a data structure in Persistent Memory (PM) or Non-Volatile Memory, if you prefer.
These data structures have to be resilient to failures. In other words, if there is a power cut during an operation on the data structure, the data structure must be correctly recovered after the system restarts.
For the rest of this post, and likely for all other posts you read in this site, when I say "durable data structure" I mean a data structure that has "atomic durability": In the event of a failure, the side effects of all completed operations will be visible upon restart and recovery.

The are two main approaches to having a durable data structure in PM:
1) Write an algorithm for a new data structure. This algorithm has to be failure-resilient;
2) Write a regular data structure and wrap it in a PTM;

Making a new data structure is typically the subject of an entire research paper, while using a PTM is not. Researchers tend to lean on doing new data structures because it's more fun (we can't get a paper accepted where we describe "using" a data structure). Which approach is better depends on what you mean by "better", but my favorite is 2).

However, set's say you decide to use a hand-made durable data structure that someone else made. Now how do you use it in your program?

Most applications have many data structure instances. For an application that needs durability (or persistence, if you prefer) not all data needs to be persistent, which means that not all data structures need to be persistent either. However, let's assume it is more than 1 data structure instance.
When a crash occurs, it could have occurred during an operation on *any* of these data structures. This means we need to call the recovery() method for each of these data structure instances, just in case.

This by itself creates a nuisance, because now you need to have an explicit list of all the durable data structures in your application and iterate through all in your recovery code:
void recovery() {   // called after a restart
  ds1.recovery();
  ds2.recovery();
  ds3.recovery();
  ...
}

If you forget to add one of the data structures to the main recovery method, you risk doing operations on an inconsistent data structure and further corrupting your data. In PM, when you corrupt the data it's gone forever... unless you had backups, and even then you're going to lose all data from the last non-corrupt backup.

Another difficulty is that sometimes the number of durable data structures may be dynamic because the application creates and destroys data structures during execution, to store some data. This means that it's not possible to add each data structure instance in the recovery() method because they may not exist when the program restarts. This typically means you will need a registration mechanism where newly created data structure instances are added and later de-registered when the data structure is destroyed. Also, you need to save the root pointer to each of those data structure instances.
PTMs handle this transparently because the transaction has the all-or-nothing semantics. If new data structure instances are created or destroyed, you can use the root pointer API (in OneFile and Romulus we called it get_object() and put_object()). When there is a crash, any modifications to ongoing transactions will be reverted, including allocating or de-allocating a new data structure instance and including adding or removing a root pointer. By having a PTM whose transactions can encompass the modifications on the data structure and the allocation/de-allocation of the data structure and the adding/removing of the instance as a root pointer, we give transactions that are easy to reason about and make your code consistent at all times.

Ok, ok, so it's not that bad for the hand-made durable data structures. We can do automatic checks for these scenarios by adding registration mechanisms in the constructor of the data structure, or even some kind of static check or compile-time assertion to make sure everything is as it should be.
But the thing is, if you use a PTM, you only have to call the recovery method of the PTM, one time, regardless of the number of data structure instances you have in your application!
In the case of OneFile, it has null recovery which means there is no recovery method: the PTM will simply continue execution where it left of.


This is yet another thing where PTMs are better than an hand-made durable data structures.

Tuesday, October 1, 2019

Atomic Durability - How do databases recover from a crash ?

In this post we're going to talk about the four different ways of having durable transactions. If you want to know how databases and file systems guarantee correct data recovery after a power failure, then keep reading!

Writing to a durable/persistent media can be a complex task. If there is a failure half-way through, you may end up with corrupted data.
By durable media I mean a disk drive, an SD card, a USB stick, a Persistent Memory module (NV-DIMM), a network attached NAS, and many other like these.

Suppose you're changing a customer address in a database. The current address is "Bag End, Shire" and we want to change it to "Rivendell".
What happens of there is a crash half-way through the write?
Suppose the first four bytes were written before the crash occurred. Upon restart, the address in the database is now "Rivend, Shire"... clearly an incorrect address. What should we do?

This is a typical problem encountered by the designers of DBMS (Database Management Systems) and file systems.
Generally speaking, there are four solutions to this problem and they are:
1) Undo-Log;
2) Redo-Log;
3) Copy-On-Write (sometimes called Shadow Copy or Shadow Paging);
4) Romulus;


Before we explain how each of these works, we need to introduce two things which we're going to call "ordering fence" and "synchronization fence".
An "ordering fence" is something that guarantees that previous writes will reach the durable media before subsequent writes. For example:

write(a);
order_fence();
write(b);

implies that "a" will always be written to the durable media before "b" is written.

A "synchronization fence" is something that guarantees that previously written data is now durable. And by durable I mean it's persistent in the media and in the event of a crash, this data will not be lost. For example:

write(a);
sync_fence();
computer_beeps();

if you heard a beep and immediately after there is a power cut, you know for sure that the media contains the data "a" because the beep occurred after the sync_fence() which was after the write(a) was persisted.


The actual semantics depend on what hardware we're talking about, but the idea is pretty much the same: you need order_fence() and you need sync_fence().
When writing to disk, both the order_fence() and sync_fence() have to be implemented as fsync() or fdatasync(). When writing to PM they're made up of a combination of CLWB and SFENCE instructions.

Now back to the actual algorithms


1) Undo-Log

This technique is at least 30 years old and dates back to AIRES.
As the name indicates, it's an undo-log technique. Before writing to the durable media, we copy the original contents to a log, then we indicate that the log is consistent, then write the new values on the actual data, and finally reset the log:

1:  write(dst, src, size) {
2:    memcpy(src, log, size);   // copy original data to log
3:    order_fence();
4:    recover_from_log = true;
5:    order_fence();
6:    memcpy(dst, src, size);   // write new data
7:    order_fence();
8:    recover_from_log = false;
9:    sync_fence();
10: }

Let's examine what happens if there is a crash in:
- Lines 1 to 4: no data has been over written, nothing happens;
- Lines 4 to 6: 'recover_from_log' is seen as 'true' and the recovery will overwrite the contents of 'src' with the contents of 'log'. They are the same, therefore, no change;
- Lines 6 to 8: the recovery method will overwrite the new data with the contents of the log, restoring the old values, as it should be;
- After line 8: assuming 'recover_from_log' is durable, nothing will be reverted, the write() has committed successfully;



Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing (transaction) semantics, the algorithm changes slightly:

1:  write(dst, src, size) {
2:    memcpy(src, &log[entries].data, size);   // copy original data to log entry
3:    log[entries].size = size;                // save the size of this write()
4:    log[entries].addr = dst;                 // save the destination address
3:    order_fence();
4:    entries++;
5:    order_fence();
6:    memcpy(dst, src, size);                  // write new data
7:  }
8:  commit() {
9:    sync_fence();
10:   entries = 0;
11:   order_fence();
12: }


When used in this way, the number of fences is 2 per call to write() plus 2 at commit time. The higher the number of write()s in the transaction, the more fences will be executed.

Pros/Cons:
Disadvantage 1) Two fences per write(). Can be very slow;
Advantage 1) Reading and writing is done in place;

In the context of Persistent Memory, undo-log is what libpmemobj does. This is the library in PMDK that provides durable transactions.
In the context of databases, AIRES was the first to use undo-log.



2) Redo-Log


In the redo-log technique the approach is almost opposite from the undo-log: we write the new data on the log, make the log consistent and after, copy the new contents from the log to the data:

1:  write(dst, src, size) {
2:    memcpy(log, src, size);   // copy new data to log
3:    order_fence();
4:    recover_from_log = true;
5:    order_fence();
6:    memcpy(dst, log, size);   // copy new data from log
7:    sync_fence();
8:    recover_from_log = false;
9:  }


If there is a crash after the log has been made consistent, the modifications in the log will be re-applied.
Let's examine what happens if there is a crash in:
- Lines 1 to 4: recovery does nothing;
- Lines 4 to 8: recovery method will re-apply the contents of the log to the data (redo) overwriting any old values that may have been left incomplete (in case the crash occurred during line 6);
- After line 8: recovery does nothing;


Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), the algorithm changes slightly:

1:  write(dst, src, size) {
2:    memcpy(&log[entries].data, src, size);   // copy new data to log entry
3:    log[entries].size = size;                // save the size of this write()
4:    log[entries].addr = dst;                 // save the destination address
5:    non_durable_entries++;
6:  }
7:  commit() {
8:    order_fence();
9:    entries = non_durable_entries;           // make the log consistent
10:   sync_fence();
11:   for (i : entries) memcpy(log[i].addr, &log[i].data, log[i].size);
12:   order_fence();
13:   entries = 0;
14:   order_fence();
15: }


When used for transactions, a redo-log always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction, it's always 4 fences. This number can be brought down to 2 fences, but the algorithm becomes more complex.

Pros/Cons:
Advantage 1) Only four fences per transaction;
Disadvantage 1) Reading data requires a lookup on the log, so as to cover the read-after-write scenarios. Without it, we would be reading stale data and break invariants.


In the context of Persistent Memory, redo-log is what Mnemosyne and Onefile do. Mnemosyne was the first blocking PTM and OneFile was the first wait-free PTM.
In the context of databases, Microsoft Hekaton, MySQL InnoDB engine, PostgreSQL and OracleDB all use a redo-log technique to provide transactions, while IBM DB2 uses a combination of undo and redo logs.



3) Copy-On-Write

Copy-On-Write (COW) is sometimes named "shadow copying" or "shadow paging".
This technique doesn't allow for transactional semantics. For that, it needs to be coupled with either undo-log or redo-log.
The COW technique is always applied to a block or an object. First, a new block/object is reserved, which is a replica of the block/object that is to be modified, then the modifications are executed on the new block/object and finally, a single pointer swap makes the new block/object accessible (durable):

1:  modify(obj_ptr, mutation()) {
2:    new_obj_ptr = malloc(sizeof(*obj_ptr));          // allocate a new object (2 fences or more)
3:    memcpy(new_obj_ptr, obj_ptr, sizeof(*obj_ptr));  // copy the contents of the old to the new object
4:    new_obj_ptr->mutation();                         // apply the modification on the new object
5:    order_fence();
6:    old_obj_ptr = obj_ptr;
7:    obj_ptr = new_obj_ptr;                           // pointer swap (atomic)
8:    sync_fence();
9:    free(obj_ptr);                                   // de-allocate old object (2 fences or more)
10: }




Pros/Cons:
Disadvantage 1) Every modification requires the allocation of a new block/object and the de-allocation of the old one. This can lead to high "fragmentation" on the durable media and it has a performance penalty because the act of allocating a new block/object requires two extra fences to change allocator metadata. This means that COW executes 4 fences per modified object (1 order + 1 sync for the allocator plus 1 order + 1 sync for the modification and pointer change);
Disadvantage 2) The entire block/object has to be copied, even if a single byte has changed. Depending on the usage pattern, this can create high write amplification which in turn affects performance and possibly media endurance (think flash and PM);
Disadvantage 3) Needs and undo-log or redo-log to provide transactional semantics, which can mean additional fences;
Disadvantage 4) Write amplification may be close to zero (just the pointer swap and allocator metadata changes), or it may be close to infinity (a single byte change on a very large object requires a full copy of that object, and allocator metadata changes and the pointer swap);

To be properly durable, COW techniques will always execute *at least* two fences per modified block/object (1 order + 1 sync). Extra fences may be needed for the allocation and de-allocation of the block/object (2 or more fences).

In the context of Persistent Memory, there is no PTM that uses COW but there are hand-made data structures that use this technique extensively, examples being the ones described in the MOD paper (although their algorithm requires two fences per operation to be correct, plus whatever the allocator does).
When it comes to databases, SAP HANA uses a combination of COW and redo-log to provide durable transactions.



4) Romulus

Since last year (2018), there is now a fourth way to achieve durability and it's called Romulus.
In Romulus, two twin replicas of the data are maintained at all times, with the modifications first executed in the first replica 'main' and after, applied to the other replica 'back':

1:  write(dst, src, size) {
2:    state = MUTATING;
3:    order_fence();
4:    memcpy(dst_main, src, size);   // copy new data to 'main'
5:    order_fence();
6:    state = COPYING;
7:    sync_fence();
8:    memcpy(dst_back, src, size);   // copy new data to 'back'
9:    order_fence();
10:   state = IDLE;
11: }


Let's examine what happens if there is a crash in:
- Lines 2 to 4: the recovery method see state at 'MUTATING' and copies the entire contents of 'back' to 'main' (no change);
- Lines 4 to 6: the recovery method copies 'back' to 'main', undoing the possibly incomplete modifications in 'main';
- Lines 6 to 10: assuming state is durable at 'COPYING', the recovery method will copy the contents of 'main' to back, effectively acting as a redo and duplicating the committed modifications;
- After line 10: recovery does nothing;




Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), then it is important to have an optimization where a non-durable log is kept with the addresses and ranges of the modified data:

1:  write(dst, src, size) {
2:    state = MUTATING;
3:    order_fence();
4:    memcpy(dst_main, src, size);   // copy new data to 'main'
5:    log[non_durable_entries].addr = dst_main;
6:    log[non_durable_entries].size = size;
7:    non_durable_entries++;
8:  }
9:  commit() {
10:   order_fence(); 
11:   state = COPYING;
12:   sync_fence();
13:   // 'offset' is the delta between 'back' and 'main'
14:   for (i : non_durable_entries) {

15:     memcpy(log[i].addr+offset, log[i].addr, log[i].size);
16:   }
17:   order_fence();
18:   state = IDLE;
19: }


When used for transactions, romulus always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction. Similarly to redo-log, this number can be brought down to 2 fences, but the algorithm becomes more complex.

Pros/Cons:
Advantage 1) Reading and writing is done in place, directly from the data;
Advantage 2) Transactions are unbounded in size. The volatile log can grow dynamically and if it grows too much, we can just stop using it and copy the entire 'main' to 'back' at the end of the transaction. No other technique supports transactions of unbounded size;
Advantage 3) Write amplification is 1;
Advantage 4) Four fences per transaction;
Disadvantage 1) Only half the capacity of the media is available to the user. The redo-log and undo-log need to pre-reserve space for their logs but rarely do applications reserve half the capacity of the media to the logs, giving the advantage to undo/redo/cow.



Whether we're talking about filesystems, DBMS or some other kind of application that requires durability with resilience to failures, it must use one or a combination/variant of these four techniques. Each of them has different trade-offs in terms of performance, memory usage, ease-of-implementing and ease-of-use. In the end, it's an engineering choice which of these four algorithms is best suited for a particular application.