Tuesday, October 1, 2019

Atomic Durability - How do databases recover from a crash ?

In this post we're going to talk about the four different ways of having durable transactions. If you want to know how databases and file systems guarantee correct data recovery after a power failure, then keep reading!

Writing to a durable/persistent media can be a complex task. If there is a failure half-way through, you may end up with corrupted data.
By durable media I mean a disk drive, an SD card, a USB stick, a Persistent Memory module (NV-DIMM), a network attached NAS, and many other like these.

Suppose you're changing a customer address in a database. The current address is "Bag End, Shire" and we want to change it to "Rivendell".
What happens of there is a crash half-way through the write?
Suppose the first four bytes were written before the crash occurred. Upon restart, the address in the database is now "Rivend, Shire"... clearly an incorrect address. What should we do?

This is a typical problem encountered by the designers of DBMS (Database Management Systems) and file systems.
Generally speaking, there are four solutions to this problem and they are:
1) Undo-Log;
2) Redo-Log;
3) Copy-On-Write (sometimes called Shadow Copy or Shadow Paging);
4) Romulus;


Before we explain how each of these works, we need to introduce two things which we're going to call "ordering fence" and "synchronization fence".
An "ordering fence" is something that guarantees that previous writes will reach the durable media before subsequent writes. For example:

write(a);
order_fence();
write(b);

implies that "a" will always be written to the durable media before "b" is written.

A "synchronization fence" is something that guarantees that previously written data is now durable. And by durable I mean it's persistent in the media and in the event of a crash, this data will not be lost. For example:

write(a);
sync_fence();
computer_beeps();

if you heard a beep and immediately after there is a power cut, you know for sure that the media contains the data "a" because the beep occurred after the sync_fence() which was after the write(a) was persisted.


The actual semantics depend on what hardware we're talking about, but the idea is pretty much the same: you need order_fence() and you need sync_fence().
When writing to disk, both the order_fence() and sync_fence() have to be implemented as fsync() or fdatasync(). When writing to PM they're made up of a combination of CLWB and SFENCE instructions.

Now back to the actual algorithms


1) Undo-Log

This technique is at least 30 years old and dates back to AIRES.
As the name indicates, it's an undo-log technique. Before writing to the durable media, we copy the original contents to a log, then we indicate that the log is consistent, then write the new values on the actual data, and finally reset the log:

1:  write(dst, src, size) {
2:    memcpy(src, log, size);   // copy original data to log
3:    order_fence();
4:    recover_from_log = true;
5:    order_fence();
6:    memcpy(dst, src, size);   // write new data
7:    order_fence();
8:    recover_from_log = false;
9:    sync_fence();
10: }

Let's examine what happens if there is a crash in:
- Lines 1 to 4: no data has been over written, nothing happens;
- Lines 4 to 6: 'recover_from_log' is seen as 'true' and the recovery will overwrite the contents of 'src' with the contents of 'log'. They are the same, therefore, no change;
- Lines 6 to 8: the recovery method will overwrite the new data with the contents of the log, restoring the old values, as it should be;
- After line 8: assuming 'recover_from_log' is durable, nothing will be reverted, the write() has committed successfully;



Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing (transaction) semantics, the algorithm changes slightly:

1:  write(dst, src, size) {
2:    memcpy(src, &log[entries].data, size);   // copy original data to log entry
3:    log[entries].size = size;                // save the size of this write()
4:    log[entries].addr = dst;                 // save the destination address
3:    order_fence();
4:    entries++;
5:    order_fence();
6:    memcpy(dst, src, size);                  // write new data
7:  }
8:  commit() {
9:    sync_fence();
10:   entries = 0;
11:   order_fence();
12: }


When used in this way, the number of fences is 2 per call to write() plus 2 at commit time. The higher the number of write()s in the transaction, the more fences will be executed.

Pros/Cons:
Disadvantage 1) Two fences per write(). Can be very slow;
Advantage 1) Reading and writing is done in place;

In the context of Persistent Memory, undo-log is what libpmemobj does. This is the library in PMDK that provides durable transactions.
In the context of databases, AIRES was the first to use undo-log.



2) Redo-Log


In the redo-log technique the approach is almost opposite from the undo-log: we write the new data on the log, make the log consistent and after, copy the new contents from the log to the data:

1:  write(dst, src, size) {
2:    memcpy(log, src, size);   // copy new data to log
3:    order_fence();
4:    recover_from_log = true;
5:    order_fence();
6:    memcpy(dst, log, size);   // copy new data from log
7:    sync_fence();
8:    recover_from_log = false;
9:  }


If there is a crash after the log has been made consistent, the modifications in the log will be re-applied.
Let's examine what happens if there is a crash in:
- Lines 1 to 4: recovery does nothing;
- Lines 4 to 8: recovery method will re-apply the contents of the log to the data (redo) overwriting any old values that may have been left incomplete (in case the crash occurred during line 6);
- After line 8: recovery does nothing;


Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), the algorithm changes slightly:

1:  write(dst, src, size) {
2:    memcpy(&log[entries].data, src, size);   // copy new data to log entry
3:    log[entries].size = size;                // save the size of this write()
4:    log[entries].addr = dst;                 // save the destination address
5:    non_durable_entries++;
6:  }
7:  commit() {
8:    order_fence();
9:    entries = non_durable_entries;           // make the log consistent
10:   sync_fence();
11:   for (i : entries) memcpy(log[i].addr, &log[i].data, log[i].size);
12:   order_fence();
13:   entries = 0;
14:   order_fence();
15: }


When used for transactions, a redo-log always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction, it's always 4 fences. This number can be brought down to 2 fences, but the algorithm becomes more complex.

Pros/Cons:
Advantage 1) Only four fences per transaction;
Disadvantage 1) Reading data requires a lookup on the log, so as to cover the read-after-write scenarios. Without it, we would be reading stale data and break invariants.


In the context of Persistent Memory, redo-log is what Mnemosyne and Onefile do. Mnemosyne was the first blocking PTM and OneFile was the first wait-free PTM.
In the context of databases, Microsoft Hekaton, MySQL InnoDB engine, PostgreSQL and OracleDB all use a redo-log technique to provide transactions, while IBM DB2 uses a combination of undo and redo logs.



3) Copy-On-Write

Copy-On-Write (COW) is sometimes named "shadow copying" or "shadow paging".
This technique doesn't allow for transactional semantics. For that, it needs to be coupled with either undo-log or redo-log.
The COW technique is always applied to a block or an object. First, a new block/object is reserved, which is a replica of the block/object that is to be modified, then the modifications are executed on the new block/object and finally, a single pointer swap makes the new block/object accessible (durable):

1:  modify(obj_ptr, mutation()) {
2:    new_obj_ptr = malloc(sizeof(*obj_ptr));          // allocate a new object (2 fences or more)
3:    memcpy(new_obj_ptr, obj_ptr, sizeof(*obj_ptr));  // copy the contents of the old to the new object
4:    new_obj_ptr->mutation();                         // apply the modification on the new object
5:    order_fence();
6:    old_obj_ptr = obj_ptr;
7:    obj_ptr = new_obj_ptr;                           // pointer swap (atomic)
8:    sync_fence();
9:    free(obj_ptr);                                   // de-allocate old object (2 fences or more)
10: }




Pros/Cons:
Disadvantage 1) Every modification requires the allocation of a new block/object and the de-allocation of the old one. This can lead to high "fragmentation" on the durable media and it has a performance penalty because the act of allocating a new block/object requires two extra fences to change allocator metadata. This means that COW executes 4 fences per modified object (1 order + 1 sync for the allocator plus 1 order + 1 sync for the modification and pointer change);
Disadvantage 2) The entire block/object has to be copied, even if a single byte has changed. Depending on the usage pattern, this can create high write amplification which in turn affects performance and possibly media endurance (think flash and PM);
Disadvantage 3) Needs and undo-log or redo-log to provide transactional semantics, which can mean additional fences;
Disadvantage 4) Write amplification may be close to zero (just the pointer swap and allocator metadata changes), or it may be close to infinity (a single byte change on a very large object requires a full copy of that object, and allocator metadata changes and the pointer swap);

To be properly durable, COW techniques will always execute *at least* two fences per modified block/object (1 order + 1 sync). Extra fences may be needed for the allocation and de-allocation of the block/object (2 or more fences).

In the context of Persistent Memory, there is no PTM that uses COW but there are hand-made data structures that use this technique extensively, examples being the ones described in the MOD paper (although their algorithm requires two fences per operation to be correct, plus whatever the allocator does).
When it comes to databases, SAP HANA uses a combination of COW and redo-log to provide durable transactions.



4) Romulus

Since last year (2018), there is now a fourth way to achieve durability and it's called Romulus.
In Romulus, two twin replicas of the data are maintained at all times, with the modifications first executed in the first replica 'main' and after, applied to the other replica 'back':

1:  write(dst, src, size) {
2:    state = MUTATING;
3:    order_fence();
4:    memcpy(dst_main, src, size);   // copy new data to 'main'
5:    order_fence();
6:    state = COPYING;
7:    sync_fence();
8:    memcpy(dst_back, src, size);   // copy new data to 'back'
9:    order_fence();
10:   state = IDLE;
11: }


Let's examine what happens if there is a crash in:
- Lines 2 to 4: the recovery method see state at 'MUTATING' and copies the entire contents of 'back' to 'main' (no change);
- Lines 4 to 6: the recovery method copies 'back' to 'main', undoing the possibly incomplete modifications in 'main';
- Lines 6 to 10: assuming state is durable at 'COPYING', the recovery method will copy the contents of 'main' to back, effectively acting as a redo and duplicating the committed modifications;
- After line 10: recovery does nothing;




Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), then it is important to have an optimization where a non-durable log is kept with the addresses and ranges of the modified data:

1:  write(dst, src, size) {
2:    state = MUTATING;
3:    order_fence();
4:    memcpy(dst_main, src, size);   // copy new data to 'main'
5:    log[non_durable_entries].addr = dst_main;
6:    log[non_durable_entries].size = size;
7:    non_durable_entries++;
8:  }
9:  commit() {
10:   order_fence(); 
11:   state = COPYING;
12:   sync_fence();
13:   // 'offset' is the delta between 'back' and 'main'
14:   for (i : non_durable_entries) {

15:     memcpy(log[i].addr+offset, log[i].addr, log[i].size);
16:   }
17:   order_fence();
18:   state = IDLE;
19: }


When used for transactions, romulus always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction. Similarly to redo-log, this number can be brought down to 2 fences, but the algorithm becomes more complex.

Pros/Cons:
Advantage 1) Reading and writing is done in place, directly from the data;
Advantage 2) Transactions are unbounded in size. The volatile log can grow dynamically and if it grows too much, we can just stop using it and copy the entire 'main' to 'back' at the end of the transaction. No other technique supports transactions of unbounded size;
Advantage 3) Write amplification is 1;
Advantage 4) Four fences per transaction;
Disadvantage 1) Only half the capacity of the media is available to the user. The redo-log and undo-log need to pre-reserve space for their logs but rarely do applications reserve half the capacity of the media to the logs, giving the advantage to undo/redo/cow.



Whether we're talking about filesystems, DBMS or some other kind of application that requires durability with resilience to failures, it must use one or a combination/variant of these four techniques. Each of them has different trade-offs in terms of performance, memory usage, ease-of-implementing and ease-of-use. In the end, it's an engineering choice which of these four algorithms is best suited for a particular application.

2 comments:

  1. Amazing Post really appreciative cover more and more helpful topics that's help us to getting more information from your blog Thank You!

    ReplyDelete
  2. Very interesting. Thank you for writing this.

    ReplyDelete