If I were to make a concurrent map data structure whose operations are not linearizable and then put it on a library and give it to my users, any user of that library would come back to me and say that my library has a bug.
They may not be able to tell me exactly what it is, but they'll understand that there is a bug in it.
However, if I take the same concurrent map data structure, and put in an application and call that application a "Key-Value store" or a "Database" (DBMS) and give it to the typical database users, it seems they may certainly use it for a several decades without ever complaining that this "Database" is not linearizable (or serializable as the DB folks call it).
If this sounds far-fetched, then just go and read this post: http://jepsen.io/analyses/postgresql-12.3
It seems that the Postgresql users really didn't care that postgresql isn't serializable when told to be so and in fact, isn't even read committed by default, which should be the default. And a similar thing happened to MongDB last month, so it's not something specific to Postgresql.
I find this interesting because it's an extreme case of managing expectations: When you take a library of concurrent data structures, you expect a certain kind of consistency, namely, you expect linearizability (just go and read discussion on the Java concurrency mailing list if you don't believe me).
However, if you take a DBMS like Postgresql, you no longer expect linearizability to be the default behavior. In fact, you expect read committed as the default behavior, which turns out Postgresql doesn't even give read-committed and instead gives snapshot isolation.
A user of a library of concurrent data structures would call read committed a bug. She would also call snapshot isolation a bug, and pretty much everything that is not linearizable would be a bug to her.
The reason it would be a bug, is because it is very hard to reason about anything that is not linearizable. In the post by Jepsen you can even see that there is no exact definition of read committed, so I guess that's one of the reasons why nobody complained about it before.
I can easily imagine discussions of the type:
DBMS User: This behavior is weird!
DBMS Developer: No, it's not weird, it's "read committed", you just have to go and learn about what it means!
DBMS User: The definition of "read committed" is so fluffly that it can mean any weird thing... I can't even understand if the weird thing I'm observing is "read committed" or not.
DBMS Developer: See!?! I was right, this is "read committed"!
DBMS User: Ok, I'll keep using your DBMS because all the other DBMS work the same way.
I could have understood if it was a distributed database, because the cost of a serializable transaction over a distributed database is likely proportional to the latency of the furthest node in the database, while for read-committed it may be lower (who knows?). But the scenario that Jepsen describes isn't even about distributed databases. It's a bug found running on a single-node database. There's no "distributed-databases-are-hard" excuse here (which they are, it's true).
It makes me wonder how did the DBMS folks got their users so well trained that they don't complain about consistency bugs in the DBMS?!?
On one side, I'm super envious because I secretly wish I could be as dismissive to my users as the DBMS folks are to theirs, lol.
But seriously, how could this have happened for so long?!?
I see two possible reasons:
1st, the users don't really care about consistency. They're running businesses. As long as the DBMS is fast enough and has the features that they need, they'll continue to spew out cash for it. Correct consistency is not an issue for the 99% use-cases of databases, as long as the data doesn't get corrupted or lost, everything's fine.
2nd, it's always been like that. Everybody accepts it works like that, and if you want something better, you have to go for a niche DBMS (not that easy to find). Read-committed, snapshot isolation and other strange names as such, are just the "status quo" and nobody wants to change the status quo.
3rd, the DBMS folks hide behind the wall of complexity that is the DBMS. It's a common scenario in IT. They would say something like "Ooooohhhhh this DBMS is too complicated for mere mortals to question! It takes many years of work and millions of lines of code! Here be dragons! Oooohhhh".
If you think of other reasons behind this, I would like to hear about them in the comments section.
Anyways, with the advent of Persistent Memory (PM) and Software Transactional Memory for PM, this game is changing.
One example close to my heart is RedoDB.
RedoDB is a "key-value store" but it supports linearizable transactions over any C++ data type (needs to be type-annotated though). Not only that, but these transactions are wait-free.
That's right, you heard it well: RedoDB provides durable linearizable/serializable wait-free dynamic transactions.
No only does it do that, but is does it slightly faster than RocksDB.
The downside? Consumes vast amounts of memory, though it won't be any worse than most implementations of Multi-Version-Concurrency-Control. At least in RedoDB there is a bound on memory usage.
Anyways, our goal was to show that wait-free DBMS are feasible and can be made efficiently. We weren't aiming for a commercial product.
Ohh and did I say that this DB is about 3k lines of code, as opposed to the several hundred thousands LOC for other DBMS?
You can checkout the paper for RedoDB here: https://dl.acm.org/doi/abs/10.1145/3342195.3387515
and the source code here: https://github.com/pramalhe/RedoDB
Showing posts with label DBMS. Show all posts
Showing posts with label DBMS. Show all posts
Friday, June 12, 2020
Monday, October 28, 2019
Is RCU a generic concurrency control?
In this post we're going to talk about RCU and we're going to start by what it means.
When people talk about RCU they usually mean two different things combined: a synchronization mechanism, which I'll call RCU-sync, and a make-copy-of-object-and-modify-copy, which I'll call Copy-On-Write (COW).
Let's start with Copy-On-Write.
This is a technique used in concurrency and even on durability (see previous post) where instead of modifying an object in-place, you make a copy of the object, modify the new object and then toggle a pointer to point from the old object to the new object. The name RCU comes from Read-Copy-Update as a reference to this technique.
When used in (shared memory) concurrency this creates the problem of what to do with the old object: is it safe to delete it immediately? What if there are other threads currently accessing it?
Solving this problem of object lifetime tracking is what the synchronization mechanism of RCU (RCU-sync) does.
There are many different algorithms that implement an RCU-sync. Some examples of RCU-sync algorithms can be seen in Paul McKenney's excellent book or in our paper about fast userspace RCUs. The thing about an RCU-sync is that it's a concurrency construct, like a mutual exclusion lock or a reader-writer lock. For example, a mutual exclusion lock implements at least the following API:
lock()
unlock()
while a reader-writer lock implements at least:
read_lock()
read_unlock()
write_lock()
write_unlock()
and an RCU-sync implements at least:
rcu_read_lock()
rcu_read_unlock()
rcu_synchronize()
The difference between RCU and locks is that locks provide a generic way of doing concurrency, while RCU does not.
The rcu_read_lock() and rcu_read_unlock() methods must guarantee that code within a critical section defined by these two functions must not escape outside the critical section. An RCU-sync must also guarantee that a call to rcu_synchronize() will wait for all threads currently executing a block of code in a rcu_read_lock/unlock() block to complete (call rcu_read_unlock) before rcu_synchronize() returns to the caller. This guarantee is called quiescence and it's extremely useful in shared memory concurrency. Note that the actual semantics are much more formal that what I've described here, but for the purpose of this post, it's enough to get an idea of how an RCU-sync behaves.
As mentioned above, RCU-sync provides quiescence, not mutual exclusion.
A thread calling rcu_synchronize() will know that by the time the call to rcu_synchronize() returns, everything done before this call is now visible to other threads (as long as the other threads call rcu_read_lock/unlock to access the associated data). However, there is nothing preventing other threads from modyfing the same data at the same time (data race behavior). This means that even when using RCU-synch we still need some extra mechanism to provide mutual exclusion or prevent races in some other way.
At its essence, RCU-sync is mechanism for a single writer to synchronize with multiple readers. It doesn't solve the problem of multiple writers and it also doesn't solve the problem of the readers having a consistent view of the data.
Let me explain a bit more about what I mean with "consistent view". Suppose we have two variables, "a" and "b" initially at zero and a single writer thread and multiple reader threads. The writer code could be:
writer() {
a = 1;
rcu_synchronize();
b = 1;
rcu_synchronize();
}
and the reader code:
reader() {
rcu_read_lock();
printf("a=%d\n", a);
printf("b=%d\n", b);
rcu_read_unlock();
}
A reader thread could print "a=1,b=0", even if "a" and "b" are std::atomic with memory_order_consume, or even memory_order_seq_cst.
RCU-sync does not provide atomicity for multiple variables. In order to achieve this, we need to aggregate all those variables within a single object and make a snapshopt of this object using Copy-On-Write. But then, the concurrency technique really is COW, not RCU-sync. With such an approach, the RCU-sync is used only for memory reclamation of the old objects and if we're in Java then we don't even need that because the GC will do it for us.
And that is why whenever they use copy-on-write on java.util.concurrent they call it copy-on-write and not RCU ;)
Let me repeat it again: RCU-sync is extremely useful for memory reclamation and as a component of more complex concurrency techniques, but it is not generic. RCU-sync does not provide atomicity, nor does it prevent write-write races, not even read-write races. All it does it give quiescence, which personally I think it's awesome! It's all a matter of expectations.
Just to go back to COW briefly, COW can be used to solve write-write races and read-write races by creating a new object on every modification of every sub-object. Basically, if you want full consistency across your data, you need a god object that contains all your data and any modification to the data (or just a single variable) will require a full copy of the entire data.
So yes, COW can be generic, it's just that nobody does it like this because it would be too slow.
Both COW and RCU-sync are valuable tools that researchers and concurrency library implementers can use to make sophisticated concurrent data structures or other synchronization mechanisms, but calling them generic concurrency controls is going too far.
In other words, there certainly are database engines and STMs (software transactional memory) which use COW or RCU-sync internally, but you'll never see a DBMS or STM whose concurrency control is RCU. It just doesn't make sense to have a DBMS where any mutative operation causes a copy of the entire DB to be made. Because, that would be the only way to provide generic transactions on that DBMS or STM.
In the DBMS context, a concurrency control is something like MVCC (multi-version concurrency control) or 2PL (two-phase locking) or OCC (optimistic concurrency control).
RCU is not a generic concurrency control.
When people talk about RCU they usually mean two different things combined: a synchronization mechanism, which I'll call RCU-sync, and a make-copy-of-object-and-modify-copy, which I'll call Copy-On-Write (COW).
Let's start with Copy-On-Write.
This is a technique used in concurrency and even on durability (see previous post) where instead of modifying an object in-place, you make a copy of the object, modify the new object and then toggle a pointer to point from the old object to the new object. The name RCU comes from Read-Copy-Update as a reference to this technique.
When used in (shared memory) concurrency this creates the problem of what to do with the old object: is it safe to delete it immediately? What if there are other threads currently accessing it?
Solving this problem of object lifetime tracking is what the synchronization mechanism of RCU (RCU-sync) does.
There are many different algorithms that implement an RCU-sync. Some examples of RCU-sync algorithms can be seen in Paul McKenney's excellent book or in our paper about fast userspace RCUs. The thing about an RCU-sync is that it's a concurrency construct, like a mutual exclusion lock or a reader-writer lock. For example, a mutual exclusion lock implements at least the following API:
lock()
unlock()
while a reader-writer lock implements at least:
read_lock()
read_unlock()
write_lock()
write_unlock()
and an RCU-sync implements at least:
rcu_read_lock()
rcu_read_unlock()
rcu_synchronize()
The difference between RCU and locks is that locks provide a generic way of doing concurrency, while RCU does not.
The rcu_read_lock() and rcu_read_unlock() methods must guarantee that code within a critical section defined by these two functions must not escape outside the critical section. An RCU-sync must also guarantee that a call to rcu_synchronize() will wait for all threads currently executing a block of code in a rcu_read_lock/unlock() block to complete (call rcu_read_unlock) before rcu_synchronize() returns to the caller. This guarantee is called quiescence and it's extremely useful in shared memory concurrency. Note that the actual semantics are much more formal that what I've described here, but for the purpose of this post, it's enough to get an idea of how an RCU-sync behaves.
As mentioned above, RCU-sync provides quiescence, not mutual exclusion.
A thread calling rcu_synchronize() will know that by the time the call to rcu_synchronize() returns, everything done before this call is now visible to other threads (as long as the other threads call rcu_read_lock/unlock to access the associated data). However, there is nothing preventing other threads from modyfing the same data at the same time (data race behavior). This means that even when using RCU-synch we still need some extra mechanism to provide mutual exclusion or prevent races in some other way.
At its essence, RCU-sync is mechanism for a single writer to synchronize with multiple readers. It doesn't solve the problem of multiple writers and it also doesn't solve the problem of the readers having a consistent view of the data.
Let me explain a bit more about what I mean with "consistent view". Suppose we have two variables, "a" and "b" initially at zero and a single writer thread and multiple reader threads. The writer code could be:
writer() {
a = 1;
rcu_synchronize();
b = 1;
rcu_synchronize();
}
and the reader code:
reader() {
rcu_read_lock();
printf("a=%d\n", a);
printf("b=%d\n", b);
rcu_read_unlock();
}
A reader thread could print "a=1,b=0", even if "a" and "b" are std::atomic with memory_order_consume, or even memory_order_seq_cst.
RCU-sync does not provide atomicity for multiple variables. In order to achieve this, we need to aggregate all those variables within a single object and make a snapshopt of this object using Copy-On-Write. But then, the concurrency technique really is COW, not RCU-sync. With such an approach, the RCU-sync is used only for memory reclamation of the old objects and if we're in Java then we don't even need that because the GC will do it for us.
And that is why whenever they use copy-on-write on java.util.concurrent they call it copy-on-write and not RCU ;)
Let me repeat it again: RCU-sync is extremely useful for memory reclamation and as a component of more complex concurrency techniques, but it is not generic. RCU-sync does not provide atomicity, nor does it prevent write-write races, not even read-write races. All it does it give quiescence, which personally I think it's awesome! It's all a matter of expectations.
Just to go back to COW briefly, COW can be used to solve write-write races and read-write races by creating a new object on every modification of every sub-object. Basically, if you want full consistency across your data, you need a god object that contains all your data and any modification to the data (or just a single variable) will require a full copy of the entire data.
So yes, COW can be generic, it's just that nobody does it like this because it would be too slow.
Both COW and RCU-sync are valuable tools that researchers and concurrency library implementers can use to make sophisticated concurrent data structures or other synchronization mechanisms, but calling them generic concurrency controls is going too far.
In other words, there certainly are database engines and STMs (software transactional memory) which use COW or RCU-sync internally, but you'll never see a DBMS or STM whose concurrency control is RCU. It just doesn't make sense to have a DBMS where any mutative operation causes a copy of the entire DB to be made. Because, that would be the only way to provide generic transactions on that DBMS or STM.
In the DBMS context, a concurrency control is something like MVCC (multi-version concurrency control) or 2PL (two-phase locking) or OCC (optimistic concurrency control).
RCU is not a generic concurrency control.
Tuesday, October 1, 2019
Atomic Durability - How do databases recover from a crash ?
In this post we're going to talk about the four different ways of having durable transactions. If you want to know how databases and file systems guarantee correct data recovery after a power failure, then keep reading!
Writing to a durable/persistent media can be a complex task. If there is a failure half-way through, you may end up with corrupted data.
By durable media I mean a disk drive, an SD card, a USB stick, a Persistent Memory module (NV-DIMM), a network attached NAS, and many other like these.
Suppose you're changing a customer address in a database. The current address is "Bag End, Shire" and we want to change it to "Rivendell".
What happens of there is a crash half-way through the write?
Suppose the first four bytes were written before the crash occurred. Upon restart, the address in the database is now "Rivend, Shire"... clearly an incorrect address. What should we do?
This is a typical problem encountered by the designers of DBMS (Database Management Systems) and file systems.
Generally speaking, there are four solutions to this problem and they are:
1) Undo-Log;
2) Redo-Log;
3) Copy-On-Write (sometimes called Shadow Copy or Shadow Paging);
4) Romulus;
Before we explain how each of these works, we need to introduce two things which we're going to call "ordering fence" and "synchronization fence".
An "ordering fence" is something that guarantees that previous writes will reach the durable media before subsequent writes. For example:
write(a);
order_fence();
write(b);
implies that "a" will always be written to the durable media before "b" is written.
A "synchronization fence" is something that guarantees that previously written data is now durable. And by durable I mean it's persistent in the media and in the event of a crash, this data will not be lost. For example:
write(a);
sync_fence();
computer_beeps();
if you heard a beep and immediately after there is a power cut, you know for sure that the media contains the data "a" because the beep occurred after the sync_fence() which was after the write(a) was persisted.
The actual semantics depend on what hardware we're talking about, but the idea is pretty much the same: you need order_fence() and you need sync_fence().
When writing to disk, both the order_fence() and sync_fence() have to be implemented as fsync() or fdatasync(). When writing to PM they're made up of a combination of CLWB and SFENCE instructions.
Now back to the actual algorithms
As the name indicates, it's an undo-log technique. Before writing to the durable media, we copy the original contents to a log, then we indicate that the log is consistent, then write the new values on the actual data, and finally reset the log:
1: write(dst, src, size) {
2: memcpy(src, log, size); // copy original data to log
3: order_fence();
4: recover_from_log = true;
5: order_fence();
6: memcpy(dst, src, size); // write new data
7: order_fence();
8: recover_from_log = false;
9: sync_fence();
10: }
Let's examine what happens if there is a crash in:
- Lines 1 to 4: no data has been over written, nothing happens;
- Lines 4 to 6: 'recover_from_log' is seen as 'true' and the recovery will overwrite the contents of 'src' with the contents of 'log'. They are the same, therefore, no change;
- Lines 6 to 8: the recovery method will overwrite the new data with the contents of the log, restoring the old values, as it should be;
- After line 8: assuming 'recover_from_log' is durable, nothing will be reverted, the write() has committed successfully;
Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing (transaction) semantics, the algorithm changes slightly:
1: write(dst, src, size) {
2: memcpy(src, &log[entries].data, size); // copy original data to log entry
3: log[entries].size = size; // save the size of this write()
4: log[entries].addr = dst; // save the destination address
3: order_fence();
4: entries++;
5: order_fence();
6: memcpy(dst, src, size); // write new data
7: }
8: commit() {
9: sync_fence();
10: entries = 0;
11: order_fence();
12: }
When used in this way, the number of fences is 2 per call to write() plus 2 at commit time. The higher the number of write()s in the transaction, the more fences will be executed.
Pros/Cons:
Disadvantage 1) Two fences per write(). Can be very slow;
Advantage 1) Reading and writing is done in place;
In the context of Persistent Memory, undo-log is what libpmemobj does. This is the library in PMDK that provides durable transactions.
In the context of databases, AIRES was the first to use undo-log.
In the redo-log technique the approach is almost opposite from the undo-log: we write the new data on the log, make the log consistent and after, copy the new contents from the log to the data:
1: write(dst, src, size) {
2: memcpy(log, src, size); // copy new data to log
3: order_fence();
4: recover_from_log = true;
5: order_fence();
6: memcpy(dst, log, size); // copy new data from log
7: sync_fence();
8: recover_from_log = false;
9: }
If there is a crash after the log has been made consistent, the modifications in the log will be re-applied.
Let's examine what happens if there is a crash in:
- Lines 1 to 4: recovery does nothing;
- Lines 4 to 8: recovery method will re-apply the contents of the log to the data (redo) overwriting any old values that may have been left incomplete (in case the crash occurred during line 6);
- After line 8: recovery does nothing;

Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), the algorithm changes slightly:
1: write(dst, src, size) {
2: memcpy(&log[entries].data, src, size); // copy new data to log entry
3: log[entries].size = size; // save the size of this write()
4: log[entries].addr = dst; // save the destination address
5: non_durable_entries++;
6: }
7: commit() {
8: order_fence();
9: entries = non_durable_entries; // make the log consistent
10: sync_fence();
11: for (i : entries) memcpy(log[i].addr, &log[i].data, log[i].size);
12: order_fence();
13: entries = 0;
14: order_fence();
15: }
When used for transactions, a redo-log always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction, it's always 4 fences. This number can be brought down to 2 fences, but the algorithm becomes more complex.
Pros/Cons:
Advantage 1) Only four fences per transaction;
Disadvantage 1) Reading data requires a lookup on the log, so as to cover the read-after-write scenarios. Without it, we would be reading stale data and break invariants.
In the context of Persistent Memory, redo-log is what Mnemosyne and Onefile do. Mnemosyne was the first blocking PTM and OneFile was the first wait-free PTM.
In the context of databases, Microsoft Hekaton, MySQL InnoDB engine, PostgreSQL and OracleDB all use a redo-log technique to provide transactions, while IBM DB2 uses a combination of undo and redo logs.
This technique doesn't allow for transactional semantics. For that, it needs to be coupled with either undo-log or redo-log.
The COW technique is always applied to a block or an object. First, a new block/object is reserved, which is a replica of the block/object that is to be modified, then the modifications are executed on the new block/object and finally, a single pointer swap makes the new block/object accessible (durable):
1: modify(obj_ptr, mutation()) {
2: new_obj_ptr = malloc(sizeof(*obj_ptr)); // allocate a new object (2 fences or more)
3: memcpy(new_obj_ptr, obj_ptr, sizeof(*obj_ptr)); // copy the contents of the old to the new object
4: new_obj_ptr->mutation(); // apply the modification on the new object
5: order_fence();
6: old_obj_ptr = obj_ptr;
7: obj_ptr = new_obj_ptr; // pointer swap (atomic)
8: sync_fence();
9: free(obj_ptr); // de-allocate old object (2 fences or more)
10: }

Pros/Cons:
Disadvantage 1) Every modification requires the allocation of a new block/object and the de-allocation of the old one. This can lead to high "fragmentation" on the durable media and it has a performance penalty because the act of allocating a new block/object requires two extra fences to change allocator metadata. This means that COW executes 4 fences per modified object (1 order + 1 sync for the allocator plus 1 order + 1 sync for the modification and pointer change);
Disadvantage 2) The entire block/object has to be copied, even if a single byte has changed. Depending on the usage pattern, this can create high write amplification which in turn affects performance and possibly media endurance (think flash and PM);
Disadvantage 3) Needs and undo-log or redo-log to provide transactional semantics, which can mean additional fences;
Disadvantage 4) Write amplification may be close to zero (just the pointer swap and allocator metadata changes), or it may be close to infinity (a single byte change on a very large object requires a full copy of that object, and allocator metadata changes and the pointer swap);
To be properly durable, COW techniques will always execute *at least* two fences per modified block/object (1 order + 1 sync). Extra fences may be needed for the allocation and de-allocation of the block/object (2 or more fences).
In the context of Persistent Memory, there is no PTM that uses COW but there are hand-made data structures that use this technique extensively, examples being the ones described in the MOD paper (although their algorithm requires two fences per operation to be correct, plus whatever the allocator does).
When it comes to databases, SAP HANA uses a combination of COW and redo-log to provide durable transactions.
In Romulus, two twin replicas of the data are maintained at all times, with the modifications first executed in the first replica 'main' and after, applied to the other replica 'back':
1: write(dst, src, size) {
2: state = MUTATING;
3: order_fence();
4: memcpy(dst_main, src, size); // copy new data to 'main'
5: order_fence();
6: state = COPYING;
7: sync_fence();
8: memcpy(dst_back, src, size); // copy new data to 'back'
9: order_fence();
10: state = IDLE;
11: }
Let's examine what happens if there is a crash in:
- Lines 2 to 4: the recovery method see state at 'MUTATING' and copies the entire contents of 'back' to 'main' (no change);
- Lines 4 to 6: the recovery method copies 'back' to 'main', undoing the possibly incomplete modifications in 'main';
- Lines 6 to 10: assuming state is durable at 'COPYING', the recovery method will copy the contents of 'main' to back, effectively acting as a redo and duplicating the committed modifications;
- After line 10: recovery does nothing;
Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), then it is important to have an optimization where a non-durable log is kept with the addresses and ranges of the modified data:
1: write(dst, src, size) {
2: state = MUTATING;
3: order_fence();
4: memcpy(dst_main, src, size); // copy new data to 'main'
5: log[non_durable_entries].addr = dst_main;
6: log[non_durable_entries].size = size;
7: non_durable_entries++;
8: }
9: commit() {
10: order_fence();
11: state = COPYING;
12: sync_fence();
13: // 'offset' is the delta between 'back' and 'main'
14: for (i : non_durable_entries) {
15: memcpy(log[i].addr+offset, log[i].addr, log[i].size);
16: }
17: order_fence();
18: state = IDLE;
19: }
When used for transactions, romulus always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction. Similarly to redo-log, this number can be brought down to 2 fences, but the algorithm becomes more complex.
Pros/Cons:
Advantage 1) Reading and writing is done in place, directly from the data;
Advantage 2) Transactions are unbounded in size. The volatile log can grow dynamically and if it grows too much, we can just stop using it and copy the entire 'main' to 'back' at the end of the transaction. No other technique supports transactions of unbounded size;
Advantage 3) Write amplification is 1;
Advantage 4) Four fences per transaction;
Disadvantage 1) Only half the capacity of the media is available to the user. The redo-log and undo-log need to pre-reserve space for their logs but rarely do applications reserve half the capacity of the media to the logs, giving the advantage to undo/redo/cow.
Whether we're talking about filesystems, DBMS or some other kind of application that requires durability with resilience to failures, it must use one or a combination/variant of these four techniques. Each of them has different trade-offs in terms of performance, memory usage, ease-of-implementing and ease-of-use. In the end, it's an engineering choice which of these four algorithms is best suited for a particular application.
Writing to a durable/persistent media can be a complex task. If there is a failure half-way through, you may end up with corrupted data.
By durable media I mean a disk drive, an SD card, a USB stick, a Persistent Memory module (NV-DIMM), a network attached NAS, and many other like these.
Suppose you're changing a customer address in a database. The current address is "Bag End, Shire" and we want to change it to "Rivendell".
What happens of there is a crash half-way through the write?
Suppose the first four bytes were written before the crash occurred. Upon restart, the address in the database is now "Rivend, Shire"... clearly an incorrect address. What should we do?
This is a typical problem encountered by the designers of DBMS (Database Management Systems) and file systems.
Generally speaking, there are four solutions to this problem and they are:
1) Undo-Log;
2) Redo-Log;
3) Copy-On-Write (sometimes called Shadow Copy or Shadow Paging);
4) Romulus;
Before we explain how each of these works, we need to introduce two things which we're going to call "ordering fence" and "synchronization fence".
An "ordering fence" is something that guarantees that previous writes will reach the durable media before subsequent writes. For example:
write(a);
order_fence();
write(b);
implies that "a" will always be written to the durable media before "b" is written.
A "synchronization fence" is something that guarantees that previously written data is now durable. And by durable I mean it's persistent in the media and in the event of a crash, this data will not be lost. For example:
write(a);
sync_fence();
computer_beeps();
if you heard a beep and immediately after there is a power cut, you know for sure that the media contains the data "a" because the beep occurred after the sync_fence() which was after the write(a) was persisted.
The actual semantics depend on what hardware we're talking about, but the idea is pretty much the same: you need order_fence() and you need sync_fence().
When writing to disk, both the order_fence() and sync_fence() have to be implemented as fsync() or fdatasync(). When writing to PM they're made up of a combination of CLWB and SFENCE instructions.
Now back to the actual algorithms
1) Undo-Log
This technique is at least 30 years old and dates back to AIRES.As the name indicates, it's an undo-log technique. Before writing to the durable media, we copy the original contents to a log, then we indicate that the log is consistent, then write the new values on the actual data, and finally reset the log:
1: write(dst, src, size) {
2: memcpy(src, log, size); // copy original data to log
3: order_fence();
4: recover_from_log = true;
5: order_fence();
6: memcpy(dst, src, size); // write new data
7: order_fence();
8: recover_from_log = false;
9: sync_fence();
10: }
Let's examine what happens if there is a crash in:
- Lines 1 to 4: no data has been over written, nothing happens;
- Lines 4 to 6: 'recover_from_log' is seen as 'true' and the recovery will overwrite the contents of 'src' with the contents of 'log'. They are the same, therefore, no change;
- Lines 6 to 8: the recovery method will overwrite the new data with the contents of the log, restoring the old values, as it should be;
- After line 8: assuming 'recover_from_log' is durable, nothing will be reverted, the write() has committed successfully;
Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing (transaction) semantics, the algorithm changes slightly:
1: write(dst, src, size) {
2: memcpy(src, &log[entries].data, size); // copy original data to log entry
3: log[entries].size = size; // save the size of this write()
4: log[entries].addr = dst; // save the destination address
3: order_fence();
4: entries++;
5: order_fence();
6: memcpy(dst, src, size); // write new data
7: }
8: commit() {
9: sync_fence();
10: entries = 0;
11: order_fence();
12: }
When used in this way, the number of fences is 2 per call to write() plus 2 at commit time. The higher the number of write()s in the transaction, the more fences will be executed.
Pros/Cons:
Disadvantage 1) Two fences per write(). Can be very slow;
Advantage 1) Reading and writing is done in place;
In the context of Persistent Memory, undo-log is what libpmemobj does. This is the library in PMDK that provides durable transactions.
In the context of databases, AIRES was the first to use undo-log.
2) Redo-Log
In the redo-log technique the approach is almost opposite from the undo-log: we write the new data on the log, make the log consistent and after, copy the new contents from the log to the data:
1: write(dst, src, size) {
2: memcpy(log, src, size); // copy new data to log
3: order_fence();
4: recover_from_log = true;
5: order_fence();
6: memcpy(dst, log, size); // copy new data from log
7: sync_fence();
8: recover_from_log = false;
9: }
If there is a crash after the log has been made consistent, the modifications in the log will be re-applied.
Let's examine what happens if there is a crash in:
- Lines 1 to 4: recovery does nothing;
- Lines 4 to 8: recovery method will re-apply the contents of the log to the data (redo) overwriting any old values that may have been left incomplete (in case the crash occurred during line 6);
- After line 8: recovery does nothing;
Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), the algorithm changes slightly:
1: write(dst, src, size) {
2: memcpy(&log[entries].data, src, size); // copy new data to log entry
3: log[entries].size = size; // save the size of this write()
4: log[entries].addr = dst; // save the destination address
5: non_durable_entries++;
6: }
7: commit() {
8: order_fence();
9: entries = non_durable_entries; // make the log consistent
10: sync_fence();
11: for (i : entries) memcpy(log[i].addr, &log[i].data, log[i].size);
12: order_fence();
13: entries = 0;
14: order_fence();
15: }
When used for transactions, a redo-log always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction, it's always 4 fences. This number can be brought down to 2 fences, but the algorithm becomes more complex.
Pros/Cons:
Advantage 1) Only four fences per transaction;
Disadvantage 1) Reading data requires a lookup on the log, so as to cover the read-after-write scenarios. Without it, we would be reading stale data and break invariants.
In the context of Persistent Memory, redo-log is what Mnemosyne and Onefile do. Mnemosyne was the first blocking PTM and OneFile was the first wait-free PTM.
In the context of databases, Microsoft Hekaton, MySQL InnoDB engine, PostgreSQL and OracleDB all use a redo-log technique to provide transactions, while IBM DB2 uses a combination of undo and redo logs.
3) Copy-On-Write
Copy-On-Write (COW) is sometimes named "shadow copying" or "shadow paging".This technique doesn't allow for transactional semantics. For that, it needs to be coupled with either undo-log or redo-log.
The COW technique is always applied to a block or an object. First, a new block/object is reserved, which is a replica of the block/object that is to be modified, then the modifications are executed on the new block/object and finally, a single pointer swap makes the new block/object accessible (durable):
1: modify(obj_ptr, mutation()) {
2: new_obj_ptr = malloc(sizeof(*obj_ptr)); // allocate a new object (2 fences or more)
3: memcpy(new_obj_ptr, obj_ptr, sizeof(*obj_ptr)); // copy the contents of the old to the new object
4: new_obj_ptr->mutation(); // apply the modification on the new object
5: order_fence();
6: old_obj_ptr = obj_ptr;
7: obj_ptr = new_obj_ptr; // pointer swap (atomic)
8: sync_fence();
9: free(obj_ptr); // de-allocate old object (2 fences or more)
10: }
Pros/Cons:
Disadvantage 1) Every modification requires the allocation of a new block/object and the de-allocation of the old one. This can lead to high "fragmentation" on the durable media and it has a performance penalty because the act of allocating a new block/object requires two extra fences to change allocator metadata. This means that COW executes 4 fences per modified object (1 order + 1 sync for the allocator plus 1 order + 1 sync for the modification and pointer change);
Disadvantage 2) The entire block/object has to be copied, even if a single byte has changed. Depending on the usage pattern, this can create high write amplification which in turn affects performance and possibly media endurance (think flash and PM);
Disadvantage 3) Needs and undo-log or redo-log to provide transactional semantics, which can mean additional fences;
Disadvantage 4) Write amplification may be close to zero (just the pointer swap and allocator metadata changes), or it may be close to infinity (a single byte change on a very large object requires a full copy of that object, and allocator metadata changes and the pointer swap);
To be properly durable, COW techniques will always execute *at least* two fences per modified block/object (1 order + 1 sync). Extra fences may be needed for the allocation and de-allocation of the block/object (2 or more fences).
In the context of Persistent Memory, there is no PTM that uses COW but there are hand-made data structures that use this technique extensively, examples being the ones described in the MOD paper (although their algorithm requires two fences per operation to be correct, plus whatever the allocator does).
When it comes to databases, SAP HANA uses a combination of COW and redo-log to provide durable transactions.
4) Romulus
Since last year (2018), there is now a fourth way to achieve durability and it's called Romulus.In Romulus, two twin replicas of the data are maintained at all times, with the modifications first executed in the first replica 'main' and after, applied to the other replica 'back':
1: write(dst, src, size) {
2: state = MUTATING;
3: order_fence();
4: memcpy(dst_main, src, size); // copy new data to 'main'
5: order_fence();
6: state = COPYING;
7: sync_fence();
8: memcpy(dst_back, src, size); // copy new data to 'back'
9: order_fence();
10: state = IDLE;
11: }
Let's examine what happens if there is a crash in:
- Lines 2 to 4: the recovery method see state at 'MUTATING' and copies the entire contents of 'back' to 'main' (no change);
- Lines 4 to 6: the recovery method copies 'back' to 'main', undoing the possibly incomplete modifications in 'main';
- Lines 6 to 10: assuming state is durable at 'COPYING', the recovery method will copy the contents of 'main' to back, effectively acting as a redo and duplicating the committed modifications;
- After line 10: recovery does nothing;
Above, we saw the algorithm for a single write. If we want to have multiple writes with all-or-nothing semantics (transaction), then it is important to have an optimization where a non-durable log is kept with the addresses and ranges of the modified data:
1: write(dst, src, size) {
2: state = MUTATING;
3: order_fence();
4: memcpy(dst_main, src, size); // copy new data to 'main'
5: log[non_durable_entries].addr = dst_main;
6: log[non_durable_entries].size = size;
7: non_durable_entries++;
8: }
9: commit() {
10: order_fence();
11: state = COPYING;
12: sync_fence();
13: // 'offset' is the delta between 'back' and 'main'
14: for (i : non_durable_entries) {
15: memcpy(log[i].addr+offset, log[i].addr, log[i].size);
16: }
17: order_fence();
18: state = IDLE;
19: }
When used for transactions, romulus always does 4 fences (3 orders plus 1 sync), regardless of the size of the transaction. No matter how many write()s are done in the transaction. Similarly to redo-log, this number can be brought down to 2 fences, but the algorithm becomes more complex.
Pros/Cons:
Advantage 1) Reading and writing is done in place, directly from the data;
Advantage 2) Transactions are unbounded in size. The volatile log can grow dynamically and if it grows too much, we can just stop using it and copy the entire 'main' to 'back' at the end of the transaction. No other technique supports transactions of unbounded size;
Advantage 3) Write amplification is 1;
Advantage 4) Four fences per transaction;
Disadvantage 1) Only half the capacity of the media is available to the user. The redo-log and undo-log need to pre-reserve space for their logs but rarely do applications reserve half the capacity of the media to the logs, giving the advantage to undo/redo/cow.
Whether we're talking about filesystems, DBMS or some other kind of application that requires durability with resilience to failures, it must use one or a combination/variant of these four techniques. Each of them has different trade-offs in terms of performance, memory usage, ease-of-implementing and ease-of-use. In the end, it's an engineering choice which of these four algorithms is best suited for a particular application.
Subscribe to:
Posts (Atom)