Tuesday, May 14, 2019

OneFile is also a database engine

In this post we'll talk about OneFile, the world's first wait-free Software Transactional Memory that works on Persistent Memory, and why it's actually a (wait-free) database storage engine.

An STM (Software Transactional Memory) is a library which provides dynamic transactions over multiple objects with linearizable consistency. In effect, an STM provides Atomicity, Consistency and Isolation, but not Durability, i.e. and STM gives ACI transactions.
A Persistent Transactional Memory (PTM) is just like an STM except it also gives Durability, or as I like to call it "failure-resilience". PTMs are capable of providing fully ACID transactions, as long as they're used with byte-addressable Persistent Memory (PM).

Examples of Persistent Memory are the Optane Persistent Memory NV-DIMMS:
or the battery backed NV-DIMMS by HP or Viking:

Not all PTMs provide ACID with serializable isolation.
In the concurrency literature there are two major distinctions for the consistency of durable transactions:
- durable linearizable: pretty much equivalent to what the databse community calls "serializable isolation" or full ACID transactions;
- buffered durable linearizable: means that transactions are linearizable however, in the event of a failure, a non-specified amount of transactions will be lost up to the current one. The system is guaranteed to be restored to a consistent state in the past, just not necessarily the last committed transaction.

All the other cases in between, which the database folks have names for such as "dirty reads", "non-repeatable reads", "phantom reads", we just call those "wrong" or "bugs" because that's how most users think of them.

Last year we made a PTM with (blocking) durable linearizable mutative transactions and wait-free read-only transactions called RomulusLR:
This PTM has ACID transactions and it's pretty fast due to its combining technique, but it does serialize all mutative transactions. Other PTMs like Mnemosyne don't have this limitation, but they have other limitations.
A pretty well know PTM is libpmemobj in PMDK by the Intel guys, which understandably use it as a way to sell more Optane persistent memory  ;)
libpmemobj only has durable transactions and leaves the concurrency part up to the user, either by using a global lock, or fine grained locking (good luck with that!).  
PMDK is an awesome piece of software because it has many tools that help enable the usage of persistent memory, not to mention all the effort that the Intel guys have put in in defining standards so that it all works with the PM products from different vendors.

We made a variant of OneFile for PM. In this variant the memory reclamation is done in a different way. We'll leave those details to its own blog post, but the cool thing about OneFile for PM is that it comes with its own memory allocator for persistent memory:
Even if you don't have persistent memory, you can use a region of your DRAM memory and mmap on it. This will give you durability for process failure but not for system failure. In other words, if the process that is doing a transaction gets killed, the information will be recovered correctly, but if the machine dies (gets unpluged or reboots) then the data is lost (unless you have a persistent memory module in your machine).

The allocator that comes with OneFile is very simple, doesn't do coalescence of blocks or anything like that. We call it EsLoco (Extremely Simple Allocator).
Other allocators can be used, as long as annotation is added to them by adding the template tmtype to each struct/class member of the allocator metadata.
Because the changes to allocator metadata are part of the transaction, there will not be any leaks in the event of a failure and all the operations are wait-free. This means that this allocator is capable of allocating and de-allocating blocks of data in a wait-free way.
In summary, by using OneFile, we've made the world's first wait-free allocator!
So ok, the bad news is that it serializes all operations, whether allocations or de-allocations. Don't expect any scalability from it, but on the other hand, it will be hard to find anything better in a multi-threaded environment when it comes to tail latency.

Another interesting aspect is the number of fences.
When doing a durable transaction in block storage (disk) the typical trick is to do two fsync() or two fdatasync(). The cost of the fsync() is significant because it's about the order of the latency of a round trip to the media device, i.e. the disk's latency.
In PM, the equivalent is to do two pfence (SFENCE instructions on x86). The cost of the pfence is also the cost of a round trip to the device, but this is sub-microsecond so it's not that costly.
Libpmemobj in PMDK uses an undo-log technique, which means it executes two SFENCE per modified range/object in a transaction. Even though the pfence are 1000x cheaper than calling fsync,() if we do too many of them it becomes a performance bottleneck. In OneFile we did our best to reduce the number of pfence to a minimum and reduced them to 2 per transaction, regardless of the number of objects modified during the transaction.

In effect, OneFile is the world's first wait-free database engine. When coupled with persistent memory, it is capable of providing transactions with serializable isolation.
A relevant question would be:
Like all other databases, it has to have a recovery mechanism for when a failure occurs... right?
Actually, no it doesn't.
OneFile is the first database engine with null-recovery. Null-recovery is a property of certain lock-free algorithms which was first introduced in a paper by Izraelevitz et al. in 2016
Basically it says that some lock-free (and wait-free) algorithms have this property of simply continuing where they left off after a failure, meaning that they don't need a specific recovery procedure to recover after a crash.
This property is theoretically appealing but no database engine or PTM had so far been able to do it. With OneFile we've shown a practical implementation of it and we showed that database recovery can be done with the cost of applying the last incomplete transaction (it's actually possible to do even less).
The database research community has entire papers where they have complicated techniques to improve the time it takes to recover a database: https://research.fb.com/wp-content/uploads/2016/11/fast-database-restarts-at-facebook.pdf?
With OneFile + PM, we've reduced recovery time to the time of a single transaction!

... and did I mention that transactions in OneFile PTM are wait-free and have serializable isolation?  :)

Next week we talk about wait-free memory reclamation

No comments:

Post a Comment