Monday, September 23, 2019

Leaks in Persistent Memory

To continue on last week's topic, today we're going to go over some leaks in persistent memory.

Suppose you are using a persistent allocator, i.e. one whose metadata is kept in persistent memory AND that is resilient to failures.
Now suppose your durable data structure (a queue for simplicity) has the following code:
1:  newNode = (Node*)malloc(sizeof(newNode));
2:  newNode->item = Z;
3:  tail->next = newNode;

What happens if there is crash between lines 1 and 2?
The node has been allocated but not yet put on the queue. This means that it is permanently lost, i.e. leaked.
In volatile memory this isn't much of problem, if the application leaks memory, just restart it every once in a while and the leaks will disappear, but in PM that leaked node is now part of the durable (permanent) data and is there forever.
And when I say "forever" I mean until you throw away your data, or maybe you can figure out a way to export your data into a file, reset the PM and re-import your data, hopefully using an allocator that doesn't leak.

Generally speaking, in PM, leaks are there for all eternity (the lifetime of the data).

Having a crash is typically a rare event however, a multithreaded application may have multiple ongoing operations, one on each threads.
Theoretically, we could have each thread leaking one object during a crash, thus multiplying the number of leaks per crash by the number of threads in the apllication, in the worst-case scenario.
If we conside that each object is a node in a queue, then this isn't much to worry about.... chances are that crashes are rare and leaking a few hundred bytes or even KB per crash is not something to worry about.

What if the crashes aren't that rare, or what if the the allocations are for very large objects?
Then you start to sum up KB fast, or even MB, and this can have an impact in the total amount of memory.

If you had an application that leaked files to the disk every once in a while, would you find it acceptable?
The answer to this is likely to be "yes, if I can remove those files". This is exactly what happens when you go in Windows and delete temporary files created by your browser

The problem is that when using an allocator in PM, there is no simple way to know if the objects are still in use or not (referenced by other objects or one of the root pointers) and this means that a leak is not something easy to identify (just ask any C/C++ programmer).
Even tools like Address Sanitizer (ASAN) and Valgrind only identify the leaks when the (volatile) program ends. Doing it for PM is even harder.

Using a standalone allocator leaves the door open for leaks in the event of a crash. Is there a way around it?
Yes, use a PTM instead.

If you re-write the code such that the modifications are part of the PTM transaction along with the modifications on the allocator metadata, then these kind of leaks will no longer occur. The code will become something like:
1:  beginTx();
2:  newNode = (Node*)malloc(sizeof(newNode));
3:  newNode->item = Z;
4:  tail->next = newNode;
5:  endTx();

Until endTx() commits, the transaction will not become visible and therefore, in the event of a crash, any modifications will be reverted (details depend on whether it's an undo-log or redo-log).
Even if there is a crash between lines 2 and 4, the PTM guarantees the transactional semantics, reverting all modifications to the allocator metadata and therefore reverting the allocation done in 2, safely and correctly.
This is what PTMs like Romulus and OneFile do  :)

Ultimately, leaks in application code are likely to be the worst issue for long running data in PM and for that, there is no easy solution. There are application-side approaches to deal with this problem, but so far there is no really efficient generic solution.
Maybe one day someone will do a nice Garbage Collector for PM, but GCs aren't all the wide spread for C/C++ volatile applications, doing one for PM seems an even thougher challenge. And yes, the JVM has a GC, but the JVM and PM don't play along together all that well (yet).

Friday, September 20, 2019

Can I use a volatile allocator in Persistent Memory?

Spoiler alert, the answer is NO!

In this post I'm going to try to explain why "allocators" are such a vital part of using persistent memory. And when I talk about persistent memory (PM) I'm talking about NV-DIMMs like Intel Optane DC Persistent Memory.

When we run an application (in volatile memory) the allocator is just a passing thought. If the application does a lot of heap allocation/de-allocation, then having a fast allocator is important, but performance is about the only consideration. The truth is that the stock allocator that comes with windows/linux/iOS is fine for the majority of applications. For the exceptions, there are things like jemalloc.

When we run an application in persistent memory (PM) then it's a whole different issue!
Applications running in PM need to be failure-resilient. This means they have to recover correctly after a crash (think databases and filesystems). This requirement also applies to the allocator, and volatile allocators do not provide failure-resilience.

Suppose you have a durable queue in PM and you allocate a block for a new node. You allocate a node, place item A inside the node and them insert the node in the queue. Some time later you allocate a new node for item B and place the node inside the queue. Afterward, you allocate a new node, place item C inside the node and just before you complete the linking of the new node, the system crashes!

Upon restart, you call the recovery() method for the durable queue to recover any inconsistency in the queue, which in this case may decide to append node C or disregard it (dependig on specific of the queue algorithm), but nodes A and B will for sure be in the queue, assuming this queue has a consistentcy equivalent to "durable linearizability" (that's what I mean when I talk about a "durable queue").

Notice that the allocator metadata has been reset.

Ok, so now you have a consistent queue with items A and B.

What happens if you were using a volatile allocator?
Well, all the metadata of the allocator will disappear during the crash, which means that when you restart the application, the allocator will be "reset".
Everything will be fine, until the first allocation you do, at which point the allocator will decide to give you a block which had already given you before.
For example, if you then decide to continue the operation and insert item C, when you request a new node, the allocator will most likely give you the same block of memory that gave you the first time, where you stored item A. This is in effect a "corruption" because now the application is going to place item C into this node, overwriting the contents of the old node (already in the queue) and making A disappear.
Your queue is now "head -> node C -> node B -> tail"

Even though you're using a durable queue, you've just corrupted your data because you decided to use a volatile allocator.

There are however many research papers that take this approach and in my view this is non-sensical.
I understand that they want to show their algorithms' performance irrespective of the shortcomings of whatever allocator they're using, but the thing is, an algorithm for a durable data structure can not be correctly deployed without a persistent allocator.
When we use a PTM (Persistent Transactional Memory), the allocation is done inside the transaction along with all the modifications to any user data or data structure. Because all of this is done durably with all-or-nothing semantics (and ACID transaction) PTMs have no such corruption issues.

The gist of it is that, to have a "correct" durable data structure you need to use a "correct" durable allocator.
If you're using a PTM, then you get both of these together and when you measure the performance you're measruing the performance of both the data structure and the allocator.
If you're using a standalone durable data structure then you need to add a durable allocator and account for it as well, because one day, when the user deploys that data structure he/she will have to do it with a durable allocator which means that that durable allocator has to be taken into account when doing any kind of benchmark.
What's the purpose of having a super-duper-fast durable data structure that's not correct?!?After all, users don't typically want to use durable data structures that get corrupted after a restart  ;)

This kind of corrupting issue when using a volatile allocator is just the tip of the iceberg.

Next, comes leakage.
Let's talk about it next week.

PS: Both Romulus and OneFile deal with these kind of issues correctly. Any data structure you place within these PTMs will have correct recovery after a crash and this includes the allocations/de-allocations.