Concurrency Freaks: Persistent data structures

Showing posts with label Persistent data structures. Show all posts

Sunday, May 17, 2020

Dealing with bugs in Concurrency and Durability

Everyone knows that Concurrency is hard, in fact, it's NP-hard: the total number of interleavings typically grow exponentially with the number of lines of code (LOC) in a program. Determining if any of these interleavings causes incorrect behavior, becomes problematic for anything but very small programs.
This never prevented anyone from writing large multi-threaded programs, and the usual trick is to use locks. A code block surround by a lock does not have interleavings with other code blocks for the same lock, and the problem becomes "tractable" again.

Unfortunately, we can't use this trick for lock-free and wait-free code. Lock-free algorithms are typically done with atomic instructions that can interleave with each other.
Verifying correctness of these algorithms is a difficult problem.
There are static verification tools like SPIN, JSF, TLA+ and others, but such tools only work for a small number of threads (no more than 6 threads) and small-size programs (less than 100 LOC). As if this wasn't bad enough, these tools enforce the user to write the program in a particular language, apart from JSF, which is almost Java. The program will have to be re-written later in the target language that is meant for, and during this "translation" errors can be introduced.

Now, don't get me wrong, software has bugs. It's just part of life as a software developer to deal with them and try to minimize them.
Entire industry trends like Test Driven Development (TDD) and Test Automation revolve around the concept of testing to reduce the amount of bugs.
In other words, there's this idea that the more time you invest in tests, the more quality the final program will have. IMO this is true, but obviously it's not a linear effort: beyond a certain number of tests, the number of new bugs you'll detect will be small compared to the effort of maintaining those tests.

Concurrency bugs can, and should, also be approached in this manner, but this is far from sufficient.
Moreover, testing a concurrent program requires a different set of techniques and mindset than testing a sequential program: instead of exercising input-output responses, we should test program invariants; instead of mocking components, we should stress the code by increasing the number of threads or changing timings to exercise different interleavings and code paths.
This is where run-time verification can provide a good help. Tools like valgrind and Address Sanitizer become a need-to-have if you're writing C/C++ multi-threaded applications and there is a good deal of research being done on this field, so I'm sure that more capabilities and more efficient tools will show up in the future. However...

Suppose I have a new lock-free algorithm and I implement it. Then I bother to write some tests for it (which most researchers won't do) and then run it with address sanitizer and whatever other run-time tools I can get my hands on. None of my tests or run-time tools encounter any error. Now what? Is my algorithm ready for publication? How will the peer reviewers know that it is correct? How will I convince them of that?
And here lies the big problem in researching concurrent algorithms that people don't usually talk about: "How do I know if the lock-free algorithm in this paper is correct?"

As I said before, it's ok to have bugs in your code, all software has bugs, and software developers are used to this inevitability.
What is not ok is to have bugs in your algorithm. It's very hard to prove the correctness of a lock-free algorithm. Although it's usually simple to verify that a bug exists, once someone tells you the sequence of steps to reproduce that bug.

This can lead to surreal scenarios that typically look like this:
Me: Read the latest paper on a novel lock-free algorithm. Spend weeks thinking about it. Come up with a scenario where the algorithm doesn't work. Send an email to authors describing the scenario.
Authors: Spend a week thinking if the scenario is actually a bug and how to fix it, propose a slight variation of the algorithm.
Me: Spend another week thinking about it, come up with a difference scenario where there is a race condition.
Authors: Spend another week thinking about it and propose a modified algorithm.
Me: Spend another week thinking about this different algorithm. Get to the conclusion that the algorithm is no longer lock-free. Send a new email explaning why.
Authors: Spend another week thinking about it and propose yet another version of the algorithm.
Me: Spend another week thinking about it, only to reach the conclusion that this new algorithm is not even linearizable.
Authors: We weren't aiming for linearizability anyways...
Me: Lose all faith on the work done in the field of concurrent algorithms, and lose faith in mankind in general... at least for a couple of weeks, then I read another paper that looks interesting and the cycle restarts.

This can sound bad but it's a description of when things turn out good. Other times I send the email and the authors just ignore me, or they answer something that shows that they don't even understand there is a bug, or there is no bug and the whole things was a result of me missing some vital detail in the algorithm.
Ultimately, either I shame the authors by pointing out their mistakes or I shame myself for being too dumb to understand their paper. Never a good turnout and it's not a good way of "making friends". Unfortunately I can't help myself, I'm in it because I want to learn, and you learn the most when you're willing to make a fool of yourself ;)

This is not about incompetence, this is about the inherent difficulty in writing lock-free algorithms. The authors are not sure that their algorithms are correct, the reviewers are not sure that the algorithms are correct and the readers of the papers are not sure that the algorithms are correct. And don't get me started on the actual "implementations" of these algorithms.

It's a mess, and it creates a lot of distrust in the field, specially during peer revieweing.
Papers end up being accepted only if there is an expert reviewer that really knows about this stuff and decides to spend the several weeks of his personal time it takes to fully understand that algorithm and then comes back to the other reviewers and manages to convince them that this stuff is correct.
Most of the times, academic papers gets accepted just because they're not obviously wrong. But the problem is that may be simply because the algorithm is so complex that no one is willing to "dig into it". I've seen examples of papers accepted into major conferences, where the basic algorithm was wrong (and there is not obvious way of fixing it) and then the authors will continue convinced that approach is valid and correct and write follow up papers that build on that algorithm and those papers continue to be accepted in major conferences, when it's all crap.
Luckily, this is rare. Most of the bugs I've seen in lock-free algorithms are things that can be fixed without a major re-design of the algorithm.

There is also the effect that if the algorithm is too simple (there aren't many of those, but still) then the reviewers may think that the work doesn't have enough value. But if the algorithm is too complex, then they wouldn't be able to understand if it is correct or not.
This kind of creates the incentive for papers to have incremental algorithms based on prior published work, or to have several simple algorithms as opposed to a single complex one.

I've noticed that when it comes to durability (persistence), the problem is similar.
Durable algorithms are not NP-Hard, the failure can occur at a single point in the code, which means that the complexity is linearly proportional to the LOCs, as opposed to concurrency algorithms where the complexity grows exponentially with the LOCs in the algorithm.
However, the pattern of me reading papers on durable algorithms and then sending emails and then the authors fixing but not really fixing is pretty much the same.

It's still a hard problem, particularly now with Optane DC Persistent Memory that has a special set of rules for data to be correctly persisted.
On Persistent Memory it is very easy to forget a flush or a persistent fence. This means the implementation is incorrect, however, in practice it may have no issue, because the probability of having a crash exactly in the place where the fence is missing is very low.
On the other hand, if there are fences missing all over the place or if there is something being used that simply doesn't make sense, then no matter where the program crashes, the data will end up corrupted. So there are different shades of grey here.

Again here the problem repeats itself when it comes to getting a paper accepted at a conference: unless one of the reviewers really knows its stuff and is willing to invest the time it takes to deep dive on the algorithm presented on the paper, getting a "correct" durable algorithm accepted is just a game of luck. The best that can be done is to make a convincing proof or at least explain some invariants.

I believe that as time goes by, there will be better tools that will help writing and verifying correct durable algorithms. The thing is, by the time that happens (years from now), all or most of the interesting durable algorithms will be published by then, so it kind of defeats the purpose. Moreover, for the complicated stuff we can always use transactional durable techniques (undo-log, redo-log, copy-on-write, shadow-data) or a universal construction, like the one we presented at EuroSys2020 : link here.

For concurrency it's really black and white: you miss a fence, you're algorithm is wrong and you WILL see a bug happen eventually.
I don't see any real solution for concurrent algorithms coming anytime soon. And that's a shame because as long as things remain the way they are, it will be very hard for the good (correct) algorithms to be distinguished from the bad (incorrect) ones. It becomes a kind of "reputation game" where papers from people who have made correct algorithms in the past, will likely have correct algorithms later as well, but this is not really reliable seen as we're all humans we're bound to make mistakes at some point. For now, all we can do is test our implementations of these algorithms as extensively as possible so as to minimize the chances of bugs, but it will never be 100% certain.

This discussion may seem a bit boring: software will always have bugs, so what's the problem with having buggy algorithms?
My argument is that it's a big deal, actually, it's a very BIG deal!I've seen many papers that build on previous papers and then I go look at the previous work on which they're based, and those are incorrect. We need solid foundations on which to build, to be able to progress further and faster.
If we don't give the proper value to the correct algorithms and point out the incorrect ones, then we're creating the wrong incentive. Researchers will not bother to spend the time testing or validating their algorithms. This creates the climate of suspicion that we're in today.

Ultimately, designing lock-free algorithms will never be "easy", but designing lock-free and persistent data structures is easy when you use universal constructions or a lock-free software transactional memory. This means all these are problems are an issue only for the researchers of concurrent algorithms, but not really a problem for people that need a peristent or lock-free data structure.

Sunday, October 20, 2019

Why use a PTM in your application?

When adapting or creating new applications that use Persistent Memory (PM), developers are faced with two main choices: using a PTM or manually placing flushes and fences.

If they chose hand-placed fences, they have to worry about the correct placement of the clwbs, sfences and, in which situations are non-temporal stores a better choice performance-wise.
Placing these fences correctly can be a difficult task that even expert researchers will occasionally get wrong.
One incorrect fence or flush is enough the prevent the recovery of the application in the event of a crash, thus losing or even corrupting data in PM permanently.
Moreover, dealing with concurrency must also be done by hand, yet another non-trivial task.
Finding expert developers that are capable of dealing with durability and concurrency is likely the reason why database engines are a costly piece of software that requires a team of experts to assemble.

If the application developer chooses instead to use a PTM, concerns of how to make the code durable and concurrent, disappear.
The developer's task now becomes the identification of which blocks of code and data must be accessed in an atomic way, and encapsulate those inside a transaction. Notice that identifying atomicity of data is a task required even when choosing the hand-placement of flushes and fences.
Another advantage of using a PTM is that any modifications to allocator metadata during allocation or de-allocation, becomes part of the transaction, implying that there will be no persistent memory leaks due to loss of metadata when a crash occurs.
For the end-developer, the code within a transaction block can be reasoned about as if it were sequential.
This shift of complexity from the application developer to the PTM library implementer, tremendously increases development speed for the application developer, reduces bug count and improves code maintainability.

Sunday, October 6, 2019

Recovery on a hand-made durable data structure versus recovery on a Persistent Transactional Memory

A durable data structure is a data structure in Persistent Memory (PM) or Non-Volatile Memory, if you prefer.
These data structures have to be resilient to failures. In other words, if there is a power cut during an operation on the data structure, the data structure must be correctly recovered after the system restarts.
For the rest of this post, and likely for all other posts you read in this site, when I say "durable data structure" I mean a data structure that has "atomic durability": In the event of a failure, the side effects of all completed operations will be visible upon restart and recovery.

The are two main approaches to having a durable data structure in PM:
1) Write an algorithm for a new data structure. This algorithm has to be failure-resilient;
2) Write a regular data structure and wrap it in a PTM;

Making a new data structure is typically the subject of an entire research paper, while using a PTM is not. Researchers tend to lean on doing new data structures because it's more fun (we can't get a paper accepted where we describe "using" a data structure). Which approach is better depends on what you mean by "better", but my favorite is 2).

However, set's say you decide to use a hand-made durable data structure that someone else made. Now how do you use it in your program?

Most applications have many data structure instances. For an application that needs durability (or persistence, if you prefer) not all data needs to be persistent, which means that not all data structures need to be persistent either. However, let's assume it is more than 1 data structure instance.
When a crash occurs, it could have occurred during an operation on *any* of these data structures. This means we need to call the recovery() method for each of these data structure instances, just in case.

This by itself creates a nuisance, because now you need to have an explicit list of all the durable data structures in your application and iterate through all in your recovery code:
void recovery() { // called after a restart
ds1.recovery();
ds2.recovery();
ds3.recovery();
...
}
If you forget to add one of the data structures to the main recovery method, you risk doing operations on an inconsistent data structure and further corrupting your data. In PM, when you corrupt the data it's gone forever... unless you had backups, and even then you're going to lose all data from the last non-corrupt backup.

Another difficulty is that sometimes the number of durable data structures may be dynamic because the application creates and destroys data structures during execution, to store some data. This means that it's not possible to add each data structure instance in the recovery() method because they may not exist when the program restarts. This typically means you will need a registration mechanism where newly created data structure instances are added and later de-registered when the data structure is destroyed. Also, you need to save the root pointer to each of those data structure instances.
PTMs handle this transparently because the transaction has the all-or-nothing semantics. If new data structure instances are created or destroyed, you can use the root pointer API (in OneFile and Romulus we called it get_object() and put_object()). When there is a crash, any modifications to ongoing transactions will be reverted, including allocating or de-allocating a new data structure instance and including adding or removing a root pointer. By having a PTM whose transactions can encompass the modifications on the data structure and the allocation/de-allocation of the data structure and the adding/removing of the instance as a root pointer, we give transactions that are easy to reason about and make your code consistent at all times.

Ok, ok, so it's not that bad for the hand-made durable data structures. We can do automatic checks for these scenarios by adding registration mechanisms in the constructor of the data structure, or even some kind of static check or compile-time assertion to make sure everything is as it should be.
But the thing is, if you use a PTM, you only have to call the recovery method of the PTM, one time, regardless of the number of data structure instances you have in your application!
In the case of OneFile, it has null recovery which means there is no recovery method: the PTM will simply continue execution where it left of.

This is yet another thing where PTMs are better than an hand-made durable data structures.

Friday, September 20, 2019

Can I use a volatile allocator in Persistent Memory?

Spoiler alert, the answer is NO!

In this post I'm going to try to explain why "allocators" are such a vital part of using persistent memory. And when I talk about persistent memory (PM) I'm talking about NV-DIMMs like Intel Optane DC Persistent Memory.

When we run an application (in volatile memory) the allocator is just a passing thought. If the application does a lot of heap allocation/de-allocation, then having a fast allocator is important, but performance is about the only consideration. The truth is that the stock allocator that comes with windows/linux/iOS is fine for the majority of applications. For the exceptions, there are things like jemalloc. https://github.com/jemalloc/jemalloc

When we run an application in persistent memory (PM) then it's a whole different issue!
Applications running in PM need to be failure-resilient. This means they have to recover correctly after a crash (think databases and filesystems). This requirement also applies to the allocator, and volatile allocators do not provide failure-resilience.

Suppose you have a durable queue in PM and you allocate a block for a new node. You allocate a node, place item A inside the node and them insert the node in the queue. Some time later you allocate a new node for item B and place the node inside the queue. Afterward, you allocate a new node, place item C inside the node and just before you complete the linking of the new node, the system crashes!

Upon restart, you call the recovery() method for the durable queue to recover any inconsistency in the queue, which in this case may decide to append node C or disregard it (dependig on specific of the queue algorithm), but nodes A and B will for sure be in the queue, assuming this queue has a consistentcy equivalent to "durable linearizability" (that's what I mean when I talk about a "durable queue").

Notice that the allocator metadata has been reset.

Ok, so now you have a consistent queue with items A and B.

What happens if you were using a volatile allocator?
Well, all the metadata of the allocator will disappear during the crash, which means that when you restart the application, the allocator will be "reset".
Everything will be fine, until the first allocation you do, at which point the allocator will decide to give you a block which had already given you before.
For example, if you then decide to continue the operation and insert item C, when you request a new node, the allocator will most likely give you the same block of memory that gave you the first time, where you stored item A. This is in effect a "corruption" because now the application is going to place item C into this node, overwriting the contents of the old node (already in the queue) and making A disappear.
Your queue is now "head -> node C -> node B -> tail"

Even though you're using a durable queue, you've just corrupted your data because you decided to use a volatile allocator.

There are however many research papers that take this approach and in my view this is non-sensical.
I understand that they want to show their algorithms' performance irrespective of the shortcomings of whatever allocator they're using, but the thing is, an algorithm for a durable data structure can not be correctly deployed without a persistent allocator.
When we use a PTM (Persistent Transactional Memory), the allocation is done inside the transaction along with all the modifications to any user data or data structure. Because all of this is done durably with all-or-nothing semantics (and ACID transaction) PTMs have no such corruption issues.

The gist of it is that, to have a "correct" durable data structure you need to use a "correct" durable allocator.
If you're using a PTM, then you get both of these together and when you measure the performance you're measruing the performance of both the data structure and the allocator.
If you're using a standalone durable data structure then you need to add a durable allocator and account for it as well, because one day, when the user deploys that data structure he/she will have to do it with a durable allocator which means that that durable allocator has to be taken into account when doing any kind of benchmark.
What's the purpose of having a super-duper-fast durable data structure that's not correct?!?After all, users don't typically want to use durable data structures that get corrupted after a restart ;)

This kind of corrupting issue when using a volatile allocator is just the tip of the iceberg.

Next, comes leakage.
Let's talk about it next week.

PS: Both Romulus and OneFile deal with these kind of issues correctly. Any data structure you place within these PTMs will have correct recovery after a crash and this includes the allocations/de-allocations.