Concurrency Freaks: 2021

Friday, December 24, 2021

The importance of Correctness in concurrent algorithms

My work is in the area of concurrent algorithms therefore, this post is going to be largely biased by my experience in this field. Having said that, I suspect that most of what I'm going to say is applicable to many other fields in Computer Science and maybe of other disciplines as well.

To me, the creative process in the field of Concurrent Algorithms is composed of three main steps: The Idea, The Algorithm, and The Implementation.

The Idea

Before we start doing anything we need an idea. Sometimes this is a well-formed idea, other times it's just an insight or trick that we feel can be used to solve a particular problem in a way that no one has done before. Maybe it's a way to solve the same problem but faster, or do it using less memory. Whatever it is, it gives you an advantage somehow.

Most people are capable of having ideas, but most ideas are useless, either because they have already been discovered, or because they are wrong, or because they lack understanding of important details. The later being the common case I have observed in my experience.

Novices in a field usually have ideas that fall in this category: they are simple and based on well-known concepts, or concepts from other fields, and they show a lacking of understanding of the fundamentals of this field (concurrency) and a lacking of knowledge about the problem at hand. These ideas don't actually work, but to the uninitiated, they look like they're good.

Then there are the class of people of occasionally have a decent idea. Once they have it, they can get very attached to the idea as this is precious to them and they likely won't be able to come up with anything better. This is the best they can come up with, therefore, they're going to try to sell it as the-best-thing-since-bread-came-sliced.

You have to understand that they worked hard to get there, therefore it's only normal that they want to value this idea, fighting for it with nails and teeth, if need be. It can be very hard to reason with these people and convince them that there are better ways to attack the problem.

The people in this category are not novices. They typically have some knowledge of the field. Sometimes they call themselves experts, perhaps because they have been working on the field for a long time.

There is a third group of people, the ones that have lots of ideas. This is small group because to get to this group you need to not only be good at having ideas, but also need to have a lot of knowledge about a particular field. My conjecture is that the kind of mindset it takes to learn about a field in depth, is almost the opposite to the mindset you need to innovate in a field. The foremost being a mindset where we need to memorize and accept a lot of what has been done by others previously in this field, while the later mindset is about asking questions about everything that is being taught to us.

Obviously, the two mindsets are not mutually exclusive, but let's just say that only a small subset of the researchers in a particular field have both mindsets, but I digress.

The main characteristic of this third group of people is that they have lots of ideas. Most of these ideas are below grade, a couple are decent, and every once in while, one idea will be great.

Because they have lots of ideas, researchers in this group are not attached to their ideas. They know that other ideas will come (to themselves or to other researchers) and those ideas will have slightly better trade-offs and be better in slightly different metrics. They understand that there is no perfect solution, the concept of perfection depends on the particularities of the problem at hand.

The Algorithm

Once you have an idea, the next step is to transform it into an Algorithm.

A lot of people confuse an idea with an algorithm, They are not the same thing.

An idea, is a vague description of a possible solution to a problem, with an emphasis on how the multiple pieces work together, or what is the main trick behind it. You can think of the Idea as the elevator pitch, or the drawing on the whiteboard.

An algorithm is an accurate description of the steps taken by this solution. It's a pseudo description of the way we solve the problem or execute a computation, it's a description of the program. It doesn't need to be written in a programming language, the steps can be in english language, but they need to be descriptive.

In my experience, this confusion of the Idea with the Algorithm causes a lot of friction when dealing with novices. Novices have trouble seeing the difference between these two concepts. To them, the idea is enough to move to the Implementation. If there are parts missing, they'll figure it out during implementation and leave those as implementation details. The fact that both the Idea and the Algorithm can be described in english, while the Implementation is described as source code (i.e. a programming language) makes the distinction between idea and algorithm even harder for novices.

Perhaps the best way to think about this is to put it in the context of Mathematics. In fields of theoretical Mathematics, there is no "Implementation stage", only the Idea and the Algorithm. Yet, to mathematicians, there is a clear distinction between an idea and an algorithm. The algorithm can be proven correct (or incorrect) while the idea cannot (unless it is obviously wrong).

It helps them that ideas are described in english, while algorithms are described in mathematical notation.

It also helps them that they have formal proofs to support the algorithm, but more on this later in the post.

This mathematical mindset helps me a lot in my research on concurrent algorithms. When describing an idea, I do an elevator pitch, or a drawing on the whiteboard, or maybe a couple of powerpoint animations. When describing an algorithm, I try to use pseudo-code or, if using english sentences, itemize each step and try to be as descriptive as possible.

It's not perfect, but it helps to make the distinction between the two stages.

The Implementation

In the field of concurrent algorithms, we usually do an implementation of the algorithm.

An implementation is made in a programming language. If the goal is a research paper, then the language can be whatever the researchers doing this work are comfortable with. If the goal is to solve an actual problem in a company, then the language will be dictated by the particular problem or whatever programming language/tools the group supporting this feature needs to do.

Why do we do an implementation?

First, because we can.

Second, because having an implementation helps us understand the performance of the solution, even if it's a proof-of-concept (POC) implementation.

Third, because in the Industry, an implementation is what we ultimately need, because it's the way to actually solve the problem we're trying to solve.

Fourth, because an implementation can be tested and passing these tests helps us have confidence that the algorithm is correct. More on this on the next section.

How do we know it's correct?

When designing or reading a new algorithm, we always wonder whether it is correct or not.

There are two ways to address this concern, formal proofs and stress tests. The best is to do both, but this is not always possible or a useful investment of time.

In the academic setting, when writing a research paper on a novel concurrency algorithm, the correctness proof is the preferred approach. When small, the proof can be placed in the paper itself, otherwise, it's put on an accompanying document or appendix.

Some theoretical conferences and journals don't even care about having an implementation of the algorithm. All they care is about the proof, whether it well constructed and really proves what it sets out to prove.

The more practical conferences (in the field of concurrency) usually don't care much about stress tests either. They give importance to the fact that an implementation was made and to the benchmarks made with this implementation, but not to any supporting code that tests the correctness of the implementation. And because the reviewers don't value this, the researchers submitting papers don't spend time doing tests either. To the point where of all my colleagues working in this field, I only know of one that consistently spends time writing stress tests for his work.

The majority of my fellow research colleagues don't stress test their work and instead, prefer to spend some time writing a formal proof of correctness. The incentive just isn't there.

Writing research papers takes time. It takes time to come up with an idea worthy of a paper, it takes time to turn it into an algorithm, it takes time to implement it, it takes time to benchmark it, it takes time to write about it in way that pleases reviewers at conferences, and if on top of that we have to write stress tests, it's just too much.

Proofs are made by humans, usually the same humans who designed the algorithm that is being proven correct, and therefore, it's possible to have the same fallacies that inducted an error in the algorithm, creep in into the proof. This depends on the kind of proof, for example, we can argue that invariant proofs are less prone to this nefarious effect, but there is no way to be immune to it.

In the software industry the mindset is nearly opposite.

There, we don't care about formal proofs because each proof is tightly associated with the algorithm and implementation. Changing a single line of code can invalidate the corresponding proof. This is particularly true in lock-free algorithms, but it's a general statement for concurrency algorithms in general.

Business needs can change, which means code can have new functionality, which means the implementation will change as time goes by. It's not practical to write a new proof every time someone changes a line of code.

Yes there are tools like TLA+ but then we need to keep the proof and the implementation in sync, which is its own pain.

On the other hand, if we write tests to cover the implementation, every time we change a line of code we can just re-run the tests. It won't give a 100% guarantee that you didn't break anything, but it will help to catch many mistakes.

Furthermore, in the Industry we actually want implementations that work. What good is an algorithm that has been proven correct, if the implementation of this algorithm is full of bugs?

Not really much, I'm afraid. It's a much better use of the engineers time to have tests that assert (some) correctness of the implementation than to have him write formal proofs or TLA+ proofs.

Tests imply re-usability

My intellect is not capable of holding a large concurrent algorithm. I know of other researchers who can, and Andreia is one example.

If you present me with a novel concurrent algorithm that is large(ish), I may not be able to convince myself that it is correct, because I won't be able to stuff it my head as a whole. I won't be able to go through all the possible interleavings of the steps nor reason about its correctness.

It's like the chess players who can see 7 moves ahead in the board, and the chess players who can only see 3 moves ahead. Yes, it's a skill that you can practice, but after a certain time you get old, you have other stuff in your head, and frankly, I don't even want to invest the time it takes to grasp such large algorithms.

Because of this limitation, I need to restrict myself to small algorithms. And even for the small algorithms, I do stress tests.

If we have tests to cover the behavior of a queue, then any new queue we design and implement can be tested with the same tests. If a particular queue passes all the tests we have designed over the years, then this gives me a good confidence that this is correct queue implementation and, by consequence, the algorithm of the queue is likely correct.

Moreover, having a well-tested queue means that when we design other concurrency mechanisms that need a queue, we know that we can use the queue we designed beforehand because when we find issues in the new thing we are designing (which inevitably we will) we will be certain that those issues originate from the algorithm which works on top of the queue and not due to issues in the queue itself.

By stress-testing everything we do, we gain re-usability!

And yes, stress-testing is no 100% guarantee that the implementation is correct, but it sure beats not having anything, and I would argue, it beats proofs too.

You see, a proof is for the algorithm, not the implementation. On the other hand, a stress-test validates correctness for the implementation and the algorithm. Two for one is a better investment of your time.

Not only the implementation can be re-used, but the tests can be re-used on other implementations. When we design a stress tests to check invariants of concurrent sets, we can use the same tests to check the correctness of any other set implementation.

The tests we do become an investment for the future, which reaps its benefits as the years go by.

My take on correctness

Suppose you came up with a new lock-free data structure and now you want my help with it. There are two different ways I can have confidence that what you did is correct:

You make a formal proof of correctness to accompany the algorithm;
You make as many stress/invariant tests as you can think of and make your implementation of the algorithm pass those tests;

Ideally, you should have both, but if you're going to do just one, then go for number 2.

Notice that you need an algorithm in both. If you don't have an algorithm, then you have nothing. Some people jump directly from the idea to the implementation stage. This means that the algorithm lives only in their head. Good for them, but the point of a research paper is to share information, and forcing others to reverse-engineer the algorithm from the implementation is not a good way of sharing. And most likely, no one will bother to do so.

So, yeah, without an algorithm, you got a big fat nothing.

Summary

In summary, the reasons to write stress/invariant tests are:

These tests will be re-usable later;
They help prove the correctness of the algorithm and the implementation;
The algorithms validated with these stress tests can then be used as building blocks of more complex algorithms;
If you have tests, you will be able to validate the algorithms of other researchers;
If you make modifications to the algorithm, you will be able to re-check the correctness of the modified algorithm without having to make a new proof from scratch;
Writing tests is fun!

Friday, March 12, 2021

The 4 laws of Durability

When it comes to having durable data, there are four ways to do it: undo log, redo log, shadow copy and shadow data.

Let's start with the preliminaries.

So what do we mean by "durable"?

Well, durable means that whatever data you're trying to save, has reached your storage device in a consistent way. It means that when you write to storage you want it to be "permanent", whether that storage device is a USB key, a CD ROM, an hard drive, and SSD, or a non-volatile memory DIMM like Intel's Optane DC PM.

For any of these storage devices, the algorithms are always the same: you have to use one of the four mentioned above.

Keep in mind that these are needed if the data needs to be consistent, i.e. you want to see the whole data before the storage or none of the data. I mean, if we were ok with having garbled data, then why would we bother saving it in permanent storage? The whole point of making data durable is because it has important information and therefore, it implies consistency.

Now that the basics are out of the way, what are exactly these four algorithms?

I'm going to focus on these in the context of transactions, but they don't have to be necessarily about that.

Undo log is technique where we write to durable storage a log entry before each write is done to storage. It allows multiple independent (non-atomic) writes to become durable in an all-or-nothing way, like a transaction, or a checkpoint.

In the context of persistent memory, libpmemobj in PMDK is an example of a transactional system that uses undo log.

In Redo log we write the log with multiple entries to storage before writing the actual data. The difference between redo and undo is that undo log does one entry in the log at a time followed by one modification, while the redo log does all entries in the log in one shot and then all the modifications in one shot.

Mnemosyne and OneFile are examples of transactional systems that utilize redo log.

Shadow copy, sometimes called Copy-On-Write (COW) creates a new replica of the data and writes the new data along with the unchanged contents to durable storage, before swapping some kind of pointer to indicate the this is the new object/data and the old one can be discarded. COW can't really be used by itself for transactions over multiple objects, but it can be combined with redo log to make it more efficient.

One example is SAP HANA which uses redo log with COW.

Shadow data can sometimes be confused with COW but it is not the same thing. In shadow data two (or more) replicas of the entire data are kept in durable storage and they both are updated with the modifications, one at a time. First one replica, then a logical pointer and then the second replica. On the next set of atomic writes the recently updated replica is the first to be updated.

Examples of shadow data transactional systems are Romulus, RedoDB and Trinity to some extent.

We though long and hard at the similarities and differences between these four algorithms for durable transactions, and we found they possess four common characteristics, regardless of the underlying storage media for which they are intended.

Each one of these characteristics reveals an important insight into the concept of durability and we believe these to be empirical rules to which all durable techniques abide. These rules are:

There must be a replica of the data;
There must be a durable state indicating which of the replicas is consistent;
All algorithms require at least one ordering constraint of the writes to durable storage;
A modification is durable only after a round-trip fence to the storage hardware;

The first key insight regarding durable transactions is that a consistent and durable replica of the data must exist at all times. This replica may be a full copy of the data, such as on shadow data, or it may be a logical replica, such as on undo log and redo log.

Intuitively, there has to be a consistent replica of the data, so that there is a way to recover data to its original consistent state in the event of a failure. Shadow data keeps a full replica of the data thus incurring a high permanent usage of the durable media (space amplification), while the undo log and redo log approaches have to write in durable storage, not just the new data but also, encoded information about the location and size of the modification (write amplification).

There's clearly an important trade-off here: log-based algorithms will increase (amortized) write amplification but shadow-data-based algorithms will increase space amplification.

The second empirical rule implies that the algorithm must ensure that, irrespective of when a failure occurs, there is a way for the recovery procedure to determine which of the replicas is consistent.

Shadow data like Romulus uses a two-bit variable to determine which of the two replicas is consistent, while redo log and undo log can use the size of the log (zero or non-zero) to indicate if the log is consistent.

By itself, there is no significant difference in any of the approaches however, the exact mechanics, will influence the number of ordering constraints in the algorithm.

This leads us to the third insight, that data consistency is possible only through ordering of some of the writes.

For shadow-copying, the modifications on the new block must be made durable before the pointer swap, otherwise a failure occurring after the pointer swap is made durable, would leave the pointer referencing an inconsistent block. This means that apart from block allocation and de-allocation details, shadow-copying has a single ordering constraint, or in other words, a single ordering fence.

Shadow data like Romulus uses a two-bit state (though one bit would suffice) to indicate which of the two replicas is the consistent one, or whether both are consistent. If the state variable indicating which replica is the consistent one becomes durable before or after the modifications on either replica and a crash occurs, upon recovery it may be referencing the inconsistent replica. For this algorithm, three ordering constraints exist: one to prevent the state from changing to COPYING before the modifications in main replica are done; another to prevent the modifications in back replica from being done before the state changes to COPYING; and another one to prevent the state change to IDLE before the changes on back replica are durable.

The undo log technique has two constraints per modified object/range: the log entry must contain the old value before the entry is added to the log; and the entry must be added to the log before the modification is done on the data. Undo log has one extra constraint per transaction, requiring the last modification to be durable before the log is reset.

The Redo log technique has three constraints per transaction: all the log entries must be durable before the log size is set; the log size must be set before the modifications are done on the data; the modifications on the data must be durable before the log is reset.

The fourth and final rule addresses the need for a round-trip synchronization mechanism to the storage domain, such that the hardware can guarantee that it contains, in stable durable storage, all the previously written data. The cost of such a fence is typically of the order of the storage device's latency.

Fast devices like PM implement this round-trip fence orders of magnitude faster than slower devices, like hard drives.

Without such a mechanism, it is not possible to have durable operations, even if ordering constraints are set: in the event of a failure, the ordering constraints impose a temporal sequence of which the writes will be made durable, but there is no guarantee on durability.

A corollary of this is that all algorithms require one and only one such fence, strategically placed.

Notice that the ordering constraints may be replaced by such synchronous fences, at the detriment of performance, and in fact, many storage systems make no distinction between the two. Ordering is typically achieved with an asynchronrous fence and it relates to the order to which certain writes will be made durable in the storage media.

On block based storage, this is typically implemented with fsync() or fdatasync().

In Persistent Memory (PM) ordering can be achieved through the combination of flushes (clwb) and fences (sfence) or by writing to the same cache line. The round-trip guarantee of durability is given by a synchronous fence, either fsync()/fdatasync() on block storage, or sfence on PM storage.

In case you haven't noticed, the fact that all algorithms require one round-trip fence to the device (psync), but may require multiple ordering fences (pfence) has implications in performance. This is specially true given that the psync has inescapable physical implications: it is not possible to have all-or-nothing consistent durability without a psync that physically does a round trip to the storage device (or at least the storage domain) and therefore the latency cost of this single round trip is inescapable.

However, different algorithms may have different ordering constraints (pfences) and these may have different costs.

Yes, fsync() is used for both sync and ordering on block devices, and the sfence instruction is also used for both in PM, however, there are tricks. In PM, writes to the same cache line are guaranteed to be ordered and therefore, no sfence is needed to order them, as long as store with memory_order_release is used.

Seen as these round trips are typically the bottleneck when doing random writes to PM, the fact that we can create an algorithm with a lower number of psyncs means we can have a performance gain that is nearly proportional to the reduction in the number of such fences.

This is exactly what we've done with Trinity.

Trinity is a novel durability technique that needs just two fences per transaction and reduces the number of flushes when doing random writes. It consumes more memory than the other previous techniques but it has significant higher performance.

Moreover, we combined it with our own variant of TL2 for highly scalable durable linearizable transactions, and we used that to make a K/V store, which is likely that fastest K/V store on the planet with full transactions (though you need Optane Persistent Memory to be able to run it).

If you want to see the video, it's here:

https://youtu.be/vdInrf_kk1w

If you want the source code, it's here:

https://github.com/pramalhe/durabletx/blob/master/ptms/trinity/TrinityVRTL2.hpp