Monday, September 11, 2023

How to (not) invent concurrent algorithms

Today I want to talk about correctness and the creative process of designing algorithms.

Let's start with the later.

 

A couple of weeks ago there was a link shared on hackernews about a blog post by the guys at QuestDB where they discovered a new concurrency technique for a single writer and multiple readers: https://questdb.io/blog/concurrent-lockfree-datastructure-design-walkthrough/

Turns out, their concurrency control is a poor-man's version of Left-Right, a universal construction that Andreia and I discovered more than a decade ago and published in a peer-reviewed conference:

https://hal.science/hal-01207881

https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/left-right-2014.pdf

Left-Right provides linearizable irrevocable execution with starvation-free for a writer and wait-free for multiple readers by having two replicas of the data.

 

Obviously, it's a bit sad when due credit is not given where it should be, but smart readers have added to the blog's comments and on hackernews that what is being described is indeed Left-Right (or a simpler but not as powerful version of it).

This is ok because the good stuff will be re-discovered several times over, and Left-Right is certainly one of those gold nuggets. In the end what matters is not my ego being massaged, but instead that developers get one more great tool in their toolbox to handle concurrency and as I've said many times, Concurrency is hard, which means we need all the tools and help we can get.

Not all developers will read my blog posts or see my cppcon talk about Left-Right, which means there is high value for the community in having an extra entry vector in the form of the QuestDB blog, which has a wider audience than my blog or my papers, and is explained in a more detailed and accessible way. 

Ultimately, what matters is that developers learn something new and useful, not who came up with it first.

 

 

Let me share a funny story on how (our) creative process works.

After discovering Left-Right as a way to solve a particular problem in my day job, Andreia and I started working in Concurrency actively. We would dream of new ideas, then bounce them over with each other and, when we felt they were solid, then and only then, we would go look them up on the existing literature. Usually we would end up finding the same idea or something very similar on an existing paper. Usually, this paper was made by Maurice Herlihy, yes, the guy who literally wrote the book on Concurrent algorithms https://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy/dp/0124159508

 

Perhaps other people would be discouraged by this, but for us, we were new to the field and being able to rediscover something which the great Maurice Herlihy had discovered 20 years before, actually felt like "a win", and so we kept on working and coming up with ideas and implementing and testing the best ones.

As the months passed by, Andreia and I kept (re-)discovering algorithms, but we noticed a pattern emerge: our latest discoveries had been published by Maurice only a handful of years previously. This meant we were catching up with the state of the art!

 

Eventually, we discovered one of the reader-writer lock algorithms in this paper, just a few months after the paper was published: https://cs.brown.edu/~irina/papers/ppopp13.pdf

and some months after that we discovered other stuff that was actually new. I think CX was one of the first ones:

https://github.com/pramalhe/CX

although it was too complex for most people to understand and therefore was only published much later in PPoPP 2020.

After this, we were on a roll and kept doing stuff in this field https://dblp.uni-trier.de/search?q=pedro+ramalhete

 and still have a lot of cool unpublished work to publish ;)

 

Going through the peer-review process is important for these complex algorithms because it's one of the few ways to validate its correctness. From the start, we spent years adding and perfecting our correctness tests to make sure that our implementations are bug-free, or at least don't have any bugs that can be observed by invariant testing and stress-testing. But testing is not enough, the work needs to be seen and validated by other experts. That is one of the few advantages in the long and painful process that we call peer-reviewing.

To this day, I believe there are only a handful of people who really understand how CX works, and the only reason why this paper was accepted is because one or two of them were in the review of that paper. This is not to say that CX is like "General Relativity" or something, it's just that Universal Constructions are a rare and complex beasts which most people don't care about or don't have the inclination to spend the time looking into it, and that's fine.

 

Although over the years I've become pretty good at finding bugs in concurrent code and concurrent algorithm (see this post later below), I am lucky to have access to one of the best tools ever created to find bugs in concurrent algorithms: Andreia's brain   ;)

I lost count of how many times I used this powerful tool to validate designs, and find bugs in algorithms which would otherwise would have made me lost months or even years of work on a (in retrospect) dumb idea.

 

In summary, it's ok to re-discover existing algorithms, it just validates the usefulness and correctness of such algorithms, and it also validates that you, as a researcher, are on the right track to making important discoveries in that field. That is, assuming there are still important discoveries to be made in that field, but that's a topic for a different post.

 

 

But what is not ok is to read other people's paper and code, make a stripped down version of it, claim it as your own, and after being so stripped down it is not correct anymore, claim that this is correct and lock-free. And this where we get into the "Peredvizhnikov Engine".

 

The "Atomic Struct" in the "Peredvizhnikov Engine" is in essence a stripped down version of the OneFile STM (the lock-free) variant: https://github.com/pramalhe/OneFile/

https://ieeexplore.ieee.org/document/8809531

OneFile was published back in 2019 and it was the first wait-free STM.

It's not super novel, in the sense that it uses ideas which already existed:

  • It uses double-word CAS where each data word has and adjoining unique version number to ensure progress and ABA-free execution. This had been used previously in Multi-word CAS algorithms;
  • OneFile wait-free uses Herlihy's combining consensus, which was discovered by Herliy almost two decades earlier, and was used before in the P-Sim universal construction;
  • It uses a variant of Hazard Eras to do memory reclamation in a wait-free way;
  • It shares code between threads using std::functions. Rachid Guerraoui had shown some years before that sharing code among threads is the only way to have wait-free generic execution;
  • etc

But OneFile is novel because it combines these ideas into something that no one had achieved before: a single-replica lock-free/wait-free STM, with integrated lock-free/wait-free memory reclamation. Previous wait-free universal construction have multiple replicas of the entire data, with CX showing that the minimum worst-case is 2N+1, and they may need to make copies of the entire dataset replica, which can be extremely inefficient.

In a short description, OneFile uses a redo-log to execute the code of the transaction, saving all modifications on a redo-log. At the end of the transaction it will CAS on a central variable curTX to indicate that it wants to commit and, assuming the CAS was successful, it will use 1 bit to indicate the write-transaction is now in-flight and 8 bits to indicate its thread-id so that other threads may copy the redo-log and help execute the modifications with double-word-CASes.

When one of the threads sees all DCAS are done, it will CAS again on the central variable curTX to clear the in-flight bit, incrementing the sequence number, and a new write-transaction can execute.

Disjoint read-transactions will read the sequence numbers on the curTX and, if all data read is below this number, then the data is consistent and the transaction doesn't need to abort. Otherwise it needs to abort, help the current in-flight writer, and restart again.

This behavior is lock-free, but it's possible to add a wait-free consensus and make it wait-free, the details of which do not matter for this particular post.

 

 

The "Atomic Struct" described here https://github.com/eduard-permyakov/peredvizhnikov-engine/blob/master/docs/lockfree.pdf

uses single-word CAS instead of double-word CAS because it splits the 64 bits into 32 bits of data and 32 bits for the unique version number. This will wrap-around faster than using 64+64 bits, but in practice it can be made safe.

A write operation will try to install a pointer to its function in a central variable, along with a unique sequence number and other threads will follow this pointer and execute the same function. Like on OneFile, each write to memory is replaced by a compare-and-exchange operation, so as to prevent ABA.

 

However, in Peredvizhnikov there is no redo-log: threads just directly write to memory their CAS. Unfortunately this is incorrect, i.e. it is not linearizable. Consider the following example:

Suppose there is an operation that reads a memory location into a temporary variable and later this variable is used to write into a different memory location. Any helping threads will not be able to figure out the original value.

1:   void txn_function() {

2:     int t = a;  // suppose 'a' is initially 0

3:     a = 42;

4:     b = t;      // b must be 0 (otherwise this is not linearizable)

5:   }

In this example, if thread 1 executes line 3 and then thread 2 enters and executes line 2, thread 2 will see variable 'a' as 42 and there is no way to obtain the original value of '0' to place in its own copy of 't', to be able to help it write in line 4 in variable 'b'.

The Peredvizhnikov Engine is not lock-free because if thread 1 dies after executing line 3, no other thread will be able to make progress ever again.

 

You could argue that it is lock-free and it is correct, if and only if, it is applied to functions where this kind of behavior doesn't happen. The problem with that argument is two-fold:

First, you can no longer claim that is an STM, or a generic concurrency control, or a universal construction. All these things are meant for generic sequential code, which means that if doesn't work with generic code then it's not an STM/UC/generic-CC.

Second, it's actually really hard to read a block of code and understand if one such issue exists for that particular block of code, to the point where only experts can do it. And such experts would rather design their own lock-free code to suit their needs, than go and use the Peredvizhnikov engine. I know this is a nuanced argument, but another way to think of it as this being a tool that is so hard to use correctly that you would rather do everything "by hand" without the help of such a tool.

 

What I believe happened, but this is just a guess, was that the author of this work read the OneFile paper and tried to make something better. This is a good mindset to start from, but the problem is that the author didn't understand why OneFile needs a redo-log, or sharing of the redo-log and therefore thought it was unnecessary and decide to make a better (incorrect) version of OneFile. This is unfortunately what happens when your ideas don't go through peer-review.

I know peer-reviewing is awful, I've complained about it many times myself, but there is a reason why it exists: it's so that the information we disseminate is reliable, or at least has been validated by someone that knows a bit about what they're talking about.

 

I blame hackernews and its sensationalist take on the world of Computer Science, because it gives voice to anyone who shouts loud enough. I know that's how you get "disruption", which the Silicon Valley loves, but the downside is that we're infecting everyone's brains with broken tools that simply don't work, and Peredvizhnikov is one such case.

 

Claiming that an algorithm is correct when it isn't, is bad for our community because it devalues the ones that are correct, the algorithms where people spent many months or years of effort to write tests, write a formal proof, go through the peer-review process.

Claiming that an algorithm is lock-free when it isn't, is even worse because it devalues how hard it is to make correct lock-free algorithms. It makes it look easy, and it diminishes the work I've done for the past decade. So yes, I'm human and I feel wronged and sad by other people saying their algorithm is lock-free when it isn't.

 

 

But my feelings don't matter, what matters is that when applied to generic code (or even non-generic code), the technique described in the Peredvizhnikov Engine is neither correct nor lock-free.

Friday, December 24, 2021

The importance of Correctness in concurrent algorithms

My work is in the area of concurrent algorithms therefore, this post is going to be largely biased by my experience in this field. Having said that, I suspect that most of what I'm going to say is applicable to many other fields in Computer Science and maybe of other disciplines as well.

 

To me, the creative process in the field of Concurrent Algorithms is composed of three main steps: The Idea, The Algorithm, and The Implementation.

 

 

The Idea

Before we start doing anything we need an idea. Sometimes this is a well-formed idea, other times it's just an insight or trick that we feel can be used to solve a particular problem in a way that no one has done before. Maybe it's a way to solve the same problem but faster, or do it using less memory. Whatever it is, it gives you an advantage somehow.

 

Most people are capable of having ideas, but most ideas are useless, either because they have already been discovered, or because they are wrong, or because they lack understanding of important details. The later being the common case I have observed in my experience.

Novices in a field usually have ideas that fall in this category: they are simple and based on well-known concepts, or concepts from other fields, and they show a lacking of understanding of the fundamentals of this field (concurrency) and a lacking of knowledge about the problem at hand. These ideas don't actually work, but to the uninitiated, they look like they're good.

 

Then there are the class of people of occasionally have a decent idea. Once they have it, they can get very attached to the idea as this is precious to them and they likely won't be able to come up with anything better. This is the best they can come up with, therefore, they're going to try to sell it as the-best-thing-since-bread-came-sliced.

You have to understand that they worked hard to get there, therefore it's only normal that they want to value this idea, fighting for it with nails and teeth, if need be. It can be very hard to reason with these people and convince them that there are better ways to attack the problem.

The people in this category are not novices. They typically have some knowledge of the field. Sometimes they call themselves experts, perhaps because they have been working on the field for a long time.

 

There is a third group of people, the ones that have lots of ideas. This is small group because to get to this group you need to not only be good at having ideas, but also need to have a lot of knowledge about a particular field. My conjecture is that the kind of mindset it takes to learn about a field in depth, is almost the opposite to the mindset you need to innovate in a field. The foremost being a mindset where we need to memorize and accept a lot of what has been done by others previously in this field, while the later mindset is about asking questions about everything that is being taught to us.

Obviously, the two mindsets are not mutually exclusive, but let's just say that only a small subset of the researchers in a particular field have both mindsets, but I digress.

The main characteristic of this third group of people is that they have lots of ideas. Most of these ideas are below grade, a couple are decent, and every once in while, one idea will be great.

Because they have lots of ideas, researchers in this group are not attached to their ideas. They know that other ideas will come (to themselves or to other researchers) and those ideas will have slightly better trade-offs and be better in slightly different metrics. They understand that there is no perfect solution, the concept of perfection depends on the particularities of the problem at hand.

  

 

The Algorithm

Once you have an idea, the next step is to transform it into an Algorithm.

 

A lot of people confuse an idea with an algorithm, They are not the same thing.

An idea, is a vague description of a possible solution to a problem, with an emphasis on how the multiple pieces work together, or what is the main trick behind it. You can think of the Idea as the elevator pitch, or the drawing on the whiteboard.

An algorithm is an accurate description of the steps taken by this solution. It's a pseudo description of the way we solve the problem or execute a computation, it's a description of the program. It doesn't need to be written in a programming language, the steps can be in english language, but they need to be descriptive.

 

In my experience, this confusion of the Idea with the Algorithm causes a lot of friction when dealing with novices. Novices have trouble seeing the difference between these two concepts. To them, the idea is enough to move to the Implementation. If there are parts missing, they'll figure it out during implementation and leave those as implementation details. The fact that both the Idea and the Algorithm can be described in english, while the Implementation is described as source code (i.e. a programming language) makes the distinction between idea and algorithm even harder for novices.

 

Perhaps the best way to think about this is to put it in the context of Mathematics. In fields of theoretical Mathematics, there is no "Implementation stage", only the Idea and the Algorithm. Yet, to mathematicians, there is a clear distinction between an idea and an algorithm. The algorithm can be proven correct (or incorrect) while the idea cannot (unless it is obviously wrong).

It helps them that ideas are described in english, while algorithms are described in mathematical notation.

It also helps them that they have formal proofs to support the algorithm, but more on this later in the post.

 

This mathematical mindset helps me a lot in my research on concurrent algorithms. When describing an idea, I do an elevator pitch, or a drawing on the whiteboard, or maybe a couple of powerpoint animations. When describing an algorithm, I try to use pseudo-code or, if using english sentences, itemize each step and try to be as descriptive as possible.

It's not perfect, but it helps to make the distinction between the two stages.

 

 

The Implementation

In the field of concurrent algorithms, we usually do an implementation of the algorithm.

An implementation is made in a programming language. If the goal is a research paper, then the language can be whatever the researchers doing this work are comfortable with. If the goal is to solve an actual problem in a company, then the language will be dictated by the particular problem or whatever programming language/tools the group supporting this feature needs to do.

 

Why do we do an implementation?

First, because we can.

Second, because having an implementation helps us understand the performance of the solution, even if it's a proof-of-concept (POC) implementation.

Third, because in the Industry, an implementation is what we ultimately need, because it's the way to actually solve the problem we're trying to solve.

Fourth, because an implementation can be tested and passing these tests helps us have confidence that the algorithm is correct. More on this on the next section.

 

 

How do we know it's correct?

When designing or reading a new algorithm, we always wonder whether it is correct or not.

There are two ways to address this concern, formal proofs and stress tests. The best is to do both, but this is not always possible or a useful investment of time.

 

In the academic setting, when writing a research paper on a novel concurrency algorithm, the correctness proof is the preferred approach. When small, the proof can be placed in the paper itself, otherwise, it's put on an accompanying document or appendix.

Some theoretical conferences and journals don't even care about having an implementation of the algorithm. All they care is about the proof, whether it well constructed and really proves what it sets out to prove.

The more practical conferences (in the field of concurrency) usually don't care much about stress tests either. They give importance to the fact that an implementation was made and to the benchmarks made with this implementation, but not to any supporting code that tests the correctness of the implementation. And because the reviewers don't value this, the researchers submitting papers don't spend time doing tests either. To the point where of all my colleagues working in this field, I only know of one that consistently spends time writing stress tests for his work.

 

The majority of my fellow research colleagues don't stress test their work and instead, prefer to spend some time writing a formal proof of correctness. The incentive just isn't there.

Writing research papers takes time. It takes time to come up with an idea worthy of a paper, it takes time to turn it into an algorithm, it takes time to implement it, it takes time to benchmark it, it takes time to write about it in way that pleases reviewers at conferences, and if on top of that we have to write stress tests, it's just too much.

 

Proofs are made by humans, usually the same humans who designed the algorithm that is being proven correct, and therefore, it's possible to have the same fallacies that inducted an error in the algorithm, creep in into the proof. This depends on the kind of proof, for example, we can argue that invariant proofs are less prone to this nefarious effect, but there is no way to be immune to it.

 

In the software industry the mindset is nearly opposite.

There, we don't care about formal proofs because each proof is tightly associated with the algorithm and implementation. Changing a single line of code can invalidate the corresponding proof. This is particularly true in lock-free algorithms, but it's a general statement for concurrency algorithms in general.

Business needs can change, which means code can have new functionality, which means the implementation will change as time goes by. It's not practical to write a new proof every time someone changes a line of code.

 

Yes there are tools like TLA+ but then we need to keep the proof and the implementation in sync, which is its own pain.

On the other hand, if we write tests to cover the implementation, every time we change a line of code we can just re-run the tests. It won't give a 100% guarantee that you didn't break anything, but it will help to catch many mistakes.

 

Furthermore, in the Industry we actually want implementations that work. What good is an algorithm that has been proven correct, if the implementation of this algorithm is full of bugs? 

Not really much, I'm afraid. It's a much better use of the engineers time to have tests that assert (some) correctness of the implementation than to have him write formal proofs or TLA+ proofs.

 

 

Tests imply re-usability

My intellect is not capable of holding a large concurrent algorithm. I know of other researchers who can, and Andreia is one example.

If you present me with a novel concurrent algorithm that is large(ish), I may not be able to convince myself that it is correct, because I won't be able to stuff it my head as a whole. I won't be able to go through all the possible interleavings of the steps nor reason about its correctness.

It's like the chess players who can see 7 moves ahead in the board, and the chess players who can only see 3 moves ahead. Yes, it's a skill that you can practice, but after a certain time you get old, you have other stuff in your head, and frankly, I don't even want to invest the time it takes to grasp such large algorithms.

Because of this limitation, I need to restrict myself to small algorithms. And even for the small algorithms, I do stress tests.

 

If we have tests to cover the behavior of a queue, then any new queue we design and implement can be tested with the same tests. If a particular queue passes all the tests we have designed over the years, then this gives me a good confidence that this is correct queue implementation and, by consequence, the algorithm of the queue is likely correct.

 

Moreover, having a well-tested queue means that when we design other concurrency mechanisms that need a queue, we know that we can use the queue we designed beforehand because when we find issues in the new thing we are designing (which inevitably we will) we will be certain that those issues originate from the algorithm which works on top of the queue and not due to issues in the queue itself.

By stress-testing everything we do, we gain re-usability!

And yes, stress-testing is no 100% guarantee that the implementation is correct, but it sure beats not having anything, and I would argue, it beats proofs too.

You see, a proof is for the algorithm, not the implementation. On the other hand, a stress-test validates correctness for the implementation and the algorithm. Two for one is a better investment of your time.

 

Not only the implementation can be re-used, but the tests can be re-used on other implementations. When we design a stress tests to check invariants of concurrent sets, we can use the same tests to check the correctness of any other set implementation.

The tests we do become an investment for the future, which reaps its benefits as the years go by.

 

 

My take on correctness

Suppose you came up with a new lock-free data structure and now you want my help with it. There are two different ways I can have confidence that what you did is correct:

  1. You make a formal proof of correctness to accompany the algorithm;
  2. You make as many stress/invariant tests as you can think of and make your implementation of the algorithm pass those tests;

Ideally, you should have both, but if you're going to do just one, then go for number 2.

Notice that you need an algorithm in both. If you don't have an algorithm, then you have nothing. Some people jump directly from the idea to the implementation stage. This means that the algorithm lives only in their head. Good for them, but the point of a research paper is to share information, and forcing others to reverse-engineer the algorithm from the implementation is not a good way of sharing. And most likely, no one will bother to do so.

So, yeah, without an algorithm, you got a big fat nothing.

 

 

Summary

In summary, the reasons to write stress/invariant tests are:

  • These tests will be re-usable later;
  • They help prove the correctness of the algorithm and the implementation;
  • The algorithms validated with these stress tests can then be used as building blocks of more complex algorithms;
  • If you have tests, you will be able to validate the algorithms of other researchers;
  • If you make modifications to the algorithm, you will be able to re-check the correctness of the modified algorithm without having to make a new proof from scratch;
  • Writing tests is fun!

Friday, March 12, 2021

The 4 laws of Durability

When it comes to having durable data, there are four ways to do it: undo log, redo log, shadow copy and shadow data.

 

Let's start with the preliminaries.

So what do we mean by "durable"?

Well, durable means that whatever data you're trying to save, has reached your storage device in a consistent way. It means that when you write to storage you want it to be "permanent", whether that storage device is a USB key, a CD ROM, an hard drive, and SSD, or a non-volatile memory DIMM like Intel's Optane DC PM.

For any of these storage devices, the algorithms are always the same: you have to use one of the four mentioned above.

Keep in mind that these are needed if the data needs to be consistent, i.e. you want to see the whole data before the storage or none of the data. I mean, if we were ok with having garbled data, then why would we bother saving it in permanent storage? The whole point of making data durable is because it has important information and therefore, it implies consistency.

 

Now that the basics are out of the way, what are exactly these four algorithms?

I'm going to focus on these in the context of transactions, but they don't have to be necessarily about that.

 

Undo log is technique where we write to durable storage a log entry before each write is done to storage. It allows multiple independent (non-atomic) writes to become durable in an all-or-nothing way, like a transaction, or a checkpoint.

In the context of persistent memory, libpmemobj in PMDK is an example of a transactional system that uses undo log.

 

In Redo log we write the log with multiple entries to storage before writing the actual data. The difference between redo and undo is that undo log does one entry in the log at a time followed by one modification, while the redo log does all entries in the log in one shot and then all the modifications in one shot.

Mnemosyne and OneFile are examples of transactional systems that utilize redo log.

 

Shadow copy, sometimes called Copy-On-Write (COW) creates a new replica of the data and writes the new data along with the unchanged contents to durable storage, before swapping some kind of pointer to indicate the this is the new object/data and the old one can be discarded. COW can't really be used by itself for transactions over multiple objects, but it can be combined with redo log to make it more efficient.

One example is SAP HANA which uses redo log with COW.

 

Shadow data can sometimes be confused with COW but it is not the same thing. In shadow data two (or more) replicas of the entire data are kept in durable storage and they both are updated with the modifications, one at a time. First one replica, then a logical pointer and then the second replica. On the next set of atomic writes the recently updated replica is the first to be updated.

Examples of shadow data transactional systems are Romulus, RedoDB and Trinity to some extent.

 

 

We though long and hard at the similarities and differences between these four algorithms for durable transactions, and we found they possess four common characteristics, regardless of the underlying storage media for which they are intended.

Each one of these characteristics reveals an important insight into the concept of durability and we believe these to be empirical rules to which all durable techniques abide. These rules are:

  1.  There must be a replica of the data;
  2.  There must be a durable state indicating which of the replicas is consistent;
  3. All algorithms require at least one ordering constraint of the writes to durable storage; 
  4. A modification is durable only after a round-trip fence to the storage hardware;

 

 

The first key insight regarding durable transactions is that a consistent and durable replica of the data must exist at all times. This replica may be a full copy of the data, such as on shadow data, or it may be a logical replica, such as on undo log and redo log.

Intuitively, there has to be a consistent replica of the data, so that there is a way to recover data to its original consistent state in the event of a failure. Shadow data keeps a full replica of the data thus incurring a high permanent usage of the durable media (space amplification), while the undo log and redo log approaches have to write in durable storage, not just the new data but also, encoded information about the location and size of the modification (write amplification).

There's clearly an important trade-off here: log-based algorithms will increase (amortized) write amplification but shadow-data-based algorithms will increase space amplification.

 

The second empirical rule implies that the algorithm must ensure that, irrespective of when a failure occurs, there is a way for the recovery procedure to determine which of the replicas is consistent.

Shadow data like Romulus uses a two-bit variable to determine which of the two replicas is consistent, while redo log and undo log can use the size of the log (zero or non-zero) to indicate if the log is consistent.

By itself, there is no significant difference in any of the approaches however, the exact mechanics, will influence the number of ordering constraints in the algorithm.

 

This leads us to the third insight, that data consistency is possible only through ordering of some of the writes.

For shadow-copying, the modifications on the new block must be made durable before the pointer swap, otherwise a failure occurring after the pointer swap is made durable, would leave the pointer referencing an inconsistent block. This means that apart from block allocation and de-allocation details, shadow-copying has a single ordering constraint, or in other words, a single ordering fence.

Shadow data like Romulus uses a two-bit state (though one bit would suffice) to indicate which of the two replicas is the consistent one, or whether both are consistent. If the state variable indicating which replica is the consistent one becomes durable before or after the modifications on either replica and a crash occurs, upon recovery it may be referencing the inconsistent replica. For this algorithm, three ordering constraints exist: one to prevent the state from changing to COPYING before the modifications in main replica are done; another to prevent the modifications in back replica from being done before the state changes to COPYING; and another one to prevent the state change to IDLE before the changes on back replica are durable.

The undo log technique has two constraints per modified object/range: the log entry must contain the old value before the entry is added to the log; and the entry must be added to the log before the modification is done on the data. Undo log has one extra constraint per transaction, requiring the last modification to be durable before the log is reset.

The Redo log technique has three constraints per transaction: all the log entries must be durable before the log size is set; the log size must be set before the modifications are done on the data; the modifications on the data must be durable before the log is reset.

 

The fourth and final rule addresses the need for a round-trip synchronization mechanism to the storage domain, such that the hardware can guarantee that it contains, in stable durable storage, all the previously written data. The cost of such a fence is typically of the order of the storage device's latency.

Fast devices like PM implement this round-trip fence orders of magnitude faster than slower devices, like hard drives.

Without such a mechanism, it is not possible to have durable operations, even if ordering constraints are set: in the event of a failure, the ordering constraints impose a temporal sequence of which the writes will be made durable, but there is no guarantee on durability.

A corollary of this is that all algorithms require one and only one such fence, strategically placed.

Notice that the ordering constraints may be replaced by such synchronous fences, at the detriment of performance, and in fact, many storage systems make no distinction between the two. Ordering is typically achieved with an asynchronrous fence and it relates to the order to which certain writes will be made durable in the storage media.

On block based storage, this is typically implemented with fsync() or fdatasync().

In Persistent Memory (PM) ordering can be achieved through the combination of flushes (clwb) and fences (sfence) or by writing to the same cache line. The round-trip guarantee of durability is given by a synchronous fence, either fsync()/fdatasync() on block storage, or sfence on PM storage.

 

In case you haven't noticed, the fact that all algorithms require one round-trip fence to the device (psync), but may require multiple ordering fences (pfence) has implications in performance. This is specially true given that the psync has inescapable physical implications: it is not possible to have all-or-nothing consistent durability without a psync that physically does a round trip to the storage device (or at least the storage domain) and therefore the latency cost of this single round trip is inescapable.

However, different algorithms may have different ordering constraints (pfences) and these may have different costs.

 

Yes, fsync() is used for both sync and ordering on block devices, and the sfence instruction is also used for both in PM, however, there are tricks. In PM, writes to the same cache line are guaranteed to be ordered and therefore, no sfence is needed to order them, as long as store with memory_order_release is used.

Seen as these round trips are typically the bottleneck when doing random writes to PM, the fact that we can create an algorithm with a lower number of psyncs means we can have a performance gain that is nearly proportional to the reduction in the number of such fences.

 

This is exactly what we've done with Trinity.

Trinity is a novel durability technique that needs just two fences per transaction and reduces the number of flushes when doing random writes. It consumes more memory than the other previous techniques but it has significant higher performance.

Moreover, we combined it with our own variant of TL2 for highly scalable durable linearizable transactions, and we used that to make a K/V store, which is likely that fastest K/V store on the planet with full transactions (though you need Optane Persistent Memory to be able to run it).

 

If you want to see the video, it's here:


https://youtu.be/vdInrf_kk1w

 

If you want the source code, it's here:

https://github.com/pramalhe/durabletx/blob/master/ptms/trinity/TrinityVRTL2.hpp

Friday, October 2, 2020

A Relaxed Guide do memory_order_relaxed

 Just a quick post to a nice presentation on relaxed atomics by Hans Boehm and Paul McKenney

https://www.youtube.com/watch?v=cWkUqK71DZ0



Very instructive if you are designing locks or lock-free code.

Full PRD here http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2055r0.pdf

Wednesday, July 1, 2020

HydraConf 2020

The Hydra Conference is almost here again: https://hydraconf.com/
This year the speaker lineup is even more impressive: https://hydraconf.com/2020/msk/people/

Maurice Herlihy is going to talk about Software Transactional Memory, and many more interesting talks.

It's going to be online-only, so anybody can attend.
You can buy your tickets here.

Here are some talks from last year to get you going.

Friday, June 12, 2020

You say read-committed, I say buggy-code, potatoe, potato, tomatoe, tomato

If I were to make a concurrent map data structure whose operations are not linearizable and then put it on a library and give it to my users, any user of that library would come back to me and say that my library has a bug.
They may not be able to tell me exactly what it is, but they'll understand that there is a bug in it.

However, if I take the same concurrent map data structure, and put in an application and call that application a "Key-Value store" or a "Database" (DBMS) and give it to the typical database users, it seems they may certainly use it for a several decades without ever complaining that this "Database" is not linearizable (or serializable as the DB folks call it).

If this sounds far-fetched, then just go and read this post: http://jepsen.io/analyses/postgresql-12.3
It seems that the Postgresql users really didn't care that postgresql isn't serializable when told to be so and in fact, isn't even read committed by default, which should be the default. And a similar thing happened to MongDB last month, so it's not something specific to Postgresql.

I find this interesting because it's an extreme case of managing expectations: When you take a library of concurrent data structures, you expect a certain kind of consistency, namely, you expect linearizability (just go and read discussion on the Java concurrency mailing list if you don't believe me).
However, if you take a DBMS like Postgresql, you no longer expect linearizability to be the default behavior. In fact, you expect read committed as the default behavior, which turns out Postgresql doesn't even give read-committed and instead gives snapshot isolation.
A user of a library of concurrent data structures would call read committed a bug. She would also call snapshot isolation a bug, and pretty much everything that is not linearizable would be a bug to her.
The reason it would be a bug, is because it is very hard to reason about anything that is not linearizable. In the post by Jepsen you can even see that there is no exact definition of read committed, so I guess that's one of the reasons why nobody complained about it before.

I can easily imagine discussions of the type:
DBMS User: This behavior is weird!
DBMS Developer: No, it's not weird, it's "read committed", you just have to go and learn about what it means!
DBMS User: The definition of "read committed" is so fluffly that it can mean any weird thing... I can't even understand if the weird thing I'm observing is "read committed" or not.
DBMS Developer: See!?! I was right, this is "read committed"!
DBMS User: Ok, I'll keep using your DBMS because all the other DBMS work the same way.

I could have understood if it was a distributed database, because the cost of a serializable transaction over a distributed database is likely proportional to the latency of the furthest node in the database, while for read-committed it may be lower (who knows?). But the scenario that Jepsen describes isn't even about distributed databases. It's a bug found running on a single-node database. There's no "distributed-databases-are-hard" excuse here (which they are, it's true).

It makes me wonder how did the DBMS folks got their users so well trained that they don't complain about consistency bugs in the DBMS?!?
On one side, I'm super envious because I secretly wish I could be as dismissive to my users as the DBMS folks are to theirs, lol.
But seriously, how could this have happened for so long?!?
I see two possible reasons:
1st, the users don't really care about consistency. They're running businesses. As long as the DBMS is fast enough and has the features that they need, they'll continue to spew out cash for it. Correct consistency is not an issue for the 99% use-cases of databases, as long as the data doesn't get corrupted or lost, everything's fine.
2nd, it's always been like that. Everybody accepts it works like that, and if you want something better, you have to go for a niche DBMS (not that easy to find). Read-committed, snapshot isolation and other strange names as such, are just the "status quo" and nobody wants to change the status quo.
3rd, the DBMS folks hide behind the wall of complexity that is the DBMS. It's a common scenario in IT. They would say something like "Ooooohhhhh this DBMS is too complicated for mere mortals to question! It takes many years of work and millions of lines of code! Here be dragons! Oooohhhh".

If you think of other reasons behind this, I would like to hear about them in the comments section.

Anyways, with the advent of Persistent Memory (PM) and Software Transactional Memory for PM, this game is changing.
One example close to my heart is RedoDB.
RedoDB is a "key-value store" but it supports linearizable transactions over any C++ data type (needs to be type-annotated though). Not only that, but these transactions are wait-free.
That's right, you heard it well: RedoDB provides durable linearizable/serializable wait-free dynamic transactions.
No only does it do that, but is does it slightly faster than RocksDB.

The downside? Consumes vast amounts of memory, though it won't be any worse than most implementations of Multi-Version-Concurrency-Control. At least in RedoDB there is a bound on memory usage.

Anyways, our goal was to show that wait-free DBMS are feasible and can be made efficiently. We weren't aiming for a commercial product.
Ohh and did I say that this DB is about 3k lines of code, as opposed to the several hundred thousands LOC for other DBMS?

You can checkout the paper for RedoDB here: https://dl.acm.org/doi/abs/10.1145/3342195.3387515
and the source code here:  https://github.com/pramalhe/RedoDB

Sunday, May 17, 2020

Dealing with bugs in Concurrency and Durability

Everyone knows that Concurrency is hard, in fact, it's NP-hard: the total number of interleavings typically grow exponentially with the number of lines of code (LOC) in a program. Determining if any of these interleavings causes incorrect behavior, becomes problematic for anything but very small programs.
This never prevented anyone from writing large multi-threaded programs, and the usual trick is to use locks. A code block surround by a lock does not have interleavings with other code blocks for the same lock, and the problem becomes "tractable" again.

Unfortunately, we can't use this trick for lock-free and wait-free code. Lock-free algorithms are typically done with atomic instructions that can interleave with each other.
Verifying correctness of these algorithms is a difficult problem.
There are static verification tools like SPIN, JSF, TLA+ and others, but such tools only work for a small number of threads (no more than 6 threads) and small-size programs (less than 100 LOC). As if this wasn't bad enough, these tools enforce the user to write the program in a particular language, apart from JSF, which is almost Java. The program will have to be re-written later in the target language that is meant for, and during this "translation" errors can be introduced.

Now, don't get me wrong, software has bugs. It's just part of life as a software developer to deal with them and try to minimize them.
Entire industry trends like Test Driven Development (TDD) and Test Automation revolve around the concept of testing to reduce the amount of bugs.
In other words, there's this idea that the more time you invest in tests, the more quality the final program will have. IMO this is true, but obviously it's not a linear effort: beyond a certain number of tests, the number of new bugs you'll detect will be small compared to the effort of maintaining those tests.

Concurrency bugs can, and should, also be approached in this manner, but this is far from sufficient.
Moreover, testing a concurrent program requires a different set of techniques and mindset than testing a sequential program: instead of exercising input-output responses, we should test program invariants; instead of mocking components, we should stress the code by increasing the number of threads or changing timings to exercise different interleavings and code paths.
This is where run-time verification can provide a good help. Tools like valgrind and Address Sanitizer become a need-to-have if you're writing C/C++ multi-threaded applications and there is a good deal of research being done on this field, so I'm sure that more capabilities and more efficient tools will show up in the future. However...

Suppose I have a new lock-free algorithm and I implement it. Then I bother to write some tests for it (which most researchers won't do) and then run it with address sanitizer and whatever other run-time tools I can get my hands on. None of my tests or run-time tools encounter any error. Now what? Is my algorithm ready for publication? How will the peer reviewers know that it is correct? How will I convince them of that?
And here lies the big problem in researching concurrent algorithms that people don't usually talk about: "How do I know if the lock-free algorithm in this paper is correct?"

As I said before, it's ok to have bugs in your code, all software has bugs, and software developers are used to this inevitability.
What is not ok is to have bugs in your algorithm. It's very hard to prove the correctness of a lock-free algorithm. Although it's usually simple to verify that a bug exists, once someone tells you the sequence of steps to reproduce that bug.

This can lead to surreal scenarios that typically look like this:
Me: Read the latest paper on a novel lock-free algorithm. Spend weeks thinking about it. Come up with a scenario where the algorithm doesn't work. Send an email to authors describing the scenario.
Authors: Spend a week thinking if the scenario is actually a bug and how to fix it, propose a slight variation of the algorithm.
Me
: Spend another week thinking about it, come up with a difference scenario where there is a race condition.
Authors: Spend another week thinking about it and propose a modified algorithm.
Me: Spend another week thinking about this different algorithm. Get to the conclusion that the algorithm is no longer lock-free. Send a new email explaning why.
Authors: Spend another week thinking about it and propose yet another version of the algorithm.
Me: Spend another week thinking about it, only to reach the conclusion that this new algorithm is not even linearizable.
Authors: We weren't aiming for linearizability anyways...
Me: Lose all faith on the work done in the field of concurrent algorithms, and lose faith in mankind in general... at least for a couple of weeks, then I read another paper that looks interesting and the cycle restarts.

This can sound bad but it's a description of when things turn out good. Other times I send the email and the authors just ignore me, or they answer something that shows that they don't even understand there is a bug, or there is no bug and the whole things was a result of me missing some vital detail in the algorithm.
Ultimately, either I shame the authors by pointing out their mistakes or I shame myself for being too dumb to understand their paper. Never a good turnout and it's not a good way of "making friends". Unfortunately I can't help myself, I'm in it because I want to learn, and you learn the most when you're willing to make a fool of yourself  ;)

This is not about incompetence, this is about the inherent difficulty in writing lock-free algorithms. The authors are not sure that their algorithms are correct, the reviewers are not sure that the algorithms are correct and the readers of the papers are not sure that the algorithms are correct.  And don't get me started on the actual "implementations" of these algorithms.

It's a mess, and it creates a lot of distrust in the field, specially during peer revieweing.
Papers end up being accepted only if there is an expert reviewer that really knows about this stuff and decides to spend the several weeks of his personal time it takes to fully understand that algorithm and then comes back to the other reviewers and manages to convince them that this stuff is correct.
Most of the times, academic papers gets accepted just because they're not obviously wrong. But the problem is that may be simply because the algorithm is so complex that no one is willing to "dig into it". I've seen examples of papers accepted into major conferences, where the basic algorithm was wrong (and there is not obvious way of fixing it) and then the authors will continue convinced that approach is valid and correct and write follow up papers that build on that algorithm and those papers continue to be accepted in major conferences, when it's all crap.
Luckily, this is rare. Most of the bugs I've seen in lock-free algorithms are things that can be fixed without a major re-design of the algorithm.

There is also the effect that if the algorithm is too simple (there aren't many of those, but still) then the reviewers may think that the work doesn't have enough value. But if the algorithm is too complex, then they wouldn't be able to understand if it is correct or not.
This kind of creates the incentive for papers to have incremental algorithms based on prior published work, or to have several simple algorithms as opposed to a single complex one.

I've noticed that when it comes to durability (persistence), the problem is similar.
Durable algorithms are not NP-Hard, the failure can occur at a single point in the code, which means that the complexity is linearly proportional to the LOCs, as opposed to concurrency algorithms where the complexity grows exponentially with the LOCs in the algorithm.
However, the pattern of me reading papers on durable algorithms and then sending emails and then the authors fixing but not really fixing is pretty much the same.

It's still a hard problem, particularly now with Optane DC Persistent Memory that has a special set of rules for data to be correctly persisted.
On Persistent Memory it is very easy to forget a flush or a persistent fence. This means the implementation is incorrect, however, in practice it may have no issue, because the probability of having a crash exactly in the place where the fence is missing is very low.
On the other hand, if there are fences missing all over the place or if there is something being used that simply doesn't make sense, then no matter where the program crashes, the data will end up corrupted. So there are different shades of grey here.

Again here the problem repeats itself when it comes to getting a paper accepted at a conference: unless one of the reviewers really knows its stuff and is willing to invest the time it takes to deep dive on the algorithm presented on the paper, getting a "correct" durable algorithm accepted is just a game of luck. The best that can be done is to make a convincing proof or at least explain some invariants.

I believe that as time goes by, there will be better tools that will help writing and verifying correct durable algorithms. The thing is, by the time that happens (years from now), all or most of the interesting durable algorithms will be published by then, so it kind of defeats the purpose. Moreover, for the complicated stuff we can always use transactional durable techniques (undo-log, redo-log, copy-on-write, shadow-data) or a universal construction, like the one we presented at EuroSys2020 : link here.

For concurrency it's really black and white: you miss a fence, you're algorithm is wrong and you WILL see a bug happen eventually.
I don't see any real solution for concurrent algorithms coming anytime soon. And that's a shame because as long as things remain the way they are, it will be very hard for the good (correct) algorithms to be distinguished from the bad (incorrect) ones. It becomes a kind of "reputation game" where papers from people who have made correct algorithms in the past, will likely have correct algorithms later as well, but this is not really reliable seen as we're all humans we're bound to make mistakes at some point. For now, all we can do is test our implementations of these algorithms as extensively as possible so as to minimize the chances of bugs, but it will never be 100% certain.

This discussion may seem a bit boring: software will always have bugs, so what's the problem with having buggy algorithms?
My argument is that it's a big deal, actually, it's a very BIG deal!I've seen many papers that build on previous papers and then I go look at the previous work on which they're based, and those are incorrect. We need solid foundations on which to build, to be able to progress further and faster.
If we don't give the proper value to the correct algorithms and point out the incorrect ones, then we're creating the wrong incentive. Researchers will not bother to spend the time testing or validating their algorithms. This creates the climate of suspicion that we're in today.

Ultimately, designing lock-free algorithms will never be "easy", but designing lock-free and persistent data structures is easy when you use universal constructions or a lock-free software transactional memory. This means all these are problems are an issue only for the researchers of concurrent algorithms, but not really a problem for people that need a peristent or lock-free data structure.