Friday, December 24, 2021

The importance of Correctness in concurrent algorithms

My work is in the area of concurrent algorithms therefore, this post is going to be largely biased by my experience in this field. Having said that, I suspect that most of what I'm going to say is applicable to many other fields in Computer Science and maybe of other disciplines as well.

 

To me, the creative process in the field of Concurrent Algorithms is composed of three main steps: The Idea, The Algorithm, and The Implementation.

 

 

The Idea

Before we start doing anything we need an idea. Sometimes this is a well-formed idea, other times it's just an insight or trick that we feel can be used to solve a particular problem in a way that no one has done before. Maybe it's a way to solve the same problem but faster, or do it using less memory. Whatever it is, it gives you an advantage somehow.

 

Most people are capable of having ideas, but most ideas are useless, either because they have already been discovered, or because they are wrong, or because they lack understanding of important details. The later being the common case I have observed in my experience.

Novices in a field usually have ideas that fall in this category: they are simple and based on well-known concepts, or concepts from other fields, and they show a lacking of understanding of the fundamentals of this field (concurrency) and a lacking of knowledge about the problem at hand. These ideas don't actually work, but to the uninitiated, they look like they're good.

 

Then there are the class of people of occasionally have a decent idea. Once they have it, they can get very attached to the idea as this is precious to them and they likely won't be able to come up with anything better. This is the best they can come up with, therefore, they're going to try to sell it as the-best-thing-since-bread-came-sliced.

You have to understand that they worked hard to get there, therefore it's only normal that they want to value this idea, fighting for it with nails and teeth, if need be. It can be very hard to reason with these people and convince them that there are better ways to attack the problem.

The people in this category are not novices. They typically have some knowledge of the field. Sometimes they call themselves experts, perhaps because they have been working on the field for a long time.

 

There is a third group of people, the ones that have lots of ideas. This is small group because to get to this group you need to not only be good at having ideas, but also need to have a lot of knowledge about a particular field. My conjecture is that the kind of mindset it takes to learn about a field in depth, is almost the opposite to the mindset you need to innovate in a field. The foremost being a mindset where we need to memorize and accept a lot of what has been done by others previously in this field, while the later mindset is about asking questions about everything that is being taught to us.

Obviously, the two mindsets are not mutually exclusive, but let's just say that only a small subset of the researchers in a particular field have both mindsets, but I digress.

The main characteristic of this third group of people is that they have lots of ideas. Most of these ideas are below grade, a couple are decent, and every once in while, one idea will be great.

Because they have lots of ideas, researchers in this group are not attached to their ideas. They know that other ideas will come (to themselves or to other researchers) and those ideas will have slightly better trade-offs and be better in slightly different metrics. They understand that there is no perfect solution, the concept of perfection depends on the particularities of the problem at hand.

  

 

The Algorithm

Once you have an idea, the next step is to transform it into an Algorithm.

 

A lot of people confuse an idea with an algorithm, They are not the same thing.

An idea, is a vague description of a possible solution to a problem, with an emphasis on how the multiple pieces work together, or what is the main trick behind it. You can think of the Idea as the elevator pitch, or the drawing on the whiteboard.

An algorithm is an accurate description of the steps taken by this solution. It's a pseudo description of the way we solve the problem or execute a computation, it's a description of the program. It doesn't need to be written in a programming language, the steps can be in english language, but they need to be descriptive.

 

In my experience, this confusion of the Idea with the Algorithm causes a lot of friction when dealing with novices. Novices have trouble seeing the difference between these two concepts. To them, the idea is enough to move to the Implementation. If there are parts missing, they'll figure it out during implementation and leave those as implementation details. The fact that both the Idea and the Algorithm can be described in english, while the Implementation is described as source code (i.e. a programming language) makes the distinction between idea and algorithm even harder for novices.

 

Perhaps the best way to think about this is to put it in the context of Mathematics. In fields of theoretical Mathematics, there is no "Implementation stage", only the Idea and the Algorithm. Yet, to mathematicians, there is a clear distinction between an idea and an algorithm. The algorithm can be proven correct (or incorrect) while the idea cannot (unless it is obviously wrong).

It helps them that ideas are described in english, while algorithms are described in mathematical notation.

It also helps them that they have formal proofs to support the algorithm, but more on this later in the post.

 

This mathematical mindset helps me a lot in my research on concurrent algorithms. When describing an idea, I do an elevator pitch, or a drawing on the whiteboard, or maybe a couple of powerpoint animations. When describing an algorithm, I try to use pseudo-code or, if using english sentences, itemize each step and try to be as descriptive as possible.

It's not perfect, but it helps to make the distinction between the two stages.

 

 

The Implementation

In the field of concurrent algorithms, we usually do an implementation of the algorithm.

An implementation is made in a programming language. If the goal is a research paper, then the language can be whatever the researchers doing this work are comfortable with. If the goal is to solve an actual problem in a company, then the language will be dictated by the particular problem or whatever programming language/tools the group supporting this feature needs to do.

 

Why do we do an implementation?

First, because we can.

Second, because having an implementation helps us understand the performance of the solution, even if it's a proof-of-concept (POC) implementation.

Third, because in the Industry, an implementation is what we ultimately need, because it's the way to actually solve the problem we're trying to solve.

Fourth, because an implementation can be tested and passing these tests helps us have confidence that the algorithm is correct. More on this on the next section.

 

 

How do we know it's correct?

When designing or reading a new algorithm, we always wonder whether it is correct or not.

There are two ways to address this concern, formal proofs and stress tests. The best is to do both, but this is not always possible or a useful investment of time.

 

In the academic setting, when writing a research paper on a novel concurrency algorithm, the correctness proof is the preferred approach. When small, the proof can be placed in the paper itself, otherwise, it's put on an accompanying document or appendix.

Some theoretical conferences and journals don't even care about having an implementation of the algorithm. All they care is about the proof, whether it well constructed and really proves what it sets out to prove.

The more practical conferences (in the field of concurrency) usually don't care much about stress tests either. They give importance to the fact that an implementation was made and to the benchmarks made with this implementation, but not to any supporting code that tests the correctness of the implementation. And because the reviewers don't value this, the researchers submitting papers don't spend time doing tests either. To the point where of all my colleagues working in this field, I only know of one that consistently spends time writing stress tests for his work.

 

The majority of my fellow research colleagues don't stress test their work and instead, prefer to spend some time writing a formal proof of correctness. The incentive just isn't there.

Writing research papers takes time. It takes time to come up with an idea worthy of a paper, it takes time to turn it into an algorithm, it takes time to implement it, it takes time to benchmark it, it takes time to write about it in way that pleases reviewers at conferences, and if on top of that we have to write stress tests, it's just too much.

 

Proofs are made by humans, usually the same humans who designed the algorithm that is being proven correct, and therefore, it's possible to have the same fallacies that inducted an error in the algorithm, creep in into the proof. This depends on the kind of proof, for example, we can argue that invariant proofs are less prone to this nefarious effect, but there is no way to be immune to it.

 

In the software industry the mindset is nearly opposite.

There, we don't care about formal proofs because each proof is tightly associated with the algorithm and implementation. Changing a single line of code can invalidate the corresponding proof. This is particularly true in lock-free algorithms, but it's a general statement for concurrency algorithms in general.

Business needs can change, which means code can have new functionality, which means the implementation will change as time goes by. It's not practical to write a new proof every time someone changes a line of code.

 

Yes there are tools like TLA+ but then we need to keep the proof and the implementation in sync, which is its own pain.

On the other hand, if we write tests to cover the implementation, every time we change a line of code we can just re-run the tests. It won't give a 100% guarantee that you didn't break anything, but it will help to catch many mistakes.

 

Furthermore, in the Industry we actually want implementations that work. What good is an algorithm that has been proven correct, if the implementation of this algorithm is full of bugs? 

Not really much, I'm afraid. It's a much better use of the engineers time to have tests that assert (some) correctness of the implementation than to have him write formal proofs or TLA+ proofs.

 

 

Tests imply re-usability

My intellect is not capable of holding a large concurrent algorithm. I know of other researchers who can, and Andreia is one example.

If you present me with a novel concurrent algorithm that is large(ish), I may not be able to convince myself that it is correct, because I won't be able to stuff it my head as a whole. I won't be able to go through all the possible interleavings of the steps nor reason about its correctness.

It's like the chess players who can see 7 moves ahead in the board, and the chess players who can only see 3 moves ahead. Yes, it's a skill that you can practice, but after a certain time you get old, you have other stuff in your head, and frankly, I don't even want to invest the time it takes to grasp such large algorithms.

Because of this limitation, I need to restrict myself to small algorithms. And even for the small algorithms, I do stress tests.

 

If we have tests to cover the behavior of a queue, then any new queue we design and implement can be tested with the same tests. If a particular queue passes all the tests we have designed over the years, then this gives me a good confidence that this is correct queue implementation and, by consequence, the algorithm of the queue is likely correct.

 

Moreover, having a well-tested queue means that when we design other concurrency mechanisms that need a queue, we know that we can use the queue we designed beforehand because when we find issues in the new thing we are designing (which inevitably we will) we will be certain that those issues originate from the algorithm which works on top of the queue and not due to issues in the queue itself.

By stress-testing everything we do, we gain re-usability!

And yes, stress-testing is no 100% guarantee that the implementation is correct, but it sure beats not having anything, and I would argue, it beats proofs too.

You see, a proof is for the algorithm, not the implementation. On the other hand, a stress-test validates correctness for the implementation and the algorithm. Two for one is a better investment of your time.

 

Not only the implementation can be re-used, but the tests can be re-used on other implementations. When we design a stress tests to check invariants of concurrent sets, we can use the same tests to check the correctness of any other set implementation.

The tests we do become an investment for the future, which reaps its benefits as the years go by.

 

 

My take on correctness

Suppose you came up with a new lock-free data structure and now you want my help with it. There are two different ways I can have confidence that what you did is correct:

  1. You make a formal proof of correctness to accompany the algorithm;
  2. You make as many stress/invariant tests as you can think of and make your implementation of the algorithm pass those tests;

Ideally, you should have both, but if you're going to do just one, then go for number 2.

Notice that you need an algorithm in both. If you don't have an algorithm, then you have nothing. Some people jump directly from the idea to the implementation stage. This means that the algorithm lives only in their head. Good for them, but the point of a research paper is to share information, and forcing others to reverse-engineer the algorithm from the implementation is not a good way of sharing. And most likely, no one will bother to do so.

So, yeah, without an algorithm, you got a big fat nothing.

 

 

Summary

In summary, the reasons to write stress/invariant tests are:

  • These tests will be re-usable later;
  • They help prove the correctness of the algorithm and the implementation;
  • The algorithms validated with these stress tests can then be used as building blocks of more complex algorithms;
  • If you have tests, you will be able to validate the algorithms of other researchers;
  • If you make modifications to the algorithm, you will be able to re-check the correctness of the modified algorithm without having to make a new proof from scratch;
  • Writing tests is fun!

Friday, March 12, 2021

The 4 laws of Durability

When it comes to having durable data, there are four ways to do it: undo log, redo log, shadow copy and shadow data.

 

Let's start with the preliminaries.

So what do we mean by "durable"?

Well, durable means that whatever data you're trying to save, has reached your storage device in a consistent way. It means that when you write to storage you want it to be "permanent", whether that storage device is a USB key, a CD ROM, an hard drive, and SSD, or a non-volatile memory DIMM like Intel's Optane DC PM.

For any of these storage devices, the algorithms are always the same: you have to use one of the four mentioned above.

Keep in mind that these are needed if the data needs to be consistent, i.e. you want to see the whole data before the storage or none of the data. I mean, if we were ok with having garbled data, then why would we bother saving it in permanent storage? The whole point of making data durable is because it has important information and therefore, it implies consistency.

 

Now that the basics are out of the way, what are exactly these four algorithms?

I'm going to focus on these in the context of transactions, but they don't have to be necessarily about that.

 

Undo log is technique where we write to durable storage a log entry before each write is done to storage. It allows multiple independent (non-atomic) writes to become durable in an all-or-nothing way, like a transaction, or a checkpoint.

In the context of persistent memory, libpmemobj in PMDK is an example of a transactional system that uses undo log.

 

In Redo log we write the log with multiple entries to storage before writing the actual data. The difference between redo and undo is that undo log does one entry in the log at a time followed by one modification, while the redo log does all entries in the log in one shot and then all the modifications in one shot.

Mnemosyne and OneFile are examples of transactional systems that utilize redo log.

 

Shadow copy, sometimes called Copy-On-Write (COW) creates a new replica of the data and writes the new data along with the unchanged contents to durable storage, before swapping some kind of pointer to indicate the this is the new object/data and the old one can be discarded. COW can't really be used by itself for transactions over multiple objects, but it can be combined with redo log to make it more efficient.

One example is SAP HANA which uses redo log with COW.

 

Shadow data can sometimes be confused with COW but it is not the same thing. In shadow data two (or more) replicas of the entire data are kept in durable storage and they both are updated with the modifications, one at a time. First one replica, then a logical pointer and then the second replica. On the next set of atomic writes the recently updated replica is the first to be updated.

Examples of shadow data transactional systems are Romulus, RedoDB and Trinity to some extent.

 

 

We though long and hard at the similarities and differences between these four algorithms for durable transactions, and we found they possess four common characteristics, regardless of the underlying storage media for which they are intended.

Each one of these characteristics reveals an important insight into the concept of durability and we believe these to be empirical rules to which all durable techniques abide. These rules are:

  1.  There must be a replica of the data;
  2.  There must be a durable state indicating which of the replicas is consistent;
  3. All algorithms require at least one ordering constraint of the writes to durable storage; 
  4. A modification is durable only after a round-trip fence to the storage hardware;

 

 

The first key insight regarding durable transactions is that a consistent and durable replica of the data must exist at all times. This replica may be a full copy of the data, such as on shadow data, or it may be a logical replica, such as on undo log and redo log.

Intuitively, there has to be a consistent replica of the data, so that there is a way to recover data to its original consistent state in the event of a failure. Shadow data keeps a full replica of the data thus incurring a high permanent usage of the durable media (space amplification), while the undo log and redo log approaches have to write in durable storage, not just the new data but also, encoded information about the location and size of the modification (write amplification).

There's clearly an important trade-off here: log-based algorithms will increase (amortized) write amplification but shadow-data-based algorithms will increase space amplification.

 

The second empirical rule implies that the algorithm must ensure that, irrespective of when a failure occurs, there is a way for the recovery procedure to determine which of the replicas is consistent.

Shadow data like Romulus uses a two-bit variable to determine which of the two replicas is consistent, while redo log and undo log can use the size of the log (zero or non-zero) to indicate if the log is consistent.

By itself, there is no significant difference in any of the approaches however, the exact mechanics, will influence the number of ordering constraints in the algorithm.

 

This leads us to the third insight, that data consistency is possible only through ordering of some of the writes.

For shadow-copying, the modifications on the new block must be made durable before the pointer swap, otherwise a failure occurring after the pointer swap is made durable, would leave the pointer referencing an inconsistent block. This means that apart from block allocation and de-allocation details, shadow-copying has a single ordering constraint, or in other words, a single ordering fence.

Shadow data like Romulus uses a two-bit state (though one bit would suffice) to indicate which of the two replicas is the consistent one, or whether both are consistent. If the state variable indicating which replica is the consistent one becomes durable before or after the modifications on either replica and a crash occurs, upon recovery it may be referencing the inconsistent replica. For this algorithm, three ordering constraints exist: one to prevent the state from changing to COPYING before the modifications in main replica are done; another to prevent the modifications in back replica from being done before the state changes to COPYING; and another one to prevent the state change to IDLE before the changes on back replica are durable.

The undo log technique has two constraints per modified object/range: the log entry must contain the old value before the entry is added to the log; and the entry must be added to the log before the modification is done on the data. Undo log has one extra constraint per transaction, requiring the last modification to be durable before the log is reset.

The Redo log technique has three constraints per transaction: all the log entries must be durable before the log size is set; the log size must be set before the modifications are done on the data; the modifications on the data must be durable before the log is reset.

 

The fourth and final rule addresses the need for a round-trip synchronization mechanism to the storage domain, such that the hardware can guarantee that it contains, in stable durable storage, all the previously written data. The cost of such a fence is typically of the order of the storage device's latency.

Fast devices like PM implement this round-trip fence orders of magnitude faster than slower devices, like hard drives.

Without such a mechanism, it is not possible to have durable operations, even if ordering constraints are set: in the event of a failure, the ordering constraints impose a temporal sequence of which the writes will be made durable, but there is no guarantee on durability.

A corollary of this is that all algorithms require one and only one such fence, strategically placed.

Notice that the ordering constraints may be replaced by such synchronous fences, at the detriment of performance, and in fact, many storage systems make no distinction between the two. Ordering is typically achieved with an asynchronrous fence and it relates to the order to which certain writes will be made durable in the storage media.

On block based storage, this is typically implemented with fsync() or fdatasync().

In Persistent Memory (PM) ordering can be achieved through the combination of flushes (clwb) and fences (sfence) or by writing to the same cache line. The round-trip guarantee of durability is given by a synchronous fence, either fsync()/fdatasync() on block storage, or sfence on PM storage.

 

In case you haven't noticed, the fact that all algorithms require one round-trip fence to the device (psync), but may require multiple ordering fences (pfence) has implications in performance. This is specially true given that the psync has inescapable physical implications: it is not possible to have all-or-nothing consistent durability without a psync that physically does a round trip to the storage device (or at least the storage domain) and therefore the latency cost of this single round trip is inescapable.

However, different algorithms may have different ordering constraints (pfences) and these may have different costs.

 

Yes, fsync() is used for both sync and ordering on block devices, and the sfence instruction is also used for both in PM, however, there are tricks. In PM, writes to the same cache line are guaranteed to be ordered and therefore, no sfence is needed to order them, as long as store with memory_order_release is used.

Seen as these round trips are typically the bottleneck when doing random writes to PM, the fact that we can create an algorithm with a lower number of psyncs means we can have a performance gain that is nearly proportional to the reduction in the number of such fences.

 

This is exactly what we've done with Trinity.

Trinity is a novel durability technique that needs just two fences per transaction and reduces the number of flushes when doing random writes. It consumes more memory than the other previous techniques but it has significant higher performance.

Moreover, we combined it with our own variant of TL2 for highly scalable durable linearizable transactions, and we used that to make a K/V store, which is likely that fastest K/V store on the planet with full transactions (though you need Optane Persistent Memory to be able to run it).

 

If you want to see the video, it's here:


https://youtu.be/vdInrf_kk1w

 

If you want the source code, it's here:

https://github.com/pramalhe/durabletx/blob/master/ptms/trinity/TrinityVRTL2.hpp

Friday, October 2, 2020

A Relaxed Guide do memory_order_relaxed

 Just a quick post to a nice presentation on relaxed atomics by Hans Boehm and Paul McKenney

https://www.youtube.com/watch?v=cWkUqK71DZ0



Very instructive if you are designing locks or lock-free code.

Full PRD here http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2055r0.pdf

Wednesday, July 1, 2020

HydraConf 2020

The Hydra Conference is almost here again: https://hydraconf.com/
This year the speaker lineup is even more impressive: https://hydraconf.com/2020/msk/people/

Maurice Herlihy is going to talk about Software Transactional Memory, and many more interesting talks.

It's going to be online-only, so anybody can attend.
You can buy your tickets here.

Here are some talks from last year to get you going.

Friday, June 12, 2020

You say read-committed, I say buggy-code, potatoe, potato, tomatoe, tomato

If I were to make a concurrent map data structure whose operations are not linearizable and then put it on a library and give it to my users, any user of that library would come back to me and say that my library has a bug.
They may not be able to tell me exactly what it is, but they'll understand that there is a bug in it.

However, if I take the same concurrent map data structure, and put in an application and call that application a "Key-Value store" or a "Database" (DBMS) and give it to the typical database users, it seems they may certainly use it for a several decades without ever complaining that this "Database" is not linearizable (or serializable as the DB folks call it).

If this sounds far-fetched, then just go and read this post: http://jepsen.io/analyses/postgresql-12.3
It seems that the Postgresql users really didn't care that postgresql isn't serializable when told to be so and in fact, isn't even read committed by default, which should be the default. And a similar thing happened to MongDB last month, so it's not something specific to Postgresql.

I find this interesting because it's an extreme case of managing expectations: When you take a library of concurrent data structures, you expect a certain kind of consistency, namely, you expect linearizability (just go and read discussion on the Java concurrency mailing list if you don't believe me).
However, if you take a DBMS like Postgresql, you no longer expect linearizability to be the default behavior. In fact, you expect read committed as the default behavior, which turns out Postgresql doesn't even give read-committed and instead gives snapshot isolation.
A user of a library of concurrent data structures would call read committed a bug. She would also call snapshot isolation a bug, and pretty much everything that is not linearizable would be a bug to her.
The reason it would be a bug, is because it is very hard to reason about anything that is not linearizable. In the post by Jepsen you can even see that there is no exact definition of read committed, so I guess that's one of the reasons why nobody complained about it before.

I can easily imagine discussions of the type:
DBMS User: This behavior is weird!
DBMS Developer: No, it's not weird, it's "read committed", you just have to go and learn about what it means!
DBMS User: The definition of "read committed" is so fluffly that it can mean any weird thing... I can't even understand if the weird thing I'm observing is "read committed" or not.
DBMS Developer: See!?! I was right, this is "read committed"!
DBMS User: Ok, I'll keep using your DBMS because all the other DBMS work the same way.

I could have understood if it was a distributed database, because the cost of a serializable transaction over a distributed database is likely proportional to the latency of the furthest node in the database, while for read-committed it may be lower (who knows?). But the scenario that Jepsen describes isn't even about distributed databases. It's a bug found running on a single-node database. There's no "distributed-databases-are-hard" excuse here (which they are, it's true).

It makes me wonder how did the DBMS folks got their users so well trained that they don't complain about consistency bugs in the DBMS?!?
On one side, I'm super envious because I secretly wish I could be as dismissive to my users as the DBMS folks are to theirs, lol.
But seriously, how could this have happened for so long?!?
I see two possible reasons:
1st, the users don't really care about consistency. They're running businesses. As long as the DBMS is fast enough and has the features that they need, they'll continue to spew out cash for it. Correct consistency is not an issue for the 99% use-cases of databases, as long as the data doesn't get corrupted or lost, everything's fine.
2nd, it's always been like that. Everybody accepts it works like that, and if you want something better, you have to go for a niche DBMS (not that easy to find). Read-committed, snapshot isolation and other strange names as such, are just the "status quo" and nobody wants to change the status quo.
3rd, the DBMS folks hide behind the wall of complexity that is the DBMS. It's a common scenario in IT. They would say something like "Ooooohhhhh this DBMS is too complicated for mere mortals to question! It takes many years of work and millions of lines of code! Here be dragons! Oooohhhh".

If you think of other reasons behind this, I would like to hear about them in the comments section.

Anyways, with the advent of Persistent Memory (PM) and Software Transactional Memory for PM, this game is changing.
One example close to my heart is RedoDB.
RedoDB is a "key-value store" but it supports linearizable transactions over any C++ data type (needs to be type-annotated though). Not only that, but these transactions are wait-free.
That's right, you heard it well: RedoDB provides durable linearizable/serializable wait-free dynamic transactions.
No only does it do that, but is does it slightly faster than RocksDB.

The downside? Consumes vast amounts of memory, though it won't be any worse than most implementations of Multi-Version-Concurrency-Control. At least in RedoDB there is a bound on memory usage.

Anyways, our goal was to show that wait-free DBMS are feasible and can be made efficiently. We weren't aiming for a commercial product.
Ohh and did I say that this DB is about 3k lines of code, as opposed to the several hundred thousands LOC for other DBMS?

You can checkout the paper for RedoDB here: https://dl.acm.org/doi/abs/10.1145/3342195.3387515
and the source code here:  https://github.com/pramalhe/RedoDB

Sunday, May 17, 2020

Dealing with bugs in Concurrency and Durability

Everyone knows that Concurrency is hard, in fact, it's NP-hard: the total number of interleavings typically grow exponentially with the number of lines of code (LOC) in a program. Determining if any of these interleavings causes incorrect behavior, becomes problematic for anything but very small programs.
This never prevented anyone from writing large multi-threaded programs, and the usual trick is to use locks. A code block surround by a lock does not have interleavings with other code blocks for the same lock, and the problem becomes "tractable" again.

Unfortunately, we can't use this trick for lock-free and wait-free code. Lock-free algorithms are typically done with atomic instructions that can interleave with each other.
Verifying correctness of these algorithms is a difficult problem.
There are static verification tools like SPIN, JSF, TLA+ and others, but such tools only work for a small number of threads (no more than 6 threads) and small-size programs (less than 100 LOC). As if this wasn't bad enough, these tools enforce the user to write the program in a particular language, apart from JSF, which is almost Java. The program will have to be re-written later in the target language that is meant for, and during this "translation" errors can be introduced.

Now, don't get me wrong, software has bugs. It's just part of life as a software developer to deal with them and try to minimize them.
Entire industry trends like Test Driven Development (TDD) and Test Automation revolve around the concept of testing to reduce the amount of bugs.
In other words, there's this idea that the more time you invest in tests, the more quality the final program will have. IMO this is true, but obviously it's not a linear effort: beyond a certain number of tests, the number of new bugs you'll detect will be small compared to the effort of maintaining those tests.

Concurrency bugs can, and should, also be approached in this manner, but this is far from sufficient.
Moreover, testing a concurrent program requires a different set of techniques and mindset than testing a sequential program: instead of exercising input-output responses, we should test program invariants; instead of mocking components, we should stress the code by increasing the number of threads or changing timings to exercise different interleavings and code paths.
This is where run-time verification can provide a good help. Tools like valgrind and Address Sanitizer become a need-to-have if you're writing C/C++ multi-threaded applications and there is a good deal of research being done on this field, so I'm sure that more capabilities and more efficient tools will show up in the future. However...

Suppose I have a new lock-free algorithm and I implement it. Then I bother to write some tests for it (which most researchers won't do) and then run it with address sanitizer and whatever other run-time tools I can get my hands on. None of my tests or run-time tools encounter any error. Now what? Is my algorithm ready for publication? How will the peer reviewers know that it is correct? How will I convince them of that?
And here lies the big problem in researching concurrent algorithms that people don't usually talk about: "How do I know if the lock-free algorithm in this paper is correct?"

As I said before, it's ok to have bugs in your code, all software has bugs, and software developers are used to this inevitability.
What is not ok is to have bugs in your algorithm. It's very hard to prove the correctness of a lock-free algorithm. Although it's usually simple to verify that a bug exists, once someone tells you the sequence of steps to reproduce that bug.

This can lead to surreal scenarios that typically look like this:
Me: Read the latest paper on a novel lock-free algorithm. Spend weeks thinking about it. Come up with a scenario where the algorithm doesn't work. Send an email to authors describing the scenario.
Authors: Spend a week thinking if the scenario is actually a bug and how to fix it, propose a slight variation of the algorithm.
Me
: Spend another week thinking about it, come up with a difference scenario where there is a race condition.
Authors: Spend another week thinking about it and propose a modified algorithm.
Me: Spend another week thinking about this different algorithm. Get to the conclusion that the algorithm is no longer lock-free. Send a new email explaning why.
Authors: Spend another week thinking about it and propose yet another version of the algorithm.
Me: Spend another week thinking about it, only to reach the conclusion that this new algorithm is not even linearizable.
Authors: We weren't aiming for linearizability anyways...
Me: Lose all faith on the work done in the field of concurrent algorithms, and lose faith in mankind in general... at least for a couple of weeks, then I read another paper that looks interesting and the cycle restarts.

This can sound bad but it's a description of when things turn out good. Other times I send the email and the authors just ignore me, or they answer something that shows that they don't even understand there is a bug, or there is no bug and the whole things was a result of me missing some vital detail in the algorithm.
Ultimately, either I shame the authors by pointing out their mistakes or I shame myself for being too dumb to understand their paper. Never a good turnout and it's not a good way of "making friends". Unfortunately I can't help myself, I'm in it because I want to learn, and you learn the most when you're willing to make a fool of yourself  ;)

This is not about incompetence, this is about the inherent difficulty in writing lock-free algorithms. The authors are not sure that their algorithms are correct, the reviewers are not sure that the algorithms are correct and the readers of the papers are not sure that the algorithms are correct.  And don't get me started on the actual "implementations" of these algorithms.

It's a mess, and it creates a lot of distrust in the field, specially during peer revieweing.
Papers end up being accepted only if there is an expert reviewer that really knows about this stuff and decides to spend the several weeks of his personal time it takes to fully understand that algorithm and then comes back to the other reviewers and manages to convince them that this stuff is correct.
Most of the times, academic papers gets accepted just because they're not obviously wrong. But the problem is that may be simply because the algorithm is so complex that no one is willing to "dig into it". I've seen examples of papers accepted into major conferences, where the basic algorithm was wrong (and there is not obvious way of fixing it) and then the authors will continue convinced that approach is valid and correct and write follow up papers that build on that algorithm and those papers continue to be accepted in major conferences, when it's all crap.
Luckily, this is rare. Most of the bugs I've seen in lock-free algorithms are things that can be fixed without a major re-design of the algorithm.

There is also the effect that if the algorithm is too simple (there aren't many of those, but still) then the reviewers may think that the work doesn't have enough value. But if the algorithm is too complex, then they wouldn't be able to understand if it is correct or not.
This kind of creates the incentive for papers to have incremental algorithms based on prior published work, or to have several simple algorithms as opposed to a single complex one.

I've noticed that when it comes to durability (persistence), the problem is similar.
Durable algorithms are not NP-Hard, the failure can occur at a single point in the code, which means that the complexity is linearly proportional to the LOCs, as opposed to concurrency algorithms where the complexity grows exponentially with the LOCs in the algorithm.
However, the pattern of me reading papers on durable algorithms and then sending emails and then the authors fixing but not really fixing is pretty much the same.

It's still a hard problem, particularly now with Optane DC Persistent Memory that has a special set of rules for data to be correctly persisted.
On Persistent Memory it is very easy to forget a flush or a persistent fence. This means the implementation is incorrect, however, in practice it may have no issue, because the probability of having a crash exactly in the place where the fence is missing is very low.
On the other hand, if there are fences missing all over the place or if there is something being used that simply doesn't make sense, then no matter where the program crashes, the data will end up corrupted. So there are different shades of grey here.

Again here the problem repeats itself when it comes to getting a paper accepted at a conference: unless one of the reviewers really knows its stuff and is willing to invest the time it takes to deep dive on the algorithm presented on the paper, getting a "correct" durable algorithm accepted is just a game of luck. The best that can be done is to make a convincing proof or at least explain some invariants.

I believe that as time goes by, there will be better tools that will help writing and verifying correct durable algorithms. The thing is, by the time that happens (years from now), all or most of the interesting durable algorithms will be published by then, so it kind of defeats the purpose. Moreover, for the complicated stuff we can always use transactional durable techniques (undo-log, redo-log, copy-on-write, shadow-data) or a universal construction, like the one we presented at EuroSys2020 : link here.

For concurrency it's really black and white: you miss a fence, you're algorithm is wrong and you WILL see a bug happen eventually.
I don't see any real solution for concurrent algorithms coming anytime soon. And that's a shame because as long as things remain the way they are, it will be very hard for the good (correct) algorithms to be distinguished from the bad (incorrect) ones. It becomes a kind of "reputation game" where papers from people who have made correct algorithms in the past, will likely have correct algorithms later as well, but this is not really reliable seen as we're all humans we're bound to make mistakes at some point. For now, all we can do is test our implementations of these algorithms as extensively as possible so as to minimize the chances of bugs, but it will never be 100% certain.

This discussion may seem a bit boring: software will always have bugs, so what's the problem with having buggy algorithms?
My argument is that it's a big deal, actually, it's a very BIG deal!I've seen many papers that build on previous papers and then I go look at the previous work on which they're based, and those are incorrect. We need solid foundations on which to build, to be able to progress further and faster.
If we don't give the proper value to the correct algorithms and point out the incorrect ones, then we're creating the wrong incentive. Researchers will not bother to spend the time testing or validating their algorithms. This creates the climate of suspicion that we're in today.

Ultimately, designing lock-free algorithms will never be "easy", but designing lock-free and persistent data structures is easy when you use universal constructions or a lock-free software transactional memory. This means all these are problems are an issue only for the researchers of concurrent algorithms, but not really a problem for people that need a peristent or lock-free data structure.

Wednesday, November 6, 2019

CX - Wait-Free Universal Construction

After many years, we've decided to make public the source code for the CX Universal Construction and explain what it is. Source code can be obtained here: https://github.com/pramalhe/CX
and paper here: https://arxiv.org/abs/1911.01676

CX is the first wait-free Universal Construction that can actually be used in practice.
It can take lambdas for its operations, it has integrated (wait-free) memory reclamation, it was extensively tested and it doesn't leak.
CX allows you to take a sequential data structure or object (single-threaded code) and make a wait-free concurrent implementation of the same object by simply encapsulating the methods in calls to updateTx() or readTx(). The resulting data structures have linearizable operations.

Transforming a sequential data structure with CX is not trivial, but it is simple enough that a developer fresh out of college can do it. There are a few rules specific to wait-free UCs that need to be followed (no I/O, no exceptions, no side effects, deterministic code only) but these are easy enough to adhere to.

Unlike handmade wait-free and lock-free data structures, you can have any kind of type in your container (there are limitations on the return values though). All lock-free data structures I've seen so far were meant for integers only, or something equal or smaller than 64bits. Moreover, with CX it's easy to modify the data structure later and add more functionality.

Unlike STMs, CX does not need to have interposing and therefore, we can use CX to wrap data structures in libraries to which we don't have access to the source code. For example, you can use CX to wrap STL containers (we have!). Here is an example on how to make a wait-free std::set
https://github.com/pramalhe/CX/blob/master/examples/WFStdSet.hpp
or even easier, just create the instance and do the operations on the fly:
https://github.com/pramalhe/CX/blob/master/examples/example1.cpp

All you need is a copy constructor for the underlying data structure you want to protect. As long as your sequential data structure has a copy constructor (or something logically equivalent) and deterministic operations, you should be able to convert it to a wait-free data structure.

We've used CX to make the world's first wait-free balanced binary tree, with integrated wait-free memory reclamation, capable of supporting multiple types. How did we do that? Well, we wrapped a std::set in CX and BOOM, out comes the world's first wait-free balanced BST  :)


Several years ago, when Andreia first came up with the idea that we could use a reader-writer lock to achieve wait-freedom, I though that was crazy, but cool. After joining up forces with Pascal and a lot of work, we were able to turn that into a universal construction, CX.
Sure, it has its limitations, but it opens new venues for research and is evidence of some contrarian concepts:
- wait-freedom can be generic and fast (at least for read-mostly workloads);
- wait-free memory reclamation can be lightweight;
- wait-freedom can be achieved with locks;


Now for the bad news: just because it's wait-free, doesn't mean it scales.
CX serializes all mutative operations which means it's flat at best if the workload consists of only mutations on the sequential object.
It also consumes larges amounts of memory and in the very unlikely worst case, may require 2x MAX_THREADS replicas of the sequential object. If the object is a data structure with 1000 nodes, then it doesn't make a dent, but if we're talking about a container with multiple millions of keys and tens of threads writing to it, then it's going to chew up with DRAM pretty fast.
This high memory usage means CX likely won't be your favorite concurrency approach when deploying in production.

The good news is that it scales really well for read-only operations. This means that as long as your workload is read-mostly, it will be able to scale.
Synchronization for read-only operations in CX is very low, typically a few acquire-loads and a single seq-cst fence instruction and beyond that it's just regular loads.
Yes, that's right, the user-code you place in the lambdas is being executed without having to do any load acquires or heavy synchronization fences, just like when you execute code inside a lock!
This is what makes CX be very fast for read-mostly workloads.


In fact, it's so fast that when we submitted the paper to conferences, a few reviewers complained that it was too good to be true and it couldn't possibly be correct. That's how fast it is  ;)
Hand-made lock-free data structures can take months (or years) of work by expert researchers to complete, therefore it's normal be greeted with disbelief when we show them a technique capable of transforming generic sequential code into wait-free safe code. Something a young developer can do in minutes. If on top of that we say that it has wait-free memory reclamation and on top of it we show plots where it beats multiple handmade data structures in read-mostly scenarios, then they go ballistic and think it's all a bunch of BS.
It's understandable, we all have our biases. Mine is to think that it's a good idea to trade off large amounts of memory for a generic construct that scales in read-mostly workloads.What's yours?

Hey, nobody's perfect!