Concurrency Freaks: The importance of Correctness in concurrent algorithms

My work is in the area of concurrent algorithms therefore, this post is going to be largely biased by my experience in this field. Having said that, I suspect that most of what I'm going to say is applicable to many other fields in Computer Science and maybe of other disciplines as well.

To me, the creative process in the field of Concurrent Algorithms is composed of three main steps: The Idea, The Algorithm, and The Implementation.

The Idea

Before we start doing anything we need an idea. Sometimes this is a well-formed idea, other times it's just an insight or trick that we feel can be used to solve a particular problem in a way that no one has done before. Maybe it's a way to solve the same problem but faster, or do it using less memory. Whatever it is, it gives you an advantage somehow.

Most people are capable of having ideas, but most ideas are useless, either because they have already been discovered, or because they are wrong, or because they lack understanding of important details. The later being the common case I have observed in my experience.

Novices in a field usually have ideas that fall in this category: they are simple and based on well-known concepts, or concepts from other fields, and they show a lacking of understanding of the fundamentals of this field (concurrency) and a lacking of knowledge about the problem at hand. These ideas don't actually work, but to the uninitiated, they look like they're good.

Then there are the class of people of occasionally have a decent idea. Once they have it, they can get very attached to the idea as this is precious to them and they likely won't be able to come up with anything better. This is the best they can come up with, therefore, they're going to try to sell it as the-best-thing-since-bread-came-sliced.

You have to understand that they worked hard to get there, therefore it's only normal that they want to value this idea, fighting for it with nails and teeth, if need be. It can be very hard to reason with these people and convince them that there are better ways to attack the problem.

The people in this category are not novices. They typically have some knowledge of the field. Sometimes they call themselves experts, perhaps because they have been working on the field for a long time.

There is a third group of people, the ones that have lots of ideas. This is small group because to get to this group you need to not only be good at having ideas, but also need to have a lot of knowledge about a particular field. My conjecture is that the kind of mindset it takes to learn about a field in depth, is almost the opposite to the mindset you need to innovate in a field. The foremost being a mindset where we need to memorize and accept a lot of what has been done by others previously in this field, while the later mindset is about asking questions about everything that is being taught to us.

Obviously, the two mindsets are not mutually exclusive, but let's just say that only a small subset of the researchers in a particular field have both mindsets, but I digress.

The main characteristic of this third group of people is that they have lots of ideas. Most of these ideas are below grade, a couple are decent, and every once in while, one idea will be great.

Because they have lots of ideas, researchers in this group are not attached to their ideas. They know that other ideas will come (to themselves or to other researchers) and those ideas will have slightly better trade-offs and be better in slightly different metrics. They understand that there is no perfect solution, the concept of perfection depends on the particularities of the problem at hand.

The Algorithm

Once you have an idea, the next step is to transform it into an Algorithm.

A lot of people confuse an idea with an algorithm, They are not the same thing.

An idea, is a vague description of a possible solution to a problem, with an emphasis on how the multiple pieces work together, or what is the main trick behind it. You can think of the Idea as the elevator pitch, or the drawing on the whiteboard.

An algorithm is an accurate description of the steps taken by this solution. It's a pseudo description of the way we solve the problem or execute a computation, it's a description of the program. It doesn't need to be written in a programming language, the steps can be in english language, but they need to be descriptive.

In my experience, this confusion of the Idea with the Algorithm causes a lot of friction when dealing with novices. Novices have trouble seeing the difference between these two concepts. To them, the idea is enough to move to the Implementation. If there are parts missing, they'll figure it out during implementation and leave those as implementation details. The fact that both the Idea and the Algorithm can be described in english, while the Implementation is described as source code (i.e. a programming language) makes the distinction between idea and algorithm even harder for novices.

Perhaps the best way to think about this is to put it in the context of Mathematics. In fields of theoretical Mathematics, there is no "Implementation stage", only the Idea and the Algorithm. Yet, to mathematicians, there is a clear distinction between an idea and an algorithm. The algorithm can be proven correct (or incorrect) while the idea cannot (unless it is obviously wrong).

It helps them that ideas are described in english, while algorithms are described in mathematical notation.

It also helps them that they have formal proofs to support the algorithm, but more on this later in the post.

This mathematical mindset helps me a lot in my research on concurrent algorithms. When describing an idea, I do an elevator pitch, or a drawing on the whiteboard, or maybe a couple of powerpoint animations. When describing an algorithm, I try to use pseudo-code or, if using english sentences, itemize each step and try to be as descriptive as possible.

It's not perfect, but it helps to make the distinction between the two stages.

The Implementation

In the field of concurrent algorithms, we usually do an implementation of the algorithm.

An implementation is made in a programming language. If the goal is a research paper, then the language can be whatever the researchers doing this work are comfortable with. If the goal is to solve an actual problem in a company, then the language will be dictated by the particular problem or whatever programming language/tools the group supporting this feature needs to do.

Why do we do an implementation?

First, because we can.

Second, because having an implementation helps us understand the performance of the solution, even if it's a proof-of-concept (POC) implementation.

Third, because in the Industry, an implementation is what we ultimately need, because it's the way to actually solve the problem we're trying to solve.

Fourth, because an implementation can be tested and passing these tests helps us have confidence that the algorithm is correct. More on this on the next section.

How do we know it's correct?

When designing or reading a new algorithm, we always wonder whether it is correct or not.

There are two ways to address this concern, formal proofs and stress tests. The best is to do both, but this is not always possible or a useful investment of time.

In the academic setting, when writing a research paper on a novel concurrency algorithm, the correctness proof is the preferred approach. When small, the proof can be placed in the paper itself, otherwise, it's put on an accompanying document or appendix.

Some theoretical conferences and journals don't even care about having an implementation of the algorithm. All they care is about the proof, whether it well constructed and really proves what it sets out to prove.

The more practical conferences (in the field of concurrency) usually don't care much about stress tests either. They give importance to the fact that an implementation was made and to the benchmarks made with this implementation, but not to any supporting code that tests the correctness of the implementation. And because the reviewers don't value this, the researchers submitting papers don't spend time doing tests either. To the point where of all my colleagues working in this field, I only know of one that consistently spends time writing stress tests for his work.

The majority of my fellow research colleagues don't stress test their work and instead, prefer to spend some time writing a formal proof of correctness. The incentive just isn't there.

Writing research papers takes time. It takes time to come up with an idea worthy of a paper, it takes time to turn it into an algorithm, it takes time to implement it, it takes time to benchmark it, it takes time to write about it in way that pleases reviewers at conferences, and if on top of that we have to write stress tests, it's just too much.

Proofs are made by humans, usually the same humans who designed the algorithm that is being proven correct, and therefore, it's possible to have the same fallacies that inducted an error in the algorithm, creep in into the proof. This depends on the kind of proof, for example, we can argue that invariant proofs are less prone to this nefarious effect, but there is no way to be immune to it.

In the software industry the mindset is nearly opposite.

There, we don't care about formal proofs because each proof is tightly associated with the algorithm and implementation. Changing a single line of code can invalidate the corresponding proof. This is particularly true in lock-free algorithms, but it's a general statement for concurrency algorithms in general.

Business needs can change, which means code can have new functionality, which means the implementation will change as time goes by. It's not practical to write a new proof every time someone changes a line of code.

Yes there are tools like TLA+ but then we need to keep the proof and the implementation in sync, which is its own pain.

On the other hand, if we write tests to cover the implementation, every time we change a line of code we can just re-run the tests. It won't give a 100% guarantee that you didn't break anything, but it will help to catch many mistakes.

Furthermore, in the Industry we actually want implementations that work. What good is an algorithm that has been proven correct, if the implementation of this algorithm is full of bugs?

Not really much, I'm afraid. It's a much better use of the engineers time to have tests that assert (some) correctness of the implementation than to have him write formal proofs or TLA+ proofs.

Tests imply re-usability

My intellect is not capable of holding a large concurrent algorithm. I know of other researchers who can, and Andreia is one example.

If you present me with a novel concurrent algorithm that is large(ish), I may not be able to convince myself that it is correct, because I won't be able to stuff it my head as a whole. I won't be able to go through all the possible interleavings of the steps nor reason about its correctness.

It's like the chess players who can see 7 moves ahead in the board, and the chess players who can only see 3 moves ahead. Yes, it's a skill that you can practice, but after a certain time you get old, you have other stuff in your head, and frankly, I don't even want to invest the time it takes to grasp such large algorithms.

Because of this limitation, I need to restrict myself to small algorithms. And even for the small algorithms, I do stress tests.

If we have tests to cover the behavior of a queue, then any new queue we design and implement can be tested with the same tests. If a particular queue passes all the tests we have designed over the years, then this gives me a good confidence that this is correct queue implementation and, by consequence, the algorithm of the queue is likely correct.

Moreover, having a well-tested queue means that when we design other concurrency mechanisms that need a queue, we know that we can use the queue we designed beforehand because when we find issues in the new thing we are designing (which inevitably we will) we will be certain that those issues originate from the algorithm which works on top of the queue and not due to issues in the queue itself.

By stress-testing everything we do, we gain re-usability!

And yes, stress-testing is no 100% guarantee that the implementation is correct, but it sure beats not having anything, and I would argue, it beats proofs too.

You see, a proof is for the algorithm, not the implementation. On the other hand, a stress-test validates correctness for the implementation and the algorithm. Two for one is a better investment of your time.

Not only the implementation can be re-used, but the tests can be re-used on other implementations. When we design a stress tests to check invariants of concurrent sets, we can use the same tests to check the correctness of any other set implementation.

The tests we do become an investment for the future, which reaps its benefits as the years go by.

My take on correctness

Suppose you came up with a new lock-free data structure and now you want my help with it. There are two different ways I can have confidence that what you did is correct:

You make a formal proof of correctness to accompany the algorithm;
You make as many stress/invariant tests as you can think of and make your implementation of the algorithm pass those tests;

Ideally, you should have both, but if you're going to do just one, then go for number 2.

Notice that you need an algorithm in both. If you don't have an algorithm, then you have nothing. Some people jump directly from the idea to the implementation stage. This means that the algorithm lives only in their head. Good for them, but the point of a research paper is to share information, and forcing others to reverse-engineer the algorithm from the implementation is not a good way of sharing. And most likely, no one will bother to do so.

So, yeah, without an algorithm, you got a big fat nothing.

Summary

In summary, the reasons to write stress/invariant tests are:

These tests will be re-usable later;
They help prove the correctness of the algorithm and the implementation;
The algorithms validated with these stress tests can then be used as building blocks of more complex algorithms;
If you have tests, you will be able to validate the algorithms of other researchers;
If you make modifications to the algorithm, you will be able to re-check the correctness of the modified algorithm without having to make a new proof from scratch;
Writing tests is fun!

Concurrency Freaks

Friday, December 24, 2021

The importance of Correctness in concurrent algorithms