Friday, October 2, 2020

A Relaxed Guide do memory_order_relaxed

 Just a quick post to a nice presentation on relaxed atomics by Hans Boehm and Paul McKenney

Very instructive if you are designing locks or lock-free code.

Full PRD here

Wednesday, July 1, 2020

HydraConf 2020

The Hydra Conference is almost here again:
This year the speaker lineup is even more impressive:

Maurice Herlihy is going to talk about Software Transactional Memory, and many more interesting talks.

It's going to be online-only, so anybody can attend.
You can buy your tickets here.

Here are some talks from last year to get you going.

Friday, June 12, 2020

You say read-committed, I say buggy-code, potatoe, potato, tomatoe, tomato

If I were to make a concurrent map data structure whose operations are not linearizable and then put it on a library and give it to my users, any user of that library would come back to me and say that my library has a bug.
They may not be able to tell me exactly what it is, but they'll understand that there is a bug in it.

However, if I take the same concurrent map data structure, and put in an application and call that application a "Key-Value store" or a "Database" (DBMS) and give it to the typical database users, it seems they may certainly use it for a several decades without ever complaining that this "Database" is not linearizable (or serializable as the DB folks call it).

If this sounds far-fetched, then just go and read this post:
It seems that the Postgresql users really didn't care that postgresql isn't serializable when told to be so and in fact, isn't even read committed by default, which should be the default. And a similar thing happened to MongDB last month, so it's not something specific to Postgresql.

I find this interesting because it's an extreme case of managing expectations: When you take a library of concurrent data structures, you expect a certain kind of consistency, namely, you expect linearizability (just go and read discussion on the Java concurrency mailing list if you don't believe me).
However, if you take a DBMS like Postgresql, you no longer expect linearizability to be the default behavior. In fact, you expect read committed as the default behavior, which turns out Postgresql doesn't even give read-committed and instead gives snapshot isolation.
A user of a library of concurrent data structures would call read committed a bug. She would also call snapshot isolation a bug, and pretty much everything that is not linearizable would be a bug to her.
The reason it would be a bug, is because it is very hard to reason about anything that is not linearizable. In the post by Jepsen you can even see that there is no exact definition of read committed, so I guess that's one of the reasons why nobody complained about it before.

I can easily imagine discussions of the type:
DBMS User: This behavior is weird!
DBMS Developer: No, it's not weird, it's "read committed", you just have to go and learn about what it means!
DBMS User: The definition of "read committed" is so fluffly that it can mean any weird thing... I can't even understand if the weird thing I'm observing is "read committed" or not.
DBMS Developer: See!?! I was right, this is "read committed"!
DBMS User: Ok, I'll keep using your DBMS because all the other DBMS work the same way.

I could have understood if it was a distributed database, because the cost of a serializable transaction over a distributed database is likely proportional to the latency of the furthest node in the database, while for read-committed it may be lower (who knows?). But the scenario that Jepsen describes isn't even about distributed databases. It's a bug found running on a single-node database. There's no "distributed-databases-are-hard" excuse here (which they are, it's true).

It makes me wonder how did the DBMS folks got their users so well trained that they don't complain about consistency bugs in the DBMS?!?
On one side, I'm super envious because I secretly wish I could be as dismissive to my users as the DBMS folks are to theirs, lol.
But seriously, how could this have happened for so long?!?
I see two possible reasons:
1st, the users don't really care about consistency. They're running businesses. As long as the DBMS is fast enough and has the features that they need, they'll continue to spew out cash for it. Correct consistency is not an issue for the 99% use-cases of databases, as long as the data doesn't get corrupted or lost, everything's fine.
2nd, it's always been like that. Everybody accepts it works like that, and if you want something better, you have to go for a niche DBMS (not that easy to find). Read-committed, snapshot isolation and other strange names as such, are just the "status quo" and nobody wants to change the status quo.
3rd, the DBMS folks hide behind the wall of complexity that is the DBMS. It's a common scenario in IT. They would say something like "Ooooohhhhh this DBMS is too complicated for mere mortals to question! It takes many years of work and millions of lines of code! Here be dragons! Oooohhhh".

If you think of other reasons behind this, I would like to hear about them in the comments section.

Anyways, with the advent of Persistent Memory (PM) and Software Transactional Memory for PM, this game is changing.
One example close to my heart is RedoDB.
RedoDB is a "key-value store" but it supports linearizable transactions over any C++ data type (needs to be type-annotated though). Not only that, but these transactions are wait-free.
That's right, you heard it well: RedoDB provides durable linearizable/serializable wait-free dynamic transactions.
No only does it do that, but is does it slightly faster than RocksDB.

The downside? Consumes vast amounts of memory, though it won't be any worse than most implementations of Multi-Version-Concurrency-Control. At least in RedoDB there is a bound on memory usage.

Anyways, our goal was to show that wait-free DBMS are feasible and can be made efficiently. We weren't aiming for a commercial product.
Ohh and did I say that this DB is about 3k lines of code, as opposed to the several hundred thousands LOC for other DBMS?

You can checkout the paper for RedoDB here:
and the source code here:

Sunday, May 17, 2020

Dealing with bugs in Concurrency and Durability

Everyone knows that Concurrency is hard, in fact, it's NP-hard: the total number of interleavings typically grow exponentially with the number of lines of code (LOC) in a program. Determining if any of these interleavings causes incorrect behavior, becomes problematic for anything but very small programs.
This never prevented anyone from writing large multi-threaded programs, and the usual trick is to use locks. A code block surround by a lock does not have interleavings with other code blocks for the same lock, and the problem becomes "tractable" again.

Unfortunately, we can't use this trick for lock-free and wait-free code. Lock-free algorithms are typically done with atomic instructions that can interleave with each other.
Verifying correctness of these algorithms is a difficult problem.
There are static verification tools like SPIN, JSF, TLA+ and others, but such tools only work for a small number of threads (no more than 6 threads) and small-size programs (less than 100 LOC). As if this wasn't bad enough, these tools enforce the user to write the program in a particular language, apart from JSF, which is almost Java. The program will have to be re-written later in the target language that is meant for, and during this "translation" errors can be introduced.

Now, don't get me wrong, software has bugs. It's just part of life as a software developer to deal with them and try to minimize them.
Entire industry trends like Test Driven Development (TDD) and Test Automation revolve around the concept of testing to reduce the amount of bugs.
In other words, there's this idea that the more time you invest in tests, the more quality the final program will have. IMO this is true, but obviously it's not a linear effort: beyond a certain number of tests, the number of new bugs you'll detect will be small compared to the effort of maintaining those tests.

Concurrency bugs can, and should, also be approached in this manner, but this is far from sufficient.
Moreover, testing a concurrent program requires a different set of techniques and mindset than testing a sequential program: instead of exercising input-output responses, we should test program invariants; instead of mocking components, we should stress the code by increasing the number of threads or changing timings to exercise different interleavings and code paths.
This is where run-time verification can provide a good help. Tools like valgrind and Address Sanitizer become a need-to-have if you're writing C/C++ multi-threaded applications and there is a good deal of research being done on this field, so I'm sure that more capabilities and more efficient tools will show up in the future. However...

Suppose I have a new lock-free algorithm and I implement it. Then I bother to write some tests for it (which most researchers won't do) and then run it with address sanitizer and whatever other run-time tools I can get my hands on. None of my tests or run-time tools encounter any error. Now what? Is my algorithm ready for publication? How will the peer reviewers know that it is correct? How will I convince them of that?
And here lies the big problem in researching concurrent algorithms that people don't usually talk about: "How do I know if the lock-free algorithm in this paper is correct?"

As I said before, it's ok to have bugs in your code, all software has bugs, and software developers are used to this inevitability.
What is not ok is to have bugs in your algorithm. It's very hard to prove the correctness of a lock-free algorithm. Although it's usually simple to verify that a bug exists, once someone tells you the sequence of steps to reproduce that bug.

This can lead to surreal scenarios that typically look like this:
Me: Read the latest paper on a novel lock-free algorithm. Spend weeks thinking about it. Come up with a scenario where the algorithm doesn't work. Send an email to authors describing the scenario.
Authors: Spend a week thinking if the scenario is actually a bug and how to fix it, propose a slight variation of the algorithm.
: Spend another week thinking about it, come up with a difference scenario where there is a race condition.
Authors: Spend another week thinking about it and propose a modified algorithm.
Me: Spend another week thinking about this different algorithm. Get to the conclusion that the algorithm is no longer lock-free. Send a new email explaning why.
Authors: Spend another week thinking about it and propose yet another version of the algorithm.
Me: Spend another week thinking about it, only to reach the conclusion that this new algorithm is not even linearizable.
Authors: We weren't aiming for linearizability anyways...
Me: Lose all faith on the work done in the field of concurrent algorithms, and lose faith in mankind in general... at least for a couple of weeks, then I read another paper that looks interesting and the cycle restarts.

This can sound bad but it's a description of when things turn out good. Other times I send the email and the authors just ignore me, or they answer something that shows that they don't even understand there is a bug, or there is no bug and the whole things was a result of me missing some vital detail in the algorithm.
Ultimately, either I shame the authors by pointing out their mistakes or I shame myself for being too dumb to understand their paper. Never a good turnout and it's not a good way of "making friends". Unfortunately I can't help myself, I'm in it because I want to learn, and you learn the most when you're willing to make a fool of yourself  ;)

This is not about incompetence, this is about the inherent difficulty in writing lock-free algorithms. The authors are not sure that their algorithms are correct, the reviewers are not sure that the algorithms are correct and the readers of the papers are not sure that the algorithms are correct.  And don't get me started on the actual "implementations" of these algorithms.

It's a mess, and it creates a lot of distrust in the field, specially during peer revieweing.
Papers end up being accepted only if there is an expert reviewer that really knows about this stuff and decides to spend the several weeks of his personal time it takes to fully understand that algorithm and then comes back to the other reviewers and manages to convince them that this stuff is correct.
Most of the times, academic papers gets accepted just because they're not obviously wrong. But the problem is that may be simply because the algorithm is so complex that no one is willing to "dig into it". I've seen examples of papers accepted into major conferences, where the basic algorithm was wrong (and there is not obvious way of fixing it) and then the authors will continue convinced that approach is valid and correct and write follow up papers that build on that algorithm and those papers continue to be accepted in major conferences, when it's all crap.
Luckily, this is rare. Most of the bugs I've seen in lock-free algorithms are things that can be fixed without a major re-design of the algorithm.

There is also the effect that if the algorithm is too simple (there aren't many of those, but still) then the reviewers may think that the work doesn't have enough value. But if the algorithm is too complex, then they wouldn't be able to understand if it is correct or not.
This kind of creates the incentive for papers to have incremental algorithms based on prior published work, or to have several simple algorithms as opposed to a single complex one.

I've noticed that when it comes to durability (persistence), the problem is similar.
Durable algorithms are not NP-Hard, the failure can occur at a single point in the code, which means that the complexity is linearly proportional to the LOCs, as opposed to concurrency algorithms where the complexity grows exponentially with the LOCs in the algorithm.
However, the pattern of me reading papers on durable algorithms and then sending emails and then the authors fixing but not really fixing is pretty much the same.

It's still a hard problem, particularly now with Optane DC Persistent Memory that has a special set of rules for data to be correctly persisted.
On Persistent Memory it is very easy to forget a flush or a persistent fence. This means the implementation is incorrect, however, in practice it may have no issue, because the probability of having a crash exactly in the place where the fence is missing is very low.
On the other hand, if there are fences missing all over the place or if there is something being used that simply doesn't make sense, then no matter where the program crashes, the data will end up corrupted. So there are different shades of grey here.

Again here the problem repeats itself when it comes to getting a paper accepted at a conference: unless one of the reviewers really knows its stuff and is willing to invest the time it takes to deep dive on the algorithm presented on the paper, getting a "correct" durable algorithm accepted is just a game of luck. The best that can be done is to make a convincing proof or at least explain some invariants.

I believe that as time goes by, there will be better tools that will help writing and verifying correct durable algorithms. The thing is, by the time that happens (years from now), all or most of the interesting durable algorithms will be published by then, so it kind of defeats the purpose. Moreover, for the complicated stuff we can always use transactional durable techniques (undo-log, redo-log, copy-on-write, shadow-data) or a universal construction, like the one we presented at EuroSys2020 : link here.

For concurrency it's really black and white: you miss a fence, you're algorithm is wrong and you WILL see a bug happen eventually.
I don't see any real solution for concurrent algorithms coming anytime soon. And that's a shame because as long as things remain the way they are, it will be very hard for the good (correct) algorithms to be distinguished from the bad (incorrect) ones. It becomes a kind of "reputation game" where papers from people who have made correct algorithms in the past, will likely have correct algorithms later as well, but this is not really reliable seen as we're all humans we're bound to make mistakes at some point. For now, all we can do is test our implementations of these algorithms as extensively as possible so as to minimize the chances of bugs, but it will never be 100% certain.

This discussion may seem a bit boring: software will always have bugs, so what's the problem with having buggy algorithms?
My argument is that it's a big deal, actually, it's a very BIG deal!I've seen many papers that build on previous papers and then I go look at the previous work on which they're based, and those are incorrect. We need solid foundations on which to build, to be able to progress further and faster.
If we don't give the proper value to the correct algorithms and point out the incorrect ones, then we're creating the wrong incentive. Researchers will not bother to spend the time testing or validating their algorithms. This creates the climate of suspicion that we're in today.

Ultimately, designing lock-free algorithms will never be "easy", but designing lock-free and persistent data structures is easy when you use universal constructions or a lock-free software transactional memory. This means all these are problems are an issue only for the researchers of concurrent algorithms, but not really a problem for people that need a peristent or lock-free data structure.

Wednesday, November 6, 2019

CX - Wait-Free Universal Construction

After many years, we've decided to make public the source code for the CX Universal Construction and explain what it is. Source code can be obtained here:
and paper here:

CX is the first wait-free Universal Construction that can actually be used in practice.
It can take lambdas for its operations, it has integrated (wait-free) memory reclamation, it was extensively tested and it doesn't leak.
CX allows you to take a sequential data structure or object (single-threaded code) and make a wait-free concurrent implementation of the same object by simply encapsulating the methods in calls to updateTx() or readTx(). The resulting data structures have linearizable operations.

Transforming a sequential data structure with CX is not trivial, but it is simple enough that a developer fresh out of college can do it. There are a few rules specific to wait-free UCs that need to be followed (no I/O, no exceptions, no side effects, deterministic code only) but these are easy enough to adhere to.

Unlike handmade wait-free and lock-free data structures, you can have any kind of type in your container (there are limitations on the return values though). All lock-free data structures I've seen so far were meant for integers only, or something equal or smaller than 64bits. Moreover, with CX it's easy to modify the data structure later and add more functionality.

Unlike STMs, CX does not need to have interposing and therefore, we can use CX to wrap data structures in libraries to which we don't have access to the source code. For example, you can use CX to wrap STL containers (we have!). Here is an example on how to make a wait-free std::set
or even easier, just create the instance and do the operations on the fly:

All you need is a copy constructor for the underlying data structure you want to protect. As long as your sequential data structure has a copy constructor (or something logically equivalent) and deterministic operations, you should be able to convert it to a wait-free data structure.

We've used CX to make the world's first wait-free balanced binary tree, with integrated wait-free memory reclamation, capable of supporting multiple types. How did we do that? Well, we wrapped a std::set in CX and BOOM, out comes the world's first wait-free balanced BST  :)

Several years ago, when Andreia first came up with the idea that we could use a reader-writer lock to achieve wait-freedom, I though that was crazy, but cool. After joining up forces with Pascal and a lot of work, we were able to turn that into a universal construction, CX.
Sure, it has its limitations, but it opens new venues for research and is evidence of some contrarian concepts:
- wait-freedom can be generic and fast (at least for read-mostly workloads);
- wait-free memory reclamation can be lightweight;
- wait-freedom can be achieved with locks;

Now for the bad news: just because it's wait-free, doesn't mean it scales.
CX serializes all mutative operations which means it's flat at best if the workload consists of only mutations on the sequential object.
It also consumes larges amounts of memory and in the very unlikely worst case, may require 2x MAX_THREADS replicas of the sequential object. If the object is a data structure with 1000 nodes, then it doesn't make a dent, but if we're talking about a container with multiple millions of keys and tens of threads writing to it, then it's going to chew up with DRAM pretty fast.
This high memory usage means CX likely won't be your favorite concurrency approach when deploying in production.

The good news is that it scales really well for read-only operations. This means that as long as your workload is read-mostly, it will be able to scale.
Synchronization for read-only operations in CX is very low, typically a few acquire-loads and a single seq-cst fence instruction and beyond that it's just regular loads.
Yes, that's right, the user-code you place in the lambdas is being executed without having to do any load acquires or heavy synchronization fences, just like when you execute code inside a lock!
This is what makes CX be very fast for read-mostly workloads.

In fact, it's so fast that when we submitted the paper to conferences, a few reviewers complained that it was too good to be true and it couldn't possibly be correct. That's how fast it is  ;)
Hand-made lock-free data structures can take months (or years) of work by expert researchers to complete, therefore it's normal be greeted with disbelief when we show them a technique capable of transforming generic sequential code into wait-free safe code. Something a young developer can do in minutes. If on top of that we say that it has wait-free memory reclamation and on top of it we show plots where it beats multiple handmade data structures in read-mostly scenarios, then they go ballistic and think it's all a bunch of BS.
It's understandable, we all have our biases. Mine is to think that it's a good idea to trade off large amounts of memory for a generic construct that scales in read-mostly workloads.What's yours?

Hey, nobody's perfect!

Monday, November 4, 2019

Is Left-Right a generic concurrency control?

Spoiler alert, yes, Left-Right is a generic concurrency control!

We mentioned Left-Right many times before, but here it goes again.
Left-Right is a technique where two replicas of your data are kept up to date. Mutative operations execute in one replica while the other replica can be read by in-flight readers and then the role of the replicas toggles for the write to apply the same operation on the opposite replica while the new readers go on the freshly modified replica.

Through the careful usage of atomics and RCU, this can be done in such a way as to have wait-free progress for the readers and that's what Left-Right is about.
There is only one writer at a time (we can use a global lock) but it can use flat combining to aggregate requests from multiple writers and thus have data locality without causing too much contention on the writer lock. This data locality allows for a "flat" throughput of the write operations, and in the case of Persistent Memory (Romulus) is can even have high positive scaling due to saving writes to PM.

If you want to see more details on Left-Right you check out this ppt:
or see my cppcon video on this topic:

Because Left-Right is a generic concurrency control, it can be used to make a Universal Construction
or an STM
both approaches imply blocking yet starvation-free writes, and wait-free population oblivious reads.
Read-only operations can execute in parallel with the write operation because they're accessing opposite replicas.

When a user decides to utilize the Left-Right UC, all it has to do is place the user-code in a lambda and pass it to the Left-Right UC.
The code in the lambda has some minor restrictions, shared in common with all other generic approaches that give wait-freedom for reads:
- No side effects in the lambda: can't modify global variables or thread-locals;
- Deterministic code: no access to random number generator devices or random functions, unless it's a pseudo-random generator and the seed is passed to the lambda;
- No I/O; no writing to files, no sending/receiving of network packets, no hw registers being written... actually all of this can be done but then you have to use an STM version (like Romulus) which needs load and store interposing
- No exceptions: any exceptions must be catch inside the lambda (user-code). For the read-only operations it can be made such that this is not a problem (implementation detail) but for the mutative operations this is an algorithmic constraint;
Operations in Left-Right are irrevocable (they don't abort-retry) which is a nice feature for most user code.

The bad news: it uses twice as much memory. Like all good things, there are trade-offs and the wait-freedom for readers has a cost, the memory usage.

Remember the issue about cross-dependencies (write skews) from the previous post ? Left-Right doesn't have such an issue because the synchronization is implicitly a global lock. Yes, yes, it serializes mutative operations, but remember, read-only operations are wait-free and go in parallel with the mutations. As long as your workload is read-dominated, Left-Right is a good trade-off.

Another nice feature is that memory reclamation works just like inside a lock: you just call malloc/free/new/delete and it works. No need to use epoch-based reclamation explicitly (it's already implicit in the usage of the userspace RCU inside Left-Right).

On the next post we're going to talk about a fully wait-free Universal Construction.

Wednesday, October 30, 2019

How about RLU? Is RLU a generic concurrency control?

On the previous post we went over why RCU is not a generic concurrency control. This time we'll talk about RLU (it's an L, not a C):

RLU is a synchronization technique which contains inside it an efficient userspace RCU.
It supports multi-atomic updates, making them visible in one-shot to the readers, without having to make an replica of the entire data, like on RCU.

The idea of RLU is that during a write operation, each modified object is first copied into a per-thread log, then it makes the modifications to the object in the log. When all modifications have been made, it increments the global clock, thus making all modifications atomically visible to readers. I'm over-simplifying, but it's ok for this post.

RLU is certainly more generic and efficient than RCU but it does not allow for generic code to be used. In fact, RLU does write-read synchronization but it does not handle write-write synchronization and the the paper itself mentions this explicitly:
"The basic idea of RLU described above provides read-write synchronization for object accesses. It does not however ensure write-write synchronization, which must be managedby the programmer if needed (as with RCU)".

To better understand why RLU is not a generic approach to concurrency, let's consider the canonical example of two threads 1 and 2 with cross dependencies. They are attempting to execute the following operations, where x and y start both at false:
Thread 1:
  if (!x) y = true;

Thread 2:
  if (!y) x = true;

As I'm sure you're aware, any serializable execution of these two operations will either result in x and y being {true,false} or {false,true} but never {true,true}. One of the operations has to "happen before" the other. If T1 executes before T2 then the end result will be {false,true} and if T2 executes before T1 then the end result will be {true,false}.
If you run this code without modification on RLU, it can happen that T1 reads x and sees false while T2 reads y and sees false, then T1 acquires the lock on y while T2 acquires the lock on x and each write to their log that the variables will be true and each advance the global clock and commit. In case you're wondering, this scenario is not linearizable.

In the database world, this kind of problem is called a write skew anomaly (see section 2.4 of the MV-RLU paper for a great explanation in the context of RLU). In the context of concurrent data structures, we call these bugs!
Handling cases like this is what an STM or a Universal Construction are supposed to do because an STM/UC must handle any sequential piece code (with some exceptions).
Granted, if you're really careful how you write the code, you can use RLU as long as you avoid writing sequential code that has this kind of cross dependencies. The problem is, in practice, this is very hard to do and even when you do it, you're never really sure it's actually done correctly (did I miss some cross dependency?). If you write user-code that has one of these problems, it will be hard to find it and likely hard to fix it.

There is an easy solution for RLU and that is to use a global lock, but then it looses some of its appeal because it no longer scales for writes.
Keep in mind though that RLU scales for read-only transactions!

STMs like TinySTM or TL2 have time-based algorithms that keep a read-set of all loads done within the transaction and do validation of the read-set at commit time exactly to handle these kind of cross dependencies.
If you're wondering how frequent are cross dependencies in real code, then even an ordered linked list is prone to these kind of issues (depending on how it is implemented), which helps to show that it's very hard to get away from this problem in practice.

If you ever read a paper where they say that they used RLU to make an STM, be very very suspicious: an STM must be able to handle cross dependencies and provide linearizable (globally serializable) operations regardless of what the underlying sequential user-code. If it works only for some small subset of user programs, then it's not an STM... sure it's a very valuable tool, but it's not an STM.