Friday, June 12, 2020

You say read-committed, I say buggy-code, potatoe, potato, tomatoe, tomato

If I were to make a concurrent map data structure whose operations are not linearizable and then put it on a library and give it to my users, any user of that library would come back to me and say that my library has a bug.
They may not be able to tell me exactly what it is, but they'll understand that there is a bug in it.

However, if I take the same concurrent map data structure, and put in an application and call that application a "Key-Value store" or a "Database" (DBMS) and give it to the typical database users, it seems they may certainly use it for a several decades without ever complaining that this "Database" is not linearizable (or serializable as the DB folks call it).

If this sounds far-fetched, then just go and read this post:
It seems that the Postgresql users really didn't care that postgresql isn't serializable when told to be so and in fact, isn't even read committed by default, which should be the default. And a similar thing happened to MongDB last month, so it's not something specific to Postgresql.

I find this interesting because it's an extreme case of managing expectations: When you take a library of concurrent data structures, you expect a certain kind of consistency, namely, you expect linearizability (just go and read discussion on the Java concurrency mailing list if you don't believe me).
However, if you take a DBMS like Postgresql, you no longer expect linearizability to be the default behavior. In fact, you expect read committed as the default behavior, which turns out Postgresql doesn't even give read-committed and instead gives snapshot isolation.
A user of a library of concurrent data structures would call read committed a bug. She would also call snapshot isolation a bug, and pretty much everything that is not linearizable would be a bug to her.
The reason it would be a bug, is because it is very hard to reason about anything that is not linearizable. In the post by Jepsen you can even see that there is no exact definition of read committed, so I guess that's one of the reasons why nobody complained about it before.

I can easily imagine discussions of the type:
DBMS User: This behavior is weird!
DBMS Developer: No, it's not weird, it's "read committed", you just have to go and learn about what it means!
DBMS User: The definition of "read committed" is so fluffly that it can mean any weird thing... I can't even understand if the weird thing I'm observing is "read committed" or not.
DBMS Developer: See!?! I was right, this is "read committed"!
DBMS User: Ok, I'll keep using your DBMS because all the other DBMS work the same way.

I could have understood if it was a distributed database, because the cost of a serializable transaction over a distributed database is likely proportional to the latency of the furthest node in the database, while for read-committed it may be lower (who knows?). But the scenario that Jepsen describes isn't even about distributed databases. It's a bug found running on a single-node database. There's no "distributed-databases-are-hard" excuse here (which they are, it's true).

It makes me wonder how did the DBMS folks got their users so well trained that they don't complain about consistency bugs in the DBMS?!?
On one side, I'm super envious because I secretly wish I could be as dismissive to my users as the DBMS folks are to theirs, lol.
But seriously, how could this have happened for so long?!?
I see two possible reasons:
1st, the users don't really care about consistency. They're running businesses. As long as the DBMS is fast enough and has the features that they need, they'll continue to spew out cash for it. Correct consistency is not an issue for the 99% use-cases of databases, as long as the data doesn't get corrupted or lost, everything's fine.
2nd, it's always been like that. Everybody accepts it works like that, and if you want something better, you have to go for a niche DBMS (not that easy to find). Read-committed, snapshot isolation and other strange names as such, are just the "status quo" and nobody wants to change the status quo.
3rd, the DBMS folks hide behind the wall of complexity that is the DBMS. It's a common scenario in IT. They would say something like "Ooooohhhhh this DBMS is too complicated for mere mortals to question! It takes many years of work and millions of lines of code! Here be dragons! Oooohhhh".

If you think of other reasons behind this, I would like to hear about them in the comments section.

Anyways, with the advent of Persistent Memory (PM) and Software Transactional Memory for PM, this game is changing.
One example close to my heart is RedoDB.
RedoDB is a "key-value store" but it supports linearizable transactions over any C++ data type (needs to be type-annotated though). Not only that, but these transactions are wait-free.
That's right, you heard it well: RedoDB provides durable linearizable/serializable wait-free dynamic transactions.
No only does it do that, but is does it slightly faster than RocksDB.

The downside? Consumes vast amounts of memory, though it won't be any worse than most implementations of Multi-Version-Concurrency-Control. At least in RedoDB there is a bound on memory usage.

Anyways, our goal was to show that wait-free DBMS are feasible and can be made efficiently. We weren't aiming for a commercial product.
Ohh and did I say that this DB is about 3k lines of code, as opposed to the several hundred thousands LOC for other DBMS?

You can checkout the paper for RedoDB here:
and the source code here:


  1. Hi,
    I have been interested in this work since first paper was published.
    When you say, that RedoDB is slightly faster than RocksDB what do you mean? Was it tested on single thread or multi thread loading ?

    1. Hi Andrey,
      Multi-threaded of course! This is a concurrency blog after all ;)
      We ran the db_bench microbenchmarks and it's slightly higher for write-intensive benchmarks and much higher for the read-only ones.
      On one hand, it's kind of unfair because RocksDB was not optimized or designed for Persistent Memory, but on the other hand, we provide fully ACID transactions by default for all operations which RocksDB does not.

    2. Thank you :-) I supposed that of course :-) What is still unclear for me, that AFAIK db_bench contains several types of benchmarks which have write component inside. Is it fair to suppose that on all those benchmarks RedoDB is slightly faster. Also, if I am not mistaken, db_bench uses 1 as default amount of threads for benchmarks with N treads. Is it fair to say that on any amount of threads RedoDB is slightly faster than RocksDB.

    3. Hi, yes on all the benchmarks we ran on our machine with Optane (we ran almost all of the ones in db_bench). Here are some sample results for 1M keys (value=100 bytes). Rows are the number of threads and the values are latency (lower is better):
      Threads RedoOptDB-fillrandom RedoOptDB-overwrite RedoOptDB-readrandom RedoOptDB-readwhilewriting
      1 5.432 3.708 1.577 2.064
      2 8.793 10.813 1.563 2.182
      4 19.019 17.663 1.528 2.379
      8 32.933 31.470 1.687 2.413
      16 63.296 61.570 2.147 3.092
      24 95.137 94.830 2.865 3.859
      32 136.139 135.433 3.855 4.859
      40 196.637 197.169 4.810 4.917

      Threads RocksDB-fillrandom RocksDB-overwrite RocksDB-readrandom RocksDB-readwhilewriting
      1 49.882 61.415 4.739 6.104
      2 86.345 87.177 5.677 6.222
      4 85.291 112.555 6.533 7.171
      8 111.229 110.930 8.092 8.764
      16 144.364 159.453 12.907 12.842
      24 169.834 172.795 19.077 18.642
      32 236.561 234.283 25.541 24.115
      40 282.053 281.001 34.453 31.658

      This was for RocksDB on ext4-DAX. If you use NOVA instead then maybe it will show a difference (we didn't have time to experiment with that because NOVA requires kernel changes).

    4. Got it, thank you, numbers are impressive. I wanted to ask one more question. But I suppose you have already answered. Wanted to check did you use emulation or real Optane DC PMem. Prices for Optane is on the space level right now, but I am sure that it will be changed eventually. Interesting that if for example to design database engine using shared nothing architecture for each CPU core. Like ScilaDB does (using green threads), then IMHO fact that only single write operation can be performed in TX could be not disadvantage but performed by design. May be I am mistaken of course.

    5. *that only single write operation can be performed - that only single write operation can be performed simultaneously

    6. Yes, we used real Optane DC PM.
      One of the insights of RedoDB is that on Optane a lot of the performance comes from "grouping" the transactions. RocksDB kind of does this from a durability point of view, they call it "group commit", but RedoDB uses Herlihy's combining consensus and several PM-specific optimizations to group even more transactions, from a concurrency and durability point of view. This "grouping of transactions" reduces the number of writes and the number of CLWB flushes to the PM and improves throughput significantly, even if ultimately there is a strong serialization of the transactions (each tx group is placed on a kind-of wait-free queue before being processed).

  2. Hi Pedro,
    If you do not mind, I have one more question. Did you measure space consumption of RocksDB and RedoDB while you were running benchmarks ? That is clear that it depends on data and pattern of operations performed but did you compare space consumption of RocksDB and RedoDB during those tests ? IMHO in this case would be more fair to compare space consumption of RedoDB and PostgreSQL. But if you do not mind could you provided those data if you have them ?

    1. Hi Andrey,
      Yes we did measure memory usage and it was always below 2x the size of the DB. Notice that to have wait-free writes it kind of implies that you need at least t+1 bound on memory usage of the size of the DB, where t is the number of threads.
      In RedoDB is possible to actually enforce the memory usage to be at most 2x, but then the progress guarantee for write-tx will be "blocking" and only the read-tx will be "wait-free". Basically it's a tradeoff in memory usage versus progress.

    2. Thank you Pedro ! 2 times space overhead (I mean practical not theoretical)(or memory in this particular case) is not so big if we talk about world of databases. I have hope to re-implement your approach on Rust using modern support for await/async with single carrier thread per core and share nothing architecture. Which obviously would be much simpler because of usage of total ordering of operations and predictable scheduling points and hopefully even more faster. That is not "next month" plans. But I do hope to find time to do this exercise.

    3. That should be a very cool rust project!
      Looking forward to see it one day in action :)