Concurrency Freaks: STM

Showing posts with label STM. Show all posts

Wednesday, July 1, 2020

HydraConf 2020

The Hydra Conference is almost here again: https://hydraconf.com/

This year the speaker lineup is even more impressive: https://hydraconf.com/2020/msk/people/

Maurice Herlihy is going to talk about Software Transactional Memory, and many more interesting talks.

It's going to be online-only, so anybody can attend.

You can buy your tickets here.

Here are some talks from last year to get you going.

Sunday, May 17, 2020

Dealing with bugs in Concurrency and Durability

Everyone knows that Concurrency is hard, in fact, it's NP-hard: the total number of interleavings typically grow exponentially with the number of lines of code (LOC) in a program. Determining if any of these interleavings causes incorrect behavior, becomes problematic for anything but very small programs.
This never prevented anyone from writing large multi-threaded programs, and the usual trick is to use locks. A code block surround by a lock does not have interleavings with other code blocks for the same lock, and the problem becomes "tractable" again.

Unfortunately, we can't use this trick for lock-free and wait-free code. Lock-free algorithms are typically done with atomic instructions that can interleave with each other.
Verifying correctness of these algorithms is a difficult problem.
There are static verification tools like SPIN, JSF, TLA+ and others, but such tools only work for a small number of threads (no more than 6 threads) and small-size programs (less than 100 LOC). As if this wasn't bad enough, these tools enforce the user to write the program in a particular language, apart from JSF, which is almost Java. The program will have to be re-written later in the target language that is meant for, and during this "translation" errors can be introduced.

Now, don't get me wrong, software has bugs. It's just part of life as a software developer to deal with them and try to minimize them.
Entire industry trends like Test Driven Development (TDD) and Test Automation revolve around the concept of testing to reduce the amount of bugs.
In other words, there's this idea that the more time you invest in tests, the more quality the final program will have. IMO this is true, but obviously it's not a linear effort: beyond a certain number of tests, the number of new bugs you'll detect will be small compared to the effort of maintaining those tests.

Concurrency bugs can, and should, also be approached in this manner, but this is far from sufficient.
Moreover, testing a concurrent program requires a different set of techniques and mindset than testing a sequential program: instead of exercising input-output responses, we should test program invariants; instead of mocking components, we should stress the code by increasing the number of threads or changing timings to exercise different interleavings and code paths.
This is where run-time verification can provide a good help. Tools like valgrind and Address Sanitizer become a need-to-have if you're writing C/C++ multi-threaded applications and there is a good deal of research being done on this field, so I'm sure that more capabilities and more efficient tools will show up in the future. However...

Suppose I have a new lock-free algorithm and I implement it. Then I bother to write some tests for it (which most researchers won't do) and then run it with address sanitizer and whatever other run-time tools I can get my hands on. None of my tests or run-time tools encounter any error. Now what? Is my algorithm ready for publication? How will the peer reviewers know that it is correct? How will I convince them of that?
And here lies the big problem in researching concurrent algorithms that people don't usually talk about: "How do I know if the lock-free algorithm in this paper is correct?"

As I said before, it's ok to have bugs in your code, all software has bugs, and software developers are used to this inevitability.
What is not ok is to have bugs in your algorithm. It's very hard to prove the correctness of a lock-free algorithm. Although it's usually simple to verify that a bug exists, once someone tells you the sequence of steps to reproduce that bug.

This can lead to surreal scenarios that typically look like this:
Me: Read the latest paper on a novel lock-free algorithm. Spend weeks thinking about it. Come up with a scenario where the algorithm doesn't work. Send an email to authors describing the scenario.
Authors: Spend a week thinking if the scenario is actually a bug and how to fix it, propose a slight variation of the algorithm.
Me: Spend another week thinking about it, come up with a difference scenario where there is a race condition.
Authors: Spend another week thinking about it and propose a modified algorithm.
Me: Spend another week thinking about this different algorithm. Get to the conclusion that the algorithm is no longer lock-free. Send a new email explaning why.
Authors: Spend another week thinking about it and propose yet another version of the algorithm.
Me: Spend another week thinking about it, only to reach the conclusion that this new algorithm is not even linearizable.
Authors: We weren't aiming for linearizability anyways...
Me: Lose all faith on the work done in the field of concurrent algorithms, and lose faith in mankind in general... at least for a couple of weeks, then I read another paper that looks interesting and the cycle restarts.

This can sound bad but it's a description of when things turn out good. Other times I send the email and the authors just ignore me, or they answer something that shows that they don't even understand there is a bug, or there is no bug and the whole things was a result of me missing some vital detail in the algorithm.
Ultimately, either I shame the authors by pointing out their mistakes or I shame myself for being too dumb to understand their paper. Never a good turnout and it's not a good way of "making friends". Unfortunately I can't help myself, I'm in it because I want to learn, and you learn the most when you're willing to make a fool of yourself ;)

This is not about incompetence, this is about the inherent difficulty in writing lock-free algorithms. The authors are not sure that their algorithms are correct, the reviewers are not sure that the algorithms are correct and the readers of the papers are not sure that the algorithms are correct. And don't get me started on the actual "implementations" of these algorithms.

It's a mess, and it creates a lot of distrust in the field, specially during peer revieweing.
Papers end up being accepted only if there is an expert reviewer that really knows about this stuff and decides to spend the several weeks of his personal time it takes to fully understand that algorithm and then comes back to the other reviewers and manages to convince them that this stuff is correct.
Most of the times, academic papers gets accepted just because they're not obviously wrong. But the problem is that may be simply because the algorithm is so complex that no one is willing to "dig into it". I've seen examples of papers accepted into major conferences, where the basic algorithm was wrong (and there is not obvious way of fixing it) and then the authors will continue convinced that approach is valid and correct and write follow up papers that build on that algorithm and those papers continue to be accepted in major conferences, when it's all crap.
Luckily, this is rare. Most of the bugs I've seen in lock-free algorithms are things that can be fixed without a major re-design of the algorithm.

There is also the effect that if the algorithm is too simple (there aren't many of those, but still) then the reviewers may think that the work doesn't have enough value. But if the algorithm is too complex, then they wouldn't be able to understand if it is correct or not.
This kind of creates the incentive for papers to have incremental algorithms based on prior published work, or to have several simple algorithms as opposed to a single complex one.

I've noticed that when it comes to durability (persistence), the problem is similar.
Durable algorithms are not NP-Hard, the failure can occur at a single point in the code, which means that the complexity is linearly proportional to the LOCs, as opposed to concurrency algorithms where the complexity grows exponentially with the LOCs in the algorithm.
However, the pattern of me reading papers on durable algorithms and then sending emails and then the authors fixing but not really fixing is pretty much the same.

It's still a hard problem, particularly now with Optane DC Persistent Memory that has a special set of rules for data to be correctly persisted.
On Persistent Memory it is very easy to forget a flush or a persistent fence. This means the implementation is incorrect, however, in practice it may have no issue, because the probability of having a crash exactly in the place where the fence is missing is very low.
On the other hand, if there are fences missing all over the place or if there is something being used that simply doesn't make sense, then no matter where the program crashes, the data will end up corrupted. So there are different shades of grey here.

Again here the problem repeats itself when it comes to getting a paper accepted at a conference: unless one of the reviewers really knows its stuff and is willing to invest the time it takes to deep dive on the algorithm presented on the paper, getting a "correct" durable algorithm accepted is just a game of luck. The best that can be done is to make a convincing proof or at least explain some invariants.

I believe that as time goes by, there will be better tools that will help writing and verifying correct durable algorithms. The thing is, by the time that happens (years from now), all or most of the interesting durable algorithms will be published by then, so it kind of defeats the purpose. Moreover, for the complicated stuff we can always use transactional durable techniques (undo-log, redo-log, copy-on-write, shadow-data) or a universal construction, like the one we presented at EuroSys2020 : link here.

For concurrency it's really black and white: you miss a fence, you're algorithm is wrong and you WILL see a bug happen eventually.
I don't see any real solution for concurrent algorithms coming anytime soon. And that's a shame because as long as things remain the way they are, it will be very hard for the good (correct) algorithms to be distinguished from the bad (incorrect) ones. It becomes a kind of "reputation game" where papers from people who have made correct algorithms in the past, will likely have correct algorithms later as well, but this is not really reliable seen as we're all humans we're bound to make mistakes at some point. For now, all we can do is test our implementations of these algorithms as extensively as possible so as to minimize the chances of bugs, but it will never be 100% certain.

This discussion may seem a bit boring: software will always have bugs, so what's the problem with having buggy algorithms?
My argument is that it's a big deal, actually, it's a very BIG deal!I've seen many papers that build on previous papers and then I go look at the previous work on which they're based, and those are incorrect. We need solid foundations on which to build, to be able to progress further and faster.
If we don't give the proper value to the correct algorithms and point out the incorrect ones, then we're creating the wrong incentive. Researchers will not bother to spend the time testing or validating their algorithms. This creates the climate of suspicion that we're in today.

Ultimately, designing lock-free algorithms will never be "easy", but designing lock-free and persistent data structures is easy when you use universal constructions or a lock-free software transactional memory. This means all these are problems are an issue only for the researchers of concurrent algorithms, but not really a problem for people that need a peristent or lock-free data structure.

Wednesday, October 30, 2019

How about RLU? Is RLU a generic concurrency control?

On the previous post we went over why RCU is not a generic concurrency control. This time we'll talk about RLU (it's an L, not a C):
http://alchem.usc.edu/courses-ee599/downloads/rlu-sosp15.pdf

RLU is a synchronization technique which contains inside it an efficient userspace RCU.
It supports multi-atomic updates, making them visible in one-shot to the readers, without having to make an replica of the entire data, like on RCU.

The idea of RLU is that during a write operation, each modified object is first copied into a per-thread log, then it makes the modifications to the object in the log. When all modifications have been made, it increments the global clock, thus making all modifications atomically visible to readers. I'm over-simplifying, but it's ok for this post.

RLU is certainly more generic and efficient than RCU but it does not allow for generic code to be used. In fact, RLU does write-read synchronization but it does not handle write-write synchronization and the the paper itself mentions this explicitly:
"The basic idea of RLU described above provides read-write synchronization for object accesses. It does not however ensure write-write synchronization, which must be managedby the programmer if needed (as with RCU)".

To better understand why RLU is not a generic approach to concurrency, let's consider the canonical example of two threads 1 and 2 with cross dependencies. They are attempting to execute the following operations, where x and y start both at false:
Thread 1:
if (!x) y = true;

Thread 2:
if (!y) x = true;

As I'm sure you're aware, any serializable execution of these two operations will either result in x and y being {true,false} or {false,true} but never {true,true}. One of the operations has to "happen before" the other. If T1 executes before T2 then the end result will be {false,true} and if T2 executes before T1 then the end result will be {true,false}.
If you run this code without modification on RLU, it can happen that T1 reads x and sees false while T2 reads y and sees false, then T1 acquires the lock on y while T2 acquires the lock on x and each write to their log that the variables will be true and each advance the global clock and commit. In case you're wondering, this scenario is not linearizable.

In the database world, this kind of problem is called a write skew anomaly (see section 2.4 of the MV-RLU paper for a great explanation in the context of RLU). In the context of concurrent data structures, we call these bugs!
Handling cases like this is what an STM or a Universal Construction are supposed to do because an STM/UC must handle any sequential piece code (with some exceptions).
Granted, if you're really careful how you write the code, you can use RLU as long as you avoid writing sequential code that has this kind of cross dependencies. The problem is, in practice, this is very hard to do and even when you do it, you're never really sure it's actually done correctly (did I miss some cross dependency?). If you write user-code that has one of these problems, it will be hard to find it and likely hard to fix it.

There is an easy solution for RLU and that is to use a global lock, but then it looses some of its appeal because it no longer scales for writes.
Keep in mind though that RLU scales for read-only transactions!

STMs like TinySTM or TL2 have time-based algorithms that keep a read-set of all loads done within the transaction and do validation of the read-set at commit time exactly to handle these kind of cross dependencies.
If you're wondering how frequent are cross dependencies in real code, then even an ordered linked list is prone to these kind of issues (depending on how it is implemented), which helps to show that it's very hard to get away from this problem in practice.

If you ever read a paper where they say that they used RLU to make an STM, be very very suspicious: an STM must be able to handle cross dependencies and provide linearizable (globally serializable) operations regardless of what the underlying sequential user-code. If it works only for some small subset of user programs, then it's not an STM... sure it's a very valuable tool, but it's not an STM.

Monday, October 28, 2019

Is RCU a generic concurrency control?

In this post we're going to talk about RCU and we're going to start by what it means.
When people talk about RCU they usually mean two different things combined: a synchronization mechanism, which I'll call RCU-sync, and a make-copy-of-object-and-modify-copy, which I'll call Copy-On-Write (COW).

Let's start with Copy-On-Write.
This is a technique used in concurrency and even on durability (see previous post) where instead of modifying an object in-place, you make a copy of the object, modify the new object and then toggle a pointer to point from the old object to the new object. The name RCU comes from Read-Copy-Update as a reference to this technique.
When used in (shared memory) concurrency this creates the problem of what to do with the old object: is it safe to delete it immediately? What if there are other threads currently accessing it?
Solving this problem of object lifetime tracking is what the synchronization mechanism of RCU (RCU-sync) does.

There are many different algorithms that implement an RCU-sync. Some examples of RCU-sync algorithms can be seen in Paul McKenney's excellent book or in our paper about fast userspace RCUs. The thing about an RCU-sync is that it's a concurrency construct, like a mutual exclusion lock or a reader-writer lock. For example, a mutual exclusion lock implements at least the following API:
lock()
unlock()
while a reader-writer lock implements at least:
read_lock()
read_unlock()
write_lock()
write_unlock()
and an RCU-sync implements at least:
rcu_read_lock()
rcu_read_unlock()
rcu_synchronize()
The difference between RCU and locks is that locks provide a generic way of doing concurrency, while RCU does not.

The rcu_read_lock() and rcu_read_unlock() methods must guarantee that code within a critical section defined by these two functions must not escape outside the critical section. An RCU-sync must also guarantee that a call to rcu_synchronize() will wait for all threads currently executing a block of code in a rcu_read_lock/unlock() block to complete (call rcu_read_unlock) before rcu_synchronize() returns to the caller. This guarantee is called quiescence and it's extremely useful in shared memory concurrency. Note that the actual semantics are much more formal that what I've described here, but for the purpose of this post, it's enough to get an idea of how an RCU-sync behaves.

As mentioned above, RCU-sync provides quiescence, not mutual exclusion.
A thread calling rcu_synchronize() will know that by the time the call to rcu_synchronize() returns, everything done before this call is now visible to other threads (as long as the other threads call rcu_read_lock/unlock to access the associated data). However, there is nothing preventing other threads from modyfing the same data at the same time (data race behavior). This means that even when using RCU-synch we still need some extra mechanism to provide mutual exclusion or prevent races in some other way.

At its essence, RCU-sync is mechanism for a single writer to synchronize with multiple readers. It doesn't solve the problem of multiple writers and it also doesn't solve the problem of the readers having a consistent view of the data.
Let me explain a bit more about what I mean with "consistent view". Suppose we have two variables, "a" and "b" initially at zero and a single writer thread and multiple reader threads. The writer code could be:
writer() {
    a = 1;
    rcu_synchronize();
    b = 1;
    rcu_synchronize();
}
and the reader code:
reader() {
    rcu_read_lock();
    printf("a=%d\n", a);
    printf("b=%d\n", b);
    rcu_read_unlock();
}

A reader thread could print "a=1,b=0", even if "a" and "b" are std::atomic with memory_order_consume, or even memory_order_seq_cst.
RCU-sync does not provide atomicity for multiple variables. In order to achieve this, we need to aggregate all those variables within a single object and make a snapshopt of this object using Copy-On-Write. But then, the concurrency technique really is COW, not RCU-sync. With such an approach, the RCU-sync is used only for memory reclamation of the old objects and if we're in Java then we don't even need that because the GC will do it for us.
And that is why whenever they use copy-on-write on java.util.concurrent they call it copy-on-write and not RCU ;)

Let me repeat it again: RCU-sync is extremely useful for memory reclamation and as a component of more complex concurrency techniques, but it is not generic. RCU-sync does not provide atomicity, nor does it prevent write-write races, not even read-write races. All it does it give quiescence, which personally I think it's awesome! It's all a matter of expectations.

Just to go back to COW briefly, COW can be used to solve write-write races and read-write races by creating a new object on every modification of every sub-object. Basically, if you want full consistency across your data, you need a god object that contains all your data and any modification to the data (or just a single variable) will require a full copy of the entire data.
So yes, COW can be generic, it's just that nobody does it like this because it would be too slow.

Both COW and RCU-sync are valuable tools that researchers and concurrency library implementers can use to make sophisticated concurrent data structures or other synchronization mechanisms, but calling them generic concurrency controls is going too far.
In other words, there certainly are database engines and STMs (software transactional memory) which use COW or RCU-sync internally, but you'll never see a DBMS or STM whose concurrency control is RCU. It just doesn't make sense to have a DBMS where any mutative operation causes a copy of the entire DB to be made. Because, that would be the only way to provide generic transactions on that DBMS or STM.
In the DBMS context, a concurrency control is something like MVCC (multi-version concurrency control) or 2PL (two-phase locking) or OCC (optimistic concurrency control).

RCU is not a generic concurrency control.

Sunday, May 5, 2019

OneFile and wait-free balanced trees

OneFile is the world's first wait-free STM and as such, it allows to create wait-free trees.
As it so happens, there is a hand-made wait-free tree in the literature but it is not a "balanced", which means that our sequential implementation of a red-black tree protected by OneFile is the first wait-free balanced tree ever made... and did I mention that it has wait-free memory reclamation? :)

Hand-made lock-free trees are not new. A fellow researcher, Trevor Brown, has been particularly prolific in this area and made several robust trees, with integrated memory reclamation and several with lock-free progress (just a small sample of his work on this topic):
https://arxiv.org/abs/1708.04838
https://www.cs.tau.ac.il/~mad/publications/atc2018-bst.pdf

Back in 2013, Aravind Natarajan, Lee Savoie and Neeraj Mittal presented a hande-made wait-free tree. Their tree does not have memory reclamation, which means that as you do a removal of a node from the tree, the node will will not be reclaimed/deleted, thus leaking. Eventually, your application will crash due to running out of memory. I remember I tried once with Andreia to add memory reclamation to their code but it was just too hard, even for epoch-based reclamation.
Moreover, the keys have to be 64 bit variables, from which a couple of bits are stolen to store some metadata. Perhaps there is a way to extend it to generic keys, but it's non-trivial.
This tree is not balanced. What this means is that if you do insertions of keys in sequential order (1,2,3,4,5...) the tree will not re-balance the nodes and what you end up with is just a "linked list" of nodes. Obviously, if you create a tree with 1 million keys, the latency for any operation will be around hours. Yes, hours per insertion or removal or lookup because that's how long it takes to do 1 million traversals of a linked list with 1 million nodes.
This tree is interesting more as a theoretical construct: it showed for the first time that wait-free trees are possible!

We took a sequential implementation of a Red-Black Tree (a balanced tree) and we added the annotation to work with OneFile. The result is a fully functional wait-free balanced tree. Also, it has wait-free memory reclamation for its nodes and the keys are generic (a template parameter).
This isn't the first wait-free balanced tree, that honor falls on CX, a universal construct that is capable of taking any sequential implementation of a data structure and making a wait-free data structure, without even needing annotation of any sort. It even works with std::set.
https://github.com/pramalhe/CX
One day I'll talk more about CX and what it can do, but for the moment let's just say that OneFile is better than CX when it comes to tail latency for writes.

Because OneFile is a generic technique to make concurrent data structures, it means that adding other operations to the underlying data structure is as easy/hard as adding those operations as sequential code. This means it's easy to add things like range queries, do joins/splits, range deletions, range updates, add key A if key B is present, and pretty much anything you can express in C++ code and place inside a lambda. All with linearizable consistency and wait-free progress.
The source code for this wait-free balanced tree can be obtained here:
https://github.com/pramalhe/OneFile/blob/master/datastructures/treemaps/OFWFRedBlackTree.hpp
All you'll need to run it is the header containing the OneFile STM and a C++ compiler.
https://github.com/pramalhe/OneFile/blob/master/stms/OneFileWF.hpp

OneFile is an STM and as such, allows for dynamic transactions. These transactions can cover multiple objects or multiple data structure instances. This means that we can have a wait-free linearizable operation which removes one key from a tree and inserts into another tree, or another data structure. In other words, these transactions are dynamic, have Atomicity, Isolation and Consistency (are linearizable).
... if only we could have Durability... but ohhh wait, there is this thing called "Persistent Memory" (PM) which can give byte-addressable durability.

We have made a variant of the OneFile that is meant for Persistent Memory and gives ACID wait-free transactions.
Next week we'll talk about Persistent Memory and how OneFile is actually the first ever wait-free database storage engine.

Monday, February 19, 2018

The difference between a Universal Construct and a Software Transactional Memory

In this post we're going to talk about Universal Constructs (UC), Software Transactional Memory (STM), the different between the two of them, and why it's not possible to have a UC for persistent memory.

Waaaiiiittt what?!?! It's not possible to have a Universal Construct for non-volatile memory?!? Why not?

The reason is at end of the post, first, like on all voyages, we start at the beginning.
Back in 1991, Maurice Herlihy (yes, it's the guy that wrote "The Art of Multiprocessor Programming" with Nir Shavit) decided he was tired of locks and blocking progress, and tired of dealing with errors in lock-free data structures, and he proposed a thing he called a "Universal Wait-Free Construct".
https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf
You can see an implementation in Java of this UC in chapter 6 of his book.
A UC is a construct (an object) that wraps a sequential implementation of an underlying object (or data structure) and provides safe concurrent access to the underlying object. In the case of a universal wait-free construct, it provides wait-free access to the underlying object.

In other words, you can write a data structure as if it was single-threaded, and then, instead of slapping a lock on top of it (which is blocking), you slap a wait-free UC (which is wait-free) and booom! You get a wait-free data structure.
In hand-made lock-free and wait-free data structures, if you so much as modify a single line of code in them, they're likely going to break apart completely. On the other hand, when using a UC you don't have that problem, as long as the sequential implementation is still correct from a single-threaded point of view. It's as if we're decoupling the concurrency part from the actual object, while providing wait-free progress.

Hummm, so how come I've never heard of this "Universal Construct" thingy?!?

Well, the main reason is that completely wait-free UCs have been until now, extremely slow.
There are partially wait-free UCs, like Left-Right, which provides blocking (starvation-free) progress for writers and wait-free progress for readers. Left-Right is simple and pretty fast, but anything that provides wait-free progress for both readers and writers, is bound to be slow.
The reason behind this slow performance is related to the Copy-On-Write (COW) pattern behind almost all wait-free UCs. Wait-free UCs typically make an entire copy of the underlying object for every mutation. If the object is small, this will be fast, but if the "object" is a data structure with one million nodes, then its going to be super slow.
This means, that UCs are great for theoretical research, but not so great for making non-blocking data structures.
If you want to know more about this topic, then take a look at P-Sim which is probably the best know wait-free UC:
https://dl.acm.org/citation.cfm?id=2664096
or to our own UC named COWMutationQ which is simpler and has a similar performance:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/cowmq-2015.pdf

UCs have a few peculiarities.
First, they can protect a single object, or if you prefer, they have a single root pointer from which all inner objects must be accessible. UCs allow for transactions to be made as long as the objects in that transaction are accessible from the same root pointer, and they must not be accessible from any other root pointer (in case you have other UC instances).
In other words, objects inside the UC have a 1-to-1 connection to the UC (as if it was a lock) and the wait-free UC provides wait-free progress with linearizable consistent for any operation being applied to the underlying object(s).
Second, for multiple objects to be protected by the same UC (as is the case of nodes in a data structure), there has to be a well defined way of copying those objects, what is called in C++ terminology a copy constructor. Without a copy-constructor, or something equivalent, you can't have data structures inside the UC. This pretty much leaves out low-level languages like C from having any practical usage for UCs, ohhh well.
Third, there is no annotation of the sequential implementation.
Let me repeat this one because it's going to be really important later in this post: The user does NOT have to annotate the sequential implementation of the underlying object that is put on the UC. None, Zero, Nada!
The whole point of the "Universal" part of UC is that it can be applied universally to any underlying object, without having to modify it in any way whatsoever.

And what does this have to do with STMs?

It turns out that Maurice is actually a pretty smart guy (should not be a surprise to anyone) and he figured out that if you relax some of these constraints you might be able to improve the performance.
So, only two years later, in 1993, him and JEB Moss coined the term "Transactional Memory" to describe a construct that can do pretty much the same as a UC but without the first and second constraint I mentioned on the paragraph above.
Granted, they originally though TMs were going to be deployed in hardware (HTM) and it's true that it eventually happened, but they also though it was going to be wait-free, and it's currently worse than blocking. For example, Intel's HTM doesn't guarantee any kind of progress.

In 1997, Nir Shavit and D. Touitou showed the first Software Transactional Memory (blocking). From then until now, many, many STMs have been created, but very few with lock-free progress, and I doubt any has been made with wait-free progress. Even if they do exist, they are likely to be just as slow as P-Sim and COWMutationQ.

I'm pretty sure Maurice has been a bit disappointed by this turnout of events. He must have been expecting someone to come up with a lock-free or wait-free STM that had some nice performance. In fact, I asked him about this on a conference once, and he said that "If someone could come up with a lock-free STM, then it would be something that's actually very useful". I could tell his eyes were sparkling as he said that :)

The main disadvantage of STMs is that they require some kind of annotation of the underlying object(s). In other words, you can't just write a sequential implementation of a data structure, slap an STM on top of it and get concurrent access for free. Nope, you need to annotate or even modify (depending on the STM) some code of the sequential implementation.
Granted, the annotations are typically small, such as changing data types or calls to allocator or deallocator, but still, some work needs to be done by the user. The process is not automatic and it's not "universal".

Just to summarize:
- STMs can have multiple root pointers. UCs can have only one root pointer;
- STMs allow transactions between any object (allocated in the TM). Each UC can execute transactions only within the objects it protects;
- STMs don't need copy constructors. UCs need copy constructors or equivalent methods;
- STMs require annotation of the underlying object. UCs require no annotation or change to the sequential implementation;

Just to consolidate the idea: If it needs annotation, it's not a Universal Construct!

What about that stuff you said about not having UCs for persistent memory ?

Now we get to the insightful part.
Any code that modifies data in Non-Volatile Memory (NVM) must (currently) flush the cache lines after the modification. On x86 this is done with a CLFLUSH or CLFLUSHOPT or CLWB instructions.
The only way to do the flushing automatically is to have some kind of annotation, typically in the type of the variables modified. Yes, the compiler can hide that from me, but I still have to have a way of telling the compiler which data is volatile and which data is non-volatile, and that is typically done through type annotation. The compiler can't guess what data is meant to be persisted and what data is meant to be volatile, there is just no way for it to figure that one out because that is related to application logic, and by definition, only the user knows that. The only way around this is to have _everything_ be non-volatile, but that requires hardware which doesn't currently exist, and may not be an interesting programming model anyways.
This means, that the whole concept of "Universal Construct" kind of fades away when it comes to NVM, because the main point of having a UC, is not doing any change to the underlying code, but the only way to have a sequential (non persistent) implementation of an object be persistent, is to annotate it.
Of course, if they ever make an architecture where writing to NVM doesn't need to be flushed, then ok, it should be possible for someone to make a UC with persistence, but that's very unlikely to happen (the no-need-for-flushing-cpu-architecture).

In practice all of this is fine because we still have STMs, and STMs are particularly suited for working in NVM because they were designed to provide transactions. Having NVM is just an added bonus that can make the transactions persistent, which is great because it gives ACID.

Up until recently, STMs only had ACI, there was no Durability, but now with NVM, STMs can provide full ACID support, just like databases... and that is a game-changer.

Ultimately, whether you call it a UC or an STM depends on the interface that is being given to the user. If it needs annotation, it's not a UC. But then again, having to annotate a data structure is typically not too hard, therefore, it's fine to do it if you get some extra performance in return for that extra work.

An update: Actually, there is a way to make a universal construct for NVM, it's just a matter of flushing the entire memory region on each operation. So I guess I was wrong, yes, it is possible to have universal constructs for persistence, it's just that their performance will go down the drain if we have to flush the entire NVM region on every single transaction. Moral of the story, if you're into to persistence, better stick to STMs.

Wednesday, December 20, 2017

Software Transactional Memory is Simple

Still super busy so I'll just post a light talk about an STM that is being used in production at AppNexus. It's not everyday you see production-use of STMs.