Tuesday, February 20, 2018

Why do we need to flush persistent memory?

In the followup of the previous post, here is a presentation by an Intel guy where he explains to some detail exactly why we need to flush stores to NVM. He answers this exact question at the 5min mark:
https://www.youtube.com/watch?v=FcKwqdZU5dk&t=5m

... and Andy Rudoff said the same at 9m30s of this presentation:
https://www.youtube.com/watch?v=rG8h8NPnHbc&t=9m30s

Monday, February 19, 2018

The difference between a Universal Construct and a Software Transactional Memory

In this post we're going to talk about Universal Constructs (UC), Software Transactional Memory (STM), the different between the two of them, and why it's not possible to have a UC for persistent memory.

Waaaiiiittt what?!?! It's not possible to have a Universal Construct for non-volatile memory?!? Why not?

The reason is at end of the post, first, like on all voyages, we start at the beginning.
Back in 1991, Maurice Herlihy (yes, it's the guy that wrote "The Art of Multiprocessor Programming" with Nir Shavit) decided he was tired of locks and blocking progress, and tired of dealing with errors in lock-free data structures, and he proposed a thing he called a "Universal Wait-Free Construct".
https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf
You can see an implementation in Java of this UC in chapter 6 of his book.
A UC is a construct (an object) that wraps a sequential implementation of an underlying object (or data structure) and provides safe concurrent access to the underlying object. In the case of a universal wait-free construct, it provides wait-free access to the underlying object.

In other words, you can write a data structure as if it was single-threaded, and then, instead of slapping a lock on top of it (which is blocking), you slap a wait-free UC (which is wait-free) and booom! You get a wait-free data structure.
In hand-made lock-free and wait-free data structures, if you so much as modify a single line of code in them, they're likely going to break apart completely. On the other hand, when using a UC you don't have that problem, as long as the sequential implementation is still correct from a single-threaded point of view. It's as if we're decoupling the concurrency part from the actual object, while providing wait-free progress.


Hummm, so how come I've never heard of this "Universal Construct" thingy?!?

Well, the main reason is that completely wait-free UCs have been until now, extremely slow.
There are partially wait-free UCs, like Left-Right, which provides blocking (starvation-free) progress for writers and wait-free progress for readers. Left-Right is simple and pretty fast, but anything that provides wait-free progress for both readers and writers, is bound to be slow.
The reason behind this slow performance is related to the Copy-On-Write (COW) pattern behind almost all wait-free UCs. Wait-free UCs typically make an entire copy of the underlying object for every mutation. If the object is small, this will be fast, but if the "object" is a data structure with one million nodes, then its going to be super slow.
This means, that UCs are great for theoretical research, but not so great for making non-blocking data structures.
If you want to know more about this topic, then take a look at P-Sim which is probably the best know wait-free UC:
https://dl.acm.org/citation.cfm?id=2664096
or to our own UC named COWMutationQ which is simpler and has a similar performance:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/cowmq-2015.pdf

UCs have a few peculiarities.
First, they can protect a single object, or if you prefer, they have a single root pointer from which all inner objects must be accessible. UCs allow for transactions to be made as long as the objects in that transaction are accessible from the same root pointer, and they must not be accessible from any other root pointer (in case you have other UC instances).
In other words, objects inside the UC have a 1-to-1 connection to the UC (as if it was a lock) and the wait-free UC provides wait-free progress with linearizable consistent for any operation being applied to the underlying object(s).
Second, for multiple objects to be protected by the same UC (as is the case of nodes in a data structure), there has to be a well defined way of copying those objects, what is called in C++ terminology a copy constructor. Without a copy-constructor, or something equivalent, you can't have data structures inside the UC. This pretty much leaves out low-level languages like C from having any practical usage for UCs, ohhh well.
Third, there is no annotation of the sequential implementation.
Let me repeat this one because it's going to be really important later in this post: The user does NOT have to annotate the sequential implementation of the underlying object that is put on the UC. None, Zero, Nada!
The whole point of the "Universal" part of UC is that it can be applied universally to any underlying object, without having to modify it in any way whatsoever.


And what does this have to do with STMs?

It turns out that Maurice is actually a pretty smart guy (should not be a surprise to anyone) and he figured out that if you relax some of these constraints you might be able to improve the performance.
So, only two years later, in 1993, him and JEB Moss coined the term "Transactional Memory" to describe a construct that can do pretty much the same as a UC but without the first and second constraint I mentioned on the paragraph above.
Granted, they originally though TMs were going to be deployed in hardware (HTM) and it's true that it eventually happened, but they also though it was going to be wait-free, and it's currently worse than blocking. For example, Intel's HTM doesn't guarantee any kind of progress.

In 1997, Nir Shavit and D. Touitou showed the first Software Transactional Memory (blocking). From then until now, many, many STMs have been created, but very few with lock-free progress, and I doubt any has been made with wait-free progress. Even if they do exist, they are likely to be just as slow as P-Sim and COWMutationQ.

I'm pretty sure Maurice has been a bit disappointed by this turnout of events. He must have been expecting someone to come up with a lock-free or wait-free STM that had some nice performance. In fact, I asked him about this on a conference once, and he said that "If someone could come up with a lock-free STM, then it would be something that's actually very useful". I could tell his eyes were sparkling as he said that  :)

The main disadvantage of STMs is that they require some kind of annotation of the underlying object(s). In other words, you can't just write a sequential implementation of a data structure, slap an STM on top of it and get concurrent access for free. Nope, you need to annotate or even modify (depending on the STM) some code of the sequential implementation.
Granted, the annotations are typically small, such as changing data types or calls to allocator or deallocator, but still, some work needs to be done by the user. The process is not automatic and it's not "universal".

Just to summarize:
- STMs can have multiple root pointers. UCs can have only one root pointer;
- STMs allow transactions between any object (allocated in the TM). Each UC can execute transactions only within the objects it protects;
- STMs don't need copy constructors. UCs need copy constructors or equivalent methods;
- STMs require annotation of the underlying object. UCs require no annotation or change to the sequential implementation;

Just to consolidate the idea: If it needs annotation, it's not a Universal Construct!


What about that stuff you said about not having UCs for persistent memory ?

Now we get to the insightful part.
Any code that modifies data in Non-Volatile Memory (NVM) must (currently) flush the cache lines after the modification. On x86 this is done with a CLFLUSH or CLFLUSHOPT or CLWB instructions.
The only way to do the flushing automatically is to have some kind of annotation, typically in the type of the variables modified. Yes, the compiler can hide that from me, but I still have to have a way of telling the compiler which data is volatile and which data is non-volatile, and that is typically done through type annotation. The compiler can't guess what data is meant to be persisted and what data is meant to be volatile, there is just no way for it to figure that one out because that is related to application logic, and by definition, only the user knows that. The only way around this is to have _everything_ be non-volatile, but that requires hardware which doesn't currently exist, and may not be an interesting programming model anyways.
This means, that the whole concept of "Universal Construct" kind of fades away when it comes to NVM, because the main point of having a UC, is not doing any change to the underlying code, but the only way to have a sequential (non persistent) implementation of an object be persistent, is to annotate it.
Of course, if they ever make an architecture where writing to NVM doesn't need to be flushed, then ok, it should be possible for someone to make a UC with persistence, but that's very unlikely to happen (the no-need-for-flushing-cpu-architecture).

In practice all of this is fine because we still have STMs, and STMs are particularly suited for working in NVM because they were designed to provide transactions. Having NVM is just an added bonus that can make the transactions persistent, which is great because it gives ACID.

Up until recently, STMs only had ACI, there was no Durability, but now with NVM, STMs can provide full ACID support, just like databases... and that is a game-changer.

Ultimately, whether you call it a UC or an STM depends on the interface that is being given to the user. If it needs annotation, it's not a UC. But then again, having to annotate a data structure is typically not too hard, therefore, it's fine to do it if you get some extra performance in return for that extra work.

An update: Actually, there is a way to make a universal construct for NVM, it's just a matter of flushing the entire memory region on each operation. So I guess I was wrong, yes, it is possible to have universal constructs for persistence, it's just that their performance will go down the drain if we have to flush the entire NVM region on every single transaction. Moral of the story, if you're into to persistence, better stick to STMs.

Thursday, February 8, 2018

Can't protect a pointer without a store-load fence

This post is about memory reclamation algorithms in shared memory concurrency. That's things like Hazard Pointers, Drop the Anchor, epoch-based reclamation (User-space RCUs) and of course, Hazard Eras. There's an insight about memory reclamation, I had a long time ago, that I want to share. This insight probably isn't new, but it was to me when I thought about it  ;)


Let's start from the beginning:

Many years ago, Leslie Lamport made a short paper where he showed that to have mutual exclusion on shared memory system, you need at least one store-load fence. Not sure this is the right reference but it's close enough to this topic:
https://www.microsoft.com/en-us/research/publication/make-multiprocessor-computer-correctly-executes-multiprocess-programs/
In other words, for two threads to have mutually exclusive access to a certain data, this data must be protected with an algorithm (a lock) and to implement this algorithm you will need at least one store and one load, whose visibility must not be re-ordered to the other thread.
It's probably easier to think about this in the context of Peterson's algorithm, or Dekker's, or one of our 10 algorithms which do the same kind of thing:
https://en.wikipedia.org/wiki/Dekker%27s_algorithm
https://en.wikipedia.org/wiki/Peterson%27s_algorithm
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/cr2t-2016.pdf

If you get in the context of these algorithms, you can think about the store as being a way of signaling the intent to acquire the lock, and the load as a check that the other thread doesn't already have the lock. If the effects or visibility of these two operations could be changed (by the compiler or CPU or cache-coherence system) then the algorithm would no longer be correct.
Placing a store-load fence between the store and the load prevents re-ordering from happening, making the algorithm correct, at the cost of adding synchronization.

Another analogy, is to think about a store as a mechanism to send a message to a certain centralized location, and the load as a mechanism to read the message at that centralized location. This is easier for us humans to reason about.
In the end it's all physics, transmitting information takes time, and it involves waiting which by definition is blocking or at least involves some kind of synchronization, in this case, the store-load fence.

As an aside, there are no explicit store-load fences on the C11/C++ memory model, therefore we must use a seq-cst store followed by a seq-cst load, or a relaxed store and relaxed load but with an atomic_thread_fence(memory_order_seq_cst) in the middle.


Waaaaiiit, you said you were going to talk about memory reclamation, but you're talking about mutual exclusion?!? You tricked us!!!

Not exactly, you see, the two-thread mutual exclusion problem and the memory reclamation problem have a lot in common.
A lock algorithm allows two threads to access an object, one a time. Each thread has the guarantee that when it is accessing the object, the other thread is not accessing the object.
A memory reclamation algorithm allows two threads (the reader and the reclaimer) to access an object, with the reclaimer having the guarantee that at a certain point in time the reader will never access the object again.
Notice the difference?

For mutual exclusion a thread needs to know if the other thread is not currently accessing the object, while for memory reclamation, a thread (the reclaimer) needs to know if the other thread (the reader) is not currently accessing the object nor will it ever try to access it again.

In summary, memory reclamation can be seen as mutual exclusion with a stronger requirement.
Which means, that if to do mutual exclusion we need a store-load fence, then we also need (at least) a store-load fence for doing memory reclamation.
To be more precise, a reader needs one store-load fence to signal its intent to protect an object to the reclaimer.

Now, I never proved this formally, this is just a sequence of logical deductions. So let me substantiate this bold statement with some examples.

Start by looking at figure 3 of our Hazard Eras paper:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/hazarderas-2017.pdf



Looking at the left side, to protect a pointer in Hazard Eras we do a seq-cst store of this pointer on a shared array, and then we check that the pointer is still valid. There is an implicit store-load fence placed in by the compiler when we use seq-cst atomic store followed by a load.
In Hazard Pointers, at least one store-load fence is needed for every object needing protection.

Looking at the right side, to protect a pointer with URCU or Epoch-based reclamation, we do a seq-cst store of an epoch and then load whatever pointer we want to use. Again here there is an implicit store-load fence so that the store of the epoch will not be re-ordered with any of the loads of the object that will be used in the read-side critical section.
In Epoch-based reclamation, a single store-load fence is needed to protect all objects, now and until the read-side critical section ends.

Looking at the middle code, to protect a pointer with Hazard Eras, we do a seq-cst store of an era, load whatever pointers we need, and then check that the era has not changed. Again here, there is an implicit store-load fence between the store of the era and the subsequent loads. In fact, there is also a constraint that the load of the pointer(s) and the following load of the era can not be re-ordered, but that's another story.


Do you see the "pattern" now?
All three classes of algorithms require one store-load fence.
The difference between the three is that:
- Hazard Pointers does one fence to protect one object;
- Epoch-based does one fence to protect all objects (until the end of the read-side critical section);
- Hazard Eras does one fence to protect all currently live objects;
and of course, Epoch-based reclamation is blocking, while Hazard Pointers and Hazard Eras are lock-free.


Having said this, there are some apparent exceptions to this, but I would argue that the synchronization of the store-load fence is always there, it just might be hidden from sight.
For example, we can use hardware transactional memory to do memory reclamation, and no explicit store-load fence is needed... because the hardware does the corresponding synchronization internally.
http://www.cs.technion.ac.il/~erez/Papers/teleportation.pdf
It's also possible to use optimistic memory reclamation, but then we're relaxing the premises of the memory reclamation problem I stated above and that may not always be possible.
http://www.cs.technion.ac.il/~erez/Papers/oa-spaa-15.pdf
There are also other approaches that make use of services provided by the Operating System, just like there are mutual exclusion locks that are OS provided. Ultimately, they will need synchronization or a store-load fence, maybe it's the kernel that's doing it instead of the user code explicitly, but the synchronization is there somewhere, hidden behind all the OS/Kernel layers.

The final insight is the following: If it does correct and safe memory reclamation in a shared memory concurrency system, it needs at least one store-load fence, or an equivalent form of synchronization!