Saturday, May 23, 2015

The quest for the perfect load/store lock (part 2)

60 years ago we didn't even know if it was possible to provide mutual exclusion with a software-only solution, and then came Dekker (and Dijkstra) and showed it was possible. Knuth, Peterson, Leslie Lamport, and many others followed with other solutions throughout the years. Some were real innovations, others were just refinements of previous ideas.
It's funny that some of the solutions keep on getting "re-discovered"... I guess it's the old adage of "The Book" by Erdos, that the real good solutions have always been there and just need to be "discovered", as opposed to "invented".
One concrete example is Lamport's "One Bit Solution" that Andreia re-discovered earlier this year. Mostly because it is a simple solution and one of the fastest (although not starvation-free). By the time we realized that Lamport had come up with it a long time ago, we had already made several "upgrades" on it with the intent to provide starvation-freedom, so we decided to push on a bit more and make a paper about the two best solutions we had came up with for the N-thread mutual exclusion problem:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/cralgorithm-2015.pdf

These two algorithms were designed having in mind the ideas from the previous post, about what an ideal lock should be.
One of the algorithms CRTurn, uses global spinning, where all threads are spinning on the same variable, named turn.
The other algorithm, CRHandover, uses local spinning, where each thread is (mostly) spinning on a different variable (and cache line).
The main innovation is that because they're based on Lamport's One Bit Solution, they can enter the critical section (locking) doing a single store in shared memory, i.e. an atomic_store() for the C11/C++1x folks or volatile store for the Java/JVM developers. Same thing when exiting the critical section (unlocking), a single store in shared memory is required. However, unlike Lamport's solution, they're both starvation-free.
If you saw the "Atomic Weapons" presentation by Herb Sutter then I'm sure you know that in such memory models the most expensive thing is doing a release-store (a store in shared memory), as opposed to doing an acquire-load or a regular load/store, this means, an ideal algorithm should have the smallest possible number of release-stores in order to be as fast as possible (under low contention).


Why do we care about the low contention scenario, and not so much about high contention one?
Well, imagine you have an application where you have many lock instances, and you do some profiling in your app for latency/scalability/throughput, and you find that one of the lock instances is a big bottleneck where many (or all) threads get stuck waiting for that particular resource or critical section... now what do you do?
If the resource under contention is a data structure, then this would be an ideal place to put a lock-free variant of the said data structure, or at least a variant that has finer grain locking. If it's an object or set of objects, then you can still have gains by using techniques such as Left-Right, or Copy-On-Write, or even one (or multiple) Reader-Writer Lock. Worst comes to worse, you can always re-design the application so as to reduce contention in this single point.
What this means in practice, is that any healthy application will not have high contention on a single lock systematically. Don't get me wrong, it can have bursts of high contention occasionally, and when that happens, it better be a starvation-free lock to ensure better (lower) latency and fairness, but the point is that most of the time, when a thread is attempting to acquire a lock (any lock) in your app, it will be uncontended. This means that the total throughput will be dominated by the behavior of the lock under low contention, i.e. a single thread attempting to acquire it, and therefore, reducing the number of release-stores in that case is a very desirable goal.

Now, this is not to say that locks that were designed with high contention in mind should not be taken seriously, it's just that it's IMO a rare case, but as I said on the previous post, an ideal lock should have a good behavior on both cases.

More on this in a future post.

Saturday, May 16, 2015

The quest for the perfect load/store lock (part 1)

This is the first part of a two-post digression into what it takes to make an optimum starvation-free mutual exclusion lock with only loads and stores.
If you don't know what such algorithms are, take a look at the software solutions to the mutual exclusion problem.

Let us start by considering what should be the properties of an idealized starvation-free algorithm for  mutual exclusion, with loads and stores, that works between N threads, under the C11/C++1x memory model.
Two distinct behaviors should be analyzed, namely, what happens when under high contention, where all N threads are concurrently competing for the lock, and what happens under low contention, when a single thread is attempting to acquire the lock.

For the high contention case, on the lock() method, an ideal algorithm should do local spinning on its own cache line when waiting for the current thread to release the lock, and then immediately enter the critical section.
Ideally, before it enters the spinning loop, the thread can spend some time reaching a consensus with the other waiting threads to order themselves relative to each other (there are usually N-1 waiting threads in this high contention scenario), so that later they don't waste time reaching consensus on which one is the next to get the lock.
By doing the relative ordering of the waiting threads, the algorithm gains time because otherwise, the waiting threads would just be spinning waiting for the lock to be released.

On the unlock() side, as soon as the critical section is over, it should write in the variable that the next thread is waiting on, and this way, pass the lock to the next thread. After doing this, the thread
previously holding the lock may still do other work, for example, to reset some variables back to their default state.

A template of what the pseudo-code would look like for a mutual exclusion lock algorithm that works well under high contention can be seen below:

void lock()
{
    orderThisThreadRelativeToOtherThreads();
    while (atomic_load(&localSpinVariable) != MINE) Pause();
    // Enter Critical Section immediately
}

void unlock()
{
    // The next thread is waiting on that variable
    atomic_store(&localSpinVariable);
    resetStateForNextLock();
}




For the low contention case, on the lock() method, the ideal algorithm should spend as little time as possible, so as to provide a maximum throughput when a single thread is continuously acquiring and releasing the lock, without any contention from other threads. This means the algorithm should provide a quick way to determine if there are other threads contending for the lock or already holding the lock, and if possible, do a single store to make the intent of acquiring the lock visible to other threads.
Under the C11/C++1x memory model, making the intent to acquire the lock visible to other threads, can be done with a single call to atomic_store(), and determining that there are no other threads contending on the lock requires (at least) anywhere from log2(N-1) (in the case of tournament-based algorithms), to N-1  atomic_load() calls (in the case of array-based algorithms).
Notice that although multiple loads may be required to confirm that the thread is alone in its intent to acquire the load, a single acquire-load is needed, along with a full barrier (acquire and release) at the end of the scan of the states of the remaning threads.

On the unlock() side, the goal is similarly to provide a maximum throughput and, as such, use the minimum possible number of stores and loads. At least one store is always required, i.e. the one that resets/changes the state of the current thread back to its default value of no longer holding the lock or attempting to do so.
A template for a lock that would work well under a low contention scenario can be seen below:

void lock()
{
    atomic_store(&mystate, WantIn);
    // This is the "fast path"
    if (noOtherThreadsAttemptingToAcquireLock()) return;
    // This is the "slow path"
    doNormalConsensusAndWaitForLockToBeAvailable();
}

void unlock()
{
    atomic_store(&mystate, DontWantIn);
}

The main conclusion from this analysis is that for the high contention case, an algorithm needs to be able to pass the lock to the next waiting thread as fast as possible, preferably with just one atomic_store(), while for the low contention case, an algorithm should be able to enter the critical section with just one atomic_store() and at most N-1 calls to atomic_load(), and then release the lock, preferably with just one atomic_store().


So what is this all about? In the next post we will go into practical examples of these ideas.

Wednesday, March 11, 2015

Tidex white paper

We've prepared a short white paper on the Tidex Lock that we discovered some months ago.
It doesn't say much more than the blog post, but if you're interested, you can get it from github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/tidex-2015.pdf

Just a quick update: we've added padding to Tidex Lock (and Ticket Lock) and now they doubled in performance but both give the same throughput in our benchmarks, and in the plot the two lines are now almost indistinguishable.
The source code for these benchmarks is available on github:
https://github.com/pramalhe/ConcurrencyFreaks/tree/master/C11/papers/tidex

Tuesday, February 10, 2015

Left-Right and the (broken) peer-review system


On early September last year, we finished an updated version of the Left-Right paper. We submitted it (again) to another of the major conferences in the field and (again) it came back with lots of weak-rejects. Here is the updated paper on github in case you're interested:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/left-right-2014.pdf

The reviewers were unanimously about the novelty of this technique, and most were encouraging in their statements and suggested us to continue to improve the paper until it is "ready for a conference", the problem is, it will never be "ready for a conference" because they ask for opposite things, or just too much work that we don't have the time or patience to do. This means that the Left-Right paper will most likely never be shown in a major computing conference, and that my friends, is a big shame, for multiple reasons.


The first reason is time: Most people in academia (i.e. peer reviewing papers for conferences) are either a PhD student, or a professor at a university that has one or several PhD students working under him.

If they're a PhD student, then they have a big slice of their time devoted to work on papers, unless their PhD supervisor is a jackass and suckered them into babysitting his kids or some other task completely unrelated to the topic of the student's PhD thesis. They still have to attend and sometimes give classes, which can suck up a good deal of time, but PhD students aren't expected to have any personal time anyways, so that's life.

If they're a Professor at a university, they will probably not have too much time (between classes and going to conferences and committees and whatnot) but they will have one or several PhD students to do the "grunt work" on the papers, and all the professor has to do is have a few somewhat original ideas and choose at least one PhD student that's not completely dumb and is capable of doing the work or explaining it to the other PhD students. He then reviews the draft and corrects all the crap the PhD students write, points them in the right direction, adds the right references to previous work, and the final result will be a decent enough paper, which doesn't even have to be very innovative, as long as it has "the right stuff". If they're really lucky, the professor will have a Post-doc that does this last part of the work of reviewing the paper and doing the multiple iterations, and all he has to do is look at the almost finished version and do some minor changes.

A much smaller group, consists of the people working in companies but still tied to academic life or to research (like on research labs), and I guess that for those, they have to show some results and they will have a lot more tasks than just write papers, but getting papers out is one way of "showing work done" and keeping their job, so they do it. I think these guys usually focus on stuff that is more on the practical side, i.e. more "engineering research" and less "pie in the sky research", but that's just my impression and I'm sure it varies a lot from company to company and research group to research group.

And then there are people like me and Andreia. We don't have a background in this particular field (Concurrency and Synchronization algorithms), Andreia has a masters in Applied Mathematics with a post-grad in Information Systems, and I have a PhD in Physics (but I spent most of my PhD doing non-parametric low-sample statistical analysis and developing new statistical techniques and algorithms), so we're only "amateurs" in this field, but we do have many years of experience in Java/C/C++ and we both like studying algorithms, so we ended up here.
We have a day job, and it's not really related to these topics, which means that whatever time we devote to Concurrent Algorithms must be taken from our spare/leisure time, and because we enjoy it so much, we do spend a big chunk of it doing research, but with a 4-month old baby, our spare/leisure time is getting shorter and shorter (as is to be expected).
This means that, no, dear reviewers, we won't do all that insane amount of work that you suggest us doing. The Left-Right paper explains the algorithm, has the source code, has microbenchmarks for a specific use-case, and even has a proof of correctness... there's just not much more we can do... even if we wanted to!


The second reason is that they don't get it: The first time we submitted the paper to a conference, there were a few reviewers that thought the Left-Right was some kind of new concurrent tree data structure. I must admit that we're definitely not the best people in the world at writing papers, but obviously those reviews skimmed just the paper and "didn't get" what it was about.
This time, none of them thought it was a new kind of tree, but almost all suggested that it should be compared against RCU, and saying that is about the same as you coming up a new algorithm for a Reader-Writer lock, and then someone telling you that you should compare your Reader-Writer Lock with Hazard Pointers... duhhh. This only shows that they either don't know what Left-Right is, or they don't know what RCU is, or both. I don't blame them for not knowing this, because both are non-trivial concepts, but it does upset me that they call themselves "experts" in the field, and then don't what these things are about.

The Left-Right pattern is a new kind of "synchronization primitive" or "concurrency technique" or whatever you want to call it, and the closest thing to it is indeed a Reader-Writer Lock, like we mention in the paper. Comparing it with a Reader-Writer Lock is a reasonable suggestion because we can use a Left-right pattern in almost all the same scenarios that we use a Reader-Writer Lock, and indeed we do this comparison on our microbenchmarks.

Comparing the Left-Right with RCU is not very reasonable, because Left-right is meant to be used in a very different way. And yes, it can be used in a similar way, and in fact, I believe it can even be used to implement a generic RCU API, but it's besides the point... it's like saying that a lock-free queue/list can be used to implement a Reader-Writer Lock (like on the MCS algorithm), and as such, we should compare the two with each other... how would you even do that, it doesn't make sense.

The goal of the Left-Right paper is to explain the Left-right pattern and the concurrency control algorithm behind it. It's impossible to "predict" all the applications that people will apply it to. The Left-right pattern can be applied to any data structure, just like you can apply a Reader-Writer lock to any data structure. If you discover a new algorithm for a Reader-Writer Lock, does that mean that to make a paper about it you will have to apply it to ALL the data structures that the reviewers know about it, and compare it against all the lock-free data structures that exist?
Well then, if that's true, then don't expect any other new Reader-Writer Lock to ever be discovered or at least published.
How's that for stifling innovation?


And that takes me to the third reason, the peer review system is stifling true innovation: Let me exemplify in another way by telling a personal story.
Last May, I challenged Andreia to come up with a technique for a Reader-Writer Lock that using a particular kind of algorithm, would be able to scale with readers just as well as the C-RW-WP but be starvation-free all around, i.e. starvation free for readers-to-writers, writers-to-readers, readers-to-readers, and writers-to-writers. For those of you unaware, this is kind of a "Holy Grail" of Reader-Writer Locks, because it provides good Reader scalability but has no preference for either Readers or Writers, which gives good latency guarantees (as good as a Reader-Writer Lock can provide). As it so happens, she "took the bait" and did come up such an algorithm, which we then expanded to multiple variants, implemented in multiple languages, ran on microbenchmarks, made plots, and even a nice powerpoint presentation with a few animations that we showed to a couple of people.
We haven't made a paper about it yet, mostly because we didn't have time because we're fixing up the other papers we are trying to submit in order to please reviewers in conferences.

One more example just to make my point, is that last year in January (more than a year ago) we had this idea for a Lock-Free linked list that is kind of like Harris's linked list, but easy to implement in any language and easy to explain. We implemented it in 4 different languages, benchmarked it, even used it as a building block to other more complex data structures... but we didn't write a paper.
Why? Because it would take a considerable amount of time, and most likely would not be accepted because of reviewers wanting it to be compared with x and y lock-free hashset or treeset, because hey, it's (also) a "set", so why not compare it against an "hashset" ?

I could go on with more examples, but I'll spare you.
There's no point in searching on github or the concurrencyfreaks site for the Reader-Writer Lock or Lock-Free linked list mentioned above, because we haven't made it public, and the reason for that is related with time it takes to write the papers, and reviewers that stifle innovation, but also with the next reason...


The fourth and final reason, is someone else stealing you work: Andreia and I work hard on these topics, expending most of our personal time, having long discussions over lunch, dinner, throughout the night, until we figure out the best way to do a certain algorithm, or to explain why a certain technique gives the results it gives.We write thousands of lines of code until the algorithm is fine tuned and free of bugs, we write benchmarks and stress tests, we add and remove debugging code, we write new ideas just to test some behavior or hypothesis.
We don't get paid to do this, we don't register patents of our discoveries (yes, they are "discoveries", not "inventions"), we make the source code available in multiple languages, we provide plots with benchmark results done on multiple architectures (as many machines as we have access to).
You're free to use the algorithms we make available to the public for whatever purpose you want, including getting rich from it (if you manage to do so, then you certainly deserve it, lol), and the only thing we ask in return, is to be given credit for them. That's not much to ask, is it?

Sometimes, different groups discover the same ideas at almost the same time, and that's normal, and whomever publishes first takes the credit for it, and that's the way it should be (one example that happened to us, was the C-RW-WP by the Oracle research labs that we discovered independently as Scalable Reader-Writer Locks). This is not a problem and it happens more often than we might think, probably because some ideas depend on previous work, and only once that work is made public, it "unlocks" the ideas that build on that.

What is problematic is when you come up with a new idea, and then make it publicly available, and then someone comes and claims it as their own (like WriterReaderPhaser).
There is a well known psychological effect that happens when you read/hear about an idea/concept and then you kind of forget about it, until later you "come up with it" again or some small variation of it and you think it is new. Our brain can trick us in that way, because we have a tendency to make good ideas our own, and personally, I'm probably guilty of this sin on more than one occasion, like for example this one which is just an implementation of the C-RW-NP, but once someone points out that this is an already known idea, we take the time to go over carefully to understand if that is really so. The truth is, it's really hard to know every thing that has been published, and even harder for stuff that is publicly known, but not mainstream published, even on a field like this one which isn't really all that extensive. Because of these factors, it becomes easy to copy someone else's idea (on purpose or inadvertently), or simply to re-invent the wheel.

Of course, we could do like Dmitry Vyukov. He publishes his ideas and algorithms on his website and doesn't spend useless time trying to please reviewers that don't understand what he's talking about. I'm afraid I don't have the "cojones" he has, because if we put all our stuff on our web site and not with papers, then it is easy for someone else to come and say it was their idea and there is not much we can do about it.
This effect is so strong, that some researches publish papers that are deliberately made complex so that it is hard to understand all the details or even copy the idea. And let's not even go into making source code available, because apart from a few great exceptions (with The Art of Multiprocessor Programming being best example of code source sharing), most papers don't have source code.

What good is to publish an algorithm if we don't provide the source code? I thought that one of the basic tenets of the Scientific Method is Reproducibility, and how can you reproduce a certain result in this field without (at least) the source code?


For the moment, the only way we found to protect our long and hard work is to write a paper about it and get it published, but the truth is, this is taking so much of our time that all of our great ideas are getting left behind, and it frustrates me a lot. I guess it is not just a critique to the whole peer review system, but also to our capability of writing scientific papers, but the end result is that innovation is being stiffed.
How many engineers finding new innovations don't even bother to put it anywhere, or even to spend time thinking about these things because they know that they won't get any benefit from it, most likely, they won't even get the credit for it?
It's all about the Incentives, and in our case, it's mostly about the Passion, because the incentives are nearly zero.

So that's it. If you don't see many updates to this blog for the upcoming times, it doesn't means we gave up, it just means we're spending our time exploring new ideas or writing some papers about them  ;)

Friday, February 6, 2015

Presentation on Scalable Concurrent Data Structures

Here's a nice presentation by Irina Calciu on data structures using delegation (sorry no video, just the pdf):
http://researcher.ibm.com/researcher/files/us-rabbah/plday14-calciu.pdf

The presentation has some nice schematics on how to use these techniques with a skip-list like implementation of a priority queue.
More details at her home page:
http://cs.brown.edu/~irina/

Sunday, February 1, 2015

Presentation on RCU

Here's a nice presentation on RCU by Paul E. McKenney
http://www.efficios.com/pub/linuxcon-europe-2013/mdesnoyers-userspace-rcu-tutorial-linuxcon-2013.pdf

The cool thing about it is that it shows some benchmarks when RCU is applied to different data structures.
I find RCU interesting because its usage is semantically very similar to the Left-Right pattern and they do very similar things, with the disadvantage of Left-Right being blocking for mutations while RCU is not. The problem with RCU is that it's pretty much a Linux-only thing, and although there are user-space implementations of RCU, they're not easily (if at all) portable to other Unix flavors or Windows or OSX, or even to a the JVM running inside Linux (unless you use JNI.... maybe).

The area of applicability for RCU is also similar to Left-Right:


Obviously, if you can use RCU (i.e. you're app is a native Linux build) then by all means go ahead and use it instead of the Left-Right pattern. But if you need portable code and your app is not just for Linux, then RCU won't be an option, and you should take a look at Left-Right to see if suits your needs.