Saturday, July 30, 2016

Cliff Click on VM Design Choices

Whenever I see a presentation online by Cliff Click I just jump at it. There is just so much content in a single presentation, maybe it's because he speaks really fast, but maybe, just maybe, it's because this guy is really good.  :-O

If you ever wondered how to make your own JVM, how HotSpot works, how a SafePoint is implemented, some little known truths about Garbage Collection in the JVM (did you know there is a lock on the thread's stack that the GC/thread acquire?), even how to make a scalable lock, and more, much much more, then watch this video!
I'm pretty sure you'll learn something, at least some detail about the inner workings of a JVM.... huhhh unless of course, you are Cliff Click yourself, or you worked with him in this stuff, or you watched one of these presentations live, in which case, you already know everything that's being said. lol

The part about GCs was amazing, and so was the one about inlining.

One thing I really like is that he's one of the very few people who make a distinction between performance for throughput and performance for latency (Andreia and I we just call it throughput and latency).

I had to watch it twice and a third time will be needed to get all the details. And if you missed some details, try watching the rerun of the same presentation here:

Really cool stuff

Friday, July 22, 2016

The Dangers of Optimistic Concurrency

We talked about optimistic concurrency previously, more specifically, in the context of StampedLock tryOptimisticRead().

In this post we're going to give a few more examples of what can go wrong when using tryOptimisticRead() and optimistic concurrency in general.
Don't get me wrong, I think that the fact that there is an RW-Lock in Java that provides this kind of API is a great addition, it's just that most developers are not aware of the dangers of using this API.
I personally find optimistic concurrency extremely useful when used in a highly controlled environment, i.e. inside the code of a data structure that you designed and implemented and that nobody else is going to touch. This is unlikely if you work for a large organization (like myself)
where your code is public and there will be other people changing it over time, and that's problematic for optimistic concurrency because it is very hard to reason about, which means it is very easy to make a small change that causes a nasty bug.

One of the arguments that the users of optimistic concurrency give (at least in Java) is that you can catch all kinds of exceptions and retry the operation in case an exception is thrown. This has its problems because the code that is being called inside tryOptimisticRead()/validate() may actually throw exceptions as part of its normal behavior, and then you don't know how to distinguish between an "expected" exception, from an "unexpected" exception caused by some weird re-ordering of code:

   double exmaple1() { // A read-only (optimistic) method
     long stamp = sl.tryOptimisticRead();
     Result result;
     try {
         result = mayThrowAnException(); // throwing is a "normal" behavior
     } catch (Exception e) {
         // Ooops an exception was thrown... should I recompute result or not?
     if (!sl.validate(stamp)) {
        stamp = sl.readLock();
        try {
           result =
        } finally {
     return result;

But let's say it's not a completely generic piece of code and you have some control over it. Then, there is still the issue that some re-ordering will cause the code to enter an infinite loop. Now what?
You can have some variables reordering that would cause an infinite loop, like on the following example, where x starts at 0 and y starts at 1 and they are incremented each on another thread inside the write-lock. The order of reading (or writing) of x and y can be re-ordered and the variable delta ends up being zero and the for() loop goes into an infinite cycle.
   void incrementXY() { // an exclusively locked method
     long stamp = sl.writeLock();
     try {
       sharedX++;  // Starts at zero. Can be re-ordered with increment below
       sharedY++;  // Starts at one. Can be re-ordered with increment above
     } finally {
   void exmaple2() { // A read-only (optimistic) method
     long stamp = sl.tryOptimisticRead();
     long x = sharedX;                    // Starts at 0
     long y = sharedY;                    // Starts at 1
     long delta = y - x;                  // Ooops, this can be zero, or negative, or whatever
     for (i = 0; i < 1000; i+=delta) {
         doSomeReading();                 // To infinity and beyond!
     if (!sl.validate(stamp)) {
        stamp = sl.readLock();
        try {
        } finally {

Notice that using a regular read-lock would mean that delta would always be 1 and there would be no infinite loop.

Well, to fix this, you can wait for a certain timeout. This raises the question of how long to wait for, and does it depend on the machine you're running on, and you went through all this trouble because someone told you that tryOptimisticRead()/validate() was faster than using readLock()/unlockRead(), but your code ends up having to wait for some randomly chosen (hard-coded) timer to expire... and you think that is going to be faster? .... yeah, right.

Ok, but let's say you're willing to take the hit and add in your code the boilerplate to handle all the exceptions, and to timeout if the read-only critical section goes into an infinite loop (which are both a major pain and it would make my eyes bleed to write such code, so I'm not going to), after all this hurdle, you've got safe optimistic read-only code... right?
... nope, not really.

Suppose in the example above the doSomeReading() method is actually allocating memory.
And suppose that there is no re-ordering but the thread calling example3() goes to sleep after reading sharedX, and in the meantime, sharedY is incremented multiple times... like one million times:
   double example3() { // A read-only (optimistic) method
     long stamp = sl.tryOptimisticRead();
     long x = sharedX;                    // Reads this to be 1
     // Thread sleeps here
     long y = sharedY;                    // Reads this to be 1 million
     long delta = y - x;                  // oops, this can be a really large number
     Widget array[] = new Widget[delta];  // Suppose there is enough memory for this
     for (i = 0; i < delta; i++) {
         array[i] = new Widget();         // There might not be enough memory for 1M Widget instances... out of memory?
     if (!sl.validate(stamp)) {
        stamp = sl.readLock();
        try {
          result = someReadOnlyFunctionThatThrowsAnException();
        } finally {
     return result;

By allocating a large number of Widget instances, you've just consumed all the memory in the JVM (or the system), and although you're catching exceptions and abort and then the GC will cleanup this mess, it might be too late, because now you may have failures in other threads of this application (or even in another applications running on the same JVM) because there is not enough memory to allocate anything...
your whole app goes down in a blaze of glory, maybe bringing down with it all the other apps in the JVM, and in an extreme situation, cause the OS to go down as well (unlikely in the case of Java/JVM, but not so unlikely in other languages that have a GC but run natively, like the D language).

As far as I know, there is no way to protect efficiently against this issue, but if you know a way, please use the comments below this post  ;)

In case you think this example is contrived (it is), there are examples of this kind of thing happening in production systems where optimistic concurrency is being used, like on this presentation at CppCon last year by Brent Hall.

In this particular case, it was a transactional memory with optimistic reads that had an issue very similar to the example 3 above.

And the biggest reason of all not to use optimistic concurrency is: none of the above!
It's reasoning.

Concurrency is hard, everybody knows that. The whole point of doing things like immutability and (pessimistic) transactional memory and having linearizable data structures is to have something that is easy(er) to reason about.
Optimistic concurrency requires us to reason in terms of Relaxed Consistency (like relaxed atomics in the C++ Memory Model), and our pitiful linear brains are not made for that, and anyone saying otherwise is either a genius (I mean, like above Einstein level) or has no idea what they're talking about.
Most software developers work in groups (companys, startups, university, research labs, whatever) and even if you can reason about such code, it is unlikely that everyone else in your team will be able to understand it and change it in the future.
You have to ask yourself these questions:
Even if the code does what you think it does, how are you going to convince the rest of the team of the correctness of your code?
Is everyone in your team (i.e. the weakest team member) able to understand this code well enough that they can safely change it?

Unless the code is extremely simple and your team is made of a group of outstanding individuals, this is unlikely.
But hey, it's your code, so do what you think is best  ;)

Monday, July 18, 2016

Angelika Langer at GOTO 2014

If you want a soft intro to some of the concurrency features introduced in Java 8, here is a video by Angelika Langer at GOTO 2014

I don't know much about the Future's stuff, but the part about StampedLock and Contended is well explained, with intuitive examples as well.
In fact, she makes it look so easy to use StampedLock and particularly the optimistic API of tryOptimiscRead()/validate() that someone who is new to it may think it is easy... it's not.

IMO using optimistic concurrency is like juggling knifes, and I've ranted about it before and will rant about it again on my next post, but I am happy that people are talking about and explaining it in such detail... too bad they didn't give her enough time to explain all the issues that can occur with tryOptimiscRead()/validate(), but in these conferences time is always short so you never get to present everything you want.

Monday, July 11, 2016

John Cairns Presents - Smashing Atomics

If you're into Java and queues, here is a light intro talk by John Cairns. There's nothing new about it that you haven't already seen on other talks posted here (except for his Disruptor variant), but it's always nice to see how a SeqLock can work, and if you're a newbie to the area, this is not a bad place to start:

Btw, there are a few slides that mention RCU, but just replace that with Copy-On-Write because that's what is being talked about, not the RCU synchronization technique.

Also, the slide on "Lock Freedom" has a few incorrect statements so just ignore it.
Namely, the third statement that says "All threads will make progress in a finite time" is incorrect. Lock-Free methods do not give such a guarantee. To have all threads make progress in a finite number of steps we need to have Wait-Free, and there is no progress condition that gives guarantees about "time", they're all defined in terms of number of steps.
The fourth statement in the "Lock Freedom" slide is also incorrect. There are many lock-free data structures with some methods that are not linearizable. The most well known is Michael-Scott Queue (implemented in java.util.concurrent.ConcurrentLinkedQueue) which is not linearizable for iterations. Linearizability is a "Consistency Model" and Lock-Freedom is a "Progress Condition", and although there is some faint relationship between the two concepts, they are mostly orthogonal.

Tuesday, July 5, 2016

Martin Kleppmann on “Sequential Consistency versus Linearizability”

Did you ever wonder what is Sequential Consistency or Linearizability or what are the differences between the two?
If the answer is yes, then you should watch this presentation by Martin Kleppmann:

This proof is interesting, and you have to take into account that this is for a single register. If you're using a reader-writer lock (or the Left-Right technique) and you're only doing reads, then the overhead is that of reading/writing to a couple of variables, which amortized over the entire critical section can become a very low overhead, and therefore make linearizability profitable.
Now you can say, that you can use the same argument for seq-cst, but the problem with such an argument is that there are no seq-cst locks or rw-locks, for the simple reason that seq-cst is not (generically) composable.  Therefore, I would argue that, depending on what you're doing, you can have more performance for a linearizable object that you get for a seq-cst object, simply because you can not compose regular reads and writes with seq-cst reads and writes and still expect the result to be seq-cst, but you can for a linearizable object like a mutex, or a rw-lock or a left-right.