Friday, July 22, 2016

The Dangers of Optimistic Concurrency

We talked about optimistic concurrency previously, more specifically, in the context of StampedLock tryOptimisticRead().

In this post we're going to give a few more examples of what can go wrong when using tryOptimisticRead() and optimistic concurrency in general.
Don't get me wrong, I think that the fact that there is an RW-Lock in Java that provides this kind of API is a great addition, it's just that most developers are not aware of the dangers of using this API.
I personally find optimistic concurrency extremely useful when used in a highly controlled environment, i.e. inside the code of a data structure that you designed and implemented and that nobody else is going to touch. This is unlikely if you work for a large organization (like myself)
where your code is public and there will be other people changing it over time, and that's problematic for optimistic concurrency because it is very hard to reason about, which means it is very easy to make a small change that causes a nasty bug.



One of the arguments that the users of optimistic concurrency give (at least in Java) is that you can catch all kinds of exceptions and retry the operation in case an exception is thrown. This has its problems because the code that is being called inside tryOptimisticRead()/validate() may actually throw exceptions as part of its normal behavior, and then you don't know how to distinguish between an "expected" exception, from an "unexpected" exception caused by some weird re-ordering of code:

   double exmaple1() { // A read-only (optimistic) method
     long stamp = sl.tryOptimisticRead();
     Result result;
     try {
         result = mayThrowAnException(); // throwing is a "normal" behavior
     } catch (Exception e) {
         // Ooops an exception was thrown... should I recompute result or not?
     }
     if (!sl.validate(stamp)) {
        stamp = sl.readLock();
        try {
           result =
mayThrowAnException();
        } finally {
           sl.unlockRead(stamp);
        }
     }
     return result;
   }



But let's say it's not a completely generic piece of code and you have some control over it. Then, there is still the issue that some re-ordering will cause the code to enter an infinite loop. Now what?
You can have some variables reordering that would cause an infinite loop, like on the following example, where x starts at 0 and y starts at 1 and they are incremented each on another thread inside the write-lock. The order of reading (or writing) of x and y can be re-ordered and the variable delta ends up being zero and the for() loop goes into an infinite cycle.
   void incrementXY() { // an exclusively locked method
     long stamp = sl.writeLock();
     try {
       sharedX++;  // Starts at zero. Can be re-ordered with increment below
       sharedY++;  // Starts at one. Can be re-ordered with increment above
     } finally {
       sl.unlockWrite(stamp);
     }
   }
  
   void exmaple2() { // A read-only (optimistic) method
     long stamp = sl.tryOptimisticRead();
     long x = sharedX;                    // Starts at 0
     long y = sharedY;                    // Starts at 1
     long delta = y - x;                  // Ooops, this can be zero, or negative, or whatever
     for (i = 0; i < 1000; i+=delta) {
         doSomeReading();                 // To infinity and beyond!
     }
     if (!sl.validate(stamp)) {
        stamp = sl.readLock();
        try {
          doSomeReading();
        } finally {
           sl.unlockRead(stamp);
        }
     }
   }

Notice that using a regular read-lock would mean that delta would always be 1 and there would be no infinite loop.

Well, to fix this, you can wait for a certain timeout. This raises the question of how long to wait for, and does it depend on the machine you're running on, and you went through all this trouble because someone told you that tryOptimisticRead()/validate() was faster than using readLock()/unlockRead(), but your code ends up having to wait for some randomly chosen (hard-coded) timer to expire... and you think that is going to be faster? .... yeah, right.


Ok, but let's say you're willing to take the hit and add in your code the boilerplate to handle all the exceptions, and to timeout if the read-only critical section goes into an infinite loop (which are both a major pain and it would make my eyes bleed to write such code, so I'm not going to), after all this hurdle, you've got safe optimistic read-only code... right?
... nope, not really.

Suppose in the example above the doSomeReading() method is actually allocating memory.
And suppose that there is no re-ordering but the thread calling example3() goes to sleep after reading sharedX, and in the meantime, sharedY is incremented multiple times... like one million times:
   double example3() { // A read-only (optimistic) method
     long stamp = sl.tryOptimisticRead();
     long x = sharedX;                    // Reads this to be 1
     // Thread sleeps here
     long y = sharedY;                    // Reads this to be 1 million
     long delta = y - x;                  // oops, this can be a really large number
     Widget array[] = new Widget[delta];  // Suppose there is enough memory for this
     for (i = 0; i < delta; i++) {
         array[i] = new Widget();         // There might not be enough memory for 1M Widget instances... out of memory?
     }
     if (!sl.validate(stamp)) {
        stamp = sl.readLock();
        try {
          result = someReadOnlyFunctionThatThrowsAnException();
        } finally {
           sl.unlockRead(stamp);
        }
     }
     return result;
   }


By allocating a large number of Widget instances, you've just consumed all the memory in the JVM (or the system), and although you're catching exceptions and abort and then the GC will cleanup this mess, it might be too late, because now you may have failures in other threads of this application (or even in another applications running on the same JVM) because there is not enough memory to allocate anything...
your whole app goes down in a blaze of glory, maybe bringing down with it all the other apps in the JVM, and in an extreme situation, cause the OS to go down as well (unlikely in the case of Java/JVM, but not so unlikely in other languages that have a GC but run natively, like the D language).

As far as I know, there is no way to protect efficiently against this issue, but if you know a way, please use the comments below this post  ;)



In case you think this example is contrived (it is), there are examples of this kind of thing happening in production systems where optimistic concurrency is being used, like on this presentation at CppCon last year by Brent Hall.

In this particular case, it was a transactional memory with optimistic reads that had an issue very similar to the example 3 above.



And the biggest reason of all not to use optimistic concurrency is: none of the above!
It's reasoning.

Concurrency is hard, everybody knows that. The whole point of doing things like immutability and (pessimistic) transactional memory and having linearizable data structures is to have something that is easy(er) to reason about.
Optimistic concurrency requires us to reason in terms of Relaxed Consistency (like relaxed atomics in the C++ Memory Model), and our pitiful linear brains are not made for that, and anyone saying otherwise is either a genius (I mean, like above Einstein level) or has no idea what they're talking about.
Most software developers work in groups (companys, startups, university, research labs, whatever) and even if you can reason about such code, it is unlikely that everyone else in your team will be able to understand it and change it in the future.
You have to ask yourself these questions:
Even if the code does what you think it does, how are you going to convince the rest of the team of the correctness of your code?
Is everyone in your team (i.e. the weakest team member) able to understand this code well enough that they can safely change it?

Unless the code is extremely simple and your team is made of a group of outstanding individuals, this is unlikely.
But hey, it's your code, so do what you think is best  ;)

2 comments:

  1. Why would you do other things in the optimistic read phase other than reading what you need and then work with it outside the lock? Plus, the Javadoc says you should call validate() more often: "Fields read while in optimistic mode may be wildly inconsistent, so usage applies only when you are familiar enough with data representations to check consistency and/or repeatedly invoke method validate(). For example, such steps are typically required when first reading an object or array reference, and then accessing one of its fields, elements or methods."

    ReplyDelete
    Replies
    1. Hi David
      I'm assuming you're referring to doing dynamic memory allocation inside the read-lock. I agree that it may not seem like a common case, but go and look at a few real-life applications and I'm pretty sure you see examples of people doing dynamic memory allocation (stack or heap) inside the read-lock of a rw-lock.
      Perhaps the javadoc should mention that for tryOptimisticRead()/validate() usage we should not do any memory allocation (a limitation of sorts)... who would use it if it had such a warning? lol

      Regarding filling up the code inside the read-lock with validate() checks, a validate() consists of one loadFence() and two volatile loads (for stamp and state):
      528 public boolean validate(long stamp) {
      529 U.loadFence();
      530 return (stamp & SBITS) == (state & SBITS);
      531 }
      On x86 the loadFence() and the volatile loads come for free, but on ARM or PowerPC they will have a synchronization cost, which means that if we put too many of those, we'll end up making the code slower than using the pessimistic read-lock :(
      Moreover, adding several validate() calls in the code means that we have to add the checks for it which can be tricky to code in terms of flow.
      There are no gotos in Java to jump directly to the beginning and try the optimistic lock from the beginning or to jump to the end and try the pessimistic instead.
      Or you may not even have control over the code that is being called, for example, if you call TreeMap.contains(x) in your read-lock,
      you can not add a few validate() calls inside the contains() method, and believe me, it needs them to be correct.

      The point I was trying to make is that tryOptimisticRead()/validate() will work 99.99% of the time... the problem is about the 0.01% and what will happen to your application then.

      Delete