Sunday, October 25, 2015

Implementing a Reader-Writer Lock using RCU

As surprising as this may seem, it is actually possible to implement a Reader-Writer Lock using the RCU API and the C-RW-WP algorithm (or some other algorithm that you prefer). This is not to be confused to using RCU as a replacement to a Reader-Writer Lock, like what is mentioned here:

What we're going to cover in this post is an actual RW-Lock implementation, which happens to use inside it the RCU API.
Now you may ask:
Why is this interesting?
... or maybe if you're not so nice:
Who in their right mind would want to do that? That's crazy talk!

Well, as it so happens, if the underlying RCU implementation is the kernel one, or the one that uses signals, you can profit from this mechanism to make a rw-lock that can provide some interesting performance, and we're always interested in rw-locks with good performance, right?

So how would we implement one in C++1x ?

std::atomic<bool> writerWaiting { false };
std::mutex        writersMutex;

void sharedLock() {
    while (true) {
        while (writerWaiting.load()) std::this_thread::yield();
        rcu_read_lock();
        if (!writerWaiting.load(std::memory_order_acquire)) return;
        rcu_read_unlock();
    }
}

void sharedUnlock() {
    rcu_read_unlock();
}

void exclusiveLock() {
    writersMutex.lock();
    writerWaiting.store(true); // do we need full membar?
    synchronize_rcu();
}

void exclusiveUnlock() {
    writerWaiting.store(false, std::memory_order_release);
    writersMutex.unlock();
}


All I need now is some time to run some Linux benchmarks to check how good this is...


PS: Fixed some incorrect optimizations in the code thanks to David Goldblatt's comments. Thanks David!

4 comments:

  1. Minor Typo: you need at least std::memory_order_release in exclusiveUnlock.

    Also worth noting is that this can deadlock if multiple of these locks are used; only one RW is possible per RCU domain.

    ReplyDelete
    Replies
    1. Hi David,

      Not a typo. We don't need to use memory_order_release on exclusiveUnlock() because the writersMutex.unlock() has release semantics and will guarantee visibility of the change to writerWaiting.
      Theoretically, there is no "synchronizes with" of writersMutex.unlock() with the writerWaiting.load() in sharedLock(), but in practice there must be some fence in the unlock() method, and even if there wasn't, then at most we could have some Readers waiting "too long" on sharedLock(), but they would eventually see writerWaiting changing to false.


      I don't understand the second comment. Can you go into more details?
      Is it because synchronize_rcu() can not be called from multiple threads on kernel space?
      Notice that the call to synchronize_rcu() is protected by the std::mutex, so I don't see any possibility of deadlock in userspace.


      PS: To optimize even further, in sharedLock(), the second load on writerWaiting() can be std::memory_order_relaxed. This is possible because rcu_read_lock() has "acquire semantics".

      PPS: ... and in exclusiveLock() we can replace the seq-cst store with a relaxed store:
      writerWaiting.store(true, std::memory_order_relaxed)
      ;)

      Delete
    2. Hi Pedro,

      Regarding memory order:
      The issue is, you need something like a release fence *before* the writerWaiting store in exclusiveUnlock. An example reordering that could cause a problem is if the writerWaiting store moves up inside the critical section (say, to a point when half the protected data has been updated but not the other half): then a call to sharedLock() might finish while a writer is still making modifications to protected data. (I believe that a similar analysis will show the optimization you mention in your PS to be unsafe; the acquire fence has to be *after* the relevant load, not before it. I'm not sure enough about the ordering semantics of (U)RCU to speak to the PPS.).

      Regarding deadlock:
      Consider:
      void t1() {
      rwlock1.sharedLock();
      A:
      rwlock2.exclusiveLock();
      }

      void t2() {
      rwlock3.sharedLock();
      B:
      rwlock4.exclusiveLock();
      }

      Suppose thread t1 runs to label A and then thread t2 runs to label B. At that point, neither of the threads can successfully enter their exclusive lock critical sections, since they will be stuck in a synchronize_rcu() call that will never return, since it's blocked by the other thread's presence in a rcu read-side critical section. (In fact, either thread by itself will deadlock; but this example works even if synchronize_rcu is modified to not require a quiescent state from the calling thread to occur).

      Delete
    3. Dope... you're absolutely right about needing memory_order_release on exclusiveLock(). Without the release, it could re-odered into the critical section.
      Good catch!

      Regarding the PS, yes you're right, same logic. It needs a memory_order_acquire.


      The deadlock, yes I get it, you're talking about the kernel implementation.
      If it's a userspace implementation of RCU like the one we have inside the Classic Left-Right technique (or your own implementation?), then the deadlock issue doesn't occur. And I think SRCU doesn't suffer from this issue either because we can have more than one of those... I'll have to dig into it a bit more to be certain.
      Even for the kernel implementation, I think that synchronize_rcu() has a mechanism to detect whether rcu_read_lock() has been called from the same thread (I'm not certain this is true, and certainly not for all implementations), and anyways, it would still give inconsistent results if more than one rw-lock instance is used.

      Great code review, thanks David!

      Delete