Friday, September 29, 2023

50 years later, is Two-Phase Locking the best we can do?

Two phase locking (2PL) was the first of the general-purpose Concurrency Controls to be invented which provided Serializability. In fact, 2PL gives more than Serializability, it gives Opacity, a much stronger isolation level.
2PL was published in 1976, which incidentally is the year I was born, and it is likely that Jim Gray and his buddies had this idea long before it was published, which means 2PL first came to existence nearly 50 years ago.
Jim Gray Endowed Scholarship | Paul G. Allen School of Computer Science &  Engineering

 

After all that time has passed, is this the best we can do?

 Turns out no, we can do better with 2PLSF, but let's start from the begining


When I use the term "general-purpose concurrency control" I mean an algorithm which allows access multiple objects (or records or tuples, whatever you want to name them) with an all-or-nothing semantics. In other words, an algorithm that lets you do transactions over multiple data items.
Two-Phase Locking has several advantages over the other concurrency controls that have since been invented, but in my view there are two important ones: simplicity and a strong isolation level.

In 2PL, before accessing a record for read or write access, we must first take the lock that protects this record.
During the transaction, we keep acquiring locks for each access, and only at the end of the transaction, when we know that no further accesses will be made, can we release all the locks. Having an instant in time (i.e. the end of the transaction) where all the locks are taken on the data that we accessed, means that there is a linearization point for our operation (transaction), which means we have a consistent view of the different records and can write to other records all in a consistent way. It doesn't get much simpler than this.
Today, this idea may sound embarrassingly obvious, but 50 years ago many database researchers thought that it was ok to release the locks after completing the access to the record. And yes, it is possible to do so, but such a concurrency control is not serializable.

As for strong isolation, database researchers continue to invent concurrency controls that are not serializable, and write papers about it, which means Serializability is not that important for Databases. On the other hand, all transactional commercial databases that I know of, use 2PL or some combination of it with T/O or MVCC.

Moreover, in the field of concurrency data structures, linearizability is the gold standard, which means 2PL is used heavily. If we need to write to multiple nodes of a data structure in a consistent way, we typically need something like 2PL, at least for the write accesses. The exception to this are lock-free data-structures, but hey, that's why (correct) lock-free is hard!


Ok, 2PL is easy to use and has strong isolation, so this means we're ready to go and don't need anything better than 2PL, right?
I'm afraid not. 2PL has a couple of big disadvantages: poor read-scalability and live-lock progress.

The classic 2PL was designed for mutual exclusion locks, which means that when two threads are performing a read-access on the same record, they will conflict and one of them (or both) will abort and restart.
This problem can be solved by replacing the mutual exclusion locks with reader-writer locks, but it's not as simple as this.
Mutual exclusion locks can be implemented with a single bit, representing the state of locked or unlocked.
Reader-writer locks also need this bit and in addition, need to have a counter of the number of readers currently holding the lock in read mode. This counter needs enough bits to represent the number of readers. For example, 7 bits means you can have a maximum of 128 threads in the system, in case they all decide to acquire the read-lock on a particular reader-writer lock instance.
For such a scenario this implies that each lock would take 1 byte, which may not sound like much, but if you have billions of records in your DB then you will need billions of bytes for those locks. Still reasonable, but now we get into the problem of contention on the counter.

Certain workloads have lots of read accesses on the same data, they are read-non-disjoint. An example of this is the root node of a binary search tree, where all operations need to read the root before they start descending the nodes of the tree.
When using 2PL, each of these accesses on the root node implies a lock acquisition and even if we're using read-writer locks, it implies heavy contention on the lock that protects the root node.


Previous approaches have taken a stab at this problem, for example TLRW by Dave Dice and Nir Shavit in SPAA 2010.
By using reader-writer locks they were able to have much better performance than using mutual exclusion locks, but still far from what the optimistic concurrency controls can achieve.
Take the example of the plot below where we have an implementation similar to TLRW with each read-access contending on a single variable of the reader-writer lock, applied to a binary search tree, a Rank-based Relaxed AVL. Scalability is flat regardless of whether we're doing mostly write-transactions (left side plot) or just read-transactions (right side plot).


Turns out it is possible to overcome this "read-indicator contention" problem through the usage of scalable read-indicators. Our favorite algorithm is a reader-writer lock where each reader announces its arrival/departure on a separate cache line, thus having no contention for read-lock acquisition. The downside is that the thread taking the write-lock must scan through all those cache lines to be able to ascertain whether if the write-lock is granted, thus incurring a higher cost for the write-lock acquisition.
As far as I know, the first reader-writer lock algorithms with this technique were shown in the paper "NUMA Aware reader-writer locks" of which Dave Dice and Nir Shavit are two of the authors, along with Irina Calciu, Yossi Lev, Victor Luchangco, and Virenda Marathe
This paper shows three different reader-writer lock algorithms, two with high scalability, but neither is starvation-free.

So what we did was take some of these ideas to make a better reader-writer lock, which also scales well for read-lock acquisition but has other properties, and we used this to implement our own concurrency control which we called Two-Phase Locking Starvation-Free (2PLSF).
The reader-writer locks in 2PLSF have one bit per thread reserved for the read-lock but they are located in their own cache line, along with the bits (read-indicators) of the next adjacent locks.


Like on the "NUMA-Aware reader-writer locks" paper, the cost shifts to the write-lock acquisition which needs to scan multiple cache lines to acquire the write-lock. There is no magic here, just trade-offs, but this turns out to be a pretty good trade-off as most workloads tend to be on the read-heavy side. Even write-intensive workloads spend a good amount of time executing read-accesses, for example, during the record lookup phase.
With our improved reader-writer lock the same benchmark shown previously for the binary search tree looks very different:



With this improved reader-writer lock we are able to scale 2PL even on read-non-disjoint workloads, but it still leaves out the other major disadvantage, 2PL is prone to live-lock.

There are several variants of the original 2PL, some of these variants aren't even serializable, therefore I wouldn't call them 2PL anymore and won't bother going into that.
For the classical 2PL, there are three variants and they are mostly about how to deal with contention. They're usually named:
    - No-Wait
    - Wait-Or-Die
    - Deadlock-detection

When a conflict is encountered, the No-Wait variant aborts the self transaction (or the other transaction) and retries again. This retry can be immediate, or it can be later, based on an exponential backoff scheme. The No-Wait approach has live-lock progress because two transactions with one attempting to modify record A and then record B, while the other is attempting to modify record B and then record A, may indefinitely conflict with each other and abort-restart without any of them ever being able to commit.




The Deadlock-Detection variant keeps an internal list of threads waiting on a lock and detects cycles (deadlocks).
This is problematic for reader-writer locks because it would require each reader to have its own list, which itself needs a (mutual exclusion) lock to protect it. And detecting the cycles would mean scanning all the readers' lists when the lock is taken in read-lock mode.
Theoretically it should be possible to make this scheme starvation-free, but it would require using starvation-free locks, and as there is no (published) highly scalable reader-writer lock with starvation-free progress, it kind of defeats the purpose. Moreover, having one list per reader may have consequences on high memory usage. Who knows, maybe one day someone will try this approach.




The Wait-Or-Die variant imposes an order on all transactions, typically with a timestamp of when the transaction started and, when a lock conflict arises, decides to wait for the lock or to abort, by comparing the timestamp of the transaction with the timestamp of the lock owner. This works fine for mutual exclusion locks as the owner can be stored in the lock itself using a unique-thread identifier, but if we want to do it for reader-writer locks then a thread-id would be needed per reader.
If we want to support 256 threads then this means we need 8 bits x 256 = 256 bytes per reader-writer lock. Using 256 bytes per lock is a hard pill to swallow!




But memory usage is not the real obstacle here. The Wait-Or-Die approach implies that all transactions have a unique transaction id so as to order them, for example, they can take a number from an atomic variable using a fetch_and_add() instruction.
The problem with this is that on most modern CPUs you won't be able to do more than 40 million fetch_and_add() operations per second on a contended atomic variable. This may seem like a lot (Visa does about 660 million transactions per day, so doing 40 million per second sounds pretty good), but when it comes to in-memory DBMS it's not that much, and particularly for concurrent data structures is a bit on the low side.
Even worse, this atomic fetch_and_add() must be done for all transactions, whether they are write-transactions or read-transactions.
For example, in one of our machines it's not really possible to go above 20 M fetch_and_add() per second, which means that scalability suckz:



To put this in perspective, one of my favorite concurrency controls is TL2 which was invented by (surprise!) none other than Dave Dice, Nir Shavit and Ori Shalev
I hope by now you know who are the experts in this stuff  ;)

Anyways, in TL2 the read-transactions don't need to do an atomic fetch_and_add(), and they execute optimistic reads, which is faster than any read-lock acquisition you can think of. At least for read-transactions, TL2 can scale to hundreds of millions of transactions per second. By comparison, 2PL with Wait-Or-Die will never be able to go above 40 M tps/sec.
This means if high scalability is your goal, then you would be better off with TL2 than 2PL… except, 2PLSF solves this problem too.

In 2PLSF only the transactions that go into conflict need to be ordered, i.e. only these need to do a fetch_and_add() on a central atomic variable. This has two benefits: it means there is less contention on the central atomic variable that assigns the unique transaction id, and it means that transactions without conflicts are not bounded by the 40 M tps plateau.
This means that we can have 200 M tps running without conflict and then 40 M tps that are having conflict, because the conflicting transactions are the only ones that need to do the fetch_and_add() and therefore, and the only ones bounded by the maximum number of fetch_and_adds() the CPU can execute per second.
On top of this, the 2PLSF algorithm provides starvation-freedom.



This post is already quite long so I'm not going to dive into the algorithm itself, but it's rather simple, as far as starvation-free algorithms can go. If you're interested, you can take a look at the paper: https://zenodo.org/record/7886718
or the source code: https://github.com/pramalhe/2PLSF/blob/main/stms/2PLSF.hpp


Summary

In this post we saw some of the advantages and disadvantages of 2PL and some of the variants of 2PL.
We explained what it takes to scale 2PL: make a better reader-writer lock.
But the big disadvantage of 2PL is the live-lock progress, which some variants could seemingly resolve, but in practice they don't because they will not scale, even with a better reader-writer lock.
Then we described 2PLSF, a novel algorithm invented by me, Andreia and Pascal Felber to address these issues.

In summary, 2PLSF is what 2PL should have been from the start, a concurrency control that scales well even when reads are non-disjoint and that provides starvation-free transactions, the highest form of blocking progress there is.
Moreover, it's pretty good a solving certain kinds of conflicts, which means it can have scalability even some conflicts arise. 2PLSF is not perfect, but it's good enough, certainly better than TL2 when it comes to solving conflicts.

Despite being two-phase locking, it's as close to 2PL as a jackhammer is to a pickaxe. 

2PLSF is not your grandfather's 2PL


8 comments:

  1. Is this the first post ever here?

    ReplyDelete
  2. Great article! You may find this reader writer lock interesting:

    https://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

    ReplyDelete
    Replies
    1. Right, the SNZI rw-locks, I had forgotten about these, they do indeed scale, and they precede the "NUMA-Aware rwlocks" paper!
      Thanks for pointing this out.

      Delete
  3. jaypalsopandare@gmail.com

    ReplyDelete
  4. I haven't read the paper thoroughly yet, but a couple points that jumped out at me:

    1. You don't necessarily need FAA on a global counter to provide a total order on txns; you could just as well use, say, RDTSC plus a thread ID.
    2. You don't need a pessimistic fallback to guarantee fairness/starvation-freedom; this can be done entirely with forward-invalidating OCC; see InvalSTM for a POC.

    ReplyDelete
    Replies
    1. 1. Right, I know of at least Cicada which uses RDTSC, though I suspect it has a non-negligible cost in synchronization. But a synchronized clock is not available on all platforms, hence the advantage of 2PLSF as being slightly more generic.
      2. Thanks for the ref. InvalSTM uses global locks to commit, so it kind of is pessimistic (depending on the definition of pessimistic), but it doesn't scale so it's irrelevant which definition we use. Maybe I should be have been more precise and say that "write-scalable" STMs that provide starvation-freedom will inevitably fallback to some variant of pessimistic approach.

      Delete
    2. Indeed, InvalSTM uses very coarse-grained synchronization that inhibits write scalability (global commit lock and stalling all readers during a commit), but I don't think this is necessary. In fact, I think "invalidating" or "forward-validating" OCC can be implemented with approximately the same synchronization as ordinary "validating" or "backward-validating" OCC (except for a few additional memory barriers). Let me sketch this approach.

      For "fully invalidating" lazy conflict detection to work, we must eliminate all races between committing writers and active readers. The obvious way to do that is to just pause all the readers while the committer checks their read sets for conflicts with its write set (and that is what InvalSTM does). But this is far too strong a constraint: if the committer has already acquired write locks on its entire write set before validation (just like ordinary OCC), then we know that after that point, no additional conflicts can arise (i.e. no active reader can add an item in the committer's write set to its read set), as long as readers either wait or abort on encountering a write lock. So the committer can just scan all active read sets without worrying about racing with readers.

      We also need to ensure that all possibly conflicting reads are visible to the committing writer before validation begins. Unfortunately, I think this requires additional memory barriers in at least two places: immediately after each read (a seq_cst load of the reader's ABORTED flag to ensure it wasn't aborted by a conflicting writer), and after each update to the read set data structure (maybe a seqlock would work well here since we have a single writer).

      Anyway, despite this small additional synchronization, I think the "invalidation" approach has huge benefits: opacity "for free" (without either incremental revalidation or TL2-like false conflicts), minimizing wasted work by doomed txns (they'll be aborted as soon as they're doomed), and most importantly, the ability to arbitrate conflicts when they occur (which otherwise requires PCC), so provably fair conflict resolution policies (e.g. oldest timestamp wins) and irrevocable txns (for handling arbitrary effects in a txn) can be straightforwardly implemented (which is impossible in ordinary OCC).

      Delete
  5. I’ve never seen a rw-lock scale where the number of threads is more than the numbers of cores. Where you can control the number of threads <= number of cores it is easy. You can also spin on the local cache l land create.a virtual list of waiters, it’s in Herlihy’s book.

    ReplyDelete