- CompareAndSet() are slow
- Each Reader only writes in its own cache-line
- No retries when doing a read-lock
- The Writer's state will usually be read from L1 or L2 cache
- The read-lock mechanism would be Wait-Free if there were no Writers
CompareAndSet() are slow
Typical read-write lock implementations use one or two exclusive locks to acquire the read-only lock. Usually there is at least a global lock that is held for a very short time while the thread increments the variable that holds the number of readers and other variables if needed.
Each exclusive lock is implemented with a CompareAndSet() function which is a synchronization primitive that can take several hundred of CPU cycles to complete.
On the ScalableRWLock there is no CompareAndSet() call when doing a read-lock operation.
Each Reader only writes in its own cache-line
Each thread doing a Read will have it's own variable in a separate cache-line to avoid invalidating cache-lines of other Readers. The read-lock operation consists of reading the Writer's state and writing to the Reader's state. The first is done by all Readers but it is a read operation, therefore, cache-line contention is minimal, and the second one implies the modification of a cache-line, but that cache-line only holds one variable (the state of the current Reader) and it will be read only by a Writer (if and when) it scans for the states of the Readers.
No retries when doing a read-lock
Regardless of the number of Readers currently trying to acquire the lock in read-only mode, in the ScalableRWLock algorithm there will not be retries from the Readers. There is a variable that is read (twice) and another variable that is written, and that is all.
For classical read-write locks, like pthread_rwlock_t, when multiple threads attempt a read-lock operation on the same lock, several retries may be required because they are competing for the exclusive global lock that guards the variable with the number of readers, where each retry will incur a penalty of repeating a CompareAndSet() operation.
The Writer's state will usually be read from L1 or L2 cache
If Writes are rare, when the read-lock operation is performed, the variable with the state of the Writer will be read from memory, but its cache-line will not be dirty and usually it can be taken directly from cache L1 or L2. This is even more likely to occur when several read-lock/read-unlock operations of the same lock instances are done in near succession, because the state of the Writer will still be in a close-by cache. One can almost say that, the more you lock/unlock in read-only mode, the faster the ScalableRWLock will become :-O
For classical read-write lock algorithms, the state of the global lock, or the number of readers, will all be in cache-lines that are invalidated by other Readers running concurrently, thus forcing these cache-lines to be taken from main memory, or at best, from L3 cache when that is shared among the different cores of the CPU.
The read-lock mechanism would be Wait-Free if there were no Writers
This is a bit of a silly statement because, if there were no Writers then we wouldn't need a lock in the first place! Still, the point here is that as long as there are no Writers attempting a write-lock operation, the Readers will have a mechanism to reach consensus that is Wait-Free. Not only that, it does not depend on the number of threads (from the Reader's point of view), which means that it is a kind of quasi Wait-Free-Population-Oblivious... again, this is only true as long as there are no Writers, which means in practice that is doesn't follow the strict definition of Wait-Freedom.
The third and fourth reasons described above, apply to the Fetch-And-Add RW-Lock (FAARWLock), and all reasons apply to the ScalableRWLock. Both algorithms are part of the ConcurrencyFreaks library.