When we use pthread_self() to get the thread id, it should be noticed that using a pthread_t for anything else than comparison is undefined behavior. In practice, this is not much of a problem because we just want to use the negative value of it, so even for the cases where pthread_t is a pointer to a struct, casting it to a long long and using the negative of it should provide a unique value that is never returned by pthread_self().
It is true that most of us get a bit stressed when the words undefined behavior are part of a mutual exclusion lock algorithm, therefore, in order to give some peace of mind to those troubled by it, here is a technique which doesn't need ptread_self() and provides the same performance.
We call it the No Pthread_Self() variant, or Tidex NPS for short.
We start by adding a sequentially consistent atomic variable named globalThreadIndex which is incremented monotonically with atomic_fetch_add() by each new thread accessing an instance of the lock. This increment is done once and only once for each thread and it is the equivalent to the thread id. To store this thread id for later usage, we have a thread-local variable (non-atomic) to save it.
Here's what the extra code looks like in C11:
/*
* This variable can even be an 'atomic_short'
because it is unlikely that your
* application will create more than 32767
threads. This also means that
* both ingress and egress can be of type
'atomic_short', which can save memory.
*
* We start at '1' because we want to use the
negative of the value as well.
* Alternatively, we could start at zero but
then we would have to advance
* this index 2 at a time.
*
* This is shared by all tidex_nps_mutex_t
instances to save memory.
*/
static atomic_long globalThreadIndex =
ATOMIC_VAR_INIT(1);
/*
* The index of the thread is stored in a
thread-local variable that is
* shared by all instances of
tidex_nps_mutex_t.
* If the value is the initialized of
INVALID_TID (zero) then we need to
* get a value from globalThreadIndex using
atomic_fetch_add(), once and
* only once per thread.
*/
static _Thread_local long tlThreadIndex = INVALID_TID;
Notice that both these variables are shared among all instances of the locks so there is no increase in memory usage, and in fact, with this technique we can make the ingress and egress be 32bit integers or event 16 bit integers, which would reduce memory usage.
The modifications to lock() are minimal, and unlock() remains unchanged:
void
tidex_nps_mutex_lock(tidex_nps_mutex_t * self)
{
long mytid = tlThreadIndex;
if (mytid == INVALID_TID) {
tlThreadIndex =
atomic_fetch_add(&globalThreadIndex, 1);
mytid = tlThreadIndex;
}
if (atomic_load_explicit(&self->egress, memory_order_relaxed)
== mytid) mytid = -mytid;
long prevtid = atomic_exchange(&self->ingress, mytid);
while (atomic_load(&self->egress) != prevtid) {
sched_yield(); // Replace this
with thrd_yield() if you use <threads.h>
}
// Lock has been
acquired
self->nextEgress = mytid;
}
Conceptually, there is not much of a difference between this approach and the one using pthread_self() because pthread_self() returns itself a value from a thread-local variable which contains the thread's id.
The performance numbers remain unchanged, at least on our usual benchmark.
As usual, the C11 code is available on github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/tidex_nps_mutex.h
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/tidex_nps_mutex.c