Concurrency Freaks: Tidex

Showing posts with label Tidex. Show all posts

Tuesday, February 7, 2017

Cliff Click - A JVM Does That ???

There are a few people that are really knowleadgeable about Java, and Cliff Click is certainly one of them. One of his latest talks was at Devoxx and it's named "A JVM Does That???". Youtube link below.

This talk is so rich with content that it's hard to explain what it is about, so I'm just going to mention my favorite parts, but please watch it all if you haven't done so yet.

The explanation on the endless loop of GC cycles when have a cache (data structure) near the limit is just divine:
https://www.youtube.com/watch?v=-vizTDSz8NU&t=24m40s

Or how about the illusion that locks are fast (and fair)
https://www.youtube.com/watch?v=-vizTDSz8NU&t=12m55s
hence the importance of starvation-free locks that I mentioned in my CppCon talk, and the reason behind Andreia and I spending some of our time coming up with new starvation-free lock algorithms like Tidex:
http://dl.acm.org/citation.cfm?id=2851171
http://concurrencyfreaks.com/2014/12/tidex-mutex.html

The part about thread priorities is a nice demonstration of why lock-free and wait-free algorithms are so important (you don't need to care about thread priorities with lock-free and wait-free):
https://www.youtube.com/watch?v=-vizTDSz8NU&t=28m10s

But my favorite is the fact that System.currentTimeMillis() has happens-before guarantees across cores on the HotSpot JVM, because this guarantee allowed us to make a blazing fast queue (which we will show in a future post):
https://www.youtube.com/watch?v=-vizTDSz8NU&t=14m22s

Learning is fun with Cliff Click!

Wednesday, March 11, 2015

Tidex white paper

We've prepared a short white paper on the Tidex Lock that we discovered some months ago.
It doesn't say much more than the blog post, but if you're interested, you can get it from github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/tidex-2015.pdf

Just a quick update: we've added padding to Tidex Lock (and Ticket Lock) and now they doubled in performance but both give the same throughput in our benchmarks, and in the plot the two lines are now almost indistinguishable.
The source code for these benchmarks is available on github:
https://github.com/pramalhe/ConcurrencyFreaks/tree/master/C11/papers/tidex

Wednesday, December 31, 2014

Tidex Mutex - No Pthread_Self variant

Before we close up on the Tidex Lock, there is one extra technique that may be useful.
When we use pthread_self() to get the thread id, it should be noticed that using a pthread_t for anything else than comparison is undefined behavior. In practice, this is not much of a problem because we just want to use the negative value of it, so even for the cases where pthread_t is a pointer to a struct, casting it to a long long and using the negative of it should provide a unique value that is never returned by pthread_self().
It is true that most of us get a bit stressed when the words undefined behavior are part of a mutual exclusion lock algorithm, therefore, in order to give some peace of mind to those troubled by it, here is a technique which doesn't need ptread_self() and provides the same performance.
We call it the No Pthread_Self() variant, or Tidex NPS for short.

We start by adding a sequentially consistent atomic variable named globalThreadIndex which is incremented monotonically with atomic_fetch_add() by each new thread accessing an instance of the lock. This increment is done once and only once for each thread and it is the equivalent to the thread id. To store this thread id for later usage, we have a thread-local variable (non-atomic) to save it.
Here's what the extra code looks like in C11:

* This variable can even be an 'atomic_short' because it is unlikely that your

* application will create more than 32767 threads. This also means that

* both ingress and egress can be of type 'atomic_short', which can save memory.

* We start at '1' because we want to use the negative of the value as well.

* Alternatively, we could start at zero but then we would have to advance

* this index 2 at a time.

* This is shared by all tidex_nps_mutex_t instances to save memory.

static atomic_long globalThreadIndex = ATOMIC_VAR_INIT(1);

* The index of the thread is stored in a thread-local variable that is

* shared by all instances of tidex_nps_mutex_t.

* If the value is the initialized of INVALID_TID (zero) then we need to

* get a value from globalThreadIndex using atomic_fetch_add(), once and

* only once per thread.

static _Thread_local long tlThreadIndex = INVALID_TID;

Notice that both these variables are shared among all instances of the locks so there is no increase in memory usage, and in fact, with this technique we can make the ingress and egress be 32bit integers or event 16 bit integers, which would reduce memory usage.

The modifications to lock() are minimal, and unlock() remains unchanged:

void tidex_nps_mutex_lock(tidex_nps_mutex_t * self)

{

long mytid = tlThreadIndex;

if (mytid == INVALID_TID) {

tlThreadIndex = atomic_fetch_add(&globalThreadIndex, 1);

mytid = tlThreadIndex;

}

if (atomic_load_explicit(&self->egress, memory_order_relaxed) == mytid) mytid = -mytid;

long prevtid = atomic_exchange(&self->ingress, mytid);

while (atomic_load(&self->egress) != prevtid) {

sched_yield(); // Replace this with thrd_yield() if you use <threads.h>

}

// Lock has been acquired

self->nextEgress = mytid;

}

Conceptually, there is not much of a difference between this approach and the one using pthread_self() because pthread_self() returns itself a value from a thread-local variable which contains the thread's id.

The performance numbers remain unchanged, at least on our usual benchmark.

As usual, the C11 code is available on github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/tidex_nps_mutex.h
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/tidex_nps_mutex.c

Sunday, December 28, 2014

Tidex Mutex in C11

We've made a C11 version of the Tidex Lock and made it available on github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/tidex_mutex.h
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/tidex_mutex.c

We ran some benchmarks to compare the Tidex Lock versus a pthread_mutex_t and our own Ticket Lock implementation.

--- Opteron 32 cores ---
1 thread:
pthread_mutex_t = 5583126 ops/second
Ticket Lock     = 18606375 ops/second
Tidex Lock      = 17374496 ops/second

16 threads:
pthread_mutex_t = 1418309 ops/second
Ticket Lock     = 5348964 ops/second
Tidex Lock      = 5322226 ops/second

32 threads:
pthread_mutex_t = 1338859 ops/second
Ticket Lock     = 4004952 ops/second
Tidex Lock      = 3775166 ops/second

--- Intel i7 ---
1 thread:
pthread_mutex_t = 13720671 ops/second
Ticket Lock     = 37627282 ops/second
Tidex Lock      = 39492774 ops/second

4 threads:
pthread_mutex_t = 3418376 ops/second
Ticket Lock     = 18810728 ops/second
Tidex Lock      = 24807009 ops/second

8 threads:
pthread_mutex_t = 4679020 ops/second
Ticket Lock     = 17078066 ops/second
Tidex Lock      = 18310183 ops/second

First thing to notice is that the results for the Tidex Lock are very similar to the Ticket Lock (at least on this benchmark). Another thing is that there is a small disadvantage on the AMD Opteron but a small advantage on the Intel i7-3740QM for the Tidex Lock. This benchmark was done without any spinning.

So what does it look like on C11 ?

void tidex_mutex_lock(tidex_mutex_t * self)

{

long long mytid = (long long)pthread_self();

if (atomic_load_explicit(&self->egress, memory_order_relaxed) == mytid) mytid = -mytid;

long long prevtid = atomic_exchange(&self->ingress, mytid);

while (atomic_load(&self->egress) != prevtid) {

sched_yield(); // Replace this with thrd_yield() if you use <threads.h>

}

// Lock has been acquired

self->nextEgress = mytid;

}

void tidex_mutex_unlock(tidex_mutex_t * self)

{

atomic_store(&self->egress, self->nextEgress);

}

as you can see, this lock is very easy to implement (and understand).