Comments on Concurrency Freaks: Combining the StampedLock and LongAdder to make a new RW-Lock

Hello, I have wrote this about my distributed pri...

2013-09-24T20:47:38.037+02:00

Hello,

I have wrote this about my distributed priority and non priority LOCK :

"So if you have noticed, in fact i am using 2 CASes , so my algorithm is good."

You will say that i will in fact use 3 atomic operations , 2 CASes and one
inside the waitfree queue, but be smart please, when there is
threads waiting and spining , one CAS among the CASes will be very cheap cause the variable will be inside the L2 cache, and the other CAS will be
expensive cause the threads must load the variable from the
remote caches and this is expensive, so it is as we would
have one CAS cause one amongst the second is very cheap,
so it will make this a total of 2 CASes if we count the CAS for
the waitfree queue.

This was my algorithm for a distributed priority and non priority LOCK
that is efficient, and i will code it as soon as possible.

Thank you,
Amine Moulay Ramdane.

Hello, I have come to an interresting subject ab...

2013-09-24T19:10:53.993+02:00

Hello,

I have come to an interresting subject about lockfree queues,
if you take a look at those lockfree queues, like in transactional memory, if the data have changed those lockfree queue retries and loop again, but this (lockfree) mechanism generates a lot of cache-coherence traffic, so i will not advice this lockfree mechanism, so how
can we do it ? you can take a look at the MCS queue Lock , i
think they are doing it like a waitfree linklist that doesn't spin and
that reduces a lot the cache-coherence traffic, other than that
if your memory manager is not optimal and uses a lockfree mechanism and generates a lot cache-coherence traffic, so you have to use a freelist to lower the cache coherence traffic.

So be smart please and think about that.

Thank you,
Amine Moulay Ramdane.

Hello, Be smart please, to reduce a lot the cache...

2013-09-24T19:00:21.414+02:00

Hello,

Be smart please, to reduce a lot the cache-coherence traffic,
you have to choose a queue that reduces a lot the cache-coherence traffic, and for that you have to code a queue as a linklist
(similar to to the MCS queue) that reduces a lot the cache-coherence traffic, but you have to avoid the lockfree queues cause they will higher the cache-coherence traffic. This is how you have to code my distributed LOCK and it will efficient and be good.

Thank you,
Amine Moulay Ramdane.

I correct some typos, please read again... Hello,...

2013-09-24T17:49:46.011+02:00

I correct some typos, please read again...

Hello,

Here is my new algorithm for a distributed priority and
non priority LOCK, this algorithm reduce a lot the
cache-coherence traffic and is good, so follow me please...

First you you allocate an array with the same number of elements
as the numbers of cores and the elements in the array must be 64 bytes
cache aligned and must be a record containing one element that is the
a flag for the first CAS that will be used as a critical section .

You first initialise the distributed priority LOCK by setting
a flag for the second CAS that will be used as a critical section to 0 and you set the flag of the first CAS that will be used also as a critical section to 1 (to permit the first thread to enter the lock())

The lock() function will look like this:

Every thread that enters the lock() must acquire his processor
number by using GetProcessorNumber() function, but this function
will be optimized to amortize the number of calls to it, and that's
easy to do.

You enter the lock() function by pushing the processor number of the threads into a queue or priority queue that will be optimized, this queue will be using a CAS on the push() and will not use any CAS on the pop(), we don't need it on the pop() cause only one thread will pop() from the unlock() function, after that you enter a repeat loop where you test with the first CAS and you test also with the second CAS the flag of the corresponding element(flag) of the array using the processor number of the threads as an index , the flag for the second CAS will be set to 1 so the first thread will enter the lock() and the other threads will test in a repeat loop for the first CAS and the second CAS and there flags will have been set to zero so they will wait..

After that the first thread will arrive to the unlock() function and
he will pop() the processor number from the optimized priority queue
or non priority queue and set the flag for the first CAS to 1 for the corresponding processor core this will allow a thread to enter the lock , if there is no elements in the queue the thread will set the flag of the second CAS to 1 and this will allow a thread to enter the lock.

So as you have noticed my algorithm is efficient also cause if there
is threads waiting, the cache-coherence traffic will be reduced a lot
cause we are using local variables in each element of the array alligned to 64 byte.

So if you have noticed, in fact i am using 2 CASes , so my algorithm
is good.

This was my algorithm for a distributed priority and non priority LOCK
that is efficient, and i will code it as soon as possible.

Thank you,
Amine Moulay Ramdane.

Hello Pedro, Here is my new algorithm for a dis...

2013-09-24T17:11:19.188+02:00

Hello Pedro,

Here is my new algorithm for a distributed priority and
non priority LOCK, this algorithm reduce a lot the
cache-coherence traffic and is good, so follow me please...

First you you allocate an array with the same number of elements
as the numbers of cores and the elements in the array must be 64 bytes
cache aligned and must be a record containing one element that is the
a flag for the first CAS that will be used as a critical section .

You first initialise the distributed priority LOCK by setting
a flag for the second CAS that will be used as a critical section to 0 and you set the flag of the first CAS that will be used also as a critical section to 1 (to permit the first thread to enter the lock())

The lock() function will look like this:

Every thread that enters the lock() must acquire his processor
number by using GetProcessorNumber() function, but this function
will be optimized to amortize the number of calls to it, and that's
easy to do.

You enter the lock() function by pushing the processor number of the threads into a queue or priority queue that will be optimized, this queue will be using a CAS on the push() and will not use any CAS on the pop(), we don't need it on the pop() cause only one thread will pop() from the unlock() function, after that you enter a repeat loop where you test with the first CAS and you test also with the second CAS the flag of the corresponding element(flag) of the array using the processor number of the threads as an index , the flag for the second CAS will be set to 1 so the first thread will enter the lock() and the other threads will test in a repeat loop for the first CAS and the second CAS and there flags will have been set to zero so they will wait..

After that the first thread will arrive to the unlock() function and
he will pop() the processor number from the optimized priority queue
or non priority queue and set the flag for the first CAS to 0 for the corresponding processor core this will allow a thread to enter the lock , if there is no elements in the queue the thread will set the flag of the second CAS to zero and this will allow a thread to enter the lock.

So as you have noticed my algorithm is efficient also cause if there
is threads waiting, the cache-coherence traffic will be reduced a lot
cause we are using local variables in each element of the array alligned to 64 byte.

So if you have noticed, in fact i am using 2 CASes , so my algorithm
is good.

This was my algorithm for a distributed priority and non priority LOCK
that is efficient, and i will code it as soon as possible.

Thank you,
Amine Moulay Ramdane.

Hello Pedro, I have used my SemaMonitor inside m...

2013-09-23T01:19:47.947+02:00

Hello Pedro,

I have used my SemaMonitor inside my RWLock and benchmarked
it against the Windows Semaphore and i have found that it is faster than the Windows Semaphore. You will find my SemaCondvar and
my SemaMonitor on the following website:

http://pages.videotron.com/aminer/

Thank you,
Amine Moulay Ramdane.

Hello Pedro, I have updated my RWLock to version ...

2013-09-23T01:10:33.948+02:00

Hello Pedro,

I have updated my RWLock to version 1.1 , i have added another
version of my RWLock , please read the following description:

Description:

A fast, and scalable, and lightweight Multiple-Readers-Exclusive-Writer Lock that is portable called LW_RWLock and that works across processes and threads and a fast and scalable Multiple-Readers-Exclusive-Writer Lock called RWLock that is portable and that don't use spin wait but uses an event object and also my SemaMonitor and that consumes less CPU than the lightweight version and it processes now the writers on a FIFO order and that's also important and it works across threads .

You can take a look at it on the following website:

http://pages.videotron.com/aminer/

Thank you,
Amine Moulay Ramdane.