Concurrency Freaks: Concurrency Pattern: Distributed Cache-Line Counter

Sunday, August 25, 2013

Concurrency Pattern: Distributed Cache-Line Counter

On the previous post we showed a Counter that can be used for single-Writer scenarios, this time we have one that works for multiple-writers-multiple-readers, which we named the Distributed Cache-Line Counter.

The trick behind this counter is that it distributes the Writers among different cache lines and then the Readers have to sum up all the counters in each cache-line.
It is sequentially consistent for get() as long as only increment() operations are done. If a clear() is done, then get()will no longer be sequentially consistent.

Usage scenario:
- Multiple Writers that increment often.
- Preferably not too many Reads.
- Any thread can do writes.
- Any thread can do reads.

It has one array of "counters" where each entry is separated from the next one such that they end up on different cache lines.
There is a hashing function that takes the id of the thread and generates an index in the the array of counters so that statistically there will be a small chance of two Writer threads doing an increment() on the same memory location (cache-line).
The Reader will have to sum up all the values of all the counters in the array, which means it can be slow compared to a Reader acting on a single atomic variable.

The DistributedCacheLineCounter has three functions:
increment():
1. Hash the thread id, modulus that with the number of entries in the array, and multiply by the size of the cache line to get the index in the array
2. Do a fetch-and-add +1 on the index of the array

get():
1. Sum up all the entries in the array

clear():
1. Set each of the entries in the array to zero

Note that if sequential consistency is desired even when clear() operations are performed then you can always protect it with a rw-lock and do a sharedLock() for increment() and get(), and an exclusiveLock() for clear(), but it will damage performance which kind of defeats the purpose of using a "simple" counter.

Disadvantages:

- It's not linearizable (but who cares about it)
- It's only sequentially-consistent as long as the decrement() and clear() functions are not used. This is usually not a problem because from a logical point of view there is not much sense in performing a clear() at the same time as you do a get(), so if you do it, you'll get an inconsistent state, which is fine for nearly all applications.
- Requires more memory than a simple atomic variable because it needs to reserver (at least) one cache line per core.
- Reads are slower than on an atomic variable.

Advantages:

- Performance for Writes can be much better than an atomic counter under high contention.

Performance Plots:

Here are some performance plots comparing an AtomicLong from java.util.concurrent with the Distributed Cache-Line Counter, and as you can see, this new counter scales with the number of threads doing increment() almost linearly (at least on our 32-core machine)

Code:

public class DistributedCacheLineCounter {

// Size of the counters[] array (TODO: explain the magical number 3)

private final static int kNumCounters = Runtime.getRuntime().availableProcessors()*3;

// Size of a cache line in ints

private final static int COUNTER_CACHE_LINE = 64/8;

// Stores the number of readers holding the read-lock

private final AtomicLongArray counters = new AtomicLongArray(kNumCounters*COUNTER_CACHE_LINE);

/**

* An imprecise but fast hash function

private int tid2hash() {

long x = Thread.currentThread().getId();

x ^= (x << 21);

x ^= (x >>> 35);

x ^= (x << 4);

final int idx = (int)((x % kNumCounters)*COUNTER_CACHE_LINE);

return idx;

}

public void increment() {

counters.getAndIncrement(tid2hash());

}

public long get() {

long sum = 0;

for (int idx = 0; idx < kNumCounters*COUNTER_CACHE_LINE; idx += COUNTER_CACHE_LINE) {

sum += counters.get(idx);

}

return sum;

}

public void clear() {

for (int idx = 0; idx < kNumCounters*COUNTER_CACHE_LINE; idx += COUNTER_CACHE_LINE) {

counters.set(idx, 0);

}

9 comments:

AshkritSeptember 7, 2013 at 7:12 PM
Very interesting blog, can you pls explain bit about your magical tid2hash function ?
Did you benchmark it against LongAdder which is using Striped64 ?
ReplyDelete
Replies
Pedro RamalheteSeptember 7, 2013 at 10:51 PM
Hi Ashkrit,
The tid2hash() is just a "randomizer" based on George Marsalia's algorithm: http://www.javamex.com/tutorials/random_numbers/xorshift.shtml

No, I haven't compared with LongAdder... in fact, I didn't even know about it until I saw your comment ;)
I'll re-run the microbenchmark with LongAdder and post the results soon.
Thanks
ReplyDelete
Replies
AshkritSeptember 13, 2013 at 6:54 PM
Thanks for sharing your result. I also did quick prototype for scalable counters inspired by your blog.

http://ashkrit.blogspot.sg/2013/09/scalable-counters-for-multi-core.html

Do you have access to XEON processor based desktop ?
Result of my test on Xeon is very surprising, on Xeon performance of all the counters is almost same, all through there is CAS failure for Atomic Long.
On XEON CAS failure does't make any difference in over all timing.
Later i will use some of the back of strategy mention by dave dice in his blog for counter and test it on XEON.
ReplyDelete
Replies
AshkritSeptember 16, 2013 at 4:22 PM
You were right it was due to false sharing, after i added another implementation of padded counter and performance was in expected range after that.
Thanks.
ReplyDelete
Replies
NEo.XJanuary 9, 2014 at 11:59 PM
Nice article but there is a bug I think:
in get():
for (int idx = 0; idx < kNumCounters; idx += COUNTER_CACHE_LINE) {
sum += counters.get(idx);
}

should be
for (int idx = 0; idx < kNumCounters; idx++) {
sum += counters.get(idx * kNumCounters);
}

same thing in clear()
ReplyDelete
Replies

Add comment