Sunday, August 25, 2013

Concurrency Pattern: Distributed Cache-Line Counter

On the previous post we showed a Counter that can be used for single-Writer scenarios, this time we have one that works for multiple-writers-multiple-readers, which we named the Distributed Cache-Line Counter.

The trick behind this counter is that it distributes the Writers among different cache lines and then the Readers have to sum up all the counters in each cache-line.
It is sequentially consistent for get() as long as only increment() operations are done. If a clear() is  done, then get()will no longer be sequentially consistent.

Usage scenario:
- Multiple Writers that increment often.
- Preferably not too many Reads.
- Any thread can do writes.
- Any thread can do reads.

It has one array of "counters" where each entry is separated from the next one such that they end up on different cache lines.
There is a hashing function that takes the id of the thread and generates an index in the the array of counters so that statistically there will be a small chance of two Writer threads doing an increment() on the same memory location (cache-line).
The Reader will have to sum up all the values of all the counters in the array, which means it can be slow compared to a Reader acting on a single atomic variable.


The DistributedCacheLineCounter has three functions:
increment():
1. Hash the thread id, modulus that with the number of entries in the array, and multiply by the size of the cache line to get the index in the array
2. Do a fetch-and-add +1 on the index of the array


get():
1. Sum up all the entries in the array

clear():
1. Set each of the entries in the array to zero


Note that if sequential consistency is desired even when clear() operations are performed then you can always protect it with a rw-lock and do a sharedLock() for increment() and get(), and an exclusiveLock() for clear(), but it will damage performance which kind of defeats the purpose of using a "simple" counter.

Disadvantages:

- It's not linearizable (but who cares about it)
- It's only sequentially-consistent as long as the decrement() and clear() functions are not used. This is usually not a problem because from a logical point of view there is not much sense in performing a clear() at the same time as you do a get(), so if you do it, you'll get an inconsistent state, which is fine for nearly all applications.
- Requires more memory than a simple atomic variable because it needs to reserver (at least) one cache line per core.
- Reads are slower than on an atomic variable.

Advantages:

- Performance for Writes can be much better than an atomic counter under high contention.

Performance Plots:

Here are some performance plots comparing an AtomicLong from java.util.concurrent with the Distributed Cache-Line Counter, and as you can see, this new counter scales with the number of threads doing increment() almost linearly (at least on our 32-core machine)

Code:



public class DistributedCacheLineCounter {
    // Size of the counters[] array (TODO: explain the magical number 3)
    private final static int kNumCounters = Runtime.getRuntime().availableProcessors()*3;
   
    // Size of a cache line in ints
    private final static int COUNTER_CACHE_LINE = 64/8;   
   
    // Stores the number of readers holding the read-lock
    private final AtomicLongArray counters = new AtomicLongArray(kNumCounters*COUNTER_CACHE_LINE);   
   
    /**
     * An imprecise but fast hash function
     */
    private int tid2hash() {
        long x = Thread.currentThread().getId();
        x ^= (x << 21);
        x ^= (x >>> 35);
        x ^= (x << 4);
        final int idx = (int)((x % kNumCounters)*COUNTER_CACHE_LINE);
        return idx;
    }
   
    public void increment() {
        counters.getAndIncrement(tid2hash());
    }
       
    public long get() {
        long sum = 0;
        for (int idx = 0; idx < kNumCounters*COUNTER_CACHE_LINE; idx += COUNTER_CACHE_LINE) {
            sum += counters.get(idx);
        }
        return sum;
    }
   
    public void clear() {
        for (int idx = 0; idx < kNumCounters*COUNTER_CACHE_LINE; idx += COUNTER_CACHE_LINE) {
            counters.set(idx, 0);
        }
    }
}



9 comments:

  1. Very interesting blog, can you pls explain bit about your magical tid2hash function ?
    Did you benchmark it against LongAdder which is using Striped64 ?

    ReplyDelete
  2. Hi Ashkrit,
    The tid2hash() is just a "randomizer" based on George Marsalia's algorithm: http://www.javamex.com/tutorials/random_numbers/xorshift.shtml

    No, I haven't compared with LongAdder... in fact, I didn't even know about it until I saw your comment ;)
    I'll re-run the microbenchmark with LongAdder and post the results soon.
    Thanks

    ReplyDelete
    Replies
    1. Here is the link
      http://concurrencyfreaks.blogspot.co.uk/2013/09/longadder-and-dclc.html

      Delete
  3. Thanks for sharing your result. I also did quick prototype for scalable counters inspired by your blog.

    http://ashkrit.blogspot.sg/2013/09/scalable-counters-for-multi-core.html

    Do you have access to XEON processor based desktop ?
    Result of my test on Xeon is very surprising, on Xeon performance of all the counters is almost same, all through there is CAS failure for Atomic Long.
    On XEON CAS failure does't make any difference in over all timing.
    Later i will use some of the back of strategy mention by dave dice in his blog for counter and test it on XEON.

    ReplyDelete
    Replies
    1. Hi Ashkrit,
      I looked at the source code you have on github and it seems to me, that the reason they all have the same performance is because all those implementations are suffering from "false sharing". Adjacent entries in the atomicArray variable of your CoreBaseCounter are usually sharing a cache line, where each 8 entries share the same cache line.
      In addition, some of the big differences in performance are only noticeable when you go up to a large number of cores.
      Even so, after you fix the false-sharing issue, try running with all threads doing only increment() to see how many ops you get more. On my Core i7 I see 65k ops/ms for an AtomicLong and 113k ops/ms for the DCLC when running with 4 threads.

      Btw, I've heard Doug Lea mention a couple of times that even adding "padding" on an array may not be enough to avoid false sharing. The only sure way is to use @java.sun.misc.Contended

      Cheers

      Delete
    2. correction: (...) each 16 entries share the same cache line (...)

      Delete
  4. You were right it was due to false sharing, after i added another implementation of padded counter and performance was in expected range after that.
    Thanks.

    ReplyDelete
  5. Nice article but there is a bug I think:
    in get():
    for (int idx = 0; idx < kNumCounters; idx += COUNTER_CACHE_LINE) {
    sum += counters.get(idx);
    }

    should be
    for (int idx = 0; idx < kNumCounters; idx++) {
    sum += counters.get(idx * kNumCounters);
    }

    same thing in clear()

    ReplyDelete
    Replies
    1. Hi Neo.X,
      Thanks for pointing this out! I had fixed it a long time ago in the code, but completely forgot to edit the post
      https://sourceforge.net/projects/ccfreaks/files/java/src/com/concurrencyfreaks/counter/DistributedCacheLineCounter.java

      If it weren't for you paying attention, we could have had this bug here for a long time. Thanks for finding it!

      Delete