Thursday, August 22, 2013

Concurrency Pattern: Off-by-X Counter

In this post we are going to talk about counters.
Counters are a widely used construct on many concurrent systems. They allow to measure events, like for example, the number of incoming packets on a network device or server.

When there isn't too much contention, the simplest approach is to use an atomic variable and increment it with a fetch-and-add.
In C11 we would use atomic_fetch_add(), in C++11 atomic<>.fetch_add() and in Java AtomicLong.getAndIncrement(). Notice that in Java the getAndIncrement() is actually doing a compare-and-swap and not a fetch-and-add (there is a bug opened for that on the JDK). Moreover, not all CPUs support native fetch-and-add operations so even C11 and C++11 may use some "trick".

The problem with using an atomic variable is that under high contention you start to have a large number of memory barriers being done to read/write on the variable atomically.
To fix that, we propose here a kind of counter that uses relaxed atomic operations which can improve performance, and is a nifty example of how to make something useful with relaxed atomic operations.
... by the way, if you don't know what relaxed atomics are then you can first look at this presentation from Herb Sutter:

I named this thechnique "Off-by-X Counter". This counter that can be written only from a specific thread, but read from multiple threads. The trick is to use a relaxed atomic and do the release-barrier only once in a while, namely every X increments.

Usage scenario:
- Only one Writer that may increment often
- As many Readers as desired.
- Only a single thread can do writes and it must always be the same.
- Any thread can do reads.

It has three functions:
1. Increment the counter
2. Check if (counter % X) == 0 and if it is then do the release-barrier

1. Do the acquire-barrier
2. Read the value of the counter

1. Set the counter to zero
2. Do Release-barrier

The goal of this is to be used in a high-priority thread where you don't want to spend all the time doing release barriers for a variable that is only going to be read once in a while, and when it is read, it doesn't need to be completely up-to-date.
Notice that even if the release barrier is not done, as long as in some other place in the code there is a release barrier (or lock) the value will be updated, which means that most of the time

- It has undefined behavior if multiple threads call the increment() or clear() functions.
- In x86 CPUs the difference in performance is not impressive because there is no explicit acquire-barrier at the CPU level.

- Writes should be faster than on an atomic variable.

Here's what the code looks like on Java (I'll try to make a C++ version of it soon):

import java.lang.reflect.Field;
import sun.misc.Unsafe;

public class OffByXCounter {
    private static final sun.misc.Unsafe UNSAFE;
    static {
        try {
            UNSAFE = getUnsafe();
        } catch (Exception e) {
            throw new Error(e);

    private static Unsafe getUnsafe() {
        try {
            Field f = Unsafe.class.getDeclaredField("theUnsafe");
            return (Unsafe)f.get(null);
        } catch (Exception e) {
            throw new Error(e);

    // Member variables
    private long counter = 0;
    private final long theX;
    public OffByXCounter(long theX) {
        this.theX = theX;
    public long get() {
        return counter;
    public void increment() {
        if ((counter % theX) == 0) UNSAFE.storeFence();
    public void clear() {
        counter = 0;


  1. This comment has been removed by the author.

  2. A similar setup I use is via the lazySet method. This does not require a 'magicnumber' and the counter will eventually be published to other threads.

    public class LazyLong {
    private long primitive;
    private final AtomicLong atomicLong= new AtomicLong(primitive);

    public void incrementLazy(){

    public void addLazy(long value){

    public long get(){
    return atomicLong.get();