Wednesday, December 17, 2014

Left-Right Atomic Long No Version variant

For our second holiday gift, we have a new variant of Left-Right (yes another, boohoo). This one we called the "Atomic Long No Version" variant, and it is not very scalable but there is a specific use-case that will be of usefulness for the C11 and C++1x folks, so yes, this goodie is for you C/C++ guys!... and the example code will be in C++1x

The first thing to point out is that this uses an atomic_exchange() like on the GT variant, and one thing about the GT variant (that Andreia pointed out) is that we actually don't need a versionIndex, we just need the leftRight variable, thanks to the atomic_exchange() that resets the ingress counter and toggles the leftRight in one go, thus guaranteeing that all the threads wishing to do an arrive()/depart() will increment the ingress for the corresponding leftRight, and thus forsake the need of a versionIndex. Btw, on the WriterReaderPhaser there is a usage of an atomic pointer instead of the leftRight variable, but it is the same thing as using a leftRight variable.

In the Atomic Long No Version (ALNV) variant we integrate the counters and the leftRight on a single 64 bit atomic counter, with the following usage:
  • 20 bits for the ingress counter
  • 20 bits for the egress left counter
  • 20 bits for the egress right counter
  • 1 bit for the leftRight variable
  • 3 bits unused
We call this variable the leftRightCounters.
And yes, we can use 21 bits for the counters, but then we would have to deal with the nastiness of having the leftRight as the sign of the integer, or use an unsigned long long type.



There are a few tricks about this algorithm.
The first one is that the arrive() and depart() are done with atomic_fetch_add(), and that the atomic_fetch_add() on the arrive() will return the current leftRight variable, as opposed to returning the versionIndex as it is done on the Classic Left-Right.



int arrive(void) {
    const long long lrc = _leftRightCounters.fetch_add(1LL << BIT_INGRESS);
    return getLeftRight(lrc);
}

The depart() increments the appropriate counter (returned by the arrive()) within the leftRightCounters variable, but this is still the same logic as the ingress/egress readIndicators, only now we use the leftRight directly instead of the versionIndex. There is some extra code at the end of depart() to prevent overflowing the counters, but we'll talk about it further below.



void depart(int localLeftRight) {

    long long lrc;

    if (localLeftRight == READS_ON_LEFT) {

        lrc = _leftRightCounters.fetch_add(1LL << BIT_EGRESS_LEFT);

    } else {

        lrc = _leftRightCounters.fetch_add(1LL << BIT_EGRESS_RIGHT);

    }

    // Check if we need to handle a possible overflow

    while (true) {

        const long long ingress = getIngress(lrc);

        if (ingress < OVERFLOW_COUNTER) return;

        if (getLeftRight(lrc) == READS_ON_LEFT) {

            const long long egress = getEgressLeft(lrc);

            if (_leftRightCounters.compare_exchange_strong(lrc, composeLRC(READS_ON_LEFT, 0, egress-ingress, 0))) return;

        } else {

            const long long egress = getEgressRight(lrc);

            if (_leftRightCounters.compare_exchange_strong(lrc, composeLRC(READS_ON_RIGHT, egress-ingress, 0, 0))) return;

        }

        lrc = _leftRightCounters.load();

    }

}


The bulk of the core logic is on toggleVersionAndWait(). We use atomic_exchange() to reset all counters, and toggle the leftRight, and get the current values of the counters, all in one single atomic instruction... neat huhhh?
We then check if the last seen ingress counter value (returned by the atomic_exchange()) is the same as the last seen egress (for the corresponding leftRight value) and if it is, it means there are no threads that have done arrive() and not yet depart(), which means there is no thread accessing the corresponding instance, and we can now modify it safely. If the counters are different, then we need to loop and read the latest egress value until the sum of that egress counter value plus the last seen value for the same egress counter matches the last seen ingress (the new values on the ingress are for the opposite instance).
We've just described another one of the tricks, which is the need to add the two values of the same egress counter, because we have resetted the egress counter when did the atomic_exchange(). It is true that we didn't really have to do the reset, but the value could change between doing the atomic_load() and doing the atomic_exchange(), so we would still need some of this logic, and this way we reset all counters in one shot, which is better for preventing overflow of the counters and the troubles associated with it.
The name toggleVersionAndWait() is now misleading because the version is not being toggled, but the leftRight is, so I guess it should be called toggleLeftRightAndWait(), but for consistency with the other variants of Left-Right algorithm, we will keep the name here as well.



void toggleVersionAndWait(void) {
    const long long oldlrc = _leftRightCounters.load();
    if (getLeftRight(oldlrc) == READS_ON_LEFT) {
        // Toggle versionIndex, and reset all counters
        const long long lastlrc = _leftRightCounters.exchange(composeLRC(READS_ON_RIGHT, 0, 0, 0));
        const long long lastIngress = getIngress(lastlrc);
        const long long lastEgressLeft = getEgressLeft(lastlrc);
        // Wait for EgressLeft to match the last seen ingress before toggle
        if (lastEgressLeft != lastIngress) {
            while (!isEmpty(READS_ON_LEFT, lastEgressLeft, lastIngress)) std::this_thread::yield();
        }
    } else {
        // leftRight is READS_ON_RIGHT
        const long long lastlrc = _leftRightCounters.exchange(composeLRC(READS_ON_LEFT, 0, 0, 0));
        const long long lastIngress = getIngress(lastlrc);
        const long long lastEgressRight = getEgressRight(lastlrc);
        if (lastEgressRight != lastIngress) {
            while (!isEmpty(READS_ON_RIGHT, lastEgressRight, lastIngress)) std::this_thread::yield();
        }
    }
}



bool isEmpty(long long localLeftRight, long long localEgressAdd, long long lastIngress) {
    const long long lrc = _leftRightCounters.load();
    const long long egress = (localLeftRight == READS_ON_LEFT) ? getEgressLeft(lrc) : getEgressRight(lrc);
    return ((egress + localEgressAdd) == lastIngress);
}


The last trick is in the way the overflow of the counters is handled. As you might have guessed, 2^20 is not such a big number that it can't be overflown in a short time span by a thread doing consecutive arrive()/depart() calls (it takes about 70 milliseconds on my laptop).
We can handle overflow in the depart() after incrementing the egress counter. For that, we check whether the ingress is above a certain threshold, which we define as being the max_value/2, or in this case 2^19, and if ingress is larger than that, then we trigger a renormalization of the counters. This renormalization procedure consists or subtracting the ingress value from the egress using a compare_exchange_strong() (or compareAndSet() in Java). Now the thing is, if there is another thread doing arrive() or depart() at the same time, it will change the value of leftRightCounters and the CAS will fail, but that same thread will end up trying to do the same CAS to renormalize the counters, and eventually one of the threads will succeed because there is a finite number of threads. This means that this operation is Wait-Free bounded by the number of threads (i.e., it is not WFPO).
It can also happen that a thread doing toggleVersionAndWait() changes the leftRightCounters, but by doing so, it will be resetting the counters, and the threads attempting the renormalization will see that the ingress is now below the threshold and return from the method depart(), so it does not change the progress condition.


So what does this look like in action?
Well, it's not as bad as I original thought it would be, but under high arrive()/depart() contention the scalability suckz. Here are some plots done with the LROLL that we showed on a previous post, and this variant protecting the same kind of linked list:
We ran the same microbenchmarks on an AMD Opteron with 32 cores, for a list with 100 items and a list with 1000 items.



The list with 100 items has much higher contention on the read-only operations, so we can really see the impact in performance.


The plots with "100% Writes" represent a run where 50% of the operations were add() and 50% were remove(). The runs with 0% Writes represent a run where all operations were contains().
Just an a side note: The LROLLS (a single linked list instance) shown in this post is faster than any of these two implementations, if a linked list is what you're looking for.


Like we said before, this technique isn't as good for scalability as the Classic Left-Right (and most of the other variants), but it's still ok for latency, it's just that the latency distribution is going to shift a bit to the right side, but the behavior of the tail is still good.

So, what is this good for?
This technique is interesting if you want to save memory because you can protect an object (a pair of instances) with just one mutex and three 64 bit variables: two pointers to the instances, and one 64 bit integer for leftRightCounters. If you're really stringent about memory, you can replace the std::mutex with an atomic_bool and make a spinlock, thus saving a few extra bytes.
This economy of memory is useful in practice if you have small objects, and many of them, and you want to protect them all with its own left-right pattern so as to have wait-free access to each one of them.

We'll elaborate on this a bit more on a future post.
In the meantime, here is the source code on github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/leftright/LeftRightALNV.h
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/lists/LRALNVOrderedLinkedList.h

2 comments:

  1. Pedro, have you thought about having >2 copies in your algorithm?

    I've been playing with left-right for a while, and look: although LR is generally branded as read-effective, but with single writer thread and write batching it could actually be quite write-effective, because locking could be eliminated, and copy-switching cost amortized. The only thing left is doubling of each write operation, but even this is not always actual: there are cases in which data structure guarded by LR is itself quite small, and it is much more effective to just copy it as whole once on LR-switching, then duplicate each write inside batch (another way to deal with it may be introducing kind of "dirty marks" which allows to copy only modified part of guarded data structure).

    The only source of delays which is left in writer code is waiting for old readers. And this is there >2 copies may help: if you have only 2 copies there are quite high probability somebody reads the copy you're currently supposed to write to, and you're forced to wait. But if you have, say, 3 copies, and switching between them once per batch (i.e. not so often) then most probably all readers will read the copy which becomes hidden right now, and the 3rd copy which needs to be updated (and which was switched on previous batch) most probably is free of readers because of much time spent from the time it was readable one. So you could switch most probably without waiting.

    I've come to this idea while trying to use LR as a guard for monitoring counters. Say, you have working thread (read: actor) which do some business and also maintains some self-monitoring (tx numbers, latency histo). And from time to time you need to get monitoring results out of this thread (read: to JMX thread). You do not want to introduce heavy synchronization on this counters, since it could slow down your perfectly optimized single-threaded code. But monitoring counters are small, and you could copy them with ease, and it is perfectly OK to update readable version once per many transactions, and so LR more-or-less fit, but waiting for readers on switch may still dirty our latency. And so I've come to idea of having >2 copies...

    What do you think?

    ReplyDelete
    Replies
    1. Hi Ruslan,
      It's good to know somebody uses our work for useful stuff ;)

      Yes, we have thought about algorithms using more than 2 instances but couldn't come up with an interesting use-case for them.

      Yes, writes can be batched, but they'll still be blocking, or if non-blocking, they won't be Linearizable any more.

      Yes, once more you are correct when you say that for small data structures it is better to apply all (batched) mutations on one side, and then just copy the whole data structure to the other side. That idea is closer to what we did for COW MutationQ, which is completely wait-free, for both mutations and read-only operations:
      https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/cowmq-2015.pdf

      Yes, we particularly thought about 3 instances, and I think that if there is any N which may be better than 2, it is 3, but only under very specific circumstances. You said so yourself, that such an approach would reduce (but not eliminate) the probability of waiting for Readers, but generally, having the extra cost of updating one more instance will be disadvantageous. This means it is still Blocking for Writers (even if there's only one).
      Close to this topic, you may want to see the post by David Goldblatt:
      http://dgoldblatt.com/going-from-lock-free-to-wait-free.html

      In summary, for the scenario you describe, I would try COWMutationQ first, and if that doesn't do it in terms of latency, then go ahead for a Left-Right with 3 instances.

      Delete