Concurrency Freaks: Immutable data structures are not as good as you think they are

On this post, we are going to talk about about "Immutable data structures", also named "Persistent data structures" or "Functional data structures". We intend to show what are their advantages and disadvantages in relation to other techniques.

Let's start with a very short intro on what an Immutable data structure is.
The most well known, is a tree-set data structure that follows an "optimized" kind of Copy-On-Write pattern, where instead of a copy of the full tree being performed, only the nodes that have been "modified" will be copied, all the other ones will be re-used on the new "logical tree", thus keeping the number of operations in the order of O(ln N), instead of the O(N) which would be required if we had to make a copy of the entire data structure.

A good explanation of the basic working of these kind of data structures can be seen in this post
http://debasishg.blogspot.co.uk/2010/05/grokking-functional-data-structures.html

Garbage Collector

The need for a Garbage Collector is the first disadvantage of Immutable data structures. This means that implementations of this technique can be done only on languages with a GC. Nowadays this isn't a big issue by itself because, most modern languages have a GC, but it doesn't mean that the GC is fast or efficient. Most GC mechanisms still work with some kind of "global lock" which can severely hit performance.

Memory constraints

Creating a new instance of an immutable tree set can be a very lengthy procedure. Suppose you want to make a tree with 6 elements, and you start adding new elements to the old tree and creating new trees out of it. Schematically it will look like this:

where the nodes shown in red are the ones that have just been allocated, and the ones in blue were allocated on a previous iteration.

In case you didn't notice, this requires O(N ln N) operations, because it needs to copy at least "ln N" nodes, and will be more than "ln N" for operations that require a rebalance of the tree.
Not only that, depending on how fast the GC is, it may also require O(N ln N) memory available, because that is the sum of all the allocated nodes: five instances of node 4, plus four instances of node 2, plus 3 instances of node 6... and so on.
This is a constraint which, in practice, eliminates these kind of data structures from being used in most "big data" applications, and in embedded systems.
Big data apps usually require large data structures, and having a tree with a million entries is usually enough to choke most languages/implementations/GCs. For example, we tried to use Java's fl.data.immutableset with 1 million entries, and just to create the tree would take many hours. This is understandable, due to the large amount of memory allocation and freeing that is going on when the tree is first created.
Apps in embedded systems usually have very limited memory, which means that the GC is very inefficient. A good example of this is Android, and a lengthy discussion on this topic can be seen on the article: http://concurrencyfreaks.blogspot.co.uk/2013/07/why-mobile-web-apps-are-slow.html

Performance

How about performance? How fast are immutable data structures?

Consider the case of adding a new element to the tree. On an immutable tree, this will require the allocation (and possibly the release) of ln N nodes. On a tree with a lock-based approach, only a single node will be created, and another node will be mutated (the parent node). We're assuming for both cases that no rebalancing is needed.
For the immutable case, the fact that at least ln N new nodes need to be allocated (and ln N old ones may be collected by the GC) is bad enough on itself, but it might not affect the performance directly. What does affect the performance directly, is the cache misses that it causes.
On a multi-core system, each core will usually have its own L1/L2 cache, which means it is very possible that most of the tree is already in L2, but when a new set of nodes is created, they must be read into the L2 cache of the remaining cores, which will cause several cache misses.
If you consider the NUMA case then it can be even more painful. The new tree nodes will be allocated in the heap (memory modules) assigned to the particular NUMA node that created them, and if other NUMA nodes need to access it, they will have to go through the interconnect. If a lot of write operations are going on tree, this can create a lot of traffic in the interconnect and drag performance down.

Now consider the case of removing a node from the tree. On an immutable tree, if the operation does not trigger a rebalance, ln N nodes will be created and ln N will have to be cleared by the GC. On the lock-based tree, no new tree nodes will be created, and a single node will be mutated, the parent of the node being removed.
Again, in this case, for the lock-based tree, a single cache miss may occur, due to the node that has been modified. On the immutable tree case, a series of cache misses will be triggered due to the newly allocated nodes.

The last case is the one where we search for a particular element in the tree. On an immutable tree, this will require a single acquire-barrier on the variable that holds the reference to the root node (or instance) of the tree. On a mutable tree, at the very least two acquire-barriers will be required, one to read some kind of stamp before the operation, and another to check that the stamp hasn't changed at the end of the operation, meaning that no write-modify operation has entered during the search operation. For an example of this, take a look at tryOptimisticRead() functionality in StampedLock
http://cr.openjdk.java.net/~chegar/8005697/ver.00/javadoc//StampedLock.html#tryOptimisticRead%28%29
Other locking mechanisms may have a much larger number of operations and memory barriers to protect the read-only operations.
This means the read-only operations can be much faster on Immutable data structures than they are on lock-based data structures, as long as very little write/modify operations are performed. This is by far the main advantage of Immutable data structures when compared with other lock-based techniques.

Latency

The scenario for Latency is similar to performance, but there is a slight twist to it.
The lock-based approach should, on average take less time, but in the worst-case it may take an unspecified amount of time, particularly if the lock is not fair. The write operations on the immutable tree can be made Lock-Free, while the write operations on the lock-based tree are always Blocking, and therefore, the lock-based approaches give worse latency guarantees in general. But it will depend on how high is the desired the latency guarantee.
For read operations, if only occasional writes occur, the immutable tree will have much better (lower) latency.

Code Examples

Scala's Immutable TreeMap:
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.Map

Functional Java Immutable TreeMap:
https://github.com/functionaljava/functionaljava/blob/master/core/src/main/java/fj/data/TreeMap.java

Java Hand-Over-Hand Lock-based TreeMap implementation
https://github.com/nbronson/snaptree/blob/master/src/main/java/edu/stanford/ppl/concurrent/SnapTreeMap.java

Scalable RW-Lock that can be used to protect a mutable TreeMap:
http://sourceforge.net/projects/ccfreaks/files/java/src/com/concurrencyfreaks/locks/LongAdderRWLock.java/download

Conclusion

In the end, it boils down to a simple question: Do you really need the extra performance or latency guarantees in your application?
- If the answer is no, then better use a single-threaded approach. If you're not interested in the performance gain or stronger latency guarantees, there is no point in increasing the complexity of your app by introducing threads (or actors, or tasks). The pain you're going to get from having to debug a multi-thread concurrent application is just not worth it.
- If the answer is yes, then it means performance and latency are important in your metric. If such case, the Immutable may still be a worthy investment, but only if write operations on the data structure are not very common, and you must have your application in a language that has some functional capability, preferably with a lot of free memory, and a good GC.

More often than not, immutable data structures will not be a practical solution, and lock-based solutions will be preferable.
There are alternatives to the currently known lock-based solutions, but I can't talk more about it for the moment ;)

My personal advice: if you need the extra performance, try the immutable approach, benchmark it against a lock-based solution using your own application, and decide based on that. You might be surprised by the results.

4 comments:

andrewchambersFebruary 17, 2014 at 1:25 AM
Immutability is more about program correctness, ignoring the fact that immutability objects are easier to reason about, I would agree that if you don't need concurrency you shouldn't use it, and complicate your system.

Another thing you aren't taking into account is the fact that our current generation of garbage collectors aren't designed with immutable objects in mind, and perhaps if you know objects aren't changing, you can optimize garbage collection significantly.
Raoul DukeAugust 16, 2014 at 1:30 AM
i've been thinking simple thoughts about this. how about: http://qr.ae/MLTf1 ?
albertrou704July 15, 2015 at 3:19 PM
This comment has been removed by a blog administrator.
albertrou704July 22, 2015 at 9:13 AM
This comment has been removed by a blog administrator.

Concurrency Freaks

Tuesday, October 8, 2013

Immutable data structures are not as good as you think they are