Data Oriented Parallel Value Interner
In this post, I will present a theoretical design for an interner. It should be fast, but there will be no benchmarks as I haven’t implemented the thing. So it might actually be completely broken or super slow for one reason or another. Still, I think there are a couple of neat ideas, which I would love to call out.
The context for the post is this talk by Andrew Kelley, which notices that it’s hard to reconcile interning and parallel compilation. This is something I have been thinking about a lot in the context of rust-analyzer, which relies heavily on pointers, atomic reference counting and indirection to make incremental and parallel computation possible.
And yes, interning (or, more generally, assigning unique identities to things) is a big part of that.
Usually, compilers intern strings, but we will be interning trees today.
Specifically, we will be looking at something like a
Value type from the Zig compiler.
In a simplified RAII style it could look like this:
Such values are individually heap-allocated and in general are held behind pointers. Zig’s compiler adds a couple of extra tricks to this structure, like not overallocating for small enum variants:
But how do we intern this stuff, such that:
values are just
u32rather than full pointers,
- values are deduplicated,
- and this whole construct works efficiently even if there are multiple threads using our interner simultaneously?
Let’s start with concurrent
Segmented list is like
ArrayList with an extra super power that pushing new items does not move/invalidate old ones.
ArrayList, when the backing storage fills up, you allocate a slice twice as long, copy over the elements from the old slice and then destroy it.
SegmentList, you leave the old slice where it is, and just allocate a new one.
Now, as we are writing an interner and want to use
u32 for an index, we know that we need to store
1<<32 items max.
But that means that we’ll need at most 31 segments for our
So we can just “pre-allocate” array of 31 pointers to the segments, hence
If we want to be more precise with types, we can even use a tuple whose elements are nullable pointers to arrays of power-of-two sizes:
Indexing into such an echeloned array is still O(1). Here’s how echelons look in terms of indexes
n echelons hold
2**n - 1 elements.
So, if we want to find the
ith item, we first find the echelon it is in, by computing the nearest smaller power of two of
i + 1, and then index into the echelon with
i - (2**n - 1), give or take a
+1 here or there.
Note that we pre-allocate an array of pointers to segments, but not the segments themselves.
Pointers are nullable, and we allocate new segments lazily, when we actually write to the corresponding indexes.
This structure is very friendly to parallel code.
Reading items works because items are never reallocated.
Lazily allocating new echelons is easy, because the position of the pointer is fixed.
That is, we can do something like this to insert an item at position
- compute the echelon index
if the pointer is null
- allocate the echelon
@cmpxchgStrong(.Acquire, .Release)the pointer
- free the redundant echelon if exchange failed
- insert the item
Notice how we don’t need any locks or even complicated atomics, at the price of sometimes doing a second redundant allocation.
One thing this data structure is bad at is doing bounds checks and tracking which items are actually initialized. For the interner use-case, we will rely on an invariant that we always use indexes provided to use by someone else, such that possession of the index signifies that:
- the echelon holding the item is allocated
- the item itself is initialized
- there’s the relevant happens-before established
If, instead, we manufacture an index out of thin air, we might hit all kinds of nasty behavior without any bullet-proof way to check that.
Okay, now that we have this
SegmentList, how would we use them?
Recall that our simplified value is
Of course we will struct-of-array it now, to arrive at something like this:
Value is now an index.
This index works for two fields of
That is, the index addresses five bytes of payload, which is all that is needed for small values.
For large tags like
data field stores an index into the corresponding payload
That is, every value allocates a
data elements, but only actual
u64s occupy a slot in
So now we can write a
lookup function which takes a value index and reconstructs a value from pieces:
Note that here
ValueFull is non-owning type, it is a reference into the actual data.
Note as well that aggregates now store a slice of indexes, rather than a slice of pointers.
Now let’s deal with creating and interning values.
We start by creating a
ValueFull using data owned by us
(e.g. if we are creating an aggregate, we may use a stack-allocated array as a backing store for
Then we ask
ValueTable to intern the data:
If the table already contains an equal value, its index is returned.
Otherwise, the table copies
ValueFull data such that it is owned by the table itself, and returns a freshly allocated index.
For bookkeeping, we’ll need a hash table with existing values and a counter to use for a fresh index, something like this:
Pay attention to
_count fields — we have
value_count guarding the
index, and separate counts for specific kinds of values, as we don’t want to allocate, e.g. an
u64 for every value.
Our hashmap is actually a set which stores
u32 integers, but uses
ValueFull to do a lookup: when we consider interning a new
ValueFull, we don’t know its index yet.
getOrPutAdapted API provides the required flexibility.
We can use it to compare a
Value (index) and a
ValueFull by hashing a
ValueFull and doing component-wise comparisons in the case of a collision.
Note that, because of interning, we can also hash
As any subvalues in
ValueFull are guaranteed to be already interned, we can rely on shallow hash and hash only child value’s indexes, rather than their data.
This is a nice design for a single thread, but how do we make it thread safe?
The straightforward solution would be to slap a mutex around the logic in
This actually is not as bad as it seems, as we’d need a lock only in
lookup would work without any synchronization whatsoever.
Recall that obtaining an index of a value is a proof that the value was properly published.
Still, we expect to intern a lot of values, and that mutex is all but guaranteed to become a point of contention.
And some amount of contention is inevitable here — if two threads try to intern two identical values, we want them to clash, communicate, and end up with a single, shared value.
There’s a rather universal recipe for dealing with contention — you can shard the data. In our case, rather than using something like
we can do
That is, we create not one, but sixteen hashmaps, and use, e.g., lower 4 bits of the hash to decide which mutex and hashmap to use. Depending on the structure of the hashmap, such locks could even be pushed as far as individual buckets.
This doesn’t solve all our contention problems — now that several threads can simultaneously intern values (as long as they are hashed into different shards) we have to make all
count variables atomic.
So we essentially moved the single global point of contention from a mutex to
value_count field, which is incremented for every interned value.
We can apply the sharding trick again, and shard all our
But that would mean that we have to dedicate some bits from
Value index to the shard number, and to waste some extra space for non-perfectly balanced shards.
There’s a better way — we can amortize atomic increments by allowing each thread to bulk-allocate indexes.
That is, if a thread wants to allocate a new value, it atomically increments
value_cont by, say,
1024, and uses those indexes for the next thousand allocations.
In addition to
ValueTable, each thread now gets its own distinct
An attentive reader would notice a bonus here: in this setup, a thread allocates a contiguous chunk of values. It is reasonable to assume that values allocated together would also be used together, so we potentially increase future spatial locality here.
Putting everything together, the pseudo-code for interning would look like this:
Note that it is important that we don’t release the mutex immediately after assigning the index for a value, but rather keep it locked all the way until we fully copied thee value into the
If we release the lock earlier, a different thread which tries to intern the same value would get the correct index, but would risk accessing partially-initialized data.
This can be optimized a bit by adding value-specific lock (or rather, a
So we use the shard lock to assign an index, then release the shard lock, and use value-specific lock to do the actual (potentially slow) initialization.
And that’s all I have for today! Again, I haven’t implemented this, so I have no idea how fast or slow it actually is. But the end result looks rather beautiful, and builds upon many interesting ideas:
SegmentListallows to maintain index stability despite insertions.
There will be at most 31 echelons in a
SegmentList, so you can put pointes to them into an array, removing the need to synchronize to read an echelon.
With this setup, it becomes easy to initialize a new echelon with a single CAS.
Synchronization is required only when creating a new item. If you trust indexes, you can use them to carry happens-before.
In a struct-of-arrays setup for enums, you can save space by requiring that an array for a specific variant is just as long as it needs to be.
One benefit of interning trees is that hash function becomes a shallow operation.
Optimal interners use hashmaps in a fancy way, where the key is not what you actually store in the hashmap. I have two related posts about that, Fast and Simple Rust Interner and Call Site Dependency Injection.
Sharding is an effective way to reduce contention if you are dealing with something like a shared hashmap.
For counters, one alternative to sharding is batching up the increments.
Discussion on /r/Zig.