Fast Thread Locals In Rust
Rust thread-locals are slower than they could be. This is because they violate zero-cost abstraction principle, specifically the “you don’t pay for what you don’t use” part.
Rust’s thread-local implementation( 1, 2 ) comes with built-in support for laziness — thread locals are initialized on the first access. Sometimes this overhead is a big deal, as thread locals are a common tool for writing high-performance code. For example, allocator fast path often involves looking into thread-local heap.
There’s an unstable #[thread_local]
attribute for a zero-cost implementation
(see the tracking issue).
Let’s see how much “is thread local initialized?” check costs by comparing these two programs:
In this test, we declare an integer thread-local variable, and use it as an accumulator for the summation.
We use non-trivial summation term: (step * step) ^ step
— this is to prevent LLVM from evaluating the sum at compile time.
If a term of a summation is a polynomial (like 1
, step
or step * step
), then the sum itself is a one degree higher polynomial, and LLVM can figure this out!
We rely on wrapping overflow of unsigned integers in C, and use wrapping_mul
and wrapping_add
in Rust.
To make sure that both programs are equivalent, we also print the result.
One optimization we specifically don’t protect from is caching thread-local access. That is, instead of doing a billion of thread-local loads and stores, the compiler could generate code to compute the sum into the local variable, and do a single store at the end. This is because “can the compiler optimize thread-local access?” is exactly the property we want to measure.
There’s no standard way to get monotonic wall-clock time in C, so the C version is not cross-platform.
This code gives the following results on my machine:
This benchmark doesn’t allow to measure the cost of thread-local access per se, but the overall time is about 2x longer for Rust.
Can we make Rust faster? I don’t know how to do that, but I know how to cheat. We can apply a general Rust extension trick — write some C code and link it with Rust!
Let’s implement a simple C library which declares a thread-local and provides access to it:
Link it with Rust:
And use it:
The result are underwhelming:
This is expected — we replaced access to a thread local with a function call. As we are crossing the language boundary, the compiler can’t inline it, which destroys performance. However, there’s a way around that: Rust allows cross-language Link Time Optimization (docs). That is, Rust and C compilers can cooperate, to allow the linker to do inlining across the languages.
This requires to manually align a bunch of stars:
-
The C compiler, the Rust compiler and the linker must use the same version of LLVM. As you might have noticed, this excludes gcc. I had luck with
rustc 1.46.0
,clang 10.0.0
, andLLD 10.0.0
. -
-flto=thin
in the C compiler flags. -
RUSTFLAGS
:
Now, just recompiling the old code gives the same performance for C and Rust:
Interestingly, this is the same performance we get without any thread-locals at all:
So, either the compiler/linker was able to lift thread-local access out of the loop, or its cost is masked by arithmetics.
Full code for the benchmarks is available at https://github.com/matklad/ftl. Note that this research only scratches the surface of the topic: thread locals are implemented differently on different OSes. Even on a single OS, there are be differences depending on compilation flags (dynamic libraries differ from static libraries, for example). Looking at the generated assembly could also be illuminating (code on Compiler Explorer).
Discussion on /r/rust.
Update(2023-12-18): since writing this post, Rust gained an ability to opt-out of lazy
initialization semantics by using a const block in the thread_local
macro:
This remove the overhead measured in this article. Note that in this case const {
is a feature of
thread_local
macro. That is, const {
is parsed specifically by the declarative macro
machinery, it is not a part of a more general (currently unstable) “const block” syntax.