Spinlocks Considered Harmful
Happy new year 🎉!
In this post, I will be expressing strong opinions about a topic I have relatively little practical experience with, so feel free to roast and educate me in comments (link at the end of the post) :-)
Specifically, I’ll talk about:
spinlocks in Rust with
- priority inversion,
- CPU interrupts,
- and a couple of neat/horrible systemsy Rust hacks.
once_cell crate, which is a synchronization primitive.
std blocking facilities under the hood (specifically,
std::thread::park), and as such is not compatible with
A popular request is to add a spin-lock based implementation for use in
#[no_std] environments: #61.
More generally, this seems to be a common pattern in Rust ecosystem:
A crate uses
Mutexor other synchronization mechanism from
Someone asks for
Mutexis swapped for some variation of spinlock.
For example, the
lazy_static crate does this:
I think this is an anti-pattern, and I am writing this blog post to call it out.
What Is a Spinlock, Anyway?
Spinlock is the simplest possible implementation of a mutex, its general form looks like this:
- To grab a lock, we repeatedly execute compareandswap until it succeeds. The CPU “spins” in this very short loop.
- Only one thread at a time can be here.
- To release the lock, we do a single atomic store.
- Spinning is wasteful, so we use an intrinsic to instruct the CPU to enter a low-power mode.
Why we need
Ordering::Release is very interesting, but beyond the scope of this article.
The key take-away here is that a spinlock is implemented entirely in user space: from OS point of view, a “spinning” thread looks exactly like a thread that does a heavy computation.
An OS-based mutex, like
parking_lot::Mutex, uses a system call to tell the operating system that a thread needs to be blocked. In pseudo code, an implementation might look like this:
The main difference is
park_this_thread — a blocking system call.
It instructs the OS to take current thread off the CPU until it is woken up by an
The kernel maintains a queue of threads waiting for a mutex.
park call enqueues current thread onto this queue, while
unpark dequeues some thread. The
park system call returns when the thread is dequeued.
In the meantime, the thread waits off the CPU.
If there are several different mutexes, the kernel needs to maintain several queues. An address of a lock can be used as a token to identify a specific queue (this is a futex API).
System calls are expensive, so production implementations of
Mutex usually spin for several iterations before calling into OS, optimistically hoping that the
Mutex will be released soon.
However, the waiting always bottoms out in a syscall.
Spinning Just For a Little Bit, What Can Go Wrong?
Because spin locks are so simple and fast, it seems to be a good idea to use them for short-lived critical sections. For example, if you only need to increment a couple of integers, should you really bother with complicated syscalls? In the worst case, the other thread will spin just for a couple of iterations…
Unfortunately, this logic is flawed! A thread can be preempted at any time, including during a short critical section. If it is preempted, that means that all other threads will need to spin until the original thread gets its share of CPU again. And, because a spinning thread looks like a good, busy thread to the OS, the other threads will spin until they exhaust their quants, preventing the unlucky thread from getting back on the processor!
If this sounds like a series of unfortunate events, don’t worry, it gets even worse. Enter Priority Inversion. Suppose our threads have priorities, and OS tries to schedule high-priority threads over low-priority ones.
Now, what happens if the thread that enters a critical section is a low-priority one, but competing threads have high priority? It will likely get preempted: there are higher priority threads after all. And, if the number of cores is smaller than the number of high priority threads that try to lock a mutex, it likely won’t be able to complete a critical section at all: OS will be repeatedly scheduling all the other threads!
No OS, no problem?
But wait! — you would say — we only use spin locks in
#[no_std] crates, so there’s no OS to preempt our threads.
First, it’s not really true: it’s perfectly fine, and often even desirable, to use
#[no_std] crates for usual user-space applications.
For example, if you write a Rust replacement for a low-level C library, like zlib or openssl, you will probably make the crate
#[no_std], so that non-Rust applications can link to it without pulling the whole of the Rust runtime.
Second, if there’s really no OS to speak about, and you are on the bare metal (or in the kernel), it gets even worse than priority inversion.
On bare metal, we generally don’t worry about thread preemption, but we need to worry about processor interrupts. That is, while processor is executing some code, it might receive an interrupt from some periphery device, and temporary switch to the interrupt handler’s code.
And here comes the disaster: if the main code is in the middle of the critical section when the interrupt arrives, and if the interrupt handler tries to enter the critical section as well, we get a guaranteed deadlock! There’s no OS to switch threads after a quant expires. Here are Linux kernel docs discussing this issue.
Let’s trigger priority inversion!
Our victim is the
I don’t pick on
getrandom specifically here: the pattern is pervasive across the ecosystem.
The crate uses spinning in the
LazyUsize utility type:
static instance of
LazyUsize which caches file descriptor for
This descriptor is used when calling
getrandom — the only function that is exported by the crate.
To trigger priority inversion, we will create
1 + N threads, each of which will call
We arrange it so that the first thread has a low priority, and the rest are high priority.
We stagger threads a little bit so that the first one does the initialization.
We also make creating the file descriptor slow, so that the first thread gets preempted while in the critical section.
Here is the implementation of this plan: https://github.com/matklad/spin-of-death.
It uses a couple of systems programming hacks to make this disaster scenario easy to reproduce.
To simulate slow
/dev/random, we want to intercept the
getrandom is using to ensure that there’s enough entropy.
We can use strace to log system calls issued by a program.
I don’t know if strace can be used to make a syscall run slow (now, once I’ve looked at the website, I see that it can in fact be used to tamper with syscalls, sigh), but we actually don’t need to!
getrandom does not use the syscall directly, it uses the
poll function from
We can substitute this function by using
LD_PRELOAD, but there’s an even simpler way!
We can trick the static linker into using a function which we define ourselves:
The name of the function accidentally ( :) ) clashes with a well-known POSIX function.
However, this alone is not enough.
getrandom tries to use
getrandom syscall first, and that code path does not use a spin lock.
We need to fool
getrandom into believing that the syscall is not available.
extern "C" trick wouldn’t have worked if
getrandom literally used the
However, as inline assembly (which you need to issue a syscall manually) is not available on stable Rust,
getrandom goes via
syscall function from
That we can override with the same trick.
However, there’s a wrinkle!
libc API used
errno for error reporting.
That is, on a failure the function would return an single specific invalid value, and set the
errno thread local variable to the specific error code.
syscall follows this pattern.
errno interface is cumbersome to use.
The worst part of
errno is that the specification requires it to be a macro, and so you can only really use it from
C source code.
Internally, on Linux the macro calls
__get_errno_location function to get the thread local, but this is an implementation detail (which we will gladly take advantage of, in this land of reckless systems hacking!). The irony is that the ABI of Linux syscall just returns error codes, so
libc has to do some legwork to adapt to the awkward
So, here’s a strong contender for the most cursed function I’ve written so far:
getrandom believe that there’s no
getrandom syscall, which causes it to fallback to
To set thread priorities, we use thread_priority crate, which is a thin wrapper around
We will be using real time priorities, which require
And here are the results:
Note that I had to kill the program after two minutes. Also note the impressive system time, as well as load average
If we patch
getrandom to use
std::sync::Once instead we get a much better result:
realis half a second, but
sysare small. That’s because we are waiting for 500 milliseconds in our
This is because
Once uses OS facilities for blocking, and so OS notices that high priority threads are actually blocked and gives the low priority thread a chance to finish its work.
If Not a Spinlock, Then What?
First, if you only use a spin lock because “it’s faster for small critical sections”, just replace it with a mutex from
They already do a small amount of spinning iterations before calling into the kernel, so they are as fast as a spinlock in the best case, and infinitely faster in the worst case.
Second, it seems like most problematic uses of spinlocks come from one time initialization (which is exactly what my
once_cell crate helps with). I think it usually is possible to get away without using spinlocks. For example, instead of storing the state itself, the library may just delegate state storing to the user. For
getrandom, it can expose two functions:
It then becomes the user’s problem to cache
For example, std may continue using a thread local (src) while rand, with
std feature enabled, could use a global variable, protected by
Another option, if the state fits into
usize and the initializing function is idempotent and relatively quick, is to do a racy initialization:
Take a second to appreciate the absence of
unsafe blocks and cross-core communication in the above example!
At worst, (EDIT: this is wrong, thanks to /u/pcpthm for pointing this out!).
init will be called
number of cores times
There’s also a nuclear option: parametrize the library by blocking behavior, and allow the user to supply their own synchronization primitive.
Third, sometimes you just know that there’s only a single thread in the program, and you might want to use a spinlock just to silence those annoying compiler errors about
The primary use case here I think is WASM. A solution for this case is to assume that blocking just doesn’t happen, and panic otherwise. This is what std does for
Mutex on WASM, and what is implemented for
once_cell in this PR: #82.
Discussion on /r/rust.
EDIT: If you enjoyed this post, you might also like this one:
Looks like we have some contention here!
EDIT: there’s now a follow up post, where we actually benchmark spinlocks: