Spinlocks Considered Harmful
Happy new year 🎉!
In this post, I will be expressing strong opinions about a topic I have relatively little practical experience with, so feel free to roast and educate me in comments (link at the end of the post) :-)
Specifically, I’ll talk about:
- spinlocks,
-
spinlocks in Rust with
#[no_std]
, - priority inversion,
- CPU interrupts,
- and a couple of neat/horrible systemsy Rust hacks.
Context
I maintain once_cell
crate, which is a synchronization primitive.
It uses std
blocking facilities under the hood (specifically, std::thread::park
), and as such is not compatible with #[no_std]
.
A popular request is to add a spin-lock based implementation for use in #[no_std]
environments: #61.
More generally, this seems to be a common pattern in Rust ecosystem:
-
A crate uses
Mutex
or other synchronization mechanism fromstd
-
Someone asks for
#[no_std]
support -
Mutex
is swapped for some variation of spinlock.
For example, the lazy_static
crate does this:
github.com/rust-lang-nursery/lazy-static.rs/blob/master/src/core_lazy.rs
I think this is an anti-pattern, and I am writing this blog post to call it out.
What Is a Spinlock, Anyway?
A Spinlock
is the simplest possible implementation of a mutex, its general form looks like this:
- To grab a lock, we repeatedly execute compareandswap until it succeeds. The CPU “spins” in this very short loop.
- Only one thread at a time can be here.
- To release the lock, we do a single atomic store.
- Spinning is wasteful, so we use an intrinsic to instruct the CPU to enter a low-power mode.
Why we need Ordering::Acquire
and Ordering::Release
is very interesting, but beyond the scope of this article.
The key take-away here is that a spinlock is implemented entirely in user space: from OS point of view, a “spinning” thread looks exactly like a thread that does a heavy computation.
An OS-based mutex, like std::sync::Mutex
or parking_lot::Mutex
, uses a system call to tell the operating system that a thread needs to be blocked. In pseudo code, an implementation might look like this:
The main difference is park_this_thread
— a blocking system call.
It instructs the OS to take current thread off the CPU until it is woken up by an unpark_some_thread
call.
The kernel maintains a queue of threads waiting for a mutex.
The park
call enqueues current thread onto this queue, while unpark
dequeues some thread. The park
system call returns when the thread is dequeued.
In the meantime, the thread waits off the CPU.
If there are several different mutexes, the kernel needs to maintain several queues. An address of a lock can be used as a token to identify a specific queue (this is a futex API).
System calls are expensive, so production implementations of Mutex
usually spin for several iterations before calling into OS, optimistically hoping that the Mutex
will be released soon.
However, the waiting always bottoms out in a syscall.
Spinning Just For a Little Bit, What Can Go Wrong?
Because spin locks are so simple and fast, it seems to be a good idea to use them for short-lived critical sections. For example, if you only need to increment a couple of integers, should you really bother with complicated syscalls? In the worst case, the other thread will spin just for a couple of iterations…
Unfortunately, this logic is flawed! A thread can be preempted at any time, including during a short critical section. If it is preempted, that means that all other threads will need to spin until the original thread gets its share of CPU again. And, because a spinning thread looks like a good, busy thread to the OS, the other threads will spin until they exhaust their quants, preventing the unlucky thread from getting back on the processor!
If this sounds like a series of unfortunate events, don’t worry, it gets even worse. Enter Priority Inversion. Suppose our threads have priorities, and OS tries to schedule high-priority threads over low-priority ones.
Now, what happens if the thread that enters a critical section is a low-priority one, but competing threads have high priority? It will likely get preempted: there are higher priority threads after all. And, if the number of cores is smaller than the number of high priority threads that try to lock a mutex, it likely won’t be able to complete a critical section at all: OS will be repeatedly scheduling all the other threads!
No OS, no problem?
But wait! — you would say — we only use spin locks in #[no_std]
crates, so there’s no OS to preempt our threads.
First, it’s not really true: it’s perfectly fine, and often even desirable, to use #[no_std]
crates for usual user-space applications.
For example, if you write a Rust replacement for a low-level C library, like zlib or openssl, you will probably make the crate #[no_std]
, so that non-Rust applications can link to it without pulling the whole of the Rust runtime.
Second, if there’s really no OS to speak about, and you are on the bare metal (or in the kernel), it gets even worse than priority inversion.
On bare metal, we generally don’t worry about thread preemption, but we need to worry about processor interrupts. That is, while processor is executing some code, it might receive an interrupt from some periphery device, and temporary switch to the interrupt handler’s code.
And here comes the disaster: if the main code is in the middle of the critical section when the interrupt arrives, and if the interrupt handler tries to enter the critical section as well, we get a guaranteed deadlock! There’s no OS to switch threads after a quant expires. Here are Linux kernel docs discussing this issue.
Practical Applications
Let’s trigger priority inversion!
Our victim is the getrandom
crate.
I don’t pick on getrandom
specifically here: the pattern is pervasive across the ecosystem.
The crate uses spinning in the LazyUsize
utility type:
There’s a static
instance of LazyUsize
which caches file descriptor for /dev/random
:
https://github.com/rust-random/getrandom/blob/v0.1.13/src/use_file.rs#L26
This descriptor is used when calling getrandom
— the only function that is exported by the crate.
To trigger priority inversion, we will create 1 + N
threads, each of which will call getrandom::getrandom
.
We arrange it so that the first thread has a low priority, and the rest are high priority.
We stagger threads a little bit so that the first one does the initialization.
We also make creating the file descriptor slow, so that the first thread gets preempted while in the critical section.
Here is the implementation of this plan: https://github.com/matklad/spin-of-death.
It uses a couple of systems programming hacks to make this disaster scenario easy to reproduce.
To simulate slow /dev/random
, we want to intercept the poll
syscall getrandom
is using to ensure that there’s enough entropy.
We can use strace to log system calls issued by a program.
I don’t know if strace can be used to make a syscall run slow (now, once I’ve looked at the website, I see that it can in fact be used to tamper with syscalls, sigh), but we actually don’t need to!
getrandom
does not use the syscall directly, it uses the poll
function from libc
.
We can substitute this function by using LD_PRELOAD
, but there’s an even simpler way!
We can trick the static linker into using a function which we define ourselves:
The name of the function accidentally ( :) ) clashes with a well-known POSIX function.
However, this alone is not enough.
getrandom
tries to use getrandom
syscall first, and that code path does not use a spin lock.
We need to fool getrandom
into believing that the syscall is not available.
Our extern "C"
trick wouldn’t have worked if getrandom
literally used the syscall
instruction.
However, as inline assembly (which you need to issue a syscall manually) is not available on stable Rust, getrandom
goes via syscall
function from libc
.
That we can override with the same trick.
However, there’s a wrinkle!
Traditionally, libc
API used errno
for error reporting.
That is, on a failure the function would return an single specific invalid value, and set the errno
thread local variable to the specific error code. syscall
follows this pattern.
The errno
interface is cumbersome to use.
The worst part of errno
is that the specification requires it to be a macro, and so you can only really use it from C
source code.
Internally, on Linux the macro calls __get_errno_location
function to get the thread local, but this is an implementation detail (which we will gladly take advantage of, in this land of reckless systems hacking!). The irony is that the ABI of Linux syscall just returns error codes, so libc
has to do some legwork to adapt to the awkward errno
interface.
So, here’s a strong contender for the most cursed function I’ve written so far:
It makes getrandom
believe that there’s no getrandom
syscall, which causes it to fallback to /dev/random
implementation.
To set thread priorities, we use thread_priority crate, which is a thin wrapper around pthread
APIs.
We will be using real time priorities, which require sudo
.
And here are the results:
Note that I had to kill the program after two minutes. Also note the impressive system time, as well as load average
If we patch getrandom
to use std::sync::Once
instead we get a much better result:
-
Note how
real
is half a second, butuser
andsys
are small. That’s because we are waiting for 500 milliseconds in ourpoll
This is because Once
uses OS facilities for blocking, and so OS notices that high priority threads are actually blocked and gives the low priority thread a chance to finish its work.
If Not a Spinlock, Then What?
First, if you only use a spin lock because “it’s faster for small critical sections”, just replace it with a mutex from std
or parking_lot
.
They already do a small amount of spinning iterations before calling into the kernel, so they are as fast as a spinlock in the best case, and infinitely faster in the worst case.
Second, it seems like most problematic uses of spinlocks come from one time initialization (which is exactly what my once_cell
crate helps with). I think it usually is possible to get away without using spinlocks. For example, instead of storing the state itself, the library may just delegate state storing to the user. For getrandom
, it can expose two functions:
It then becomes the user’s problem to cache RandomState
appropriately.
For example, std may continue using a thread local (src) while rand, with std
feature enabled, could use a global variable, protected by Once
.
Another option, if the state fits into usize
and the initializing function is idempotent and relatively quick, is to do a racy initialization:
Take a second to appreciate the absence of unsafe
blocks and cross-core communication in the above example!
At worst, (EDIT: this is wrong, thanks to /u/pcpthm for pointing this out!).init
will be called number of cores
times
There’s also a nuclear option: parametrize the library by blocking behavior, and allow the user to supply their own synchronization primitive.
Third, sometimes you just know that there’s only a single thread in the program, and you might want to use a spinlock just to silence those annoying compiler errors about static mut
.
The primary use case here I think is WASM. A solution for this case is to assume that blocking just doesn’t happen, and panic otherwise. This is what std does for Mutex
on WASM, and what is implemented for once_cell
in this PR: #82.
Discussion on /r/rust.
EDIT: If you enjoyed this post, you might also like this one:
Looks like we have some contention here!
EDIT: there’s now a follow up post, where we actually benchmark spinlocks:
https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html