Mutexes Are Faster Than Spinlocks
(at least on commodity desktop Linux with stock settings)
This is a followup to the previous post about spinlocks. The gist of the previous post was that spinlocks have some pretty bad worst-case behaviors, and, for that reason, one shouldn’t blindly use a spinlock if using a sleeping mutex or avoiding blocking altogether is cumbersome.
In the comments, I was pointed to this interesting article, which made me realize that there’s another misconception:
Until today, I haven’t benchmarked any mutexes, so I don’t know for sure. However, what I know in theory about mutexes and spinlocks makes me doubt this claim, so let’s find out.
Where Does The Misconception Come From?
I do understand why people might think that way though.
A simplest mutex just makes
unlock syscalls when entering and exiting a critical section, offloading all synchronization to the kernel.
However, syscalls are slow and so, if the length of critical section is smaller than the length of two syscalls, spinning would be faster.
It’s easy to eliminate the syscall on entry in an uncontended state.
We can try to optimistically
CAS lock to the locked state, and call into kernel only if we failed and need to sleep.
Eliminating syscall on exit is tricky, and so I think historically many implementations did at least one syscall in practice.
Thus, mutexes were, in fact, slower than spinlocks in some benchmarks.
However, modern mutex implementations avoid all syscalls if there’s no contention. The trick is to make the state of the mutex an enum: unlocked, locked with some waiting threads, locked without waiting threads. This way, we only need to call into the kernel if there are in fact waiters.
Another historical benefit of spinlocks is that they are smaller in size.
A state of a spinlock is just a single boolean variable, while for a mutex you also need a queue of waiting threads. But there’s a trick to combat this inefficiency as well.
We can use the address of the boolean flag as token to identify the mutex, and store non-empty queues in a side table.
Note how this also reduces the (worst case) total number of queues from
number of mutexes to
number of threads!
So a modern mutex, like the one in WTF::ParkingLot, is a single boolean, which behaves more or less like a spinlock in an uncontended case but doesn’t have pathological behaviors of the spinlock.
So, let’s check if the theory works in practice! The source code for the benchmark is here:
The interesting bit is reproduced below:
Our hypothesis is that mutexes are faster, so we need to pick a workload which favors spinlocks. That is, we need to pick a very short critical section, and so we will just be incrementing a counter (1).
This is better than doing a dummy lock/unlock. At the end of the benchmark, we will assert that the counter is indeed incremented the correct number of times (2). This has a number of benefits:
- This is a nice smoke test which at least makes sure that we haven’t done an off by one error anywhere.
- As we will be benchmarking different implementations, it’s important to verify that they indeed give the same answer! More than once I’ve made some piece of code ten times faster by accidentally eliminating some essential logic :D
- We can be reasonably sure that compiler won’t outsmart us and won’t remove empty critical sections.
Now, we can just make all the threads hammer a single global counter, but that would only test a situation of extreme contention. We need to structure a benchmark in a way that allow us to vary contention level.
So instead of a single global counter, we will use an array of counters (3).
Each thread will be incrementing random elements of this array.
By varying the size of the array, we will be able to control the level of contention.
To avoid false sharing between neighboring elements of the array we will use crossbeam’s
To make the benchmark more reproducible, we will vendor a simple PRNG (4), which we seed manually.
We are testing
spin::Mutex and a bespoke implementation of spinlock from probablydance article.
We use 32 threads (on 4 core/8 hyperthreads CPU), and each thread increments some counter 10 000 times.
We run each benchmark 100 times and compute average, min and max times (we are primarily measuring throughput, so average makes more sense than median this time).
Finally, we run the whole suite twice, to sanity check that the results are reproducible.
There are several interesting observations here!
First, we reproduce the result that the variance of spinlocks on Linux with default scheduling settings can be huge:
Note that these are extreme results for 100 runs, where each run does
32 * 10_000 lock operations.
That is, individual lock/unlock operations probably have an even higher spread.
Second, the uncontended case looks like I have expected: mutexes and spinlocks are not that different, because they essentially use the same code
Third, under heavy contention mutexes annihilate spinlocks:
Now, this is the opposite of what I would naively expect. Even in heavy contended state, the critical section is still extremely short, so for each thread, the most efficient strategy seems to spin for a couple of iterations.
But I think I can explain why mutexes are so much better in this case. One reason is that with spinlocks a thread can get unlucky and be preempted in the critical section. The other more important reason is that, at any given moment in time, there are many threads trying to enter the same critical section. With spinlocks, all cores can be occupied by threads who compete for the same lock. With mutexes, there is a queue of sleeping threads for each lock, and the kernel generally tries to make sure that only one thread from the group is awake.
This is a funny example of mechanical race to the bottom. Due to the short length of critical section, each individual thread would spend less CPU cycles in total if it were spinning, but it increases the overall cost.
EDIT: simpler and more plausible explanation from the author of Rust’s parking lot is that it does exponential backoff when spinning, unlike the two spinlock implementations.
Fourth, even under heavy contention spin locks can luck out and finish almost as fast as mutexes:
This again shows that a good mutex is roughly equivalent to a spinlock in the best case.
Fifth, the amount of contention required to disrupt spinlocks seems to be small. Even if 32 threads compete for 1 000 locks, spinlocks still are considerably slower:
EDIT: someone on Reddit noticed that the number of threads is significantly higher than the number of cores, which is an unfortunate situation for spinlocks. And, although the number of threads in the benchmark is configurable, it never occurred to me to actually vary it 😅! Lowering the number of threads to four gives a picture similar to the “no contention” situation above: spinlocks a slightly, but not massively, faster. Which makes total sense! as there are more cores than CPUs, there’s no harm in spinning. And, if you can carefully architecture you application such that it runs a small fixed number of threads, ideally pinned to specific CPUs (like in the seastar architecture), using spinlocks might make sense!
As usual, each benchmark exercises only a narrow slice from the space of possible configurations, so it would be wrong to draw a sweeping conclusion that mutexes are always faster. For example, if you are in a situation where preemption is impossible (interrupts are disabled, cooperative multitasking, realtime scheduling, etc), spinlocks might be better (or even the only!) choice. And there’s also a chance the benchmark doesn’t measure what I think it measures :-)
But I find this particular benchmark convincing enough to disprove that “spinlocks are faster then mutexes for short critical sections”. In particular I find the qualitative observation that, under contention mutexes allow for better scheduling even if critical sections are short and not preempted in the middle, enlightening.
Futexes Are Tricky — a paper describing the
futexsyscall used to implement efficient sleeping on Linux.
- Locking in WebKit — a long post, describing a modern mutex implementation.
- Generic Mutex Subsystem — Linux kernel docs about sleeping mutexes.
- Spinlock — Linux kernel docs about spinlocks.
- Do not use spinlocks in user space — Linus explains why user space spinlocks are usually bad.
- Almost all serious locking libraries try to do something exactly like that — Linus explains how good mutex might be implemented instead.
- Effcient Userspace Optimistic Spinning Locks — a presentation about making fast-path spinlocking in futex-based locks even more efficient. The main problem with optimistic spinning is how much of it do you want (that is, tweaking the number of iterations parameter). The proposal solves this in an ingenious self-tweeking way (with the help of the kernel): we spin until the holder of the lock itself goes to sleep.
Discussion on /r/rust.