It’s Not Always ICache
This is a follow up to the previous post about #[inline]
in Rust specifically.
This post is a bit more general, and a bit more ranty.
Reader, beware!
When inlining optimization is discussed, the following is almost always mentioned: “inlining can also make code slower, because inlining increases the code size, blowing the instruction cache size and causing cache misses”.
I myself have seen this repeated on various forms many times. I have also seen a lot of benchmarks where judicious removal of inlining annotations did increase performance. However, not once have I seen the performance improvement being traced to ICache specifically. To me at least, this explanation doesn’t seem to be grounded — people know that ICache is to blame because other people say this, not because there’s a benchmark everyone points to. It doesn’t mean that the ICache explanation is wrong — just that I personally don’t have evidence to believe it is better than any other explanation.
Anyway, I’ve decided to look at a specific case where I know #[inline]
to cause an observable slow down, and understand why it happens.
Note that the goal here is not to explain real-world impact of #[inline]
, the benchmark is artificial.
The goal is, first and foremost, to learn more about the tools to use for explaining results.
The secondary goal is to either observe ICache effects in practice, or else to provide an alternative hypothesis for why removing inlining can speed the things up.
The benchmark is based on my once_cell Rust library. The library provides a primitive, similar to double-checked locking. There’s a function that looks like this:
fn get_or_try_init<F, E>(&self, f: F) -> Result<&T, E>
where
F: FnOnce() -> Result<T, E>,
{
if let Some(value) = self.get() {
// Fast path.
return Ok(value);
}
// Slow path.
self.0.initialize(f)?;
Ok(unsafe { self.get_unchecked() })
}
I know that performance improves significantly when the initialize
function is not inlined.
It’s somewhat obvious that this is the case (that’s why the benchmark is synthetic — real world examples are about cases where we don’t know if inline
is needed).
But it is unclear why, exactly, inlining initialize
leads to slower code.
For the experiment, I wrote a simple high-level benchmark calling get_or_try_init
in a loop:
const N_LOOPS: usize = 8;
static CELL: OnceCell<usize> = OnceCell::new();
fn main() {
for i in 0..N_LOOPS {
go(i)
}
}
fn go(i: usize) {
for _ in 0..100_000_000 {
let &value = CELL.get_or_init(|| i);
assert!(value < N_LOOPS);
}
}
I also added compile-time toggle to force or forbid inlining:
fn initialize() { ... }
You can see the full benchmark in this commit: matklad/once_cell@a741d5f.
Running both versions shows that #[inline(never)]
is indeed measurably faster:
$ cargo run -q --example bench --release --features inline_always
330ms
$ cargo run -q --example bench --release --features inline_never
259ms
How do we explain the difference? The first step is to remove cargo from the equation and make two binaries for comparison:
$ cargo build --example bench --release --features inline_never
$ cp ./target/release/examples/bench never
$ cargo build --example bench --release --features inline_always
$ cp ./target/release/examples/bench always
On Linux, the best tool to quickly access the performance of any program is perf stat
.
It runs the program and shows a bunch of CPU-level performance counters, which might explain what’s going on.
As we suspect that ICache might be to blame, let’s include the counters for caches:
$ perf stat -e instructions,cycles,\
L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,\
L1-icache-loads,L1-icache-load-misses,cache-misses \
./always
348ms
6,396,754,995 instructions:u
1,601,314,994 cycles:u
1,600,621,170 L1-dcache-loads:u
4,806 L1-dcache-load-misses:u
4,402 L1-dcache-prefetches:u
69,594 L1-icache-loads:u
461 L1-icache-load-misses:u
1,928 cache-misses:u
$ perf stat -e instructions,cycles,\
L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,\
L1-icache-loads,L1-icache-load-misses,cache-misses \
./never
261ms
Performance counter stats for './never':
5,597,215,493 instructions:u
1,199,960,402 cycles:u
1,599,404,303 L1-dcache-loads:u
4,612 L1-dcache-load-misses:u
4,290 L1-dcache-prefetches:u
62,268 L1-icache-loads:u
603 L1-icache-load-misses:u
1,675 cache-misses:u
There is some difference in L1-icache-load-misses
, but there’s also a surprising difference in instructions
.
What’s more, the L1-icache-load-misses
difference is hard to estimate, because it’s unclear what L1-icache-loads
are.
As a sanity check, statistics for dcache
are the same, just as we expect.
While perf
takes the real data from the CPU, an alternative approach is to run the program in a simulated environment.
That’s what cachegrind
tool does.
Fun fact: the primary author of cachegrind is @nnethercote, whose Rust Performance Book we saw in the last post.
Let’s see what cachegrind
thinks about the benchmark.
$ valgrind --tool=cachegrind ./always
10s
I refs: 6,400,577,147
I1 misses: 1,560
LLi misses: 1,524
I1 miss rate: 0.00%
LLi miss rate: 0.00%
D refs: 1,600,196,336
D1 misses: 5,549
LLd misses: 4,024
D1 miss rate: 0.0%
LLd miss rate: 0.0%
LL refs: 7,109
LL misses: 5,548
LL miss rate: 0.0%
$ valgrind --tool=cachegrind ./never
9s
I refs: 5,600,577,226
I1 misses: 1,572
LLi misses: 1,529
I1 miss rate: 0.00%
LLi miss rate: 0.00%
D refs: 1,600,196,330
D1 misses: 5,553
LLd misses: 4,024
D1 miss rate: 0.0%
LLd miss rate: 0.0%
LL refs: 7,125
LL misses: 5,553
LL miss rate: 0.0%
Note that, because cachegrind
simulates the program, it runs much slower.
Here, we don’t see a big difference in ICache misses (I1 — first level instruction cache, LLi — last level instruction cache).
We do see a difference in ICache references.
Note that the number of times CPU refers to ICache should correspond to the number of instructions it executes.
Cross-checking the number with perf
, we see that both perf
and cachegrind
agree on the number of instructions executed.
They also agree that inline_always
version executes more instructions.
It’s still hard to say what perf’s sL1-icache-loads
means.
Judging by the name, it should correspond to cachegrind
’s I refs
, but it doesn’t.
Anyway, it seems there’s one thing that bears further investigation — why inlining changes the number of instructions executed? Inlining doesn’t actually change the code the CPU runs, so the number of instructions should stay the same. Let’s look at the asm then! The right tool here is cargo-asm.
Again, here’s the function we will be looking at:
fn go(tid: usize) {
for _ in 0..100_000_000 {
let &value = CELL.get_or_init(|| tid);
assert!(value < N_THREADS);
}
}
The call to get_or_init
will be inlined, and the nested call to initialize
will be inlined depending on the flag.
Let’s first look at the inline_never
version:
push r14 ;
push rbx ; prologue
push rax ;
mov qword, ptr, [rsp], rdi
mov ebx, 100000001 ; loop counter
mov r14, rsp
jmp .LBB14_1
.loop:
cmp qword, ptr, [rip, +, CELL+16], 8
jae .assert_failure
.LBB14_1:
add rbx, -1
je .normal_exit
mov rax, qword, ptr, [rip, +, CELL]
cmp rax, 2
je .loop
mov rdi, r14
call once_cell::imp::OnceCell<T>::initialize
jmp .loop
.normal_exit:
add rsp, 8 ;
pop rbx ; epilogue
pop r14a ;
ret ;
.assert_failure:
lea rdi, [rip, +, .L__unnamed_12]
lea rdx, [rip, +, .L__unnamed_13]
mov esi, 35
call qword, ptr, [rip, +, core::panicking::panic@GOTPCREL]
ud2
And then at the inline_always
version:
push rbp ;
push r15 ;
push r14 ;
push r13 ; prologue
push r12 ;
push rbx ;
sub rsp, 24
mov r12, rdi
xor ebx, ebx
mov r13d, 1
lea r14, [rip, +, CELL]
mov rbp, qword, ptr, [rip, +, WaiterQueue::drop@GOTPCREL]
mov r15, qword, ptr, [rip, +, once_cell::imp::wait@GOTPCREL]
jmp .LBB10_1
.LBB10_10:
mov qword, ptr, [rsp, +, 8], r14
mov qword, ptr, [rip, +, CELL+8], 1
mov qword, ptr, [rip, +, CELL+16], r12
mov qword, ptr, [rsp, +, 16], 2
lea rdi, [rsp, +, 8]
call rbp
.loop:
add rbx, 1
cmp qword, ptr, [rip, +, CELL+16], 8
jae .assert_failure
.LBB10_1:
cmp rbx, 100000000
je .normal_exit
mov rax, qword, ptr, [rip, +, CELL]
cmp rax, 2
je .loop
.LBB10_3:
mov rax, qword, ptr, [rip, +, CELL]
.LBB10_4:
test rax, rax
jne .LBB10_5
xor eax, eax
lock cmpxchg, qword, ptr, [rip, +, CELL], r13
jne .LBB10_4
jmp .LBB10_10
.LBB10_5:
cmp rax, 2
je .loop
mov ecx, eax
and ecx, 3
cmp ecx, 1
jne .LBB10_8
mov rdi, r14
mov rsi, rax
call r15
jmp .LBB10_3
.normal_exit:
add rsp, 24 ;
pop rbx ;
pop r12 ;
pop r13 ; epilogue
pop r14 ;
pop r15 ;
pop rbp ;
ret
.assert_failure:
lea rdi, [rip, +, .L__unnamed_9]
lea rdx, [rip, +, .L__unnamed_10]
mov esi, 35
call qword, ptr, [rip, +, core::panicking::panic@GOTPCREL]
ud2
.LBB10_8:
lea rdi, [rip, +, .L__unnamed_11]
lea rdx, [rip, +, .L__unnamed_12]
mov esi, 57
call qword, ptr, [rip, +, core::panicking::panic@GOTPCREL]
ud2
I’ve slightly edited the code and also highlighted the hot loop which constitutes the bulk of the benchmark.
Looking at the assembly, we can see the following:
- code is much larger — inlining happened!
- function prologue is bigger, compiler pushes more callee-saved registers to the stack
- function epilogue is bigger, compiler needs to restore more registers
- stack frame is larger
-
compiler hoisted some of the
initialize
code to before the loop - the core loop is very tight in both cases, just a handful of instructions
-
the core loop counts upwards rather than downwards, adding an extra
cmp
instruction
Note that it’s highly unlikely that ICache affects the running code, as it’s a small bunch of instructions next to each other in memory.
On the other hand, an extra cmp
with a large immediate precisely accounts for the amount of extra instructions we observe (the loop is run 800_000_000 times).
Conclusions
It’s hard enough to come up with a benchmark which demonstrate the difference between two alternatives. It’s even harder to explain the difference — there might be many readily available explanations, but they are not necessary true. Nonetheless, today we have a wealth of helpful tooling. Two notable examples are perf and valgrind. Tools are not always correct — it’s a good idea to sanity check different tools against each other and against common-sense understanding of the problem.
For inlining in particular, we found the following reasons why inlining S
into C
might cause a slow down:
-
Inlining might cause
C
to use more registers. This means that prologue and epilogue grow additional push/pop instructions, which also use stack memory. Without inlining, these instructions are hidden inS
and are only paid for whenC
actually calls intoS
, as opposed to every timeC
itself is called. -
Generalizing from the first point, if
S
is called in a loop or in anif
, the compiler might hoist some instructions ofS
to before the branch, moving them from the cold path to the hot path. - With more local variables and control flow in the stack frame to juggle, compiler might accidentally pessimize the hot loop.
If you are curious under which conditions ICache does become an issue, there’s this excellent article about one such case.