It’s Not Always ICache
This is a follow up to the previous post about #[inline]
in Rust specifically.
This post is a bit more general, and a bit more ranty.
Reader, beware!
When inlining optimization is discussed, the following is almost always mentioned: “inlining can also make code slower, because inlining increases the code size, blowing the instruction cache size and causing cache misses”.
I myself have seen this repeated on various forms many times. I have also seen a lot of benchmarks where judicious removal of inlining annotations did increase performance. However, not once have I seen the performance improvement being traced to ICache specifically. To me at least, this explanation doesn’t seem to be grounded — people know that ICache is to blame because other people say this, not because there’s a benchmark everyone points to. It doesn’t mean that the ICache explanation is wrong — just that I personally don’t have evidence to believe it is better than any other explanation.
Anyway, I’ve decided to look at a specific case where I know #[inline]
to cause an observable slow down, and understand why it happens.
Note that the goal here is not to explain real-world impact of #[inline]
, the benchmark is artificial.
The goal is, first and foremost, to learn more about the tools to use for explaining results.
The secondary goal is to either observe ICache effects in practice, or else to provide an alternative hypothesis for why removing inlining can speed the things up.
The benchmark is based on my once_cell Rust library. The library provides a primitive, similar to double-checked locking. There’s a function that looks like this:
I know that performance improves significantly when the initialize
function is not inlined.
It’s somewhat obvious that this is the case (that’s why the benchmark is synthetic — real world examples are about cases where we don’t know if inline
is needed).
But it is unclear why, exactly, inlining initialize
leads to slower code.
For the experiment, I wrote a simple high-level benchmark calling get_or_try_init
in a loop:
I also added compile-time toggle to force or forbid inlining:
You can see the full benchmark in this commit: matklad/once_cell@a741d5f.
Running both versions shows that #[inline(never)]
is indeed measurably faster:
How do we explain the difference? The first step is to remove cargo from the equation and make two binaries for comparison:
On Linux, the best tool to quickly access the performance of any program is perf stat
.
It runs the program and shows a bunch of CPU-level performance counters, which might explain what’s going on.
As we suspect that ICache might be to blame, let’s include the counters for caches:
There is some difference in L1-icache-load-misses
, but there’s also a surprising difference in instructions
.
What’s more, the L1-icache-load-misses
difference is hard to estimate, because it’s unclear what L1-icache-loads
are.
As a sanity check, statistics for dcache
are the same, just as we expect.
While perf
takes the real data from the CPU, an alternative approach is to run the program in a simulated environment.
That’s what cachegrind
tool does.
Fun fact: the primary author of cachegrind is @nnethercote, whose Rust Performance Book we saw in the last post.
Let’s see what cachegrind
thinks about the benchmark.
Note that, because cachegrind
simulates the program, it runs much slower.
Here, we don’t see a big difference in ICache misses (I1 — first level instruction cache, LLi — last level instruction cache).
We do see a difference in ICache references.
Note that the number of times CPU refers to ICache should correspond to the number of instructions it executes.
Cross-checking the number with perf
, we see that both perf
and cachegrind
agree on the number of instructions executed.
They also agree that inline_always
version executes more instructions.
It’s still hard to say what perf’s sL1-icache-loads
means.
Judging by the name, it should correspond to cachegrind
’s I refs
, but it doesn’t.
Anyway, it seems there’s one thing that bears further investigation — why inlining changes the number of instructions executed? Inlining doesn’t actually change the code the CPU runs, so the number of instructions should stay the same. Let’s look at the asm then! The right tool here is cargo-asm.
Again, here’s the function we will be looking at:
The call to get_or_init
will be inlined, and the nested call to initialize
will be inlined depending on the flag.
Let’s first look at the inline_never
version:
And then at the inline_always
version:
I’ve slightly edited the code and also highlighted the hot loop which constitutes the bulk of the benchmark.
Looking at the assembly, we can see the following:
- code is much larger — inlining happened!
- function prologue is bigger, compiler pushes more callee-saved registers to the stack
- function epilogue is bigger, compiler needs to restore more registers
- stack frame is larger
-
compiler hoisted some of the
initialize
code to before the loop - the core loop is very tight in both cases, just a handful of instructions
-
the core loop counts upwards rather than downwards, adding an extra
cmp
instruction
Note that it’s highly unlikely that ICache affects the running code, as it’s a small bunch of instructions next to each other in memory.
On the other hand, an extra cmp
with a large immediate precisely accounts for the amount of extra instructions we observe (the loop is run 800_000_000 times).
Conclusions
It’s hard enough to come up with a benchmark which demonstrate the difference between two alternatives. It’s even harder to explain the difference — there might be many readily available explanations, but they are not necessary true. Nonetheless, today we have a wealth of helpful tooling. Two notable examples are perf and valgrind. Tools are not always correct — it’s a good idea to sanity check different tools against each other and against common-sense understanding of the problem.
For inlining in particular, we found the following reasons why inlining S
into C
might cause a slow down:
-
Inlining might cause
C
to use more registers. This means that prologue and epilogue grow additional push/pop instructions, which also use stack memory. Without inlining, these instructions are hidden inS
and are only paid for whenC
actually calls intoS
, as opposed to every timeC
itself is called. -
Generalizing from the first point, if
S
is called in a loop or in anif
, the compiler might hoist some instructions ofS
to before the branch, moving them from the cold path to the hot path. - With more local variables and control flow in the stack frame to juggle, compiler might accidentally pessimize the hot loop.
If you are curious under which conditions ICache does become an issue, there’s this excellent article about one such case.