It’s Not Always ICache
This is a follow up to the previous post about
#[inline] in Rust specifically.
This post is a bit more general, and a bit more ranty.
When inlining optimization is discussed, the following is almost always mentioned: “inlining can also make code slower, because inlining increases the code size, blowing the instruction cache size and causing cache misses”.
I myself have seen this repeated on various forms many times. I have also seen a lot of benchmarks where judicious removal of inlining annotations did increase performance. However, not once have I seen the performance improvement being traced to ICache specifically. To me at least, this explanation doesn’t seem to be grounded — people know that ICache is to blame because other people say this, not because there’s a benchmark everyone points to. It doesn’t mean that the ICache explanation is wrong — just that I personally don’t have evidence to believe it is better than any other explanation.
Anyway, I’ve decided to look at a specific case where I know
#[inline] to cause an observable slow down, and understand why it happens.
Note that the goal here is not to explain real-world impact of
#[inline], the benchmark is artificial.
The goal is, first and foremost, to learn more about the tools to use for explaining results.
The secondary goal is to either observe ICache effects in practice, or else to provide an alternative hypothesis for why removing inlining can speed the things up.
The benchmark is based on my once_cell Rust library. The library provides a primitive, similar to double-checked locking. There’s a function that looks like this:
I know that performance improves significantly when the
initialize function is not inlined.
It’s somewhat obvious that this is the case (that’s why the benchmark is synthetic — real world examples are about cases where we don’t know if
inline is needed).
But it is unclear why, exactly, inlining
initialize leads to slower code.
For the experiment, I wrote a simple high-level benchmark calling
get_or_try_init in a loop:
I also added compile-time toggle to force or forbid inlining:
You can see the full benchmark in this commit: matklad/once_cell@a741d5f.
Running both versions shows that
#[inline(never)] is indeed measurably faster:
How do we explain the difference? The first step is to remove cargo from the equation and make two binaries for comparison:
On Linux, the best tool to quickly access the performance of any program is
It runs the program and shows a bunch of CPU-level performance counters, which might explain what’s going on.
As we suspect that ICache might be to blame, let’s include the counters for caches:
There is some difference in
L1-icache-load-misses, but there’s also a surprising difference in
What’s more, the
L1-icache-load-misses difference is hard to estimate, because it’s unclear what
As a sanity check, statistics for
dcache are the same, just as we expect.
perf takes the real data from the CPU, an alternative approach is to run the program in a simulated environment.
cachegrind tool does.
Fun fact: the primary author of cachegrind is @nnethercote, whose Rust Performance Book we saw in the last post.
Let’s see what
cachegrind thinks about the benchmark.
Note that, because
cachegrind simulates the program, it runs much slower.
Here, we don’t see a big difference in ICache misses (I1 — first level instruction cache, LLi — last level instruction cache).
We do see a difference in ICache references.
Note that the number of times CPU refers to ICache should correspond to the number of instructions it executes.
Cross-checking the number with
perf, we see that both
cachegrind agree on the number of instructions executed.
They also agree that
inline_always version executes more instructions.
It’s still hard to say what perf’s
Judging by the name, it should correspond to
I refs, but it doesn’t.
Anyway, it seems there’s one thing that bears further investigation — why inlining changes the number of instructions executed? Inlining doesn’t actually change the code the CPU runs, so the number of instructions should stay the same. Let’s look at the asm then! The right tool here is cargo-asm.
Again, here’s the function we will be looking at:
The call to
get_or_init will be inlined, and the nested call to
initialize will be inlined depending on the flag.
Let’s first look at the
And then at the
I’ve slightly edited the code and also highlighted the hot loop which constitutes the bulk of the benchmark.
Looking at the assembly, we can see the following:
- code is much larger — inlining happened!
- function prologue is bigger, compiler pushes more callee-saved registers to the stack
- function epilogue is bigger, compiler needs to restore more registers
- stack frame is larger
compiler hoisted some of the
initializecode to before the loop
- the core loop is very tight in both cases, just a handful of instructions
the core loop counts upwards rather than downwards, adding an extra
Note that it’s highly unlikely that ICache affects the running code, as it’s a small bunch of instructions next to each other in memory.
On the other hand, an extra
cmp with a large immediate precisely accounts for the amount of extra instructions we observe (the loop is run 800_000_000 times).
It’s hard enough to come up with a benchmark which demonstrate the difference between two alternatives. It’s even harder to explain the difference — there might be many readily available explanations, but they are not necessary true. Nonetheless, today we have a wealth of helpful tooling. Two notable examples are perf and valgrind. Tools are not always correct — it’s a good idea to sanity check different tools against each other and against common-sense understanding of the problem.
For inlining in particular, we found the following reasons why inlining
C might cause a slow down:
Inlining might cause
Cto use more registers. This means that prologue and epilogue grow additional push/pop instructions, which also use stack memory. Without inlining, these instructions are hidden in
Sand are only paid for when
Cactually calls into
S, as opposed to every time
Citself is called.
Generalizing from the first point, if
Sis called in a loop or in an
if, the compiler might hoist some instructions of
Sto before the branch, moving them from the cold path to the hot path.
- With more local variables and control flow in the stack frame to juggle, compiler might accidentally pessimize the hot loop.
If you are curious under which conditions ICache does become an issue, there’s this excellent article about one such case.