It’s Not Always ICache

This is a follow up to the previous post about #[inline] in Rust specifically. This post is a bit more general, and a bit more ranty. Reader, beware!

When inlining optimization is discussed, the following is almost always mentioned: “inlining can also make code slower, because inlining increases the code size, blowing the instruction cache size and causing cache misses”.

I myself have seen this repeated on various forms many times. I have also seen a lot of benchmarks where judicious removal of inlining annotations did increase performance. However, not once have I seen the performance improvement being traced to ICache specifically. To me at least, this explanation doesn’t seem to be grounded — people know that ICache is to blame because other people say this, not because there’s a benchmark everyone points to. It doesn’t mean that the ICache explanation is wrong — just that I personally don’t have evidence to believe it is better than any other explanation.

Anyway, I’ve decided to look at a specific case where I know #[inline] to cause an observable slow down, and understand why it happens. Note that the goal here is not to explain real-world impact of #[inline], the benchmark is artificial. The goal is, first and foremost, to learn more about the tools to use for explaining results. The secondary goal is to either observe ICache effects in practice, or else to provide an alternative hypothesis for why removing inlining can speed the things up.

The benchmark is based on my once_cell Rust library. The library provides a primitive, similar to double-checked locking. There’s a function that looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
fn get_or_try_init<F, E>(&self, f: F) -> Result<&T, E>
where
 F: FnOnce() -> Result<T, E>,
{
  if let Some(value) = self.get() {
    // Fast path.
    return Ok(value);
  }

  // Slow path.
  self.0.initialize(f)?;
  Ok(unsafe { self.get_unchecked() })
}

I know that performance improves significantly when the initialize function is not inlined. It’s somewhat obvious that this is the case (that’s why the benchmark is synthetic — real world examples are about cases where we don’t know if inline is needed). But it is unclear why, exactly, inlining initialize leads to slower code.

For the experiment, I wrote a simple high-level benchmark calling get_or_try_init in a loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
const N_LOOPS: usize = 8;
static CELL: OnceCell<usize> = OnceCell::new();

fn main() {
  for i in 0..N_LOOPS {
    go(i)
  }
}

fn go(i: usize) {
  for _ in 0..100_000_000 {
    let &value = CELL.get_or_init(|| i);
    assert!(value < N_LOOPS);
  }
}

I also added compile-time toggle to force or forbid inlining:

1
2
3
#[cfg_attr(feature = "inline_always", inline(always))]
#[cfg_attr(feature = "inline_never", inline(never))]
fn initialize() { ... }

You can see the full benchmark in this commit: matklad/once_cell@a741d5f.

Running both versions shows that #[inline(never)] is indeed measurably faster:

1
2
3
4
5
$ cargo run -q --example bench  --release --features inline_always
330ms

$ cargo run -q --example bench  --release --features inline_never
259ms

Note that we don’t use fancy statistics here. /usr/bin/time is enough to see the difference with a naked eye despite the fact that the effect we are looking for is very low-level. Hence, a general tip: if you are benchmarking relative difference (and not the absolute performance), don’t bother with measuring nanosecond-precision time. Instead, loop the benchmark enough to make the change human-perceptible.

How do we explain the difference? The first step is to remove cargo from the equation and make two binaries for comparison:

1
2
3
4
$ cargo build --example bench --release --features inline_never
$ cp ./target/release/examples/bench never
$ cargo build --example bench --release --features inline_always
$ cp ./target/release/examples/bench always

On Linux, the best tool to quickly access the performance of any program is perf stat. It runs the program and shows a bunch of CPU-level performance counters, which might explain what’s going on. As we suspect that ICache might be to blame, let’s include the counters for caches:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ perf stat -e instructions,cycles,\
  L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,\
  L1-icache-loads,L1-icache-load-misses,cache-misses \
  ./always
348ms

 6,396,754,995      instructions:u
 1,601,314,994      cycles:u
 1,600,621,170      L1-dcache-loads:u
         4,806      L1-dcache-load-misses:u
         4,402      L1-dcache-prefetches:u
        69,594      L1-icache-loads:u
           461      L1-icache-load-misses:u
         1,928      cache-misses:u

$ perf stat -e instructions,cycles,\
  L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,\
  L1-icache-loads,L1-icache-load-misses,cache-misses \
  ./never
261ms

 Performance counter stats for './never':

 5,597,215,493      instructions:u
 1,199,960,402      cycles:u
 1,599,404,303      L1-dcache-loads:u
         4,612      L1-dcache-load-misses:u
         4,290      L1-dcache-prefetches:u
        62,268      L1-icache-loads:u
           603      L1-icache-load-misses:u
         1,675      cache-misses:u

There is some difference in L1-icache-load-misses, but there’s also a surprising difference in instructions. What’s more, the L1-icache-load-misses difference is hard to estimate, because it’s unclear what L1-icache-loads are. As a sanity check, statistics for dcache are the same, just as we expect.

While perf takes the real data from the CPU, an alternative approach is to run the program in a simulated environment. That’s what cachegrind tool does. Fun fact: the primary author of cachegrind is @nnethercote, whose Rust Performance Book we saw in the last post. Let’s see what cachegrind thinks about the benchmark.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
$ valgrind --tool=cachegrind ./always
10s
 I   refs:      6,400,577,147
 I1  misses:            1,560
 LLi misses:            1,524
 I1  miss rate:          0.00%
 LLi miss rate:          0.00%

 D   refs:      1,600,196,336
 D1  misses:            5,549
 LLd misses:            4,024
 D1  miss rate:           0.0%
 LLd miss rate:           0.0%

 LL refs:               7,109
 LL misses:             5,548
 LL miss rate:            0.0%

$ valgrind --tool=cachegrind ./never
9s
 I   refs:      5,600,577,226
 I1  misses:            1,572
 LLi misses:            1,529
 I1  miss rate:          0.00%
 LLi miss rate:          0.00%

 D   refs:      1,600,196,330
 D1  misses:            5,553
 LLd misses:            4,024
 D1  miss rate:           0.0%
 LLd miss rate:           0.0%

 LL refs:               7,125
 LL misses:             5,553
 LL miss rate:            0.0%

Note that, because cachegrind simulates the program, it runs much slower. Here, we don’t see a big difference in ICache misses (I1 — first level instruction cache, LLi — last level instruction cache). We do see a difference in ICache references. Note that the number of times CPU refers to ICache should correspond to the number of instructions it executes. Cross-checking the number with perf, we see that both perf and cachegrind agree on the number of instructions executed. They also agree that inline_always version executes more instructions. It’s still hard to say what perf’s sL1-icache-loads means. Judging by the name, it should correspond to cachegrind's I refs, but it doesn’t.

Anyway, it seems there’s one thing that bears further investigation — why inlining changes the number of instructions executed? Inlining doesn’t actually change the code the CPU runs, so the number of instructions should stay the same. Let’s look at the asm then! The right tool here is cargo-asm.

Again, here’s the function we will be locking at:

1
2
3
4
5
6
fn go(tid: usize) {
  for _ in 0..100_000_000 {
    let &value = CELL.get_or_init(|| tid);
    assert!(value < N_THREADS);
  }
}

The call to get_or_init will be inlined, and the nested call to initialize will be inlined depending on the flag.

Let’s first look at the inline_never version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  push    r14 ;
  push    rbx ; prologue
  push    rax ;
  mov     qword, ptr, [rsp], rdi
  mov     ebx, 100000001 ; loop counter
  mov     r14, rsp
  jmp     .LBB14_1
 .loop:
  cmp     qword, ptr, [rip, +, CELL+16], 8
  jae     .assert_failure
 .LBB14_1:
  add     rbx, -1
  je      .normal_exit
  mov     rax, qword, ptr, [rip, +, CELL]
  cmp     rax, 2
  je      .loop
  mov     rdi, r14
  call    once_cell::imp::OnceCell<T>::initialize
  jmp     .loop
 .normal_exit:
  add     rsp, 8 ;
  pop     rbx    ; epilogue
  pop     r14a   ;
  ret            ;
 .assert_failure:
  lea     rdi, [rip, +, .L__unnamed_12]
  lea     rdx, [rip, +, .L__unnamed_13]
  mov     esi, 35
  call    qword, ptr, [rip, +, core::panicking::panic@GOTPCREL]
  ud2

And then at the inline_always version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
  push    rbp  ;
  push    r15  ;
  push    r14  ;
  push    r13  ; prologue
  push    r12  ;
  push    rbx  ;
  sub     rsp, 24
  mov     r12, rdi
  xor     ebx, ebx
  mov     r13d, 1
  lea     r14, [rip, +, CELL]
  mov     rbp, qword, ptr, [rip, +, WaiterQueue::drop@GOTPCREL]
  mov     r15, qword, ptr, [rip, +, once_cell::imp::wait@GOTPCREL]
  jmp     .LBB10_1
 .LBB10_10:
  mov     qword, ptr, [rsp, +, 8], r14
  mov     qword, ptr, [rip, +, CELL+8], 1
  mov     qword, ptr, [rip, +, CELL+16], r12
  mov     qword, ptr, [rsp, +, 16], 2
  lea     rdi, [rsp, +, 8]
  call    rbp
 .loop:
  add     rbx, 1
  cmp     qword, ptr, [rip, +, CELL+16], 8
  jae     .assert_failure
 .LBB10_1:
  cmp     rbx, 100000000
  je      .normal_exit
  mov     rax, qword, ptr, [rip, +, CELL]
  cmp     rax, 2
  je      .loop
 .LBB10_3:
  mov     rax, qword, ptr, [rip, +, CELL]
 .LBB10_4:
  test    rax, rax
  jne     .LBB10_5
  xor     eax, eax
  lock    cmpxchg, qword, ptr, [rip, +, CELL], r13
  jne     .LBB10_4
  jmp     .LBB10_10
 .LBB10_5:
  cmp     rax, 2
  je      .loop
  mov     ecx, eax
  and     ecx, 3
  cmp     ecx, 1
  jne     .LBB10_8
  mov     rdi, r14
  mov     rsi, rax
  call    r15
  jmp     .LBB10_3
 .normal_exit:
  add     rsp, 24 ;
  pop     rbx     ;
  pop     r12     ;
  pop     r13     ; epilogue
  pop     r14     ;
  pop     r15     ;
  pop     rbp     ;
  ret
 .assert_failure:
  lea     rdi, [rip, +, .L__unnamed_9]
  lea     rdx, [rip, +, .L__unnamed_10]
  mov     esi, 35
  call    qword, ptr, [rip, +, core::panicking::panic@GOTPCREL]
  ud2
 .LBB10_8:
  lea     rdi, [rip, +, .L__unnamed_11]
  lea     rdx, [rip, +, .L__unnamed_12]
  mov     esi, 57
  call    qword, ptr, [rip, +, core::panicking::panic@GOTPCREL]
  ud2

I’ve slightly edited the code and also highlighted the hot loop which constitutes the bulk of the benchmark.

Looking at the assembly, we can see the following:

  • code is much larger — inlining happened!

  • function prologue is bigger, compiler pushes more callee-saved registers to the stack

  • function epilogue is bigger, compiler needs to restore more registers

  • stack frame is larger

  • compiler hoisted some of the initialize code to before the loop

  • the core loop is very tight in both cases, just a handful of instructions

  • the core loop counts upwards rather than downwards, adding an extra cmp instruction

Note that it’s highly unlikely that ICache affects the running code, as it’s a small bunch of instructions next to each other in memory. On the other hand, an extra cmp with a large immediate precisely accounts for the amount of extra instructions we observe (the loop is run 800_000_000 times).

Conclusions

It’s hard enough to come up with a benchmark which demonstrate the difference between two alternatives. It’s even harder to explain the difference — there might be many readily available explanations, but they are not necessary true. Nonetheless, today we have a wealth of helpful tooling. Two notable examples are perf and valgrind. Tools are not always correct — it’s a good idea to sanity check different tools against each other and against common-sense understanding of the problem.

For inlining in particular, we found the following reasons why inlining S into C might cause a slow down:

  1. Inlining might cause C to use more registers. This means that prologue and epilogue grow additional push/pop instructions, which also use stack memory. Without inlining, these instructions are hidden in S and are only paid for when C actually calls into S, as opposed to every time C itself is called.

  2. Generalizing from the first point, if S is called in a loop or in an if, the compiler might hoist some instructions of S to before the branch, moving them from the cold path to the hot path.

  3. With more local variables and control flow in the stack frame to juggle, compiler might accidentally pessimize the hot loop.

If you are curious under which conditions ICache does become an issue, there’s this excellent article about one such case.