# Min of Three

How to find a minimum of three `double`

numbers? It may be surprising to you
(it certainly was to me), but there is more than one way to do it, and with big
difference in performance as well. It is possible to make this simple
calculation significantly faster by utilizing
CPU level parallelism.

The phenomenon described in this blog post was observed in this thread of the Rust forum. I am not the one who found out what is going on, I am just writing it down :)

We will be using Rust, but the language is not important, the original program
was in Java. What will turn out to be important is CPU architecture. The laptop
on which the measurements are done has `i7-3612QM`

.

## Test subject

We will be measuring dynamic time warping algorithm. This algorithm
calculates a distance between two real number sequences, `xs`

and `ys`

. It is
very similar to edit distance or Needlemanâ€“Wunsch, because it uses the same
dynamic programming structure.

The main equation is

dtw[i, j] = min(dtw[i-1, j-1], dtw[i, j-1], [i-1, j]) + (xs[i] - ys[i])^{2}

That is, we calculate the distance between each pair of prefixes of `xs`

and `ys`

using the distances from three smaller pairs. This calculation can be represented
as a table where each cell depends on three others:

It is possible to avoid storing the whole table explicitly. Each row depends only on the previous one, so we need to store only two rows at a time.

Here is the Rust code for this version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

fn dtw(xs: &[f64], ys: &[f64]) -> f64 {
// assume equal lengths for simplicity
assert_eq!(xs.len(), ys.len());
let n = xs.len();
let mut prev = vec![0f64; n + 1];
let mut curr = vec![std::f64::MAX; n + 1];
curr[0] = 0.0;
for ix in 1..(n + 1) {
std::mem::swap(&mut curr, &mut prev);
curr[0] = std::f64::MAX;
for iy in 1..(n + 1) {
let d11 = prev[iy - 1];
let d01 = curr[iy - 1];
let d10 = prev[iy];
// Find the minimum of d11, d01, d10
// by enumerating all the cases.
let d = if d11 < d01 {
if d11 < d10 { d11 } else { d10 }
} else {
if d01 < d10 { d01 } else { d10 }
};
let cost = {
let t = xs[ix - 1] - ys[iy - 1];
t * t
};
curr[iy] = d + cost;
}
}
curr[n]
}

## Profile first

Is it fast? If we compile it in `--release`

mode with

[build] rustflags = "-C target-cpu=native"

in `~/.cargo/config`

, it takes 435 milliseconds for two
random sequences of length 10000.

What is the bottleneck? Let’s look at the instruction level profile of the main
loop using `perf annotate`

command:

// Find the minimum of three numbers. 0.00 : vmovsd -0x8(%rax,%rsi,8),%xmm1 0.00 : vmovsd (%rax,%rsi,8),%xmm2 0.06 : vminsd %xmm2,%xmm1,%xmm3 9.04 : vminsd %xmm2,%xmm0,%xmm2 0.00 : vcmpltsd %xmm0,%xmm1,%xmm0 22.70 : vblendvpd %xmm0,%xmm3,%xmm2,%xmm0 // Calculate the squared error penalty. 0.00 : vmovsd -0x8(%r12,%r10,8),%xmm1 0.00 : vsubsd -0x8(%r13,%rsi,8),%xmm1,%xmm1 11.01 : vmulsd %xmm1,%xmm1,%xmm1 // Store the result in the `curr` array. // Note how xmm0 is used on the next iteration. 22.81 : vaddsd %xmm1,%xmm0,%xmm0 10.67 : vmovsd %xmm0,(%rdi,%rsi,8)

`perf annotate`

uses AT&T assembly syntax, this means that the destination
registry is on the right.

The `xmm0`

registry holds the value of `curr[iy]`

, which was calculated on the
previous iteration. Values of `prev[iy - 1]`

and `prev[iy]`

are fetched into
`xmm1`

and `xmm2`

. Note that although the original code contained three `if`

expressions, the assembly does not have any jumps and instead uses two `min`

and
one `blend`

instruction to select the minimum. Nevertheless, a significant
amount of time, according to `perf`

, is spent calculating the minimum.

## Optimization

Can we do better? Let’s use `min2`

function to calculate minimum of three
elements recursively:

1
2
3
4
5
6
7
8
9

fn min2(x: f64, y: f64) -> f64 {
if x < y { x } else { y }
}
fn dtw(xs: &[f64], ys: &[f64]) -> f64 {
// ...
let d = min2(min2(d11, d01), d10);
// ...
}

This version completes in 430 milliseconds, which is a nice win of 5 milliseconds over the first version, but is not that impressive. The assembly looks cleaner though:

0.00 : vmovsd -0x8(%rax,%rsi,8),%xmm1 0.28 : vminsd %xmm0,%xmm1,%xmm0 31.14 : vminsd (%rax,%rsi,8),%xmm0,%xmm0 0.06 : vmovsd -0x8(%r12,%r10,8),%xmm1 0.28 : vsubsd -0x8(%r13,%rsi,8),%xmm1,%xmm1 10.61 : vmulsd %xmm1,%xmm1,%xmm1 23.29 : vaddsd %xmm1,%xmm0,%xmm0 11.11 : vmovsd %xmm0,(%rdi,%rsi,8)

Up to this point it was a rather boring blog post about Rust with some assembly thrown in. But let’s tweak the last variant just a little bit …

1
2
3
4
5
6

fn dtw(xs: &[f64], ys: &[f64]) -> f64 {
// ...
// Swap d10 and d01.
let d = min2(min2(d11, d10), d01);
// ...
}

This version takes only 287 milliseconds to run, which is roughly 1.5 times faster than the previous one! However, the assembly looks almost the same …

0.08 : vmovsd -0x8(%rax,%rsi,8),%xmm1 0.17 : vminsd (%rax,%rsi,8),%xmm1,%xmm1 16.40 : vminsd %xmm0,%xmm1,%xmm0 0.00 : vmovsd -0x8(%r12,%r10,8),%xmm1 0.17 : vsubsd -0x8(%r13,%rsi,8),%xmm1,%xmm1 18.24 : vmulsd %xmm1,%xmm1,%xmm1 17.15 : vaddsd %xmm1,%xmm0,%xmm0 15.82 : vmovsd %xmm0,(%rdi,%rsi,8)

The only difference is that two `vminsd`

instructions are swapped.
But it is definitely much faster.

## A possible explanation

A possible explanation is a synergy of CPU level parallelism and speculative execution. It was proposed by @krdln and @vitalyd. I don’t know how to falsify it, but it at least looks plausible to me!

Imagine for a second that instead of `vminsd %xmm0,%xmm1,%xmm0`

instruction
in the preceding assembly there is just `vmovsd %xmm1,%xmm0`

. That is, we don’t
use `xmm0`

from the previous iteration at all! This corresponds to the following
update rule:

The important property of this update rule is that CPU can calculate two cells
simultaneously in parallel, because there is no data dependency between
`curr[i]`

and `curr[i + 1]`

.

We do have `vminsd %xmm0,%xmm1,%xmm0`

, but it is equivalent to ```
vmovsd
%xmm1,%xmm0
```

if `xmm1`

is smaller than `xmm0`

. And this is often the case:
`xmm1`

holds the minimum of upper and diagonal cell, so it is likely to be less
then a single cell to the left. Also, the diagonal path is taken slightly more
often then the two alternatives, which adds to the bias.

So it looks like the CPU is able to speculatively execute `vminsd`

and
parallelise the following computation based on this speculation! Isn’t that
awesome?

## Further directions

It’s interesting that we can make the computation truly parallel if we update the cells diagonally:

This is explored in the second part of this post.

## Conclusion

Despite the fact that Rust is a high level language, there is a strong
correlation between the source code and the generated assembly. Small tweaks to
the source result in the small changes to the assembly with potentially big
implications for performance. Also, `perf`

is great!

That’s all :)