Min of Three
How to find a minimum of three double
numbers? It may be surprising to you (it
certainly was to me), but there is more than one way to do it, and with big
difference in performance as well. It is possible to make this simple
calculation significantly faster by utilizing
CPU level parallelism.
The phenomenon described in this blog post was observed in this thread of the Rust forum. I am not the one who found out what is going on, I am just writing it down :)
We will be using Rust, but the language is not important, the original program
was in Java. What will turn out to be important is CPU architecture. The laptop
on which the measurements are done has i7-3612QM
.
Test subject
We will be measuring dynamic time warping algorithm. This algorithm
calculates a distance between two real number sequences, xs
and ys
. It is
very similar to edit distance or Needleman–Wunsch,
because it uses the same dynamic programming structure.
The main equation is
That is, we calculate the distance between each pair of prefixes of xs
and
ys
using the distances from three smaller pairs. This calculation can be
represented as a table where each cell depends on three others:
It is possible to avoid storing the whole table explicitly. Each row depends only on the previous one, so we need to store only two rows at a time.
Here is the Rust code for this version:
Profile first
Is it fast? If we compile it in --release
mode with
in ~/.cargo/config
, it takes 435 milliseconds for two
random sequences of length 10000.
What is the bottleneck? Let’s look at the instruction level profile of the main
loop using perf annotate
command:
perf annotate
uses AT&T assembly syntax, this means that the destination
register is on the right.
The xmm0
register holds the value of curr[iy]
, which was calculated on the
previous iteration. Values of prev[iy - 1]
and prev[iy]
are fetched into
xmm1
and xmm2
. Note that although the original code contained three if
expressions, the assembly does not have any jumps and instead uses two min
and
one blend
instruction to select the minimum. Nevertheless, a significant
amount of time, according to perf
, is spent calculating the minimum.
Optimization
Can we do better? Let’s use min2
function to calculate minimum of three
elements recursively:
This version completes in 430 milliseconds, which is a nice win of 5 milliseconds over the first version, but is not that impressive. The assembly looks cleaner though:
Up to this point it was a rather boring blog post about Rust with some assembly thrown in. But let’s tweak the last variant just a little bit …
This version takes only 287 milliseconds to run, which is roughly 1.5 times faster than the previous one! However, the assembly looks almost the same …
The only difference is that two vminsd
instructions are swapped.
But it is definitely much faster.
A possible explanation
A possible explanation is a synergy of CPU level parallelism and speculative execution. It was proposed by @krdln and @vitalyd. I don’t know how to falsify it, but it at least looks plausible to me!
Imagine for a second that instead of vminsd %xmm0,%xmm1,%xmm0
instruction
in the preceding assembly there is just vmovsd %xmm1,%xmm0
. That is, we don’t
use xmm0
from the previous iteration at all! This corresponds to the following
update rule:
The important property of this update rule is that CPU can calculate two cells
simultaneously in parallel, because there is no data dependency between
curr[i]
and curr[i + 1]
.
We do have vminsd %xmm0,%xmm1,%xmm0
, but it is equivalent to vmovsd
%xmm1,%xmm0
if xmm1
is smaller than xmm0
. And this is often the case:
xmm1
holds the minimum of upper and diagonal cell, so it is likely to be less
then a single cell to the left. Also, the diagonal path is taken slightly more
often then the two alternatives, which adds to the bias.
So it looks like the CPU is able to speculatively execute vminsd
and
parallelise the following computation based on this speculation! Isn’t that
awesome?
Further directions
It’s interesting that we can make the computation truly parallel if we update the cells diagonally:
This is explored in the second part of this post.
Conclusion
Despite the fact that Rust is a high level language, there is a strong
correlation between the source code and the generated assembly. Small tweaks to
the source result in the small changes to the assembly with potentially big
implications for performance. Also, perf
is great!
That’s all :)