Codegen check

Portable CPU microbenchmarks like Subwoofer are hard to write because they aim to exercise CPUs with tightly controlled instruction streams, without writing all the associated hardware-specific machine code by hand.

This goal can only be achieved through a very careful balance between two opposing forces:

  • On one hand we want to let the compiler produce maximally optimized code for the target CPU, with minimal effort on our side.
  • On the other hand we need to avoid compiler optimizations that change the nature of the benchmarked code, such as turning scalar code into SIMD code or hoisting repeated square root computations out of the benchmark’s inner loop.

If this balance is not perfectly mastered, then we get either artificially bad machine code that is not as much bottlenecked by floating-point arithmetic as we intended, or artificially “good” machine code that does not measure the hardware performance characteristics that we are interested in.

At the time of writing, Subwoofer is known to compile down to optimal machine code on certain x86 CPUs, when using the rustc nightly that is specified by its rust-toolchain file. But there is no guarantee that optimal code will also be generated for other CPUs, or the next time we will upgrade to a newer rustc nightly. Any suspicious performance numbers should therefore prompt you to check that the generated machine code is correct for all the benchmarks that you are exercising.

On Linux, the easiest way to perform this validation is to profile the set of benchmark configurations that you are interested in using perf, while executing it in --profile-time=1 mode and with a minimal subnormal occurence frequency resolution (since it does not affect code generation)…

# measure_codegen is like measure, but without the increased freq resolution
cargo bench --features=measure_codegen --bench=f32x08 -- --profile-time=1

…then check out the assembly of the inner loop of the benchmarks that actually got executed, using the “annotate” feature of perf report.

The provided run_attended.sh script is designed to support a more elaborate version of this approach to analyzing the code that is generated by rustc/LLVM, in which…

  • cargo criterion is used instead of cargo bench to speed up compilation
  • cargo build is used before cargo criterion to avoid profiling compilation
  • Benchmarks are compiled with a minimized target-feature set for optimal codegen
  • A separate perf.data file is produced for each benchmarked floating-point type, which makes it easier to detect type-specific issues like autovectorization bugs