Full measurements

To fully assess the impact of subnormal numbers of your CPU’s floating-point arithmetic performance, you will want to run a more extensive set of benchmarks that…

  • Measures the performance of more arithmetic operations
  • Checks CPU register inputs in addition to memory inputs from the L1 cache
  • Benchmarks SIMD floating-point types in addition to scalar ones
  • Increases subnormal share resolution as needed to precisely probe the performance vs subnormal input share curve (e.g. measure the maximal subnormal number processing overhead and the subnormal input share at which this overhead is observed)

The measure Cargo feature can be used to quickly set up a benchmark configuration that has proven suitable for all the hardware against which Subwoofer has been tested so far, at the expense of often being overly precise and thus taking an unnecessary long time to build and run:

cargo bench --features=measure

If the execution time of this generic configuration is a problem, you may want to deviate from it by reducing the number of benchmarks that is run to the minimum needed to acquire the data that you are interested in. In the remainder of this chapter, we will be discuss how this is done.

Arithmetic operation set

By default, the measure configuration enables all supported microbenchmarks. Depending on the results of the previous basic check, this may be overkill and spend a large amount of time re-measuring information that you already know with unnecessary extra precision.

Here’s how to decide which benchmarks you can disable:

  • If one of the add, mul_max, sqrt_positive_max, div_numerator_max and div_denominator_min benchmarks was not affected by subnormals during the basic check, then you can disable it during the full measurement.
  • If the fma_full_max benchmark was not affected by subnormals during the basic check, then you can disable the fma_addend, fma_multiplier and fma_full_max benchmarks during the full measurement.
  • You cannot disable the max benchmark during the full measurement unless no benchmark other than add was affected by subnormals.

If you are in one of those cases, then you may want to stop using the catch-all measure Cargo feature, and instead use finer-grained Cargo features that let you control benchmarks on a case-by-case basis. Please check out the definition of the measure and bench_xyz features in Subwoofer’s Cargo.toml to know which set of Cargo features you should enable in this case. As an easier but slower alternative, you may also disable those benchmarks at runtime using cargo bench’s regex-based benchmark name filter.

For example, on AMD Zen 2/3 CPUs where only the performance of MUL, SQRT and DIV are affected by subnormals, you could restrict the set of benchmarks at compile time like this…

# This is correct as of 2024-12-29, but beware that the set of Cargo features # covered by the "measure" option may evolve in future versions of Subwoofer cargo bench --no-default-features \ --features=bench_max,bench_mul_max,bench_sqrt_positive_max,bench_div_numerator_max,bench_div_denominator_min,cargo_bench_support,register_data_sources,simd

…or, alternatively, restrict it at runtime like this:

cargo bench --features=measure -- '(max|mul_max|sqrt_positive_max|div_numerator_max|div_denominator_min)'

ILP configurations

As discussed in the context of the basic check, we normally expect CPUs to operate at peak floating-point throughput when they are fed with code that has the highest possible amount of instruction-level parallelism. But sometimes CPU frontend limitations get in the way and make the optimal degree of instruction-level parallelism smaller than this.

This is why by default, we run benchmarks at both the maximum possible ILP and half this ILP: in most cases, one of these configurations will be the fastest possible one for your CPU, or at least very close to the performance of the optimal ILP configuration.

If you found out during the basic check that your CPU needs an even more drastic ILP reduction to perform at optimal throughput, you will need to enable the more_ilp_configurations Cargo feature for the full benchmark that we are discussing in this chapter. In this case, consider sending us a bug report: if your CPU model is sufficiently common, we may want to enable more ILP configurations by default, as we aim for a default configuration that Just Works.

Once you’ve found the level of ILP that leads to peak throughput for each benchmark, you can speed up benchmark execution by only enabling this ILP configuration along with the maximally latency-bound chained configuration. Here is an example of applying such an ILP configuration filter when benchmarking SQRT performance on an AMD Zen 2 CPU:

cargo bench --features=measure -- '(max/ilp08|sqrt_positive_max/ilp04)|chained'

CPU register inputs

On many common CPUs, the performance of Subwoofer microbenchmarks that operate on data from the L1 data cache is mainly limited by floating-point arithmetic performance, rather than memory subsystem performance. However…

  • This not true of all benchmarks, for example fma_full_max on all-normal inputs from the L1 cache is memory-bound on all CPUs that Subwoofer has been tested against so far.
  • It may not be true on all CPUs because the memory operations do consume some CPU frontend and backend resources, and these resources could become the bottleneck on CPU cores that are less well-balanced than those which Subwoofer has been tested against so far.

For this reason, the register_data_sources Cargo feature, which is part of the broader measure feature, enables alternate versions of the microbenchmarks that run against inputs from CPU registers instead of inputs from the L1 cache. This configuration avoids memory subsystem bottlenecks entirely, at the expense of having other drawbacks:

  • Because the input dataset is tiny, a sufficiently smart CPU backend could apply optimizations to its internal subnormal number processing logic that do not apply on a more realistic data scale.
  • Even if the CPU does not apply such optimizations, rustc and LLVM can perform some of them. We apply optimization barriers to prevent them from doing so, but they may sometimes come at the expense of a reduction in generated code quality.

Because these drawbacks normally outweigh the benefits of not putting pressure on the CPU’s memory subsystem, we strongly advise against running only those versions of the benchmarks, and we suggest taking their results with a respectable grain of salt:

  • If they are a bit faster than the benchmarks that operate from the L1 cache (say, 20% faster), it may reflect a genuine hardware performance benefit of avoiding the memory subsystem, and thus a “purer” measurement of the CPU’s subnormal number processing overhead.
  • If they are >2x faster, or worse slower, you should disregard any data that comes out of them as suspicious by default, unless you have the time to carefully analyze the generated code and runtime CPU microarchitecture behavior to prove this default hypothesis wrong.
  • In general, if you have any doubt, your default assumption should be that benchmarks that operate from the L1 cache are “more right” than those that operate from CPU registers, until proven otherwise.

If you want to disable register inputs to speed up the benchmark’s build and execution, then the most efficient way will be to refrain from enabling register_data_sources Cargo feature.

Sadly, Cargo features cannot be disabled, only enabled, so the only way to enable a subset of the measure feature pack is to look up its definition in Cargo.toml and only enable the subset that you want, as discussed above in the context of controlling the set of benchmarked operations.

An easier but slower alternative is to only disable the execution of benchmarks at runtime using cargo bench’s regex filter, like this:

cargo bench --features=measure -- 'L1cache'

More memory data sources

By default, we only measure the performance of floating-point operations on memory inputs that fit in the L1 data cache, because that’s where the performance impact of subnormal numbers is expected to be the highest. As we access increasingly remote memory inputs, the CPU is expected to spend less time crunching numbers, and more time waiting for the memory subsystem, resulting in a reduction of the relative subnormal number processing overhead.

This is, however, only 100% expected when processing data at the maximal SIMD vector width supported by your CPU architecture. When data is accessed in smaller chunks, the CPU’s spatial prefetcher gets more time to hide the latency of remote memory accesses by predicting future memory accesses before the CPU has even requested them, and as a result performance for e.g. scalar data will often be similar for all layers of the memory hierarchy.

If you are interesed in studying those kind of effects, consider adding more_memory_data_sources to the set of Cargo features that you are enabling. This feature is never enabled by default because it falls a bit outside of the scope of what Subwoofer normally aims to measure (namely the impact of subnormal numbers on the performance of floating-point arithmetic) and it causes an enormous increase of total benchmark execution time.

SIMD data types

Depending on how the CPU’s subnormals fallback is implemented, its performance may or may not depend on the floating-point data type that is being manipulated. In particular, it may be faster or slower for code that operates on SIMD vectors of numbers, rather than individual “scalar” numbers.

To account for this, the simd Cargo feature, which is part of the broader measure feature, lets you test the performance of all supported SIMD vector types. If you are trying to speed up benchmark runs, we advise narrowing this down to just the widest supported SIMD vector types and scalar data, as it is unlikely that intermediate vector sizes will behave very differently. This can be done using cargo bench’s regex filtering feature:

# Suitable for an x86 CPU whose native vector width is 256-bit AVX cargo bench --features=measure -- 'f(32|64)(x08)?/'

Another way in which SIMD affects our benchmarks is that we need to apply optimization barriers to prevent the compiler from auto-vectorizing our scalar benchmarks into SIMD code, or our narrow SIMD vector benchmarks into wider SIMD code. Unfortunately, these optimization barriers may come at the expense of a reduction in generated code quality, so it is best to avoid them.

To do this, you will need to compile each type-specific benchmark with the narrowest target-feature set that is…

  1. Needed to process this particular scalar/SIMD floating-point type
  2. Legal on the target CPU architecture

The provided run_attended.sh script applies this approach to optimize codegen on x86 CPUs.

Subnormal frequency resolution

The last tunable of the Subwoofer microbenchmark suite is the set of subnormal_freq_resolution_1inN cargo features. It controls the number of subnormal occurence frequencies that are probed, and thus the horizontal resolution of the subnormal overhead vs occurence frequency graph in the performance report that each benchmark will generate at the end.

Like many other configurable Subwoofer settings, this is a tradeoff between benchmark execution time and output precision: the higher the frequency resolution, the longer the benchmarks will take to run, but the more precise the output data will be in the end.

Unless you have a good reason to do so, we would advise using a frequency resolution that is sufficient to precisely measure the horizontal and vertical position of the peak of maximum overhead on your hardware, as well as the general shape of the curves on either side of this peak (linear, exponential-like, or logarithm-like?).

Alas, the optimal resolution is hardware-dependent, and can only be found through slow experimentation. But the default setting enabled by the measure Cargo feature is known to be good enough on all CPUs where Subwoofer has been tested so far. We’d like to keep it that way, so if you find a CPU where the default is not precise enough, please consider sending us a bug report.

Maximizing coverage

So far, this chapter has focused on techniques to reduce the execution time of Subwoofer to a minimum, while still measuring the most important data. However, there are situations where benchmark execution time is not as much of a problem, and the largest concern is instead to make sure all important data has been acquired. This is for example the case when we are debugging a new benchmark, or when benchmarking hardware which you only have temporary access to.

In this case, the simplest option is to run benchmarks with the --all-features Cargo option, which will instruct Subwoofer to measure as much data as possible…

cargo bench --all-features

…and then make sure that you remember to copy the contents of the target/criterion directory before you lose access to the hardware or accidentally cause a cargo clean disaster. The price to pay is that this configuration is extremely slow, and will take days to run to completion.

A better latency vs execution time tradeoff can be achieved by using the provided run_attended.sh script. It runs multiple benchmark passes of increasing precision/execution time so that you get basic results early, and full results eventually, at the expense of a negligibly small increase in overall execution time with respect to --all-features.