User guide

Welcome to the documentation of Subwoofer! This project provides a set of microbenchmarks that lets you review how subnormal numbers affect the performance of floating-point arithmetic operations on your CPU’s microarchitecture.

Currently supported arithmetic includes ADD/SUB, MIN/MAX, MUL, FMA, DIV and SQRT of positive numbers, mainly with subnormal inputs and sometimes with subnormal outputs too.

As the time of writing, this benchmark has only been rigorously checked for correctness on x86_64. But it has been designed with due consideration for other common CPU microarchitectures, so I believe that given a week or two of interactive access to an ARM or RISC-V machine with perf profiler support, I should be able to validate/debug it for those ISAs too.

System setup

In this section, we discuss the steps needed to prepare your computer for a Subwoofer run. Some of these steps are required for the benchmark to build and run at all, while others are mere suggestions to improve the accuracy and reproducibility of the data acquisition process.

Unless specified otherwise, you only need to perform these setup steps once per computer. They do not need to be repeated if you later need to re-run Subwoofer on the same machine, either to test out new Subwoofer developments or to explore new aspects of subnormal arithmetic that you skipped on a previous run.

Requirements

Because this project uses features specific to your CPU model, it is not easily amenable to binary distribution. The recommended way to use it is therefore to roll out an optimized local build for your machine.

For this, you are going to need rustup and the libhwloc C library along with the associated pkg-config file. The latter should either be installed system-wide or be reachable via your PKG_CONFIG_PATH.

On Unices like Linux and macOS, you can install libhwloc and the tooling needed to link it with Rust code by running the following commands…

  • macOS:
    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"  \
    && brew install hwloc pkgconf
    
  • Debian/Ubuntu/Mint:
    sudo apt-get update  \
    && sudo apt-get install build-essential libhwloc-dev libudev-dev pkg-config
    
  • Fedora:
    sudo dnf makecache --refresh   \
    && sudo dnf group install c-development   \
    && sudo dnf install hwloc-devel libudev-devel pkg-config
    
  • RHEL/Alma/Rocky:
    sudo dnf makecache --refresh  \
    && sudo dnf groupinstall "Devlopment tools"  \
    && sudo dnf install epel-release  \
    && sudo /usr/bin/crb enable  \
    && sudo dnf makecache --refresh  \
    && sudo dnf install hwloc-devel libudev-devel pkg-config
    
  • Arch:
    sudo pacman -Syu base-devel libhwloc pkg-config
    
  • openSUSE:
    sudo zypper ref  \
    && sudo zypper in -t pattern devel_C_C++  \
    && sudo zypper in hwloc-devel libudev-devel pkg-config
    

…then you can install rustup and bring it into your current shell’s environment with those commands:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain none  \
&& . "$HOME/.cargo/env"

The required nightly Rust toolchain will then be installed automatically on the first run of one of the cargo bench commands discussed in the data acquisition section.

On Windows, I would recommend using Windows Subsystem for Linux (aka WSL) and following the instructions for Ubuntu above because…

  • WSL offers a much better software developer user experience than native Windows development.
  • Contrary to what you may think, the underlying Linux virtual machine will not get in the way of precise CPU microbenchmarking due to the magic of VT-x/AMD-V.

…but if you really want a native Windows development environment, please take inspiration from these setup instructions that I wrote for a numerical computing course. They will tell you how to set up everything needed for this benchmark, plus HDF5. You can leave out HDF5 for the sake of minimalism, but this will prevent you from using the suggested installation testing procedure.

Suggestions

Like all CPU microbenchmarks, Subwoofer is affected by various external factors including your operating system’s power management strategy and resource-sharing interference with other background processes running on the same machine.

We use Criterion as a robust benchmark harness that tries it best to work around these irrelevant effects, but if you have the time, you can also do your part to improve result quality by following usual CPU microbenchmarking recommendations.

These recommendations are listed below, in rough order of decreasing importance.

  1. Before running a benchmark, you should shut down any background OS process that is not required by the benchmark, and keep background processes to a minimum until the benchmark is done performing timing measurements.
    • If you do not do this, even intermittent background tasks may keep other CPU cores active, which can lead the CPU core on which the benchmark is running to downclock in unpredictable ways under a normal CPU frequency scaling policy. See below for more info on how you can turn that off, and why you may not want do do so.
    • More CPU-intensive background tasks may steal CPU time from the benchmark, which will more affect timing measurements. But because this benchmark is single-threaded, you have a fair bit of headroom before this starts to be a problem.
    • Background tasks may also put pressure on resources which are shared between CPU cores, like the CPU-RAM interconnect, which will affect a few specific benchmarks that are not enabled by default.
  2. If the computer you are testing is a laptop, it should be plugged into an electrical outlet.
    • The CPU performance of some laptops has been observed to fluctuate in highly unpredictable ways when operating on battery power. The exact cause for this phenomenon is not fully understood, but it could be related to maximal current draw limitations of the underlying laptop batteries.
  3. OS and vendor performance/powersaving tunables should be set up for maximal performance.
    • If you do not do this, your benchmark results will exhibit some dependence on the powersaving algorithm used by your computer. This is not ideal because the details of this powersaving algorithm may depend on many things: installed hardware and operating system, version of CPU microcode and all software involved in CPU power management decision…
  4. If maximal output reproducibility is desired, you can also disable “turbo” frequency scaling and force your CPU to constantly operate at its nominal frequency.
    • On Linux, this used to be easily done with vendor-agnostic tools like cpupower, but modern AMD and Intel CPUs have thrown a wrench into this and vendor-specific tools are now needed. For Intel CPUs, I recommend pstate-frequency. For other CPUs, see this page of the Arch wiki.
    • Note that setting up your CPU like this is a nontrivial tradeoff:
      • As a major benefit, it makes your benchmark output more accurate and reproducible.
      • As a minor benefit, it lets you convert criterion’s time-based measurements into cycle-based measurements, which is appropriate for benchmarks where CPU code execution is the limiting factor. Note that this is not true of all microbenchmarks provided by Subwoofer, a few of them are limited by other resources like the CPU-RAM interconnect, whose clock is not in sync with the CPU clock.
      • In exchange for these benefits, the main drawback of this approach is that you get performance results that are less representative of real-world CPU usage, since computers do not normally run with CPU frequency scaling disabled.

Acquiring data

In this section, we will get into how Subwoofer lets you measure your CPU’s subnormal number processing performance. This is where you will learn about the inevitable tradeoff between execution time and output data resolution, and how Subwoofer lets you control this tradeoff in many ways to get the most relevant data in the minimal amout of execution time.

Basic check

Before you perform any quantitative measurements, you should first quickly check if your CPU performance degrades in the presence of subnormal numbers. This can be done by running subwoofer in the default, rather minimal configuration, which is optimized for relatively fast execution time (<1h) at the expense of exhaustivity.

Start by downloading the Subwoofer source code and going into it if you have not done so already…

git clone --depth=1 https://github.com/HadrienG2/subwoofer.git
cd subwoofer

…then run the benchmark in its default configuration:

cargo bench

Here is what you should pay attention at this stage:

  • If the performance of some arithmetic operations does not ever change as the share of subnormal inputs (percentage at the end of the benchmark name) varies, you may1 not need to measure the performance of this arithmetic operation at all.
    • If this is true of all arithmetic operations that Subwoofer measures, congratulations! It looks like your CPU’s manufacturer did not cut any corner when it comes to subnormal numbers and made sure they are processes them at the same speed as normal floating-point numbers. In this case, it is likely pointless to run any other benchmark, but you can consider using the time you saved to send your CPU manufacturer our thanks for making the life of numerical computation authors less difficult.
  • For each operation of interest, you should check what degree of instruction-level parallelism (ILP) leads to maximal performance. The following scenarios are expected:
    • Measured performance is maximal when ILP is maximal, as expected. This is the best-case scenario, you have nothing to do in this case.
    • Performance is maximal at an ILP smaller than the probed maximum. In this case, the max-ILP configuration is running into some CPU microarchitecture bottleneck, and you will want to run cargo bench --features=more_ilp_configurations to check more ILP configurations. Take note of the ILP that provides maximal performance in this case, and make sure it is not chained. If it is chained, this suggests a major codegen problem on our side, so you should send us a bug report and refrain from taking any further measurement as their results will likely be incorrect.
  • If the observed relative performance degradation is the same for f32 and f64, then you may focus on only one of these floating-point precisions in the following data acquisition steps, which means that your benchmarks will run 2x faster.
  • If the runtime performance follows a very simple degradation pattern as the share of subnormal inputs grows (e.g. simple affine increase), you may keep the number of subnormal share data points low, which will greatly speed up benchmark execution.
1

You should read the next chapter before disabling operation benchmarks because many of the operations that Subwoofer measure are not elementary hardware arithmetic operations. This means that you need to know the performance of some measured operations, even if they are not affected by subnormals, in order to estimate the performance of other hardware operations that are affected by subnormals.

Full measurements

To fully assess the impact of subnormal numbers of your CPU’s floating-point arithmetic performance, you will want to run a more extensive set of benchmarks that…

  • Measures the performance of more arithmetic operations
  • Checks CPU register inputs in addition to memory inputs from the L1 cache
  • Benchmarks SIMD floating-point types in addition to scalar ones
  • Increases subnormal share resolution as needed to precisely probe the performance vs subnormal input share curve (e.g. measure the maximal subnormal number processing overhead and the subnormal input share at which this overhead is observed)

The measure Cargo feature can be used to quickly set up a benchmark configuration that has proven suitable for all the hardware against which Subwoofer has been tested so far, at the expense of often being overly precise and thus taking an unnecessary long time to build and run:

cargo bench --features=measure

If the execution time of this generic configuration is a problem, you may want to deviate from it by reducing the number of benchmarks that is run to the minimum needed to acquire the data that you are interested in. In the remainder of this chapter, we will be discuss how this is done.

Arithmetic operation set

By default, the measure configuration enables all supported microbenchmarks. Depending on the results of the previous basic check, this may be overkill and spend a large amount of time re-measuring information that you already know with unnecessary extra precision.

Here’s how to decide which benchmarks you can disable:

  • If one of the add, mul_max, sqrt_positive_max, div_numerator_max and div_denominator_min benchmarks was not affected by subnormals during the basic check, then you can disable it during the full measurement.
  • If the fma_full_max benchmark was not affected by subnormals during the basic check, then you can disable the fma_addend, fma_multiplier and fma_full_max benchmarks during the full measurement.
  • You cannot disable the max benchmark during the full measurement unless no benchmark other than add was affected by subnormals.

If you are in one of those cases, then you may want to stop using the catch-all measure Cargo feature, and instead use finer-grained Cargo features that let you control benchmarks on a case-by-case basis. Please check out the definition of the measure and bench_xyz features in Subwoofer’s Cargo.toml to know which set of Cargo features you should enable in this case. As an easier but slower alternative, you may also disable those benchmarks at runtime using cargo bench’s regex-based benchmark name filter.

For example, on AMD Zen 2/3 CPUs where only the performance of MUL, SQRT and DIV are affected by subnormals, you could restrict the set of benchmarks at compile time like this…

# This is correct as of 2024-12-29, but beware that the set of Cargo features
# covered by the "measure" option may evolve in future versions of Subwoofer
cargo bench --no-default-features  \
            --features=bench_max,bench_mul_max,bench_sqrt_positive_max,bench_div_numerator_max,bench_div_denominator_min,cargo_bench_support,register_data_sources,simd

…or, alternatively, restrict it at runtime like this:

cargo bench --features=measure -- '(max|mul_max|sqrt_positive_max|div_numerator_max|div_denominator_min)'

ILP configurations

As discussed in the context of the basic check, we normally expect CPUs to operate at peak floating-point throughput when they are fed with code that has the highest possible amount of instruction-level parallelism. But sometimes CPU frontend limitations get in the way and make the optimal degree of instruction-level parallelism smaller than this.

This is why by default, we run benchmarks at both the maximum possible ILP and half this ILP: in most cases, one of these configurations will be the fastest possible one for your CPU, or at least very close to the performance of the optimal ILP configuration.

If you found out during the basic check that your CPU needs an even more drastic ILP reduction to perform at optimal throughput, you will need to enable the more_ilp_configurations Cargo feature for the full benchmark that we are discussing in this chapter. In this case, consider sending us a bug report: if your CPU model is sufficiently common, we may want to enable more ILP configurations by default, as we aim for a default configuration that Just Works.

Once you’ve found the level of ILP that leads to peak throughput for each benchmark, you can speed up benchmark execution by only enabling this ILP configuration along with the maximally latency-bound chained configuration. Here is an example of applying such an ILP configuration filter when benchmarking SQRT performance on an AMD Zen 2 CPU:

cargo bench --features=measure -- '(max/ilp08|sqrt_positive_max/ilp04)|chained'

CPU register inputs

On many common CPUs, the performance of Subwoofer microbenchmarks that operate on data from the L1 data cache is mainly limited by floating-point arithmetic performance, rather than memory subsystem performance. However…

  • This not true of all benchmarks, for example fma_full_max on all-normal inputs from the L1 cache is memory-bound on all CPUs that Subwoofer has been tested against so far.
  • It may not be true on all CPUs because the memory operations do consume some CPU frontend and backend resources, and these resources could become the bottleneck on CPU cores that are less well-balanced than those which Subwoofer has been tested against so far.

For this reason, the register_data_sources Cargo feature, which is part of the broader measure feature, enables alternate versions of the microbenchmarks that run against inputs from CPU registers instead of inputs from the L1 cache. This configuration avoids memory subsystem bottlenecks entirely, at the expense of having other drawbacks:

  • Because the input dataset is tiny, a sufficiently smart CPU backend could apply optimizations to its internal subnormal number processing logic that do not apply on a more realistic data scale.
  • Even if the CPU does not apply such optimizations, rustc and LLVM can perform some of them. We apply optimization barriers to prevent them from doing so, but they may sometimes come at the expense of a reduction in generated code quality.

Because these drawbacks normally outweigh the benefits of not putting pressure on the CPU’s memory subsystem, we strongly advise against running only those versions of the benchmarks, and we suggest taking their results with a respectable grain of salt:

  • If they are a bit faster than the benchmarks that operate from the L1 cache (say, 20% faster), it may reflect a genuine hardware performance benefit of avoiding the memory subsystem, and thus a “purer” measurement of the CPU’s subnormal number processing overhead.
  • If they are >2x faster, or worse slower, you should disregard any data that comes out of them as suspicious by default, unless you have the time to carefully analyze the generated code and runtime CPU microarchitecture behavior to prove this default hypothesis wrong.
  • In general, if you have any doubt, your default assumption should be that benchmarks that operate from the L1 cache are “more right” than those that operate from CPU registers, until proven otherwise.

If you want to disable register inputs to speed up the benchmark’s build and execution, then the most efficient way will be to refrain from enabling register_data_sources Cargo feature.

Sadly, Cargo features cannot be disabled, only enabled, so the only way to enable a subset of the measure feature pack is to look up its definition in Cargo.toml and only enable the subset that you want, as discussed above in the context of controlling the set of benchmarked operations.

An easier but slower alternative is to only disable the execution of benchmarks at runtime using cargo bench’s regex filter, like this:

cargo bench --features=measure -- 'L1cache'

More memory data sources

By default, we only measure the performance of floating-point operations on memory inputs that fit in the L1 data cache, because that’s where the performance impact of subnormal numbers is expected to be the highest. As we access increasingly remote memory inputs, the CPU is expected to spend less time crunching numbers, and more time waiting for the memory subsystem, resulting in a reduction of the relative subnormal number processing overhead.

This is, however, only 100% expected when processing data at the maximal SIMD vector width supported by your CPU architecture. When data is accessed in smaller chunks, the CPU’s spatial prefetcher gets more time to hide the latency of remote memory accesses by predicting future memory accesses before the CPU has even requested them, and as a result performance for e.g. scalar data will often be similar for all layers of the memory hierarchy.

If you are interesed in studying those kind of effects, consider adding more_memory_data_sources to the set of Cargo features that you are enabling. This feature is never enabled by default because it falls a bit outside of the scope of what Subwoofer normally aims to measure (namely the impact of subnormal numbers on the performance of floating-point arithmetic) and it causes an enormous increase of total benchmark execution time.

SIMD data types

Depending on how the CPU’s subnormals fallback is implemented, its performance may or may not depend on the floating-point data type that is being manipulated. In particular, it may be faster or slower for code that operates on SIMD vectors of numbers, rather than individual “scalar” numbers.

To account for this, the simd Cargo feature, which is part of the broader measure feature, lets you test the performance of all supported SIMD vector types. If you are trying to speed up benchmark runs, we advise narrowing this down to just the widest supported SIMD vector types and scalar data, as it is unlikely that intermediate vector sizes will behave very differently. This can be done using cargo bench’s regex filtering feature:

# Suitable for an x86 CPU whose native vector width is 256-bit AVX
cargo bench --features=measure -- 'f(32|64)(x08)?/'

Another way in which SIMD affects our benchmarks is that we need to apply optimization barriers to prevent the compiler from auto-vectorizing our scalar benchmarks into SIMD code, or our narrow SIMD vector benchmarks into wider SIMD code. Unfortunately, these optimization barriers may come at the expense of a reduction in generated code quality, so it is best to avoid them.

To do this, you will need to compile each type-specific benchmark with the narrowest target-feature set that is…

  1. Needed to process this particular scalar/SIMD floating-point type
  2. Legal on the target CPU architecture

The provided run_attended.sh script applies this approach to optimize codegen on x86 CPUs.

Subnormal frequency resolution

The last tunable of the Subwoofer microbenchmark suite is the set of subnormal_freq_resolution_1inN cargo features. It controls the number of subnormal occurence frequencies that are probed, and thus the horizontal resolution of the subnormal overhead vs occurence frequency graph in the performance report that each benchmark will generate at the end.

Like many other configurable Subwoofer settings, this is a tradeoff between benchmark execution time and output precision: the higher the frequency resolution, the longer the benchmarks will take to run, but the more precise the output data will be in the end.

Unless you have a good reason to do so, we would advise using a frequency resolution that is sufficient to precisely measure the horizontal and vertical position of the peak of maximum overhead on your hardware, as well as the general shape of the curves on either side of this peak (linear, exponential-like, or logarithm-like?).

Alas, the optimal resolution is hardware-dependent, and can only be found through slow experimentation. But the default setting enabled by the measure Cargo feature is known to be good enough on all CPUs where Subwoofer has been tested so far. We’d like to keep it that way, so if you find a CPU where the default is not precise enough, please consider sending us a bug report.

Maximizing coverage

So far, this chapter has focused on techniques to reduce the execution time of Subwoofer to a minimum, while still measuring the most important data. However, there are situations where benchmark execution time is not as much of a problem, and the largest concern is instead to make sure all important data has been acquired. This is for example the case when we are debugging a new benchmark, or when benchmarking hardware which you only have temporary access to.

In this case, the simplest option is to run benchmarks with the --all-features Cargo option, which will instruct Subwoofer to measure as much data as possible…

cargo bench --all-features

…and then make sure that you remember to copy the contents of the target/criterion directory before you lose access to the hardware or accidentally cause a cargo clean disaster. The price to pay is that this configuration is extremely slow, and will take days to run to completion.

A better latency vs execution time tradeoff can be achieved by using the provided run_attended.sh script. It runs multiple benchmark passes of increasing precision/execution time so that you get basic results early, and full results eventually, at the expense of a negligibly small increase in overall execution time with respect to --all-features.

Codegen check

Portable CPU microbenchmarks like Subwoofer are hard to write because they aim to exercise CPUs with tightly controlled instruction streams, without writing all the associated hardware-specific machine code by hand.

This goal can only be achieved through a very careful balance between two opposing forces:

  • On one hand we want to let the compiler produce maximally optimized code for the target CPU, with minimal effort on our side.
  • On the other hand we need to avoid compiler optimizations that change the nature of the benchmarked code, such as turning scalar code into SIMD code or hoisting repeated square root computations out of the benchmark’s inner loop.

If this balance is not perfectly mastered, then we get either artificially bad machine code that is not as much bottlenecked by floating-point arithmetic as we intended, or artificially “good” machine code that does not measure the hardware performance characteristics that we are interested in.

At the time of writing, Subwoofer is known to compile down to optimal machine code on certain x86 CPUs, when using the rustc nightly that is specified by its rust-toolchain file. But there is no guarantee that optimal code will also be generated for other CPUs, or the next time we will upgrade to a newer rustc nightly. Any suspicious performance numbers should therefore prompt you to check that the generated machine code is correct for all the benchmarks that you are exercising.

On Linux, the easiest way to perform this validation is to profile the set of benchmark configurations that you are interested in using perf, while executing it in --profile-time=1 mode and with a minimal subnormal occurence frequency resolution (since it does not affect code generation)…

# measure_codegen is like measure, but without the increased freq resolution
cargo bench --features=measure_codegen --bench=f32x08 -- --profile-time=1

…then check out the assembly of the inner loop of the benchmarks that actually got executed, using the “annotate” feature of perf report.

The provided run_attended.sh script is designed to support a more elaborate version of this approach to analyzing the code that is generated by rustc/LLVM, in which…

  • cargo criterion is used instead of cargo bench to speed up compilation
  • cargo build is used before cargo criterion to avoid profiling compilation
  • Benchmarks are compiled with a minimized target-feature set for optimal codegen
  • A separate perf.data file is produced for each benchmarked floating-point type, which makes it easier to detect type-specific issues like autovectorization bugs

Analysis

At the time of writing, Subwoofer does not produce fully digested data telling you e.g. what slowdown you can expect when multiplying subnormal numbers instead of normal numbers in throughput-bound numerical code that operates from inputs that reside in the L1 CPU cache.

Instead, this information must be obtained through a manual analysis process. In the remainder of this documentation, we will explain to you how this process is performed, then show you examples of the kind of results that you can obtain on the particular CPUs that the authors got access to.

Naming convention

Benchmarks names folow a type/op/ilp/source/%subnormals structure where…

  • type is the type of data that is being operated over. Depending on how the CPU’s subnormal fallback part is implemented, subnormal performance might differ for single vs double precision, and for scalar vs SIMD operations.
  • op is the operation that is being benchmarked.
    • Note that we are often not benchmarking only the hardware operation of interest, but rather a combination of this operation with some cheap corrective actions that resets the accumulator to a normal state whenever it becomes subnormal. We will explain how to deduce raw hardware arithmetic performance characteristics from these measurements later in this chapter.
  • ilp is the degree of instruction-level parallelism that is present in the benchmark.
    • “chained” corresponds to the ILP=1 special case, which is maximally latency-bound. Higher ILP should increase execution performance until the code becomes throughput bound and saturates superscalar CPU backend resources, but the highest ILP configurations may not be optimal due to limitations of the CPU microarchitecture (e.g. CPU op cache/loop buffer trashing), the compiler, or the optimization barriers that we use.
    • If you observe that the highest ILP configuration is slower than the next-highest configuration, I advise re-running the benchmark with more_ilp_configurations added to the set of Cargo features, in order to make sure that your benchmark runs do cover the fastest, most throughput-bound ILP configuration. This will increase execution time.
  • source indicates where the input data comes from.
    • As the CPU accesses increasingly remote input data sources, the relative impact of subnormal operations is expected to decrease, because the CPU will end up spending more of its time waiting for inputs, rather than computing floating-point operations.
    • By default, we only cover the data sources where the impact of subnormals is expected to be the highest. If you want to measure how the impact of subnormals goes down when the code becomes more memory-bound, you can add more_data_sources to the set of Cargo features. But this will greatly increase execution time.
  • %subnormals indicates what percentage of subnormals is present in the input.
    • The proposed measure benchmark configuration has enough percentage resolution to precisely probe the overhead curve of all CPUs tested so far. But your CPU model may have a different overhead curve that can be precisely probed with less percentage resolution (leading to faster benchmark runs) or that requires more percentage resolution for precise analysis (at the expense of slower runs). In that case you may want to look into the various available subnormal_freq_resolution_1inN Cargo features.
    • If you ended up needing more data points than the current measure configuration, please consider submitting a PR that makes this the new default. I would like to keep the measure configuration precise enough on all known hardware, if possibly suboptimal in terms of benchmark execution time.

The presence of leading zeros in numbers within the benchmark name may slightly confuse you. This is needed to ensure that the entries of criterion reports within target/criterion/ are sorted correctly, because criterion sadly does not yet use a name sorting algorithm that handles multi-digit numbers correctly at the time of writing…

Latency and throughput

Modern CPUs are pipelined and superscalar, which allows them to execute multiple instructions at the same time. This means that the performance of a CPU instruction cannot be fully defined by a single figure of merit. Instead, two standard figures of merit are normally used:

  • Latency measures the amount of time that elapses from the moment where a CPU instruction starts executing, to the moment where the result is available and another instruction that depends on it can start executing. It is normally measured in nanoseconds or CPU clock cycles, the latter being more relevant when all execution bottlenecks are internal to the CPU.
  • Throughput measures the maximal rate at which a CPU’s backend can execute an infinite stream of instructions of a certain type, assuming all conditions are met (execution ports are available, inputs are ready, etc). It is normally given in instructions per second or per CPU clock cycle. Sometimes, people also provide reciprocal throughput in average CPU clock cycles per instruction, which is more tracherous to the reader because it looks like a latency.

Depending on the “shape” of its machine code, a certain numerical computation will be more limited by one of these two performance limits:

  • Programs with a single long dependency chain, where each instruction depends on the output of the previous instruction, are normally limited by instruction execution latencies
  • Programs with many independent instruction streams that do not depend on each other are normally limited by instruction execution throughputs
  • Outside of these two extreme situations, precise performance characteristics depend on microarchitectural trade secrets that are not well documented, and it is better to empirically measure the performance of code than to try to theoretically predict it. But we do know that observed performance will normally lie somewhere between the two limits of latency-bound and throughput-bound execution.

Subwoofer attempts to measure latency and throughput by benchmarking a varying number of identical instruction streams where the output of each operation is the input of the next one.

  • In the chained configurations, there is only one instruction stream, so we expect maximally latency-bound performance. The execution time for a chain of N operations should therefore be N times the execution latency of an individual operation.
  • In one of the configurations of higher Instruction-Level Parallelism (ILP), which is normally close to the maximum ILP that is allowed by the CPU ISA, maximal performance will be observed. At this throughput-bound limit, throughput can be estimated as the number of operations that was gets computed, divided by the associated execution time.
    • More precisely, you should find the degree of ILP that is associated with maximal performance when operating on fully normal floating-point data. That’s because code which is throughput-bound on normal inputs, may become latency-bound on subnormal inputs, if the CPU’s subnormal fallback is so inefficient that it messes up the superscalar pipeline and reduces or fully prevents parallel instruction execution.

There is unfortunately one exception to the “chained is latency-bound” general rule, which is the sqrt_positive_max benchmark. This benchmark does not feature an SQRT → SQRT → SQRT… dependency chain, because performing such a sequence of operations while guaranteeing continued subnormal input is impossible as the square root of a subnormal number is a normal number. Therefore, this benchmark cannot currently be used to measure SQRT latency, and its output in chained mode should be ignored for now.

Estimating hardware performance

The raison d’être of Subwoofer is to study how basic hardware floating-point arithmetic operations behave in presence of a stream of data that contains a certain share of subnormal numbers, in both latency-bound and throughput-bound configurations. This is not easy as it seems because…

  • To study the performance of latency-bound operations, we need long dependency chains made of many copies of the same operation, where each operation takes the output of the previous operation as one of its inputs.
  • To enforce a share of subnormal inputs othen than 100%, we must ensure that the output of operation N, which serves as one of the inputs of operation N+1, is a normal number, while other inputs come from a pseudorandom input data stream of well-controlled characteristics.
  • Many IEEE-754 arithmetic operations produce a non-normal number when fed with a normal and subnormal operand. For example, multiplication of a normal number by a subnormal number may produce a subnormal output. We need to turn such non-normal numbers back into normal numbers before we can use them as the input of the next step of the dependency chain, by passing them through an auxiliary operation which is…
    • Safe to perform on both normal and subnormal numbers.
    • As cheap as possible to limit the resulting measurement bias.

To achieve these goals, we often need to chain the hardware operation that we are trying to study with another operation that we also study, chosen to be as cheap as possible so that the impact on measured performance is minimal. Then we subtract the impact of that other operation to estimate the impact of the operation of interest in isolation.

ADD/SUB

Adding or subtracting a subnormal number to a normal number produces a normal number, therefore the overhead of ADD/SUB can be studied in isolation. Because these two operations have identical performance characteristics on all known hardware, we have a single add benchmark that measures the average of ADD only, and it is assumed that SUB (or addition of the negation) has the same performance profile.

MIN/MAX

The maximum of a subnormal and a normal number is a normal number, therefore the overhead of MAX can be studied in isolation. This is done by the max benchmark. Sadly, MIN does not have this useful property, but its performance characteristics are the same as those of MAX on all known hardware, so for now we will just assume that they have the same overhead and that measuring the performance of MAX is enough to know the performance of MIN.

MUL

The product of a subnormal and a normal number is a subnormal number, but we can use a MAX to get back into normal range. This is what the mul_max benchmark does. By subtracting the execution time of max from the execution time of mul_max, we get an estimate of the execution time that MUL would have in isolation, which we can use to estimate the latency and throughput of MUL.

SQRT

A square root is a unary operation, and the square root of a subnormal number is a normal number. Therefore, to keep integrating new possibly subnormal inputs into our accumulator, we cannot use just SQRT and must also use a binary operation. Hence we use the acc <- max(acc, SQRT(input)) operation as the basis of the sqrt_positive_max benchmark.

Why “positive”, you may ask? Well, for now we only test with positive inputs, for a few reasons:

  • Computing square roots of negative numbers is normally a math error, well-behaved programs shouldn’t do that in a loop in their hot code path. So the performance of the error path isn’t that important.
  • The square root of a negative number is a NaN, and going back from a NaN to a normal number of a reasonable order of magnitude without breaking the dependency chain is a lot messier than going from a subnormal to a normal number (can’t just use a MAX, need to play weird tricks with the exponent bits of the underlying IEEE-754 representation, which may cause unexpected CPU overhead linked to integer/FP domain crossing).
  • The negative argument path of the libm sqrt() function that people actually use is often partially or totally handled in software, so getting to the hardware overhead is difficult, and even if we manage it won’t be representative of typical real-world performance.

DIV

Division is interesting because it is one of the few basic IEEE-754 binary arithmetic operations where the two input operands play a highy asymmetrical role:

  • If we divide possibly subnormal inputs by a normal number, and use the output as the denominator of the next division, then we are effectively doing the same as multiplying a possibly subnormal number by the inverse of a normal number, which is another normal number. As a result, we end up with a pattern that is quite similar to that of the mul_max benchmark, and again we can use MAX as a cheap mechanism to recover from subnormal outputs. This is how the div_numerator_max benchmark works.
  • If we divide a normal number by possibly subnormal inputs, and use the output as the numerator of the next division, then the main IEEE-754 special case that we need to guard against is not subnormal outputs but infinite outputs. This can be done using a MIN that takes infinities back into normal range, and that is what the div_denominator_min benchmark does. As discussed above, to analyze this benchmark, we will assume that MIN has the same performance characteristics as MAX and use the results that we collected for MAX.

FMA

Because Fused Multiply-Add (FMA) has three operands that play two different roles (multiplier or addend), we have more freedom in how we set up a dependency chains of FMA with a feedback path from the output of operation N to one of the inputs of operation N+1. This is what we chose:

  • In benchmark fma_multiplier, the input data is fed to a multiplier argment of the FMA, multiplied by a constant factor, and alternatively added to and subtracted from an accumulator. This is effectively the same as the add benchmark, just with a larger or smaller step size, so for this pattern we can study the overhead of FMA with possibly subnormal multipliers in isolation, without taking corrective action to guard against non-normal outputs.

  • In benchmark fma_addend, input data is fed to the addend argument of the FMA, and we add to it the current accumulator multiplied by a constant factor. The result then becomes the next accumulator. Unfortunately, depending on how we pick the constant factor, this factor is doomed to eventually overflow or underflow in some input configurations:

    • If the constant factor is >= 1 or sufficiently close to 1, then for a stream of normal inputs the value of the accumulator will experience unbounded growth and eventually overflow.
    • If the constant factor is < 1, then for a stream of subnormal inputs the value of the acccumulator will decay and eventually become subnormal.

    To prevent this, we actually alternate between multiplying by the chosen constant factor and its inverse. This should not meaningfully affect that measured performance characteristics.

  • In benchmark fma_full_max, two substreams of the input data are fed to the addend argument of the FMA and one of its multiplier argument, with the feedback path taking the other multiplier argument. This configuration allows us to check if the CPU has a particularly hard time with FMAs that produce subnormal outputs. But precisely because we can get subnormal results, we need a MAX to bring the accumulator back to normal range.

Other insight

Data sources

As discussed in the data acquisition section, the suggested measure data acquisition configuration exercises the hardware operation of interest not just with memory inputs taken from the L1 cache, but also with tiny sets of inputs that stay resident in CPU registers.

The latter configuration fully avoids memory subsystem bottlenecks, which are rare at the L1 cache level but do exist in a few corner cases like FMAs with two data inputs. However, it does so at the expense of being a lot less realistic and a lot easier for compiler optimizers and CPU frontends to analyze, which causes all sorts of problems.

As a result, you should generally consider the configuration that operates from the L1 cache as the most reliable reference measurement, but perform a quick comparison with the performance of the configurations that operate from registers:

  • If a configuration that operates from registers is a bit faster (say, ~20% faster), it is likely that the configuration with inputs from the L1 cache was running into a memory subsystem bottleneck, and the configuration with inputs from CPU registers more accurately reflects the performance of the underlying floating-point arithmetic operations.
  • But if the difference is enormous, or the configuration that operates from registers is slower than the one that operates from the L1 cache, you should treat the results obtained against register inputs as suspicious by default, and disregard them until you have taken the time to apply further analysis.

Inputs from memory data sources that are more remote from the CPU than the L1 cache are generally not particularly interesting, as they are normally less affected by the performance of floating-point arithmetic and more heavily bottlenecked by the memory subsystem. This is why the associated performance numbers are not measured by default.

But there is some nuance to the generic statement above (among other things because modern CPUs have spatial prefetchers), so if you are interested in how the performance of subnormal inputs differs when operating from a non-L1 data source, consider trying out the more_memory_data_sources Cargo feature which measures just that. This comes at the expense of, you guessed it, longer benchmark execution times.

Overhead vs %subnormals curve

For those arithmetic operations that are affected by subnormal inputs on a given CPU microarchitecture, one might naively expect the associated overhead to grow linearly as the share of subnormal numbers in the input data stream increases.

If we more rigorously spell out the underlying intuition, it is that processing a normal number has a certain cost , processing a subnormal number has another cost , and therefore the average floating-point operation cost that we eventually measure should be a weighted mean of these two costs where is the share of subnormal numbers in the input data stream.

This is indeed one possible hardware behavior, and some Intel CPUs are actually quite close to that performance model. But other CPU behaviors can be observed in the wild. Consider instead a CPU whose floating-point ALUs have two operating modes:

  • In “normal mode”, an ALU processes normal floating-point numbers at optimal speed, but it cannot process subnormal numbers. When a subnormal number is encountered, the ALU must perform an expensive switch to an alternate generic operating mode in order to process it.
  • In this alternate “generic mode”, the ALU can process both normal and subnormal numbers, but operates with reduced efficiency (e.g. it carefully checks the exponent of the number before each multiplication, instead of speculatively computing the normal result and later discarding it if the number turns out to actually be subnormal).
  • Because of this reduced efficiency and the presumed rarity of subnormal inputs, the ALU will spontaneously switch back to the initial “normal mode” after some amount of time has elapsed without any subnormal number showing up in the input data stream.

Assuming such a more complex performance model, the relationship between the average processing time and the share of subnormal data in the input would not be an affine straight line anymore:

  • For small amounts of subnormal numbers in the input data stream, each subnormal input will cause an expensive switch to “generic mode”, followed by a period of time during which mostly-normal numbers are processed at a reduced data rate, then a return to “normal mode”.
    • In this initial phase of low subnormal input density, we expect a mostly linear growth of processing overhead, where the initial overhead(%subnormals) curve slope is the cost of each ALU normal → generic → normal mode round trip, to which we add the number of normal numbers that are processed during the generic mode operating period, multiplied by the extra overhead of processing a normal number in generic mode.
  • As the share of subnormal numbers in the input data stream increases, the CPU’s ALUs will see more and more subnormal numbers pass by. Eventually, the density of subnormal numbers in the input will become high enough to go below the ALUs’ internal threshold, and cause ALUs to start delaying some generic → normal mode switches.
    • This will result in a reduction of the number of normal → generic → normal mode round trips. Therefore, if said round trips are a lot more expensive than the cost of processing a few normal numbers in generic mode for the duration of the round trip, as initially assumed, the overall processing overhead will start to decrease.
  • Finally, beyond a certain share of subnormal inputs, the ALUs will be constantly operating in generic mode, and the only observable overhead will be that of processing normal numbers in the subnormals-friendly generic mode.
    • It should be noted that the associated overhead that the CPU manufacturer tries to avoid may not be a mere linear decrease in throughput, but instead something more subtle like an increase in operation latency or ALU energy consumption.

Such an overhead curve that grows then shrinks and eventually reaches a plateau as the share of subnormal inputs increases, was indeed observed on AMD Zen 2 and 3 CPUs when processing certain arithmetic operations like MUL. This suggests that a mechanism along the line of the one described above is at play on those CPUs.

As you can see, studying the dependence of subnormal number processing overhead on the share of subnormal numbers in the input data stream can provide valuable insight into the underlying hardware implementation of subnormal arithmetic.

Data type

Besides individual single and double precision numbers, many CPUs can process data in SIMD batches of one or more widths. On microarchitectures with a sufficiently dark and troubled past, we could expect the subnormals fallback path to behave differently depending on the precision of the floating-point number that are being manipulated, or the width of the SIMD batches. Such differences, if any, will reveal more limitations of the CPU’s subnormal processing path.