Superscalar SIMD

By our powers combined…

As we have seen, both instruction-level parallelism and SIMD can be leveraged to achieve extra execution performance. Now it’s time to combine them and see how the benefits add up.

#![allow(unused)]
fn main() {
pub fn simd_ilp<const ILP_WIDTH: usize>(target: u64) -> u64 {
    use safe_arch::m128i;
    const SIMD_WIDTH: usize = 2;
    assert_ne!(ILP_WIDTH, 0, "No progress possible in this configuration");

    // Set up counters and increment
    let mut simd_counters = [m128i::from([0u64; SIMD_WIDTH]); ILP_WIDTH];
    let simd_increment = m128i::from([1u64; SIMD_WIDTH]);

    // Accumulate in parallel
    let full_width = (SIMD_WIDTH * ILP_WIDTH) as u64;
    for _ in 0..(target / full_width) {
        for simd_counter in &mut simd_counters {
            *simd_counter =
                pessimize::hide(safe_arch::add_i64_m128i(*simd_counter, simd_increment));
        }
    }

    // Accumulate remaining pairs of elements
    let mut remainder = (target % full_width) as usize;
    while remainder >= SIMD_WIDTH {
        for simd_counter in simd_counters.iter_mut().take(remainder / SIMD_WIDTH) {
            *simd_counter =
                pessimize::hide(safe_arch::add_i64_m128i(*simd_counter, simd_increment));
            remainder -= SIMD_WIDTH;
        }
    }

    // Merge SIMD accumulators using parallel reduction
    let mut stride = ILP_WIDTH.next_power_of_two() / 2;
    while stride > 0 {
        for i in 0..stride.min(ILP_WIDTH - stride) {
            simd_counters[i] =
                safe_arch::add_i64_m128i(simd_counters[i], simd_counters[i + stride]);
        }
        stride /= 2;
    }
    let simd_counter = simd_counters[0];

    // Merge the SIMD counters into a scalar counter
    let counters: [u64; SIMD_WIDTH] = simd_counter.into();
    let mut counter = counters.iter().sum();

    // Accumulate trailing element, if any, the scalar way
    for _ in 0..remainder {
        counter = pessimize::hide(counter + 1);
    }
    counter
}
}

This code… could be prettier. It certainly shows the need for layers of abstraction when combining multiple layers of hardware parallelism. But since this might be your first exposure to the marriage of SIMD and ILP, seeing it written explicitly the first time may make understanding easier.

How fast do we expect this to get? From uops.info, my Zen 2 CPU can do 3 SSE 64-bit integer additions per clock cycle. Each SSE addition, in turn, can sum two 64-bit integers. So our expectation would be to reach 6 integer increments per clock cycle, which indeed we do, with this implementation reaching 25.6 billion integer increments per second on my CPU.

Another thing worth noting is that we need less ILP streams to do this, with peak performance being reached with an ILP_WIDTH of 9. This is good for multiple reasons:

  • Narrower code uses up less instruction cache and CPU registers, thus reducing the risk of slowdown elsewhere in the codebase when this function is called.
  • Narrower code reaches top speed with a smaller number of loop iterations. This implementation beats the 15-way ILP loop at any number of iterations, but it also beats the basic for loop’s counting rate when counting only up to >8, and the basic SIMD loop when counting up to >16. So not only is it very efficient asymptotically, it also becomes efficient very quickly as problem size ramps up.

It is also very easy to explain why this is the case: with ILP based on scalar increments, our previous code was sharing resources with the outer loop management instructions. This code only relies on SIMD instructions for counting, which are not used for outer loop management, so as soon as we have enough work to keep the SIMD unit fed without outer loop management becoming a bottleneck, we should operate at maximum possible speed.

Superscalar honeymoon?

At this point, it may seem like I am claiming that SIMD is superior to scalar processing and should be used in its place whenever practical. However, for integer computations at least, it actually should not have to be one or the other.

Since my Zen 2 CPU can process at most 5 instructions per cycle, but only 3 SSE instructions per cycle, one has to wonder if given suitable shaped code, it shouldn’t be possible for me to also have it perform a pair of scalar integer increments in each CPU cycle in addition to all the SSE work it’s already doing.

Of course, in doing so, I would need to be careful not to consume ILP resources needed to handle the outer counting loop. But given enough code contortions, that can be taken care of.

And so I did actually try out this hybrid scalar + SIMD approach…

#![allow(unused)]
fn main() {
pub fn extreme_ilp<
    // Number of SIMD operations per cycle
    const SIMD_ILP: usize,
    // Number of scalar iterations per cycle
    const SCALAR_ILP: usize,
    // Number of cycles in the inner loop
    const UNROLL_FACTOR: usize,
    // Number of scalar instructions needed for outer loop maintenance
    const LOOP_INSNS: usize,
>(
    target: u64,
) -> u64 {
    use safe_arch::m128i;
    const SIMD_WIDTH: usize = 2;
    assert_ne!(SIMD_ILP, 0, "No point in this without SIMD ops");
    assert_ne!(SCALAR_ILP, 0, "No point in this without scalar ops");

    // Set up counters and increment
    let mut simd_counters = [m128i::from([0u64; SIMD_WIDTH]); SIMD_ILP];
    let simd_increment = m128i::from([1u64; SIMD_WIDTH]);
    let mut scalar_counters = [0; SCALAR_ILP];

    // Accumulate in parallel
    let unrolled_simd_ops = SIMD_WIDTH * SIMD_ILP * UNROLL_FACTOR;
    let unrolled_scalar_ops = (SCALAR_ILP * UNROLL_FACTOR).saturating_sub(LOOP_INSNS);
    let full_width = unrolled_simd_ops + unrolled_scalar_ops;
    for _ in 0..(target / full_width as u64) {
        for unroll_iter in 0..UNROLL_FACTOR {
            for simd_counter in &mut simd_counters {
                *simd_counter =
                    pessimize::hide(safe_arch::add_i64_m128i(*simd_counter, simd_increment));
            }
            for scalar_counter in scalar_counters
                .iter_mut()
                .take(unrolled_scalar_ops - unroll_iter * SCALAR_ILP)
            {
                *scalar_counter = pessimize::hide(*scalar_counter + 1)
            }
        }
    }

    // Accumulate remaining elements using as much parallelism as possible
    let mut remainder = (target % full_width as u64) as usize;
    while remainder > SIMD_WIDTH {
        for simd_counter in simd_counters.iter_mut().take(remainder / SIMD_WIDTH) {
            *simd_counter =
                pessimize::hide(safe_arch::add_i64_m128i(*simd_counter, simd_increment));
            remainder -= SIMD_WIDTH;
        }
    }
    while remainder > 0 {
        for scalar_counter in scalar_counters.iter_mut().take(remainder) {
            *scalar_counter = pessimize::hide(*scalar_counter + 1);
            remainder -= 1;
        }
    }

    // Merge accumulators using parallel reduction
    let mut stride = (SIMD_ILP.max(SCALAR_ILP)).next_power_of_two() / 2;
    while stride > 0 {
        for i in 0..stride.min(SIMD_ILP.saturating_sub(stride)) {
            simd_counters[i] =
                safe_arch::add_i64_m128i(simd_counters[i], simd_counters[i + stride]);
        }
        for i in 0..stride.min(SCALAR_ILP.saturating_sub(stride)) {
            scalar_counters[i] += scalar_counters[i + stride];
        }
        stride /= 2;
    }
    let simd_counter = simd_counters[0];
    let mut scalar_counter = scalar_counters[0];

    // Merge the SIMD counters and scalar counter into one
    let counters: [u64; SIMD_WIDTH] = simd_counter.into();
    scalar_counter += counters.iter().sum::<u64>();
    scalar_counter
}
}

…but I observed no performance benefits with respect to the SIMD+ILP version, only a slight performance degradation that varied depending on the value of tuning parameters.

This puzzles me. Obviously I must be hitting some microarchitectural limit, but it’s not clear which. I looked up every source of microarchitectural documentation that I know of and tried to check every relevant AMD performance counter for this code and the previous one. But I found nothing that could explain this lack of ILP improvements.

However, AMD’s CPU performance counters are much less numerous than Intel’s, which suggests that they may have more blind spots. So it may be the case that if I re-tried this analysis on an Intel chip, I could find the answer to this enigma. Maybe I’ll do this someday. Or maybe one of you readers will find the answer before I do and tell me about it so I update the book accordingly!

In any case, though, since this approach does not currently provide any measurable improvement and requires even more convoluted code, I am very happy to let it go and stick with the simpler SIMD + ILP approach for the remainder of the current version of this book.

Refactoring

Speaking of keeping code understandable, before we scale this code up further, we need to do something about the cognitive complexity of the SIMD + ILP solution. While less scary than the hybrid scalar + SIMD version, this code still feels too complex, and we’re far from being done with it.

To address this, let me separate concerns of SIMD computation and instruction-level parallelism a bit. First, for SIMD abstraction purposes, let’s introduce the following trait.

#![allow(unused)]
fn main() {
/// Set of integer counters with SIMD semantics
pub trait SimdAccumulator<Counter>: Copy + Eq + pessimize::Pessimize + Sized {
    /// Number of inner accumulators
    const WIDTH: usize = std::mem::size_of::<Self>() / std::mem::size_of::<Counter>();

    /// Set up accumulators, all initialized to 0 or 1
    fn identity(one: bool) -> Self;

    /// Set up accumulators initialized to 0
    #[inline]
    fn zeros() -> Self {
        Self::identity(false)
    }

    /// Set up accumulators initialized to 1
    #[inline]
    fn ones() -> Self {
        Self::identity(true)
    }

    /// Merge another set of accumulators into this one
    fn add(self, other: Self) -> Self;

    /// Add one to every accumulator in the set in a manner that cannot be
    /// optimized out by the compiler
    #[inline]
    fn increment(&mut self) {
        *self = pessimize::hide(Self::add(*self, Self::ones()));
    }

    /// Merge another accumulator into this one
    #[inline]
    fn merge(&mut self, other: Self) {
        *self = Self::add(*self, other);
    }

    /// Doubles the size of Counter if it's not u64 yet, else stays at u64
    type ReducedCounter;

    /// Goes to another SimdAccumulator type that is half as wide if Counter is
    /// u64 and WIDTH is not yet 1, else stays at Self.
    type ReducedAccumulator: SimdAccumulator<Self::ReducedCounter>;

    /// Go to the next step down the reduction pipeline
    fn reduce_step(self) -> [Self::ReducedAccumulator; 2];

    /// Reduce all the way to a 64-bit counter
    /// Must be specialized to the identity function for u64 scalars
    #[inline]
    fn reduce(self) -> u64 {
        let [mut half, half2] = self.reduce_step();
        half.merge(half2);
        half.reduce()
    }
}
}

While mostly straightfoward, this trait has a number of generic knobs that we do not need yet but are going to use later on in this book. Let’s go through them briefly:

  • The trait is parametrized on an inner integer type, so that we can use all of the integer widths supported by a given SIMD vector type later on.
  • Reduction from a SIMD vector of counters to a single 64-bit integer is performed through a series of recursive steps:
    • First, we double inner integer size until we reach 64-bits by converting each half of the SIMD vector to a an integer size twice as wide and merging the resulting SIMD vectors.
    • Then we recursively split the eventual SIMD vector of 64-bit integers in halves and merge the halves until we get to a single 64-bit integer.

The trait can be easily implemented for both our initial u64 scalar counter and our new m128i SIMD counter as follows…

#![allow(unused)]
fn main() {
impl SimdAccumulator<u64> for safe_arch::m128i {
    #[inline]
    fn identity(one: bool) -> Self {
        Self::from([one as u64; <Self as SimdAccumulator<u64>>::WIDTH])
    }

    #[inline]
    fn add(self, other: Self) -> Self {
        safe_arch::add_i64_m128i(self, other)
    }

    // SSE vectors of 64-bit integers reduce to 64-bit integers
    type ReducedCounter = u64;
    type ReducedAccumulator = u64;
    //
    #[inline]
    fn reduce_step(self) -> [Self::ReducedAccumulator; 2] {
        self.into()
    }
}

impl SimdAccumulator<u64> for u64 {
    #[inline]
    fn identity(one: bool) -> Self {
        one as u64
    }

    #[inline]
    fn add(self, other: Self) -> Self {
        self + other
    }

    // Scalar u64 is the end of the reduction recursion
    type ReducedCounter = u64;
    type ReducedAccumulator = u64;
    //
    #[inline]
    fn reduce_step(self) -> [Self::ReducedAccumulator; 2] {
        [self, 0]
    }
    //
    #[inline]
    fn reduce(self) -> u64 {
        self
    }
}
}

…and then we can write a generic SIMD + ILP implementation that can work in terms of any implementation of SimdAccumulator<u64>, superseding all of our previous _ilp implementations:

#![allow(unused)]
fn main() {
pub fn generic_ilp_u64<const ILP_WIDTH: usize, Simd: SimdAccumulator<u64>>(target: u64) -> u64 {
    assert_ne!(ILP_WIDTH, 0, "No progress possible in this configuration");

    // Set up counters
    let mut simd_accumulators = [Simd::zeros(); ILP_WIDTH];

    // Accumulate in parallel
    let full_width = (Simd::WIDTH * ILP_WIDTH) as u64;
    for _ in 0..(target / full_width) {
        for simd_accumulator in &mut simd_accumulators {
            simd_accumulator.increment();
        }
    }

    // Accumulate remaining SIMD vectors of elements
    let mut remainder = (target % full_width) as usize;
    while remainder >= Simd::WIDTH {
        for simd_accumulator in simd_accumulators.iter_mut().take(remainder / Simd::WIDTH) {
            simd_accumulator.increment();
            remainder -= Simd::WIDTH;
        }
    }

    // Merge SIMD accumulators using parallel reduction
    let mut stride = ILP_WIDTH.next_power_of_two() / 2;
    while stride > 0 {
        for i in 0..stride.min(ILP_WIDTH - stride) {
            simd_accumulators[i].merge(simd_accumulators[i + stride]);
        }
        stride /= 2;
    }
    let simd_accumulator = simd_accumulators[0];

    // Merge the SIMD counters into a scalar counter
    let mut counter = simd_accumulator.reduce();

    // Accumulate trailing element, if any, the scalar way
    for _ in 0..remainder {
        counter.increment();
    }
    counter
}
}

With that, we can stop worrying about ILP for now and offload that concern to generic_ilp_u64, keeping our focus on perfecting our usage of SIMD.