Instruction-Level Parallelism (ILP)

Reaching the end of the previous chapter, you may think that the naive for loop is optimal, since it increments the counter on every clock cycle so it obviously keeps the CPU busy all the time. However, you would be wrong.

You see, modern CPUs are superscalar, which means that they can do multiple things per clock cyle. In the case of my Zen 2 CPU, you can check over at uops.info that it can actually do four integer increments per cycle. This is called “Instruction Level Parallelism”, or ILP for short. The question then is, why do we not observe this form of parallelism in the simple counting loop?

The problem here is that the CPU may only execute multiple instructions at the same time if these instructions are independent from each other. But here, each counter value depends on the previous one, so the counter increments are not independent and must await each other. To get instruction level parallelism, we need to maintain multiple independent counters and spread the counting work across them like this:

#![allow(unused)]
fn main() {
pub fn ilp<const WIDTH: usize>(target: u64) -> u64 {
    assert_ne!(WIDTH, 0, "No progress possible in this configuration");

    // Accumulate in parallel
    let mut counters = [0; WIDTH];
    for _ in 0..(target / WIDTH as u64) {
        for counter in &mut counters {
            *counter = pessimize::hide(*counter + 1);
        }
    }

    // Accumulate remaining elements
    for counter in counters.iter_mut().take(target as usize % WIDTH) {
        *counter = pessimize::hide(*counter + 1);
    }

    // Merge accumulators using parallel reduction
    let mut stride = WIDTH.next_power_of_two() / 2;
    while stride > 0 {
        for i in 0..stride.min(WIDTH - stride) {
            counters[i] += counters[i + stride];
        }
        stride /= 2;
    }
    counters[0]
}
}

Notice that we need extra code in the end to merge counters and manage those odd elements that don’t fit into an integer multiple of the number of counters. This will be a recuring theme in the remainder of this book.

Given the above assertion that the CPU can do 4 increments per cycle, you may expect optimal instruction-level performance to be achieved with 4 independent counters. But due to the extra instructions needed to manage the counting loop, and the fact that adding counters has the side-effect of forcing the compiler to do extra unrolling which amortizes that overhead, performance improvements are actually observed all the way up to 15 counters, which is the maximum supported by the x86_64 architecture before counters start being spilled from registers to RAM.

With those 15 counters, we observe a peak throughput of 3.6 increments per cycle, a little less than the 4 we would have hoped for but still pretty close.

In terms of scaling with the number of counting iterations, the performance is close to that of sequential counting with only one iteration of parallel counting (15 increments), and beats it with two iterations. Here, we see that instruction-level parallelism is actually a good way to get performance in large loops without sacrificing the performance of smaller loops too much.