Aussie AI

Chapter 12. Instruction-Level Parallelism

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 12. Instruction-Level Parallelism

What is Instruction-Level Parallelism?

Instruction-Level Parallelism (ILP) is a CPU optimization at the lowest levels of machine instruction processing. If you thought parallel programming was about multithreading, SIMD vectorization and GPU kernels, there’s a whole another level deep down in the CPU.

Modern CPUs are amazingly advanced, and they have been architected to use various types of extra parallelism. Some of the types of instruction-level parallelism in a modern CPU include:

Parallel execution units
Pipelined execution of micro-ops
Out-of-order execution of instructions
Prefetching of instructions
Branch prediction Memory data prefetching

Importantly, the CPU has total parallelism in its instruction execution units. In fact, a CPU can typically run four or more machine instructions in parallel in the same clock cycle, but using multiple execution units on different parts of the chip.

Instruction Reordering Optimizations

Instruction reordering is a type of Instruction-Level Parallelism (ILP), and is an optimization performed inside the CPU where it actually runs the machine code instructions out-of-order. The way this works in simple terms is:

Delay any opcodes that don’t have the data they need (e.g., from memory).
Run any instructions that are ready as soon as possible.

There’s a whole smash of fun to be had researching how this all works in the CPU. There are schedulers and “stations” and various queues and caches. Kudos to all those hardware engineers.

Another special type of fun is for compiler engineers. GCC does a lot of fancy optimizations in the code generation backend in terms of taking advantage of instruction orders.

But what about C++? Is there anything you can do in C++ to optimize your code? Or with inline assembly instructions?

Safety first. Most of the discussion of out-of-order execution and C++ occurs in relation to safety. Problems can arise across multiple threads if the reads and writes from our C++ statements are running out-of-order. I mean, how can it be good to just run my C++ code in any random order that the CPU chooses?

The issue of preventing out-of-order errors involves “memory order.” These are especially useful for correctly implementing lock-free algorithms with atomics, but they also act as memory barriers that can prevent any undesirable types of out-of-order execution.

Speed second. But the goal is to go faster! Rather than stopping the CPU from reordering instructions by using memory barriers, let’s maximize it! There are at least two major ideas:

Minimize memory-waiting delays
Exploit out-of-order instructions

The first point is to minimize the slowdowns whereby instructions get delayed. The main one is memory accesses, which has well-known solutions such as: cache hit maximization, cache lines, tiled memory accessing, contiguous memory blocks, reducing data sizes, etc.

Other than cache locality, there’s not a lot of discussion anywhere in books or on the internet about exploiting out-of-order instruction execution to make code run faster. But there’s some discussion of this in Agner Fog’s astounding CPU resources; see (Fog, 2024). The key point is:

Free extra parallelism!

The average CPU has hidden parallelism in terms of its various computation pathways. For example, the CPU can run these two computations in parallel:

Integer arithmetic — Arithmetic-Logic Unit (ALU)
Floating-point arithmetic — Floating-Point Unit (FPU)

That’s not the full list. Modern CPUs now have more than one ALU, so they can perform two or more integer additions or comparisons in parallel. Some CPUs can also run different types of integer arithmetic, such as addition and multiplication, on separate pathways. Similarly, some of the SIMD operations run separately from the non-SIMD instructions.

Out-of-Order Execution Optimizations

So, you can see the opportunity here, right? Not only can the CPU run the same operations in parallel via SIMD instructions, but it can run two (or more!) different types of computations in parallel.

Unfortunately, the opportunities for huge improvements to your C++ are somewhat limited. For example, if you have a computation with both integer and floating-point computations, can you parallelize them? Yes, but only in limited circumstances, where:

The two computations don’t depend on the results of the other.
Not requiring memory accesses for the computations.
Computation operands are values already in CPU registers.

If there’s a dependency, they can’t run in parallel. And if they both require memory requests, that’s the bottleneck regardless of whether the instructions can run in parallel. The data needs to be already loaded from memory into CPU registers to run fast.

That’s quite a list of limitations, but it’s not insurmountable. The optimization methods include:

Prefetching the memory (e.g., __builtin_prefch() with GCC).
Removing “dependency chains” from the sequence of arithmetic instructions.

One common way to remove data dependencies is to use multiple separate variables for intermediate results.

Multiple Accumulator Optimizations

A simple example of using parallel arithmetic computations in a CPU is using multiple accumulator variables for vector dot product. Here’s an unrolled version of the dot product:

    float vector_dot_product_unroll2_ILP(const float v1[], const float v2[], int n)
    {
        float sum = 0.0f;
        for (int i = 0; i < n; i += 2) {
            sum += v1[i] * v2[i];
            sum += v1[i+1] * v2[i+1];
        }
        return sum;
    }

The problem is there’s a data dependency between the two additions. The two multiplications can run in parallel, if the CPU can do so, but the second “sum+=” operation must await the completion of the first one. The solution that increases the opportunity for CPU instruction-level parallelism is:

Multiple separate accumulators!

Hence, the code becomes:

    float vector_dot_product_unroll2(const float v1[], const float v2[], int n)
    {
        float sum = 0.0f, sum2 = 0.0f; // Two accumulators!
        for (int i = 0; i < n; i += 2) {
            sum += v1[i] * v2[i];
            sum2 += v1[i+1] * v2[i+1];
        }
        return sum + sum2; // Add the accumulators
    }

This new version now allows the compiler to use out-of-order execution or other instruction-level parallelism optimizations, because the two “+=” operations are now independent inside the loop body.

This function also needs other optimizations applied to it, which are orthogonal to this idea of breaking data dependency chains, such as marking the pointers are “restricted” and using AVX SIMD vectorized instructions.

References

Agner Fog, 22 June 2024 (last updated), Optimizing subroutines in assembly language: An optimization guide for x86 platforms, https://www.agner.org/optimize/optimizing_assembly.pdf
Agner Fog, 28 May, 2024 (last update), The microarchitecture of Intel, AMD, and VIA CPUs: An optimization guide for assembly programmers and compiler makers, https://www.agner.org/optimize/microarchitecture.pdf
Daniel Lemire, April 2018, Is software prefetching (__builtin_prefetch) useful for performance? https://lemire.me/blog/2018/04/30/is-software-prefetching-__builtin_prefetch-useful-for-performance/
Johnny’s Software Lab, March 31, 2024, The pros and cons of explicit software prefetching, https://johnnysswlab.com/the-pros-and-cons-of-explicit-software-prefetching/
Katecpp, Oct 5, 2015, Improve performance with cache prefetching, http://katecpp.github.io/cache-prefetching/