Aussie AI Blog

Vector Dot Product Optimization in C++ with Instruction-Level Parallelism

August 29th, 2025

by David Spuler, Ph.D.

What is Instruction-Level Parallelism?

Modern CPUs have a hidden level of parallelism that few programmers take advantage of. It's the ability to run two or more machine code instructions in parallel, and it's called Instruction-Level Parallelism (ILP).

Note that this is not referring to SIMD instructions such as x86 AVX or ARM Neon/SVE2 instructions. Those are powerful vectorizing instructions, but ILP refers to any type of basic arithmetic instruction. It's not a new type of instruction, but the fact that better CPUs do more things in parallel.

Also, this isn't referring to running multiple instructions on different cores, which is, you know, multithreading. ILP is parallelism on a single thread, and you can multithread on top of it, to get multiple levels of parallelization.

How Does Instruction-Level Parallelism Work?

In short: different parts of the chip. Basically, modern CPUs are so compacted with transistors that they often have multiple of each type of arithmetic unit.

Parallel Units. If there are multiple types of units, it can do two things in parallel. And there are also different types of units for different instructions, which can then be done in parallel. The types of computation units include:

Arithmetic Logic Unit (ALU) — basic integer arithmetic, comparisons, addition, multiply, etc.
Floating-point unit (FPU) — doing addition and multiplication in floating-point (with the IEEE 754 standard).

Some CPUs can send addition to a different part of the chip to multiplication. And some can only send integer arithmetic separate to the floating-point computations. It depends on the chip capabilities, and how many separate ALUs and FPUs the chip has.

Instruction Pipelining. Another aspect is pipelining of different operations inside the CPU. It works at a lower level than even the machine code instructions, called "micro-operations" or just "micro-ops". Every machine code instruction, like an addition or a multiplication, is separated into much smaller steps:

Load the data from memory
Store the data to a register
Load the other data operand
Store that to another register
Add them together
Store the result to a third register

There's a lot more to it, but that's the conceptual idea. And there's still more. CPUs are designed to run these micro-ops, such that:

Instructions may wait for memory (if they need data)
Any instructions that are ready can run
Out-of-order execution is allowed!

This starts getting to be too much to think about. Show me some code already!

Vector Dot Product Optimization

Vector dot product is a pairwise multiplication of each element of two vectors, and then add those all up. Here's the basic idea in C++ code:

    float aussie_vecdot_basic(
         const float v1[], const float v2[], int n)   
    {
        // Basic float vector dot product
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v1[i] * v2[i];
        }
        return sum;
    }

Basic Loop Unrolling of Vector Dot Product

Here's a basic loop-unrolled version with 2 steps per iteration:

    float aussie_vecdot_basic_unroll2(
         const float v1[], const float v2[], int n)  
    {
        // Unrolled float vector dot product
        assert(n % 2 == 0);
        float sum = 0.0;
        for (int i = 0; i < n; i += 2) {
            sum += v1[i] * v2[i];
            sum += v1[i+1] * v2[i+1];
        }
        return sum;
    }

Each iteration does two of the multiply-add sequences, and this is a 2-level loop unrolling. Note that the loop iteration does "i+=2" rather than "i++" operator, so there are half as many iterations of the loop, with double the loop body.

ILP Optimization of Vector Dot Product

How can we take advantage of ILP and out-of-order execution in C++? The answer is surprisingly simple: remove data dependencies. If we have two instructions with data dependency, as in the above unrolled vector dot product, then they cannot really run in parallel, nor can they interleave with pipelining. Here's the code tweak that removes the data dependency on "sum" from the loop-unrolled version:

    float aussie_vecdot_basic_unroll2_ILP(
        const float v1[], const float v2[], int n)  
    {
        // Uses Instruction-Level Parallelism (ILP)
        assert(n % 2 == 0);
        float sum1 = 0.0;
        float sum2 = 0.0;
        for (int i = 0; i < n; i += 2) {
            sum1 += v1[i] * v2[i];
            sum2 += v1[i + 1] * v2[i + 1];
        }
        return sum1 + sum2;
    }

It's only a small change, but it removes the data dependency between the two summations. Each multiply-addition sequence is independent of the other. Hence, they could run in parallel, or at least be partially pipelined over each other.

Speedup Results

Here's a test run of the three options with 1024 as the vector size with float data.

    FLOAT Vector dot product benchmarks: (N=1024, Iter=1000000)
    Vecdot float basic: 3803 ticks (3.80 seconds)
    Vecdot float basic unroll2: 3237 ticks (3.24 seconds)
    Vecdot float basic unroll2 ILP: 1673 ticks (1.67 seconds)

As you can see, the loop-unrolled version is somewhat faster on a CPU than a naive version. However, by just tweaking the C++ code to use an extra float "sum2" variable, our code ran almost twice as fast as the unrolled version. By making the computations of "sum" and "sum2" separate, each loop iteration can leverage the power of ILP with pipelining, parallel instruction execution, and possibly even out-of-order execution.

The level of ILP can be further improved by using extra unrolled loop bodies, each with their own accumulation variable, adding them together at the very end. Here's what I got by doing this with 4, 8, 16, and 32 operations per iteration:

    Vecdot float basic unroll4 ILP: 1160 ticks (1.16 seconds)
    Vecdot float basic unroll8 ILP: 986 ticks (0.99 seconds)
    Vecdot float basic unroll16 ILP: 949 ticks (0.95 seconds)
    Vecdot float basic unroll32 ILP: 897 ticks (0.90 seconds)

As you can see, the benefit levels off, and the last improvement is probably just from the loop unrolling, with any extra parallelism occurring due to ILP.

Note that there are various general C++ optimizations that could be used to further improve this code: