Aussie AI Blog

Vector Dot Product Optimization in C++ with Instruction-Level Parallelism

  • August 29th, 2025
  • by David Spuler, Ph.D.

What is Instruction-Level Parallelism?

Modern CPUs have a hidden level of parallelism that few programmers take advantage of. It's the ability to run two or more machine code instructions in parallel, and it's called Instruction-Level Parallelism (ILP).

Note that this is not referring to SIMD instructions such as x86 AVX or ARM Neon/SVE2 instructions. Those are powerful vectorizing instructions, but ILP refers to any type of basic arithmetic instruction. It's not a new type of instruction, but the fact that better CPUs do more things in parallel.

Also, this isn't referring to running multiple instructions on different cores, which is, you know, multithreading. ILP is parallelism on a single thread, and you can multithread on top of it, to get multiple levels of parallelization.

How Does Instruction-Level Parallelism Work?

In short: different parts of the chip. Basically, modern CPUs are so compacted with transistors that they often have multiple of each type of arithmetic unit.

Parallel Units. If there are multiple types of units, it can do two things in parallel. And there are also different types of units for different instructions, which can then be done in parallel. The types of computation units include:

  • Arithmetic Logic Unit (ALU) — basic integer arithmetic, comparisons, addition, multiply, etc.
  • Floating-point unit (FPU) — doing addition and multiplication in floating-point (with the IEEE 754 standard).

Some CPUs can send addition to a different part of the chip to multiplication. And some can only send integer arithmetic separate to the floating-point computations. It depends on the chip capabilities, and how many separate ALUs and FPUs the chip has.

Instruction Pipelining. Another aspect is pipelining of different operations inside the CPU. It works at a lower level than even the machine code instructions, called "micro-operations" or just "micro-ops". Every machine code instruction, like an addition or a multiplication, is separated into much smaller steps:

  • Load the data from memory
  • Store the data to a register
  • Load the other data operand
  • Store that to another register
  • Add them together
  • Store the result to a third register

There's a lot more to it, but that's the conceptual idea. And there's still more. CPUs are designed to run these micro-ops, such that:

  • Instructions may wait for memory (if they need data)
  • Any instructions that are ready can run
  • Out-of-order execution is allowed!

This starts getting to be too much to think about. Show me some code already!

Vector Dot Product Optimization

Vector dot product is a pairwise multiplication of each element of two vectors, and then add those all up. Here's the basic idea in C++ code:

    float aussie_vecdot_basic(
         const float v1[], const float v2[], int n)   
    {
        // Basic float vector dot product
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v1[i] * v2[i];
        }
        return sum;
    }

Basic Loop Unrolling of Vector Dot Product

Here's a basic loop-unrolled version with 2 steps per iteration:

    float aussie_vecdot_basic_unroll2(
         const float v1[], const float v2[], int n)  
    {
        // Unrolled float vector dot product
        assert(n % 2 == 0);
        float sum = 0.0;
        for (int i = 0; i < n; i += 2) {
            sum += v1[i] * v2[i];
            sum += v1[i+1] * v2[i+1];
        }
        return sum;
    }

Each iteration does two of the multiply-add sequences, and this is a 2-level loop unrolling. Note that the loop iteration does "i+=2" rather than "i++" operator, so there are half as many iterations of the loop, with double the loop body.

ILP Optimization of Vector Dot Product

How can we take advantage of ILP and out-of-order execution in C++? The answer is surprisingly simple: remove data dependencies. If we have two instructions with data dependency, as in the above unrolled vector dot product, then they cannot really run in parallel, nor can they interleave with pipelining. Here's the code tweak that removes the data dependency on "sum" from the loop-unrolled version:

    float aussie_vecdot_basic_unroll2_ILP(
        const float v1[], const float v2[], int n)  
    {
        // Uses Instruction-Level Parallelism (ILP)
        assert(n % 2 == 0);
        float sum1 = 0.0;
        float sum2 = 0.0;
        for (int i = 0; i < n; i += 2) {
            sum1 += v1[i] * v2[i];
            sum2 += v1[i + 1] * v2[i + 1];
        }
        return sum1 + sum2;
    }

It's only a small change, but it removes the data dependency between the two summations. Each multiply-addition sequence is independent of the other. Hence, they could run in parallel, or at least be partially pipelined over each other.

Speedup Results

Here's a test run of the three options with 1024 as the vector size with float data.

    FLOAT Vector dot product benchmarks: (N=1024, Iter=1000000)
    Vecdot float basic: 3803 ticks (3.80 seconds)
    Vecdot float basic unroll2: 3237 ticks (3.24 seconds)
    Vecdot float basic unroll2 ILP: 1673 ticks (1.67 seconds)

As you can see, the loop-unrolled version is somewhat faster on a CPU than a naive version. However, by just tweaking the C++ code to use an extra float "sum2" variable, our code ran almost twice as fast as the unrolled version. By making the computations of "sum" and "sum2" separate, each loop iteration can leverage the power of ILP with pipelining, parallel instruction execution, and possibly even out-of-order execution.

The level of ILP can be further improved by using extra unrolled loop bodies, each with their own accumulation variable, adding them together at the very end. Here's what I got by doing this with 4, 8, 16, and 32 operations per iteration:

    Vecdot float basic unroll4 ILP: 1160 ticks (1.16 seconds)
    Vecdot float basic unroll8 ILP: 986 ticks (0.99 seconds)
    Vecdot float basic unroll16 ILP: 949 ticks (0.95 seconds)
    Vecdot float basic unroll32 ILP: 897 ticks (0.90 seconds)

As you can see, the benefit levels off, and the last improvement is probably just from the loop unrolling, with any extra parallelism occurring due to ILP.

Note that there are various general C++ optimizations that could be used to further improve this code:

  • Pointer arithmetic (avoiding the need for array index accesses via "i")
  • Inlining the function call
  • Operator strength reduction (e.g., "n&1" is faster than "n%2")
  • Fixed constant vector size known at compile-time.
  • Fully unrolling the loop to remove the loop comparison tests.
  • Higher optimizer settings of the compiler

There are also many more advanced ways to make additional parallelizations of vector dot product:

  • SIMD vectorized CPU instructions — AVX for x86 CPU or Neon/SVE for Arm CPUs
  • Multithreading — the ILP optimization can be used in each thread on each core.
  • GPU parallelization with CUDA C++ — that's a whole another story.

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Mordern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging