Aussie AI Blog

Interleaved Vector Dot Product Using Mogami Add-as-Integer for Instruction-Level Parallelism

  • March 1st, 2026
  • by David Spuler, Ph.D.

Abstract

Instruction-level parallelism is an optimization in both CPU and GPU processors that takes advantage of low-level arithmetic concurrency in different parts of the silicon. It is a well-known optimization to use multiple accumulators to avoid data dependency in code paths to permit multiple arithmetic operations to be pipelined and computed in parallel. However, such optimizations typically limit their applicability to either the Arithmetic Logic Unit (ALU) for integer arithmetic or the Floating-Point Unit (FPU) for floating-point operations. This paper examines a method of using inherent parallelism in both the ALU and FPU at the same time, with applicability to vector dot product, by combining both integer and floating-point operations. To do so, the add-as-integer method of Mogami (2020) is used to achieve Mitchell's approximate multiplication for some vector elements, which is interleaved with standard FP32 multiplications on other elements. The result is that the operations are split between integer addition and floating-point multiplication, allowing full instruction-level parallelism of both the ALU and FPU.

However, results were largely negative and the modified versions of vector dot product were either approximately the same cost, or significantly worse. The reasons for this are not entirely clear, but may include that: (a) memory access cost overwhelms compute cost, (b) the add-as-integer approach actually requires two integer additions versus a single floating-point multiplication, or (c) the add-as-integer accumulation requires floating-point addition, which causes both the ALU and FPU to be used.

Introduction

The idea of this research is to combine two different optimizations in vector dot product computations, and by extension, into matrix multiplication. The optimizations are:

  • Instruction-level parallelism (ILP)
  • Mogami's add-as-integer optimization for floating-point multiplication

Instruction-level parallelism is extra parallelism hidden in each core of a CPU or GPU. In particular, a CPU can often implement floating-point and integer operations in parallel, because they are executed by different parts of the silicon:

  • Floating-Point Unit (FPU) — floating-point multiplication or floating-point addition
  • Arithmetic Logic Unit (ALU) — integer operations

The idea here for vector dot product is to try to parallelize chunks of the computation so that some is done by the FPU and some by the ALU. Mogami's add-as-integer idea is the trick that allows us to use integer arithmetic instead of floating-point multiplications.

Mogami's Add-as-Integer Optimization

The use of an approximate multiplication that is implemented via integer addition. This is a very weird idea and it seems almost magical that it works. It's basically pretend that 32-bit floating point (with its 1 sign bit, 8 exponent bits, and 23 mantissa bits) is actually a 32-bit integer (signed), and add them together. It doesn't do full multiplication, but it does an approximation called Mitchell's approximate multiplication.

Example: Add-as-Int Mogami Approximate Multiplication: The method uses C++ casts to trick the compiler into using the floats as if they were ints. And then it needs to subtract an offset to correct extra bits. Let's say we want to try optimizing a basic float multiply:

     float fc = f1 * f2;   // Floating-point multiply

This is slow, so we want to try the Mogami (2020) idea to change it into addition instead. Note that fancy coding is required. A simple version doesn't work:

     int c = (int)f1 + (int)f2;  // Not multiplication!
     float fc = (float)c;

That code isn't tricking the compiler and it isn't doing multiplication at all. It does a full conversion from float to int, with all that entails, and this is nothing like floating point multiplication.

Instead, type casting is required. Assuming that both int and float are 32-bit types, a coded version in C++ looks like:

     int c = *(int*)&(f1) + *(int*)&(f2) - 0x3f800000;  // Mogami(2020)
     float fc = *(float*)&c;

How does this even work? I mean, it seems like hocus pocus. The effect is that integer addition on the 8-bit exponent is like doing a multiplication (because exponent bits are the powers). Adding the 23 mantissa bits together isn't really the same, it's not doing multiplication, but it's close enough that it's doing an approximate version of multiplication. Some of the theory of why this works is examined in Kosson & Jaggi (2022). Overall, it seems to work like multiplication on both positive and negative floating point, but faster because it's using integer addition. The accuracy of the multiplication is such that the difference from regular float multiplication (i.e. the error) is less than 15%. In my testing it seemed like it was usually less than 12%, so it's a very good approximation of multiplication, for a significant speedup in arithmetic calculations.

Note that the temporary integer variable is hard to get rid of in C++, and might require assembler instead. The "+" operator puts the 32-bit integer into a C++ register, but I can't find a way to re-interpret that temporary int value as a 32-bit float without first storing it to a temporary variable. A simple typecast to float doesn't work in C++:

     float fc = (float) ( *(int*)&(f1) + *(int*)&(f2) - 0x3f800000 );  // Fails...

The above doesn't work because the integer is converted by the float typecast, which is very different from re-interpreting the 32-bit temporary integer as a 32-bit float. In fact, the code above is really just a bug, as I discovered myself. It doesn't really compute anything very meaningful, not even approximately.

Example: Add-as-Integer Vector Dot Product: Here's what it looks like to put Mogami's method into a vector dot product to create an approximate version (but faster):

    float aussie_vecdot_add_as_int_mogami(float v1[], float v2[], int n)   // Add as integer
    {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		int c = *(int*)&(v1[i]) + *(int*)&(v2[i]) - 0x3f800000;  // Mogami(2020)
		sum += *(float*)&c;
	}
	return sum;
    }

This is not a fully optimized version. For example, the iterator variable i should be removed via pointer arithmetic.

Why does Mogami's add-as-integer idea even work? There are a few points that help to explain why it is a close approximation:

  • Implicit leading-1 in the mantissa bits. It is important to note that because both mantissa bits don't include the impliciting leading 1, that the integer addition is not adding these bits together. Instead, it is only adding the lower-order bits of the mantissa, which helps explain why the error rate is low.
  • Fails for fixed-point representations. the Mogami add-as-integer approach is unlikely to be a useful approximation of fixed-point numbers. Note that multiplication of fixed-point numbers are already implemented as integer multiplication. Also note that fixed-point does not use an implicit 1 leading mantissa bit. Further, the same comments apply to hybrid fixed-point approaches such as block floating-point, where the add-as-integer approximation is unlikely to be very accurate.
  • Adjacency of exponent and mantissa bits. With integer addition used on the mantissa bits, if their highest order bits are both set, then a 1 is carried up to the next bit position. This is in the exponent, since it is next to the mantissa in the IEEE 754 format. Hence, an extra 1 is added to the exponent in this case. Note that the leading 1 in the explicit mantissa bits is not actually the highest order bit overall, but the second-highest, because there is an implicit leading-1 bit in both mantissas. So, addition of the second-highest bit in the mantissa, where both are 1, may then add 1 to the exponent (in addition to the two exponents being added together).

Instruction-Level Parallelism

Modern CPUs have a hidden level of parallelism that few programmers take advantage of. It's the ability to run two or more machine code instructions in parallel, and it's called Instruction-Level Parallelism (ILP).

Note that this is not referring to SIMD instructions such as x86 AVX or ARM Neon/SVE2 instructions. Those are powerful vectorizing instructions, but ILP refers to any type of basic arithmetic instruction. It's not a new type of instruction, but the fact that better CPUs do more things in parallel.

Also, this isn't referring to running multiple instructions on different cores, which is, you know, multithreading. ILP is parallelism on a single thread, and you can multithread on top of it, to get multiple levels of parallelization.

How does ILP work? In short: different parts of the chip. Basically, modern CPUs are so compacted with transistors that they often have multiple of each type of arithmetic unit.

Parallel Units. If there are multiple types of units, it can do two things in parallel. And there are also different types of units for different instructions, which can then be done in parallel. The types of computation units include:

  • Arithmetic Logic Unit (ALU) — basic integer arithmetic, comparisons, addition, multiply, etc.
  • Floating-point unit (FPU) — doing addition and multiplication in floating-point (with the IEEE 754 standard).

Some CPUs can send addition to a different part of the chip to multiplication. And some can only send integer arithmetic separate to the floating-point computations. It depends on the chip capabilities, and how many separate ALUs and FPUs the chip has.

Instruction Pipelining. Another aspect is pipelining of different operations inside the CPU. It works at a lower level than even the machine code instructions, called "micro-operations" or just "micro-ops". Every machine code instruction, like an addition or a multiplication, is separated into much smaller steps:

  • Load the data from memory
  • Store the data to a register
  • Load the other data operand
  • Store that to another register
  • Add them together
  • Store the result to a third register

There's a lot more to it, but that's the conceptual idea. And there's still more. CPUs are designed to run these micro-ops, such that:

  • Instructions may wait for memory (if they need data)
  • Any instructions that are ready can run
  • Out-of-order execution is allowed!

This starts getting to be too much to think about. Show me some code already!

Vector Dot Product Optimization

Vector dot product is a pairwise multiplication of each element of two vectors, and then add those all up. Here's the basic idea in C++ code:

    float aussie_vecdot_basic(
         const float v1[], const float v2[], int n)   
    {
        // Basic float vector dot product
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v1[i] * v2[i];
        }
        return sum;
    }

Basic Loop Unrolling of Vector Dot Product

Here's a basic loop-unrolled version with 2 steps per iteration:

    float aussie_vecdot_basic_unroll2(
         const float v1[], const float v2[], int n)  
    {
        // Unrolled float vector dot product
        assert(n % 2 == 0);
        float sum = 0.0;
        for (int i = 0; i < n; i += 2) {
            sum += v1[i] * v2[i];
            sum += v1[i+1] * v2[i+1];
        }
        return sum;
    }

Each iteration does two of the multiply-add sequences, and this is a 2-level loop unrolling. Note that the loop iteration does "i+=2" rather than "i++" operator, so there are half as many iterations of the loop, with double the loop body.

ILP Optimization of Vector Dot Product

How can we take advantage of ILP and out-of-order execution in C++? The answer is surprisingly simple: remove data dependencies. If we have two instructions with data dependency, as in the above unrolled vector dot product, then they cannot really run in parallel, nor can they interleave with pipelining. Here's the code tweak that removes the data dependency on "sum" from the loop-unrolled version:

    float aussie_vecdot_basic_unroll2_ILP(
        const float v1[], const float v2[], int n)  
    {
        // Uses Instruction-Level Parallelism (ILP)
        assert(n % 2 == 0);
        float sum1 = 0.0;
        float sum2 = 0.0;
        for (int i = 0; i < n; i += 2) {
            sum1 += v1[i] * v2[i];
            sum2 += v1[i + 1] * v2[i + 1];
        }
        return sum1 + sum2;
    }

It's only a small change, but it removes the data dependency between the two summations. Each multiply-addition sequence is independent of the other. Hence, they could run in parallel, or at least be partially pipelined over each other.

Basic Mogami Add-as-Integer Vector Dot Product

We can use add-as-integer for the whole vector dot product, which gives:

float aussie_vecdot_mogami_add_as_integer(const float v1[], const float v2[], int n)   // Mogami Add-as-integer FLOAT vector dot product
{
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		int c = *(int*)&v1[i] + *(int*)&v2[i] - 0x3f800000;
		sum += *(float*)&c;
	}
	return sum;
}

Basic Loop Unrolled Vector Dot Product

Unrolling the loop to level 2 gives this code:

float aussie_vecdot_basic_unroll2(const float v1[], const float v2[], int n)   // Basic FLOAT vector dot product
{
	assert(n % 2 == 0);
	float sum = 0.0;
	for (int i = 0; i < n; i += 2) {
		sum += v1[i] * v2[i];
		sum += v1[i+1] * v2[i+1];
	}
	return sum;
}

Loop-Unrolled Mogami Add-as-Integer with Parallel Accumulators

Using level-two loop unrolling and two parallel accumulators with the add-as-integer method.

float aussie_vecdot_mogami_add_as_integer_unroll2_ILP(const float v1[], const float v2[], int n)   // Mogami Add-as-integer FLOAT vector dot product
{
	assert(n % 2 == 0);
	float sum1 = 0.0;
	float sum2 = 0.0;
	for (int i = 0; i < n; i += 2) {
		int c1 = *(int*)&v1[i] + *(int*)&v2[i] - 0x3f800000;
		sum1 += *(float*)&c1;

		int c2 = *(int*)&v1[i+1] + *(int*)&v2[i+1] - 0x3f800000;
		sum2 += *(float*)&c2;
	}
	return sum1 + sum2;
}

Interleaved Mogami and FP32 with Level 2 Loop Unrolling

A first attempt at ILP across the ALU and FPU is to use a 2-iteration unrolling of the loop, where one does normal floating-point multiplication, and the other does integer arithmetic using the add-as-integer trick.

float aussie_vecdot_basic_unroll2_ILP_interleave_Mogami(const float v1[], const float v2[], int n)   // Basic FLOAT vector dot product
{
	static_assert(0x3f800000 == (127 << 23), "Mogami Exponent Bias Value");
	static_assert(sizeof(int) == sizeof(float), "int and float same size");
	float sum1 = 0.0;
	float sum2 = 0.0;
	for (int i = 0; i < n; i += 2) {
		sum1 += v1[i] * v2[i];
		// sum2 += v1[i + 1] * v2[i + 1];
		
		// Mogami add-as-integer
		int c = *(int*)&v1[i+1] + *(int*)&v2[i+1] - 0x3f800000;
		sum2 += *(float*)&c;
	}
	return sum1 + sum2;
}

Basic ILP Loop Unrolled 32

This is the basic idea of ILP maximized by using 32 parallel accumulators, to see if a core can parallelize 32 floating-point multiplications at once. There's no add-as-integer in this version.

float aussie_vecdot_basic_unroll32_ILP(const float v1[], const float v2[], int n)   // Basic FLOAT vector dot product
{
	assert(n % 32 == 0);
	float sum1 = 0.0;
	float sum2 = 0.0;
	float sum3 = 0.0;
	float sum4 = 0.0;
	float sum5 = 0.0;
	float sum6 = 0.0;
	float sum7 = 0.0;
	float sum8 = 0.0;

	float sum9 = 0.0;
	float sum10 = 0.0;
	float sum11 = 0.0;
	float sum12 = 0.0;
	float sum13 = 0.0;
	float sum14 = 0.0;
	float sum15 = 0.0;
	float sum16 = 0.0;

	float sum1B = 0.0;
	float sum2B = 0.0;
	float sum3B = 0.0;
	float sum4B = 0.0;
	float sum5B = 0.0;
	float sum6B = 0.0;
	float sum7B = 0.0;
	float sum8B = 0.0;

	float sum9B = 0.0;
	float sum10B = 0.0;
	float sum11B = 0.0;
	float sum12B = 0.0;
	float sum13B = 0.0;
	float sum14B = 0.0;
	float sum15B = 0.0;
	float sum16B = 0.0;


	for (int i = 0; i < n; i += 32) {
		sum1 += v1[i] * v2[i];
		sum2 += v1[i + 1] * v2[i + 1];
		sum3 += v1[i + 2] * v2[i + 2];
		sum4 += v1[i + 3] * v2[i + 3];
		sum5 += v1[i + 4] * v2[i + 4];
		sum6 += v1[i + 5] * v2[i + 5];
		sum7 += v1[i + 6] * v2[i + 6];
		sum8 += v1[i + 7] * v2[i + 7];
		sum9 += v1[i + 8] * v2[i + 8];
		sum10 += v1[i + 9] * v2[i + 9];
		sum11 += v1[i + 10] * v2[i + 10];
		sum12 += v1[i + 11] * v2[i + 11];
		sum13 += v1[i + 12] * v2[i + 12];
		sum14 += v1[i + 13] * v2[i + 13];
		sum15 += v1[i + 14] * v2[i + 14];
		sum16 += v1[i + 15] * v2[i + 15];

		sum1B += v1[i + 16] * v2[i + 16];
		sum2B += v1[i + 1 + 16] * v2[i + 1 + 16];
		sum3B += v1[i + 2 + 16] * v2[i + 2 + 16];
		sum4B += v1[i + 3 + 16] * v2[i + 3 + 16];
		sum5B += v1[i + 4 + 16] * v2[i + 4 + 16];
		sum6B += v1[i + 5 + 16] * v2[i + 5 + 16];
		sum7B += v1[i + 6 + 16] * v2[i + 6 + 16];
		sum8B += v1[i + 7 + 16] * v2[i + 7 + 16];
		sum9B += v1[i + 8 + 16] * v2[i + 8 + 16];
		sum10B += v1[i + 9 + 16] * v2[i + 9 + 16];
		sum11B += v1[i + 10 + 16] * v2[i + 10 + 16];
		sum12B += v1[i + 11 + 16] * v2[i + 11 + 16];
		sum13B += v1[i + 12 + 16] * v2[i + 12 + 16];
		sum14B += v1[i + 13 + 16] * v2[i + 13 + 16];
		sum15B += v1[i + 14 + 16] * v2[i + 14 + 16];
		sum16B += v1[i + 15 + 16] * v2[i + 15 + 16];
	}
	return sum1 + sum2 + sum3 + sum4
	 	 + sum5 + sum6 + sum7 + sum8
		 + sum9 + sum10 + sum11 + sum12
		 + sum13 + sum14 + sum15 + sum16
	     + sum1B + sum2B + sum3B + sum4B
		 + sum5B + sum6B + sum7B + sum8B
		 + sum9B + sum10B + sum11B + sum12B
		 + sum13B + sum14B + sum15B + sum16B;
}

Interleaved Mogami and FP32 Loop Unrolled 32

This version attempts to interleave 16 floating-point operations and 16 add-as-integer operations, to try to get ILP across both the ALU and the FPU.

float aussie_vecdot_basic_unroll32_ILP_MOGAMI(const float v1[], const float v2[], int n)   // Basic FLOAT vector dot product
{
	assert(n % 32 == 0);
	float sum1 = 0.0;
	float sum2 = 0.0;
	float sum3 = 0.0;
	float sum4 = 0.0;
	float sum5 = 0.0;
	float sum6 = 0.0;
	float sum7 = 0.0;
	float sum8 = 0.0;

	float sum9 = 0.0;
	float sum10 = 0.0;
	float sum11 = 0.0;
	float sum12 = 0.0;
	float sum13 = 0.0;
	float sum14 = 0.0;
	float sum15 = 0.0;
	float sum16 = 0.0;

	float sum1B = 0.0;
	float sum2B = 0.0;
	float sum3B = 0.0;
	float sum4B = 0.0;
	float sum5B = 0.0;
	float sum6B = 0.0;
	float sum7B = 0.0;
	float sum8B = 0.0;

	float sum9B = 0.0;
	float sum10B = 0.0;
	float sum11B = 0.0;
	float sum12B = 0.0;
	float sum13B = 0.0;
	float sum14B = 0.0;
	float sum15B = 0.0;
	float sum16B = 0.0;


	for (int i = 0; i < n; i += 32) {
		sum1 += v1[i] * v2[i];
		sum2 += v1[i + 1] * v2[i + 1];
		sum3 += v1[i + 2] * v2[i + 2];
		sum4 += v1[i + 3] * v2[i + 3];
		sum5 += v1[i + 4] * v2[i + 4];
		sum6 += v1[i + 5] * v2[i + 5];
		sum7 += v1[i + 6] * v2[i + 6];
		sum8 += v1[i + 7] * v2[i + 7];
		sum9 += v1[i + 8] * v2[i + 8];
		sum10 += v1[i + 9] * v2[i + 9];
		sum11 += v1[i + 10] * v2[i + 10];
		sum12 += v1[i + 11] * v2[i + 11];
		sum13 += v1[i + 12] * v2[i + 12];
		sum14 += v1[i + 13] * v2[i + 13];
		sum15 += v1[i + 14] * v2[i + 14];
		sum16 += v1[i + 15] * v2[i + 15];


		// Mogami add-as-integer
		int c1B = *(int*)&v1[i + 16] + *(int*)&v2[i + 16] - 0x3f800000;
		sum1B += *(float*)&c1B;
		int c2B = *(int*)&v1[i + 1 + 16] + *(int*)&v2[i + 1 + 16] - 0x3f800000;
		sum2B += *(float*)&c2B;
		int c3B = *(int*)&v1[i + 2 + 16] + *(int*)&v2[i + 2 + 16] - 0x3f800000;
		sum3B += *(float*)&c3B;
		int c4B = *(int*)&v1[i + 3 + 16] + *(int*)&v2[i + 3 + 16] - 0x3f800000;
		sum4B += *(float*)&c4B;
		int c5B = *(int*)&v1[i + 4 + 16] + *(int*)&v2[i + 4 + 16] - 0x3f800000;
		sum5B += *(float*)&c5B;
		int c6B = *(int*)&v1[i + 5 + 16] + *(int*)&v2[i + 5 + 16] - 0x3f800000;
		sum6B += *(float*)&c6B;
		int c7B = *(int*)&v1[i + 6 + 16] + *(int*)&v2[i + 6 + 16] - 0x3f800000;
		sum7B += *(float*)&c7B;
		int c8B = *(int*)&v1[i + 7 + 16] + *(int*)&v2[i + 7 + 16] - 0x3f800000;
		sum8B += *(float*)&c8B;
		int c9B = *(int*)&v1[i + 8 + 16] + *(int*)&v2[i + 8 + 16] - 0x3f800000;
		sum9B += *(float*)&c9B;
		int c10B = *(int*)&v1[i + 9 + 16] + *(int*)&v2[i + 9 + 16] - 0x3f800000;
		sum10B += *(float*)&c10B;
		int c11B = *(int*)&v1[i + 10 + 16] + *(int*)&v2[i + 10 + 16] - 0x3f800000;
		sum11B += *(float*)&c11B;
		int c12B = *(int*)&v1[i + 11 + 16] + *(int*)&v2[i + 11 + 16] - 0x3f800000;
		sum12B += *(float*)&c12B;
		int c13B = *(int*)&v1[i + 12 + 16] + *(int*)&v2[i + 12 + 16] - 0x3f800000;
		sum13B += *(float*)&c13B;
		int c14B = *(int*)&v1[i + 13 + 16] + *(int*)&v2[i + 13 + 16] - 0x3f800000;
		sum14B += *(float*)&c14B;
		int c15B = *(int*)&v1[i + 14 + 16] + *(int*)&v2[i + 14 + 16] - 0x3f800000;
		sum15B += *(float*)&c15B;
		int c16B = *(int*)&v1[i + 15 + 16] + *(int*)&v2[i + 15 + 16] - 0x3f800000;
		sum16B += *(float*)&c16B;
	}
	return sum1 + sum2 + sum3 + sum4
		+ sum5 + sum6 + sum7 + sum8
		+ sum9 + sum10 + sum11 + sum12
		+ sum13 + sum14 + sum15 + sum16
		+ sum1B + sum2B + sum3B + sum4B
		+ sum5B + sum6B + sum7B + sum8B
		+ sum9B + sum10B + sum11B + sum12B
		+ sum13B + sum14B + sum15B + sum16B;
}

Results

Windows results on a laptop with x86 CPU (without Visual Studio compiler optimization flags enabled):

    FLOAT Vector dot product benchmarks: (N=10240, Iter=100000)
    Vecdot float basic: 3149 ticks (3.15 seconds)
    Vecdot float basic (REPEAT): 2898 ticks (2.90 seconds)
    Vecdot float with permutation: 2915 ticks (2.92 seconds)
    Vecdot float with interleaved vectors: 2875 ticks (2.88 seconds)
    Vecdot float basic unroll2: 2899 ticks (2.90 seconds)
    Vecdot float basic unroll2 ILP: 1473 ticks (1.47 seconds)
    Vecdot float basic unroll2 ILP mogami interleave simple: 1474 ticks (1.47 seconds)
    Vecdot float basic unroll2 ILP mogami interleave better: 1473 ticks (1.47 seconds)
    Vecdot float basic unroll4 ILP: 1018 ticks (1.02 seconds)
    Vecdot float basic unroll8 ILP: 859 ticks (0.86 seconds)
    Vecdot float basic unroll16 ILP: 793 ticks (0.79 seconds)
    Vecdot float Mogami unroll32 ILP: 942 ticks (0.94 seconds)
    Vecdot float Mogami NO OFFSET unroll32 ILP: 930 ticks (0.93 seconds)
    Vecdot float basic unroll32 ILP: 776 ticks (0.78 seconds)
    Vecdot float basic unroll8 mult/add: 1107 ticks (1.11 seconds)
    Vecdot mogami add-as-int: 2922 ticks (2.92 seconds)
    Vecdot mogami add-as-int unroll2 ILP: 1506 ticks (1.51 seconds)
    Vecdot mogami no-bias: 2928 ticks (2.93 seconds)
    Vecdot float ptr arith: 2868 ticks (2.87 seconds)
    Vecdot AVX1 unroll with DP (4 floats, 128-bits): 1183 ticks (1.18 seconds)
    Vecdot AVX1 unroll with MUL/ADD (4 floats, 128-bits): 1277 ticks (1.28 seconds)
    Vecdot AVX1 unroll with MUL/ADD TWICE (4 floats, 128-bits): 1273 ticks (1.27 seconds)
    Vecdot AVX1 unroll with MUL/ADD TWICE ILP (4 floats, 128-bits): 966 ticks (0.97 seconds)
    Vecdot AVX1 unroll with Mogami MUL/ADD (4 floats, 128-bits): 1305 ticks (1.30 seconds)
    

Linux server results on a much faster data center server using the GCC compiler with "-O3" optimization flags enabled:

    FLOAT Vector dot product benchmarks: (N=10240, Iter=100000)
    Vecdot float basic: 800000 ticks (0.80 seconds)
    Vecdot float basic (REPEAT): 800000 ticks (0.80 seconds)
    Vecdot float with permutation: 800000 ticks (0.80 seconds)
    Vecdot float with interleaved vectors: 810000 ticks (0.81 seconds)
    Vecdot float basic unroll2: 800000 ticks (0.80 seconds)
    Vecdot float basic unroll2 ILP: 400000 ticks (0.40 seconds)
    Vecdot float basic unroll2 ILP mogami interleave simple: 430000 ticks (0.43 seconds)
    Vecdot float basic unroll2 ILP mogami interleave better: 430000 ticks (0.43 seconds)
    Vecdot float basic unroll4 ILP: 230000 ticks (0.23 seconds)
    Vecdot float basic unroll8 ILP: 270000 ticks (0.27 seconds)
    Vecdot float basic unroll16 ILP: 340000 ticks (0.34 seconds)
    Vecdot float Mogami unroll32 ILP: 930000 ticks (0.93 seconds)
    Vecdot float Mogami NO OFFSET unroll32 ILP: 920000 ticks (0.92 seconds)
    Vecdot float basic unroll32 ILP: 460000 ticks (0.46 seconds)
    Vecdot float basic unroll8 mult/add: 280000 ticks (0.28 seconds)
    Vecdot mogami add-as-int: 850000 ticks (0.85 seconds)
    Vecdot mogami add-as-int unroll2 ILP: 430000 ticks (0.43 seconds)
    Vecdot mogami no-bias: 850000 ticks (0.85 seconds)
    Vecdot float ptr arith: 800000 ticks (0.80 seconds)
    

Instruction-Level Parallelism with Parallel Accumulators: The well-known optimization trick to use multiple parallel accumulators was shown to be effective on both CPU platforms, effectively halving the computation time.

    On Windows:
    Vecdot float basic unroll2: 2899 ticks (2.90 seconds)
    Vecdot float basic unroll2 ILP: 1473 ticks (1.47 seconds)
    On Linux:
    Vecdot float basic unroll2: 800000 ticks (0.80 seconds)
    Vecdot float basic unroll2 ILP: 400000 ticks (0.40 seconds)
    

Full Mogami add-as-integer vector dot product: Changing the whole vector dot product computation from floating-point multiplication to integer arithmetic gave these results:

    On Windows:
        Vecdot float basic (REPEAT): 2898 ticks (2.90 seconds)
        Vecdot mogami add-as-int: 2922 ticks (2.92 seconds)
    On Linux:
        Vecdot float basic (REPEAT): 800000 ticks (0.80 seconds)
        Vecdot mogami add-as-int: 850000 ticks (0.85 seconds)
    

This shows that the add-as-integer method is slightly slower than the basic floating-point version. Possibly reasons for the disappointing results are that add-as-integer actually requires two integer additions, with the second of 0x3f800000 to correct the exponent bits, and that the FPU is still involved because the accumulator must be floating-point addition (not integer).

Interleaved Floating-Point and Add-as-Integer Method: The idea of an interleaved method is to use loop unrolling and then change half of the unrolled operations from floating-point multiplication to integer addition via the add-as-integer calculation. The results of this approach are:

    On Windows (unrolling level 2):
        Vecdot float basic unroll2 ILP: 1473 ticks (1.47 seconds)
        Vecdot float basic unroll2 ILP mogami interleave simple: 1474 ticks (1.47 seconds)
    
    On Windows (unrolling level 32):
        Vecdot float basic unroll32 ILP: 776 ticks (0.78 seconds)
        Vecdot float Mogami unroll32 ILP: 942 ticks (0.94 seconds)
    
    On Linux (unrolling level 2):
        Vecdot float basic unroll2 ILP: 400000 ticks (0.40 seconds)
        Vecdot float basic unroll2 ILP mogami interleave better: 430000 ticks (0.43 seconds)
    
    On Linux (unrolling level 32):
        Vecdot float basic unroll32 ILP: 460000 ticks (0.46 seconds)
        Vecdot float Mogami unroll32 ILP: 930000 ticks (0.93 seconds)
    

Hence, interleaving to achieve ILP across the FPU and ALU is not successful. The interleaved version is either about the same, or notably slower.

Conclusion

All of the results were negative and the idea is a de-optimization. The results show the interleaving method was not better than using standard instruction-level parallelism (with multiple parallel accumulators), without utilizing Mogami's add-as-integer method.

Negative results occur from all attempts:

  • Full Mogami version is not faster than the basic floating-point multiplication version
  • Interleaved ILP version with multiple accumulators is also not faster than an ILP floating-point version

Possible reasons for the failed optimization:

  • Memory access cost for loading the vector elements vastly outweighs the compute cost of either floating-point operations or integer addition. This seems likely, but doesn't explain why using add-as-integer is a performance penalty.
  • Integer addition is not much faster than floating-point multiplication, so the gain is not seen. Also probably correct, but doesn't explain why it is sometimes slower.
  • Add-as-integer requires not one, but two integer addition operations because of the need to fix the offset to the exponent. However, this theory was tested by running a dummy modified C++ version that simply removed this second addition, which makes its computation totally inaccurate, but nevertheless, the speed can be tested. This was only marginally faster, with a gain of less than 1%, so the issue with two additions seems not to be the cause of the slowdown.
  • The C++ code for the interleaved version has 16 standard multiplications and then 16 add-as-integer operations. Should these be better interleaved in the C++ code itself? A version with this reordering of the 32 operations was tested, and made almost no difference to the speed, so this does not seem to be the cause. Presumably the low-level instruction ordering performed by the CPU to queue up work is performing similarly in both versions.
  • Fused-multiply-add (FMA) instructions could also be the cause, since such instructions are standard for vector dot product computations. The add-as-integer method cannot use FMA instructions, because its sequence is integer addition and then floating-point addition, so it needs a "fused add-add" opcode (across ALU and FPU), which does not exist in CPU instruction sets. However, in any case, review of compiler-generated assembly code showed that none of the versions seemed to be automatically using the FMA low-level CPU instructions, although such an optimization inside the CPU itself cannot be ruled out.
  • The add-as-integer optimization still needs to perform floating-point addition for its accumulators, so the compute uses both the ALU and FPU, and interferes with the other FPU computations. This seems the most plausible explanation, since there may be contention in the FPU for doing both the floating-point multiplication in the standard code, and the floating-point addition in the add-as-integer code.

Although an intriguing attempt at a speedup, this is not a lossless optimization, even if it were successful, because the add-as-integer operations are an approximation, with up to 15% error rate. Hence, the use of this approach in matrix multiplications in LLMs is a trade-off between computational cost and model accuracy.

Limitations

Some of the major limitations:

  • GPU version with CUDA C++ not tested here.
  • Not many CPU configurations were tested (only two).
  • Performance under different C++ compiler optimization levels was not extensively tested.
  • Even if it were faster, the vector dot product result is approximate, not exact, since all of the Mogami integer arithmetic is approximate.

Extensions

Some possible extensions to the analysis include:

  • Is there a way to avoid the extra subtraction of 0x3f800000? Can you defer it to the end and then add a bigger offset to the accumulators?
  • Using a non-IEEE754 standard floating-point number format that doesn't have offsets for the exponent, and hence doesn't need the subtraction of 0x3f800000
  • AVX SIMD C++ CPU versions (x86 CPU) to analyze ILP in the AVX instructions.
  • CUDA C++ GPU version
  • Additional C++ code optimizations have not been examined extensively, such as the use of pointer arithmetic, inlined functions, restricted pointer designations, or testing more compiler optimization flags, which could be applied to all code examples.

References

  1. T. Mogami, Deep neural network training without multiplications, In Beyond BackPropagation WS at 34th Conference on Neural Information Processing Systems, 2020, https://arxiv.org/abs/2012.03458 (multiplication of floating-point numbers with integer addition, using Mitchell's approximate multiplication)
  2. Lingyun Yao, Martin Trapp, Karthekeyan Periasamy, Jelin Leslin, Gaurav Singh, Martin Andraud, June 2023, Logarithm-Approximate Floating-Point Multiplier for Hardware-efficient Inference in Probabilistic Circuits, Proceedings of The 6th Workshop on Tractable Probabilistic Modeling, https://openreview.net/forum?id=WL7YDLOLfK, PDF: https://openreview.net/pdf?id=WL7YDLOLfK (Probabilistic speed improvement; uses Mogami's approximate multiplier.)
  3. A Kosson, M Jaggi, 2023, Hardware-Efficient Transformer Training via Piecewise Affine Operations, arXiv preprint arXiv:2305.17190, https://arxiv.org/abs/2305.17190, Code: https://github.com/epfml/piecewise-affine-multiplication (Uses Mogami method with neural networks, including multiple components of the model, in training and inference; also a theoretical explanation of why Mogami integer addition works, including its correct handling of sign bits.)
  4. X Li, B Liu, RH Yang, V Courville, C Xing, VP Nia, 2023, DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization, Proceedings of the IEEE/CVF, https://openaccess.thecvf.com/content/ICCV2023/papers/Li_DenseShift_Towards_Accurate_and_Efficient_Low-Bit_Power-of-Two_Quantization_ICCV_2023_paper.pdf (Not a full add-as-integer method, but uses integer addition on the sign and exponent bits of IEEE 754 floating point to perform bitshifts on floats to perform power-of-two number quantization on 32-bit floats.)
  5. David Spuler, March 2024, Chapter 51. Zero-Multiplication Models, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  6. Lingyun Yao, Martin Trapp, Jelin Leslin, Gaurav Singh, Peng Zhang, Karthekeyan Periasamy, Martin Andraud, 22 May 2024, On Hardware-efficient Inference in Probabilistic Circuits, https://arxiv.org/abs/2405.13639
  7. C. Hakert, K. -H. Chen and J. -J. Chen, 2024, FLInt: Exploiting Floating Point Enabled Integer Arithmetic for Efficient Random Forest Inference, 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), Valencia, Spain, 2024, pp. 1-2, doi: 10.23919/DATE58400.2024.10546851, https://ieeexplore.ieee.org/abstract/document/10546851
  8. David Spuler, March 2024, Example: Add-as-int Approximate Multiply, in Generative AI in C++, https://www.aussieai.com/book/ch9-example-add-as-integer
  9. Hongyin Luo, Wei Sun, 2 Oct 2024 (v2), Addition is All You Need for Energy-efficient Language Models, https://arxiv.org/abs/2410.00907 (This looks similar to Mogami add-as-integer method.)
  10. Y. Chen, J. Zou and X. Chen, "April: Accuracy-Improved Floating-Point Approximation For Neural Network Accelerators," 2025 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2025, pp. 1-7, doi: 10.1109/DAC63849.2025.11133083. https://ieeexplore.ieee.org/abstract/document/11133083, PDF: https://soldierchen.github.io/assets/pdf/April-DAC25.pdf
  11. Lingyun Yao, Martin Trapp, Karthekeyan Periasamy, Jelin Leslin, Gaurav Singh, Martin Andraud 23 Aug 2023, Logarithm-Approximate Floating-Point Multiplier for Hardware-efficient Inference in Probabilistic Circuits, https://openreview.net/forum?id=WL7YDLOLfK, PDF: https://openreview.net/pdf?id=WL7YDLOLfK
  12. Hu, Z., Zhu, S., Wang, L. et al. A neural network accelerated optimization method for FPGA. J Comb Optim 47, 84 (2024). https://doi.org/10.1007/s10878-024-01117-x
  13. Jiaxiang Zou, Yonghao Chen, Xingyu Chen, Chenxi Xu, and Xinyu Chen. 2025. AxCore: A Quantization-Aware Approximate GEMM Unit for LLM Inference. Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. Association for Computing Machinery, New York, NY, USA, 839–853. https://doi.org/10.1145/3725843.3756094, PDF: https://dl.acm.org/doi/pdf/10.1145/3725843.3756094
  14. Yulhwa Kim, Jaeyong Jang, Jehun Lee, Jihoon Park, Jeonghoon Kim, Byeong wook Kim, Baeseong Park, Se Jung Kwon, Dongsoo Lee, and Jae Joon Kim. 2023, Winning Both the Accuracy of Floating Point Activation and the Simplicity of Integer Arithmetic, In 11th International Conference on Learning Representations, ICLR 2023.
  15. Hongyin Luo, Wei Sun, 2 Oct 2024 (v2), Addition is All You Need for Energy-efficient Language Models, https://arxiv.org/abs/2410.00907 (Uses the add-as-integer method, but only on 3-bit and 4-bit mantissas, with relevance to FP8.)

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Mordern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging