Aussie AI
Chapter 7. Vector Dot Product
-
Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"
-
by David Spuler, Ph.D.
Chapter 7. Vector Dot Product
Vector Dot Product
An example of an operation with aspects of both vertical and horizontal arithmetic is: vector dot product. The two arithmetic requirements are:
- Multiplication — pairwise on all elements (vertical vectorization).
- Accumulation — summing up all those values (horizontal reduction).
Conceptually, this is the case, with a vertical multiplication of pairs, followed by a horizontal summation of these values. In practice, however, these two stages are merged in together for a faster method.
Sequential Code
Here is the basic non-vectorized dot product computation without any optimization attempts.
float aussie_vecdot_basic(float v1[], float v2[], int n)
{
// Basic FLOAT vector dot product
float sum = 0.0;
for (int i = 0; i < n; i++) {
sum += v1[i] * v2[i];
}
return sum;
}
There are several ways that we could improve this code, such as:
- Loop unrolling
- Pointer arithmetic
- Instruction-level parallelism
However, first, let’s look at making it work fast on the CPU using AVX SIMD instructions.
AVX 128-Bit Dot Product
The AVX instruction set has a vector dot product intrinsic that wraps an x86 dot product hardware instruction on the CPU. There are versions of the dot product intrinsic for AVX (128-bit), AVX-2 (256-bit) and AVX-512 (512-bit).
For basic AVX (128 bits), this is a full vector dot product
of two vectors with 4 x 32-bit float numbers in each vector.
One oddity is that although the result is a floating-point scalar (i.e., a single 32-bit float),
it’s still stored in a 128-bit register,
and must be extracted using the “_mm_cvtss_f32” intrinsic.
The example code looks like:
float aussie_avx_vecdot_4_floats(float v1[4], float v2[4])
{
// AVX dot product: 2 vectors of 4x32-bit floats
__m128 r1 = _mm_loadu_ps(v1); // Load floats
__m128 r2 = _mm_loadu_ps(v2);
__m128 dst = _mm_dp_ps(r1, r2, 0xf1); // Dot product
float fret = _mm_cvtss_f32(dst); // Extract float
return fret;
}
AVX-2 256-Bit Dot Product
Here is my attempt at the 256-bit version of a vector dot product
of 8 float values
using AVX-2 instructions,
which seems like it should work:
float aussie_avx2_vecdot_8_floats_buggy(
float v1[8], float v2[8])
{
// AVX2 dot product: 2 vectors, 8x32-bit floats
__m256 r1 = _mm256_loadu_ps(v1); // Load floats
__m256 r2 = _mm256_loadu_ps(v2);
__m256 dst = _mm256_dp_ps(r1, r2, 0xf1); // Bug!
float fret = _mm256_cvtss_f32(dst);
return fret;
}
But it doesn’t!
Instead of working on 8 pairs of float numbers, it does the vector dot product of only 4 pairs of float values, just like the first AVX code.
The problem wasn’t related to alignment to 256-bit blocks, because I added “alignas(32)” to the arrays passed in.
It seems that the “_mm256_dp_ps” intrinsic doesn’t actually do 256-bit dot products,
but is similar to the 128-bit “_mm_dp_ps” intrinsic that does only four float numbers (128 bits).
These are based on the VDPPS opcode in the x86 instruction for 32-bit float values
and there is VDPPD for 64-bit double numbers.
However, it seems that “_mm256_dp_ps” is not using the 256-bit version.
Or maybe my code is just buggy!
Loop Unrolled Version
To use AVX to vectorize it, we need to unroll the loop first.
Here’s a simple vector dot product with its inner loop unrolled 4 times.
This version assumes that n is a multiple of 4 rather than handling odd cases:
float aussie_vecdot_unroll4_basic(float v1[], float v2[], int n)
{
// Loop-unrolled Vector dot product
if (n % 4 != 0) {
aussie_assert(n % 4 == 0);
return 0.0; // fail
}
float sum = 0.0;
for (int i = 0; i < n; ) {
sum += v1[i] * v2[i]; i++;
sum += v1[i] * v2[i]; i++;
sum += v1[i] * v2[i]; i++;
sum += v1[i] * v2[i]; i++;
}
return sum;
}
So, now we can change those 4 unrolled multiplications into one AVX computation
of the vector dot product of 4 float numbers.
#include <intrin.h>
float aussie_vecdot_unroll_AVX1(float v1[], float v2[], int n)
{
// AVX-1 loop-unrolled (4 floats) vector dot product
if (n % 4 != 0) {
aussie_assert(n % 4 == 0);
return 0.0; // fail
}
float sum = 0.0;
for (int i = 0; i < n; i += 4) {
// AVX1: Vector dot product of 2 vectors
// ... process 4x32-bit floats in 128 bits
__m128 r1 = _mm_loadu_ps(&v1[i]); // Load floats into 128-bits
__m128 r2 = _mm_loadu_ps(&v2[i]);
__m128 dst = _mm_dp_ps(r1, r2, 0xf1); // Dot product
sum += _mm_cvtss_f32(dst);
}
return sum;
}
This basic AVX sequence of code to do the 4 float dot product has been analyzed in a separate chapter.
The main dot product computation is “_mm_dp_ps” which is an AVX intrinsic
and multiplies 4 pairs of 32-bit float numbers, and then sums them, all in one call to an intrinsic.
Note that the loop now iterates 4 at a time through the array of float values (i.e., “i+=4”)
and then the AVX intrinsic does the rest.
Here’s the benchmark analysis showing that the AVX-vectorized version is more than twice as fast:
FLOAT Vector dot product benchmarks:
Time taken: Vecdot basic: 2805 ticks (2.81 seconds)
Time taken: Vecdot AVX1 unroll (4 floats, 128-bits): 1142 ticks (1.14 seconds)
Fused-Multiply-Add (FMA)
The AVX-2 FMA intrinsic takes 3 vectors, each of size 256-bits, multiplies two of them pair-wise, and then adds the third vector. Both the multiplication and addition are done in element-wise SIMD style. At first blush this sounds like doing a vector multiply and then adding a “bias” vector, and hence doesn’t sound like a good optimization for the vector dot product. The SIMD pairwise multiplication is the first step of dot products, but the vector addition seems the opposite of what we want, which is “horizontal” addition of the products that result from the multiplications.
The default idea is doing a dot product of 8 float values, and then another one,
and then adding each individual sum at the end.
With that idea, the vertical addition in FMA is not what we want,
and it looks like using SIMD multiplication and an extra horizontal addition would be better than using a single FMA intrinsic.
However, we can make like Superman III...
Reverse it!
If you think about FMA not as a multiplication and then addition, but as “adding multiplications” in the reverse order, then there is a eureka moment: put the addition first. The idea is that we can maintain a vector of running sums, and then only do a single horizontal addition at the very end. It’s kind of mind-bending, but here’s the code:
float aussie_vecdot_FMA_unroll_AVX2(float v1[], float v2[], int n)
{
// AVX2 vecdot using FMA (Fused Multiply-Add) primitives
if (n % 8 != 0) {
aussie_assert(n % 8 == 0);
return 0.0; // fail
}
__m256 sumdst = _mm256_setzero_ps(); // Set accumulators to zero
for (int i = 0; i < n; i += 8) {
// AVX2: process 8x32-bit floats in 256 bits
__m256 r1 = _mm256_loadu_ps(&v1[i]); // Load floats into 256-bits
__m256 r2 = _mm256_loadu_ps(&v2[i]);
sumdst = _mm256_fmadd_ps(r1, r2, sumdst); // FMA of 3 vectors
}
// Add the final 8 accumulators manually
float* farr = (float*)&sumdst;
float sum = farr[0] + farr[1] + farr[2] + farr[3]
+ farr[4] + farr[5] + farr[6] + farr[7];
return sum;
}
How does this work?
Well, we declare “sumdst” as a vector of 8 float numbers that maintains
the 8 parallel accumulators,
which is first initialized to all-zeros via the “_mm256_setzero_ps” intrinsic.
In the main loop, we use “sumdst” to maintain a running sum
in all 8 of those parallel accumulators across multiple segments of the vector.
One accumulator sums the products in array indices 0,8,16,..., and the
next accumulator sums the products for indices 1,9,17,...
We use the FMA intrinsic (“_mm256_fmadd_ps” in AVX2)
to do the SIMD multiplication,
but rather than trying to add the 8 resulting products together,
we add each product to a separate accumulator.
This works very neatly, because the AVX-2 FMA intrinsics does this all in SIMD parallelism with the combined FMA intrinsic.
Only at the very end, after the main loop, we do a horizontal add of the 8 parallel accumulators to
get the final sum.
This idea works surprisingly well, and is gratifying since I couldn’t get the AVX-2 256-bit
version with the dot product “_mm256_dp_ps” intrinsic to run correctly on 8 float values.
Here’s the benchmarking, which shows that AVX-2 using FMA on 8 float values in parallel runs
much faster than the AVX1 unrolled vector dot product using the intrinsic “_mm_dp_ps” with 4 float values.
FLOAT Vector dot product benchmarks: (N=1024, Iter=1000000)
Vecdot basic: 2961 ticks (2.96 seconds)
Vecdot AVX1 unroll (4 floats, 128-bits): 1169 ticks (1.17 seconds)
Vecdot AVX1 FMA (4 floats, 128-bits): 1314 ticks (1.31 seconds)
Vecdot AVX2 FMA (8 floats, 256-bits): 783 ticks (0.78 seconds)
Note that we can improve on the horizontal addition at the very end.
The example code just uses basic C++ with 7 additions and 8 array index computations.
Instead, this last computation should really
use some AVX “hadd” intrinsics instead (it needs 3 calls to horizontal-pairwise add 8 float values).
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: C++ AVX Optimization |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |