Aussie AI
Chapter 6. Horizontal Reductions
-
Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"
-
by David Spuler, Ph.D.
Chapter 6. Horizontal Reductions
What is a Reduction?
A reduction is an operation that “reduces” a vector down to a single number. It’s called “horizontal” because it operates across the vector, requiring multiple elements from the same vector to be examined. An elementwise operation where each vector element is handled separately, without any other elements from the same vector, is called a “vertical” operation. Here’s a rule of thumb:
- Vertical — two vector operands (e.g., add two vectors).
- Horizontal — one vector operand (e.g., sum of all elements of a single vector).
Some example of common horizontal reductions on vectors include:
- Max
- Min
- Sum
- Average
These look easy enough. But, no!
Horizontal reductions are actually harder to code than vertical operations, even if there’s two operands for the vertical algorithm. SIMD hardware instructions are much better suited to vertical elementwise operations across two different vectors. The horizontal SIMD instructions are often slower!
Example: AVX Vector Sum Reduction
Let us suppose we need to calculate the sum of all the elements of a vector. This is a “reduction” that has dimensions “vector-to-scalar.” Here is a basic naive C++ version without any optimizations:
float aussie_vector_sum(float v[], int n) // Summation
{
float sum = 0.0;
for (int i = 0; i < n; i++) {
sum += v[i];
}
return sum;
}
Horizontal AVX Intrinsics
AVX vector reductions have some issues in the early releases.
Although AVX has SIMD instructions to add two vectors in parallel,
it struggles to do a “reduction” operation like this.
AVX and AVX-2 do have “horizontal add” (“hadd”) intrinsics,
but these only do pairwise additions within the single vector, rather than adding all elements.
AVX-512 has a “reduce add” intrinsic (“_mm512_reduce_add_ps”) for horizontally adds 16 float numbers, which works a lot better.
Parallel Accumulators Trick
For AVX and AVX-2, are we stuck with doing multiple calls to the pairwise “hadd” intrinsics? No, there’s a non-obvious way to use the “vertical add” intrinsics in parallel. We can do “in parallel” squared. It’s almost like we’re doing math inside a computer.
The trick is to use the AVX registers as a set of 4 parallel accumulators (AVX 128 bits) or 8 parallel accumulators (AVX-2’s 256 bits). The overall algorithm becomes:
1. Initialize 4 (or 8) accumulators.
2. Scan the whole vector doing 4 parallel additions every iteration.
3. Sum the 4 accumulators (and done).
In this way, we can defer the horizontal addition (“hadd”) until the very end, and since it’s not in the critical loop, its performance hardly matters. Here’s the code for AVX-1 with 128-bit registers:
float aussie_vector_sum_AVX1(float v[], int n) // Summation (horizontal) of a single vector
{
if (n % 4 != 0) { // Safety
aussie_assert(n % 4 == 0);
return 0.0; // fail
}
__m128 sumdst = _mm_setzero_ps(); // Set accumulators to zero
for (int i = 0; i < n; i += 4) {
__m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
sumdst = _mm_add_ps(r1, sumdst); // SUM = SUM + V
}
// Add the final 4 accumulators manually
float* farr = sumdst.m128_f32;
float sum = farr[0] + farr[1] + farr[2] + farr[3];
return sum;
}
The AVX-2 version is faster, because it processes 8 float values at a time.
This uses the same strategy of 8 parallel accumulators
and a loop unrolling factor of 8 (i.e., the loop incrementer is now “i+=8”).
Here’s the C++ code:
float aussie_vector_sum_AVX2(float v[], int n) // Summation (horizontal) of a single vector
{
if (n % 8 != 0) { // Safety check (no extra cases)
aussie_assert(n % 8 == 0);
return 0.0; // fail
}
__m256 sumdst = _mm256_setzero_ps(); // Set 8 accumulators to zero
for (int i = 0; i < n; i += 8) {
__m256 r1 = _mm256_loadu_ps(&v[i]); // Load 8 floats into 256-bits
sumdst = _mm256_add_ps(r1, sumdst); // SUM = SUM + V
}
// Add the final 8 accumulators manually
float* farr = sumdst.m256_f32;
float sum = farr[0] + farr[1] + farr[2] + farr[3]
+ farr[4] + farr[5] + farr[6] + farr[7];
return sum;
}
I’ve been lazy not bothering to optimize the final horizontal addition.
A small extra speedup is probably available using the “hadd” intrinsics 3 times in a row
to drop it down from 8 accumulators to a single float.
If this was AVX-512, we could use the horizontal reduction “_mm512_reduce_add_ps” intrinsic for summation at the end
(for adding 16 partial sums of type float).
Loop Peeling Optimization: Another inefficiency with these AVX addition routines it that they needlessly perform an addition with zero in the first iteration. Effectively, we need to do “loop peeling” to handle the first loop iteration differently. This is the slow first iteration of AVX2 vector sum:
__m256 sumdst = _mm256_setzero_ps(); // Set 8 accumulators to zero
for (int i = 0; i < n; i += 8) {
// ...
}
Loop peeling says to replace the initialization with zero with loading
the first 8 values from the vector.
The loop starts its first iteration at i=8 instead of i=0, skipping what had been the first addition:
__m256 sumdst = _mm256_loadu_ps(&v[0]); // Get first 8 values
for (int i = 8 /*not 0!*/; i < n; i += 8) {
// ... same
}
AVX Vector Max and Min Reductions
The need to find a minimum or maximum of a vector’s elements
is similar to a summation reduction.
Again, AVX1 and AVX2 don’t have proper “reduction” intrinsics for max or min,
but we can compute them in parallel by keeping a running min or max value
of 4 or 8 float values
(i.e., analogous to parallel accumulators when doing summation).
The AVX intrinsics are:
MIN:_mm_min_ps,_mm256_min_psMAX:_mm_max_ps,_mm256_max_ps
Here is the AVX1 version of MAX vector reduction:
float aussie_vector_max_AVX1(float v[], int n)
{
// Maximum (horizontal) of a single vector
if (n % 4 != 0) {
aussie_assert(n % 4 == 0);
return 0.0; // fail
}
__m128 sumdst = _mm_loadu_ps(&v[0]); // Initial values
for (int i = 4 /*not 0*/; i < n; i += 4) {
__m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
sumdst = _mm_max_ps(r1, sumdst); // dst = MAX(dst, r1)
}
// Find Max of the final 4 accumulators
float* farr = sumdst.m128_f32;
float fmax = farr[0];
if (farr[1] > fmax) fmax = farr[1];
if (farr[2] > fmax) fmax = farr[2];
if (farr[3] > fmax) fmax = farr[3];
return fmax;
}
And here is the analogous AVX2 version of MAX vector reduction:
float aussie_vector_max_AVX2(float v[], int n)
{
// Maximum (horizontal) of a single vector
if (n % 8 != 0) { // Safety check (no extra cases)
aussie_assert(n % 8 == 0);
return 0.0; // fail
}
__m256 sumdst = _mm256_loadu_ps(&v[0]); // Initial 8 values
for (int i = 8/*not 0*/; i < n; i += 8) {
__m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats into 256-bits
sumdst = _mm256_max_ps(r1, sumdst); // dst = MAX(dst, r1)
}
// Find Max of the final 8 accumulators
float* farr = sumdst.m256_f32;
float fmax = farr[0];
if (farr[1] > fmax) fmax = farr[1];
if (farr[2] > fmax) fmax = farr[2];
if (farr[3] > fmax) fmax = farr[3];
if (farr[4] > fmax) fmax = farr[4];
if (farr[5] > fmax) fmax = farr[5];
if (farr[6] > fmax) fmax = farr[6];
if (farr[7] > fmax) fmax = farr[7];
return fmax;
}
The MIN versions are very similar. They use the “min” AVX intrinsics,
and the final steps use “<” not “>” operations.
Here’s the AVX1 version of a MIN vector reduction:
float aussie_vector_min_AVX1(float v[], int n)
{
// Minimum (horizontal) of a single vector
if (n % 4 != 0) {
aussie_assert(n % 4 == 0);
return 0.0; // fail
}
__m128 sumdst = _mm_loadu_ps(&v[0]); // Initial values
for (int i = 4 /*not 0*/; i < n; i += 4) {
__m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
sumdst = _mm_min_ps(r1, sumdst); // dst = MIN(dst, r1)
}
// Find Min of the final 4 accumulators
float* farr = sumdst.m128_f32;
float fmin = farr[0];
if (farr[1] < fmin) fmin = farr[1];
if (farr[2] < fmin) fmin = farr[2];
if (farr[3] < fmin) fmin = farr[3];
return fmin;
}
This is the AVX2 version of a MIN vector reduction:
float aussie_vector_min_AVX2(float v[], int n) // Minimum (horizontal) of a single vector
{
if (n % 8 != 0) { // Safety check (no extra cases)
aussie_assert(n % 8 == 0);
return 0.0; // fail
}
__m256 sumdst = _mm256_loadu_ps(&v[0]); // Initial 8 values
for (int i = 8/*not 0*/; i < n; i += 8) {
__m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats into 256-bits
sumdst = _mm256_min_ps(r1, sumdst); // dst = MIN(dst, r1)
}
// Find Min of the final 8 accumulators
float* farr = sumdst.m256_f32;
float fmin = farr[0];
if (farr[1] < fmin) fmin = farr[1];
if (farr[2] < fmin) fmin = farr[2];
if (farr[3] < fmin) fmin = farr[3];
if (farr[4] < fmin) fmin = farr[4];
if (farr[5] < fmin) fmin = farr[5];
if (farr[6] < fmin) fmin = farr[6];
if (farr[7] < fmin) fmin = farr[7];
return fmin;
}
These versions are not especially optimized.
AVX-512 would allow us to further vectorize to 16 float values.
Also, the final computation of the maximum or minimum of 8 float numbers is far from optimal.
The AVX horizontal min/max intrinsics would be used (pairwise, multiple times).
Or we can at least avoid some comparisons
by doing it pairwise sequentially.
Here’s the alternative for AVX1 minimum computation:
// Find Min of the final 4 accumulators
#define FMIN(x,y) ( (x) < (y) ? (x) : (y) )
float* farr = sumdst.m128_f32;
float fmin1 = FMIN(farr[0], farr[1]);
float fmin2 = FMIN(farr[2], farr[3]);
float fmin = FMIN(fmin1, fmin2);
return fmin;
These functions can also have their main loops further improved.
Other basic optimizations would include using loop pointer arithmetic to remove
the index variable “i”
and also unrolling the loop body multiple times.
Vectorized Sum-of-Squares Reduction
The sum of the square of an element of a vector has various applications in our AI Engine. Firstly, it can be used to compute the magnitude of a vector. Secondly, the sum-of-squares is used in various normalization functions, as part of computing the variance from the sum-of-squares of the difference between values and the mean. The RMS factor in RMSNorm is also the square root of the sum-of-squares.
The method to add up the sum-of-squares for a vector reduction to a single float
is very similar to a simple summation reduction.
The idea for AVX1 and AVX2 is to keep 4 or 8 running sum accumulators,
and then add them up at the final step.
Here is the AVX1 version of sum-of-squares of a vector:
float aussie_vector_sum_squares_AVX1(float v[], int n)
{
// Summation of squares of all elements
if (n % 4 != 0) { // Safety check (no extra cases)
aussie_assert(n % 4 == 0);
return 0.0; // fail
}
__m128 sumdst = _mm_setzero_ps(); // Zero accumulators
for (int i = 0; i < n; i += 4) {
__m128 r1 = _mm_loadu_ps(&v[i]); // Load floats
__m128 sqr = _mm_mul_ps(r1, r1); // Square (V*V)
sumdst = _mm_add_ps(sqr, sumdst); // SUM = SUM + V*V
}
// Add the final 4 accumulators manually
float* farr = sumdst.m128_f32;
float sum = farr[0] + farr[1] + farr[2] + farr[3];
return sum;
}
And here is the AVX2 version of sum-of-squares:
float aussie_vector_sum_squares_AVX2(float v[], int n)
{
// Summation of squares of all elements
if (n % 8 != 0) { // Safety check (no extra cases)
aussie_assert(n % 8 == 0);
return 0.0; // fail
}
__m256 sumdst = _mm256_setzero_ps(); // Zero accumulators
for (int i = 0; i < n; i += 8) {
__m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats
__m256 sqr = _mm256_mul_ps(r1, r1); // Square (V*V)
sumdst = _mm256_add_ps(sqr, sumdst); // SUM = SUM + V*V
}
// Add the final 8 accumulators manually
float* farr = sumdst.m256_f32;
float sum = farr[0] + farr[1] + farr[2] + farr[3]
+ farr[4] + farr[5] + farr[6] + farr[7];
return sum;
}
Various optimizations can be further applied to these versions.
Like the summation reduction, these loops needlessly add zero at the first iteration,
and loop peeling should be used for split out the first iteration separately.
The final horizontal addition of 4 or 8 float values should be optimized.
AVX-512 should be used for greater parallelism to 16 float numbers.
Finally, basic loop optimizations of pointer arithmetic and loop unrolling could be applied.
Hybrid Vertical-Horizontal Operations
There are also many more complicated reductions. There are also some one-dimensional vector operations that a “hybrid” and don’t fit neatly into the categories of horizontal reductions or vertical elementwise operations. Some examples include:
- Vector dot product
- AI normalization (BatchNorm)
- AI statistical normalization (Softmax)
For example, vector dot product is like a vertical elementwise multiplication followed by a horizontal summation reduction. Normalization functions in AI, such as BatchNorm or Softmax, are another example, where the whole vector is processed to “normalize” every value, but doing so requires computations both horizontally to compute the scaling fator and vertically to scale every element.
Boolean Vector Operations
Weirdly, a lot of Boolean testing operations on vectors are also hybrid vertical-horizontal operations. Some examples include:
- Vector equality (two vectors)
- Vector is zero (one vector)
- Vector contains a value (one vector)
- Vector has a NaN or Inf (floating-point)
All of these seem like they’re elementwise tests, where each vector element can be examined separately. The two-vector comparison does pairiwse comparison of each element separately, and the single-vector operations can examine one element at a time. That would seem to make them all meet the criteria for vertical elementwise operations, with very fast implementations via AVX SIMD instructions.
Alas, no, it’s not that simple.
A more accurate way to think about them is that the first phase creates a Boolean vector (as a temporary), where each element of this interim vector has the test result for each element. Hence, there’s actually a second phase where the many Boolean results from analyzing each vector element need to be “reduced” to a single Boolean value as the final answer. This last phase is like a horizontal reduction over a Boolean vector using a logical “or” or “and” operator.
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: C++ AVX Optimization |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |