Aussie AI

Chapter 6. Horizontal Reductions

Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"

by David Spuler, Ph.D.

Chapter 6. Horizontal Reductions

What is a Reduction?

A reduction is an operation that “reduces” a vector down to a single number. It’s called “horizontal” because it operates across the vector, requiring multiple elements from the same vector to be examined. An elementwise operation where each vector element is handled separately, without any other elements from the same vector, is called a “vertical” operation. Here’s a rule of thumb:

Vertical — two vector operands (e.g., add two vectors).
Horizontal — one vector operand (e.g., sum of all elements of a single vector).

Some example of common horizontal reductions on vectors include:

Max
Min
Sum
Average

These look easy enough. But, no!

Horizontal reductions are actually harder to code than vertical operations, even if there’s two operands for the vertical algorithm. SIMD hardware instructions are much better suited to vertical elementwise operations across two different vectors. The horizontal SIMD instructions are often slower!

Example: AVX Vector Sum Reduction

Let us suppose we need to calculate the sum of all the elements of a vector. This is a “reduction” that has dimensions “vector-to-scalar.” Here is a basic naive C++ version without any optimizations:

    float aussie_vector_sum(float v[], int n)  // Summation
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v[i];
        }
        return sum;
    }

Horizontal AVX Intrinsics

AVX vector reductions have some issues in the early releases. Although AVX has SIMD instructions to add two vectors in parallel, it struggles to do a “reduction” operation like this. AVX and AVX-2 do have “horizontal add” (“hadd”) intrinsics, but these only do pairwise additions within the single vector, rather than adding all elements. AVX-512 has a “reduce add” intrinsic (“_mm512_reduce_add_ps”) for horizontally adds 16 float numbers, which works a lot better.

Parallel Accumulators Trick

For AVX and AVX-2, are we stuck with doing multiple calls to the pairwise “hadd” intrinsics? No, there’s a non-obvious way to use the “vertical add” intrinsics in parallel. We can do “in parallel” squared. It’s almost like we’re doing math inside a computer.

The trick is to use the AVX registers as a set of 4 parallel accumulators (AVX 128 bits) or 8 parallel accumulators (AVX-2’s 256 bits). The overall algorithm becomes:

1. Initialize 4 (or 8) accumulators.

2. Scan the whole vector doing 4 parallel additions every iteration.

3. Sum the 4 accumulators (and done).

In this way, we can defer the horizontal addition (“hadd”) until the very end, and since it’s not in the critical loop, its performance hardly matters. Here’s the code for AVX-1 with 128-bit registers:

    float aussie_vector_sum_AVX1(float v[], int n)   // Summation (horizontal) of a single vector
    {
        if (n % 4 != 0) {  // Safety
            aussie_assert(n % 4 == 0);
            return 0.0; // fail
        }

        __m128 sumdst = _mm_setzero_ps();   // Set accumulators to zero
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
            sumdst = _mm_add_ps(r1, sumdst); // SUM = SUM + V
        }

        // Add the final 4 accumulators manually
        float* farr = sumdst.m128_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3];
        return sum;
    }

The AVX-2 version is faster, because it processes 8 float values at a time. This uses the same strategy of 8 parallel accumulators and a loop unrolling factor of 8 (i.e., the loop incrementer is now “i+=8”). Here’s the C++ code:

    float aussie_vector_sum_AVX2(float v[], int n)   // Summation (horizontal) of a single vector
    {
        if (n % 8 != 0) { // Safety check (no extra cases)
            aussie_assert(n % 8 == 0);
            return 0.0; // fail
        }

        __m256 sumdst = _mm256_setzero_ps();   // Set 8 accumulators to zero
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load 8 floats into 256-bits
            sumdst = _mm256_add_ps(r1, sumdst); // SUM = SUM + V
        }

        // Add the final 8 accumulators manually
        float* farr = sumdst.m256_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3]
                  + farr[4] + farr[5] + farr[6] + farr[7]; 
        return sum;
    }

I’ve been lazy not bothering to optimize the final horizontal addition. A small extra speedup is probably available using the “hadd” intrinsics 3 times in a row to drop it down from 8 accumulators to a single float. If this was AVX-512, we could use the horizontal reduction “_mm512_reduce_add_ps” intrinsic for summation at the end (for adding 16 partial sums of type float).

Loop Peeling Optimization: Another inefficiency with these AVX addition routines it that they needlessly perform an addition with zero in the first iteration. Effectively, we need to do “loop peeling” to handle the first loop iteration differently. This is the slow first iteration of AVX2 vector sum:

    __m256 sumdst = _mm256_setzero_ps();   // Set 8 accumulators to zero
    for (int i = 0; i < n; i += 8) {
        // ... 
    }

Loop peeling says to replace the initialization with zero with loading the first 8 values from the vector. The loop starts its first iteration at i=8 instead of i=0, skipping what had been the first addition:

    __m256 sumdst = _mm256_loadu_ps(&v[0]);  // Get first 8 values
    for (int i = 8 /*not 0!*/; i < n; i += 8) {
        // ... same
    }

AVX Vector Max and Min Reductions

The need to find a minimum or maximum of a vector’s elements is similar to a summation reduction. Again, AVX1 and AVX2 don’t have proper “reduction” intrinsics for max or min, but we can compute them in parallel by keeping a running min or max value of 4 or 8 float values (i.e., analogous to parallel accumulators when doing summation). The AVX intrinsics are:

MIN: _mm_min_ps, _mm256_min_ps
MAX: _mm_max_ps, _mm256_max_ps

Here is the AVX1 version of MAX vector reduction:

    float aussie_vector_max_AVX1(float v[], int n)   
    {
        // Maximum (horizontal) of a single vector
        if (n % 4 != 0) {
            aussie_assert(n % 4 == 0);
            return 0.0; // fail
        }
        __m128 sumdst = _mm_loadu_ps(&v[0]);   // Initial values
        for (int i = 4 /*not 0*/; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
            sumdst = _mm_max_ps(r1, sumdst); // dst = MAX(dst, r1)
        }
        // Find Max of the final 4 accumulators
        float* farr = sumdst.m128_f32;
        float fmax = farr[0];
        if (farr[1] > fmax) fmax = farr[1];
        if (farr[2] > fmax) fmax = farr[2];
        if (farr[3] > fmax) fmax = farr[3];
        return fmax;
    }

And here is the analogous AVX2 version of MAX vector reduction:

    float aussie_vector_max_AVX2(float v[], int n)
    {
        // Maximum (horizontal) of a single vector
        if (n % 8 != 0) { // Safety check (no extra cases)
            aussie_assert(n % 8 == 0);
            return 0.0; // fail
        }
        __m256 sumdst = _mm256_loadu_ps(&v[0]);   // Initial 8 values
        for (int i = 8/*not 0*/; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats into 256-bits
            sumdst = _mm256_max_ps(r1, sumdst); // dst = MAX(dst, r1)
        }

        // Find Max of the final 8 accumulators
        float* farr = sumdst.m256_f32;
        float fmax = farr[0];
        if (farr[1] > fmax) fmax = farr[1];
        if (farr[2] > fmax) fmax = farr[2];
        if (farr[3] > fmax) fmax = farr[3];
        if (farr[4] > fmax) fmax = farr[4];
        if (farr[5] > fmax) fmax = farr[5];
        if (farr[6] > fmax) fmax = farr[6];
        if (farr[7] > fmax) fmax = farr[7];
        return fmax;
    }

The MIN versions are very similar. They use the “min” AVX intrinsics, and the final steps use “<” not “>” operations. Here’s the AVX1 version of a MIN vector reduction:

    float aussie_vector_min_AVX1(float v[], int n)
    {
        // Minimum (horizontal) of a single vector
        if (n % 4 != 0) {
            aussie_assert(n % 4 == 0);
            return 0.0; // fail
        }
        __m128 sumdst = _mm_loadu_ps(&v[0]);   // Initial values
        for (int i = 4 /*not 0*/; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
            sumdst = _mm_min_ps(r1, sumdst); // dst = MIN(dst, r1)
        }
        // Find Min of the final 4 accumulators
        float* farr = sumdst.m128_f32;
        float fmin = farr[0];
        if (farr[1] < fmin) fmin = farr[1];
        if (farr[2] < fmin) fmin = farr[2];
        if (farr[3] < fmin) fmin = farr[3];
        return fmin;
    }

This is the AVX2 version of a MIN vector reduction:

    float aussie_vector_min_AVX2(float v[], int n)   // Minimum (horizontal) of a single vector
    {
        if (n % 8 != 0) { // Safety check (no extra cases)
            aussie_assert(n % 8 == 0);
            return 0.0; // fail
        }
        __m256 sumdst = _mm256_loadu_ps(&v[0]);   // Initial 8 values
        for (int i = 8/*not 0*/; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats into 256-bits
            sumdst = _mm256_min_ps(r1, sumdst); // dst = MIN(dst, r1)
        }

        // Find Min of the final 8 accumulators
        float* farr = sumdst.m256_f32;
        float fmin = farr[0];
        if (farr[1] < fmin) fmin = farr[1];
        if (farr[2] < fmin) fmin = farr[2];
        if (farr[3] < fmin) fmin = farr[3];
        if (farr[4] < fmin) fmin = farr[4];
        if (farr[5] < fmin) fmin = farr[5];
        if (farr[6] < fmin) fmin = farr[6];
        if (farr[7] < fmin) fmin = farr[7];
        return fmin;
    }

These versions are not especially optimized. AVX-512 would allow us to further vectorize to 16 float values. Also, the final computation of the maximum or minimum of 8 float numbers is far from optimal. The AVX horizontal min/max intrinsics would be used (pairwise, multiple times). Or we can at least avoid some comparisons by doing it pairwise sequentially. Here’s the alternative for AVX1 minimum computation:

        // Find Min of the final 4 accumulators
    #define FMIN(x,y)  ( (x) < (y) ? (x) : (y) )
        float* farr = sumdst.m128_f32;
        float fmin1 = FMIN(farr[0], farr[1]);
        float fmin2 = FMIN(farr[2], farr[3]);
        float fmin = FMIN(fmin1, fmin2);
        return fmin;

These functions can also have their main loops further improved. Other basic optimizations would include using loop pointer arithmetic to remove the index variable “i” and also unrolling the loop body multiple times.

Vectorized Sum-of-Squares Reduction

The sum of the square of an element of a vector has various applications in our AI Engine. Firstly, it can be used to compute the magnitude of a vector. Secondly, the sum-of-squares is used in various normalization functions, as part of computing the variance from the sum-of-squares of the difference between values and the mean. The RMS factor in RMSNorm is also the square root of the sum-of-squares.

The method to add up the sum-of-squares for a vector reduction to a single float is very similar to a simple summation reduction. The idea for AVX1 and AVX2 is to keep 4 or 8 running sum accumulators, and then add them up at the final step.

Here is the AVX1 version of sum-of-squares of a vector:

    float aussie_vector_sum_squares_AVX1(float v[], int n)  
    {
        // Summation of squares of all elements
        if (n % 4 != 0) { // Safety check (no extra cases)
            aussie_assert(n % 4 == 0);
            return 0.0; // fail
        }
        __m128 sumdst = _mm_setzero_ps();   // Zero accumulators
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats
            __m128 sqr = _mm_mul_ps(r1, r1);   // Square (V*V)
            sumdst = _mm_add_ps(sqr, sumdst); // SUM = SUM + V*V
        }
        // Add the final 4 accumulators manually
        float* farr = sumdst.m128_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3];
        return sum;
    }

And here is the AVX2 version of sum-of-squares:

    float aussie_vector_sum_squares_AVX2(float v[], int n)  
    {
        // Summation of squares of all elements
        if (n % 8 != 0) { // Safety check (no extra cases)
            aussie_assert(n % 8 == 0);
            return 0.0; // fail
        }

        __m256 sumdst = _mm256_setzero_ps();   // Zero accumulators
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats
            __m256 sqr = _mm256_mul_ps(r1, r1);   // Square (V*V)
            sumdst = _mm256_add_ps(sqr, sumdst); // SUM = SUM + V*V
        }

        // Add the final 8 accumulators manually
        float* farr = sumdst.m256_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3]
                  + farr[4] + farr[5] + farr[6] + farr[7];
        return sum;
    }

Various optimizations can be further applied to these versions. Like the summation reduction, these loops needlessly add zero at the first iteration, and loop peeling should be used for split out the first iteration separately. The final horizontal addition of 4 or 8 float values should be optimized. AVX-512 should be used for greater parallelism to 16 float numbers. Finally, basic loop optimizations of pointer arithmetic and loop unrolling could be applied.

Hybrid Vertical-Horizontal Operations

There are also many more complicated reductions. There are also some one-dimensional vector operations that a “hybrid” and don’t fit neatly into the categories of horizontal reductions or vertical elementwise operations. Some examples include:

Vector dot product
AI normalization (BatchNorm)
AI statistical normalization (Softmax)

For example, vector dot product is like a vertical elementwise multiplication followed by a horizontal summation reduction. Normalization functions in AI, such as BatchNorm or Softmax, are another example, where the whole vector is processed to “normalize” every value, but doing so requires computations both horizontally to compute the scaling fator and vertically to scale every element.

Boolean Vector Operations

Weirdly, a lot of Boolean testing operations on vectors are also hybrid vertical-horizontal operations. Some examples include:

Vector equality (two vectors)
Vector is zero (one vector)
Vector contains a value (one vector)
Vector has a NaN or Inf (floating-point)

All of these seem like they’re elementwise tests, where each vector element can be examined separately. The two-vector comparison does pairiwse comparison of each element separately, and the single-vector operations can examine one element at a time. That would seem to make them all meet the criteria for vertical elementwise operations, with very fast implementations via AVX SIMD instructions.

Alas, no, it’s not that simple.

A more accurate way to think about them is that the first phase creates a Boolean vector (as a temporary), where each element of this interim vector has the test result for each element. Hence, there’s actually a second phase where the many Boolean results from analyzing each vector element need to be “reduced” to a single Boolean value as the final answer. This last phase is like a horizontal reduction over a Boolean vector using a logical “or” or “and” operator.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ AVX Optimization