Aussie AI

Chapter 5. One-Dimensional Vectorization

  • Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"
  • by David Spuler, Ph.D.

Chapter 5. One-Dimensional AVX Vectorization

What is Vectorization?

Vectorization is the name given to transforming a software loop from running sequentially on a one-dimensional array of data to performing the same computation fully in parallel, by sending the data to a GPU or CPU SIMD extensions. This is a powerful way to optimize the processing of contiguous data structures such as arrays and vectors.

Vectorization uses techniques from loop optimizations to transform loops into faster parallelizable versions, such as “unrolling” a loop into all its element-wise actions, and loop distribution (also called “loop sectioning”), which breaks the array into segments that are the right size to fit in parallel into your GPU or CPU SIMD extensions. In theory, a good optimizing compiler can do vectorization optimizations automatically for simple loops, but often you have to do it yourself.

A powerful way to do vectorization of contiguous data processing is to use the AVX SIMD instructions for CPU-based parallelism. The AVX intrinsics are C++ built-in functions that wrap around SIMD instruction codes in the x86 instruction set. The basic AVX intrinsics are 128-bits (4 float values of size 32-bits), AVX-2 is 256 bits (8 float values), and AVX-512 is 512 bits (surprise!), which is 16 float numbers. The upcoming AVX-10 (announced in July 2023) is also 512 bits, but with extra capabilities.

Obviously, since the largest number of floating-point values that can be parallelized is 16, the AVX intrinsics cannot fully vectorize a larger vector of many float values, such as an AI model with dimension 1024. Instead, we can use AVX intrinsics on segments of vectors, and thereby vectorize chunks of the right size to get a speedup.

Vectorized Multiply Vector by Scalar

The requirement to multiply a vector by a scalar is common when using scaling vectors. Division by a scalar is also handled by multiplying by the reciprocal (e.g., needed for Softmax). Multiplication by a scalar is amenable to vectorization because the naive C++ version is very simple:

    void aussie_vector_multiply_scalar(float v[], int n, float c)  
    {
        // Multiply all vector elements by constant
        for (int i = 0; i < n; i++) {
            v[i] *= c;
        }
    }

Loop Pointer Arithmetic. First, we can try the basic C++ optimization of pointer arithmetic:

    void aussie_vector_multiply_scalar_pointer_arith(float v[], int n, float c)  
    {
        // Multiply all vector elements by constant
        for (; n > 0; n--, v++) {
            *v *= c;
        }
    }

AVX1 multiply-by-scalar: There is no special scalar multiplication opcode in AVX or AVX-2, but we can populate a constant register (128-bit or 256-bit) with multiple copies of the scalar (i.e., _mm_set1_ps or _mm256_set1_ps), and we need do this only once. We can then use the SIMD multiply intrinsics in the unrolled loop section. The AVX 128-bit vector multiplication by scalar becomes:

    void aussie_vector_multiply_scalar_AVX1(float v[], int n, float c)  
    { 
        const __m128 rscalar = _mm_set1_ps(c);  // Vector of scalars
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats
            __m128 dst = _mm_mul_ps(r1, rscalar); // Multiply by scalars
            _mm_store_ps(&v[i], dst);  // convert to floats (aligned)
        }
    }

AVX2 multiply-by-scalar: Even faster is to use 8 parallel multiplications with AVX-2’s 256-bit registers. The AVX-1 version is simply changed to use the “__m256” type and the analogous AVX-2 intrinsics. The new code looks like:

    void aussie_vector_multiply_scalar_AVX2(float v[], int n, float c)  
    {
        const __m256 rscalar = _mm256_set1_ps(c);  // Vector of scalars
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats
            __m256 dst = _mm256_mul_ps(r1, rscalar); // Multiply by scalars
            _mm256_store_ps(&v[i], dst);  // convert to floats (aligned)
        }
    }

Combining AVX-2 with pointer arithmetic. Finally, we can get a small extra benefit by adding pointer arithmetic optimizations to the AVX-2 parallelized version. The new code is:

    void aussie_vector_multiply_scalar_AVX2_pointer_arith(float v[], int n, float c)  
    {
        // Multiply all vector elements by constant
        const __m256 rscalar = _mm256_set1_ps(c);  // vector full of scalars...
        for (; n > 0; n -= 8, v += 8) {
            __m256 r1 = _mm256_loadu_ps(v);   // Load floats into 256-bits
            __m256 dst = _mm256_mul_ps(r1, rscalar);   // Multiply by scalars
            _mm256_store_ps(v, dst);  // convert to floats (Aligned version)
        }
    }

Benchmarking results. In theory, the AVX-2 intrinsics could parallelize the computation by 8 times, but benchmarking showed that it only achieved a 4-times speedup.

    Vector-scalar operation benchmarks (N=1024, ITER=1000000):
    Vector mult-scalar C++: 1412 ticks (1.41 seconds)
    Vector mult-scalar pointer-arith: 995 ticks (0.99 seconds)
    Vector mult-scalar AVX1: 677 ticks (0.68 seconds)
    Vector mult-scalar AVX2: 373 ticks (0.37 seconds)
    Vector mult-scalar AVX2 + pointer arith: 340 ticks (0.34 seconds)

Vectorized Add Scalar

The code to vectorize an “add-scalar” operation is almost identical to “multiply-scalar” operations, except that “add” intrinsics are used. Here is the AVX-1 version with “_mm_add_ps”:

    void aussie_vector_add_scalar_AVX1(float v[], int n, float c)
    {
        // Add scalar constant to all vector elements
        const __m128 rscalar = _mm_set1_ps(c);  // Set up vector full of scalars...
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats into 128-bits
            __m128 dst = _mm_add_ps(r1, rscalar);   // Add scalars
            _mm_store_ps(&v[i], dst);  // store back to floats
        }
    }

And this is the analogous AVX-2 version using the “_mm256_add_ps” intrinsic:

    void aussie_vector_add_scalar_AVX2(float v[], int n, float c)  // Add scalar constant to all vector elements
    {
        const __m256 rscalar = _mm256_set1_ps(c);  // vector full of scalars...
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats into 256-bits
            __m256 dst = _mm256_add_ps(r1, rscalar);   // Add scalars
            _mm256_store_ps(&v[i], dst);  // convert to floats (Aligned version)
        }
    }

Vectorized RELU with Max Intrinsics

The RELU activation function is an important piece of code in AI engines. However, it’s very simple, arithmetically converting negatives to zero, leaving positives unchanged. This is algebraically equivalent to max(x,0), which can be implemented in AVX like a “max-scalar” operation.

To vectorize RELU applied to a whole vector of float elements, we are effectively doing a SIMD max operation with a scalar zero (i.e., 0.0). Hence, the code is very similar to vectorization of add-scalar, but uses the “_mm_max_ps” intrinsic.

The AVX1 version of vectorized RELU looks like:

    void aussie_vector_reluize_AVX1(float v[], int n)   // Apply RELU to each element (sets negatives to zero)
    {
        if (n % 4 != 0) {
            aussie_assert(n % 4 == 0);
            return; // fail
        }
        const __m128 rzeros = _mm_set1_ps(0.0f);  // Set up vector full of zeros...
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats into 128-bits
            __m128 dst = _mm_max_ps(r1, rzeros);   // MAX(r1,0)
            _mm_store_ps(&v[i], dst);  // store back to floats
        }
    }

And here is the AVX2 version doing 8 float elements at a time using the “_mm256_max_ps” intrinsic:

    void aussie_vector_reluize_AVX2(float v[], int n)  // Apply RELU to each element (sets negatives to zero)
    {
        if (n % 8 != 0) {
            aussie_assert(n % 8 == 0);
            return; // fail
        }
        const __m256 rzeros = _mm256_set1_ps(0.0f);  // vector full of zeros...
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats into 256-bits
            __m256 dst = _mm256_max_ps(r1, rzeros);   // MAX(R1, 0)
            _mm256_store_ps(&v[i], dst);  // store back to floats
        }
    }

Vectorization of Exponentiation

The expf function is very expensive to call, but exponentiation of entire vectors of float values are required in several parts of AI engines, such as activation functions and Softmax normalization. Surprisingly, in x86 there are CPU opcodes to do exponentiation in hardware, and there are matching AVX intrinsics for SIMD exponentiation operations on small vectors (i.e., 4 float values for AVX-1 and 8 float values for AVX-2).

The basic C++ version to apply expf to every element of a vector, and store the result in the original vector, looks like this:

    void aussie_vector_expf(float v[], int n)   
    {
        // Apply EXPF (exponential) to each element
        for (int i = 0; i < n; i++) {
            v[i] = expf(v[i]);
        }
    }

Loop Pointer arithmetic. Applying the basic C++ optimization of pointer arithmetic, the new code is:

    void aussie_vector_expf_pointer_arith(float v[], int n)
    {
        for (; n > 0; n--, v++) {
            *v = expf(*v);
        }
    }

AVX1 SIMD exponentiation of 4 values: There is an AVX intrinsic called “_mm_exp_ps” to exponentiate 4 float values in parallel using the 128-bit registers. Here’s the new vector exponentiation code with loop unrolling every 4 elements and AVX1 vectorization:

    void aussie_vector_expf_AVX1(float v[], int n)
    {
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats into 128-bits
            __m128 dst = _mm_exp_ps(r1);   // Exponentiate (expf)
            _mm_store_ps(&v[i], dst);  // convert to floats (Aligned version)
        }
    }

AVX2 SIMD exponentiation of 8 values: The AVX2 intrinsic is “_mm256_exp_ps” to exponentiate 8 elements in parallel using the 256-bit registers. The new code with loop unrolling every 8 values and AVX-2 intrinsics becomes:

    void aussie_vector_expf_AVX2(float v[], int n)  // Apply EXPF (exponential) to each element
    {
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats into 256-bits
            __m256 dst = _mm256_exp_ps(r1);    // Exponentiate (expf)
            _mm256_store_ps(&v[i], dst);  // convert to floats (Aligned version)
        }
    }

Benchmarking results. The results of optimization of exponentiation are striking! AVX1 is massively faster, cutting out 97% of the original computation time, and then AVX2 is faster still. It’s almost like hardware is faster than software. Who knew?

    Vector-exponentiation operation benchmarks (N=1024, ITER=100000):
    Vector expf basic: 6695 ticks (6.70 seconds)
    Vector expf pointer-arith: 6395 ticks (6.39 seconds)
    Vector expf AVX1: 260 ticks (0.26 seconds)
    Vector expf AVX2: 124 ticks (0.12 seconds)

 

Online: Table of Contents

PDF: Free PDF book download

Buy: C++ AVX Optimization

C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization