Aussie AI

Chapter 2. Simple AVX Example

Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"

by David Spuler, Ph.D.

Chapter 2. Simple AVX Example

Basic AVX SIMD Multiply

Let us do a basic element-wise SIMD multiply using AVX (version 1) and its 128-bit registers. This will do a paired vector multiply an array of 4 float numbers (i.e., 4 x 32-bit float = 128 bits). Each float in the resulting array is a pairwise multiplication of the elements in the two operands.

This is how SIMD instructions work, by operating on each element of the array (i.e., “pairwise” or “element-wise”). For example, a “vertical” multiply will take the 4 float values in one input array, and multiply each of them by the corresponding float in the other input array of 4 float numbers, and then will return a resulting output array with 4 float values.

For testing, let us assume with want to create an AVX function that multiplies 4 float values element-wise. The test code looks like:

    float arr1[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    float arr2[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    float resultarr[4];
    // Multiply element-wise
    aussie_multiply_vectors(arr1, arr2, resultarr, 4);

Testing the results of the multiply as an element-wise multiply of each pair in the 4 float values (using my home-grown “aussie_testf” unit testing function that compares float numbers for equality):

    aussie_testf(resultarr[0], 1.0f * 1.0f);  // Unit tests
    aussie_testf(resultarr[1], 2.5f * 2.5f);
    aussie_testf(resultarr[2], 3.14f * 3.14f);
    aussie_testf(resultarr[3], 0.0f * 0.0f);

Here’s the low-level C++ code that actually does the SIMD multiply using the “_mm_mul_ps” AVX intrinsic function:

    #include <xmmintrin.h>
    #include <intrin.h>

    void aussie_avx_multiply_4_floats(
        float v1[4], float v2[4], float vresult[4])
    {
        // Multiply 4x32-bit float in 128-bit AVX registers
        __m128 r1 = _mm_loadu_ps(v1);   // Load floats
        __m128 r2 = _mm_loadu_ps(v2);
        __m128 dst = _mm_mul_ps(r1, r2);   // AVX SIMD Multiply
        _mm_storeu_ps(vresult, dst);  // Convert back to floats
    }

Explaining this code one line at a time:

1. The header files are included: <xmmintrin.h> and <intrin.h>.

2. The basic AVX register type is “__m128” which is an AVX 128-bit register (i.e., it is 128 bits in the basic AVX version, not AVX-2 or AVX-512).

3. The variables “r1” and “r2” are declared as _mm128 registers. The names “r1” and “r2” are not important, and are just variable names.

4. The intrinsic function “_mm_loadu_ps” is used to convert the arrays of 4 float values into the 128-bit register types, and the result is “loaded” into the “r1” and “r2” 128-bit types.

5. Another 128-bit variable “dst” is declared to hold the results of the SIMD multiply. The name “dst” can be any variable name.

6. The main AVX SIMD multiply is performed by the “_mm_mul_ps” intrinsic function. The suffix “s” means “single-precision” (i.e., 32-bit float). This is where the rubber meets the road, and the results of the element-wise multiplication of registers “r1” and “r2” are computed and saved into the “dst” register. It is analogous to the basic C++ expression: dst = r1*r2;

7. The 128-bit result register variable “dst” is converted back to 32-bit float values (4 of them), by “storing” the 128 bits into the float array using the “_mm_storeu_ps” AVX intrinsic.

AVX-2 SIMD Multiplication

Here is the AVX-2 version of pairwise SIMD multiply with intrinsics for 256-bit registers, which is eight 32-bit float variables.

    void aussie_avx2_multiply_8_floats(
        float v1[8], float v2[8], float vresult[8])
    {
        // Multiply 8x32-bit floats in 256-bit AVX2 registers
        __m256 r1 = _mm256_loadu_ps(v1);   // Load floats
        __m256 r2 = _mm256_loadu_ps(v2);
        __m256 dst = _mm256_mul_ps(r1, r2);  // Multiply (SIMD)
        _mm256_storeu_ps(vresult, dst);  // Convert to 8 floats
    }

This is similar to the basic AVX 128-bit version, with some differences:

The type for 256-bit registers is “__m256”.
The AVX-2 loading intrinsic is “_mm256_loadu_ps”.
The AVX-2 multiplication intrinsic is “_mm256_mul_ps”.
The conversion back to float uses AVX-2 intrinsic “_mm256_storeu_ps”.

AVX-512 SIMD Multiplication

Here is the basic 16 float SIMD vector multiplication using 512-bits in AVX-512.

    void aussie_avx512_multiply_16_floats(
        float v1[16], float v2[16], float vresult[16])
    {
        // Multiply 16x32-bit floats in 512-bit registers
        __m512 r1 = _mm512_loadu_ps(v1); // Load 16 floats
        __m512 r2 = _mm512_loadu_ps(v2);
        __m512 dst = _mm512_mul_ps(r1, r2); // Multiply (SIMD)
        _mm512_storeu_ps(vresult, dst);  // Convert to floats
    }

Note that AVX-512 will fail with an “unhandled exception: illegal instruction” (e.g., in MSVS) if AVX-512 is not supported on your CPU. Hence, it’s important to check your platform before optimizing!

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ AVX Optimization

C++ AVX Optimization: CPU SIMD Vectorization:

Introduction to AVX SIMD intrinsics
Vectorization and horizontal reductions
Low latency tricks and branchless programming
Instruction-level parallelism and out-of-order execution
Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization

Aussie AI

Chapter 2. Simple AVX Example