Aussie AI

Chapter 1. AVX Intrinsics

Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"

by David Spuler, Ph.D.

Chapter 1. AVX Intrinsics

What are AVX Intrinsics?

Hardware-assisted vectorization is a powerful optimization to processing contiguous data structures. AVX intrinsics are SIMD parallel instructions for x86 and x64 architectures. They are actually machine opcodes supported by the x86/x64 CPU, but are wrapped in the intrinsic prototypes for easy access from a C++ program.

The main advantage of SIMD instructions is that they are CPU-supported parallel optimizations. Hence, they do not require a GPU, and can even be used on a basic Windows laptop. The main downside is that their level of parallelism is nowhere near that of a high-end GPU.

There are multiple generations of AVX intrinsics based on x86/x64 CPU instructions. Different CPUs support different features, and exactly which intrinsic calls can be used will depend on the CPU on which your C++ is running. The basic AVX types are:

AVX-1 — 128-bit registers = 16 bytes
AVX-2 — 256-bit registers = 32 bytes
AVX-512 — 512-bit registers = 64 bytes
AVX-10 — also 512-bit registers (with speedups)

In terms of numerical processing, you get this level of parallelism:

AVX-1 — 4 x 32-bit float or int values
AVX-2 — 8 x 32-bit values
AVX-512 — 16 x 32-bit values

The AVX intrinsics use C++ type names to declare variables for their registers. The float types used to declare the registers in AVX using C++ all have a double-underscore prefix with “__m128” for 128-bit registers (4 floats), “__m256” for 256 bit registers (8 floats), and “__m512” for 512 bits (16 floats). Similarly, there are also register type names for int types (__m128i, __m256i, and __m512i), and types for “double” registers (__m128d, __m256d, and __m512d).

AVX intrinsic functions and their types are declared as ordinary function prototypes in header files. The header files that you may need to include for these intrinsics include <intrin.h>, <emmintrin.h>, and <immintrin.h>.

Useful AVX SIMD vector intrinsics for float types include:

Initialize to all-zeros — _mm_setzero_ps, _mm256_setzero_ps
Set all values to a single float — _mm_set1_ps, _mm256_set1_ps
Set to 4 or 8 values — _mm_set_ps, _mm256_set_ps
Load from arrays to AVX registers — _mm_loadu_ps, _mm256_loadu_ps
Store registers back to float arrays — _mm_storeu_ps, _mm256_storeu_ps
Addition — _mm_add_ps, _mm256_add_ps
Multiplication — _mm_mul_ps (SSE), _mm256_mul_ps (AVX-2)
Vector dot product — _mm_dp_ps, _mm256_dp_ps
Fused Multiply-Add (FMA — _mm_fmadd_ps, _mm256_fmadd_ps
Horizontal addition (pairwise) — _mm_hadd_ps, _mm256_hadd_ps

Note that the names of the intrinsic functions have meaningful suffixes. The “_ps” suffix means “packed-single-precision” (i.e., float), whereas “_pd” suffix means “packed-double-precision” (i.e., double).

AVX Operations

The main SIMD instructions are called “vertical” instructions, by convention. They take one vector and a second vector (e.g., both are 128-bit), apply an operation element-wise in parallel, and put the result into a third register. In other words, they return the result of a “pair-wise” or “element-wise” operation on two vectors into a third vector.

For example, vertical addition requires two input vectors and will output a third vector with the sums. AVX-512 SIMD addition will add two 512-bit registers full of float values on a paired element basis (i.e., adds 16 pairs of 32-bit float values), yielding a third 512-bit vector with the result (16 float values).

Binary operations. The full list of binary AVX operations is very long. Supported AVX operations include:

Multiplication
Addition
Subtraction
Division
Maximum
Minimum
Fused Multiply-Add (FMA)
Bitwise operations
...and many more

Unary operations. AVX unary intrinsics apply a particular function to all elements of an AVX register in parallel, and return the resulting register. Supported AVX unary operations include:

Clear to zero
Set to a constant
Casts
Conversions
Popcount (POPCNT)
Leading-zero count (LZCNT)

Mathematical Functions. Simple float-to-float mathematical functions are effectively a type of unary operator. AVX supports a variety of functions with vector hardware instructions, such as:

Absolute value: abs
Error function: erf
Reciprocal
Rounding, ceiling, floor
Roots: sqrt (square root), cube root
Inverted roots (e.g., invsqrt)
Exponential: exp, exp10
Logarithm: log, log10
Trigonometric functions
Hyperbolic functions
Statistics (e.g., Cumulative Distribution Function)

AVX Horizontal Intrinsics

Horizontal operations refer to arithmetic across the values within one vector. AVX intrinsics exist to do “horizontal” operations across the same vector, such as adding horizontal elements of a vector, or finding the maximum of pairs of elements within a vector.

Horizontal SIMD instructions are typically designated with a “h” prefix (e.g., “horizontal add” is “hadd”). More specifically, the intrinsic for 128-bit horizontal add is “_mm_hadd_ps” and it is “_mm256_hadd_ps” for 256-bits.

However, do not make the mistake of assuming that these horizontal AVX intrinsics are a “reduction” of a vector down to a single float (i.e., vector-to-scalar). I mean, they really should do exactly that, but that would be too good to be true. The horizontal intrinsic functions are still effectively “pairwise” operations for AVX and AVX-2, except the pairs are within the same vector (i.e., horizontal pairs). If you want to add all elements of a vector, or find the maximum, you will need multiple calls to these intrinsics, each time processing pairs of numbers, halving the number of elements you are examining at each iteration. Hence, for example, summing all the float values in a vector with AVX or AVX-2 uses a method of “shuffle-and-add” multiple times.

Thankfully, AVX-512 actually does have horizontal reductions that process all the elements in their 512 bit registers. Hence, the 512-bit horizontal add uses a different naming convention and uses the prefix of “reduce add” in the intrinsic name (e.g., _mm512_reduce_add_ps is a summation reduction). In other words, this reduction operates in parallel on all 16 float values in an AVX-512 register, and the _mm512_reduce_add_ps intrinsic can add up all 16 float values in one operation. This horizontal reduction summation is useful for vectorizing functions such as average, and could be used for vector dot products (i.e., do an AVX-512 SIMD vertical multiplication into a third vector of 16 float values, then a horizontal reduction to sum those 16 float values), although there’s an even better way with FMA intrinsics.

Supported AVX horizontal operations for pairwise horizontal calculations (AVX or AVX-2) or vector-to-scalar reductions (AVX-512) include floating-point and integer versions, with various sizes, for primitives, such as:

Addition
Maximum
Minimum
Bitwise operations

Combining Multithreading and SIMD CPU Instructions

You can double up! C++ multithreading software can be interleaved with CPU SIMD instructions as an optimized optimization. It’s totally allowed, and you can even put it on your resume. The idea is basically this structure:

Multithreading architecture — higher-level CPU parallelization.
SIMD instructions — lower-level CPU vectorization.

You can even triple up your parallelism:

Multithreading/multicore (CPU)
SIMD instructions (CPU)
GPU vectorization

Each different type of parallelization comes in at a different level. There’s even a fourth level, because CUDA C++ GPU programming has its own SIMD instructions to run on the GPU, based on the float4 family of types. However, they’re not AVX, and don’t work on an x86 CPU, so we’ll leave the GPU SIMD discussion to another day.