Aussie AI
Chapter 1. AVX Intrinsics
-
Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"
-
by David Spuler, Ph.D.
Chapter 1. AVX Intrinsics
What are AVX Intrinsics?
Hardware-assisted vectorization is a powerful optimization to processing contiguous data structures. AVX intrinsics are SIMD parallel instructions for x86 and x64 architectures. They are actually machine opcodes supported by the x86/x64 CPU, but are wrapped in the intrinsic prototypes for easy access from a C++ program.
The main advantage of SIMD instructions is that they are CPU-supported parallel optimizations. Hence, they do not require a GPU, and can even be used on a basic Windows laptop. The main downside is that their level of parallelism is nowhere near that of a high-end GPU.
There are multiple generations of AVX intrinsics based on x86/x64 CPU instructions. Different CPUs support different features, and exactly which intrinsic calls can be used will depend on the CPU on which your C++ is running. The basic AVX types are:
- AVX-1 — 128-bit registers = 16 bytes
- AVX-2 — 256-bit registers = 32 bytes
- AVX-512 — 512-bit registers = 64 bytes
- AVX-10 — also 512-bit registers (with speedups)
- AVX-1 — 4 x 32-bit
floatorintvalues - AVX-2 — 8 x 32-bit values
- AVX-512 — 16 x 32-bit values
The AVX intrinsics use C++ type names to declare variables for their registers.
The float types used to declare the registers in AVX using C++
all have a double-underscore prefix
with “__m128” for 128-bit registers (4 floats), “__m256” for 256 bit registers (8 floats),
and “__m512” for 512 bits (16 floats).
Similarly, there are also register type names for int types (__m128i, __m256i, and __m512i),
and types for “double” registers (__m128d, __m256d, and __m512d).
AVX intrinsic functions and their types are declared as ordinary function prototypes in header files.
The header files that you may need to include for these intrinsics include <intrin.h>, <emmintrin.h>,
and <immintrin.h>.
Useful AVX SIMD vector intrinsics for float types include:
- Initialize to all-zeros —
_mm_setzero_ps,_mm256_setzero_ps - Set all values to a single
float—_mm_set1_ps,_mm256_set1_ps - Set to 4 or 8 values —
_mm_set_ps,_mm256_set_ps - Load from arrays to AVX registers —
_mm_loadu_ps,_mm256_loadu_ps - Store registers back to
floatarrays —_mm_storeu_ps,_mm256_storeu_ps - Addition —
_mm_add_ps,_mm256_add_ps - Multiplication —
_mm_mul_ps(SSE),_mm256_mul_ps(AVX-2) - Vector dot product —
_mm_dp_ps,_mm256_dp_ps - Fused Multiply-Add (FMA —
_mm_fmadd_ps,_mm256_fmadd_ps - Horizontal addition (pairwise) —
_mm_hadd_ps,_mm256_hadd_ps
Note that the names of the intrinsic functions have meaningful suffixes.
The “_ps” suffix means “packed-single-precision” (i.e., float),
whereas “_pd” suffix means “packed-double-precision” (i.e., double).
AVX Operations
The main SIMD instructions are called “vertical” instructions, by convention. They take one vector and a second vector (e.g., both are 128-bit), apply an operation element-wise in parallel, and put the result into a third register. In other words, they return the result of a “pair-wise” or “element-wise” operation on two vectors into a third vector.
For example, vertical addition requires two input vectors and will output a third vector with the sums.
AVX-512 SIMD addition will add two 512-bit registers full of float values
on a paired element basis (i.e., adds 16 pairs of 32-bit float values), yielding a third 512-bit vector with the result (16 float values).
Binary operations. The full list of binary AVX operations is very long. Supported AVX operations include:
- Multiplication
- Addition
- Subtraction
- Division
- Maximum
- Minimum
- Fused Multiply-Add (FMA)
- Bitwise operations
- ...and many more
Unary operations. AVX unary intrinsics apply a particular function to all elements of an AVX register in parallel, and return the resulting register. Supported AVX unary operations include:
- Clear to zero
- Set to a constant
- Casts
- Conversions
- Popcount (POPCNT)
- Leading-zero count (LZCNT)
Mathematical Functions. Simple float-to-float mathematical functions are effectively a type of unary operator. AVX supports a variety of functions with vector hardware instructions, such as:
- Absolute value:
abs - Error function:
erf - Reciprocal
- Rounding, ceiling, floor
- Roots:
sqrt(square root), cube root - Inverted roots (e.g.,
invsqrt) - Exponential:
exp,exp10 - Logarithm:
log,log10 - Trigonometric functions
- Hyperbolic functions
- Statistics (e.g., Cumulative Distribution Function)
AVX Horizontal Intrinsics
Horizontal operations refer to arithmetic across the values within one vector. AVX intrinsics exist to do “horizontal” operations across the same vector, such as adding horizontal elements of a vector, or finding the maximum of pairs of elements within a vector.
Horizontal SIMD instructions are typically designated with a “h” prefix
(e.g., “horizontal add” is “hadd”).
More specifically,
the intrinsic for 128-bit horizontal add is “_mm_hadd_ps”
and it is “_mm256_hadd_ps” for 256-bits.
However, do not make the mistake of assuming that these horizontal AVX
intrinsics are a “reduction” of a vector down to a single float (i.e., vector-to-scalar).
I mean, they really should do exactly that,
but that would be too good to be true.
The horizontal intrinsic functions
are still effectively “pairwise” operations for AVX and AVX-2, except the pairs are within the same vector (i.e., horizontal pairs).
If you want to add all elements of a vector, or find the maximum,
you will need multiple calls to these intrinsics,
each time processing pairs of numbers,
halving the number of elements you are examining at each iteration.
Hence, for example, summing all the float values in a vector with AVX or AVX-2 uses a method of “shuffle-and-add” multiple times.
Thankfully, AVX-512 actually does have horizontal reductions
that process all the elements in their 512 bit registers.
Hence, the 512-bit horizontal add uses a different naming convention
and uses the prefix of “reduce add” in the intrinsic name (e.g., _mm512_reduce_add_ps is a summation reduction).
In other words, this reduction operates in parallel on all 16 float values in an AVX-512 register,
and the _mm512_reduce_add_ps intrinsic can add up all 16 float values in one operation.
This horizontal reduction summation is useful for vectorizing functions such as average,
and could be used for vector dot products
(i.e., do an AVX-512 SIMD vertical multiplication into a third vector of 16 float values, then a horizontal reduction to sum those 16 float values),
although there’s an even better way with FMA intrinsics.
Supported AVX horizontal operations for pairwise horizontal calculations (AVX or AVX-2) or vector-to-scalar reductions (AVX-512) include floating-point and integer versions, with various sizes, for primitives, such as:
- Addition
- Maximum
- Minimum
- Bitwise operations
Combining Multithreading and SIMD CPU Instructions
You can double up! C++ multithreading software can be interleaved with CPU SIMD instructions as an optimized optimization. It’s totally allowed, and you can even put it on your resume. The idea is basically this structure:
- Multithreading architecture — higher-level CPU parallelization.
- SIMD instructions — lower-level CPU vectorization.
You can even triple up your parallelism:
- Multithreading/multicore (CPU)
- SIMD instructions (CPU)
- GPU vectorization
Each different type of parallelization comes in at a different level.
There’s even a fourth level, because CUDA C++ GPU programming
has its own SIMD instructions to run on the GPU,
based on the float4 family of types.
However, they’re not AVX, and don’t work on an x86 CPU,
so we’ll leave the GPU SIMD discussion to another day.
References
- Intel (2023), Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1, August 2023, 248966-Software-Optimization-Manual-V1-048.pdf
- Agner Fog (2023), Optimizing subroutines in assembly language, https://www.agner.org/optimize/optimizing_assembly.pdf
- Félix Cloutier (2023), x86 and amd64 instruction reference, https://www.felixcloutier.com/x86/
- Microsoft (2023), x86 intrinsics list, https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list
- Intel (2023), Intel Intrinsics Guide, Version 3.6.6, May 10th, 2023, https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
- Intel (2023), Intel C++ Compiler Classic Developer Guide, version 2021.10, https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-10/overview.html, PDF: https://cdrdv2.intel.com/v1/dl/getContent/781922?fileName=cpp-compiler_developer-guide-reference_2021.10-767249-781922.pdf
- Microsoft, 2021, __cpuid, __cpuidex, https://learn.microsoft.com/en-us/cpp/intrinsics/cpuid-cpuidex?view=msvc-170 (Using CPUID to detect versions.)
- Wikipedia, July 2025 (accessed), Advanced Vector Extensions, https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: C++ AVX Optimization |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |