Aussie AI
Chapter 2. Simple AVX Example
-
Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"
-
by David Spuler, Ph.D.
Chapter 2. Simple AVX Example
Basic AVX SIMD Multiply
Let us do a basic element-wise SIMD multiply using AVX (version 1) and its 128-bit registers.
This will do a paired vector multiply an array of 4 float numbers (i.e., 4 x 32-bit float = 128 bits).
Each float in the resulting array is a pairwise multiplication of the elements in the two operands.
This is how SIMD instructions work, by operating on each element of the array (i.e., “pairwise” or “element-wise”).
For example, a “vertical” multiply
will take the 4 float values in one input array,
and multiply each of them by the corresponding float in the other input array of 4 float numbers,
and then will return a resulting output array with 4 float values.
For testing, let us assume with want to create an AVX function that multiplies 4 float values element-wise.
The test code looks like:
float arr1[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
float arr2[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
float resultarr[4];
// Multiply element-wise
aussie_multiply_vectors(arr1, arr2, resultarr, 4);
Testing the results of the multiply as an element-wise multiply of each pair in the 4 float values
(using my home-grown “aussie_testf” unit testing function that compares float numbers for equality):
aussie_testf(resultarr[0], 1.0f * 1.0f); // Unit tests
aussie_testf(resultarr[1], 2.5f * 2.5f);
aussie_testf(resultarr[2], 3.14f * 3.14f);
aussie_testf(resultarr[3], 0.0f * 0.0f);
Here’s the low-level C++ code that actually does the SIMD multiply
using the “_mm_mul_ps” AVX intrinsic function:
#include <xmmintrin.h>
#include <intrin.h>
void aussie_avx_multiply_4_floats(
float v1[4], float v2[4], float vresult[4])
{
// Multiply 4x32-bit float in 128-bit AVX registers
__m128 r1 = _mm_loadu_ps(v1); // Load floats
__m128 r2 = _mm_loadu_ps(v2);
__m128 dst = _mm_mul_ps(r1, r2); // AVX SIMD Multiply
_mm_storeu_ps(vresult, dst); // Convert back to floats
}
Explaining this code one line at a time:
1. The header files are included: <xmmintrin.h> and <intrin.h>.
2. The basic AVX register type is “__m128” which is an AVX 128-bit register (i.e., it is 128 bits in the basic AVX version, not AVX-2 or AVX-512).
3. The variables “r1” and “r2” are declared as _mm128 registers. The names “r1” and “r2” are not important, and are just variable names.
4. The intrinsic function “_mm_loadu_ps” is used to convert the arrays of 4 float values into the 128-bit register types,
and the result is “loaded” into the “r1” and “r2” 128-bit types.
5. Another 128-bit variable “dst” is declared to hold the results of the SIMD multiply. The name “dst” can be any variable name.
6. The main AVX SIMD multiply is performed by the “_mm_mul_ps” intrinsic function. The suffix “s” means “single-precision” (i.e., 32-bit float).
This is where the rubber meets the road, and the results
of the element-wise multiplication of registers “r1” and “r2”
are computed and saved into the “dst” register.
It is analogous to the basic C++ expression: dst = r1*r2;
7. The 128-bit result register variable “dst” is converted back to 32-bit float values (4 of them),
by “storing” the 128 bits into the float array using the “_mm_storeu_ps” AVX intrinsic.
AVX-2 SIMD Multiplication
Here is the AVX-2 version of pairwise SIMD multiply with intrinsics
for 256-bit registers,
which is eight 32-bit float variables.
void aussie_avx2_multiply_8_floats(
float v1[8], float v2[8], float vresult[8])
{
// Multiply 8x32-bit floats in 256-bit AVX2 registers
__m256 r1 = _mm256_loadu_ps(v1); // Load floats
__m256 r2 = _mm256_loadu_ps(v2);
__m256 dst = _mm256_mul_ps(r1, r2); // Multiply (SIMD)
_mm256_storeu_ps(vresult, dst); // Convert to 8 floats
}
This is similar to the basic AVX 128-bit version, with some differences:
- The type for 256-bit registers is “
__m256”. - The AVX-2 loading intrinsic is “
_mm256_loadu_ps”. - The AVX-2 multiplication intrinsic is “
_mm256_mul_ps”. - The conversion back to float uses AVX-2 intrinsic “
_mm256_storeu_ps”.
AVX-512 SIMD Multiplication
Here is the basic 16 float SIMD vector multiplication using 512-bits in AVX-512.
void aussie_avx512_multiply_16_floats(
float v1[16], float v2[16], float vresult[16])
{
// Multiply 16x32-bit floats in 512-bit registers
__m512 r1 = _mm512_loadu_ps(v1); // Load 16 floats
__m512 r2 = _mm512_loadu_ps(v2);
__m512 dst = _mm512_mul_ps(r1, r2); // Multiply (SIMD)
_mm512_storeu_ps(vresult, dst); // Convert to floats
}
Note that AVX-512 will fail with an “unhandled exception: illegal instruction” (e.g., in MSVS) if AVX-512 is not supported on your CPU. Hence, it’s important to check your platform before optimizing!
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: C++ AVX Optimization |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |