Aussie AI

Chapter 10. Advanced AVX Techniques

Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"

by David Spuler, Ph.D.

Chapter 10. Advanced AVX Techniques

AVX Memory Alignment Issues

Some of the AVX examples gloss over the issue of managing “alignment” of memory addresses on byte boundaries with the “alignas” specifier. Some of the AVX SIMD intrinsic calls require that addresses are 16-byte aligned (i.e., this is effectively 128-bit alignment), which is not guaranteed by the C++ compiler. However, we’ve tolerated non-aligned addresses by using the “_mm_storeu_ps” intrinsic, which works with either aligned or non-aligned addresses.

Note that alignment restriction requirements of AVX are somewhat in flux. Not all AVX intrinsics require alignment, and they are “relaxed” in many cases. There have also been some bugs in compiler toleration of non-aligned addresses in C++ intrinsics. Where required, the alignment needs are:

AVX-1 — 16-byte alignment (128-bit).
AVX-2 — 32-byte alignment (256-bit).
AVX-512 — 64-byte alignment (512-bit).

Since we can sort out alignment at compile-time using the C++ “alignas” specifier and “aligned” type attributes, there is no performance penalty (except in terms of space) for ensuring greater compatibility across CPU platforms and compiler versions by preferring aligned addresses.

You can create your own macros to easily test pointer addresses for alignment by checking their remainder with the % operator. These examples use bitwise-and to replace the slow remainder operator:

    #define aussie_is_aligned_16(ptr)  ((((unsigned long)(ptr)) &15ul) == 0)
    #define aussie_is_aligned_32(ptr)  ((((unsigned long)(ptr)) &31ul) == 0)

Although our code to multiply 4 float values tolerates non-alignment, it’s a minor slug. The “_mm_storeu_ps” AVX intrinsic is slower if the addresses are not aligned, so we should fix the alignment for performance reasons. There’s also another “store” intrinsic to convert from 128-bits to 4 floats called “_mm_store_ps” (without the “u”) that runs faster, but does not tolerate non-aligned float arrays. Actually, “_mm_storeu_ps” is supposed to be equally as fast as “_mm_store_ps” if the address is correctly aligned, so we can still use that intrinsic if we prefer safety, but we need to change the variables to be aligned on 16-byte boundaries for a speedup.

To ensure alignment in C++, there is an “alignas” specifier for variable declarations. We can use “alignas(16)” to force C++ to create the variables with 16-byte alignment of the address where they are stored. For example, our unit test harness code could have ensured 16-byte alignment of all memory addresses via:

    // Test with 16-byte alignment
    alignas(16) float arr1[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    alignas(16) float arr2[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    alignas(16) float resultarr[4];

There are various non-standard alternatives to “alignas” in the various compilers. For example, MSVS has “__declspec(align(16))” with two prefix underscores, and GCC supports “decltype(align(16))”.

The AVX code for an alignment-requiring version is not much different, with minor changes to the names of the C++ intrinsics:

    void aussie_avx_multiply_4_floats_aligned(float v1[4], float v2[4], float vresult[4])
    {
        // Use 128-bit AVX registers to multiply 4x32-bit floats...
        __m128 r1 = _mm_loadu_ps(v1);   // Load floats into 128-bits
        __m128 r2 = _mm_loadu_ps(v2);
        __m128 dst = _mm_mul_ps(r1, r2);   // Multiply
        _mm_store_ps(vresult, dst);  // Aligned version convert to floats
    }

Ideally we’d like to ensure that the function is only called with aligned addresses at compile-time. The first attempt is to declare “vresult” above as “alignas(16)” for type checking of alignment issues, but it fails for function parameters. Fortunately, there’s another way using type attributes:

    __attribute__((aligned(16)))

Another method is to define our own assertion that uses bitwise tests on the address instead:

    #define is_aligned_16(ptr)  ((((unsigned long int)(ptr)) & 15) == 0)

This tests the address is a number that is a multiple of 16 using bitwise-and with 15, but this is at runtime and costs extra cycles.

Permute and Shuffle

There are two classes of AVX instructions known as “permute” and “shuffle” operations. They’re both very similar in that they reorder data in the AVX registers. There are various ways that this can be used to optimize different types of algorithms. Generally speaking, the permute options came later, and are better:

Shuffle — AVX-1/SSE.
Permute — AVX-2 and AVX-512.

Some example intrinsic functions:

_mm_shuffle_epi32 — shuffle (AVX-1)
vpermilps — permute (AVX-2)

Was it a marketing name change? The permute and shuffle commands look very similar, except more bits in the later commands. I’m not 100% sure.

Blend Ternary Operations

The AVX “blend” operations are like a C++ ternary operator on steroids. Generally, they test a mask vector, and then choose from either of their two operands, depending on the value of a bit in the mask vector. You can see how it’s a lot like doing:

    z = bit ? x : y;

Except, you know, in parallel.

For example, there’s the AVX blend functions, such as:

    __m256 ret = _mm256_blendv_ps(x, y, mask);

One of the main ways to go fast with blend is to combine it with one of the many “cmp” comparison operations. This allows a vector comparison to create a mask, where each element is either 0 or 0xFF (all 1s). The main AVX comparison functions are:

    _mm256_cmp_ps(x, y, cond)
    _mm256_cmp_pd(x, y, cond)

The condition or “predicate” operand can be a builtin constant, such as:

_CMP_EQ_0Q — equality
_CMP_LT_0Q — less-than

There are many other operands with different sizes or operations. The operations include: EQ (equality), LE (less-equal), LT (less-than), NEQ (not-equal), NLT (not-less-than), NLE (not-less-equal), NGT (not-greater-than), NGE (not-greater-equal), ORD (ordered), UNORD (unordered). There’s also the “nop” conditions of FALSE for always false, and TRUE for always true.

This idea of using comparisons with blend operations has a lot of similarities to the CPU non-SIMD equivalent of ternary operators, the CMOV assembly statement. The blend instructions are branchless logic, just like CMOV for a single operation.

Vectorization of Lookup Tables

The use of lookup-tables was once a powerful speed optimization, but I’m not sure they’re being used much any more. Memory is slow, and CPUs are fast. Before you assume a LUT is better, you really should benchmark it against just plain old computation, or even re-computation!

Anyway, if you’re using a LUT to trade space for speed, you can double down by adding vectorization. The AVX SIMD instruction sets include a variety of “gather” intrinsics that perform parallel array lookups from a vector of integer indices, using a base address.

The basic algorithm we’re going to use for AVX SIMD optimizations of a LUT precalculation of some mathematical function is as follows:

Offline: Precalculate a big LUT for 24 bits with 2^24 elements using non-AVX basic C++ methods.
Input: vector of 4 float values (AVX-1) or 8 float values (AVX-2).
Use a cast to treat these float arrays as arrays of integers.
Load these “int” arrays into an AVX register.
AVX shift right by 8 with the AVX-2 “_mm_srli_epi32” intrinsic, which shifts right and adds zero bits, so that they are now 24-bit numbers in 32 bits, with a zero sign bit (hence, all indices are positive integers).
AVX “gather” with base on the LUT array, and scale of 4 (i.e., float byte size).
Store the AVX register results back into an array of float values.
Output: vector of 4/8 float values with the LUT-calculated function.

Note that we can use a smaller (or bigger) LUT than 24 bits simply by modifying the bitshift counts.

LUTs with AVX Shuffle. Another way to implement a LUT in AVX is to use “shuffle” operations on another register. This only works for small lookup tables, that have few enough elements to fit inside AVX registers. In other words, this can be fast, but only for 16 or 32 elements in the LUT for AVX-2, or more if you use AVX-512. This optimization is unlikely to be relevant to computing the massive 16-bit or 24-bit LUTs that we need for AI mathematical functions.

AVX SIMD Pointer Dereferences. A corollary to the AVX LUT “gather” functionality is they can possibly be used to vectorize arrays of pointers, where the pointers are directly aimed at the data without any intervening lookup-table. For example, suppose we have an array of pointers to float (i.e., rather than an array of integer indices), and we want to access these addresses to generate the corresponding array of float. This is analogous to using a lookup table, but with a base address of zero. Hence, we could potentially use AVX “gather” intrinsics with a zero base address, and the integer offsets equal to the address (i.e., the pointers converted to integer). The x86 platform has 64-bit pointers, so 64-bit integer index offsets are required in the “gather” intrinsic. For example, the AVX2 “_mm256_i64gather_epi32” and “_mm256_i64gather_ps” intrinsics seem to be along these lines with 64-bit indices. I haven’t actually tested this approach to check if it works.

Auto-Vectorization and Restricted Pointers

Modern C++ compilers attempt to automatically vectorize simple loops. Basic loop structures can be unrolled by optimizers, either partially or fully, and then sent to hardware acceleration automatically.

One of the most important hints to the compiler is a “restrict” designation on pointer variables. Ironically, the benefit of restrict is to limit what you can code, but also to allow unrestricted use of the pointers by the optimizer.

The purpose of the restrict attribute is a type specifier to tell the C++ compiler that a given pointer or array variable is not an “alias” for any other pointer. There are various loop transformations and vectorization optimizations that cannot be performed if the compiler has to be conservative and assume that aliasing could occur.

One of the main uses of restrict is on pointer or array function parameters, because arrays are pointers in this context. For example, if we have two function parameters (e.g., vector addition), declaring both parameters as restrict tells the compiler that the two pointers will never point to the other vector.

Note that this use of the word “aliasing” refers to two pointers referring to the same object or array (i.e., the pointers are aliases of each other). There is another unrelated but similar use of the term in C++ “aliases” for declarations, which means one function or type with two alias names.

The “restrict” keyword is merely a hint to the optimizer, and recalcitrant C++ compilers are free to ignore the advice. In fact, “restrict” isn’t even valid C++, because it’s part of C, but not yet in the C++ standard. Nevertheless, various compilers support it or similar extensions like __restrict__, so it can be used in C++ programs.

Restricted pointers don’t always need to be marked as such. In some usages, the use of “const” can allow the compiler to infer non-aliasing of parameters, but it probably doesn’t hurt to declare it with “restrict” as well. Note also that the C++ compiler is free to assume non-aliasing of pointers of different types, because it is undefined behavior if they are aliases. This is known as the “strict aliasing rule” and this assumption can be disabled in GCC via the option “-fno-strict-aliasing”.

The C++ compiler doesn’t really check if you are lying (to yourself). If you tell the compiler that pointers are restricted, and then pass in two aliased pointers, the behavior of your program is “undefined” and there aren’t likely to be any compilation errors or runtime warnings. So, don’t do that.

The correct declaration of a “restrict” pointer is:

    int * restrict ptr;  // Correct

This is actually incorrect:

    int restrict * ptr;   // Wrong
    restrict int * ptr;   // Also wrong

The syntax for array parameters has the keyword inside the square brackets:

   void myfunc(int arr[restrict]);

Read-only functions. Note that read-only functions don’t really need to use the restrict keyword. For example, the calculation of a vector dot product of two arrays doesn’t really have an aliasing problem, since neither of the vectors are changed.

Restricted references. The “restrict” type specifier can be used on references, as well as pointers and arrays. This is helpful for some of the issues with aliasing between references in pass-by-reference function parameters. But this usage of restrict for references isn’t very important for auto-vectorization optimizations.

Restricted “this” pointer. GCC also supports specifying that the class object “this” pointer is unaliased by marking the function body with the “__restrict__” keyword. This is placed after the closing right parenthesis of the function parameters (i.e., similar to a const member function declaration). The declaration looks like:

    void MyClass::myfunc(int x) __restrict__;

Overall, it’s unclear how much all these restricted pointer specifiers help the compiler to optimize, but it certainly won’t harm the performance!

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ AVX Optimization

C++ AVX Optimization: CPU SIMD Vectorization:

Introduction to AVX SIMD intrinsics
Vectorization and horizontal reductions
Low latency tricks and branchless programming
Instruction-level parallelism and out-of-order execution
Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization

Aussie AI

Chapter 10. Advanced AVX Techniques