Aussie AI

Chapter 4. Common AVX Bugs & Slugs

Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"

by David Spuler, Ph.D.

Chapter 4. Common AVX Bugs & Slugs

Common AVX Bugs

Nobody said that AVX was easy! There are certainly plenty of great speedups, but there are also some new ways to crash your code:

AVX version not supported by CPU architecture (crash!).
Bugs in tricky AVX loop bounds and incrementers (various mistakes possible).
Alignment problems (usually 16-bit alignment is needed).
CPU overheating (AVX instructions are heavy on the poor silicon).
Pointer arithmetic errors (AVX types are bigger than normal).
Wrongly mixing integers and floating-point numbers (they’re the same size, after all).
Bytewise comparison pitfalls (e.g., memcmp, vpcmpeqb, vpmovmskb, and bzhi; beware padding bytes, negative-zero, Inf/NaN floating-point values, and more).

In addition to AVX-specific bugs, there are all sorts of normal variable bugs! The AVX register variables can simply be uninitialized, or you can divide by zero, or any number of memory mistakes. To catch some of these problems, you can still use the same debugging techniques on AVX variables. Runtime checkers will catch AVX-related memory errors, so make sure to use Valgrind or ASan.

Common AVX Slugs

If you do AVX correctly, your program goes much faster! But you can also accidentally slow it down, and here’s some of the ways:

Slow memory accesses — poor cache locality of your memory lookups will slow things, no matter what AVX instructions you use (e.g., prefer contiguous data storage like arrays or vectors).
Alignment slugs — incorrect alignment is sometimes auto-corrected, but then it’s slower, even when it doesn’t crash, such as if you do unaligned stores.
Overuse of alignment-safe AVX primitives — this is always slower, so avoid it where unnecessary.
Downclocking of AVX instructions — use of AVX undermines any overclocking you might be doing!
Setting AVX constants inside the loop — tune your inner loops even in AVX (a common mistake).
Accidental redundant AVX code — e.g., wrong logic in loop indices or tests.
Gather instructions are often slower — due to their poor memory access patterns.
Auto-vectorization prevention — compilers sometimes don’t speak AVX very well (check the assembly output).
Pointers or arrays not declared as “restricted” — throw your poor compiler a bone.
Too much prefetching — _mm_prefetch() can be a slowdown if overused.
Lookup tables can be a de-optimization — benchmark against raw computation.
Caching can be a slug — for the same reasons, benchmark caching against recomputation.

Loop Invariant Code Hoisting

AVX statements can be misplaced like any other statements. Can you spot the slug in this code:

    void aussie_vec_mult_scalar_AVX1_sluggy(float v[], int n, float c)  
    { 
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats
            __m128 rscalar = _mm_set1_ps(c);  // Vector of scalars
            __m128 dst = _mm_mul_ps(r1, rscalar); // Multiply by scalars
            _mm_store_ps(&v[i], dst);  // convert to floats (aligned)
        }
    }

The fixed code has a constant operation hoisted out of the loop. It doesn’t change throughout the iterations:

    void aussie_vector_mult_scalar_AVX1(float v[], int n, float c)  
    { 
        const __m128 rscalar = _mm_set1_ps(c);  // Hoisted!!
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);
            __m128 dst = _mm_mul_ps(r1, rscalar); // Multiply by scalars
            _mm_store_ps(&v[i], dst);
        }
    }

Accidental Redundant Computations

This is buggy and sluggy code. Can you see the bug? It’s hidden by “code blindness” because of what C++ programmers are used to seeing.

    void aussie_vector_multiply_scalar_AVX2(float v[], int n, float c)  
    {
        const __m256 rscalar = _mm256_set1_ps(c);  // Vector of scalars
        for (int i = 0; i < n; i++) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats
            __m256 dst = _mm256_mul_ps(r1, rscalar); // Multiply by scalars
            _mm256_store_ps(&v[i], dst);  // convert to floats (aligned)
        }
    }

The bug is “i++” because it should really be “i+=8” to stride through the loop. This is the type of bug that can happen in any of the SIMD kernels. Depending on the function, it can be a bug, or it can be an insidious slug, whereby the same computations are done over again, losing all benefit of the AVX vectorized instructions.

Too Much AVX

What do you think of this AVX routine to clear a vector in parallel? Here’s the unoptimized code:

    void aussie_vector_clear_AVX2(float v[], int n)  
    {
        const __m256 rzeros = _mm256_setzero_ps();
        for (int i = 0; i < n; i += 8) {
          __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats
          _mm256_store_ps(&v[i], rzeros);  // store zeros
        }
    }

Umm, yeah. Do you think I like AVX maybe a little too much? How about:

    std::memset(v, 0, n *sizeof(float));

Don’t worry. The compiler designers are certainly using something better than looping AVX calls in the standard library implementation.

List of AVX Optimization Tricks

A lot of these ideas are covered in other parts of the book. However, here’s a convenient list of some of the major techniques:

Unroll loops manually (reduce loop overhead and have fewer branches).
Use “double unrolling” of loops (unroll once to AVX, then unroll those AVX instructions, too!)
Parallel accumulators (with single or double unrolled loops).
Avoid data dependencies for “out-of-order” execution (parallel accumulators; split integer versus floating-point arithmetic, etc.)
Fused Multiply-Add (FMA) is fast (and a pleasure to use).
Use “alignas” to maintain alignment.
Use “broadcast” of constants (e.g., _mm256_set1_ps()).
Manual prefetching (e.g., _mm_prefetch).
Masked operations (useful branchless coding trick).
Optimizer architecture flags such as “-march” for GCC/Clang.
Compare memcpy versus vpcmpeqb, vpmovmskb, and bzhi (use with care!).
Use permute and shuffle primitives to reorder data.
Store data with “streaming stores” via _mm256_stream_ps().
Use vector sizes that are a multiple of loop unroll (or pad with zeros).