Aussie AI

Chapter 3. CPU Platform Detection

Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"

by David Spuler, Ph.D.

Chapter 3. CPU Platform Detection

Portability Checking of AVX Versions

The power of AVX support has changed over the years, with different CPUs having different capabilities, not only with AVX, AVX-2 and AVX-512, but also their sub-releases. And it’s also a little unclear into the future, with reports that some of the newer Intel chips have AVX-512 disabled.

If you write some code using AVX-512 intrinsics, and compile your C++ into an executable with the AVX-512 flags on, and then it runs on a lower-capability CPU without AVX-512, what happens? Do the AVX-512 intrinsics fail, or are they simulated somehow so that they’re slower but still work?

Answer: kaboom on MSVS.

In the MSVS IDE, if you try to call these intrinsics on a CPU that doesn’t support it, you get “unhandled exception: illegal instruction.” In other words, the C++ compiler still emits the AVX-512 instruction codes, but they aren’t valid, so it excepts at runtime.

Hence, the calls to AVX-512 are not emulated at run-time on lower-capability CPUs. And they aren’t checked, either. That’s up to you!

Preprocessor Macro Tests

Firstly, you cannot generally use the preprocessor to decide what version of AVX you have (if any). This only works if:

1. There’s only one platform, and

2. You’re compiling on (or for) the same platform that will run the binary.

In other words, it’s either you and your one box doing everything, or else you’re carefully maintaining lots of different executable binaries for each platform.

Note that you can modify the default CPU platform target via compiler mode settings. During compilation, you can either take whatever platform you’re on, or you can modify the setting with compiler flags for different compile-time platform effects:

-mavx — GCC/Clang compiler
-march=native — GCC/Clang compiler
/arch:AVX — MSVC compiler
/arch:AVX2 — MSVC compiler

In those limited circumstances, you can use the builtin preprocessor macros:

__AVX__
__AVX2__
__AVX512F__

There are also the SSE versions of these macros:

__MMX__
__SSE__
__SSE2__
__SSE3__
__SSE4_1__
__SSE4_2__

There are also some macros for specific types of CPU functionality or individual machine codes:

__FMA__ — fused multiply-add.
__BMI__ — bit manipulation instructions.
__POPCNT__ — popcount (set bits count instruction).

If you’re also supporting non-AVX platforms, your AVX code probably should have a check like this somewhere :

    #if defined(_M_ARM) || defined(_M_ARM64) || defined(_M_HYBRID_X86_ARM64)
              || defined(_M_ARM64EC) || __arm__ || __aarch64__
    #error AVX not supported on ARM platform
    #endif

Source: GGML AI inference backend open-source code (see Appendix for license details).

Runtime CPU Feature Checking

In general, for shipping a binary to customers, you can’t test #if or #ifdef for whether you’ve got AVX-512 in the CPU or not. You can use the preprocessor to distinguish between different platforms where you’ll compile a separate binary (e.g., ARM Neon for phones or Apple M1/M2/M3 chipsets).

Preprocessing checks can help with the non-AVX platforms, but not so much on x86 CPUs. You cannot choose between AVX, AVX-2, and AVX-512 at compile-time, unless you really plan to ship three separate binary executables. Well, you probably could do this if you really, really wanted to. Go ahead, prove me wrong!

The other thing you don’t really want to do is low-level testing of capabilities. You don’t want to test a flag right in front of every AVX-512 intrinsic call. Otherwise, you’ll lose most of the speedup benefits. Instead, you want this test done much higher up, and then have multiple versions of the higher-level kernel operations (e.g., vector add, vector multiply, vector dot product, etc.)

CPUID Instruction

Given the preprocessor limitations, it is important to check your CPU platform has the AVX support that you need. What this means is that you have to check in your runtime code what the CPU’s capabilities are, at a very high level in your program, usually during initialization.

Fortunately, every CPU has a builtin machine-code instruction called “CPUID” that is very fast and provides this information. The main features of CPUID include:

1. It’s a hardware opcode! (fast), and

2. The bit flags are very obscure, and therefore

3. Using it directly is a real pain.

The main way to do this is via one of several possible “cpuid” intrinsic functions at program startup. There are several versions of this non-standard C++ intrinsic:

cpuid — the main CPU instruction.
__cpuid() — basic CPU information (MSVC)
__cpuidex() — extended information (MSVC)
__get_cpuid() — GCC/Clang version in <cpuid.h>
__cpuid_count() — also GCC/Clang, but more specific.

GCC also has a more user-friendly version without any bit flags needed:

__builtin_cpu_supports("NAME") — look up CPU features by name (e.g., “SSE”).
int _may_i_use_cpu_feature (unsigned __int64 a) — an old version.

The GCC version is current and quite easy to use. The other one looks like a bad AI hallucination, but it’s in some 2022 Intel documentation, so best of luck with that.

Then you have a dynamic flag that specifies whether you have AVX-512 or not, and you can then choose between an AVX-2 dot product or an AVX-512 dot product, or whatever else, during execution. Obviously, it gets a bit convoluted when you have to dynamically choose between versions for AVX, AVX-2 and AVX-512, not to mention all the AVX sub-capabilities and also AVX-10 coming soon!