Aussie AI

Chapter 6. CUDA Compilers and Optimizers

  • Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"
  • by David Spuler

Chapter 6. CUDA Compilers and Optimizers

Compiler Optimization Options

Like most robust C++ compilers, the nvcc CUDA C++ compiler has a “-O” optimization flag, but it goes further: it has separate options for device code optimization, too. The nvcc compiler supports several different levels of optimization flags:

  • Host code optimization flags: -O or --optimize with a level.
  • Device code optimization flags: -dopt or --dopt kind
  • Link-time optimization flags: --dlink-time-opt (-dlto)

NVCC compiler control of code optimization settings and GPU modes can also be useful to optimize the generated code:

  • Flush-to-Zero (FTZ) mode: faster floating point computations can be turned on with compiler flag “-ftz=true
  • Fast math mode: set with “--use_fast_math
  • Optimizer level: use “--optimize” flag
  • Device code optimization level: set with the “--dopt” flag
  • Division lower precision: “-prec-div=false
  • Square root lower precision: “-prec-sqrt=false

Disable Compiler Debug Flags. The “-g” (host debug) and “-G” (device debug) compiler options to nvcc should be disabled for performance. The issue is not so much that the executable contains extra symbol naming information, but that these options suppress a wide variety of auto-optimizations that would otherwise be performed by the compiler. This is less the case if the flags are passed to gcc for the host code, but especially true for kernel code.

People Helping Parsers

The humble C++ compiler needs your attention. Hat in hand, the compiler is sitting there saying “I am but a poor, helpless lexer, without even a single neural network. Please help me.” Hence, please consider donating your time to help a poor struggling compiler in your neighborhood.

There is a long history of the C++ compiler needing “hints” about optimization from the programmer. The early C++ language in the 1990s had a “register” specifier that hinted to the compiler that a variable was going to be highly used, and the compiler should optimize it by putting the variable in a CPU register. The “register” keyword has since been deprecated in C++17, which indicates that compiler register allocation algorithms no longer benefit from human help.

Some of the other longstanding C++ keywords that can be used for efficiency-related purposes include:

  • inline
  • const
  • static

And with the evolving C++ standards, there’s a whole new set of directives that are hints to the compiler about how to optimize:

  • constexpr
  • constinit
  • consteval
  • reinterpret_cast
  • restricted pointers (“__restrict__”)
  • [[likely]] and [[unlikely]] path attributes (C++20)
  • [[fallthrough]] (C++17)
  • [[expects]], [[ensures]], [[assert]] (C++20)
  • __assume (CUDA 11.2)
  • __builtin_assume (CUDA 11.2)
  • __builtin_assume_aligned (CUDA 11.2)
  • __builtin_unreachable (CUDA 11.3)

Note that all of these capabilities are available in both device kernels and host code. Various additional capabilities are available in host-only code, using the underlying compiler, such as additional GCC builtin primitives:

  • __builtin_prefetch (control data fetching)
  • __builtin___clear_cache
  • __builtin_alloca_with_align_and_max
  • __attribute__((fallthrough))
  • __attribute__(assume(expression))
  • __attribute__(cold/hot/unused)
  • __builtin_expect
  • __builtin_expect_with_probability

The constexpr and related directives help the compiler do “constant folding” and “constant propagation” to compute as much as possible at compile-time, thereby avoiding any runtime cost for lots of code. In fact, the idea is extended to its logical asymptote, whereby you can declare an entire function as “constexpr” and then expect the poor compiler to interpret the whole mess at compile-time. Pity the overworked compiler designers.

The “__restrict__” pointer declarations help the compiler with advanced optimizations like loop unrolling and vectorization by telling the compiler to ignore potential “aliasing” of pointers, allowing much more powerful code transformations on loops. The restricted pointer optimizations are actually of more interest than constexpr for AI development. These have been formalized in C++23, but non-standard versions have long existed. The possible benefit for C++ AI engines is that restricted pointer specifications might help the compiler do auto-vectorization of loops into parallel hardware-assisted code.

How much do these help? It’s rather unclear, and the compiler is free to simply ignore these hints. Compilers already did a lot of constant propagation optimizations before the “constexpr” directives came along, so presumably compiler designers have upped their game even further now.

Optimizer-Triggered Glitches

Here’s a pro tip: don’t turn the optimizer on the night before you ship!

And there’s a corollary, too: don’t disable the “-g” and “-G” debug flags for the compiler just before you ship. That’s because these flags suppress various nvcc optimizations, so if you suddenly remove them, it’s the same as enabling a much higher level of optimization.

You need to run the optimizer regularly in your build, since it might shake out a few bugs. This is far less embarrassing if it’s only in front of the team. In fact, it can be a desirable self-testing method to run your kernel against its unit tests and regressions with all sorts of different optimization levels enabled.

There are a large number of obscure coding errors that can be triggered by higher levels of optimization, whereas the code may run fine without optimization, or in the interactive debugger. Examples include various memory errors, which has to be top of mind, because the optimizer may be changing the way in which variables are arranged in memory, or how memory management is optimized. Examples include the normal culprits:

  • Use of uninitialized memory (e.g., from new or malloc)
  • Use of already-deallocated memory (e.g., after delete or free)
  • Dangling pointers
  • Buffer overruns
  • Doubly-deallocated memory
  • Deallocating non-allocated memory
  • Returning the address of a local stack variable or buffer.

Less commonly, there are also various arithmetic oddities that are not technically certain in the C++ language, and may be changed subtly by optimization. Some examples to consider if you cannot find a memory problem include:

  • Integer division or modulus by negative integers.
  • Right-bitshift on a signed negative integer.
  • Order of evaluation undefined behavior (e.g., a side-effect like ++ on one operand of a binary arithmetic operator, whereby the order matters, or the modified variable is used twice in the one expression).

So, if the code suddenly breaks when you turn up the optimization level, it’s more likely to be a latent bug in your C++ code, rather than an optimizer bug. Nobody to blame but yourself!

Compiler Auto-Optimization

Your compiler is trying to do its best. If it sees you code “3+5” then it’s going to change that to “8” as it’s spinning along. That’s called “constant folding” and is one of the many techniques that the “optimizer” part of your compiler uses.

If you write “sqrt(3.14)” then the compiler could pre-compute that, too. This is going beyond constant folding to precomputing all constant expressions at compile-time, based on constexpr and related specifiers. I’m not sure that all C++ compilers do this yet, but the state of the art is continually improving.

What optimization techniques does your compiler use automatically? Here is an overview of some of the various methods whereby compilers emit better low-level instructions during their code generation phase. There is an extensive body of research on how compilers can auto-optimize your code, with some of the major techniques including:

  • Constant folding
  • Constant propagation
  • Operator strength reduction
  • Algebraic identities in expressions
  • Compile-time expression evaluation
  • Compile-time function execution
  • Common subexpression elimination
  • Dead code elimination (unreachable code)
  • Loop optimizations (unrolling, hoisting)

Constant Folding

Constant folding is the name of the optimization where the compiler automatically merges constant expressions at compile-time during code generation. For example, assume your code has:

    x = 3 + 4;

The compiler should “fold” the two constants (3 and 4) by doing the “+” operation at compile-time, and then automatically generate code without the “+” operation. Effectively, it should execute as if you’d written:

    x = 7;

So, how good are the compilers? The answer is that they’re getting pretty amazing, but it’s not always clear. Here’s some cases about constant folding:

  • sizeof is a constant. Any use of the sizeof operator in expressions should be treated like an integer. If your code says “8*sizeof(int)”, then it’s hopefully folded to give “32” at compile-time.
  • Named constants. If you declare “const int n=3”, then hopefully all subsequent uses of “n” will be folded as if they said “3”.
  • Named macro constants. This is trivially handled.
  • Type casts: Hopefully your compiler should propagate types correctly while doing constant folding.

What about some harder cases? Consider this code with an intrinsic math function:

   const float scalefactor = sqrtf(2.0f * 3.14159f);

Can the compiler fold this at compile-time? Surely it should do “2.0f*3.14159f” correctly. But what about the sqrtf calculation? It’s theoretically possible, but I’m far from certain. I’d be inclined to declare it as a global constant, or a local “static” constant, so as to ensure it only gets pre-calculated at most once:

   static const float scalefactor = sqrtf(2.0f * 3.14159f);

Now, since it’s declared as “static” (and “const”), hopefully the executable code would only compute it once at program startup, even if it wasn’t fully folded. Another faster alternative that ensures it’s not calculated even once is to work out the formula yourself, and just put in the numeric result as a hard-coded floating-point constant.

   static const float scalefactor = 2.506627216f; // sqrtf(2.0f * 3.14159f);

Constant Propagation

Constant propagation is the compiler optimization whereby constants are “propagated” through expressions and control flow. This is a longstanding area of research and the technique is done by many compilers. For the basic idea of constant propagation, consider this code:

    x = 3;
    y = x;

The idea is to “propagate” the constant value of 3 through the expressions, yielding the faster code:

    x = 3;
    y = 3;  // Propagated

Constant Expression Evaluation

The idea of “constant expression evaluation” is to generalize the older techniques of constant folding and constant propagation to their logical extreme. Whenever a value can be computed at compile-time, then the optimizer should attempt to do so. For example, consider the code:

    f = sqrtf(3.14f);

In theory, this can be computed at compile-time, because it has a fixed value, and the return value of the sqrtf function only depends on its input. If the compiler is advanced enough, it can “interpret” or “evaluate” the value of the sqrtf function from the constant argument (3.14f), and replace this function call with the numeric result.

The “constexpr” C++ directive is one manner in which programmers attempt to guide the compiler as to what expressions or functions it can compute at compile-time. If you look in the system header files, you’ll certainly see that sqrtf function is declared as having property “constexpr”. Whether the compiler actually evaluates this at compile-time is another matter, but the capabilities of optimizers are moving very rapidly.

The constexpr property can be applied to your own declared functions. In theory, the compiler could then evaluate at compile-time any calls to your function that are passing in a constant value. This becomes like constant propagation on steroids.

Algebraic Identities

The compiler knows about all sorts of algebra! The calculations in some complicated expressions can be reduced by transforming the expression into another equivalent form. The aim when using algebraic identities is to group the operations differently, to reduce the total number of arithmetic operations. Care must be taken to ensure that the new expression has equivalent meaning. For example, the short-circuiting of the logical operators can cause differences. Some useful algebraic identities are:

    2 * x == x + x == x << 1
    a * x + a * y == a * (x + y)
    -x + -y == -(x + y)

There are also Boolean algebraic identities that can be used to perform fewer logical operations:

    (a && b) || (a && c) == a && (b || c)
    (a || b) && (a || c) == a || (b && c)
    !a && !b == !(a || b)
    !a || !b == !(a && b)

Common Subexpression Elimination

Common subexpression elimination (CSE) is avoiding the recomputation of the same expression twice. There are many cases where the same computation appears multiple times in a single expression, or across the control flow of a program. Compiler optimizers attempt to automatically detect such cases and reuse the first computation.

In a complicated expression, there are often repeated sub-expressions. These are inefficient as they require the computer to calculate the same value twice or more. To save time, calculate the sub-expression first and store it in a temporary variable. Then replace the sub-expression with the temporary variable. For example:

    x = (i * i) + (i * i);

With a temporary variable, this becomes:

    temp = i * i;
    x = temp + temp;

Note that this attempt to be concise is incorrect:

    x = (temp = i * i) + temp; // Bug

This may fail because of its reliance on the order of evaluation of the + operator. It is not actually guaranteed in C++ that the + operator is evaluated left-to-right.

Common sub-expressions do not occur only in single expressions. It often happens that a program computes the same thing in subsequent statements. For example, consider the code sequence:

    if (x > y && x > 10) {
        // ...
    }
    if (x > y && y > 10) {
        // ...
    }

The Boolean condition “x>y” need be calculated only once:

    temp = (x > y);
    if (temp && x>10) {
        // ...
    }
    if (temp && y>10) {
        // ...
    }

 

Online: Table of Contents

PDF: Free PDF book download

Buy: CUDA C++ Optimization

CUDA C++ Optimization The new CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization