Aussie AI

Chapter 15. Compile-Time Optimizations

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 15. Compile-Time Optimizations

CUDA C++ Compile-time Techniques

Compile-time processing is the optimal way to run a program. All the work is done by the compiler and none by your program. There are literally zero instructions executed on the CPU or GPU at runtime, whether it’s doing training or inference. It will be blindingly fast for your users.

If only all code could be like that!

The reality is that programmers are still needed and that code still needs to run (sigh!). But to make it faster, there are lots of ways to have more computation done by the compiler, long before it ever goes near a user.

The C++ programming language has numerous features that help perform work at compile-time. These include ways to explicitly control what goes to the compiler, or to give more information to the compiler so that its optimizer can do good work on your behalf. Some of the various C++ language features to consider include:

Conditional compilation — #if/#ifdef statements.
inline functions.
Loop unrolling — #pragma unroll.
Templates — these expand at compile-time.
Symbolic constants — const or #define.
Function-like macros — #define with parameters.
Constant hints — constexpr, if constexpr, etc.
Global and static variable initializations.
static data members — fixed data in C++ classes.
Type traits — compile-time type testing.
Restricted pointers — avoid aliasing-related slowdowns.

Conditional Compilation

The CUDA C++ preprocessor can be used to simply remove code from the program. You can just go crazy and put #if directives around any statements you like, and your program will definitely run a lot faster, but this idea is more typically used to remove:

Debug code
Self-testing code
Assertions
Tracing code

The way to remove such code is to define the preprocessor macros (e.g., using the “-D” option for nvcc in a Makefile), or more precisely to not define macros like “DEBUG” or “SELFTEST” or whatever you’ve used. For the builtin assert macro, you can define the “NDEBUG” (no debug) macro flag to compile them out.

It’s a policy decision on whether to leave self-testing code in for production or not. There’s a supportability benefit to leaving it in, but a performance cost to doing so. If you want a lot of meetings, ask the Marketing Department what to do about that.

Inline Functions

Placing the keyword “inline” before any function declarations makes that function instantly disappear in a puff of smoke. Well, sort of. It gives your C++ compiler the hint to optimize the code by putting the function’s body there instead of the function call. This is faster, but means there are many copies of the function’s statements, so it increases code size.

Which functions should you inline? General wisdom is to do so for these types of C++ functions:

Short functions (esp. single-statement functions)
Getters and setters in a class
Frequently called functions at the bottom of the call hierarchy.

The inline specifier is just a hint. Your compiler is free to completely ignore you. In fact, this choice will probably disappear in a few years, as compilers become better than humans at choosing which functions to inline.

If you want to force the compiler to inline, use preprocessor macros. However, there’s a whole minefield of problems in function-like macros. For example, you need to add parentheses around the whole expression and also around each parameter’s appearance in the replacement text. Hence, inline functions are much safer than macros.

The value of inline functions is not only from avoiding function call overhead. The merging of the statements into the caller’s code also allows many other optimizations to be applied there as follow-up transformations. Constants can be propagated further through the inlined statements, which is similar to constexpr, but the range of optimizations is much larger with inline.

GCC has some additional C++ language features related to inlining. There is the “always_inline” function attribute which says to always inline this function, and the “flatten” attribute which says to inline every call to other functions inside this function. There is also the “gnu_inline” attribute that prevents creation of a non-inlined function body.

inline function limitations

The inline specifier is wonderful when it works. A very important point to note about inline functions is that the inline specifier, by itself, is not enough to guarantee that inline code will be generated. The other requirement is that the compiler must know the function body code, where the function is called.

Hence, an inline keyword in a function prototype declaration is not enough. The executable statements inside the function’s definition (i.e., the function body) must be available to the C++ compiler. Otherwise, how is the compiler to know what inline code to expand a function call into? I guess in theory the C++ compiler could maintain a huge database of all the functions in your source code, or scan through all the CPP files to find it, and that would be amazing, but we’re not there yet. In practice, the compiler will only inline functions where it has seen the function body within the current C++ source file or an included header file. This requirement imposes two restrictions on the use of inline functions:

1. Member functions declared as inline should include the function body inside the same header file as the class declaration. This can be achieved by placing the function body of a member function inside the class declaration. For a more readable style when there are many inline member functions, the class declaration can declare the function prototypes, and then provide the inline function definitions immediately after it, in the same header file. This restriction ensures that whenever the class declaration is included as a header file, the member function body is available for inlining.

2. Non-member inline functions must be defined before they are used within a source file, preferably by placing the inline functions in a header file. Placing inline functions at the top of a source file allows the inlining of any function calls later in the same source file, but calls to the functions from a different source file cannot be inlined by the compiler unless the inline function definition is placed in a header file.

Non-inlined functions

Some functions declared as inline will not be expanded into inline code by the compiler, simply because they are too complicated for the compiler to handle. In this case, the inline specifier is ignored and the function is treated like any other function. The sophistication of the inline code generation depends on the compiler implementor.

Even if a compiler could theoretically inline a function, the compiler is sometimes still forced to generate a “real” function. There are various possible reasons for this:

1. The name of an inline function is used as a pointer-to-function constant.

2. A call to the inline function from within another source file.

3. virtual member functions.

When an inline function is called from a source file, where the function body has not been made available, the compiler generates a real function call (simply because it cannot inline the function). Hence, the real function must exist and be linked like any other function. Fortunately, the placement of inline functions in header files as discussed above will avoid this for any function the compiler decides to inline.

Inline Variables

Since C++17 you can define a variable as “inline”. What does this do?

Basically, it’s not really much of a speedup, but makes it easier to manage global constants, global variables, or static data members in C++ classes. You can declare these variables as “inline” in a header file, with an initializer:

    inline int g_x = 3;

Then you can with wild abandon include that header file all over the place without any problems whatsoever. The C++ linker is required to:

Merge all of them into one variable at link-time.
Guarantee that it’s initialized as specified.
Have the same address for that variable everywhere.

I find this addition to C++ somewhat humorous because it fixes up a huge mess that’s existed since old K&R C code, and I’ve battled against it many times trying to get my program linked. I’m not going to irritate myself by repeating all the quirks, but it was always messy whether you had a global variable that was extern or non-extern, initialized or non-initialized, in a header file or a non-header file. So, if you ask me, the way that “extern” variable declarations “worked” was always broken, and now it’s fixed in C++17. Hooray! (A bit late for me.)

Overall, allowing “inline” for variables is helpful to efficiency because you can be guaranteed about constants, static members, or global variables at compile-time. And it’s always nice to get your program to link.

Constant Specifiers

The “const” keyword means that something is constant, and cannot be modified. It is helpful for efficiency, but its role is also to help detect programming errors, where code accidentally attempts to modify a constant variable or object. There are multiple places where “const” can be used.

Symbolic constants
const variables
const objects
const function parameters (i.e., “const&” idiom)
const member functions (read-only)

But don’t get me started on “const correctness.” I’ve seen too many dawns fighting with compilers about const. Anyway, let’s move on, and assume we love const.

Basic const symbols. Symbolic constants can be declared as a representation of a numeric value or other type data (instead of using #define symbols):

    const float pi = 3.14;

Set-once variables with const. Variables can be made constant via “const”, which is effectively the same as a symbolic constant, except that the initializer need not be a compile-time constant. It is a “set-only-once” variable. The C++ compiler ensures that const variables cannot be modified, once they are initialized.

    const int scale_factor = get_config("scale");
    const int primes[] = { 2, 3, 5, 7, 11, 13, 17 };

Function parameters and const. The const specifier can ensure that function parameters are not modified, especially for arrays passed by reference. const on a scalar parameter type such as int is not as useful, only ensuring that the code inside the function doesn’t modify the parameter (which isn’t really a problem anyway). However, the idiom of “const&” to specify a const reference as a function parameter allows constant pass-by-reference of object parameters, which is extremely important for C++ efficiency.

Instantiate-only objects with const. Class objects can be declared as const variables. When the variable is a const object, it can be instantiated via a constructor, but cannot be modified thereafter.

    const Complex cfactor(3.14, 1.0);

Member functions declared const. Class member functions can be declared by adding the keyword “const” immediately after the function parameter list:

    int MyVector::count() const;

The C++ compiler blocks a const member function from modifying data members, although it can still change “static” data members. For const object variables, the C++ compiler ensures that any calls to non-const member functions are disallowed.

Non-member functions. Note that a non-member function cannot be const. The actions of a friend function or other non-class function are controlled by using const on the parameters, rather than the whole function itself.

Beyond const. Newer C++ features have generalized and improved some of the uses of const. The “constexpr” specifier is much more powerful in terms of allowing compile-time optimizations, as are its derivatives “constinit” and “consteval.” The newer use of “inline” on a variable (yes, a variable, not a function, supported since C++17), can be helpful for safely sharing constants across multiple files.

Constant Expressions Specifier

The constexpr keyword is an optimization hint for the compiler that’s more powerful than “const.” Whereas const only guarantees that something won’t change, constexpr is a guarantee by the human that something can be evaluated at compile-time.

The compiler should use the constexpr hint to try to propagate constant values throughout the evaluation of expressions and function calls, producing an overall speedup. However, if the compiler doesn’t have the capability to do the level of compile-time optimization required, or if the human has told the machine a bald-faced lie, there’s no penalty and the code just runs like it never had a constexpr specifier.

There’s not a whole lot of difference between const and constexpr if you use it only for named constants:

    const float PI = 3.14f;
    constexpr float PI = 3.14f;  // Same same

`constexpr` functions

The real power is when you use constexpr for functions.

    const float SQRTPI = sqrtf(3.14f);   // Works?
    constexpr float SQRTPI = sqrtf(3.14f); // Works?

Oh, dear! I just tested this code snippet, and the const version works, whereas the constexpr version fails to compile, which is the opposite of what I was expecting. According to an informed source that was trained on Internet scrapings, sqrtf is not going to be declared as a “constexpr” function until C++26. Alas, by then all C++ programmers will have been replaced by robots, so feel free to skip this section.

The apparently futuristic idea is that sqrtf should have a “constexpr” keyword in its declaration, because the function return value can be computed at compile-time if you pass it a constant argument. In other words, the compiler can evaluate “sqrtf(3.14f)” at compile-time. Hence, the whole function should be declared “constexpr” in the standard library header file. The const version is also probably not evaluating the sqrtf function at compile-time, but just calling it dynamically whenever the const variable is first initialized (this non-compile-time initialization is allowed for const variables, provided you don’t later attempt to change its value).

Anyway, you can already declare your own function with the “constexpr” specifier.

    constexpr int twice(int x)
    {
        return x + x;
    }

`constexpr` functions vs `inline` functions

A lot of the same value in terms of optimization can be had by making a function just inline rather than constexpr. Note that you can use both, but officially constexpr for functions implies inline on the function as well.

Is constexpr any better than just inline? If you pass a constant argument to a small inline function, then the expansion of the function body will trigger lots of constant propagation optimizations, effectively evaluating most of it at compile-time, which is almost the same as constexpr.

constexpr is supposed to be more formal in guaranteeing that the result of a function is a compile-time constant, and the compiler is honor-bound to do “compile-time function evaluation” to get the constant return value. Also, a constexpr function is more officially usable as a compile-time constant, so that you can use an expression with a constexpr function’s return value in various places where C++ needs a constant (e.g., an array size declaration, some template situations, etc.).

An inline function is also supposed to be optimized at run-time for non-constant arguments, and constexpr functions are implicitly inline functions. The code generation requirements of dynamic inlining are often more advanced that constant expression evaluation.

Also, the limitations on how a constexpr function can be structured are a lot easier to code than the unrestricted nature of an inline function body. However, as a practical matter, the compile-time evaluation of expressions and the code generation for inlined expressions have a lot of overlap, so I expect C++ compilers will mostly try to do both on every type of function.

The inline keyword also serves a weird secondary purpose, by guaranteeing that there’s only one copy of the function. This means we can include header files with the full definition of that inline function anywhere we like, without getting a compiler error at link-time about multiple definitions. But this isn’t a performance optimization, and the linker feature of inline is almost the opposite of what we want in making a function inline, because we don’t want a real function to be called at all.

`if constexpr` statements

There is an alternative usage of constexpr in terms of “if” statement conditions (since C++17):

   if constexpr(cond)

This new syntax tags the condition as being amenable to computation at compile-time. Hence, the compiler should optimize the if statement to a constant value, and it can then determine at compile-time which branch should be executed. So, there is a double speedup from:

(a) the condition computation is removed at run-time, and

(b) code size reduction from unexecuted “dead code” being removed.

In fact, this determines at compile-time which code block will be parsed, so there are cases where you can avoid a compile-time error in templates by wrapping it inside an “if constexpr” check. This can be useful in compile-time situations such as template expansion, where you can prevent some expressions from being compiled, and also code bloat can be reduced.

`constinit` variables

The constinit specifier is like a hybrid between consteval and static variables. The constinit specifier declares a variable that is static, with lifetime scope, that is initialized at compile-time.

A variable declared as constinit must be initialized, and cannot be modified (like “const”). However, the initializer needn’t be a “constant expression” although it must be able to be calculated at compile-time.

Huh? That makes no sense. Sure, it does in the world of C++ standards. A “constant expression” with only constant arithmetic is a subset of the set of expressions that can be calculated at compile-time.

The best example is a call to a function that has one path where it’s constant, and another path where it’s not. The definition of “somefunc” has two paths:

    int somefunc()
    {
        if (something) return 27;
        else return some_random_number();
    }

The “somefunc” function cannot be declared “const” or “constexpr” because it isn’t always a constant on all paths.

However, if we’re using “somefunc” at program startup initialization, we can try:

    constinit int s_myconst = somefunc();

Here, if we know that it will use the constant path for some reason, the initialization of “s_myconst” will go through the fixed path to get the compile-time constant value of 27, we can tell the compiler that by declaring the variable as constinit.

Anyway, now that you’ve been forced to learn all that, just forget it. You’ll rarely if ever be needing constinit.

`consteval` functions

Use consteval for functions that are always constant. A consteval function is strictly declared so that every invocation of the function must return a compile-time constant.

The consteval keyword is a subset of constexpr functions (and also implies inline on a function). Although a constexpr function is constant if its arguments are constant, it can also return a dynamic return value for non-constant arguments.

When would you use consteval versus constexpr functions? I mean, when you ask your boss to make you a cup of coffee, do you like to ask politely or do you issue commands? Supposedly constexpr is optional for the C++ compiler, whereas consteval is mandating compile-time evaluation.

Personally, I can’t see much difference in general usage, since the compiler will probably optimize a constexpr function at compile-time if it’s capable enough. Hence, for regular functions I don’t see much benefit to consteval over constexpr. There are some complicated places in C++ where it helps to guarantee a compile-time constant, such as reflexive types and other tricks in compile-time template usage.

Auto-Vectorization and Restricted Pointers

Modern C++ compilers attempt to automatically vectorize simple loops. Basic loop structures can be unrolled by optimizers, either partially or fully, and then sent to hardware acceleration automatically.

One of the most important hints to the compiler is a “restrict” designation on pointer variables. Ironically, the benefit of restrict is to limit what you can code, but also to allow unrestricted use of the pointers by the optimizer.

The purpose of the restrict attribute is a type specifier to tell the C++ compiler that a given pointer or array variable is not an “alias” for any other pointer. There are various loop transformations and vectorization optimizations that cannot be performed if the compiler has to be conservative and assume that aliasing could occur.

One of the main uses of restrict is on pointer or array function parameters, because arrays are pointers in this context. For example, if we have two function parameters (e.g., vector addition), declaring both parameters as restrict tells the compiler that the two pointers will never point to the other vector.

Note that this use of the word “aliasing” refers to two pointers referring to the same object or array (i.e., the pointers are aliases of each other). There is another unrelated but similar use of the term in C++ “aliases” for declarations, which means one function or type with two alias names.

The “restrict” keyword is merely a hint to the optimizer, and recalcitrant C++ compilers are free to ignore the advice. In fact, “restrict” isn’t even valid C++, because it’s part of C, but not yet in the C++ standard. Nevertheless, various compilers support it or similar extensions like __restrict__, so it can be used in C++ programs.

Restricted pointers don’t always need to be marked as such. In some usages, the use of “const” can allow the compiler to infer non-aliasing of parameters, but it probably doesn’t hurt to declare it with “restrict” as well. Note also that the C++ compiler is free to assume non-aliasing of pointers of different types, because it is undefined behavior if they are aliases. This is known as the “strict aliasing rule” and this assumption can be disabled in GCC via the option “-fno-strict-aliasing”.

The C++ compiler doesn’t really check if you are lying (to yourself). If you tell the compiler that pointers are restricted, and then pass in two aliased pointers, the behavior of your program is “undefined” and there aren’t likely to be any compilation errors or runtime warnings. So, don’t do that.

The correct declaration of a “restrict” pointer is:

    int * restrict ptr;  // Correct

This is actually incorrect:

    int restrict * ptr;   // Wrong
    restrict int * ptr;   // Also wrong

The syntax for array parameters has the keyword inside the square brackets:

   void myfunc(int arr[restrict]);

Read-only functions. Note that read-only functions don’t really need to use the restrict keyword. For example, the calculation of a vector dot product of two arrays doesn’t really have an aliasing problem, since neither of the vectors are changed.

Restricted references. The “restrict” type specifier can be used on references, as well as pointers and arrays. This is helpful for some of the issues with aliasing between references in pass-by-reference function parameters. But this usage of restrict for references isn’t very important for auto-vectorization optimizations.

Restricted “this” pointer. GCC also supports specifying that the class object “this” pointer is unaliased by marking the function body with the “__restrict__” keyword. This is placed after the closing right parenthesis of the function parameters (i.e., similar to a const member function declaration). The declaration looks like:

    void MyClass::myfunc(int x) __restrict__;

Templates for Kernels

You can define templated versions of CUDA kernels, much as you would for standard C++ kernels. The advantages of templated kernels include:

Extra compile-time auto-optimizations from the use of constant parameters.
Multiple specialized versions of your kernels can be defined.

In other words, templating your kernels allows either you or your compiler to do a better job optimizing.

To define a templated CUDA kernel, just use the “template” keyword, much like standard C++ templates. The basic syntax is:

    template < /*...template params ...*/ >
    __global__ mykernel( /*... kernel params ...*/ )
    {
        // kernel code...
    }

In other words, it looks exactly like a standard C++ templated function, except for an extra “__global__” or “__device__” specifier. And you can also invoke a templated kernel using the same ideas:

    mykernel< /*...template params ...*/ > 
            <<< blocks,threads >>> ( /*... kernel params ...*/ );

For example, a templated kernel launch would look like:

    vector_clear <<< blocks,threads >>> (dv, n );

It looks a little messy with “<...>” for template parameters and “<<<...>>>” sequences for the kernel launch, but it works.

Template Compile-Time Optimizations

Going beyond just using template code to write the same algorithm for different types, there are various ways to optimize code that is templated to do more at compile-time:

Template class and function specializations
Constant template parameters
Compile-time conditional tests on types (e.g., sizeof, type traits, etc.)
if constexpr syntax
Variadic templates
SFINAE techniques
Template Metaprogramming (TMP) techniques

Constants can be used to instantiate template code in a way that helps the compiler to optimize by evaluating constant expressions. Template parameters don’t need to be types, but can also be constant variables or numbers, such as the size of an array. Using a template in this way is as efficient as hard-coding the array size, which helps the compiler to know exactly what it can optimize, such as if the array size is used in any computations.

If you think you can do better than the compiler’s optimizer, remember that you can also override the generic template code. For example, you can instantiate your own specific version of a template class for a particular type. Similarly, you can provide a generic function declaration that instantiates a templated function with your explicit version.

An alternative to specializing a version of a template class or function is to use compile-time tests inside the generic template code. For example, you can use conditional tests involving compile-time operations:

sizeof
typeid
std::is_same_v
if constexpr conditional test syntax

Next level templating

C++ templates are a very powerful programming mechanism. In fact, you can define entire projects as templates inside header files. To get the most out of template optimizations at compile-time, consider these methods:

Type traits
Variadic templates
SFINAE
Template Meta-Programming (TMP)

Type traits are a generic feature of C++ (since C++11) that you can use to interrogate the type of a variable. They are declared in the <type_traits> header file and there are numerous ways that you can test the type of a variable. The above example std::is_same_v is one example. As another example, there is std::is_signed and std::is_unsigned to test whether it’s a signed or unsigned type. There’s also std::is_pointer and std::is_array and various others. Combining type traits with “if constexpr” gives a powerful way to ensure templated code gets evaluated at compile-time, and to specialize blocks of code for particular types.

Variadic templates are another way to level up your code and have been supported since C++11. These are variable-argument templates via the use of the ellipsis “...” operator in a template declaration. This allows templates to accept a variable number of parameters for instantiation.

SFINAE. Another optimization for advanced templating is to rely on SFINAE semantics. This refers to “Substitution Failure Is Not An Error” and means that template instantiation that fails should not itself trigger a compilation error that prevents execution. More specifically, if the compiler tries and fails to instantiate a template, but there’s another way to run it, such as a different overloaded function available, then the code should execute via the non-templated method. Relying on this capability in C++ not only avoids having compilation errors that block some advanced template usages, but can also be used to ensure compile-time calculations. However, although there are some good uses cases in making templates faster, SFINAE is an obscure programming technique that isn’t widely used in everyday C++ programming.

Template Metaprogramming

Template metaprogramming, often abbreviated TMP, is weird. Several years ago, some bright spark figured out that the C++ template syntax was Turing-complete, because it has:

1. Sequence,

2. Selection, and

3. Recursion

Sequence and selection are trivial with multiple statements and if statements. The third requirement is usually “iteration” with loops, but the use of well-defined recursion is actually equivalent. There’s also the fourth requirement of “data storage” performed by the compiler as it parses the code.

Voila! We have a fully-fledged programming language.

Although there aren’t any loops in C++ template declarations (yet?), you can have one template declaration declare another, for a different value. Hence, there are “recursive” template definitions (e.g., “factorial<n-1>”), and you can separately declare a “base case” for a particular constant value (e.g., “factorial<1>”). This amazing insight means that you can use a fancy template definition to perform full computations at compile-time. For example, there are TMP “programs” that compute the Fibonacci numbers or factorials. Here’s an example:

    template 
    struct Factorial
    {
        // Recursive case N! = N * (N-1)!
        enum { value = N * Factorial::value };
    };

    // Base case 1! = 1
    template <>
    struct Factorial<1> { 
        enum { value = 1 };
    };

    void print_factorial()
    {
        printf("Factorial 5 = %d\n", Factorial<5>::value);
    }

This code “tricks” the compiler to use the template machinery to evaluation 5*4*3*2*1=120 at compile-time. And I have to tip my hat to whoever figured that out! According to Wikibooks it was discovered accidentally by Erwin Unruh.

Unfortunately, or perhaps fortunately, if you let your excitement subside, there are two points to consider:

TMP only works if the input value is a constant, and
Standard C++ now has “constexpr” which does almost the same thing in a non-ugly syntax.

So, I’m not sure how much time you really should spend trying to optimize your CUDA kernels with TMP, but trying to will definitely grow your brain size!

References

Bjorn Andrist, Viktor Sehr (2020), C++ High Performance: Master the art of optimizing the functioning of your C++ code, 2nd Edition, Packt Publishing, Dec 2020, https://www.amazon.com/dp/1839216549, Code: https://github.com/PacktPublishing/Cpp-High-Performance-Second-Edition (Chapter 8 is on compile-time optimizations.)
Gnu.org (2023), GCC Command Options, GNU Compiler Collection, https://gcc.gnu.org/onlinedocs/gcc/Invoking-GCC.html
Wikipedia, Oct 2024 (accessed), Template metaprogramming, https://en.wikipedia.org/wiki/Template_metaprogramming
Wikibooks, Oct 2024 (accessed), History of TMP, https://en.wikibooks.org/wiki/C%2B%2B_Programming/Templates/Template_Meta-Programming#History_of_TMP
Arthy Sundaram, Jaydeep Marathe, Hari Sandanagobalane, Gautam Chakrabarti, Mukesh Kapoor, Steve Wells and Mike Murphy, Feb 12, 2021, Boosting Productivity and Performance with the NVIDIA CUDA 11.2 C++ Compiler, https://developer.nvidia.com/blog/boosting-productivity-and-performance-with-the-nvidia-cuda-11-2-c-compiler/

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization

The new CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Aussie AI

Chapter 15. Compile-Time Optimizations