Aussie AI

Appendix: CUDA C++ Slugs

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Appendix: CUDA C++ Slugs

Slug Hunting Advice

This appendix is about speeding up your C++ programs through general improvements to sequential or parallel coding. A more detailed analysis of many of these techniques with code examples is found in the bonus materials online for this book: https://www.aussieai.com/cuda/optimization.

But, before we begin with anything that’s actually useful, I have to introduce the obligatory wrist-slapping politically-correct deslugging advice for programmers. Hence, here are some general nuggets of advice when attempting to speed up your program:

Profile twice, code once. Performance profiling tools exist for a reason. Find your busiest kernels.
Focus on parallelization of sequential code to GPU kernels.
Don’t micro-optimize. Unless you’re into that kind of thing. But really, try to sit on your hands.
Do macro-optimize. Think about your data structures and algorithms.
Optimizing introduces new bugs. 100% guaranteed! Don’t optimize the night before your release. Re-run your test suite.
Don’t optimize exception handling. Tweaking rarely-executed code is a poor use of your geniousness.
Use open source third-party libraries that have already been optimized by others.

Or just ignore that advice and go crazy. It’s just too much fun optimizing when the alternative is dreary debugging. Pro tip: it’s even more fun writing a book on optimizing!

Where to hunt slugs? Some of the common large-scale issues with coding inefficiency in typical C++ programs include:

Sequential algorithms
Function call hierarchies
Nested loops
Overuse of memory allocation
Constructor and destructor inefficiencies
Inefficient sequential algorithms (e.g., linear search of arrays)
Unnecessary overhead or wrappers
Recursion. After you’ve coded up your university assignments (Tower of Hanoi, anyone?), please forget recursion exists.

C++ Speedup Techniques: Some of the general ways to speed up C++ programs at the design structure or algorithmic level include:

Parallelize via CUDA kernels.
Further parallelize via multi-kernel, multi-threading, multi-process, multi-core, multi-GPU, multi-something.
Faster data structures (parallelizable linear ones preferably).
Faster algorithms (e.g., restructure algorithms for vectorization).
Vectorization (parallelize your important linear loops).
Precompute expensive functions into a lookup table at compile-time (e.g., activation functions).
Cache any complex calculations to trade extra space for time savings (e.g., KV caching).
Change floating-point to integer operations (quantization, anyone?)
Replace recursion with iteration. Subtract ten bonus points if you need to do this.

Some of the high-level C++ coding optimizations include:

Flatten function call hierarchies (stop wrapping everything so much, and inline the small functions at the bottom).
Optimize loops, especially nested loops (e.g., move loop-invariant code out, loop unrolling, vectorization, etc.)
Templates are effectively a compile-time optimization that improves speed at the cost of code space.
Reduce memory allocation (use less memory overall or replace memory allocation with temporary stack buffers).
Operator strength reduction (e.g., replace “*” with “+”, a pipe dream of all AI engineers).
Declare variables as close as possible to where they are used. This avoids instantiating objects that aren’t needed on some paths.
Use pointer arithmetic, especially for loops over arrays.
Bitwise operations are fast, but the basic C++ integer operations are also fast too, nowadays. Benchmark, don’t assume.
Use short-circuiting of the && and || operators, and also the ternary ?: operator, to avoid expensive function calls.

And finally, some things you might forget (and some that are forgettable):

Benchmark any important changes (e.g., operator strength reductions).
Turn up your C++ optimizer. There are higher settings you could try.
Add compile-time optimization hints (e.g., constexpr and restrict).
Overclock your GPU (like a gamer).
Sell your car to buy a better GPU.
Put every function in a header file and make them all inline.
Reorder your case labels. Surely it helps.
Change all uses of i++ to ++i in everyone else’s code.

Slugs in CUDA C++

The whole point of CUDA is to replace sequential sluggish code with streamlined SIMD kernels. But there are a few ways to go wrong and actually introduce slugs into device code.

Computational slugs:

Launching too many kernel threads (wasted resources)
Kernels duplicating the same work (wasted resources)
Forgetting to enable the NVCC optimization flags on both host and device code.
Low occupancy rates on the GPU
Tail effect of a small final wave
Block size not a multiple of 32
Segmented loop kernels (prefer grid-stride loops)
Nested kernels (CUDA dynamic parallelism) can get out of control (e.g., too many threads, indirect recursion).

Thread execution bottlenecks:

Avoid recursion (generally a poor idea overall, except in school assignments, but here it also uses up GPU local memory on the stack).
Loop unrolling with care (can increase register pressure and causes non-coalesced memory accesses)
Output in kernel code (i.e., printf calls run slow and have weird buffering delays)
Complex class objects as kernel parameters (pass-by-reference is not supported and pass-by-value will run the constructor).
Device debug mode — Don’t compile production code with nvcc -G or --device-debug, which turns off all the auto-optimizations on kernel code (can also use both by adding the option: --dopt).
nvcc debug and profiler modes — take care whether to use various debug and profiler flags in production: “-g”, “-pg”, “-profile”, “-debug” or “-lineinfo”/“--generate-line-info” in production.
Debug tracing output or failing assertions slowing down execution.

Instruction execution slugs:

Warp divergence (from if statements or loops in kernels)
Low instruction-level parallelism
Poor instruction locality and instruction cache misses
Overly large thread code size (bloated kernels)
Non-restricted pointer declarations

Synchronization slugs:

Too many calls to cudaDeviceSynchronize
Overuse of __syncwarp() or __syncthreads()
Redundant barriers (synchronization barriers that don’t actually stop any race condition)
Implied hidden synchronizations in many CUDA runtime API calls (e.g., cudaGetLastError, cudaMemcpy, etc.)
Overuse of unnecessary atomics

Sequential execution slugs (ugh!):

Serialized kernel execution of multiple different kernels
Accidentally leaving the “kernel blocking” environment variable enabled.

Memory usage slugs:

Using too much global memory (or repeatedly accessing the same data)
Memory leaks (host or device memory)
Non-coalesced data access patterns
Excessive data transfer costs
Using pageable memory for data transfers (use pinned memory)
Poor data locality in global memory usage
Array-of-Structures (AoS) layout has non-contiguous data
Using shared memory for inter-thread data movement when warp shuffle operations would be faster
Contention in memory accesses (shared memory or global memory)
Register spills and “register pressure”
Cache inefficiency
Shared memory “bank conflicts” (access contention in shared memory accessed by multiple threads).
Non-linearized data layout for matrices (2D) or tensors (3D)

C++ Class Slugs

The C++ class features are designed to add encapsulation and modularity, while retaining speed, but there’s still plenty of ways that slugs can crawl into your classes. C++ class optimizations include:

Ensure small member functions are inline, especially those that do “get” and “set”.
Add inline to other friend or non-class functions (esp. if small or commonly used).
Pass objects to functions using “const&” (pass-by-reference), rather than pass-by-value.
Watch out for temporary objects. These can occur in simple assignments or function call expressions or in weird ways like accidentally making your overloaded assignment operator have the wrong type.
Avoid virtual functions (they have a runtime cost and block compile-time optimizations like inlining and constexpr).
Specialize inherited member functions in derived classes.
Declare objects as close to first use as possible.
Initialize objects at declaration, rather than calling a constructor and then later an initialization function.
Overloaded binary operators can be a slug returning an object. Instead of c=a+b, use a three-parameter function with references, such as add3(a,b,c).
Use reference variables instead of copying objects into temporary variables.
Take care templating class objects (e.g., when using the std::vector class for a vector of your class objects). Lots of hidden calls to constructors and destructors may arise in resizing.
Don’t return objects as the return type of the assignment operators.
Declare assignment operators as return type void.
Use the initializer list in the constructor for initializing data members.
Use friend functions for faster accesses to internal object data.
Block accidental calls to the copy constructor or class assignment operator (i.e., if you aren’t defining them, make a dummy version that is “private” with a “void” function body).
Avoid returning objects if you can. Return a reference if it’s safe to do so.
Take care with “wrapper” classes like “smart pointers”, “smart integers” or “smart buffers”. Usually, they’re safer but slower. How smart is that?
Singleton classes can be made more efficient than general classes.
Postfix overloaded assignment operators are inherently inefficient (make them return void, or disallow them as a private declaration with void body).
Destructor execution with cleanup code can be a slug on program exit (consider avoiding memory deallocations in destructors by setting a global “i_am_shutting_down” flag).
Reduce object byte size with “largest to smallest” member ordering.
Put the most-used data member first in the object (it has a zero offset, which is compiled-out).

Function Slugs

Functions are an important building block of your code. Some ways to get the slugs out of general non-member functions include:

Declare small functions inline.
Scour open source libraries for versions of your function already written by experts (e.g., the many CUDA source code libraries).
Avoid recursion (use loops instead).
If you can’t fully remove recursion, use (a) tail recursion elimination, (b) collapse recursive calls by unwinding or pre-computing more of the lower-level base cases, or (c) use a stack data structure.
Call builtin or intrinsic functions.
Call library functions rather than writing your own.
Avoid function pointers (they block compile-time optimizations like inline and constexpr).
Optimize the most frequently called function (profile it!).
Use parameters as if they’re local variables (reduces register usage).
Use constexpr functions.
Pass objects by reference.
Use const references for objects to allow auto-optimizations by the compiler.
Declare pointer parameters as __restrict__ for additional compiler optimizations.
Avoid function pointers.
Optimizing functions with default arguments by creating a separate specialized function without the extra parameter (i.e., remove the default argument one).

Medium-Sized Slugs

There are a lot more examples of possible inefficiencies in C++ coding. Some of the types of errors that are “medium-sized” slugs include:

Avoid non-static automatic array initializations with constant data.
Don’t put function calls in a loop test (i.e., expensive loop conditional tests).
Member initializations in the constructor body (they should be in the initializer lists).
The new for loop auto-iterators are elegant, but not necessarily the fastest way.
Program startup hidden initializations (global or static object constructors).
Small non-inline functions called frequently.
Busy wait loops are a worst case.
Disk I/O with file operations is often slow (e.g., load the whole file into memory) .
Don’t accidentally call your global initialization routine twice (use a flag to check it’s already run).
Unnecessary code inside loops should be “hoisted” out.
C++ classes wrapping simple data types (e.g., overuse of “smart pointers” or “smart integer” classes).
Overuse of standard string concatenation operations.
Recursion is almost always a slug.

More Slug Repellent

There’s plenty of other optimizations in the earlier chapters. Well, actually all of the book! Nevertheless, here’s a list of some more C++ code optimization techniques for you to consider. Some of the bigger ideas:

Use “move constructors” instead of copy constructors where appropriate (since C++11).
Use static data members where appropriate, so they are initialized once only.
Use std::sort rather than qsort.
Don’t put try..catch inside an inner loop that’s a bottleneck.
Use std::bitset for bit sets or bit vectors.
Use the “iterators” design pattern rather than returning a full scan of a data structure all at once (saves memory and allows early exit).
Consider basic C++ arrays instead of std::vector if it has a fixed size (known at compile-time) or its maximum size is small enough.
Consider C++20 coroutines where appropriate for the architecture.
Structure of arrays (SoA) data layout is more vectorizable than Array of Structures (AoS).

And some of the smaller optimizations:

Commonly used object or struct fields should be first. On some platforms, smaller offsets from the start of an object are accessed faster. Also, the very first field has offset zero, which is optimized away, so put the most used field first.
Avoid long else-if sequences. You are effectively doing linear search on the problem space in a long block of if-else-if statements. The best alternative is to use a switch statement, if the conditions are constants. For non-constant conditions or string comparisons, consider tabularizing the options and/or using heuristics to bifurcate the search space (e.g., start with a switch on the first letter of a string).
Use compact numeric ranges for switch. If the case numbers are close together, the compiler will probably use a lookup-table in assembler. If the cases are sparse, it can be forced to do an if-else-if equivalent in machine code.
Correct choice of loop. If the condition is true at the first iteration, use do-while loops.
Instead of range checking a signed integer with “i>=0 && i < MAX” use a typecast with “(unsigned)i<MAX” because negatives become large unsigned positives, and a cast from int to unsigned int isn’t a real instruction at run-time.
Enable the FTZ (“flush-to-zero”) and/or DAZ (“denormals-are-zero”) floating-point modes on your CPU, even though they violate the IEEE 754 standard. You probably don’t care about tiny floating-point numbers in your weight or probability calculations.
Enable GCC’s floating-point arithmetic speedup options: -ffast-math, -fno-math-errno, -fno-trapping-math, and -ffinite-math-only.
bsearch is slow. Choose a better method.
Use static_assert rather than assert (e.g., to check data type sizes).
Copy arrays by wrapping them in a dummy struct and using C++ struct bitwise assignment. It might be faster than memcpy.
Use memcpy rather than memmove if you’re sure the arguments won’t overlap.
Move local non-static objects outside of a critical loop. Reuse the same object rather than re-running constructors and destructors every loop iteration. Add a “reset” member function if needed.
Use scaling factors that are a power-of-two, so that multiplication or division can be a bitshift.
Specialize a function with a void and non-void version if you find yourself ignoring the return value sometimes. This avoids all of the calculations to determine the return value inside the void function, because the function itself cannot tell whether or not the caller will use its return value.
Prefer pre-increment (++i) to post-increment (i++) for non-scalar values. And it’s better to use pre-increment even for “int” types, even though it’s the same, just to get into the habit.
Use the GCC __builtin_unreachable() statement and the “noreturn” function attribute to help the GCC optimizer identify dead code paths, allowing unreachable code removal (not that we care that much) and also better optimization of path-specific optimizations on other live paths (e.g., compile-time constant propagation).
Test the first character of two strings directly with character tests before calling strcmp.
Replace calls to “round”, “floor” or “ceil” functions with a type cast to int (as an approximation).
Consider using the simpler putchar/putc/fputc or puts/fputs functions rather than printf or fprintf.
Write your own versions of abs and fabs/fabsf (but benchmark it).
Avoid the floating-point pow function for computing integer powers.
Instead of strlen("literal") declare it as an initialized char[] array variable and use sizeof(arr)-1.
Merge a large number of function parameters into an object. Don’t pass 10 Boolean flags as differently named function parameters. Create an object or structure and make them fields instead.
Avoid calling strlen in a “for” loop conditional. Compute strlen before the loop, or test for the null byte.
Merge multiple Boolean function parameters into a bit set. packed into an int or long. The gain from passing fewer values as function arguments will be offset by the cost of packing and unpacking bits, but still should be better.
Use int type mostly, not char or short. Maybe prefer int to size_t, too.
Specialize functions being called with a constant for an argument using a template function with an integer field. This will increase code size, but the constant will be propagated more at compile-time, and you also don’t have the cost of passing it as an argument.
Add “noexcept” specifiers to functions wherever it applies, because this allows the compiler to know not to worry about adding any extra exception handling code.
If you’re “searching” an array or set of constant integers, known at compile-time, consider “proceduralization” by putting the numbers as cases in a switch. (Trust the compiler engineers.)
Consider writing your own faster atoi/itoa functions, as the standard libraries need to handle lots of rare cases, making them slower. (I’m not sure I agree and you might want to benchmark.)
Don’t overuse “alignas” to specify address alignments if you don’t need them, as the enforcement of alignment requirements can impose runtime cost.
sprintf is a slow and unsafe function. snprintf is safer but still slow. Find another way.
Post-increment can be faster in pointer arithmetic, so prefer using the normal idiom “*ptr++” rather than “*++ptr” to scan a vector.

References

Agner Fog, 2023, Optimizing software in C++: An optimization guide for Windows, Linux, and Mac platforms, PDF: https://www.agner.org/optimize/optimizing_cpp.pdf
Kurt Guntheroth, 2016, Optimized C++: Proven Techniques for Heightened Performance, O’Reilly Media, https://www.amazon.com/dp/1491922060
Dov Bulka and David Mayhew, 1999, Efficient C++: Performance Programming Techniques, https://www.amazon.com//dp/0201379503
Fedor G. Pikus, 2021, The Art of Writing Efficient Programs: An advanced programmer’s guide to efficient hardware utilization and compiler optimizations using C++ examples, Packt Publishing, https://www.amazon.com/dp/1800208111
ISO/IEC, Feb 15, 2006, Technical Report on C++ Performance, ISO/IEC TR 18015:2006(E), https://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf (Design of the C++ language from an efficiency perspective, including discussion of virtual functions and other language features.)
Nicolai M. Josuttis, 2012, The C++ Standard Library: A Tutorial and Reference, Second Edition, Supplementary Chapter, https://www.amazon.com/Standard-Library-Tutorial-Reference-2nd/dp/0321623215, PDF (extra chapter): http://www.cppstdlib.com/cppstdlib_supplementary.pdf (C++ optimizations such as bit sets and user-defined memory allocators.)
Bjarne Stroustrup, 2013, The Essence of C++ with examples in C++84, C++98, C++11, and C++14, PDF Slides: http://www.staroceans.org/e-book/essenceOfC++.pdf
Wikibooks, 2023, Optimizing C++/Writing efficient code/Performance improving features, Wikibooks, https://en.wikibooks.org/wiki/Optimizing_C%2B%2B/Writing_efficient_code/Performance_improving_features
Dave Abrahams et. al., 2003, Technical Report on C++ Performance, http://web.archive.org/web/20040608203404/http://www.research.att.com/~bs/performanceTR.pdf
Jakob Engblom, 2001, Getting the Least Out of Your C Compiler, https://www.engbloms.se/publications/engblom-esc-sf-2001.pdf
Jon Louis Bentley, 1982, Writing Efficient Programs, Prentice Hall.
Thomas Plum and Jim Brodie, 1985, Efficient C, Plum Hall Inc.
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, 1986, Compilers—Principles, Techniques and Tools, Addison-Wesley.
Donald E. Knuth, 1973, The Art of Computer Programming (Vol. 3): Sorting and Searching, Addison-Wesley.
James O. Coplien, 1992, Advanced C++ Programming Styles and Idioms, Addison-Wesley.
Jonathan S. Shapiro, 1991, A C++ Toolkit, Prentice Hall.
Bjarne Stroustrup, 1991, The C++ Programming Language (2nd edition), Addison-Wesley.
David Spuler, March 1994, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/Generative-AI-Coding-Transformers-LLMs/dp/B0D14LHGZ6/

CUDA C++ Optimization Book:

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization

The new CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Aussie AI

Appendix: CUDA C++ Slugs