Aussie AI
Appendix: CUDA C++ Slugs
-
Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"
-
by David Spuler
Appendix: CUDA C++ Slugs
Slug Hunting Advice
This appendix is about speeding up your C++ programs through general improvements to sequential or parallel coding. A more detailed analysis of many of these techniques with code examples is found in the bonus materials online for this book: https://www.aussieai.com/cuda/optimization.
But, before we begin with anything that’s actually useful, I have to introduce the obligatory wrist-slapping politically-correct deslugging advice for programmers. Hence, here are some general nuggets of advice when attempting to speed up your program:
- Profile twice, code once. Performance profiling tools exist for a reason. Find your busiest kernels.
- Focus on parallelization of sequential code to GPU kernels.
- Don’t micro-optimize. Unless you’re into that kind of thing. But really, try to sit on your hands.
- Do macro-optimize. Think about your data structures and algorithms.
- Optimizing introduces new bugs. 100% guaranteed! Don’t optimize the night before your release. Re-run your test suite.
- Don’t optimize exception handling. Tweaking rarely-executed code is a poor use of your geniousness.
- Use open source third-party libraries that have already been optimized by others.
Or just ignore that advice and go crazy. It’s just too much fun optimizing when the alternative is dreary debugging. Pro tip: it’s even more fun writing a book on optimizing!
Where to hunt slugs? Some of the common large-scale issues with coding inefficiency in typical C++ programs include:
- Sequential algorithms
- Function call hierarchies
- Nested loops
- Overuse of memory allocation
- Constructor and destructor inefficiencies
- Inefficient sequential algorithms (e.g., linear search of arrays)
- Unnecessary overhead or wrappers
- Recursion. After you’ve coded up your university assignments (Tower of Hanoi, anyone?), please forget recursion exists.
C++ Speedup Techniques: Some of the general ways to speed up C++ programs at the design structure or algorithmic level include:
- Parallelize via CUDA kernels.
- Further parallelize via multi-kernel, multi-threading, multi-process, multi-core, multi-GPU, multi-something.
- Faster data structures (parallelizable linear ones preferably).
- Faster algorithms (e.g., restructure algorithms for vectorization).
- Vectorization (parallelize your important linear loops).
- Precompute expensive functions into a lookup table at compile-time (e.g., activation functions).
- Cache any complex calculations to trade extra space for time savings (e.g., KV caching).
- Change floating-point to integer operations (quantization, anyone?)
- Replace recursion with iteration. Subtract ten bonus points if you need to do this.
Some of the high-level C++ coding optimizations include:
- Flatten function call hierarchies (stop wrapping everything so much, and inline the small functions at the bottom).
- Optimize loops, especially nested loops (e.g., move loop-invariant code out, loop unrolling, vectorization, etc.)
- Templates are effectively a compile-time optimization that improves speed at the cost of code space.
- Reduce memory allocation (use less memory overall or replace memory allocation with temporary stack buffers).
- Operator strength reduction (e.g., replace “
*” with “+”, a pipe dream of all AI engineers). - Declare variables as close as possible to where they are used. This avoids instantiating objects that aren’t needed on some paths.
- Use pointer arithmetic, especially for loops over arrays.
- Bitwise operations are fast, but the basic C++ integer operations are also fast too, nowadays. Benchmark, don’t assume.
- Use short-circuiting of the
&&and||operators, and also the ternary?:operator, to avoid expensive function calls.
And finally, some things you might forget (and some that are forgettable):
- Benchmark any important changes (e.g., operator strength reductions).
- Turn up your C++ optimizer. There are higher settings you could try.
- Add compile-time optimization hints (e.g.,
constexprandrestrict). - Overclock your GPU (like a gamer).
- Sell your car to buy a better GPU.
- Put every function in a header file and make them all
inline. - Reorder your
caselabels. Surely it helps. - Change all uses of
i++to++iin everyone else’s code.
Slugs in CUDA C++
The whole point of CUDA is to replace sequential sluggish code with streamlined SIMD kernels. But there are a few ways to go wrong and actually introduce slugs into device code.
Computational slugs:
- Launching too many kernel threads (wasted resources)
- Kernels duplicating the same work (wasted resources)
- Forgetting to enable the NVCC optimization flags on both host and device code.
- Low occupancy rates on the GPU
- Tail effect of a small final wave
- Block size not a multiple of 32
- Segmented loop kernels (prefer grid-stride loops)
- Nested kernels (CUDA dynamic parallelism) can get out of control (e.g., too many threads, indirect recursion).
Thread execution bottlenecks:
- Avoid recursion (generally a poor idea overall, except in school assignments, but here it also uses up GPU local memory on the stack).
- Loop unrolling with care (can increase register pressure and causes non-coalesced memory accesses)
- Output in kernel code (i.e.,
printfcalls run slow and have weird buffering delays) - Complex class objects as kernel parameters (pass-by-reference is not supported and pass-by-value will run the constructor).
- Device debug mode — Don’t compile production code with
nvcc -Gor--device-debug, which turns off all the auto-optimizations on kernel code (can also use both by adding the option:--dopt). nvccdebug and profiler modes — take care whether to use various debug and profiler flags in production: “-g”, “-pg”, “-profile”, “-debug” or “-lineinfo”/“--generate-line-info” in production.- Debug tracing output or failing assertions slowing down execution.
Instruction execution slugs:
- Warp divergence (from
ifstatements or loops in kernels) - Low instruction-level parallelism
- Poor instruction locality and instruction cache misses
- Overly large thread code size (bloated kernels)
- Non-restricted pointer declarations
Synchronization slugs:
- Too many calls to
cudaDeviceSynchronize - Overuse of
__syncwarp()or__syncthreads() - Redundant barriers (synchronization barriers that don’t actually stop any race condition)
- Implied hidden synchronizations in many CUDA runtime API calls (e.g.,
cudaGetLastError,cudaMemcpy, etc.) - Overuse of unnecessary atomics
Sequential execution slugs (ugh!):
- Serialized kernel execution of multiple different kernels
- Accidentally leaving the “kernel blocking” environment variable enabled.
Memory usage slugs:
- Using too much global memory (or repeatedly accessing the same data)
- Memory leaks (host or device memory)
- Non-coalesced data access patterns
- Excessive data transfer costs
- Using pageable memory for data transfers (use pinned memory)
- Poor data locality in global memory usage
- Array-of-Structures (AoS) layout has non-contiguous data
- Using shared memory for inter-thread data movement when warp shuffle operations would be faster
- Contention in memory accesses (shared memory or global memory)
- Register spills and “register pressure”
- Cache inefficiency
- Shared memory “bank conflicts” (access contention in shared memory accessed by multiple threads).
- Non-linearized data layout for matrices (2D) or tensors (3D)
C++ Class Slugs
The C++ class features are designed to add encapsulation and modularity, while retaining speed, but there’s still plenty of ways that slugs can crawl into your classes. C++ class optimizations include:
- Ensure small member functions are
inline, especially those that do “get” and “set”. - Add
inlineto otherfriendor non-class functions (esp. if small or commonly used). - Pass objects to functions using “
const&” (pass-by-reference), rather than pass-by-value. - Watch out for temporary objects. These can occur in simple assignments or function call expressions or in weird ways like accidentally making your overloaded assignment operator have the wrong type.
- Avoid
virtualfunctions (they have a runtime cost and block compile-time optimizations like inlining andconstexpr). - Specialize inherited member functions in derived classes.
- Declare objects as close to first use as possible.
- Initialize objects at declaration, rather than calling a constructor and then later an initialization function.
- Overloaded binary operators can be a slug returning an object. Instead
of
c=a+b, use a three-parameter function with references, such asadd3(a,b,c). - Use reference variables instead of copying objects into temporary variables.
- Take care templating class objects (e.g., when using the
std::vectorclass for a vector of your class objects). Lots of hidden calls to constructors and destructors may arise in resizing. - Don’t return objects as the return type of the assignment operators.
- Declare assignment operators as return type
void. - Use the initializer list in the constructor for initializing data members.
- Use
friendfunctions for faster accesses to internal object data. - Block accidental calls to the copy constructor or class assignment operator (i.e., if you aren’t defining them, make a dummy version that is “
private” with a “void” function body). - Avoid returning objects if you can. Return a reference if it’s safe to do so.
- Take care with “wrapper” classes like “smart pointers”, “smart integers” or “smart buffers”. Usually, they’re safer but slower. How smart is that?
- Singleton classes can be made more efficient than general classes.
- Postfix overloaded assignment operators are inherently inefficient (make them return
void, or disallow them as aprivatedeclaration withvoidbody). - Destructor execution with cleanup code can be a slug on program exit (consider avoiding memory deallocations
in destructors by setting a global “
i_am_shutting_down” flag). - Reduce object byte size with “largest to smallest” member ordering.
- Put the most-used data member first in the object (it has a zero offset, which is compiled-out).
Function Slugs
Functions are an important building block of your code. Some ways to get the slugs out of general non-member functions include:
- Declare small functions
inline. - Scour open source libraries for versions of your function already written by experts (e.g., the many CUDA source code libraries).
- Avoid recursion (use loops instead).
- If you can’t fully remove recursion, use (a) tail recursion elimination, (b) collapse recursive calls by unwinding or pre-computing more of the lower-level base cases, or (c) use a stack data structure.
- Call builtin or intrinsic functions.
- Call library functions rather than writing your own.
- Avoid function pointers (they block compile-time optimizations like
inlineandconstexpr). - Optimize the most frequently called function (profile it!).
- Use parameters as if they’re local variables (reduces register usage).
- Use
constexprfunctions. - Pass objects by reference.
- Use
constreferences for objects to allow auto-optimizations by the compiler. - Declare pointer parameters as
__restrict__for additional compiler optimizations. - Avoid function pointers.
- Optimizing functions with default arguments by creating a separate specialized function without the extra parameter (i.e., remove the default argument one).
Medium-Sized Slugs
There are a lot more examples of possible inefficiencies in C++ coding. Some of the types of errors that are “medium-sized” slugs include:
- Avoid non-
staticautomatic array initializations with constant data. - Don’t put function calls in a loop test (i.e., expensive loop conditional tests).
- Member initializations in the constructor body (they should be in the initializer lists).
- The new
forloop auto-iterators are elegant, but not necessarily the fastest way. - Program startup hidden initializations (global or
staticobject constructors). - Small non-
inlinefunctions called frequently. - Busy wait loops are a worst case.
- Disk I/O with file operations is often slow (e.g., load the whole file into memory) .
- Don’t accidentally call your global initialization routine twice (use a flag to check it’s already run).
- Unnecessary code inside loops should be “hoisted” out.
- C++ classes wrapping simple data types (e.g., overuse of “smart pointers” or “smart integer” classes).
- Overuse of standard string concatenation operations.
- Recursion is almost always a slug.
More Slug Repellent
There’s plenty of other optimizations in the earlier chapters. Well, actually all of the book! Nevertheless, here’s a list of some more C++ code optimization techniques for you to consider. Some of the bigger ideas:
- Use “move constructors” instead of copy constructors where appropriate (since C++11).
- Use
staticdata members where appropriate, so they are initialized once only. - Use
std::sortrather thanqsort. - Don’t put
try..catchinside an inner loop that’s a bottleneck. - Use
std::bitsetfor bit sets or bit vectors. - Use the “iterators” design pattern rather than returning a full scan of a data structure all at once (saves memory and allows early exit).
- Consider basic C++ arrays instead of
std::vectorif it has a fixed size (known at compile-time) or its maximum size is small enough. - Consider C++20 coroutines where appropriate for the architecture.
- Structure of arrays (SoA) data layout is more vectorizable than Array of Structures (AoS).
And some of the smaller optimizations:
- Commonly used object or struct fields should be first. On some platforms, smaller offsets from the start of an object are accessed faster. Also, the very first field has offset zero, which is optimized away, so put the most used field first.
- Avoid long
else-ifsequences. You are effectively doing linear search on the problem space in a long block ofif-else-ifstatements. The best alternative is to use aswitchstatement, if the conditions are constants. For non-constant conditions or string comparisons, consider tabularizing the options and/or using heuristics to bifurcate the search space (e.g., start with aswitchon the first letter of a string). - Use compact numeric ranges for
switch. If the case numbers are close together, the compiler will probably use a lookup-table in assembler. If the cases are sparse, it can be forced to do anif-else-ifequivalent in machine code. - Correct choice of loop. If the condition is true at the first iteration, use
do-whileloops. - Instead of range checking a signed integer
with “
i>=0 && i < MAX” use a typecast with “(unsigned)i<MAX” because negatives become large unsigned positives, and a cast frominttounsigned intisn’t a real instruction at run-time. - Enable the FTZ (“flush-to-zero”) and/or DAZ (“denormals-are-zero”) floating-point modes on your CPU, even though they violate the IEEE 754 standard. You probably don’t care about tiny floating-point numbers in your weight or probability calculations.
- Enable GCC’s floating-point arithmetic speedup options:
-ffast-math,-fno-math-errno,-fno-trapping-math, and-ffinite-math-only. bsearchis slow. Choose a better method.- Use
static_assertrather thanassert(e.g., to check data type sizes). - Copy arrays by wrapping them in a dummy
structand using C++structbitwise assignment. It might be faster thanmemcpy. - Use
memcpyrather thanmemmoveif you’re sure the arguments won’t overlap. - Move local non-
staticobjects outside of a critical loop. Reuse the same object rather than re-running constructors and destructors every loop iteration. Add a “reset” member function if needed. - Use scaling factors that are a power-of-two, so that multiplication or division can be a bitshift.
- Specialize a function with a
voidand non-voidversion if you find yourself ignoring the return value sometimes. This avoids all of the calculations to determine the return value inside thevoidfunction, because the function itself cannot tell whether or not the caller will use its return value. - Prefer pre-increment (
++i) to post-increment (i++) for non-scalar values. And it’s better to use pre-increment even for “int” types, even though it’s the same, just to get into the habit. - Use the GCC
__builtin_unreachable()statement and the “noreturn” function attribute to help the GCC optimizer identify dead code paths, allowing unreachable code removal (not that we care that much) and also better optimization of path-specific optimizations on other live paths (e.g., compile-time constant propagation). - Test the first character of two strings directly with character tests before calling
strcmp. - Replace calls to “
round”, “floor” or “ceil” functions with a type cast toint(as an approximation). - Consider using the simpler
putchar/putc/fputcorputs/fputsfunctions rather thanprintforfprintf. - Write your own versions of
absandfabs/fabsf(but benchmark it). - Avoid the floating-point
powfunction for computing integer powers. - Instead of
strlen("literal")declare it as an initializedchar[]array variable and usesizeof(arr)-1. - Merge a large number of function parameters into an object. Don’t pass 10 Boolean flags as differently named function parameters. Create an object or structure and make them fields instead.
- Avoid calling
strlenin a “for” loop conditional. Computestrlenbefore the loop, or test for the null byte. - Merge multiple Boolean function parameters into a bit set.
packed into an
intorlong. The gain from passing fewer values as function arguments will be offset by the cost of packing and unpacking bits, but still should be better. - Use
inttype mostly, notcharorshort. Maybe preferinttosize_t, too. - Specialize functions being called with a constant for an argument using a template function with an integer field. This will increase code size, but the constant will be propagated more at compile-time, and you also don’t have the cost of passing it as an argument.
- Add “
noexcept” specifiers to functions wherever it applies, because this allows the compiler to know not to worry about adding any extra exception handling code. - If you’re “searching” an array or set of constant integers, known at compile-time, consider “proceduralization” by putting
the numbers as cases in a
switch. (Trust the compiler engineers.) - Consider writing your own faster
atoi/itoafunctions, as the standard libraries need to handle lots of rare cases, making them slower. (I’m not sure I agree and you might want to benchmark.) - Don’t overuse “
alignas” to specify address alignments if you don’t need them, as the enforcement of alignment requirements can impose runtime cost. sprintfis a slow and unsafe function.snprintfis safer but still slow. Find another way.- Post-increment can be faster in pointer arithmetic, so prefer using the normal idiom “
*ptr++” rather than “*++ptr” to scan a vector.
References
- Agner Fog, 2023, Optimizing software in C++: An optimization guide for Windows, Linux, and Mac platforms, PDF: https://www.agner.org/optimize/optimizing_cpp.pdf
- Kurt Guntheroth, 2016, Optimized C++: Proven Techniques for Heightened Performance, O’Reilly Media, https://www.amazon.com/dp/1491922060
- Dov Bulka and David Mayhew, 1999, Efficient C++: Performance Programming Techniques, https://www.amazon.com//dp/0201379503
- Fedor G. Pikus, 2021, The Art of Writing Efficient Programs: An advanced programmer’s guide to efficient hardware utilization and compiler optimizations using C++ examples, Packt Publishing, https://www.amazon.com/dp/1800208111
- ISO/IEC, Feb 15, 2006, Technical Report on C++ Performance, ISO/IEC TR 18015:2006(E), https://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf (Design of the C++ language from an efficiency perspective, including discussion of virtual functions and other language features.)
- Nicolai M. Josuttis, 2012, The C++ Standard Library: A Tutorial and Reference, Second Edition, Supplementary Chapter, https://www.amazon.com/Standard-Library-Tutorial-Reference-2nd/dp/0321623215, PDF (extra chapter): http://www.cppstdlib.com/cppstdlib_supplementary.pdf (C++ optimizations such as bit sets and user-defined memory allocators.)
- Bjarne Stroustrup, 2013, The Essence of C++ with examples in C++84, C++98, C++11, and C++14, PDF Slides: http://www.staroceans.org/e-book/essenceOfC++.pdf
- Wikibooks, 2023, Optimizing C++/Writing efficient code/Performance improving features, Wikibooks, https://en.wikibooks.org/wiki/Optimizing_C%2B%2B/Writing_efficient_code/Performance_improving_features
- Dave Abrahams et. al., 2003, Technical Report on C++ Performance, http://web.archive.org/web/20040608203404/http://www.research.att.com/~bs/performanceTR.pdf
- Jakob Engblom, 2001, Getting the Least Out of Your C Compiler, https://www.engbloms.se/publications/engblom-esc-sf-2001.pdf
- Jon Louis Bentley, 1982, Writing Efficient Programs, Prentice Hall.
- Thomas Plum and Jim Brodie, 1985, Efficient C, Plum Hall Inc.
- Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, 1986, Compilers—Principles, Techniques and Tools, Addison-Wesley.
- Donald E. Knuth, 1973, The Art of Computer Programming (Vol. 3): Sorting and Searching, Addison-Wesley.
- James O. Coplien, 1992, Advanced C++ Programming Styles and Idioms, Addison-Wesley.
- Jonathan S. Shapiro, 1991, A C++ Toolkit, Prentice Hall.
- Bjarne Stroustrup, 1991, The C++ Programming Language (2nd edition), Addison-Wesley.
- David Spuler, March 1994, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/Generative-AI-Coding-Transformers-LLMs/dp/B0D14LHGZ6/
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: CUDA C++ Optimization |
|
The new CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |