Aussie AI

Chapter 12. Zero Runtime Cost Operations

  • Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"
  • by David Spuler, Ph.D.

Chapter 12. Zero Runtime Cost Operations

You want free CPU cycles? You got it! There are plenty of “freebies” in C++!

We’ve already talked about compile-time operations in C++, but here’s a summary of some of the “hints” you can give to the compiler for a free gain, usually via helping the optimizer to do fancier optimizations:

  • inline
  • template
  • const
  • constexpr (also consteval and constinit)
  • noexcept
  • static_assert
  • Restricted pointers (e.g., __restrict)
  • likely/unlikely or __builtin_expect (expressions)
  • [[likely]] and [[unlikely]] path attributes

I’ve missed a bunch of them, so you should re-read those chapters. Those are well-known optimizations via programmer hints.

Here are some other ones that are useful. If you see these keywords, these are free or compile-time operations:

  • auto types (type deduction)
  • decltype
  • final
  • override
  • explicit
  • [[nodiscard]] (function attribute)
  • = delete

But there’s always more. Here are some advanced C++ language features that you might think cost real CPU juice, but are free for various language design reasons:

  • Type traits — compile-time type operators (not RTTI).
  • Concepts (C++20) — compile-time guarantees.
  • Static reflection (C++26) — fixing RTTI inefficiencies.
  • Profiles — safety with compile-time validation.
  • Curious Recurring Template Pattern (CRTP) — useful for devirtualization.
  • Structured bindings — grouped assignments are compile-time processed.

Type traits are a form of Compile-Time Type Information (CTTI) and work at compile-time. Some examples are operations like std::is_trivial or std::is_same. However, note that you have to be careful not to move across into the darker side of RTTI, which is dynamic_cast and typeid.

Free Type Cast Operations

There are various arithmetic operations that can look real, but actually disappear in a puff of compiler smoke. The first item on the list is type casts, which have many freebies:

  • reinterpret_cast
  • static_cast
  • const_cast
  • std::move (move semantics)
  • std::forward (perfect forwarding)

Note that std::move is effectively a compile-time type cast, which turns an l-value into an r-value (I’m simplifying the idea here). However, there are also overloaded versions of std::move with two or more arguments that really do move bytes at runtime (effectively doing memcpy), so be aware of the distinction between free uses of std::move for move semantics versus real byte movers.

Arithmetic type casts between similarly represented numbers can often be optimized away. For example, these are usually free, or at least very fast:

  • Downsizing integer type casts (e.g., int to char).
  • Upsizing integer type casts (e.g., char to int)
  • Floating-point type conversions (e.g., float to double)

Differently sized integer types seem like they would cost real instructions to convert between them. If a char is one byte and an int is four bytes, you’d think there’s an operation that adds or removes three bytes. However, the compiler has many tricks up its sleeves here, such as:

  • Copy propagation
  • Register allocation
  • Peephole optimizations

This is often true of the conversions between any of the many and varied integer types, from a 1-byte char to a 16-byte long long. In the cases where the compiler cannot find a way to do it freely, the operation is very inexpensive anyway.

But note that not all type casts are free. In particular, converting between integers and floating-point types is expensive, in both directions, because the way these two types of values are represented is very different. Be careful with explicit type casts, but also any expressions that mix integer and floating-point types may have implicit type casts.

Optimized Away

Here’s a somewhat random list of stuff that should get optimized away by the compiler. We can be reasonably sure these are free:

  • Constant expressions (via “constant folding” and newer constexpr features)
  • Small getter member functions (via inlining)
  • Null-effect expressions (useful for compiling-out assertions)
  • Unnecessary temporary variables (removed by copy propagation, peephole optimizations, and register allocation)
  • Wrongly typed constants (e.g., using 1 or 1U or 1.0 or 1.0f should be implicitly type-converted at compile-time).
  • Double negation (using “!!(x)” is a common trick).
  • Algebraic simplifications (e.g., plus zero, subtract zero, times one, and many more).
  • Explicit zero conditional tests (e.g., if (x != 0) or if (ptr != nullptr) equates to if(x) or if(ptr) at runtime).
  • First data member in an object or structure (it’s offset is zero, so there’s a “plus zero” in the address calculation that is optimized away).
  • Assertions and #if DEBUG (if compiled-out for production).

The compiler optimization of “dead code elimination” will make these control flow features free:

  • while(1) — using for(;;) isn’t faster!
  • if(true) or if(1) or if(0) or whatever
  • do...while(0) — a common macro trick.
  • Short-circuited constants in || or && operators
  • Tested constants in the ?: ternary operator

You can always check the assembly code with “gcc -S” or the MSVS assembly debug window.

Standard Container Operations

A lot of the standard containers have many optimized specializations for builtin types. Hence, if you’re using std::vector<int>, you can expect operations like push_back are inlined and very fast. All of the contiguous containers have a simple structure, and the non-contiguous linked containers would maintain incremental variables, making begin() and end() calls very fast. Similarly, most of the containers maintain an incrementer counter of objects inside, so all calls to std::size are as fast as a getter accessing an integer data member (inlined, of course).

There are some relatively simple standard C++ data types where operations can often be inlined or optimized away by the compiler:

  • std::pair
  • std::tuple
  • std::optional
  • std::expected
  • std::variant (modern C++ unions)

Finally, note that some calls to containers can lead to memory allocations, which is a slowdown. And various containers when used on your own non-scalar objects can trigger many calls to constructors or assignment operators, which is slow regardless of whether it calls copy or move versions. I mean, moving is better than copying an object, but the optimizer can only do so much.

The Opposite of Free

There are also features of C++ that look like they should be free, but are actually costly. Perhaps we should call them “costlies”?

Elegance and the beauty of short code sequences is not the same thing as fast. Here are some examples of beautiful things that can be slow:

  • Calls to virtual functions
  • RTTI (i.e., dynamic_cast and typeid)
  • Lambdas, functors and other function objects
  • std::function
  • Comparators (except maybe standard ones like std::less)
  • Fold expressions
  • Exception handling

The issue with lambdas and function objects is not clear-cut. If you use a lambda with a simple capture and an immediate assignment to a functor variable, which is then called, the optimizer probably can handle this and inline the function call. However, if you declare your own complex lambda as a comparator that is sent to a function (e.g., to std::sort), all of the calls to that lambda are probably not inlined, leading to a performance bottleneck.

Also, if you use a builtin comparator like std::greater and pass it to std::sort or other library functions, it’s likely that the operation has a pre-coded template specialization for that comparator, meaning it won’t really be using it as a function call. However, you might want to benchmark this or look at the standard library source to confirm there is such a specialization!

And here are some more slugs that are less obvious, because the code is concise and looks like it should be fast:

  • Operator overloading (looks like a single instruction, but it’s a function call, even if it’s inlined).
  • Initializer lists (can call lots of copy constructors).
  • Pointer-to-function types (cannot be inlined).
  • Implicit type conversions (especially via overloaded type cast operators).
  • Temporary object creation (accidental)
  • Type casts between int and float (explicit or implicit)
  • Container resize() calls

Modern C++ is becoming such a complex language with conflicting goals of elegance and performance, so it’s hard to know which things are freebies or costlies.

 

Online: Table of Contents

PDF: Free PDF book download

Buy: C++ AVX Optimization

C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization