Aussie AI

Chapter 39. Timing and Benchmarking

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 39. Timing and Benchmarking

Timing C++ Code

There are a number of reasons why it can be useful to time the execution of a program. Timing C++ code can be useful in determining which statements should be optimized whereas profilers may only indicate which functions are consuming time. Timing code can also determine the relative efficiency of various operations and give you valuable information about writing code for your machine (e.g., is shifting faster than integer multiplication?).

There are several ways to time your C++ code, some of which have existed for decades, and some that are newer and standardized. Here’s a list of some options:

time shell command
time C++ function
clock C++ function
<chrono> standard C++ class

Another way to examine the efficiency of a C++ operation is to look at the assembly code. This is examined later in the chapter.

If the full execution time for a program is all that is needed, the Linux time command can be used to calculate the time required by a program. There are two versions — a stand-alone utility in /bin and a command built into csh. The command to run is usually:

    time a.out

A different executable name could also be used and command line arguments can also be specified.

The Chrono Class

The std::chrono library is an awesome piece of work, and has many features. It’s been part of the C++ standard since C++11. I’m only going to touch on a handful of basic measurements here.

Here’s an example of how to measure the duration between two events:

   auto before = std::chrono::high_resolution_clock::now();
   // ... Do something
   auto now = std::chrono::high_resolution_clock::now();
   auto diff = std::chrono::duration_cast<std::chrono::microseconds>(now - before).count();
   std::cout << "Time: " << diff << " microseconds" << std::endl;

There are other ways to do this, as the library is very flexible, with many capabilities. Reading the documentation for this class is enough to make my head spin. Someone had a lot of time to spend on time! Kudos to them. But one way is good enough for timing our C++ code, so let’s move on and leave the rest as an exercise for the reader (LOL!).

The Clock Function

If a more detailed speed analysis is needed, it is possible to add C++ self-instrumentation code to your program to monitor its own performance. The basic idea is to use the standard library functions to monitor the time before and after an action. The advantages of the clock function over the new-fangled std::chrono library:

Measures CPU clock ticks, not wall clock time.
Works in C, if you need it, not only C++.
Only have to remember one function name!

The oldest useful function is the “clock” function which has existed since the C programming language. The clock function counts the number of clock ticks since the program began executing. The “time” function, which keeps track of the real calendar time could also be used, but it is not a true indication of processor time on a large multi-user system. The clock function is correct for both single user and multi-user systems.

The clock function returns a value of type clock_t (typically long or int) that counts the number of clock ticks. This value can be converted to seconds by dividing by the constant CLOCKS_PER_SEC, also declared in <time.h>.

The basic idea of timing C++ code blocks is to call the clock function before and after an operation and examine the difference between the number of clicks. The code below examines the relative speed of shift and multiplication operations on int operands.

    void profile_shifts()
    {
        const int MILLION = 1000000;
        const int ITERATIONS = 100 * MILLION;

        int x = 1, y = 2, z = 3;

        clock_t before = clock();
        for (int i = 0; i < ITERATIONS; i++)
            x = y << z;
        printf("%d Shifts took %f seconds\n", ITERATIONS,
            (double)(clock() - before) / CLOCKS_PER_SEC);

        before = clock();
        for (int i = 0; i < ITERATIONS; i++)
            x = y * z;
        printf("%d Multiplications took %f seconds\n", ITERATIONS,
            (double)(clock() - before) / CLOCKS_PER_SEC);
    }

Clock Problems

clock Portability Pitfall. Note that some implementations on older Unix versions don’t conform to the C++ standard and return the number of clock ticks since the first call to the clock function. This means that a single call to clock at the end of the program would always return zero. Hence, it is more portable to measure the number of clock ticks between two calls to clock, one at the start and one at the end. Obviously, you can also put the first call to “clock” at the start of the “main” function to avoid this rare glitch. Note that on implementations that are correct, a call at the start of “main” may be non-zero due to the overhead of global and static C++ object instantiations (i.e., constructors for global objects), which occurs before entering main.

Clock Tick Integer Division Pitfall. Note that the clock_t type and CLOCKS_PER_SEC constant are both integers. Hence, here’s a bug:

    clock_t diff = clock() - before;
    double seconds = diff / CLOCKS_PER_SEC; // Bug!

The problem is that it’s integer division, so it inaccurately truncates to an integer. You need a typecast to float or double on either side of the division operator.

    clock_t diff = clock() - before;
    double seconds = diff / (double)CLOCKS_PER_SEC; // Correct

Clock Tick Overflow Pitfall. The clock function also has a problem with wraparound on some implementations. Because of its high resolution, the number of clock ticks can quickly overflow the maximum value that can be stored by the type clock_t. On one system the clock function will wrap around after only 36 minutes. If the program being timed runs for longer than this period, the use of clock can be misleading. One solution is to use the “time” function rather than “clock” when executions are longer, but this usually only has resolution to the nearest second.

Benchmarking

Benchmarking is a slightly different concept to tuning, and refers to testing the efficiency of certain operations, such as low-level operators, to find a more efficient way to do an operation. For example, if you want to compare multiplication versus addition, you write a program to run these operations a few million times. When changing a program to increase efficiency, you shouldn’t assume that a certain operation is clearly faster, but you should benchmark whether the changes have noticeably increased the operation’s efficiency (or even decreased it!).

Techniques for measuring program efficiency range from the stop-watch method to the use of sophisticated profiler software tools. If no profiler is adequate, the programmer can gain timing information by adding instrumentation statements to the program, although there are many pitfalls in attempting to determine the time taken by a sequence of statements.

The measurement of the memory usage and space-efficiency of a C++ program is a slightly more difficult problem. There are several types of memory: instruction code, static memory, read-only string literals, initialization data, global/static variables, the stack, and the heap. Measuring the memory usage of the stack and heap is somewhat difficult because of their dynamic nature. However, various tools exist to measure the different types of memory, and clever use of C++ programming constructs can also yield reasonable data.

Benchmark programs attempt to examine how quickly your machine executes certain instructions, which is more useful for examining a single multiplication operation. You mainly use benchmarking for code that’s running in low-level kernels, such as CPU speedups (e.g., AVX intrinsics) or examining the use of different GPU primitives.

Consider benchmarking for timing of low-level arithmetic operations on your platform. For example, how would you determine whether the integer multiplication operation x*2 could be more efficiently replaced by x<<1?

How can you time these instructions? You obviously cannot just time a single operation of each with the “clock” function, because a single click tick contains many CPU cycles. So, you have to time thousands or even millions of such operations.

    for (int i = 0; i < 100 * MILLION; i++) {
        x << 1;
    }

We’ve already noted one problem: there’s all this extra loop overhead time for the for loop conditional test (the “<” operator) and its incrementer (i++). The loop actually has three operations that are all about the same order-of-magnitude cost (i.e., <, ++, <<). To get at the operator cost, we’d need to subtract out the loop overhead. We could, for example, try to time an empty loop without any loop body, and subtract that from our final cost.

Benchmarking Problems

Null effect problems. Another problem is that we cannot easily time the operators with these statements in the loop body:

    x << 1;
    x * 2;

The compiler is clever enough to notice that the x<<1 and x*2 statements have no effect in the program above (and gives “null effect” warnings). The built-in optimizer may even remove them completely. So, they won’t get timed properly, or at all, even in a loop.

Add volatility? One possible solution is that maybe the compiler can be forced to avoid this optimization on the original expressions by declaring x as a “volatile” variable.

    volatile int x = 0;

The volatile qualifier tells the compiler that all accesses to x are important, and that it should not remove any. The intended purpose of volatile is to allow the declaration of addresses for memory-mapped I/O, debugger-modified variables, or for variables modified by other programs (e.g., a semaphore modified by another program running concurrently). However, we can use it here to force all accesses to x to occur even if they appear pointless.

On the other hand, by doing this, we’ve lost the ability to see the “real” time cost of these operations when they’re running in normal code. Most variables aren’t volatile.

Anyway, it doesn’t even work properly. Unfortunately, the computations of the << and * operators in x<<1 and x*2 are not being assigned anywhere, so the computations themselves could be optimized out, even though the actual read operations on x must occur because x is volatile. To force the << and * operations to occur, it is necessary to use their result somehow, such as by assigning it to the (volatile) variable x:

    x = x <<  1;

Although all of the above improvements will enhance the previous version, a far better method of improvement is to time a loop that performs a huge number of the operations,. Hence, we have to use something like these assignment expressions inside a loop:

    x <<= 1;
    x *= 2;

The code given here examines the relative speed of 10,000 shift and multiplication operations on int operands:

   volatile int x = 0; // volatile to prevent optimizations
   clock_t before  = clock();
   for (int i = 0; i < ITERATIONS; i++)
       x = x << 1;
   printf("%d Shifts took %f seconds\n", ITERATIONS,
       (double)(clock() - before) / CLOCKS_PER_SEC);
   before = clock();
   for (int i = 0; i < ITERATIONS; i++)
       x = x * 2;
   printf("%d Multiplications took %f seconds\n", ITERATIONS,
       (double)(clock() - before) / CLOCKS_PER_SEC);

Loop Unrolling

Unfortunately, the above method of measuring the speed of operations is not completely accurate, because it also includes the loop overhead (incrementing i from 1 to 10,000) and the cost of the assignment of the result to x. The loop overhead can be minimized by placing many operations within the loop, as below:

    volatile int x = 0; // volatile to prevent optimizations
    clock_t before = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
    }
    printf("%d Shifts took %f seconds\n", ITERATIONS * 20,
        (double)(clock() - before) / CLOCKS_PER_SEC);
    before = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
    }
    printf("%d Multiplications took %f seconds\n", ITERATIONS * 20,
        (double)(clock() - before) / CLOCKS_PER_SEC);

Unfortunately, the assignment operations are needed to prevent the optimizer removing the computations, as discussed above. The only truly effective method of removing the cost of the assignment from the measurement is to time another separate loop, and subtract its time from that of the other loops, as below. This method also automatically accounts for the loop overhead cost, so the multiple operations inside each loop are not needed (and in fact would be incorrect). Our final version of the benchmark program is also made more sophisticated to output the relative magnitude of the two operations:

    void profile_shifts4()
    {
        const int MILLION = 1000000;
        const int ITERATIONS = 1000 * MILLION;
        volatile int x = 0; // volatile to prevent optimizations
        double time1, time2;

        // Time the loop overhead
        clock_t before = clock();
        for (int i = 0; i < ITERATIONS; i++)
            x = 1;
        clock_t loop_cost = clock() - before; // overhead
        double ovtime = (double)(loop_cost) / CLOCKS_PER_SEC;
        printf("%d overhead: %f seconds\n", ITERATIONS, ovtime);

        // Shifts
        before = clock();
        for (int i = 0; i < ITERATIONS; i++) {
            x = x << 1;
        }
        time1 = (double)(clock() - before - loop_cost) / CLOCKS_PER_SEC;
        printf("%d Shifts took %f seconds\n", ITERATIONS, time1);

        // Multiplications
        before = clock();
        for (int i = 0; i < ITERATIONS; i++) {
            x = x * 2;
        }
        time2 = (double)(clock() - before - loop_cost) / CLOCKS_PER_SEC;
        printf("%d Multiplications took %f seconds\n", ITERATIONS, time2);

        // Compare both times, and print percentage difference
        const float ACCURACY = 0.00001f; // maximum error
        if (fabs(time1 - time2) < ACCURACY) // (almost) equal?
            printf("Shift and multiplications: same time\n");
        else if (time1 < time2) {
            printf("Shifts faster by %5.2f percent\n",
                    (time2 - time1) / time2 * 100.0);
        } 
        else {
            printf("Multiplications faster by %5.2f percent\n",
                (time1 - time2) / time1 * 100.0);
        }
    }

Limitations of Benchmarking

Benchmarking of C++ using these timing methods is not perfect, but I’ve always found it useful. There are various reasons why this type of benchmarking timing results may not be fully correct.

Hard to account for parallelism (e.g., GPU throughput)
Single-threaded code is not always a true representation.
Pipelining speedups often differ in production code (even for sequential CPU code, such as AVX intrinsics).
Loop overhead is hard to separate from the raw operations (as seen above!)
Compiler optimizations might modify or even remove the operations being benchmarked.
Memory cache hit rates are too high because you’re running tight code accessing only a few addresses.
Optimization levels in test mode might not match your production version.
Debug modes might not match production (e.g., if running in a debugger).
Pipelining by the CPU of many instructions makes it appear better than reality.
Unrealistic non-production conditions are being tested.

Compiler optimizations. In this day and age of amazing optimization algorithms, note that on some platforms the benchmarking code above may indicate that shifts and multiplications cost exactly the same. This is most likely an indication that the compiler automatically optimizes any multiplications by powers of two into left shifts. To get the true cost of a multiplication, the expression should be:

    x = x * x;

But even this might be optimized algebraically by a compiler. The only way to know for sure what’s actually being benchmarked is to examine the assembly language.

Examining Assembly Output

Another way of examining the relative costs of particular operations for a particular compiler is to examine the assembly language produced by the compiler. Many compilers have an option to produce assembly language output. For example, under Linux the command may be:

    gcc -S main.cpp

This will produce the assembly language listing for the C++ source file and store it in a new file “main.s” as a human-readable text file. Without the -S option, the assembly output would have been passed to the assembler to create the machine code executable. GCC also has a “-masm” option that controls the different “dialects” of assembly language (e.g., “intel” or “att”). GCC also has a verbosity control on assembly output via “-fverbose-asm” and “-fno-verbose-asm” options.

Another way to generate assembly with GCC is the “-save-temps” option. This option tells GCC to save the temporary assembly language file that it used for the real compilation. Hence, this option can be used with the normal compilation mode to both build the code as normal and also output a “.s” assembly file. The advantage of this GCC “-save-temps” option over “-S” is that you don’t need to create a separate build path for generating assembly text files.

Reviewing assembly code. Examining assembly language instructions produced for C++ operations can be very enlightening. For example, you can determine whether the compiler uses a special increment instruction for the ++ operator. Whether or not the compiler is performing various optimizations can also be examined.

Counting the number of assembly instructions is a simple measure and gives a reasonable indication of how efficiently an operation will be performed. A better method is to determine the number of cycles used by each instruction, but this requires a rather more intimate knowledge of the assembly language being used.

Many useful things can be discovered by examining assembly output. For example, does the expression x*2 generate a multiply instruction or a shift instruction (or an addition instruction to do “x+x”)? Does the compiler notice that x=x+1 can be replaced by x++? Is the integer % remainder operator implemented by a sequence of instructions?

Consider the use of the relational operators (e.g., >, <) in expressions such as:

    flag = x > y;

This will often produce a sequence of instructions because of the need to assign flag the value either 0 or 1. The instructions may well look like the following pseudo-assembly language:

    LOAD 10($sp) # Load x (from stack)
    CMP 12($sp) # Compare with y (on stack)
    BGT $1 # Branch if greater than
    LOAD 0 # Result of > operation is 0
    JUMP $2
    $1:
    LOAD 1 # Result of > operation is 1
    $2:
    STORE 14($sp) # Store in flag (on stack)

However, review the assembler for the similar test in if statements, such as:

    if (x > y) ...

For an if statement, the instructions need not be as complex, because there is no need to store the value 0 or 1 anywhere. The assembly language could be similar to branches without computations:

    LOAD 10($sp) # Load x (from stack)
    CMP 12($sp) # Compare with y (on stack)
    BLE $1 # Branch if NOT greater than
    ... # Code for if statement body
    $1:
    ... # Statements after if statement

Examining Object Files

The objdump command is another useful tool on Linux for analyzing binary object files. DUMPBIN is the comparable tool on Windows for MSVS (or you can use the LINK command with the “/DUMP” option). These tools can get to the assembly language text in reverse, by disassembling the binary instructions that are in the object file, in combination with the various symbolic information.

objdump can be used to examine object files in various ways and there are various useful options. The “-d” and “-D” options provide disassembly where you can examine a full dump of the assembly code in printable form (as an alternative path to the “-S” option). The “-h” option shows the headers of the object file and “-g” shows debugging information in the file. There are numerous other options and the “--help” option can be used to list all options. The objdump command is part of Gnu Binutils, which also includes other useful binary file tools such as nm, size, strip, and strings utilities.

DUMPBIN also has various options that can be used on the DOS command-line. The default is “/SUMMARY” for a summary of the information about the object file. The “/DISASM” command shows the disassembly of the object file, which is in assembly language. Also useful is “/SYMBOLS” to show the symbolic names.

Performance Tuning Practices

How should the huge number of methods of improving program efficiency be applied to a program? The code transformations that improve the program by a significant amount should be tried first, and the smaller optimizations used only when it is important to squeeze out that last bit of extra speed in bottlenecks. Hence, I suggest the following steps for improving the efficiency of a program:

1. Time your program to get a baseline (i.e., run a full inference query).

2. Invoke the C++ compiler’s built-in optimizer.

3. Profile the code and find the “hot spots.”

4. Consider a better data structure or algorithm.

5. Use the major code transformations.

6. Use smaller code transformations, if speed is crucial.

The first step is to measure your code’s time cost. Otherwise, how will you know whether anything made it better?

The next step is easy: turn on your optimizer. All modern C++ compilers have an option to invoke an optimizer on the code. The optimizer, although it may not always yield a major increase in speed, has one very important advantage — the programmer need not change the code. Hence, if a small improvement is desired, the optimizer can often provide it without much effort.

Software tuning. Assuming you’re done with all the non-code changes to the system (e.g., hardware, networking), it’s time to examine the C++. You can either start high by looking at the data structures, or start low by optimizing the busiest low-level kernels.

The choice of a better algorithm (usually with different data structures) for a program is not an easy method of program improvement. Simply identifying what would be a better algorithm is a difficult problem! And once identified, the new algorithm must be implemented by the programmer, costing precious man hours. However, this is the best method to achieve an order-of-magnitude increase in the program’s performance.

The next step is to profile in detail the C++ code to determine which functions (or statements) are accounting for most of the program’s time; these are the “hot spots” of the program. This identification of costly statements is best achieved by a profiler, although if I had to take a guess, I’d say look at your vector dot product code. Identifying frequently called functions and deeply nested loops is often adequate. Once the hot spots are identified, all efficiency measures, large and small, should be applied to this code. Any improvement to the efficiency of a statement, no matter how small, will improve the overall efficiency greatly if that statement is executed often.

Once the most costly functions and loops have been optimized, other statements can also be optimized, although the increase in speed will not be as noticeable. Some of the better code transformations to apply are parallelization, loop optimizations (vectorizations), using pass-by-reference for passing structures or objects to functions, and replacing small functions with macros or inline functions.

Make it right first? The speed improvement techniques in C++ can be applied either as the programmer is writing the code, or after the development and debugging of the program. The second approach is often referred to as the “make it right first” rule. However, I believe that the first method is preferable simply because optimizing your program once it is working is a dangerous practice, and often introduces new bugs. Deferring efficiency improvement to the final development stage can also waste programmer time in improving the basic algorithms used in a program. Using efficiency techniques during the development of the program is a much sounder method of improving efficiency.

Tuning Trade-offs

Tuning a program is not always a clear-cut gain. There are numerous other quantities that efficiency may affect:

Space versus time-efficiency.
Robustness of a program.
Readability and maintainability of a program.
Portability of a program.

There is almost always a trade-off between time and space when making programs run faster. Many of the algorithm improvements sacrifice space for extra speed, such as caching and precalculation. An often overlooked trade-off is between program efficiency and a programmer’s time in making the changes.

Changing a program for efficiency can introduce extra bugs into a program (although you could argue that it might remove bugs, too). If a piece of code has already been debugged, improving its efficiency may not be worth the risk to the robustness of a program.

Many of the program transformations used for efficiency can reduce the readability of a program. Naturally, this also makes it more difficult for a program to be maintained, and since the major cost in a program’s development cycle is usually maintenance, improving efficiency may not be worth it in the long run.

Perhaps surprisingly, the efficiency of a program can usually be increased significantly without affecting portability. There are some efficiency techniques in this book, but there are many generic methods that work across all C++ code.

Almost all of the dangers of improving efficiency are dangers for the programmer. On the other hand, the users of a program will be well pleased by extra responsiveness, and this alone makes efficiency improvement a worthwhile exercise.

References

Linux Code, December 27, 2023, Measuring Execution Time with Microsecond Resolution in C++, https://thelinuxcode.com/cpp-microseconds/

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ Ultra-Low Latency

C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:

Low-level C++ efficiency techniques
C++ multithreading optimizations
AI LLM inference backend speedups
Low latency data structures
Multithreading optimizations
General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency

Aussie AI

Chapter 39. Timing and Benchmarking