Aussie AI

Chapter 16. Debugging Strategies

Book Excerpt from "Safe C++: Fixing Memory Safety Issues"

by David Spuler

General Debugging Techniques

A lot of the work in debugging programs is nothing special: it’s just basic C++ coding mistakes. Most of the errors in coding are ordinary, boring coding errors that every C++ programmer is prone to. There are a variety of ways to go wrong in handling pointers and addresses, from basic beginner mistakes to traps that can catch the experienced practitioner.

The best way to catch a bug is to try to make it happen early. We want the program to crash in the lab, not out in production. In this regard, some of the best practices are about auto-detecting the failures in your code, rather than waiting for them to actually cause a crash:

Check every return code (even the harmless functions that can “never” fail).
Use macro wrappers to help handle errors.
Add debug wrapper functions and enable them while testing.
Run valgrind or other sanitizers on your code regularly.
Thrash the code in many ways in the nightly builds.

If you mess up, and a bug happens in the production backend of your AI training run, I suggest this: blame the data scientists. Surely, the problem was in the training data, not in my perfect C++ code. And if that doesn’t work, well, obviously the GPU was overheating.

Very Difficult Bugs. Some bugs are like roaches and keep coming out of the woodwork. General strategies for solving a tricky bug include:

Can you reproduce it? That’s the key.
Write a unit test that triggers it (if you can).
Try to cut down the input to the smallest case that triggers the fault.
Gather as much information about the context as possible (e.g., if it’s a user-reported error).

Your debugging approach should include:

Run valgrind or sanitizers to check for memory glitches.
Think about what code you just changed recently (or was just committed to the repo by someone else!).
Memory-related failures often cause weird errors nowhere near the cause.
Review the debug trace output carefully (i.e., may be that some other part of the code failed much earlier).
Step through the code in gdb about ten more times.
Run a static analysis (“linter”) tool on the code.
Run an AI copilot debugger tool. I hear they’re terrific.
Refactor a large module into smaller functions that are more easily unit-tested (often you accidentally fix the bug!).

If you really get stuck, you could try talking to another human (gasp!). Show your code to someone else and they’ll find the bug in three seconds.

Bug Symptom Diagnosis

It is very beneficial to the debugging process to be able to identify the cause of an error from its symptoms. Unfortunately, this is a very difficult process — otherwise debugging would be easy! Nevertheless, there are some common run-time errors with well-known causes, and this section attempts to provide a brief catalog of common error causes, mapping observable failure symptoms into the common errors.

Linux core dumps

There are a number of run-time error messages that occur mainly on Linux machines. Some of the common run-time error messages are:

Segmentation fault
Bus error
Illegal instruction
Trace/BPT trap

The message "core dumped" will often accompany the error message if it causes program termination, and this indicates that a file named "core" has been saved in the current directory. The "core" file can be used for postmortem debugging to locate the failure with a symbolic debugger.

Note that the dump of the core file can be prevented by providing an empty file named "core" that is set to protection mode 000 using chmod. This may be useful if disk space is limited and the core dumps are huge.

A segmentation fault occurs when the hardware detects a memory access by the program that attempts to reference memory it is not allowed to use. For example, the address NULL cannot be referenced, and in fact, the single most common cause of a segmentation fault (at least for the experienced programmer) is a null dereference, but there are many other causes.

A bus error occurs when an attempt is made to load an incorrect address into an address bus. Although this leads us to suspect bad pointers, this error can also arise via stack corruption (because this can cause bad pointer addresses), and so there are a variety of potential causes.

Segmentation faults and bus errors may be reported as the program receiving signal SIGSEGV or SIGBUS in some situations. The most common causes of a segmentation fault or bus error are listed below. Different architectures will have different results for these errors, but will usually produce either a segmentation fault or bus error.

Null pointer dereference.
Wayward pointer dereference (memory allocation problem).
Non-initialized pointer dereference.
Array index out of bounds.
Wrong number or type of arguments to non-prototyped function.
Bad arguments to scanf or printf.
Forgetting the & on arguments to scanf.
Deallocating non-allocated location using free or delete.
Deallocating same address twice using free or delete.
Executable file removed/modified while being executed (dynamic loading).
Stack overflow due to function calls or automatic variables.

Another common abnormal termination condition for Linux machines is the message "illegal instruction," which usually causes a core dump. The most common causes of this method of termination are:

assert macro has failed (causes abort call).
abort library function called.
Data has been executed somehow (uninitialized pointer-to-function?).
Stack corruption (e.g., write past end of local array).
Stack overflow (not the website).
C++ exception problem causing abort call.
Unhandled exception was thrown.
Unexpected exception from function with interface specification.
Exception thrown in destructor during exception-related stack unwinding.

Another run-time error message for Linux machines is the message "fixed up non-aligned data access," although this does not necessarily lead to program termination. This indicates that hardware has detected an attempt to access a value through an address with incorrect alignment requirements. Typically it refers to attempting to read or write an integer or pointer at an odd-valued address (i.e., an address that is not word-aligned). Note that on machines without this automatic "fix-up" the same code will probably cause a bus error.

Program hangs infinitely

When one is faced with debugging a program that seems to get stuck, it is important to determine what type of "hang" has occurred. If the program is simply stuck in an infinite loop, you will still have control of the program and can interrupt it. One method of finding out where the program is stuck is to run the program from a debugger, or to use the keyboard interrupt <ctrl-\> to cause a core dump, which can then be examined by a debugger. Some causes of this form of infinite looping are:

NP-complete algorithm (i.e., basically anything in AI).
Infinite loop is occurring due to logic bug or coding error.
Looping down to zero with a size_t index variable (i.e., unsigned).
Accidental semicolon on end of while/for loop header.
exit called within a destructor of global object (C++ only).
Handled/ignored signal is recurring (e.g., SIGSEGV, SIGBUS).
Waiting for input: getc/getch assigned to char.
Linked data structure corrupted (contains pointer cycles).

If the program hangs for a period of time and then crashes, a likely candidate is a runaway recursive function. This will loop (almost) infinitely, consuming stack space all the time, until it runs out of stack space and:

(a) Terminates abnormally, or

(b) The stack overwrites some important memory and the second, more severe form of "hang" occurs.

The most severe form of a "hung" program is one that will not respond. You know it's a bad bug when the reset button is the only thing that works. When this occurs, I recommend the use of any compiler run-time checks, especially stack overflow checking and array bounds checking (if available). An additional method is to recompile using a sanitizer or a memory allocation debugging library. Some possible causes of a non-responsive program are:

Infinite recursion error.
Stack overflow for other reasons.
Array index out of bounds.
Modification via wayward pointer.
Modification via non-initialized pointer.
Modification via null pointer.
Freeing a non-allocated block.
Freeing a string constant.
Non-terminated string was copied.
Inconsistent compiler/linker options.

Failure after long execution

A very annoying error is that of a program that runs perfectly for a long period of time and then suddenly fails for no apparent reason. This usually indicates a "memory leak" causing the system to use up all available memory and malloc to return NULL. However, there are other causes and a more complete list is:

Untested rare sequence of events is causing the error (try to repeat it).
Heap memory leak causing allocation failure (allocated memory not deallocated).
Running out of FILE* handles (files opened but not closed).
Some form of memory corruption (symptom of bug doesn't appear immediately).
Integer overflow (e.g., of some 16-bit or 32-bit counter).
Disk filling up (e.g., excessive logging).
Unhandled peripheral error (e.g., printer out of paper).

Optimizer-only bugs

A program that runs correctly with normal compilation but fails when the optimizer is invoked is a well-known problem. The immediate reaction is to blame a bug in the optimizer. Memory safety errors are a likely cause! However, if your runtime memory checker is showing no errors, it's something else. although such bugs are not so rare as one would wish, there are a number of other potential causes. It is usually an indication that some erroneous or non-portable code has been working correctly more by luck than good programming, and the more aggressive optimizations have shown up the error. Some possible causes are:

Order of evaluation errors (optimizer rearranges expressions).
Special location not declared volatile.
Use of an uninitialized variable.
Wrong number/type of arguments to non-prototyped function.
Wrong arguments to prototyped function not declared before use.
Memory access problems (optimizer has rearranged some memory).

In this situation it may be useful to examine what compiler options are available to choose which optimizations are chosen. For example, there may be an option to choose between traditional stack-based argument passing and pass by register. If so, recompilation with and without that option can help to test for argument passing errors.

Failure disappears when debugger used

A really annoying situation is a program that crashes when run normally, but does not fail when run via a symbolic debugger or interpreter. One fairly well-known cause is the use of an uninitialized automatic variable. The error may disappear when run via the debugger, because some debuggers set these local variables to zero or NULL initially. Thus, some possible causes are mainly from memory access problems, where the debugger has rearranged memory somehow:

Using uninitialized variable (especially a pointer)
Array index out of bounds
Modification via wayward pointer
Modification via non-initialized pointer
NULL pointer dereference
Modification via null pointer
Freeing a non-allocated block
Freeing a string constant

The list of errors possibly causing a memory-related problem is comparable with the list of errors causing a non-recoverable hung program.

Program crashes at startup

When a C++ program crashes on program startup, without even executing the first statement in main, we must suspect constructors of global objects. Use a run-time debugger to determine if main has been entered; but note that some debuggers allow debugging of constructors before main and others do not. Alternatively, place an output statement as the very first statement in main (even before the first declaration!) to ensure that the problem really is arising before main, rather than from instructions in main. Once a constructor problem has been identified, finding the root cause of the problem is a debugging matter. There are no forms of error particular to constructors, so the problem is something being done by a constructor that is probably some type of other error (e.g., a memory stomp error).

Program crashes on exit

The program can fail in a few obscure ways at the end of execution. Careful consideration of what actions are taking place at the end of execution is important (e.g., destructors are invoked in C++; any functions registered with atexit will be called). In my experience this failure is most common during the learning phase of C++ programming, when destructor errors are common.

delete operation in object destructor is trashing memory.
Destructor in global object calls exit.
main accidentally declared returning non-int. e.g., missing semicolon on class or struct declaration above main.
setbuf buffer is a non-static local variable of main.
No call to exit, and no return statement in main (a few platforms only).
File closed twice (e.g., double fclose error).

Function apparently not invoked

Consider the situation where you are debugging a program, and discover that a particular function seems to be having no effect. You put an output statement at its first statement and no output appears. Why isn't the function being invoked? Some possible causes are:

No call to the function (!), e.g., you didn't rebuild properly (your fault), or a source code repo issue (someone else's fault) or you're looking in the wrong C++ source file (I've done it many times).
Control flow or conditional test controlling the call is wrong.
Missing brackets on function call (null-effect).
Function is a macro at call location.
Function is a reserved library function name (wrong function is getting called).
Missig semicolon or statement in if statement above the function call.
Nested comments unclosed and deleting call to function.

Garbage output

When a program runs and produces strange output there are a number of possibilities (mostly related to misusing string variables). Note that it is important to distinguish whether the output of a statement is entirely garbage or whether it has a correct prefix (which may indicate a non-terminated string). Some causes are:

Uninitialized local or allocated variable.
Constructor not initializing all data members.
Missing argument to printf %s format.
Wrong type argument to printf %s format.
Returning address of automatic local string array.
Stack corruption (local array buffer overrun).
strncpy leaves string non-terminated.
Pointer variable not initialized.
Address has already been deallocated

Failure on new platform

When a program appears to be running successfully on one machine, it is by no means guaranteed that porting the source code and recompiling on a new machine will not lead to new errors. When a new error is discovered, the first thing that must be tested is whether the same error exists for the same test data on the original machine. The bug might not be a portability problem — it might be an untested case.

However, if the bug appears on one machine but not on another there are a few common causes. The most frequent portability problem is a memory corruption error since these will often lurk undetected on one machine, and appear in the new memory layout of a different environment.

Other possible causes are different compilation results that may arise when a new compiler uses more aggressive optimization. Hence, code that relies on an undocumented compiler feature (e.g., left to right function argument evaluation) may suddenly fail. Note that this implies that portability errors can arise after a compiler upgrade on the same machine, as well as when moving code to a new machine. Some common causes of portability errors are:

Memory corruption errors.
Array index out of bounds.
Modification via wayward pointer.
Modification via non-initialized pointer.
Null pointer dereference.
Modification via null pointer.
Freeing a non-allocated block.
Freeing a string constant.
Function has no return statement.
Order of evaluation error.
Operator order-of-evaluation: a[i]=i++;
Function argument order-of-evaluation: fn(i,i++);
Global object construction in separate files.
Special location not declared volatile.
Use of an uninitialized variable.
Constructor not initializing all data members.
new doesn't initialize non-class types.
malloc doesn't initialize any types.
Bit-field is plain int.
Plain char is signed/unsigned.
getc/getchar return value assigned to char.

Most of these causes are fairly self-explanatory. However, the appearance of "function has no return statement" in the list may appear surprising — surely this will cause a bug on all implementations? In fact, most C++ implementation will offer a compilation error, but this is not always true, and not applicable to C compilers. It has been observed surprisingly frequently that a function that terminates without a return statement might accidentally return the correct value. Typically, this surprising outcome occurs if, by coincidence, a local scalar variable that is intended to be returned happens to be in the hardware register that is used to hold the function return value. Since that register is not loaded when no return statement is found, the correct result is accidentally returned, and there is no failure until a different compiler or environment is used.

Some compilers have compilation options to change various compiler-dependent features. For example, there may be options to change the default type of plain char and/or plain int bit-fields to signed or unsigned. If it is suspected that this may be the cause of the error, the code can be recompiled with different option settings to confirm this. Any run-time error checking options such as memory allocation debugging and stack overflow checking should also be enabled.

Making the Correction

An important part of the debugging phase that is often neglected is actually making the correction. You’ve found the cause of the failure, but how do you fix it? It is imperative that you actually understand what caused the error before fixing it; don’t be satisfied when a correction works and you don’t know why.

Here are some thoughts on the best practices for the “fixing” part of debugging:

Test it one last time.
Add a unit test or regression test.
Re-run the entire unit test or regression test suite.
Update status logs, bug databases, change logs, etc.
Update documentation (if applicable)

Another common pitfall is to make the correction and then not test whether it actually fixed the problem. Furthermore, making a correction will often uncover (or introduce!) another new bug. Hence, not only should you test for this bug, but it’s a very good idea to use extensive regression tests after making an apparently successful correction.

Level Up Your Post-Debugging Routine. Assuming you can fix it, think about the next level of professionalism to avoid having a repetition of similar problems. Consider doing followups such as:

Add a unit test or regression test to re-check that problematic input every build.
Write it up and close the incident in the bug tracking database like a Goody Two-Shoes.
Add safety input validation tests so that a similar failure is tolerated (and logged).
Add a self-check in a C++ debug wrapper function to check for it next time at runtime.
Is there a tool that would have found it? Or even a grep script? Can you run it automatically? Every build?

Production-Level Code

As with all applications, there’s another level needed to get the code out the door into production. Some of the issues for fully production-ready C++ code include:

Validate function parameters (don’t trust the caller or the user).
Check return codes of all primitives.
Handle memory allocation failure (e.g., graceful shutdown).
Add unique error message codes for supportability

Let’s not forget that maybe a little testing is required. High-quality coding requires all manner of joyous programmer tasks: write unit tests, warning-free compilation, static analysis checks, add assertions and debug tracing, run valgrind or sanitizers, write useful commit summaries (rather than “I forget”), don’t cuss in the bug tracking record, update the doc, comment your code, and be good to your mother.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Safe C++: Fixing Memory Safety Issues