Aussie AI

Chapter 8. Undefined C++ Features

Book Excerpt from "Safe C++: Fixing Memory Safety Issues"

by David Spuler

What are Undefined Behaviors?

The C++ programming language is very portable despite its low-level focus on efficiency. However, there are numerous areas of C++ that are "undefined behaviors" and some of them are relatively common. Technically, there are two types:

"undefined behaviors" — compilers can do whatever they like, even differently each time, or
"implementation-specific behaviors" — compilers can do whatever they like, but they have to be consistent in doing the same thing every time.

If you've never heard of this, well, actually you have. You're certainly familiar with some of the undefined behaviors, like accessing an uninitialized value, a null pointer dereference, or accessing a memory address that has already been de-allocated. In other words, all the memory bugs are from undefined behaviors.

Some of the indicators that you've accidentally used an undefined behavior include:

Portability problems where the code crashes on different platforms.
Use higher optimization levels causes the code to crash.
Code runs fine in the debugger, but fails without it.

These are also indicators of a memory safety failure, which underscores my point that the most common type of undefined behavior is related to memory errors.

Safety Issues for Compiler Vendors

We've examined various types of problems that can be addressed by memory debug libraries and other debug wrapper libraries. But we can't do everything that way.

Here are some of the issues that are hard to address without changing the code-generation of the compiler. In other words, they can't be detected or resolved by changes to the standard C++ library alone.

Some of the problematic areas include:

Initialization of automatic stack variables without an explicit initializer (i.e., auto-initialization to zero).
Checking pointer de-references have a valid address.
Checking array accesses have a valid address.
Race conditions and other concurrency problems.
Integer overflow and underflow (signed or unsigned types).
Floating-point overflow and underflow.
Order-of-evaluation issues with binary operator operands.
Order-of-evaluation issues with function argument expressions.
Undefined integer operations on signed negative values (e.g., remainder, integer division, right bitshift).
Throwing an exception inside a destructor.
Calling exit or abort inside a destructor.

That's not the full list. There are literally hundreds of obscure "undefined behaviors" in C++, although most of them are very rare.

Note that some of the most common memory safety issues are not on the above list. For example, it would be easy for a vendor to change their standard library so as to guarantee that malloc and new would zero the allocated memory. These could be fixed in the vendor's standard library, rather than needing changes to the compiler engine.

C++ Operator Pitfalls

Most of the low-level arithmetic code in C++ algorithms looks quite standardized. Well, not so much. The general areas where C++ code that looks standard is actually non-portable includes trappy issues such as:

Arithmetic overflow of integer or float operators.
Integer % remainder and / division operators on negatives.
Right bitshift operator >> on a negative signed integer is not division.
Divide-by-zero doesn’t always crash on all CPUs and GPUs.
Order of evaluation of expression operands (e.g., with side-effects).
Order of evaluation of function arguments.
Functions that should be Boolean are not always (e.g., isdigit, isalpha)
Functions that don’t return well-defined results (e.g., strcmp, memcmp, etc.)
Initialization order for static or global objects is undefined.
memcmp is not an array equality test for non-basic types (e.g., structures).

Note that these errors are not only portability problems, but can arise in any C++ program. In particular, different levels of optimization in C++ compilers may cause different computations, leading to insidious bugs.

Signed right bitshift is not division

The shift operators << and >> are often used to replace multiplication by a power of 2 for a low-level optimization. However, it is dangerous to use >> on negative numbers. Right shift is not equivalent to division for negative values. Note that the problem does not arise for unsigned data types that are never negative, and for which shifting is always a division.

There are two separate issues involved in shifting signed types with negative values: firstly, that the compiler may choose two distinct methods of implementing >>, and secondly, that neither of these approaches is equivalent to division (although one approach is often equivalent). It is unspecified by the standard whether >> on negative values will:

(a) sign extend, or

(b) shift in zero bits.

Different compilers must choose one of these methods, document it, and use it for all applications of the >> operator. The use of shifting in zero bits is never equal to division for a negative number, since it shifts a zero bit into the sign bit, causing the result to be a nonnegative integer (dividing a negative number by two and getting a positive result is not division!). Shifting in zero bits is always used for unsigned types, which explains why right shifting on unsigned types is a division.

Divide and remainder on negative integers

Extreme care is needed when the integer division and remainder operators / and % are applied to negative values. Actually, no, forget that, because you should never use division or remainder and if you must, then you choose a power-of-two and use bitwise operations instead. Division is unsigned right bitshift, and remainder is bitwise-and.

Anyway, another reason to avoid these operators occurs with negatives. Problems arise if a program assumes, for example, that -7/2 equals -3 (rather than -4) . The direction of truncation of the / operator is undefined if either operand is negative.

Order of evaluation errors

Humans would assume that expressions are evaluated left-to-right. However, in C++ the order of the evaluation of operands for most binary operators is not specified and is undefined behavior. This makes it possible for compilers to apply very good optimizing algorithms to the code. Unfortunately, it also leads to some problems that the programmer must be aware of.

To see the effect of side effects, consider the increment operator in the expression below. It is a dangerous side effect.

    y = (x++) + (x * 2);

Because the order of evaluation of the addition operator is not specified, there are two orders in which the expression could actually be executed. The programmer’s intended order is left-to-right:

    temp = x++;
    y = (temp) + (x * 2);

The other incorrect order is right-to-left:

    temp = x * 2;
    y = (x++) + (temp);

In the first case, the increment occurs before x*2 is evaluated. In the second, the increment occurs after x*2 has been evaluated. Obviously, the two interpretations give different results. This is a bug because it is undefined which order the compiler will choose.

Function-call side effects

If there are two function calls in the one expression, the order of the function calls can be important. For example, consider the code below:

    x = f() + g()

Our first instinct is to assume a left-to-right evaluation of the “+” operator. If both functions produce output or both modify the same global variable, the result of the expression may depend on the order of evaluation of the “+” operator, which is undefined in C++.

Order of evaluation of assignment operator

Order of evaluation errors are a complicated problem. Most binary operators have unspecified order of evaluation — even the assignment operators. A simple assignment statement can be the cause of an error. This error can occur in assignment statements such as:

   a[i] = i++;   // Bug

The problem here is that “i” has a side effect applied to it (i.e., ++), and is also used without a side effect. Because the order of evaluation of the = operator is unspecified in C++, it is undefined whether the increment side effect occurs before or after the evaluation of i in the array index.

Function-call arguments

Another form of the order of evaluation problem occurs because the order of the evaluation of arguments to a function call is not specified in C++. It is not necessarily left-to-right, as the programmer expects it to be. For example, consider the function call:

    fn(a++, a);  // Bug

Which argument is evaluated first? Is the second argument the new or old value of a? It’s actually undefined in C++.

Order of initialization of static objects

A special order of evaluation error exists because the order of initialization of static or global objects is not defined across files. Within a single file the ordering is the same as the textual appearance of the definitions. For example, the Chicken object is always initialized before the Egg object in the following code:

    Chicken chicken; // Chicken comes first
    Egg egg;

However, as for any declarations there is no specified left-to-right ordering for initialization of objects within a single declaration. Therefore, it is undefined which of c1 or c2 is initialized first in the code below:

    Chicken c1, c2;

If the declarations of the global objects “chicken” and “egg” appear in different files that are linked together using independent compilation, it is undefined which will be constructed first.

Standard Library Problems

Not everything is well-defined in the standard C++ library. There are numerous pitfalls that can jump up and bite you in apparently innocuous uses of the library functions. Here's a selection of them below.

memcmp cannot test array equality

For equality tests on many types of arrays, the memcmp function might seem an efficient way to test if two arrays are exactly equal. However, it only works in a few simple situations (e.g., arrays of int), and is buggy in several cases:

Floating-point has two zeros (positive and negative zero), so it fails.
Floating-point also has multiple numbers representing NaN (not-a-number).
If there’s any padding in the array, such as arrays of objects or structures.
Bit-field data members may have undefined padding.

You can’t skip a proper comparison by looking at the bytes.

EOF is not a char

The EOF constant is a special value for C++ file operations. One problem related to signed versus unsigned chars is comparing a char variable with EOF. Because EOF is represented as integer value −1, it should never be directly compared with a char type. Although it will usually work if characters happen to be signed for a particular implementation, if characters are unsigned, the comparison of a char type with EOF is not correct since −1 is promoted to unsigned int, yielding a huge value not representable by a char. An example of this type of bug:

   char ch = getchar();
   if (ch == EOF) { ... }   // Bug

The correct definition is to use "int ch" as the declaration.

fflush on an input file

The fflush function is used to flush the buffer associated with a file pointer. Unfortunately, it can only be used to flush an output buffer, causing output to appear on screen (or be flushed to a file). Applying fflush on an input file leads to undefined results; it will succeed on some systems, but cause failure on others. The problem is typified by the following statement that often appears in code:

    fflush(stdin);

The intention is to flush all input characters currently awaiting processing (i.e., stored in the buffer), so that the next call to getchar (or another input function) will only read characters entered by the user after the fflush call. This functionality would be very useful, but is unfortunately not possible in general, as the effect of fflush is undefined on input streams. There is no portable way to flush any "type ahead" input keystrokes; fflush(stdin) may work on some systems, but on others it is necessary to call some non-standard library functions.

fread and fwrite without intervening fseek

When a binary file is opened for update using a mode such as "rb+", the programmer must be careful when using fread and fwrite. It is an error to mix fread and fwrite operations without an intervening call to a repositioning function such as fseek or rewind.

For example, do not assume, after sequentially reading all records in a file using fread, that a call to fwrite will write data to the end of the file. Instead, use an fseek call to explicitly reach the end of the file before writing. The best method of avoiding this error is to call fseek immediately before every fread or fwrite call.

Modification of string literals

String literals should not be modified in C++ because they could potentially be stored in read-only memory. They should be thought of as having type const char*. Therefore, using char* string types without caution can lead to errors, such as applying strcpy to a pointer that currently points to a string literal, as below:

    char *result = "yes";
    if (...)
        strcpy(result, "no"); // WRONG

The effect of this code is to try to modify the memory containing the string literal "yes". If this is stored in read-only memory the strcpy function has no effect or possibly a run-time failure. Even if string literals happen to be modifiable for the particular implementation this form of modification can lead to strange errors. Overwriting "yes" with "no" means that the initialization of result will never again set result to "yes". The code can be thought of as equivalent to the following code:

    char yes_addr[4] = { 'y', 'e', 's', '\0' };
    char *result = yes_addr;
    if (...)
        strcpy(result, "no"); // WRONG

Hence, the strcpy call changes yes_addr and the initialization will always set "result" to whatever yes_addr currently contains.

Worse still is the problem that many compilers merge identical string literals so as to save space. Hence, the above strcpy call will change all uses of the constant "yes" to be "no" throughout the program! Therefore, one change to a string constant will affect all other instances of the same string constant — a very severe form of aliasing.

Avoiding the modification of string literals is not all that difficult, requiring only a better understanding of strings. One solution to the above problem is to use an array of characters instead of a pointer:

    char result[] = "yes";
    if (...)
        strcpy(result, "no"); // RIGHT

In this case the compiler allocates 4 bytes for result, rather than making it point at the 4 bytes for the string literal (which was the same address that all uses were given).

Backslash in DOS filenames

Windows and DOS use backslashes for directory paths in filenames, whereas Linux uses a forward slash character. A common error with file operations occurs when a DOS filename is encoded with its full path name. The backslash starts an escape inside the string constant. Hence, the filename below is wrong:

    fp = fopen("c:\file.cpp", "r");     // Bug

The backslash character starts the escape \f, which is a formfeed escape. The correct statement uses two backslash characters:

    fp = fopen("c:\\file.cpp", "r");    // Correct

Summary

The above is just a selection of some of the undefined and non-portable aspects of C++. The solution to a fully safe C++ programming language rests not only in addressing memory safety, but in fixing a lot of these areas. The goal would be to change all the "undefined behaviors" into "well-defined" parts of the safe C++ standard. Unfortunately, there are many issues to choose from!

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Safe C++: Fixing Memory Safety Issues