Aussie AI

Chapter 16. Contiguous Memory Blocks

  • Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"
  • by David Spuler, Ph.D.

Chapter 16. Contiguous Memory Blocks

Why Contiguous Memory Blocks?

A critical part of optimizing low-latency engines is to store data in a contiguous memory block so that they have a sequential address space. Processing chunks of data in parallel is the main optimization used in both GPU and CPU SIMD acceleration. All of the vectors, matrices, and tensors need their underlying data in a block for efficiency.

Processing data that is in adjacent addresses is much faster than jumping all over the place. Vectors should obviously be stored in a simple contiguous array of memory. Less obviously, similar comments apply to the memory storage of matrices and tensors.

The use of contiguous memory is an important optimization for both sequential and parallel algorithms. The reasons that memory blocks are more efficient include:

  • Data locality (cache hits)
  • Data block GPU uploads (model weights from memory-to-cache)
  • Predictive cache pipelining (in CPU sequential accesses)

Data locality refers to using data in the same or similar address locations. This is helpful for the cache hit rate because data that is already in the cache is much faster to access that a non-cached RAM memory address.

GPU uploads from CPU RAM to the GPU’s Video RAM (VRAM) is done in blocks. Obviously, we don’t want to be uploading random bits of data from different parts of the RAM.

Non-GPU architectures also benefit from the use of contiguous memory. This is obviously true of CPU SIMD instructions (e.g., AVX on x86), but even in sequential execution, the CPU has its own RAM caching methods and often has other optimizations of memory accesses. Predictive cache pipelining is where the CPU attempts to predict what the next memory location will be, and load it in a pipelined speedup, before being asked. This pipelining of memory accesses is much faster than doing completely sequential address lookups.

Typically, predictive cache pipelining uses the simple heuristic that the next address is the most likely next request, which assumes that data is being processed in order of the addresses. Hence, scanning an array in reverse is the worst possible order for these CPUs. Similarly, jumping around to different memory addresses, such as scanning the column of a matrix using a large “stride,” is also inefficient.

Low-Level Memory Block Functions

Memory block operations in the standard C++ libraries are implemented using fast assembly language behind the scenes. The main functions in the standard C++ library that operate at a low level on binary bytes in a memory block are:

  • memset(): set bytes to a value, usually used to clear bytes to zero.
  • memcpy(): copy bytes.
  • memmove(): copy bytes, but tolerates overlapping regions.
  • memcmp(): compare a sequence of bytes.
  • memchr(): search for a byte in a sequence.

These functions are lower-level than the modern C++ versions, such as std::copy, std::move(), and their “backward” versions. The above listed memory block functions are not aware of object-level semantics, and won’t run any of the special functions on memory containing objects.

Note that unlike the standard string functions (such as strlen), these functions do not assume a block is null-terminated by a zero byte. Zero is simply a binary value, and these functions don’t stop at a zero byte. All of these functions operate on a block of memory with a known maximum byte length.

Each compiler environment typically offers some extra non-standard byte-wise functions that are also fast. Some of the less standardized C++ intrinsics that operate on memory blocks include:

  • _memccpy(): copy bytes up to a specified sentinel byte.
  • memicmp() or _memicmp: compare bytes ignoring letter case.
  • bcopy(): copy bytes
  • bzero(): clear bytes to zero.
  • bcmp(): compare bytes.
  • _byteswap_uint64() (Microsoft intrinsic): Swap the bytes of an integer.
  • __builtin_bswap16(): GCC function to swap the bytes in an integer. There are versions for 32-bit and 64-bit.

Fast Memory Block Operations

The slow way to do things in arrays is one element at a time. The faster way is to use the standard memory block functions on the whole array. There are a number of standard functions that operate on array data or memory blocks and they are very fast.

Initialize with memset byte fill. The memset function sets all of a memory block to a byte value. It is widely used as a fast way to initialize a block of memory to all zeros.

    memset(&x, 0, sizeof(x));

Almost all usages of memset will be for the zero byte. The only other usage I’ve seen is to fill memory with a dummy non-zero byte as a form of mutation testing to catch uses of uninitialized memory.

    memset(&x, 0x55, sizeof(x));

Fast array copying with memcpy. The fast way to copy an entire array is with memcpy. Rather than copy each element of an array, one at a time, in a loop, the memcpy standard library function can be used to copy the entire array in one statement:

    memcpy(destarr, srcarr, sizeof(srcarr)); 

Note that this is a bitwise copy of the array intended for simple data types. For example, it won’t run copy constructors if applied to an array of objects.

The memcpy function does a very fast memory block copy. It is like strcpy in that the destination is the first parameter. memcpy will copy everything, even null bytes and hidden padding bytes. It keeps going even if it finds a null byte, so it is not like strcpy, and will always copy a fixed number of bytes. memcpy is a super-fast byte copy, but is unsafe, because it does not have well-defined behavior if the source and destination blocks overlap.

Safer byte copy with memmove: The memmove function is a safer version of memcpy, which also works correctly if the memory blocks overlap. If the source and destination blocks don’t overlap, it’s the same as memcpy, except probably slightly slower. If they do overlap, then memmove conceptually will copy the source to a temporary area, and then copy it to the destination block.

Copying arrays using struct assignment. An alternative method of copying arrays is to make use of struct assignments. This is similar to how std::array works, which could also be used in a similar vein, but this example totally avoids any constructor, copying or move costs (and also works in C).

This method is not portable, is very unreadable and uses pointers incorrectly by converting between two different pointer types. However, it can be faster than memcpy because it makes use of the assignment operator rather than calling a function. On the other hand, memcpy is an intrinsic function that might be inlined to assembler instructions by the compiler, so this trick might be a waste of time. Benchmarking is recommended here.

To copy an array using this method it is necessary to declare a new dummy struct type that is the same size as the array that is to be copied. Then we use type casting to fool the compiler into thinking it is copying structures when really it is copying arrays. The method is illustrated below:

    struct dummy_transfer { // The new struct type
        int a[MAX]; // This field gives the right size
    };

    int a[MAX], b[MAX]; // The array variables being copied
    static_assert(sizeof(struct dummy_transfer) == sizeof(a));
    *(struct dummy_transfer *)a = *(struct dummy_transfer *)b;

The assignment statement first type casts both “a” and “b” to be pointers to the new struct type, and then dereferences these pointers so that the compiler believes it is assigning between two structures. The assertion is an efficient compile-time safety net to ensure that the copying statement will work. Of course, a better way entirely is probably to put the array inside a class object, with lovely encapsulation and modularity, and then we can simply copy the objects.

memcmp byte comparisons. The memcmp function does a byte-wise comparison of a memory block. Its return value is like strcmp, returning 0 for equality, and a negative or positive value otherwise. Note that memcmp is not like strcmp, and will not stop when it finds a zero byte.

Memory Block Function Pitfalls

The standard memory block functions are fast, but they are not always safe. Here are some of the common pitfalls that commonly occur in everyday coding.

memset sizeof problem. Here’s another glitch in using memset inside functions:

    void zero_array(int arr[10])
    {
        memset(&arr, 0, sizeof(arr));  // Bug
    }

The problem is not memset, but the sizeof operator on function parameters. An array parameter in a function is like a hologram and isn’t really there. It’s not really an array, but a pointer, and sizeof(int[10]) is the same as sizeof(int*). Hence, sizeof(arr) is probably only 4 or 8, rather than 40 or 80, leaving most of the array uninitialized. Personally, I recommend a memset debug wrapper function to catch this kind of problem at runtime, or maybe a tricky preprocessor macro can detect it at compile-time with a static_assert somehow.

memset portability issue. Even though it’s a fast zeroing method, the use of memset to zero bytes has an obscure portability problem on any architecture where all-bytes-zero is not the same as all data types zero. However, on most standard platforms, all-bytes-zero is correct for all types: integer zero (regardless of endianness), floating-point zero (positive zero is all bits zero), and the null pointer.

memcpy overlapping blocks error: The only downside with memcpy is that it can fail with overlapping ranges for the source and destination blocks, so if you are shuffling arrays up or down one element using memcpy, then you have to be careful, because the results on overlapping ranges are undefined. Here’s a buggy example of using memcpy to remove the first character of a string in place:

    memcpy(s, s+1, strlen(s+1)+1);  // Bug

The problem is that the blocks starting at “s” and “s+1” are overlapping. It is implementation-defined whether it will be correct. The fix is simply to use memmove, which always works correctly for overlaps:

    memmove(s, s+1, strlen(s+1)+1);  // Correct

memcmp return value. A pitfall with memcmp is that you cannot assume that it returns 1 or -1, but must compare the return result to zero (like the strcmp function).

    if (memcmp(&a, &b, sizeof(a)) == 1)  // Bug
    if (memcmp(&a, &b, sizeof(a)) > 0)   // Correct

memcmp object equality testing. Looking at the memcmp function, you might think of it as an opportunity to do a fast equality/inequality test on large objects by simply doing a byte-wise test. You would not be the first to think that.

Consider if you have a complex number class:

    class MyComplex {
        float real,imag;
        // .. etc
    }

The brute-force equality test is:

    bool is_equal(const MyComplex &a, const MyComplex &b)
    {
        return (a.real == b.real && a.imag == b.imag);
    }

Our idea to optimize this with memcmp looks like:

    bool is_equal(const MyComplex &a, const MyComplex &b)
    {
        return memcmp(&a, &b, sizeof(MyComplex)) == 0;  // Bug!
    }

Unfortunately, there are multiple obscure pitfalls with this approach:

  • Padding bytes
  • Two types of floating-point zero
  • Multiple types of floating-point NaN (not-a-number)
  • Bitfields

Padding byte problems. If float is 4 bytes, but the machine has 8-byte alignment, then the “real” and “imag” data members will be stored on 8-byte alignment addresses, and there will be another 4 bytes each of dummy padding. It doesn’t even have to be on a machine with alignment issue, but can occur with a bigger object if we’ve mixed different size objects (e.g., char, int, and pointers). The padding bytes will be uninitialized (e.g., for local objects or if allocated with “new”), in which case they can contain random values. Since memcmp does not skip the padding bytes, its test will fail. Now, we could possibly work around this portability issue via the use of memset in the constructor, or calloc memory allocation, to zero all of the bytes of an object including the padding bytes.

Negative zero problems. Unfortunately, the next problem is not a portability problem, but a fundamental issue with floating-point numbers. There are two zeros! There’s the normal zero with all bits zero, and there’s negative zero, with the sign bit set, but all other bits zero. Hence, the bitwise testing of both float numbers fails if there’s ever a negative zero.

NaN problems. Similarly, but perhaps less seriously, the representation of NaN (Not-a-Number) in floating-point is also not fixed. There are multiple values of NaN, both positive and negative. So, memcmp would say the float values differ, even if both are NaN. I think this NaN issue is less serious than negative zero, because if your computations are generating NaN, then they’re probably already failing, and an incorrect memcmp equality test won’t matter as much.

Bitfield problems. If our structure has any bitfield data members, this memcmp idea fails too. Bitfields are a standard C++ feature that is defined with a suffix colon and a number of bits like:

    unsigned int myflag:1;  // Boolean bitfield with 1-bit

With bitfields it’s implementation-defined how this is represented numerically, and there might be undefined bits in the same byte, or extra padding bytes again.

Still want your memcmp speedup? I’ve just shown you about 15 pitfalls, but maybe you still want to live on the edge and get that speedup? You can use memcmp to do fast array or object comparisons if you’re really, really sure that you have:

  • Zero byte initializations. All allocated arrays or objects must be first zero’d by memset or calloc. You cannot rely on constructors, and it’s hard to put a memset as the first action of the constructor due to initializer lists and base classes. You might have to intercept all new and new[] operators with your own wrapper that does memset on the block, rather than use constructor tricks. It’s also unclear if you can actually rely on static or global variable initialization to carefully zero all the padding bytes in an array or object. Probably it works on most platforms, but I doubt it’s fully portable. To be sure, use memset on the global variables during program startup.
  • No bit-fields used. That’s easy, at least.
  • Floating-point computations should avoid negative zero and NaN.

Raw Subarray Memory Blocks

Passing raw subarray types to functions can be a fast alternative to some of the modern C++ contiguous containers (i.e., std::array, std::vector). However, the passing of a container object by reference with “const&” parameters is also very fast, so don’t assume that raw arrays are always faster.

If a function accepts a raw array type, it is possible to pass it any array as an argument, or any pointer of the right type. In this way, it is possible to pass memory blocks or “sub-arrays” to a function by passing the address of a particular array element. A function to operate on a particular type of array can be written, and used to operate on various arrays.

    void clear(int a[], int n)
    {
        int i;
        for (i = 0; i < n; i++)
            a[i] = 0;
    }

    void test_subarrays()
    {
        int a[100];
        clear(a, 10); // clear first ten, 0..9
        clear(a + 50, 10); // clear 50..59 
        clear(&a[50], 10); // clear 50..59 (equivalent)
    }

Multidimensional subarrays. It is also legal to pass multi-dimensional arrays to functions. However, the sizes of all but the first dimension must be specified in the function receiving the array. For example, to pass a two-dimensional array to a function, the function header would look like:

    void fn(int a[][SIZE2]);

The reason for this restriction is that the compiler cannot determine the address of an arbitrary array element if it does not know the sizes of all but one of the dimensions.

Because the sizes of most of the array dimensions must be specified in the function declaration it is very difficult to write a function to act on sub-arrays of multi-dimensional arrays. For example, this idea would be useful to define library functions to operate on matrices with different dimensions. Ideally, we would like one function to calculate the determinant of a matrix for any dimension (i.e., an n-by-n matrix where n varies). Consider how we would like the determinant function to look:

    double determinant(double matrix[][], int n); // ILLEGAL

Ideally, the dimensions of the matrix are not specified at compile-time, but are specified at run-time by the n argument. This is not possible as a simple C++ declaration because the second dimension (i.e., n) needs to be specified in the definition of the two-dimensional array type. The best solution is to use dynamic multi-dimensional arrays.

Dynamic Memory Management Pitfalls

Memory management is really not the strong suit of C++. If your program is crashing or behaving badly, it’s highly likely to be some kind of memory problem. There are so many pitfalls in C++ dynamic memory management, and even in static or global (non-dynamic) memory, that it’s hard to list them all.

C++ programs have access to a large block of free memory, called the heap. The actual size of the available memory depends on the system. This memory is available to a C++ program which can allocate itself chunks of memory from this heap. This is useful when a C program does not know beforehand how much data is being stored, and hence, how much memory is required. Instead of allocating a large array to cater for the worst case, the program can allocate itself blocks of memory as required.

Blocks of dynamic memory can be allocated in two main ways:

  • The C++ style “new” or “new[]” operators
  • The older style malloc() and calloc() functions (inherited from C)

Other ways to allocate dynamic memory include:

  • strdup(): make an allocated copy of a string.
  • realloc(): a companion to malloc/calloc that is rarely used.

Once the memory is no longer needed it is “freed” back to the heap. Again, there are two main ways:

  • The C++ style “delete” and “delete[]” operators
  • The older style “free” function

Some of the main memory problems in a C++ program can include:

    Uninitialized new memory. The new operator does not initialize the new chunk of allocated memory. Accidentally using it is a common bug.

    Uninitialized malloc memory. The malloc function also does not initialize its allocated memory. Again, use of a memory block that is allocated by malloc but hasn’t been properly cleared is a common bug. One of the mitigations is to use calloc instead, because calloc does zero the bytes of every block it allocates.

    Mismatched new/delete with malloc/free. Memory allocated with new should be deallocated by delete, but malloc’d memory should be free’d. Never the twain shall meet, or else kaboom.

    Mixing new/new[] and delete/delete[]. Memory allocated by new should be released by delete, but memory allocated by the array version “new[]” should be freed by the delete[] array version. Again, they’re not supposed to mix.

    free(nullptr) is harmless. If it’s so harmless, why is it a pitfall? Sure, free(nullptr) is officially defined by the standard to do nothing. But if your coding is doing this, it sure walks and talks and quacks like a buggy duck.

    strdup(nullptr) is not harmless. This is probably a crash, but even on systems where it’s not, it’s clearly a bug in your code if you’re trying to duplicate a null pointer.

Pitfalls for Non-Dynamic Memory Blocks

There’s so many pitfalls in management dynamic memory, with either new/delete or malloc/free, that surely we’ve run out? No, don’t worry, it’s comforting to know that there are still a bunch more insidious problems in other types of non-allocated memory.

Here’s a list of some more fatal memory stomps that aren’t about allocated blocks on the heap:

  • Buffer overrun of a global, local, static, or stack buffer variable.
  • Returning the address of a local variable on the stack (i.e., non-static variable).
  • Trying to write to addresses of string literals (often a crash if they’re non-writable, but maybe worse behavior if it can be modified).
  • Modifying arr[10] in an array of size 10 (raw arrays or std::array).
  • Uninitialized local variables or local buffers on the stack (non-static).
  • Using an uninitialized local pointer variable to access some random address in Timbuktu.
  • Null pointer dereferences. Oh, well, at least you initialized it.
  • Returning the address of a “static” local variable (aliasing problems).
  • Using a negative array index.
  • Modifying a string literal (they’re in read-only memory on Linux).

The standard C++ library functions can also have problems:

  • strcpy() on overlapping string arguments: strcpy(s, s+1);
  • strncpy() can leave strings without a null byte terminator.
  • memcpy() on overlapping memory blocks (use memmove instead).
  • Trying to free() or delete a global, static, stack or instruction address will crash.
  • Double fclose() on file pointers from fopen.
  • Ignoring the return value of erase() in an iterator loop.

 

Online: Table of Contents

PDF: Free PDF book download

Buy: C++ Ultra-Low Latency

C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency