Aussie AI

Chapter 7. Memory Pools

Book Excerpt from "Advanced C++ Memory Techniques: Efficiency and Safety"

by David Spuler, Ph.D.

Chapter 7. Memory Pools

What are Memory Pools?

Memory pools are a C++ optimization where you take control of the memory allocation used for a class of objects. The basic idea is to store all objects of the same type in a big array, next to each other, rather than being spread out over the heap wherever the new operator decides to put them.

Memory pools are a general optimization that can be used in C++ with the new operator, and also in C programming with malloc. Some of the related data structures include:

Bucket array
Hive

A bucket array is like a memory pool, in that it’s a big memory block, and you put your objects in there. However, a bucket array usually handles erasing an object by simply marking it as invalid using a Boolean flag. The memory for an erased object is not usually re-used when you insert a new object.

A hive is a generalization of a bucket array, whereby a hive can dynamically expand and contract the number of buckets. Notably, there’s a std::hive class to use in C++26, which would make a good basis for an advanced type of memory pool. However, we’re going to examine some of the simpler types of memory pools first.

Why Memory Pools?

Other than being a fun and gritty project in low-level C++ coding, the goal is speed, and this is achieved in various ways:

Preallocation — no need to allocate memory on a low-latency hotpath.
Fewer allocation calls — one big chunk rather than lots of small ones.
Fewer deallocation calls — reusing memory addresses within the pool.
No memory fragmentation — we don’t mix small and large memory allocations.
Less memory overhead — hidden heap memory “control blocks” are not needed.
Cache locality — all objects are stored contiguously.

In fact, you can even get the number of memory allocations for your class down to zero, if you really want to, by using a global memory pool object. Even the memory pool is not on the heap! But this only works for a fixed-size memory pool, and thus, only if you’re really sure you won’t need too many objects.

Memory fragmentation is also a slowdown that can be avoided or reduced with memory pools. The problems with fragmentation arise in two ways:

Frequent allocations and de-allocations, and
Different-sized memory blocks.

A memory pool is helpful in both respects. The memory pool avoids lots of allocations by using one big block, and avoids deallocations by re-using the locations inside the block. And because the memory block stores lots of blocks of the same size, we aren’t mixing up different size allocations.

Disadvantages of Memory Pools

Firstly, this whole idea of memory pools is only about reducing allocated memory on the heap. This optimization is not relevant for objects stored on the stack (i.e., local variables), or static objects, such as global scope objects or static data members.

Memory pools are not the only option for optimization memory allocation. In fact, the use of an open-source drop-in replacement for the standard C++ memory allocators is another significant option:

jemalloc — the original FreeBSD allocator, now a Facebook favorite.
tcmalloc — from Google, with an Apache 2.0 license.

The other disadvantages of memory pools include:

Fixed maximum number of objects (in the basic versions).
Only works for single-sized objects (e.g., one class).
Need one memory pool object for each type of object (via templating).
Not useful for optimizing variable-sized objects (e.g., strings).
Allocating too much memory in one massive chunk.

However, we can work around a lot of these disadvantages by using a templated class for our memory pool. The optimization of memory pools is a general algorithm that works for all types of objects.

Memory Control Block Overhead

Whenever you allocate memory on the heap, using the new operator or the old-style malloc function, it returns you the address of the block. But that’s not actually the start of the real memory block.

There’s actually an extra memory control block stored before that address. It contains meta-information about the memory block, which is used by the C++ standard library to keep track of things. For example, the size of the memory block is stored in that control block.

Whenever you deallocate a memory block by sending the address to delete or free, the standard library knows to look backwards a few bytes. Hence, it can find the size of the memory block, which helps it to deallocate the full block of memory. You don’t need to worry about it, because the standard library takes care of it.

Hence, if you create a memory pool from one big chunk to contain 100 objects, rather than 100 separate calls to the new operator, there are 99 fewer memory control blocks. This is why memory pools reduce the memory overhead from your objects.

Fixed-Size Memory Pool Algorithms

For simplicity, we’re going to limit our first memory pools to just one huge block of memory. This means that we can choose the overall capacity of the memory pool, but we can’t increase it later by adding a second big block. This makes our memory pool more like a vector or array, rather than a dynamic bucket array or hive.

Even with these restrictions, there are still quite a few choices to make about designing our memory pool algorithm. Some of the alternatives include:

Boolean flag — storing an “active” flag in each object.
Index array — maintaining a list of indices of free blocks as a “free list” (instead of a per-object flag).
Pointer array — tracking the free list via pointers.
Permutation-based free list approach.

In the first case, we only have one array, and each block contains the “active” flag along with the stored user objects. In the other cases, we maintain two arrays, one of the user’s objects, and another as the free list (with either indices, pointers, or permutations).

Boolean Flag Memory Pool

This is the simplest approach, but not the fastest. Let’s examine it to get some of the basic ideas.

Some of the interesting features of this code include:

Boolean flag — stored as a data member in every memory pool record.
Pointer arithmetic — used in computing the offset when erasing an object.
Incremental count — increment on allocation, decrement on release.
Compile-time pool size — this uses std::array rather than std::vector.

Here’s the basic layout of the memory pool class.

    template<typename T, int N>
    class MemoryPool {
        struct Node {
            T data;
            bool active;
        };
    private:
        std::array<Node, N> arr_;
        int nextfree_;
        int ct_;
        // ...
    };

The constructor has to set all the “active” flags (although using memset would be faster than a loop):

    MemoryPool() : arr_(), nextfree_(0), ct_(0) {
        for (int i = 0; i < N; i++) arr_[i].active = false;
    }

The code maintains the index of the “next free” object. Initially, it’s increasing as the first blocks get used, but later it’s necessary to scan linearly.

    int find_next_free(int offset) {
        if (offset == -1) offset = 0;
        int i = offset;
        do {
            if (!arr_[i].active) return i; // Found
            i = (i + 1) % N;
        } while (i != offset);
        return -1;  // It’s full!
    }

Here’s the code for the allocation of a memory pool block:

    T* alloc() {
        if (full()) return nullptr; // fail!
        assert(nextfree_ != -1);
        int oldindex = nextfree_;
        arr_[oldindex].active = true; // Not free
        nextfree_ = find_next_free(nextfree_);
        ct_++; // Incremental count
        return reinterpret_cast<T*>(&arr_[oldindex]);
    }

And here’s the code whereby a block is released by the caller. Note that the index computation requires pointers converted to the correct type. This code has some safety checks that are quite expensive, and might later be removed for production usage.

    void erase(T* addr) {
        assert(ct_ >= 0);
        Node* nptr = reinterpret_cast<Node*>(addr);
        if (nptr >= reinterpret_cast<Node*>(&arr_[0])
            && nptr <= reinterpret_cast<Node*>(&arr_[N - 1])) {
            // Valid pointer...
            int offset = nptr - &arr_[0];  // Pointer arithmetic
            assert(nptr->active);
            nptr->active = false;  // Free now
            ct_--;  // Incremental count
            if (nextfree_ == -1) { // Was full?
                nextfree_ = offset; 
            }
        }
        else { // Invalid pointer...
            assert(false);
        }
    }

Constructor inefficiency. This implementation has a high-level slug if the memory pool is instantiated for use with a non-trivial class type. The definition of std::array will cause the constructors for every single object to run needlessly on the empty storage bytes, when the memory pool is first created or defined. The solution here is simply to use bytes instead of the class type for the storage declaration:

    struct Node {
        unsigned char data [sizeof(T)];  // Raw object storage
        bool active;
    };

But we also need to be careful of memory alignment in this situation. The template could be instantiated on any type, some of which will need aligned addresses. Character addresses won’t get automatically aligned, so we have to use alignas specifier. However, it’s hard to fix in this implementation, because I cannot use alignas(alignof(T)). The extra “active” flag in the structure is messing everything up. But that’s only one disadvantage of this method.

Disadvantages of Boolean Flag Method

The first point to remember is that this memory pool is a significant optimization. It achieves all the advantages of a memory pool as outlined above: preallocation, fewer allocations and deallocations, less memory fragmentation, and so on. Hence, it’s a good start, and a worthy improvement to our classes.

We could stop now, and go home with a smile on our face.

However, it’s not optimal. There are even better ways to code up a memory pool. The suboptimal features of this version of a memory pool include:

Mixing hot and cold data
Alignment issues for some types
Extra padding bytes needed
Slow insertions

One problem with the above approach is that it mixes “hot” and “cold” data. Your objects are probably hot areas of processing that are doing whatever you need. The Boolean flags are only used by the memory pool when inserting and deleting objects, and are thus cold data for the main processing algorithms. It would be better for cache locality if the cold data was separated from our hot objects.

Memory size is also not optimal. By adding a single Boolean variable to each object, it’s not just 1 byte extra, because the compiler probably has to add a number of padding bytes to meet the alignment requirements (depending on what’s inside your objects). This will increase the memory size, and worsen cache locality when processing multiple objects.

However, the main problem with the Boolean flag approach is that it’s slow. In fact, it has worst case O(n) performance for an insertion, because it might have to scan the entire array to find a free block. This worst case won’t happen initially, but the performance can degrade as the memory pool fills up, and we do lots of insertions and deletions.

We can do better!

Boolean Flag Array Method

One way that we can address some of these issues is by separating all of the Boolean “active” flags into a different array. Rather than storing a flag in each object, we just store the user’s object in the main block, and have a second block that contains the Boolean flags.

The advantages are that it fixes the hot-cold data problem, addresses alignment concerns, and the compiler won’t need to add extra padding to the array of user objects. The array of Boolean flags should be one byte per object, but stored in a different array.

Firstly, we move the “active” flag out of the structures:

    struct Node {
        unsigned char data[sizeof(T)];  // Raw object storage
    };

And put it into a separate array:

    bool activearr_[N];

The handful of places that used the “active” flag need to be changed to the “activearr_” array member.

We can also fix the alignment issues using the alignas and alignof specifiers:

    alignas(alignof(T)) std::array<Node, N> arr_;

Bit packing. This active flag array method can be further improved by using bit packing. We only need one bit flag per object, rather than one byte each. Hence, we can pack them all into an array of 64-bit unsigned long, and can check for a free block using one integer comparison, testing 64 memory blocks at a time.

In practice, this version is pretty fast. Even so, it is technically still an O(n) worst case algorithm for insertion or deletion with large numbers of objects. And there are a few ways to fix that.

Index Array Memory Pool

The faster solution is to maintain an array of integer indices for the free locations. The advantages of this index array approach over the earlier “active” flag method include:

Insertion and deletion always have O(1) complexity.
Separates hot data from cold data.
No extra padding bytes needed.

Here’s the basic definition of the class:

    template<typename T, int N>
    class IndexMemoryPool {
        struct Node {
            unsigned char data[sizeof(T)];  // Raw object storage
        };
    private:
        alignas(alignof(T)) std::array<Node, N> arr_;
        int freelist_[N];  // array of free indexes (stack-like)
        int ct_;
        int ctfree_;
    // ...
    };

Some of the basic primitives are simple:

    bool empty() { return ct_ == 0; }
    bool full() { return ct_ == N; }
    int capacity() { return N; }
    int count() { return ct_; }
    int count_free() { return ctfree_; }

The index array is a “free list” that tells us where to find a free memory block. After a lot of insertions and deletions, if functions a lot like a stack of free locations. At the start, it’s a fixed-size stack that’s full with the index of every element available.

    IndexMemoryPool() : arr_(), ct_(0), ctfree_(N) {
        for (int i = 0; i < N; i++) {
            freelist_[i] = i;  // Store all indexes
        }
    }

When we allocate a new block, that’s a “pop” of the stack, because we’re removing from the free list:

    int pop_free_index()
    {
        assert(ctfree_ > 0);
        int index = freelist_[ctfree_ - 1];
        assert(index != -1);
        freelist_[ctfree_ - 1] = -1; // Clear it
        ctfree_--;
        return index;
    }

The allocation of a block is mostly a call to this “pop” of the free list:

    T* alloc() {
        if (full()) return nullptr; // fail!
        int index = pop_free_index();
        assert(index != -1);
        ct_++; // Incremental count
        return reinterpret_cast<T*>(&arr_[index]);
    }

And the reverse is true when the caller releases a memory block. This is a push of a newly free index onto the stack.

    void push_free_index(int index)
    {
        assert(ctfree_ < N);
        freelist_[ctfree_] = index;
        ctfree_++;
    }

And here’s the version for release the memory:

    void erase(T* addr) {
        assert(ct_ >= 0);
        Node* nptr = reinterpret_cast<Node*>(addr);
        if (nptr >= reinterpret_cast<Node*>(&arr_[0])
            && nptr <= reinterpret_cast<Node*>(&arr_[N - 1])) {
            // Valid pointer...
            int offset = nptr - &arr_[0];
            push_free_index(offset);
            ct_--;  // Incremental count
        }
        else { // Invalid pointer...
            assert(false);
        }

    }

In summary, note that the push and pop of the free list stack is very efficient with O(1) complexity. Everything in this index array version has constant-time efficiency.

Memory Pools Versus Containers

Why do you need a memory pool? Why not just use the standard C++ containers for your objects? Isn’t a memory pool about the same as std::vector?

Yes and no.

Yes, a memory pool for your objects is very similar to managing them all in a standard vector. After all, the memory pool code can use a std::vector object inside it as the big pool. So, yes, you can manage your objects in a standard vector if you:

Use a single reserve or resize call to allow the vector memory in one call.
Keep track of objects going in and out of the vector.

In other words, it’s almost the same thing as writing a memory pool, except it’s mixed in the middle of your application’s main logic.

Hence, no, it’s not quite the same thing. There are two types of containers:

Contiguous storage containers — it’s very similar.
Maps, sets, hash tables — memory management performance gains.

We’ll examine vectors and arrays in a minute, but first let’s look at the other containers. There are two aspects to use normal memory allocation and storing your objects in these advanced containers:

Allocating memory for your objects — you’ve improved nothing (it’s one allocation call per object).
Extra container allocations — the container also needs memory allocation and a memory pool doesn’t help with that.

But for the containers based on contiguous memory, the issue is less clear cut. The standard containers based on contiguous storage include:

std::vector
std::array
std::inplace_vector (C++26)

When you compare a memory pool to using a standard vector of your objects, there is less gain to performance. However, creating a memory pool as a standalone class has several practical advantages:

Separate memory management optimizations from business logic.
Ensures only a single (huge) memory allocation occurs (or only a few if it’s dynamic).
Callers of the interface or API don’t need to know about the memory management aspects.

Creating a memory pool as a separate idiom is good for encapsulating the performance optimization aspects of memory management. It encourages modularity by isolating high-level business logic from low-level resource management.

Advanced Memory Pools

Higher-level improvements to the memory pool interface are also possible. Most of the discussion here has been about a memory pool for one type of class, with a focus on reducing the number of distinct blocks requested on the heap. More advanced memory allocators are well-known, and they offer a variety of generalized performance optimizations and convenience features:

Thread safety (e.g., a single mutex or a lock-free version).
Intercepting the class-specific new and delete operators.
Passing arguments to object constructors via parameter packs and std::forward()
Placement new operator — does not really allocate memory!
Custom allocators — memory pools via allocator functor objects.

Additional memory management features that could be added to a memory pool include:

Dynamic expansion with multiple chunks rather than a fixed-size pool.
Multiple object types supported in the memory pool.
Dynamic size of objects allowed by allocating multiple large “pools” or memory chunks.
Downsizing the memory pool if fewer objects are required.

Even more general than memory pools is the concept of “custom allocators.” The idea with custom allocators is not just to enhance the memory handling of a few classes, but to take over the whole memory allocation shemozzle from the standard library.

Extensions

Build your own simple memory pool templated class.
Add a memory pool to your object class by overloading a set of class-specific new and delete operators, sending these requests to the memory pool instead.
Code up multiple types of memory pools and measure their performance.
Generalize your memory pool class to dynamically manage multiple big chunks of memory, rather than just one.
Implement an advanced dynamic memory pool using std::hive (C++26) as the underlying data structure, rather than a vector or array.