Aussie AI
Chapter 26. Fast Ring Buffers
-
Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"
-
by David Spuler, Ph.D.
Chapter 26. Fast Ring Buffers
What is a Ring Buffer?
A ring buffer is an array-like data structure where the data moves around in a “ring” so that the end wraps around to the beginning. It’s also known as a “circular buffer” and is often what is meant when people talk about a “fixed-size queue.”
A ring buffer is stored in a single array or vector of contiguous data, but is not accessed in the same idiom. The data is processed in a FIFO (First-In-First-Out) idiom, where items are added to the “tail” of the queue, and removed from the “head” for processing. Hence, a ring buffer is a good data structure for implementing a fixed-size queue or dequeue (double-ended queue).
Some of the main design decisions when implementing a ring buffer involve error handling:
- Overflow — inserting into a full buffer
- Underflow — removing from an empty buffer
Should the ring buffer throw an exception, or just return a Boolean failure status to the caller?
Simple Ring Buffer
A basic ring buffer data structure has three main elements:
- Array or vector of objects (fixed-size)
- Head index (integer)
- Tail index (integer)
Here’s some code using std::array for a ring buffer:
template<typename T, int sz>
class RingBuffer {
private:
std::array<T, sz> arr; // Fixed-size array
int head;
int tail;
// ....
};
New objects are inserted at the tail, and retrieved for processing from the head. In a typical implementation, the progression goes from left to write, using a “+1” idea for the next location. Technically, the ring buffer data could be handled in reverse order, but the forward progression around the ring is simpler and allows marginally more efficient arithmetic because there are no negatives to handle.
Thus, the basic primitives needed by a ring buffer:
- Insert at the tail
- Remove at the head
Here’s the basic insertion method:
bool push(const T& x) {
int newtail = (tail + 1) % sz;
if (newtail == head) {
// Overflow (full)
return false;
}
tail = newtail;
arr[tail] = x;
return true; // success
}
And here’s the “top” method for an interface that allows “top” to access, and “pop” to remove:
T top() {
if (is_empty()) {
// Underflow
return T(0);
}
return arr[head];
}
The “pop” method actually removes the item from the ring buffer:
void pop() { // Just remove (no return)
if (is_empty()) {
// Throw exception? (optional)
return;
}
else {
head = (head + 1) % sz;
}
}
And there are also various simple primitives:
- Capacity — the fixed-size of buffer.
- Empty — zero elements
- Full — fixed-size array is full.
The code is reasonably simple:
int capacity() const { return sz; }
bool is_empty() const { return head == tail; }
bool is_full() const { return (tail + 1) % sz == head; }
Pros and Cons of Ring Buffers
The main advantage of a ring buffer is that it has contiguous data.
This means that our fixed-size queue should be faster to access
than one stored as a linked list using std::queue.
The main disadvantage of a ring buffer is that it has a fixed size,
unlike std::queue, which grows dynamically.
This ring buffer size doesn’t necessarily need to be known at compile-time,
but does need to be set when you initialize the ring buffer.
There are also more advanced types of ring buffers which use multiple arrays, which can be
dynamically grown in size.
The other disadvantages are that the ring buffer is very specific to a FIFO access pattern. It’s not a fast data structure for these operations:
- Searching for a value
- Sorting data
- Inserting at a random location (rather than the tail)
- Deleting from a random location (rather than the head)
Insertions and deletions are slow because they require a “shuffle” of all objects. Note that there’s an interesting wrinkle: we could make insertion and deletions fast if we don’t mind violating the FIFO ordering and moving objects around (invalidating any pointers or iterators referencing them). The idea is that the ring buffer becomes like an unsorted array (with wraparound):
- Fast random insertion — move the current element at the insertion location to a free location at the end of the ring buffer, then insert.
- Fast random deletion — move the last element to the location we are deleting from.
It’s not all bad news. The data in a ring buffer is mostly stored contiguously, so there are some operations that still have good cache locality properties:
- Scanning or visiting all data elements
- Random access of data by integer index
A linear scan of all the elements can be quite fast, provided you don’t mind that it’s unsorted (or rather, it’s sorted by order-of-insertion). The data elements are always in one or two contiguous data blocks, which is better than dispersed data structures like linked lists or binary trees. However, it’s not quite as fast as an array or vector of objects, which is always one contiguous block.
Accessing one of the objects via an integer ordinal is still quite fast (i.e., 0...n-1).
Mainly, it’s just some integer arithmetic with head and tail to find its array offset
in the ring buffer.
Incremental Count Optimization
Computing the count of how many elements are currently inside the ring buffer is somewhat tricky: In the above computations, we can compute the “count” of how many elements are in the buffer using arithmetic on head and tail indices.
int count() const {
return (tail >= head) ? tail - head : sz - (head - tail);
}
An alternative that can be faster, if the count() method is called often,
is to maintain an incremental count, and store it in the ring buffer.
The idea is pretty simple:
- Insertions —
count++(except if full) - Deletions —
count--(except if empty) - Count — just return the
countvariable.
Hence, the computations during insertion and deletion are only
a single integer increment or decrement,
and the count() function becomes a simple getter of an integer data member.
In addition, the availability of a “count” variable actually allows some optimizations
to some of the other methods:
empty()— testcount==0full()— testcount==capacity
These are much faster than the earlier versions using head and tail index arithmetic. Hence, these efficiency gains may override the extra costs from incrementally computing the count during object insertions and removals.
Avoiding Three Integers
If we use an incremental count optimization for the number of items in the ring buffer, we end up with three integer values:
- Head
- Tail
- Count
It turns out that we don’t need all three, because they are inter-related numbers. We can calculate the “tail” variable from the “head” and the “count” value.
tail = (head + count) %sz;
There are actually some other numbers that are also related, which we could also use. For example, the total number of insertions and deletions of objects is related to the head and tail values, and the count is simply the difference between them.
Alternative Variable Pairs. It turns out that a ring buffer can be defined by any two variables from a set of several related calculations. Some of the possible pairs include:
- Head and tail
- Head and count
- Tail and count
Note that there are two main implementations of the initialization of head and tail values. These yield implementations that differ by one in all calculations, so you have to consistently choose between them:
head = tail = 0head = 1,tail = 0
The meanings of head and tail differ slightly in these two variants. Hence, the inter-relationship with the count is also different by one. Care must be taken to avoid off-by-one errors!
Combining Two Variables. The optimization ideas above reduced our three variables (head, tail, and count) down to two variables. Any pair of them will do, since they are inter-related.
But what about reducing it to one variable? Having only one integer variable in our ring buffer might be desirable because:
- Efficient single arithmetic operations.
- One integer value as an atomic for lock-free versions.
Can it be done?
The key point to note is that we really do need two distinct values.
However, we can put them together into a single integer
with encoding and packing ideas.
For example, we could store the head as 16 bits
and the count as 16 bits, and put both in a 32-bit unsigned integer.
Note that this limits the capacity of the ring buffer to 2^16 which is 65,536.
We could also pack them into a 64-bit unsigned long if we needed more capacity.
Modulo Arithmetic Optimizations
The % operator for modulo arithmetic (or remainders) is one of the slowest operations in C++.
The typical code we want to optimize in a ring buffer or fixed-size queue uses this idiom:
head = (head + 1) % N;
Modulo arithmetic is based on division, which is also slow, even on integers. Hence, our ring buffer can be improved by getting rid of the percent!
How? There are several options:
- Bitwise arithmetic
- Type casts
- Ternary operator
- Branchless coding
- Unsigned arithmetic
Bitwise-and trick.
Firstly, if we choose the buffer size N, to be a power-of-two, then we can use
bitwise arithmetic.
A remainder of a power-of-two is the bitwise-and of the number one less.
These are equivalent:
head = (head + 1) % 16; // Modulo
head = (head + 1) & 15; // Bitwise-and
Validating power-of-two. One thing you might want is a safety net to ensure nobody uses the ring buffer for a size that’s not a power-of-two. We want this:
static_assert(is_power_of_two(N)); // How?
We can use the Kernighan bit trick:
static_assert( (N & (N-1)) == 0); // Kernighan
How does this work?
It’s just magic, and let’s forget about it.
No, actually, the Kernighan trick is that “N&(N-1)” clears the value
of the rightmost bit of a number.
Hence, if the number without the rightmost bit equals zero,
then there’s only one bit set in the number.
And the set of numbers with only one bit set: powers of two.
Note that lots of parentheses are necessary around the bitwise operator
to avoid an operator precedence glitch.
Also note that the Kernigan trick fails with a false positive
if N is zero or negative, so we should add some more safety checks
at compile-time:
static_assert(N > 0);
Type casts. The use of bitwise-and is limited to powers of two, which is annoying, but there’s an even more specific way to do this for some of them: type casts. If we can choose the size as 256 (8-bits) or 65,536 (16=bits), we can do this:
head = (unsigned char)(head + 1); // 8-bits
head = (unsigned short)(head + 1); // 16-bits
Note that type casts are often effectively free after C++ does its optimization thing. The register allocation algorithm can just choose to use a value in a different way, and propagate that forward to other arithmetic. Thus, a type cast operation may result in zero runtime instructions.
Ternary operator. But why are we using arithmetic in general, when there’s actually only one case where we want to reset the value. Another way is to use the ternary operator instead of arithmetic. The calculation becomes:
head = (head + 1 == N) ? 0 : head + 1;
We can also implement this logic in two instructions, which is worth a try:
head++;
if (head == N) head = 0;
Or if you like short-circuiting operators, you can do this:
(++head) == N && (head = 0);
The compiler probably treats that the same, but you never know,
and you might want to check the assembly output (e.g., using “gcc -S”).
Branchless coding tricks.
Another trick is to notice that we just want to zero the value in one specific case.
Hence, we can use the branchless coding trick of using logical operators as 0 or 1 integers.
The goal of branchless coding is to remove all control flow branches, so that the CPU’s
branch prediction logic can run fast.
Note that the ternary operator is actually like an if statement, and it has two branches.
The branchless version with only fixed arithmetic is:
head = (head + 1) * (head + 1 != N); // Branchless
The way this works is to multiply the value by 0 or 1, depending on the logical test. Again, we can also try this as two statements:
head++;
head *= (head != N); // Branchless
Note that I doubt the branchless versions are very efficient, because they’ve added a multiplication operation. The ternary operator version is likely better, and isn’t that bad despite its branches, if you look at the assembly. Most compilers will convert it to a single CMOV (conditional move) CPU instruction, which makes it effectively branchless, too.
Unsigned arithmetic. One final trick is to note that we have modulo arithmetic for free in the CPU: unsigned integer arithmetic. Overflow of unsigned integers is not an exception in C++ and when you think about it, implements the exact semantics of modulo arithmetic. Hence, here’s the idea:
unsigned char head;
...
head++;
It works! And there’s not a single percent operator anywhere! All this time and we had cheap modulo arithmetic hiding in plain sight.
We really need to time this, because it isn’t 100% guaranteed to be faster.
A lot of the uses of head
will involve converting it from unsigned char to an integer offset,
such as for array indexing in the vector of objects that makes up the ring buffer.
A variation of this idea would be to store the head and tail as integers or unsigned integers,
so that they can be used as the fastest type of normal integer,
but still use unsigned arithmetic overflow tricks for modulo arithmetic.
This is the idea for an N=256 size ring buffer:
int head;
....
((unsigned char*)&head)++;
This relies on the platform being “little endian” with the lowest-order byte stored on the left,
which is true in most modern CPUs (but not if you’re sending integers over the network in “network byte order”).
And, yes, you got me, I really should use reinterpret_cast here rather than
the old C-style type cast.
Obviously, these tricks of using head and tail as unsigned integers
only work for a limited set of sizes:
- N=256 —
unsigned char(8-bits) - N=65,536 —
unsigned short(16-bits) - N=4.7 billion —
unsigned int(32-bits)
We can even do decrement and negative calculations this way,
since underflow is also not an exception,
whereas the % operator and negatives don’t talk to each other at parties.
Move Semantics
If our ring buffer contains complex objects, there are many more considerations for making it efficient. One of the biggest inefficiencies in a ring buffer class is inserting and deleting any non-trivial objects. If we do it wrong, we’re calling copy assignment operators and copy constructors to make new objects in the array, and running the destructor when we release an object.
Move semantics to the rescue!
The first point to note is that it doesn’t matter for simple data types in our ring buffer. Any scalar values like integers or floating-point numbers don’t have any copy constructors or destructors to worry about. In fact, this is also true of simple structures and classes, so long as they are “plain-old data” or POD data types.
But anything more complicated than this will have costly calls to copy constructors and copy assignment operators. To optimize this, we need to talk about:
- Move constructor and move assignment operator
- R-value references
- Copy elision
- Return Value Optimization (RVO)
In practice, the problems arise in both our “push” and “top” versions. The “pop” routine causes a copy assignment operator invocation:
bool push(const T& x) {
// ....
arr[tail] = x; // Copy assignment
return true; // success
}
And the “top” member has the problem of returning an object type,
which will use a copy constructor call at the return statement.
T top() {
// ...
return arr[head]; // Copy constructor
}
The automatic compiler optimization of “copy elision” might help improve the performance of the “top” method. Returning an object is exactly the situation it’s meant for. However, we can use move semantics explicitly to ensure it’s improved:
bool pop_top_move(T& outobj) {
if (is_empty()) { return false; }
ct_incremental--;
int oldhead = head;
head = (head + 1) % sz;
outobj = std::move(arr[oldhead]); // Move assignment
return true; // success
}
Note that std::move() is a compile-time
type-cast here, without any runtime cost.
And it’s required to convert to an R-value reference,
as otherwise the assignment statement
would still call a copy assignment operator.
Constructor Problems
One of the performance problems with our ring buffer implementation
is that std::array calls the constructor for every object
whenever a new ring buffer object is defined or created.
This occurs with this use of std::array for our ring buffer:
std::array<T, sz> arr; // Fixed-size array
How to avoid these constructor calls? After all, our ring buffer is supposedly empty with zero objects initially. Some of the solutions that don’t work and will still call constructors:
- Raw arrays
- Pointer to std::array
Using a raw array like this will still call all the constructors when our ring buffer is created:
T arr[sz];
Similarly, we could use an allocated copy of std::array, since it’s
really an object not an array.
It works like this:
std::array<typename T,sz> * arrptr;
....
arrptr = new std::array<T,sz>; // in constructor
This allocates our big array in the constructor
rather than as a non-allocated data member.
This adds an extra inefficiency from the extra allocated block,
and doesn’t work anyway.
The new operator will still run all the individual object constructors.
What about using std::vector instead?
Standard Vector Problems
Using std::vector can be better than std::array,
because it delays both its memory allocation and its construction of objects,
std::vector arr<T>;
Unfortunately, I’m not a big fan of this approach, because it has other difficulties:
- Extra memory allocation call (inefficient).
- Bounds checking failures in debug libraries.
The first point is that resize() has the same problem
with too many constructor calls.
Doing this in the constructor will still call all the constructors:
arr.resize(sz); // Constructors!
So, maybe we can call reserve() instead of resize().
That won’t call constructors:
std::vector arr<T>;
// ....
arr.reserve(sz); // No constructors!
This has hopefully allocated the memory for all the objects,
without running their constructors.
But this can run into various problems when we try to use the vector elements.
The problem is on this type of statement in our push method:
arr[tail] = x;
And the same problem still occurs with our code that gets items out of the ring buffer. Note that the issue is not move semantics, because this has the same issue:
outobj = std::move(arr[oldhead]); // Move assignment
The issue is bounds checking on the [] operator for std::vector.
In theory, the reserve() function has allocated valid memory for enough objects.
However, the size() function is still zero, so the runtime bounds checking
will trigger on any debug run of the code.
Yes, maybe some platforms this will work, with no bounds checking. But you can run into portability problems. For example, it makes the code fail with spurious runtime errors on any type of “hardened” standard C++ library.
Explicit Destructor Calls
Another problem with our ring buffer implementation when instantiated with class types is destructor calls. Instead of too many constructor calls, we have too few destructor calls. The problems include:
- Destructor calls missed after move assignments (e.g., popping).
- Destructor calls on destroying the whole ring buffer.
One solution: don’t bother. If the object that’s used in a ring buffer doesn’t have important destructor actions after a move (and it shouldn’t), or if destroying the whole ring buffer is in the shutdown sequence of the application, then you can maybe just forget about this problem.
Another solution is to
explicitly call the destructor ourselves.
You can call the destructor of a class like any other member function using
the ~T() syntax.
For example, in the pop function, we can do:
arr[head].~T(); // Explicit destructor
Basic types don’t need destructor calls, so we ideally want to distinguish trivial types from fancy class objects. We can also use type traits to do this, which are wonderfully efficient compile-time operators that work during instantiation of the template. Here’s how it works:
if (!std::is_trivially_destructible<T>::value) {
arr[head].~T(); // Explicit destructor
}
The alternative is to note that trivial types have no-op destructors, and the compiler would remove them anyway. Hence, the above type trait test may be unnecessary, but it’s a fast compile-time test anyway, so either way is fine.
Note that we are assuming here that the class being used has a destructor that works properly after an object has been moved away. In other words, it doesn’t do something silly like assuming a pointer in the object is non-null. The move assignment operator also needs to properly clear all the non-trivial data members, such as pointers, to zero or null values, so that the destructor doesn’t access bad memory after a move.
Class Interface Bypass
There are a couple ways to bypass the class interfaces, and thereby avoid the inefficiencies of construction and destruction. This makes the caller of our ring buffer manage when the objects are created and destroyed. The main ways are:
- Blocking non-trivial types
- Raw character buffer arrays
- Pointers to objects
Trivial types only. We can make our ring buffer, or other home-grown containers, faster simply by disallowing their use with complex objects. We can efficiently trigger compiler warnings with the type trails, so that users of the template know to only use scalars or other POD types. Here’s some examples using the various different settings:
static_assert(std::is_pod<T>::value); // Plain-Old Data static_assert(std::is_trivial<T>::value); // Trivial type
Raw character-array memory buffers.
The idea is to use a character array as a raw buffer,
rather than std::array or std::vector,
for our container class (e.g., our ring buffer).
To bypass class constructions by using raw memory buffers,
we have choices like:
char arr[sizeof(T) * sz]; // Static data member
char *arr = new char[sizeof(T)*sz]; // Dynamic allocation
This raw byte idea is workable, but every use of the array has to involve index calculations
and type casts to object-type pointers.
It’s fiddly and annoying, but it’s faster, because it avoids constructor calls,
and doesn’t need all the extra messing around to avoid std::vector bounds checking.
There are also concerns with:
- Uninitialized bytes in the buffer
- Alignment of addresses
We really should also initialize the bytes in our array buffer
to all nulls in the constructor
using memset on the whole array.
To do this, we also need to make sure that
all the classes using the ring buffer
have properties like:
- All-bytes-null is a stable but invalid initial status of the object.
- Destructor doesn’t fail on an all-bytes-null object.
We also need to manually take care of alignment of the addresses,
since the compiler thinks we only have characters, which don’t have alignment issues.
There’s the alignas standard specifier and various non-standard implementations
for older language versions.
If we’re really careful, maybe the initialization is not needed
and we can leave out the memset call in the constructor.
There’s some new “uninitialized memory” primitives coming in C++26
that may also help to do so.
You can maybe avoid needing the null byte initialization, but I’m betting against you
when I run valgrind on your code.
Pointers. As much as I admire the design of move semantics, there is a simpler way to avoid the overhead of objects moving in and out of our ring buffer. Old-school coding still works: store pointers to the objects in the ring buffer instead of full objects. The upside is avoidance of object copying and moving overhead.
The downside of pointers is the extra level of indirection, and double hit to memory with poor cache locality because of that. And pointers have a few pitfalls with a bad reputation as being unsafe, but I’m sure you’ve heard that before.
Extensions
- Implement a reverse ring buffer that uses decremented indices for head and tail, rather than addition, so that it grows from right-to-left instead of left-to-write.
- Implement a dequeue in a ring buffer by adding “insert-at-head” and “remove-from-tail” operations for the ring buffer (rather than the normal insert-at-tail and remove-from-head idiom). The trick is we’ll need to subtract one from indices and go in reverse.
- Implement a ring buffer with initialization of “head=1” and “tail=0” (rather than “head=tail=0”). All calculations will differ by one, such as the “empty” calculations is not “head==tail” anymore.
- Implement a ring buffer using two full-size integers that count the number of insertions and deletions. Note: the relationship between head and tail versus insertions and deletions is not that difficult!
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: C++ Ultra-Low Latency |
|
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |