Aussie AI

Chapter 23. Sorted Arrays

  • Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"
  • by David Spuler, Ph.D.

Chapter 23. Sorted Arrays

Sorted Arrays Overview

There is no standard C++ sorted array class, so we’ve got to implement our own. The C++ containers that can be used for sorted arrays include:

  • std::vector — variable size.
  • std::array — fixed size at compile-time.
  • Native C++ arrays — old-style non-container builtin arrays.

A sorted array has a good search lookup cost, being logarithmic in the number of elements, by using the “binary search” lookup algorithm. However, that’s not as good as a hash table (e.g., std::unordered_map), which has O(1) average search cost.

Insertions and deletions have a poor O(n) theoretical complexity, although the first phase of finding where to insert or delete is also logarithmic, using an algorithm very similar to binary search. The linear cost arises because once they find the location, they then need to shuffle elements:

  • Make a gap (insertion), or
  • Close a gap (deletion).

If we’re using a class object for our array, such as std::array or std::vector, we can use the insert() method. This is doing a shuffle behind the scenes.

The main advantage of a sorted array is that it’s, well, sorted, so if we want to process the array elements in sorted order, then it’s already done for us. That’s desirable because sorting an unsorted array is expensive with an O(n log n) complexity (e.g., std::sort typically uses a quicksort-heapsort hybrid).

If we need sorted data, there are other options in C++ containers. The std::map container is implemented as a balanced binary tree, called a “red-black tree,” and this has logarithmic complexity for all major operations: search, insertions and deletions. However, a sorted array has good memory cost because it used contiguous storage, so it should not be underestimated!

Shuffling Array Elements

Shuffling of array elements along by one location is required for both insertion and deletion in sorted arrays. Shuffle right to create a gap for a new insertion, and shuffle left to close a gap after deletion. We can also use this idea for unsorted arrays, but there are faster tricks, as examined later in this section.

In practice, shuffling of sorted arrays is quite efficient for scalar types via a memory block copy, using the memmove() standard function. Note that memmove() is an older function that does a bytewise copy of the memory that ignores object constructors and move operators. Presumably, the standard insert() method is using fast byte copies for scalar types.

Here’s an obscure pitfall: we cannot use various other copying methods because the shuffle involves overlapping source and destination memory blocks. There does not seem to be a version of C++ copying that permits overlaps. These functions would be incorrect and lead to undefined behavior on overlapping memory blocks, which is definitely true of any array shuffle:

  • std::memcpy (old C-style)
  • std::copy_n

However, we can use the overloads of the std::move function that work on ranges of multiple objects. These version of std::move have a real runtime cost, unlike the basic version, which is a compile-time type-cast that converts to a movable R-value reference (with no runtime code generated). We also need to pay attention to whether we are shuffling to the left or right, because these functions don’t work for all overlapping arguments.

  • std::move or std::copy — moving or copying left (i.e., close a gap for deletion).
  • std::move_backward or std::copy_backward — moving or copying right (i.e., create a gap for insertion).

Note that using std::copy or std::copy_backward functions also work here, but copying is slower than moving for non-scalar types. Hence, the std::move versions are more general, but still have some downsides:

  • Expensive for non-scalar objects.
  • Iterators are invalidated on the array.
  • Invalidates any pointers or references to specific objects.

Unfortunately, the shuffle cost is terrible for complex objects that will require their move operators called for every single object. I can’t say that I recommended sorted arrays for those types. Note that there are also various types of objects where we could still use a memory block move to do a “shallow move” of the objects (i.e., “relocatable objects”), rather than individually moving each element. However, using this idea requires tricks to prevent the C++ container from doing its move thing, such as using a low-level raw array rather than std::vector.

Binary-Like Sorted Array Insertion

Sorted arrays are logarithmic for searches, but not quite as good for insertions and deletions. Inserting a new element into a sorted array is a three-phase algorithm:

    1. Find the location to insert,

    2. Shuffle elements to the right (create a gap), and

    3. Insert the new element at the location.

There are three ways to find the location in a sorted array:

    1. Linear search from the front.

    2. Linear search from the back.

    3. Binary-like search (faster!)

Linear search over a sorted array doesn’t use equality, but finds the first element the bigger than the new element. Or to go in reverse, start at the end and look for the first element that’s smaller than the new one.

The advantage of starting at the end is that we can shuffle as we go, but it’ll have terrible cache locality problems in accessing memory addresses in reverse. CPU memory prefetch algorithms usually assume a forward access order.

Anyway, neither of the linear algorithms are fast and they aren’t typically used. Instead, binary-like search for the insertion point is much faster, with a logarithmic complexity.

Binary-like search for insertion involves splitting up the array into two intervals, and choosing between the two based on the midpoint value. This is not exactly the same as binary search, because we’re assuming that the element is not already in the array. Hence, it’s like binary search, but we’re looking for smaller versus bigger elements in comparison to the new element, rather than seeking equality.

You can code your own binary insertion algorithm, or the standard C++ library has two functions to help:

  • std::lower_bound()
  • std::upper_bound()

These functions are general methods on many containers, but they require the underlying data to be sorted, or “partitioned” is the official term. It means the same as “sorted” for everyone except polymaths who like Abelian groups.

Hopefully, this is coded in the standard library via a binary-like search method, and is therefore fast. It should have logarithmic complexity. However, if we follow it up with an insert() call, then that’s an array shuffle that’s likely to be linear in cost.

Note that there’s no equivalent “binary deletion” algorithm when we’re deleting from a sorted array. That just uses normal binary search to find the element, such as std::binary_search, if it’s there, and then we can remove it. Insertion is different to deletion in that sense.

Sorted Array Deletion

Deletion of an element in a sorted array is easier than insertion. There are two major phases:

    1. Find the element using binary search.

    2. Shuffle the elements left to close the gap.

Note that we’re using real binary search, not the binary-like search for insertion, because we assume the element is present. We can’t delete an element that’s not in the array. Hence, we can use std::binary_search to find the element.

The deletion phase is a left shuffle of all the array elements. As discussed above, we can do a byte copy such as memmmove() or std::move, which both are well-defined with overlapping memory blocks.

These methods can be efficient for scalar and other trivial types where bitwise shallow copying is allowed, but may trigger a cascade of move constructors or move assignments on complex classes. Thus, sorted arrays can be potentially inefficient for non-scalars because of the hidden costs of shuffling objects.

Batched Multiple Insertions in Sorted Arrays

The optimization here is that we can perform two or more sorted array insertions faster if we do them together. There are several possibilities to consider:

    1. Adjacent sequences — if we have two or more items that “fit” between two elements of the array, we can insert them in a block, and do only one shuffle.

    2. Sorted sequence — if we have a sorted list of two or more new elements to insert, we can insert them by doing a single scan of a merge sort (i.e., merging two sorted arrays into one longer array).

These ideas are certainly more efficient than naive repeated insertions, but they are special cases. However, looking at those efficiency gains, we get the inspirational idea for a more general way to handle a sorted array that’s undergoing lots of insertions:

    1. Defer the insertions by storing the to-be-inserted elements in a separate location.

    2. Batch insert all of the new elements at some point later (e.g., every 100 insertions).

For example, we could store the first 100 to-be-inserted elements in a separate array of 100 elements. Or we could reserve an extra 100 elements at the end of the main vector.

However, correctness before efficiency. Our to-be-inserted array elements are supposedly already inserted into our sorted array, so they should be found by search, and should be able to be deleted, too. Hence, both search and deletion algorithms will need to look in two locations, which gets complex and bug-prone, not to mention less efficient.

Even assuming we’ve fixed those bugs, the overall efficiency of this batched-insertion method is actually not that great. We’ve overlooked the practical problem that the batched insertion step needs the 100 extra elements to be in sorted order before the merge. So, we’d have to run std::sort() on the new array of 100 extra elements, before merging them into the main sorted array.

Alternatively, we could maintain our 100 elements as a sorted array, but then your boss might notice this oddity that could be hard to explain on a whiteboard: maintaining insertions into a sorted array to optimize insertions into a sorted array.

I’m not sure how much we’ve actually improved things? My brain is about to explode figuring it out, but feel free to talk amongst yourselves. I’m going to analyze deletions instead.

Batched Multiple Deletions in Sorted Arrays

The deletion of an element in a sorted array is a “find-and-destroy” sequence that is quite inefficient. The finding of the element is fast using binary search, with a logarithmic cost. However, the shuffle required to remove an element in the middle of the array has linear cost.

If we’re doing a lot of deletions, the cost is significant from a lot of shuffling. If we’re inserting and later deleting n elements into a sorted array, and each deletion is linear, then it’s quadratic in complexity.

Can we do better?

Yes, and there are multiple ways to do so. Some of the options include:

  • Deleting a range of elements
  • Deferred deletions

The simplest idea is to delete two or more array elements at once. This reduces multiple deletions to a single shuffle operation. However, it’s not always possible to know that our many deletions will be in a subrange or even pairs of adjacent elements.

The more general case of multiple random deletions can be optimized via deferred deletion algorithms. The idea of deferred deletions is to track multiple deletions, possibly in some other ancillary data structure, and then finalize the deletions all at once.

Deferred Deletions with Extra Data Structure

The idea of optimizing a large number of deletions is to group multiple deletions and then perform them together. For example, if we track the locations to be deleted, and can then detect adjacent pairs of elements (or more), then these subarrays can be deleted together. This is efficient because we shuffle once per deleted subarray rather than once per deleted array element.

Here’s an idea: store the array indices for deferred deletion. The approach involves:

    1. Store the indices of the array elements to be deleted in a secondary data structure, and

    2. Later, we scan this data structure to look for adjacent indices, which can be deleted together, which is faster.

In addition to the gain of processing merged pairs or longer subarrays of elements, the shuffle can also be optimized to move smaller chunks around, by removing the “gaps” where the to-be-deleted elements are located.

This approach is workable, and fortunately there’s only two problems:

    1. Bugs, and

    2. Slugs.

The downsides of this approach of storing the indices of the to-be-deleted elements include some major potential bugs:

  • Insertions will mess up the indices in the secondary data structure.
  • Searches will still find the to-be-deleted items.

Fixing these problems is not easy, and certainly not efficient. And even if we only had multiple deletions in a row, this approach is also not especially efficient in general, with both space and time overhead:

  • The space overhead of the secondary data structure.
  • Extra time cost of updating the secondary data structure (and then destroying it).
  • The algorithm to find adjacent indices is inefficient (e.g., sort the indices).

The other major problem with this approach of using an extra secondary data structure containing deletion offsets is simply: it’s unnecessary. There’s a better way.

Deferred Deletions with Vector Defragmentation

There’s a simple way to handle deferred deletions in a sorted array, and it doesn’t even require an extra data structure. The basic strategy looks like:

    1. Mark each deleted element as “to-be-deleted” (later).

    2. Ignore all to-be-deleted elements (e.g., when searching).

    3. Remove all the to-be-deleted elements together (using vector defrag).

How do you mark an array element for deletion? There are two basic strategies:

    1. Add a new Boolean flag in the array, or

    2. Special values

Extra Boolean Flag Method. Obviously, if our array elements are large objects, then we could just add another bool data member to mark its status. But if our array elements are small, such as an array of integers or timestamps or other scalars, then adding an extra data field is inefficient in terms of both space and time.

Extra space cost will be at least a byte per array element, and likely more due to alignment considerations. Larger array elements also reduce cache locality and will impact speed.

Special Values Method. The idea of using a special value is to re-use an existing data field in the array, rather than adding to its size. Some common special values to consider include:

  • 0
  • -1
  • nullptr
  • Negatives

Usually, we will just use a single, fixed special value such as 0 or -1. However, if we want to mark the data, but still be able to know what the original data value was at that location, one way is to negate it (reversibly). The slow way to do this is to multiply by -1, whereas the faster way is to toggle the sign bit.

No matter what method is used, the key point is that we are able to look at an array element and decide whether it’s valid or a to-be-deleted array element. We’ll need to use that function in all of the other array operations.

Searching and Insertion with Deferred Deletions. Note that it’s quite tricky to ignore the to-be-deleted elements during search and insertion. The binary search algorithm may still “find” the number in a to-be-deleted array element, in which case you need to check adjacent array elements, as there may be a non-deleted array element with the same value.

Sorted array insertion with some to-be-deleted elements should still work. The sorting of the array key should be maintained across both valid and already-deleted elements. Inserting a new element will change all the array indices, but note that we’re not tracking these indices for our deferred deletion algorithm, so this shuffling from insertion doesn’t cause problems with the deletions done later.

An interesting wrinkle in this method occurs when an insertion hits the location of a to-be-deleted element via the binary insertion method. In this case, the new element can simply replace the to-be-deleted item, and no shuffle is needed, leading to a very efficient insertion. Furthermore, even if not, our shuffle for insertion can be shorter, as it only needs to shuffle the elements until the first to-be-deleted item is found (i.e., only shuffle up to a “gap” in the array).

Searches and insertions are not the only code to modify. You also need to ignore the to-be-deleted elements in any other array operations, such as printing the array elements or other linear scan. It’s actually quite error-prone to have to remember to handle already-deleted elements in every other operation. Easy to leave an insidious bug this way!

Vector Defragmentation. The final stage of this deferred deletion algorithm is to clean up the array to remove all of the to-be-deleted array elements. Until this is done, the array could be wasting a significant amount of space.

The idea of vector defragmentation is to scan the entire array and compact all the valid array elements together. This is accomplished via a simple two-pointer algorithm. At the end, we need to resize our array down to the reduced number of stored elements.

Phew! That was a lot of special cases to handle for our delayed deletion algorithm. Hope it’s worth it!

Mixing Many Searches, Insertions & Deletions

The general case is where we have a sorted array that’s undergoing a large volume of searches, insertions, and deletions. A sorted array is not necessarily the best data structure for that, but the assumption is that we need a sorted array for some other reason, such as fast scanning of the entire array through its contiguous memory.

Searching is not the problem. The binary search algorithm is very efficient with logarithmic complexity in both average and worst-case cost.

Insertions and deletions in a sorted array are much worse, since both involve a “shuffle” that is linear in cost. In both cases, the main optimization to consider is a deferred algorithm, where multiple insertions and deletions can be delayed, and then performed as a group. Overall, this deferred batching idea doesn’t seem to work very well for insertions, but works extremely well for deletions.

In practice, the shuffle is not that bad, because it’s just a big memory block copy using memcpy() or memmove() or similar functions. Thus, if the array elements are a scalar, or any other similar “plain old data” object type that doesn’t require a move constructor or move assignment operator, then it’s not really O(n) complexity to do insertion or deletion in a sorted array. Hence, the benefits of using deferred deletion with vector defragmentation may not be as great as they seem.

Extensions

  1. Benchmark the sorted array implementation with a raw array versus using std::vector as the internal data array, especially to see if our hand-coded binary search is fast or not.
  2. Explore the use of “shallow copying” on sorted arrays containing “relocatable objects” in the shuffle needed for insertions and deletions in a sorted array data structure.
  3. Explore the efficiency of calls to move constructors in a “shuffle” for a sorted array implemented using std::vector or std::array.
  4. Implement the binary-like search algorithm to find the insertion location in a sorted array. (Note that deletion is just the normal binary search to find the element.)
  5. Benchmark inserting into an unsorted array and then sorting using std::sort, versus incrementally maintaining a sorted array. Do the results differ for a scalar integer type versus arrays of an object like std::string (which has move operators)?
  6. Implement a hybrid binary-linear search where the binary search reverts to linear search once the interval is small enough.
  7. Implement an AVX SIMD version of linear search over integers that tests a number of integers in the array at once.
  8. Implement a “cache-aware” binary search that chooses the middle index at the start of a cache line (where possible), and tests all values in that cache line immediately using an unrolled linear search.
  9. Implement a binary search that is both cache-aware and uses AVX SIMD instructions to test all elements in the same cache line more efficiently.
  10. Implement a sorted array with deferred insertions and deletions.
  11. Is there a better way to optimize insertions into a sorted array via batched insertions or deferred insertions (in the general case)? What about if we exclude searches and deletions, so that it’s only a sequence of many random insertions? Maybe we can build some other data structure with better insertion complexity, such as a red-black tree (std::map), and then linearize it into the array with a tree traversal at the end.

 

Online: Table of Contents

PDF: Free PDF book download

Buy: C++ Ultra-Low Latency

C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency