Aussie AI
Chapter 22. Unsorted Arrays
-
Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"
-
by David Spuler, Ph.D.
Chapter 22. Unsorted Arrays
Unsorted Arrays Overview
Unsorted arrays are not an all-star data structure, and don’t get a lot of use for basic search requirements. The main features include:
- Slow search lookups in cases like associative arrays or sets (linear scan cost).
- Fast insertions and deletions (constant cost, without any “shuffle”).
- Sorting an unsorted array is costly with O(n log n) complexity.
Unsorted arrays are very useful if we want fast insertions and deletions, but rarely need to search or sort the array. Insertion is very fast with constant time, just by adding the new element at the end of the array. Deletions can also be implemented in constant time, but only via a trick of swapping the to-be-deleted element with the last element.
Interestingly, we can always fix our unsorted array by sorting it, and that turns out to be a decent idea. Let’s examine the two ways to get a sorted array:
- Build an unsorted array, then sort it, or
- Incrementally maintain a sorted array.
The first plan costs O(n) in total to do all the n insertions (unsorted), and then costs O(n log n) to sort it with std::sort.
The second plan costs O(n) for every one of the n insertions into a sorted array,
and so we get to O(n^2) quadratic complexity for the incremental sorted array approach.
In summary, our analysis suggests:
- Unsorted array (sort it later) — complexity of O(n log n).
- Sorted array (incremental) — quadratic O(n^2) complexity.
An unsorted array might be the way to go? However, as discussed above, it’s not as bad as that sounds if we have scalar types in a sorted array, because the “shuffle” is a single memory block copy.
Note that an unsorted array is actually sorted in a weird way: by the order of insertions. Hence, if you have an ordered sequence of data, they are mapped into the array sequence according to the order in which they are processed. If these objects have an associated timestamp, your supposedly unsorted array may well be sorted implicitly according to the timestamp field.
Unsorted arrays are underestimated, and can be efficient in practice.
An array that is unsorted functions as a list of items, but is stored in contiguous
memory, which can make scanning the array efficient in terms of cache locality
(e.g., faster than linked lists in std::list
or red-black binary trees in std::map).
Unsorted arrays can be useful for semantics other than basic search lookups. An array can efficiently implement a fixed-size stack, but a fixed-size queue is better implemented using a ring buffer that progresses around the array in a circular fashion. You can also put a balanced binary tree or a heap data structure into an array, but we’re getting far away from a basic unsorted array in doing that.
Linear Search of Unsorted Arrays
Linear search is the worst part of unsorted arrays. There’s not really a better way to search an unsorted array. Here’s a simple hand-coded linear search of the array to demonstrate the algorithm that’s happening:
int find_linear_search(const T &item)
{
for (int i = 0; i < count_; i++) {
if (item == arr_[i])
return i; // found
}
return -1; // not found
}
The above assumes we’re stored our data in a raw array type as the data member.
If we choose to store the data as std::array or std::vector, we could use
standard member functions to search the array, such as find().
Note that if we were doing a lot of searches of an array without many insertions or deletions, here’s an idea: pre-sort the array! This gives us this approach:
1. Pre-sort the array with std::sort
2. Use binary search on our newly sorted array.
The use of binary search reduces our searches to logarithmic complexity, which is much faster than linear search.
Template Value vs Reference Parameters
Templating based on a type has a common conundrum about how to choose between passing function parameters by reference or value. The desirable efficient that we want is usually:
- Small integer types — pass-by-value.
- Large class types — pass-by-reference.
Which signature should we use?
int find_linear_search(const T &item) // Const reference
int find_linear_search(T item) // Pass-by-value
Which one we desire for larger non-class types, such as long or double, is
somewhat implementation-dependent and you need to benchmark it!
Unfortunately, there’s no way to alter the signature of a templated function according to a compile-time setting. I don’t think there’s even a way to do it in type traits.
However, the most common modern C++ style is to use const reference parameters.
The reasons are:
- Large class types —
const&references are much faster. - Small integer types — it’s not much worse.
In one sense, I’m not sure about the last point, because:
1. It’s a micro-optimization, and
2. The compiler may auto-optimize it anyway.
But there is a simple solution whereby you can use const& reference parameters
for generic types,
but use pass-by-value for small integers.
Template specialization to the rescue!
Just define specialized versions of templated functions for the handful of small integer types:
int find_linear_search(int item) // Pass-by-value
{
// etc...
}
Now you only have to define about 27 more versions for every single integral and floating-point type.
Fast Linear Search
You’re thinking that this doesn’t exist, and the heading is an oxymoron. But there are situations where linear search on an unsorted array can be faster than the alternatives:
- Small number of elements
- Sentinel search optimization
- Low-level support for searching
- Parallel linear search
Let’s examine all of these techniques in turn.
Sentinel linear search optimization.
This is an optimization attributable to Knuth (1973)
in the Mix programming language.
The idea is to remove the conditional test in the loop
(i.e., removing “i < count”)
by guaranteeing a successful search.
The trick is to add an extra element at the end of the array,
which equals what we’re searching for.
Note that this requires that we declare our array data member with one more item than the capacity. We always need a spare element at the end, even if the array is full to capacity.
T arr_[N + 1]; // Extra dummy element
Sentinel-based searching is only good for arrays of scalar types, because it requires making a copy of the search element, which is created at the end. The sentinel search of an unsorted array still has linear complexity, but has a lower complexity constant because each loop iteration is faster in practice.
Low-Level Search Support
Some types of CPU have explicit instructions that support scanning a memory block for a value. If we’re using an array of characters or bytes, there are these candidates:
std::find— on an array, vector, or string type.strchr— old-style character strings (null-terminated)memchr— low-level memory blocks of bytes.
The modern C++ code using std::find looks something like this:
bool find_standard(const T& item)
{
auto iter = std::find(arr_, item);
return iter != arr_.end();
}
The version that returns the integer index of the element in the array is:
int find_standard_index(const T &item)
{
auto iter = std::find(arr_, item);
if (iter == arr_.end()) return -1; // Fail
return iter - arr.begin(); // Pointer arithmetic
}
Note that this idea only works for arrays of contiguous memory. Pointer arithmetic doesn’t work well on general iterators for dynamic memory containers.
Parallel Linear Search
There are multiple ways that we could parallelize our linear search algorithm. It just depends on our budget! Here are some options:
- CPU SIMD instructions (e.g., AVX or ARM Neon)
- Multithreading (on CPU)
- GPU hardware
SIMD instructions allow use to test multiple values in parallel on a CPU. For example, an x86 CPU from Intel or AMD allows the AVX sets of instructions, of which there are a few versions:
- AVX — 128 bits (4 x 32-bit integers).
- AVX-2 — 256 bits (8 x 32-bit integers).
- AVX-512 — 512 bits (16 x 32-bit integers).
- AVX-10 — 1024 bits (32 x 32-bit integers).
CUDA C++ GPU linear search. If we have an NVIDIA GPU, the type of parallelism is much more extensive. In fact, we can create 1024 threads, and each thread can compare only a few elements with our search key. This sounds like an almost constant-time algorithm on the GPU, but it’s not quite that good. In practice, there are two phases:
1. Compare each loop element in parallel, and
2. Collate the results.
The GPU can compare all the array elements 1024 at a time. Hence, it’s not constant time, but it’s still linear time divided by 1024.
Also, at the end we have a synchronization problem with detecting which of the threads had a successful result of the comparison. It’s not quite as bad as a “horizontal reduction” of the array (e.g., max or sum), but we have to synchronize the results in shared memory or global memory. We could use “warp shuffle” instructions that coordinate via faster GPU registers, but these only work within each warp of 32 threads, so it ends up being like a horizontal reduction over each warp.
Unsorted Array Insertions
Inserting into an unsorted array is very fast because we can just insert it at the end. This is very efficient with constant time complexity. The code example for insertion at the end:
void insert_end(const T & obj)
{
if (full()) {
throw std::overflow_error("Insert on full array");
}
else {
arr_[count_++] = obj;
}
}
There’s nothing much to this code: only one statement! It’s very efficient to insert at the end of an array.
Insertion at an Index
Inserting in the middle of an unsorted array seems to be an O(n) operation. If we needed to insert into the middle, it would seem slower because of the need to shuffle the other elements out of the way. And that would certainly be true of a sorted array, where a shuffle is needed to maintain the sorted array.
But, no, we’re talking about an unsorted array here. Let’s ban the shuffle.
There’s a move trick to insert into the middle of an unsorted array at a given index in O(1) time. The trick is to note that in an unsorted array we only need to move a single element out of the way. The idea is two short phases:
1. Move the existing element “out of the way” and to the end.
2. Insert the element at that location.
Here’s a coded version of the “move away to the end” optimization.
One fast way is to use std::move, which is like a type cast with no runtime code,
and this causes move assignment on a complex object (or simple byte copying on a scalar type).
Here’s the code:
void insert_at_offset(const T & obj, int offset)
{
if (full()) {
throw std::overflow_error("Insert on full array");
}
else {
arr_[count_ + 1] = std::move(arr_[offset]); // Move to end
arr_[offset] = obj; // Insert at location
count_++;
}
}
Note that this only works for an unsorted array, not a sorted array. If we wanted a sorted order, or we need the implicit order-of-insertion in an unsorted array, then this “move to end” idea cannot be used as it will ruin the ordering.
Fast Unsorted Array Deletion
There’s a trick for deleting an arbitrary element from an unsorted array that is often missed in articles. Unsorted array deletion need not be O(n) complexity, but can be done in O(1) time.
Deletion of an item from an unsorted array is a two-phase operation: find and destroy. Here’s the code to find the element, which uses linear search to find its offset, and is thus O(n) unavoidably:
void delete_key(const T& item)
{
int offset = find_linear_search(item);
if (offset == -1) {
throw std::invalid_argument("Delete object not found");
}
else {
delete_offset_swap(offset);
}
}
The naive idea for deleting from an unsorted array that we’ve found here
is to remove the element and “shuffle” the rest of the elements
downwards (to the left) so that there’s no “gap” in the array.
Doing a shuffle isn’t so bad for scalar types, where it’s probably just
one call to memmove behind the scenes.
But for non-scalar objects, we’re moving a lot of objects.
Either way, our unsorted array deletion with a shuffle has cost complexity of O(n) time.
There is a faster way!
First, let’s get rid of the special cases: if there’s only one element in the array, just erase it, and set the count to zero. And if the erase location is the end-most object, just erase it there, and decrement the count. Otherwise, if the object we want to remove is at the front or middle of the array, we do a tricky swap with the end element:
- Swap
arr[i]witharr[n-1] - Erase at
arr[n-1] - Decrement
n
This swap idea has changed our unsorted array deletion from O(n) time to the optimal O(1) complexity. There’s no loops anywhere!
Note that we can use std::swap here, and we may need to explicitly
run the destructor of objects being destroyed (optional for scalar types).
Here’s what the code looks like:
void delete_offset_swap(int offset)
{
if (empty()) {
throw std::underflow_error("Delete on empty array");
}
else if (count_ == 1) { // ***
if (!std::is_trivially_destructible<T>::value) {
arr_[0].~T(); // Explicit destructor (if needed)
}
count_ = 0;
}
else {
if (offset != count_ - 1) {
// Swap with the end element
std::swap(arr_[offset], arr_[count_ - 1]);
}
if (!std::is_trivially_destructible<T>::value) {
arr_[count_ - 1].~T(); // Explicit destructor (at end)
}
count_--;
}
}
The above code uses “type traits” from modern C++ to detect whether or not we need to explicitly run the destructor when destroying an object in the array. This is very efficient because type traits are evaluated to compile-time constants, so the compiler should optimize out the path if not needed (i.e., using “dead code elimination”). There are several options available in the type traits library, depending on exactly what types we want to support in our array:
std::is_trivially_destructible<T>::valuestd::is_destructible<T>::valuestd::is_scalar<T>::value
Actually, the above code has a minor inefficiency.
The giveaway is that two code sequences with is_trivially_destructible are similar.
Can you see it?
We don’t need to expressly test for count==1 (marked with stars),
because the general code in the else clause also works for that special case as well.
And also, what was I thinking?
There’s no need to swap the element to the end, only to destroy it there.
That’s two hidden moves inside std::swap,
when we only need one moved element.
The better idea than swapping
is to destroy the object where it is, and then move the end element down:
if (!std::is_trivially_destructible<T>::value) {
arr_[offset].~T(); // Destroy in place
}
if (offset != count_ - 1) {
// Move down the end element
arr[offset] = std::move(arr_[count_ - 1]);
}
count_--;
Note that std::move() here is only a compile-time type cast operation.
It will ensure that the move assignment operator is used on complex class types,
and is also efficient for scalar and other trivial types.
Yes, moving the end element to the middle of the unsorted array changes some addresses. It will certainly invalidate iterators over the container. But so would the shuffle of elements, so we’re okay there.
Note that this only works for an unsorted array data structure. If we did this on a sorted array, we’d ruin the sorting order in the array by moving the biggest element into the middle of the sequence. Sorted arrays need to do the shuffle.
One final point is that this fast deletion trick with swapping will break the unofficial ordering of the array by its insertion order. If we have timestamps associated with our array elements, swapping the end element into the middle will ruin that implicit ordering.
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: C++ Ultra-Low Latency |
|
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |