Aussie AI

Chapter 27. Perfect Hashing

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 27. Perfect Hashing

What is Perfect Hashing?

Perfect hashing is the extreme of hashing, where we guarantee that there’s no collisions. Hence, the hash function is “perfect” because no pair of two keys map to the same hash value. This makes for a super-fast hash lookup with guaranteed O(1) search performance, and no need to look up a second hash location ever.

Perfect hashing is faster than normal hash tables. Regular hashing is fast on average, with O(1) average search, but collision resolution mechanisms like linear chaining or probing can have worst case O(n) search cost. Perfect hashing has guaranteed O(1) search complexity for best, average, and worst case. In fact, we don’t even code up a collision resolution method at all.

Unfortunately, the good news stops there, because this only works in a very special situation: where the set of keys is known at compile-time. This hash table can only contain a fixed set of keys that we know whenever we build the perfect hashing code.

If there are any insertions or deletions, this idea doesn’t work at all, and may require us to re-run and re-compile our perfect hashing engine if they occur. Thus, we can tolerate insertions and deletions but only if they are rare. Some examples of rarely changing sets of strings we might want to look up with perfect hashing include:

Special keywords in a programming language tokenizer (e.g., 100 reserved words).
Common English words in a grammar checker (e.g., 1,000 basic words).
Stock tickers on an exchange’s market data feed (e.g., about 5,000).
Vocabulary words of an AI model (often 50,000 to 100,000 words).

Yes, the last one is a bit tricky, because tickers might change daily, in which case we might need to re-run our perfect hashing in every overnight build. Also, finding a perfect hash function for 100,000 LLM vocabulary strings in a reasonable amount of time might be a struggle.

Disadvantages of Perfect Hashing

We already mentioned the main disadvantage of perfect hashing, which is that it requires a known set of keys, or at least a very rarely changing set of keys. Other disadvantages include:

Cost to build — expensive to scan the search space to find a perfect hash map.
Scalability problems — cannot handle a large number of keys because the search space becomes too large.
Static data — insertions and deletions invalidate the hash map.
Recomputations — increasing the key set requires a total re-run of the whole shemozzle.

Perfect hashing also has some of the disadvantages of a basic hash map like std::unordered_map, such as:

Unsorted data
Scanning all data is somewhat inefficient (and in unsorted order)
Cache locality issues because objects are stored randomly in the hash table.

Perfect hashing is not perfect for every case. Some alternatives data structures to consider for search lookup optimization include:

Bloom filters
Tries
Automata (precomputed)

Or you could just put all your keys in an array and use a GPU to check them all in parallel.

Perfect Hash Functions

Special hashing algorithms can be used in any situation where the search data is known at compile-time. The most efficient solution is to use hashing with a specially developed hash function, designed to prevent all collisions. This is called a perfect hash function and can only be developed for unchanging data. If a perfect hash function can be found, the symbol table can be searched with one computation of the hash function and one key comparison to determine if the key is actually there at the index.

The most difficult aspect of using this method is the search for a perfect hash function for a particular set of data. There are a few common methods of doing so:

Inspired guesswork
Brute-force computation
Use a perfect hashing tool (e.g., GNU gperf)

In some cases, the programmer can work out a function that has no collisions by guessing at a function. For example, if the programmer notices that all keys have a different first letter then it is easy to compute a perfect hash function as a mapping from the 26 letters to a different unique integer, the hash value. There’s a curious fact unknown to most AI engineers, that humans are very resourceful and this method of “guessing” the function works surprisingly well. However, the average human might have a little trouble with 100,000 distinct keys.

The brute-force approach involves trying to generate the hash function using a computer which tries a number of different hash functions of a particular meta-pattern, applies the hash function to each key, and reports when a function that produces no collisions is found.

Further Optimizations of Perfect Hashing

The general complexity of perfect hashing is O(1), which is true of the best case, average, and worst case complexity. Hence, it’s fast for large sizes, but we still might want to optimize it a little more! There are two places to try to speed up:

Lookup function (online)
Perfect hash function creation (offline)

The basic method of perfect hashing can be optimized so that lookup is even faster. Some of the ways that we might super-optimize the search phase of the perfect hash function include:

Not checking the key is present.
Using a power-of-two hash table size.
Larger hash table size.

Avoiding string comparisons. The computation sequence for a perfect hash lookup goes like this:

1. Calculate the perfect hash function.

2. Find that location in the hash table.

3. Compare the string at that location with our search key.

But why are we doing this string comparison at the end? That’s quite slow. Well, sometimes we don’t need to, and it depends on context. For a grammar checker or LLM tokenizer, we need to detect whether or not the key is there, because multiple words could map to the same hash location.

On the other hand, a market data feed from a US stock exchange might only contain our set of ticket names, so we can assume that only one string could possibly be at the hash table location. In other words, we’re assuming that every string is found, and there are zero failed searches, so our hash table is mapping of the string to a set of data structures (e.g., our order book for that stock). That’s all fine, and it will go faster, but the code will break completely if the exchange adds a new stock ticker!

Another way we could avoid the string comparison is to use two or more perfect hash functions. This data structure is known as a Bloom filter, and combines multiple bit vectors with multiple hash functions. Bloom filters are a probabilistic data structure that can confirm 100% that a key is invalid, but can only confirm that a key is likely to be valid, but not with 100% certainty.

Power-of-two hash table size. The size of the array that is our hash table is one of the parameters for a perfect hash function, so we have some control over it. Note this basic point: the hash table size must be more than the number of keys, or else it’s a little hard to avoid collisions! In fact, it’s easier to find a perfect hash function if the size is significantly more than the number of keys, so that there are some empty slots.

But what size? For some reason lost in the mists of time, everyone wants to choose a prime number, preferably a Mersenne prime, because that supposedly makes hash maps more evenly spread. But in the case of perfect hashing, we are looking for exact mappings with zero collisions, so it’s perhaps not so important to use a prime.

Instead, we should use a power-of-two hash table size, because that allows the arithmetic in our perfect hash function to be faster. The reason is that most perfect hash functions look like this:

    offset = some_big_number(key) % N;

The % remainder operator is extremely slow, even on integers. The only reason it is used here is to ensure that the hash function maps to between 0 and N-1, where N is the hash table size. We can use “strength reduction” to use a faster arithmetic operation, such as:

Bitwise-and (&) operator — if N is a power-of-two (e.g., for x%16 do x&15).
Type cast to unsigned char — if N is 256 (8 bits)
Type cast to unsigned short — if N is 65,536 (16 bits).
Overflow of unsigned char — if N is 256 (8 bits)
Overflow of unsigned short — if N is 65,536 (16 bits).

We’ve already examined a lot of these optimizations to modulo arithmetic in detail for the discussion of ring buffers in Chapter 21.

Larger hash table size. An important point about hash table sizes is that bigger can be better. This is true for both the offline computation of the perfect hash function, and the online search lookup. Bigger hash tables have more “gaps” and are an easier search space to find a solution. In terms of online search performance, a bigger table worsens cache performance, but that’s not likely to be great for a hash table anyway. Furthermore, these extra gaps also mean that unsuccessful searches will be faster on average, because those keys that map to a gap can avoid the string comparison at the end. And memory is cheap, after all.

Offline search optimizations. The search for a perfect hash function can be very expensive, and even impossible. Some of the ways to speed things up include:

All of the hash function optimizations.
Splitting up the search space (partitioning).

The first point is that any optimization to the perfect hash function computation applies a thousand-fold to the offline search. For example, we also get faster computations possible in the offline search for a hash function if we only look at power-of-two table sizes. In fact, our offline code does a lot more of those computations.

Search space partitioning optimizations. The search space is combinatorial and explodes with large key sets. One approach is to split the keys into multiple perfect hash tables, such as by partitioning the key sets. Some of the ways to consider partitioning include:

First letter — we can use 26 different perfect hash tables.
Two letters — this gives 26*26=676 separate hash tables.
Length of keys — e.g., stock tickers are at most 5 letters long.
Preliminary hash — a simple hash function to start with (e.g., first two letters modulo a size smaller than 676).

Note that this means running the perfect hash engine multiple times to find a different perfect hash function for each partitioned set of keys. However, running 26 searches for smaller sets of keys will often run faster overall than trying to find one super-perfect hash function for every single key.

Example: ANSI C Keywords

As an example of the various approaches, let us attempt to develop a perfect hash function for a set of C’s 32 keywords for a programming language tool:

    auto break case char
    const continue default do
    double else enum extern
    float for goto if
    int long register return
    short signed sizeof static
    struct switch typedef union
    unsigned void volatile while

Using my own version of “inspired guesswork”, involving a couple of hours of poring over ASCII tables, I managed to come up with a reasonable perfect hash function. The basic approach I took was to break up the words into groups of about five keys by using a test of the string length, and also by making single character comparisons on the larger groups of keys with the same length. Once the group was small enough I looked for letters in the keys that were unique, often the first or second letter, and then examined the ASCII binary values of these letters. This way, the hash function extracts certain bits from each letter, and generates a small integer, which is then mapped into an “interval” of values for that particular group. The function, which produces hash values in the range 0..36, is as follows:

    int my_hash(char* s)
    {
        switch (strlen(s)) {
        case 2: // Only “if” and “do”
                return (s[0] & 01) + 2; // 2..3
        case 3:
                return (s[0] & 01) + 8; // 8..9
        case 4:
                if (s[1] == 'o') // goto, long, void
                        return (s[0] & 03) + 26; // 26..29
                else // auto, case, char, else, enum
                        return ((s[1] & 14) >> 1) + 30;
        case 5: // break, const, float, short, union, while
                // First letter is unique
                return (s[0] & 07) + (s[0] == 'c') + 10; // 10..16
        case 6:
                if (s[0] == 's') // signed,sizeof,static,struct,switch
                        return (s[5] & 03) + ((s[5] & 8) >> 3)
                        + ((s[5] & 16) >> 2) + 18; // 18..22
                else // First letter not ’s’ - double, return, extern
                        return (s[0] & 03) + 23; // 22..24
        case 7: // "typedef", "default"
                return (s[0] & 16) != 0;
        case 8: // continue, register, unsigned,volatile
                // First letter is unique
                return ((s[0] & 04) >> 1) + (s[0] & 01) + 4; // 4..7
        default: // Can’t be a C keyword
                return 0; // Pick any number
        }
    }

The second approach is to make the computer perform a brute-force search for a perfect hash function. The following program takes a set of keys from a file and develops a hash function of the following form:

    ( Σ C[i] * key[i] ) mod N

The code attempts brute-force computations with many combinations of the constants C[i] and N. If one of these hash functions produces no collisions, a perfect hash function has been found. The source code below implements this concept.

    //-----------------------------------------------------------------
    // PERFECT HASH FUNCTION BRUTE-FORCE SEARCH
    //-----------------------------------------------------------------
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <ctype.h>
    //-----------------------------------------------------------------
    #define LEN 10 // Maximum length of a word
    //-----------------------------------------------------------------
    char words[MAX][LEN]; // words being hashed
    int C[LEN]; // coefficients of hash function
    //-----------------------------------------------------------------
    #define MAX_MULTIPLIER 1 // Let C[i] range 0..MAX_MULTIPLIER
    // 0 means skip, 1 --> use addition
    #define MAX_MODULUS 1000
    int G_MODULUS;
    int G_MODULUS_START_MULTIPLIER = 5;
    int G_MODULUS_TOP;

    //-----------------------------------------------------------------
    // Apply the hash function coefficients to a key
    //-----------------------------------------------------------------
    int compute_hash_perfect(char* s, int modulus)
    {
        unsigned int hash = 0;
        for (int i = 0; i < LEN && s[i] != 0; i++) {
                hash += s[i] * C[i];
        }
        return hash % modulus;
    }

    //-----------------------------------------------------------------
    // Try all the combinations of coefficients
    // This function finds the perfect hash function!
    //-----------------------------------------------------------------
    void perfect_hash_find_best(int nwords, int nstart)
    {
        bool done = false;
        bool flags[MAX_MODULUS]; // has a key hashed here yet?
        int modulus = nstart * G_MODULUS_START_MULTIPLIER;
        do {
            // Do one possible modulus (table size)
            for (int i = 0; i < LEN; i++) C[i] = 0;  // Clear coefficients
            do {
                        // Update C[i] coefficients for next attempt
                        C[0]++;
                        for (int i = 0; i < LEN; i++) {
                                if (C[i] <= MAX_MULTIPLIER) break;
                                C[i] = 0;
                                if (i + 1 < LEN) { C[i + 1]++; }
                        }

                        memset(&flags, 0, sizeof flags);

                        // Scan all strings to count collisions...
                        bool collision = false;
                        for (int num = 0; num < nwords; num++) {
                                int val = compute_hash_perfect(words[num], modulus);
                                if (flags[val]) {
                                        collision = true;
                                        break;
                                }
                                flags[val] = true;
                        }
                        if (!collision) { // report success!!
                                printf("NO COLLISION: ");
                                for (int i = 0; i < LEN; i++) {
                                        printf("%2d ", C[i]);
                                }
                                printf(", MODULUS = %d ", modulus);
                                if (modulus == nstart) printf(" PERFECT!!! (n=%d)", (int)nstart);
                                printf("\n");
                                break; // exit do loop. Do next MODULUS
                        }
                        done = true; // Finish only when all multipliers are up to MAX_MULTIPLIER
                        for (int i = 0; i < LEN; i++) {
                                if (C[i] < MAX_MULTIPLIER) {
                                        done = false;
                                        break;
                                }
                        }
                } while (!done);
                if (done) {
                        printf("FAILED with MODULUS %d\n", modulus);
                }
                modulus--; // Try the next modulus value
        } while (modulus >= nstart);
    }

As shown in the source code above, the program is set to find all hash functions where the coefficient is either 0 or 1. These functions are a useful special case, as no multiplications are actually needed (all the characters with a 1 coefficient are simply added). When the program is run as shown on the ANSI C keywords as inputs, the best hash function it produces has modulus 134 (i.e., hash table size 134) and the following coefficients:

    NO COLLISION:  1  0  1  1  1  1  1  0  0  0 , MODULUS = 134

This information can be coded up into a simple perfect hash function. Unfortunately, the memset and strncpy calls are necessary to ensure that characters beyond the end of the string are considered zero, as is assumed by the hash function generator.

    //----------------------------------------------------------
    // Computer-generated addition hash function for C keywords 
    //----------------------------------------------------------
    int computer_hash(char* s)
    {
        char s2[10];
        memset(s2, 0, 7);  // zero the first 7 letters
        strncpy(s2, s, 7); // copy up to 7 letters
        return ((int)s[0] + (int)s[2] + (int)s[3]
              + (int)s[4] + (int)s[5] + (int)s[6]) % 134;
    }

This is not a minimal perfect hash function for these 32 keys. If the records to be stored with these keys are quite large, the space wastage of 134 hash table entries may be too large. A simple method of overcoming this is to add an array of 134 small integers (i.e., using the char type), where each entry in this array sets each C keyword to a unique value in the range 0..31. On the other hand, this may be a de-optimization as a sparse hash table can be more efficient than a minimal perfect hash function. If the table is large, it becomes likely that an unsuccessful search will map to a location containing a null pointer entry, and this avoids the need for the key comparison.

Perfect Final Thoughts

These computations we found here are not minimal perfect hash functions. If the stars align, you can sometimes find a mapping that works with the hash table size exactly equal to the number of keys. It might take a lot of CPU juice to find one, though. Good luck with that!

All of the hash functions in this section (both human and computer-generated) have multiple limitations, such as:

ASCII-specific — not portable to the EBCDIC set or other character sets.
Little endian — I haven’t checked portability to big endian machines.

Finally, if you’d rather use a tool for perfect hashing than have as much fun as I just did, you can use the GNU gperf tool, which is a perfect hash function generator. GNU gperf will output the perfect hash function in C++ for you, and is highly customizable.

Extensions

Generalize the perfect hash functions to use parallel arithmetic in the hash function computation, such as AVX or ARM Neon SIMD instructions on a CPU or GPU kernel calculations.
Parallelize the search for a perfect hash function on either a CPU (e.g., AVX or ARM Neon functions) or on a GPU (e.g., in CUDA C++).
Implement multiple perfect hash functions on the same set of keys to get a Bloom filter data structure, where the string comparison can be omitted during lookup.
Try out the GNU gperf tool for one of the data sets.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ Ultra-Low Latency