Aussie AI

Chapter 18. Data Compression

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 18. Data Compression

What is Data Compression?

Data compression is a common efficiency requirement in low-level programming. This is the general class of algorithms that aim to make data smaller. Generally, there are two main phases:

Compression — reducing data to a smaller size.
Decompression — inflating the data to its original values.

There are two main classes of data compression algorithms:

Lossless — the original data is fully restored.
Lossy — the uncompressed data differs and is partially “lost” in some sense.

Obviously, lossless algorithms are preferable, but some types of lossy algorithms can achieve much better compression ratios. Hence, there is a trade-off between data size and accuracy of the uncompressed data.

The best example of lossy compression in the modern era is “quantization” in AI, which is a type of “model compression” technique. The billions of 32-bit numbers that are an LLM’s “weights” can be shrunk down to fewer bits, often as few as 4 bits each, while still retaining the general capabilities and knowledge of the original model. In this way, LLMs are much smaller to transport over the network, to store on disk, and also faster to run on CPUs or GPUs (less memory usage). Note that this is not a typical data compression algorithm, because quantized models are not “uncompressed” back to 32-bit numbers, but are used in their low-bit formats.

Related Data Algorithms

Data compression is somewhat related to other common programming tasks, but it is not the same as:

Encoding — converting data to a simpler representation.
Encryption — hiding the data with varying levels of difficulty.
Hashing — mapping data to a hash value.
Data compaction — using smaller data records.
Data structures — organizing data for fast search or other operations.

Encoding or encryption of data are different requirements to compression. In fact, both encoding and encryption can increase the data load and slow things down., but have other advantages.

Encoding neither makes the data smaller nor hides it from prying eyes. The goal of encoding is to make it easier to use data, whereas data compression has the central aim of reducing size. There are some common encoding algorithms:

Base64 (or UUencode) — transmit binary data over text-only streams using the subset of printable characters.
UTF-8 — internationalization encoding for European letters and Double-Byte Character Sets (DBCS).
Rot-13 — simple semi-encryption method.

Note that most of these encoding algorithms will actually increase the size rather than decrease. This is true of both UTF-8 or UUencode.

Encryption is a different task, and it’s not related to data compression. The purpose is to maintain privacy and security of the data payload, and the way to do that often increases its size. Some of the well-known encryption algorithms for “hash” creation include:

SHA (Secure Hash Algorithm)
MD5 (Message Digest 5)
AES (Advanced Encryption Standard)

These algorithms have been used for things like password encryption in the past. More recently, these encryption algorithms became known as cryptographic algorithms, because they’re used in Bitcoin mining and other “crypto” creations.

Low Latency Data Compression

When there’s too much data, you want to compress it. In low latency programming, some example situations where data compression algorithms can be useful include:

Disk storage — e.g., compressing large volumes of historical data in files or databases.
Network data transmission — e.g., sending data off to a different site for ML model processing.

The goals of data compression include:

1. Reduced space storage on disk or in memory, and/or

2. Faster network transmission

Note that the goal is not to make processing faster. In fact, to process the data later, you would have to uncompress it first. Both compression and uncompression are extra processing costs, so there is a trade-off when considering the benefits of data compression algorithms.

Trading algorithms involve the processing of a lot of data, often aggregated from multiple financial exchange locations. Full market data with deep order book details grows quickly in size, and compressing this data speeds up historical storage, algorithmic analysis for trading storage, and backtesting with this data.

Data Compression Algorithms

There is a long history of data compression algorithms from the 1970s and earlier. The lossless algorithms from that era include:

Run-Length Encoding (RLE)
Huffman coding
Lempel-Ziv algorithm (LZ)

Run-length encoding is well-known and unsophisticated. It can do well if the data has long “runs” of the same value, which is relatively common in images, but not in text.

Huffman coding is a more complicated data compression algorithm. It involves building a separate data structure that represents the bitwise encodings for different patterns. Huffman coding is quite successful as an algorithm, but its main downside is the need to convey the dictionary as a separate data structure before decoding can begin on the main compressed data.

The Lempel-Ziv algorithm was a clever idea of overcoming Huffman coding’s main disadvantage by building a dictionary incrementally inside the compressed data, thereby alleviating the need to pre-send a separate data structure. Several variants of LZ data compression have been used:

LZ77 — the original 1977 version.
LZ78 — an improved version in 1978 by the original researchers.
Lempel-Ziv-Welch (LZW) — a further improved and popular algorithm.

LZ77 is the original LZ algorithm from 1977, using a “sliding window” over the text. Each entry in the compressed file includes a character count and an “offset” back to a prior occurrence of the same text string. In this way, the explicit dictionary required for Huffman encoding is no longer needed, because the strings in the prior text are used as if they were a dictionary.

The LZ78 variant is a modification of the LZ77 algorithm, which uses an explicit dictionary data structure. However, this dictionary is used internally by both the encoder and decoder, but does not need to be sent from the encoder to the decoder. Instead, the decoder can rebuild the dictionary in an incremental fashion as it decompresses the text.

The LZW algorithm became the most widely used data compression algorithm. It was used in numerous Unix tools and notably in the GIF image format.

The improved LZW algorithm modified the LZ78 algorithm to introduce faster handling of some failed-match sequences. Initially, it has a “predefined” dictionary of all possible symbols, which is used if a text string has no match in the normal LZ78 dictionary. For 8-bit data, there are 256 of these predefined symbols, being all the single-character strings. Both the encoder and decoder start with this initial dictionary.

The main downside of the LZW algorithm for modern computing is that it’s an inherently sequential algorithm, since both the decoder and encoder must incrementally build the dictionary from the data sequence. This means that it’s difficult to fully parallelize LZW encoding or decoding on multiple CPU threads or using GPUs.

Parallel Data Compression Algorithms

Many of the traditional data compression algorithms are difficult to parallelize. The problems that need to be overcome include:

Inherently incremental algorithms (sequential nature)
Variable-length bit strings

Incremental algorithms. Compression that must start at the beginning of the data stream is problematic for parallelization. An example of this is the “sliding window” approach in LZ77. The dictionary approach in LZ78 is also difficult to parallelize, because it requires a phase to construct the dictionary, before all sub-texts can be compressed.

Bit position offsets. It might seem that the Huffman algorithm could have parallel decompression once the dictionary has been sent to the decoder. However, another problem arises: given a chunk of data to decode, how does the decoder know which bit in the first byte to start with. It could be any of the bit positions. The variable-length bit strings used by compression mean that you cannot determine this without processing from the beginning of the input string.

Naive parallelism. You can parallelize all of these algorithms by doing a “restart” in every chunk, which is sometimes called a “segmented compression” algorithm. The LZW algorithm even has a “clear code” for exactly this purpose. But that’s not a very good method! I don’t see much difference between that idea and just splitting the data into a bunch of files, and then compressing each file. Furthermore, the code that’s working on each chunk is running an inherently sequential algorithm, rather than one that’s vectorized over SIMD or GPU kernels. Somehow that doesn’t seem optimal to me!

Parallelizing LZW compression. The LZW algorithm also suffers from both of these problems that limit parallelism. The encoder and decoder both build the dictionary in an incremental manner from the start, limiting parallelism. And most LZW algorithms used variable-length bit representations, which also depend on prior input data.

There are some parallelizations possible for an LZW data compression algorithm, but these are not inside the main data compression loop. Rather, the features that wrap around the main LZW logic can be separated. Parallel optimizations to the LZW algorithm include:

Run the bit packing/unpacking in a separate thread from the encoder/decoder.
Run the I/O in parallel to the encoding/decoding (i.e., file input, file output) by reading or writing chunks to files in separate threads.

Even this bit packing parallelization runs into problems and requires modification. There’s no difficulty if the LZW algorithm uses fixed-width 12-bit codes. However, variable-length LZW variants can change the bit positions, leading to difficulty with parallel packing and unpacking of chunks.

Chunk headers. One possible solution to the variable bit position issue is to extend the algorithm so that every chunk has a “header” of data. A simple version would have an extra two bytes that encode: (a) what bit position to start for the rest of the data, and (b) how many bits are being used for packing.

However, this simple two-byte header still has a problem: the number of bits for encoding numbers in LZW can increase at any point in the middle of a chunk. A more complex header is needed, such as the two bytes indicating the bit position and number of bits, along with another two bytes indicating how far along in the chunk until the bit sizes increase.

Newer parallel algorithms. This chapter has only scratched the surface of parallelizing data compression algorithms. It is perhaps unsurprising that older algorithms designed for sequential processing are difficult to parallelize. However, there is an immense body of research in this area, and some newer parallel data compression algorithms: