Aussie AI

Model Compression

  • Last Updated 29 August, 2025
  • by David Spuler, Ph.D.

Model compression is the general class of AI optimizations that reduce the size of the model. The goal is two-fold: (a) size reduction: have a smaller model that uses less memory storage, and (b) latency optimization: run faster inference on the more compact model.

Model compression techniques have been highly successful and are widely used, second only to hardware-acceleration in their impact on the AI industry. The main model compression techniques are:

There are various lesser-known types of model compression methods:

Survey Papers on Model Compression

General surveys that cover model compression include:

Research on Model Compression (Generally)

Research papers on model compression:

KV Caching and Model Compression

There are several analogous model compression optimizations for KV cache data. Read more about these KV cache research areas:

Data Compression

Data compression refers to the use of existing streaming type bit compression algorithms to make LLMs smaller. This refers to methods such as:

  • Huffman coding
  • Run-length encoding
  • LZW compression
  • Zip file formats

One particular example where compression is highly relevant is sparse models. For example, run-length encoding can track the number of zeros between non-zero values.

Research papers on data compression with LLMs:

More Model Compression Research

Read more about:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging