Aussie AI Blog

What's Hot in LLM Inference Optimization in 2025?

March 3rd, 2025

by David Spuler, Ph.D.

Inference Optimization News in 2025

Surprisingly, 2025 started with a huge focus on LLM efficiency with the release of DeepSeek R1's advanced reasoning model that outpaced the frontier models on many benchmarks, and at a fraction of the cost.

New Research on Reasoning Efficiency

Although the reported cheap training cost of DeepSeek R1 was what put tremors through NVIDIA's stock price, there were several categories of DeepSeek architecture improvements, and they actually included several advancements to inference efficiency in reasoning models:

One-step reasoning model (fewer tokens than multi-step models such as o1/o3)
Multi-head Latent Attention (MLA)
Multi-token decoding (faster than autoregression)

Some of these algorithms had actually appeared in their prior V3 model, but they were then applied to their R1 reasoning model. Read more here: DeepSeek's research advancements.

Neural Magic Acquisition

Another industry change related to inference efficiency that received a lot less attention was that Neural Magic, a Boston-based AI inference startup, was acquired by RedHat and IBM in late 2024. As one of the few pure-play inference optimization startups, they raised over $50m in venture capital, and have now exited (price undisclosed). With a focus on the software stack, especially on dynamic sparsity, they achieved significant advances in inference efficiency.

New Research on Reasoning Efficiency

Chain-of-Thought efficiency. The rise of reasoning models from several major AI companies (not just DeepSeek) has led to a follow-on rash of many papers on improving the efficiency of LLM reasoning algorithms. Reasoning models, especially ones using multi-step inference, are very costly to run.

In particular, Chain-of-Thought tends to emit a huge amount of tokens as it "talks to itself" to achieve good reasoning results. This is true in both single-step CoT "long answer" models (e.g. DeepSeek R1) and multi-step CoT versions with "test time compute" (e.g. OpenAI's o3 and Google Gemini's reasoning model). Hence, a number of different techniques have been tested in the research to reduce the token processing cost in CoT. Changes have included high-level changes to the CoT algorithm so as to skip steps and prune redundant reasoning paths (in multi-step CoT), or much lower-level optimizations to compress tokens and other algorithms.

In general, almost all of the 500+ LLM optimization techniques could theoretically be used in the special case of reasoning algorithms, but so far only a small number of these methods have been tested in Chain-of-Thought. For more information, see:

Small Reasoning Models. Another related area of research is "Small Reasoning Models," which is the use of smaller models with reasoning capabilities. This has been approached in various ways:

Multi-step CoT algorithms wrapped around smaller base models.
Improved training and fine-tuning of reasoning techniques applied to small models.
Distillation of small models (from Large Reasoning Models).
Model compression of larger reasoning models (e.g. quantization of DeepSeek R1).

Endlessly Hot Research Areas

There continues to be an endless flow of research papers on these LLM efficiency optimizations:

Quantization: Ever the hottest area, the use of quantization is widespread throughout industry, let alone in research labs. Some of the latest hot areas include:

Reasoning model quantization — a simple way of getting to Small Reasoning Models (SRMs)
KV cache quantization — the industry standard is now to quantize three areas (e.g. "W4A4KV4 quantization" is INT4 for weights, activations, and the KV cache).
Activation quantization — especially dealing with outliers to improve accuracy. Also closely related to dynamic sparsity research.
Ternary quantization — this subarea received a boost from "1.58 bit" quantization.
Low-bit granular quantization — latest research is to combine mixed-precision quantization with block-level quantization whereby more "important" blocks get more bits, and less important blocks get few bits (but not zero bits, because that would be sparsity research instead).

Speculative decoding: Improvements to speculative decoding have included improved accuracy of draft models, and better use of multi-machine "distributed speculative decoding." In another extension, prompt lookup decoding has been generalized to look beyond the current prompt to the entire history of prompts in multiple queries. (See blog article "What's New in Speculative Decoding?").

KV cache optimizations: KV cache compression remains hot because the KV cache is the main bottleneck for long context processing. In fact, we're now into the realm of "ultralong contexts" over 1M tokens, and this requires KV cache optimizations. Quantization of KV cache data to INT4 or INT2 is now common, but there's also research on numerous other ways to compress the KV cache, such as token pruning, head fusion, and various other ways. Here are some subareas:

Generally speaking, almost all of the LLM model compression optimizations (to model parameters or activations) can be re-applied in the new realm of the KV cache, so there's no shortage of future paper topics.

Fused KV caching. My favorite area that I feel is certain for a breakout is the generalization of prefix KV caching to non-prefix strings by concatenating the KV caches together and making adjustments. Recent research has examined accuracy problems and made improvements to the handling of positional encoding adjustments and cross-chunk attention; see substring/fused/concatenated KV cache. Yes, it needs a better name!

New attention algorithms: Another way to achieve long context processing is to alter the attention algorithm. Every paper on this topic, and there are many, chooses a new name for their approach. Hence, there are numerous new candidates:

MLA (DeepSeek's method)
Star attention
Ring attention
Razor attention
Relative Attention Bias (RAB)
Lightning attention
FFT attention
Round attention

We seem to be stuck at version 3 of Flash Attention, which was released eons ago in July 2024. There also hasn't been much added recently to the literature on Paged Attention. Some of the other newer attention algorithms have been combined with these two major memory-efficient attention algorithms, but there hasn't been a core advance in these techniques lately.

Free AI and C++ Books

Generative AI programming books:

The Sweetest Lesson: Your Brain Versus AI, November 2025: full text online, free PDF available
RAG Optimization: Accurate and Efficient LLM Applications, June 2025: full text online, free PDF available
Generative AI Applications: Planning, Design and Implementation, November 2024: full text online, free PDF available
Generative AI in C++ (Spuler, March 2024): full text online, free PDF available, table of contents, bonus materials, reference lists, source code

CUDA C++ GPU Programming Books:

CUDA C++ Optimization: Coding Faster GPU Kernels, July 2024: full text online, bonus materials, free PDF available
CUDA C++ Debugging: Safer GPU Kernel Programming, July 2024: full text online, free PDF available

Modern C++ Programming Books

C++ AVX Optimization: CPU SIMD Vectorization, 2025: full text online, free PDF available
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations, 2025: full text online, free PDF available
Advanced C++ Memory Techniques: Efficiency and Safety, 2025: full text online, free PDF available
Efficient C++ Multithreading: Modern Concurrency Optimization, 2025: free PDF available
Efficient Modern C++ Data Structures: Container and Algorithm Optimizations, 2025: free PDF available
C++ Low Latency: Multithreading and Hotpath Optimizations, 2025: free PDF available
Safe C++: Fixing Memory Safety Issues, Oct 2024: full text online, free PDF available