Aussie AI Blog

What's Hot in LLM Inference Optimization in 2025?

  • March 3rd, 2025
  • by David Spuler, Ph.D.

Inference Optimization News in 2025

Surprisingly, 2025 started with a huge focus on LLM efficiency with the release of DeepSeek R1's advanced reasoning model that outpaced the frontier models on many benchmarks, and at a fraction of the cost.

New Research on Reasoning Efficiency

Although the reported cheap training cost of DeepSeek R1 was what put tremors through NVIDIA's stock price, there were several categories of DeepSeek architecture improvements, and they actually included several advancements to inference efficiency in reasoning models:

Some of these algorithms had actually appeared in their prior V3 model, but they were then applied to their R1 reasoning model. Read more here: DeepSeek's research advancements.

Neural Magic Acquisition

Another industry change related to inference efficiency that received a lot less attention was that Neural Magic, a Boston-based AI inference startup, was acquired by RedHat and IBM in late 2024. As one of the few pure-play inference optimization startups, they raised over $50m in venture capital, and have now exited (price undisclosed). With a focus on the software stack, especially on dynamic sparsity, they achieved significant advances in inference efficiency.

New Research on Reasoning Efficiency

Chain-of-Thought efficiency. The rise of reasoning models from several major AI companies (not just DeepSeek) has led to a follow-on rash of many papers on improving the efficiency of LLM reasoning algorithms. Reasoning models, especially ones using multi-step inference, are very costly to run.

In particular, Chain-of-Thought tends to emit a huge amount of tokens as it "talks to itself" to achieve good reasoning results. This is true in both single-step CoT "long answer" models (e.g. DeepSeek R1) and multi-step CoT versions with "test time compute" (e.g. OpenAI's o3 and Google Gemini's reasoning model). Hence, a number of different techniques have been tested in the research to reduce the token processing cost in CoT. Changes have included high-level changes to the CoT algorithm so as to skip steps and prune redundant reasoning paths (in multi-step CoT), or much lower-level optimizations to compress tokens and other algorithms.

In general, almost all of the 500+ LLM optimization techniques could theoretically be used in the special case of reasoning algorithms, but so far only a small number of these methods have been tested in Chain-of-Thought. For more information, see:

Small Reasoning Models. Another related area of research is "Small Reasoning Models," which is the use of smaller models with reasoning capabilities. This has been approached in various ways:

  • Multi-step CoT algorithms wrapped around smaller base models.
  • Improved training and fine-tuning of reasoning techniques applied to small models.
  • Distillation of small models (from Large Reasoning Models).
  • Model compression of larger reasoning models (e.g. quantization of DeepSeek R1).

Endlessly Hot Research Areas

There continues to be an endless flow of research papers on these LLM efficiency optimizations:

Quantization: Ever the hottest area, the use of quantization is widespread throughout industry, let alone in research labs. Some of the latest hot areas include:

Speculative decoding: Improvements to speculative decoding have included improved accuracy of draft models, and better use of multi-machine "distributed speculative decoding." In another extension, prompt lookup decoding has been generalized to look beyond the current prompt to the entire history of prompts in multiple queries. (See blog article "What's New in Speculative Decoding?").

KV cache optimizations: KV cache compression remains hot because the KV cache is the main bottleneck for long context processing. In fact, we're now into the realm of "ultralong contexts" over 1M tokens, and this requires KV cache optimizations. Quantization of KV cache data to INT4 or INT2 is now common, but there's also research on numerous other ways to compress the KV cache, such as token pruning, head fusion, and various other ways. Here are some subareas:

Generally speaking, almost all of the LLM model compression optimizations (to model parameters or activations) can be re-applied in the new realm of the KV cache, so there's no shortage of future paper topics.

Fused KV caching. My favorite area that I feel is certain for a breakout is the generalization of prefix KV caching to non-prefix strings by concatenating the KV caches together and making adjustments. Recent research has examined accuracy problems and made improvements to the handling of positional encoding adjustments and cross-chunk attention; see substring/fused/concatenated KV cache. Yes, it needs a better name!

New attention algorithms: Another way to achieve long context processing is to alter the attention algorithm. Every paper on this topic, and there are many, chooses a new name for their approach. Hence, there are numerous new candidates:

We seem to be stuck at version 3 of Flash Attention, which was released eons ago in July 2024. There also hasn't been much added recently to the literature on Paged Attention. Some of the other newer attention algorithms have been combined with these two major memory-efficient attention algorithms, but there hasn't been a core advance in these techniques lately.

More AI Research Topics

Read more about:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging