Aussie AI Blog

Promising LLM Inference Optimization Research

  • September 29, 2025
  • by David Spuler, Ph.D.

Promising LLM Inference Optimization Research

There are literally 500 ways to do LLM inference optimization. Here are some of the more interesting optimizations in recent research papers that warrant further attention.

Fused KV Caching

Fused KV caching, also called substring, concatenated or position-independent KV caching, is about merging the KV caches of two adjacent pieces of text. This is a generalization of prefix KV caching. It recently got a boost from Meta's research in Lin et. al. (2025), which created a related technique based on modifications to the attention pattern for adjacent pieces of text, and their caches. See Fused KV caching research

FFN Fusion

NVIDIA researchers found a way to merge two or more FFN components, in Bercovich et al. (2025). The method is an enhancement to "attention head pruning" because if you remove an attention head, then the FFNs across layers are adjacent and can be merged. See FFN Fusion

FFN MatMul Merging

An FFN does two matrix multiplications with an intervening activation function (e.g., GELU). Interestingly, you could merge both MatMuls in every FFN into a single operation if not for that pesky activation function in the middle. Recent research by Hu et. al. (2025), shows that it's possible to do so by using linear approximations for the activation function. See Merging FFN MatMuls

References

  1. Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan, 1 Sep 2025, REFRAG: Rethinking RAG based Decoding, https://www.arxiv.org/abs/2509.01092 https://www.alphaxiv.org/pdf/2509.01092 (Separates the attention computations across RAG chunks, which is effectively the same as "fused KV" or "concatenated KV" approaches with pre-computed per-chunk KV caches.)
  2. Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
  3. Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Mordern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging