Aussie AI

LLM Memory Optimization

  • Last Updated 29 August, 2025
  • by David Spuler, Ph.D.

What is LLM Memory Optimization?

Memory optimization involves using less memory during model inference. This means that inference requires less resources, and can also reduce CPU usage by leading to less data being swapped in and out of memory. Memory optimization can refer to either CPU memory or GPU memory.

Note that this page covers memory optimizations for backend coding of AI engines. For the other type of "memory" for a model's answers, see also: LLM short-term and long-term memory architectures.

Various research reports show that model inference is memory-bound rather than CPU-bound. In such cases, memory management is key to improving latency and throughput. On the other hand, researchers have also examined increasing memory usage to save time by caching and computation reuse.

Memory-Bound vs Compute-Bound

The situation with memory verus compute is more nuanced in LLM inference. There are two distinct phases for inference with opposite characteristics:

  • Prefill phase (prompt processing) — compute-bound.
  • Decoding phase (autoregressive token generation) — memory-bound.

Hence, there is various research on prefill optimization, including the optimization of "phase splitting" that disaggregates prefill and decoding phases, allowing them to run on machines with different memory/GPU setups.

Going even further, it turns out that the decoding phase is overall memory-bound, but this arises from the attention module and its loading of the KV cache data, which changes for each token. Hence, the memory characteristics of the decoding phase are more nuanced:

  • Attention module (KV cache) — memory-bound.
  • FFN/MLP modules — compute-bound.

The FFN always operates with the same weight matrices, so they can be fully pre-loaded, making it compute-bound. Although the attention module has the same model parameters, too, it also has to load different sets of data from the KV cache for each token, making it overall memory-bound. Hence, one high-level memory optimization is to not only split prefill and decoding (phase splitting), but also do "sublayer splitting" to run attention and FFN computations on different platforms.

Model Compression Techniques

The main class of optimizations that reduce memory requirements by making the model smaller is called "model compression". These methods reduce memory by making the model smaller. Model compression includes sub-strategies such as:

Recomputation

Recomputation is a method of trading time for space in LLM algorithms, by re-computing data rather than storing the results in memory. On memory-constrained device, it is possible to reduce space requirements at the cost of extra processor time. This is called "recomputation", or sometimes in research papers it is called "rematerialization" and when used during LLM training it is related to "checkpointing". When recomputation is used to optimize training of a model that is too large to fit inside GPU memory, it is called "gradient checkpointing." The portion of this algorithm that involves swapping tensors off the GPU back to the CPU is often called "offloading."

The recomputation optimization method involves not storing results of a computation that you might need later, but instead waiting until later, and then recomputing them all over again. Hence, recomputation trades time for space and is effectively the opposite of caching and data reuse optimizations, which trade space for time.

Recomputation involves doing calculations a second time, which is redundant computation. This is not something you want to have to do often, since it involves a lot more CPU or GPU time. But it is a technique that can be considered when memory is at a premium, and is sometimes done as a GPU optimization.

Research on Recomputation: Research papers on the recomputation memory optimization technique include:

Data Locality

Data locality is the method of speeding up LLM algorithms by using data that is stored closely together. The simplest idea is to store all data in contiguous memory, which is commonly used for model matrices and tensors. The use of data in "nearby" regions helps with optimizations such as caching, prefetching, tiling, coalescing, and other memory access pattern optimizations.

Research papers on data locality in LLM computations:

  • Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
  • Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
  • Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems, Volume 18, Issue 4, pp 424–453, https://dl.acm.org/doi/10.1145/233561.233564
  • Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
  • Ilias Bournias, Lukas Cavigelli, Georgios Zacharopoulos, 8 Nov 2024, AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality, https://arxiv.org/abs/2411.05555
  • Jordi Wolfson-Pou, Jan Laukemann, Fabrizio Petrini, 13 Jan 2025, Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs, https://arxiv.org/abs/2501.07056

Prefetching

Prefetching is the optimization technique of requesting data from memory before it is needed, so that its later usage will not slow down computations. Any type of memory may be benefit from prefetching, and there is "instruction prefetching" for CPU execution and "data prefetching" for computations.

Research papers on prefetching optimizations:

SSD Storage

The use of SSDs is common for large-scale storage of models and their data. Research papers on SSD issues include:

  • Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
  • Lucas Mearian, 24 Oct 2024, 2025: The year of the AI PC, Computer World, https://www.computerworld.com/article/3583355/2025-the-year-of-the-ai-pc.html
  • Tuowei Wang, Ruwen Fan, Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren, 29 Oct 2024 (v2), Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management, https://arxiv.org/abs/2410.19274
  • Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
  • S. Wang, Q. Cao, K. Zhou, J. Xu, Z. Guo and J. Guo, "ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 183-190, doi: 10.1109/ICCD63220.2024.00036. https://ieeexplore.ieee.org/abstract/document/10818161/ (Generalizing in-memory checkpoints by storing data in shards across multiple storage areas including CPU memory and SSDs.)
  • Hongfan Gao, Wangmeng Shen, Xiangfei Qiu, Ronghui Xu, Jilin Hu and Bin Yang, 19 Aug 2025, SSD-TS: Exploring the Potential of Linear State Space Models for Diffusion Models in Time Series Imputation, https://arxiv.org/abs/2410.13338

Compute-in-Memory (CIM)

Compute-in-Memory (CIM) or Process-in-Memory (PIM) optimizations are the use of in-memory computations rather than storing data on disk. Performing LLM computations fully inside GPU memory is one of the main optimizations in LLM inference.

Research papers on CIM/PIM include:

  • Vaclav Snasel, Tran Khanh Dang, Josef Kueng, Lingping Kong 22 December 2023, A review of in-memory computing for machine learning: architectures, options, International Journal of Web Information Systems, https://www.emerald.com/insight/content/doi/10.1108/IJWIS-08-2023-0131/full/html
  • Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • H. Diao et al., 2024, A Multiply-Less Approximate SRAM Compute-In-Memory Macro for Neural-Network Inference, IEEE Journal of Solid-State Circuits, doi: 10.1109/JSSC.2024.3433417, https://ieeexplore.ieee.org/abstract/document/10622078
  • B. Kim et al., 2024, The Breakthrough Memory Solutions for Improved Performance on LLM Inference, IEEE Micro, vol. 44, no. 3, pp. 40-48, May-June 2024, doi: 10.1109/MM.2024.3375352, https://ieeexplore.ieee.org/abstract/document/10477465
  • https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
  • Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka, 29 Aug 2024, PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation, https://arxiv.org/abs/2408.16246
  • Md Tawsif Rahman Chowdhury, Huynh Quang Nguyen Vo, Paritosh Ramanan, Murat Yildirim, Gozde Tutuncuoglu, 10 Sep 2024, The Lynchpin of In-Memory Computing: A Benchmarking Framework for Vector-Matrix Multiplication in RRAMs, https://arxiv.org/abs/2409.06140
  • Bettayeb, M., Halawani, Y., Khan, M.U. et al. Efficient memristor accelerator for transformer self-attention functionality. Sci Rep 14, 24173 (2024). https://doi.org/10.1038/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z.pdf
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
  • Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, Haerang Choi, Guhyun Kim, Jongsoon Won, Woojae Shin, Changhyun Kim, Gyeongcheol Shin, Yongkee Kwon, Ilkon Kim, Euicheol Lim, John Kim, Jungwook Choi, 28 Dec 2024, LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System, https://arxiv.org/abs/2412.20166
  • Dong Eun Kim, Tanvi Sharma, Kaushik Roy, 17 Feb 2025, Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory, https://arxiv.org/abs/2502.12344
  • Zhantong Zhu, Hongou Li, Wenjie Ren, Meng Wu, Le Ye, Ru Huang, Tianyu Jia, 1 Mar 2025, Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs, https://arxiv.org/abs/2503.00461
  • T. Sharma, M. Ali, I. Chakraborty and K. Roy, 2025, "What, When, Where to Compute-in-Memory for Efficient Matrix Multiplication during Machine Learning Inference," in IEEE Transactions on Emerging Topics in Computing, doi: 10.1109/TETC.2025.3574508, https://ieeexplore.ieee.org/abstract/document/11026257/
  • Yuannuo Feng, Wenyong Zhou, Yuexi Lyu, Yixiang Zhang, Zhengwu Liu, Ngai Wong and Wang Kang, 16 Aug 2025, Extending Straight-Through Estimation for Robust Neural Networks on Analog CIM Hardware, https://arxiv.org/abs/2508.11940
  • Yuannuo Feng, Wenyong Zhou, Yuexi Lyu, Hanjie Liu, Zhengwu Liu, Ngai Wong, Wang Kang, 16 Aug 2025, HPD: Hybrid Projection Decomposition for Robust State Space Models on Analog CIM Hardware, https://arxiv.org/abs/2508.11935

Memory-Bound versus CPU-Bound

Surprisingly, researchers discovered that LLM inference was not CPU-bound (or GPU-bound), but was memory-bound, with the cost of accessing all those tensors full of weights (and activations) being the main efficiency bottleneck.

Subsequently, it was found to be more nuanced in decoder-only transformer architectures (e.g. GPT),so that:

  • Prefill phase — CPU-bound
  • Decoding phase &mdash memory-bound

The prefill phase is the initial phase of "prompt processing" where every token in the prompt is processed (in parallel) to generate the overall KV caches. This has been found to thrash the CPU, or rather, the GPU. Prefill is a busy time, but it also takes a long time, and is the cause of the initial delay before an LLM starts answering your question.

The decoding phase is then the next phase, whereby the autoregressive algorithm spits out one token at a time. Because it cannot be fully parallelized, this tends not to fill the GPU pipeline, but is continually accesssing the entire model, one layer at a time. Hence, it's memory-bound.

Research papers on memory-bound versus CPU-bound nature of transformers:

  • Amir Gholami; Zhewei Yao; Sehoon Kim; Coleman Hooper, 25 March 2024, AI and Memory Wall, IEEE Micro ( Early Access ), pp 1-5, https://ieeexplore.ieee.org/abstract/document/10477550
  • Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
  • Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111

Research on Memory Optimization

For model compression and its popular subtypes, see research paper lists on the individual pages (e.g. quantization, pruning). Other research that is specifically on memory management and reducing memory includes:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: