Aussie AI

Dataflow Optimizations of LLMs

  • Last Updated 26 August, 2025
  • by David Spuler, Ph.D.

Dataflow optimizations are a broad category of optimizations that can be used to speed up LLM execution by Transformers. The idea is to better manage the handling of the large amounts of data in both weights and activations, and thereby gaining in efficiency.

The sources of improvement may include:

  • Computation reuse (avoiding computations)
  • Memory access reduction (avoiding the cost of accessing memory)
  • A combination of these.

Types of Dataflow Optimizations

Some of the possible types of dataflow optimizations include:

  • Computation reuse
  • Conditional computation
  • Pipelining
  • Data marshalling improvements
  • Data locality (e.g., tiling)
  • Kernel fusion
  • Caching

Research Papers on Dataflow Optimizations

Papers on the use of dataflow optimizations in LLMs and Transformer architectures:

  • Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
  • Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
  • Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
  • C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
  • Jianyi Cheng, Cheng Zhang, Zhewen Yu, Christos-Savvas Bouganis, George A. Constantinides, Yiren Zhao, 19 Apr 2024 (v2), A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats, https://arxiv.org/abs/2307.15517
  • Cyrus Zhou, Zack Hassman, Ruize Xu, Dhirpal Shah, Vaugnn Richard, Yanjing Li, 23 Nov 2023 (v3), YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUs, https://arxiv.org/abs/2310.00574
  • Lois Orosa, Skanda Koppula, Yaman Umuroglu, Konstantinos Kanellopoulos, Juan Gomez-Luna, Michaela Blott, Kees Vissers, Onur Mutlu, 4 Feb 2022, EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators, https://arxiv.org/abs/2202.02310
  • G Abarajithan, Chamira U. S. Edussooriya, 6 Dec 2021, Kraken: An Efficient Engine with a Uniform Dataflow for Deep Neural Networks, https://arxiv.org/abs/2112.02793
  • Dingqing Yang, Amin Ghasemazar, Xiaowei Ren, Maximilian Golub, Guy Lemieux, Mieszko Lis, 23 Sep 2020, Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training, https://arxiv.org/abs/2009.10976
  • SC Kao, S Subramanian, G Agrawal, 2023, FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks https://dl.acm.org/doi/pdf/10.1145/3575693.3575747
  • Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
  • Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
  • J Liu, 2024, Data-driven Performance Optimization for Data-intensive Applications, Ph.D. Thesis, Electrical Engineering and Computer Science, University of California, Merced, https://escholarship.org/content/qt6gn2p8mn/qt6gn2p8mn.pdf (Optimization of data movement intensive algorithms, mostly non-AI applications.)
  • Agarwal, Saurabh, Aug 2024, Minimizing Data Movement in Machine Learning Systems, Ph.D. Thesis, Computer Sciences, University of Wisconsin--Madison, https://digital.library.wisc.edu/1711.dl/MKLIYRPB24A5R9D https://search.library.wisc.edu/digital/AMKLIYRPB24A5R9D PDF: https://asset.library.wisc.edu/1711.dl/QXSTVAIXECHQA8L/R/file-62b54.pdf?dl https://www.proquest.com/openview/c1ae2a92106d7ec681a7296cd163e0c1/1 (Dataflow optimization in training and also "clustered head attention" for memory-efficient inference, an extension of multi-head attention similar to layer-wise head fusion/pruning.)
  • Marcin Rogowski, 2024, Addressing Data Movement Challenges in High-Performance Computing, Ph.D. Thesis, King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia, https://repository.kaust.edu.sa/bitstreams/6a297b08-e7a1-48b9-b0d4-bf2d101636c3/download
  • Ruhai Lin, Rui-Jie Zhu, Jason K. Eshraghian, 12 Oct 2024, Reducing Data Bottlenecks in Distributed, Heterogeneous Neural Networks, https://arxiv.org/abs/2410.09650
  • David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela Wang, Mingran Wang, Raghu Prabhakar, 31 Oct 2024, Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance, https://arxiv.org/abs/2410.23668
  • Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Stephen W. Keckler, 25 Feb 2025, Kitsune: Enabling Dataflow Execution on GPUs, https://arxiv.org/abs/2502.18403
  • Gustavo Moreira, Leonardo Ferreira, Carolina Veiga, Maryam Hosseini, Fabio Miranda, 10 Aug 2025, Urbanite: A Dataflow-Based Framework for Human-AI Interactive Alignment in Urban Visual Analytics, https://arxiv.org/abs/2508.07390
  • Cristian Sestito, Shady Agwa, Themis Prodromakis, 31 Jul 2025, TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Dataflow and Analytical Modelling, https://arxiv.org/abs/2408.01254
  • Choongseok Song and Doo Seok Jeong, 20 Aug 2025, Computing-In-Memory Dataflow for Minimal Buffer Traffic, https://arxiv.org/abs/2508.14375

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: