Aussie AI

Inference Frameworks

Last Updated 22 October, 2025

by David Spuler, Ph.D.

Inference frameworks are software platforms that take a model and execute it against requests from users. Many inference frameworks also provide training and fine-tuning capabilities, but not all do. Many frameworks have been open-sourced, but there are also many that remain proprietary, and there is much competition occurring in the space.

There is much overlap between the concept of a framework and a "deep learning compiler". And there is also overlap with companies that are offering "AI cloud hosting" services, including both new startups and the major cloud hosts (e.g. Amazon AWS, Microsoft Azure, and Google GCP), which typically include both training and inference features.

Software frameworks are only one part of the AI tech stack. Read more about inference optimization, training optimization, hardware accelerators, ML compilers, and our list of common and obscure AI optimization techniques.

List of Machine Learning Frameworks

Some of the many frameworks include:

TensorFlow, open-sourced by Google.
PyTorch
Torch
MXNet
HuggingFace Transformers
LangChain
GGML
Llama.cpp
Llvm
Caffe and Caffe2
Theano
RNN
Keras
Microsoft CNTK (Cognitive Toolkit)
Amazon ML
Google Cloud AutoML
Microsoft Azure (various)
SciKit-learn

Features of ML Frameworks

Some of the desirable features include:

GPU and hardware acceleration support
Training optimizations
Quantization
Pruning
Kernel operator fusion
Server hosting support (i.e. deployment to run your model as a website backend service)

Survey Papers on ML Software Frameworks

Papers that review or survey software frameworks:

G Menghani, 2023, Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233
Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949, https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf
Saba Amiri; Sara Salimzadeh; A.S.Z. Belloum, 2019, A Survey of Scalable Deep Learning Frameworks, 2019 15th International Conference on eScience (eScience), https://ieeexplore.ieee.org/document/9041689, PDF: https://pure.uva.nl/ws/files/58721994/09041689.pdf (Short survey paper from 2019.)
Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949, PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
MM YAPICI, N Topaloğlu, 2021, Computers and Informatics, Performance comparison of deep learning frameworks https://dergipark.org.tr/en/pub/ci/issue/60236/769457, PDF: https://dergipark.org.tr/en/download/article-file/1201877 (Examines Torch, Theano, Caffe, Caffe2, MXNet, Keras, TensorFlow, and CNTK frameworks in terms of training speed.)
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233

General Research on ML Software Frameworks

Research papers about general issues or specific frameworks:

F Mince, D Dinh, J Kgomo, N Thompson, S Hooker, 2023, The Grand Illusion: The Myth of Software Portability and Implications for ML Progress, arXiv preprint arXiv:2309.07181, https://arxiv.org/pdf/2309.07181.pdf (Examines ML software frameworks TensorFlow, Pytorch, and JAX, and their portability across hardware.)
H Guan, Y Xiao, J Li, Y Liu, G Bai, May 2023, A comprehensive study of real-world bugs in machine learning model optimization, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), https://ieeexplore.ieee.org/document/10172690, PDF: https://yepangliu.github.io/files/ICSE2023-MOB.pdf, PDF: https://baigd.github.io/files/ICSE23-MOB.pdf (Frameworks can have bugs? Who knew?)
N Mungoli, Apr 2023, Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency, arXiv preprint arXiv:2304.13738, https://arxiv.org/abs/2304.13738 (Extending frameworks for distributed AI.)
Arpan Jain, Ammar Ahmad Awan, Quentin Anthony, Hari Subramoni, Dhableswar K. DK Panda, 2019, Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters, 2019 IEEE International Conference on Cluster Computing (CLUSTER), https://ieeexplore.ieee.org/abstract/document/8891042, PDF Slides: http://nbcl.cse.ohio-state.edu/static/media/talks/slide/Arpan_booth_talk_2.pdf
Marc-André Zöller, Marco F. Huber, Jan 2021, Benchmark and Survey of Automated Machine Learning Frameworks, https://arxiv.org/abs/1904.12054
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm
tinygrad. 2023. Tinygrad. https://github.com/tinygrad/tinygrad
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (The original Paged Attention and vLLM paper, focusing on optimizing memory size of the KV cache using methods similar to operating-system memory paging.)
Vince Lam, Mar 12, 2024, 50+ Open-Source Options for Running LLMs Locally, https://medium.com/thedeephub/50-open-source-options-for-running-llms-locally-db1ec6f5a54f
Jason Perlow, Aug. 6, 2024, How to run dozens of AI models on your Mac or PC - no third-party cloud needed, https://www.zdnet.com/article/how-to-run-dozens-of-ai-models-on-your-mac-or-pc-no-third-party-cloud-needed/
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
The SGLang Team, Jul 25, 2024, Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM), https://lmsys.org/blog/2024-07-25-sglang-llama3/
Anna Popovych, Sofiya Merenych, February 16, 2024, Top AI Frameworks in 2024: Comparison of Artificial Intelligence Frameworks, https://clockwise.software/blog/artificial-intelligence-framework/
Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
ZML, Sep 2024, ZML: High performance AI inference stack. Built for productionl https://docs.zml.ai/ https://github.com/zml/zml?tab=readme-ov-file
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Sebastian Petrus, Sep 4, 2024, Top 10 RAG Frameworks Github Repos 2024, https://sebastian-petrus.medium.com/top-10-rag-frameworks-github-repos-2024-12b2a81f4a49
Rick Zhou, Larme Zhao, Bo Jiang, and Sean Sheng, June 5, 2024, Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI, https://www.bentoml.com/blog/benchmarking-llm-inference-backends
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Meta, Jan 2025 (accessed), Llama Stack: Composable building blocks to build Llama Apps, https://github.com/meta-llama/llama-stack
Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, Tinoosh Mohsenin, 19 Feb 2025, GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices, https://arxiv.org/abs/2502.15816
Amr Elmeleegy, Harry Kim, David Zier, Kyle Kranen, Neelay Shah, Ryan Olson and Omri Kahalon, Mar 18, 2025, Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models, https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/
Matthias Jobst, Tim Langer, Chen Liu, Mehmet Alici, Hector A. Gonzalez, Christian Mayr, 18 Jul 2025, An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC, https://arxiv.org/abs/2507.13736
Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Guanggang Geng, Zhiying Li, Jian Weng, 6 Aug 2025, DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection, https://arxiv.org/abs/2508.05694
Soorya Ram Shimgekar, Shayan Vassef, Abhay Goyal, Navin Kumar, Koustuv Saha, 24 Jul 2025, Agentic AI framework for End-to-End Medical Data Inference, https://arxiv.org/abs/2507.18115
Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath, 18 Jul 2025, A General Framework for Inference-time Scaling and Steering of Diffusion Models, https://arxiv.org/abs/2501.06848
Jiawen Qi, Chang Gao, Zhaochun Ren, Qinyu Chen, 25 Jul 2025, DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference, https://arxiv.org/abs/2507.19608
Riddhi J. Pitliya, Ozan Catal, Toon Van de Maele, Corrado Pezzato, Tim Verbelen, 1 Aug 2025, Theory of Mind Using Active Inference: A Framework for Multi-Agent Cooperation, https://arxiv.org/abs/2508.00401
Chakattrai Sookkongwaree, Tattep Lakmuang, and Chainarong Amornbunchornvej, 1 Aug 2025, Multi-Band Variable-Lag Granger Causality: A Unified Framework for Causal Time Series Inference across Frequencies, https://arxiv.org/abs/2508.00658
Bo Wen, 7 Aug 2025, A Framework for Inherently Safer AGI through Language-Mediated Active Inference, https://arxiv.org/abs/2508.05766
Bj\"orn Volkmann, Jan-Hendrik Ewering, Michael Meindl, Simon F. G. Ehlers, Thomas Seel, 21 Aug 2025, Bayesian Inference and Learning in Nonlinear Dynamical Systems: A Framework for Incorporating Explicit and Implicit Prior Knowledge, https://arxiv.org/abs/2508.15345
Zucheng Liang, Wenxin Wei, Kaijie Zhang, Hongyi Chen, 5 Sep 2025, Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework, https://arxiv.org/abs/2509.04770
Yongsheng Feng, Yuetonghui Xu, Jiehui Luo, Hongjia Liu, Xiaobing Li, Feng Yu, Wei Li, 19 Sep 2025, TISDiSS: A Training-Time and Inference-Time Scalable Framework for Discriminative Source Separation, https://arxiv.org/abs/2509.15666
Enyu Zhou, Kai Sheng, Hao Chen, Xin He, 19 Sep 2025, CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference, https://arxiv.org/abs/2508.04462
Yudong Shen, Wenyu Wu, Jiali Mao, Yixiao Tong, Guoping Liu, Chaoya Wang, 15 Sep 2025, Bridging the Gap Between Sparsity and Redundancy: A Dual-Decoding Framework with Global Context for Map Inference, https://arxiv.org/abs/2509.11731
Giorgos Armeniakos, Alexis Maras, Sotirios Xydis, Dimitrios Soudris, 18 Sep 2025, MaRVIn: A Cross-Layer Mixed-Precision RISC-V Framework for DNN Inference, from ISA Extension to Hardware Acceleration, https://arxiv.org/abs/2509.15187