Aussie AI

Outliers in Quantization

Last Updated 7 February, 2026

by David Spuler, Ph.D.

Research on Outliers in Quantization

Research papers include:

Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Robin Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu, 4 Apr 2024, Outlier-Efficient Hopfield Layers for Large Transformer-Based Models, https://arxiv.org/abs/2404.03828 Code: https://github.com/MAGICS-LAB/OutEffHop (Addresses outliers in quantization with a modified Softmax and an advanced Hopfield memory model.)
Xing Hu, Yuan Chen, Dawei Yang, Sifan Zhou, Zhihang Yuan, Jiangyong Yu, Chen Xu, 28 May 2024, I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models, https://arxiv.org/abs/2405.17849 Code: https://anonymous.4open.science/r/I-LLM-F242/
Wanyun Cui, Qianle Wang, 3 Apr 2024, Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models, https://arxiv.org/abs/2404.02837 (Examines which weights most affect inference, including outlier values.)
Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Lianwei Yang, Haisong Gong, 6 Aug 2024, DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers, https://arxiv.org/abs/2408.03291
Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, 4 Jun 2024 (v2), APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, ICML 2024 Oral, https://arxiv.org/abs/2401.12200 https://github.com/ROIM1998/APT
Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu, 20 Aug 2024, LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models, https://arxiv.org/abs/2408.10631 https://github.com/YupengSu/LLM-Barber
Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou, 30 Sep 2024, Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference, https://arxiv.org/abs/2409.20361 (Handling of outliers in INT4 quantization.)
Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo, 7 Oct 2024, PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs, https://arxiv.org/abs/2410.05265 https://github.com/ChenMnZ/PrefixQuant (Puts outliers into the KV cache as a prefix.)
Akshat Ramachandran, Souvik Kundu, Tushar Krishna, 12 Nov 2024 (v2), MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization, https://arxiv.org/abs/2411.05282
Dongwei Wang, Huanrui Yang, 8 Dec 2024, Taming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization, https://arxiv.org/abs/2412.06858
H Kang, Q Zhang, S Kundu, G Jeong, Z Liu, T Krishna, Dec 2024, GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference, https://neurips2024-enlsp.github.io/papers/paper_3.pdf (Use extra information in low-rank and sparse matrices to efficiently alleviate lossy KV cache quantization issues such as outliers.)
Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang, Ming-Der Shieh, Chih-Chung Hsu, Wei-Fen Lin, 11 Mar 2024, QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning, https://arxiv.org/abs/2403.06497 (Outlier-correcting fine-tuning and quantization method.)
Kyle Wiggers, December 23, 2024, A popular technique to make AI more efficient has drawbacks, https://techcrunch.com/2024/12/23/a-popular-technique-to-make-ai-more-efficient-has-drawbacks/
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
S. Kim, H. Lee, S. Kim, C. Kim and W. W. Ro, "AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 645-652, doi: 10.1109/ICCD63220.2024.00103. https://ieeexplore.ieee.org/abstract/document/10818069
Xuerui Qiu, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Malu Zhang, Haizhou Li, 23 Jan 2025, Quantized Spike-driven Transformer, https://arxiv.org/abs/2501.13492 https://github.com/bollossom/QSD-Transformer
Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383 (INT2 KV caching with special handling of outliers, RoPE, and attention sinks, and the resulting architecture works in Chain-of-Thought.)
Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang, 3 Feb 2025, Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding, https://arxiv.org/abs/2502.01563 https://github.com/MingyuJ666/Rope_with_LLM (Finds that outliers in attention are important, and arise by being generated by RoPE.)
Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan, 1 Feb 2025, PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration, https://arxiv.org/abs/2502.00527
G. Wang, S. Cai, W. Li, D. Lyu and G. He, "OFQ-LLM: Outlier-Flexing Quantization for Efficient Low-Bit Large Language Model Acceleration," in IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2025.3547732. https://ieeexplore.ieee.org/abstract/document/10924797
Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang, 16 May 2025, Accurate KV Cache Quantization with Outlier Tokens Tracing, https://arxiv.org/abs/2505.10938
P Czakó, G Kertész, S Szénási, 2025, Addressing Activation Outliers in LLMs: A Systematic Review of Post-Training Quantization Techniques, IEEE Access, 2025, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994764
M. Seo, J. Hyun, S. Jeong, X. T. Nguyen, H. -J. Lee and H. Lee, "OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems," in IEEE Computer Architecture Letters, doi: 10.1109/LCA.2025.3567844, https://ieeexplore.ieee.org/abstract/document/10990150
Yutong Liu, Cairong Zhao, Guosheng Hu, 23 Jul 2025, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, https://arxiv.org/pdf/2507.17417
Dongyeun Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim, 17 Jul 2025, DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization, https://arxiv.org/abs/2507.12933
Joonsung Kang, 23 Jul 2025, Doubly robust outlier resistant inference on causal treatment effect, https://arxiv.org/abs/2507.17439
Ivan Letteri, 20 Jul 2025, A Comparative Analysis of Statistical and Machine Learning Models for Outlier Detection in Bitcoin Limit Order Books, https://arxiv.org/abs/2507.14960
Arend Hintze and Clifford Bohm, 11 Aug 2025, Rethinking Self-Replication: Detecting Distributed Selfhood in the Outlier Cellular Automaton, https://arxiv.org/abs/2508.08047
Tanvir Islam, 26 Jul 2025, Extended Histogram-based Outlier Score (EHBOS), https://arxiv.org/abs/2502.05719
Marcello D'Orazio, 28 Jul 2025, An empirical comparison of some outlier detection methods with longitudinal data, https://arxiv.org/abs/2507.21203
Katharine M. Clark and Paul D. McNicholas, 31 Jul 2025, funOCLUST: Clustering Functional Data with Outliers, https://arxiv.org/abs/2508.00110
Jiaxi Li, Lu Yin, Xilu Wang, 4 Aug 2025, OWLed: Outlier-weighed Layerwise Pruning for Efficient Autonomous Driving Framework, https://arxiv.org/abs/2411.07711
Muhammad Rajabinasab, Farhad Pakdaman, Moncef Gabbouj, Peter Schneider-Kamp, Arthur Zimek, 18 Aug 2025, Randomized PCA Forest for Outlier Detection, https://arxiv.org/abs/2508.12776
Mingyu Kim, Daniel Stilwell, Jorge Jimenez, 18 Aug 2025, Outlier Detection of Poisson-Distributed Targets Using a Seabed Sensor Network, https://arxiv.org/abs/2508.13099
K\'evin Ducharlet, Louise Trav\'e-Massuy\`es, Jean-Bernard Lasserre, Marie-V\'eronique Le Lann, Youssef Miloudi, 13 Aug 2025, Leveraging the Christoffel Function for Outlier Detection in Data Streams, https://arxiv.org/abs/2508.16617
Sunwoo Kim, 17 Aug 2025, Deep Learning and Matrix Completion-aided IoT Network Localization in the Outlier Scenarios, https://arxiv.org/abs/2508.18225
Ryan Faulkner, Ian Reid, Simon Ratcliffe, Tat-Jun Chin, 25 Aug 2025, Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes, https://arxiv.org/abs/2508.17634
Paul Fogel (1), Christophe Geissler (1), George Luta (2) ((1) Data Services, Forvis Mazars, Levallois, France, (2) Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, DC, USA), 22 Aug 2025, The Target Polish: A New Approach to Outlier-Resistant Non-Negative Matrix Factorization, https://arxiv.org/abs/2507.10484
Simon Kl\"uttermann and Emmanuel M\"uller, 1 Sep 2025, Deep Transductive Outlier Detection, https://arxiv.org/abs/2404.03495
Dietmar Saupe, Tim Bleile, 8 Sep 2025, Robustness and accuracy of mean opinion scores with hard and soft outlier detection, https://arxiv.org/abs/2509.06554
Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang, 11 Sep 2025, ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms, https://arxiv.org/abs/2509.09679
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer, 5 Jun 2024 (v4), SqueezeLLM: Dense-and-Sparse Quantization, https://arxiv.org/abs/2306.07629 https://github.com/SqueezeAILab/SqueezeLLM (Separating outliers into a separately-stored small sparse data structure with full precision values, while "squeezing" the main LLM with low-bit quantization.)
Mengxia Yu, De Wang, Qi Shan, Colorado J Reed, Alvin Wan, 7 Jul 2025 (v2), The Super Weight in Large Language Models, https://arxiv.org/pdf/2411.07191
Arslan Majal, Aamir Hussain Chughtai and Muhammad Tahir, 9 Sep 2025, EMORF-II: Adaptive EM-based Outlier-Robust Filtering with Correlated Measurement Noise, https://arxiv.org/abs/2509.07415
Waqar Ahmad, Evan Murphy, Vladimir A. Krylov, 12 Sep 2025, Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures, https://arxiv.org/abs/2509.08926
Avinash Patil, 19 Sep 2025, When Bugs Linger: A Study of Anomalous Resolution Time Outliers and Their Themes, https://arxiv.org/abs/2509.16140
Ye Qiao, Sitao Huang, 17 Sep 2025, Q-ROAR: Outlier-Aware Rescaling for RoPE Position Interpolation in Quantized Long-Context LLMs, https://arxiv.org/abs/2509.14391
Yigit E. Yildirim, Samet Demir, Zafer Dogan, 18 Sep 2025, Benefits of Online Tilted Empirical Risk Minimization: A Case Study of Outlier Detection and Robust Regression, https://arxiv.org/abs/2509.15141
Georgios Vlassis, Saleh Ashkboos, Alexandra Volkova, Torsten Hoefler, Dan Alistarh, 2 Oct 2025, Beyond Outliers: A Study of Optimizers Under Quantization, https://arxiv.org/abs/2509.23500
Ye Qiao, Haocheng Xu, Xiaofan Zhang, Sitao Huang, 26 Sep 2025, Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling, https://arxiv.org/abs/2510.00028
Hongkang Li, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Meng Wang, 1 Oct 2025, Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis, https://arxiv.org/abs/2510.00399
Buhe Li, Berkay Kaplan, Maksym Lazirko, Aleksandr Kogan, 19 Sep 2025, Unsupervised Outlier Detection in Audit Analytics: A Case Study Using USA Spending Data, https://arxiv.org/abs/2509.19366
Trung Nguyen Thanh, Huyen Giang Thi Thu, Tai Le Quy, Ha-Bang Ban, 23 Sep 2025, Constraint-Reduced MILP with Local Outlier Factor Modeling for Plausible Counterfactual Explanations in Credit Approval, https://arxiv.org/abs/2509.19504
Akira Tamamori, 28 Oct 2025, Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection, https://arxiv.org/abs/2510.24043
Ziyi Fang, Lingxiao Huang, Runkai Yang, 28 Oct 2025, Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers, https://arxiv.org/abs/2510.24621
Jonah Ekelund, Savvas Raptis, Vicki Toy-Edens, Wenli Mo, Drew L. Turner, Ian J. Cohen, Stefano Markidis, 23 Oct 2025, Adaptive PCA-Based Outlier Detection for Multi-Feature Time Series in Space Missions, https://arxiv.org/abs/2504.15846
Juan A. Lara, David Lizcano, V\'ictor Ramp\'erez, Javier Soriano, 27 Oct 2025, A method for outlier detection based on cluster analysis and visual expert criteria, https://arxiv.org/abs/2510.23136
William Roy Orchard, Nastaran Okati, Sergio Hernan Garrido Mejia, Patrick Bl\"obaum, Dominik Janzing, 25 Oct 2025, Root Cause Analysis of Outliers with Missing Structural Knowledge, https://arxiv.org/abs/2406.05014
Carlo Dindorf, Jonas Dully, Steven Simon, Dennis Perchthaler, Stephan Becker, Hannah Ehmann, Kjell Heitmann, Bernd Stetter, Christian Diers, Michael Fr\"ohlich, 26 Sep 2025, Outlier Detection in Plantar Pressure: Human-Centered Comparison of Statistical Parametric Mapping and Explainable Machine Learning, https://arxiv.org/abs/2509.21943
Daniela Schkoda, Dominik Janzing, 8 Oct 2025, Root Cause Analysis of Outliers in Unknown Cyclic Graphs, https://arxiv.org/abs/2510.06995
Iuri Macocco, Nora Graichen, Gemma Boleda, Marco Baroni, 3 Oct 2025, Not a nuisance but a useful heuristic: Outlier dimensions favor frequent tokens in language models, https://arxiv.org/abs/2503.21718
Yuchen Shen, Haomin Wen and Leman Akoglu, 24 Sep 2025, FoMo-0D: A Foundation Model for Zero-shot Tabular Outlier Detection, https://arxiv.org/abs/2409.05672
Zifan Wang, Xinlei Yi, Xenia Konti, Michael M. Zavlanos, and Karl H. Johansson, 29 Sep 2025, Distributionally Robust Federated Learning with Outlier Resilience, https://arxiv.org/abs/2509.24462
Sajjad Hashemian, Mohammad Saeed Arvenaghi, Ebrahim Ardeshir-Larijani, 6 Oct 2025, Optimal Bound for PCA with Outliers using Higher-Degree Voronoi Diagrams, https://arxiv.org/abs/2408.06867
Yihao Ang, Peicheng Yao, Yifan Bao, Yushuo Feng, Qiang Huang, Anthony K. H. Tung, Zhiyong Huang, 9 Oct 2025, RFOD: Random Forest-based Outlier Detection for Tabular Data, https://arxiv.org/abs/2510.08747
Ilyas Varshavskiy, Bonu Boboeva, Shuhrat Khalilbekov, Azizjon Azimi, Sergey Shulgin, Akhlitdin Nizamitdinov, Haitz Saez de Ocariz Borde, 10 Oct 2025, Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers, https://arxiv.org/abs/2510.09294
Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu, 7 Oct 2025, Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection, https://arxiv.org/abs/2510.08602
Shivam Patel, Neharika Jali, Ankur Mallick, Gauri Joshi, 10 Oct 2025, ProxRouter: Proximity-Weighted LLM Query Routing for Improved Robustness to Outliers, https://arxiv.org/abs/2510.09852
Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar, 26 Jan 2026, FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning, https://arxiv.org/abs/2601.19001 https://github.com/robinzixuan/FROST
James Pan, Guoliang Li, 27 Jun 2025, A Survey of LLM Inference Systems, https://arxiv.org/abs/2506.21901