Aussie AI
Block-Level Quantization
-
Last Updated 10 March, 2026
-
by David Spuler, Ph.D.
Research on Block-Level Quantization
Research papers include:
- Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
- Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
- Ziteng Sun1 Uri Mendlovic, Yaniv Leviathan1 Asaf Aharoni, Ahmad Beirami , Jae HunRo, Ananda Theertha Suresh, https://openreview.net/pdf?id=OWwc8eOIm8
- Yanshu Wang, Wenyang He, Tong Yang, 24 May 2024, Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information, https://arxiv.org/abs/2405.17470
- Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu, 21 Jul 2024 (v2), Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 Code: https://github.com/thu-ml/Jetfire-INT8Training
- Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
- Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng, Partha Pratim Pande, Janardhan Rao Doppa, Krishnendu Chakrabarty, Hai Li, 27 Oct 2023 (v3), Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators, https://arxiv.org/abs/2310.12182
- Sebastian Eliassen, Raghavendra Selvan, 16 Jan 2024 (v2), Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization, https://arxiv.org/abs/2309.11856
- Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, 20 Jun 2022 (v2), 8-bit Optimizers via Block-wise Quantization, https://arxiv.org/abs/2110.02861
- Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
- W Byun, J Woo, S Mukhopadhyay, 2024, Hardware-friendly Hessian-driven Row-wise Quantization and FPGA Acceleration for Transformer-based Models, https://dl.acm.org/doi/pdf/10.1145/3665314.3670806
- G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He, 2023, Zero++: Extremely efficient collective communication for giant model training, arXiv preprint arXiv:2306.10209, 2023. https://arxiv.org/abs/2306.10209
- David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Alireza Khodamoradi, Kristof Denolf, Eric Dellinger, 15 Oct 2024, Error Diffusion: Post Training Quantization with Block-Scaled Number Formats for Neural Networks, https://arxiv.org/abs/2410.11203 https://github.com/ROCm/tensorcast
- Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah, 18 Nov 2024, BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration, https://arxiv.org/abs/2411.11745
- Wonsuk Jang, Thierry Tambe, 2 Jan 2025, BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference, https://arxiv.org/abs/2501.01144 (Per-block granular mixed-precision quantization including FP4.)
- A. Xu et al., "GausiQ: Generalized Automatic Hybrid-Precision Quantization for MIMO Detection," in IEEE Wireless Communications Letters, doi: 10.1109/LWC.2024.3509269. https://ieeexplore.ieee.org/abstract/document/10839390
- M Raji, AG Ahsaei, K Soroush, B Ghavami, Jan 2025, Progressive Bitwidth Assignment Approaches for Efficient Capsule Networks Quantization, IEEE Access, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10854429
- Michael Wu, Arnab Raha, Deepak A. Mathaikutty, Martin Langhammer, Engin Tunali, 31 Jan 2025, StruM: Structured Mixed Precision for Efficient Deep Learning Hardware Codesign, https://arxiv.org/abs/2501.18953
- Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng, 26 Feb 2025, M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type, https://arxiv.org/abs/2502.18755
- Jude Haris, Jos\'e Cano, 15 Oct 2025, F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs, https://arxiv.org/abs/2510.13401
- Weihu Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2025. MXBLAS: Accelerating 8-bit Deep Learning with a Unified Micro-Scaled GEMM Library. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '25). Association for Computing Machinery, New York, NY, USA, 1590–1603. https://doi.org/10.1145/3712285.3759809 https://dl.acm.org/doi/full/10.1145/3712285.3759809 (GEMM using "microscaling format" of 8-bit values with scaling factors, with effect similar to block-level mixed-precision quantization and block-floating point numeric formats.)
- Xin Nie, Haicheng Zhang, Liang Dong, Beining Feng, Jinhong Weng, Guiling Sun, 1 Feb 2026, SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models, https://arxiv.org/abs/2602.01027
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home