Aussie AI

Block-Level Quantization

Last Updated 10 March, 2026

by David Spuler, Ph.D.

Research on Block-Level Quantization

Research papers include:

Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
Ziteng Sun1 Uri Mendlovic, Yaniv Leviathan1 Asaf Aharoni, Ahmad Beirami , Jae HunRo, Ananda Theertha Suresh, https://openreview.net/pdf?id=OWwc8eOIm8
Yanshu Wang, Wenyang He, Tong Yang, 24 May 2024, Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information, https://arxiv.org/abs/2405.17470
Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu, 21 Jul 2024 (v2), Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 Code: https://github.com/thu-ml/Jetfire-INT8Training
Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng, Partha Pratim Pande, Janardhan Rao Doppa, Krishnendu Chakrabarty, Hai Li, 27 Oct 2023 (v3), Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators, https://arxiv.org/abs/2310.12182
Sebastian Eliassen, Raghavendra Selvan, 16 Jan 2024 (v2), Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization, https://arxiv.org/abs/2309.11856
Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, 20 Jun 2022 (v2), 8-bit Optimizers via Block-wise Quantization, https://arxiv.org/abs/2110.02861
Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
W Byun, J Woo, S Mukhopadhyay, 2024, Hardware-friendly Hessian-driven Row-wise Quantization and FPGA Acceleration for Transformer-based Models, https://dl.acm.org/doi/pdf/10.1145/3665314.3670806
G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He, 2023, Zero++: Extremely efficient collective communication for giant model training, arXiv preprint arXiv:2306.10209, 2023. https://arxiv.org/abs/2306.10209
David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Alireza Khodamoradi, Kristof Denolf, Eric Dellinger, 15 Oct 2024, Error Diffusion: Post Training Quantization with Block-Scaled Number Formats for Neural Networks, https://arxiv.org/abs/2410.11203 https://github.com/ROCm/tensorcast
Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah, 18 Nov 2024, BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration, https://arxiv.org/abs/2411.11745
Wonsuk Jang, Thierry Tambe, 2 Jan 2025, BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference, https://arxiv.org/abs/2501.01144 (Per-block granular mixed-precision quantization including FP4.)
A. Xu et al., "GausiQ: Generalized Automatic Hybrid-Precision Quantization for MIMO Detection," in IEEE Wireless Communications Letters, doi: 10.1109/LWC.2024.3509269. https://ieeexplore.ieee.org/abstract/document/10839390
M Raji, AG Ahsaei, K Soroush, B Ghavami, Jan 2025, Progressive Bitwidth Assignment Approaches for Efficient Capsule Networks Quantization, IEEE Access, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10854429
Michael Wu, Arnab Raha, Deepak A. Mathaikutty, Martin Langhammer, Engin Tunali, 31 Jan 2025, StruM: Structured Mixed Precision for Efficient Deep Learning Hardware Codesign, https://arxiv.org/abs/2501.18953
Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng, 26 Feb 2025, M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type, https://arxiv.org/abs/2502.18755
Jude Haris, Jos\'e Cano, 15 Oct 2025, F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs, https://arxiv.org/abs/2510.13401
Weihu Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2025. MXBLAS: Accelerating 8-bit Deep Learning with a Unified Micro-Scaled GEMM Library. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '25). Association for Computing Machinery, New York, NY, USA, 1590–1603. https://doi.org/10.1145/3712285.3759809 https://dl.acm.org/doi/full/10.1145/3712285.3759809 (GEMM using "microscaling format" of 8-bit values with scaling factors, with effect similar to block-level mixed-precision quantization and block-floating point numeric formats.)
Xin Nie, Haicheng Zhang, Liang Dong, Beining Feng, Jinhong Weng, Guiling Sun, 1 Feb 2026, SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models, https://arxiv.org/abs/2602.01027