Aussie AI
Transformer Optimization
-
Last Updated 30 August, 2025
-
by David Spuler, Ph.D.
The Transformer was invented at Google in 2017 and open-sourced by their research group. It became the most widely used AI engine architecture, notably being used in GPT-3 by OpenAI's ChatGPT. Since then, optimization research has taken off. There are two basic ways to optimize Transformer models:
- Transformer architecture improvements: large-scale improvements.
- Transformer code optimizations: smaller improvements, discussed below.
There are various ways to optimize a Transformer with code optimizations. Much research has also been conducted on slight modifications to the architecture of the Transformer to improve latency and throughput in both inference and training.
Transformer Inference Optimizations
See also these articles for further information on Transformer inference optimization:
- What's Hot in Inference Optimization?
- 500+ Techniques for LLM Inference Optimization
- Long List of LLM Optimization Techniques
- Inference Optimization Research Blog
Transformer Kernel Code Optimizations
Some of the specific kernel optimizations of inference engines include:
- Attention head caching: Precomputing and caching attention head matrices from already-processed tokens (HuggingFace, 2021). This reduces the auto-regression costs when outputting multiple tokens (which is the usual case). See also attention head pruning
- KV Caching: This optimization is caching the attention head K and V tensor matrix multiplications during decoding (Intel, 2023). This reduces the number of decoder matrix multiplications. See KV caching research.
- Padding byte optimizations: Removing padding in the Feed Forward Network tensor/matrix computations (Intel, 2023; also in ByteTransformer by Zhai et al. (2023)); see "zero padding removal". This reduces the total number of multiplications.
- Attention dimensions: Merging Q, K, and V matrices (of identical size) into a single large matrix for better matrix multiplication throughput (Zhai et al., 2023).
- Operator fusion and reordering: Reordering reshaping and matmul operations (Intel, 2023). This streamlines some of the arithmetic operations to use more compact low-level libraries. See kernel fusion optimizations.
Kernel Optimization Research Papers
Reference papers on some of the specific code optimizations in Transformer engines:
- Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference
- Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (Contains a number of significant optimizations to the original Transformer architecture.)
- Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/, Code: https://github.com/TenTrans/TenTrans
- Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-based Generative Models, Jaewan Choi, Jaehyun Park, Kwanhee Kyung, Nam Sung Kim, and Jung Ho Ahn, IEEE Computer Architecture Letters, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10218731 (Efficient memory storage of K and V vectors in Transformer inference.)
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019, https://arxiv.org/abs/1904.01038, Code: https://github.com/pytorch/fairseq (Includes inference optimizations such as caching model states from previously generated tokens.)
- Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (This paper avoids zero-padding inputs amongst other optimizations.)
- Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 2022, Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp 1–36 https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737 (Extensive survey that contains a section on "Memoization" which is caching computed values for later reuse.)
See also general research on code optimizations.
Transformer General Optimizations
Some of the general classes of optimization techniques for the Transformer architecture include:
- Hardware-specific optimizations and low-level libraries (various)
- Model compilation (graph compilers / deep learning compilers)
- Transformer architectures
- Kernel optimizations (i.e. inference engine code optimizations such as caching and kernel fusion).
- Inference optimization techniques (numerous methods)
- Caching of the entire query results to re-use for other users. This is called an Inference Cache.
And here is a long list of the various other optimizations possible:
- Model compression
- Quantization (binary, ternary, logarithmic, 2-bit, 3-bit, 4-bit, 8-bit, FP8, FP16, stochastic, etc.)
- Pruning (length, width, depth, dual, triple, layer, token, and more)
- Distillation
- Weight sharing
- Fusion: layer fusion, kernel operator fusion
- Skipping: including layer skipping, early exit, zero skipping
- Arithmetic: zero-multiplication, conditional computation, logs, approximations
- Decoding: speculative decoding, parallel decoding, aggressive decoding, etc.
For even more, see inference optimizations, Transformer architectural optimizations, and a complete list of Transformer optimizations.
Survey Papers on Transformer Optimization
Review and survey papers on faster Transformer engines:
- Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami, Full stack optimization of transformer inference: a survey, Feb 2023, arXiv:2302.14017, https://arxiv.org/abs/2302.14017
- Full Stack Optimization of Transformer Inference: a Survey. Part 2 on Transformer Optimization, A Paper Overview, https://www.nebuly.com/blog/full-stack-optimization-of-transformer-inference-a-survey-part-2
- Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey (v2). arXiv preprint arXiv:2009.06732, 2022, https://arxiv.org/abs/2009.06732
- Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, A Survey of Techniques for Optimizing Transformer Inference, 2023, arxiv.org July 2023, https://arxiv.org/abs/2307.07982
- L Papa, P Russo, I Amerini, L Zhou, Sep 2023, A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking, arXiv preprint arXiv:2309.02031, 2023, https://arxiv.org/abs/2309.02031
- Efficient Attention: Breaking The Quadratic Transformer Bottleneck, 2023 (accessed 8/12/23), https://gwern.net/note/attention, (A regularly updated bibliography of transformer attention optimization papers)
Tips for Transformer Optimization
Articles and papers with general tips on optimizing a Transformer:
- Fabián Varietti, Rodrigo Gallardo, Ian Spektor, Francisco Kurucz, Facundo Parodi, A guide to optimizing Transformer-based models for faster inference Tue, Nov 29, 2022 https://tryolabs.com/blog/2022/11/24/transformer-based-model-for-faster-inference
- Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
- Model optimization (TensorFlow), https://www.tensorflow.org/lite/performance/model_optimization
- Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference
- Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (Contains a number of significant optimizations to the original Transformer architecture.)
- Weng, Lilian. (Jan 2023). Large Transformer Model Inference Optimization. Lil’Log. https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- Philipp Schmid, Accelerate Sentence Transformers with Hugging Face Optimum, August 2, 2022, https://www.philschmid.de/optimize-sentence-transformers
- Michaël Benesty, Hugging Face Transformer Inference Under 1 Millisecond Latency, Nov 5, 2021 https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c
Research on Specific Fast Transformers
These papers are on new faster Transformer architectures tested by researchers:
- Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (This paper uses zero-padding inputs and fused attention heads with shared parameters)
- Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya, Reformer: The efficient transformer, In International Conference on Learning Representations, 2019, https://arxiv.org/abs/2001.04451
- J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578
- NVIDIA, NVIDIA FasterTransformer, https://github.com/NVIDIA/FasterTransformer
General Research on Transformer Optimization
These papers review Transformer optimization techniques in general.
- Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin , James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, "Efficiently Scaling Transformer Inference", arXiv:2211.05102v1 [cs.LG], 9 Nov 2022, https://arxiv.org/abs/2211.05102
- Dave Dice, Alex Kogan, Optimizing Inference Performance of Transformers on CPUs, Feb 2021, https://arxiv.org/abs/2102.06621
- Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. 2020. Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation. CoRR, abs/2006.10369. https://arxiv.org/abs/2006.10369, Code: https://github.com/jungokasai/deep-shallow (Single-layer decoder architecture, see also shallow decoder Transformer architectures inspired by this paper.)
- Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee, Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding, July 2023, https://arxiv.org/abs/2307.05908
- Zining Zhang; Yao Chen; Bingsheng He; Zhenjie Zhang, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 6, June 2023, pp.1982-1995, https://ieeexplore.ieee.org/abstract/document/10107474
- So, D. R., Ma’nke, W., Liu, H., Dai, Z., Shazeer, N. M., and Le, Q. V., 2021 (updated Jan 2022), Primer: Searching for efficient transformers for language modeling, ArXiv, abs/2109.08668, https://arxiv.org/abs/2109.08668 Code: https://github.com/google-research/google-research/tree/master/primer (Has a different Transformer architecture, but not a common one.)
- Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A., Adaptive attention span in transformers. In Annual Meeting of the Association for Computational Linguistics, Aug 2019, https://arxiv.org/abs/1905.07799 (Self-adaptive context lengths for attention heads.)
- Bapna, A., Arivazhagan, N., and Firat, O., Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106 (Conditionally controls which subunits of the model can execute.)
- Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 (The original FlashAttention version 1, now superceded by FlashAttention 2.)
- GI Yu, JS Jeong, GW Kim, S Kim, BG Chun, 2022, Orca: A distributed serving system for Transformer-Based generative models, 16th USENIX Symposium, https://www.usenix.org/conference/osdi22/presentation/yu, PDF: https://www.usenix.org/system/files/osdi22-yu.pdf (Improved parallelization/pipelining with latency reduction from iteration-level scheduling across multiple requests.)
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf (Interesting review of safety and bias/fairness issues for models optimized by quantization, pruning or distillation.)
- X Li, B Ren, X Shen, Y Wang, 2022, CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework, arXiv preprint arXiv:2206.10620, https://arxiv.org/abs/2206.10620 (Various optimizations including block pruning and deep reuse.)
Kernel Optimizations
- Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, April 2024, Tandem processor: Grappling with emerging operators in neural networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 1165–1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365 Code: https://actlab-genesys.github.io (Reviews hardware acceleration of all sub-layer kernel operators, with a focus beyond just GEMM/MatMul operators.)
- Make LLM Fine-tuning 2x faster with Unsloth and HF TRL, January 10, 2023, Daniel Han-Chen, https://huggingface.co/blog/unsloth-trl Code: https://github.com/huggingface/blog/blob/main/unsloth-trl.md (Optimizes some PyTorch kernels for back-propagation and reduces memory usage in fine-tuning; currently works with Llama and Mistral architectures.)
- H Shen, H Chang, B Dong, Y Luo, H Meng, Nov 2023, Efficient LLM Inference on CPUs, arXiv preprint arXiv:2311.00502, https://arxiv.org/pdf/2311.00502.pdf Code: https://github.com/intel/intel-extension-for-transformers (INT4 weight quantization with 16-bit activations, and highly optimized kernel with support for AVX2, AVX512, AVX512_VNNI and Advanced Matrix Extensions (AMX), and KV caching, tested on LLamam2 3B to 20B with 20-80ms latency per token.)
- Piotr Kluska, Adri´an Castello, Florian Scheidegger, A. Cristiano I. Malossi, 2024, QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers https://openaccess.thecvf.com/content/CVPR2024W/eLVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf
- Christian Szegedy et al., 2015, Going Deeper with Convolutions, http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf (The GoogleNet paper.)
- Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, Ghislain Durif, 8 Apr 2021 (v2), Kernel Operations on the GPU, with Autodiff, without Memory Overflows, https://arxiv.org/abs/2004.11127 Code: https://www.kernel-operations.io/keops/index.html
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Alejandro Araya-Núñez, Justin Fernández-Badilla, Daniel González-Vargas, Jimena León-Huertas, Erick-Andrés Obregón-Fonseca, Danny Xie-Li, June, 2024, Proposal of an open-source accelerators library for inference of transformer networks in edge devices based on Linux, Tecnología en Marcha. Vol. 37, special issue. IEEE Latin American Electron Devices Conference (LAEDC), pages 118-125, https://doi.org/10.18845/tm.v37i5.7225 PDF: https://revistas.tec.ac.cr/index.php/tec_marcha/article/download/7225/7076
- Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 5 Jul 2024 (v3), Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041
- Zheming Jin, July 2024, Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL, Oak Ridge National Laboratory, ORNL/TM-2024/3463, https://info.ornl.gov/sites/publications/Files/Pub217394.pdf
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Intel, 2024, Get Started with Intel® oneAPI Math Kernel Library, https://www.intel.com/content/www/us/en/docs/onemkl/get-started-guide/2023-0/overview.html
- T Zhao, 2024, Acceleration of Deep Learning Algorithms with Transformers, https://escholarship.org/uc/item/3419t2z6
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- J. Bi et al., "Efficient and Fast High-performance Library Generation for Deep Learning Accelerators," in IEEE Transactions on Computers, doi: 10.1109/TC.2024.3475575, https://ieeexplore.ieee.org/abstract/document/10707341 (Finding the most efficient kernel.)
- Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
- Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
- Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
- Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
- HF, 2024, TGI v3 overview, https://huggingface.co/docs/text-generation-inference/conceptual/chunking
- Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, 7 Dec 2023 (v2), Efficient LLM Inference on CPUs, https://arxiv.org/abs/2311.00502 https://github.com/intel/intel-extension-for-transformers
- Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
- Runxin Zhong, Yuyang Jin, Chen Zhang, Kinman Lei, Shuangyu Li, and Jidong Zhai. 2025. FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '25). Association for Computing Machinery, New York, NY, USA, 183–196. https://doi.org/10.1145/3710848.3710864 https://dl.acm.org/doi/abs/10.1145/3710848.3710864
- Burkhard Ringlein, Thomas Parnell, Radu Stoica, 15 May 2025 (v2), GPU Performance Portability needs Autotuning, https://arxiv.org/abs/2505.03780
- Anne Ouyang and Azalia Mirhoseini and Percy Liang, June 2025, Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet), https://crfm.stanford.edu/2025/05/28/fast-kernels.html
- Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, Yoon Kim, 28 May 2025, FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference, https://arxiv.org/abs/2505.22758 https://github.com/aninrusimha/flashformer (Optimizing kernels for low latency in a single isolated query, not a batch, via kernel fusion and running all components in one kernel, along with programming techniques like metaprogramming.)
- Bonwoo Lee, Cheolwoo Park, Jeongyoun Ahn, 23 Jul 2025, Optimal differentially private kernel learning with random projection, https://arxiv.org/abs/2507.17544
- Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang, 20 Jul 2025, MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation, https://arxiv.org/abs/2507.17773
- Kaizheng Wang, 24 Jul 2025, Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift, https://arxiv.org/abs/2302.10160
- Masaki Adachi, Masahiro Fujisawa, Michael A Osborne, 24 Jul 2025, Fixing the Pitfalls of Probabilistic Time-Series Forecasting Evaluation by Kernel Quadrature, https://arxiv.org/abs/2503.06079
- Daehyeon Baek, Jieun Choi, Jimyoung Son, Kyungmin Bin, Seungbeom Choi, Kihyo Moon, Minsung Jang, Hyojung Lee, 18 Jul 2025, FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration, https://arxiv.org/abs/2505.20839
- Zikai Xie, Linjiang Chen, 18 Jul 2025, Merge Kernel for Bayesian Optimization on Permutation Space, https://arxiv.org/abs/2507.13263
- Jie Wang and March Boedihardjo and Yao Xie, 18 Jul 2025, Statistical and Computational Guarantees of Kernel Max-Sliced Wasserstein Distances, https://arxiv.org/abs/2405.15441
- Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi, 19 Jul 2025, Kernel Based Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games, https://arxiv.org/abs/2507.14529
- Youran Zhou, Mohamed Reda Bouadjenek, Jonathan Wells, Sunil Aryal, 20 Jul 2025, HI-PMK: A Data-Dependent Kernel for Incomplete Heterogeneous Data Representation, https://arxiv.org/abs/2501.04300
- Alexander Rose, Philipp Schaub, Rolf Findeisen, 21 Jul 2025, Safe and High-Performance Learning of Model Predicitve Control using Kernel-Based Interpolation, https://arxiv.org/abs/2410.06771
- Sachin Garg, Micha{\l} Derezi\'nski, 19 Jul 2025, Faster Low-Rank Approximation and Kernel Ridge Regression via the Block-Nystr\"om Method, https://arxiv.org/abs/2506.17556
- Leonardo V. Santoro, Victor M. Panaretos, 11 Aug 2025, Likelihood Ratio Tests by Kernel Gaussian Embedding, https://arxiv.org/abs/2508.07982
- Martin Rouault, R\'emi Bardenet, Myl\`ene Ma\"ida, 9 Aug 2025, Monte Carlo with kernel-based Gibbs measures: Guarantees for probabilistic herding, https://arxiv.org/abs/2402.11736
- Shuyin Xia, Yifan Wang, Lifeng Shen, Guoyin Wang, 11 Aug 2025, Granular-Ball-Induced Multiple Kernel K-Means, https://arxiv.org/abs/2506.18637
- David M. Bossens, Kishor Bharti, and Jayne Thompson, 11 Aug 2025, Quantum Policy Gradient in Reproducing Kernel Hilbert Space, https://arxiv.org/abs/2411.06650
- Antonin Schrab, 8 Aug 2025, A Practical Introduction to Kernel Discrepancies: MMD, HSIC & KSD, https://arxiv.org/abs/2503.04820
- Rajalaxmi Rajagopalan, Yu-Lin Wei, Romit Roy Choudhury, 28 Jul 2025, Kernel Learning for Sample Constrained Black-Box Optimization, https://arxiv.org/abs/2507.20533
- Jagruti Patel (1), Mikkel Sch\"ottner (1), Thomas A. W. Bolton (1), Patric Hagmann (1) ((1) Department of Radiology, Lausanne University Hospital and University of Lausanne (CHUV-UNIL), Lausanne, Switzerland), 28 Jul 2025, Predicting Cognition from fMRI:A Comparative Study of Graph, Transformer, and Kernel Models Across Task and Rest Conditions, https://arxiv.org/abs/2507.21016
- Victor Rielly, Kamel Lahouel, Ethan Lew, Nicholas Fisher, Vicky Haney, Michael Wells, Bruno Jedynak, 25 Jul 2025, MOCK: an Algorithm for Learning Nonparametric Differential Equations via Multivariate Occupation Kernel Functions, https://arxiv.org/abs/2306.10189
- Shervin Rahimzadeh Arashloo, 31 Jul 2025, Manifold-regularised Signature Kernel Large-Margin $\ell_p$-SVDD for Multidimensional Time Series Anomaly Detection, https://arxiv.org/abs/2507.23449
- Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner, 31 Jul 2025, Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions, https://arxiv.org/abs/2507.23539
- Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum, 31 Jul 2025, Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks, https://arxiv.org/abs/2507.23194
- Abhinav Das, Stephan Schl\"uter, Lorenz Schneider, 31 Jul 2025, Electricity Price Prediction Using Multi-Kernel Gaussian Process Regression Combined with Kernel-Based Support Vector Regression, https://arxiv.org/abs/2412.00123
- Filippo Utro, Meltem Tolunay, Kahn Rhrissorrakrai, Tanvi P. Gujarati, Jie Shi, Sara Capponi, Mirko Amico, Nate Earnest-Noble, Laxmi Parida, 30 Jul 2025, Enhanced Prediction of CAR T-Cell Cytotoxicity with Quantum-Kernel Methods, https://arxiv.org/abs/2507.22710
- Erwin de Gelder, Maren Buermann, Olaf Op den Camp, 30 Jul 2025, Comparing Normalizing Flows with Kernel Density Estimation in Estimating Risk of Automated Driving Systems, https://arxiv.org/abs/2507.22429
- Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu, 1 Aug 2025, Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training, https://arxiv.org/abs/2508.00414
- Rajpreet Singh, Vidhi Kothari, 1 Aug 2025, Composable OS Kernel Architectures for Autonomous Intelligence, https://arxiv.org/abs/2508.00604
- Joon-Hyun Park, Mujin Cheon, Dong-Yeun Koh, 4 Aug 2025, BOOST: Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique, https://arxiv.org/abs/2508.02332
- Andrea Gayon-Lombardo, Ehecatl A. del Rio-Chanona, Catalina A. Pino-Munoz, Nigel P. Brandon, 7 Jun 2025, Deep Kernel Bayesian Optimisation for Closed-Loop Electrode Microstructure Design with User-Defined Properties based on GANs, https://arxiv.org/abs/2508.00833
- Haoquan Lu, Hanzhe Liang, Jie Zhang, Chenxi Hu, Jinbao Wang, Can Gao, 2 Aug 2025, C3D-AD: Toward Continual 3D Anomaly Detection via Kernel Attention with Learnable Advisor, https://arxiv.org/abs/2508.01311
- Sadegh Ebrahimkhani and John Lataire, 2 Aug 2025, Kernel-Based Sparse Additive Nonlinear Model Structure Detection through a Linearization Approach, https://arxiv.org/abs/2508.01453
- Nicolas Langren\'e, Xavier Warin, Pierre Gruet, 3 Aug 2025, Fast Gaussian process inference by exact Mat\'ern kernel decomposition, https://arxiv.org/abs/2508.01864
- Qian Tang, Yuwen Gu, Boxiang Wang, 12 Aug 2025, fastkqr: A Fast Algorithm for Kernel Quantile Regression, https://arxiv.org/abs/2408.05393
- Wouter M. Kouw, 13 Aug 2025, Bayesian autoregression to optimize temporal Mat\'ern kernel Gaussian process hyperparameters, https://arxiv.org/abs/2508.09792
- Yuan-Hao Wei, Fu-Hao Deng, Lin-Yong Cui, Yan-Jie Sun, 13 Aug 2025, Structured Kernel Regression VAE: A Computationally Efficient Surrogate for GP-VAEs in ICA, https://arxiv.org/abs/2508.09721
- Xing Liu, Fran\c{c}ois-Xavier Briol, 12 Aug 2025, On the Robustness of Kernel Goodness-of-Fit Tests, https://arxiv.org/abs/2408.05854
- Paul Dommel and Rajmadan Lakshmanan, 15 Aug 2025, Uniform convergence for Gaussian kernel ridge regression, https://arxiv.org/abs/2508.11274
- Zhan Yu, Zhongjie Shi, Ding-Xuan Zhou, 15 Aug 2025, Theory of Decentralized Robust Kernel-Based Learning, https://arxiv.org/abs/2506.05215
- Hongyu Lin, Yuchen Li, Haoran Luo, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu, 18 Aug 2025, OS-R1: Agentic Operating System Kernel Tuning with Reinforcement Learning, https://arxiv.org/abs/2508.12551
- Iam Kim de S. Hermont, Andre R. Flores and Rodrigo C. de Lamare, 18 Aug 2025, Design and Analysis of Robust Adaptive Filtering with the Hyperbolic Tangent Exponential Kernel M-Estimator Function for Active Noise Control, https://arxiv.org/abs/2508.13018
- Rahul Singh and Suhas Vijaykumar, 18 Aug 2025, Kernel Ridge Regression Inference, https://arxiv.org/abs/2302.06578
- Hengrui Luo and Yunzhang Zhu, 16 Aug 2025, Asymptotic Optimism of Random-Design Linear and Kernel Regression Models, https://arxiv.org/abs/2502.12999
- Anabel Yong, 12 Aug 2025, Multi-Objective Bayesian Optimization with Independent Tanimoto Kernel Gaussian Processes for Diverse Pareto Front Exploration, https://arxiv.org/abs/2508.14072
- Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan, 20 Aug 2025, Learnable Kernel Density Estimation for Graphs, https://arxiv.org/abs/2505.21285
- Yijin Ni and Xiaoming Huo, 20 Aug 2025, Kernel-based Equalized Odds: A Quantification of Accuracy-Fairness Trade-off in Fair Representation Learning, https://arxiv.org/abs/2508.15084
- Reilly Haskins and Benjamin Adams, 21 Aug 2025, KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis, https://arxiv.org/abs/2507.03847
- Francesca Bartolucci, Ernesto De Vito, Lorenzo Rosasco, Stefano Vigogna, 21 Aug 2025, Neural reproducing kernel Banach spaces and representer theorems for deep networks, https://arxiv.org/abs/2403.08750
- Pietro Fr\'e, Federico Milanesio, Marcelo Oyarzo, Matteo Santoro and Mario Trigiante, 22 Aug 2025, Tessellation Groups, Harmonic Analysis on Non-compact Symmetric Spaces and the Heat Kernel in view of Cartan Convolutional Neural Networks, https://arxiv.org/abs/2508.16015
- Jamal Hwaidi and Mohamed Chahine Ghanem, 22 Aug 2025, Motor Imagery EEG Signal Classification Using Minimally Random Convolutional Kernel Transform and Hybrid Deep Learning, https://arxiv.org/abs/2508.16179
- Martin Andrews, Sam Witteveen, 22 Aug 2025, GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization, https://arxiv.org/abs/2506.20807
- Ran Yan, Youhe Jiang, Binhang Yuan, 25 Aug 2025, Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel, https://arxiv.org/abs/2508.18224
- Akira Tamamori, 24 Aug 2025, Kernel Ridge Regression for Efficient Learning of High-Capacity Hopfield Networks, https://arxiv.org/abs/2504.12561
- Kyung-hwan Lee and Kyung-tae Kim, 21 Jul 2025, Semantic-Aware Gaussian Process Calibration with Structured Layerwise Kernels for Deep Neural Networks, https://arxiv.org/abs/2507.15987
- Rahul Khorana, 22 Jul 2025, Families of Optimal Transport Kernels for Cell Complexes, https://arxiv.org/abs/2507.16569
- Jun'ichi Takeuchia, Yoshinari Takeishia, Noboru Muratab, Kazushi Mimurac, Ka Long Keith Hod, Hiroshi Nagaoka, 24 Jul 2025, Neural Tangent Kernels and Fisher Information Matrices for Simple ReLU Networks with Random Hidden Weights, https://arxiv.org/abs/2507.18555
- Yaniv Shulman, 20 Jul 2025, Robust Local Polynomial Regression with Similarity Kernels, https://arxiv.org/abs/2501.10729
- Jie Hu, Yi-Ting Ma, Do Young Eun, 27 Jul 2025, Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General Graphs, https://arxiv.org/abs/2505.18300
- Roberto Fl\'orez-Ablan, Marco Roth, and Jan Schnabel, 28 Jul 2025, On the similarity of bandwidth-tuned quantum kernels and classical kernels, https://arxiv.org/abs/2503.05602
- Christian Wald and Gabriele Steidl, 2 Aug 2025, Flow Matching: Markov Kernels, Stochastic Processes and Transport Plans, https://arxiv.org/abs/2501.16839
- Max Guillen, Philipp Misof, Jan E. Gerken, 15 Aug 2025, Finite-Width Neural Tangent Kernels from Feynman Diagrams, https://arxiv.org/abs/2508.11522
- Nan-Hong Kuo, Renata Wong, 16 Feb 2025, SVM/SVR Kernels as Quantum Propagators, https://arxiv.org/abs/2502.11153
- Patrick J.F. Groenen and Michael Greenacre, 21 Aug 2025, Interpretable Kernels, https://arxiv.org/abs/2508.15932
- Ana Mart\'inez-Sabiote, Michalis Skotiniotis, Jara J. Bermejo-Vega, Daniel Manzano, Carlos Cano, 25 Aug 2025, Entanglement Detection with Quantum-inspired Kernels and SVMs, https://arxiv.org/abs/2508.17909
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about:
- List of AI Optimizations.
- Transformer architectures
- Inference Optimizations
- Shallow decoder architecture
- Inference Cache
- Zero-Multiplication Models
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Loop Optimizations
- Code Optimizations
- « Research Home