Aussie AI

Transformer Optimization

Last Updated 22 October, 2025

by David Spuler, Ph.D.

The Transformer was invented at Google in 2017 and open-sourced by their research group. It became the most widely used AI engine architecture, notably being used in GPT-3 by OpenAI's ChatGPT. Since then, optimization research has taken off. There are two basic ways to optimize Transformer models:

Transformer architecture improvements: large-scale improvements.
Transformer code optimizations: smaller improvements, discussed below.

There are various ways to optimize a Transformer with code optimizations. Much research has also been conducted on slight modifications to the architecture of the Transformer to improve latency and throughput in both inference and training.

Transformer Inference Optimizations

See also these articles for further information on Transformer inference optimization:

Transformer Kernel Code Optimizations

Some of the specific kernel optimizations of inference engines include:

Attention head caching: Precomputing and caching attention head matrices from already-processed tokens (HuggingFace, 2021). This reduces the auto-regression costs when outputting multiple tokens (which is the usual case). See also attention head pruning
KV Caching: This optimization is caching the attention head K and V tensor matrix multiplications during decoding (Intel, 2023). This reduces the number of decoder matrix multiplications. See KV caching research.
Padding byte optimizations: Removing padding in the Feed Forward Network tensor/matrix computations (Intel, 2023; also in ByteTransformer by Zhai et al. (2023)); see "zero padding removal". This reduces the total number of multiplications.
Attention dimensions: Merging Q, K, and V matrices (of identical size) into a single large matrix for better matrix multiplication throughput (Zhai et al., 2023).
Operator fusion and reordering: Reordering reshaping and matmul operations (Intel, 2023). This streamlines some of the arithmetic operations to use more compact low-level libraries. See kernel fusion optimizations.

Kernel Optimization Research Papers

Reference papers on some of the specific code optimizations in Transformer engines:

Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference
Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (Contains a number of significant optimizations to the original Transformer architecture.)
Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/, Code: https://github.com/TenTrans/TenTrans
Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-based Generative Models, Jaewan Choi, Jaehyun Park, Kwanhee Kyung, Nam Sung Kim, and Jung Ho Ahn, IEEE Computer Architecture Letters, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10218731 (Efficient memory storage of K and V vectors in Transformer inference.)
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019, https://arxiv.org/abs/1904.01038, Code: https://github.com/pytorch/fairseq (Includes inference optimizations such as caching model states from previously generated tokens.)
Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (This paper avoids zero-padding inputs amongst other optimizations.)
Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 2022, Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp 1–36 https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737 (Extensive survey that contains a section on "Memoization" which is caching computed values for later reuse.)

See also general research on code optimizations.

Transformer General Optimizations

Some of the general classes of optimization techniques for the Transformer architecture include:

Hardware-specific optimizations and low-level libraries (various)
Model compilation (graph compilers / deep learning compilers)
Transformer architectures
Kernel optimizations (i.e. inference engine code optimizations such as caching and kernel fusion).
Inference optimization techniques (numerous methods)
Caching of the entire query results to re-use for other users. This is called an Inference Cache.

And here is a long list of the various other optimizations possible:

Model compression
Quantization (binary, ternary, logarithmic, 2-bit, 3-bit, 4-bit, 8-bit, FP8, FP16, stochastic, etc.)
Pruning (length, width, depth, dual, triple, layer, token, and more)
Distillation
Weight sharing
Fusion: layer fusion, kernel operator fusion
Skipping: including layer skipping, early exit, zero skipping
Arithmetic: zero-multiplication, conditional computation, logs, approximations
Decoding: speculative decoding, parallel decoding, aggressive decoding, etc.

For even more, see inference optimizations, Transformer architectural optimizations, and a complete list of Transformer optimizations.

Survey Papers on Transformer Optimization

Review and survey papers on faster Transformer engines:

Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami, Full stack optimization of transformer inference: a survey, Feb 2023, arXiv:2302.14017, https://arxiv.org/abs/2302.14017
Full Stack Optimization of Transformer Inference: a Survey. Part 2 on Transformer Optimization, A Paper Overview, https://www.nebuly.com/blog/full-stack-optimization-of-transformer-inference-a-survey-part-2
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey (v2). arXiv preprint arXiv:2009.06732, 2022, https://arxiv.org/abs/2009.06732
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, A Survey of Techniques for Optimizing Transformer Inference, 2023, arxiv.org July 2023, https://arxiv.org/abs/2307.07982
L Papa, P Russo, I Amerini, L Zhou, Sep 2023, A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking, arXiv preprint arXiv:2309.02031, 2023, https://arxiv.org/abs/2309.02031
Efficient Attention: Breaking The Quadratic Transformer Bottleneck, 2023 (accessed 8/12/23), https://gwern.net/note/attention, (A regularly updated bibliography of transformer attention optimization papers)

Tips for Transformer Optimization

Articles and papers with general tips on optimizing a Transformer:

Fabián Varietti, Rodrigo Gallardo, Ian Spektor, Francisco Kurucz, Facundo Parodi, A guide to optimizing Transformer-based models for faster inference Tue, Nov 29, 2022 https://tryolabs.com/blog/2022/11/24/transformer-based-model-for-faster-inference
Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
Model optimization (TensorFlow), https://www.tensorflow.org/lite/performance/model_optimization
Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference
Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (Contains a number of significant optimizations to the original Transformer architecture.)
Weng, Lilian. (Jan 2023). Large Transformer Model Inference Optimization. Lil’Log. https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
Philipp Schmid, Accelerate Sentence Transformers with Hugging Face Optimum, August 2, 2022, https://www.philschmid.de/optimize-sentence-transformers
Michaël Benesty, Hugging Face Transformer Inference Under 1 Millisecond Latency, Nov 5, 2021 https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c

Research on Specific Fast Transformers

These papers are on new faster Transformer architectures tested by researchers:

Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (This paper uses zero-padding inputs and fused attention heads with shared parameters)
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya, Reformer: The efficient transformer, In International Conference on Learning Representations, 2019, https://arxiv.org/abs/2001.04451
J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578
NVIDIA, NVIDIA FasterTransformer, https://github.com/NVIDIA/FasterTransformer

General Research on Transformer Optimization

These papers review Transformer optimization techniques in general.

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin , James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, "Efficiently Scaling Transformer Inference", arXiv:2211.05102v1 [cs.LG], 9 Nov 2022, https://arxiv.org/abs/2211.05102
Dave Dice, Alex Kogan, Optimizing Inference Performance of Transformers on CPUs, Feb 2021, https://arxiv.org/abs/2102.06621
Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. 2020. Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation. CoRR, abs/2006.10369. https://arxiv.org/abs/2006.10369, Code: https://github.com/jungokasai/deep-shallow (Single-layer decoder architecture, see also shallow decoder Transformer architectures inspired by this paper.)
Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee, Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding, July 2023, https://arxiv.org/abs/2307.05908
Zining Zhang; Yao Chen; Bingsheng He; Zhenjie Zhang, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 6, June 2023, pp.1982-1995, https://ieeexplore.ieee.org/abstract/document/10107474
So, D. R., Ma’nke, W., Liu, H., Dai, Z., Shazeer, N. M., and Le, Q. V., 2021 (updated Jan 2022), Primer: Searching for efficient transformers for language modeling, ArXiv, abs/2109.08668, https://arxiv.org/abs/2109.08668 Code: https://github.com/google-research/google-research/tree/master/primer (Has a different Transformer architecture, but not a common one.)
Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A., Adaptive attention span in transformers. In Annual Meeting of the Association for Computational Linguistics, Aug 2019, https://arxiv.org/abs/1905.07799 (Self-adaptive context lengths for attention heads.)
Bapna, A., Arivazhagan, N., and Firat, O., Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106 (Conditionally controls which subunits of the model can execute.)
Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 (The original FlashAttention version 1, now superceded by FlashAttention 2.)
GI Yu, JS Jeong, GW Kim, S Kim, BG Chun, 2022, Orca: A distributed serving system for Transformer-Based generative models, 16th USENIX Symposium, https://www.usenix.org/conference/osdi22/presentation/yu, PDF: https://www.usenix.org/system/files/osdi22-yu.pdf (Improved parallelization/pipelining with latency reduction from iteration-level scheduling across multiple requests.)
K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf (Interesting review of safety and bias/fairness issues for models optimized by quantization, pruning or distillation.)
X Li, B Ren, X Shen, Y Wang, 2022, CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework, arXiv preprint arXiv:2206.10620, https://arxiv.org/abs/2206.10620 (Various optimizations including block pruning and deep reuse.)

Kernel Optimizations

Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, April 2024, Tandem processor: Grappling with emerging operators in neural networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 1165–1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365 Code: https://actlab-genesys.github.io (Reviews hardware acceleration of all sub-layer kernel operators, with a focus beyond just GEMM/MatMul operators.)
Make LLM Fine-tuning 2x faster with Unsloth and HF TRL, January 10, 2023, Daniel Han-Chen, https://huggingface.co/blog/unsloth-trl Code: https://github.com/huggingface/blog/blob/main/unsloth-trl.md (Optimizes some PyTorch kernels for back-propagation and reduces memory usage in fine-tuning; currently works with Llama and Mistral architectures.)
H Shen, H Chang, B Dong, Y Luo, H Meng, Nov 2023, Efficient LLM Inference on CPUs, arXiv preprint arXiv:2311.00502, https://arxiv.org/pdf/2311.00502.pdf Code: https://github.com/intel/intel-extension-for-transformers (INT4 weight quantization with 16-bit activations, and highly optimized kernel with support for AVX2, AVX512, AVX512_VNNI and Advanced Matrix Extensions (AMX), and KV caching, tested on LLamam2 3B to 20B with 20-80ms latency per token.)
Piotr Kluska, Adri´an Castello, Florian Scheidegger, A. Cristiano I. Malossi, 2024, QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers https://openaccess.thecvf.com/content/CVPR2024W/eLVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf
Christian Szegedy et al., 2015, Going Deeper with Convolutions, http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf (The GoogleNet paper.)
Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, Ghislain Durif, 8 Apr 2021 (v2), Kernel Operations on the GPU, with Autodiff, without Memory Overflows, https://arxiv.org/abs/2004.11127 Code: https://www.kernel-operations.io/keops/index.html
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Alejandro Araya-Núñez, Justin Fernández-Badilla, Daniel González-Vargas, Jimena León-Huertas, Erick-Andrés Obregón-Fonseca, Danny Xie-Li, June, 2024, Proposal of an open-source accelerators library for inference of transformer networks in edge devices based on Linux, Tecnología en Marcha. Vol. 37, special issue. IEEE Latin American Electron Devices Conference (LAEDC), pages 118-125, https://doi.org/10.18845/tm.v37i5.7225 PDF: https://revistas.tec.ac.cr/index.php/tec_marcha/article/download/7225/7076
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 5 Jul 2024 (v3), Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041
Zheming Jin, July 2024, Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL, Oak Ridge National Laboratory, ORNL/TM-2024/3463, https://info.ornl.gov/sites/publications/Files/Pub217394.pdf
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
Intel, 2024, Get Started with Intel® oneAPI Math Kernel Library, https://www.intel.com/content/www/us/en/docs/onemkl/get-started-guide/2023-0/overview.html
T Zhao, 2024, Acceleration of Deep Learning Algorithms with Transformers, https://escholarship.org/uc/item/3419t2z6
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
J. Bi et al., "Efficient and Fast High-performance Library Generation for Deep Learning Accelerators," in IEEE Transactions on Computers, doi: 10.1109/TC.2024.3475575, https://ieeexplore.ieee.org/abstract/document/10707341 (Finding the most efficient kernel.)
Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
HF, 2024, TGI v3 overview, https://huggingface.co/docs/text-generation-inference/conceptual/chunking
Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, 7 Dec 2023 (v2), Efficient LLM Inference on CPUs, https://arxiv.org/abs/2311.00502 https://github.com/intel/intel-extension-for-transformers
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Runxin Zhong, Yuyang Jin, Chen Zhang, Kinman Lei, Shuangyu Li, and Jidong Zhai. 2025. FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '25). Association for Computing Machinery, New York, NY, USA, 183–196. https://doi.org/10.1145/3710848.3710864 https://dl.acm.org/doi/abs/10.1145/3710848.3710864
Burkhard Ringlein, Thomas Parnell, Radu Stoica, 15 May 2025 (v2), GPU Performance Portability needs Autotuning, https://arxiv.org/abs/2505.03780
Anne Ouyang and Azalia Mirhoseini and Percy Liang, June 2025, Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet), https://crfm.stanford.edu/2025/05/28/fast-kernels.html
Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, Yoon Kim, 28 May 2025, FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference, https://arxiv.org/abs/2505.22758 https://github.com/aninrusimha/flashformer (Optimizing kernels for low latency in a single isolated query, not a batch, via kernel fusion and running all components in one kernel, along with programming techniques like metaprogramming.)
Bonwoo Lee, Cheolwoo Park, Jeongyoun Ahn, 23 Jul 2025, Optimal differentially private kernel learning with random projection, https://arxiv.org/abs/2507.17544
Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang, 20 Jul 2025, MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation, https://arxiv.org/abs/2507.17773
Kaizheng Wang, 24 Jul 2025, Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift, https://arxiv.org/abs/2302.10160
Masaki Adachi, Masahiro Fujisawa, Michael A Osborne, 24 Jul 2025, Fixing the Pitfalls of Probabilistic Time-Series Forecasting Evaluation by Kernel Quadrature, https://arxiv.org/abs/2503.06079
Daehyeon Baek, Jieun Choi, Jimyoung Son, Kyungmin Bin, Seungbeom Choi, Kihyo Moon, Minsung Jang, Hyojung Lee, 18 Jul 2025, FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration, https://arxiv.org/abs/2505.20839
Zikai Xie, Linjiang Chen, 18 Jul 2025, Merge Kernel for Bayesian Optimization on Permutation Space, https://arxiv.org/abs/2507.13263
Jie Wang and March Boedihardjo and Yao Xie, 18 Jul 2025, Statistical and Computational Guarantees of Kernel Max-Sliced Wasserstein Distances, https://arxiv.org/abs/2405.15441
Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi, 19 Jul 2025, Kernel Based Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games, https://arxiv.org/abs/2507.14529
Youran Zhou, Mohamed Reda Bouadjenek, Jonathan Wells, Sunil Aryal, 20 Jul 2025, HI-PMK: A Data-Dependent Kernel for Incomplete Heterogeneous Data Representation, https://arxiv.org/abs/2501.04300
Alexander Rose, Philipp Schaub, Rolf Findeisen, 21 Jul 2025, Safe and High-Performance Learning of Model Predicitve Control using Kernel-Based Interpolation, https://arxiv.org/abs/2410.06771
Sachin Garg, Micha{\l} Derezi\'nski, 19 Jul 2025, Faster Low-Rank Approximation and Kernel Ridge Regression via the Block-Nystr\"om Method, https://arxiv.org/abs/2506.17556
Leonardo V. Santoro, Victor M. Panaretos, 11 Aug 2025, Likelihood Ratio Tests by Kernel Gaussian Embedding, https://arxiv.org/abs/2508.07982
Martin Rouault, R\'emi Bardenet, Myl\`ene Ma\"ida, 9 Aug 2025, Monte Carlo with kernel-based Gibbs measures: Guarantees for probabilistic herding, https://arxiv.org/abs/2402.11736
Shuyin Xia, Yifan Wang, Lifeng Shen, Guoyin Wang, 11 Aug 2025, Granular-Ball-Induced Multiple Kernel K-Means, https://arxiv.org/abs/2506.18637
David M. Bossens, Kishor Bharti, and Jayne Thompson, 11 Aug 2025, Quantum Policy Gradient in Reproducing Kernel Hilbert Space, https://arxiv.org/abs/2411.06650
Antonin Schrab, 8 Aug 2025, A Practical Introduction to Kernel Discrepancies: MMD, HSIC & KSD, https://arxiv.org/abs/2503.04820
Rajalaxmi Rajagopalan, Yu-Lin Wei, Romit Roy Choudhury, 28 Jul 2025, Kernel Learning for Sample Constrained Black-Box Optimization, https://arxiv.org/abs/2507.20533
Jagruti Patel (1), Mikkel Sch\"ottner (1), Thomas A. W. Bolton (1), Patric Hagmann (1) ((1) Department of Radiology, Lausanne University Hospital and University of Lausanne (CHUV-UNIL), Lausanne, Switzerland), 28 Jul 2025, Predicting Cognition from fMRI:A Comparative Study of Graph, Transformer, and Kernel Models Across Task and Rest Conditions, https://arxiv.org/abs/2507.21016
Victor Rielly, Kamel Lahouel, Ethan Lew, Nicholas Fisher, Vicky Haney, Michael Wells, Bruno Jedynak, 25 Jul 2025, MOCK: an Algorithm for Learning Nonparametric Differential Equations via Multivariate Occupation Kernel Functions, https://arxiv.org/abs/2306.10189
Shervin Rahimzadeh Arashloo, 31 Jul 2025, Manifold-regularised Signature Kernel Large-Margin $\ell_p$-SVDD for Multidimensional Time Series Anomaly Detection, https://arxiv.org/abs/2507.23449
Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner, 31 Jul 2025, Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions, https://arxiv.org/abs/2507.23539
Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum, 31 Jul 2025, Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks, https://arxiv.org/abs/2507.23194
Abhinav Das, Stephan Schl\"uter, Lorenz Schneider, 31 Jul 2025, Electricity Price Prediction Using Multi-Kernel Gaussian Process Regression Combined with Kernel-Based Support Vector Regression, https://arxiv.org/abs/2412.00123
Filippo Utro, Meltem Tolunay, Kahn Rhrissorrakrai, Tanvi P. Gujarati, Jie Shi, Sara Capponi, Mirko Amico, Nate Earnest-Noble, Laxmi Parida, 30 Jul 2025, Enhanced Prediction of CAR T-Cell Cytotoxicity with Quantum-Kernel Methods, https://arxiv.org/abs/2507.22710
Erwin de Gelder, Maren Buermann, Olaf Op den Camp, 30 Jul 2025, Comparing Normalizing Flows with Kernel Density Estimation in Estimating Risk of Automated Driving Systems, https://arxiv.org/abs/2507.22429
Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu, 1 Aug 2025, Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training, https://arxiv.org/abs/2508.00414
Rajpreet Singh, Vidhi Kothari, 1 Aug 2025, Composable OS Kernel Architectures for Autonomous Intelligence, https://arxiv.org/abs/2508.00604
Joon-Hyun Park, Mujin Cheon, Dong-Yeun Koh, 4 Aug 2025, BOOST: Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique, https://arxiv.org/abs/2508.02332
Andrea Gayon-Lombardo, Ehecatl A. del Rio-Chanona, Catalina A. Pino-Munoz, Nigel P. Brandon, 7 Jun 2025, Deep Kernel Bayesian Optimisation for Closed-Loop Electrode Microstructure Design with User-Defined Properties based on GANs, https://arxiv.org/abs/2508.00833
Haoquan Lu, Hanzhe Liang, Jie Zhang, Chenxi Hu, Jinbao Wang, Can Gao, 2 Aug 2025, C3D-AD: Toward Continual 3D Anomaly Detection via Kernel Attention with Learnable Advisor, https://arxiv.org/abs/2508.01311
Sadegh Ebrahimkhani and John Lataire, 2 Aug 2025, Kernel-Based Sparse Additive Nonlinear Model Structure Detection through a Linearization Approach, https://arxiv.org/abs/2508.01453
Nicolas Langren\'e, Xavier Warin, Pierre Gruet, 3 Aug 2025, Fast Gaussian process inference by exact Mat\'ern kernel decomposition, https://arxiv.org/abs/2508.01864
Qian Tang, Yuwen Gu, Boxiang Wang, 12 Aug 2025, fastkqr: A Fast Algorithm for Kernel Quantile Regression, https://arxiv.org/abs/2408.05393
Wouter M. Kouw, 13 Aug 2025, Bayesian autoregression to optimize temporal Mat\'ern kernel Gaussian process hyperparameters, https://arxiv.org/abs/2508.09792
Yuan-Hao Wei, Fu-Hao Deng, Lin-Yong Cui, Yan-Jie Sun, 13 Aug 2025, Structured Kernel Regression VAE: A Computationally Efficient Surrogate for GP-VAEs in ICA, https://arxiv.org/abs/2508.09721
Xing Liu, Fran\c{c}ois-Xavier Briol, 12 Aug 2025, On the Robustness of Kernel Goodness-of-Fit Tests, https://arxiv.org/abs/2408.05854
Paul Dommel and Rajmadan Lakshmanan, 15 Aug 2025, Uniform convergence for Gaussian kernel ridge regression, https://arxiv.org/abs/2508.11274
Zhan Yu, Zhongjie Shi, Ding-Xuan Zhou, 15 Aug 2025, Theory of Decentralized Robust Kernel-Based Learning, https://arxiv.org/abs/2506.05215
Hongyu Lin, Yuchen Li, Haoran Luo, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu, 18 Aug 2025, OS-R1: Agentic Operating System Kernel Tuning with Reinforcement Learning, https://arxiv.org/abs/2508.12551
Iam Kim de S. Hermont, Andre R. Flores and Rodrigo C. de Lamare, 18 Aug 2025, Design and Analysis of Robust Adaptive Filtering with the Hyperbolic Tangent Exponential Kernel M-Estimator Function for Active Noise Control, https://arxiv.org/abs/2508.13018
Rahul Singh and Suhas Vijaykumar, 18 Aug 2025, Kernel Ridge Regression Inference, https://arxiv.org/abs/2302.06578
Hengrui Luo and Yunzhang Zhu, 16 Aug 2025, Asymptotic Optimism of Random-Design Linear and Kernel Regression Models, https://arxiv.org/abs/2502.12999
Anabel Yong, 12 Aug 2025, Multi-Objective Bayesian Optimization with Independent Tanimoto Kernel Gaussian Processes for Diverse Pareto Front Exploration, https://arxiv.org/abs/2508.14072
Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan, 20 Aug 2025, Learnable Kernel Density Estimation for Graphs, https://arxiv.org/abs/2505.21285
Yijin Ni and Xiaoming Huo, 20 Aug 2025, Kernel-based Equalized Odds: A Quantification of Accuracy-Fairness Trade-off in Fair Representation Learning, https://arxiv.org/abs/2508.15084
Reilly Haskins and Benjamin Adams, 21 Aug 2025, KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis, https://arxiv.org/abs/2507.03847
Francesca Bartolucci, Ernesto De Vito, Lorenzo Rosasco, Stefano Vigogna, 21 Aug 2025, Neural reproducing kernel Banach spaces and representer theorems for deep networks, https://arxiv.org/abs/2403.08750
Pietro Fr\'e, Federico Milanesio, Marcelo Oyarzo, Matteo Santoro and Mario Trigiante, 22 Aug 2025, Tessellation Groups, Harmonic Analysis on Non-compact Symmetric Spaces and the Heat Kernel in view of Cartan Convolutional Neural Networks, https://arxiv.org/abs/2508.16015
Jamal Hwaidi and Mohamed Chahine Ghanem, 22 Aug 2025, Motor Imagery EEG Signal Classification Using Minimally Random Convolutional Kernel Transform and Hybrid Deep Learning, https://arxiv.org/abs/2508.16179
Martin Andrews, Sam Witteveen, 22 Aug 2025, GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization, https://arxiv.org/abs/2506.20807
Ran Yan, Youhe Jiang, Binhang Yuan, 25 Aug 2025, Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel, https://arxiv.org/abs/2508.18224
Akira Tamamori, 24 Aug 2025, Kernel Ridge Regression for Efficient Learning of High-Capacity Hopfield Networks, https://arxiv.org/abs/2504.12561
Kyung-hwan Lee and Kyung-tae Kim, 21 Jul 2025, Semantic-Aware Gaussian Process Calibration with Structured Layerwise Kernels for Deep Neural Networks, https://arxiv.org/abs/2507.15987
Rahul Khorana, 22 Jul 2025, Families of Optimal Transport Kernels for Cell Complexes, https://arxiv.org/abs/2507.16569
Jun'ichi Takeuchia, Yoshinari Takeishia, Noboru Muratab, Kazushi Mimurac, Ka Long Keith Hod, Hiroshi Nagaoka, 24 Jul 2025, Neural Tangent Kernels and Fisher Information Matrices for Simple ReLU Networks with Random Hidden Weights, https://arxiv.org/abs/2507.18555
Yaniv Shulman, 20 Jul 2025, Robust Local Polynomial Regression with Similarity Kernels, https://arxiv.org/abs/2501.10729
Jie Hu, Yi-Ting Ma, Do Young Eun, 27 Jul 2025, Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General Graphs, https://arxiv.org/abs/2505.18300
Roberto Fl\'orez-Ablan, Marco Roth, and Jan Schnabel, 28 Jul 2025, On the similarity of bandwidth-tuned quantum kernels and classical kernels, https://arxiv.org/abs/2503.05602
Christian Wald and Gabriele Steidl, 2 Aug 2025, Flow Matching: Markov Kernels, Stochastic Processes and Transport Plans, https://arxiv.org/abs/2501.16839
Max Guillen, Philipp Misof, Jan E. Gerken, 15 Aug 2025, Finite-Width Neural Tangent Kernels from Feynman Diagrams, https://arxiv.org/abs/2508.11522
Nan-Hong Kuo, Renata Wong, 16 Feb 2025, SVM/SVR Kernels as Quantum Propagators, https://arxiv.org/abs/2502.11153
Patrick J.F. Groenen and Michael Greenacre, 21 Aug 2025, Interpretable Kernels, https://arxiv.org/abs/2508.15932
Ana Mart\'inez-Sabiote, Michalis Skotiniotis, Jara J. Bermejo-Vega, Daniel Manzano, Carlos Cano, 25 Aug 2025, Entanglement Detection with Quantum-inspired Kernels and SVMs, https://arxiv.org/abs/2508.17909
Angxiu Ni, 4 Sep 2025, Divergence-Kernel method for linear responses and diffusion models, https://arxiv.org/abs/2509.03992
Lukas Gonon, Lyudmila Grigoryeva, and Juan-Pablo Ortega, 4 Sep 2025, Reservoir kernels and Volterra series, https://arxiv.org/abs/2212.14641
Thore Gerlach, Sascha M\"ucke, Christian Bauckhage, 4 Sep 2025, Kernel $k$-Medoids as General Vector Quantization, https://arxiv.org/abs/2506.04786
Mahishanka Withanachchi, 23 Aug 2025, Learning Spatio-Temporal Dynamics via Operator-Valued RKHS and Kernel Koopman Methods, https://arxiv.org/abs/2508.18307
Yiping Lu, Daozhe Lin, Qiang Du, 26 Aug 2025, Which Spaces can be Embedded in $L_p$-type Reproducing Kernel Banach Space? A Characterization via Metric Entropy, https://arxiv.org/abs/2410.11116
Arya Tschand, Muhammad Awad, Ryan Swann, Kesavan Ramakrishnan, Jeffrey Ma, Keith Lowery, Ganesh Dasika, Vijay Janapa Reddi, 27 Aug 2025, SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization, https://arxiv.org/abs/2508.20258
Chris Cama\~no, Daniel Huang, 28 Aug 2025, High-Dimensional Gaussian Process Regression with Soft Kernel Interpolation, https://arxiv.org/abs/2410.21419
James Tian, 1 Sep 2025, A Class of Random-Kernel Network Models, https://arxiv.org/abs/2509.01090
Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xiang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, Minyi Guo, 1 Sep 2025, LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving, https://arxiv.org/abs/2509.01229
Diego Di Carlo (RIKEN AIP), Koyama Shoichi (UTokyo), Nugraha Aditya Arie (RIKEN AIP), Fontaine Mathieu (LTCI, S2A), Bando Yoshiaki (AIST), Yoshii Kazuyoshi (RIKEN AIP), 20 Aug 2025, Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening, https://arxiv.org/abs/2509.02571
Nathan Doum\`eche (LPSM, EDF R&D OSIRIS), Francis Bach (ENS-PSL), G\'erard Biau (LPSM, IUF), Claire Boyer (LMO), 2 Sep 2025, Fast kernel methods: Sobolev, physics-informed, and additive models, https://arxiv.org/abs/2509.02649
Takashi Hayakawa and Satoshi Asai, 3 Sep 2025, Debiased maximum-likelihood estimators for hazard ratios under kernel-based machine-learning adjustment, https://arxiv.org/abs/2507.17686
Yian Huang, Zhen Huang, 7 Sep 2025, Randomized Quasi-Monte Carlo Features for Kernel Approximation, https://arxiv.org/abs/2503.06041
Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, Alex Aiken, 9 Sep 2025, Astra: A Multi-Agent System for GPU Kernel Performance Optimization, https://arxiv.org/abs/2509.07506
M.Hadi Sepanj, Benyamin Ghojogh, Paul Fieguth, 8 Sep 2025, Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space, https://arxiv.org/abs/2509.07289
Yifei Wang, Wenbin Wang, Yong Luo, 12 Sep 2025, DyKen-Hyena: Dynamic Kernel Generation via Cross-Modal Attention for Multimodal Intent Recognition, https://arxiv.org/abs/2509.09940
Jan Mielniczuk and Wojciech Rejchel and Pawe{\l} Teisseyre, 12 Sep 2025, Prior shift estimation for positive unlabeled data through the lens of kernel embedding, https://arxiv.org/abs/2502.21194
Thorbj{\o}rn Mosekj{\ae}r Iversen, Lars Car{\o}e S{\o}rensen, Simon Faarvang Mathiesen, Henrik Gordon Petersen, 11 Sep 2025, Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation, https://arxiv.org/abs/2509.09238
Aya Kayal, Sattar Vakili, Laura Toni, Alberto Bernacchia, 11 Sep 2025, Near-Optimal Sample Complexity in Reward-Free Kernel-Based Reinforcement Learning, https://arxiv.org/abs/2502.07715
Clarissa Lauditi, Blake Bordelon and Cengiz Pehlevan, 10 Sep 2025, Adaptive kernel predictors from feature-learning infinite limits of neural networks, https://arxiv.org/abs/2502.07998
Kristina P. Sinaga, 19 Sep 2025, FedHK-MVFC: Federated Heat Kernel Multi-View Clustering, https://arxiv.org/abs/2509.15844
Kristina P. Sinaga, 19 Sep 2025, Personalized Federated Learning with Heat-Kernel Enhanced Tensorized Multi-View Clustering, https://arxiv.org/abs/2509.16101
Carlo Graziani and Marieme Ngom, 17 Sep 2025, Kernel Model Validation: How To Do It, And Why You Should Care, https://arxiv.org/abs/2509.15244
Pucheng Dang, Di Huang, Dong Li, Kang Chen, Yuanbo Wen, Qi Guo, Xing Hu, 19 Sep 2025, MigGPT: Harnessing Large Language Models for Automated Migration of Out-of-Tree Linux Kernel Patches Across Versions, https://arxiv.org/abs/2504.09474
Marie Kempkes, Aroosa Ijaz, Elies Gil-Fuster, Carlos Bravo-Prieto, Jakob Spiegelberg, Evert van Nieuwenburg, Vedran Dunjko, 19 Sep 2025, Double descent in quantum kernel methods, https://arxiv.org/abs/2501.10077
Hong Sun and Joshua A. Vita and Amit Samanta and Vincenzo Lordi, 15 Sep 2025, Unsupervised Atomic Data Mining via Multi-Kernel Graph Autoencoders for Machine Learning Force Fields, https://arxiv.org/abs/2509.12358
Sriram Nagaraj and Vishakh Hari, 15 Sep 2025, Nonlocal Neural Tangent Kernels via Parameter-Space Interactions, https://arxiv.org/abs/2509.12467
Jia-Qi Yang, Lei Shi, 14 Sep 2025, Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning, https://arxiv.org/abs/2509.11070
Chi Han, Ziqi Wang, Han Zhao, Heng Ji, 12 Sep 2025, Understanding Emergent In-Context Learning from a Kernel Regression Perspective, https://arxiv.org/abs/2305.12766
Shubhanshu Shekhar, Ilmun Kim, Aaditya Ramdas, 15 Sep 2025, A Permutation-free Kernel Two-Sample Test, https://arxiv.org/abs/2211.14908
Leonardo V. Santoro, Kartik G. Waghmare and Victor M. Panaretos, 15 Sep 2025, Kernel Embeddings and the Separation of Measure Phenomenon, https://arxiv.org/abs/2505.04613
Siyuan Chen and Zhichao Lu and Qingfu Zhang, 14 Sep 2025, Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models, https://arxiv.org/abs/2509.14265
Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, David Ha, 16 Sep 2025, Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization, https://arxiv.org/abs/2509.14279
Mark D. Risser, Marcus M. Noack, Hengrui Luo, Ronald Pandolfi, 17 Sep 2025, Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data, https://arxiv.org/abs/2411.05869
Feng Ruan, Keli Liu, Michael Jordan, 17 Sep 2025, A Compositional Kernel Model for Feature Learning, https://arxiv.org/abs/2509.14158
Weihao Yan, Christoph Brune, and Mengwu Guo, 17 Sep 2025, Physics-based deep kernel learning for parameter estimation in high dimensional PDEs, https://arxiv.org/abs/2509.14054