Aussie AI

Positional Encoding Optimization

Last Updated 25 April, 2026

by David Spuler, Ph.D.

Positional Encoding (PE) is the algorithm whereby relative positional information about the placements of words in relation to each other is encoded into "embeddings" that are input into the AI model. The term is often used synonymously with "positional embeddings", but technically, positional encoding is the algorithm (i.e. code) used to create a vector of positional embeddings (i.e. data).

The positional encoding algorithm was one of the important parts of the vanilla 2017 Transformer architecture, which used a sinusoidal positional encoding. Various attempts have been made to try other methods of positional encoding, and to optimize them both in terms of perplexity (prediction accuracy) and computation speed. Positional encoding is not usually a major CPU bottleneck, but it can nevertheless be optimized via improved algorithms, approximations (including integer-only versions), and surprisely, by removing PE entirely with a "NoPE" algorithm.

Positional Encoding: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
David Spuler, March 2024, Chapter 20. Attention, in book "Generative AI in C++", https://www.aussieai.com/book/ch20-attention
David Spuler, March 2024, Chapter 27. Tokenizer and Vocabulary, in book "Generative AI in C++", https://www.aussieai.com/book/ch27-tokenizer-embedding
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf

Research on Positional Encoding Optimizations

Research on faster position encoding algorithms includes:

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (Includes a section discussing positional embedding improvements.)
Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c (Identifies three bottlenecks including positional embeddings.)
Meta, August 24, 2023, Introducing Code Llama, a state-of-the-art large language model for coding, Meta Blog, https://ai.meta.com/blog/code-llama-large-language-model-coding/ (100k context window size for Meta's Code Llama model.)
Mike Young, Sep 22, 2023, LongLoRA: A New, More Efficient Way to Fine-Tune LLMs, https://notes.aimodels.fyi/longlora-a-new-efficient-fine-tuning-of-long-context-llms/
Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang, Sep 2023, LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, https://arxiv.org/abs/2308.16137 (Has a useful review of the positional encoding issues.)
Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training. In International Conference on Learning Representations, March 2021. https://arxiv.org/abs/2006.15595
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu, Aug 2022, Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, https://arxiv.org/abs/2104.09864
Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge, May 2023, Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis, https://arxiv.org/abs/2212.10356
Mingxu Tao, Yansong Feng, Dongyan Zhao, May 2023, A Frustratingly Easy Improvement for Position Embeddings via Random Padding, https://arxiv.org/abs/2305.04859
Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with different positional encoding methods.)
Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. 2020. Encoding word order in complex embeddings. In Proceedings of ICLR. https://openreview.net/forum?id=Hke-WTVtwr, https://arxiv.org/abs/1912.12333 (Replaces positional encoding with continuous word functions.)
Georgi Gerganov, June 2023 Extending context size via RoPE scaling #1965, Llama.cpp project, https://github.com/ggerganov/llama.cpp/discussions/1965
H Jin, X Han, J Yang, Z Jiang, CY Chang, X Hu, Oct 2023, GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length, arXiv preprint arXiv:2310.00576, https://browse.arxiv.org/pdf/2310.00576.pdf (Uses RoPE for long contexts in training.)
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used RoPE embeddings.)
Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein, 27 May 2024, Transformers Can Do Arithmetic with the Right Embeddings, https://arxiv.org/abs/2405.17399 (Positional encoding of numeric digits improves math arithmetic accuracy.)
Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
A Haviv, O Ram, O Press, P Izsak, O Levy, 2022, Transformer language models without positional encodings still learn positional information, https://arxiv.org/abs/2203.16634
Karim Lasri, Alessandro Lenci, Thierry Poibeau, Nov 2022, Word Order Matters when you Increase Masking, https://arxiv.org/abs/2211.04427
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, May 2023, Let's Verify Step by Step, https://arxiv.org/abs/2305.20050
Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, Kentaro Inui, Nov 2021, SHAPE: Shifted Absolute Position Embedding for Transformers, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, https://aclanthology.org/2021.emnlp-main.266/ PDF: https://aclanthology.org/2021.emnlp-main.266.pdf
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=YicbFdNTTy
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, September 2019. https://openreview.net/forum?id=H1eA7AEtvS
Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer Quality in Linear Time. In Proceedings of the 39th International Conference on Machine Learning, pp. 9099–9117. PMLR, June 2022. https://proceedings.mlr.press/v162/hua22a.html
Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. Rethinking Embedding Coupling in Pre-trained Language Models. In International Conference on Learning Representations, September 2020. https://openreview.net/forum?id=xpFFI_NtgpW
Ruiqing Yan, Xingbo Du, Haoyu Deng, Linghan Zheng, Qiuzhuang Sun, Jifang Hu, Yuhang Shao, Penghao Jiang, Jinrong Jiang, Lian Zhao, 3 Jul 2024 (v2), Unveiling and Controlling Anomalous Attention Distribution in Transformers, https://arxiv.org/abs/2407.01601 (Examination of why the very first token in a sequence always gets more attention than others, including the effect of positional encoding, and its impact on KV cache compression.)
Aziz Belaweid, Mar 31, 2024, Complete Summary of Absolute, Relative And Rotary Position Embeddings! https://generativeai.pub/complete-summary-of-absolute-relative-and-rotary-position-embeddings-e2775f663088
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang, 21 Aug 2025, Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding, https://arxiv.org/abs/2501.00712
Sophie Ostmeier, Brian Axelrod, Maya Varma, Michael E. Moseley, Akshay Chaudhari, Curtis Langlotz, 16 Aug 2025, LieRE: Lie Rotational Positional Encodings, https://arxiv.org/abs/2406.10322
Chang Dai, Hongyu Shan, Mingyang Song, Di Liang, 5 Sep 2025, HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models, https://arxiv.org/abs/2509.05218
Avinash Amballa, 23 Aug 2025, CoPE: A Lightweight Complex Positional Encoding, https://arxiv.org/abs/2508.18308
Nabil Jabareen, Dongsheng Yuan, Dingming Liu, Foo-Wei Ten, S\"oren Lukassen, 2 Sep 2025, Anisotropic Fourier Features for Positional Encoding in Medical Imaging, https://arxiv.org/abs/2509.02488
Habib Irani and Vangelis Metsis, 18 Sep 2025, Positional Encoding in Transformer-Based Time Series Models: A Survey, https://arxiv.org/abs/2502.12370
Yu (Sid) Wang, Sheng Shen, R\'emi Munos, Hongyuan Zhan, Yuandong Tian, 16 Sep 2025, Positional Encoding via Token-Aware Phase Attention, https://arxiv.org/abs/2509.12635
Habib Irani, Vangelis Metsis, 18 Sep 2025, DyWPE: Signal-Aware Dynamic Wavelet Positional Encoding for Time Series Transformers, https://arxiv.org/abs/2509.14640
Rui Melo, Rui Abreu, Corina S. Pasareanu, 1 Oct 2025, Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours, https://arxiv.org/abs/2510.01288
Raahul Krishna Durairaju (1), K. Saruladha (2) ((1) California State University, Fullerton, (2) Puducherry Technological University), 2 Oct 2025, PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning, https://arxiv.org/abs/2510.01715
Juan Amboage, Ernst R\"oell, Patrick Schnider, Bastian Rieck, 1 Oct 2025, LEAP: Local ECT-Based Learnable Positional Encodings for Graphs, https://arxiv.org/abs/2510.00757
Kaichen Xu, Yihang Du, Mianpeng Liu, Zimu Yu, Xiaobo Sun, 23 Sep 2025, Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features, https://arxiv.org/abs/2509.16629
Xingjian Tao, Yiwei Wang, Yujun Cai, Yihong Luo, Jing Tang, 25 Oct 2025, Mitigating Coordinate Prediction Bias from Positional Encoding Failures, https://arxiv.org/abs/2510.22102
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinsk\"u, 25 Sep 2025, Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation, https://arxiv.org/abs/2505.22842
Jiawei Xu, Chia Xin Liang, Ziqian Bi, Xiaoming Li, Danyang Zhang, Zhenyu Yu, Dec 2025, A Comprehensive Survey on Large Language Models: From Pre-training to Autonomous Agents, https://www.researchgate.net/profile/Ziqian_Bi/publication/399059225_A_Comprehensive_Survey_on_Large_Language_Models_From_Pre-training_to_Autonomous_Agents/links/694c94a07e61d05b5312836f/A-Comprehensive-Survey-on-Large-Language-Models-From-Pre-training-to-Autonomous-Agents.pdf
David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
Anand Gopalakrishnan, Robert Csord\'as, J\"urgen Schmidhuber, Michael C. Mozer, 5 Sep 2025, Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings, https://arxiv.org/abs/2509.10534
Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han, 4 Oct 2025, Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models, https://arxiv.org/abs/2505.16416
David Spuler, March 2024, Chapter 20. Attention, in book "Generative AI in C++", https://www.aussieai.com/book/ch20-attention
David Spuler, March 2024, Chapter 27. Tokenizer and Vocabulary, in book "Generative AI in C++", https://www.aussieai.com/book/ch27-tokenizer-embedding
David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf

Pruning Positional Encoding ("NoPE")

Whereas positional encoding methods were important in the paper on the vanilla 2017 Transformer (Vaswani et al, 2017), some recent research suggests they could be removed entirely (Kazemnejad et al, 2023).

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. May 2023. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, https://arxiv.org/abs/2305.19466 (Evaluates various positional encoding algorithms in decoder-only Transformers, including none, which they styled "NoPE".)
Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with "implicit" positional encodings.)
Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. Conditional Positional Encodings for Vision Transformers. arXiv:2102.10882 [cs.CV] https://arxiv.org/abs/2102.10882
Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2019. R-Transformer: Recurrent Neural Network Enhanced Transformer. CoRR abs/1907.05572 (2019). arXiv:1907.05572 https://arxiv.org/abs/1907.05572, Code: https://github.com/DSE-MSU/R-transformer
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Kazuki Irie, 31 Dec 2024, Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph, https://arxiv.org/abs/2501.00659
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Sebastian Raschka, Jul 19, 2025, The Big LLM Architecture Comparison: From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design, https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison
Sesame Disk, Apr 2026, LLM Architecture Gallery 2026: Top Model Designs Explained, https://sesamedisk.com/llm-architecture-gallery-2026/

RoPE (Rotary Positional Encoding)

Research papers on RoPE:

Jesus Rodriguez, Apr 22, 2024, Some Technical Notes About Llama 3: New tokenizer, optimized pretraining and some other details about Meta AI’s new model, Towards AI, https://pub.towardsai.net/some-technical-notes-about-llama-3-042c0b19db14
Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun, 20 Mar 2024, Rotary Position Embedding for Vision Transformer, https://arxiv.org/abs/2403.13298 Code: https://github.com/naver-ai/rope-vit
Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, NVIDIA Blog, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/ (NVIDIA releases Nemotron-4 340B model, under an open source license, for the creation of synthetic data, with a decoder-only architecture using grouped-query attention and RoPE.)
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
Aziz Belaweid, Mar 31, 2024, Complete Summary of Absolute, Relative And Rotary Position Embeddings! https://generativeai.pub/complete-summary-of-absolute-relative-and-rotary-position-embeddings-e2775f663088
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Seungrok Jung., 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
Papers With Code, 2024, Relative Position Encodings, https://paperswithcode.com/method/relative-position-encodings
Hugging Face, 2024, Optimizing LLMs for Speed and Memory, https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization
Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
Nils Graef, Aarush Gupta, 2024 (accessed), Approximate attention: infinite context length with constant complexity per token [work in progress], https://github.com/OpenMachine-ai/transformer-tricks/blob/main/approximate.pdf
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi, 31 Dec 2024, 2 OLMo 2 Furious, https://arxiv.org/abs/2501.00656
Kazuki Irie, 31 Dec 2024, Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph, https://arxiv.org/abs/2501.00659
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Y. Jeon et al., "RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models" in IEEE Computer Architecture Letters, vol. , no. 01, pp. 1-4, PrePrints 5555, doi: 10.1109/LCA.2025.3535470. https://www.computer.org/csdl/journal/ca/5555/01/10856355/23Sakp1WnDi
Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383 (INT2 KV caching with special handling of outliers, RoPE, and attention sinks, and the resulting architecture works in Chain-of-Thought.)
Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang, 3 Feb 2025, Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding, https://arxiv.org/abs/2502.01563 https://github.com/MingyuJ666/Rope_with_LLM (Finds that outliers in attention are important, and arise by being generated by RoPE.)
Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang, 27 Feb 2025, LongRoPE2: Near-Lossless LLM Context Window Scaling, https://arxiv.org/abs/2502.20082 https://github.com/microsoft/LongRoPE (Addresses RopE issues with long context optimization.)
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Asif Razzaq, March 5, 2025, Qwen Releases QwQ-32B: A 32B Reasoning Model that Achieves Significantly Enhanced Performance in Downstream Task, https://www.marktechpost.com/2025/03/05/qwen-releases-qwq-32b-a-32b-reasoning-model-that-achieves-significantly-enhanced-performance-in-downstream-task/ (Features 32B parameters, 32K context length, 64 layers, RoPE, SwiGLU, RMSNorm, and attention enhancements.)
Ali Veisi, Amir Mansourian, 11 Mar 2025, Context-aware Biases for Length Extrapolation, https://arxiv.org/abs/2503.08067 https://github.com/axiomlab/Cable
Daehyeon Baek, Jieun Choi, Jimyoung Son, Kyungmin Bin, Seungbeom Choi, Kihyo Moon, Minsung Jang, Hyojung Lee, 18 Jul 2025, FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration, https://arxiv.org/abs/2505.20839
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole, Nov 2023, YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/abs/2309.00071 Code: https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k
J. Su. 2023, Rectified rotary position embeddings. https://github.com/bojone/rerope
Stephen Diehl, 2025, Attention Wasn't All We Needed, https://www.stephendiehl.com/posts/post_transformers/
Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu, 19 Aug 2025, VRoPE: Rotary Position Embedding for Video Large Language Models, https://arxiv.org/abs/2502.11664
Chang Dai, Hongyu Shan, Mingyang Song, Di Liang, 5 Sep 2025, HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models, https://arxiv.org/abs/2509.05218
Hyeongju Kim, Juheon Lee, Jinhyeok Yang, Jacob Morton, 14 Sep 2025, Length-Aware Rotary Position Embedding for Text-Speech Alignment, https://arxiv.org/abs/2509.11084
Ye Qiao, Sitao Huang, 17 Sep 2025, Q-ROAR: Outlier-Aware Rescaling for RoPE Position Interpolation in Quantized Long-Context LLMs, https://arxiv.org/abs/2509.14391
Ye Qiao, Haocheng Xu, Xiaofan Zhang, Sitao Huang, 26 Sep 2025, Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling, https://arxiv.org/abs/2510.00028
Isaac Reid, Arijit Sehanobish, Cedrik H\"ofs, Bruno Mlodozeniec, Leonhard Vulpius, Federico Barbero, Adrian Weller, Krzysztof Choromanski, Richard E. Turner, Petar Veli\v{c}kovi\'c, 26 Sep 2025, Wavelet-Induced Rotary Encodings: RoPE Meets Graphs, https://arxiv.org/abs/2509.22259
Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun, 26 Sep 2025, Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning, https://arxiv.org/abs/2509.22601
Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi, 25 Sep 2025, Behind RoPE: How Does Causal Mask Encode Positional Information?, https://arxiv.org/abs/2509.21042
Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han, 4 Oct 2025, Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models, https://arxiv.org/abs/2505.16416
Pranamya Kulkarni, Puranjay Datta, Burak Var{\i}c{\i}, Emre Acart\"urk, Karthikeyan Shanmugam, Ali Tajer, 23 Oct 2025, ROPES: Robotic Pose Estimation via Score-Based Causal Representation Learning, https://arxiv.org/abs/2510.20884
Sebastian Raschka, PhD, Dec 18, 2025 (updated), The Big LLM Architecture Comparison: From DeepSeek V3 to Mistral 3 Large: A Look At Modern LLM Architecture Design, https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison
Jiawei Xu, Chia Xin Liang, Ziqian Bi, Xiaoming Li, Danyang Zhang, Zhenyu Yu, Dec 2025, A Comprehensive Survey on Large Language Models: From Pre-training to Autonomous Agents, https://www.researchgate.net/profile/Ziqian_Bi/publication/399059225_A_Comprehensive_Survey_on_Large_Language_Models_From_Pre-training_to_Autonomous_Agents/links/694c94a07e61d05b5312836f/A-Comprehensive-Survey-on-Large-Language-Models-From-Pre-training-to-Autonomous-Agents.pdf
Devansh, Apr 2026, Google’s Gemma 4 is Weirder than you Realize: The architecture matters more than the numbers. Here’s what Google actually built, https://machine-learning-made-simple.medium.com/googles-gemma-4-is-weirder-than-you-realize-17d00d95b0d5
Sesame Disk, Apr 2026, LLM Architecture Gallery 2026: Top Model Designs Explained, https://sesamedisk.com/llm-architecture-gallery-2026/
Liger-Kernel: Efficient Triton Kernels for LLM Training, Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, Zhipeng Wang, July 2025 https://openreview.net/forum?id=36SjAIT42G https://openreview.net/pdf?id=36SjAIT42G