Aussie AI

Parameter Sharing

  • Last Updated 10 May, 2026
  • by David Spuler, Ph.D.

What is Parameter Sharing?

Parameter sharing, also called "weight sharing", is the use of the same parameters by different structures of the Transformer. Parameter sharing and pruning are similar techniques, both being forms of model compression, but they are not the same. Pruning avoids doing some computations, whereas parameter sharing still does all the computations, but with shared parameters (reducing the total number of stored weights).

Each layer of the Transformer typically has its own set of weights for each structure. When the same set of weights is used across multiple layers, this is a type of layer fusion, and is conceptually similar to layer pruning. However, note that layer pruning reduces the number of layers that are executed, whereas layerwise parameter sharing does not (although the two ideas can be combined).

Parameter sharing reduces the total number of weights to be stored, thereby reducing model size. Since the operation of loading the model into memory for arithmetic operations is itself costly (sometimes called "overhead"), and Transformers are sometimes memory-bound rather than CPU-bound (i.e, in the decoding phase), sharing the data can also sometimes reduce latency and improve inference throughput, even though it is not actually reducing the number of computations.

Training time can also be improved by parameter sharing, as there are fewer parameters to train. Obviously, this architecture requires a non-standard extension to the normal Transformer training algorithms.

Types of Parameter Sharing

Parameters can be shared for structures such as:

  • Layer fusion (sharing all weights in a layer or "layerwise parameter sharing")
  • Attention head fusion (a type of "width-wise parameter sharing")
  • Feed-forward network (FFN) parameter sharing
  • Lengthwise parameter sharing (see token pruning and token merging)
  • KV cache data layer fusion

A variant is more granular weight sharing limited to subcomponents within a layer, rather than every weight in a layer. If the FFN weights are shared, this is similar to FFN pruning. Similarly, sharing attention head weights is akin to head pruning.

The ideas of layer fusion can also be used for the KV cache data layers in the method of KV cache layer fusion, as a type of KV cache compression that reduces the size of the KV cache. It is also possible to fuse the KV heads, using weight sharing on the width dimension.

Parameter Sharing: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

Whole Model Weight Sharing

Large-scale parameter sharing can occur for an entire set of weights in an LLM. The following techniques aren't really considered to be types of parameter sharing, and yet, they really should be!

Layer Fusion

Layer fusion is the sharing of weights for entire layers of a model. See also layer fusion and also layer pruning. Research papers include:

KV Cache Layer Fusion

Layer fusion applied to KV cache data is another type of parameter sharing within the KV data. Read more about KV caching optimizations.

Research papers on KV cache layer fusion:

  • Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, https://arxiv.org/abs/2405.14366 (Compresses the KV cache on the depth dimension of layers, analogous to layer fusion.)
  • Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
  • Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
  • Haoyi Wu, Kewei Tu, 4 Jun 2024 (v2), Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Only computes the KV cache of some layers.)
  • Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
  • William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
  • AIModels.FYI, 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://www.aimodels.fyi/papers/arxiv/layer-condensed-kv-cache-efficient-inference-large
  • Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 13 Aug 2024 (v3), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 https://github.com/zcli-charlie/Awesome-KV%20Cache
  • Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
  • Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
  • Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
  • Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He, 4 Oct 2024, SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation, https://arxiv.org/abs/2410.03960
  • You Wu, Haoyi Wu, Kewei Tu, 18 Oct 2024, A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference, https://arxiv.org/abs/2410.14442
  • Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
  • Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan, 23 Oct 2024, Value Residual Learning For Alleviating Attention Concentration In Transformers, https://arxiv.org/abs/2410.17897
  • Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen, 24 Oct 2024, KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing, https://arxiv.org/abs/2410.18517
  • Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
  • Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
  • AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
  • Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu, 3 Dec 2024, Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity, https://arxiv.org/abs/2412.02252
  • Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang, 12 Dec 2024, ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty, https://arxiv.org/abs/2412.09036 (KV cache compression on a layerwise budget, similar to token-based eviction, giving a kind of "dual-dimension" KV cache compression on depth and length.)
  • Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
  • Longze Chen, Jan 2025 (accessed), Awesome-KV-Cache-Compression: Must-read papers on KV Cache Compression (constantly updating), https://github.com/October2001/Awesome-KV-Cache-Compression (KV cache reuse across multiple prompts via SwiftKV, sounds similar to prefix KV caching or fused KV caching, and also SingleInputKV does KV cache layer fusion in a single prompt.)
  • Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh, 31 Jan 2025, Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models, https://arxiv.org/abs/2501.19392 (Enhances KV cache quantization by exploiting cross-layer similarities.)
  • Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
  • Ben Dickson, July 22, 2025, Mixture-of-recursions delivers 2x faster inference—Here’s how to implement it, https://venturebeat.com/ai/mixture-of-recursions-delivers-2x-faster-inference-heres-how-to-implement-it/
  • Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun, 21 Jul 2025 (v2), Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation, https://www.arxiv.org/abs/2507.10524 (MoR is an adaptive layer fusion or layer reuse method to a fixed "recursive level" and also combined with related optimizations to KV cache management techniques.)
  • Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Zibin Zheng, 10 Jul 2025, Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing, https://arxiv.org/abs/2507.08045
  • David Spuler, Ph.D., March 3rd, 2025, What's Hot in LLM Inference Optimization in 2025? Aussie AI Blog, https://www.aussieai.com/blog/hot-inference-optimization-2025
  • Devansh, Apr 2026, Google’s Gemma 4 is Weirder than you Realize: The architecture matters more than the numbers. Here’s what Google actually built, https://machine-learning-made-simple.medium.com/googles-gemma-4-is-weirder-than-you-realize-17d00d95b0d5
  • Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, and Yuxiong He. 2025. SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25734–25753, Suzhou, China. Association for Computational Linguistics. https://aclanthology.org/2025.emnlp-main.1306/ https://github.com/snowflakedb/arctictraining and https://github.com/snowflakedb/arcticinference (Prefill-only KV layer fusion with propagation of KV caches to later layers allows prefill layer skipping.)

Fused Head Attention Research

Research papers on widthwise parameter sharing via fused attention heads:

KV Fused Head Research

Research papers on KV head fusion or merging:

  • Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu, 11 Jun 2024, Effectively Compress KV Heads for LLM, https://arxiv.org/abs/2406.07056
  • Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
  • Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
  • Isaac Rehg, 7 Oct 2024 (v2), KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head, https://arxiv.org/abs/2410.00161
  • Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao, 28 Oct 2024 (v2), Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning, https://arxiv.org/abs/2410.19258
  • Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
  • Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
  • Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)

FFN Parameter Sharing

The FFN weights can be shared in a "fused FFN" optimization, similar to FFN pruning. Research papers include:

General Research on Parameter Sharing

There have been many attempts to speed up models using parameter sharing:

  • Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
  • Qian Lou, Ting Hua, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. 2022. Dictformer: Tiny transformer with shared dictionary. In International Conference on Learning Representations. https://sra.samsung.com/publications/dictformer-tiny-transformer-with-shared-dictionary/ (Effectively shares parameters by using dictionary lookups.)
  • Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2012.14913 (Explores how FFN's work in depth, with relevance to sharing FFN weights.)
  • Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
  • Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4081–4090, Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2101.00234 (Parameter sharing across layers.)
  • Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. 2020. Analyzing redundancy in pretrained transformer models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–4926, Online. Association for Computational Linguistics. https://arxiv.org/abs/2004.04010 (Detailed analysis finding redundancy in 85% of parameters, with relevance to pruning and sharing.)
  • Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In International Conference on Learning Representations. https://arxiv.org/abs/1807.03819 (Optimizes Transformers with weight sharing and other ways.)
  • Sho Takase and Shun Kiyono. 2023. Lessons on parameter sharing across layers in transformers. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 78–90, Toronto, Canada (Hybrid). Association for Computational Linguistics. https://arxiv.org/abs/2104.06022
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations. In Proceedings of ICLR. https://arxiv.org/abs/1909.11942 (Parameter sharing across layers in the BERT Transformer architecture.)
  • Raj Dabre and Atsushi Fujita. 2019. Recurrent stacking of layers for compact neural machine translation models. Proceedings of AAAI, 33:6292–6299. https://arxiv.org/abs/1807.05353 (Parameter sharing across layers of a Transformer.)
  • Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer. In Proceedings of IJCAI, pages 5292–5298, https://arxiv.org/abs/1906.11024 (Parameter sharing of attention heads.)
  • Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied transformers: Neural machine translation with shared encoder and decoder. Proceedings of AAAI, 33(01):5466–5473. PDF: https://taoqin.github.io/papers/tiedT.AAAI2019.pdf
  • Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Contains several sections surveying weight sharing.)
  • Chu, X.; Zhang, B.; Xu, R. FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12219–12228. http://dx.doi.org/10.1109/ICCV48922.2021.01202, https://arxiv.org/abs/1907.01845 (NAS in the context of weight sharing architectures.)
  • Aich, S.; Yamazaki, M.; Taniguchi, Y.; Stavness, I., Multi-Scale Weight Sharing Network for Image Recognition. Pattern Recognit. Lett. 2020, 131, 348–354. http://dx.doi.org/10.1016/j.patrec.2020.01.011, https://arxiv.org/abs/2001.02816
  • Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll, Feb 2019, Convolutional neural networks with layer reuse, 2019 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/8802998/, https://arxiv.org/pdf/1901.09615 (The method of repeatedly re-using the same entire layers.)
  • M Mary Shanthi Rani, P Chitra, S Lakshmanan, M Kalpana Devi, R Sangeetha, S Nithya, 2022, DeepCompNet: A novel neural net model compression architecture, Comput Intell Neurosci. 2022 Feb 22;2022:2213273. https://pubmed.ncbi.nlm.nih.gov/35242176/, https://www.hindawi.com/journals/cin/2022/2213273/ (Combines quantization and pruning with weight sharing.)
  • Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. https://arxiv.org/abs/1902.00751
  • X Wang, P Guo, Y Zhang, 2023, Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer, ECML PKDD 2023: Machine Learning and Knowledge Discovery in Databases: Research Track pp 309–325, https://arxiv.org/abs/2201.05887 (Attention optimization method that uses weight sharing.)
  • Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 (Multi-query attention shares KV tensors across multiple attention heads.)
  • Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal Transformers. In Proceedings of ICLR. https://openreview.net/forum?id=HyzdRiR9Y7, PDF: https://openreview.net/pdf?id=HyzdRiR9Y7
  • C Fu, 2023, Machine Learning Algorithm and System Co-design for Hardware Efficiency, Ph.D. thesis, Computer Science, University of California San Diego, https://escholarship.org/content/qt52q368p3/qt52q368p3.pdf
  • S Tan, Y Shen, Z Chen, A Courville, C Gan, Oct 2023, Sparse Universal Transformer, arXiv preprint arXiv:2310.07096, https://arxiv.org/pdf/2310.07096.pdf
  • William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
  • Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
  • 3 Jan 2024 (v2), SPEED: Speculative Pipelined Execution for Efficient Decoding, Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao, https://arxiv.org/abs/2310.12072 (Speculatively executing multiple future tokens in parallel to the current token, by using multiple tokens with high probability from the early layers of inference of the current token in the model. This allows multiple speculations of the autoregressive inference of the next token to start before the current token is finished.)
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le, Jan 2022, Primer: Searching for Efficient Transformers for Language Modeling, https://arxiv.org/abs/2109.08668
  • Hesen Chen, Ming Lin, Xiuyu Sun, Qian Qi, Hao Li, and Rong Jin. 2019. Muffnet: Multi-layer feature federation for mobile deep learning. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. https://ieeexplore.ieee.org/document/9022559 PDF: https://openaccess.thecvf.com/content_ICCVW_2019/papers/CEFRL/Chen_MuffNet_Multi-Layer_Feature_Federation_for_Mobile_Deep_Learning_ICCVW_2019_paper.pdf
  • Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
  • Rene Bidart, Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, 2023, Ph.D. thesis, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
  • S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, 2016, Eie: Efficient inference engine on compressed deep neural network, in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA, IEEE Press, 2016, pp. 243–254, https://doi.org/10.1109/ISCA.2016.30 https://arxiv.org/abs/1602.01528
  • Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
  • Salar Shakibhamedan, Amin Aminifar, Nima TaheriNejad, Axel Jantsch, 2024, EASE: Energy Optimization through Adaptation — A Review of Runtime Energy-Aware Approximate Deep Learning Algorithms, https://eclectx.org/Publications/2024_M13.pdf (Survey paper on techniques for adaptive inference with a focus on approximations of inference, including loop performance, stochastic algorithms, approximate arithmetic, quantization, pruning and low-rank.)
  • C Hooper, S Kim, H Mohammadzadeh, H Genc, Oct 2023, SPEED: Speculative Pipelined Execution for Efficient Decoding https://arxiv.org/pdf/2310.12072.pdf
  • Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
  • David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
  • Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 27 Jun 2024 (v2), MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, Meta Research, https://arxiv.org/abs/2402.14905 Code: https://github.com/facebookresearch/MobileLLM
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
  • Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, Ziyuan Ling, 26 Aug 2024, On-Device Language Models: A Comprehensive Review, https://arxiv.org/abs/2409.00088 https://github.com/NexaAI/Awesome-LLMs-on-device https://www.nexaai.com/models
  • Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster, 28 Oct 2024, Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA, https://arxiv.org/abs/2410.20672
  • Seul-Ki Yeom, Tae-Ho Kim, 3 Dec 2024, UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices, https://arxiv.org/abs/2412.02344 (Shared attention matrix generalizes MHA with fused attention matrixes across layers.)
  • L. Zhou et al., "Chasing Common Knowledge: Joint Large Model Selection and Pulling in MEC With Parameter Sharing," in IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2025.3527649. https://ieeexplore.ieee.org/abstract/document/10834568/
  • Ben Dickson, July 22, 2025, Mixture-of-recursions delivers 2x faster inference—Here’s how to implement it, https://venturebeat.com/ai/mixture-of-recursions-delivers-2x-faster-inference-heres-how-to-implement-it/
  • Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun, 21 Jul 2025 (v2), Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation, https://www.arxiv.org/abs/2507.10524 (MoR is an adaptive layer fusion or layer reuse method to a fixed "recursive level" and also combined with related optimizations to KV cache management techniques.)
  • Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, Wanxiang Che, 22 Aug 2025, CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing, https://arxiv.org/abs/2508.16134
  • Guanqiao Qu, Qian Chen, Xianhao Chen, Kaibin Huang, Yuguang Fang, 12 Oct 2025, PartialLoading: User Scheduling and Bandwidth Allocation for Parameter-sharing Edge Inference, https://arxiv.org/abs/2503.22982
  • Hao Ban, Kaiyi Ji, 29 Sep 2025, Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs, https://arxiv.org/abs/2509.25414
  • David Spuler, Ph.D., Feb 6th, 2026 (updated), 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
  • David Spuler, March 2024, Chapter 46. Structured Pruning, in book "Generative AI in C++", https://www.aussieai.com/book/ch46-structured-pruning
  • David Spuler, March 2024, Generative AI in C++: Coding Transformers and LLMs, https://www.aussieai.com/book/toc PDF: https://www.aussieai.com/pdf/BOOK-Generative-AI-CPP-Spuler-2024.pdf

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: