Aussie AI

Parameter-Efficient Fine-Tuning (PEFT)

  • Last Updated 29 August, 2025
  • by David Spuler, Ph.D.

Parameter-Efficient Fine-Tuning (PEFT) is fine-tuning that's efficient with parameters. Instead of updating all of the model's parameters, which is slow and inefficient, only a subset of parameters is updated. The rest of the model parameters are "frozen" during the fine-tuning procedure.

Various types of PEFT have been examined, such as:

The alternatives to using PEFT to train additional intelligence into a model include:

LoRA

The idea behind LoRA is to use "low-rank" matrices, which have a smaller size, and thus are much less costly to fine-tuning. These matrices can be multiplied together to create data than can be combined with the original model.

Multi-LoRA

The use of multiple LoRA adapters got a boost when Apple chose this method for its Apple Intelligence platform. Several other platforms use multi-LoRA as an efficiency gain for both training and inference.

Research papers on multi-LoRA include:

LoRA Inference Optimizations

The popularity of LoRA as an efficient training method has also spawned research on maximizing its inference efficiency. Some of the loading and unloading of LoRA adapters can be quite expensive, and methods of optimizing multi-LORA platforms have seen various research papers.

Research papers on LoRA and multi-LoRA inference optimization include:

  • Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy, 28 Oct 2023, Punica: Multi-Tenant LoRA Serving https://arxiv.org/abs/2310.18547 Code: https://github.com/punica-ai/punica
  • Jingwei Xu, Junyu Lai, Yunpeng Huang, 24 May 2024 (v2), MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models, https://arxiv.org/abs/2405.13053
  • Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag, 28 Mar 2024, CLoRA: A Contrastive Approach to Compose Multiple LoRA Models, https://arxiv.org/abs/2403.19776
  • Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang, 20 Jan 2024, CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference, https://arxiv.org/abs/2401.11240 (Multi-LoRA inference where it starts running prefill computations in the CPU while loading the LoRA weights into the GPU.)
  • Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu, 28 May 2024, LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design, https://arxiv.org/abs/2405.17741
  • Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, 5 Jun 2024 (v3), S-LoRA: Serving Thousands of Concurrent LoRA Adapters, https://arxiv.org/abs/2311.03285 Code: https://github.com/S-LoRA/S-LoRA
  • Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
  • Bingyang Wu, Ruidong Zhu, and Zili Zhang, Peng Sun, Shanghai AI Lab; Xuanzhe Liu, Xin Jin, 2024, dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving, https://www.usenix.org/conference/osdi24/presentation/wu-bingyang
  • Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
  • Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao 12 Aug 2024 (v3), A Survey on LoRA of Large Language Models, https://arxiv.org/abs/2407.11046 https://github.com/ZJU-LLMs/Awesome-LoRAs.git
  • Yuxuan Zhang, Ruizhe Li, 2 Oct 2024, DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models, https://arxiv.org/abs/2410.01497 https://github.com/MeCuping/DLP-LoRA (Merging multiple LoRA adapters for parallel inference.)
  • Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu, 1 Nov 2024, V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM, https://arxiv.org/abs/2411.00915
  • Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Hubertus Franke, Josep Torrellas, 24 Nov 2024, Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments, https://arxiv.org/abs/2411.17741
  • Jiaxuan Chen. 2024. Comparative Analysis and Optimization of LoRA Adapter Co-serving for Large Language Models. In Proceedings of the 25th International Middleware Conference: Demos, Posters and Doctoral Symposium (Middleware '24). Association for Computing Machinery, New York, NY, USA, 27–28. https://doi.org/10.1145/3704440.3704777 https://dl.acm.org/doi/abs/10.1145/3704440.3704777 https://dl.acm.org/doi/pdf/10.1145/3704440.3704777 (Serving multiple LoRA adapters while maintaining a single backbone LLM model in memory.)
  • https://arxiv.org/abs/2505.14468
  • Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)

QLoRA

QLoRA is quantized LoRA. This is pretty standard nowadays, with most LoRA adapters using quantization. A lot of the research papers don't use the term "QLoRA" anymore. For example, Apple Intelligence uses QLoRA in its multi-LoRA architecture with 4-bit quantization.

Research papers on QLoRA include:

Prompt Tuning (Extended Vocabulary PEFT)

Prompt tuning is a lengthwise PEFT that creates new tokens to extend the vocabulary, rather than training the parameters for existing tokens. Since the tokens are new, they don't have values, and obviously aren't frozen. The weights for the original tokens are frozen, however, which means most of them. TAs an example, this type of PEFT can be useful when extending the LLM via fine-tuning on a special curated data set, so as to have particular "trigger tokens" to launch integrated tools or perform other advanced capabilities. For example, adding new tokens that indicate a "tool launch" to the vocabulary and only fine-tuning for those ones.

Research Papers on PEFT

PEFT is a popular technique that receives a lot of research attention:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: