Aussie AI

QLoRA Models

Last Updated 22 October, 2025

by David Spuler, Ph.D.

Research on QLoRA Models

Research papers include:

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts, 26 Mar 2024, The Unreasonable Ineffectiveness of the Deeper Layers, https://arxiv.org/abs/2403.17887 (Static layer pruning with some PEFT re-training after removing layers, with quantization and QLoRA.)
Intel, 2024, https://github.com/intel/intel-extension-for-transformers
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 23 May 2023, QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314 Code: https://github.com/artidoro/qlora Code: https://github.com/TimDettmers/bitsandbytes (The original QLoRA paper.)
Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang, 13 Apr 2024 (v4), EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models, https://arxiv.org/abs/2310.03270 Code: https://github.com/ThisisBillhe/EfficientDM
Pranav Patel, 2024, In-depth guide to fine-tuning LLMs with LoRA and QLoRA, https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li, 24 Jul 2024, Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance, https://arxiv.org/abs/2407.17029 Code: https://github.com/xiaocaigou/qbaraqahira (Combining quantization and LoRA.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Shikhar Bajpai, Sep 27, 2024, Shrinking Elephants: A Funny Guide to 4-bit and 8-bit Quantization for LLMs with LoRA, https://medium.com/@shikharstruck/shrinking-elephants-a-funny-guide-to-4-bit-and-8-bit-quantization-for-llms-with-lora-ddf9f1a62070
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Neal Lawton, Aishwarya Padmakumar, Judith Gaspers, Jack FitzGerald, Anoop Kumar, Greg Ver Steeg, Aram Galstyan, 9 Oct 2024, QuAILoRA: Quantization-Aware Initialization for LoRA, https://arxiv.org/abs/2410.14713
Meta, October 24, 2024, Introducing quantized Llama models with increased speed and a reduced memory footprint, https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Towards AI, December 24, 2024, Llm Fine Tuning Guide: Do You Need It and How to Do It https://towardsai.net/p/artificial-intelligence/llm-fine-tuning-guide-do-you-need-it-and-how-to-do-it-4
MSR Avinash, 7 Sep 2025, Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study, https://arxiv.org/abs/2509.12229