Aussie AI

RAG Evaluation

Last Updated 17 November, 2025

by David Spuler, Ph.D.

RAG evaluation is the analysis of the LLM-based RAG architecture as a whole, rather than conventional model evaluation that examines only the model. A typical RAG system includes not only an LLM, but a vector database of document chunks, and an orchestrator component. Advanced RAG architectures typically also include a keyword search datastore, reranker, packer, and other components.

See also more research on related areas:

Research on RAG Evaluation

Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert, 26 Sep 2023, RAGAS: Automated Evaluation of Retrieval Augmented Generation, https://arxiv.org/abs/2309.15217
Shangeetha Sivasothy, Scott Barnett, Stefanus Kurniawan, Zafaryab Rasool, Rajesh Vasa, 24 Sep 2024, RAGProbe: An Automated Approach for Evaluating RAG Applications, https://arxiv.org/abs/2409.19019
Jon Saad-Falcon, Omar Khattab, Christopher Potts, Matei Zaharia, 31 Mar 2024 (v2), ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems, https://arxiv.org/abs/2311.09476
Kevin Wu, Eric Wu, James Zou, 10 Jun 2024 (v2), ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence, https://arxiv.org/abs/2404.10198
Galla, D., Hoda, S., Zhang, M., Quan, W., Yang, T.D., Voyles, J. (2024). CoURAGE: A Framework to Evaluate RAG Systems. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14763. Springer, Cham. https://doi.org/10.1007/978-3-031-70242-6_37 https://link.springer.com/chapter/10.1007/978-3-031-70242-6_37
Rafael Teixeira de Lima, Shubham Gupta, Cesar Berrospi, Lokesh Mishra, Michele Dolfi, Peter Staar, Panagiotis Vagenas, 29 Nov 2024, Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems, IBM Research, https://arxiv.org/abs/2411.19710
Lilian Weng, July 7, 2024, Extrinsic Hallucinations in LLMs, https://lilianweng.github.io/posts/2024-07-07-hallucination/
Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha, Hridhay Mehta, Ashwin Jha, Prithvi Raj, Abhinav Balasubramanian, Murali Maram, Guru Muthusamy, Shivakesh Reddy Annepally, Sidney Knowles, Min Du, Nick Burnett, Sean Javiya, Ashok Marannan, Mamta Kumari, Surbhi Jha, Ethan Dereszenski, Anupam Chakraborty, Subhash Ranjan, Amina Terfai, Anoop Surya, Tracey Mercer, Vinodh Kumar Thanigachalam, Tamar Bar, Sanjana Krishnan, Samy Kilaru, Jasmine Jaksic, Nave Algarici, Jacob Liberman, Joey Conway, Sonu Nayyar, Justin Boitano, 10 Jul 2024, FACTS About Building Retrieval Augmented Generation-based Chatbots, NVIDIA Research, https://arxiv.org/abs/2407.07858
Contextual AI Team, March 19, 2024 Introducing RAG 2.0, https://contextual.ai/introducing-rag2/
Angels Balaguer, Vinamra Benara, Renato Luiz de Freitas Cunha, Roberto de M. Estevão Filho, Todd Hendry, Daniel Holstein, Jennifer Marsman, Nick Mecklenburg, Sara Malvar, Leonardo O. Nunes, Rafael Padilha, Morris Sharp, Bruno Silva, Swati Sharma, Vijay Aski, Ranveer Chandra, 30 Jan 2024 (v3), RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, https://arxiv.org/abs/2401.08406
Mingyue Cheng, Yucong Luo, Jie Ouyang, Qi Liu, Huijie Liu, Li Li, Shuo Yu, Bohou Zhang, Jiawei Cao, Jie Ma, Daoyu Wang, Enhong Chen, 17 Mar 2025 (v2), A Survey on Knowledge-Oriented Retrieval-Augmented Generation, https://arxiv.org/abs/2503.10677
Chaitanya Sharma, 28 May 2025, Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers, https://arxiv.org/abs/2506.00054
Quentin Romero Lauro, Shreya Shankar, Sepanta Zeighami, Aditya Parameswaran, 18 Apr 2025, RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation Pipelines, https://arxiv.org/abs/2504.13587
Jeongsoo Lee, Daeyong Kwon, Kyohoon Jin, 23 Aug 2025, GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation, https://arxiv.org/abs/2508.16994
Mohita Chowdhury, Yajie Vera He, Jared Joselowitz, Aisling Higham, Ernest Lim, 18 Jul 2025, ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems, https://arxiv.org/abs/2501.08208
Gr\'egoire Martinon, Alexandra Lorenzo de Brionne, J\'er\^ome Bohard, Antoine Lojou, Damien Hervault, Nicolas J-B. Brunel (ENSIIE, LaMME), 29 Jul 2025, Towards a rigorous evaluation of RAG systems: the challenge of due diligence, https://arxiv.org/abs/2507.21753
Jiaxuan Liang, Shide Zhou, and Kailong Wang, 26 Jul 2025, OmniBench-RAG: A Multi-Domain Evaluation Platform for Retrieval-Augmented Generation Tools, https://arxiv.org/abs/2508.05650
Ilias Driouich, Hongliu Cao, Eoin Thomas, 26 Aug 2025, Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework, https://arxiv.org/abs/2508.18929
Md Toufique Hasan, Muhammad Waseem, Kai-Kristian Kemell, Ayman Asad Khan, Mika Saari and Pekka Abrahamsson, 18 Sep 2025, Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation, https://arxiv.org/abs/2506.20869
Sicheng Dong, Vahid Zolfaghari, Nenad Petrovic, Alois Knoll, 2 Oct 2025, Knowledge-Graph Based RAG System Evaluation Framework, https://arxiv.org/abs/2510.02549
Aline Mangold, Kiran Hoffmann, 30 Sep 2025, Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration, https://arxiv.org/abs/2509.26205