Aussie AI

Model Evaluation

  • Last Updated 30 August, 2025
  • by David Spuler, Ph.D.

Leaderboards (Model Evaluation)

Benchmarks for Model Evaluation

  • Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
  • Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You, 3 Jun 2024, MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, https://arxiv.org/abs/2406.06565
  • Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi, 13 Jun 2024, Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning, https://arxiv.org/abs/2406.09170
  • Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
  • Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
  • Petr Spelda and Vit Stritecky, 13 Aug 2025, Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1, https://arxiv.org/abs/2508.10173
  • Pengbo Shen, Yaqing Wang, Ni Mu, Yao Luan, Runpeng Xie, Senhao Yang, Lexiang Wang, Hao Hu, Shuang Xu, Yiqin Yang, Bo Xu, 14 Aug 2025, SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks, https://arxiv.org/abs/2508.10428
  • Lucie-Aim\'ee Kaffee and Giada Pistilli and Yacine Jernite, 4 Aug 2025, INTIMA: A Benchmark for Human-AI Companionship Behavior, https://arxiv.org/abs/2508.09998
  • Rakesh Thakur, Sneha Sharma, Gauri Chopra, 4 Aug 2025, HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish, https://arxiv.org/abs/2508.10001
  • Nghia Trung Ngo, Franck Dernoncourt and Thien Huu Nguyen, 13 Aug 2025, mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning, https://arxiv.org/abs/2508.10137
  • Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza, 13 Aug 2025, PakBBQ: A Culturally Adapted Bias Benchmark for QA, https://arxiv.org/abs/2508.10186
  • Chenggang Chen, Zhiyu Yang, 13 Aug 2025, No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings, https://arxiv.org/abs/2508.10230
  • Jieyu Li, Xin Zhang, and Joey Tianyi Zhou, 14 Aug 2025, AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences, https://arxiv.org/abs/2508.10771
  • Brooke R. Weborg and Gursel Serpen, 14 Aug 2025, Empirical Investigation into Configuring Echo State Networks for Representative Benchmark Problem Domains, https://arxiv.org/abs/2508.10887
  • Anand Kumar, Harminder Pal Monga, Tapasi Brahma, Satyam Kalra, Navas Sherif, 14 Aug 2025, Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops, https://arxiv.org/abs/2508.10817
  • Fabian David Schmidt, Ivan Vuli\'c, Goran Glava\v{s}, David Ifeoluwa Adelani, 13 Aug 2025, Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding, https://arxiv.org/abs/2501.06117
  • Yuping Wang and Xiangyu Huang and Xiaokang Sun and Mingxuan Yan and Shuo Xing and Zhengzhong Tu and Jiachen Li, 14 Aug 2025, UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving, https://arxiv.org/abs/2503.24381
  • Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou, 14 Aug 2025, PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts, https://arxiv.org/abs/2508.09848
  • Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, Wentao Zhang, 22 Jul 2025, CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos, https://arxiv.org/abs/2507.16878
  • Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang, 23 Jul 2025, SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs, https://arxiv.org/abs/2507.17178
  • Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing, 23 Jul 2025, MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs, https://arxiv.org/abs/2507.17476
  • Helen Jin, Shreya Havaldar, Chaehyeon Kim, Anton Xue, Weiqiu You, Helen Qu, Marco Gatti, Daniel A Hashimoto, Bhuvnesh Jain, Amin Madani, Masao Sako, Lyle Ungar, Eric Wong, 22 Jul 2025, The FIX Benchmark: Extracting Features Interpretable to eXperts, https://arxiv.org/abs/2409.13684
  • Xu Yang, Qi Zhang, Shuming Jiang, Yaowen Xu, Zhaofan Zou, Hao Sun, Xuelong Li, 22 Jul 2025, METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark, https://arxiv.org/abs/2507.16206
  • Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur, 18 Jul 2025, Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark, https://arxiv.org/abs/2507.15882
  • Eduardo Pacheco, Atila Orhon, Berkin Durmus, Blaise Munyampirwa, Andrey Leonov, 22 Jul 2025, SDBench: A Comprehensive Benchmark Suite for Speaker Diarization, https://arxiv.org/abs/2507.16136
  • Yasser Ashraf, Ahmed Sharshar, Velibor Bojkovic, Bin Gu, 22 Jul 2025, SPACT18: Spiking Human Action Recognition Benchmark Dataset with Complementary RGB and Thermal Modalities, https://arxiv.org/abs/2507.16151
  • Shalaka Satheesh, Katrin Klug, Katharina Beckh, H\'ector Allende-Cid, Sebastian Houben, Teena Hassan, 22 Jul 2025, GG-BBQ: German Gender Bias Benchmark for Question Answering, https://arxiv.org/abs/2507.16410
  • Alireza Dizaji, Benedict Aaron Tjandra, Mehrab Hamidi, Shenyang Huang, Guillaume Rabusseau, 22 Jul 2025, T-GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs, https://arxiv.org/abs/2507.10183
  • Huan Liu, Shusen Yang, Yuzhe Zhang, Mengze Wang, Fanyu Gong, Chengxi Xie, Guanjian Liu, Zejun Liu, Yong-Jin Liu, Bao-Liang Lu, Dalin Zhang, 22 Jul 2025, LibEER: A Comprehensive Benchmark and Algorithm Library for EEG-based Emotion Recognition, https://arxiv.org/abs/2410.09767
  • Pierre Sermanet, Anirudha Majumdar, Vikas Sindhwani, 22 Jul 2025, SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior, https://arxiv.org/abs/2503.10706
  • Mustafa Chasmai, Wuao Liu, Subhransu Maji, Grant Van Horn, 21 Jul 2025, Audio Geolocation: A Natural Sounds Benchmark, https://arxiv.org/abs/2505.18726
  • Zehan Li, Hongjie Chen, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li, 24 Jul 2025, TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios, https://arxiv.org/abs/2507.18061
  • Minje Park, Jeonghwa Lim, Taehyung Yu, and Sunghoon Joo, 24 Jul 2025, A Multi-Dataset Benchmark for Semi-Supervised Semantic Segmentation in ECG Delineation, https://arxiv.org/abs/2507.18323
  • Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang, 20 Jul 2025, MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation, https://arxiv.org/abs/2507.17773
  • Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer, 24 Jul 2025, BEARCUBS: A benchmark for computer-using web agents, https://arxiv.org/abs/2503.07919
  • Xiao Wang, Qian Zhu, Shujuan Wu, Bo Jiang, Shiliang Zhang, Yaowei Wang, Yonghong Tian, Bin Luo, 18 Jul 2025, When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework, https://arxiv.org/abs/2507.13659
  • Ishant Chintapatla, Kazuma Choji, Naaisha Agarwal, Andrew Lin, Hannah You, Charles Duong, Kevin Zhu, Sean O'Brien, Vasu Sharma, 17 Jul 2025, COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark, https://arxiv.org/abs/2507.13405
  • Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu, 18 Jul 2025, HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation, https://arxiv.org/abs/2503.04800
  • Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee, 18 Jul 2025, From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation, https://arxiv.org/abs/2507.08924
  • Zhijiang Tang, Jiaxin Qi, Yuhua Zheng, Jianqiang Huang, 15 Jul 2025, A Comprehensive Benchmark for Electrocardiogram Time-Series, https://arxiv.org/abs/2507.14206
  • Lingbo Li, Anuradha Mathrani, Teo Susnjak, 20 Jul 2025, What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction, https://arxiv.org/abs/2507.15152
  • Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li, 20 Jul 2025, CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents, https://arxiv.org/abs/2407.01511
  • Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang, 19 Jul 2025, TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios, https://arxiv.org/abs/2505.12891
  • Mar\'ia Andrea Cruz Bland\'on, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico, 19 Jul 2025, MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation, https://arxiv.org/abs/2502.17163
  • Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Durmus, Chuan Guo, 21 Jul 2025, Detecting Benchmark Contamination Through Watermarking, https://arxiv.org/abs/2502.17259
  • Ziyu Wang (1), Tao Xue (1), Jingyuan Li (1), Haibin Zhang (1), Zhiqiang Xu (3), Gaofei Xu (4), Zhen Wang (5), Yanbin Wang (2), Zhiquan Liu (6) ((1) Xidian University, (2) Shenzhen MSU-BIT University, (3) Jiangxi University of Science and Technology, (4) Institute of Deep-sea Science and Engineering,(5) Northwestern Polytechnical University, (6) Jinan University), 20 Jul 2025, Can Optical Denoising Clean Sonar Images? A Benchmark and Fusion Approach, https://arxiv.org/abs/2503.01655
  • Shengtao Wen, Haodong Chen, Yadong Wang, Zhongying Pan, Xiang Chen, Yu Tian, Bo Qian, Dong Liang, Sheng-Jun Huang, 9 Aug 2025, MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA, https://arxiv.org/abs/2508.07022
  • Naseem Machlovi, Maryam Saleki, Innocent Ababio, Ruhul Amin, 9 Aug 2025, Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach, https://arxiv.org/abs/2508.07063
  • Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li, 10 Aug 2025, Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach, https://arxiv.org/abs/2508.07353
  • Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo, 11 Aug 2025, MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark, https://arxiv.org/abs/2508.07575
  • Zhe Zhang, Runlin Liu, Aishan Liu, Xingyu Liu, Xiang Gao, Hailong Sun, 10 Aug 2025, Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes, https://arxiv.org/abs/2508.07180
  • Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, 10 Aug 2025, MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark, https://arxiv.org/abs/2508.07307
  • Sarina Penquitt, Jonathan Klees, Rinor Cakaj, Daniel Kondermann, Matthias Rottmann, Lars Schmarje, 6 Aug 2025, From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets, https://arxiv.org/abs/2508.06556
  • Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, Shubhi Bansal, Nagendra Kumar, 7 Aug 2025, ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos, https://arxiv.org/abs/2508.06570
  • Lucia Cipolina-Kun and Marianna Nezhurina and Jenia Jitsev, 10 Aug 2025, Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilites of Large Language Models via Game Play, https://arxiv.org/abs/2508.03368
  • Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming-Ming Cheng, Jian Yang, 9 Aug 2025, SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection, https://arxiv.org/abs/2403.06534
  • Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang, 9 Aug 2025, LVBench: An Extreme Long Video Understanding Benchmark, https://arxiv.org/abs/2406.08035
  • Mihir Godbole, Xiangbo Gao, Zhengzhong Tu, 9 Aug 2025, DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving, https://arxiv.org/abs/2506.17590
  • Zhihao Zhu, Yi Yang, Defu Lian, 9 Aug 2025, TDDBench: A Benchmark for Training data detection, https://arxiv.org/abs/2411.03363
  • Hafsteinn Einarsson, 27 Jul 2025, MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models, https://arxiv.org/abs/2507.20395
  • Chenchen Zhao, Zhengyuan Shi, Xiangyu Wen, Chengjie Liu, Yi Liu, Yunhao Zhou, Yuxiang Zhao, Hefei Feng, Yinan Zhu, Gwok-Waa Wan, Xin Cheng, Weiyu Chen, Yongqi Fu, Chujie Chen, Chenhao Xue, Guangyu Sun, Ying Wang, Yibo Lin, Jun Yang, Ning Xu, Xi Wang and Qiang Xu, 20 Jul 2025, MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs, https://arxiv.org/abs/2507.19525
  • Sara Papi, Maike Z\"ufle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues, 25 Jul 2025, MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks, https://arxiv.org/abs/2507.19634
  • Roberto Labadie-Tamayo, Adrian Jaques B\"ock, Djordje Slijep\v{c}evi\'c, Xihui Chen, Andreas Babic, Matthias Zeppelzauer, 28 Jul 2025, FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models, https://arxiv.org/abs/2507.20924
  • Xinhan Di, Kristin Qi, Pengqian Yu, 28 Jul 2025, JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1, https://arxiv.org/abs/2507.20987
  • Ali Ismail-Fawaz and Maxime Devanne and Stefano Berretti and Jonathan Weber and Germain Forestier, 28 Jul 2025, Deep Learning for Skeleton Based Human Motion Rehabilitation Assessment: A Benchmark, https://arxiv.org/abs/2507.21018
  • Xuzhao Li and Xuchen Li and Shiyu Hu and Yongzhen Guo and Wentao Zhang, 26 Jul 2025, VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains, https://arxiv.org/abs/2507.09884
  • Hassan Ismail Fawaz, Ganesh Del Grosso, Tanguy Kerdoncuff, Aurelie Boisbunon, Illyyne Saffar, 25 Jul 2025, Deep Unsupervised Domain Adaptation for Time Series Classification: a Benchmark, https://arxiv.org/abs/2312.09857
  • Valay Bundele, Karahan Sar{\i}ta\c{s}, Bora Kargi, O\u{g}uz Ata \c{C}al, K{\i}van\c{c} Tez\"oren, Zohreh Ghaderi, Hendrik Lensch, 26 Jul 2025, Evaluating Self-Supervised Learning in Medical Imaging: A Benchmark for Robustness, Generalizability, and Multi-Domain Impact, https://arxiv.org/abs/2412.19124
  • David Maria Schmidt, Raoul Schubert, Philipp Cimiano, 28 Jul 2025, CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting, https://arxiv.org/abs/2507.21257
  • Haiquan Wang, Yi Chen, Shang Zeng, Yun Bian, Zhe Cui, 29 Jul 2025, GovRelBench:A Benchmark for Government Domain Relevance, https://arxiv.org/abs/2507.21419
  • Leonard Hinckeldey, Elliot Fosong, Elle Miller, Rimvydas Rubavicius, Trevor McInroe, Patricia Wollstadt, Christiane B. Wiebel-Herboth, Subramanian Ramamoorthy, Stefano V. Albrecht, 29 Jul 2025, Assistax: A Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics, https://arxiv.org/abs/2507.21638
  • Amber Huang, Ian Scott Knight, Slava Naprienko, 29 Jul 2025, Data Leakage and Redundancy in the LIT-PCBA Benchmark, https://arxiv.org/abs/2507.21404
  • Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, Yong Li, 9 Jun 2025, FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents, https://arxiv.org/abs/2507.21071
  • Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz, 28 Jul 2025, StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation, https://arxiv.org/abs/2507.21340
  • Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, Viet Hoang, 29 Jul 2025, VN-MTEB: Vietnamese Massive Text Embedding Benchmark, https://arxiv.org/abs/2507.21500
  • Kristian G. Barman, Sascha Caron, Faegheh Hasibi, Eugene Shalugin, Yoris Marcet, Johannes Otte, Henk W. de Regt, and Merijn Moody, 29 Jul 2025, Towards a Large Physics Benchmark, https://arxiv.org/abs/2507.21695
  • Rohan Hitchcock, Jesse Hoogland, 29 Jul 2025, From Global to Local: A Scalable Benchmark for Local Posterior Sampling, https://arxiv.org/abs/2507.21449
  • Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang, 25 Mar 2025, OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching, https://arxiv.org/abs/2503.21813
  • Jinzhi Wang, Qingke Peng, Haozhou Li, Zeyuan Zeng, Qinfeng Song, Kaixuan Yang, Jiangbo Zhang, Yaoying Wang, Ruimeng Li, Biyi Zhou, 19 Jul 2025, ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing, https://arxiv.org/abs/2507.22911
  • Chengqian Ma, Wei Tao, Yiwen Guo, 30 Jul 2025, C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations, https://arxiv.org/abs/2507.22968
  • Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che, 31 Jul 2025, MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models, https://arxiv.org/abs/2507.23382
  • Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan, 31 Jul 2025, MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks, https://arxiv.org/abs/2507.23511
  • Kai Goebel and Patrik Zips, 31 Jul 2025, Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study, https://arxiv.org/abs/2507.23589
  • Shimanto Bhowmik, Tawsif Tashwar Dipto, Md Sazzad Islam, Sheryl Hsu, Tahsin Reasat, 31 Jul 2025, Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis, https://arxiv.org/abs/2507.23248
  • Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang, 31 Jul 2025, LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models, https://arxiv.org/abs/2507.22359
  • Takashi Ishida, Thanawat Lodkaew, Ikko Yamane, 31 Jul 2025, How Can I Publish My LLM Benchmark Without Giving the True Answers Away?, https://arxiv.org/abs/2505.18102
  • Gianluca Carloni, Biagio Brattoli, Seongho Keum, Jongchan Park, Taebum Lee, Chang Ho Ahn, Sergio Pereira, 29 Jul 2025, Pathology Foundation Models are Scanner Sensitive: Benchmark and Mitigation with Contrastive ScanGen Loss, https://arxiv.org/abs/2507.22092
  • Yimeng Liu, Maolin Gan, Yidong Ren, Gen Li, Jingkai Lin, Younsuk Dong, Zhichao Cao, 30 Jul 2025, Hydra-Bench: A Benchmark for Multi-Modal Leaf Wetness Sensing, https://arxiv.org/abs/2507.22685
  • Matej \v{S}progar, 30 Jul 2025, AGITB: A Signal-Level Benchmark for Evaluating Artificial General Intelligence, https://arxiv.org/abs/2504.04430
  • Thuy Nguyen, Dang Nguyen, Hoang Nguyen, Thuan Luong, Long Hoang Dang, Viet Dac Lai, 30 Jul 2025, OWLViz: An Open-World Benchmark for Visual Question Answering, https://arxiv.org/abs/2503.07631
  • Xiang Xiang, Zhuo Xu, Yao Deng, Qinhao Zhou, Yifan Liang, Ke Chen, Qingfang Zheng, Yaowei Wang, Xilin Chen, Wen Gao, 30 Jul 2025, OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing, https://arxiv.org/abs/2502.20668
  • Muhammad Farid Adilazuarda, Musa Izzanardi Wijanarko, Lucky Susanto, Khumaisa Nur'aini, Derry Wijaya, Alham Fikri Aji, 25 Feb 2025, NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts, https://arxiv.org/abs/2502.18148
  • Kejia Gao, Liguo Zhou, Mingjun Liu, Alois Knoll, 1 Aug 2025, E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking, https://arxiv.org/abs/2504.10812
  • Junjie Shi, Wei Ma, Shi Ying, Lingxiao Jiang, Yang liu, Bo Du, 2 Aug 2025, Importance Sampling is All You Need: Predict LLM's performance on new benchmark by reusing existing benchmark, https://arxiv.org/abs/2508.01203
  • Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen, 2 Aug 2025, NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset, https://arxiv.org/abs/2508.01330
  • Yuanzhe Shen, Kaimin Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 2 Aug 2025, TripTailor: A Real-World Benchmark for Personalized Travel Planning, https://arxiv.org/abs/2508.01432
  • Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, Huimin Zhao, 1 Aug 2025, FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models, https://arxiv.org/abs/2508.01055
  • Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu, 24 Jul 2025, EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow, https://arxiv.org/abs/2507.22929
  • Lyle Regenwetter, Yazan Abu Obaideh, Fabien Chiotti, Ioanna Lykourentzou, Faez Ahmed, 25 May 2025, Bike-Bench: A Bicycle Design Benchmark for Generative Models with Objectives and Constraints, https://arxiv.org/abs/2508.00830
  • Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning, 2 Aug 2025, WebDS: An End-to-End Benchmark for Web-based Data Science, https://arxiv.org/abs/2508.01222
  • Rushin H. Gindra, Giovanni Palla, Mathias Nguyen, Sophia J. Wagner, Manuel Tran, Fabian J Theis, Dieter Saur, Lorin Crawford, Tingying Peng, 2 Aug 2025, A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics, https://arxiv.org/abs/2508.01490
  • Amir DN Cohen, Hilla Merhav, Yoav Goldberg, Reut Tsarfaty, 3 Aug 2025, HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark, https://arxiv.org/abs/2508.01812
  • Andrea Dosi, Semanto Mondal, Rajib Chandra Ghosh, Massimo Brescia, Giuseppe Longo, 3 Aug 2025, Less is More: AMBER-AFNO -- a New Benchmark for Lightweight 3D Medical Image Segmentation, https://arxiv.org/abs/2508.01941
  • Wanqi Yang, Yanda Li, Yunchao Wei, Meng Fang, Ling Chen, 4 Aug 2025, SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models, https://arxiv.org/abs/2508.02018
  • Yebo Peng, Zixiang Liu, Yaoming Li, Zhizhuo Yang, Xinye Xu, Bowen Ye, Weijun Yuan, Zihan Wang, Tong Yang, 4 Aug 2025, Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems, https://arxiv.org/abs/2508.02208
  • Gustaf Ahdritz, Anat Kleiman, 4 Aug 2025, The SMeL Test: A simple benchmark for media literacy in language models, https://arxiv.org/abs/2508.02074
  • Ivan Karpukhin, Foma Shipilov, Andrey Savchenko, 2 Aug 2025, HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?, https://arxiv.org/abs/2406.14341
  • Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje, 4 Aug 2025, DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA, https://arxiv.org/abs/2412.05430
  • Junying Wang, Wenzhe Li, Yalun Wu, Yingji Liang, Yijin Guo, Chunyi Li, Haodong Duan, Zicheng Zhang, Guangtao Zhai, 2 Aug 2025, Affordance Benchmark for MLLMs, https://arxiv.org/abs/2506.00893
  • Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang, 4 Aug 2025, VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos, https://arxiv.org/abs/2506.10857
  • Xiaohao Liu, Xiaobo Xia, Zhuo Huang, See-Kiong Ng, Tat-Seng Chua, 4 Aug 2025, Towards Modality Generalization: A Benchmark and Prospective Analysis, https://arxiv.org/abs/2412.18277
  • Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu, 4 Aug 2025, Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding, https://arxiv.org/abs/2505.05026
  • Feng Rui, Zhiyao Luo, Wei Wang, Yuting Song, Yong Liu, Tingting Zhu, Jianqing Li, and Xingyao Wang, 5 Aug 2025, CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment, https://arxiv.org/abs/2508.03360
  • Haoran Liu, Yihan Zhan, Mingzhe Liu, Yanhua Liu, Peng Li, Zhuo Zuo, Bingqi Liu, Runxi Liu, 3 Aug 2025, Pulse Shape Discrimination Algorithms: Survey and Benchmark, https://arxiv.org/abs/2508.02750
  • Yahia Dalbah, Marcel Worring, Yen-Chia Hsu, 1 Aug 2025, Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction, https://arxiv.org/abs/2508.02724
  • Yihao Ang, Qiang Wang, Qiang Huang, Yifan Bao, Xinyu Xi, Anthony K. H. Tung, Chen Jin, Zhiyong Huang, 3 Aug 2025, CTBench: Cryptocurrency Time Series Generation Benchmark, https://arxiv.org/abs/2508.02758
  • Zixuan Gu, Qiufeng Fan, Long Sun, Yang Liu, Xiaojun Ye, 5 Aug 2025, VFLAIR-LLM: A Comprehensive Framework and Benchmark for Split Learning of LLMs, https://arxiv.org/abs/2508.03097
  • Longling Geng and Edward Y. Chang, 5 Aug 2025, REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks, https://arxiv.org/abs/2502.18836
  • Alexis Roger, Prateek Humane, Daniel Z. Kaplan, Kshitij Gupta, Qi Sun, George Adamopoulos, Jonathan Siu Chi Lim, Quentin Anthony, Edwin Fennell, Irina Rish, 5 Aug 2025, CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models, https://arxiv.org/abs/2501.09672
  • Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng, 5 Aug 2025, ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark, https://arxiv.org/abs/2506.10960
  • Yue Zhou, Yi Chang, Yuan Wu, 6 Aug 2025, ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges, https://arxiv.org/abs/2508.04576
  • Ashutosh Bandooni and Brindha Subburaj, 31 Jul 2025, GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models, https://arxiv.org/abs/2508.03737
  • Xiao Wang, Ziwen Wang, Wentao Wu, Anjie Wang, Jiashu Wu, Yantao Pan, Chenglong Li, 6 Aug 2025, Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark, https://arxiv.org/abs/2508.04260
  • Xiao Wang, Xufeng Lou, Shiao Wang, Ju Huang, Lan Chen, Bo Jiang, 6 Aug 2025, Long-Term Visual Object Tracking with Event Cameras: An Associative Memory Augmented Tracker and A Benchmark Dataset, https://arxiv.org/abs/2403.05839
  • Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran Wang, 5 Aug 2025, NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models, https://arxiv.org/abs/2503.12772
  • Xiao Wang, Haiyang Wang, Shiao Wang, Qiang Chen, Jiandong Jin, Haoyu Song, Bo Jiang, Chenglong Li, 6 Aug 2025, RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework, https://arxiv.org/abs/2504.10018
  • Dexuan Xu, Jieyi Wang, Zhongyan Chai, Yongzhi Cao, Hanpin Wang, Huamin Zhang, Yu Huang, 7 Aug 2025, MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models, https://arxiv.org/abs/2508.05083
  • Pouyan Navard, Yasemin Ozkut, Srikar Adhikari, Elaine Situ-LaCasse, Josie Acu\~na, Adrienne Yarnish, Alper Yilmaz, 5 Aug 2025, ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular Ultrasound, https://arxiv.org/abs/2508.04735
  • Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, 7 Aug 2025, ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents, https://arxiv.org/abs/2410.06703
  • Yumeng Fu, Jiayin Zhu, Lingling Zhang, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Yanrui Wu, Wenjun Wu, 8 Aug 2025, GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines, https://arxiv.org/abs/2508.06226
  • Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique, 5 Aug 2025, Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark, https://arxiv.org/abs/2508.05674
  • Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo, 7 Aug 2025, INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance, https://arxiv.org/abs/2406.09105
  • Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Wei-Lun Chao, 8 Aug 2025, AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models, https://arxiv.org/abs/2506.09082
  • Nikolaos Dionelis, Alessandra Feliciotti, Mattia Marconcini, Devis Peressutti, Nika Oman Kadunc, JaeWan Park, Hagai Raja Sinulingga, Steve Andreas Immanuel, Ba Tran, Caroline Arnold, Nicolas Long\'ep\'e, 8 Aug 2025, Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge, https://arxiv.org/abs/2502.13818
  • Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig, 8 Aug 2025, CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation, https://arxiv.org/abs/2504.15254
  • Haiyun Guo, ZhiYan Hou, Yu Chen, Jinghan He, Yandu Sun, Yuzhe Zhou, Shujing Guo, Kuan Zhu, Jinqiao Wang, 31 Jul 2025, MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis, https://arxiv.org/abs/2508.08275
  • Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo, 5 Aug 2025, Putnam-AXIOM: A Functional and Static Benchmark, https://arxiv.org/abs/2508.08292
  • Manuel Herrador, 13 Aug 2025, The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?, https://arxiv.org/abs/2508.09762
  • Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng, 11 Aug 2025, MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models, https://arxiv.org/abs/2508.09210
  • Amir Hosseinian, Ashkan Dehghani Zahedani, Umer Mansoor, Noosheen Hashemi, Mark Woodward, 13 Aug 2025, January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis, https://arxiv.org/abs/2508.09966
  • Kechen Li, Yaotian Tao, Ximing Wen, Quanwei Sun, Zifei Gong, Chang Xu, Xizhe Zhang, Tianbo Ji, 13 Aug 2025, GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments, https://arxiv.org/abs/2505.24306
  • Grigor Bezirganyan, Sana Sellami, Laure Berti-\'Equille, S\'ebastien Fournier, 13 Aug 2025, LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data, https://arxiv.org/abs/2406.09864
  • Chunan Liu, Aurelien Pelissier, Yanjun Shao, Lilian Denzler, Andrew C.R. Martin, Brooks Paige and Mar\'ia Rodr\'iguez Mart\'inez, 13 Aug 2025, AbRank: A Benchmark Dataset and Metric-Learning Framework for Antibody-Antigen Affinity Ranking, https://arxiv.org/abs/2506.17857
  • Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski, 14 Aug 2025, ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks, https://arxiv.org/abs/2508.10956
  • Wenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han, 14 Aug 2025, SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth, https://arxiv.org/abs/2508.11009
  • Beichen Guo, Zhiyuan Wen, Yu Yang, Peng Gao, Ruosong Yang, Jiaxing Shen, 15 Aug 2025, SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems, https://arxiv.org/abs/2508.11310
  • Hongtao Liu, Zhicheng Du, Zihe Wang and Weiran Shen, 16 Aug 2025, CHBench: A Cognitive Hierarchy Benchmark for Evaluating Strategic Reasoning Capability of LLMs, https://arxiv.org/abs/2508.11944
  • Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang, 16 Aug 2025, FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction, https://arxiv.org/abs/2508.11987
  • Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette, 18 Aug 2025, HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds, https://arxiv.org/abs/2508.12782
  • Fan Li, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, Xuemin Lin, 17 Aug 2025, DHG-Bench: A Comprehensive Benchmark on Deep Hypergraph Learning, https://arxiv.org/abs/2508.12244
  • Manuela Imbriani, Gina Belmonte, Mieke Massink, Alessandro Tofani, Vincenzo Ciancia, 18 Aug 2025, A Multi-Resolution Benchmark Framework for Spatial Reasoning Assessment in Neural Networks, https://arxiv.org/abs/2508.12741
  • Ananya Singha, Harshita Sahijwani, Walt Williams, Emmanuel Aboah Boateng, Nick Hausman, Miguel Di Luca, Keegan Choudhury, Chaya Binet, Vu Le, Tianwei Chen, Oryan Rokeah Chen, Sulaiman Vesal, Sadid Hasan, 14 Aug 2025, Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs, https://arxiv.org/abs/2508.11715
  • Javier Mu\~noz-Haro and Ruben Tolosana and Ruben Vera-Rodriguez and Aythami Morales and Julian Fierrez, 14 Aug 2025, Privacy-Aware Detection of Fake Identity Documents: Methodology, Benchmark, and Improved Detection Methods (FakeIDet2), https://arxiv.org/abs/2508.11716
  • Elon Ezra, Ariel Weizman, Amos Azaria, 17 Aug 2025, The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution, https://arxiv.org/abs/2508.12277
  • Zhiyuan Ning, Tianle Gu, Jiaxin Song, Shixin Hong, Lingyu Li, Huacan Liu, Jie Li, Yixu Wang, Meng Lingyu, Yan Teng, Yingchun Wang, 18 Aug 2025, LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models, https://arxiv.org/abs/2508.12733
  • Krzysztof Kotowski, Christoph Haskamp, Jacek Andrzejewski, Bogdan Ruszczak, Jakub Nalepa, Daniel Lakey, Peter Collins, Aybike Kolmas, Mauro Bartesaghi, Jose Martinez-Heras, Gabriele De Canio, 17 Aug 2025, European Space Agency Benchmark for Anomaly Detection in Satellite Telemetry, https://arxiv.org/abs/2406.17826
  • Bryan L. M. de Oliveira, Luana G. B. Martins, Bruno Brand\~ao, Murilo L. da Luz, Telma W. de L. Soares, Luckeciano C. Melo, 16 Aug 2025, Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning, https://arxiv.org/abs/2410.14038
  • Hyunjong Ok, Jaeho Lee, 18 Aug 2025, S2Cap: A Benchmark and a Baseline for Singing Style Captioning, https://arxiv.org/abs/2409.09866
  • Haohang Xu, Chengjie Liu, Qihang Wang, Wenhao Huang, Yongjian Xu, Weiyu Chen, Anlan Peng, Zhijun Li, Bo Li, Lei Qi, Jun Yang, Yuan Du, and Li Du, 27 Jun 2025, Image2Net: Datasets, Benchmark and Hybrid Framework to Convert Analog Circuit Diagrams into Netlists, https://arxiv.org/abs/2508.13157
  • Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen, 14 Aug 2025, MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents, https://arxiv.org/abs/2508.13186
  • Yixuan Yang and Daoyuan Wu and Yufan Chen, 17 Aug 2025, MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols, https://arxiv.org/abs/2508.13220
  • James Meaden, Micha{\l} Jarosz, Piotr Jod{\l}owski, Grigori Melnik, 19 Aug 2025, COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models, https://arxiv.org/abs/2508.13757
  • Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Guoyang Liu, 12 Aug 2025, KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge, https://arxiv.org/abs/2508.14080
  • Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim, Juneha Hwang, Jaeseon Ha, Hojun Choi, Suyeol Yun, Yongjin Kim, and Yongjae Lee, 7 Aug 2025, FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering, https://arxiv.org/abs/2508.14052
  • Sujit Roy, Dinesha V. Hegde, Johannes Schmude, Amy Lin, Vishal Gaur, Rohit Lal, Kshitiz Mandal, Talwinder Singh, Andr\'es Mu\~noz-Jaramillo, Kang Yang, Chetraj Pandey, Jinsu Hong, Berkay Aydin, Ryan McGranaghan, Spiridon Kasapis, Vishal Upendran, Shah Bahauddin, Daniel da Silva, Marcus Freitag, Iksha Gurung, Nikolai Pogorelov, Campbell Watson, Manil Maskey, Juan Bernabe-Moreno, Rahul Ramachandran, 18 Aug 2025, SuryaBench: Benchmark Dataset for Advancing Machine Learning in Heliophysics and Space Weather Prediction, https://arxiv.org/abs/2508.14107
  • Tapio Pitk\"aranta, 20 Aug 2025, The NordDRG AI Benchmark for Large Language Models, https://arxiv.org/abs/2506.13790
  • Ningyi Liao, Haoyu Liu, Zulun Zhu, Siqiang Luo, Laks V.S. Lakshmanan, 20 Aug 2025, A Comprehensive Benchmark on Spectral GNNs: The Impact on Efficiency, Memory, and Effectiveness, https://arxiv.org/abs/2406.09675
  • Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran, 21 Aug 2025, GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning, https://arxiv.org/abs/2508.15690
  • Chenghao Zhang, Qingqing Long, Ludi Wang, Wenjuan Cui, Jianjun Yu, Yi Du, 21 Aug 2025, CITE: A Comprehensive Benchmark for Heterogeneous Text-Attributed Graphs on Catalytic Materials, https://arxiv.org/abs/2508.15392
  • Jiahao Xu (Ohio State University, USA), Changchang Yin (Ohio State University Wexner Medical Center, USA), Odysseas Chatzipanagiotou (Ohio State University Wexner Medical Center, USA), Diamantis Tsilimigras (Ohio State University Wexner Medical Center, USA), Kevin Clear (Ohio State University Wexner Medical Center, USA), Bingsheng Yao (Northeastern University, USA), Dakuo Wang (Northeastern University, USA), Timothy Pawlik (Ohio State University Wexner Medical Center, USA), Ping Zhang (Ohio State University, USA), 21 Aug 2025, SurgWound-Bench: A Benchmark for Surgical Wound Diagnosis, https://arxiv.org/abs/2508.15189
  • Nikita Kachaev, Andrei Spiridonov, Andrey Gorodetsky, Kirill Muravyev, Nikita Oskolkov, Aditya Narendra, Vlad Shakhuro, Dmitry Makarov, Aleksandr I. Panov, Polina Fedotova, Alexey K. Kovalev, 21 Aug 2025, Mind and Motion Aligned: A Joint Evaluation IsaacSim Benchmark for Task Planning and Low-Level Policies in Mobile Manipulation, https://arxiv.org/abs/2508.15663
  • Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, Xiangyang Li, 19 Aug 2025, MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers, https://arxiv.org/abs/2508.14925
  • Changshun Wu, Weicheng He, Chih-Hong Cheng, Xiaowei Huang, Saddek Bensalem, 20 Aug 2025, Revisiting Out-of-Distribution Detection in Real-time Object Detection: From Benchmark Pitfalls to a New Mitigation Paradigm, https://arxiv.org/abs/2503.07330
  • Fei Lei, Yibo Yang, Wenxiu Sun, Dahua Lin, 22 Aug 2025, MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use, https://arxiv.org/abs/2508.16260
  • Mohan Jiang, Jin Gao, Jiahao Zhan, Dequan Wang, 14 Aug 2025, MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding, https://arxiv.org/abs/2508.15802
  • Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans, 18 Aug 2025, A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains, https://arxiv.org/abs/2508.15832
  • Ahmed Allam, Youssef Mansour, and Mohamed Shalan, 21 Aug 2025, ASIC-Agent: An Autonomous Multi-Agent System for ASIC Design with Benchmark Evaluation, https://arxiv.org/abs/2508.15940
  • Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
  • Mahinthan Chandramohan, Jovan Jancic, Yuntong Zhang and Padmanabhan Krishnan, 22 Aug 2025, From Benchmark Data To Applicable Program Repair: An Experience Report, https://arxiv.org/abs/2508.16071
  • Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu, 22 Aug 2025, RoMedQA: The First Benchmark for Romanian Medical Question Answering, https://arxiv.org/abs/2508.16390
  • Yakup Abrek Er, Ilker Kesen, G\"ozde G\"ul \c{S}ahin, Aykut Erdem, 22 Aug 2025, Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish, https://arxiv.org/abs/2508.16431
  • Adil Bahaj, Mounir Ghogho, 22 Aug 2025, PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark, https://arxiv.org/abs/2508.16439
  • Sam Earle, Graham Todd, Yuchen Li, Ahmed Khalifa, Muhammad Umair Nasir, Zehua Jiang, Andrzej Banburski-Fahey, Julian Togelius, 22 Aug 2025, PuzzleJAX: A Benchmark for Reasoning and Learning, https://arxiv.org/abs/2508.16821
  • Nilay Pande, Sahiti Yerramilli, Jayant Sravan Tamarapalli, Rynaa Grover, 24 Aug 2025, MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes, https://arxiv.org/abs/2508.17180
  • Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson, 25 Aug 2025, SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models, https://arxiv.org/abs/2508.18179
  • Robert Yang, 25 Aug 2025, Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery, https://arxiv.org/abs/2508.17681
  • Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Lei Bai, Yunqi Cai, Xi Dai, Shufei Zhang, Jinguang Cheng, Zhong Fang, Hongming Weng, 25 Aug 2025, CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics, https://arxiv.org/abs/2508.18124
  • Fangxin Shang, Yuan Xia, Dalu Yang, Yahui Wang, Binglin Yang, 21 Aug 2025, MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation, https://arxiv.org/abs/2508.16674
  • Liane Makatura, Benjamin Jones, Siyuan Bian, Wojciech Matusik, 25 Aug 2025, MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation, https://arxiv.org/abs/2508.17568
  • Wei Xiong and Jiangtong Li and Jie Li and Kun Zhu, 25 Aug 2025, EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation of EEG Foundation Models, https://arxiv.org/abs/2508.17742
  • Keke Lian and Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li and Dong Zhang, 25 Aug 2025, A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code, https://arxiv.org/abs/2508.18106
  • Yajing Yang, Qian Liu, Min-Yen Kan, 23 Aug 2025, DataTales: A Benchmark for Real-World Intelligent Data Narration, https://arxiv.org/abs/2410.17859
  • Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish, 23 Aug 2025, MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark, https://arxiv.org/abs/2506.05587
  • Linbo Cao, Jinman Zhao, 23 Jul 2025, Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks, https://arxiv.org/abs/2507.17747
  • Jianhao Chen, Junyang Ren, Wentao Ding, Haoyuan Ouyang, Wei Hu, Yuzhong Qu, 23 Jul 2025, Conflict Detection for Temporal Knowledge Graphs:A Fast Constraint Mining Algorithm and New Benchmarks, https://arxiv.org/abs/2312.11053
  • Fred Mutisya (1 and 2), Shikoh Gitau (1), Christine Syovata (2), Diana Oigara (2), Ibrahim Matende (2), Muna Aden (2), Munira Ali (2), Ryan Nyotu (2), Diana Marion (2), Job Nyangena (2), Nasubo Ongoma (1), Keith Mbae (1), Elizabeth Wamicha (1), Eric Mibuari (1), Jean Philbert Nsengemana (3), Talkmore Chidede (4) ((1) Qhala (Nairobi, Kenya), (2) Kenya Medical Association (Nairobi, Kenya), (3) Africa CDC (Addis Ababa, Ethiopia), (4) AfCFTA (Accra, Ghana)), 22 Jul 2025, Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens, https://arxiv.org/abs/2507.16322
  • Roland Pihlakas, Joel Pyykk\"o, 22 Jul 2025, From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent AI safety benchmarks, https://arxiv.org/abs/2410.00081
  • Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu, 10 Aug 2025, Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks, https://arxiv.org/abs/2508.07179
  • Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum, 31 Jul 2025, Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks, https://arxiv.org/abs/2507.23194
  • Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tutt\"os\'i, Angelica Lim, 25 Jul 2025, Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks, https://arxiv.org/abs/2507.19684
  • Khloud AL Jallad, Nada Ghneim, Ghaida Rebdawi, 27 Jul 2025, Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?, https://arxiv.org/abs/2507.20419
  • Fred Mutisya (1,2), Shikoh Gitau (1), Nasubo Ongoma (1), Keith Mbae (1), Elizabeth Wamicha (1), 31 Jul 2025, Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench, https://arxiv.org/abs/2508.00081
  • Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen (Cherise) Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert, 30 Jul 2025, Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models, https://arxiv.org/abs/2508.00923
  • Olawale Salaudeen, Nicole Chiou, Shiny Weng, Sanmi Koyejo, 2 Aug 2025, Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?, https://arxiv.org/abs/2504.00186
  • Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, Linlin Shen, 6 Aug 2025, Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models, https://arxiv.org/abs/2508.04325
  • Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang, 7 Aug 2025, Establishing Best Practices for Building Rigorous Agentic Benchmarks, https://arxiv.org/abs/2507.02825
  • Prathamesh Kalamkar, Janani Venugopalan Ph.D., Vivek Raghavan Ph.D, 13 Jul 2021, Indian Legal NLP Benchmarks : A Survey, https://arxiv.org/abs/2107.06056
  • Serina Chang, Ashton Anderson, Jake M. Hofman, 12 Aug 2025, ChatBench: From Static Benchmarks to Human-AI Evaluation, https://arxiv.org/abs/2504.07114
  • Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan "Honza" Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko, 14 Aug 2025, Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping, https://arxiv.org/abs/2310.00098
  • Shengbo Wang, Mingwei Liu, Zike Li, Anji Li, Yanlin Wang, Xin Peng, Zibin Zheng, 18 Aug 2025, EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing, https://arxiv.org/abs/2508.13003

Research on Model Evaluation

  • Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
  • Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, Andy Zou, 23 May 2024, Lessons from the Trenches on Reproducible Evaluation of Language Models, https://arxiv.org/abs/2405.14782 (Model evaluation theory and practice with the lm-eval test harness tool.)
  • Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
  • Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
  • Yifan Wei, Yisong Su, Huanhuan Ma, Xiaoyan Yu, Fangyu Lei, Yuanzhe Zhang, Jun Zhao, Kang Liu, 8 Oct 2023, MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models, https://arxiv.org/abs/2310.05157
  • George Cybenko, Joshua Ackerman, Paul Lintilhac, 16 Apr 2024, TEL'M: Test and Evaluation of Language Models, https://arxiv.org/abs/2404.10200
  • Gayathri Saranathan, Mahammad Parwez Alam, James Lim, Suparna Bhattacharya, Soon Yee Wong, Foltin Martin & Cong Xu, 2024, DELE: Data Efficient LLM Evaluation, Hewlett Packard Labs, Navigating and Addressing Data Problems for Foundation Models (DPFM) Workshop, ICLR 2024, https://openreview.net/pdf?id=I8bsxPWLNF
  • Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang, 17 Mar 2024 (v2), Compressing LLMs: The Truth is Rarely Pure and Never Simple, https://arxiv.org/abs/2310.01382 Code: https://github.com/VITA-Group/llm-kick (A set of tasks to evaluate LLMs.)
  • AADITYA NAIK, ADAMSTEIN, YINJUN WU, MAYURNAIK, ERIC WONG, April 2024, TorchQL: A Programming Framework for Integrity Constraints in Machine Learning, Proc. ACM Program. Lang., Vol. 8, No. OOPSLA1, Article 124. PDF: https://dl.acm.org/doi/pdf/10.1145/3649841
  • Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
  • Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, Min Lin, 22 Jan 2024, Benchmarking Large Multimodal Models against Common Corruptions, https://arxiv.org/abs/2401.11943 Code: https://github.com/sail-sg/MMCBench
  • Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, Gordon Wetzstein, Jan 2024, GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation, https://arxiv.org/abs/2401.04092 Code: https://github.com/3DTopia/GPTEval3D Project: https://gpteval3d.github.io/
  • Lan Chu, Jan 2024, LLM Output — Evaluating, debugging, and interpreting, Towards AI, https://pub.towardsai.net/llm-output-evaluating-debugging-and-interpreting-f3bd29e7d14d
  • Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You, 3 Jun 2024, MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, https://arxiv.org/abs/2406.06565
  • Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo, 9 Jun 2024, The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models, https://arxiv.org/abs/2406.05761 Code: https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench
  • Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi, 7 Jun 2024, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, https://arxiv.org/abs/2406.04770 Code: https://hf.co/spaces/allenai/WildBench
  • Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi, 13 Jun 2024, Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning, https://arxiv.org/abs/2406.09170
  • Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
  • Tianle Li, Wei-Lin Chiang, Lisa Dunlap, May 20, 2024, Introducing Hard Prompts Category in Chatbot Arena, https://lmsys.org/blog/2024-05-17-category-hard/
  • Louis Bouchard, Jun 24, 2024, LLM Evals: What, why, when and how, https://www.louisbouchard.ai/llm-evals/
  • Clémentine Fourrier, May 23, 2024 Let's talk about LLM evaluation, https://huggingface.co/blog/clefourrier/llm-evaluation
  • Jeffrey Ip, November 7, 2023, How to Evaluate LLM Applications: The Complete Guide, https://www.confident-ai.com/blog/how-to-evaluate-llm-applications
  • Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
  • Anirban Ghoshal, July 3, 2024, AWS approach to RAG evaluation could help enterprises reduce AI spending, https://www.infoworld.com/article/3715629/aws-new-approach-to-rag-evaluation-could-help-enterprises-reduce-ai-spending.html
  • Tianyi Tang, Yiwen Hu, Bingqian Li, Wenyang Luo, Zijing Qin, Haoxiang Sun, Jiapeng Wang, Shiyi Xu, Xiaoxue Cheng, Geyang Guo, Han Peng, Bowen Zheng, Yiru Tang, Yingqian Min, Yushuo Chen, Jie Chen, Yuanqian Zhao, Luran Ding, Yuhao Wang, Zican Dong, Chunxuan Xia, Junyi Li, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen, 8 Jul 2024, LLMBox: A Comprehensive Library for Large Language Models, https://arxiv.org/abs/2407.05563 Code: https://github.com/RUCAIBox/LLMBox
  • Jin Peng Zhou, Christian K. Belardi, Ruihan Wu, Travis Zhang, Carla P. Gomes, Wen Sun, Kilian Q. Weinberger, 8 Jul 2024, On Speeding Up Language Model Evaluation, https://arxiv.org/abs/2407.06172
  • HELM, July 2024 (accessed), A holistic framework for evaluating foundation models, Stanford University, https://crfm.stanford.edu/helm/lite/latest/
  • Juan Pablo Bottaro, April 25, 2024, Musings on building a Generative AI product, https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product?_l=en_US
  • Angie Boggust, Venkatesh Sivaraman, Yannick Assogba, Donghao Ren, Dominik Moritz, Fred Hohman, 6 Aug 2024, Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments, https://arxiv.org/abs/2408.03274
  • Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
  • Andrew Ng, Sep 2024, X post, https://x.com/AndrewYNg/status/1829190549842321758 (Dropping token prices for LLMs means developers can focus on the app layer.)
  • Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
  • Lior Solomon, Sep 2024, Gen AI testing strategies and tools, https://medium.com/ai-in-grc/gen-ai-testing-strategies-and-tools-257383e5cbfb
  • Michael Nuñez, September 9, 2024, LightEval: Hugging Face’s open-source solution to AI’s accountability problem, https://venturebeat.com/ai/lighteval-hugging-faces-open-source-solution-to-ais-accountability-problem/
  • Michael Nuñez, September 13, 2024, Microsoft’s Windows Agent Arena: Teaching AI assistants to navigate your PC, https://venturebeat.com/ai/microsofts-windows-agent-arena-teaching-ai-assistants-to-navigate-your-pc/
  • Flow AI, Sep 2024, Flow Judge: An Open Small Language Model for LLM System Evaluations, https://www.flow-ai.com/blog/flow-judge
  • Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate Olszewska, 20 Sep 2024 (v2), Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries, https://arxiv.org/abs/2409.12640 (Long context model evaluation dataset.)
  • Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
  • Cameron R. Wolfe, Ph.D., Dec 02, 2024, Finetuning LLM Judges for Evaluation: The Prometheus suite, JudgeLM, PandaLM, AutoJ, and more..., https://cameronrwolfe.substack.com/p/finetuned-judge
  • Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu, 10 Dec 2024 (v2), LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, https://arxiv.org/abs/2412.05579 https://github.com/CSHaitao/Awesome-LLMs-as-Judges
  • Liam Seymour, Basar Kutukcu, Sabur Baidya, 19 Dec 2024, Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs, https://arxiv.org/abs/2412.15352 https://github.com/LiamS57/orin-llm-testing
  • Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, Jeff Z. Pan, 22 Dec 2024, MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge, https://arxiv.org/abs/2412.17032 https://github.com/probe2/multi-hop/ (Model evaluation of reasoning abilities.)
  • Latent Space, Dec 28, 2024, The 2025 AI Engineering Reading List: We picked 50 paper/models/blogs across 10 fields in AI Eng: LLMs, Benchmarks, Prompting, RAG, Agents, CodeGen, Vision, Voice, Diffusion, Finetuning. If you're starting from scratch, start here. https://www.latent.space/p/2025-papers
  • Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu, 24 Dec 2024, LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating, https://arxiv.org/abs/2412.18424
  • Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang, AH Abdi, Dec 2024, SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_93.pdf (Evaluating model performance with KV cache compression.)
  • Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
  • Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
  • Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
  • Lucas C. Cordeiro, Matthew L. Daggitt, Julien Girard-Satabin, Omri Isac, Taylor T. Johnson, Guy Katz, Ekaterina Komendantskaya, Augustin Lemesle, Edoardo Manino, Artjoms Šinkarovs, Haoze Wu, 10 Jan 2025, Neural Network Verification is a Programming Language Challenge, https://arxiv.org/abs/2501.05867
  • Dr. Marcel Müller, Jan 2025, Why Generative-AI Apps’ Quality Often Sucks and What to Do About It: How to get from PoCs to tested high-quality applications in production, https://towardsdatascience.com/why-generative-ai-apps-quality-often-sucks-and-what-to-do-about-it-f84407f263c3
  • Bharani Subramaniam, 13 February 2025, Emerging Patterns in Building GenAI Products, https://martinfowler.com/articles/gen-ai-patterns/
  • Nikhil, February 26, 2025, How to Compare Two LLMs in Terms of Performance: A Comprehensive Web Guide for Evaluating and Benchmarking Language Models, https://www.marktechpost.com/2025/02/26/how-to-compare-two-llms-in-terms-of-performance-a-comprehensive-web-guide-for-evaluating-and-benchmarking-language-models/
  • Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
  • Yansheng Qiu, Li Xiao, Zhaopan Xu, Pengfei Zhou, Zheng Wang, Kaipeng Zhang, 16 May 2025, Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans, https://arxiv.org/abs/2505.11141
  • Brandon Lepine, Gawesha Weerantunga, Juho Kim, Pamela Mishkin, Matthew Beane, 15 May 2025, Evaluations at Work: Measuring the Capabilities of GenAI in Use, https://arxiv.org/abs/2505.10742
  • Rachel Draelos, MD, PhD, May 14, 2025, HealthBench Does Not Evaluate Patient Safety, https://medium.com/data-science-collective/healthbench-does-not-evaluate-patient-safety-11eda5f0eeac
  • Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo, 29 May 2025, ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions, https://arxiv.org/abs/2505.23662 https://github.com/bwookwak/ToolHaystack
  • Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, Wenxuan Zhang, Lifu Huang, Muhao Chen, Lei Hou, Qianru Sun, Xingjun Ma, Zuxuan Wu, Min-Yen Kan, David Lo, Qi Zhang, Heng Ji, Jing Jiang, Juanzi Li, Aixin Sun, Xuanjing Huang, Tat-Seng Chua, Yu-Gang Jiang, 26 Apr 2025, Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks, https://arxiv.org/abs/2504.18838
  • Liyun Zhang, Jingcheng Ke, Shenli Fan, Xuanmeng Sha and Zheng Lian, 14 Aug 2025, A Unified Evaluation Framework for Multi-Annotator Tendency Learning, https://arxiv.org/abs/2508.10393
  • Brian Shing-Hei Wong, Joshua Mincheol Kim, Sin-Hang Fung, Qing Xiong, Kelvin Fu-Kiu Ao, Junkang Wei, Ran Wang, Dan Michelle Wang, Jingying Zhou, Bo Feng, Alfred Sze-Lok Cheng, Kevin Y. Yip, Stephen Kwok-Wing Tsui, Qin Cao, 14 Aug 2025, Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation, https://arxiv.org/abs/2508.10541
  • Xiao Fu, Hossein A. Rahmani, Bin Wu, Jerome Ramos, Emine Yilmaz, Aldo Lipani, 8 Aug 2025, PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs, https://arxiv.org/abs/2508.10028
  • Gal Amram, Eitan Farchi, Shmulik Froimovich, Raviv Gal, and Avi Ziv, 13 Aug 2025, LaajMeter: A Framework for LaaJ Evaluation, https://arxiv.org/abs/2508.10161
  • Jieyu Li, Xin Zhang, and Joey Tianyi Zhou, 14 Aug 2025, AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences, https://arxiv.org/abs/2508.10771
  • Yuzhuo Xiao, Zeyu Han, Yuhan Wang, Huaizu Jiang, 4 Aug 2025, XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs, https://arxiv.org/abs/2508.09999
  • Aditya Ashvin, Rimita Lahiri, Aditya Kommineni, Somer Bishop, Catherine Lord, Sudarsana Reddy Kadiri, Shrikanth Narayanan, 14 Aug 2025, Evaluation of Speech Foundation Models for ASR on Child-Adult Conversations in Autism Diagnostic Sessions, https://arxiv.org/abs/2409.16135
  • Zhe Chen, Daniel Harabor, Ryan Hechnenberger, Nathan R. Sturtevant, 23 Jul 2025, Online Submission and Evaluation System Design for Competition Operations, https://arxiv.org/abs/2507.17730
  • Yutong Liu, Cairong Zhao, Guosheng Hu, 23 Jul 2025, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2507.17417
  • Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing, 23 Jul 2025, MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs, https://arxiv.org/abs/2507.17476
  • Miguel Carrasco, C\'esar Gonz\'alez-Mart\'in, Jos\'e Aranda, Luis Oliveros, 23 Jul 2025, Vision Transformer attention alignment with human visual perception in aesthetic object evaluation, https://arxiv.org/abs/2507.17616
  • Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan, 23 Jul 2025, From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes, https://arxiv.org/abs/2507.17717
  • Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian, 23 Jul 2025, Fairness Evaluation of Large Language Models in Academic Library Reference Services, https://arxiv.org/abs/2507.04224
  • Roman Mayr, Michel Schimpf, Thomas Bohn\'e, 22 Jul 2025, ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation, https://arxiv.org/abs/2507.16792
  • Abhash Kumar Jha, Shakiba Moradian, Arjun Krishnakumar, Martin Rapp, Frank Hutter, 22 Jul 2025, confopt: A Library for Implementation and Evaluation of Gradient-based One-Shot NAS Methods, https://arxiv.org/abs/2507.16533
  • Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao, 22 Jul 2025, RAVine: Reality-Aligned Evaluation for Agentic Search, https://arxiv.org/abs/2507.16725
  • Danil Gusak, Anna Volodkevich, Anton Klenitskiy, Alexey Vasilev, Evgeny Frolov, 22 Jul 2025, Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders, https://arxiv.org/abs/2507.16289
  • Jakub Micha\'nk\'ow, Pawe{\l} Sakowski, Robert \'Slepaczuk, 22 Jul 2025, Alternative Loss Function in Evaluation of Transformer Models, https://arxiv.org/abs/2507.16548
  • Bruno Deprez, Toon Vanderschueren, Bart Baesens, Tim Verdonck, Wouter Verbeke, 22 Jul 2025, Network Analytics for Anti-Money Laundering -- A Systematic Literature Review and Experimental Evaluation, https://arxiv.org/abs/2405.19383
  • Xiaoxu Guo, Siyan Liang, Yachao Cui, Juxiang Zhou, Lei Wang, Han Cao, 21 Jul 2025, Multimodal Fine-grained Reasoning for Post Quality Evaluation, https://arxiv.org/abs/2507.17934
  • Rodrigo Moreira and Larissa F. Rodrigues Moreira and Fl\'avio de Oliveira Silva, 23 Jul 2025, Performance Evaluation and Threat Mitigation in Large-scale 5G Core Deployment, https://arxiv.org/abs/2507.17850
  • Maria Vlachou, 24 Jul 2025, Fashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items, https://arxiv.org/abs/2507.18017
  • Yefeng Yuan, Yuhong Liu, Liang Cheng, 24 Jul 2025, A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models, https://arxiv.org/abs/2404.14445
  • Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp, 24 Jul 2025, Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation, https://arxiv.org/abs/2506.11790
  • Niket Patel, Randall Balestriero, 23 Jul 2025, Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks, https://arxiv.org/abs/2507.09871
  • Ashray Gupta and Rohan Joseph and Sunny Rai, 23 Jul 2025, Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation, https://arxiv.org/abs/2507.13238
  • Masaki Adachi, Masahiro Fujisawa, Michael A Osborne, 24 Jul 2025, Fixing the Pitfalls of Probabilistic Time-Series Forecasting Evaluation by Kernel Quadrature, https://arxiv.org/abs/2503.06079
  • Gerben van der Hoek, Johan Jeuring and Rogier Bos, 18 Jul 2025, Buggy rule diagnosis for combined steps through final answer evaluation in stepwise tasks, https://arxiv.org/abs/2507.13651
  • Viraj Nishesh Darji, Callie C. Liao, Duoduo Liao, 18 Jul 2025, Automated Interpretation of Non-Destructive Evaluation Contour Maps Using Large Language Models for Bridge Condition Assessment, https://arxiv.org/abs/2507.14107
  • Yudai Hayashi, Shuhei Goda, Yuta Saito, 18 Jul 2025, Off-Policy Evaluation and Learning for Matching Markets, https://arxiv.org/abs/2507.13608
  • Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang, 17 Jul 2025, "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models, https://arxiv.org/abs/2507.13428
  • Steven Lamp, Jason D. Hiser, Anh Nguyen-Tuong, Jack W. Davidson, 17 Jul 2025, PHASE: Passive Human Activity Simulation Evaluation, https://arxiv.org/abs/2507.13505
  • Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz, 18 Jul 2025, From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios, https://arxiv.org/abs/2502.02145
  • Daniel Commey, Benjamin Appiah, Griffith S. Klogo, and Garth V. Crosby, 18 Jul 2025, ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs, https://arxiv.org/abs/2507.11649
  • Mohita Chowdhury, Yajie Vera He, Jared Joselowitz, Aisling Higham, Ernest Lim, 18 Jul 2025, ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems, https://arxiv.org/abs/2501.08208
  • Dawar Khan and Xinyu Liu and Omar Mena and Donggang Jia and Alexandre Kouyoumdjian and Ivan Viola, 18 Jul 2025, AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results, https://arxiv.org/abs/2502.15761
  • Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee, 18 Jul 2025, From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation, https://arxiv.org/abs/2507.08924
  • Shingo Ayabe, Takuto Otomo, Hiroshi Kera, Kazuhiko Kawamoto, 18 Jul 2025, Robustness Evaluation of Offline Reinforcement Learning for Robot Control Against Action Perturbations, https://arxiv.org/abs/2412.18781
  • Anna Sofia Lippolis, Mohammad Javad Saeedizade, Robin Keskis\"arkk\"a, Aldo Gangemi, Eva Blomqvist, Andrea Giovanni Nuzzolese, 19 Jul 2025, Large Language Models Assisting Ontology Evaluation, https://arxiv.org/abs/2507.14552
  • Qianchao Wang, Yuxuan Ding, Chuanzhen Jia, Zhe Li, Yaping Du, 21 Jul 2025, Explainable Artificial Intelligence based Soft Evaluation Indicator for Arc Fault Diagnosis, https://arxiv.org/abs/2507.15239
  • Firdaus Ahmed Choudhury, Ethan Leicht, Jude Ethan Bislig, Hangzhi Guo, Amulya Yadav, 20 Jul 2025, Designing User-Centric Metrics for Evaluation of Counterfactual Explanations, https://arxiv.org/abs/2507.15162
  • Amina Dzafic, Merve Kavut, Ulya Bayram, 19 Jul 2025, Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation, https://arxiv.org/abs/2507.14693
  • Chenlei Gong, Yuanhe Tian, Lei Mao, Yan Song, 20 Jul 2025, Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling, https://arxiv.org/abs/2507.15087
  • Devichand Budagam, Ashutosh Kumar, Mahsa Khoshnoodi, Sankalp KJ, Vinija Jain, Aman Chadha, 21 Jul 2025, Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles, https://arxiv.org/abs/2406.12644
  • Peilong Wang, Jason Holmes, Zhengliang Liu, Dequan Chen, Tianming Liu, Jiajian Shen, Wei Liu, 18 Jul 2025, A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options, https://arxiv.org/abs/2412.10622
  • Wanke Xia, Ruoxin Peng, Haoqi Chu, Xinlei Zhu, Zhiyu Yang, Lili Yang, 21 Jul 2025, An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice, https://arxiv.org/abs/2502.13764
  • Mar\'ia Andrea Cruz Bland\'on, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico, 19 Jul 2025, MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation, https://arxiv.org/abs/2502.17163
  • Felix H\"arer, 19 Jul 2025, Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications, https://arxiv.org/abs/2506.10467
  • Zhijin He, Alan B. McMillan, 21 Jul 2025, Comparative Evaluation of Radiomics and Deep Learning Models for Disease Detection in Chest Radiography, https://arxiv.org/abs/2504.12249
  • Pengfei Zhou, Xiaopeng Peng, Fanrui Zhang, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Zekai Li, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, and Kaipeng Zhang, 9 Aug 2025, MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams, https://arxiv.org/abs/2508.06851
  • Jiawei Zhang, Yifei Zhang, Baozhao Yi, Yao Ren, Qi Jiao, Hanyu Bai, Weiran Jiang, Ziyou Song, 9 Aug 2025, Discovery Learning accelerates battery design evaluation, https://arxiv.org/abs/2508.06985
  • Lin-Han Jia, Si-Yu Han, Wen-Chao Hu, Jie-Jing Shao, Wen-Da Wei, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li, 10 Aug 2025, When Is Prior Knowledge Helpful? Exploring the Evaluation and Selection of Unsupervised Pretext Tasks from a Neuro-Symbolic Perspective, https://arxiv.org/abs/2508.07299
  • Gregory Schuit, Denis Parra, Cecilia Besa, 10 Aug 2025, Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays, https://arxiv.org/abs/2508.07128
  • Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu, 10 Aug 2025, Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks, https://arxiv.org/abs/2508.07179
  • Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, and Masashi Sugiyama, 3 Aug 2025, What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?, https://arxiv.org/abs/2508.06530
  • Bruno L. Pereira, Alan Said, Rodrygo L. T. Santos, 11 Aug 2025, On the Reliability of Sampling Strategies in Offline Recommender Evaluation, https://arxiv.org/abs/2508.05398
  • Xiaohua Feng,Jiaming Zhang,Fengyuan Yu,Chengye Wang,Li Zhang,Kaixiang Li,Yuyuan Li,Chaochao Chen,Jianwei Yin, 26 Jul 2025, A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction, https://arxiv.org/abs/2507.19894
  • Aishwarya Mandyam, Jason Meng, Ge Gao, Jiankai Sun, Mac Schwager, Barbara E. Engelhardt, Emma Brunskill, 26 Jul 2025, PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data, https://arxiv.org/abs/2507.20068
  • Minju Kim, Dongje Yoo, Yeonjun Hwang, Minseok Kang, Namyoung Kim, Minju Gwak, Beong-woo Kwak, Hyungjoo Chae, Harim Kim, Yunjoong Lee, Min Hee Kim, Dayi Jung, Kyong-Mee Chung, Jinyoung Yeo, 25 Jul 2025, Can You Share Your Story? Modeling Clients' Metacognition and Openness for LLM Therapist Evaluation, https://arxiv.org/abs/2507.19643
  • Matin Aghaei, Mohammad Ali Alomrani, Yingxue Zhang, Mahdi Biparva, 26 Jul 2025, When Engineering Outruns Intelligence: A Re-evaluation of Instruction-Guided Navigation, https://arxiv.org/abs/2507.20021
  • Harsh Purohit, Tomoya Nishida, Kota Dohi, Takashi Endo, and Yohei Kawaguchi, 28 Jul 2025, MIMII-Agent: Leveraging LLMs with Function Calling for Relative Evaluation of Anomalous Sound Detection, https://arxiv.org/abs/2507.20666
  • Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue, 28 Jul 2025, Music Arena: Live Evaluation for Text-to-Music, https://arxiv.org/abs/2507.20900
  • Adrien Bazoge, 28 Jul 2025, MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation, https://arxiv.org/abs/2507.20917
  • Khalid Hasan, Jamil Saquer and Mukulika Ghosh, 17 Jul 2025, Advancing Mental Disorder Detection: A Comparative Evaluation of Transformer and LSTM Architectures on Social Media, https://arxiv.org/abs/2507.19511
  • Hugo Retief, Kayathri, Vigneswaran, Surajit Ghosh, Mariangel Garcia Andarcia, Chris Dickens, 28 Jul 2025, Satellite-Surface-Area Machine-Learning Models for Reservoir Storage Estimation: Regime-Sensitive Evaluation and Operational Deployment at Loskop Dam, South Africa, https://arxiv.org/abs/2502.19989
  • Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang, 28 Jul 2025, FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation, https://arxiv.org/abs/2506.09081
  • Afonso Martini Spezia, Mariana Recamonde-Mendoza, 30 Jul 2025, Comparing Cluster-Based Cross-Validation Strategies for Machine Learning Model Evaluation, https://arxiv.org/abs/2507.22299
  • Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje, 4 Aug 2025, DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA, https://arxiv.org/abs/2412.05430
  • Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim, 4 Aug 2025, Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons, https://arxiv.org/abs/2411.01281
  • Arthur Cho, 4 Aug 2025, GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics, https://arxiv.org/abs/2508.02926
  • Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare, Soundar Srinivasan, 7 Aug 2025, Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation, https://arxiv.org/abs/2508.05508
  • Dewi S. W. Gould, Bruno Mlodozeniec, Samuel F. Brown, 8 Aug 2025, SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges, https://arxiv.org/abs/2508.06111
  • Yuchen Tian, Kaixin Li, Hao Chen, Ziyang Luo, Hongzhan Lin, Sebastian Schelter, Lun Du, Jing Ma, 13 Aug 2025, AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries?, https://arxiv.org/abs/2508.09631
  • Seungju Yoo, Hyuk Kwon, Joong-Won Hwang, Kibok Lee, 16 Aug 2025, Automated Model Evaluation for Object Detection via Prediction Consistency and Reliablity, https://arxiv.org/abs/2508.12082
  • David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge, 18 Aug 2025, Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation, https://arxiv.org/abs/2508.13144
  • Jun Li, Aaron Aguirre, Junior Moura, Che Liu, Lanhai Zhong, Chenxi Sun, Gari Clifford, Brandon Westover, Shenda Hong, 4 Aug 2025, An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains, https://arxiv.org/abs/2410.04133
  • Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, Kocmi Tom, 29 Jul 2025, D\'ej\`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation, https://arxiv.org/abs/2504.11829
  • Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip, 29 Jul 2025, Evaluation and Benchmarking of LLM Agents: A Survey, https://arxiv.org/abs/2507.21504
  • Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang, 31 Jul 2025, LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models, https://arxiv.org/abs/2507.22359
  • Ke Miao, Yuke Hu, Xiaochen Li, Wenjie Bao, Zhihao Liu, Zhan Qin, Kui Ren, 2 Aug 2025, Towards Evaluation for Real-World LLM Unlearning, https://arxiv.org/abs/2508.01324
  • Jungkoo Kang, 3 Aug 2025, Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation, https://arxiv.org/abs/2507.02253
  • Jialin Li, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu, 5 Aug 2025, Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework, https://arxiv.org/abs/2508.03622
  • Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Arnav Ramamoorthy, Pratyush Ghosh, Aaryan Raj Jindal, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, Yashwanth Nakka, Devansh, Jagat Sesh Challa, Dhruv Kumar, 6 Aug 2025, Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics, https://arxiv.org/abs/2503.23989
  • Zachary Robertson, Sanmi Koyejo, 7 Aug 2025, Let's Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes, https://arxiv.org/abs/2508.05469
  • Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, Weinan Zhang, 3 Aug 2025, A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges, https://arxiv.org/abs/2508.05668
  • Tuhina Tripathi, Manya Wadhwa, Greg Durrett, Scott Niekum, 21 Aug 2025, Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation, https://arxiv.org/abs/2504.14716
  • Mingyang Li, Viktor Schlegel, Tingting Mu, Wuraola Oyewusi, Kai Kang, Goran Nenadic, 22 Aug 2025, Evaluation and LLM-Guided Learning of ICD Coding Rationales, https://arxiv.org/abs/2508.16777
  • Patricia Paskov, Michael J. Byun, Kevin Wei, Toby Webster, 22 Jul 2025, Preliminary suggestions for rigorous GPAI model evaluations, https://arxiv.org/abs/2508.00875

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: