Aussie AI

Model Evaluation

Last Updated 17 November, 2025

by David Spuler, Ph.D.

Leaderboards (Model Evaluation)

Hugging Face, 2024, Open LLM Leaderboard, https://chat.lmsys.org/?leaderboard https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
LMSYS, 2024, LMSYS Chatbot Arena Leaderboard, https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Hugging Face, 2024, LLM-Perf Leaderboard, https://huggingface.co/spaces/optimum/llm-perf-leaderboard
Hugging Face, 2024, Big Code Models Leaderboard, https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
University of California, Berkeley, 2024, Berkeley Function-Calling Leaderboard, https://gorilla.cs.berkeley.edu/leaderboard.html https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard
Samuel J. Paech, 3 Jan 2024 (v2), EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models, https://arxiv.org/abs/2312.06281 https://github.com/EQ-bench/EQ-Bench https://eqbench.com/

Benchmarks for Model Evaluation

Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You, 3 Jun 2024, MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, https://arxiv.org/abs/2406.06565
Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi, 13 Jun 2024, Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning, https://arxiv.org/abs/2406.09170
Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
Petr Spelda and Vit Stritecky, 13 Aug 2025, Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1, https://arxiv.org/abs/2508.10173
Pengbo Shen, Yaqing Wang, Ni Mu, Yao Luan, Runpeng Xie, Senhao Yang, Lexiang Wang, Hao Hu, Shuang Xu, Yiqin Yang, Bo Xu, 14 Aug 2025, SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks, https://arxiv.org/abs/2508.10428
Lucie-Aim\'ee Kaffee and Giada Pistilli and Yacine Jernite, 4 Aug 2025, INTIMA: A Benchmark for Human-AI Companionship Behavior, https://arxiv.org/abs/2508.09998
Rakesh Thakur, Sneha Sharma, Gauri Chopra, 4 Aug 2025, HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish, https://arxiv.org/abs/2508.10001
Nghia Trung Ngo, Franck Dernoncourt and Thien Huu Nguyen, 13 Aug 2025, mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning, https://arxiv.org/abs/2508.10137
Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza, 13 Aug 2025, PakBBQ: A Culturally Adapted Bias Benchmark for QA, https://arxiv.org/abs/2508.10186
Chenggang Chen, Zhiyu Yang, 13 Aug 2025, No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings, https://arxiv.org/abs/2508.10230
Jieyu Li, Xin Zhang, and Joey Tianyi Zhou, 14 Aug 2025, AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences, https://arxiv.org/abs/2508.10771
Brooke R. Weborg and Gursel Serpen, 14 Aug 2025, Empirical Investigation into Configuring Echo State Networks for Representative Benchmark Problem Domains, https://arxiv.org/abs/2508.10887
Anand Kumar, Harminder Pal Monga, Tapasi Brahma, Satyam Kalra, Navas Sherif, 14 Aug 2025, Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops, https://arxiv.org/abs/2508.10817
Fabian David Schmidt, Ivan Vuli\'c, Goran Glava\v{s}, David Ifeoluwa Adelani, 13 Aug 2025, Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding, https://arxiv.org/abs/2501.06117
Yuping Wang and Xiangyu Huang and Xiaokang Sun and Mingxuan Yan and Shuo Xing and Zhengzhong Tu and Jiachen Li, 14 Aug 2025, UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving, https://arxiv.org/abs/2503.24381
Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou, 14 Aug 2025, PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts, https://arxiv.org/abs/2508.09848
Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, Wentao Zhang, 22 Jul 2025, CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos, https://arxiv.org/abs/2507.16878
Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang, 23 Jul 2025, SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs, https://arxiv.org/abs/2507.17178
Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing, 23 Jul 2025, MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs, https://arxiv.org/abs/2507.17476
Helen Jin, Shreya Havaldar, Chaehyeon Kim, Anton Xue, Weiqiu You, Helen Qu, Marco Gatti, Daniel A Hashimoto, Bhuvnesh Jain, Amin Madani, Masao Sako, Lyle Ungar, Eric Wong, 22 Jul 2025, The FIX Benchmark: Extracting Features Interpretable to eXperts, https://arxiv.org/abs/2409.13684
Xu Yang, Qi Zhang, Shuming Jiang, Yaowen Xu, Zhaofan Zou, Hao Sun, Xuelong Li, 22 Jul 2025, METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark, https://arxiv.org/abs/2507.16206
Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur, 18 Jul 2025, Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark, https://arxiv.org/abs/2507.15882
Eduardo Pacheco, Atila Orhon, Berkin Durmus, Blaise Munyampirwa, Andrey Leonov, 22 Jul 2025, SDBench: A Comprehensive Benchmark Suite for Speaker Diarization, https://arxiv.org/abs/2507.16136
Yasser Ashraf, Ahmed Sharshar, Velibor Bojkovic, Bin Gu, 22 Jul 2025, SPACT18: Spiking Human Action Recognition Benchmark Dataset with Complementary RGB and Thermal Modalities, https://arxiv.org/abs/2507.16151
Shalaka Satheesh, Katrin Klug, Katharina Beckh, H\'ector Allende-Cid, Sebastian Houben, Teena Hassan, 22 Jul 2025, GG-BBQ: German Gender Bias Benchmark for Question Answering, https://arxiv.org/abs/2507.16410
Alireza Dizaji, Benedict Aaron Tjandra, Mehrab Hamidi, Shenyang Huang, Guillaume Rabusseau, 22 Jul 2025, T-GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs, https://arxiv.org/abs/2507.10183
Huan Liu, Shusen Yang, Yuzhe Zhang, Mengze Wang, Fanyu Gong, Chengxi Xie, Guanjian Liu, Zejun Liu, Yong-Jin Liu, Bao-Liang Lu, Dalin Zhang, 22 Jul 2025, LibEER: A Comprehensive Benchmark and Algorithm Library for EEG-based Emotion Recognition, https://arxiv.org/abs/2410.09767
Pierre Sermanet, Anirudha Majumdar, Vikas Sindhwani, 22 Jul 2025, SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior, https://arxiv.org/abs/2503.10706
Mustafa Chasmai, Wuao Liu, Subhransu Maji, Grant Van Horn, 21 Jul 2025, Audio Geolocation: A Natural Sounds Benchmark, https://arxiv.org/abs/2505.18726
Zehan Li, Hongjie Chen, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li, 24 Jul 2025, TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios, https://arxiv.org/abs/2507.18061
Minje Park, Jeonghwa Lim, Taehyung Yu, and Sunghoon Joo, 24 Jul 2025, A Multi-Dataset Benchmark for Semi-Supervised Semantic Segmentation in ECG Delineation, https://arxiv.org/abs/2507.18323
Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang, 20 Jul 2025, MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation, https://arxiv.org/abs/2507.17773
Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer, 24 Jul 2025, BEARCUBS: A benchmark for computer-using web agents, https://arxiv.org/abs/2503.07919
Xiao Wang, Qian Zhu, Shujuan Wu, Bo Jiang, Shiliang Zhang, Yaowei Wang, Yonghong Tian, Bin Luo, 18 Jul 2025, When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework, https://arxiv.org/abs/2507.13659
Ishant Chintapatla, Kazuma Choji, Naaisha Agarwal, Andrew Lin, Hannah You, Charles Duong, Kevin Zhu, Sean O'Brien, Vasu Sharma, 17 Jul 2025, COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark, https://arxiv.org/abs/2507.13405
Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu, 18 Jul 2025, HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation, https://arxiv.org/abs/2503.04800
Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee, 18 Jul 2025, From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation, https://arxiv.org/abs/2507.08924
Zhijiang Tang, Jiaxin Qi, Yuhua Zheng, Jianqiang Huang, 15 Jul 2025, A Comprehensive Benchmark for Electrocardiogram Time-Series, https://arxiv.org/abs/2507.14206
Lingbo Li, Anuradha Mathrani, Teo Susnjak, 20 Jul 2025, What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction, https://arxiv.org/abs/2507.15152
Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li, 20 Jul 2025, CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents, https://arxiv.org/abs/2407.01511
Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang, 19 Jul 2025, TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios, https://arxiv.org/abs/2505.12891
Mar\'ia Andrea Cruz Bland\'on, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico, 19 Jul 2025, MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation, https://arxiv.org/abs/2502.17163
Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Durmus, Chuan Guo, 21 Jul 2025, Detecting Benchmark Contamination Through Watermarking, https://arxiv.org/abs/2502.17259
Ziyu Wang (1), Tao Xue (1), Jingyuan Li (1), Haibin Zhang (1), Zhiqiang Xu (3), Gaofei Xu (4), Zhen Wang (5), Yanbin Wang (2), Zhiquan Liu (6) ((1) Xidian University, (2) Shenzhen MSU-BIT University, (3) Jiangxi University of Science and Technology, (4) Institute of Deep-sea Science and Engineering,(5) Northwestern Polytechnical University, (6) Jinan University), 20 Jul 2025, Can Optical Denoising Clean Sonar Images? A Benchmark and Fusion Approach, https://arxiv.org/abs/2503.01655
Shengtao Wen, Haodong Chen, Yadong Wang, Zhongying Pan, Xiang Chen, Yu Tian, Bo Qian, Dong Liang, Sheng-Jun Huang, 9 Aug 2025, MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA, https://arxiv.org/abs/2508.07022
Naseem Machlovi, Maryam Saleki, Innocent Ababio, Ruhul Amin, 9 Aug 2025, Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach, https://arxiv.org/abs/2508.07063
Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li, 10 Aug 2025, Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach, https://arxiv.org/abs/2508.07353
Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo, 11 Aug 2025, MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark, https://arxiv.org/abs/2508.07575
Zhe Zhang, Runlin Liu, Aishan Liu, Xingyu Liu, Xiang Gao, Hailong Sun, 10 Aug 2025, Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes, https://arxiv.org/abs/2508.07180
Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, 10 Aug 2025, MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark, https://arxiv.org/abs/2508.07307
Sarina Penquitt, Jonathan Klees, Rinor Cakaj, Daniel Kondermann, Matthias Rottmann, Lars Schmarje, 6 Aug 2025, From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets, https://arxiv.org/abs/2508.06556
Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, Shubhi Bansal, Nagendra Kumar, 7 Aug 2025, ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos, https://arxiv.org/abs/2508.06570
Lucia Cipolina-Kun and Marianna Nezhurina and Jenia Jitsev, 10 Aug 2025, Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilites of Large Language Models via Game Play, https://arxiv.org/abs/2508.03368
Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming-Ming Cheng, Jian Yang, 9 Aug 2025, SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection, https://arxiv.org/abs/2403.06534
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang, 9 Aug 2025, LVBench: An Extreme Long Video Understanding Benchmark, https://arxiv.org/abs/2406.08035
Mihir Godbole, Xiangbo Gao, Zhengzhong Tu, 9 Aug 2025, DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving, https://arxiv.org/abs/2506.17590
Zhihao Zhu, Yi Yang, Defu Lian, 9 Aug 2025, TDDBench: A Benchmark for Training data detection, https://arxiv.org/abs/2411.03363
Hafsteinn Einarsson, 27 Jul 2025, MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models, https://arxiv.org/abs/2507.20395
Chenchen Zhao, Zhengyuan Shi, Xiangyu Wen, Chengjie Liu, Yi Liu, Yunhao Zhou, Yuxiang Zhao, Hefei Feng, Yinan Zhu, Gwok-Waa Wan, Xin Cheng, Weiyu Chen, Yongqi Fu, Chujie Chen, Chenhao Xue, Guangyu Sun, Ying Wang, Yibo Lin, Jun Yang, Ning Xu, Xi Wang and Qiang Xu, 20 Jul 2025, MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs, https://arxiv.org/abs/2507.19525
Sara Papi, Maike Z\"ufle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues, 25 Jul 2025, MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks, https://arxiv.org/abs/2507.19634
Roberto Labadie-Tamayo, Adrian Jaques B\"ock, Djordje Slijep\v{c}evi\'c, Xihui Chen, Andreas Babic, Matthias Zeppelzauer, 28 Jul 2025, FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models, https://arxiv.org/abs/2507.20924
Xinhan Di, Kristin Qi, Pengqian Yu, 28 Jul 2025, JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1, https://arxiv.org/abs/2507.20987
Ali Ismail-Fawaz and Maxime Devanne and Stefano Berretti and Jonathan Weber and Germain Forestier, 28 Jul 2025, Deep Learning for Skeleton Based Human Motion Rehabilitation Assessment: A Benchmark, https://arxiv.org/abs/2507.21018
Xuzhao Li and Xuchen Li and Shiyu Hu and Yongzhen Guo and Wentao Zhang, 26 Jul 2025, VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains, https://arxiv.org/abs/2507.09884
Hassan Ismail Fawaz, Ganesh Del Grosso, Tanguy Kerdoncuff, Aurelie Boisbunon, Illyyne Saffar, 25 Jul 2025, Deep Unsupervised Domain Adaptation for Time Series Classification: a Benchmark, https://arxiv.org/abs/2312.09857
Valay Bundele, Karahan Sar{\i}ta\c{s}, Bora Kargi, O\u{g}uz Ata \c{C}al, K{\i}van\c{c} Tez\"oren, Zohreh Ghaderi, Hendrik Lensch, 26 Jul 2025, Evaluating Self-Supervised Learning in Medical Imaging: A Benchmark for Robustness, Generalizability, and Multi-Domain Impact, https://arxiv.org/abs/2412.19124
David Maria Schmidt, Raoul Schubert, Philipp Cimiano, 28 Jul 2025, CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting, https://arxiv.org/abs/2507.21257
Haiquan Wang, Yi Chen, Shang Zeng, Yun Bian, Zhe Cui, 29 Jul 2025, GovRelBench:A Benchmark for Government Domain Relevance, https://arxiv.org/abs/2507.21419
Leonard Hinckeldey, Elliot Fosong, Elle Miller, Rimvydas Rubavicius, Trevor McInroe, Patricia Wollstadt, Christiane B. Wiebel-Herboth, Subramanian Ramamoorthy, Stefano V. Albrecht, 29 Jul 2025, Assistax: A Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics, https://arxiv.org/abs/2507.21638
Amber Huang, Ian Scott Knight, Slava Naprienko, 29 Jul 2025, Data Leakage and Redundancy in the LIT-PCBA Benchmark, https://arxiv.org/abs/2507.21404
Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, Yong Li, 9 Jun 2025, FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents, https://arxiv.org/abs/2507.21071
Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz, 28 Jul 2025, StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation, https://arxiv.org/abs/2507.21340
Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, Viet Hoang, 29 Jul 2025, VN-MTEB: Vietnamese Massive Text Embedding Benchmark, https://arxiv.org/abs/2507.21500
Kristian G. Barman, Sascha Caron, Faegheh Hasibi, Eugene Shalugin, Yoris Marcet, Johannes Otte, Henk W. de Regt, and Merijn Moody, 29 Jul 2025, Towards a Large Physics Benchmark, https://arxiv.org/abs/2507.21695
Rohan Hitchcock, Jesse Hoogland, 29 Jul 2025, From Global to Local: A Scalable Benchmark for Local Posterior Sampling, https://arxiv.org/abs/2507.21449
Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang, 25 Mar 2025, OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching, https://arxiv.org/abs/2503.21813
Jinzhi Wang, Qingke Peng, Haozhou Li, Zeyuan Zeng, Qinfeng Song, Kaixuan Yang, Jiangbo Zhang, Yaoying Wang, Ruimeng Li, Biyi Zhou, 19 Jul 2025, ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing, https://arxiv.org/abs/2507.22911
Chengqian Ma, Wei Tao, Yiwen Guo, 30 Jul 2025, C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations, https://arxiv.org/abs/2507.22968
Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che, 31 Jul 2025, MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models, https://arxiv.org/abs/2507.23382
Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan, 31 Jul 2025, MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks, https://arxiv.org/abs/2507.23511
Kai Goebel and Patrik Zips, 31 Jul 2025, Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study, https://arxiv.org/abs/2507.23589
Shimanto Bhowmik, Tawsif Tashwar Dipto, Md Sazzad Islam, Sheryl Hsu, Tahsin Reasat, 31 Jul 2025, Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis, https://arxiv.org/abs/2507.23248
Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang, 31 Jul 2025, LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models, https://arxiv.org/abs/2507.22359
Takashi Ishida, Thanawat Lodkaew, Ikko Yamane, 31 Jul 2025, How Can I Publish My LLM Benchmark Without Giving the True Answers Away?, https://arxiv.org/abs/2505.18102
Gianluca Carloni, Biagio Brattoli, Seongho Keum, Jongchan Park, Taebum Lee, Chang Ho Ahn, Sergio Pereira, 29 Jul 2025, Pathology Foundation Models are Scanner Sensitive: Benchmark and Mitigation with Contrastive ScanGen Loss, https://arxiv.org/abs/2507.22092
Yimeng Liu, Maolin Gan, Yidong Ren, Gen Li, Jingkai Lin, Younsuk Dong, Zhichao Cao, 30 Jul 2025, Hydra-Bench: A Benchmark for Multi-Modal Leaf Wetness Sensing, https://arxiv.org/abs/2507.22685
Matej \v{S}progar, 30 Jul 2025, AGITB: A Signal-Level Benchmark for Evaluating Artificial General Intelligence, https://arxiv.org/abs/2504.04430
Thuy Nguyen, Dang Nguyen, Hoang Nguyen, Thuan Luong, Long Hoang Dang, Viet Dac Lai, 30 Jul 2025, OWLViz: An Open-World Benchmark for Visual Question Answering, https://arxiv.org/abs/2503.07631
Xiang Xiang, Zhuo Xu, Yao Deng, Qinhao Zhou, Yifan Liang, Ke Chen, Qingfang Zheng, Yaowei Wang, Xilin Chen, Wen Gao, 30 Jul 2025, OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing, https://arxiv.org/abs/2502.20668
Muhammad Farid Adilazuarda, Musa Izzanardi Wijanarko, Lucky Susanto, Khumaisa Nur'aini, Derry Wijaya, Alham Fikri Aji, 25 Feb 2025, NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts, https://arxiv.org/abs/2502.18148
Kejia Gao, Liguo Zhou, Mingjun Liu, Alois Knoll, 1 Aug 2025, E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking, https://arxiv.org/abs/2504.10812
Junjie Shi, Wei Ma, Shi Ying, Lingxiao Jiang, Yang liu, Bo Du, 2 Aug 2025, Importance Sampling is All You Need: Predict LLM's performance on new benchmark by reusing existing benchmark, https://arxiv.org/abs/2508.01203
Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen, 2 Aug 2025, NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset, https://arxiv.org/abs/2508.01330
Yuanzhe Shen, Kaimin Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 2 Aug 2025, TripTailor: A Real-World Benchmark for Personalized Travel Planning, https://arxiv.org/abs/2508.01432
Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, Huimin Zhao, 1 Aug 2025, FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models, https://arxiv.org/abs/2508.01055
Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu, 24 Jul 2025, EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow, https://arxiv.org/abs/2507.22929
Lyle Regenwetter, Yazan Abu Obaideh, Fabien Chiotti, Ioanna Lykourentzou, Faez Ahmed, 25 May 2025, Bike-Bench: A Bicycle Design Benchmark for Generative Models with Objectives and Constraints, https://arxiv.org/abs/2508.00830
Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning, 2 Aug 2025, WebDS: An End-to-End Benchmark for Web-based Data Science, https://arxiv.org/abs/2508.01222
Rushin H. Gindra, Giovanni Palla, Mathias Nguyen, Sophia J. Wagner, Manuel Tran, Fabian J Theis, Dieter Saur, Lorin Crawford, Tingying Peng, 2 Aug 2025, A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics, https://arxiv.org/abs/2508.01490
Amir DN Cohen, Hilla Merhav, Yoav Goldberg, Reut Tsarfaty, 3 Aug 2025, HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark, https://arxiv.org/abs/2508.01812
Andrea Dosi, Semanto Mondal, Rajib Chandra Ghosh, Massimo Brescia, Giuseppe Longo, 3 Aug 2025, Less is More: AMBER-AFNO -- a New Benchmark for Lightweight 3D Medical Image Segmentation, https://arxiv.org/abs/2508.01941
Wanqi Yang, Yanda Li, Yunchao Wei, Meng Fang, Ling Chen, 4 Aug 2025, SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models, https://arxiv.org/abs/2508.02018
Yebo Peng, Zixiang Liu, Yaoming Li, Zhizhuo Yang, Xinye Xu, Bowen Ye, Weijun Yuan, Zihan Wang, Tong Yang, 4 Aug 2025, Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems, https://arxiv.org/abs/2508.02208
Gustaf Ahdritz, Anat Kleiman, 4 Aug 2025, The SMeL Test: A simple benchmark for media literacy in language models, https://arxiv.org/abs/2508.02074
Ivan Karpukhin, Foma Shipilov, Andrey Savchenko, 2 Aug 2025, HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?, https://arxiv.org/abs/2406.14341
Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje, 4 Aug 2025, DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA, https://arxiv.org/abs/2412.05430
Junying Wang, Wenzhe Li, Yalun Wu, Yingji Liang, Yijin Guo, Chunyi Li, Haodong Duan, Zicheng Zhang, Guangtao Zhai, 2 Aug 2025, Affordance Benchmark for MLLMs, https://arxiv.org/abs/2506.00893
Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang, 4 Aug 2025, VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos, https://arxiv.org/abs/2506.10857
Xiaohao Liu, Xiaobo Xia, Zhuo Huang, See-Kiong Ng, Tat-Seng Chua, 4 Aug 2025, Towards Modality Generalization: A Benchmark and Prospective Analysis, https://arxiv.org/abs/2412.18277
Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu, 4 Aug 2025, Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding, https://arxiv.org/abs/2505.05026
Feng Rui, Zhiyao Luo, Wei Wang, Yuting Song, Yong Liu, Tingting Zhu, Jianqing Li, and Xingyao Wang, 5 Aug 2025, CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment, https://arxiv.org/abs/2508.03360
Haoran Liu, Yihan Zhan, Mingzhe Liu, Yanhua Liu, Peng Li, Zhuo Zuo, Bingqi Liu, Runxi Liu, 3 Aug 2025, Pulse Shape Discrimination Algorithms: Survey and Benchmark, https://arxiv.org/abs/2508.02750
Yahia Dalbah, Marcel Worring, Yen-Chia Hsu, 1 Aug 2025, Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction, https://arxiv.org/abs/2508.02724
Yihao Ang, Qiang Wang, Qiang Huang, Yifan Bao, Xinyu Xi, Anthony K. H. Tung, Chen Jin, Zhiyong Huang, 3 Aug 2025, CTBench: Cryptocurrency Time Series Generation Benchmark, https://arxiv.org/abs/2508.02758
Zixuan Gu, Qiufeng Fan, Long Sun, Yang Liu, Xiaojun Ye, 5 Aug 2025, VFLAIR-LLM: A Comprehensive Framework and Benchmark for Split Learning of LLMs, https://arxiv.org/abs/2508.03097
Longling Geng and Edward Y. Chang, 5 Aug 2025, REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks, https://arxiv.org/abs/2502.18836
Alexis Roger, Prateek Humane, Daniel Z. Kaplan, Kshitij Gupta, Qi Sun, George Adamopoulos, Jonathan Siu Chi Lim, Quentin Anthony, Edwin Fennell, Irina Rish, 5 Aug 2025, CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models, https://arxiv.org/abs/2501.09672
Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng, 5 Aug 2025, ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark, https://arxiv.org/abs/2506.10960
Yue Zhou, Yi Chang, Yuan Wu, 6 Aug 2025, ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges, https://arxiv.org/abs/2508.04576
Ashutosh Bandooni and Brindha Subburaj, 31 Jul 2025, GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models, https://arxiv.org/abs/2508.03737
Xiao Wang, Ziwen Wang, Wentao Wu, Anjie Wang, Jiashu Wu, Yantao Pan, Chenglong Li, 6 Aug 2025, Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark, https://arxiv.org/abs/2508.04260
Xiao Wang, Xufeng Lou, Shiao Wang, Ju Huang, Lan Chen, Bo Jiang, 6 Aug 2025, Long-Term Visual Object Tracking with Event Cameras: An Associative Memory Augmented Tracker and A Benchmark Dataset, https://arxiv.org/abs/2403.05839
Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran Wang, 5 Aug 2025, NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models, https://arxiv.org/abs/2503.12772
Xiao Wang, Haiyang Wang, Shiao Wang, Qiang Chen, Jiandong Jin, Haoyu Song, Bo Jiang, Chenglong Li, 6 Aug 2025, RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework, https://arxiv.org/abs/2504.10018
Dexuan Xu, Jieyi Wang, Zhongyan Chai, Yongzhi Cao, Hanpin Wang, Huamin Zhang, Yu Huang, 7 Aug 2025, MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models, https://arxiv.org/abs/2508.05083
Pouyan Navard, Yasemin Ozkut, Srikar Adhikari, Elaine Situ-LaCasse, Josie Acu\~na, Adrienne Yarnish, Alper Yilmaz, 5 Aug 2025, ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular Ultrasound, https://arxiv.org/abs/2508.04735
Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, 7 Aug 2025, ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents, https://arxiv.org/abs/2410.06703
Yumeng Fu, Jiayin Zhu, Lingling Zhang, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Yanrui Wu, Wenjun Wu, 8 Aug 2025, GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines, https://arxiv.org/abs/2508.06226
Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique, 5 Aug 2025, Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark, https://arxiv.org/abs/2508.05674
Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo, 7 Aug 2025, INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance, https://arxiv.org/abs/2406.09105
Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Wei-Lun Chao, 8 Aug 2025, AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models, https://arxiv.org/abs/2506.09082
Nikolaos Dionelis, Alessandra Feliciotti, Mattia Marconcini, Devis Peressutti, Nika Oman Kadunc, JaeWan Park, Hagai Raja Sinulingga, Steve Andreas Immanuel, Ba Tran, Caroline Arnold, Nicolas Long\'ep\'e, 8 Aug 2025, Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge, https://arxiv.org/abs/2502.13818
Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig, 8 Aug 2025, CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation, https://arxiv.org/abs/2504.15254
Haiyun Guo, ZhiYan Hou, Yu Chen, Jinghan He, Yandu Sun, Yuzhe Zhou, Shujing Guo, Kuan Zhu, Jinqiao Wang, 31 Jul 2025, MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis, https://arxiv.org/abs/2508.08275
Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo, 5 Aug 2025, Putnam-AXIOM: A Functional and Static Benchmark, https://arxiv.org/abs/2508.08292
Manuel Herrador, 13 Aug 2025, The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?, https://arxiv.org/abs/2508.09762
Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng, 11 Aug 2025, MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models, https://arxiv.org/abs/2508.09210
Amir Hosseinian, Ashkan Dehghani Zahedani, Umer Mansoor, Noosheen Hashemi, Mark Woodward, 13 Aug 2025, January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis, https://arxiv.org/abs/2508.09966
Kechen Li, Yaotian Tao, Ximing Wen, Quanwei Sun, Zifei Gong, Chang Xu, Xizhe Zhang, Tianbo Ji, 13 Aug 2025, GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments, https://arxiv.org/abs/2505.24306
Grigor Bezirganyan, Sana Sellami, Laure Berti-\'Equille, S\'ebastien Fournier, 13 Aug 2025, LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data, https://arxiv.org/abs/2406.09864
Chunan Liu, Aurelien Pelissier, Yanjun Shao, Lilian Denzler, Andrew C.R. Martin, Brooks Paige and Mar\'ia Rodr\'iguez Mart\'inez, 13 Aug 2025, AbRank: A Benchmark Dataset and Metric-Learning Framework for Antibody-Antigen Affinity Ranking, https://arxiv.org/abs/2506.17857
Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski, 14 Aug 2025, ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks, https://arxiv.org/abs/2508.10956
Wenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han, 14 Aug 2025, SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth, https://arxiv.org/abs/2508.11009
Beichen Guo, Zhiyuan Wen, Yu Yang, Peng Gao, Ruosong Yang, Jiaxing Shen, 15 Aug 2025, SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems, https://arxiv.org/abs/2508.11310
Hongtao Liu, Zhicheng Du, Zihe Wang and Weiran Shen, 16 Aug 2025, CHBench: A Cognitive Hierarchy Benchmark for Evaluating Strategic Reasoning Capability of LLMs, https://arxiv.org/abs/2508.11944
Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang, 16 Aug 2025, FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction, https://arxiv.org/abs/2508.11987
Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette, 18 Aug 2025, HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds, https://arxiv.org/abs/2508.12782
Fan Li, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, Xuemin Lin, 17 Aug 2025, DHG-Bench: A Comprehensive Benchmark on Deep Hypergraph Learning, https://arxiv.org/abs/2508.12244
Manuela Imbriani, Gina Belmonte, Mieke Massink, Alessandro Tofani, Vincenzo Ciancia, 18 Aug 2025, A Multi-Resolution Benchmark Framework for Spatial Reasoning Assessment in Neural Networks, https://arxiv.org/abs/2508.12741
Ananya Singha, Harshita Sahijwani, Walt Williams, Emmanuel Aboah Boateng, Nick Hausman, Miguel Di Luca, Keegan Choudhury, Chaya Binet, Vu Le, Tianwei Chen, Oryan Rokeah Chen, Sulaiman Vesal, Sadid Hasan, 14 Aug 2025, Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs, https://arxiv.org/abs/2508.11715
Javier Mu\~noz-Haro and Ruben Tolosana and Ruben Vera-Rodriguez and Aythami Morales and Julian Fierrez, 14 Aug 2025, Privacy-Aware Detection of Fake Identity Documents: Methodology, Benchmark, and Improved Detection Methods (FakeIDet2), https://arxiv.org/abs/2508.11716
Elon Ezra, Ariel Weizman, Amos Azaria, 17 Aug 2025, The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution, https://arxiv.org/abs/2508.12277
Zhiyuan Ning, Tianle Gu, Jiaxin Song, Shixin Hong, Lingyu Li, Huacan Liu, Jie Li, Yixu Wang, Meng Lingyu, Yan Teng, Yingchun Wang, 18 Aug 2025, LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models, https://arxiv.org/abs/2508.12733
Krzysztof Kotowski, Christoph Haskamp, Jacek Andrzejewski, Bogdan Ruszczak, Jakub Nalepa, Daniel Lakey, Peter Collins, Aybike Kolmas, Mauro Bartesaghi, Jose Martinez-Heras, Gabriele De Canio, 17 Aug 2025, European Space Agency Benchmark for Anomaly Detection in Satellite Telemetry, https://arxiv.org/abs/2406.17826
Bryan L. M. de Oliveira, Luana G. B. Martins, Bruno Brand\~ao, Murilo L. da Luz, Telma W. de L. Soares, Luckeciano C. Melo, 16 Aug 2025, Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning, https://arxiv.org/abs/2410.14038
Hyunjong Ok, Jaeho Lee, 18 Aug 2025, S2Cap: A Benchmark and a Baseline for Singing Style Captioning, https://arxiv.org/abs/2409.09866
Haohang Xu, Chengjie Liu, Qihang Wang, Wenhao Huang, Yongjian Xu, Weiyu Chen, Anlan Peng, Zhijun Li, Bo Li, Lei Qi, Jun Yang, Yuan Du, and Li Du, 27 Jun 2025, Image2Net: Datasets, Benchmark and Hybrid Framework to Convert Analog Circuit Diagrams into Netlists, https://arxiv.org/abs/2508.13157
Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen, 14 Aug 2025, MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents, https://arxiv.org/abs/2508.13186
Yixuan Yang and Daoyuan Wu and Yufan Chen, 17 Aug 2025, MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols, https://arxiv.org/abs/2508.13220
James Meaden, Micha{\l} Jarosz, Piotr Jod{\l}owski, Grigori Melnik, 19 Aug 2025, COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models, https://arxiv.org/abs/2508.13757
Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Guoyang Liu, 12 Aug 2025, KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge, https://arxiv.org/abs/2508.14080
Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim, Juneha Hwang, Jaeseon Ha, Hojun Choi, Suyeol Yun, Yongjin Kim, and Yongjae Lee, 7 Aug 2025, FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering, https://arxiv.org/abs/2508.14052
Sujit Roy, Dinesha V. Hegde, Johannes Schmude, Amy Lin, Vishal Gaur, Rohit Lal, Kshitiz Mandal, Talwinder Singh, Andr\'es Mu\~noz-Jaramillo, Kang Yang, Chetraj Pandey, Jinsu Hong, Berkay Aydin, Ryan McGranaghan, Spiridon Kasapis, Vishal Upendran, Shah Bahauddin, Daniel da Silva, Marcus Freitag, Iksha Gurung, Nikolai Pogorelov, Campbell Watson, Manil Maskey, Juan Bernabe-Moreno, Rahul Ramachandran, 18 Aug 2025, SuryaBench: Benchmark Dataset for Advancing Machine Learning in Heliophysics and Space Weather Prediction, https://arxiv.org/abs/2508.14107
Tapio Pitk\"aranta, 20 Aug 2025, The NordDRG AI Benchmark for Large Language Models, https://arxiv.org/abs/2506.13790
Ningyi Liao, Haoyu Liu, Zulun Zhu, Siqiang Luo, Laks V.S. Lakshmanan, 20 Aug 2025, A Comprehensive Benchmark on Spectral GNNs: The Impact on Efficiency, Memory, and Effectiveness, https://arxiv.org/abs/2406.09675
Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran, 21 Aug 2025, GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning, https://arxiv.org/abs/2508.15690
Chenghao Zhang, Qingqing Long, Ludi Wang, Wenjuan Cui, Jianjun Yu, Yi Du, 21 Aug 2025, CITE: A Comprehensive Benchmark for Heterogeneous Text-Attributed Graphs on Catalytic Materials, https://arxiv.org/abs/2508.15392
Jiahao Xu (Ohio State University, USA), Changchang Yin (Ohio State University Wexner Medical Center, USA), Odysseas Chatzipanagiotou (Ohio State University Wexner Medical Center, USA), Diamantis Tsilimigras (Ohio State University Wexner Medical Center, USA), Kevin Clear (Ohio State University Wexner Medical Center, USA), Bingsheng Yao (Northeastern University, USA), Dakuo Wang (Northeastern University, USA), Timothy Pawlik (Ohio State University Wexner Medical Center, USA), Ping Zhang (Ohio State University, USA), 21 Aug 2025, SurgWound-Bench: A Benchmark for Surgical Wound Diagnosis, https://arxiv.org/abs/2508.15189
Nikita Kachaev, Andrei Spiridonov, Andrey Gorodetsky, Kirill Muravyev, Nikita Oskolkov, Aditya Narendra, Vlad Shakhuro, Dmitry Makarov, Aleksandr I. Panov, Polina Fedotova, Alexey K. Kovalev, 21 Aug 2025, Mind and Motion Aligned: A Joint Evaluation IsaacSim Benchmark for Task Planning and Low-Level Policies in Mobile Manipulation, https://arxiv.org/abs/2508.15663
Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, Xiangyang Li, 19 Aug 2025, MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers, https://arxiv.org/abs/2508.14925
Changshun Wu, Weicheng He, Chih-Hong Cheng, Xiaowei Huang, Saddek Bensalem, 20 Aug 2025, Revisiting Out-of-Distribution Detection in Real-time Object Detection: From Benchmark Pitfalls to a New Mitigation Paradigm, https://arxiv.org/abs/2503.07330
Fei Lei, Yibo Yang, Wenxiu Sun, Dahua Lin, 22 Aug 2025, MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use, https://arxiv.org/abs/2508.16260
Mohan Jiang, Jin Gao, Jiahao Zhan, Dequan Wang, 14 Aug 2025, MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding, https://arxiv.org/abs/2508.15802
Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans, 18 Aug 2025, A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains, https://arxiv.org/abs/2508.15832
Ahmed Allam, Youssef Mansour, and Mohamed Shalan, 21 Aug 2025, ASIC-Agent: An Autonomous Multi-Agent System for ASIC Design with Benchmark Evaluation, https://arxiv.org/abs/2508.15940
Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
Mahinthan Chandramohan, Jovan Jancic, Yuntong Zhang and Padmanabhan Krishnan, 22 Aug 2025, From Benchmark Data To Applicable Program Repair: An Experience Report, https://arxiv.org/abs/2508.16071
Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu, 22 Aug 2025, RoMedQA: The First Benchmark for Romanian Medical Question Answering, https://arxiv.org/abs/2508.16390
Yakup Abrek Er, Ilker Kesen, G\"ozde G\"ul \c{S}ahin, Aykut Erdem, 22 Aug 2025, Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish, https://arxiv.org/abs/2508.16431
Adil Bahaj, Mounir Ghogho, 22 Aug 2025, PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark, https://arxiv.org/abs/2508.16439
Sam Earle, Graham Todd, Yuchen Li, Ahmed Khalifa, Muhammad Umair Nasir, Zehua Jiang, Andrzej Banburski-Fahey, Julian Togelius, 22 Aug 2025, PuzzleJAX: A Benchmark for Reasoning and Learning, https://arxiv.org/abs/2508.16821
Nilay Pande, Sahiti Yerramilli, Jayant Sravan Tamarapalli, Rynaa Grover, 24 Aug 2025, MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes, https://arxiv.org/abs/2508.17180
Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson, 25 Aug 2025, SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models, https://arxiv.org/abs/2508.18179
Robert Yang, 25 Aug 2025, Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery, https://arxiv.org/abs/2508.17681
Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Lei Bai, Yunqi Cai, Xi Dai, Shufei Zhang, Jinguang Cheng, Zhong Fang, Hongming Weng, 25 Aug 2025, CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics, https://arxiv.org/abs/2508.18124
Fangxin Shang, Yuan Xia, Dalu Yang, Yahui Wang, Binglin Yang, 21 Aug 2025, MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation, https://arxiv.org/abs/2508.16674
Liane Makatura, Benjamin Jones, Siyuan Bian, Wojciech Matusik, 25 Aug 2025, MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation, https://arxiv.org/abs/2508.17568
Wei Xiong and Jiangtong Li and Jie Li and Kun Zhu, 25 Aug 2025, EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation of EEG Foundation Models, https://arxiv.org/abs/2508.17742
Keke Lian and Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li and Dong Zhang, 25 Aug 2025, A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code, https://arxiv.org/abs/2508.18106
Yajing Yang, Qian Liu, Min-Yen Kan, 23 Aug 2025, DataTales: A Benchmark for Real-World Intelligent Data Narration, https://arxiv.org/abs/2410.17859
Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish, 23 Aug 2025, MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark, https://arxiv.org/abs/2506.05587
Linbo Cao, Jinman Zhao, 23 Jul 2025, Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks, https://arxiv.org/abs/2507.17747
Jianhao Chen, Junyang Ren, Wentao Ding, Haoyuan Ouyang, Wei Hu, Yuzhong Qu, 23 Jul 2025, Conflict Detection for Temporal Knowledge Graphs:A Fast Constraint Mining Algorithm and New Benchmarks, https://arxiv.org/abs/2312.11053
Fred Mutisya (1 and 2), Shikoh Gitau (1), Christine Syovata (2), Diana Oigara (2), Ibrahim Matende (2), Muna Aden (2), Munira Ali (2), Ryan Nyotu (2), Diana Marion (2), Job Nyangena (2), Nasubo Ongoma (1), Keith Mbae (1), Elizabeth Wamicha (1), Eric Mibuari (1), Jean Philbert Nsengemana (3), Talkmore Chidede (4) ((1) Qhala (Nairobi, Kenya), (2) Kenya Medical Association (Nairobi, Kenya), (3) Africa CDC (Addis Ababa, Ethiopia), (4) AfCFTA (Accra, Ghana)), 22 Jul 2025, Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens, https://arxiv.org/abs/2507.16322
Roland Pihlakas, Joel Pyykk\"o, 22 Jul 2025, From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent AI safety benchmarks, https://arxiv.org/abs/2410.00081
Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu, 10 Aug 2025, Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks, https://arxiv.org/abs/2508.07179
Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum, 31 Jul 2025, Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks, https://arxiv.org/abs/2507.23194
Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tutt\"os\'i, Angelica Lim, 25 Jul 2025, Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks, https://arxiv.org/abs/2507.19684
Khloud AL Jallad, Nada Ghneim, Ghaida Rebdawi, 27 Jul 2025, Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?, https://arxiv.org/abs/2507.20419
Fred Mutisya (1,2), Shikoh Gitau (1), Nasubo Ongoma (1), Keith Mbae (1), Elizabeth Wamicha (1), 31 Jul 2025, Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench, https://arxiv.org/abs/2508.00081
Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen (Cherise) Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert, 30 Jul 2025, Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models, https://arxiv.org/abs/2508.00923
Olawale Salaudeen, Nicole Chiou, Shiny Weng, Sanmi Koyejo, 2 Aug 2025, Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?, https://arxiv.org/abs/2504.00186
Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, Linlin Shen, 6 Aug 2025, Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models, https://arxiv.org/abs/2508.04325
Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang, 7 Aug 2025, Establishing Best Practices for Building Rigorous Agentic Benchmarks, https://arxiv.org/abs/2507.02825
Prathamesh Kalamkar, Janani Venugopalan Ph.D., Vivek Raghavan Ph.D, 13 Jul 2021, Indian Legal NLP Benchmarks : A Survey, https://arxiv.org/abs/2107.06056
Serina Chang, Ashton Anderson, Jake M. Hofman, 12 Aug 2025, ChatBench: From Static Benchmarks to Human-AI Evaluation, https://arxiv.org/abs/2504.07114
Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan "Honza" Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko, 14 Aug 2025, Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping, https://arxiv.org/abs/2310.00098
Shengbo Wang, Mingwei Liu, Zike Li, Anji Li, Yanlin Wang, Xin Peng, Zibin Zheng, 18 Aug 2025, EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing, https://arxiv.org/abs/2508.13003
Weihao Wu, Liang Cao, Xinyu Wu, Zhiwei Lin, Rui Niu, Jingbei Li, Zhiyong Wu, 4 Sep 2025, VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents, https://arxiv.org/abs/2509.03940
Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, Kevin Roitero, 4 Sep 2025, On Robustness and Reliability of Benchmark-Based Evaluation of LLMs, https://arxiv.org/abs/2509.04013
Gyehun Go, Satbyul Han, Ahyeon Choi, Eunjin Choi, Juhan Nam and Jeong Mi Park, 4 Sep 2025, AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation, https://arxiv.org/abs/2509.00813
Lu Wang, Hao Chen, Siyu Wu, Zhiyue Wu, Hao Zhou, Chengfeng Zhang, Ting Wang and Haodi Zhang, 4 Sep 2025, AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation, https://arxiv.org/abs/2509.02349
Jason Gardner, Ayan Dutta, Swapnoneel Roy, O. Patrick Kreidl, Ladislau Boloni, 5 Sep 2025, Greener Deep Reinforcement Learning: Analysis of Energy and Carbon Efficiency Across Atari Benchmarks, https://arxiv.org/abs/2509.05273
Xuan Yao, Qianteng Wang, Xinbo Liu, Ke-Wei Huang, 29 Aug 2025, Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study, https://arxiv.org/abs/2509.04468
Shengyin Sun and Yiming Li and Xing Li and Yingzhao Lian and Weizhe Lin and Hui-Ling Zhen and Zhiyuan Yang and Chen Chen and Xianzhi Yu and Mingxuan Yuan and Chen Ma, 30 Aug 2025, Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling, https://arxiv.org/abs/2509.04474
Medhasweta Sen, Zachary Gottesman, Jiaxing Qiu, C. Bayan Bruss, Nam Nguyen, Tom Hartvigsen, 5 Sep 2025, BEDTime: A Unified Benchmark for Automatically Describing Time Series, https://arxiv.org/abs/2509.05215
Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan, 26 Aug 2025, Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap, https://arxiv.org/abs/2508.18646
Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, and Liang He, 26 Aug 2025, Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark, https://arxiv.org/abs/2508.19005
Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Yuntao Du, Bill Sun, Hongzhang Liu, Sen Hu, Ronghao Chen, Bo Li, Xin Li, Chen Hu, Binxing Jiao, Daxin Jiang, Pin Lyu, 26 Aug 2025, GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging, https://arxiv.org/abs/2508.18993
Dan Ristea, Vasilios Mavroudis, 26 Aug 2025, HonestCyberEval: An AI Cyber Risk Benchmark for Automated Software Exploitation, https://arxiv.org/abs/2410.21939
Vil\'em Heinz, Petr Vil\'im, Zden\v{e}k Hanz\'alek, 27 Aug 2025, Reinforcement Learning for Search Tree Size Minimization in Constraint Programming: New Results on Scheduling Benchmarks, https://arxiv.org/abs/2508.20056
Qian Liang, Menghaoran Tang, Yi Zeng, 8 Aug 2025, MuSpike: A Benchmark and Evaluation Framework for Symbolic Music Generation with Spiking Neural Networks, https://arxiv.org/abs/2508.19251
Jiayu Ding, Shuming Ma, Lei Cui, Nanning Zheng, Furu Wei, 26 Aug 2025, LongReasonArena: A Long Reasoning Benchmark for Large Language Models, https://arxiv.org/abs/2508.19363
Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia and Carlos Guestrin, 27 Aug 2025, DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis, https://arxiv.org/abs/2508.20033
Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu, Walter Pretschner, Heinz Koeppl, Fakhri Karray, 27 Aug 2025, CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval, https://arxiv.org/abs/2506.11066
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, Yao Mu, 27 Aug 2025, RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation, https://arxiv.org/abs/2506.18088
Chihiro Taguchi, Seng Mai, Keita Kurabe, Yusuke Sakai, Georgina Agyei, Soudabeh Eslami, David Chiang, 28 Aug 2025, Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark, https://arxiv.org/abs/2508.20511
Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, Lingpeng Kong, 29 Aug 2025, MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents, https://arxiv.org/abs/2508.21475
Hao Xu, Zhichao Wang, Shengqi Sang, Pisit Wajanasara, Nuno Bandeira, 12 Aug 2025, Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics, https://arxiv.org/abs/2508.21076
Jo\~ao Guilherme Alves Santos, Giovana Kerche Bon\'as, Thales Sales Almeida, 29 Aug 2025, BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning, https://arxiv.org/abs/2508.21294
Chen Gong, Kecen Li, Zinan Lin, Tianhao Wang, 28 Aug 2025, DPImageBench: A Unified Benchmark for Differentially Private Image Synthesis, https://arxiv.org/abs/2503.14681
Terry Jingchen Zhang, Gopal Dev, Ning Wang, Nicole Ni, Wenyuan Jiang, Yinya Huang, Bernhard Sch\"olkopf, Mrinmaya Sachan, Zhijing Jin, 26 Aug 2025, Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination, https://arxiv.org/abs/2509.00072
Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim, 1 Sep 2025, FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games, https://arxiv.org/abs/2509.01052
Tung Nguyen, Harkanwar Singh, Nilay Naharas, Lucas Bandarkar, Aditya Grover, 31 Aug 2025, IndiaWeatherBench: A Dataset and Benchmark for Data-Driven Regional Weather Forecasting over India, https://arxiv.org/abs/2509.00653
Aryan Amit Barsainyan, Jing Yu Lim, Dianbo Liu, 1 Sep 2025, Toward a Unified Benchmark and Taxonomy of Stochastic Environments, https://arxiv.org/abs/2509.01793
Rio Akizuki, Yuya Kudo, Nozomu Yoshinari, Yoichi Hirose, Toshiyuki Nishimoto, Kento Uchida, Shinichi Shirakawa, 2 Sep 2025, Surrogate Benchmarks for Model Merging Optimization, https://arxiv.org/abs/2509.02555
Muhammad Ali, Salman Khan, 29 Aug 2025, Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments, https://arxiv.org/abs/2509.00176
Artur D\'iaz-Juan, Coloma Ballester, Gloria Haro, 1 Sep 2025, SoccerHigh: A Benchmark Dataset for Automatic Soccer Video Summarization, https://arxiv.org/abs/2509.01439
Yi Cao, Paulette Clancy, 27 Aug 2025, Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields in Doped Materials, https://arxiv.org/abs/2509.00090
Yumeng Lin, Dong Li, Xintao Wu, Minglai Shao, Xujiang Zhao, Zhong Chen, Chen Zhao, 31 Aug 2025, Face4FairShifts: A Large Image Benchmark for Fairness and Robust Learning across Visual Domains, https://arxiv.org/abs/2509.00658
Feng Wang, Yiding Sun, Jiaxin Mao, Wei Xue, Danqing Xu, 1 Sep 2025, FinS-Pilot: A Benchmark for Online Financial RAG System, https://arxiv.org/abs/2506.02037
Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang, 31 Aug 2025, LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation, https://arxiv.org/abs/2506.04078
Yunxin Sun, Abulhair Saparov, 3 Sep 2025, Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning, https://arxiv.org/abs/2509.03345
Taiga Saito, Yu Otake, Stephen Wu, 3 Sep 2025, Tabular foundation model for GEOAI benchmark problems BM/AirportSoilProperties/2/2025, https://arxiv.org/abs/2509.03191
Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang, 3 Sep 2025, SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models, https://arxiv.org/abs/2509.03487
Roland Pihlakas, Sruthi Kuriakose, 2 Sep 2025, BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format, https://arxiv.org/abs/2509.02655
Yuhang Yao, Yuan Li, Xinyi Fan, Junhao Li, Kay Liu, Weizhao Jin, Yu Yang, Srivatsan Ravi, Philip S. Yu, Carlee Joe-Wong, 2 Sep 2025, FedGraph: A Research Library and Benchmark for Federated Graph Learning, https://arxiv.org/abs/2410.06340
Pengxiang Zhao, Guangyi Liu, Yaozhen Liang, Weiqing He, Zhengxi Lu, Yuehao Huang, Yaxuan Guo, Kexin Zhang, Hao Wang, Liang Liu, Yong Liu, 8 Sep 2025, MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents, https://arxiv.org/abs/2509.06477
Chen Shao, Yue Wang, Zhenyi Zhu, Zhanbo Huang, Sebastian P\"utz, Benjamin Sch\"afer, Tobais K\"afer, Michael F\"arber, 6 Sep 2025, Real-E: A Foundation Benchmark for Advancing Robust and Generalizable Electricity Forecasting, https://arxiv.org/abs/2509.05768
Honggang Jia, Xiucheng Wang, Nan Cheng, Ruijin Sun, Changle Li, 8 Sep 2025, UrbanMIMOMap: A Ray-Traced MIMO CSI Dataset with Precoding-Aware Maps and Benchmarks, https://arxiv.org/abs/2509.06270
Shay Dahary, Avi Edana, Alexander Apartsin, Yehudit Aperstein, 6 Sep 2025, From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics, https://arxiv.org/abs/2509.05617
Qianheng Zhang, Song Gao, Chen Wei, Yibo Zhao, Ying Nie, Ziru Chen, Shijie Chen, Yu Su, Huan Sun, 7 Sep 2025, GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation, https://arxiv.org/abs/2509.05881
Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Si-Yu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-Feng Li, 6 Sep 2025, ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning, https://arxiv.org/abs/2412.13682
Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Bram Grooten, Meng Fang, Yali Du, Mykola Pechenizkiy, 6 Sep 2025, MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning, https://arxiv.org/abs/2506.14990
Ziye Chen and Chengwei Qin and Yao Shu, 9 Sep 2025, RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning, https://arxiv.org/abs/2509.07711
Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, Yu Cheng, Tao Chen, Lei Bai, Dongzhan Zhou, Yun Luo, Ganqu Cui, Peng Ye, 9 Sep 2025, HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?, https://arxiv.org/abs/2509.07894
Yu Song, Zhigang Hua, Yan Xie, Jingzhe Liu, Bo Long, Hui Liu, 28 Aug 2025, GSTBench: A Benchmark Study on the Transferability of Graph Self-Supervised Learning, https://arxiv.org/abs/2509.06975
Kutub Uddin, Muhammad Umar Farooq, Awais Khan, Khalid Mahmood Malik, 8 Sep 2025, Adversarial Attacks on Audio Deepfake Detection: A Benchmark and Comparative Study, https://arxiv.org/abs/2509.07132
Zonghai Yao, Michael Sun, Won Seok Jang, Sunjae Kwon, Soie Kwon, Hong Yu, 8 Sep 2025, DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge, https://arxiv.org/abs/2509.07188
Christopher Brady (1) and Xu Wu (1) ((1) North Carolina State University), 9 Sep 2025, Nuclear Data Adjustment for Nonlinear Applications in the OECD/NEA WPNCS SG14 Benchmark -- A Bayesian Inverse UQ-based Approach for Data Assimilation, https://arxiv.org/abs/2509.07790
Timothy Ossowski, Jixuan Chen, Danyal Maqbool, Zefan Cai, Tyler Bradshaw, Junjie Hu, 8 Sep 2025, COMMA: A Communicative Multimodal Multi-Agent Benchmark, https://arxiv.org/abs/2410.07553
Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang, 9 Sep 2025, Audio-centric Video Understanding Benchmark without Text Shortcut, https://arxiv.org/abs/2503.19951
Riccardo Lunelli, Angus Nicolson, Samuel Martin Pr\"oll, Sebastian Johannes Reinstadler, Axel Bauer, Clemens Dlaska, 12 Sep 2025, BenchECG and xECG: a benchmark and baseline for ECG foundation models, https://arxiv.org/abs/2509.10151
Ninad Bhat, Kieran Browne, Pip Bingemann, 5 Sep 2025, Creativity Benchmark: A benchmark for marketing creativity for LLM models, https://arxiv.org/abs/2509.09702
Claudio Pinhanez and Paulo Cavalin and Cassia Sanctos and Marcelo Grave and Yago Primerano, 5 Sep 2025, The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks, https://arxiv.org/abs/2509.09705
Aya E. Fouda, Abdelrahamn A. Hassan, Radwa J. Hanafy and Mohammed E. Fouda, 7 Sep 2025, Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry, https://arxiv.org/abs/2509.09711
Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, Bo Zheng, 9 Sep 2025, VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions, https://arxiv.org/abs/2509.09716
Kaikai Zhao, Zhaoxiang Liu, Peng Wang, Xin Wang, Zhicheng Ma, Yajun Xu, Wenjing Zhang, Yibing Nan, Kai Wang, Shiguo Lian, 10 Sep 2025, MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance, https://arxiv.org/abs/2509.09730
Ji\v{r}\'i Mili\v{c}ka, Anna Marklov\'a, V\'aclav Cvr\v{c}ek, 12 Sep 2025, Benchmark of stylistic variation in LLM-generated texts, https://arxiv.org/abs/2509.10179
{\L}ukasz Grzybowski, Jakub Pokrywka, Micha{\l} Ciesi\'o{\l}ka, Jeremi I. Kaczmarek, Marek Kubis, 12 Sep 2025, Polish-English medical knowledge transfer: A new benchmark and results, https://arxiv.org/abs/2412.00559
Platon Lukyanenko, Joshua Mayourian, Mingxuan Liu, John K. Triedman, Sunil J. Ghelani, William G. La Cava, 12 Sep 2025, Deep Survival Analysis from Adult and Pediatric Electrocardiograms: A Multi-center Benchmark Study, https://arxiv.org/abs/2406.17002
Hangyi Jia, Yuxi Qian, Hanwen Tong, Xinhui Wu, Lin Chen, Feng Wei, 11 Sep 2025, Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization, https://arxiv.org/abs/2509.09321
Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang, 11 Sep 2025, Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization, https://arxiv.org/abs/2509.09307
Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang, 11 Sep 2025, LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering, https://arxiv.org/abs/2509.09614
Weixuan Sun, Jucai Zhai, Dengfeng Liu, Xin Zhang, Xiaojun Wu, Qiaobo Hao, AIMgroup, Yang Fang, Jiuyang Tang, 19 Sep 2025, CCrepairBench: A High-Fidelity Benchmark and Reinforcement Learning Framework for C++ Compilation Repair, https://arxiv.org/abs/2509.15690
Chi Yang, Fu Wang, Xiaofei Yang, Hao Huang, Weijia Cao, Xiaowen Chu, 19 Sep 2025, SGMAGNet: A Baseline Model for 3D Cloud Phase Structure Reconstruction on a New Passive Active Satellite Benchmark, https://arxiv.org/abs/2509.15706
Marylou Fauchard, Florian Carichon, Margarida Carvalho and Golnoosh Farnadi, 16 Sep 2025, Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets, https://arxiv.org/abs/2509.13131
Wanru Zhuang, Wenbo Li, Zhibin Lan, Xu Han, Peng Li, Jinsong Su, 14 Sep 2025, PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models, https://arxiv.org/abs/2509.12278
Matteo Marcuzzo, Alessandro Zangari, Andrea Albarelli, Jose Camacho-Collados, Mohammad Taher Pilehvar, 15 Sep 2025, MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables, https://arxiv.org/abs/2509.12371
Maximus Powers, Shaina Raza, Alex Chang, Rehana Riaz, Umang Mavani, Harshitha Reddy Jonala, Ansh Tiwari, Hua Wei, 15 Sep 2025, Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes, https://arxiv.org/abs/2410.08388
Yao Liang, Dongcheng Zhao, Feifei Zhao, Guobin Shen, Yuwei Wang, Dongqi Liang, Yi Zeng, 16 Sep 2025, MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values, https://arxiv.org/abs/2509.08022
Tara Bogavelli, Roshnee Sharma, Hari Subramani, 13 Sep 2025, AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise, https://arxiv.org/abs/2509.10769
Rodrigo Tertulino, 3 Sep 2025, A Comparative Benchmark of Federated Learning Strategies for Mortality Prediction on Heterogeneous and Imbalanced Clinical Data, https://arxiv.org/abs/2509.10517
Yonghao Weng and Liqiang Gao and Linwu Zhu and Jian Huang, 14 Sep 2025, MatQnA: A Benchmark Dataset for Multi-modal Large Language Models in Materials Characterization and Analysis, https://arxiv.org/abs/2509.11335
Sai Kartheek Reddy Kasu, 15 Sep 2025, EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI, https://arxiv.org/abs/2509.11648
Payam Latifi, 15 Sep 2025, Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities, https://arxiv.org/abs/2509.12098
Shuaiqi Wang, Vikas Raunak, Arturs Backurs, Victor Reis, Pei Zhou, Sihao Chen, Longqi Yang, Zinan Lin, Sergey Yekhanin, Giulia Fanti, 12 Sep 2025, Struct-Bench: A Benchmark for Differentially Private Structured Text Generation, https://arxiv.org/abs/2509.10696
William Corrias and Fabio De Gaspari and Dorjan Hitaj and Luigi V. Mancini, 15 Sep 2025, MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark, https://arxiv.org/abs/2504.16651
Hangyu Li and Qin Zhao and Haoran Xu and Xinyu Jiang and Qingwei Ben and Feiyu Jia and Haoyu Zhao and Liang Xu and Jia Zeng and Hanqing Wang and Bo Dai and Junting Dong and Jiangmiao Pang, 15 Sep 2025, TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation, https://arxiv.org/abs/2505.12748
Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, and Nenghai Yu, 15 Sep 2025, MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation, https://arxiv.org/abs/2505.23810
Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, Xiaodong Gu, 14 Sep 2025, LastingBench: Defend Benchmarks Against Knowledge Leakage, https://arxiv.org/abs/2506.21614
Yidan Sun, Viktor Schlegel, Srinivasan Nandakumar, Iqra Zahid, Yuping Wu, Yulong Wu, Hao Li, Jie Zhang, Warren Del-Pinto, Goran Nenadic, Siew Kei Lam, Anil Anthony Bharath, 18 Sep 2025, SynBench: A Benchmark for Differentially Private Text Generation, https://arxiv.org/abs/2509.14594
Masaharu Mizumoto, Dat Nguyen, Zhiheng Han, Jiyuan Fang, Heyuan Guan, Xingfu Li, Naoya Shiraishi, Xuyang Tian, Yo Nakawake, Le Minh Nguyen, 18 Sep 2025, The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs, https://arxiv.org/abs/2509.14704
Rashid Mushkani, 18 Sep 2025, Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark, https://arxiv.org/abs/2509.14574
Carolin Benjamins, Helena Graf, Sarah Segel, Difan Deng, Tim Ruhkopf, Leona Hennig, Soham Basu, Neeratyoy Mallik, Edward Bergman, Deyao Chen, Fran\c{c}ois Cl\'ement, Alexander Tornede, Matthias Feurer, Katharina Eggensperger, Frank Hutter, Carola Doerr, Marius Lindauer, 18 Sep 2025, carps: A Framework for Comparing N Hyperparameter Optimizers on M Benchmarks, https://arxiv.org/abs/2506.06143
Kai Yin, Xiangjue Dong, Chengkai Liu, Lipai Huang, Yiming Xiao, Zhewei Liu, Ali Mostafavi, James Caverlee, 17 Sep 2025, DisastIR: A Comprehensive Information Retrieval Benchmark for Disaster Management, https://arxiv.org/abs/2505.15856
Bingjian Yang, Danni Xu, Kaipeng Niu, Wenxuan Liu, Zheng Wang, Mohan Kankanhalli, 8 Sep 2025, A New Dataset and Benchmark for Grounding Multimodal Misinformation, https://arxiv.org/abs/2509.08008
David Robinson, Animesh Gupta, Rizwan Quershi, Qiushi Fu, Mubarak Shah, 2 Sep 2025, STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery, https://arxiv.org/abs/2509.07994
Shang Qin, Jingheng Ye, Yinghui Li, Hai-Tao Zheng, Qi Li, Jinxiao Shan, Zhixing Li, Hong-Gee Kim, 17 Sep 2025, CL$^2$GEC: A Multi-Discipline Benchmark for Continual Learning in Chinese Literature Grammatical Error Correction, https://arxiv.org/abs/2509.13672
G. Charbel N. Kindji (LACODAM), Lina Maria Rojas-Barahona, Elisa Fromont (LACODAM), Tanguy Urvoy, 17 Sep 2025, Tabular Data Generation Models: An In-Depth Survey and Performance Benchmarks with Extensive Tuning, https://arxiv.org/abs/2406.12945
Youngjoon Lee, Jinu Gong, Joonhyuk Kang, 17 Sep 2025, A Unified Benchmark of Federated Learning with Kolmogorov-Arnold Networks for Medical Imaging, https://arxiv.org/abs/2504.19639
Sanghyu Yoon, Dongmin Kim, Suhee Yoon, Ye Seul Sim, Seungdong Yoa, Hye-Seung Cho, Soonyoung Lee, Hankook Lee, Woohyung Lim, 2 Oct 2025, ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection, https://arxiv.org/abs/2510.02060
Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, Yingchun Wang, 2 Oct 2025, A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports, https://arxiv.org/abs/2510.02190
Aritra Das, Joseph T. Iosue and Victor V. Albert, 1 Oct 2025, Quantum-inspired Benchmark for Estimating Intrinsic Dimension, https://arxiv.org/abs/2510.01335
Dongjun Kim, Gyuho Shim, Yongchan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim, 23 Sep 2025, Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks, https://arxiv.org/abs/2510.01232
Punit Kumar Singh, Nishant Kumar, Akash Ghosh, Kunal Pasad, Khushi Soni, Manisha Jaishwal, Sriparna Saha, Syukron Abu Ishaq Alfarozi, Asres Temam Abagissa, Kitsuchart Pasupa, Haiqin Yang, Jose G Moreno, 24 Sep 2025, Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports, https://arxiv.org/abs/2510.01247
Shree Harsha Bokkahalli Satish, Gustav Eje Henter, \'Eva Sz\'ekely, 24 Sep 2025, Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs, https://arxiv.org/abs/2510.01254
Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour, 2 Oct 2025, MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization, https://arxiv.org/abs/2510.01659
Yuxun Tang, Lan Liu, Wenhao Feng, Yiwen Zhao, Jionghao Han, Yifeng Yu, Jiatong Shi, Qin Jin, 2 Oct 2025, SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment, https://arxiv.org/abs/2510.01812
Mario Medrano-Paredes, Carmen Fern\'andez-Gonz\'alez, Francisco-Javier D\'iaz-Pernas, Hichem Saoudi, Javier Gonz\'alez-Alonso, Mario Mart\'inez-Zarzuela, 2 Oct 2025, Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities, https://arxiv.org/abs/2510.02264
Jian Yao, Ran Cheng, and Kay Chen Tan, 2 Oct 2025, VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks, https://arxiv.org/abs/2507.12885
Aaron Xuxiang Tian, Ruofan Zhang, Jiayao Tang, Young Min Cho, Xueqian Li, Qiang Yi, Ji Wang, Zhunping Zhang, Danrui Qi, Zekun Li, Xingyu Xiang, Sharath Chandra Guntuku, Lyle Ungar, Tianyu Shi, Chi Wang, 1 Oct 2025, Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks, https://arxiv.org/abs/2509.23537
Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo, 2 Oct 2025, MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation, https://arxiv.org/abs/2505.15054
Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo, 1 Oct 2025, Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation, https://arxiv.org/abs/2505.19430
Monoshi Kumar Roy, Simin Chen, Benjamin Steenhoek, Jinjun Peng, Gail Kaiser, Baishakhi Ray, Wei Le, 2 Oct 2025, CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning, https://arxiv.org/abs/2506.00750
A. Alfarano, L. Venturoli, D. Negueruela del Castillo (University of Zurich, Max Planck Society), 14 Oct 2025, VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage, https://arxiv.org/abs/2510.12750
Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov, 14 Oct 2025, Diff-XYZ: A Benchmark for Evaluating Diff Understanding, https://arxiv.org/abs/2510.12487
Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao, 13 Oct 2025, MergeBench: A Benchmark for Merging Domain-Specialized LLMs, https://arxiv.org/abs/2505.10833
Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang, 14 Oct 2025, Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series, https://arxiv.org/abs/2506.10412
Shen Dong, Mingxuan Zhang, Pengfei He, Li Ma, Bhavani Thuraisingham, Hui Liu, Yue Xing, 14 Oct 2025, PEAR: Planner-Executor Agent Robustness Benchmark, https://arxiv.org/abs/2510.07505
Gongping Chen, Lei Zhao, Xiaotao Yin, Liang Cui, Jianxun Zhang, Yu Dai, Ningning Liu, 14 Oct 2025, BAAF: A benchmark attention adaptive framework for medical ultrasound image segmentation tasks, https://arxiv.org/abs/2310.00919
Dadi Guo, Tianyi Zhou, Dongrui Liu, Chen Qian, Qihan Ren, Shuai Shao, Zhiyuan Fan, Yi R. Fung, Kun Wang, Linfeng Zhang, Jing Shao, 1 Oct 2025, Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm, https://arxiv.org/abs/2510.00415
Yuchen Song, Andong Chen, Wenxin Zhu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao, 27 Sep 2025, Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness, https://arxiv.org/abs/2510.00041
Jinghang Shi, Xiao Yu Tang, Yang Hunag, Yuyang Li, Xiaokong, Yanxia Zhang, and Caizhan Yue, 29 Sep 2025, AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy, https://arxiv.org/abs/2510.00063
Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley, Zexue He, 30 Sep 2025, BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses, https://arxiv.org/abs/2510.00232
Harethah Abu Shairah, Somayah AlHarbi, Abdulaziz AlHussein, Sameer Alsabea, Omar Shaqaqi, Hebah AlShamlan, Omar Knio, George Turkiyyah, 1 Oct 2025, ALARB: An Arabic Legal Argument Reasoning Benchmark, https://arxiv.org/abs/2510.00694
Changyu Zeng, Yifan Wang, Zimu Wang, Wei Wang, Zhengni Yang, Muyi Bao, Jiming Xiao, Anh Nguyen, and Yutao Yue, 1 Oct 2025, NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities, https://arxiv.org/abs/2509.16656
Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel CF Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Hao Cheng, Hohin Lee, Praneeth Sanapathi, Sarah Hilado, Jiang Bian, Javier Alvarez-Valle, Mu Wei, Khalil Malik, Jianfeng Gao, Eric Horvitz, Matthew P Lungren, Hoifung Poon, Paul Vozila, 1 Oct 2025, The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks, https://arxiv.org/abs/2509.18234
Siyang Wu, Honglin Bao, Sida Li, Ari Holtzman, James A. Evans, 1 Oct 2025, Mapping Overlaps in Benchmarks through Perplexity in the Wild, https://arxiv.org/abs/2509.23488
Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Ya\"ir Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson, Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, et al. (2 additional authors not shown), 1 Oct 2025, Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark, https://arxiv.org/abs/2509.26574
Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan and Xiaofan Jiang, 30 Sep 2025, TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models, https://arxiv.org/abs/2504.03748
Peiyu Yang and Naveed Akhtar and Jiantong Jiang and Ajmal Mian, 1 Oct 2025, A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions, https://arxiv.org/abs/2405.02344
Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri, 1 Oct 2025, PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks, https://arxiv.org/abs/2404.04671
Nathanael Jo, Ashia Wilson, 23 Sep 2025, What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities, https://arxiv.org/abs/2509.19590
Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson, 24 Sep 2025, When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity, https://arxiv.org/abs/2509.20293
Zejun Liu, Yunshan Chen, Chengxi Xie, Huan Liu, 14 Sep 2025, LibEMER: A novel benchmark and algorithms library for EEG-based Multimodal Emotion Recognition, https://arxiv.org/abs/2509.19330
Hailay Kidu Teklehaymanot, Gebrearegawi Gidey, Wolfgang Nejdl, 24 Sep 2025, Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks, https://arxiv.org/abs/2509.20209
Sergey Berezin, Reza Farahbakhsh, Noel Crespi, 24 Sep 2025, Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems, https://arxiv.org/abs/2409.18708
Mahdi Zakizadeh and Mohammad Taher Pilehvar, 24 Sep 2025, Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets, https://arxiv.org/abs/2501.01168
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi, 24 Sep 2025, OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models, https://arxiv.org/abs/2506.03135
Segev Shlomov, Alon Oved, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, {\L}ukasz Str\k{a}k, Elizabeth Koumpan, Yinon Goldshtein, Eilam Shapira, Nir Mashkif, Asaf Adi, 27 Oct 2025, From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production, https://arxiv.org/abs/2510.23856
Anisha Saha, Varsha Suresh, Timothy Hospedales, Vera Demberg, 27 Oct 2025, MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection, https://arxiv.org/abs/2510.23727
Yong Huang, Zhongqi Yang, Amir Rahmani, 28 Oct 2025, MIMIC-Sepsis: A Curated Benchmark for Modeling and Learning from Sepsis Trajectories in the ICU, https://arxiv.org/abs/2510.24500
Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, Jinho D. Choi, 27 Oct 2025, CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection, https://arxiv.org/abs/2510.23845
Mirali Purohit, Bimal Gajera, Vatsal Malaviya, Irish Mehta, Kunal Kasodekar, Jacob Adler, Steven Lu, Umaa Rebbapragada, Hannah Kerner, 28 Oct 2025, Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks, https://arxiv.org/abs/2510.24010
Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park, Sumin Bae, Mingyu Kang, Jaejin Lee, 28 Oct 2025, Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean, https://arxiv.org/abs/2510.24150
Hunzalah Hassan Bhatti, Firoj Alam, 28 Oct 2025, Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants, https://arxiv.org/abs/2510.24328
Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin, 28 Oct 2025, LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability, https://arxiv.org/abs/2510.24345
Kaveh Eskandari Miandoab, Mahammed Kamruzzaman, Arshia Gharooni, Gene Louis Kim, Vasanth Sarathy, Ninareh Mehrabi, 27 Oct 2025, Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation, https://arxiv.org/abs/2510.23921
Ahmet Onur Akman, Anastasia Psarou, Micha{\l} Hoffmann, {\L}ukasz Gorczyca, {\L}ukasz Kowalski, Pawe{\l} Gora, Grzegorz Jamr\'oz, Rafa{\l} Kucharski, 28 Oct 2025, URB - Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles, https://arxiv.org/abs/2505.17734
Dongwon Choi, Sunwoo Kim, Juyeon Kim, Kyungho Kim, Geon Lee, Shinhwan Kang, Myunghwan Kim, Kijung Shin, 28 Oct 2025, RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases, https://arxiv.org/abs/2506.01360
Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu, 28 Oct 2025, MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems, https://arxiv.org/abs/2510.17281
Roham Koohestani, Philippe de Bekker, Beg\"um Ko\c{c}, Maliheh Izadi, 28 Oct 2025, Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality, https://arxiv.org/abs/2503.05860
Aneesha Sampath, Oya Aran, Emily Mower Provost, 28 Oct 2025, SEER: The Span-based Emotion Evidence Retrieval Benchmark, https://arxiv.org/abs/2510.03490
Yu Wu and Ke Shu and Jonas Fischer and Lidia Pivovarova and David Rosson and Eetu M\"akel\"a and Mikko Tolonen, 28 Oct 2025, Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark, https://arxiv.org/abs/2510.19585
Disheng Liu, Yiran Qiao, Wuche Liu, Yiren Lu, Yunlai Zhou, Tuo Liang, Yu Yin, Jing Ma, 28 Oct 2025, CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data, https://arxiv.org/abs/2503.04852
Eric Ngoiya and Tianshu Bao, 23 Oct 2025, Fluidity Index: Next-Generation Super-intelligence Benchmarks, https://arxiv.org/abs/2510.20636
Tom Maus, Asma Atamna, Tobias Glasmachers, 23 Oct 2025, Balancing Specialization and Centralization: A Multi-Agent Reinforcement Learning Benchmark for Sequential Industrial Control, https://arxiv.org/abs/2510.20408
Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei, Ximing Lu, Meng Jiang, Faeze Brahman, Snigdha Chaturvedi, Haw-Shiuan Chang, Daniel Khashabi, Xiang Lorraine Li, 23 Oct 2025, CreativityPrism: A Holistic Benchmark for Large Language Model Creativity, https://arxiv.org/abs/2510.20091
Tian Xia, Tianrun Gao, Wenhao Deng, Long Wei, Xiaowei Qian, Yixian Jiang, Chenglei Yu, Tailin Wu, 23 Oct 2025, BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction, https://arxiv.org/abs/2510.16559
Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri, 23 Oct 2025, CLEVER: A Curated Benchmark for Formally Verified Code Generation, https://arxiv.org/abs/2505.13938
Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu, 23 Oct 2025, Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants, https://arxiv.org/abs/2501.01243
Weikang Yuan, Kaisong Song, Zhuoren Jiang, Junjie Cao, Yujie Zhang, Jun Lin, Kun Kuang, Ji Zhang, Xiaozhong Liu, 23 Oct 2025, LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation, https://arxiv.org/abs/2505.19667
Max Gutbrod, David Rauber, Danilo Weber Nunes, Christoph Palm, 23 Oct 2025, OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection, https://arxiv.org/abs/2503.16247
Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri, 22 Oct 2025, Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models, https://arxiv.org/abs/2505.20612
Charles Rhys Campbell, Aldo H. Romero, Kamal Choudhary, 17 Oct 2025, AtomBench: A Benchmark for Generative Atomic Structure Models using GPT, Diffusion, and Flow Architectures, https://arxiv.org/abs/2510.16165
Ashutosh Srivastava, Lokesh Nagalapatti, Gautam Jajoo, Aniket Vashishtha, Parameswari Krishnamurthy, Amit Sharma, 18 Oct 2025, Realizing LLMs' Causal Potential Requires Science-Grounded, Novel Benchmarks, https://arxiv.org/abs/2510.16530
Ioannis Tsaknakis, Bingqing Song, Shuyu Gan, Dongyeop Kang, Alfredo Garcia, Gaowen Liu, Charles Fleming, Mingyi Hong, 20 Oct 2025, Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction, https://arxiv.org/abs/2510.17132
Alexander Aghili, Andy Bruce, Daniel Sabo, Sanya Murdeshwar, Kevin Bachelor, Ionut Mistreanu, Ashwin Lokapally and Razvan Marinescu, 20 Oct 2025, A Standardized Benchmark for Machine-Learned Molecular Dynamics using Weighted Ensemble Sampling, https://arxiv.org/abs/2510.17187
Tal Barami, Nimrod Berman, Ilan Naiman, Amos H. Hason, Rotem Ezra, Omri Azencot, 20 Oct 2025, Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations, https://arxiv.org/abs/2510.17313
Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, Qingsong Wen, Lixing Chen, Pan Zhou, Lichao Sun, 17 Oct 2025, Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs, https://arxiv.org/abs/2510.16062
Ryoto Miyamoto, Xin Fan, Fuyuko Kido, Tsuneo Matsumoto and Hayato Yamana, 18 Oct 2025, OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models, https://arxiv.org/abs/2510.16295
Mohammad Javad Ahmadi, Iman Gandomi, Parisa Abdi, Seyed-Farzad Mohammadi, Amirhossein Taslimi, Mehdi Khodaparast, Hassan Hashemi, Mahdi Tavakoli, Hamid D. Taghirad, 18 Oct 2025, Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis, https://arxiv.org/abs/2510.16371
Sheikh Jubair, Arwa Omayrah, Amal Alshammari, Alhanoof Althnian, Abdulhamed Alothaimen, Norah A. Alzahrani, Shahad D. Alzaidi, Nora Al-Twairesh, Abdulmohsen Al-Thubaity, 19 Oct 2025, LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding, https://arxiv.org/abs/2510.16783
Yahia Battach, Abdulwahab Felemban, Faizan Farooq Khan, Yousef A. Radwan, Xiang Li, Fabio Marchese, Sara Beery, Burton H. Jones, Francesca Benzoni, Mohamed Elhoseiny, 19 Oct 2025, ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification, https://arxiv.org/abs/2510.16822
Numaan Naeem, Abdellah El Mekki, Muhammad Abdul-Mageed, 20 Oct 2025, EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs, https://arxiv.org/abs/2510.17389
Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu, 20 Oct 2025, MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues, https://arxiv.org/abs/2510.17722
Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You, 20 Oct 2025, AcademicEval: Live Long-Context LLM Benchmark, https://arxiv.org/abs/2510.17725
Qianru Zhang, Yuting Sun, Honggang Wen, Peng Yang, Xinzhu Li, Ming Li, Kwok-Yan Lam, Siu-Ming Yiu, Hongzhi Yin, 12 Feb 2025, Time Series Analysis in Frequency Domain: A Survey of Open Challenges, Opportunities and Benchmarks, https://arxiv.org/abs/2504.07099
Jie Zhang, Cezara Petrui, Kristina Nikoli\'c, Florian Tram\`er, 19 Oct 2025, RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics, https://arxiv.org/abs/2505.12575
Pei Yang, Hai Ci, and Mike Zheng Shou, 18 Oct 2025, macOSWorld: A Multilingual Interactive Benchmark for GUI Agents, https://arxiv.org/abs/2506.04135
Zhining Liu, Zihao Li, Ze Yang, Tianxin Wei, Jian Kang, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong, 20 Oct 2025, CLIMB: Class-imbalanced Learning Benchmark on Tabular Data, https://arxiv.org/abs/2505.17451
Weijie Xu, Shixian Cui, Xi Fang, Chi Xue, Stephanie Eckman, Chandan K. Reddy, 17 Oct 2025, SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions, https://arxiv.org/abs/2506.00643
Mohammad Ramezanali, Mo Vazifeh, Paolo Santi, 21 Sep 2025, seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs, https://arxiv.org/abs/2509.16866
Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, and Junhua Zhao, 22 Sep 2025, EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving, https://arxiv.org/abs/2509.17677
Michelangelo Conserva, Remo Sasso, Paulo Rauber, 21 Sep 2025, On the Limits of Tabular Hardness Metrics for Deep RL: A Study with the Pharos Benchmark, https://arxiv.org/abs/2509.17092
Siu Hang Ho, Prasad Ganesan, Nguyen Duong, Daniel Schlabig, 22 Sep 2025, Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark, https://arxiv.org/abs/2509.17894
Asiya Ibrahim Zanga, Salisu Mamman Abdulrahman, Abubakar Ado, Abdulkadir Abubakar Bichi, Lukman Aliyu Jibril, Abdulmajid Babangida Umar, Alhassan Adamu, Shamsuddeen Hassan Muhammad and Bashir Salisu Abubakar, 17 Sep 2025, HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language, https://arxiv.org/abs/2509.16256
Burak Satar, Zhixin Ma, Patrick A. Irawan, Wilfried A. Mulyawan, Jing Jiang, Ee-Peng Lim, Chong-Wah Ngo, 20 Sep 2025, Seeing Culture: A Benchmark for Visual Reasoning and Grounding, https://arxiv.org/abs/2509.16517
Ritabrata Chakraborty, Avijit Dasgupta, Sandeep Chaurasia, 20 Sep 2025, CAMBench-QR : A Structure-Aware Benchmark for Post-Hoc Explanations with QR Understanding, https://arxiv.org/abs/2509.16745
Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, and Sicong Leng, 21 Sep 2025, From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning, https://arxiv.org/abs/2509.17040
Yuzhen Lei, Hongbin Xie, Jiaxing Zhao, Shuangxue Liu, Xuan Song, 22 Sep 2025, MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents, https://arxiv.org/abs/2509.17628
Zhichao Ma, Fan Huang, Lu Zhao, Fengjun Guo, Guangtao Zhai, Xiongkuo Min, 21 Sep 2025, DocIQ: A Benchmark Dataset and Feature Fusion Network for Document Image Quality Assessment, https://arxiv.org/abs/2509.17012
Florinel Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, 22 Sep 2025, PRNU-Bench: A Novel Benchmark and Model for PRNU-Based Camera Identification, https://arxiv.org/abs/2509.17581
Peter Jansen, Samiah Hassan, Ruoyao Wang, 19 Sep 2025, Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science, https://arxiv.org/abs/2506.04410
Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Wang, and Liangming Pan, 22 Sep 2025, How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark, https://arxiv.org/abs/2505.18761
Alireza Salemi, Hamed Zamani, 20 Sep 2025, LaMP-QA: A Benchmark for Personalized Long-form Question Answering, https://arxiv.org/abs/2506.00137
Changti Wu and Shijie Lian and Zihao Liu and Lei Zhang and Laurence Tianruo Yang and Kai Chen, 25 Oct 2025, DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry, https://arxiv.org/abs/2510.22340
Shilaj Baral, Youngkyu Lee, Sangam Khanal, Joongoo Jeon, 21 Oct 2025, Residual-guided AI-CFD hybrid method enables stable and scalable simulations: from 2D benchmarks to 3D applications, https://arxiv.org/abs/2510.21804
Ana K. Rivera, Anvita Bhagavathula, Alvaro Carbonero, and Priya Donti, 24 Oct 2025, PF$\Delta$: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations, https://arxiv.org/abs/2510.22048
Darshana Priyasad, Tharindu Fernando, Maryam Haghighat, Harshala Gammulle, Clinton Fookes, 27 Oct 2025, Transforming volcanic monitoring: A dataset and benchmark for onboard volcano activity detection, https://arxiv.org/abs/2510.22889
Shvetank Prakash, Andrew Cheng, Arya Tschand, Mark Mazumder, Varun Gohil, Jeffrey Ma, Jason Yik, Zishen Wan, Jessica Quaye, Elisavet Lydia Alvanaki, Avinash Kumar, Chandrashis Mazumdar, Tuhin Khare, Alexander Ingare, Ikechukwu Uchendu, Radhika Ghosal, Abhishek Tyagi, Chenyu Wang, Andrea Mattia Garavagno, Sarah Gu, Alice Guo, Grace Hur, Luca Carloni, Tushar Krishna, Ankita Nayak, Amir Yazdanbakhsh, Vijay Janapa Reddi, 24 Oct 2025, QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture, https://arxiv.org/abs/2510.22087
Yutao Wu, Xiao Liu, Yunhao Feng, Jiale Ding, Xingjun Ma, 25 Oct 2025, PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading, https://arxiv.org/abs/2510.22242
Iliass Ayaou and Denis Cavallucci, 25 Oct 2025, PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding, https://arxiv.org/abs/2510.22264
Chenyu Zhang, Tairen Zhang, Lanjun Wang, Ruidong Chen, Wenhui Li, Anan Liu, 25 Oct 2025, T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model, https://arxiv.org/abs/2510.22300
Mahiro Ukai, Shuhei Kurita and Nakamasa Inoue, 26 Oct 2025, STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models, https://arxiv.org/abs/2510.22571
Daoyu Wang, Mingyue Cheng, Qi Liu, Shuo Yu, Zirui Liu, Ze Guo, 27 Oct 2025, PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature, https://arxiv.org/abs/2510.10909
Ziheng Cheng, Yixiao Huang, Hui Xu, Somayeh Sojoudi, Xuandong Zhao, Dawn Song, Song Mei, 25 Oct 2025, OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models, https://arxiv.org/abs/2505.21347
Han Deng, Yuan Meng, Shixiang Tang, Wanli Ouyang, Xinzhu Ma, 26 Oct 2025, CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming, https://arxiv.org/abs/2505.12925
Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi, 27 Oct 2025, CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays, https://arxiv.org/abs/2505.18087
Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongyan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan, 25 Oct 2025, SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus, https://arxiv.org/abs/2510.03160
Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li and Xuezhi Cao, 27 Oct 2025, UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels, https://arxiv.org/abs/2510.18915
Zhaomin Wu, Ziyang Wang, Bingsheng He, 27 Oct 2025, WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos, https://arxiv.org/abs/2505.16635
Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, Shafiq Joty, 15 Oct 2025, Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math, https://arxiv.org/abs/2510.13744
Ivan Dubrovsky, Anastasia Orlova, Illarion Iov, Nina Gubina, Irena Gureeva, Alexey Zaytsev, 15 Oct 2025, Selective Adversarial Attacks on LLM Benchmarks, https://arxiv.org/abs/2510.13570
Xiangyu Zhao, Wanghan Xu, Bo Liu, Yuhao Zhou, Fenghua Ling, Ben Fei, Xiaoyu Yue, Lei Bai, Wenlong Zhang, Xiao-Ming Wu, 15 Oct 2025, MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science, https://arxiv.org/abs/2505.20740
Haiyang Li, Yaxiong Wang, Shengeng Tang, Lianwei Wu, Lechao Cheng, Zhun Zhong, 15 Oct 2025, Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline, https://arxiv.org/abs/2509.25991
Xingjian Zhou, Keyi Shen, Andy Xu, Hongji Xu, Cho-Jui Hsieh, Huan Zhang, Zhouxing Shi, 14 Oct 2025, SoundnessBench: A Soundness Benchmark for Neural Network Verifiers, https://arxiv.org/abs/2412.03154
Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, Qiang Xu, 15 Oct 2025, Fidel-TS: A High-Fidelity Benchmark for Multimodal Time Series Forecasting, https://arxiv.org/abs/2509.24789
Spandan Garg, Benjamin Steenhoek, Yufan Huang, 14 Oct 2025, Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation, https://arxiv.org/abs/2510.08996
Jiaqi Shao, Yuxiang Lin, Munish Prasad Lohani, Yufeng Miao, Bing Luo, 26 Sep 2025, Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents, https://arxiv.org/abs/2509.22391
Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Lijun Wang, Yuanyuan Peng, Huan Gao, Mingkun Xu, Shangyang Li, 26 Sep 2025, Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks, https://arxiv.org/abs/2509.22258
Hui Li, Changhao Jiang, Hongyu Wang, Ming Zhang, Jiajun Sun, Zhixiong Yang, Yifei Cao, Shihan Dou, Xiaoran Fan, Baoyu Fan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang, 26 Sep 2025, MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark, https://arxiv.org/abs/2509.22461
Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Xianpeng Lang, 26 Sep 2025, DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models, https://arxiv.org/abs/2506.05667
Raj Ghugare, Catherine Ji, Kathryn Wantlin, Jin Schofield, Benjamin Eysenbach, 7 Oct 2025, BuilderBench -- A benchmark for generalist agents, https://arxiv.org/abs/2510.06288
Prabhant Singh, Sibylle Hess, Joaquin Vanschoren, 7 Oct 2025, How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation, https://arxiv.org/abs/2510.06448
Tian Qin, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu, Bowen Jin, Mert Cemri, Jiarui Lu, Zirui Wang, Meng Cao, 8 Oct 2025, COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization, https://arxiv.org/abs/2510.07043
Laurent Brisson (IMT Atlantique - DSD), C\'ecile Bothorel (IMT Atlantique - DSD), Nicolas Duminy (IMT Atlantique, IMT Atlantique - DSD), 3 Oct 2025, DynBenchmark: Customizable Ground Truths to Benchmark Community Detection and Tracking in Temporal Networks, https://arxiv.org/abs/2510.06245
Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, Yanran Li, Chengwei Qin, 8 Oct 2025, FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline, https://arxiv.org/abs/2510.06800
Peize He, Zichen Wen, Yubo Wang, Yuxuan Wang, Xiaoqian Liu, Jiajie Huang, Zehui Lei, Zhuangcheng Gu, Xiangqi Jin, Jiabing Yang, Kai Li, Zhifei Liu, Weijia Li, Cunxiang Wang, Conghui He, Linfeng Zhang, 8 Oct 2025, AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs, https://arxiv.org/abs/2510.07293
Junli Liu, Qizhi Chen, Zhigang Wang, Yiwen Tang, Yiting Zhang, Chi Yan, Dong Wang, Xuelong Li, Bin Zhao, 8 Oct 2025, AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations, https://arxiv.org/abs/2504.07836
Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, Sivaprasad Sudhir, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska, 7 Oct 2025, KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes, https://arxiv.org/abs/2506.06541
Yunqi Huang, Nishith Chennakeshava, Alexis Carras, Vladislav Neverov, Wei Liu, Aske Plaat, Yingjie Fan, 2 Oct 2025, A Benchmark Study of Deep Reinforcement Learning Algorithms for the Container Stowage Planning Problem, https://arxiv.org/abs/2510.02589
Rakshith S Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego Mares, Ernesto Hernandez, Jayeon Park, Dean Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, Chen Xing, 3 Oct 2025, TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models, https://arxiv.org/abs/2510.02663
Xiao-Wen Yang, Zihao Zhang, Jianuo Cao, Zhi Zhou, Zenan Li, Lan-Zhe Guo, Yuan Yao, Taolue Chen, Yu-Feng Li, and Xiaoxing Ma, 26 Sep 2025, FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory, https://arxiv.org/abs/2510.02335
Xinjie Shen, Mufei Li, Pan Li, 27 Sep 2025, Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark, https://arxiv.org/abs/2510.02356
Han Wang, Haoyu Li, Brian Ko, Huan Zhang, 30 Sep 2025, On The Fragility of Benchmark Contamination Detection in Reasoning Models, https://arxiv.org/abs/2510.02386
Abhinav Arun, Reetu Raj Harsh, Bhaskarjit Sarmah, Stefano Pasquali, 3 Oct 2025, FinReflectKG - MultiHop: Financial QA Benchmark for Reasoning with Knowledge Graph Evidence, https://arxiv.org/abs/2510.02906
Nicole N Khatibi, Daniil A. Radamovich, Michael P. Brenner, 23 Sep 2025, EEFSUVA: A New Mathematical Olympiad Benchmark, https://arxiv.org/abs/2510.01227
Ingrid Navarro, Pablo Ortega-Kral, Jay Patrikar, Haichuan Wang, Alonso Cano, Zelin Ye, Jong Hoon Park, Sebastian Scherer and Jean Oh, 3 Oct 2025, Amelia: A Large Dataset and Benchmark for Airport Surface Movement Forecasting, https://arxiv.org/abs/2407.21185
Haoran Zhang, Chenhao Zhu, Sicong Guo, Hanzhe Guo, Haiming Li, Donglin Yu, 21 Oct 2025, StarBench: A Turn-Based RPG Benchmark for Agentic Multimodal Decision-Making and Information Seeking, https://arxiv.org/abs/2510.18483
Ho Fai Leung, Xiaoyan Xi, Fei Zuo, 21 Oct 2025, AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification, https://arxiv.org/abs/2510.18488
Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu, 19 Oct 2025, Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism, https://arxiv.org/abs/2510.17896
Rikard Vinge and Isabelle Wittmann and Jannik Schneider and Michael Marszalek and Luis Gilch and Thomas Brunschwiler and Conrad M Albrecht, 19 Oct 2025, NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation, https://arxiv.org/abs/2510.17914
Seunghee Ryu, Donghoon Kwon, Seongjin Choi, Aryan Deshwal, Seungmo Kang, Carolina Osorio, 21 Oct 2025, BO4Mob: Bayesian Optimization Benchmarks for High-Dimensional Urban Mobility Problem, https://arxiv.org/abs/2510.18824
Jiahao Tang, Henry Hengyuan Zhao, Lijian Wu, Yifei Tao, Dongxing Mao, Yang Wan, Jingru Tan, Min Zeng, Min Li, Alex Jinpeng Wang, 20 Oct 2025, From Charts to Code: A Hierarchical Benchmark for Multimodal Models, https://arxiv.org/abs/2510.17932
Nishant Subramani, Alfredo Gomez, Mona Diab, 20 Oct 2025, SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone, https://arxiv.org/abs/2510.17998
Zeyuan Ma, Yue-Jiao Gong, Hongshu Guo, Wenjie Qiu, Sijie Ma, Hongqiao Lian, Jiajun Zhan, Kaixu Chen, Chen Wang, Zhiyang Huang, Zechuan Huang, Guojun Peng, Ran Cheng, Yining Ma, 21 Oct 2025, MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization, https://arxiv.org/abs/2505.17745
Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong, 21 Oct 2025, DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding, https://arxiv.org/abs/2505.18411
Samarth Goel, Reagan J. Lee, Kannan Ramchandran, 25 Sep 2025, SAGE: A Realistic Benchmark for Semantic Understanding, https://arxiv.org/abs/2509.21310
Meng Wan, Benxi Tian, Jue Wang, Cui Hui, Ningming Nie, Tiantian Liu, Zongguo Wang, Cao Rongqiang, Peng Shi, and Yangang Wang, 25 Sep 2025, Lossless Compression: A New Benchmark for Time Series Model Evaluation, https://arxiv.org/abs/2509.21002
Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao Li, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan, 19 Sep 2025, CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics, https://arxiv.org/abs/2509.20374
Jungsoo Park, Ethan Mendes, Gabriel Stanovsky, Alan Ritter, 25 Sep 2025, Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions, https://arxiv.org/abs/2509.20645
Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud, 25 Sep 2025, CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density, https://arxiv.org/abs/2509.18458
Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, Qing Wang, 28 Sep 2025, Diagnosing Failure Root Causes in Platform-Orchestrated Agentic Systems: Dataset, Taxonomy, and Benchmark, https://arxiv.org/abs/2509.23735
Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Chaoyou Fu, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu, 29 Sep 2025, RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark, https://arxiv.org/abs/2509.24897
Prashant Govindarajan, Mathieu Reymond, Antoine Clavaud, Mariano Phielipp, Santiago Miret and Sarath Chandar, 27 Sep 2025, CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning, https://arxiv.org/abs/2509.23156
Xavier Aramayo Carrasco, Grigoriy Ksenofontov, Aleksei Leonov, Iaroslav Sergeevich Koshelev, Alexander Korotin, 27 Sep 2025, Entering the Era of Discrete Diffusion Models: A Benchmark for Schr\"odinger Bridges and Entropic Optimal Transport, https://arxiv.org/abs/2509.23348
Jiahao Ying, Mingbao Lin, Qianru Sun, Yixin Cao, 28 Sep 2025, Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms, https://arxiv.org/abs/2509.23933
Hongpei Li, Ziyan He, Yufei Wang, Wenting Tu, Shanwen Pu, Qi Deng and Dongdong Ge, 3 Jun 2025, BenLOC: A Benchmark for Learning to Configure MIP Optimizers, https://arxiv.org/abs/2506.02752
Jie Cai, Kangning Yang, Lan Fu, Jiaming Ding, Jinlong Li, Huiming Sun, Daitao Xing, Jinglin Shen, Zibo Meng, 25 Sep 2025, CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models, https://arxiv.org/abs/2509.22737
Wei Zhou, Guoliang Li, Haoyu Wang, Yuxing Han, Xufei Wu, Fan Wu, Xuanhe Zhou, 27 Sep 2025, PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation, https://arxiv.org/abs/2509.23338
Amit Agarwal, Hitesh Laxmichand Patel, Srikant Panda, Hansa Meghwani, Jyotika Singh, Karan Dua, Paul Li, Tao Sheng, Sujith Ravi, Dan Roth, 28 Sep 2025, RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks, https://arxiv.org/abs/2509.23673
Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, and Michael Qizhe Shieh, 28 Sep 2025, MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use, https://arxiv.org/abs/2509.24002
Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang, 29 Sep 2025, BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models, https://arxiv.org/abs/2509.24210
Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee, Yohan Jo, 29 Sep 2025, SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents, https://arxiv.org/abs/2509.24282
Mohamad Ballout, Okajevo Wilfred, Seyedalireza Yaghoubi, Nohayr Muhammad Abdelmoneim, Julius Mayer, Elia Bruni, 29 Sep 2025, Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs, https://arxiv.org/abs/2509.24640
Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou, 29 Sep 2025, Towards Personalized Deep Research: Benchmarks and Evaluations, https://arxiv.org/abs/2509.25106
Dumitran Adrian Marius, Theodor-Pierre Moroianu and Buca Mihnea-Vicentiu, 3 Jul 2025, MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks, https://arxiv.org/abs/2507.03162
Adrian-Marius Dumitran and Alexandra-Mihaela Danila and Angela-Liliana Dumitran, 19 Aug 2025, GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs, https://arxiv.org/abs/2508.14279
Sergiu Bursuc (BAIF), Theodore Ehrenborg (BAIF), Shaowei Lin (BAIF), Lacramioara Astefanoaei (BAIF), Ionel Emilian Chiosa (MIT), Jure Kukovec (BAIF), Alok Singh (BAIF), Oliver Butterley (BAIF), Adem Bizid (BAIF), Quinn Dougherty (BAIF), Miranda Zhao (MIT), Max Tan (MIT), Max Tegmark (MIT), 26 Sep 2025, A benchmark for vericoding: formally verified program synthesis, https://arxiv.org/abs/2509.22908
Philipp D. Siedler, 27 Sep 2025, SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution, https://arxiv.org/abs/2505.16048
Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho, 29 Sep 2025, Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games, https://arxiv.org/abs/2506.03610
Pengyun Wang, Junyu Luo, Yanxin Shen, Ming Zhang, Shaoen Qin, Siyu Heng, Xiao Luo, 28 Sep 2025, A Comprehensive Graph Pooling Benchmark: Effectiveness, Robustness and Generalizability, https://arxiv.org/abs/2406.09031
Yuxuan Wang, Haixu Wu and Jiaxiang Dong, Yong Liu, Chen Wang, Mingsheng Long, Jianmin Wang, 27 Sep 2025, Deep Time Series Models: A Comprehensive Survey and Benchmark, https://arxiv.org/abs/2407.13278
Jiahao Kuang, Nuowei Liu, Jie Wang, Changzhi Sun, Tao Ji, Yuanbin Wu, 28 Sep 2025, PDFBench: A Benchmark for De novo Protein Design from Function, https://arxiv.org/abs/2505.20346
Yang Du, Yuqi Liu, Qin Jin, 28 Sep 2025, Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval, https://arxiv.org/abs/2412.19178
Tiansheng Hu, Tongyan Hu, Liuyang Bai, Yilun Zhao, Arman Cohan, Chen Zhao, 17 Oct 2025, FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain, https://arxiv.org/abs/2510.15232
Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, Alan Yuille, 12 Feb 2025, Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models, https://arxiv.org/abs/2502.08636
Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, Xingxing Wei, 17 Oct 2025, DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios, https://arxiv.org/abs/2510.15501
Tingyu Lin, Marco Peer, Florian Kleber, Robert Sablatnig, 17 Oct 2025, ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents, https://arxiv.org/abs/2510.15557
Dongjun Kim, Chanhee Park, Chanjun Park, Heuiseok Lim, 17 Oct 2025, KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models, https://arxiv.org/abs/2510.15558
Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty, Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, and Preslav Nakov, 17 Oct 2025, FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning, https://arxiv.org/abs/2506.02515
Su Kara, Fazle Faisal, Suman Nath, 28 Sep 2025, WAREX: Web Agent Reliability Evaluation on Existing Benchmarks, https://arxiv.org/abs/2510.03285
Pengfei He, Zhenwei Dai, Bing He, Hui Liu, Xianfeng Tang, Hanqing Lu, Juanhui Li, Jiayuan Ding, Subhabrata Mukherjee, Suhang Wang, Yue Xing, Jiliang Tang, Benoit Dumoulin, 6 Oct 2025, TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use, https://arxiv.org/abs/2510.04550
Ivo Petrov, Jasper Dekoninck, Martin Vechev, 6 Oct 2025, BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs, https://arxiv.org/abs/2510.04721
C. Coelho, M. Hohmann, D. Fern\'andez, L. Penter, S. Ihlenfeldt, O. Niggemann, 26 Sep 2025, Data-Driven Temperature Modelling of Machine Tools by Neural Networks: A Benchmark, https://arxiv.org/abs/2510.03261
Ali Khairallah and Arkaitz Zubiaga, 3 Oct 2025, ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection, https://arxiv.org/abs/2510.03502
Zirui Wang, Jiajun Wu, Braden Teitge, Jessalyn Holodinsky, Steve Drew, 5 Oct 2025, Small Language Models for Emergency Departments Decision Support: A Benchmark Study, https://arxiv.org/abs/2510.04032
Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Bram Hoex, Zhicheng Zhong, Tong Xie, 6 Oct 2025, AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials, https://arxiv.org/abs/2510.04704
Victor May, Diganta Misra, Yanqi Luo, Anjali Sridhar, Justine Gehring, Silvio Soares Ribeiro Junior, 6 Oct 2025, FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration, https://arxiv.org/abs/2510.04852
Yiqiao Chen, 4 Oct 2025, A Benchmark Study of Deep Learning Methods for Multi-Label Pediatric Electrocardiogram-Based Cardiovascular Disease Classification, https://arxiv.org/abs/2510.03780
Ayan Majumdar, Feihao Chen, Jinghui Li, Xiaozhen Wang, 6 Oct 2025, Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study, https://arxiv.org/abs/2510.04641
Chao Wen, Jacqueline Staub, Adish Singla, 6 Oct 2025, Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment, https://arxiv.org/abs/2406.11334
Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, Takuya Akiba, 6 Oct 2025, ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, https://arxiv.org/abs/2506.09050
Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka, 6 Oct 2025, FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations, https://arxiv.org/abs/2507.07644
Xinkai Zou, Xuan Jiang, Ruikai Huang, Haoze He, Parv Kapoor, Hongrui Wu, Yibo Wang, Jian Sha, Xiongbo Shi, Zixun Huang, Jinhua Zhao, 3 Oct 2025, Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments, https://arxiv.org/abs/2508.01844
Naomi Fridman (Ariel University), Anat Goldstein (Ariel University), 4 Oct 2025, Transformer Classification of Breast Lesions: The BreastDCEDL_AMBL Benchmark Dataset and 0.92 AUC Baseline, https://arxiv.org/abs/2509.26440
Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu, 4 Oct 2025, SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation, https://arxiv.org/abs/2508.15658
Alhasan Abdellatif, Hannah P. Menke, Julien Maes, Ahmed H. Elsheikh and Florian Doster, 3 Oct 2025, Benchmark Dataset for Pore-Scale CO2-Water Interaction, https://arxiv.org/abs/2503.17592
Hyundong Jin, Joonghyuk Hahn, Yo-Sub Han, 10 Oct 2025, RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems, https://arxiv.org/abs/2510.09227
Xiangxu Zhang, Lei Li, Yanyun Zhou, Xiao Zhou, Yingying Zhang, Xian Wu, 10 Oct 2025, Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation, https://arxiv.org/abs/2510.09275
Yifei Dong, Fengyi Wu, Qi He, Zhi-Qi Cheng, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann, 9 Oct 2025, HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions, https://arxiv.org/abs/2503.14229
Benjamin Herdeanu, Juan Nathaniel, Carla Roesch, Jatan Buch, Gregor Ramien, Johannes Haux, Pierre Gentine, 10 Oct 2025, CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models, https://arxiv.org/abs/2505.16620
Yuangang Li, Jiaqi Li, Zhuo Xiao, Tiankai Yang, Yi Nian, Xiyang Hu, Yue Zhao, 9 Oct 2025, NLP-ADBench: NLP Anomaly Detection Benchmark, https://arxiv.org/abs/2412.04784
Pengyu Xu, Shijia Li, Ao Sun, Feng Zhang, Yahan Li, Bo Wu, Zhanyu Ma, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Rui Wang, Yang Liu, Xiaobo Hu, Fan Yang, Jia Zheng, Guanghua Yao, 24 Oct 2025, OutboundEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Outbound Evaluation of Xbench's Professional-Aligned Series, https://arxiv.org/abs/2510.21244
Gaku Morio, Harri Rowlands, Dominik Stammbach, Christopher D. Manning, Peter Henderson, 24 Oct 2025, A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection, https://arxiv.org/abs/2510.21679
Priyanshu Karmakar (1), Soumyabrata Chaudhuri (1), Shubhojit Mallick (2), Manish Gupta (2), Abhik Jana (1), Shreya Ghosh (1) ((1) School of Electrical and Computer Sciences, IIT Bhubaneswar, India, (2) Microsoft, India), 24 Oct 2025, TripTide: A Benchmark for Adaptive Travel Planning under Disruptions, https://arxiv.org/abs/2510.21329
Mojca Brglez and \v{S}pela Vintar, 24 Oct 2025, From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene, https://arxiv.org/abs/2510.21575
Jesse Haworth, Juo-Tung Chen, Nigel Nelson, Ji Woong Kim, Masoud Moghani, Chelsea Finn, Axel Krieger, 23 Oct 2025, SutureBot: A Precision Framework & Benchmark For Autonomous End-to-End Suturing, https://arxiv.org/abs/2510.20965
Sean McGregor, Victor Lu, Vassil Tashev, Armstrong Foundjem, Aishwarya Ramasethu, Sadegh AlMahdi Kazemi Zarkouei, Chris Knotz, Kongtao Chen, Alicia Parrish, Anka Reuel, Heather Frase, 24 Oct 2025, Risk Management for Mitigating Benchmark Failure Modes: BenchRisk, https://arxiv.org/abs/2510.21460
Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Yinghui Li, Hai-Tao Zheng, Xue Liu, Irwin King, Philip S. Yu, 24 Oct 2025, RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback, https://arxiv.org/abs/2510.06186
Xinbang Dai, Huikang Hu, Yongrui Chen, Jiaqi Li, Rihui Jin, Yuyang Zhang, Xiaoguang Li, Lifeng Shang, Guilin Qi, 12 Oct 2025, ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding, https://arxiv.org/abs/2510.10549
Sanjari Srivastava, Gang Li, Cheng Chang, Rishu Garg, Manpreet Kaur, Charlene Y. Lee, Yuezhang Li, Yining Mao, Ignacio Cases, Yanan Xie, Peng Qi, 10 Oct 2025, WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions, https://arxiv.org/abs/2510.09872
Linfei Li, Fengyi Zhang, Zhong Wang, Lin Zhang, and Ying Shen, 11 Oct 2025, INR-Bench: A Unified Benchmark for Implicit Neural Representations in Multi-Domain Regression and Reconstruction, https://arxiv.org/abs/2510.10188
Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh, 12 Oct 2025, Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?, https://arxiv.org/abs/2510.10541
Wanshu Nie, Sujay V. Kumar, Junyu Chen, Long Zhao, Olya Skulovich, Jinwoong Yoo, Justin Pflug, Shahryar Khalique Ahmad, Goutam Konapala, 12 Oct 2025, Rethinking deep learning: linear regression remains a key benchmark in predicting terrestrial water storage, https://arxiv.org/abs/2510.10799
Mohammad Karami, Mostafa Jalali, Fatemeh Ghassemi, 13 Oct 2025, A Comprehensive Forecasting-Based Framework for Time Series Anomaly Detection: Benchmarking on the Numenta Anomaly Benchmark (NAB), https://arxiv.org/abs/2510.11141
Prasanna Mayilvahanan and Ricardo Dominguez-Olmedo and Thadd\"aus Wiedemer and Wieland Brendel, 13 Oct 2025, MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model, https://arxiv.org/abs/2510.11653
Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu, 12 Oct 2025, FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth, https://arxiv.org/abs/2510.10472
Xuanqi Gao, Siyi Xie, Juan Zhai, Shiqing Ma, Chao Shen, 12 Oct 2025, MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models, https://arxiv.org/abs/2505.16700
Zifeng Ding, Sikuan Yan, Zhangdie Yuan, Xianglong Hu, Fangru Lin, Andreas Vlachos, 13 Oct 2025, TCP: a Benchmark for Temporal Constraint-Based Planning, https://arxiv.org/abs/2505.19927
Zhipeng He, Chun Ouyang, Lijie Wen, Cong Liu, Catarina Moreira, 13 Oct 2025, TabAttackBench: A Benchmark for Adversarial Attacks on Tabular Data, https://arxiv.org/abs/2505.21027
Yichen Shi, Ze Zhang, Hongyang Wang, Zhuofu Tao, Zhongyi Li, Bingyu Chen, Yaxin Wang, Zhen huang, Xuhua Liu, Quan Chen, Zhiping Yu, Ting-Jung Lin, Lei He, 13 Oct 2025, AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits, https://arxiv.org/abs/2505.24138
Yifang Zhang, Pengfei Duan, Henan Wang, Wenjie Yin, Chen Zhou, Shengwu Xiong, 13 Oct 2025, How Effective Are Time-Series Models for Rainfall Nowcasting? A Comprehensive Benchmark for Rainfall Nowcasting Incorporating PWV Data, https://arxiv.org/abs/2509.25263
Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan, 12 Oct 2025, MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors, https://arxiv.org/abs/2502.18940
Trinh T.L. Vuong and Jin Tae Kwak, 13 Oct 2025, ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos, https://arxiv.org/abs/2505.04192
Xinyan Zhao, Yi-Ching Tang, Akshita Singh, Victor J Cantu, KwanHo An, Junseok Lee, Adam E Stogsdill, Ibraheem M Hamdi, Ashwin Kumar Ramesh, Zhiqiang An, Xiaoqian Jiang, Yejin Kim, 10 Oct 2025, AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design, https://arxiv.org/abs/2506.04235
Shuangyan Deng, Haizhou Peng, Jiachen Xu, Rui Mao, Ciprian Doru Giurc\u{a}neanu, Jiamou Liu, 9 Oct 2025, FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning, https://arxiv.org/abs/2510.07852
Qin Liu, Jacob Dineen, Yuxi Huang, Sheng Zhang, Hoifung Poon, Ben Zhou, Muhao Chen, 9 Oct 2025, ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation, https://arxiv.org/abs/2510.08569
Julia Moska, Oleksii Furman, Kacper Kozaczko, Szymon Leszkiewicz, Jakub Polczyk, Piotr Gramacki and Piotr Szyma\'nski, 9 Oct 2025, OBSR: Open Benchmark for Spatial Representations, https://arxiv.org/abs/2510.05879
Shuo Yu (1), Mingyue Cheng (1), Qi Liu (1), Daoyu Wang (1), Jiqian Yang (1), Jie Ouyang (1), Yucong Luo (1), Chenyi Lei (2), Enhong Chen (1) ((1) State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China (2) Kuaishou Technology, Beijing, China), 9 Oct 2025, Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study, https://arxiv.org/abs/2409.13694
Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani, Samuel G Rodriques, 8 Oct 2025, BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology, https://arxiv.org/abs/2503.00096
Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu, 23 Sep 2025, How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective, https://arxiv.org/abs/2509.18905
Shaoheng Wang, Yao Lu, Yuqi Li, Yaxin Gao, Jiaqi Nie, Shanqing Yu, Yingli Tian, Qi Xuan, 14 Sep 2025, LoRALib: A Standardized Benchmark for Evaluating LoRA-MoE Methods, https://arxiv.org/abs/2509.18137
Hongyi Luo, Qing Cheng, Daniel Matos, Hari Krishna Gadi, Yanfeng Zhang, Lu Liu, Yongliang Wang, Niclas Zeller, Daniel Cremers, Liqiu Meng, 17 Sep 2025, TurnBack: A Geospatial Route Cognition Benchmark for Large Language Models through Reverse Route, https://arxiv.org/abs/2509.18173
Chen Liang, Zhaoqi Huang, Haofen Wang, Fu Chai, Chunying Yu, Huanhuan Wei, Zhengjie Liu, Yanpeng Li, Hongjun Wang, Ruifeng Luo, Xianzhong Zhao, 23 Sep 2025, AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field, https://arxiv.org/abs/2509.18776
Haonan He, Yuchen Ren, Yining Tang, Ziyang Xu, Junxian Li, Minghao Yang, Di Zhang, Dong Yuan, Tao Chen, Shufei Zhang, Yuqiang Li, Nanqing Dong, Wanli Ouyang, Dongzhan Zhou, Peng Ye, 23 Sep 2025, Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models, https://arxiv.org/abs/2412.19191
Ziyuan Liu, Ruifei Zhu, Long Gao, Yuanxiu Zhou, Jingyu Ma, and Yuantao Gu, 23 Sep 2025, JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework, https://arxiv.org/abs/2502.13407
Samuel Stockman, Daniel Lawson, Maximilian Werner, 23 Sep 2025, EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes, https://arxiv.org/abs/2410.08226
Brandon James Carone, Iran R. Roman, and Pablo Ripoll\'es, 21 Oct 2025, The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS, https://arxiv.org/abs/2510.19055
Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai, 22 Oct 2025, MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration, https://arxiv.org/abs/2510.19423
Yiqian Yang, Tian Lan, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang, 22 Oct 2025, HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application, https://arxiv.org/abs/2510.19631
Fabian Schaipp, 22 Oct 2025, Optimization Benchmark for Diffusion Models on Dynamical Systems, https://arxiv.org/abs/2510.19376
Umar Butler, Abdur-Rahman Butler, Adrian Lucas Malec, 22 Oct 2025, The Massive Legal Embedding Benchmark (MLEB), https://arxiv.org/abs/2510.19365
Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Sebastian Otalora and Mauricio Reyes, 22 Oct 2025, XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography, https://arxiv.org/abs/2510.19599
Luis Wyss, Vincent Mallet, Wissam Karroucha, Karsten Borgwardt, Carlos Oliver, 22 Oct 2025, A Comprehensive Benchmark for RNA 3D Structure-Function Modeling, https://arxiv.org/abs/2503.21681
Justin Chavarria, Rohan Raizada, Justin White, Eyad Alhetairshi, 30 Sep 2025, SOCK: A Benchmark for Measuring Self-Replication in Large Language Models, https://arxiv.org/abs/2509.25643
Lujun Li, Lama Sleem, Yiqun Wang, Yangjie Xu, Niccol\`o Gentile, Radu State, 30 Sep 2025, How Far Do Time Series Foundation Models Paint the Landscape of Real-World Benchmarks ?, https://arxiv.org/abs/2509.26347
Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu, 30 Sep 2025, SCUBA: Salesforce Computer Use Benchmark, https://arxiv.org/abs/2509.26506
Joshua Sebastian, Karma Tobden, KMA Solaiman, 30 Sep 2025, LLM-Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation, https://arxiv.org/abs/2509.26351
Oleksandr Shchur, Abdul Fatir Ansari, Caner Turkmen, Lorenzo Stella, Nick Erickson, Pablo Guerron, Michael Bohlke-Schneider, Yuyang Wang, 30 Sep 2025, fev-bench: A Realistic Benchmark for Time Series Forecasting, https://arxiv.org/abs/2509.26468
Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, Guoan Zhang, 26 Sep 2025, A Benchmark for Localizing Code and Non-Code Issues in Software Projects, https://arxiv.org/abs/2509.25242
Zhengpeng Shi, Hengli Li, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng, 30 Sep 2025, V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs, https://arxiv.org/abs/2509.25773
Jisu Shin, Hoyun Song, Juhyun Oh, Changgeon Ko, Eunsu Kim, Chani Jung, Alice Oh, 30 Sep 2025, RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity, https://arxiv.org/abs/2509.25897
Dominik Macko, Jakub Kopal, 30 Sep 2025, CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages, https://arxiv.org/abs/2509.26051
Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen, 30 Sep 2025, OceanGym: A Benchmark Environment for Underwater Embodied Agents, https://arxiv.org/abs/2509.26536
Wenda Xu, Sweta Agrawal, Vil\'em Zouhar, Markus Freitag, Daniel Deutsch, 30 Sep 2025, Deconstructing Self-Bias in LLM-generated Translation Benchmarks, https://arxiv.org/abs/2509.26600
Yingming Pu and Tao Lin and Hongyu Chen, 29 Sep 2025, Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis, https://arxiv.org/abs/2509.25281
Yi-Cheng Lin, Yu-Hua Chen, Jia-Kai Dong, Yueh-Hsuan Huang, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Chen, Yu-Jung Lin, Yu-Ling Chen, Zih-Yu Chen, I-Ning Tsai, Hsiu-Hsuan Wang, Ho-Lam Chung, Ke-Han Lu, Hung-yi Lee, 30 Sep 2025, TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics, https://arxiv.org/abs/2509.26329
Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, Baishakhi Ray, 29 Sep 2025, Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation, https://arxiv.org/abs/2502.17521
Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, Elia Bruni, 30 Sep 2025, iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs, https://arxiv.org/abs/2502.03214
Dayy\'an O'Brien, Barry Haddow, Emily Allaway, Pinzhen Chen, 7 Oct 2025, MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization, https://arxiv.org/abs/2510.05962
Haining Pan, James V. Roggeveen, Erez Berg, Juan Carrasquilla, Debanjan Chowdhury, Surya Ganguli, Federico Ghimenti, Juraj Hasik, Henry Hunt, Hong-Chen Jiang, Mason Kamb, Ying-Jer Kao, Ehsan Khatami, Michael J. Lawler, Di Luo, Titus Neupert, Xiaoliang Qi, Michael P. Brenner, Eun-Ah Kim, 6 Oct 2025, CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers, https://arxiv.org/abs/2510.05228
Jo\~ao Palmeiro, Diogo Duarte, Rita Costa, Pedro Bizarro, 7 Oct 2025, Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks, https://arxiv.org/abs/2510.06071
Periklis Mantenoglou, Rishi Hazra, Pedro Zuidberg Dos Martires, Luc De Raedt, 7 Oct 2025, LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language, https://arxiv.org/abs/2510.05972
Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, Luc Van Gool, and Danda Pani Paudel, 7 Oct 2025, EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark, https://arxiv.org/abs/2510.06218
Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran, 7 Oct 2025, BenchAgents: Multi-Agent Systems for Structured Benchmark Creation, https://arxiv.org/abs/2410.22584
Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty, 16 Oct 2025, LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild, https://arxiv.org/abs/2510.14240
Xukai Wang, Xuanbo Liu, Mingrui Chen, Haitian Zhong, Xuanlin Yang, Bohan Zeng, Jinbo Hu, Hao Liang, Junbo Niu, Xuchen Li, Ruitao Wu, Ruichuan An, Yang Shi, Liu Liu, Xu-Yao Zhang, Qiang Liu, Zhouchen Lin, Wentao Zhang, Bin Dong, 16 Oct 2025, MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning, https://arxiv.org/abs/2510.14265
Peter Banyas, Shristi Sharma, Alistair Simmons, Atharva Vispute, 11 Oct 2025, ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups, https://arxiv.org/abs/2510.13852
Fabian Wenz and Omar Bouattour and Devin Yang and Justin Choi and Cecil Gregg and Nesime Tatbul and \c{C}a\u{g}atay Demiralp, 11 Oct 2025, BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation, https://arxiv.org/abs/2510.13853
Steffen Hagedorn, Luka Donkov, Aron Distelzweig, Alexandru P. Condurache, 16 Oct 2025, When Planners Meet Reality: How Learned, Reactive Traffic Agents Shift nuPlan Benchmarks, https://arxiv.org/abs/2510.14677
Yuxing Lu, Xukai Zhao, J. Ben Tamo, Micky C. Nnamdi, Rui Peng, Shuang Zeng, Xingyu Hu, Jinzhuo Wang, May D. Wang, 16 Oct 2025, MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics, https://arxiv.org/abs/2510.14944
Jae-Won Chung and Jeff J. Ma and Ruofan Wu and Jiachen Liu and Oh Jun Kweon and Yuxuan Xia and Zhiyu Wu and Mosharaf Chowdhury, 16 Oct 2025, The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization, https://arxiv.org/abs/2505.06371

Research on Model Evaluation

Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, Andy Zou, 23 May 2024, Lessons from the Trenches on Reproducible Evaluation of Language Models, https://arxiv.org/abs/2405.14782 (Model evaluation theory and practice with the lm-eval test harness tool.)
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
Yifan Wei, Yisong Su, Huanhuan Ma, Xiaoyan Yu, Fangyu Lei, Yuanzhe Zhang, Jun Zhao, Kang Liu, 8 Oct 2023, MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models, https://arxiv.org/abs/2310.05157
George Cybenko, Joshua Ackerman, Paul Lintilhac, 16 Apr 2024, TEL'M: Test and Evaluation of Language Models, https://arxiv.org/abs/2404.10200
Gayathri Saranathan, Mahammad Parwez Alam, James Lim, Suparna Bhattacharya, Soon Yee Wong, Foltin Martin & Cong Xu, 2024, DELE: Data Efficient LLM Evaluation, Hewlett Packard Labs, Navigating and Addressing Data Problems for Foundation Models (DPFM) Workshop, ICLR 2024, https://openreview.net/pdf?id=I8bsxPWLNF
Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang, 17 Mar 2024 (v2), Compressing LLMs: The Truth is Rarely Pure and Never Simple, https://arxiv.org/abs/2310.01382 Code: https://github.com/VITA-Group/llm-kick (A set of tasks to evaluate LLMs.)
AADITYA NAIK, ADAMSTEIN, YINJUN WU, MAYURNAIK, ERIC WONG, April 2024, TorchQL: A Programming Framework for Integrity Constraints in Machine Learning, Proc. ACM Program. Lang., Vol. 8, No. OOPSLA1, Article 124. PDF: https://dl.acm.org/doi/pdf/10.1145/3649841
Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, Min Lin, 22 Jan 2024, Benchmarking Large Multimodal Models against Common Corruptions, https://arxiv.org/abs/2401.11943 Code: https://github.com/sail-sg/MMCBench
Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, Gordon Wetzstein, Jan 2024, GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation, https://arxiv.org/abs/2401.04092 Code: https://github.com/3DTopia/GPTEval3D Project: https://gpteval3d.github.io/
Lan Chu, Jan 2024, LLM Output — Evaluating, debugging, and interpreting, Towards AI, https://pub.towardsai.net/llm-output-evaluating-debugging-and-interpreting-f3bd29e7d14d
Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You, 3 Jun 2024, MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, https://arxiv.org/abs/2406.06565
Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo, 9 Jun 2024, The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models, https://arxiv.org/abs/2406.05761 Code: https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi, 7 Jun 2024, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, https://arxiv.org/abs/2406.04770 Code: https://hf.co/spaces/allenai/WildBench
Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi, 13 Jun 2024, Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning, https://arxiv.org/abs/2406.09170
Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
Tianle Li, Wei-Lin Chiang, Lisa Dunlap, May 20, 2024, Introducing Hard Prompts Category in Chatbot Arena, https://lmsys.org/blog/2024-05-17-category-hard/
Louis Bouchard, Jun 24, 2024, LLM Evals: What, why, when and how, https://www.louisbouchard.ai/llm-evals/
Clémentine Fourrier, May 23, 2024 Let's talk about LLM evaluation, https://huggingface.co/blog/clefourrier/llm-evaluation
Jeffrey Ip, November 7, 2023, How to Evaluate LLM Applications: The Complete Guide, https://www.confident-ai.com/blog/how-to-evaluate-llm-applications
Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
Anirban Ghoshal, July 3, 2024, AWS approach to RAG evaluation could help enterprises reduce AI spending, https://www.infoworld.com/article/3715629/aws-new-approach-to-rag-evaluation-could-help-enterprises-reduce-ai-spending.html
Tianyi Tang, Yiwen Hu, Bingqian Li, Wenyang Luo, Zijing Qin, Haoxiang Sun, Jiapeng Wang, Shiyi Xu, Xiaoxue Cheng, Geyang Guo, Han Peng, Bowen Zheng, Yiru Tang, Yingqian Min, Yushuo Chen, Jie Chen, Yuanqian Zhao, Luran Ding, Yuhao Wang, Zican Dong, Chunxuan Xia, Junyi Li, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen, 8 Jul 2024, LLMBox: A Comprehensive Library for Large Language Models, https://arxiv.org/abs/2407.05563 Code: https://github.com/RUCAIBox/LLMBox
Jin Peng Zhou, Christian K. Belardi, Ruihan Wu, Travis Zhang, Carla P. Gomes, Wen Sun, Kilian Q. Weinberger, 8 Jul 2024, On Speeding Up Language Model Evaluation, https://arxiv.org/abs/2407.06172
HELM, July 2024 (accessed), A holistic framework for evaluating foundation models, Stanford University, https://crfm.stanford.edu/helm/lite/latest/
Juan Pablo Bottaro, April 25, 2024, Musings on building a Generative AI product, https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product?_l=en_US
Angie Boggust, Venkatesh Sivaraman, Yannick Assogba, Donghao Ren, Dominik Moritz, Fred Hohman, 6 Aug 2024, Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments, https://arxiv.org/abs/2408.03274
Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
Andrew Ng, Sep 2024, X post, https://x.com/AndrewYNg/status/1829190549842321758 (Dropping token prices for LLMs means developers can focus on the app layer.)
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
Lior Solomon, Sep 2024, Gen AI testing strategies and tools, https://medium.com/ai-in-grc/gen-ai-testing-strategies-and-tools-257383e5cbfb
Michael Nuñez, September 9, 2024, LightEval: Hugging Face’s open-source solution to AI’s accountability problem, https://venturebeat.com/ai/lighteval-hugging-faces-open-source-solution-to-ais-accountability-problem/
Michael Nuñez, September 13, 2024, Microsoft’s Windows Agent Arena: Teaching AI assistants to navigate your PC, https://venturebeat.com/ai/microsofts-windows-agent-arena-teaching-ai-assistants-to-navigate-your-pc/
Flow AI, Sep 2024, Flow Judge: An Open Small Language Model for LLM System Evaluations, https://www.flow-ai.com/blog/flow-judge
Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate Olszewska, 20 Sep 2024 (v2), Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries, https://arxiv.org/abs/2409.12640 (Long context model evaluation dataset.)
Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
Cameron R. Wolfe, Ph.D., Dec 02, 2024, Finetuning LLM Judges for Evaluation: The Prometheus suite, JudgeLM, PandaLM, AutoJ, and more..., https://cameronrwolfe.substack.com/p/finetuned-judge
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu, 10 Dec 2024 (v2), LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, https://arxiv.org/abs/2412.05579 https://github.com/CSHaitao/Awesome-LLMs-as-Judges
Liam Seymour, Basar Kutukcu, Sabur Baidya, 19 Dec 2024, Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs, https://arxiv.org/abs/2412.15352 https://github.com/LiamS57/orin-llm-testing
Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, Jeff Z. Pan, 22 Dec 2024, MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge, https://arxiv.org/abs/2412.17032 https://github.com/probe2/multi-hop/ (Model evaluation of reasoning abilities.)
Latent Space, Dec 28, 2024, The 2025 AI Engineering Reading List: We picked 50 paper/models/blogs across 10 fields in AI Eng: LLMs, Benchmarks, Prompting, RAG, Agents, CodeGen, Vision, Voice, Diffusion, Finetuning. If you're starting from scratch, start here. https://www.latent.space/p/2025-papers
Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu, 24 Dec 2024, LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating, https://arxiv.org/abs/2412.18424
Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang, AH Abdi, Dec 2024, SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_93.pdf (Evaluating model performance with KV cache compression.)
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
Lucas C. Cordeiro, Matthew L. Daggitt, Julien Girard-Satabin, Omri Isac, Taylor T. Johnson, Guy Katz, Ekaterina Komendantskaya, Augustin Lemesle, Edoardo Manino, Artjoms Šinkarovs, Haoze Wu, 10 Jan 2025, Neural Network Verification is a Programming Language Challenge, https://arxiv.org/abs/2501.05867
Dr. Marcel Müller, Jan 2025, Why Generative-AI Apps’ Quality Often Sucks and What to Do About It: How to get from PoCs to tested high-quality applications in production, https://towardsdatascience.com/why-generative-ai-apps-quality-often-sucks-and-what-to-do-about-it-f84407f263c3
Bharani Subramaniam, 13 February 2025, Emerging Patterns in Building GenAI Products, https://martinfowler.com/articles/gen-ai-patterns/
Nikhil, February 26, 2025, How to Compare Two LLMs in Terms of Performance: A Comprehensive Web Guide for Evaluating and Benchmarking Language Models, https://www.marktechpost.com/2025/02/26/how-to-compare-two-llms-in-terms-of-performance-a-comprehensive-web-guide-for-evaluating-and-benchmarking-language-models/
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Yansheng Qiu, Li Xiao, Zhaopan Xu, Pengfei Zhou, Zheng Wang, Kaipeng Zhang, 16 May 2025, Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans, https://arxiv.org/abs/2505.11141
Brandon Lepine, Gawesha Weerantunga, Juho Kim, Pamela Mishkin, Matthew Beane, 15 May 2025, Evaluations at Work: Measuring the Capabilities of GenAI in Use, https://arxiv.org/abs/2505.10742
Rachel Draelos, MD, PhD, May 14, 2025, HealthBench Does Not Evaluate Patient Safety, https://medium.com/data-science-collective/healthbench-does-not-evaluate-patient-safety-11eda5f0eeac
Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo, 29 May 2025, ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions, https://arxiv.org/abs/2505.23662 https://github.com/bwookwak/ToolHaystack
Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, Wenxuan Zhang, Lifu Huang, Muhao Chen, Lei Hou, Qianru Sun, Xingjun Ma, Zuxuan Wu, Min-Yen Kan, David Lo, Qi Zhang, Heng Ji, Jing Jiang, Juanzi Li, Aixin Sun, Xuanjing Huang, Tat-Seng Chua, Yu-Gang Jiang, 26 Apr 2025, Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks, https://arxiv.org/abs/2504.18838
Liyun Zhang, Jingcheng Ke, Shenli Fan, Xuanmeng Sha and Zheng Lian, 14 Aug 2025, A Unified Evaluation Framework for Multi-Annotator Tendency Learning, https://arxiv.org/abs/2508.10393
Brian Shing-Hei Wong, Joshua Mincheol Kim, Sin-Hang Fung, Qing Xiong, Kelvin Fu-Kiu Ao, Junkang Wei, Ran Wang, Dan Michelle Wang, Jingying Zhou, Bo Feng, Alfred Sze-Lok Cheng, Kevin Y. Yip, Stephen Kwok-Wing Tsui, Qin Cao, 14 Aug 2025, Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation, https://arxiv.org/abs/2508.10541
Xiao Fu, Hossein A. Rahmani, Bin Wu, Jerome Ramos, Emine Yilmaz, Aldo Lipani, 8 Aug 2025, PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs, https://arxiv.org/abs/2508.10028
Gal Amram, Eitan Farchi, Shmulik Froimovich, Raviv Gal, and Avi Ziv, 13 Aug 2025, LaajMeter: A Framework for LaaJ Evaluation, https://arxiv.org/abs/2508.10161
Jieyu Li, Xin Zhang, and Joey Tianyi Zhou, 14 Aug 2025, AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences, https://arxiv.org/abs/2508.10771
Yuzhuo Xiao, Zeyu Han, Yuhan Wang, Huaizu Jiang, 4 Aug 2025, XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs, https://arxiv.org/abs/2508.09999
Aditya Ashvin, Rimita Lahiri, Aditya Kommineni, Somer Bishop, Catherine Lord, Sudarsana Reddy Kadiri, Shrikanth Narayanan, 14 Aug 2025, Evaluation of Speech Foundation Models for ASR on Child-Adult Conversations in Autism Diagnostic Sessions, https://arxiv.org/abs/2409.16135
Zhe Chen, Daniel Harabor, Ryan Hechnenberger, Nathan R. Sturtevant, 23 Jul 2025, Online Submission and Evaluation System Design for Competition Operations, https://arxiv.org/abs/2507.17730
Yutong Liu, Cairong Zhao, Guosheng Hu, 23 Jul 2025, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2507.17417
Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing, 23 Jul 2025, MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs, https://arxiv.org/abs/2507.17476
Miguel Carrasco, C\'esar Gonz\'alez-Mart\'in, Jos\'e Aranda, Luis Oliveros, 23 Jul 2025, Vision Transformer attention alignment with human visual perception in aesthetic object evaluation, https://arxiv.org/abs/2507.17616
Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan, 23 Jul 2025, From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes, https://arxiv.org/abs/2507.17717
Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian, 23 Jul 2025, Fairness Evaluation of Large Language Models in Academic Library Reference Services, https://arxiv.org/abs/2507.04224
Roman Mayr, Michel Schimpf, Thomas Bohn\'e, 22 Jul 2025, ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation, https://arxiv.org/abs/2507.16792
Abhash Kumar Jha, Shakiba Moradian, Arjun Krishnakumar, Martin Rapp, Frank Hutter, 22 Jul 2025, confopt: A Library for Implementation and Evaluation of Gradient-based One-Shot NAS Methods, https://arxiv.org/abs/2507.16533
Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao, 22 Jul 2025, RAVine: Reality-Aligned Evaluation for Agentic Search, https://arxiv.org/abs/2507.16725
Danil Gusak, Anna Volodkevich, Anton Klenitskiy, Alexey Vasilev, Evgeny Frolov, 22 Jul 2025, Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders, https://arxiv.org/abs/2507.16289
Jakub Micha\'nk\'ow, Pawe{\l} Sakowski, Robert \'Slepaczuk, 22 Jul 2025, Alternative Loss Function in Evaluation of Transformer Models, https://arxiv.org/abs/2507.16548
Bruno Deprez, Toon Vanderschueren, Bart Baesens, Tim Verdonck, Wouter Verbeke, 22 Jul 2025, Network Analytics for Anti-Money Laundering -- A Systematic Literature Review and Experimental Evaluation, https://arxiv.org/abs/2405.19383
Xiaoxu Guo, Siyan Liang, Yachao Cui, Juxiang Zhou, Lei Wang, Han Cao, 21 Jul 2025, Multimodal Fine-grained Reasoning for Post Quality Evaluation, https://arxiv.org/abs/2507.17934
Rodrigo Moreira and Larissa F. Rodrigues Moreira and Fl\'avio de Oliveira Silva, 23 Jul 2025, Performance Evaluation and Threat Mitigation in Large-scale 5G Core Deployment, https://arxiv.org/abs/2507.17850
Maria Vlachou, 24 Jul 2025, Fashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items, https://arxiv.org/abs/2507.18017
Yefeng Yuan, Yuhong Liu, Liang Cheng, 24 Jul 2025, A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models, https://arxiv.org/abs/2404.14445
Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp, 24 Jul 2025, Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation, https://arxiv.org/abs/2506.11790
Niket Patel, Randall Balestriero, 23 Jul 2025, Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks, https://arxiv.org/abs/2507.09871
Ashray Gupta and Rohan Joseph and Sunny Rai, 23 Jul 2025, Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation, https://arxiv.org/abs/2507.13238
Masaki Adachi, Masahiro Fujisawa, Michael A Osborne, 24 Jul 2025, Fixing the Pitfalls of Probabilistic Time-Series Forecasting Evaluation by Kernel Quadrature, https://arxiv.org/abs/2503.06079
Gerben van der Hoek, Johan Jeuring and Rogier Bos, 18 Jul 2025, Buggy rule diagnosis for combined steps through final answer evaluation in stepwise tasks, https://arxiv.org/abs/2507.13651
Viraj Nishesh Darji, Callie C. Liao, Duoduo Liao, 18 Jul 2025, Automated Interpretation of Non-Destructive Evaluation Contour Maps Using Large Language Models for Bridge Condition Assessment, https://arxiv.org/abs/2507.14107
Yudai Hayashi, Shuhei Goda, Yuta Saito, 18 Jul 2025, Off-Policy Evaluation and Learning for Matching Markets, https://arxiv.org/abs/2507.13608
Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang, 17 Jul 2025, "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models, https://arxiv.org/abs/2507.13428
Steven Lamp, Jason D. Hiser, Anh Nguyen-Tuong, Jack W. Davidson, 17 Jul 2025, PHASE: Passive Human Activity Simulation Evaluation, https://arxiv.org/abs/2507.13505
Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz, 18 Jul 2025, From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios, https://arxiv.org/abs/2502.02145
Daniel Commey, Benjamin Appiah, Griffith S. Klogo, and Garth V. Crosby, 18 Jul 2025, ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs, https://arxiv.org/abs/2507.11649
Mohita Chowdhury, Yajie Vera He, Jared Joselowitz, Aisling Higham, Ernest Lim, 18 Jul 2025, ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems, https://arxiv.org/abs/2501.08208
Dawar Khan and Xinyu Liu and Omar Mena and Donggang Jia and Alexandre Kouyoumdjian and Ivan Viola, 18 Jul 2025, AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results, https://arxiv.org/abs/2502.15761
Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee, 18 Jul 2025, From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation, https://arxiv.org/abs/2507.08924
Shingo Ayabe, Takuto Otomo, Hiroshi Kera, Kazuhiko Kawamoto, 18 Jul 2025, Robustness Evaluation of Offline Reinforcement Learning for Robot Control Against Action Perturbations, https://arxiv.org/abs/2412.18781
Anna Sofia Lippolis, Mohammad Javad Saeedizade, Robin Keskis\"arkk\"a, Aldo Gangemi, Eva Blomqvist, Andrea Giovanni Nuzzolese, 19 Jul 2025, Large Language Models Assisting Ontology Evaluation, https://arxiv.org/abs/2507.14552
Qianchao Wang, Yuxuan Ding, Chuanzhen Jia, Zhe Li, Yaping Du, 21 Jul 2025, Explainable Artificial Intelligence based Soft Evaluation Indicator for Arc Fault Diagnosis, https://arxiv.org/abs/2507.15239
Firdaus Ahmed Choudhury, Ethan Leicht, Jude Ethan Bislig, Hangzhi Guo, Amulya Yadav, 20 Jul 2025, Designing User-Centric Metrics for Evaluation of Counterfactual Explanations, https://arxiv.org/abs/2507.15162
Amina Dzafic, Merve Kavut, Ulya Bayram, 19 Jul 2025, Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation, https://arxiv.org/abs/2507.14693
Chenlei Gong, Yuanhe Tian, Lei Mao, Yan Song, 20 Jul 2025, Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling, https://arxiv.org/abs/2507.15087
Devichand Budagam, Ashutosh Kumar, Mahsa Khoshnoodi, Sankalp KJ, Vinija Jain, Aman Chadha, 21 Jul 2025, Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles, https://arxiv.org/abs/2406.12644
Peilong Wang, Jason Holmes, Zhengliang Liu, Dequan Chen, Tianming Liu, Jiajian Shen, Wei Liu, 18 Jul 2025, A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options, https://arxiv.org/abs/2412.10622
Wanke Xia, Ruoxin Peng, Haoqi Chu, Xinlei Zhu, Zhiyu Yang, Lili Yang, 21 Jul 2025, An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice, https://arxiv.org/abs/2502.13764
Mar\'ia Andrea Cruz Bland\'on, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico, 19 Jul 2025, MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation, https://arxiv.org/abs/2502.17163
Felix H\"arer, 19 Jul 2025, Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications, https://arxiv.org/abs/2506.10467
Zhijin He, Alan B. McMillan, 21 Jul 2025, Comparative Evaluation of Radiomics and Deep Learning Models for Disease Detection in Chest Radiography, https://arxiv.org/abs/2504.12249
Pengfei Zhou, Xiaopeng Peng, Fanrui Zhang, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Zekai Li, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, and Kaipeng Zhang, 9 Aug 2025, MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams, https://arxiv.org/abs/2508.06851
Jiawei Zhang, Yifei Zhang, Baozhao Yi, Yao Ren, Qi Jiao, Hanyu Bai, Weiran Jiang, Ziyou Song, 9 Aug 2025, Discovery Learning accelerates battery design evaluation, https://arxiv.org/abs/2508.06985
Lin-Han Jia, Si-Yu Han, Wen-Chao Hu, Jie-Jing Shao, Wen-Da Wei, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li, 10 Aug 2025, When Is Prior Knowledge Helpful? Exploring the Evaluation and Selection of Unsupervised Pretext Tasks from a Neuro-Symbolic Perspective, https://arxiv.org/abs/2508.07299
Gregory Schuit, Denis Parra, Cecilia Besa, 10 Aug 2025, Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays, https://arxiv.org/abs/2508.07128
Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu, 10 Aug 2025, Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks, https://arxiv.org/abs/2508.07179
Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, and Masashi Sugiyama, 3 Aug 2025, What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?, https://arxiv.org/abs/2508.06530
Bruno L. Pereira, Alan Said, Rodrygo L. T. Santos, 11 Aug 2025, On the Reliability of Sampling Strategies in Offline Recommender Evaluation, https://arxiv.org/abs/2508.05398
Xiaohua Feng,Jiaming Zhang,Fengyuan Yu,Chengye Wang,Li Zhang,Kaixiang Li,Yuyuan Li,Chaochao Chen,Jianwei Yin, 26 Jul 2025, A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction, https://arxiv.org/abs/2507.19894
Aishwarya Mandyam, Jason Meng, Ge Gao, Jiankai Sun, Mac Schwager, Barbara E. Engelhardt, Emma Brunskill, 26 Jul 2025, PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data, https://arxiv.org/abs/2507.20068
Minju Kim, Dongje Yoo, Yeonjun Hwang, Minseok Kang, Namyoung Kim, Minju Gwak, Beong-woo Kwak, Hyungjoo Chae, Harim Kim, Yunjoong Lee, Min Hee Kim, Dayi Jung, Kyong-Mee Chung, Jinyoung Yeo, 25 Jul 2025, Can You Share Your Story? Modeling Clients' Metacognition and Openness for LLM Therapist Evaluation, https://arxiv.org/abs/2507.19643
Matin Aghaei, Mohammad Ali Alomrani, Yingxue Zhang, Mahdi Biparva, 26 Jul 2025, When Engineering Outruns Intelligence: A Re-evaluation of Instruction-Guided Navigation, https://arxiv.org/abs/2507.20021
Harsh Purohit, Tomoya Nishida, Kota Dohi, Takashi Endo, and Yohei Kawaguchi, 28 Jul 2025, MIMII-Agent: Leveraging LLMs with Function Calling for Relative Evaluation of Anomalous Sound Detection, https://arxiv.org/abs/2507.20666
Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue, 28 Jul 2025, Music Arena: Live Evaluation for Text-to-Music, https://arxiv.org/abs/2507.20900
Adrien Bazoge, 28 Jul 2025, MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation, https://arxiv.org/abs/2507.20917
Khalid Hasan, Jamil Saquer and Mukulika Ghosh, 17 Jul 2025, Advancing Mental Disorder Detection: A Comparative Evaluation of Transformer and LSTM Architectures on Social Media, https://arxiv.org/abs/2507.19511
Hugo Retief, Kayathri, Vigneswaran, Surajit Ghosh, Mariangel Garcia Andarcia, Chris Dickens, 28 Jul 2025, Satellite-Surface-Area Machine-Learning Models for Reservoir Storage Estimation: Regime-Sensitive Evaluation and Operational Deployment at Loskop Dam, South Africa, https://arxiv.org/abs/2502.19989
Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang, 28 Jul 2025, FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation, https://arxiv.org/abs/2506.09081
Afonso Martini Spezia, Mariana Recamonde-Mendoza, 30 Jul 2025, Comparing Cluster-Based Cross-Validation Strategies for Machine Learning Model Evaluation, https://arxiv.org/abs/2507.22299
Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje, 4 Aug 2025, DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA, https://arxiv.org/abs/2412.05430
Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim, 4 Aug 2025, Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons, https://arxiv.org/abs/2411.01281
Arthur Cho, 4 Aug 2025, GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics, https://arxiv.org/abs/2508.02926
Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare, Soundar Srinivasan, 7 Aug 2025, Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation, https://arxiv.org/abs/2508.05508
Dewi S. W. Gould, Bruno Mlodozeniec, Samuel F. Brown, 8 Aug 2025, SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges, https://arxiv.org/abs/2508.06111
Yuchen Tian, Kaixin Li, Hao Chen, Ziyang Luo, Hongzhan Lin, Sebastian Schelter, Lun Du, Jing Ma, 13 Aug 2025, AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries?, https://arxiv.org/abs/2508.09631
Seungju Yoo, Hyuk Kwon, Joong-Won Hwang, Kibok Lee, 16 Aug 2025, Automated Model Evaluation for Object Detection via Prediction Consistency and Reliablity, https://arxiv.org/abs/2508.12082
David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge, 18 Aug 2025, Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation, https://arxiv.org/abs/2508.13144
Jun Li, Aaron Aguirre, Junior Moura, Che Liu, Lanhai Zhong, Chenxi Sun, Gari Clifford, Brandon Westover, Shenda Hong, 4 Aug 2025, An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains, https://arxiv.org/abs/2410.04133
Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, Kocmi Tom, 29 Jul 2025, D\'ej\`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation, https://arxiv.org/abs/2504.11829
Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip, 29 Jul 2025, Evaluation and Benchmarking of LLM Agents: A Survey, https://arxiv.org/abs/2507.21504
Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang, 31 Jul 2025, LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models, https://arxiv.org/abs/2507.22359
Ke Miao, Yuke Hu, Xiaochen Li, Wenjie Bao, Zhihao Liu, Zhan Qin, Kui Ren, 2 Aug 2025, Towards Evaluation for Real-World LLM Unlearning, https://arxiv.org/abs/2508.01324
Jungkoo Kang, 3 Aug 2025, Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation, https://arxiv.org/abs/2507.02253
Jialin Li, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu, 5 Aug 2025, Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework, https://arxiv.org/abs/2508.03622
Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Arnav Ramamoorthy, Pratyush Ghosh, Aaryan Raj Jindal, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, Yashwanth Nakka, Devansh, Jagat Sesh Challa, Dhruv Kumar, 6 Aug 2025, Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics, https://arxiv.org/abs/2503.23989
Zachary Robertson, Sanmi Koyejo, 7 Aug 2025, Let's Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes, https://arxiv.org/abs/2508.05469
Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, Weinan Zhang, 3 Aug 2025, A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges, https://arxiv.org/abs/2508.05668
Tuhina Tripathi, Manya Wadhwa, Greg Durrett, Scott Niekum, 21 Aug 2025, Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation, https://arxiv.org/abs/2504.14716
Mingyang Li, Viktor Schlegel, Tingting Mu, Wuraola Oyewusi, Kai Kang, Goran Nenadic, 22 Aug 2025, Evaluation and LLM-Guided Learning of ICD Coding Rationales, https://arxiv.org/abs/2508.16777
Patricia Paskov, Michael J. Byun, Kevin Wei, Toby Webster, 22 Jul 2025, Preliminary suggestions for rigorous GPAI model evaluations, https://arxiv.org/abs/2508.00875
Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sepehr Karimi, Sina Rashidi, Ali Zolnour, Maryam Dadkhah, Yasaman Haghbin, Hossein AzadMaleki, Maryam Zolnoori, 24 Aug 2025, Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies, https://arxiv.org/abs/2509.03525
Han Wang, Alex Whitworth, Pak Ming Cheung, Zhenjie Zhang, Krishna Kamath, 3 Sep 2025, LLM-based Relevance Assessment for Web-Scale Search Evaluation at Pinterest, https://arxiv.org/abs/2509.03764
Seganrasan Subramanian, Abhigya Verma, 4 Sep 2025, Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation, https://arxiv.org/abs/2509.01185
Sasha Mitts, 4 Sep 2025, An Approach to Grounding AI Model Evaluations in Human-derived Criteria, https://arxiv.org/abs/2509.04676
Brennen Hill, 4 Sep 2025, Scaling Environments for Organoid Intelligence with LLM-Automated Design and Plasticity-Based Evaluation, https://arxiv.org/abs/2509.04633
Yan Wang, Xinyi Hou, Yanjie Zhao, Weiguo Lin, Haoyu Wang, Junjun Si, 26 Aug 2025, LaQual: A Novel Framework for Automated Evaluation of LLM App Quality, https://arxiv.org/abs/2508.18636
Courtney Ford and Mark T. Keane, 26 Aug 2025, Feature-Guided Neighbor Selection for Non-Expert Evaluation of Model Predictions, https://arxiv.org/abs/2507.06029
Jessica Lundin, Guillaume Chabot-Couture, 28 Aug 2025, A Graph-Based Test-Harness for LLM Evaluation, https://arxiv.org/abs/2508.20810
Daryna Oliynyk, Rudolf Mayer, Kathrin Grosse, Andreas Rauber, 29 Aug 2025, I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing Attacks, https://arxiv.org/abs/2508.21654
Haichen Hu, David Simchi-Levi, 2 Sep 2025, Wild Refitting for Model-Free Excess Risk Evaluation of Opaque ML/AI Models under Bregman Loss, https://arxiv.org/abs/2509.02476
S.R. Eshwar, 7 Sep 2025, Teaching Precommitted Agents: Model-Free Policy Evaluation and Control in Quasi-Hyperbolic Discounted MDPs, https://arxiv.org/abs/2509.06094
Sam Davidson, Li Sun, Bhavana Bhasker, Laurent Callot, Anoop Deoras, 21 Aug 2025, Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats, https://arxiv.org/abs/2509.05303
Yichi Zhang, Alexander Belloni, Ethan X. Fang, Junwei Lu, Xiaoan Xu, 6 Sep 2025, Fisher Random Walk: Automatic Debiasing Contextual Preference Inference for Large Language Model Evaluation, https://arxiv.org/abs/2509.05852
Manit Baser, Dinil Mon Divakaran, Mohan Gurusamy, 6 Sep 2025, ThinkEval: Practical Evaluation of Knowledge Leakage in LLM Editing using Thought-based Knowledge Graphs, https://arxiv.org/abs/2506.01386
Zhiyin Tan, Jennifer D'Souza, 8 Sep 2025, Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models, https://arxiv.org/abs/2509.07142
Edouard Lansiaux, Ramy Azzouz, Emmanuel Chazard, Am\'elie Vromant, Eric Wiel, 11 Sep 2025, Development and Comparative Evaluation of Three Artificial Intelligence Models (NLP, LLM, JEPA) for Predicting Triage in Emergency Departments: A 7-Month Retrospective Proof-of-Concept, https://arxiv.org/abs/2507.01080
Kanato Masayoshi, Masahiro Hashimoto, Ryoichi Yokoyama, Naoki Toda, Yoshifumi Uwamino, Shogo Fukuda, Ho Namkoong, Masahiro Jinzaki, 19 Sep 2025, EHR-MCP: Real-world Evaluation of Clinical Information Retrieval by Large Language Models via Model Context Protocol, https://arxiv.org/abs/2509.15957
Fangyi Yu, Nabeel Seedat, Dasha Herrmannova, Frank Schilder, Jonathan Richard Schwarz, 19 Sep 2025, Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses, https://arxiv.org/abs/2509.16093
Seyoung Song, Seogyeong Jeong, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh, 19 Sep 2025, MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language, https://arxiv.org/abs/2505.14395
Yiyang Li, Yonghuang Wu, Ying Luo, Liangtai Sun, Zishu Qin, Lin Qiu, Xuezhi Cao, Xunliang Cai, 16 Sep 2025, Instance-level Randomization: Toward More Stable LLM Evaluations, https://arxiv.org/abs/2509.12678
Yongmin Yoo, Qiongkai Xu, Longbing Cao, 16 Sep 2025, PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims, https://arxiv.org/abs/2505.19345
Ziang Li, Manasi Ganti, Zixian Ma, Helena Vasconcelos, Qijia He, Ranjay Krishna, 14 Sep 2025, Rethinking Human Preference Evaluation of LLM Rationales, https://arxiv.org/abs/2509.11026
Argimiro Arratia, Alejandra Caba\~na, Ernesto Mordecki, Gerard Rovira-Parra, 15 Sep 2025, The Morgan-Pitman Test of Equality of Variances and its Application to Machine Learning Model Evaluation and Selection, https://arxiv.org/abs/2509.12185
Sidharth Surapaneni, Hoang Nguyen, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Akshay Kalkunte, Sai Rajeswar, Sathwik Tejaswi Madhusudhan, 9 Sep 2025, LALM-Eval: An Open-Source Toolkit for Holistic Evaluation of Large Audio Language Models, https://arxiv.org/abs/2509.08031
Alejandro Andrade-Lotero, Lee Becker, Joshua Southerland and Scott Hellman, 10 Sep 2025, Toward Subtrait-Level Model Explainability in Automated Writing Evaluation, https://arxiv.org/abs/2509.08345
Hossein Siadati, Haadi Jafarian, Sima Jafarikhah, 10 Sep 2025, Send to which account? Evaluation of an LLM-based Scambaiting System, https://arxiv.org/abs/2509.08493
Yongye Su, Zeya Zhang, Jane Kou, Cheng Ju, Shubhojeet Sarkar, Yamin Wang, Ji Liu, Shengbo Guo, 17 Sep 2025, Modernizing Facebook Scoped Search: Keyword and Embedding Hybrid Retrieval with LLM Evaluation, https://arxiv.org/abs/2509.13603
Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, Rafael Ferreira da Silva, 17 Sep 2025, LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology, https://arxiv.org/abs/2509.13978
Zarreen Reza, 1 Oct 2025, The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation, https://arxiv.org/abs/2510.01295
Manuel Cebrian, Tomomi Kito, Raul Castro Fernandez, 30 Sep 2025, Emergent evaluation hubs in a decentralizing large language model ecosystem, https://arxiv.org/abs/2510.01286
Shuyang Hou, Haoyue Jiao, Ziqi Liu, Lutong Xie, Guanyu Chen, Shaowen Wu, Xuefeng Guan, Huayi Wu, 2 Oct 2025, GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2GeoSQL Queries, https://arxiv.org/abs/2509.25264
Ali Mekky, Omar El Herraoui, Preslav Nakov, Yuxia Wang, 14 Oct 2025, HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment, https://arxiv.org/abs/2510.12217
Patrizio Migliarini, Mashal Afzal Memon, Marco Autili, Paola Inverardi, 1 Oct 2025, Advancing Automated Ethical Profiling in SE: a Zero-Shot Evaluation of LLM Reasoning, https://arxiv.org/abs/2510.00881
Sujeong Lee, Hayoung Lee, Seongsoo Heo, Wonik Choi, 1 Oct 2025, Integrated Framework for LLM Evaluation with Answer Generation, https://arxiv.org/abs/2509.20097
William Walden, Marc Mason, Orion Weller, Laura Dietz, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, James Mayfield, Eugene Yang, 1 Oct 2025, Auto-ARGUE: LLM-Based Report Generation Evaluation, https://arxiv.org/abs/2509.26184
Angelina Wang and Daniel E. Ho and Sanmi Koyejo, 18 Sep 2025, The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior, https://arxiv.org/abs/2509.19364
Wei-Hsiang Lin, Sheng-Lun Wei, Hen-Hsen Huang, Hsin-Hsi Chen, 24 Sep 2025, Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation, https://arxiv.org/abs/2509.19880
Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir, 28 Oct 2025, Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content, https://arxiv.org/abs/2510.24438
Qiuli Wang, Jie Chen, Yongxu Liu, Xingpeng Zhang, Xiaoming Li, Wei Chen, 28 Oct 2025, From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports, https://arxiv.org/abs/2510.23008
Tatsuki Kawakami, Kazuki Egashira, Atsuyuki Miyai, Go Irie and Kiyoharu Aizawa, 28 Oct 2025, PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning, https://arxiv.org/abs/2507.01271
Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li, 23 Oct 2025, RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration, https://arxiv.org/abs/2509.25271
Dou Liu, Ying Long, Sophia Zuoqiu, Di Liu, Kang Li, Yiting Lin, Hanyi Liu, Rong Yin, Tian Tang, 17 Oct 2025, Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study, https://arxiv.org/abs/2510.16095
Melik Ozolcer, Sang Won Bae, 20 Oct 2025, Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users, https://arxiv.org/abs/2510.17173
Dania Refai and Moataz Ahmed, 19 Oct 2025, Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation, https://arxiv.org/abs/2510.16943
Sheikh Jubair, Arwa Omayrah, Amal Alshammari, Alhanoof Althnian, Abdulhamed Alothaimen, Norah A. Alzahrani, Shahad D. Alzaidi, Nora Al-Twairesh, Abdulmohsen Al-Thubaity, 19 Oct 2025, LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding, https://arxiv.org/abs/2510.16783
Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam, 19 Oct 2025, AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents, https://arxiv.org/abs/2506.00641
Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, 22 Sep 2025, Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents, https://arxiv.org/abs/2509.17488
Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, Yi Fang, 20 Sep 2025, Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning, https://arxiv.org/abs/2502.15361
Chenyu Zhang, Tairen Zhang, Lanjun Wang, Ruidong Chen, Wenhui Li, Anan Liu, 25 Oct 2025, T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model, https://arxiv.org/abs/2510.22300
Dario Loi, Elena Maria Mui\`a, Federico Siciliano, Giovanni Trappolini, Vincenzo Cris\`a, Peter Kruger, Fabrizio Silvestri, 26 Oct 2025, AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment, https://arxiv.org/abs/2510.22593
Madhur Jindal, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat, 27 Oct 2025, SAGE: A Generic Framework for LLM Safety Evaluation, https://arxiv.org/abs/2504.19674
Simon Sinong Zhan, Yao Liu, Philip Wang, Zinan Wang, Qineng Wang, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Huajie Shao, Manling Li, Qi Zhu, 14 Oct 2025, SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents, https://arxiv.org/abs/2510.12985
Soheil Hashtarkhani, Rezaur Rashid, Christopher L Brett, Lokesh Chinthala, Fekede Asefa Kumsa, Janet A Zink, Robert L Davis, David L Schwartz, Arash Shaban-Nejad, 8 Oct 2025, Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study, https://arxiv.org/abs/2510.12813
Ke Wang, Houxing Ren, Zimu Lu, Mingjie Zhan, Hongsheng Li, 26 Sep 2025, VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing, https://arxiv.org/abs/2509.22651
Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Zhenxin Huang, Shengjie Ma, Yinghan Shen, Jian Guo, Yuanzhuo Wang, 26 Sep 2025, JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer, https://arxiv.org/abs/2509.02097
Yong Oh Lee, Byeonghun Bang, Sejun Oh, 4 Oct 2025, LLM-Driven Rubric-Based Assessment of Algebraic Competence in Multi-Stage Block Coding Tasks with Design and Field Evaluation, https://arxiv.org/abs/2510.06253
Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R Marlowe, Carina Suzana Negreanu, Kitty Boxall, Diana Mincu, 8 Oct 2025, LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation, https://arxiv.org/abs/2510.07243
Chenyue Zhou, G\"urkan Solmaz, Flavio Cirillo, Kiril Gashteovski, Jonathan F\"urst, 8 Oct 2025, TextMine: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action, https://arxiv.org/abs/2509.15098
Hadi Mohammadi, Anastasia Giachanou, and Ayoub Bagheri, 8 Oct 2025, EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models, https://arxiv.org/abs/2510.05942
Weijian Deng, Weijie Tu, Ibrahim Radwan, Mohammad Abu Alsheikh, Stephen Gould, Liang Zheng, 3 Oct 2025, Confidence and Dispersity as Signals: Unsupervised Model Evaluation and Ranking, https://arxiv.org/abs/2510.02956
Aur\'elien B\"uck-Kaeffer, Je Qin Chooi, Dan Zhao, Maximilian Puelma Touzel, Kellin Pelrine, Jean-Fran\c{c}ois Godbout, Reihaneh Rabbany, Zachary Yang, 27 Sep 2025, $\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training, https://arxiv.org/abs/2510.02343
Lara Ahrens, Wilhelm Haverkamp, Nils Strodthoff, 21 Oct 2025, ECG-LLM-- training and evaluation of domain-specific large language models for electrocardiography, https://arxiv.org/abs/2510.18339
Panagiotis Michelakis, Yiannis Hadjiyiannis, Dimitrios Stamoulis, 25 Sep 2025, CORE: Full-Path Evaluation of LLM Agents Beyond Final State, https://arxiv.org/abs/2509.20998
Wenkai Guo, Xuefeng Liu, Haolin Wang, Jianwei Niu, Shaojie Tang, Jing Yuan, 25 Sep 2025, Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation, https://arxiv.org/abs/2509.20680
Meng Wan, Benxi Tian, Jue Wang, Cui Hui, Ningming Nie, Tiantian Liu, Zongguo Wang, Cao Rongqiang, Peng Shi, and Yangang Wang, 25 Sep 2025, Lossless Compression: A New Benchmark for Time Series Model Evaluation, https://arxiv.org/abs/2509.21002
Nicolas Salvy, Hugues Talbot and Bertrand Thirion, 25 Sep 2025, Enhanced Generative Model Evaluation with Clipped Density and Coverage, https://arxiv.org/abs/2507.01761
Miguel Angel Alvarado Gonzalez, Michelle Bruno Hernandez, Miguel Angel Pe\~naloza Perez, Bruno Lopez Orozco, Jesus Tadeo Cruz Soto, Sandra Malagon, 28 Sep 2025, Do Repetitions Matter? Strengthening Reliability in LLM Evaluations, https://arxiv.org/abs/2509.24086
Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li, 27 Sep 2025, Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting, https://arxiv.org/abs/2509.23074
Raviteja Anantha, Soheil Hor, Teodor Nicola Antoniu, Layne C. Price, 27 Sep 2025, NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning, https://arxiv.org/abs/2509.23252
Bo Li, Xin Zheng, Ming Jin, Can Wang, Shirui Pan, 28 Sep 2025, Test-time GNN Model Evaluation on Dynamic Graphs, https://arxiv.org/abs/2509.23816
Chunyang Jiang, Yonggang Zhang, Yiyang Cai, Chi-Min Chan, Yulong Liu, Mingming Chen, Wei Xue, Yike Guo, 27 Sep 2025, Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks, https://arxiv.org/abs/2509.23067
Jiahao Zhao, Yunjia Li, Wei Li, Kazuyoshi Yoshii, 27 Sep 2025, ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following, https://arxiv.org/abs/2509.23350
Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren, 29 Sep 2025, HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment, https://arxiv.org/abs/2509.24384
Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, Sherry Yang, 29 Sep 2025, WorldGym: World Model as An Environment for Policy Evaluation, https://arxiv.org/abs/2506.00613
Kuang-Da Wang, Zhao Wang, Yotaro Shimose, Wei-Yao Wang, Shingo Takamatsu, 17 Oct 2025, WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation, https://arxiv.org/abs/2510.15306
Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary, 5 Oct 2025, Don't Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation, https://arxiv.org/abs/2510.04265
Ashish Kattamuri, Harshwardhan Fartale, Arpita Vats, Rahul Raja, Ishita Prasad, 10 Oct 2025, RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation, https://arxiv.org/abs/2510.08931
Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F. Gomes, Guang Yang, Kui Liu, Xin Xia, David Lo, 10 Oct 2025, An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks, https://arxiv.org/abs/2505.20854
Qinghua Lu, Dehai Zhao, Yue Liu, Hao Zhang, Liming Zhu, Xiwei Xu, Angela Shi, Tristan Tan, Rick Kazman, 23 Oct 2025, AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents, https://arxiv.org/abs/2510.21031
Pai Liu, Lingfeng Zhao, Shivangi Agarwal, Jinghan Liu, Audrey Huang, Philip Amortila, Nan Jiang, 24 Oct 2025, Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol, https://arxiv.org/abs/2502.08021
Andrew Keenan Richardson, Ryan Othniel Kearns, Sean Moss, Vincent Wang-Mascianica, Philipp Koralus, 24 Oct 2025, Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM Reasoning, https://arxiv.org/abs/2506.11128
Shaobo Wang and Cong Wang and Wenjie Fu and Yue Min and Mingquan Feng and Isabel Guan and Xuming Hu and Conghui He and Cunxiang Wang and Kexin Yang and Xingzhang Ren and Fei Huang and Dayiheng Liu and Linfeng Zhang, 12 Oct 2025, Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?, https://arxiv.org/abs/2510.10457
Dongjie Yang, Chengqiang Lu, Qimeng Wang, Xinbei Ma, Yan Gao, Yao Hu, Hai Zhao, 12 Oct 2025, Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints, https://arxiv.org/abs/2506.12421
Ran Zhang, Wei Zhao, Lieve Macken, Steffen Eger, 13 Oct 2025, LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering, https://arxiv.org/abs/2505.05423
Chaithanya Bandi and Abir Harrasse, 11 Oct 2025, Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation, https://arxiv.org/abs/2410.04663
Akhil Kumar, Jianliang Leon Zhao, Om Dobariya, 8 Oct 2025, Evaluation of LLMs for Process Model Analysis and Optimization, https://arxiv.org/abs/2510.07489
Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh, 9 Oct 2025, DISCO: Diversifying Sample Condensation for Efficient Model Evaluation, https://arxiv.org/abs/2510.07959
Amruta Parulekar, Preethi Jyothi, 8 Oct 2025, LASER: An LLM-based ASR Scoring and Evaluation Rubric, https://arxiv.org/abs/2510.07437
Fanwei Zhua, Jiaxuan He, Xiaoxiao Chen, Zulong Chen, Quan Lu and Chenrui Mei, 9 Oct 2025, Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation, https://arxiv.org/abs/2510.07912
Qin Liu, Jacob Dineen, Yuxi Huang, Sheng Zhang, Hoifung Poon, Ben Zhou, Muhao Chen, 9 Oct 2025, ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation, https://arxiv.org/abs/2510.08569
Hong-Jie Dai, Zheng-Hao Li, An-Tai Lu, Bo-Tsz Shain, Ming-Ta Li, Tatheer Hussain Mir, Kuang-Te Wang, Min-I Su, Pei-Kang Liu, Ming-Ju Tsai, 23 Sep 2025, Model selection meets clinical semantics: Optimizing ICD-10-CM prediction via LLM-as-Judge evaluation, redundancy-aware sampling, and section-aware fine-tuning, https://arxiv.org/abs/2509.18846
Mahmoud Ibrahim, Bart Elen, Chang Sun, G\"okhan Ertaylan, Michel Dumontier, 22 Oct 2025, Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series, https://arxiv.org/abs/2510.19728
Kenya S. Andrews, Deborah Dormah Kanubala, Kehinde Aruleba, Francisco Enrique Vicente Castro, Renata A Revelo, 21 Oct 2025, A Justice Lens on Fairness and Ethics Courses in Computing Education: LLM-Assisted Multi-Perspective and Thematic Evaluation, https://arxiv.org/abs/2510.18931
Chengcan Wu, Zhixin Zhang, Mingqian Xu, Zeming Wei, Meng Sun, 22 Oct 2025, Monitoring LLM-based Multi-Agent Systems Against Corruptions via Node Evaluation, https://arxiv.org/abs/2510.19420
Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao, 29 Sep 2025, Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents, https://arxiv.org/abs/2509.25302
Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Anna Vasileva, Anna Antipina, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, Egor Shvetsov, 29 Sep 2025, From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation, https://arxiv.org/abs/2509.25359
Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, Baishakhi Ray, 29 Sep 2025, Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation, https://arxiv.org/abs/2502.17521
Jiacheng Liang, Zian Wang, Lauren Hong, Shouling Ji, Ting Wang, 29 Sep 2025, Watermark under Fire: A Robustness Evaluation of LLM Watermarking, https://arxiv.org/abs/2411.13425
Jingtong Su, Jianyu Zhang, Karen Ullrich, L\'eon Bottou, Mark Ibrahim, 2 Oct 2025, A Single Character can Make or Break Your LLM Evals, https://arxiv.org/abs/2510.05152
Zachary Robertson, 16 Oct 2025, Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores, https://arxiv.org/abs/2510.14966