Aussie AI

Model Evaluation

  • Last Updated 17 November, 2025
  • by David Spuler, Ph.D.

Leaderboards (Model Evaluation)

Benchmarks for Model Evaluation

  • Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
  • Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You, 3 Jun 2024, MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, https://arxiv.org/abs/2406.06565
  • Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi, 13 Jun 2024, Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning, https://arxiv.org/abs/2406.09170
  • Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
  • Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
  • Petr Spelda and Vit Stritecky, 13 Aug 2025, Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1, https://arxiv.org/abs/2508.10173
  • Pengbo Shen, Yaqing Wang, Ni Mu, Yao Luan, Runpeng Xie, Senhao Yang, Lexiang Wang, Hao Hu, Shuang Xu, Yiqin Yang, Bo Xu, 14 Aug 2025, SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks, https://arxiv.org/abs/2508.10428
  • Lucie-Aim\'ee Kaffee and Giada Pistilli and Yacine Jernite, 4 Aug 2025, INTIMA: A Benchmark for Human-AI Companionship Behavior, https://arxiv.org/abs/2508.09998
  • Rakesh Thakur, Sneha Sharma, Gauri Chopra, 4 Aug 2025, HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish, https://arxiv.org/abs/2508.10001
  • Nghia Trung Ngo, Franck Dernoncourt and Thien Huu Nguyen, 13 Aug 2025, mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning, https://arxiv.org/abs/2508.10137
  • Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza, 13 Aug 2025, PakBBQ: A Culturally Adapted Bias Benchmark for QA, https://arxiv.org/abs/2508.10186
  • Chenggang Chen, Zhiyu Yang, 13 Aug 2025, No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings, https://arxiv.org/abs/2508.10230
  • Jieyu Li, Xin Zhang, and Joey Tianyi Zhou, 14 Aug 2025, AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences, https://arxiv.org/abs/2508.10771
  • Brooke R. Weborg and Gursel Serpen, 14 Aug 2025, Empirical Investigation into Configuring Echo State Networks for Representative Benchmark Problem Domains, https://arxiv.org/abs/2508.10887
  • Anand Kumar, Harminder Pal Monga, Tapasi Brahma, Satyam Kalra, Navas Sherif, 14 Aug 2025, Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops, https://arxiv.org/abs/2508.10817
  • Fabian David Schmidt, Ivan Vuli\'c, Goran Glava\v{s}, David Ifeoluwa Adelani, 13 Aug 2025, Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding, https://arxiv.org/abs/2501.06117
  • Yuping Wang and Xiangyu Huang and Xiaokang Sun and Mingxuan Yan and Shuo Xing and Zhengzhong Tu and Jiachen Li, 14 Aug 2025, UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving, https://arxiv.org/abs/2503.24381
  • Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou, 14 Aug 2025, PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts, https://arxiv.org/abs/2508.09848
  • Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, Wentao Zhang, 22 Jul 2025, CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos, https://arxiv.org/abs/2507.16878
  • Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang, 23 Jul 2025, SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs, https://arxiv.org/abs/2507.17178
  • Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing, 23 Jul 2025, MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs, https://arxiv.org/abs/2507.17476
  • Helen Jin, Shreya Havaldar, Chaehyeon Kim, Anton Xue, Weiqiu You, Helen Qu, Marco Gatti, Daniel A Hashimoto, Bhuvnesh Jain, Amin Madani, Masao Sako, Lyle Ungar, Eric Wong, 22 Jul 2025, The FIX Benchmark: Extracting Features Interpretable to eXperts, https://arxiv.org/abs/2409.13684
  • Xu Yang, Qi Zhang, Shuming Jiang, Yaowen Xu, Zhaofan Zou, Hao Sun, Xuelong Li, 22 Jul 2025, METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark, https://arxiv.org/abs/2507.16206
  • Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur, 18 Jul 2025, Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark, https://arxiv.org/abs/2507.15882
  • Eduardo Pacheco, Atila Orhon, Berkin Durmus, Blaise Munyampirwa, Andrey Leonov, 22 Jul 2025, SDBench: A Comprehensive Benchmark Suite for Speaker Diarization, https://arxiv.org/abs/2507.16136
  • Yasser Ashraf, Ahmed Sharshar, Velibor Bojkovic, Bin Gu, 22 Jul 2025, SPACT18: Spiking Human Action Recognition Benchmark Dataset with Complementary RGB and Thermal Modalities, https://arxiv.org/abs/2507.16151
  • Shalaka Satheesh, Katrin Klug, Katharina Beckh, H\'ector Allende-Cid, Sebastian Houben, Teena Hassan, 22 Jul 2025, GG-BBQ: German Gender Bias Benchmark for Question Answering, https://arxiv.org/abs/2507.16410
  • Alireza Dizaji, Benedict Aaron Tjandra, Mehrab Hamidi, Shenyang Huang, Guillaume Rabusseau, 22 Jul 2025, T-GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs, https://arxiv.org/abs/2507.10183
  • Huan Liu, Shusen Yang, Yuzhe Zhang, Mengze Wang, Fanyu Gong, Chengxi Xie, Guanjian Liu, Zejun Liu, Yong-Jin Liu, Bao-Liang Lu, Dalin Zhang, 22 Jul 2025, LibEER: A Comprehensive Benchmark and Algorithm Library for EEG-based Emotion Recognition, https://arxiv.org/abs/2410.09767
  • Pierre Sermanet, Anirudha Majumdar, Vikas Sindhwani, 22 Jul 2025, SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior, https://arxiv.org/abs/2503.10706
  • Mustafa Chasmai, Wuao Liu, Subhransu Maji, Grant Van Horn, 21 Jul 2025, Audio Geolocation: A Natural Sounds Benchmark, https://arxiv.org/abs/2505.18726
  • Zehan Li, Hongjie Chen, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li, 24 Jul 2025, TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios, https://arxiv.org/abs/2507.18061
  • Minje Park, Jeonghwa Lim, Taehyung Yu, and Sunghoon Joo, 24 Jul 2025, A Multi-Dataset Benchmark for Semi-Supervised Semantic Segmentation in ECG Delineation, https://arxiv.org/abs/2507.18323
  • Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang, 20 Jul 2025, MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation, https://arxiv.org/abs/2507.17773
  • Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer, 24 Jul 2025, BEARCUBS: A benchmark for computer-using web agents, https://arxiv.org/abs/2503.07919
  • Xiao Wang, Qian Zhu, Shujuan Wu, Bo Jiang, Shiliang Zhang, Yaowei Wang, Yonghong Tian, Bin Luo, 18 Jul 2025, When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework, https://arxiv.org/abs/2507.13659
  • Ishant Chintapatla, Kazuma Choji, Naaisha Agarwal, Andrew Lin, Hannah You, Charles Duong, Kevin Zhu, Sean O'Brien, Vasu Sharma, 17 Jul 2025, COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark, https://arxiv.org/abs/2507.13405
  • Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu, 18 Jul 2025, HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation, https://arxiv.org/abs/2503.04800
  • Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee, 18 Jul 2025, From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation, https://arxiv.org/abs/2507.08924
  • Zhijiang Tang, Jiaxin Qi, Yuhua Zheng, Jianqiang Huang, 15 Jul 2025, A Comprehensive Benchmark for Electrocardiogram Time-Series, https://arxiv.org/abs/2507.14206
  • Lingbo Li, Anuradha Mathrani, Teo Susnjak, 20 Jul 2025, What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction, https://arxiv.org/abs/2507.15152
  • Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li, 20 Jul 2025, CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents, https://arxiv.org/abs/2407.01511
  • Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang, 19 Jul 2025, TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios, https://arxiv.org/abs/2505.12891
  • Mar\'ia Andrea Cruz Bland\'on, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico, 19 Jul 2025, MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation, https://arxiv.org/abs/2502.17163
  • Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Durmus, Chuan Guo, 21 Jul 2025, Detecting Benchmark Contamination Through Watermarking, https://arxiv.org/abs/2502.17259
  • Ziyu Wang (1), Tao Xue (1), Jingyuan Li (1), Haibin Zhang (1), Zhiqiang Xu (3), Gaofei Xu (4), Zhen Wang (5), Yanbin Wang (2), Zhiquan Liu (6) ((1) Xidian University, (2) Shenzhen MSU-BIT University, (3) Jiangxi University of Science and Technology, (4) Institute of Deep-sea Science and Engineering,(5) Northwestern Polytechnical University, (6) Jinan University), 20 Jul 2025, Can Optical Denoising Clean Sonar Images? A Benchmark and Fusion Approach, https://arxiv.org/abs/2503.01655
  • Shengtao Wen, Haodong Chen, Yadong Wang, Zhongying Pan, Xiang Chen, Yu Tian, Bo Qian, Dong Liang, Sheng-Jun Huang, 9 Aug 2025, MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA, https://arxiv.org/abs/2508.07022
  • Naseem Machlovi, Maryam Saleki, Innocent Ababio, Ruhul Amin, 9 Aug 2025, Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach, https://arxiv.org/abs/2508.07063
  • Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li, 10 Aug 2025, Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach, https://arxiv.org/abs/2508.07353
  • Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo, 11 Aug 2025, MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark, https://arxiv.org/abs/2508.07575
  • Zhe Zhang, Runlin Liu, Aishan Liu, Xingyu Liu, Xiang Gao, Hailong Sun, 10 Aug 2025, Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes, https://arxiv.org/abs/2508.07180
  • Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, 10 Aug 2025, MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark, https://arxiv.org/abs/2508.07307
  • Sarina Penquitt, Jonathan Klees, Rinor Cakaj, Daniel Kondermann, Matthias Rottmann, Lars Schmarje, 6 Aug 2025, From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets, https://arxiv.org/abs/2508.06556
  • Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, Shubhi Bansal, Nagendra Kumar, 7 Aug 2025, ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos, https://arxiv.org/abs/2508.06570
  • Lucia Cipolina-Kun and Marianna Nezhurina and Jenia Jitsev, 10 Aug 2025, Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilites of Large Language Models via Game Play, https://arxiv.org/abs/2508.03368
  • Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming-Ming Cheng, Jian Yang, 9 Aug 2025, SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection, https://arxiv.org/abs/2403.06534
  • Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang, 9 Aug 2025, LVBench: An Extreme Long Video Understanding Benchmark, https://arxiv.org/abs/2406.08035
  • Mihir Godbole, Xiangbo Gao, Zhengzhong Tu, 9 Aug 2025, DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving, https://arxiv.org/abs/2506.17590
  • Zhihao Zhu, Yi Yang, Defu Lian, 9 Aug 2025, TDDBench: A Benchmark for Training data detection, https://arxiv.org/abs/2411.03363
  • Hafsteinn Einarsson, 27 Jul 2025, MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models, https://arxiv.org/abs/2507.20395
  • Chenchen Zhao, Zhengyuan Shi, Xiangyu Wen, Chengjie Liu, Yi Liu, Yunhao Zhou, Yuxiang Zhao, Hefei Feng, Yinan Zhu, Gwok-Waa Wan, Xin Cheng, Weiyu Chen, Yongqi Fu, Chujie Chen, Chenhao Xue, Guangyu Sun, Ying Wang, Yibo Lin, Jun Yang, Ning Xu, Xi Wang and Qiang Xu, 20 Jul 2025, MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs, https://arxiv.org/abs/2507.19525
  • Sara Papi, Maike Z\"ufle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues, 25 Jul 2025, MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks, https://arxiv.org/abs/2507.19634
  • Roberto Labadie-Tamayo, Adrian Jaques B\"ock, Djordje Slijep\v{c}evi\'c, Xihui Chen, Andreas Babic, Matthias Zeppelzauer, 28 Jul 2025, FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models, https://arxiv.org/abs/2507.20924
  • Xinhan Di, Kristin Qi, Pengqian Yu, 28 Jul 2025, JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1, https://arxiv.org/abs/2507.20987
  • Ali Ismail-Fawaz and Maxime Devanne and Stefano Berretti and Jonathan Weber and Germain Forestier, 28 Jul 2025, Deep Learning for Skeleton Based Human Motion Rehabilitation Assessment: A Benchmark, https://arxiv.org/abs/2507.21018
  • Xuzhao Li and Xuchen Li and Shiyu Hu and Yongzhen Guo and Wentao Zhang, 26 Jul 2025, VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains, https://arxiv.org/abs/2507.09884
  • Hassan Ismail Fawaz, Ganesh Del Grosso, Tanguy Kerdoncuff, Aurelie Boisbunon, Illyyne Saffar, 25 Jul 2025, Deep Unsupervised Domain Adaptation for Time Series Classification: a Benchmark, https://arxiv.org/abs/2312.09857
  • Valay Bundele, Karahan Sar{\i}ta\c{s}, Bora Kargi, O\u{g}uz Ata \c{C}al, K{\i}van\c{c} Tez\"oren, Zohreh Ghaderi, Hendrik Lensch, 26 Jul 2025, Evaluating Self-Supervised Learning in Medical Imaging: A Benchmark for Robustness, Generalizability, and Multi-Domain Impact, https://arxiv.org/abs/2412.19124
  • David Maria Schmidt, Raoul Schubert, Philipp Cimiano, 28 Jul 2025, CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting, https://arxiv.org/abs/2507.21257
  • Haiquan Wang, Yi Chen, Shang Zeng, Yun Bian, Zhe Cui, 29 Jul 2025, GovRelBench:A Benchmark for Government Domain Relevance, https://arxiv.org/abs/2507.21419
  • Leonard Hinckeldey, Elliot Fosong, Elle Miller, Rimvydas Rubavicius, Trevor McInroe, Patricia Wollstadt, Christiane B. Wiebel-Herboth, Subramanian Ramamoorthy, Stefano V. Albrecht, 29 Jul 2025, Assistax: A Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics, https://arxiv.org/abs/2507.21638
  • Amber Huang, Ian Scott Knight, Slava Naprienko, 29 Jul 2025, Data Leakage and Redundancy in the LIT-PCBA Benchmark, https://arxiv.org/abs/2507.21404
  • Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, Yong Li, 9 Jun 2025, FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents, https://arxiv.org/abs/2507.21071
  • Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz, 28 Jul 2025, StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation, https://arxiv.org/abs/2507.21340
  • Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, Viet Hoang, 29 Jul 2025, VN-MTEB: Vietnamese Massive Text Embedding Benchmark, https://arxiv.org/abs/2507.21500
  • Kristian G. Barman, Sascha Caron, Faegheh Hasibi, Eugene Shalugin, Yoris Marcet, Johannes Otte, Henk W. de Regt, and Merijn Moody, 29 Jul 2025, Towards a Large Physics Benchmark, https://arxiv.org/abs/2507.21695
  • Rohan Hitchcock, Jesse Hoogland, 29 Jul 2025, From Global to Local: A Scalable Benchmark for Local Posterior Sampling, https://arxiv.org/abs/2507.21449
  • Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang, 25 Mar 2025, OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching, https://arxiv.org/abs/2503.21813
  • Jinzhi Wang, Qingke Peng, Haozhou Li, Zeyuan Zeng, Qinfeng Song, Kaixuan Yang, Jiangbo Zhang, Yaoying Wang, Ruimeng Li, Biyi Zhou, 19 Jul 2025, ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing, https://arxiv.org/abs/2507.22911
  • Chengqian Ma, Wei Tao, Yiwen Guo, 30 Jul 2025, C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations, https://arxiv.org/abs/2507.22968
  • Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che, 31 Jul 2025, MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models, https://arxiv.org/abs/2507.23382
  • Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan, 31 Jul 2025, MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks, https://arxiv.org/abs/2507.23511
  • Kai Goebel and Patrik Zips, 31 Jul 2025, Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study, https://arxiv.org/abs/2507.23589
  • Shimanto Bhowmik, Tawsif Tashwar Dipto, Md Sazzad Islam, Sheryl Hsu, Tahsin Reasat, 31 Jul 2025, Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis, https://arxiv.org/abs/2507.23248
  • Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang, 31 Jul 2025, LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models, https://arxiv.org/abs/2507.22359
  • Takashi Ishida, Thanawat Lodkaew, Ikko Yamane, 31 Jul 2025, How Can I Publish My LLM Benchmark Without Giving the True Answers Away?, https://arxiv.org/abs/2505.18102
  • Gianluca Carloni, Biagio Brattoli, Seongho Keum, Jongchan Park, Taebum Lee, Chang Ho Ahn, Sergio Pereira, 29 Jul 2025, Pathology Foundation Models are Scanner Sensitive: Benchmark and Mitigation with Contrastive ScanGen Loss, https://arxiv.org/abs/2507.22092
  • Yimeng Liu, Maolin Gan, Yidong Ren, Gen Li, Jingkai Lin, Younsuk Dong, Zhichao Cao, 30 Jul 2025, Hydra-Bench: A Benchmark for Multi-Modal Leaf Wetness Sensing, https://arxiv.org/abs/2507.22685
  • Matej \v{S}progar, 30 Jul 2025, AGITB: A Signal-Level Benchmark for Evaluating Artificial General Intelligence, https://arxiv.org/abs/2504.04430
  • Thuy Nguyen, Dang Nguyen, Hoang Nguyen, Thuan Luong, Long Hoang Dang, Viet Dac Lai, 30 Jul 2025, OWLViz: An Open-World Benchmark for Visual Question Answering, https://arxiv.org/abs/2503.07631
  • Xiang Xiang, Zhuo Xu, Yao Deng, Qinhao Zhou, Yifan Liang, Ke Chen, Qingfang Zheng, Yaowei Wang, Xilin Chen, Wen Gao, 30 Jul 2025, OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing, https://arxiv.org/abs/2502.20668
  • Muhammad Farid Adilazuarda, Musa Izzanardi Wijanarko, Lucky Susanto, Khumaisa Nur'aini, Derry Wijaya, Alham Fikri Aji, 25 Feb 2025, NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts, https://arxiv.org/abs/2502.18148
  • Kejia Gao, Liguo Zhou, Mingjun Liu, Alois Knoll, 1 Aug 2025, E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking, https://arxiv.org/abs/2504.10812
  • Junjie Shi, Wei Ma, Shi Ying, Lingxiao Jiang, Yang liu, Bo Du, 2 Aug 2025, Importance Sampling is All You Need: Predict LLM's performance on new benchmark by reusing existing benchmark, https://arxiv.org/abs/2508.01203
  • Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen, 2 Aug 2025, NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset, https://arxiv.org/abs/2508.01330
  • Yuanzhe Shen, Kaimin Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 2 Aug 2025, TripTailor: A Real-World Benchmark for Personalized Travel Planning, https://arxiv.org/abs/2508.01432
  • Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, Huimin Zhao, 1 Aug 2025, FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models, https://arxiv.org/abs/2508.01055
  • Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu, 24 Jul 2025, EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow, https://arxiv.org/abs/2507.22929
  • Lyle Regenwetter, Yazan Abu Obaideh, Fabien Chiotti, Ioanna Lykourentzou, Faez Ahmed, 25 May 2025, Bike-Bench: A Bicycle Design Benchmark for Generative Models with Objectives and Constraints, https://arxiv.org/abs/2508.00830
  • Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning, 2 Aug 2025, WebDS: An End-to-End Benchmark for Web-based Data Science, https://arxiv.org/abs/2508.01222
  • Rushin H. Gindra, Giovanni Palla, Mathias Nguyen, Sophia J. Wagner, Manuel Tran, Fabian J Theis, Dieter Saur, Lorin Crawford, Tingying Peng, 2 Aug 2025, A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics, https://arxiv.org/abs/2508.01490
  • Amir DN Cohen, Hilla Merhav, Yoav Goldberg, Reut Tsarfaty, 3 Aug 2025, HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark, https://arxiv.org/abs/2508.01812
  • Andrea Dosi, Semanto Mondal, Rajib Chandra Ghosh, Massimo Brescia, Giuseppe Longo, 3 Aug 2025, Less is More: AMBER-AFNO -- a New Benchmark for Lightweight 3D Medical Image Segmentation, https://arxiv.org/abs/2508.01941
  • Wanqi Yang, Yanda Li, Yunchao Wei, Meng Fang, Ling Chen, 4 Aug 2025, SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models, https://arxiv.org/abs/2508.02018
  • Yebo Peng, Zixiang Liu, Yaoming Li, Zhizhuo Yang, Xinye Xu, Bowen Ye, Weijun Yuan, Zihan Wang, Tong Yang, 4 Aug 2025, Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems, https://arxiv.org/abs/2508.02208
  • Gustaf Ahdritz, Anat Kleiman, 4 Aug 2025, The SMeL Test: A simple benchmark for media literacy in language models, https://arxiv.org/abs/2508.02074
  • Ivan Karpukhin, Foma Shipilov, Andrey Savchenko, 2 Aug 2025, HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?, https://arxiv.org/abs/2406.14341
  • Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje, 4 Aug 2025, DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA, https://arxiv.org/abs/2412.05430
  • Junying Wang, Wenzhe Li, Yalun Wu, Yingji Liang, Yijin Guo, Chunyi Li, Haodong Duan, Zicheng Zhang, Guangtao Zhai, 2 Aug 2025, Affordance Benchmark for MLLMs, https://arxiv.org/abs/2506.00893
  • Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang, 4 Aug 2025, VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos, https://arxiv.org/abs/2506.10857
  • Xiaohao Liu, Xiaobo Xia, Zhuo Huang, See-Kiong Ng, Tat-Seng Chua, 4 Aug 2025, Towards Modality Generalization: A Benchmark and Prospective Analysis, https://arxiv.org/abs/2412.18277
  • Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu, 4 Aug 2025, Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding, https://arxiv.org/abs/2505.05026
  • Feng Rui, Zhiyao Luo, Wei Wang, Yuting Song, Yong Liu, Tingting Zhu, Jianqing Li, and Xingyao Wang, 5 Aug 2025, CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment, https://arxiv.org/abs/2508.03360
  • Haoran Liu, Yihan Zhan, Mingzhe Liu, Yanhua Liu, Peng Li, Zhuo Zuo, Bingqi Liu, Runxi Liu, 3 Aug 2025, Pulse Shape Discrimination Algorithms: Survey and Benchmark, https://arxiv.org/abs/2508.02750
  • Yahia Dalbah, Marcel Worring, Yen-Chia Hsu, 1 Aug 2025, Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction, https://arxiv.org/abs/2508.02724
  • Yihao Ang, Qiang Wang, Qiang Huang, Yifan Bao, Xinyu Xi, Anthony K. H. Tung, Chen Jin, Zhiyong Huang, 3 Aug 2025, CTBench: Cryptocurrency Time Series Generation Benchmark, https://arxiv.org/abs/2508.02758
  • Zixuan Gu, Qiufeng Fan, Long Sun, Yang Liu, Xiaojun Ye, 5 Aug 2025, VFLAIR-LLM: A Comprehensive Framework and Benchmark for Split Learning of LLMs, https://arxiv.org/abs/2508.03097
  • Longling Geng and Edward Y. Chang, 5 Aug 2025, REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks, https://arxiv.org/abs/2502.18836
  • Alexis Roger, Prateek Humane, Daniel Z. Kaplan, Kshitij Gupta, Qi Sun, George Adamopoulos, Jonathan Siu Chi Lim, Quentin Anthony, Edwin Fennell, Irina Rish, 5 Aug 2025, CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models, https://arxiv.org/abs/2501.09672
  • Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng, 5 Aug 2025, ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark, https://arxiv.org/abs/2506.10960
  • Yue Zhou, Yi Chang, Yuan Wu, 6 Aug 2025, ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges, https://arxiv.org/abs/2508.04576
  • Ashutosh Bandooni and Brindha Subburaj, 31 Jul 2025, GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models, https://arxiv.org/abs/2508.03737
  • Xiao Wang, Ziwen Wang, Wentao Wu, Anjie Wang, Jiashu Wu, Yantao Pan, Chenglong Li, 6 Aug 2025, Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark, https://arxiv.org/abs/2508.04260
  • Xiao Wang, Xufeng Lou, Shiao Wang, Ju Huang, Lan Chen, Bo Jiang, 6 Aug 2025, Long-Term Visual Object Tracking with Event Cameras: An Associative Memory Augmented Tracker and A Benchmark Dataset, https://arxiv.org/abs/2403.05839
  • Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran Wang, 5 Aug 2025, NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models, https://arxiv.org/abs/2503.12772
  • Xiao Wang, Haiyang Wang, Shiao Wang, Qiang Chen, Jiandong Jin, Haoyu Song, Bo Jiang, Chenglong Li, 6 Aug 2025, RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework, https://arxiv.org/abs/2504.10018
  • Dexuan Xu, Jieyi Wang, Zhongyan Chai, Yongzhi Cao, Hanpin Wang, Huamin Zhang, Yu Huang, 7 Aug 2025, MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models, https://arxiv.org/abs/2508.05083
  • Pouyan Navard, Yasemin Ozkut, Srikar Adhikari, Elaine Situ-LaCasse, Josie Acu\~na, Adrienne Yarnish, Alper Yilmaz, 5 Aug 2025, ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular Ultrasound, https://arxiv.org/abs/2508.04735
  • Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, 7 Aug 2025, ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents, https://arxiv.org/abs/2410.06703
  • Yumeng Fu, Jiayin Zhu, Lingling Zhang, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Yanrui Wu, Wenjun Wu, 8 Aug 2025, GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines, https://arxiv.org/abs/2508.06226
  • Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique, 5 Aug 2025, Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark, https://arxiv.org/abs/2508.05674
  • Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo, 7 Aug 2025, INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance, https://arxiv.org/abs/2406.09105
  • Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Wei-Lun Chao, 8 Aug 2025, AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models, https://arxiv.org/abs/2506.09082
  • Nikolaos Dionelis, Alessandra Feliciotti, Mattia Marconcini, Devis Peressutti, Nika Oman Kadunc, JaeWan Park, Hagai Raja Sinulingga, Steve Andreas Immanuel, Ba Tran, Caroline Arnold, Nicolas Long\'ep\'e, 8 Aug 2025, Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge, https://arxiv.org/abs/2502.13818
  • Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig, 8 Aug 2025, CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation, https://arxiv.org/abs/2504.15254
  • Haiyun Guo, ZhiYan Hou, Yu Chen, Jinghan He, Yandu Sun, Yuzhe Zhou, Shujing Guo, Kuan Zhu, Jinqiao Wang, 31 Jul 2025, MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis, https://arxiv.org/abs/2508.08275
  • Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo, 5 Aug 2025, Putnam-AXIOM: A Functional and Static Benchmark, https://arxiv.org/abs/2508.08292
  • Manuel Herrador, 13 Aug 2025, The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?, https://arxiv.org/abs/2508.09762
  • Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng, 11 Aug 2025, MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models, https://arxiv.org/abs/2508.09210
  • Amir Hosseinian, Ashkan Dehghani Zahedani, Umer Mansoor, Noosheen Hashemi, Mark Woodward, 13 Aug 2025, January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis, https://arxiv.org/abs/2508.09966
  • Kechen Li, Yaotian Tao, Ximing Wen, Quanwei Sun, Zifei Gong, Chang Xu, Xizhe Zhang, Tianbo Ji, 13 Aug 2025, GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments, https://arxiv.org/abs/2505.24306
  • Grigor Bezirganyan, Sana Sellami, Laure Berti-\'Equille, S\'ebastien Fournier, 13 Aug 2025, LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data, https://arxiv.org/abs/2406.09864
  • Chunan Liu, Aurelien Pelissier, Yanjun Shao, Lilian Denzler, Andrew C.R. Martin, Brooks Paige and Mar\'ia Rodr\'iguez Mart\'inez, 13 Aug 2025, AbRank: A Benchmark Dataset and Metric-Learning Framework for Antibody-Antigen Affinity Ranking, https://arxiv.org/abs/2506.17857
  • Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski, 14 Aug 2025, ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks, https://arxiv.org/abs/2508.10956
  • Wenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han, 14 Aug 2025, SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth, https://arxiv.org/abs/2508.11009
  • Beichen Guo, Zhiyuan Wen, Yu Yang, Peng Gao, Ruosong Yang, Jiaxing Shen, 15 Aug 2025, SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems, https://arxiv.org/abs/2508.11310
  • Hongtao Liu, Zhicheng Du, Zihe Wang and Weiran Shen, 16 Aug 2025, CHBench: A Cognitive Hierarchy Benchmark for Evaluating Strategic Reasoning Capability of LLMs, https://arxiv.org/abs/2508.11944
  • Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang, 16 Aug 2025, FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction, https://arxiv.org/abs/2508.11987
  • Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette, 18 Aug 2025, HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds, https://arxiv.org/abs/2508.12782
  • Fan Li, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, Xuemin Lin, 17 Aug 2025, DHG-Bench: A Comprehensive Benchmark on Deep Hypergraph Learning, https://arxiv.org/abs/2508.12244
  • Manuela Imbriani, Gina Belmonte, Mieke Massink, Alessandro Tofani, Vincenzo Ciancia, 18 Aug 2025, A Multi-Resolution Benchmark Framework for Spatial Reasoning Assessment in Neural Networks, https://arxiv.org/abs/2508.12741
  • Ananya Singha, Harshita Sahijwani, Walt Williams, Emmanuel Aboah Boateng, Nick Hausman, Miguel Di Luca, Keegan Choudhury, Chaya Binet, Vu Le, Tianwei Chen, Oryan Rokeah Chen, Sulaiman Vesal, Sadid Hasan, 14 Aug 2025, Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs, https://arxiv.org/abs/2508.11715
  • Javier Mu\~noz-Haro and Ruben Tolosana and Ruben Vera-Rodriguez and Aythami Morales and Julian Fierrez, 14 Aug 2025, Privacy-Aware Detection of Fake Identity Documents: Methodology, Benchmark, and Improved Detection Methods (FakeIDet2), https://arxiv.org/abs/2508.11716
  • Elon Ezra, Ariel Weizman, Amos Azaria, 17 Aug 2025, The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution, https://arxiv.org/abs/2508.12277
  • Zhiyuan Ning, Tianle Gu, Jiaxin Song, Shixin Hong, Lingyu Li, Huacan Liu, Jie Li, Yixu Wang, Meng Lingyu, Yan Teng, Yingchun Wang, 18 Aug 2025, LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models, https://arxiv.org/abs/2508.12733
  • Krzysztof Kotowski, Christoph Haskamp, Jacek Andrzejewski, Bogdan Ruszczak, Jakub Nalepa, Daniel Lakey, Peter Collins, Aybike Kolmas, Mauro Bartesaghi, Jose Martinez-Heras, Gabriele De Canio, 17 Aug 2025, European Space Agency Benchmark for Anomaly Detection in Satellite Telemetry, https://arxiv.org/abs/2406.17826
  • Bryan L. M. de Oliveira, Luana G. B. Martins, Bruno Brand\~ao, Murilo L. da Luz, Telma W. de L. Soares, Luckeciano C. Melo, 16 Aug 2025, Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning, https://arxiv.org/abs/2410.14038
  • Hyunjong Ok, Jaeho Lee, 18 Aug 2025, S2Cap: A Benchmark and a Baseline for Singing Style Captioning, https://arxiv.org/abs/2409.09866
  • Haohang Xu, Chengjie Liu, Qihang Wang, Wenhao Huang, Yongjian Xu, Weiyu Chen, Anlan Peng, Zhijun Li, Bo Li, Lei Qi, Jun Yang, Yuan Du, and Li Du, 27 Jun 2025, Image2Net: Datasets, Benchmark and Hybrid Framework to Convert Analog Circuit Diagrams into Netlists, https://arxiv.org/abs/2508.13157
  • Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen, 14 Aug 2025, MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents, https://arxiv.org/abs/2508.13186
  • Yixuan Yang and Daoyuan Wu and Yufan Chen, 17 Aug 2025, MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols, https://arxiv.org/abs/2508.13220
  • James Meaden, Micha{\l} Jarosz, Piotr Jod{\l}owski, Grigori Melnik, 19 Aug 2025, COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models, https://arxiv.org/abs/2508.13757
  • Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Guoyang Liu, 12 Aug 2025, KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge, https://arxiv.org/abs/2508.14080
  • Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim, Juneha Hwang, Jaeseon Ha, Hojun Choi, Suyeol Yun, Yongjin Kim, and Yongjae Lee, 7 Aug 2025, FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering, https://arxiv.org/abs/2508.14052
  • Sujit Roy, Dinesha V. Hegde, Johannes Schmude, Amy Lin, Vishal Gaur, Rohit Lal, Kshitiz Mandal, Talwinder Singh, Andr\'es Mu\~noz-Jaramillo, Kang Yang, Chetraj Pandey, Jinsu Hong, Berkay Aydin, Ryan McGranaghan, Spiridon Kasapis, Vishal Upendran, Shah Bahauddin, Daniel da Silva, Marcus Freitag, Iksha Gurung, Nikolai Pogorelov, Campbell Watson, Manil Maskey, Juan Bernabe-Moreno, Rahul Ramachandran, 18 Aug 2025, SuryaBench: Benchmark Dataset for Advancing Machine Learning in Heliophysics and Space Weather Prediction, https://arxiv.org/abs/2508.14107
  • Tapio Pitk\"aranta, 20 Aug 2025, The NordDRG AI Benchmark for Large Language Models, https://arxiv.org/abs/2506.13790
  • Ningyi Liao, Haoyu Liu, Zulun Zhu, Siqiang Luo, Laks V.S. Lakshmanan, 20 Aug 2025, A Comprehensive Benchmark on Spectral GNNs: The Impact on Efficiency, Memory, and Effectiveness, https://arxiv.org/abs/2406.09675
  • Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran, 21 Aug 2025, GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning, https://arxiv.org/abs/2508.15690
  • Chenghao Zhang, Qingqing Long, Ludi Wang, Wenjuan Cui, Jianjun Yu, Yi Du, 21 Aug 2025, CITE: A Comprehensive Benchmark for Heterogeneous Text-Attributed Graphs on Catalytic Materials, https://arxiv.org/abs/2508.15392
  • Jiahao Xu (Ohio State University, USA), Changchang Yin (Ohio State University Wexner Medical Center, USA), Odysseas Chatzipanagiotou (Ohio State University Wexner Medical Center, USA), Diamantis Tsilimigras (Ohio State University Wexner Medical Center, USA), Kevin Clear (Ohio State University Wexner Medical Center, USA), Bingsheng Yao (Northeastern University, USA), Dakuo Wang (Northeastern University, USA), Timothy Pawlik (Ohio State University Wexner Medical Center, USA), Ping Zhang (Ohio State University, USA), 21 Aug 2025, SurgWound-Bench: A Benchmark for Surgical Wound Diagnosis, https://arxiv.org/abs/2508.15189
  • Nikita Kachaev, Andrei Spiridonov, Andrey Gorodetsky, Kirill Muravyev, Nikita Oskolkov, Aditya Narendra, Vlad Shakhuro, Dmitry Makarov, Aleksandr I. Panov, Polina Fedotova, Alexey K. Kovalev, 21 Aug 2025, Mind and Motion Aligned: A Joint Evaluation IsaacSim Benchmark for Task Planning and Low-Level Policies in Mobile Manipulation, https://arxiv.org/abs/2508.15663
  • Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, Xiangyang Li, 19 Aug 2025, MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers, https://arxiv.org/abs/2508.14925
  • Changshun Wu, Weicheng He, Chih-Hong Cheng, Xiaowei Huang, Saddek Bensalem, 20 Aug 2025, Revisiting Out-of-Distribution Detection in Real-time Object Detection: From Benchmark Pitfalls to a New Mitigation Paradigm, https://arxiv.org/abs/2503.07330
  • Fei Lei, Yibo Yang, Wenxiu Sun, Dahua Lin, 22 Aug 2025, MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use, https://arxiv.org/abs/2508.16260
  • Mohan Jiang, Jin Gao, Jiahao Zhan, Dequan Wang, 14 Aug 2025, MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding, https://arxiv.org/abs/2508.15802
  • Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans, 18 Aug 2025, A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains, https://arxiv.org/abs/2508.15832
  • Ahmed Allam, Youssef Mansour, and Mohamed Shalan, 21 Aug 2025, ASIC-Agent: An Autonomous Multi-Agent System for ASIC Design with Benchmark Evaluation, https://arxiv.org/abs/2508.15940
  • Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
  • Mahinthan Chandramohan, Jovan Jancic, Yuntong Zhang and Padmanabhan Krishnan, 22 Aug 2025, From Benchmark Data To Applicable Program Repair: An Experience Report, https://arxiv.org/abs/2508.16071
  • Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu, 22 Aug 2025, RoMedQA: The First Benchmark for Romanian Medical Question Answering, https://arxiv.org/abs/2508.16390
  • Yakup Abrek Er, Ilker Kesen, G\"ozde G\"ul \c{S}ahin, Aykut Erdem, 22 Aug 2025, Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish, https://arxiv.org/abs/2508.16431
  • Adil Bahaj, Mounir Ghogho, 22 Aug 2025, PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark, https://arxiv.org/abs/2508.16439
  • Sam Earle, Graham Todd, Yuchen Li, Ahmed Khalifa, Muhammad Umair Nasir, Zehua Jiang, Andrzej Banburski-Fahey, Julian Togelius, 22 Aug 2025, PuzzleJAX: A Benchmark for Reasoning and Learning, https://arxiv.org/abs/2508.16821
  • Nilay Pande, Sahiti Yerramilli, Jayant Sravan Tamarapalli, Rynaa Grover, 24 Aug 2025, MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes, https://arxiv.org/abs/2508.17180
  • Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson, 25 Aug 2025, SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models, https://arxiv.org/abs/2508.18179
  • Robert Yang, 25 Aug 2025, Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery, https://arxiv.org/abs/2508.17681
  • Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Lei Bai, Yunqi Cai, Xi Dai, Shufei Zhang, Jinguang Cheng, Zhong Fang, Hongming Weng, 25 Aug 2025, CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics, https://arxiv.org/abs/2508.18124
  • Fangxin Shang, Yuan Xia, Dalu Yang, Yahui Wang, Binglin Yang, 21 Aug 2025, MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation, https://arxiv.org/abs/2508.16674
  • Liane Makatura, Benjamin Jones, Siyuan Bian, Wojciech Matusik, 25 Aug 2025, MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation, https://arxiv.org/abs/2508.17568
  • Wei Xiong and Jiangtong Li and Jie Li and Kun Zhu, 25 Aug 2025, EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation of EEG Foundation Models, https://arxiv.org/abs/2508.17742
  • Keke Lian and Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li and Dong Zhang, 25 Aug 2025, A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code, https://arxiv.org/abs/2508.18106
  • Yajing Yang, Qian Liu, Min-Yen Kan, 23 Aug 2025, DataTales: A Benchmark for Real-World Intelligent Data Narration, https://arxiv.org/abs/2410.17859
  • Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish, 23 Aug 2025, MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark, https://arxiv.org/abs/2506.05587
  • Linbo Cao, Jinman Zhao, 23 Jul 2025, Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks, https://arxiv.org/abs/2507.17747
  • Jianhao Chen, Junyang Ren, Wentao Ding, Haoyuan Ouyang, Wei Hu, Yuzhong Qu, 23 Jul 2025, Conflict Detection for Temporal Knowledge Graphs:A Fast Constraint Mining Algorithm and New Benchmarks, https://arxiv.org/abs/2312.11053
  • Fred Mutisya (1 and 2), Shikoh Gitau (1), Christine Syovata (2), Diana Oigara (2), Ibrahim Matende (2), Muna Aden (2), Munira Ali (2), Ryan Nyotu (2), Diana Marion (2), Job Nyangena (2), Nasubo Ongoma (1), Keith Mbae (1), Elizabeth Wamicha (1), Eric Mibuari (1), Jean Philbert Nsengemana (3), Talkmore Chidede (4) ((1) Qhala (Nairobi, Kenya), (2) Kenya Medical Association (Nairobi, Kenya), (3) Africa CDC (Addis Ababa, Ethiopia), (4) AfCFTA (Accra, Ghana)), 22 Jul 2025, Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens, https://arxiv.org/abs/2507.16322
  • Roland Pihlakas, Joel Pyykk\"o, 22 Jul 2025, From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent AI safety benchmarks, https://arxiv.org/abs/2410.00081
  • Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu, 10 Aug 2025, Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks, https://arxiv.org/abs/2508.07179
  • Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum, 31 Jul 2025, Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks, https://arxiv.org/abs/2507.23194
  • Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tutt\"os\'i, Angelica Lim, 25 Jul 2025, Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks, https://arxiv.org/abs/2507.19684
  • Khloud AL Jallad, Nada Ghneim, Ghaida Rebdawi, 27 Jul 2025, Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?, https://arxiv.org/abs/2507.20419
  • Fred Mutisya (1,2), Shikoh Gitau (1), Nasubo Ongoma (1), Keith Mbae (1), Elizabeth Wamicha (1), 31 Jul 2025, Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench, https://arxiv.org/abs/2508.00081
  • Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen (Cherise) Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert, 30 Jul 2025, Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models, https://arxiv.org/abs/2508.00923
  • Olawale Salaudeen, Nicole Chiou, Shiny Weng, Sanmi Koyejo, 2 Aug 2025, Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?, https://arxiv.org/abs/2504.00186
  • Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, Linlin Shen, 6 Aug 2025, Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models, https://arxiv.org/abs/2508.04325
  • Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang, 7 Aug 2025, Establishing Best Practices for Building Rigorous Agentic Benchmarks, https://arxiv.org/abs/2507.02825
  • Prathamesh Kalamkar, Janani Venugopalan Ph.D., Vivek Raghavan Ph.D, 13 Jul 2021, Indian Legal NLP Benchmarks : A Survey, https://arxiv.org/abs/2107.06056
  • Serina Chang, Ashton Anderson, Jake M. Hofman, 12 Aug 2025, ChatBench: From Static Benchmarks to Human-AI Evaluation, https://arxiv.org/abs/2504.07114
  • Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan "Honza" Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko, 14 Aug 2025, Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping, https://arxiv.org/abs/2310.00098
  • Shengbo Wang, Mingwei Liu, Zike Li, Anji Li, Yanlin Wang, Xin Peng, Zibin Zheng, 18 Aug 2025, EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing, https://arxiv.org/abs/2508.13003
  • Weihao Wu, Liang Cao, Xinyu Wu, Zhiwei Lin, Rui Niu, Jingbei Li, Zhiyong Wu, 4 Sep 2025, VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents, https://arxiv.org/abs/2509.03940
  • Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, Kevin Roitero, 4 Sep 2025, On Robustness and Reliability of Benchmark-Based Evaluation of LLMs, https://arxiv.org/abs/2509.04013
  • Gyehun Go, Satbyul Han, Ahyeon Choi, Eunjin Choi, Juhan Nam and Jeong Mi Park, 4 Sep 2025, AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation, https://arxiv.org/abs/2509.00813
  • Lu Wang, Hao Chen, Siyu Wu, Zhiyue Wu, Hao Zhou, Chengfeng Zhang, Ting Wang and Haodi Zhang, 4 Sep 2025, AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation, https://arxiv.org/abs/2509.02349
  • Jason Gardner, Ayan Dutta, Swapnoneel Roy, O. Patrick Kreidl, Ladislau Boloni, 5 Sep 2025, Greener Deep Reinforcement Learning: Analysis of Energy and Carbon Efficiency Across Atari Benchmarks, https://arxiv.org/abs/2509.05273
  • Xuan Yao, Qianteng Wang, Xinbo Liu, Ke-Wei Huang, 29 Aug 2025, Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study, https://arxiv.org/abs/2509.04468
  • Shengyin Sun and Yiming Li and Xing Li and Yingzhao Lian and Weizhe Lin and Hui-Ling Zhen and Zhiyuan Yang and Chen Chen and Xianzhi Yu and Mingxuan Yuan and Chen Ma, 30 Aug 2025, Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling, https://arxiv.org/abs/2509.04474
  • Medhasweta Sen, Zachary Gottesman, Jiaxing Qiu, C. Bayan Bruss, Nam Nguyen, Tom Hartvigsen, 5 Sep 2025, BEDTime: A Unified Benchmark for Automatically Describing Time Series, https://arxiv.org/abs/2509.05215
  • Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan, 26 Aug 2025, Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap, https://arxiv.org/abs/2508.18646
  • Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, and Liang He, 26 Aug 2025, Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark, https://arxiv.org/abs/2508.19005
  • Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Yuntao Du, Bill Sun, Hongzhang Liu, Sen Hu, Ronghao Chen, Bo Li, Xin Li, Chen Hu, Binxing Jiao, Daxin Jiang, Pin Lyu, 26 Aug 2025, GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging, https://arxiv.org/abs/2508.18993
  • Dan Ristea, Vasilios Mavroudis, 26 Aug 2025, HonestCyberEval: An AI Cyber Risk Benchmark for Automated Software Exploitation, https://arxiv.org/abs/2410.21939
  • Vil\'em Heinz, Petr Vil\'im, Zden\v{e}k Hanz\'alek, 27 Aug 2025, Reinforcement Learning for Search Tree Size Minimization in Constraint Programming: New Results on Scheduling Benchmarks, https://arxiv.org/abs/2508.20056
  • Qian Liang, Menghaoran Tang, Yi Zeng, 8 Aug 2025, MuSpike: A Benchmark and Evaluation Framework for Symbolic Music Generation with Spiking Neural Networks, https://arxiv.org/abs/2508.19251
  • Jiayu Ding, Shuming Ma, Lei Cui, Nanning Zheng, Furu Wei, 26 Aug 2025, LongReasonArena: A Long Reasoning Benchmark for Large Language Models, https://arxiv.org/abs/2508.19363
  • Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia and Carlos Guestrin, 27 Aug 2025, DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis, https://arxiv.org/abs/2508.20033
  • Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu, Walter Pretschner, Heinz Koeppl, Fakhri Karray, 27 Aug 2025, CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval, https://arxiv.org/abs/2506.11066
  • Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, Yao Mu, 27 Aug 2025, RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation, https://arxiv.org/abs/2506.18088
  • Chihiro Taguchi, Seng Mai, Keita Kurabe, Yusuke Sakai, Georgina Agyei, Soudabeh Eslami, David Chiang, 28 Aug 2025, Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark, https://arxiv.org/abs/2508.20511
  • Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, Lingpeng Kong, 29 Aug 2025, MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents, https://arxiv.org/abs/2508.21475
  • Hao Xu, Zhichao Wang, Shengqi Sang, Pisit Wajanasara, Nuno Bandeira, 12 Aug 2025, Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics, https://arxiv.org/abs/2508.21076
  • Jo\~ao Guilherme Alves Santos, Giovana Kerche Bon\'as, Thales Sales Almeida, 29 Aug 2025, BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning, https://arxiv.org/abs/2508.21294
  • Chen Gong, Kecen Li, Zinan Lin, Tianhao Wang, 28 Aug 2025, DPImageBench: A Unified Benchmark for Differentially Private Image Synthesis, https://arxiv.org/abs/2503.14681
  • Terry Jingchen Zhang, Gopal Dev, Ning Wang, Nicole Ni, Wenyuan Jiang, Yinya Huang, Bernhard Sch\"olkopf, Mrinmaya Sachan, Zhijing Jin, 26 Aug 2025, Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination, https://arxiv.org/abs/2509.00072
  • Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim, 1 Sep 2025, FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games, https://arxiv.org/abs/2509.01052
  • Tung Nguyen, Harkanwar Singh, Nilay Naharas, Lucas Bandarkar, Aditya Grover, 31 Aug 2025, IndiaWeatherBench: A Dataset and Benchmark for Data-Driven Regional Weather Forecasting over India, https://arxiv.org/abs/2509.00653
  • Aryan Amit Barsainyan, Jing Yu Lim, Dianbo Liu, 1 Sep 2025, Toward a Unified Benchmark and Taxonomy of Stochastic Environments, https://arxiv.org/abs/2509.01793
  • Rio Akizuki, Yuya Kudo, Nozomu Yoshinari, Yoichi Hirose, Toshiyuki Nishimoto, Kento Uchida, Shinichi Shirakawa, 2 Sep 2025, Surrogate Benchmarks for Model Merging Optimization, https://arxiv.org/abs/2509.02555
  • Muhammad Ali, Salman Khan, 29 Aug 2025, Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments, https://arxiv.org/abs/2509.00176
  • Artur D\'iaz-Juan, Coloma Ballester, Gloria Haro, 1 Sep 2025, SoccerHigh: A Benchmark Dataset for Automatic Soccer Video Summarization, https://arxiv.org/abs/2509.01439
  • Yi Cao, Paulette Clancy, 27 Aug 2025, Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields in Doped Materials, https://arxiv.org/abs/2509.00090
  • Yumeng Lin, Dong Li, Xintao Wu, Minglai Shao, Xujiang Zhao, Zhong Chen, Chen Zhao, 31 Aug 2025, Face4FairShifts: A Large Image Benchmark for Fairness and Robust Learning across Visual Domains, https://arxiv.org/abs/2509.00658
  • Feng Wang, Yiding Sun, Jiaxin Mao, Wei Xue, Danqing Xu, 1 Sep 2025, FinS-Pilot: A Benchmark for Online Financial RAG System, https://arxiv.org/abs/2506.02037
  • Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang, 31 Aug 2025, LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation, https://arxiv.org/abs/2506.04078
  • Yunxin Sun, Abulhair Saparov, 3 Sep 2025, Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning, https://arxiv.org/abs/2509.03345
  • Taiga Saito, Yu Otake, Stephen Wu, 3 Sep 2025, Tabular foundation model for GEOAI benchmark problems BM/AirportSoilProperties/2/2025, https://arxiv.org/abs/2509.03191
  • Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang, 3 Sep 2025, SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models, https://arxiv.org/abs/2509.03487
  • Roland Pihlakas, Sruthi Kuriakose, 2 Sep 2025, BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format, https://arxiv.org/abs/2509.02655
  • Yuhang Yao, Yuan Li, Xinyi Fan, Junhao Li, Kay Liu, Weizhao Jin, Yu Yang, Srivatsan Ravi, Philip S. Yu, Carlee Joe-Wong, 2 Sep 2025, FedGraph: A Research Library and Benchmark for Federated Graph Learning, https://arxiv.org/abs/2410.06340
  • Pengxiang Zhao, Guangyi Liu, Yaozhen Liang, Weiqing He, Zhengxi Lu, Yuehao Huang, Yaxuan Guo, Kexin Zhang, Hao Wang, Liang Liu, Yong Liu, 8 Sep 2025, MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents, https://arxiv.org/abs/2509.06477
  • Chen Shao, Yue Wang, Zhenyi Zhu, Zhanbo Huang, Sebastian P\"utz, Benjamin Sch\"afer, Tobais K\"afer, Michael F\"arber, 6 Sep 2025, Real-E: A Foundation Benchmark for Advancing Robust and Generalizable Electricity Forecasting, https://arxiv.org/abs/2509.05768
  • Honggang Jia, Xiucheng Wang, Nan Cheng, Ruijin Sun, Changle Li, 8 Sep 2025, UrbanMIMOMap: A Ray-Traced MIMO CSI Dataset with Precoding-Aware Maps and Benchmarks, https://arxiv.org/abs/2509.06270
  • Shay Dahary, Avi Edana, Alexander Apartsin, Yehudit Aperstein, 6 Sep 2025, From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics, https://arxiv.org/abs/2509.05617
  • Qianheng Zhang, Song Gao, Chen Wei, Yibo Zhao, Ying Nie, Ziru Chen, Shijie Chen, Yu Su, Huan Sun, 7 Sep 2025, GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation, https://arxiv.org/abs/2509.05881
  • Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Si-Yu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-Feng Li, 6 Sep 2025, ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning, https://arxiv.org/abs/2412.13682
  • Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Bram Grooten, Meng Fang, Yali Du, Mykola Pechenizkiy, 6 Sep 2025, MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning, https://arxiv.org/abs/2506.14990
  • Ziye Chen and Chengwei Qin and Yao Shu, 9 Sep 2025, RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning, https://arxiv.org/abs/2509.07711
  • Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, Yu Cheng, Tao Chen, Lei Bai, Dongzhan Zhou, Yun Luo, Ganqu Cui, Peng Ye, 9 Sep 2025, HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?, https://arxiv.org/abs/2509.07894
  • Yu Song, Zhigang Hua, Yan Xie, Jingzhe Liu, Bo Long, Hui Liu, 28 Aug 2025, GSTBench: A Benchmark Study on the Transferability of Graph Self-Supervised Learning, https://arxiv.org/abs/2509.06975
  • Kutub Uddin, Muhammad Umar Farooq, Awais Khan, Khalid Mahmood Malik, 8 Sep 2025, Adversarial Attacks on Audio Deepfake Detection: A Benchmark and Comparative Study, https://arxiv.org/abs/2509.07132
  • Zonghai Yao, Michael Sun, Won Seok Jang, Sunjae Kwon, Soie Kwon, Hong Yu, 8 Sep 2025, DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge, https://arxiv.org/abs/2509.07188
  • Christopher Brady (1) and Xu Wu (1) ((1) North Carolina State University), 9 Sep 2025, Nuclear Data Adjustment for Nonlinear Applications in the OECD/NEA WPNCS SG14 Benchmark -- A Bayesian Inverse UQ-based Approach for Data Assimilation, https://arxiv.org/abs/2509.07790
  • Timothy Ossowski, Jixuan Chen, Danyal Maqbool, Zefan Cai, Tyler Bradshaw, Junjie Hu, 8 Sep 2025, COMMA: A Communicative Multimodal Multi-Agent Benchmark, https://arxiv.org/abs/2410.07553
  • Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang, 9 Sep 2025, Audio-centric Video Understanding Benchmark without Text Shortcut, https://arxiv.org/abs/2503.19951
  • Riccardo Lunelli, Angus Nicolson, Samuel Martin Pr\"oll, Sebastian Johannes Reinstadler, Axel Bauer, Clemens Dlaska, 12 Sep 2025, BenchECG and xECG: a benchmark and baseline for ECG foundation models, https://arxiv.org/abs/2509.10151
  • Ninad Bhat, Kieran Browne, Pip Bingemann, 5 Sep 2025, Creativity Benchmark: A benchmark for marketing creativity for LLM models, https://arxiv.org/abs/2509.09702
  • Claudio Pinhanez and Paulo Cavalin and Cassia Sanctos and Marcelo Grave and Yago Primerano, 5 Sep 2025, The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks, https://arxiv.org/abs/2509.09705
  • Aya E. Fouda, Abdelrahamn A. Hassan, Radwa J. Hanafy and Mohammed E. Fouda, 7 Sep 2025, Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry, https://arxiv.org/abs/2509.09711
  • Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, Bo Zheng, 9 Sep 2025, VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions, https://arxiv.org/abs/2509.09716
  • Kaikai Zhao, Zhaoxiang Liu, Peng Wang, Xin Wang, Zhicheng Ma, Yajun Xu, Wenjing Zhang, Yibing Nan, Kai Wang, Shiguo Lian, 10 Sep 2025, MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance, https://arxiv.org/abs/2509.09730
  • Ji\v{r}\'i Mili\v{c}ka, Anna Marklov\'a, V\'aclav Cvr\v{c}ek, 12 Sep 2025, Benchmark of stylistic variation in LLM-generated texts, https://arxiv.org/abs/2509.10179
  • {\L}ukasz Grzybowski, Jakub Pokrywka, Micha{\l} Ciesi\'o{\l}ka, Jeremi I. Kaczmarek, Marek Kubis, 12 Sep 2025, Polish-English medical knowledge transfer: A new benchmark and results, https://arxiv.org/abs/2412.00559
  • Platon Lukyanenko, Joshua Mayourian, Mingxuan Liu, John K. Triedman, Sunil J. Ghelani, William G. La Cava, 12 Sep 2025, Deep Survival Analysis from Adult and Pediatric Electrocardiograms: A Multi-center Benchmark Study, https://arxiv.org/abs/2406.17002
  • Hangyi Jia, Yuxi Qian, Hanwen Tong, Xinhui Wu, Lin Chen, Feng Wei, 11 Sep 2025, Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization, https://arxiv.org/abs/2509.09321
  • Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang, 11 Sep 2025, Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization, https://arxiv.org/abs/2509.09307
  • Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang, 11 Sep 2025, LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering, https://arxiv.org/abs/2509.09614
  • Weixuan Sun, Jucai Zhai, Dengfeng Liu, Xin Zhang, Xiaojun Wu, Qiaobo Hao, AIMgroup, Yang Fang, Jiuyang Tang, 19 Sep 2025, CCrepairBench: A High-Fidelity Benchmark and Reinforcement Learning Framework for C++ Compilation Repair, https://arxiv.org/abs/2509.15690
  • Chi Yang, Fu Wang, Xiaofei Yang, Hao Huang, Weijia Cao, Xiaowen Chu, 19 Sep 2025, SGMAGNet: A Baseline Model for 3D Cloud Phase Structure Reconstruction on a New Passive Active Satellite Benchmark, https://arxiv.org/abs/2509.15706
  • Marylou Fauchard, Florian Carichon, Margarida Carvalho and Golnoosh Farnadi, 16 Sep 2025, Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets, https://arxiv.org/abs/2509.13131
  • Wanru Zhuang, Wenbo Li, Zhibin Lan, Xu Han, Peng Li, Jinsong Su, 14 Sep 2025, PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models, https://arxiv.org/abs/2509.12278
  • Matteo Marcuzzo, Alessandro Zangari, Andrea Albarelli, Jose Camacho-Collados, Mohammad Taher Pilehvar, 15 Sep 2025, MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables, https://arxiv.org/abs/2509.12371
  • Maximus Powers, Shaina Raza, Alex Chang, Rehana Riaz, Umang Mavani, Harshitha Reddy Jonala, Ansh Tiwari, Hua Wei, 15 Sep 2025, Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes, https://arxiv.org/abs/2410.08388
  • Yao Liang, Dongcheng Zhao, Feifei Zhao, Guobin Shen, Yuwei Wang, Dongqi Liang, Yi Zeng, 16 Sep 2025, MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values, https://arxiv.org/abs/2509.08022
  • Tara Bogavelli, Roshnee Sharma, Hari Subramani, 13 Sep 2025, AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise, https://arxiv.org/abs/2509.10769
  • Rodrigo Tertulino, 3 Sep 2025, A Comparative Benchmark of Federated Learning Strategies for Mortality Prediction on Heterogeneous and Imbalanced Clinical Data, https://arxiv.org/abs/2509.10517
  • Yonghao Weng and Liqiang Gao and Linwu Zhu and Jian Huang, 14 Sep 2025, MatQnA: A Benchmark Dataset for Multi-modal Large Language Models in Materials Characterization and Analysis, https://arxiv.org/abs/2509.11335
  • Sai Kartheek Reddy Kasu, 15 Sep 2025, EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI, https://arxiv.org/abs/2509.11648
  • Payam Latifi, 15 Sep 2025, Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities, https://arxiv.org/abs/2509.12098
  • Shuaiqi Wang, Vikas Raunak, Arturs Backurs, Victor Reis, Pei Zhou, Sihao Chen, Longqi Yang, Zinan Lin, Sergey Yekhanin, Giulia Fanti, 12 Sep 2025, Struct-Bench: A Benchmark for Differentially Private Structured Text Generation, https://arxiv.org/abs/2509.10696
  • William Corrias and Fabio De Gaspari and Dorjan Hitaj and Luigi V. Mancini, 15 Sep 2025, MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark, https://arxiv.org/abs/2504.16651
  • Hangyu Li and Qin Zhao and Haoran Xu and Xinyu Jiang and Qingwei Ben and Feiyu Jia and Haoyu Zhao and Liang Xu and Jia Zeng and Hanqing Wang and Bo Dai and Junting Dong and Jiangmiao Pang, 15 Sep 2025, TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation, https://arxiv.org/abs/2505.12748
  • Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, and Nenghai Yu, 15 Sep 2025, MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation, https://arxiv.org/abs/2505.23810
  • Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, Xiaodong Gu, 14 Sep 2025, LastingBench: Defend Benchmarks Against Knowledge Leakage, https://arxiv.org/abs/2506.21614
  • Yidan Sun, Viktor Schlegel, Srinivasan Nandakumar, Iqra Zahid, Yuping Wu, Yulong Wu, Hao Li, Jie Zhang, Warren Del-Pinto, Goran Nenadic, Siew Kei Lam, Anil Anthony Bharath, 18 Sep 2025, SynBench: A Benchmark for Differentially Private Text Generation, https://arxiv.org/abs/2509.14594
  • Masaharu Mizumoto, Dat Nguyen, Zhiheng Han, Jiyuan Fang, Heyuan Guan, Xingfu Li, Naoya Shiraishi, Xuyang Tian, Yo Nakawake, Le Minh Nguyen, 18 Sep 2025, The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs, https://arxiv.org/abs/2509.14704
  • Rashid Mushkani, 18 Sep 2025, Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark, https://arxiv.org/abs/2509.14574
  • Carolin Benjamins, Helena Graf, Sarah Segel, Difan Deng, Tim Ruhkopf, Leona Hennig, Soham Basu, Neeratyoy Mallik, Edward Bergman, Deyao Chen, Fran\c{c}ois Cl\'ement, Alexander Tornede, Matthias Feurer, Katharina Eggensperger, Frank Hutter, Carola Doerr, Marius Lindauer, 18 Sep 2025, carps: A Framework for Comparing N Hyperparameter Optimizers on M Benchmarks, https://arxiv.org/abs/2506.06143
  • Kai Yin, Xiangjue Dong, Chengkai Liu, Lipai Huang, Yiming Xiao, Zhewei Liu, Ali Mostafavi, James Caverlee, 17 Sep 2025, DisastIR: A Comprehensive Information Retrieval Benchmark for Disaster Management, https://arxiv.org/abs/2505.15856
  • Bingjian Yang, Danni Xu, Kaipeng Niu, Wenxuan Liu, Zheng Wang, Mohan Kankanhalli, 8 Sep 2025, A New Dataset and Benchmark for Grounding Multimodal Misinformation, https://arxiv.org/abs/2509.08008
  • David Robinson, Animesh Gupta, Rizwan Quershi, Qiushi Fu, Mubarak Shah, 2 Sep 2025, STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery, https://arxiv.org/abs/2509.07994
  • Shang Qin, Jingheng Ye, Yinghui Li, Hai-Tao Zheng, Qi Li, Jinxiao Shan, Zhixing Li, Hong-Gee Kim, 17 Sep 2025, CL$^2$GEC: A Multi-Discipline Benchmark for Continual Learning in Chinese Literature Grammatical Error Correction, https://arxiv.org/abs/2509.13672
  • G. Charbel N. Kindji (LACODAM), Lina Maria Rojas-Barahona, Elisa Fromont (LACODAM), Tanguy Urvoy, 17 Sep 2025, Tabular Data Generation Models: An In-Depth Survey and Performance Benchmarks with Extensive Tuning, https://arxiv.org/abs/2406.12945
  • Youngjoon Lee, Jinu Gong, Joonhyuk Kang, 17 Sep 2025, A Unified Benchmark of Federated Learning with Kolmogorov-Arnold Networks for Medical Imaging, https://arxiv.org/abs/2504.19639
  • Sanghyu Yoon, Dongmin Kim, Suhee Yoon, Ye Seul Sim, Seungdong Yoa, Hye-Seung Cho, Soonyoung Lee, Hankook Lee, Woohyung Lim, 2 Oct 2025, ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection, https://arxiv.org/abs/2510.02060
  • Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, Yingchun Wang, 2 Oct 2025, A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports, https://arxiv.org/abs/2510.02190
  • Aritra Das, Joseph T. Iosue and Victor V. Albert, 1 Oct 2025, Quantum-inspired Benchmark for Estimating Intrinsic Dimension, https://arxiv.org/abs/2510.01335
  • Dongjun Kim, Gyuho Shim, Yongchan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim, 23 Sep 2025, Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks, https://arxiv.org/abs/2510.01232
  • Punit Kumar Singh, Nishant Kumar, Akash Ghosh, Kunal Pasad, Khushi Soni, Manisha Jaishwal, Sriparna Saha, Syukron Abu Ishaq Alfarozi, Asres Temam Abagissa, Kitsuchart Pasupa, Haiqin Yang, Jose G Moreno, 24 Sep 2025, Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports, https://arxiv.org/abs/2510.01247
  • Shree Harsha Bokkahalli Satish, Gustav Eje Henter, \'Eva Sz\'ekely, 24 Sep 2025, Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs, https://arxiv.org/abs/2510.01254
  • Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour, 2 Oct 2025, MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization, https://arxiv.org/abs/2510.01659
  • Yuxun Tang, Lan Liu, Wenhao Feng, Yiwen Zhao, Jionghao Han, Yifeng Yu, Jiatong Shi, Qin Jin, 2 Oct 2025, SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment, https://arxiv.org/abs/2510.01812
  • Mario Medrano-Paredes, Carmen Fern\'andez-Gonz\'alez, Francisco-Javier D\'iaz-Pernas, Hichem Saoudi, Javier Gonz\'alez-Alonso, Mario Mart\'inez-Zarzuela, 2 Oct 2025, Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities, https://arxiv.org/abs/2510.02264
  • Jian Yao, Ran Cheng, and Kay Chen Tan, 2 Oct 2025, VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks, https://arxiv.org/abs/2507.12885
  • Aaron Xuxiang Tian, Ruofan Zhang, Jiayao Tang, Young Min Cho, Xueqian Li, Qiang Yi, Ji Wang, Zhunping Zhang, Danrui Qi, Zekun Li, Xingyu Xiang, Sharath Chandra Guntuku, Lyle Ungar, Tianyu Shi, Chi Wang, 1 Oct 2025, Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks, https://arxiv.org/abs/2509.23537
  • Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo, 2 Oct 2025, MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation, https://arxiv.org/abs/2505.15054
  • Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo, 1 Oct 2025, Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation, https://arxiv.org/abs/2505.19430
  • Monoshi Kumar Roy, Simin Chen, Benjamin Steenhoek, Jinjun Peng, Gail Kaiser, Baishakhi Ray, Wei Le, 2 Oct 2025, CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning, https://arxiv.org/abs/2506.00750
  • A. Alfarano, L. Venturoli, D. Negueruela del Castillo (University of Zurich, Max Planck Society), 14 Oct 2025, VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage, https://arxiv.org/abs/2510.12750
  • Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov, 14 Oct 2025, Diff-XYZ: A Benchmark for Evaluating Diff Understanding, https://arxiv.org/abs/2510.12487
  • Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao, 13 Oct 2025, MergeBench: A Benchmark for Merging Domain-Specialized LLMs, https://arxiv.org/abs/2505.10833
  • Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang, 14 Oct 2025, Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series, https://arxiv.org/abs/2506.10412
  • Shen Dong, Mingxuan Zhang, Pengfei He, Li Ma, Bhavani Thuraisingham, Hui Liu, Yue Xing, 14 Oct 2025, PEAR: Planner-Executor Agent Robustness Benchmark, https://arxiv.org/abs/2510.07505
  • Gongping Chen, Lei Zhao, Xiaotao Yin, Liang Cui, Jianxun Zhang, Yu Dai, Ningning Liu, 14 Oct 2025, BAAF: A benchmark attention adaptive framework for medical ultrasound image segmentation tasks, https://arxiv.org/abs/2310.00919
  • Dadi Guo, Tianyi Zhou, Dongrui Liu, Chen Qian, Qihan Ren, Shuai Shao, Zhiyuan Fan, Yi R. Fung, Kun Wang, Linfeng Zhang, Jing Shao, 1 Oct 2025, Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm, https://arxiv.org/abs/2510.00415
  • Yuchen Song, Andong Chen, Wenxin Zhu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao, 27 Sep 2025, Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness, https://arxiv.org/abs/2510.00041
  • Jinghang Shi, Xiao Yu Tang, Yang Hunag, Yuyang Li, Xiaokong, Yanxia Zhang, and Caizhan Yue, 29 Sep 2025, AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy, https://arxiv.org/abs/2510.00063
  • Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley, Zexue He, 30 Sep 2025, BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses, https://arxiv.org/abs/2510.00232
  • Harethah Abu Shairah, Somayah AlHarbi, Abdulaziz AlHussein, Sameer Alsabea, Omar Shaqaqi, Hebah AlShamlan, Omar Knio, George Turkiyyah, 1 Oct 2025, ALARB: An Arabic Legal Argument Reasoning Benchmark, https://arxiv.org/abs/2510.00694
  • Changyu Zeng, Yifan Wang, Zimu Wang, Wei Wang, Zhengni Yang, Muyi Bao, Jiming Xiao, Anh Nguyen, and Yutao Yue, 1 Oct 2025, NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities, https://arxiv.org/abs/2509.16656
  • Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel CF Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Hao Cheng, Hohin Lee, Praneeth Sanapathi, Sarah Hilado, Jiang Bian, Javier Alvarez-Valle, Mu Wei, Khalil Malik, Jianfeng Gao, Eric Horvitz, Matthew P Lungren, Hoifung Poon, Paul Vozila, 1 Oct 2025, The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks, https://arxiv.org/abs/2509.18234
  • Siyang Wu, Honglin Bao, Sida Li, Ari Holtzman, James A. Evans, 1 Oct 2025, Mapping Overlaps in Benchmarks through Perplexity in the Wild, https://arxiv.org/abs/2509.23488
  • Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Ya\"ir Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson, Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, et al. (2 additional authors not shown), 1 Oct 2025, Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark, https://arxiv.org/abs/2509.26574
  • Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan and Xiaofan Jiang, 30 Sep 2025, TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models, https://arxiv.org/abs/2504.03748
  • Peiyu Yang and Naveed Akhtar and Jiantong Jiang and Ajmal Mian, 1 Oct 2025, A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions, https://arxiv.org/abs/2405.02344
  • Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri, 1 Oct 2025, PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks, https://arxiv.org/abs/2404.04671
  • Nathanael Jo, Ashia Wilson, 23 Sep 2025, What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities, https://arxiv.org/abs/2509.19590
  • Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson, 24 Sep 2025, When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity, https://arxiv.org/abs/2509.20293
  • Zejun Liu, Yunshan Chen, Chengxi Xie, Huan Liu, 14 Sep 2025, LibEMER: A novel benchmark and algorithms library for EEG-based Multimodal Emotion Recognition, https://arxiv.org/abs/2509.19330
  • Hailay Kidu Teklehaymanot, Gebrearegawi Gidey, Wolfgang Nejdl, 24 Sep 2025, Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks, https://arxiv.org/abs/2509.20209
  • Sergey Berezin, Reza Farahbakhsh, Noel Crespi, 24 Sep 2025, Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems, https://arxiv.org/abs/2409.18708
  • Mahdi Zakizadeh and Mohammad Taher Pilehvar, 24 Sep 2025, Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets, https://arxiv.org/abs/2501.01168
  • Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi, 24 Sep 2025, OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models, https://arxiv.org/abs/2506.03135
  • Segev Shlomov, Alon Oved, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, {\L}ukasz Str\k{a}k, Elizabeth Koumpan, Yinon Goldshtein, Eilam Shapira, Nir Mashkif, Asaf Adi, 27 Oct 2025, From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production, https://arxiv.org/abs/2510.23856
  • Anisha Saha, Varsha Suresh, Timothy Hospedales, Vera Demberg, 27 Oct 2025, MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection, https://arxiv.org/abs/2510.23727
  • Yong Huang, Zhongqi Yang, Amir Rahmani, 28 Oct 2025, MIMIC-Sepsis: A Curated Benchmark for Modeling and Learning from Sepsis Trajectories in the ICU, https://arxiv.org/abs/2510.24500
  • Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, Jinho D. Choi, 27 Oct 2025, CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection, https://arxiv.org/abs/2510.23845
  • Mirali Purohit, Bimal Gajera, Vatsal Malaviya, Irish Mehta, Kunal Kasodekar, Jacob Adler, Steven Lu, Umaa Rebbapragada, Hannah Kerner, 28 Oct 2025, Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks, https://arxiv.org/abs/2510.24010
  • Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park, Sumin Bae, Mingyu Kang, Jaejin Lee, 28 Oct 2025, Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean, https://arxiv.org/abs/2510.24150
  • Hunzalah Hassan Bhatti, Firoj Alam, 28 Oct 2025, Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants, https://arxiv.org/abs/2510.24328
  • Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin, 28 Oct 2025, LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability, https://arxiv.org/abs/2510.24345
  • Kaveh Eskandari Miandoab, Mahammed Kamruzzaman, Arshia Gharooni, Gene Louis Kim, Vasanth Sarathy, Ninareh Mehrabi, 27 Oct 2025, Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation, https://arxiv.org/abs/2510.23921
  • Ahmet Onur Akman, Anastasia Psarou, Micha{\l} Hoffmann, {\L}ukasz Gorczyca, {\L}ukasz Kowalski, Pawe{\l} Gora, Grzegorz Jamr\'oz, Rafa{\l} Kucharski, 28 Oct 2025, URB - Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles, https://arxiv.org/abs/2505.17734
  • Dongwon Choi, Sunwoo Kim, Juyeon Kim, Kyungho Kim, Geon Lee, Shinhwan Kang, Myunghwan Kim, Kijung Shin, 28 Oct 2025, RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases, https://arxiv.org/abs/2506.01360
  • Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu, 28 Oct 2025, MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems, https://arxiv.org/abs/2510.17281
  • Roham Koohestani, Philippe de Bekker, Beg\"um Ko\c{c}, Maliheh Izadi, 28 Oct 2025, Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality, https://arxiv.org/abs/2503.05860
  • Aneesha Sampath, Oya Aran, Emily Mower Provost, 28 Oct 2025, SEER: The Span-based Emotion Evidence Retrieval Benchmark, https://arxiv.org/abs/2510.03490
  • Yu Wu and Ke Shu and Jonas Fischer and Lidia Pivovarova and David Rosson and Eetu M\"akel\"a and Mikko Tolonen, 28 Oct 2025, Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark, https://arxiv.org/abs/2510.19585
  • Disheng Liu, Yiran Qiao, Wuche Liu, Yiren Lu, Yunlai Zhou, Tuo Liang, Yu Yin, Jing Ma, 28 Oct 2025, CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data, https://arxiv.org/abs/2503.04852
  • Eric Ngoiya and Tianshu Bao, 23 Oct 2025, Fluidity Index: Next-Generation Super-intelligence Benchmarks, https://arxiv.org/abs/2510.20636
  • Tom Maus, Asma Atamna, Tobias Glasmachers, 23 Oct 2025, Balancing Specialization and Centralization: A Multi-Agent Reinforcement Learning Benchmark for Sequential Industrial Control, https://arxiv.org/abs/2510.20408
  • Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei, Ximing Lu, Meng Jiang, Faeze Brahman, Snigdha Chaturvedi, Haw-Shiuan Chang, Daniel Khashabi, Xiang Lorraine Li, 23 Oct 2025, CreativityPrism: A Holistic Benchmark for Large Language Model Creativity, https://arxiv.org/abs/2510.20091
  • Tian Xia, Tianrun Gao, Wenhao Deng, Long Wei, Xiaowei Qian, Yixian Jiang, Chenglei Yu, Tailin Wu, 23 Oct 2025, BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction, https://arxiv.org/abs/2510.16559
  • Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri, 23 Oct 2025, CLEVER: A Curated Benchmark for Formally Verified Code Generation, https://arxiv.org/abs/2505.13938
  • Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu, 23 Oct 2025, Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants, https://arxiv.org/abs/2501.01243
  • Weikang Yuan, Kaisong Song, Zhuoren Jiang, Junjie Cao, Yujie Zhang, Jun Lin, Kun Kuang, Ji Zhang, Xiaozhong Liu, 23 Oct 2025, LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation, https://arxiv.org/abs/2505.19667
  • Max Gutbrod, David Rauber, Danilo Weber Nunes, Christoph Palm, 23 Oct 2025, OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection, https://arxiv.org/abs/2503.16247
  • Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri, 22 Oct 2025, Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models, https://arxiv.org/abs/2505.20612
  • Charles Rhys Campbell, Aldo H. Romero, Kamal Choudhary, 17 Oct 2025, AtomBench: A Benchmark for Generative Atomic Structure Models using GPT, Diffusion, and Flow Architectures, https://arxiv.org/abs/2510.16165
  • Ashutosh Srivastava, Lokesh Nagalapatti, Gautam Jajoo, Aniket Vashishtha, Parameswari Krishnamurthy, Amit Sharma, 18 Oct 2025, Realizing LLMs' Causal Potential Requires Science-Grounded, Novel Benchmarks, https://arxiv.org/abs/2510.16530
  • Ioannis Tsaknakis, Bingqing Song, Shuyu Gan, Dongyeop Kang, Alfredo Garcia, Gaowen Liu, Charles Fleming, Mingyi Hong, 20 Oct 2025, Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction, https://arxiv.org/abs/2510.17132
  • Alexander Aghili, Andy Bruce, Daniel Sabo, Sanya Murdeshwar, Kevin Bachelor, Ionut Mistreanu, Ashwin Lokapally and Razvan Marinescu, 20 Oct 2025, A Standardized Benchmark for Machine-Learned Molecular Dynamics using Weighted Ensemble Sampling, https://arxiv.org/abs/2510.17187
  • Tal Barami, Nimrod Berman, Ilan Naiman, Amos H. Hason, Rotem Ezra, Omri Azencot, 20 Oct 2025, Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations, https://arxiv.org/abs/2510.17313
  • Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, Qingsong Wen, Lixing Chen, Pan Zhou, Lichao Sun, 17 Oct 2025, Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs, https://arxiv.org/abs/2510.16062
  • Ryoto Miyamoto, Xin Fan, Fuyuko Kido, Tsuneo Matsumoto and Hayato Yamana, 18 Oct 2025, OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models, https://arxiv.org/abs/2510.16295
  • Mohammad Javad Ahmadi, Iman Gandomi, Parisa Abdi, Seyed-Farzad Mohammadi, Amirhossein Taslimi, Mehdi Khodaparast, Hassan Hashemi, Mahdi Tavakoli, Hamid D. Taghirad, 18 Oct 2025, Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis, https://arxiv.org/abs/2510.16371
  • Sheikh Jubair, Arwa Omayrah, Amal Alshammari, Alhanoof Althnian, Abdulhamed Alothaimen, Norah A. Alzahrani, Shahad D. Alzaidi, Nora Al-Twairesh, Abdulmohsen Al-Thubaity, 19 Oct 2025, LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding, https://arxiv.org/abs/2510.16783
  • Yahia Battach, Abdulwahab Felemban, Faizan Farooq Khan, Yousef A. Radwan, Xiang Li, Fabio Marchese, Sara Beery, Burton H. Jones, Francesca Benzoni, Mohamed Elhoseiny, 19 Oct 2025, ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification, https://arxiv.org/abs/2510.16822
  • Numaan Naeem, Abdellah El Mekki, Muhammad Abdul-Mageed, 20 Oct 2025, EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs, https://arxiv.org/abs/2510.17389
  • Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu, 20 Oct 2025, MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues, https://arxiv.org/abs/2510.17722
  • Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You, 20 Oct 2025, AcademicEval: Live Long-Context LLM Benchmark, https://arxiv.org/abs/2510.17725
  • Qianru Zhang, Yuting Sun, Honggang Wen, Peng Yang, Xinzhu Li, Ming Li, Kwok-Yan Lam, Siu-Ming Yiu, Hongzhi Yin, 12 Feb 2025, Time Series Analysis in Frequency Domain: A Survey of Open Challenges, Opportunities and Benchmarks, https://arxiv.org/abs/2504.07099
  • Jie Zhang, Cezara Petrui, Kristina Nikoli\'c, Florian Tram\`er, 19 Oct 2025, RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics, https://arxiv.org/abs/2505.12575
  • Pei Yang, Hai Ci, and Mike Zheng Shou, 18 Oct 2025, macOSWorld: A Multilingual Interactive Benchmark for GUI Agents, https://arxiv.org/abs/2506.04135
  • Zhining Liu, Zihao Li, Ze Yang, Tianxin Wei, Jian Kang, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong, 20 Oct 2025, CLIMB: Class-imbalanced Learning Benchmark on Tabular Data, https://arxiv.org/abs/2505.17451
  • Weijie Xu, Shixian Cui, Xi Fang, Chi Xue, Stephanie Eckman, Chandan K. Reddy, 17 Oct 2025, SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions, https://arxiv.org/abs/2506.00643
  • Mohammad Ramezanali, Mo Vazifeh, Paolo Santi, 21 Sep 2025, seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs, https://arxiv.org/abs/2509.16866
  • Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, and Junhua Zhao, 22 Sep 2025, EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving, https://arxiv.org/abs/2509.17677
  • Michelangelo Conserva, Remo Sasso, Paulo Rauber, 21 Sep 2025, On the Limits of Tabular Hardness Metrics for Deep RL: A Study with the Pharos Benchmark, https://arxiv.org/abs/2509.17092
  • Siu Hang Ho, Prasad Ganesan, Nguyen Duong, Daniel Schlabig, 22 Sep 2025, Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark, https://arxiv.org/abs/2509.17894
  • Asiya Ibrahim Zanga, Salisu Mamman Abdulrahman, Abubakar Ado, Abdulkadir Abubakar Bichi, Lukman Aliyu Jibril, Abdulmajid Babangida Umar, Alhassan Adamu, Shamsuddeen Hassan Muhammad and Bashir Salisu Abubakar, 17 Sep 2025, HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language, https://arxiv.org/abs/2509.16256
  • Burak Satar, Zhixin Ma, Patrick A. Irawan, Wilfried A. Mulyawan, Jing Jiang, Ee-Peng Lim, Chong-Wah Ngo, 20 Sep 2025, Seeing Culture: A Benchmark for Visual Reasoning and Grounding, https://arxiv.org/abs/2509.16517
  • Ritabrata Chakraborty, Avijit Dasgupta, Sandeep Chaurasia, 20 Sep 2025, CAMBench-QR : A Structure-Aware Benchmark for Post-Hoc Explanations with QR Understanding, https://arxiv.org/abs/2509.16745
  • Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, and Sicong Leng, 21 Sep 2025, From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning, https://arxiv.org/abs/2509.17040
  • Yuzhen Lei, Hongbin Xie, Jiaxing Zhao, Shuangxue Liu, Xuan Song, 22 Sep 2025, MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents, https://arxiv.org/abs/2509.17628
  • Zhichao Ma, Fan Huang, Lu Zhao, Fengjun Guo, Guangtao Zhai, Xiongkuo Min, 21 Sep 2025, DocIQ: A Benchmark Dataset and Feature Fusion Network for Document Image Quality Assessment, https://arxiv.org/abs/2509.17012
  • Florinel Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, 22 Sep 2025, PRNU-Bench: A Novel Benchmark and Model for PRNU-Based Camera Identification, https://arxiv.org/abs/2509.17581
  • Peter Jansen, Samiah Hassan, Ruoyao Wang, 19 Sep 2025, Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science, https://arxiv.org/abs/2506.04410
  • Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Wang, and Liangming Pan, 22 Sep 2025, How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark, https://arxiv.org/abs/2505.18761
  • Alireza Salemi, Hamed Zamani, 20 Sep 2025, LaMP-QA: A Benchmark for Personalized Long-form Question Answering, https://arxiv.org/abs/2506.00137
  • Changti Wu and Shijie Lian and Zihao Liu and Lei Zhang and Laurence Tianruo Yang and Kai Chen, 25 Oct 2025, DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry, https://arxiv.org/abs/2510.22340
  • Shilaj Baral, Youngkyu Lee, Sangam Khanal, Joongoo Jeon, 21 Oct 2025, Residual-guided AI-CFD hybrid method enables stable and scalable simulations: from 2D benchmarks to 3D applications, https://arxiv.org/abs/2510.21804
  • Ana K. Rivera, Anvita Bhagavathula, Alvaro Carbonero, and Priya Donti, 24 Oct 2025, PF$\Delta$: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations, https://arxiv.org/abs/2510.22048
  • Darshana Priyasad, Tharindu Fernando, Maryam Haghighat, Harshala Gammulle, Clinton Fookes, 27 Oct 2025, Transforming volcanic monitoring: A dataset and benchmark for onboard volcano activity detection, https://arxiv.org/abs/2510.22889
  • Shvetank Prakash, Andrew Cheng, Arya Tschand, Mark Mazumder, Varun Gohil, Jeffrey Ma, Jason Yik, Zishen Wan, Jessica Quaye, Elisavet Lydia Alvanaki, Avinash Kumar, Chandrashis Mazumdar, Tuhin Khare, Alexander Ingare, Ikechukwu Uchendu, Radhika Ghosal, Abhishek Tyagi, Chenyu Wang, Andrea Mattia Garavagno, Sarah Gu, Alice Guo, Grace Hur, Luca Carloni, Tushar Krishna, Ankita Nayak, Amir Yazdanbakhsh, Vijay Janapa Reddi, 24 Oct 2025, QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture, https://arxiv.org/abs/2510.22087
  • Yutao Wu, Xiao Liu, Yunhao Feng, Jiale Ding, Xingjun Ma, 25 Oct 2025, PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading, https://arxiv.org/abs/2510.22242
  • Iliass Ayaou and Denis Cavallucci, 25 Oct 2025, PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding, https://arxiv.org/abs/2510.22264
  • Chenyu Zhang, Tairen Zhang, Lanjun Wang, Ruidong Chen, Wenhui Li, Anan Liu, 25 Oct 2025, T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model, https://arxiv.org/abs/2510.22300
  • Mahiro Ukai, Shuhei Kurita and Nakamasa Inoue, 26 Oct 2025, STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models, https://arxiv.org/abs/2510.22571
  • Daoyu Wang, Mingyue Cheng, Qi Liu, Shuo Yu, Zirui Liu, Ze Guo, 27 Oct 2025, PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature, https://arxiv.org/abs/2510.10909
  • Ziheng Cheng, Yixiao Huang, Hui Xu, Somayeh Sojoudi, Xuandong Zhao, Dawn Song, Song Mei, 25 Oct 2025, OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models, https://arxiv.org/abs/2505.21347
  • Han Deng, Yuan Meng, Shixiang Tang, Wanli Ouyang, Xinzhu Ma, 26 Oct 2025, CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming, https://arxiv.org/abs/2505.12925
  • Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi, 27 Oct 2025, CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays, https://arxiv.org/abs/2505.18087
  • Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongyan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan, 25 Oct 2025, SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus, https://arxiv.org/abs/2510.03160
  • Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li and Xuezhi Cao, 27 Oct 2025, UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels, https://arxiv.org/abs/2510.18915
  • Zhaomin Wu, Ziyang Wang, Bingsheng He, 27 Oct 2025, WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos, https://arxiv.org/abs/2505.16635
  • Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, Shafiq Joty, 15 Oct 2025, Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math, https://arxiv.org/abs/2510.13744
  • Ivan Dubrovsky, Anastasia Orlova, Illarion Iov, Nina Gubina, Irena Gureeva, Alexey Zaytsev, 15 Oct 2025, Selective Adversarial Attacks on LLM Benchmarks, https://arxiv.org/abs/2510.13570
  • Xiangyu Zhao, Wanghan Xu, Bo Liu, Yuhao Zhou, Fenghua Ling, Ben Fei, Xiaoyu Yue, Lei Bai, Wenlong Zhang, Xiao-Ming Wu, 15 Oct 2025, MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science, https://arxiv.org/abs/2505.20740
  • Haiyang Li, Yaxiong Wang, Shengeng Tang, Lianwei Wu, Lechao Cheng, Zhun Zhong, 15 Oct 2025, Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline, https://arxiv.org/abs/2509.25991
  • Xingjian Zhou, Keyi Shen, Andy Xu, Hongji Xu, Cho-Jui Hsieh, Huan Zhang, Zhouxing Shi, 14 Oct 2025, SoundnessBench: A Soundness Benchmark for Neural Network Verifiers, https://arxiv.org/abs/2412.03154
  • Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, Qiang Xu, 15 Oct 2025, Fidel-TS: A High-Fidelity Benchmark for Multimodal Time Series Forecasting, https://arxiv.org/abs/2509.24789
  • Spandan Garg, Benjamin Steenhoek, Yufan Huang, 14 Oct 2025, Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation, https://arxiv.org/abs/2510.08996
  • Jiaqi Shao, Yuxiang Lin, Munish Prasad Lohani, Yufeng Miao, Bing Luo, 26 Sep 2025, Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents, https://arxiv.org/abs/2509.22391
  • Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Lijun Wang, Yuanyuan Peng, Huan Gao, Mingkun Xu, Shangyang Li, 26 Sep 2025, Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks, https://arxiv.org/abs/2509.22258
  • Hui Li, Changhao Jiang, Hongyu Wang, Ming Zhang, Jiajun Sun, Zhixiong Yang, Yifei Cao, Shihan Dou, Xiaoran Fan, Baoyu Fan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang, 26 Sep 2025, MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark, https://arxiv.org/abs/2509.22461
  • Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Xianpeng Lang, 26 Sep 2025, DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models, https://arxiv.org/abs/2506.05667
  • Raj Ghugare, Catherine Ji, Kathryn Wantlin, Jin Schofield, Benjamin Eysenbach, 7 Oct 2025, BuilderBench -- A benchmark for generalist agents, https://arxiv.org/abs/2510.06288
  • Prabhant Singh, Sibylle Hess, Joaquin Vanschoren, 7 Oct 2025, How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation, https://arxiv.org/abs/2510.06448
  • Tian Qin, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu, Bowen Jin, Mert Cemri, Jiarui Lu, Zirui Wang, Meng Cao, 8 Oct 2025, COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization, https://arxiv.org/abs/2510.07043
  • Laurent Brisson (IMT Atlantique - DSD), C\'ecile Bothorel (IMT Atlantique - DSD), Nicolas Duminy (IMT Atlantique, IMT Atlantique - DSD), 3 Oct 2025, DynBenchmark: Customizable Ground Truths to Benchmark Community Detection and Tracking in Temporal Networks, https://arxiv.org/abs/2510.06245
  • Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, Yanran Li, Chengwei Qin, 8 Oct 2025, FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline, https://arxiv.org/abs/2510.06800
  • Peize He, Zichen Wen, Yubo Wang, Yuxuan Wang, Xiaoqian Liu, Jiajie Huang, Zehui Lei, Zhuangcheng Gu, Xiangqi Jin, Jiabing Yang, Kai Li, Zhifei Liu, Weijia Li, Cunxiang Wang, Conghui He, Linfeng Zhang, 8 Oct 2025, AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs, https://arxiv.org/abs/2510.07293
  • Junli Liu, Qizhi Chen, Zhigang Wang, Yiwen Tang, Yiting Zhang, Chi Yan, Dong Wang, Xuelong Li, Bin Zhao, 8 Oct 2025, AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations, https://arxiv.org/abs/2504.07836
  • Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, Sivaprasad Sudhir, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska, 7 Oct 2025, KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes, https://arxiv.org/abs/2506.06541
  • Yunqi Huang, Nishith Chennakeshava, Alexis Carras, Vladislav Neverov, Wei Liu, Aske Plaat, Yingjie Fan, 2 Oct 2025, A Benchmark Study of Deep Reinforcement Learning Algorithms for the Container Stowage Planning Problem, https://arxiv.org/abs/2510.02589
  • Rakshith S Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego Mares, Ernesto Hernandez, Jayeon Park, Dean Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, Chen Xing, 3 Oct 2025, TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models, https://arxiv.org/abs/2510.02663
  • Xiao-Wen Yang, Zihao Zhang, Jianuo Cao, Zhi Zhou, Zenan Li, Lan-Zhe Guo, Yuan Yao, Taolue Chen, Yu-Feng Li, and Xiaoxing Ma, 26 Sep 2025, FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory, https://arxiv.org/abs/2510.02335
  • Xinjie Shen, Mufei Li, Pan Li, 27 Sep 2025, Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark, https://arxiv.org/abs/2510.02356
  • Han Wang, Haoyu Li, Brian Ko, Huan Zhang, 30 Sep 2025, On The Fragility of Benchmark Contamination Detection in Reasoning Models, https://arxiv.org/abs/2510.02386
  • Abhinav Arun, Reetu Raj Harsh, Bhaskarjit Sarmah, Stefano Pasquali, 3 Oct 2025, FinReflectKG - MultiHop: Financial QA Benchmark for Reasoning with Knowledge Graph Evidence, https://arxiv.org/abs/2510.02906
  • Nicole N Khatibi, Daniil A. Radamovich, Michael P. Brenner, 23 Sep 2025, EEFSUVA: A New Mathematical Olympiad Benchmark, https://arxiv.org/abs/2510.01227
  • Ingrid Navarro, Pablo Ortega-Kral, Jay Patrikar, Haichuan Wang, Alonso Cano, Zelin Ye, Jong Hoon Park, Sebastian Scherer and Jean Oh, 3 Oct 2025, Amelia: A Large Dataset and Benchmark for Airport Surface Movement Forecasting, https://arxiv.org/abs/2407.21185
  • Haoran Zhang, Chenhao Zhu, Sicong Guo, Hanzhe Guo, Haiming Li, Donglin Yu, 21 Oct 2025, StarBench: A Turn-Based RPG Benchmark for Agentic Multimodal Decision-Making and Information Seeking, https://arxiv.org/abs/2510.18483
  • Ho Fai Leung, Xiaoyan Xi, Fei Zuo, 21 Oct 2025, AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification, https://arxiv.org/abs/2510.18488
  • Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu, 19 Oct 2025, Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism, https://arxiv.org/abs/2510.17896
  • Rikard Vinge and Isabelle Wittmann and Jannik Schneider and Michael Marszalek and Luis Gilch and Thomas Brunschwiler and Conrad M Albrecht, 19 Oct 2025, NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation, https://arxiv.org/abs/2510.17914
  • Seunghee Ryu, Donghoon Kwon, Seongjin Choi, Aryan Deshwal, Seungmo Kang, Carolina Osorio, 21 Oct 2025, BO4Mob: Bayesian Optimization Benchmarks for High-Dimensional Urban Mobility Problem, https://arxiv.org/abs/2510.18824
  • Jiahao Tang, Henry Hengyuan Zhao, Lijian Wu, Yifei Tao, Dongxing Mao, Yang Wan, Jingru Tan, Min Zeng, Min Li, Alex Jinpeng Wang, 20 Oct 2025, From Charts to Code: A Hierarchical Benchmark for Multimodal Models, https://arxiv.org/abs/2510.17932
  • Nishant Subramani, Alfredo Gomez, Mona Diab, 20 Oct 2025, SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone, https://arxiv.org/abs/2510.17998
  • Zeyuan Ma, Yue-Jiao Gong, Hongshu Guo, Wenjie Qiu, Sijie Ma, Hongqiao Lian, Jiajun Zhan, Kaixu Chen, Chen Wang, Zhiyang Huang, Zechuan Huang, Guojun Peng, Ran Cheng, Yining Ma, 21 Oct 2025, MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization, https://arxiv.org/abs/2505.17745
  • Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong, 21 Oct 2025, DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding, https://arxiv.org/abs/2505.18411
  • Samarth Goel, Reagan J. Lee, Kannan Ramchandran, 25 Sep 2025, SAGE: A Realistic Benchmark for Semantic Understanding, https://arxiv.org/abs/2509.21310
  • Meng Wan, Benxi Tian, Jue Wang, Cui Hui, Ningming Nie, Tiantian Liu, Zongguo Wang, Cao Rongqiang, Peng Shi, and Yangang Wang, 25 Sep 2025, Lossless Compression: A New Benchmark for Time Series Model Evaluation, https://arxiv.org/abs/2509.21002
  • Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao Li, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan, 19 Sep 2025, CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics, https://arxiv.org/abs/2509.20374
  • Jungsoo Park, Ethan Mendes, Gabriel Stanovsky, Alan Ritter, 25 Sep 2025, Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions, https://arxiv.org/abs/2509.20645
  • Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud, 25 Sep 2025, CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density, https://arxiv.org/abs/2509.18458
  • Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, Qing Wang, 28 Sep 2025, Diagnosing Failure Root Causes in Platform-Orchestrated Agentic Systems: Dataset, Taxonomy, and Benchmark, https://arxiv.org/abs/2509.23735
  • Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Chaoyou Fu, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu, 29 Sep 2025, RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark, https://arxiv.org/abs/2509.24897
  • Prashant Govindarajan, Mathieu Reymond, Antoine Clavaud, Mariano Phielipp, Santiago Miret and Sarath Chandar, 27 Sep 2025, CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning, https://arxiv.org/abs/2509.23156
  • Xavier Aramayo Carrasco, Grigoriy Ksenofontov, Aleksei Leonov, Iaroslav Sergeevich Koshelev, Alexander Korotin, 27 Sep 2025, Entering the Era of Discrete Diffusion Models: A Benchmark for Schr\"odinger Bridges and Entropic Optimal Transport, https://arxiv.org/abs/2509.23348
  • Jiahao Ying, Mingbao Lin, Qianru Sun, Yixin Cao, 28 Sep 2025, Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms, https://arxiv.org/abs/2509.23933
  • Hongpei Li, Ziyan He, Yufei Wang, Wenting Tu, Shanwen Pu, Qi Deng and Dongdong Ge, 3 Jun 2025, BenLOC: A Benchmark for Learning to Configure MIP Optimizers, https://arxiv.org/abs/2506.02752
  • Jie Cai, Kangning Yang, Lan Fu, Jiaming Ding, Jinlong Li, Huiming Sun, Daitao Xing, Jinglin Shen, Zibo Meng, 25 Sep 2025, CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models, https://arxiv.org/abs/2509.22737
  • Wei Zhou, Guoliang Li, Haoyu Wang, Yuxing Han, Xufei Wu, Fan Wu, Xuanhe Zhou, 27 Sep 2025, PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation, https://arxiv.org/abs/2509.23338
  • Amit Agarwal, Hitesh Laxmichand Patel, Srikant Panda, Hansa Meghwani, Jyotika Singh, Karan Dua, Paul Li, Tao Sheng, Sujith Ravi, Dan Roth, 28 Sep 2025, RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks, https://arxiv.org/abs/2509.23673
  • Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, and Michael Qizhe Shieh, 28 Sep 2025, MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use, https://arxiv.org/abs/2509.24002
  • Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang, 29 Sep 2025, BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models, https://arxiv.org/abs/2509.24210
  • Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee, Yohan Jo, 29 Sep 2025, SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents, https://arxiv.org/abs/2509.24282
  • Mohamad Ballout, Okajevo Wilfred, Seyedalireza Yaghoubi, Nohayr Muhammad Abdelmoneim, Julius Mayer, Elia Bruni, 29 Sep 2025, Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs, https://arxiv.org/abs/2509.24640
  • Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou, 29 Sep 2025, Towards Personalized Deep Research: Benchmarks and Evaluations, https://arxiv.org/abs/2509.25106
  • Dumitran Adrian Marius, Theodor-Pierre Moroianu and Buca Mihnea-Vicentiu, 3 Jul 2025, MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks, https://arxiv.org/abs/2507.03162
  • Adrian-Marius Dumitran and Alexandra-Mihaela Danila and Angela-Liliana Dumitran, 19 Aug 2025, GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs, https://arxiv.org/abs/2508.14279
  • Sergiu Bursuc (BAIF), Theodore Ehrenborg (BAIF), Shaowei Lin (BAIF), Lacramioara Astefanoaei (BAIF), Ionel Emilian Chiosa (MIT), Jure Kukovec (BAIF), Alok Singh (BAIF), Oliver Butterley (BAIF), Adem Bizid (BAIF), Quinn Dougherty (BAIF), Miranda Zhao (MIT), Max Tan (MIT), Max Tegmark (MIT), 26 Sep 2025, A benchmark for vericoding: formally verified program synthesis, https://arxiv.org/abs/2509.22908
  • Philipp D. Siedler, 27 Sep 2025, SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution, https://arxiv.org/abs/2505.16048
  • Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho, 29 Sep 2025, Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games, https://arxiv.org/abs/2506.03610
  • Pengyun Wang, Junyu Luo, Yanxin Shen, Ming Zhang, Shaoen Qin, Siyu Heng, Xiao Luo, 28 Sep 2025, A Comprehensive Graph Pooling Benchmark: Effectiveness, Robustness and Generalizability, https://arxiv.org/abs/2406.09031
  • Yuxuan Wang, Haixu Wu and Jiaxiang Dong, Yong Liu, Chen Wang, Mingsheng Long, Jianmin Wang, 27 Sep 2025, Deep Time Series Models: A Comprehensive Survey and Benchmark, https://arxiv.org/abs/2407.13278
  • Jiahao Kuang, Nuowei Liu, Jie Wang, Changzhi Sun, Tao Ji, Yuanbin Wu, 28 Sep 2025, PDFBench: A Benchmark for De novo Protein Design from Function, https://arxiv.org/abs/2505.20346
  • Yang Du, Yuqi Liu, Qin Jin, 28 Sep 2025, Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval, https://arxiv.org/abs/2412.19178
  • Tiansheng Hu, Tongyan Hu, Liuyang Bai, Yilun Zhao, Arman Cohan, Chen Zhao, 17 Oct 2025, FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain, https://arxiv.org/abs/2510.15232
  • Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, Alan Yuille, 12 Feb 2025, Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models, https://arxiv.org/abs/2502.08636
  • Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, Xingxing Wei, 17 Oct 2025, DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios, https://arxiv.org/abs/2510.15501
  • Tingyu Lin, Marco Peer, Florian Kleber, Robert Sablatnig, 17 Oct 2025, ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents, https://arxiv.org/abs/2510.15557
  • Dongjun Kim, Chanhee Park, Chanjun Park, Heuiseok Lim, 17 Oct 2025, KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models, https://arxiv.org/abs/2510.15558
  • Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty, Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, and Preslav Nakov, 17 Oct 2025, FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning, https://arxiv.org/abs/2506.02515
  • Su Kara, Fazle Faisal, Suman Nath, 28 Sep 2025, WAREX: Web Agent Reliability Evaluation on Existing Benchmarks, https://arxiv.org/abs/2510.03285
  • Pengfei He, Zhenwei Dai, Bing He, Hui Liu, Xianfeng Tang, Hanqing Lu, Juanhui Li, Jiayuan Ding, Subhabrata Mukherjee, Suhang Wang, Yue Xing, Jiliang Tang, Benoit Dumoulin, 6 Oct 2025, TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use, https://arxiv.org/abs/2510.04550
  • Ivo Petrov, Jasper Dekoninck, Martin Vechev, 6 Oct 2025, BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs, https://arxiv.org/abs/2510.04721
  • C. Coelho, M. Hohmann, D. Fern\'andez, L. Penter, S. Ihlenfeldt, O. Niggemann, 26 Sep 2025, Data-Driven Temperature Modelling of Machine Tools by Neural Networks: A Benchmark, https://arxiv.org/abs/2510.03261
  • Ali Khairallah and Arkaitz Zubiaga, 3 Oct 2025, ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection, https://arxiv.org/abs/2510.03502
  • Zirui Wang, Jiajun Wu, Braden Teitge, Jessalyn Holodinsky, Steve Drew, 5 Oct 2025, Small Language Models for Emergency Departments Decision Support: A Benchmark Study, https://arxiv.org/abs/2510.04032
  • Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Bram Hoex, Zhicheng Zhong, Tong Xie, 6 Oct 2025, AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials, https://arxiv.org/abs/2510.04704
  • Victor May, Diganta Misra, Yanqi Luo, Anjali Sridhar, Justine Gehring, Silvio Soares Ribeiro Junior, 6 Oct 2025, FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration, https://arxiv.org/abs/2510.04852
  • Yiqiao Chen, 4 Oct 2025, A Benchmark Study of Deep Learning Methods for Multi-Label Pediatric Electrocardiogram-Based Cardiovascular Disease Classification, https://arxiv.org/abs/2510.03780
  • Ayan Majumdar, Feihao Chen, Jinghui Li, Xiaozhen Wang, 6 Oct 2025, Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study, https://arxiv.org/abs/2510.04641
  • Chao Wen, Jacqueline Staub, Adish Singla, 6 Oct 2025, Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment, https://arxiv.org/abs/2406.11334
  • Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, Takuya Akiba, 6 Oct 2025, ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, https://arxiv.org/abs/2506.09050
  • Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka, 6 Oct 2025, FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations, https://arxiv.org/abs/2507.07644
  • Xinkai Zou, Xuan Jiang, Ruikai Huang, Haoze He, Parv Kapoor, Hongrui Wu, Yibo Wang, Jian Sha, Xiongbo Shi, Zixun Huang, Jinhua Zhao, 3 Oct 2025, Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments, https://arxiv.org/abs/2508.01844
  • Naomi Fridman (Ariel University), Anat Goldstein (Ariel University), 4 Oct 2025, Transformer Classification of Breast Lesions: The BreastDCEDL_AMBL Benchmark Dataset and 0.92 AUC Baseline, https://arxiv.org/abs/2509.26440
  • Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu, 4 Oct 2025, SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation, https://arxiv.org/abs/2508.15658
  • Alhasan Abdellatif, Hannah P. Menke, Julien Maes, Ahmed H. Elsheikh and Florian Doster, 3 Oct 2025, Benchmark Dataset for Pore-Scale CO2-Water Interaction, https://arxiv.org/abs/2503.17592
  • Hyundong Jin, Joonghyuk Hahn, Yo-Sub Han, 10 Oct 2025, RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems, https://arxiv.org/abs/2510.09227
  • Xiangxu Zhang, Lei Li, Yanyun Zhou, Xiao Zhou, Yingying Zhang, Xian Wu, 10 Oct 2025, Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation, https://arxiv.org/abs/2510.09275
  • Yifei Dong, Fengyi Wu, Qi He, Zhi-Qi Cheng, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann, 9 Oct 2025, HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions, https://arxiv.org/abs/2503.14229
  • Benjamin Herdeanu, Juan Nathaniel, Carla Roesch, Jatan Buch, Gregor Ramien, Johannes Haux, Pierre Gentine, 10 Oct 2025, CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models, https://arxiv.org/abs/2505.16620
  • Yuangang Li, Jiaqi Li, Zhuo Xiao, Tiankai Yang, Yi Nian, Xiyang Hu, Yue Zhao, 9 Oct 2025, NLP-ADBench: NLP Anomaly Detection Benchmark, https://arxiv.org/abs/2412.04784
  • Pengyu Xu, Shijia Li, Ao Sun, Feng Zhang, Yahan Li, Bo Wu, Zhanyu Ma, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Rui Wang, Yang Liu, Xiaobo Hu, Fan Yang, Jia Zheng, Guanghua Yao, 24 Oct 2025, OutboundEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Outbound Evaluation of Xbench's Professional-Aligned Series, https://arxiv.org/abs/2510.21244
  • Gaku Morio, Harri Rowlands, Dominik Stammbach, Christopher D. Manning, Peter Henderson, 24 Oct 2025, A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection, https://arxiv.org/abs/2510.21679
  • Priyanshu Karmakar (1), Soumyabrata Chaudhuri (1), Shubhojit Mallick (2), Manish Gupta (2), Abhik Jana (1), Shreya Ghosh (1) ((1) School of Electrical and Computer Sciences, IIT Bhubaneswar, India, (2) Microsoft, India), 24 Oct 2025, TripTide: A Benchmark for Adaptive Travel Planning under Disruptions, https://arxiv.org/abs/2510.21329
  • Mojca Brglez and \v{S}pela Vintar, 24 Oct 2025, From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene, https://arxiv.org/abs/2510.21575
  • Jesse Haworth, Juo-Tung Chen, Nigel Nelson, Ji Woong Kim, Masoud Moghani, Chelsea Finn, Axel Krieger, 23 Oct 2025, SutureBot: A Precision Framework & Benchmark For Autonomous End-to-End Suturing, https://arxiv.org/abs/2510.20965
  • Sean McGregor, Victor Lu, Vassil Tashev, Armstrong Foundjem, Aishwarya Ramasethu, Sadegh AlMahdi Kazemi Zarkouei, Chris Knotz, Kongtao Chen, Alicia Parrish, Anka Reuel, Heather Frase, 24 Oct 2025, Risk Management for Mitigating Benchmark Failure Modes: BenchRisk, https://arxiv.org/abs/2510.21460
  • Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Yinghui Li, Hai-Tao Zheng, Xue Liu, Irwin King, Philip S. Yu, 24 Oct 2025, RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback, https://arxiv.org/abs/2510.06186
  • Xinbang Dai, Huikang Hu, Yongrui Chen, Jiaqi Li, Rihui Jin, Yuyang Zhang, Xiaoguang Li, Lifeng Shang, Guilin Qi, 12 Oct 2025, ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding, https://arxiv.org/abs/2510.10549
  • Sanjari Srivastava, Gang Li, Cheng Chang, Rishu Garg, Manpreet Kaur, Charlene Y. Lee, Yuezhang Li, Yining Mao, Ignacio Cases, Yanan Xie, Peng Qi, 10 Oct 2025, WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions, https://arxiv.org/abs/2510.09872
  • Linfei Li, Fengyi Zhang, Zhong Wang, Lin Zhang, and Ying Shen, 11 Oct 2025, INR-Bench: A Unified Benchmark for Implicit Neural Representations in Multi-Domain Regression and Reconstruction, https://arxiv.org/abs/2510.10188
  • Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh, 12 Oct 2025, Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?, https://arxiv.org/abs/2510.10541
  • Wanshu Nie, Sujay V. Kumar, Junyu Chen, Long Zhao, Olya Skulovich, Jinwoong Yoo, Justin Pflug, Shahryar Khalique Ahmad, Goutam Konapala, 12 Oct 2025, Rethinking deep learning: linear regression remains a key benchmark in predicting terrestrial water storage, https://arxiv.org/abs/2510.10799
  • Mohammad Karami, Mostafa Jalali, Fatemeh Ghassemi, 13 Oct 2025, A Comprehensive Forecasting-Based Framework for Time Series Anomaly Detection: Benchmarking on the Numenta Anomaly Benchmark (NAB), https://arxiv.org/abs/2510.11141
  • Prasanna Mayilvahanan and Ricardo Dominguez-Olmedo and Thadd\"aus Wiedemer and Wieland Brendel, 13 Oct 2025, MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model, https://arxiv.org/abs/2510.11653
  • Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu, 12 Oct 2025, FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth, https://arxiv.org/abs/2510.10472
  • Xuanqi Gao, Siyi Xie, Juan Zhai, Shiqing Ma, Chao Shen, 12 Oct 2025, MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models, https://arxiv.org/abs/2505.16700
  • Zifeng Ding, Sikuan Yan, Zhangdie Yuan, Xianglong Hu, Fangru Lin, Andreas Vlachos, 13 Oct 2025, TCP: a Benchmark for Temporal Constraint-Based Planning, https://arxiv.org/abs/2505.19927
  • Zhipeng He, Chun Ouyang, Lijie Wen, Cong Liu, Catarina Moreira, 13 Oct 2025, TabAttackBench: A Benchmark for Adversarial Attacks on Tabular Data, https://arxiv.org/abs/2505.21027
  • Yichen Shi, Ze Zhang, Hongyang Wang, Zhuofu Tao, Zhongyi Li, Bingyu Chen, Yaxin Wang, Zhen huang, Xuhua Liu, Quan Chen, Zhiping Yu, Ting-Jung Lin, Lei He, 13 Oct 2025, AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits, https://arxiv.org/abs/2505.24138
  • Yifang Zhang, Pengfei Duan, Henan Wang, Wenjie Yin, Chen Zhou, Shengwu Xiong, 13 Oct 2025, How Effective Are Time-Series Models for Rainfall Nowcasting? A Comprehensive Benchmark for Rainfall Nowcasting Incorporating PWV Data, https://arxiv.org/abs/2509.25263
  • Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan, 12 Oct 2025, MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors, https://arxiv.org/abs/2502.18940
  • Trinh T.L. Vuong and Jin Tae Kwak, 13 Oct 2025, ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos, https://arxiv.org/abs/2505.04192
  • Xinyan Zhao, Yi-Ching Tang, Akshita Singh, Victor J Cantu, KwanHo An, Junseok Lee, Adam E Stogsdill, Ibraheem M Hamdi, Ashwin Kumar Ramesh, Zhiqiang An, Xiaoqian Jiang, Yejin Kim, 10 Oct 2025, AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design, https://arxiv.org/abs/2506.04235
  • Shuangyan Deng, Haizhou Peng, Jiachen Xu, Rui Mao, Ciprian Doru Giurc\u{a}neanu, Jiamou Liu, 9 Oct 2025, FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning, https://arxiv.org/abs/2510.07852
  • Qin Liu, Jacob Dineen, Yuxi Huang, Sheng Zhang, Hoifung Poon, Ben Zhou, Muhao Chen, 9 Oct 2025, ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation, https://arxiv.org/abs/2510.08569
  • Julia Moska, Oleksii Furman, Kacper Kozaczko, Szymon Leszkiewicz, Jakub Polczyk, Piotr Gramacki and Piotr Szyma\'nski, 9 Oct 2025, OBSR: Open Benchmark for Spatial Representations, https://arxiv.org/abs/2510.05879
  • Shuo Yu (1), Mingyue Cheng (1), Qi Liu (1), Daoyu Wang (1), Jiqian Yang (1), Jie Ouyang (1), Yucong Luo (1), Chenyi Lei (2), Enhong Chen (1) ((1) State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China (2) Kuaishou Technology, Beijing, China), 9 Oct 2025, Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study, https://arxiv.org/abs/2409.13694
  • Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani, Samuel G Rodriques, 8 Oct 2025, BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology, https://arxiv.org/abs/2503.00096
  • Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu, 23 Sep 2025, How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective, https://arxiv.org/abs/2509.18905
  • Shaoheng Wang, Yao Lu, Yuqi Li, Yaxin Gao, Jiaqi Nie, Shanqing Yu, Yingli Tian, Qi Xuan, 14 Sep 2025, LoRALib: A Standardized Benchmark for Evaluating LoRA-MoE Methods, https://arxiv.org/abs/2509.18137
  • Hongyi Luo, Qing Cheng, Daniel Matos, Hari Krishna Gadi, Yanfeng Zhang, Lu Liu, Yongliang Wang, Niclas Zeller, Daniel Cremers, Liqiu Meng, 17 Sep 2025, TurnBack: A Geospatial Route Cognition Benchmark for Large Language Models through Reverse Route, https://arxiv.org/abs/2509.18173
  • Chen Liang, Zhaoqi Huang, Haofen Wang, Fu Chai, Chunying Yu, Huanhuan Wei, Zhengjie Liu, Yanpeng Li, Hongjun Wang, Ruifeng Luo, Xianzhong Zhao, 23 Sep 2025, AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field, https://arxiv.org/abs/2509.18776
  • Haonan He, Yuchen Ren, Yining Tang, Ziyang Xu, Junxian Li, Minghao Yang, Di Zhang, Dong Yuan, Tao Chen, Shufei Zhang, Yuqiang Li, Nanqing Dong, Wanli Ouyang, Dongzhan Zhou, Peng Ye, 23 Sep 2025, Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models, https://arxiv.org/abs/2412.19191
  • Ziyuan Liu, Ruifei Zhu, Long Gao, Yuanxiu Zhou, Jingyu Ma, and Yuantao Gu, 23 Sep 2025, JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework, https://arxiv.org/abs/2502.13407
  • Samuel Stockman, Daniel Lawson, Maximilian Werner, 23 Sep 2025, EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes, https://arxiv.org/abs/2410.08226
  • Brandon James Carone, Iran R. Roman, and Pablo Ripoll\'es, 21 Oct 2025, The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS, https://arxiv.org/abs/2510.19055
  • Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai, 22 Oct 2025, MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration, https://arxiv.org/abs/2510.19423
  • Yiqian Yang, Tian Lan, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang, 22 Oct 2025, HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application, https://arxiv.org/abs/2510.19631
  • Fabian Schaipp, 22 Oct 2025, Optimization Benchmark for Diffusion Models on Dynamical Systems, https://arxiv.org/abs/2510.19376
  • Umar Butler, Abdur-Rahman Butler, Adrian Lucas Malec, 22 Oct 2025, The Massive Legal Embedding Benchmark (MLEB), https://arxiv.org/abs/2510.19365
  • Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Sebastian Otalora and Mauricio Reyes, 22 Oct 2025, XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography, https://arxiv.org/abs/2510.19599
  • Luis Wyss, Vincent Mallet, Wissam Karroucha, Karsten Borgwardt, Carlos Oliver, 22 Oct 2025, A Comprehensive Benchmark for RNA 3D Structure-Function Modeling, https://arxiv.org/abs/2503.21681
  • Justin Chavarria, Rohan Raizada, Justin White, Eyad Alhetairshi, 30 Sep 2025, SOCK: A Benchmark for Measuring Self-Replication in Large Language Models, https://arxiv.org/abs/2509.25643
  • Lujun Li, Lama Sleem, Yiqun Wang, Yangjie Xu, Niccol\`o Gentile, Radu State, 30 Sep 2025, How Far Do Time Series Foundation Models Paint the Landscape of Real-World Benchmarks ?, https://arxiv.org/abs/2509.26347
  • Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu, 30 Sep 2025, SCUBA: Salesforce Computer Use Benchmark, https://arxiv.org/abs/2509.26506
  • Joshua Sebastian, Karma Tobden, KMA Solaiman, 30 Sep 2025, LLM-Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation, https://arxiv.org/abs/2509.26351
  • Oleksandr Shchur, Abdul Fatir Ansari, Caner Turkmen, Lorenzo Stella, Nick Erickson, Pablo Guerron, Michael Bohlke-Schneider, Yuyang Wang, 30 Sep 2025, fev-bench: A Realistic Benchmark for Time Series Forecasting, https://arxiv.org/abs/2509.26468
  • Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, Guoan Zhang, 26 Sep 2025, A Benchmark for Localizing Code and Non-Code Issues in Software Projects, https://arxiv.org/abs/2509.25242
  • Zhengpeng Shi, Hengli Li, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng, 30 Sep 2025, V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs, https://arxiv.org/abs/2509.25773
  • Jisu Shin, Hoyun Song, Juhyun Oh, Changgeon Ko, Eunsu Kim, Chani Jung, Alice Oh, 30 Sep 2025, RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity, https://arxiv.org/abs/2509.25897
  • Dominik Macko, Jakub Kopal, 30 Sep 2025, CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages, https://arxiv.org/abs/2509.26051
  • Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen, 30 Sep 2025, OceanGym: A Benchmark Environment for Underwater Embodied Agents, https://arxiv.org/abs/2509.26536
  • Wenda Xu, Sweta Agrawal, Vil\'em Zouhar, Markus Freitag, Daniel Deutsch, 30 Sep 2025, Deconstructing Self-Bias in LLM-generated Translation Benchmarks, https://arxiv.org/abs/2509.26600
  • Yingming Pu and Tao Lin and Hongyu Chen, 29 Sep 2025, Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis, https://arxiv.org/abs/2509.25281
  • Yi-Cheng Lin, Yu-Hua Chen, Jia-Kai Dong, Yueh-Hsuan Huang, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Chen, Yu-Jung Lin, Yu-Ling Chen, Zih-Yu Chen, I-Ning Tsai, Hsiu-Hsuan Wang, Ho-Lam Chung, Ke-Han Lu, Hung-yi Lee, 30 Sep 2025, TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics, https://arxiv.org/abs/2509.26329
  • Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, Baishakhi Ray, 29 Sep 2025, Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation, https://arxiv.org/abs/2502.17521
  • Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, Elia Bruni, 30 Sep 2025, iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs, https://arxiv.org/abs/2502.03214
  • Dayy\'an O'Brien, Barry Haddow, Emily Allaway, Pinzhen Chen, 7 Oct 2025, MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization, https://arxiv.org/abs/2510.05962
  • Haining Pan, James V. Roggeveen, Erez Berg, Juan Carrasquilla, Debanjan Chowdhury, Surya Ganguli, Federico Ghimenti, Juraj Hasik, Henry Hunt, Hong-Chen Jiang, Mason Kamb, Ying-Jer Kao, Ehsan Khatami, Michael J. Lawler, Di Luo, Titus Neupert, Xiaoliang Qi, Michael P. Brenner, Eun-Ah Kim, 6 Oct 2025, CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers, https://arxiv.org/abs/2510.05228
  • Jo\~ao Palmeiro, Diogo Duarte, Rita Costa, Pedro Bizarro, 7 Oct 2025, Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks, https://arxiv.org/abs/2510.06071
  • Periklis Mantenoglou, Rishi Hazra, Pedro Zuidberg Dos Martires, Luc De Raedt, 7 Oct 2025, LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language, https://arxiv.org/abs/2510.05972
  • Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, Luc Van Gool, and Danda Pani Paudel, 7 Oct 2025, EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark, https://arxiv.org/abs/2510.06218
  • Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran, 7 Oct 2025, BenchAgents: Multi-Agent Systems for Structured Benchmark Creation, https://arxiv.org/abs/2410.22584
  • Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty, 16 Oct 2025, LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild, https://arxiv.org/abs/2510.14240
  • Xukai Wang, Xuanbo Liu, Mingrui Chen, Haitian Zhong, Xuanlin Yang, Bohan Zeng, Jinbo Hu, Hao Liang, Junbo Niu, Xuchen Li, Ruitao Wu, Ruichuan An, Yang Shi, Liu Liu, Xu-Yao Zhang, Qiang Liu, Zhouchen Lin, Wentao Zhang, Bin Dong, 16 Oct 2025, MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning, https://arxiv.org/abs/2510.14265
  • Peter Banyas, Shristi Sharma, Alistair Simmons, Atharva Vispute, 11 Oct 2025, ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups, https://arxiv.org/abs/2510.13852
  • Fabian Wenz and Omar Bouattour and Devin Yang and Justin Choi and Cecil Gregg and Nesime Tatbul and \c{C}a\u{g}atay Demiralp, 11 Oct 2025, BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation, https://arxiv.org/abs/2510.13853
  • Steffen Hagedorn, Luka Donkov, Aron Distelzweig, Alexandru P. Condurache, 16 Oct 2025, When Planners Meet Reality: How Learned, Reactive Traffic Agents Shift nuPlan Benchmarks, https://arxiv.org/abs/2510.14677
  • Yuxing Lu, Xukai Zhao, J. Ben Tamo, Micky C. Nnamdi, Rui Peng, Shuang Zeng, Xingyu Hu, Jinzhuo Wang, May D. Wang, 16 Oct 2025, MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics, https://arxiv.org/abs/2510.14944
  • Jae-Won Chung and Jeff J. Ma and Ruofan Wu and Jiachen Liu and Oh Jun Kweon and Yuxuan Xia and Zhiyu Wu and Mosharaf Chowdhury, 16 Oct 2025, The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization, https://arxiv.org/abs/2505.06371

Research on Model Evaluation

  • Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
  • Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, Andy Zou, 23 May 2024, Lessons from the Trenches on Reproducible Evaluation of Language Models, https://arxiv.org/abs/2405.14782 (Model evaluation theory and practice with the lm-eval test harness tool.)
  • Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
  • Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
  • Yifan Wei, Yisong Su, Huanhuan Ma, Xiaoyan Yu, Fangyu Lei, Yuanzhe Zhang, Jun Zhao, Kang Liu, 8 Oct 2023, MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models, https://arxiv.org/abs/2310.05157
  • George Cybenko, Joshua Ackerman, Paul Lintilhac, 16 Apr 2024, TEL'M: Test and Evaluation of Language Models, https://arxiv.org/abs/2404.10200
  • Gayathri Saranathan, Mahammad Parwez Alam, James Lim, Suparna Bhattacharya, Soon Yee Wong, Foltin Martin & Cong Xu, 2024, DELE: Data Efficient LLM Evaluation, Hewlett Packard Labs, Navigating and Addressing Data Problems for Foundation Models (DPFM) Workshop, ICLR 2024, https://openreview.net/pdf?id=I8bsxPWLNF
  • Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang, 17 Mar 2024 (v2), Compressing LLMs: The Truth is Rarely Pure and Never Simple, https://arxiv.org/abs/2310.01382 Code: https://github.com/VITA-Group/llm-kick (A set of tasks to evaluate LLMs.)
  • AADITYA NAIK, ADAMSTEIN, YINJUN WU, MAYURNAIK, ERIC WONG, April 2024, TorchQL: A Programming Framework for Integrity Constraints in Machine Learning, Proc. ACM Program. Lang., Vol. 8, No. OOPSLA1, Article 124. PDF: https://dl.acm.org/doi/pdf/10.1145/3649841
  • Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
  • Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, Min Lin, 22 Jan 2024, Benchmarking Large Multimodal Models against Common Corruptions, https://arxiv.org/abs/2401.11943 Code: https://github.com/sail-sg/MMCBench
  • Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, Gordon Wetzstein, Jan 2024, GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation, https://arxiv.org/abs/2401.04092 Code: https://github.com/3DTopia/GPTEval3D Project: https://gpteval3d.github.io/
  • Lan Chu, Jan 2024, LLM Output — Evaluating, debugging, and interpreting, Towards AI, https://pub.towardsai.net/llm-output-evaluating-debugging-and-interpreting-f3bd29e7d14d
  • Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You, 3 Jun 2024, MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, https://arxiv.org/abs/2406.06565
  • Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo, 9 Jun 2024, The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models, https://arxiv.org/abs/2406.05761 Code: https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench
  • Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi, 7 Jun 2024, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, https://arxiv.org/abs/2406.04770 Code: https://hf.co/spaces/allenai/WildBench
  • Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi, 13 Jun 2024, Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning, https://arxiv.org/abs/2406.09170
  • Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
  • Tianle Li, Wei-Lin Chiang, Lisa Dunlap, May 20, 2024, Introducing Hard Prompts Category in Chatbot Arena, https://lmsys.org/blog/2024-05-17-category-hard/
  • Louis Bouchard, Jun 24, 2024, LLM Evals: What, why, when and how, https://www.louisbouchard.ai/llm-evals/
  • Clémentine Fourrier, May 23, 2024 Let's talk about LLM evaluation, https://huggingface.co/blog/clefourrier/llm-evaluation
  • Jeffrey Ip, November 7, 2023, How to Evaluate LLM Applications: The Complete Guide, https://www.confident-ai.com/blog/how-to-evaluate-llm-applications
  • Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
  • Anirban Ghoshal, July 3, 2024, AWS approach to RAG evaluation could help enterprises reduce AI spending, https://www.infoworld.com/article/3715629/aws-new-approach-to-rag-evaluation-could-help-enterprises-reduce-ai-spending.html
  • Tianyi Tang, Yiwen Hu, Bingqian Li, Wenyang Luo, Zijing Qin, Haoxiang Sun, Jiapeng Wang, Shiyi Xu, Xiaoxue Cheng, Geyang Guo, Han Peng, Bowen Zheng, Yiru Tang, Yingqian Min, Yushuo Chen, Jie Chen, Yuanqian Zhao, Luran Ding, Yuhao Wang, Zican Dong, Chunxuan Xia, Junyi Li, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen, 8 Jul 2024, LLMBox: A Comprehensive Library for Large Language Models, https://arxiv.org/abs/2407.05563 Code: https://github.com/RUCAIBox/LLMBox
  • Jin Peng Zhou, Christian K. Belardi, Ruihan Wu, Travis Zhang, Carla P. Gomes, Wen Sun, Kilian Q. Weinberger, 8 Jul 2024, On Speeding Up Language Model Evaluation, https://arxiv.org/abs/2407.06172
  • HELM, July 2024 (accessed), A holistic framework for evaluating foundation models, Stanford University, https://crfm.stanford.edu/helm/lite/latest/
  • Juan Pablo Bottaro, April 25, 2024, Musings on building a Generative AI product, https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product?_l=en_US
  • Angie Boggust, Venkatesh Sivaraman, Yannick Assogba, Donghao Ren, Dominik Moritz, Fred Hohman, 6 Aug 2024, Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments, https://arxiv.org/abs/2408.03274
  • Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
  • Andrew Ng, Sep 2024, X post, https://x.com/AndrewYNg/status/1829190549842321758 (Dropping token prices for LLMs means developers can focus on the app layer.)
  • Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
  • Lior Solomon, Sep 2024, Gen AI testing strategies and tools, https://medium.com/ai-in-grc/gen-ai-testing-strategies-and-tools-257383e5cbfb
  • Michael Nuñez, September 9, 2024, LightEval: Hugging Face’s open-source solution to AI’s accountability problem, https://venturebeat.com/ai/lighteval-hugging-faces-open-source-solution-to-ais-accountability-problem/
  • Michael Nuñez, September 13, 2024, Microsoft’s Windows Agent Arena: Teaching AI assistants to navigate your PC, https://venturebeat.com/ai/microsofts-windows-agent-arena-teaching-ai-assistants-to-navigate-your-pc/
  • Flow AI, Sep 2024, Flow Judge: An Open Small Language Model for LLM System Evaluations, https://www.flow-ai.com/blog/flow-judge
  • Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate Olszewska, 20 Sep 2024 (v2), Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries, https://arxiv.org/abs/2409.12640 (Long context model evaluation dataset.)
  • Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
  • Cameron R. Wolfe, Ph.D., Dec 02, 2024, Finetuning LLM Judges for Evaluation: The Prometheus suite, JudgeLM, PandaLM, AutoJ, and more..., https://cameronrwolfe.substack.com/p/finetuned-judge
  • Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu, 10 Dec 2024 (v2), LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, https://arxiv.org/abs/2412.05579 https://github.com/CSHaitao/Awesome-LLMs-as-Judges
  • Liam Seymour, Basar Kutukcu, Sabur Baidya, 19 Dec 2024, Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs, https://arxiv.org/abs/2412.15352 https://github.com/LiamS57/orin-llm-testing
  • Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, Jeff Z. Pan, 22 Dec 2024, MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge, https://arxiv.org/abs/2412.17032 https://github.com/probe2/multi-hop/ (Model evaluation of reasoning abilities.)
  • Latent Space, Dec 28, 2024, The 2025 AI Engineering Reading List: We picked 50 paper/models/blogs across 10 fields in AI Eng: LLMs, Benchmarks, Prompting, RAG, Agents, CodeGen, Vision, Voice, Diffusion, Finetuning. If you're starting from scratch, start here. https://www.latent.space/p/2025-papers
  • Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu, 24 Dec 2024, LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating, https://arxiv.org/abs/2412.18424
  • Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang, AH Abdi, Dec 2024, SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_93.pdf (Evaluating model performance with KV cache compression.)
  • Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
  • Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
  • Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
  • Lucas C. Cordeiro, Matthew L. Daggitt, Julien Girard-Satabin, Omri Isac, Taylor T. Johnson, Guy Katz, Ekaterina Komendantskaya, Augustin Lemesle, Edoardo Manino, Artjoms Šinkarovs, Haoze Wu, 10 Jan 2025, Neural Network Verification is a Programming Language Challenge, https://arxiv.org/abs/2501.05867
  • Dr. Marcel Müller, Jan 2025, Why Generative-AI Apps’ Quality Often Sucks and What to Do About It: How to get from PoCs to tested high-quality applications in production, https://towardsdatascience.com/why-generative-ai-apps-quality-often-sucks-and-what-to-do-about-it-f84407f263c3
  • Bharani Subramaniam, 13 February 2025, Emerging Patterns in Building GenAI Products, https://martinfowler.com/articles/gen-ai-patterns/
  • Nikhil, February 26, 2025, How to Compare Two LLMs in Terms of Performance: A Comprehensive Web Guide for Evaluating and Benchmarking Language Models, https://www.marktechpost.com/2025/02/26/how-to-compare-two-llms-in-terms-of-performance-a-comprehensive-web-guide-for-evaluating-and-benchmarking-language-models/
  • Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
  • Yansheng Qiu, Li Xiao, Zhaopan Xu, Pengfei Zhou, Zheng Wang, Kaipeng Zhang, 16 May 2025, Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans, https://arxiv.org/abs/2505.11141
  • Brandon Lepine, Gawesha Weerantunga, Juho Kim, Pamela Mishkin, Matthew Beane, 15 May 2025, Evaluations at Work: Measuring the Capabilities of GenAI in Use, https://arxiv.org/abs/2505.10742
  • Rachel Draelos, MD, PhD, May 14, 2025, HealthBench Does Not Evaluate Patient Safety, https://medium.com/data-science-collective/healthbench-does-not-evaluate-patient-safety-11eda5f0eeac
  • Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo, 29 May 2025, ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions, https://arxiv.org/abs/2505.23662 https://github.com/bwookwak/ToolHaystack
  • Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, Wenxuan Zhang, Lifu Huang, Muhao Chen, Lei Hou, Qianru Sun, Xingjun Ma, Zuxuan Wu, Min-Yen Kan, David Lo, Qi Zhang, Heng Ji, Jing Jiang, Juanzi Li, Aixin Sun, Xuanjing Huang, Tat-Seng Chua, Yu-Gang Jiang, 26 Apr 2025, Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks, https://arxiv.org/abs/2504.18838
  • Liyun Zhang, Jingcheng Ke, Shenli Fan, Xuanmeng Sha and Zheng Lian, 14 Aug 2025, A Unified Evaluation Framework for Multi-Annotator Tendency Learning, https://arxiv.org/abs/2508.10393
  • Brian Shing-Hei Wong, Joshua Mincheol Kim, Sin-Hang Fung, Qing Xiong, Kelvin Fu-Kiu Ao, Junkang Wei, Ran Wang, Dan Michelle Wang, Jingying Zhou, Bo Feng, Alfred Sze-Lok Cheng, Kevin Y. Yip, Stephen Kwok-Wing Tsui, Qin Cao, 14 Aug 2025, Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation, https://arxiv.org/abs/2508.10541
  • Xiao Fu, Hossein A. Rahmani, Bin Wu, Jerome Ramos, Emine Yilmaz, Aldo Lipani, 8 Aug 2025, PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs, https://arxiv.org/abs/2508.10028
  • Gal Amram, Eitan Farchi, Shmulik Froimovich, Raviv Gal, and Avi Ziv, 13 Aug 2025, LaajMeter: A Framework for LaaJ Evaluation, https://arxiv.org/abs/2508.10161
  • Jieyu Li, Xin Zhang, and Joey Tianyi Zhou, 14 Aug 2025, AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences, https://arxiv.org/abs/2508.10771
  • Yuzhuo Xiao, Zeyu Han, Yuhan Wang, Huaizu Jiang, 4 Aug 2025, XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs, https://arxiv.org/abs/2508.09999
  • Aditya Ashvin, Rimita Lahiri, Aditya Kommineni, Somer Bishop, Catherine Lord, Sudarsana Reddy Kadiri, Shrikanth Narayanan, 14 Aug 2025, Evaluation of Speech Foundation Models for ASR on Child-Adult Conversations in Autism Diagnostic Sessions, https://arxiv.org/abs/2409.16135
  • Zhe Chen, Daniel Harabor, Ryan Hechnenberger, Nathan R. Sturtevant, 23 Jul 2025, Online Submission and Evaluation System Design for Competition Operations, https://arxiv.org/abs/2507.17730
  • Yutong Liu, Cairong Zhao, Guosheng Hu, 23 Jul 2025, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2507.17417
  • Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing, 23 Jul 2025, MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs, https://arxiv.org/abs/2507.17476
  • Miguel Carrasco, C\'esar Gonz\'alez-Mart\'in, Jos\'e Aranda, Luis Oliveros, 23 Jul 2025, Vision Transformer attention alignment with human visual perception in aesthetic object evaluation, https://arxiv.org/abs/2507.17616
  • Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan, 23 Jul 2025, From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes, https://arxiv.org/abs/2507.17717
  • Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian, 23 Jul 2025, Fairness Evaluation of Large Language Models in Academic Library Reference Services, https://arxiv.org/abs/2507.04224
  • Roman Mayr, Michel Schimpf, Thomas Bohn\'e, 22 Jul 2025, ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation, https://arxiv.org/abs/2507.16792
  • Abhash Kumar Jha, Shakiba Moradian, Arjun Krishnakumar, Martin Rapp, Frank Hutter, 22 Jul 2025, confopt: A Library for Implementation and Evaluation of Gradient-based One-Shot NAS Methods, https://arxiv.org/abs/2507.16533
  • Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao, 22 Jul 2025, RAVine: Reality-Aligned Evaluation for Agentic Search, https://arxiv.org/abs/2507.16725
  • Danil Gusak, Anna Volodkevich, Anton Klenitskiy, Alexey Vasilev, Evgeny Frolov, 22 Jul 2025, Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders, https://arxiv.org/abs/2507.16289
  • Jakub Micha\'nk\'ow, Pawe{\l} Sakowski, Robert \'Slepaczuk, 22 Jul 2025, Alternative Loss Function in Evaluation of Transformer Models, https://arxiv.org/abs/2507.16548
  • Bruno Deprez, Toon Vanderschueren, Bart Baesens, Tim Verdonck, Wouter Verbeke, 22 Jul 2025, Network Analytics for Anti-Money Laundering -- A Systematic Literature Review and Experimental Evaluation, https://arxiv.org/abs/2405.19383
  • Xiaoxu Guo, Siyan Liang, Yachao Cui, Juxiang Zhou, Lei Wang, Han Cao, 21 Jul 2025, Multimodal Fine-grained Reasoning for Post Quality Evaluation, https://arxiv.org/abs/2507.17934
  • Rodrigo Moreira and Larissa F. Rodrigues Moreira and Fl\'avio de Oliveira Silva, 23 Jul 2025, Performance Evaluation and Threat Mitigation in Large-scale 5G Core Deployment, https://arxiv.org/abs/2507.17850
  • Maria Vlachou, 24 Jul 2025, Fashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items, https://arxiv.org/abs/2507.18017
  • Yefeng Yuan, Yuhong Liu, Liang Cheng, 24 Jul 2025, A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models, https://arxiv.org/abs/2404.14445
  • Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp, 24 Jul 2025, Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation, https://arxiv.org/abs/2506.11790
  • Niket Patel, Randall Balestriero, 23 Jul 2025, Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks, https://arxiv.org/abs/2507.09871
  • Ashray Gupta and Rohan Joseph and Sunny Rai, 23 Jul 2025, Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation, https://arxiv.org/abs/2507.13238
  • Masaki Adachi, Masahiro Fujisawa, Michael A Osborne, 24 Jul 2025, Fixing the Pitfalls of Probabilistic Time-Series Forecasting Evaluation by Kernel Quadrature, https://arxiv.org/abs/2503.06079
  • Gerben van der Hoek, Johan Jeuring and Rogier Bos, 18 Jul 2025, Buggy rule diagnosis for combined steps through final answer evaluation in stepwise tasks, https://arxiv.org/abs/2507.13651
  • Viraj Nishesh Darji, Callie C. Liao, Duoduo Liao, 18 Jul 2025, Automated Interpretation of Non-Destructive Evaluation Contour Maps Using Large Language Models for Bridge Condition Assessment, https://arxiv.org/abs/2507.14107
  • Yudai Hayashi, Shuhei Goda, Yuta Saito, 18 Jul 2025, Off-Policy Evaluation and Learning for Matching Markets, https://arxiv.org/abs/2507.13608
  • Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang, 17 Jul 2025, "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models, https://arxiv.org/abs/2507.13428
  • Steven Lamp, Jason D. Hiser, Anh Nguyen-Tuong, Jack W. Davidson, 17 Jul 2025, PHASE: Passive Human Activity Simulation Evaluation, https://arxiv.org/abs/2507.13505
  • Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz, 18 Jul 2025, From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios, https://arxiv.org/abs/2502.02145
  • Daniel Commey, Benjamin Appiah, Griffith S. Klogo, and Garth V. Crosby, 18 Jul 2025, ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs, https://arxiv.org/abs/2507.11649
  • Mohita Chowdhury, Yajie Vera He, Jared Joselowitz, Aisling Higham, Ernest Lim, 18 Jul 2025, ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems, https://arxiv.org/abs/2501.08208
  • Dawar Khan and Xinyu Liu and Omar Mena and Donggang Jia and Alexandre Kouyoumdjian and Ivan Viola, 18 Jul 2025, AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results, https://arxiv.org/abs/2502.15761
  • Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee, 18 Jul 2025, From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation, https://arxiv.org/abs/2507.08924
  • Shingo Ayabe, Takuto Otomo, Hiroshi Kera, Kazuhiko Kawamoto, 18 Jul 2025, Robustness Evaluation of Offline Reinforcement Learning for Robot Control Against Action Perturbations, https://arxiv.org/abs/2412.18781
  • Anna Sofia Lippolis, Mohammad Javad Saeedizade, Robin Keskis\"arkk\"a, Aldo Gangemi, Eva Blomqvist, Andrea Giovanni Nuzzolese, 19 Jul 2025, Large Language Models Assisting Ontology Evaluation, https://arxiv.org/abs/2507.14552
  • Qianchao Wang, Yuxuan Ding, Chuanzhen Jia, Zhe Li, Yaping Du, 21 Jul 2025, Explainable Artificial Intelligence based Soft Evaluation Indicator for Arc Fault Diagnosis, https://arxiv.org/abs/2507.15239
  • Firdaus Ahmed Choudhury, Ethan Leicht, Jude Ethan Bislig, Hangzhi Guo, Amulya Yadav, 20 Jul 2025, Designing User-Centric Metrics for Evaluation of Counterfactual Explanations, https://arxiv.org/abs/2507.15162
  • Amina Dzafic, Merve Kavut, Ulya Bayram, 19 Jul 2025, Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation, https://arxiv.org/abs/2507.14693
  • Chenlei Gong, Yuanhe Tian, Lei Mao, Yan Song, 20 Jul 2025, Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling, https://arxiv.org/abs/2507.15087
  • Devichand Budagam, Ashutosh Kumar, Mahsa Khoshnoodi, Sankalp KJ, Vinija Jain, Aman Chadha, 21 Jul 2025, Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles, https://arxiv.org/abs/2406.12644
  • Peilong Wang, Jason Holmes, Zhengliang Liu, Dequan Chen, Tianming Liu, Jiajian Shen, Wei Liu, 18 Jul 2025, A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options, https://arxiv.org/abs/2412.10622
  • Wanke Xia, Ruoxin Peng, Haoqi Chu, Xinlei Zhu, Zhiyu Yang, Lili Yang, 21 Jul 2025, An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice, https://arxiv.org/abs/2502.13764
  • Mar\'ia Andrea Cruz Bland\'on, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico, 19 Jul 2025, MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation, https://arxiv.org/abs/2502.17163
  • Felix H\"arer, 19 Jul 2025, Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications, https://arxiv.org/abs/2506.10467
  • Zhijin He, Alan B. McMillan, 21 Jul 2025, Comparative Evaluation of Radiomics and Deep Learning Models for Disease Detection in Chest Radiography, https://arxiv.org/abs/2504.12249
  • Pengfei Zhou, Xiaopeng Peng, Fanrui Zhang, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Zekai Li, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, and Kaipeng Zhang, 9 Aug 2025, MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams, https://arxiv.org/abs/2508.06851
  • Jiawei Zhang, Yifei Zhang, Baozhao Yi, Yao Ren, Qi Jiao, Hanyu Bai, Weiran Jiang, Ziyou Song, 9 Aug 2025, Discovery Learning accelerates battery design evaluation, https://arxiv.org/abs/2508.06985
  • Lin-Han Jia, Si-Yu Han, Wen-Chao Hu, Jie-Jing Shao, Wen-Da Wei, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li, 10 Aug 2025, When Is Prior Knowledge Helpful? Exploring the Evaluation and Selection of Unsupervised Pretext Tasks from a Neuro-Symbolic Perspective, https://arxiv.org/abs/2508.07299
  • Gregory Schuit, Denis Parra, Cecilia Besa, 10 Aug 2025, Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays, https://arxiv.org/abs/2508.07128
  • Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu, 10 Aug 2025, Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks, https://arxiv.org/abs/2508.07179
  • Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, and Masashi Sugiyama, 3 Aug 2025, What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?, https://arxiv.org/abs/2508.06530
  • Bruno L. Pereira, Alan Said, Rodrygo L. T. Santos, 11 Aug 2025, On the Reliability of Sampling Strategies in Offline Recommender Evaluation, https://arxiv.org/abs/2508.05398
  • Xiaohua Feng,Jiaming Zhang,Fengyuan Yu,Chengye Wang,Li Zhang,Kaixiang Li,Yuyuan Li,Chaochao Chen,Jianwei Yin, 26 Jul 2025, A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction, https://arxiv.org/abs/2507.19894
  • Aishwarya Mandyam, Jason Meng, Ge Gao, Jiankai Sun, Mac Schwager, Barbara E. Engelhardt, Emma Brunskill, 26 Jul 2025, PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data, https://arxiv.org/abs/2507.20068
  • Minju Kim, Dongje Yoo, Yeonjun Hwang, Minseok Kang, Namyoung Kim, Minju Gwak, Beong-woo Kwak, Hyungjoo Chae, Harim Kim, Yunjoong Lee, Min Hee Kim, Dayi Jung, Kyong-Mee Chung, Jinyoung Yeo, 25 Jul 2025, Can You Share Your Story? Modeling Clients' Metacognition and Openness for LLM Therapist Evaluation, https://arxiv.org/abs/2507.19643
  • Matin Aghaei, Mohammad Ali Alomrani, Yingxue Zhang, Mahdi Biparva, 26 Jul 2025, When Engineering Outruns Intelligence: A Re-evaluation of Instruction-Guided Navigation, https://arxiv.org/abs/2507.20021
  • Harsh Purohit, Tomoya Nishida, Kota Dohi, Takashi Endo, and Yohei Kawaguchi, 28 Jul 2025, MIMII-Agent: Leveraging LLMs with Function Calling for Relative Evaluation of Anomalous Sound Detection, https://arxiv.org/abs/2507.20666
  • Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue, 28 Jul 2025, Music Arena: Live Evaluation for Text-to-Music, https://arxiv.org/abs/2507.20900
  • Adrien Bazoge, 28 Jul 2025, MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation, https://arxiv.org/abs/2507.20917
  • Khalid Hasan, Jamil Saquer and Mukulika Ghosh, 17 Jul 2025, Advancing Mental Disorder Detection: A Comparative Evaluation of Transformer and LSTM Architectures on Social Media, https://arxiv.org/abs/2507.19511
  • Hugo Retief, Kayathri, Vigneswaran, Surajit Ghosh, Mariangel Garcia Andarcia, Chris Dickens, 28 Jul 2025, Satellite-Surface-Area Machine-Learning Models for Reservoir Storage Estimation: Regime-Sensitive Evaluation and Operational Deployment at Loskop Dam, South Africa, https://arxiv.org/abs/2502.19989
  • Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang, 28 Jul 2025, FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation, https://arxiv.org/abs/2506.09081
  • Afonso Martini Spezia, Mariana Recamonde-Mendoza, 30 Jul 2025, Comparing Cluster-Based Cross-Validation Strategies for Machine Learning Model Evaluation, https://arxiv.org/abs/2507.22299
  • Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje, 4 Aug 2025, DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA, https://arxiv.org/abs/2412.05430
  • Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim, 4 Aug 2025, Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons, https://arxiv.org/abs/2411.01281
  • Arthur Cho, 4 Aug 2025, GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics, https://arxiv.org/abs/2508.02926
  • Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare, Soundar Srinivasan, 7 Aug 2025, Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation, https://arxiv.org/abs/2508.05508
  • Dewi S. W. Gould, Bruno Mlodozeniec, Samuel F. Brown, 8 Aug 2025, SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges, https://arxiv.org/abs/2508.06111
  • Yuchen Tian, Kaixin Li, Hao Chen, Ziyang Luo, Hongzhan Lin, Sebastian Schelter, Lun Du, Jing Ma, 13 Aug 2025, AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries?, https://arxiv.org/abs/2508.09631
  • Seungju Yoo, Hyuk Kwon, Joong-Won Hwang, Kibok Lee, 16 Aug 2025, Automated Model Evaluation for Object Detection via Prediction Consistency and Reliablity, https://arxiv.org/abs/2508.12082
  • David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge, 18 Aug 2025, Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation, https://arxiv.org/abs/2508.13144
  • Jun Li, Aaron Aguirre, Junior Moura, Che Liu, Lanhai Zhong, Chenxi Sun, Gari Clifford, Brandon Westover, Shenda Hong, 4 Aug 2025, An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains, https://arxiv.org/abs/2410.04133
  • Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, Kocmi Tom, 29 Jul 2025, D\'ej\`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation, https://arxiv.org/abs/2504.11829
  • Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip, 29 Jul 2025, Evaluation and Benchmarking of LLM Agents: A Survey, https://arxiv.org/abs/2507.21504
  • Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang, 31 Jul 2025, LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models, https://arxiv.org/abs/2507.22359
  • Ke Miao, Yuke Hu, Xiaochen Li, Wenjie Bao, Zhihao Liu, Zhan Qin, Kui Ren, 2 Aug 2025, Towards Evaluation for Real-World LLM Unlearning, https://arxiv.org/abs/2508.01324
  • Jungkoo Kang, 3 Aug 2025, Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation, https://arxiv.org/abs/2507.02253
  • Jialin Li, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu, 5 Aug 2025, Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework, https://arxiv.org/abs/2508.03622
  • Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Arnav Ramamoorthy, Pratyush Ghosh, Aaryan Raj Jindal, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, Yashwanth Nakka, Devansh, Jagat Sesh Challa, Dhruv Kumar, 6 Aug 2025, Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics, https://arxiv.org/abs/2503.23989
  • Zachary Robertson, Sanmi Koyejo, 7 Aug 2025, Let's Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes, https://arxiv.org/abs/2508.05469
  • Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, Weinan Zhang, 3 Aug 2025, A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges, https://arxiv.org/abs/2508.05668
  • Tuhina Tripathi, Manya Wadhwa, Greg Durrett, Scott Niekum, 21 Aug 2025, Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation, https://arxiv.org/abs/2504.14716
  • Mingyang Li, Viktor Schlegel, Tingting Mu, Wuraola Oyewusi, Kai Kang, Goran Nenadic, 22 Aug 2025, Evaluation and LLM-Guided Learning of ICD Coding Rationales, https://arxiv.org/abs/2508.16777
  • Patricia Paskov, Michael J. Byun, Kevin Wei, Toby Webster, 22 Jul 2025, Preliminary suggestions for rigorous GPAI model evaluations, https://arxiv.org/abs/2508.00875
  • Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sepehr Karimi, Sina Rashidi, Ali Zolnour, Maryam Dadkhah, Yasaman Haghbin, Hossein AzadMaleki, Maryam Zolnoori, 24 Aug 2025, Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies, https://arxiv.org/abs/2509.03525
  • Han Wang, Alex Whitworth, Pak Ming Cheung, Zhenjie Zhang, Krishna Kamath, 3 Sep 2025, LLM-based Relevance Assessment for Web-Scale Search Evaluation at Pinterest, https://arxiv.org/abs/2509.03764
  • Seganrasan Subramanian, Abhigya Verma, 4 Sep 2025, Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation, https://arxiv.org/abs/2509.01185
  • Sasha Mitts, 4 Sep 2025, An Approach to Grounding AI Model Evaluations in Human-derived Criteria, https://arxiv.org/abs/2509.04676
  • Brennen Hill, 4 Sep 2025, Scaling Environments for Organoid Intelligence with LLM-Automated Design and Plasticity-Based Evaluation, https://arxiv.org/abs/2509.04633
  • Yan Wang, Xinyi Hou, Yanjie Zhao, Weiguo Lin, Haoyu Wang, Junjun Si, 26 Aug 2025, LaQual: A Novel Framework for Automated Evaluation of LLM App Quality, https://arxiv.org/abs/2508.18636
  • Courtney Ford and Mark T. Keane, 26 Aug 2025, Feature-Guided Neighbor Selection for Non-Expert Evaluation of Model Predictions, https://arxiv.org/abs/2507.06029
  • Jessica Lundin, Guillaume Chabot-Couture, 28 Aug 2025, A Graph-Based Test-Harness for LLM Evaluation, https://arxiv.org/abs/2508.20810
  • Daryna Oliynyk, Rudolf Mayer, Kathrin Grosse, Andreas Rauber, 29 Aug 2025, I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing Attacks, https://arxiv.org/abs/2508.21654
  • Haichen Hu, David Simchi-Levi, 2 Sep 2025, Wild Refitting for Model-Free Excess Risk Evaluation of Opaque ML/AI Models under Bregman Loss, https://arxiv.org/abs/2509.02476
  • S.R. Eshwar, 7 Sep 2025, Teaching Precommitted Agents: Model-Free Policy Evaluation and Control in Quasi-Hyperbolic Discounted MDPs, https://arxiv.org/abs/2509.06094
  • Sam Davidson, Li Sun, Bhavana Bhasker, Laurent Callot, Anoop Deoras, 21 Aug 2025, Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats, https://arxiv.org/abs/2509.05303
  • Yichi Zhang, Alexander Belloni, Ethan X. Fang, Junwei Lu, Xiaoan Xu, 6 Sep 2025, Fisher Random Walk: Automatic Debiasing Contextual Preference Inference for Large Language Model Evaluation, https://arxiv.org/abs/2509.05852
  • Manit Baser, Dinil Mon Divakaran, Mohan Gurusamy, 6 Sep 2025, ThinkEval: Practical Evaluation of Knowledge Leakage in LLM Editing using Thought-based Knowledge Graphs, https://arxiv.org/abs/2506.01386
  • Zhiyin Tan, Jennifer D'Souza, 8 Sep 2025, Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models, https://arxiv.org/abs/2509.07142
  • Edouard Lansiaux, Ramy Azzouz, Emmanuel Chazard, Am\'elie Vromant, Eric Wiel, 11 Sep 2025, Development and Comparative Evaluation of Three Artificial Intelligence Models (NLP, LLM, JEPA) for Predicting Triage in Emergency Departments: A 7-Month Retrospective Proof-of-Concept, https://arxiv.org/abs/2507.01080
  • Kanato Masayoshi, Masahiro Hashimoto, Ryoichi Yokoyama, Naoki Toda, Yoshifumi Uwamino, Shogo Fukuda, Ho Namkoong, Masahiro Jinzaki, 19 Sep 2025, EHR-MCP: Real-world Evaluation of Clinical Information Retrieval by Large Language Models via Model Context Protocol, https://arxiv.org/abs/2509.15957
  • Fangyi Yu, Nabeel Seedat, Dasha Herrmannova, Frank Schilder, Jonathan Richard Schwarz, 19 Sep 2025, Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses, https://arxiv.org/abs/2509.16093
  • Seyoung Song, Seogyeong Jeong, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh, 19 Sep 2025, MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language, https://arxiv.org/abs/2505.14395
  • Yiyang Li, Yonghuang Wu, Ying Luo, Liangtai Sun, Zishu Qin, Lin Qiu, Xuezhi Cao, Xunliang Cai, 16 Sep 2025, Instance-level Randomization: Toward More Stable LLM Evaluations, https://arxiv.org/abs/2509.12678
  • Yongmin Yoo, Qiongkai Xu, Longbing Cao, 16 Sep 2025, PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims, https://arxiv.org/abs/2505.19345
  • Ziang Li, Manasi Ganti, Zixian Ma, Helena Vasconcelos, Qijia He, Ranjay Krishna, 14 Sep 2025, Rethinking Human Preference Evaluation of LLM Rationales, https://arxiv.org/abs/2509.11026
  • Argimiro Arratia, Alejandra Caba\~na, Ernesto Mordecki, Gerard Rovira-Parra, 15 Sep 2025, The Morgan-Pitman Test of Equality of Variances and its Application to Machine Learning Model Evaluation and Selection, https://arxiv.org/abs/2509.12185
  • Sidharth Surapaneni, Hoang Nguyen, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Akshay Kalkunte, Sai Rajeswar, Sathwik Tejaswi Madhusudhan, 9 Sep 2025, LALM-Eval: An Open-Source Toolkit for Holistic Evaluation of Large Audio Language Models, https://arxiv.org/abs/2509.08031
  • Alejandro Andrade-Lotero, Lee Becker, Joshua Southerland and Scott Hellman, 10 Sep 2025, Toward Subtrait-Level Model Explainability in Automated Writing Evaluation, https://arxiv.org/abs/2509.08345
  • Hossein Siadati, Haadi Jafarian, Sima Jafarikhah, 10 Sep 2025, Send to which account? Evaluation of an LLM-based Scambaiting System, https://arxiv.org/abs/2509.08493
  • Yongye Su, Zeya Zhang, Jane Kou, Cheng Ju, Shubhojeet Sarkar, Yamin Wang, Ji Liu, Shengbo Guo, 17 Sep 2025, Modernizing Facebook Scoped Search: Keyword and Embedding Hybrid Retrieval with LLM Evaluation, https://arxiv.org/abs/2509.13603
  • Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, Rafael Ferreira da Silva, 17 Sep 2025, LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology, https://arxiv.org/abs/2509.13978
  • Zarreen Reza, 1 Oct 2025, The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation, https://arxiv.org/abs/2510.01295
  • Manuel Cebrian, Tomomi Kito, Raul Castro Fernandez, 30 Sep 2025, Emergent evaluation hubs in a decentralizing large language model ecosystem, https://arxiv.org/abs/2510.01286
  • Shuyang Hou, Haoyue Jiao, Ziqi Liu, Lutong Xie, Guanyu Chen, Shaowen Wu, Xuefeng Guan, Huayi Wu, 2 Oct 2025, GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2GeoSQL Queries, https://arxiv.org/abs/2509.25264
  • Ali Mekky, Omar El Herraoui, Preslav Nakov, Yuxia Wang, 14 Oct 2025, HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment, https://arxiv.org/abs/2510.12217
  • Patrizio Migliarini, Mashal Afzal Memon, Marco Autili, Paola Inverardi, 1 Oct 2025, Advancing Automated Ethical Profiling in SE: a Zero-Shot Evaluation of LLM Reasoning, https://arxiv.org/abs/2510.00881
  • Sujeong Lee, Hayoung Lee, Seongsoo Heo, Wonik Choi, 1 Oct 2025, Integrated Framework for LLM Evaluation with Answer Generation, https://arxiv.org/abs/2509.20097
  • William Walden, Marc Mason, Orion Weller, Laura Dietz, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, James Mayfield, Eugene Yang, 1 Oct 2025, Auto-ARGUE: LLM-Based Report Generation Evaluation, https://arxiv.org/abs/2509.26184
  • Angelina Wang and Daniel E. Ho and Sanmi Koyejo, 18 Sep 2025, The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior, https://arxiv.org/abs/2509.19364
  • Wei-Hsiang Lin, Sheng-Lun Wei, Hen-Hsen Huang, Hsin-Hsi Chen, 24 Sep 2025, Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation, https://arxiv.org/abs/2509.19880
  • Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir, 28 Oct 2025, Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content, https://arxiv.org/abs/2510.24438
  • Qiuli Wang, Jie Chen, Yongxu Liu, Xingpeng Zhang, Xiaoming Li, Wei Chen, 28 Oct 2025, From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports, https://arxiv.org/abs/2510.23008
  • Tatsuki Kawakami, Kazuki Egashira, Atsuyuki Miyai, Go Irie and Kiyoharu Aizawa, 28 Oct 2025, PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning, https://arxiv.org/abs/2507.01271
  • Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li, 23 Oct 2025, RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration, https://arxiv.org/abs/2509.25271
  • Dou Liu, Ying Long, Sophia Zuoqiu, Di Liu, Kang Li, Yiting Lin, Hanyi Liu, Rong Yin, Tian Tang, 17 Oct 2025, Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study, https://arxiv.org/abs/2510.16095
  • Melik Ozolcer, Sang Won Bae, 20 Oct 2025, Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users, https://arxiv.org/abs/2510.17173
  • Dania Refai and Moataz Ahmed, 19 Oct 2025, Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation, https://arxiv.org/abs/2510.16943
  • Sheikh Jubair, Arwa Omayrah, Amal Alshammari, Alhanoof Althnian, Abdulhamed Alothaimen, Norah A. Alzahrani, Shahad D. Alzaidi, Nora Al-Twairesh, Abdulmohsen Al-Thubaity, 19 Oct 2025, LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding, https://arxiv.org/abs/2510.16783
  • Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam, 19 Oct 2025, AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents, https://arxiv.org/abs/2506.00641
  • Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, 22 Sep 2025, Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents, https://arxiv.org/abs/2509.17488
  • Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, Yi Fang, 20 Sep 2025, Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning, https://arxiv.org/abs/2502.15361
  • Chenyu Zhang, Tairen Zhang, Lanjun Wang, Ruidong Chen, Wenhui Li, Anan Liu, 25 Oct 2025, T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model, https://arxiv.org/abs/2510.22300
  • Dario Loi, Elena Maria Mui\`a, Federico Siciliano, Giovanni Trappolini, Vincenzo Cris\`a, Peter Kruger, Fabrizio Silvestri, 26 Oct 2025, AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment, https://arxiv.org/abs/2510.22593
  • Madhur Jindal, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat, 27 Oct 2025, SAGE: A Generic Framework for LLM Safety Evaluation, https://arxiv.org/abs/2504.19674
  • Simon Sinong Zhan, Yao Liu, Philip Wang, Zinan Wang, Qineng Wang, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Huajie Shao, Manling Li, Qi Zhu, 14 Oct 2025, SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents, https://arxiv.org/abs/2510.12985
  • Soheil Hashtarkhani, Rezaur Rashid, Christopher L Brett, Lokesh Chinthala, Fekede Asefa Kumsa, Janet A Zink, Robert L Davis, David L Schwartz, Arash Shaban-Nejad, 8 Oct 2025, Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study, https://arxiv.org/abs/2510.12813
  • Ke Wang, Houxing Ren, Zimu Lu, Mingjie Zhan, Hongsheng Li, 26 Sep 2025, VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing, https://arxiv.org/abs/2509.22651
  • Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Zhenxin Huang, Shengjie Ma, Yinghan Shen, Jian Guo, Yuanzhuo Wang, 26 Sep 2025, JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer, https://arxiv.org/abs/2509.02097
  • Yong Oh Lee, Byeonghun Bang, Sejun Oh, 4 Oct 2025, LLM-Driven Rubric-Based Assessment of Algebraic Competence in Multi-Stage Block Coding Tasks with Design and Field Evaluation, https://arxiv.org/abs/2510.06253
  • Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R Marlowe, Carina Suzana Negreanu, Kitty Boxall, Diana Mincu, 8 Oct 2025, LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation, https://arxiv.org/abs/2510.07243
  • Chenyue Zhou, G\"urkan Solmaz, Flavio Cirillo, Kiril Gashteovski, Jonathan F\"urst, 8 Oct 2025, TextMine: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action, https://arxiv.org/abs/2509.15098
  • Hadi Mohammadi, Anastasia Giachanou, and Ayoub Bagheri, 8 Oct 2025, EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models, https://arxiv.org/abs/2510.05942
  • Weijian Deng, Weijie Tu, Ibrahim Radwan, Mohammad Abu Alsheikh, Stephen Gould, Liang Zheng, 3 Oct 2025, Confidence and Dispersity as Signals: Unsupervised Model Evaluation and Ranking, https://arxiv.org/abs/2510.02956
  • Aur\'elien B\"uck-Kaeffer, Je Qin Chooi, Dan Zhao, Maximilian Puelma Touzel, Kellin Pelrine, Jean-Fran\c{c}ois Godbout, Reihaneh Rabbany, Zachary Yang, 27 Sep 2025, $\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training, https://arxiv.org/abs/2510.02343
  • Lara Ahrens, Wilhelm Haverkamp, Nils Strodthoff, 21 Oct 2025, ECG-LLM-- training and evaluation of domain-specific large language models for electrocardiography, https://arxiv.org/abs/2510.18339
  • Panagiotis Michelakis, Yiannis Hadjiyiannis, Dimitrios Stamoulis, 25 Sep 2025, CORE: Full-Path Evaluation of LLM Agents Beyond Final State, https://arxiv.org/abs/2509.20998
  • Wenkai Guo, Xuefeng Liu, Haolin Wang, Jianwei Niu, Shaojie Tang, Jing Yuan, 25 Sep 2025, Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation, https://arxiv.org/abs/2509.20680
  • Meng Wan, Benxi Tian, Jue Wang, Cui Hui, Ningming Nie, Tiantian Liu, Zongguo Wang, Cao Rongqiang, Peng Shi, and Yangang Wang, 25 Sep 2025, Lossless Compression: A New Benchmark for Time Series Model Evaluation, https://arxiv.org/abs/2509.21002
  • Nicolas Salvy, Hugues Talbot and Bertrand Thirion, 25 Sep 2025, Enhanced Generative Model Evaluation with Clipped Density and Coverage, https://arxiv.org/abs/2507.01761
  • Miguel Angel Alvarado Gonzalez, Michelle Bruno Hernandez, Miguel Angel Pe\~naloza Perez, Bruno Lopez Orozco, Jesus Tadeo Cruz Soto, Sandra Malagon, 28 Sep 2025, Do Repetitions Matter? Strengthening Reliability in LLM Evaluations, https://arxiv.org/abs/2509.24086
  • Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li, 27 Sep 2025, Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting, https://arxiv.org/abs/2509.23074
  • Raviteja Anantha, Soheil Hor, Teodor Nicola Antoniu, Layne C. Price, 27 Sep 2025, NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning, https://arxiv.org/abs/2509.23252
  • Bo Li, Xin Zheng, Ming Jin, Can Wang, Shirui Pan, 28 Sep 2025, Test-time GNN Model Evaluation on Dynamic Graphs, https://arxiv.org/abs/2509.23816
  • Chunyang Jiang, Yonggang Zhang, Yiyang Cai, Chi-Min Chan, Yulong Liu, Mingming Chen, Wei Xue, Yike Guo, 27 Sep 2025, Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks, https://arxiv.org/abs/2509.23067
  • Jiahao Zhao, Yunjia Li, Wei Li, Kazuyoshi Yoshii, 27 Sep 2025, ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following, https://arxiv.org/abs/2509.23350
  • Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren, 29 Sep 2025, HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment, https://arxiv.org/abs/2509.24384
  • Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, Sherry Yang, 29 Sep 2025, WorldGym: World Model as An Environment for Policy Evaluation, https://arxiv.org/abs/2506.00613
  • Kuang-Da Wang, Zhao Wang, Yotaro Shimose, Wei-Yao Wang, Shingo Takamatsu, 17 Oct 2025, WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation, https://arxiv.org/abs/2510.15306
  • Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary, 5 Oct 2025, Don't Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation, https://arxiv.org/abs/2510.04265
  • Ashish Kattamuri, Harshwardhan Fartale, Arpita Vats, Rahul Raja, Ishita Prasad, 10 Oct 2025, RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation, https://arxiv.org/abs/2510.08931
  • Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F. Gomes, Guang Yang, Kui Liu, Xin Xia, David Lo, 10 Oct 2025, An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks, https://arxiv.org/abs/2505.20854
  • Qinghua Lu, Dehai Zhao, Yue Liu, Hao Zhang, Liming Zhu, Xiwei Xu, Angela Shi, Tristan Tan, Rick Kazman, 23 Oct 2025, AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents, https://arxiv.org/abs/2510.21031
  • Pai Liu, Lingfeng Zhao, Shivangi Agarwal, Jinghan Liu, Audrey Huang, Philip Amortila, Nan Jiang, 24 Oct 2025, Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol, https://arxiv.org/abs/2502.08021
  • Andrew Keenan Richardson, Ryan Othniel Kearns, Sean Moss, Vincent Wang-Mascianica, Philipp Koralus, 24 Oct 2025, Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM Reasoning, https://arxiv.org/abs/2506.11128
  • Shaobo Wang and Cong Wang and Wenjie Fu and Yue Min and Mingquan Feng and Isabel Guan and Xuming Hu and Conghui He and Cunxiang Wang and Kexin Yang and Xingzhang Ren and Fei Huang and Dayiheng Liu and Linfeng Zhang, 12 Oct 2025, Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?, https://arxiv.org/abs/2510.10457
  • Dongjie Yang, Chengqiang Lu, Qimeng Wang, Xinbei Ma, Yan Gao, Yao Hu, Hai Zhao, 12 Oct 2025, Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints, https://arxiv.org/abs/2506.12421
  • Ran Zhang, Wei Zhao, Lieve Macken, Steffen Eger, 13 Oct 2025, LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering, https://arxiv.org/abs/2505.05423
  • Chaithanya Bandi and Abir Harrasse, 11 Oct 2025, Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation, https://arxiv.org/abs/2410.04663
  • Akhil Kumar, Jianliang Leon Zhao, Om Dobariya, 8 Oct 2025, Evaluation of LLMs for Process Model Analysis and Optimization, https://arxiv.org/abs/2510.07489
  • Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh, 9 Oct 2025, DISCO: Diversifying Sample Condensation for Efficient Model Evaluation, https://arxiv.org/abs/2510.07959
  • Amruta Parulekar, Preethi Jyothi, 8 Oct 2025, LASER: An LLM-based ASR Scoring and Evaluation Rubric, https://arxiv.org/abs/2510.07437
  • Fanwei Zhua, Jiaxuan He, Xiaoxiao Chen, Zulong Chen, Quan Lu and Chenrui Mei, 9 Oct 2025, Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation, https://arxiv.org/abs/2510.07912
  • Qin Liu, Jacob Dineen, Yuxi Huang, Sheng Zhang, Hoifung Poon, Ben Zhou, Muhao Chen, 9 Oct 2025, ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation, https://arxiv.org/abs/2510.08569
  • Hong-Jie Dai, Zheng-Hao Li, An-Tai Lu, Bo-Tsz Shain, Ming-Ta Li, Tatheer Hussain Mir, Kuang-Te Wang, Min-I Su, Pei-Kang Liu, Ming-Ju Tsai, 23 Sep 2025, Model selection meets clinical semantics: Optimizing ICD-10-CM prediction via LLM-as-Judge evaluation, redundancy-aware sampling, and section-aware fine-tuning, https://arxiv.org/abs/2509.18846
  • Mahmoud Ibrahim, Bart Elen, Chang Sun, G\"okhan Ertaylan, Michel Dumontier, 22 Oct 2025, Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series, https://arxiv.org/abs/2510.19728
  • Kenya S. Andrews, Deborah Dormah Kanubala, Kehinde Aruleba, Francisco Enrique Vicente Castro, Renata A Revelo, 21 Oct 2025, A Justice Lens on Fairness and Ethics Courses in Computing Education: LLM-Assisted Multi-Perspective and Thematic Evaluation, https://arxiv.org/abs/2510.18931
  • Chengcan Wu, Zhixin Zhang, Mingqian Xu, Zeming Wei, Meng Sun, 22 Oct 2025, Monitoring LLM-based Multi-Agent Systems Against Corruptions via Node Evaluation, https://arxiv.org/abs/2510.19420
  • Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao, 29 Sep 2025, Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents, https://arxiv.org/abs/2509.25302
  • Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Anna Vasileva, Anna Antipina, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, Egor Shvetsov, 29 Sep 2025, From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation, https://arxiv.org/abs/2509.25359
  • Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, Baishakhi Ray, 29 Sep 2025, Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation, https://arxiv.org/abs/2502.17521
  • Jiacheng Liang, Zian Wang, Lauren Hong, Shouling Ji, Ting Wang, 29 Sep 2025, Watermark under Fire: A Robustness Evaluation of LLM Watermarking, https://arxiv.org/abs/2411.13425
  • Jingtong Su, Jianyu Zhang, Karen Ullrich, L\'eon Bottou, Mark Ibrahim, 2 Oct 2025, A Single Character can Make or Break Your LLM Evals, https://arxiv.org/abs/2510.05152
  • Zachary Robertson, 16 Oct 2025, Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores, https://arxiv.org/abs/2510.14966

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: