Aussie AI

Training Optimization

  • Last Updated 3 January, 2026
  • by David Spuler, Ph.D.

Training is very expensive, leading to a rise in papers on optimization of model training methods. Training cost is typically many multiples of inference, but obviously the total inference cost can overshadow training cost given enough users. Nevertheless, the total cost of training to the industry is likely to remain high, since almost all use cases require not only initial training, but also ongoing fine-tuning and re-training.

Research on training algorithms in general:

  • Unsupervised learning
  • Reinforceument Learning from Human Feedback (RLHF)
  • In-context learning (ICL)
  • Direct Preference Optimization (DPO)
  • Self-supervised learning (automated AI Feedback)
  • Human-In-The-Loop (HITL)

General concepts in LLM reasoning and model capabilities:

Information on improving the accuracy and/or speed of training algorithms:

Improvements in resiliency of the training infastructure for multi-GPU clusters in data centers:

Research on the data used in pre-training:

Modified types of pre-training for models:

Fine-tuning methods include:

Lesser-known alternatives to fine-tuning being researched for improving model capabilities that require only a single inference step, but may also require a short training-like phase:

  • Prompt tuning (extended vocabulary PEFT, typically with extra soft tokens prepended to prompt)
  • Decoding-based reasoning in single inference step (e.g., tree decoding)

Retrieval-based alternatives to fine-tuning for extra LLM capabilities and intelligence/accuracy (without requiring any extra training):

Non-retrieval methods of giving LLMs additional context information for their inference queries, but only with a single inference query (and without traditional RAG-type data retrieval):

Prompt engineering enhancements to LLM capabilities (single-step):

Advanced topics in prompt engineering (single-shot):

Inference-based reasoning algorithms with multiple steps combining prompt engineering and inference processing of queries:

Addressing limitations of model intelligence:

Other directions for model intelligence:

  • Planning
  • Followup questions
  • Interactive prompting
  • Program execution models (e.g., LLM generates Python code to run)
  • Symbolic reasoning
  • Concept models ("large concept models" or LCMs)

Survey Papers on Training Optimizations

Survey papers on speeding up training:

  • Yarally T, Cruz L, Feitosa D, et al (2023), Uncovering energy-efficient practices in deep learning training: Preliminary steps towards green AI. International Conference on AI Engineering - Software Engineering for AI (CAIN), https://arxiv.org/abs/2303.13972
  • A. Apicella, F. Donnarumma, F. Isgrò, and R. Prevete, A survey on modern trainable activation functions, Neural Networks, vol. 138, pp.14–32, 2021, https://arxiv.org/abs/2005.00817 (Extensive survey all about training with activation functions, e.g. RELU, Swish, Maxout, leaky RELU.)
  • R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/ (Survey of on-device training for TinyML/edge computing.)
  • P Freire, E Manuylovich, JE Prilepsky, SK Turitsyn, 2023, Artificial neural networks for photonic applications—from algorithms to implementation: tutorial, Advances in Optics and Photonics, Sep 2023, https://opg.optica.org/directpdfaccess/f0ae8746-2f89-4ac4-bb598eda29c7977c_539680/aop-15-3-739.pdf?da=1&id=539680&seq=0&mobile=no (Large survey covering many aspects of the future of training optimization.)
  • Marcos Treviso, Tianchu Ji, Ji-Ung Lee, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Pedro H. Martins, Andre F. T. Martins, Pe- ´ ter Milder, Colin Raffel, Edwin Simpson, Noam Slonim, Niranjan Balasubramanian, Leon Derczynski, Roy Schwartz, Aug 2022, Efficient Methods for Natural Language Processing: A Survey. arxiv:2209.00099[cs], August 2022. http://arxiv.org/abs/2209.00099
  • MM YAPICI, N Topaloğlu, 2021, Computers and Informatics, Performance comparison of deep learning frameworks https://dergipark.org.tr/en/pub/ci/issue/60236/769457, PDF: https://dergipark.org.tr/en/download/article-file/1201877 (Examines Torch, Theano, Caffe, Caffe2, MXNet, Keras, TensorFlow, and CNTK frameworks in terms of training speed.)
  • H. Jahangir, S. K. Goel and S. Khurana, "Scaling Up the Transformers: A Survey of Training and Inference Optimization Techniques," 2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT), Greater Noida, India, 2024, pp. 1-6, doi: 10.1109/ICEECT61758.2024.10739061. https://ieeexplore.ieee.org/abstract/document/10739061
  • Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
  • Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 20 Feb 2024 (v2), Large Language Models: A Survey, https://arxiv.org/abs/2402.06196
  • R Abdulkadirov, P Lyakhov, N Nagornov, 2023, Survey of Optimization Algorithms in Modern Neural Networks https://www.mdpi.com/2227-7390/11/11/2466 https://www.mdpi.com/2227-7390/11/11/2466/pdf
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Zehao Xiao, Cees G. M. Snoek, 6 Nov 2024, Beyond Model Adaptation at Test Time: A Survey. https://arxiv.org/abs/2411.03687
  • Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
  • Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang, 23 Jan 2025, Parameter-Efficient Fine-Tuning for Foundation Models, https://arxiv.org/abs/2501.13787

Training Speed Optimizations

Papers with specific techniques for optimization of training in terms of throughput, latency or processing speed, rather than accuracy or perplexity of results (chosen out of literally thousands):

Fine-Tuning

Papers on fine-tuning optimizations:

  • Libo Qin, Qiguang Chen, Xiachong Feng, Yang Wu, Yongheng Zhang, Yinghui Li, Min Li, Wanxiang Che, Philip S. Yu, 21 May 2024, Large Language Models Meet NLP: A Survey, https://arxiv.org/abs/2405.12819 (A survey of research into how LLMs, with and without fine-tuning, perform in various NLP use cases, such as mathematical reasoning, dialogue understanding, translation, and more.)
  • Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
  • Benjue Weng, 13 Apr 2024, Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning Strategies, https://arxiv.org/abs/2404.09022 (Reviewing fine-tuning of large models.)
  • Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
  • Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim, 18 Jan 2024, Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation, https://arxiv.org/abs/2401.08417
  • David Spuler, March 2024, Chapter 6. Training, Fine-Tuning & RAG, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
  • Pranav Patel, 2024, In-depth guide to fine-tuning LLMs with LoRA and QLoRA, https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
  • Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, Xipeng Qiu, 6 Jun 2024 (v2), Full Parameter Fine-tuning for Large Language Models with Limited Resources, https://arxiv.org/abs/2306.09782 Code: https://github.com/OpenLMLab/LOMO (Low-memory usage for full-parameter fine-tuning.)
  • Louis-François Bouchard, Louie Peters, May 2024, Chapter 10: Fine-Tuning, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
  • Valentina Alto, 2024, Chapter 11: Fine-Tuning Large Language Models, Building LLM-Powered Applications: Create intelligence apps and agents with large language models, Packt Publishing, https://www.amazon.com/Building-LLM-Apps-Intelligent-Language/dp/1835462316/
  • Aarushi Kansal, Chapter 5: Fine-Tuning: The Theory, Chapter 6: Fine-Tuning: Hands-On,, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
  • Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
  • Yi Zhou, Dec 16, 2023, Optimizing GenAI: Comparing Model Training, Fine-Tuning, RAG, and Prompt Engineering, https://medium.com/generative-ai-revolution-ai-native-transformation/optimizing-genai-comparing-model-training-fine-tuning-rag-and-prompt-engineering-7a7c6c65e0f0
  • Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
  • OpenAI, August 20, 2024, Fine-tuning now available for GPT-4o, https://openai.com/index/gpt-4o-fine-tuning/
  • Judy Hanwen Shen, Inioluwa Deborah Raji, Irene Y. Chen, 8 Aug 2024, The Data Addition Dilemma, https://arxiv.org/abs/2408.04154
  • Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen, 28 May 2024 (v3) Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark, https://arxiv.org/abs/2402.11592 Code: https://github.com/ZO-Bench/ZO-LLM
  • Junjie Ye, Yuming Yang, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, 24 Sep 2024, Empirical Insights on Fine-Tuning Large Language Models for Question-Answering, https://arxiv.org/abs/2409.15825
  • Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
  • Foundry AI, Oct 2024, When Should You Move Beyond Prompting and Start Fine-Tuning? https://thefoundryai.com/blog/fine-tuning
  • Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
  • Angels Balaguer, Vinamra Benara, Renato Luiz de Freitas Cunha, Roberto de M. Estevão Filho, Todd Hendry, Daniel Holstein, Jennifer Marsman, Nick Mecklenburg, Sara Malvar, Leonardo O. Nunes, Rafael Padilha, Morris Sharp, Bruno Silva, Swati Sharma, Vijay Aski, Ranveer Chandra, 30 Jan 2024 (v3), RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, https://arxiv.org/abs/2401.08406
  • Towards AI, December 24, 2024, Llm Fine Tuning Guide: Do You Need It and How to Do It https://towardsai.net/p/artificial-intelligence/llm-fine-tuning-guide-do-you-need-it-and-how-to-do-it-4
  • Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
  • Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
  • Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
  • Maxime Heuillet, Yufei Cui, Boxing Chen, Audrey Durand, Prasanna Parthasarathi, 13 Aug 2025, Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts, https://arxiv.org/abs/2508.10123
  • Tianjun Yuan, Jiaxiang Geng, Pengchao Han, Xianhao Chen, Bing Luo, 14 Aug 2025, Flexible Personalized Split Federated Learning for On-Device Fine-Tuning of Foundation Models, https://arxiv.org/abs/2508.10349
  • Dongyue Li and Hongyang R. Zhang, 13 Aug 2025, Improved Regularization and Robustness for Fine-tuning in Neural Networks, https://arxiv.org/abs/2111.04578
  • Yanxia Deng, Aozhong Zhang, Selcuk Gurses, Naigang Wang, Zi Yang and Penghang Yin, 14 Aug 2025, CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization, https://arxiv.org/abs/2501.18475
  • Suhas G Hegde, Shilpy Kaur, Aruna Tiwari, 14 Aug 2025, VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models, https://arxiv.org/abs/2503.19530
  • Andrew P. Berg, Qian Zhang, Mia Y. Wang, 14 Aug 2025, 15,500 Seconds: Lean UAV Classification Using EfficientNet and Lightweight Fine-Tuning, https://arxiv.org/abs/2506.11049
  • Sol\`ene Debuys\`ere, Nicolas Trouv\'e, Nathan Letheule, Olivier L\'ev\^eque, Elise Colin, 14 Aug 2025, Quantitative Comparison of Fine-Tuning Techniques for Pretrained Latent Diffusion Models in the Generation of Unseen SAR Images, https://arxiv.org/abs/2506.13307
  • Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong, 23 Jul 2025, LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning, https://arxiv.org/abs/2506.15606
  • Simon Ouellette, 17 Jul 2025, Out-of-Distribution Generalization in the ARC-AGI Domain: Comparing Execution-Guided Neural Program Synthesis and Test-Time Fine-Tuning, https://arxiv.org/abs/2507.15877
  • Boheng Li, Renjie Gu, Junjie Wang, Leyi Qi, Yiming Li, Run Wang, Zhan Qin, Tianwei Zhang, 22 Jul 2025, Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning, https://arxiv.org/abs/2507.16302
  • Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda, 22 Jul 2025, Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning, https://arxiv.org/abs/2507.16795
  • Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li, 22 Jul 2025, Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance, https://arxiv.org/abs/2407.17029
  • Furong Peng, Jinzhen Gao, Xuan Lu, Kang Liu, Yifan Huo, Sheng Wang, 22 Jul 2025, Towards a deeper GCN: Alleviate over-smoothing with iterative training and fine-tuning, https://arxiv.org/abs/2506.17576
  • Binghua Li, Ziqing Chang, Tong Liang, Chao Li, Toshihisa Tanaka, Shigeki Aoki, Qibin Zhao, Zhe Sun, 24 Jul 2025, Parameter-Efficient Fine-Tuning of 3D DDPM for MRI Image Generation Using Tensor Networks, https://arxiv.org/abs/2507.18112
  • Ziming Yu, Pan Zhou, Sike Wang, Jia Li, Mi Tian, Hua Huang, 24 Jul 2025, Zeroth-Order Fine-Tuning of LLMs in Random Subspaces, https://arxiv.org/abs/2410.08989
  • Tim Rensmeyer, Denis Kramer, Oliver Niggemann, 18 Jul 2025, On-the-Fly Fine-Tuning of Foundational Neural Network Potentials: A Bayesian Neural Network Approach, https://arxiv.org/abs/2507.13805
  • Amro Abdalla, Ismail Shaheen, Dan DeGenaro, Rupayan Mallick, Bogdan Raita, Sarah Adel Bargal, 18 Jul 2025, GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention, https://arxiv.org/abs/2507.13598
  • Rafiq Kamel, Filippo Guerranti, Simon Geisler, Stephan G\"unnemann, 15 Jul 2025, SAFT: Structure-Aware Fine-Tuning of LLMs for AMR-to-Text Generation, https://arxiv.org/abs/2507.13381
  • Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, Geng Yuan, 18 Jul 2025, Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning, https://arxiv.org/abs/2502.03304
  • Harsh Nilesh Pathak and Randy Paffenroth, 18 Jul 2025, Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers, https://arxiv.org/abs/2507.14353
  • Fufang Wen and Shichang Zhang, 14 Jul 2025, Retention analysis of edited knowledge after fine-tuning, https://arxiv.org/abs/2507.14198
  • Yujia Tong, Jingling Yuan, Tian Zhang, Jianquan Liu, Chuang Hu, 19 Jul 2025, DFQ-ViT: Data-Free Quantization for Vision Transformers without Fine-tuning, https://arxiv.org/abs/2507.14481
  • Wooseok Ha, Yuansi Chen, 19 Jul 2025, When few labeled target data suffice: a theory of semi-supervised domain adaptation via fine-tuning from multiple adaptive starts, https://arxiv.org/abs/2507.14661
  • Roy H. Jennings, Genady Paikin, Roy Shaul, Evgeny Soloveichik, 20 Jul 2025, Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression, https://arxiv.org/abs/2507.14997
  • Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, Ziyu Huang, David D. Yao, Wenpin Tang, 19 Jul 2025, Fine-Tuning Diffusion Generative Models via Rich Preference Optimization, https://arxiv.org/abs/2503.11720
  • Xingke Yang and Liang Li and Sicong Li and Liwei Guan and Hao Wang and Xiaoqi Qi and Jiang Liu and Xin Fu and Miao Pan, 9 Aug 2025, Fed MobiLLM: Efficient Federated LLM Fine-Tuning over Heterogeneous Mobile Devices via Server Assisted Side-Tuning, https://arxiv.org/abs/2508.06765
  • Brendan R. Hogan, Will Brown, Adel Boyarsky, Anderson Schneider, Yuriy Nevmyvaka, 9 Aug 2025, Technical Report: Full-Stack Fine-Tuning for the Q Programming Language, https://arxiv.org/abs/2508.06813
  • Amal Saadallah, Abdulaziz Al-Ademi, 11 Aug 2025, Adaptive Fine-Tuning via Pattern Specialization for Deep Time Series Forecasting, https://arxiv.org/abs/2508.07927
  • Bujar Raufi, 10 Aug 2025, Fine-Tuning Large Language Models Using EEG Microstate Features for Mental Workload Assessment, https://arxiv.org/abs/2508.07283
  • Zhaorui Tan, Tan Pan, Kaizhu Huang, Weimiao Yu, Kai Yao, Chen Jiang, Qiufeng Wang, Anh Nguyen, Xin Guo, Yuan Cheng, Xi Yang, 11 Aug 2025, Exploiting Layer Normalization Fine-tuning in Visual Transformer Foundation Models for Classification, https://arxiv.org/abs/2508.07577
  • Vishwas M. Shetty, Jiusi Zheng, Abeer Alwan, 11 Aug 2025, G-IFT: A Gated Linear Unit adapter with Iterative Fine-Tuning for Low-Resource Children's Speaker Verification, https://arxiv.org/abs/2508.07836
  • Xingke Yang and Liang Li and Zhiyi Wan and Sicong Li and Xiaoqi Qi and Jiang Liu and Tomoaki Ohtsuki and Xin Fu and Miao Pan, 9 Aug 2025, PAE MobiLLM: Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning, https://arxiv.org/abs/2507.01216
  • Mohammad Mehdi Rastikerdar, Jin Huang, Hui Guan, Deepak Ganesan, 11 Aug 2025, In-Situ Fine-Tuning of Wildlife Models in IoT-Enabled Camera Traps for Efficient Adaptation, https://arxiv.org/abs/2409.07796
  • Qingguo Wang, 10 Aug 2025, Accurate Measles Rash Detection via Vision Transformer Fine-Tuning, https://arxiv.org/abs/2005.09112
  • Atharva Nijasure, Tanya Chowdhury, James Allan, 10 Aug 2025, How Relevance Emerges: Interpreting LoRA Fine-Tuning in Reranking LLMs, https://arxiv.org/abs/2504.08780
  • Yining Huang,Bin Li,Keke Tang,Meilian Chen, 28 Jul 2025, LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning, https://arxiv.org/abs/2507.20999
  • Roman Mach\'a\v{c}ek and Anastasiia Grishina and Max Hort and Leon Moonen, 26 Jul 2025, The Impact of Fine-tuning Large Language Models on Automated Program Repair, https://arxiv.org/abs/2507.19909
  • Fabrizio Nunnari, Alakshendra Jyotsnaditya Ramkrishna Singh, Patrick Gebhard, 27 Jul 2025, Color histogram equalization and fine-tuning to improve expression recognition of (partially occluded) faces on sign language datasets, https://arxiv.org/abs/2507.20197
  • Wei Lu, Daniel L. Chen, Christian B. Hansen, 28 Jul 2025, Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach, https://arxiv.org/abs/2507.20796
  • Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin, 28 Jul 2025, Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards, https://arxiv.org/abs/2505.16789
  • Yifu Han and Geo Zhang, 27 Jul 2025, Reinforcement learning fine-tuning of language model for instruction following and math reasoning, https://arxiv.org/abs/2506.21560
  • Zixuan Chen and Weikai Lu and Xin Lin and Ziqian Zeng, 27 Jul 2025, SDD: Self-Degraded Defense against Malicious Fine-tuning, https://arxiv.org/abs/2507.21182
  • Zengyang Li, Yimeng Li, Binbin Huang, Peng Liang, Ran Mo, Hui Liu, Yutao Ma, 29 Jul 2025, Fine-Tuning Code Language Models to Detect Cross-Language Bugs, https://arxiv.org/abs/2507.21954
  • Aly M. Kassem, Zhuan Shi, Negar Rostamzadeh, Golnoosh Farnadi, 19 Jun 2025, Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing, https://arxiv.org/abs/2507.21084
  • Georg Slamanig, Francesco Corti, Olga Saukh, 31 Jul 2025, From LLMs to Edge: Parameter-Efficient Fine-Tuning on Edge Devices, https://arxiv.org/abs/2507.23536
  • Sirine Arfa, Bernhard Vogginger, Christian Mayr, 31 Jul 2025, Hardware-Aware Fine-Tuning of Spiking Q-Networks on the SpiNNaker2 Neuromorphic Platform, https://arxiv.org/abs/2507.23562
  • Yan Zhu, Jingyang Zhu, Ting Wang, Yuanming Shi, Chunxiao Jiang and Khaled Ben Letaief, 31 Jul 2025, Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks, https://arxiv.org/abs/2504.10403
  • Wei Guo, Siyuan Lu, Yiqi Tong, Zhaojun Hu, Fuzhen Zhuang, Xiao Zhang, Tao Fan, Jin Dong, 31 Jul 2025, H2Tune: Federated Foundation Model Fine-Tuning with Hybrid Heterogeneity, https://arxiv.org/abs/2507.22633
  • Vishwesh Ramanathan, Tony Xu, Pushpak Pati, Faruk Ahmed, Maged Goubran, Anne L. Martel, 30 Jul 2025, ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology, https://arxiv.org/abs/2503.17564
  • Afshin Khadangi, Amir Sartipi, Igor Tchappi, Ramin Bahmani, Gilbert Fridgen, 30 Jul 2025, Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning, https://arxiv.org/abs/2507.22565
  • Yebo Wu, Jingguang Li, Zhijiang Guo and Li Li, 31 Jul 2025, Learning Like Humans: Resource-Efficient Federated Fine-Tuning through Cognitive Developmental Stages, https://arxiv.org/abs/2508.00041
  • Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Anton van den Hengel, Ehsan Abbasnejad, 1 Aug 2025, Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product, https://arxiv.org/abs/2508.00230
  • Shayan Jalilian, Abdul Bais, 31 Jul 2025, SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters, https://arxiv.org/abs/2508.00213
  • Prerana Ramkumar, 1 Aug 2025, SU-ESRGAN: Semantic and Uncertainty-Aware ESRGAN for Super-Resolution of Satellite and Drone Imagery with Fine-Tuning for Cross Domain Evaluation, https://arxiv.org/abs/2508.00750
  • Julian Lemmel, Manuel Kranzl, Adam Lamine, Philipp Neubauer, Radu Grosu, Sophie Neubauer, 1 Aug 2025, Online Fine-Tuning of Carbon Emission Predictions using Real-Time Recurrent Learning for State Space Models, https://arxiv.org/abs/2508.00804
  • Derin Cayir, Renjie Tao, Rashi Rungta, Kai Sun, Sean Chen, Haidar Khan, Minseok Kim, Julia Reinspach, Yue Liu, 3 Aug 2025, Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning, https://arxiv.org/abs/2508.01543
  • Yixin Shen, 4 Aug 2025, Kronecker-LoRA: hybrid Kronecker-LoRA adapters for scalable, sustainable fine-tuning, https://arxiv.org/abs/2508.01961
  • Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha, 4 Aug 2025, AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization, https://arxiv.org/abs/2508.02079
  • Yilun Liu, Yunpu Ma, Yuetian Lu, Shuo Chen, Zifeng Ding, Volker Tresp, 4 Aug 2025, Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules, https://arxiv.org/abs/2508.02587
  • Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, Chunhe Xia, 4 Aug 2025, CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning, https://arxiv.org/abs/2508.02219
  • Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Aastha Verma, Natraj Raman, Sriram Gopalakrishnan, Niladri Chatterjee, Tanmoy Chakraborty, 3 Aug 2025, Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation, https://arxiv.org/abs/2411.04358
  • Yinbin Han, Meisam Razaviyayn, Renyuan Xu, 3 Aug 2025, Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence, https://arxiv.org/abs/2412.18164
  • Jack Chen, Fazhong Liu, Naruto Liu, Yuhan Luo, Erqu Qin, Harry Zheng, Tian Dong, Haojin Zhu, Yan Meng, Xiao Wang, 4 Aug 2025, Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs, https://arxiv.org/abs/2505.13026
  • Yidong Chai (1 and 2), Yang Liu (1 and 2), Yonghang Zhou (1 and 2), Jiaheng Xie (3), Daniel Dajun Zeng (4) ((1) School of Management, Hefei University of Technology, Hefei, China, (2) Key Laboratory of Process Optimization and Intelligent Decision-making, Ministry of Education, Hefei, China, (3) Department of Accounting and MIS, Lerner College of Business and Economics, University of Delaware, Newark, Delaware, U.S., (4) Institute of Automation, Chinese Academy of Sciences, Beijing, China), 31 Jul 2025, A Bayesian Hybrid Parameter-Efficient Fine-Tuning Method for Large Language Models, https://arxiv.org/abs/2508.02711
  • Jingyi Chen, Ju Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault, 5 Aug 2025, Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback, https://arxiv.org/abs/2508.03123
  • Yutong Chen, Jiandong Gao, Ji Wu, 5 Aug 2025, Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning, https://arxiv.org/abs/2505.17988
  • Joel Walsh, Siddarth Mamidanna, Benjamin Nye, Mark Core, and Daniel Auerbach, 6 Aug 2025, Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading, https://arxiv.org/abs/2508.04063
  • Ali Taheri Ghahrizjani, Alireza Taban, Qizhou Wang, Shanshan Ye, Abdolreza Mirzaei, Tongliang Liu, Bo Han, 6 Aug 2025, Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning, https://arxiv.org/abs/2508.04329
  • Liujian Tang, Shaokang Dong, Yijia Huang, Minqi Xiang, Hongtao Ruan, Bin Wang, Shuo Li, Zhihui Cao, Hailiang Pang, Heng Kong, He Yang, Mingxu Chai, Zhilin Gao, Xingyu Liu, Yingnan Fu, Jiaming Liu, Tao Gui, Xuanjing Huang, Yu-Gang Jiang, Qi Zhang, Kang Wang, Yunke Zhang, Yuran Wang, 19 Jul 2025, MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning, https://arxiv.org/abs/2508.03700
  • Yanjie Dong, Haijun Zhang, Chengming Li, Song Guo, Victor C. M. Leung, Xiping Hu, 6 Aug 2025, Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches, https://arxiv.org/abs/2408.10691
  • Bohao Wu, Qingyun Wang, Yue Guo, 6 Aug 2025, Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning, https://arxiv.org/abs/2505.16227
  • Mahdi Nazari Ashani, Ali Asghar Alesheikh, Saba Kazemi, Kimya Kheirkhah, Yasin Mohammadi, Fatemeh Rezaie, Amir Mahdi Manafi, Hedieh Zarkesh, 6 Aug 2025, Fine-Tuning Small Language Models (SLMs) for Autonomous Web-based Geographical Information Systems (AWebGIS), https://arxiv.org/abs/2508.04846
  • Chang Tian, Matthew B. Blaschko, Mingzhe Xing, Xiuxing Li, Yinliang Yue, Marie-Francine Moens, 6 Aug 2025, Large Language Models Reasoning Abilities Under Non-Ideal Conditions After RL-Fine-Tuning, https://arxiv.org/abs/2508.04848
  • Nan Li, Wanting Yang, Marie Siew, Zehui Xiong, Binbin Chen, Shiwen Mao, Kwok-Yan Lam, 6 Aug 2025, Edge-Assisted Collaborative Fine-Tuning for Multi-User Personalized Artificial Intelligence Generated Content (AIGC), https://arxiv.org/abs/2508.04745
  • Dai Do, Manh Nguyen, Svetha Venkatesh, Hung Le, 7 Aug 2025, SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models, https://arxiv.org/abs/2508.05015
  • Zhongheng Yang, Aijia Sun, Yushang Zhao, Yinuo Yang, Dannier Li, Chengrui Zhou, 7 Aug 2025, RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders, https://arxiv.org/abs/2508.05289
  • Younwoo Choi, Muhammad Adil Asif, Ziwen Han, John Willes, Rahul G. Krishnan, 7 Aug 2025, Teaching LLMs How to Learn with Contextual Fine-Tuning, https://arxiv.org/abs/2503.09032
  • Jin Khye Tan (Faculty of Computer Science and Information Technology, Universiti Malaya), En Jun Choong, Ethan Jeremiah Chitty, Yan Pheng Choo, John Hsin Yang Wong, Chern Eu Cheah, 4 Aug 2025, Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports, https://arxiv.org/abs/2508.05669
  • Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Guanggang Geng, Zhiying Li, Jian Weng, 6 Aug 2025, DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection, https://arxiv.org/abs/2508.05694
  • Han Gao, Timo Hartmann, Botao Zhong, Kai Lia, Hanbin Luo, 5 Aug 2025, Domain-Specific Fine-Tuning and Prompt-Based Learning: A Comparative Study for developing Natural Language-Based BIM Information Retrieval Systems, https://arxiv.org/abs/2508.05676
  • Jucheng Hu, Surong Yang, Lijun Wu, Dongzhan Zhou, 8 Aug 2025, DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning, https://arxiv.org/abs/2504.14810
  • Mahmoud Salhab, Shameed Sait, Mohammad Abusheikh, Hasan Abusheikh, 12 Aug 2025, Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning, https://arxiv.org/abs/2508.08912
  • Dong Wang, Haris \v{S}iki\'c, Lothar Thiele, Olga Saukh, 12 Aug 2025, Forget the Data and Fine-Tuning! Just Fold the Network to Compress, https://arxiv.org/abs/2502.10216
  • Sajjad Ghiasvand and Haniyeh Ehsani Oskouie and Mahnoosh Alizadeh and Ramtin Pedarsani, 12 Aug 2025, Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models, https://arxiv.org/abs/2505.15130
  • Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong, 12 Aug 2025, Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning, https://arxiv.org/abs/2506.03850
  • Jan Tauberschmidt, Sophie Fellenz, Sebastian J. Vollmer, Andrew B. Duncan, 5 Aug 2025, Physics-Constrained Fine-Tuning of Flow-Matching Models for Generation and Inverse Problems, https://arxiv.org/abs/2508.09156
  • Bokeng Zheng, Jianqiang Zhong, Jiayi Liu, Xiaoxi Zhang, 13 Aug 2025, Decentralized Rank Scheduling for Energy-Constrained Multi-Task Federated Fine-Tuning in Edge-Assisted IoV Networks, https://arxiv.org/abs/2508.09532
  • Zainab Khan, Ahmed Hussain, Mukesh Thakur, Arto Hellas, and Panos Papadimitratos, 12 Aug 2025, NEFMind: Parameter-Efficient Fine-Tuning of Open-Source LLMs for Telecom APIs Automation, https://arxiv.org/abs/2508.09240
  • Basile Lewandowski, Robert Birke, Lydia Y. Chen, 14 Aug 2025, Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.10993
  • Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou, 15 Aug 2025, On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting, https://arxiv.org/abs/2508.11408
  • Baihong Qian, Haotian Fan, Wenjie Liao, Yunqiu Wang, Tao Li, and Junhui Cui, 15 Aug 2025, Better Supervised Fine-tuning for VQA: Integer-Only Loss, https://arxiv.org/abs/2508.11170
  • Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu, 15 Aug 2025, ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection, https://arxiv.org/abs/2508.11281
  • Yuan Li, Zhengzhong Liu, and Eric Xing, 16 Aug 2025, Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models, https://arxiv.org/abs/2508.11953
  • Daria Diatlova, Nikita Balagansky, Alexander Varlamov, Egor Spirin, 16 Aug 2025, VARAN: Variational Inference for Self-Supervised Speech Models Fine-Tuning on Downstream Tasks, https://arxiv.org/abs/2508.12061
  • Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi, 17 Aug 2025, Rethinking Safety in LLM Fine-tuning: An Optimization Perspective, https://arxiv.org/abs/2508.12531
  • Yuhao Zhou, Jindi Lv, Yuxin Tian, Dan Si, Qing Ye, Jiancheng Lv, 18 Aug 2025, Deploying Models to Non-participating Clients in Federated Learning without Fine-tuning: A Hypernetwork-based Approach, https://arxiv.org/abs/2508.12673
  • Manning Zhu, Songtao Guo, Pengzhan Zhou, Yansong Ning, Chang Han, Dewen Qiao, 18 Aug 2025, FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment, https://arxiv.org/abs/2508.12727
  • Julia Sammartino, Libby Barak, Jing Peng, Anna Feldman, 15 Aug 2025, When Does Language Transfer Help? Sequential Fine-Tuning for Cross-Lingual Euphemism Detection, https://arxiv.org/abs/2508.11831
  • Shiwei Li, Xiandi Luo, Xing Tang, Haozhao Wang, Hao Chen, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li, 17 Aug 2025, Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics, https://arxiv.org/abs/2505.23194
  • Rafi Ibn Sultan, Chengyin Li, Hui Zhu, Prashant Khanduri, Marco Brocanelli, Dongxiao Zhu, 15 Aug 2025, GeoSAM: Fine-tuning SAM with Multi-Modal Prompts for Mobility Infrastructure Segmentation, https://arxiv.org/abs/2311.11319
  • Keyu Chen, Wenchao Sun, Hao Cheng, Sifa Zheng, 18 Aug 2025, RIFT: Closed-Loop RL Fine-Tuning for Realistic and Controllable Traffic Simulation, https://arxiv.org/abs/2505.03344
  • Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee, 19 Aug 2025, Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation, https://arxiv.org/abs/2508.14031
  • Hassan Barmandah, 19 Aug 2025, Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation, https://arxiv.org/abs/2508.13525
  • Eric Nuertey Coleman, Luigi Quarantiello, Ziyue Liu, Qinwen Yang, Samrat Mukherjee, Julio Hurtado and Vincenzo Lomonaco, 19 Aug 2025, Parameter-Efficient Continual Fine-Tuning: A Survey, https://arxiv.org/abs/2504.13822
  • Yajie Zhou and Xiaoyi Pang and Zhibo Wang, 20 Aug 2025, AFLoRA: Adaptive Federated Fine-Tuning of Large Language Models with Resource-Aware Low-Rank Adaption, https://arxiv.org/abs/2505.24773
  • Xujia Wang, Yunjia Qi, Bin Xu, 20 Aug 2025, LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization, https://arxiv.org/abs/2507.04487
  • Mayla R. Boguslav, Adam Kiehl, David Kott, G. Joseph Strecker, Tracy Webb, Nadia Saklou, Terri Ward, Michael Kirby, 20 Aug 2025, Fine-tuning foundational models to code diagnoses from veterinary health records, https://arxiv.org/abs/2410.15186
  • Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, Jun Wang, 22 Aug 2025, AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs, https://arxiv.org/abs/2508.16153
  • Sungmin Kang, Jisoo Kim, Salman Avestimehr, Sunwoo Lee, 22 Aug 2025, GEM: A Scale-Aware and Distribution-Sensitive Sparse Fine-Tuning Framework for Effective Downstream Adaptation, https://arxiv.org/abs/2508.16191
  • Hangzhan Jin, Sicheng Lv, Sifan Wu, Mohammad Hamdaqa, 22 Aug 2025, RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs, https://arxiv.org/abs/2508.16546
  • Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang, 21 Aug 2025, CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning, https://arxiv.org/abs/2508.15868
  • Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani, 21 Aug 2025, Decentralized Low-Rank Fine-Tuning of Large Language Models, https://arxiv.org/abs/2501.15361
  • Hanyang Zhao, Haoxian Chen, Ji Zhang, David D. Yao and Wenpin Tang, 21 Aug 2025, Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning, https://arxiv.org/abs/2502.01819
  • Jack Youstra, Mohammed Mahfoud, Yang Yan, Henry Sleight, Ethan Perez, Mrinank Sharma, 23 Aug 2025, Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks, https://arxiv.org/abs/2508.17158
  • Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, Pengfei Liu, 25 Aug 2025, Proximal Supervised Fine-Tuning, https://arxiv.org/abs/2508.17784
  • Bin Pan, Shiyu Shen, Zongbin Wang, Zhenwei Shi and Xia Xu, 23 Aug 2025, Preserving Domain Generalization in Fine-Tuning via Joint Parameter Selection, https://arxiv.org/abs/2508.16976
  • Haojie Zhang, 24 Aug 2025, DropLoRA: Sparse Low-Rank Adaptation for Parameter-Efficient Fine-Tuning, https://arxiv.org/abs/2508.17337
  • Yuhao Zhang, Shaoming Duan, Jinhang Su, Chuanyi Liu, Peiyi Han, 4 Sep 2025, SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning, https://arxiv.org/abs/2509.03937
  • Junyu Yan, Feng Chen, Yuyang Xue, Yuning Du, Konstantinos Vilouras, Sotirios A. Tsaftaris, Steven McDonagh, 4 Sep 2025, SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation, https://arxiv.org/abs/2508.18826
  • Wei Huang, Huang Wei, Yinggui Wang, 4 Sep 2025, DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Tasks Based on Data and Model Compression, https://arxiv.org/abs/2509.01221
  • Cheng Peng, Xinyu Dong, Mengxian Lyu, Daniel Paredes, Yaoyun Zhang, Yonghui Wu, 5 Sep 2025, A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning, https://arxiv.org/abs/2509.04753
  • Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu, 5 Sep 2025, Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning, https://arxiv.org/abs/2408.09600
  • William F. Shen, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane, 5 Sep 2025, Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning, https://arxiv.org/abs/2506.14387
  • Gang Hu, Yinglei Teng, Pengfei Wu, and Nan Wang, 26 Aug 2025, FFT-MoE: Efficient Federated Fine-Tuning for Foundation Models via Large-scale Sparse MoE under Heterogeneous Edge, https://arxiv.org/abs/2508.18663
  • Qing Xiao, Yingshan Peng and PeiPei Zhang, 26 Aug 2025, Cross-Learning Fine-Tuning Strategy for Dysarthric Speech Recognition Via CDSD database, https://arxiv.org/abs/2508.18732
  • Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang, Bo Zheng, 27 Aug 2025, InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning, https://arxiv.org/abs/2508.19679
  • Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi, 26 Aug 2025, Fine-Tuning Vision-Language Models for Neutrino Event Analysis in High-Energy Physics Experiments, https://arxiv.org/abs/2508.19376
  • Yuhang Liu, Tao Li, Zhehao Huang, Zuopeng Yang, and Xiaolin Huang, 27 Aug 2025, Bi-LoRA: Efficient Sharpness-Aware Minimization for Fine-Tuning Large-Scale Models, https://arxiv.org/abs/2508.19564
  • Fahao Chen, Jie Wan, Peng Li, Zhou Su, Dongxiao Yu, 26 Aug 2025, Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2508.19078
  • Manuel Mosquera, Melissa Robles, Johan Rodriguez, Ruben Manrique, 26 Aug 2025, Improving Low-Resource Translation with Dictionary-Guided Fine-Tuning and RL: A Spanish-to-Wayuunaiki Study, https://arxiv.org/abs/2508.19481
  • Fatema Siddika, Md Anwar Hossen, J. Pablo Mu\~noz, Tanya Roosta, Anuj Sharma, Ali Jannesari, 27 Aug 2025, FedReFT: Federated Representation Fine-Tuning with All-But-Me Aggregation, https://arxiv.org/abs/2508.20295
  • Weitao Feng, Lixu Wang, Tianyi Wei, Jie Zhang, Chongyang Gao, Sinong Zhan, Peizhuo Lv, Wei Dong, 28 Aug 2025, Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning, https://arxiv.org/abs/2508.20697
  • Jinyuan Feng, Chaopeng Wei, Tenghai Qiu, Tianyi Hu, Zhiqiang Pu, 28 Aug 2025, CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning, https://arxiv.org/abs/2505.17553
  • Ali Nazari and Michael Weiss, 28 Aug 2025, Fine-Tuning Topics through Weighting Aspect Keywords, https://arxiv.org/abs/2502.08496
  • Jessica Liang, Anirudh Bharadwaj, 29 Aug 2025, QR-LoRA: QR-Based Low-Rank Adaptation for Efficient Fine-Tuning of Large Language Models, https://arxiv.org/abs/2508.21810
  • Guofu Liao, Taotao Wang, Shengli Zhang, Jiqun Zhang, Shi Long, and Dacheng Tao, 29 Aug 2025, zkLoRA: Fine-Tuning Large Language Models with Verifiable Security via Zero-Knowledge Proofs, https://arxiv.org/abs/2508.21393
  • Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He and Lijun Wu, 29 Aug 2025, Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning, https://arxiv.org/abs/2508.21589
  • Jo\~ao Valente, Atabak Dehban, Rodrigo Ventura, 29 Aug 2025, CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models, https://arxiv.org/abs/2508.21732
  • Yanxiao Zhao, Yaqian Li, Zihao Bo, Rinyoichi Takezoe, Haojia Hui, Mo Guang, Lei Ren, Xiaolin Qin, Kaiwen Long, 31 Aug 2025, SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs, https://arxiv.org/abs/2509.00930
  • Elie Thellier (EPIONE), Huiyu Li (EPIONE), Nicholas Ayache (EPIONE), Herv\'e Delingette (EPIONE), 20 Aug 2025, Mitigating Data Exfiltration Attacks through Layer-Wise Learning Rate Decay Fine-Tuning, https://arxiv.org/abs/2509.00027
  • Shikun Liu, Deyu Zou, Nima Shoghi, Victor Fung, Kai Liu, Pan Li, 30 Aug 2025, RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models, https://arxiv.org/abs/2509.00614
  • Xinlu Zhang, Na Yan, Yang Su, Yansha Deng, Toktam Mahmoodi, 1 Sep 2025, Communication-Aware Knowledge Distillation for Federated LLM Fine-Tuning over Wireless Networks, https://arxiv.org/abs/2509.01750
  • Wenlong Mou, 2 Sep 2025, Is RL fine-tuning harder than regression? A PDE learning approach for diffusion models, https://arxiv.org/abs/2509.02528
  • Asif Mohammed Saad, Umme Niraj Mahi, 2 Sep 2025, SegFormer Fine-Tuning with Dropout: Advancing Hair Artifact Removal in Skin Lesion Analysis, https://arxiv.org/abs/2509.02156
  • Sifeng Shang, Jiayi Zhou, Chenyu Lin, Minxian Li, Kaiyang Zhou, 1 Sep 2025, Fine-tuning Quantized Neural Networks with Zeroth-order Optimization, https://arxiv.org/abs/2505.13430
  • Xingyu Su, Xiner Li, Masatoshi Uehara, Sunwoo Kim, Yulai Zhao, Gabriele Scalia, Ehsan Hajiramezanali, Tommaso Biancalani, Degui Zhi, Shuiwang Ji, 30 Aug 2025, Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design, https://arxiv.org/abs/2507.00445
  • Xin Chen, Shuaijun Chen, Omid Tavallaie, Nguyen Tran, Shuhuang Xiang, Albert Zomaya, 30 Aug 2025, Convergence Analysis of Aggregation-Broadcast in LoRA-enabled Distributed Fine-Tuning, https://arxiv.org/abs/2508.01348
  • Jianwei Wang, Chengming Shi, Junyao Yang, Haoran Li, Qianli Ma, Huiping Zhuang, Cen Chen and Ziqian Zeng, 31 Aug 2025, RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis, https://arxiv.org/abs/2502.18517
  • Linus Jern, Valter Uotila, Cong Yu, Bo Zhao, 1 Sep 2025, Agent-Q: Fine-Tuning Large Language Models for Quantum Circuit Generation and Optimization, https://arxiv.org/abs/2504.11109
  • Christopher Subia-Waud (Rayonlabs Team), 3 Sep 2025, Gradients: When Markets Meet Fine-tuning -- A Distributed Approach to Model Optimisation, https://arxiv.org/abs/2506.07940
  • Xiang Yuan, Jun Shu, Deyu meng, Zongben Xu, 31 Aug 2025, Feed Two Birds with One Scone: Exploiting Function-Space Regularization for Both OOD Robustness and ID Fine-Tuning Performance, https://arxiv.org/abs/2509.05328
  • ZiXuan Zhang, Bowen Hao, Yingjie Li, Hongzhi Yin, 6 Sep 2025, ZhiFangDanTai: Fine-tuning Graph-based Retrieval-Augmented Generation Model for Traditional Chinese Medicine Formula, https://arxiv.org/abs/2509.05867
  • Joe Wilder, Nikhil Kadapala, Benji Xu, Mohammed Alsaadi, Aiden Parsons, Mitchell Rogers, Palash Agarwal, Adam Hassick, Laura Dietz, 8 Sep 2025, UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim Extraction, https://arxiv.org/abs/2509.06883
  • Lishan Yang, Nam Kha Nguygen, Po Hu, Wei Emma Zhang, Yanjun Shu, Mong Yuan Sim and Weitong Chen, 1 Sep 2025, FediLoRA: Heterogeneous LoRA for Federated Multimodal Fine-tuning under Missing Modalities, https://arxiv.org/abs/2509.06984
  • Xiao Li and Bharat Gandhi and Ming Zhan and Mohit Nehra and Zhicheng Zhang and Yuchen Sun and Meijia Song and Naisheng Zhang and Xi Wang, 9 Sep 2025, Fine-Tuning Vision-Language Models for Visual Navigation Assistance, https://arxiv.org/abs/2509.07488
  • Michele Joshua Maggini, Dhia Merzougui, Rabiraj Bandyopadhyay, Ga\"el Dias, Fabrice Maurel, Pablo Gamallo, 9 Sep 2025, Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning, https://arxiv.org/abs/2509.07768
  • Jiahao Chen, Zhiyuan Huang, Yurou Liu, Bing Su, 12 Sep 2025, LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning in Open-World Scenarios, https://arxiv.org/abs/2509.09926
  • Himanshu Thakur, Eshani Agrawal, Smruthi Mukund, 18 Aug 2025, Personas within Parameters: Fine-Tuning Small Language Models with Low-Rank Adapters to Mimic User Behaviors, https://arxiv.org/abs/2509.09689
  • Talha Tahir, 8 Sep 2025, The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization, https://arxiv.org/abs/2509.09712
  • Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, Hao Xu, 11 Sep 2025, Sensitivity-LoRA: Low-Load Sensitivity-Based Fine-Tuning for Large Language Models, https://arxiv.org/abs/2509.09119
  • Honghui Xu, Shiva Shrestha, Wei Chen, Zhiyuan Li, Zhipeng Cai, 11 Sep 2025, DP-FedLoRA: Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models, https://arxiv.org/abs/2509.09097
  • Leonardo Matone, Ben Abramowitz, Ben Armstrong, Avinash Balakrishnan, Nicholas Mattei, 11 Sep 2025, DeepVoting: Learning and Fine-Tuning Voting Rules with Canonical Embeddings, https://arxiv.org/abs/2408.13630
  • Marko Tuononen, Heikki Penttinen, Ville Hautam\"aki, 19 Sep 2025, Targeted Fine-Tuning of DNN-Based Receivers via Influence Functions, https://arxiv.org/abs/2509.15950
  • Baichuan Huang, Ananth Balashankar, Amir Aminifar, 19 Sep 2025, BEFT: Bias-Efficient Fine-Tuning of Language Models, https://arxiv.org/abs/2509.15974
  • Youngwon Choi, Jaeyoon Jung, Hyeonyu Kim, Huu-Kim Nguyen, Hwayeon Kim, 18 Sep 2025, Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data, https://arxiv.org/abs/2509.15389
  • Ishika Agarwal, Dilek Hakkani-T\"ur, 19 Sep 2025, Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data, https://arxiv.org/abs/2502.09969
  • Shiwan Zhao, Xuyang Zhao, Jiaming Zhou, Aobo Kong, Qicheng Li, Yong Qin, 19 Sep 2025, Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning, https://arxiv.org/abs/2509.15157
  • MSR Avinash, 7 Sep 2025, Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study, https://arxiv.org/abs/2509.12229
  • Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, Mohammad Hamdaqa, 8 Sep 2025, RL Fine-Tuning Heals OOD Forgetting in SFT, https://arxiv.org/abs/2509.12235
  • Mengyi Deng, Xin Li, Tingyu Zhu, Zhicheng Yang, Zhijiang Guo, Wei Wang, 16 Sep 2025, When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning, https://arxiv.org/abs/2509.13079
  • Bo Yin, Xingyi Yang, Xinchao Wang, 16 Sep 2025, Don't Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning, https://arxiv.org/abs/2509.13240
  • Rodrigo M Carrillo-Larco, 16 Sep 2025, LLMs for energy and macronutrients estimation using only text data from 24-hour dietary recalls: a parameter-efficient fine-tuning experiment using a 10-shot prompt, https://arxiv.org/abs/2509.13268
  • Kiho Lee, Jungkon Kim, Doowon Kim, Hyoungshick Kim, 16 Sep 2025, A Systematic Evaluation of Parameter-Efficient Fine-Tuning Methods for the Security of Code LLMs, https://arxiv.org/abs/2509.12649
  • Yao Liang, Dongcheng Zhao, Feifei Zhao, Guobin Shen, Yuwei Wang, Dongqi Liang, Yi Zeng, 16 Sep 2025, MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values, https://arxiv.org/abs/2509.08022
  • Pengcheng Luo, Yunyang Zhao, Bowen Zhang, Genke Yang, Boon-Hee Soong, Chau Yuen, 30 Aug 2025, SABR: A Stable Adaptive Bitrate Framework Using Behavior Cloning Pretraining and Reinforcement Learning Fine-Tuning, https://arxiv.org/abs/2509.10486
  • Milan Marocchi, Matthew Fynn, Kayapanda Mandana, Yue Rong, 15 Sep 2025, Scaling to Multimodal and Multichannel Heart Sound Classification: Fine-Tuning Wav2Vec 2.0 with Synthetic and Augmented Biosignals, https://arxiv.org/abs/2509.11606
  • Lei Wang, Jieming Bian, Letian Zhang, Jie Xu, 18 Sep 2025, Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning, https://arxiv.org/abs/2509.15087
  • Yeongbin Seo and Dongha Lee and Jaehyung Kim and Jinyoung Yeo, 18 Sep 2025, Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning, https://arxiv.org/abs/2509.15188
  • Gustavo Sandoval, Denys Fenchenko and Junyao Chen, 15 Sep 2025, Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models, https://arxiv.org/abs/2509.14271
  • Chenjun Li, Laurin Lux, Alexander H. Berger, Martin J. Menten, Mert R. Sabuncu, Johannes C. Paetzold, 17 Sep 2025, Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis, https://arxiv.org/abs/2503.09808
  • Yu Cheng Chih, Yong Hao Hou, 10 Sep 2025, Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a Billion-Parameter Instruction-Tuned Model, https://arxiv.org/abs/2509.08381
  • Alejandro Moreno Arcas, Albert Sanchis, Jorge Civera, Alfons Juan, 10 Sep 2025, HOFT: Householder Orthogonal Fine-tuning, https://arxiv.org/abs/2505.16531
  • Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul, 10 Sep 2025, Prior Prompt Engineering for Reinforcement Fine-Tuning, https://arxiv.org/abs/2505.14157
  • Shambhavi Krishna, Atharva Naik, Chaitali Agarwal, Sudharshan Govindan, Taesung Lee, Haw-Shiuan Chang, 17 Sep 2025, Latent Traits and Cross-Task Transfer: Deconstructing Dataset Interactions in LLM Fine-tuning, https://arxiv.org/abs/2509.13624
  • Haoteng Yin, Rongzhe Wei, Eli Chien, Pan Li, 16 Sep 2025, Privately Learning from Graphs with Applications in Fine-tuning Large Language Models, https://arxiv.org/abs/2410.08299
  • Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning, https://arxiv.org/abs/2503.09334
  • Humaid Ibrahim, Nikolai Rozanov, Marek Rei, 1 Oct 2025, Fine-tuning with RAG for Improving LLM Learning of New Skills, https://arxiv.org/abs/2510.01375
  • Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z. Pan, Hyeji Kim, Sham Kakade, Sitan Chen, 1 Oct 2025, Fine-Tuning Masked Diffusion for Provable Self-Correction, https://arxiv.org/abs/2510.01384
  • Haotian Xiang, Jinwen Xu, Qin Lu, 1 Oct 2025, Fine-tuning LLMs with variational Bayesian last layer for high-dimensional Bayesian optimization, https://arxiv.org/abs/2510.01471
  • Zhaoyi Li, Jingtao Ding, Yong Li, Shihua Li, 2 Oct 2025, Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions, https://arxiv.org/abs/2510.02081
  • Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu, 2 Oct 2025, A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation, https://arxiv.org/abs/2510.01600
  • Nilay Naharas, Dang Nguyen, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, Baharan Mirzasoleiman, 1 Oct 2025, Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories, https://arxiv.org/abs/2510.01454
  • Kathy Garcia and Leyla Isik, 1 Oct 2025, Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning, https://arxiv.org/abs/2510.01502
  • Junseo Hwang, Wonguk Cho, Taesup Kim, 2 Oct 2025, PiCa: Parameter-Efficient Fine-Tuning with Column Space Projection, https://arxiv.org/abs/2505.20211
  • Kaustubh Ponkshe, Raghav Singhal, Eduard Gorbunov, Alexey Tumanov, Samuel Horvath, Praneeth Vepakomma, 2 Oct 2025, Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning, https://arxiv.org/abs/2411.19557
  • Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma, 2 Oct 2025, ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models, https://arxiv.org/abs/2505.14238
  • Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta, Rajat Rasal, Esther Puyol-Ant\'on, Samuel Gerber, Kersten Petersen, Michiel Schaap, Ben Glocker, 2 Oct 2025, Segmentor-Guided Counterfactual Fine-Tuning for Locally Coherent and Targeted Image Synthesis, https://arxiv.org/abs/2509.24913
  • Abdulhady Abas Abdullah, Arkaitz Zubiaga, Seyedali Mirjalili, Amir H. Gandomi, Fatemeh Daneshfar, Mohammadsadra Amini, Alan Salam Mohammed, Hadi Veisi, 14 Oct 2025, Evolution of meta's llama models and parameter-efficient fine-tuning of large language models: a survey, https://arxiv.org/abs/2510.12178
  • Yukun Zhang, and Qi Dong, 14 Oct 2025, Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models, https://arxiv.org/abs/2510.12044
  • Sijing Xie, Dingzhu Wen, Changsheng You, Qimei Chen, Mehdi Bennis, and Kaibin Huang, 14 Oct 2025, FedLoDrop: Federated LoRA with Dropout for Generalized LLM Fine-tuning, https://arxiv.org/abs/2510.12078
  • Rohan Kadekodi, Zhan Jin, Keisuke Kamahori, Yile Gu, Sean Khatiri, Noah H. Bayindirli, Sergey Gorbunov and Baris Kasikci, 30 Sep 2025, DualTune: Decoupled Fine-Tuning for On-Device Agentic Systems, https://arxiv.org/abs/2510.00229
  • Xin Yu, Cong Xie, Ziyu Zhao, Tiantian Fan, Lingzhou Xue, Zhi Zhang, 30 Sep 2025, PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning, https://arxiv.org/abs/2510.00192
  • Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, and Gennady Pekhimenko, 30 Sep 2025, LoRAFusion: Efficient LoRA Fine-Tuning for LLMs, https://arxiv.org/abs/2510.00206
  • Ayush Jain and Andrea Montanari and Eren Sasoglu, 1 Oct 2025, Train on Validation (ToV): Fast data selection with applications to fine-tuning, https://arxiv.org/abs/2510.00386
  • Kairun Zhang, Haoyu Li, Yanjun Zhao, Yifan Sun, Huan Zhang, 1 Oct 2025, Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs, https://arxiv.org/abs/2510.00419
  • Ali Dadsetan, Frank Rudzicz, 1 Oct 2025, Sample-Efficient Differentially Private Fine-Tuning via Gradient Matrix Denoising, https://arxiv.org/abs/2510.01137
  • Zhexiong Liu, Diane Litman, 30 Sep 2025, Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction, https://arxiv.org/abs/2510.00268
  • Run Su, Hao Fu, Shuai Zhou, and Yingao Fu, 1 Oct 2025, Integrating Offline Pre-Training with Online Fine-Tuning: A Reinforcement Learning Approach for Robot Social Navigation, https://arxiv.org/abs/2510.00466
  • Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong, 1 Oct 2025, Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum, https://arxiv.org/abs/2510.00526
  • Roshan Kenia, Anfei Li, Rishabh Srivastava, Kaveri A. Thakoor, 1 Oct 2025, AI-CNet3D: An Anatomically-Informed Cross-Attention Network with Multi-Task Consistency Fine-tuning for 3D Glaucoma Classification, https://arxiv.org/abs/2510.00882
  • Matteo Fuoli, Weihang Huang, Jeannette Littlemore, Sarah Turner, Ellen Wilding, 1 Oct 2025, Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning, https://arxiv.org/abs/2509.24866
  • Matteo Cardoni, Sam Leroux, 24 Sep 2025, Predictive Coding-based Deep Neural Network Fine-tuning for Computationally Efficient Domain Adaptation, https://arxiv.org/abs/2509.20269
  • Jingyi Wang, Zhongyuan Zhao, Qingtian Wang, Zexu Li, Yue Wang, Tony Q. S. Quek, 5 Sep 2025, A Federated Fine-Tuning Paradigm of Foundation Models in Heterogenous Wireless Networks, https://arxiv.org/abs/2509.19306
  • Adrien Goldszal and Diego Calanzone and Vincent Taboga and Pierre-Luc Bacon, 23 Sep 2025, Discovery of Sustainable Refrigerants through Physics-Informed RL Fine-Tuning of Sequence Models, https://arxiv.org/abs/2509.19588
  • Babak Barazandeh, Subhabrata Majumdar, Om Rajyaguru, George Michailidis, 23 Sep 2025, Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning, https://arxiv.org/abs/2506.00236
  • Hui Yi Leong, Yi Fan Gao, Ji Shuai, Yang Zhang, Uktu Pamuksuz, 24 Sep 2025, Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation, https://arxiv.org/abs/2409.09324
  • Yingming Zheng, Hanqi Li, Kai Yu and Lu Chen, 24 Sep 2025, When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models, https://arxiv.org/abs/2509.18762
  • Wenpin Tang and Fuzhong Zhou, 23 Sep 2025, Fine-tuning of diffusion models via stochastic control: entropy regularization and beyond, https://arxiv.org/abs/2403.06279
  • Yilang Zhang, Xiaodong Yang, Yiwei Cai, Georgios B. Giannakis, 27 Oct 2025, ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning, https://arxiv.org/abs/2510.23818
  • Kanghyun Choi, Hyeyoon Lee, SunJong Park, Dain Kwon, Jinho Lee, 28 Oct 2025, FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic, https://arxiv.org/abs/2510.24061
  • Marton Szep, Daniel Rueckert, R\"udiger von Eisenhart-Rothe, Florian Hinterwimmer, 14 Nov 2024, Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide, https://arxiv.org/abs/2411.09539
  • Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig, 28 Oct 2025, Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents, https://arxiv.org/abs/2510.24702
  • Shufan Shen, Junshu Sun, Shuhui Wang, Qingming Huang, 28 Oct 2025, Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models, https://arxiv.org/abs/2510.24037
  • Amit Peleg, Naman Deep Singh, Matthias Hein, 28 Oct 2025, Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning, https://arxiv.org/abs/2505.24424
  • Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, Huan Zhang, 28 Oct 2025, Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay, https://arxiv.org/abs/2506.05316
  • Nathan Paull, 28 Oct 2025, CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora, https://arxiv.org/abs/2510.21729
  • Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton, 27 Oct 2025, Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization, https://arxiv.org/abs/2506.06964
  • Chen Wang, Zhaochun Li, Jionghao Bai, Yuzhi Zhang, Shisheng Cui, Zhou Zhao, Yue Wang, 23 Oct 2025, Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning, https://arxiv.org/abs/2510.08141
  • Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen, 23 Oct 2025, Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification, https://arxiv.org/abs/2505.16722
  • Saransh Gupta, Umesh Deshpande, Travis Janssen, Swami Sundararaman, 23 Oct 2025, Symbiosis: Multi-Adapter Inference and Fine-Tuning, https://arxiv.org/abs/2507.03220
  • Igli Begolli, Meltem Aksoy, Daniel Neider, 23 Oct 2025, Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects, https://arxiv.org/abs/2507.19271
  • M. H. I. Abdalla, Zhipin Wang, Christian Frey, Steffen Eger, Josif Grabocka, 21 Oct 2025, Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning, https://arxiv.org/abs/2510.19733 replaced
  • Tuowei Wang, Kun Li, Zixu Hao, Donglin Bai, Ju Ren, Yaoxue Zhang, Ting Cao, Mao Yang, 12 Oct 2025, Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity, https://arxiv.org/abs/2510.15964
  • Changsheng Wang, Xin Chen, Sijia Liu, Ke Ding, 15 Oct 2025, Breaking Memorization Barriers in LLM Code Fine-Tuning via Information Bottleneck for Improved Generalization, https://arxiv.org/abs/2510.16022
  • Heming Zou, Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji, 19 Oct 2025, Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning, https://arxiv.org/abs/2510.16882
  • Sarah Egler, John Schulman, Nicholas Carlini, 17 Oct 2025, Detecting Adversarial Fine-tuning with Auditing Agents, https://arxiv.org/abs/2510.16255
  • Mingzheng Zhang, Jinfeng Gao, Dan Xu, Jiangrui Yu, Yuhan Qiao, Lan Chen, Jin Tang, and Xiao Wang, 19 Oct 2025, EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation, https://arxiv.org/abs/2510.16776
  • Akif Islam and Mohd Ruhul Ameen, 19 Oct 2025, Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection, https://arxiv.org/abs/2510.16985
  • Yupeng Chen, Senmiao Wang, Yushun Zhang, Zhihang Lin, Haozhe Zhang, Weijian Sun, Tian Ding, Ruoyu Sun, 18 Oct 2025, MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning, https://arxiv.org/abs/2407.20999
  • Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Peter Deisenroth, Sepp Hochreiter, 20 Oct 2025, Parameter Efficient Fine-tuning via Explained Variance Adaptation, https://arxiv.org/abs/2410.07170
  • Mingyang Liu, Gabriele Farina, Asuman Ozdaglar, 19 Oct 2025, UFT: Unifying Supervised and Reinforcement Fine-Tuning, https://arxiv.org/abs/2505.16984
  • Binghao Huang, Jie Xu, Iretiayo Akinola, Wei Yang, Balakumar Sundaralingam, Rowland O'Flaherty, Dieter Fox, Xiaolong Wang, Arsalan Mousavian, Yu-Wei Chao, Yunzhu Li, 18 Oct 2025, VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning, https://arxiv.org/abs/2510.14930
  • Lovely Yeswanth Panchumarthi, Saurabh Kataria, Yi Wu, Xiao Hu, Alex Fedorov, Hyunjung Gloria Kwak, 20 Sep 2025, FairTune: A Bias-Aware Fine-Tuning Framework Towards Fair Heart Rate Prediction from PPG, https://arxiv.org/abs/2509.16491
  • Junjie Ye, Yuming Yang, Yang Nan, Shuo Li, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, 20 Sep 2025, Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels, https://arxiv.org/abs/2509.16596
  • Salha Alyami, Amani Jamal, Areej Alhothali, 20 Sep 2025, Domain-Adaptive Pre-Training for Arabic Aspect-Based Sentiment Analysis: A Comparative Study of Domain Adaptation and Fine-Tuning Strategies, https://arxiv.org/abs/2509.16788
  • Talha Tahir, 20 Sep 2025, Fine-Tuning Open-Weight Language Models to Deliver Cognitive Behavioral Therapy for Depression: A Feasibility Study, https://arxiv.org/abs/2412.00251
  • Lei Gao, Amir Ziashahabi, Yue Niu, Salman Avestimehr, Murali Annavaram, 20 Sep 2025, MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines, https://arxiv.org/abs/2409.15520
  • Yilang Zhang, Bingcong Li, Georgios B. Giannakis, 21 Sep 2025, RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models, https://arxiv.org/abs/2505.18877
  • Yang Wang, Qibin Liang, Chenghao Xiao, Yizhi Li, Noura Al Moubayed, Chenghua Lin, 22 Sep 2025, Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification, https://arxiv.org/abs/2309.11895
  • Yujie Xiao, Gongzhen Tang, Wenhui Liu, Jun Li, Guangkun Nie, Zhuoran Kan, Deyun Zhang, Qinghao Zhao, Shenda Hong, 25 Oct 2025, AnyECG-Lab: An Exploration Study of Fine-tuning an ECG Foundation Model to Estimate Laboratory Values from Single-Lead ECG Signals, https://arxiv.org/abs/2510.22301
  • Anh Pham, Mihir Thalanki, Michael Sun, Aditya Chaloo, Ankita Gupta, Tian Xia, Aditya Mate, Ehimwenma Nosakhare, Soundararajan Srinivasan, 23 Oct 2025, Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning, https://arxiv.org/abs/2510.21885
  • Andrei Baroian, 25 Oct 2025, Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER, https://arxiv.org/abs/2510.22285
  • Noshitha Padma Pratyusha Juttu, Sahithi Singireddy, Sravani Gona and Sujal Timilsina, 26 Oct 2025, Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection, https://arxiv.org/abs/2510.22531
  • Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Chunkai Fan, Junyu Lu, Yulin Luo, Qiang Zhou, Yiming Zhao, Ning Liu, Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, Jian Tang, 27 Oct 2025, SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents, https://arxiv.org/abs/2506.21669
  • Reza Shirkavand, Peiran Yu, Qi He, Heng Huang, 27 Oct 2025, Bilevel ZOFO: Bridging Parameter-Efficient and Zeroth-Order Techniques for Efficient LLM Fine-Tuning and Meta-Training, https://arxiv.org/abs/2502.03604
  • Shivam Ratnakar, Abhiroop Talasila, Raghav Chamadiya, Nikhil Agarwal, Vinayak K Doifode, 26 Oct 2025, Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs, https://arxiv.org/abs/2503.01131
  • Nicolas Menet, Aleksandar Terzi\'c, Andreas Krause, Abbas Rahimi, 15 Oct 2025, Thompson Sampling via Fine-Tuning of LLMs, https://arxiv.org/abs/2510.13328
  • Shingo Ayabe, Hiroshi Kera, Kazuhiko Kawamoto, 15 Oct 2025, Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control, https://arxiv.org/abs/2510.13358
  • Jingyao Wang, Wenwen Qiang, Zeen Song, Changwen Zheng, Hui Xiong, 15 Oct 2025, Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs, https://arxiv.org/abs/2505.10425
  • Zilun Zhang, Zian Guan, Tiancheng Zhao, Haozhan Shen, Tianyu Li, Yuxiang Cai, Zhonggen Su, Zhaojun Liu, Jianwei Yin, Xiang Li, 15 Oct 2025, Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning, https://arxiv.org/abs/2509.21976
  • Guanghao Zhu, Zhitian Hou, Zeyu Liu, Zhijie Sang, Congkai Xie, Hongxia Yang, 26 Sep 2025, InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning, https://arxiv.org/abs/2509.22261
  • Feng Yu and Jia Hu and Geyong Min, 25 Sep 2025, Blockwise Hadamard high-Rank Adaptation for Parameter-Efficient LLM Fine-Tuning, https://arxiv.org/abs/2509.21637
  • Shilei Cao, Hehai Lin, Jiashun Cheng, Yang Liu, Guowen Li, Xuehe Wang, Juepeng Zheng, Haoyuan Liang, Meng Jin, Chengwei Qin, Hong Cheng, Haohuan Fu, 26 Sep 2025, Task-Adaptive Parameter-Efficient Fine-Tuning for Weather Foundation Models, https://arxiv.org/abs/2509.22020
  • Aayush Mishra, Daniel Khashabi, Anqi Liu, 26 Sep 2025, IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning, https://arxiv.org/abs/2509.22621
  • Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang, 26 Sep 2025, Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation, https://arxiv.org/abs/2505.11235
  • Jaedong Hwang, Brian Cheung, Zhang-Wei Hong, Akhilan Boopathy, Pulkit Agrawal, Ila Fiete, 26 Sep 2025, Large Pre-Training Datasets Don't Always Guarantee Robustness after Fine-Tuning, https://arxiv.org/abs/2410.21582
  • Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang and Kai Chen, 26 Sep 2025, Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective, https://arxiv.org/abs/2506.23508
  • Samyak Jhaveri, Vanessa Klotzmann, Crista Lopes, 26 Sep 2025, ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Pragma Generation, https://arxiv.org/abs/2509.20380
  • Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani, 26 Sep 2025, pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models, https://arxiv.org/abs/2507.05394
  • Aryan Golbaghi, Shuo Zhou, 8 Oct 2025, Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation, https://arxiv.org/abs/2510.07052
  • Gautham Govind Anil, Shaan Ul Haque, Nithish Kannen, Dheeraj Nagaraj, Sanjay Shakkottai, Karthikeyan Shanmugam, 3 Oct 2025, Fine-Tuning Diffusion Models via Intermediate Distribution Shaping, https://arxiv.org/abs/2510.02692
  • Derek Shi, Ruben Glatt, Christine Klymko, Shubham Mohole, Hongjun Choi, Shashank Kushwaha, Sam Sakla, Felipe Leno da Silva, 2 Oct 2025, Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking Feedback, https://arxiv.org/abs/2510.02561
  • Daphne Tsolissou, Theofanis Ganitidis, Konstantinos Mitsis, Stergios CHristodoulidis, Maria Vakalopoulou, Konstantina Nikita, 3 Oct 2025, Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights, https://arxiv.org/abs/2510.02922
  • Jannik Graebner, Ryne Beeson, 2 Oct 2025, Self-supervised diffusion model fine-tuning for costate initialization using Markov chain Monte Carlo, https://arxiv.org/abs/2510.02527
  • He Zhu, Junyou Su, Peng Lai, Ren Ma, Wenjia Zhang, Linyi Yang, Guanhua Chen, 3 Oct 2025, Anchored Supervised Fine-Tuning, https://arxiv.org/abs/2509.23753
  • Huan Song, Deeksha Razdan, Yiyue Qian, Arijit Ghosh Chowdhury, Parth Patwa, Aman Chadha, Shinan Zhang, Sharlina Keshava, Hannah Marlowe, 20 Oct 2025, Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models, https://arxiv.org/abs/2510.18143
  • Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, Junchi Yan, 21 Oct 2025, ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning, https://arxiv.org/abs/2510.18250
  • Jiajun Fan, Tong Wei, Chaoran Cheng, Yuxin Chen, Ge Liu, 20 Oct 2025, Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models, https://arxiv.org/abs/2510.18053
  • Jiajun Fan, Chaoran Cheng, Shuaike Shen, Xiangxin Zhou, Ge Liu, 20 Oct 2025, Fine-tuning Flow Matching Generative Models with Intermediate Feedback, https://arxiv.org/abs/2510.18072
  • Zhendong Mi, Qitao Tan, Grace Li Zhang, Zhaozhuo Xu, Geng Yuan, Shaoyi Huang, 21 Oct 2025, Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations, https://arxiv.org/abs/2510.18228
  • Mariano Rivera and Angello Hoyos, 20 Oct 2025, COLORA: Efficient Fine-Tuning for Convolutional Models with a Study Case on Optical Coherence Tomography Image Classification, https://arxiv.org/abs/2505.18315
  • Mingze Yuan, Pengfei Jin, Na Li, Quanzheng Li, 24 Sep 2025, PIRF: Physics-Informed Reward Fine-Tuning for Diffusion Models, https://arxiv.org/abs/2509.20570
  • Honglin Zhang, Qianyue Hao, Fengli Xu, Yong Li, 25 Sep 2025, Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs, https://arxiv.org/abs/2509.21044
  • Yongda Yu, Guohao Shi, Xianwei Wu, Haochuan He, XueMing Gu, Qianqian Zhao, Kui Liu, Qiushi Wang, Zhao Tian, Haifeng Shen, Guoping Rong, 25 Sep 2025, Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach, https://arxiv.org/abs/2509.21170
  • Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov, 24 Sep 2025, Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling, https://arxiv.org/abs/2507.01679
  • Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu, 25 Sep 2025, Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training, https://arxiv.org/abs/2507.05386
  • Hangwei Zhang, Chun Kang, Yan Wang, Difan Zou, 27 Sep 2025, F-Adapter: Frequency-Adaptive Parameter-Efficient Fine-Tuning in Scientific Machine Learning, https://arxiv.org/abs/2509.23173
  • Jonas Ngnaw\'e, Maxime Heuillet, Sabyasachi Sahoo, Yann Pequignot, Ola Ahmad, Audrey Durand, Fr\'ed\'eric Precioso, Christian Gagn\'e, 27 Sep 2025, Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Adversarial Scheduling, https://arxiv.org/abs/2509.23325
  • Jiang-Xin Shi, Wen-Da Wei, Jin-Fei Qi, Xuanyu Chen, Tong Wei, Yu-Feng Li, 27 Sep 2025, Memory-Efficient Fine-Tuning via Low-Rank Activation Compression, https://arxiv.org/abs/2509.23472
  • Zhixin Zhang, Zeming Wei, Meng Sun, 28 Sep 2025, Dynamic Orthogonal Continual Fine-tuning for Mitigating Catastrophic Forgettings, https://arxiv.org/abs/2509.23893
  • Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen, 29 Sep 2025, Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning, https://arxiv.org/abs/2509.24372
  • David Gonz\'alez Mart\'inez, 29 Sep 2025, BALF: Budgeted Activation-Aware Low-Rank Factorization for Fine-Tuning-Free Model Compression, https://arxiv.org/abs/2509.25136
  • Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee, 29 Sep 2025, TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion, https://arxiv.org/abs/2509.25171
  • Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son, 26 Sep 2025, Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment, https://arxiv.org/abs/2509.22745
  • Nayeong Kim, Seong Joon Oh, Suha Kwak, 28 Sep 2025, GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning, https://arxiv.org/abs/2509.23781
  • Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu, 28 Sep 2025, Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning, https://arxiv.org/abs/2503.11832
  • Ruijia Niu, Dongxia Wu, Rose Yu, Yi-An Ma, 29 Sep 2025, Functional-level Uncertainty Quantification for Calibrated Fine-tuning on LLMs, https://arxiv.org/abs/2410.06431
  • Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Seyyedali Hosseinalipour, Christopher G. Brinton, 28 Sep 2025, Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs, https://arxiv.org/abs/2501.19389
  • Ningyuan Yang, Jiaxuan Gao, Feng Gao, Yi Wu, Chao Yu, 28 Sep 2025, Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps, https://arxiv.org/abs/2505.10482
  • Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou, 29 Sep 2025, Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models, https://arxiv.org/abs/2505.17826
  • Junyu Chen, Junzhuo Li, Zhen Peng, Wenjie Wang, Yuxiang Ren, Long Shi, Xuming Hu, 28 Sep 2025, LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning, https://arxiv.org/abs/2505.18724
  • Tao Ren, Zishi Zhang, Jingyang Jiang, Zehao Li, Shentao Qin, Yi Zheng, Guanghao Li, Qianyou Sun, Yan Li, Jiafeng Liang, Xinping Li, Yijie Peng, 28 Sep 2025, Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer, https://arxiv.org/abs/2502.00639
  • Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu, 27 Sep 2025, PAFT: Prompt-Agnostic Fine-Tuning, https://arxiv.org/abs/2502.12859
  • Core Francisco Park, Zechen Zhang, Hidenori Tanaka, 27 Sep 2025, $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge, https://arxiv.org/abs/2505.01812
  • Hakaze Cho, Peng Luo, Mariko Kato, Rin Kaenbyou, Naoya Inoue, 27 Sep 2025, Mechanistic Fine-tuning for In-context Learning, https://arxiv.org/abs/2505.14233
  • Yuansheng Ni, Ping Nie, Kai Zou, Xiang Yue, Wenhu Chen, 29 Sep 2025, VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation, https://arxiv.org/abs/2506.03930
  • Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, Zipei Fan, 27 Sep 2025, TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization, https://arxiv.org/abs/2506.08440
  • Lee Qi Zun, Mohamad Zulhilmi Bin Abdul Halim, Goh Man Fye, 17 Oct 2025, Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs, https://arxiv.org/abs/2510.15418
  • Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, J. Andrew Bagnell, 17 Oct 2025, All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning, https://arxiv.org/abs/2503.01067
  • Congzheng Song, Xinyu Tang, 3 Oct 2025, Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices, https://arxiv.org/abs/2510.03425
  • Yongfu Xue, 4 Oct 2025, Optimizing Fine-Tuning through Advanced Initialization Strategies for Low-Rank Adaptation, https://arxiv.org/abs/2510.03731
  • Junde Xu, Yapin Shi, Lijun Lang, Taoyong Cui, Zhiming Zhang, Guangyong Chen, Jiezhong Qiu, Pheng-Ann Heng, 3 Oct 2025, InstructPLM-mu: 1-Hour Fine-Tuning of ESM2 Beats ESM3 in Protein Mutation Predictions, https://arxiv.org/abs/2510.03370
  • Imran Mansha, 6 Oct 2025, Resource-Efficient Fine-Tuning of LLaMA-3.2-3B for Medical Chain-of-Thought Reasoning, https://arxiv.org/abs/2510.05003
  • Kuang Yuan, Yang Gao, Xilin Li, Xinhao Mei, Syavosh Zadissa, Tarun Pruthi, Saeed Bagheri Sereshki, 4 Oct 2025, Lightweight and Generalizable Acoustic Scene Representations via Contrastive Fine-Tuning and Distillation, https://arxiv.org/abs/2510.03728
  • Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, Wentao Zhang, 4 Oct 2025, Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions, https://arxiv.org/abs/2506.07527
  • Snehal Raj, Brian Coyle, 5 Oct 2025, QuIC: Quantum-Inspired Compound Adapters for Parameter Efficient Fine-Tuning, https://arxiv.org/abs/2502.06916
  • Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma, 4 Oct 2025, Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning, https://arxiv.org/abs/2502.15436
  • Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma, 4 Oct 2025, Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study, https://arxiv.org/abs/2505.14185
  • Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, 5 Oct 2025, Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models, https://arxiv.org/abs/2503.20752
  • Hung-Ying Chu, Shao-Yu Wei, Guan-Wei Chen, Tzu-Wei Hung, ChengYang Tsai and Yu-Cheng Lin, 4 Oct 2025, HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling, https://arxiv.org/abs/2509.25694
  • Nirmal Elamon, Rouzbeh Davoudi, 3 Oct 2025, Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes, https://arxiv.org/abs/2510.08589
  • Sybelle Goedicke-Fritz (1), Michelle Bous (1), Annika Engel (2), Matthias Flotho (2 and 5), Pascal Hirsch (2), Hannah Wittig (1), Dino Milanovic (2), Dominik Mohr (1), Mathias Kaspar (6), Sogand Nemat (3), Dorothea Kerner (3), Arno B\"ucker (3), Andreas Keller (2 and 5 and 7), Sascha Meyer (4), Michael Zemlin (1), Philipp Flotho (2 and 5) ((1) Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany, (2) Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany, (3) Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany, (4) Clinical Centre Karlsruhe, Franz-Lust Clinic for Paediatrics, Karlsruhe, Germany, (5) Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarland University Campus, Germany, (6) Digital Medicine, University Hospital of Augsburg, Augsburg, Germany, (7) Pharma Science Hub (PSH), Saarland University Campus, Germany), 10 Oct 2025, Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants, https://arxiv.org/abs/2507.12269
  • Aymane El Firdoussi, El Mahdi Chayti, Mohamed El Amine Seddik, Martin Jaggi, 24 Oct 2025, $\alpha$-LoRA: Effective Fine-Tuning via Base Model Rescaling, https://arxiv.org/abs/2510.21345
  • Jan Wehner, Mario Fritz, 24 Oct 2025, Probe-based Fine-tuning for Reducing Toxicity, https://arxiv.org/abs/2510.21531
  • Mingyang Lyu, Yinqian Sun, Erliang Lin, Huangrui Li, Ruolin Chen, Feifei Zhao, Yi Zeng, 11 Oct 2025, Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models, https://arxiv.org/abs/2510.09976
  • Jianzhe Zhao, Hailin Zhu, Yu Zhang, Ziqi Chen, Guibing Guo, 13 Oct 2025, FedLoRA-Optimizer: Federated LoRA Fine-Tuning with Global and Local Optimization in Heterogeneous Data Scenarios, https://arxiv.org/abs/2510.11274
  • Guozhi Liu, Qi Mu, Tiansheng Huang, Xinhua Wang, Li Shen, Weiwei Lin, Zhang Li, 11 Oct 2025, Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning, https://arxiv.org/abs/2510.10085
  • Ma\"el Macuglia, Paul Friedrich, Giorgia Ramponi, 13 Oct 2025, Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning, https://arxiv.org/abs/2509.26605
  • Haifeng Wen, Hong Xing, Osvaldo Simeone, 11 Oct 2025, Pre-Training and Personalized Fine-Tuning via Over-the-Air Federated Meta-Learning: Convergence-Generalization Trade-Offs, https://arxiv.org/abs/2406.11569
  • Jieming Bian, Lei Wang, Letian Zhang, Jie Xu, 12 Oct 2025, LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement, https://arxiv.org/abs/2411.14961
  • Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Jiahao Hu, Shuo Wu, Yazhe Niu, Ruihao Gong, Dahua Lin, Ningyi Xu, 13 Oct 2025, Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM, https://arxiv.org/abs/2503.07680
  • Changsheng Wang, Yihua Zhang, Jinghan Jia, Parikshit Ram, Dennis Wei, Yuguang Yao, Soumyadeep Pal, Nathalie Baracaldo, Sijia Liu, 10 Oct 2025, Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning, https://arxiv.org/abs/2506.01339
  • Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev, 11 Oct 2025, Complexity-aware fine-tuning, https://arxiv.org/abs/2506.21220
  • Majid Jaberi-Douraki, Hossein Sholehrasa, Xuan Xu, Remya Ampadi Ramachandran, 9 Oct 2025, HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs, https://arxiv.org/abs/2510.07796
  • Yicheng Zhang, Zhen Qin, Zhaomin Wu, Jian Hou, Shuiguang Deng, 9 Oct 2025, Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures, https://arxiv.org/abs/2411.19128
  • Vignesh Ethiraj, Ashwath David, Sidhanth Menon, Divya Vijay, Vidhyakshaya Kannan, 9 Oct 2025, T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning, https://arxiv.org/abs/2504.16460
  • Hong-Jie Dai, Zheng-Hao Li, An-Tai Lu, Bo-Tsz Shain, Ming-Ta Li, Tatheer Hussain Mir, Kuang-Te Wang, Min-I Su, Pei-Kang Liu, Ming-Ju Tsai, 23 Sep 2025, Model selection meets clinical semantics: Optimizing ICD-10-CM prediction via LLM-as-Judge evaluation, redundancy-aware sampling, and section-aware fine-tuning, https://arxiv.org/abs/2509.18846
  • Xiao Han, Zimo Zhao, Wanyu Wang, Maolin Wang, Zitao Liu, Yi Chang, Xiangyu Zhao, 23 Sep 2025, Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning, https://arxiv.org/abs/2509.18942
  • Yu Chen, Yifei Han, Long Zhang, Yue Du, Bin Li, 23 Sep 2025, TsqLoRA: Towards Sensitivity and Quality Low-Rank Adaptation for Efficient Fine-Tuning, https://arxiv.org/abs/2509.18585
  • Yueyan Li, Wenhao Gao, Caixia Yuan, Xiaojie Wang, 23 Sep 2025, Fine-Tuning is Subgraph Search: A New Lens on Learning Dynamics, https://arxiv.org/abs/2502.06106
  • Zhi Zhang, Yixian Shen, Congfeng Cao, Ekaterina Shutova, 21 Oct 2025, NeuroAda: Activating Each Neuron's Potential for Parameter-Efficient Fine-Tuning, https://arxiv.org/abs/2510.18940
  • Peng Wang and Minghao Gu and Qiang Huang, 22 Oct 2025, Feature Space Adaptation for Robust Model Fine-Tuning, https://arxiv.org/abs/2510.19155
  • A\"el Qu\'elennec, Nour Hezbri, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione, 22 Oct 2025, Study of Training Dynamics for Memory-Constrained Fine-Tuning, https://arxiv.org/abs/2510.19675
  • Reece Shuttleworth, Jacob Andreas, Antonio Torralba, Pratyusha Sharma, 22 Oct 2025, LoRA vs Full Fine-tuning: An Illusion of Equivalence, https://arxiv.org/abs/2410.21228
  • Rongguang Ye, Ming Tang, Edith C. H. Ngai, 22 Sep 2025, On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs, https://arxiv.org/abs/2509.25214
  • Yuan Huang, 25 Sep 2025, Fine-tuning of Large Language Models for Domain-Specific Cybersecurity Knowledge, https://arxiv.org/abs/2509.25241
  • Hao Ban, Kaiyi Ji, 29 Sep 2025, Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs, https://arxiv.org/abs/2509.25414
  • Abhinav Rastogi, Adam Yang, Albert Q. Jiang, Alexander H. Liu, Alexandre Sablayrolles, Am\'elie H\'eliou, Am\'elie Martin, Anmol Agarwal, Andy Ehrenberg, Andy Lo, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozi\`ere, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Cl\'emence Lanfranchi, Cl\'ement Denoix, Corentin Barreau, Darius Dabert Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gabrielle Berrada, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Graham Neubig, Guillaume Lample, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jason Rute, Jean-Malo Delignon, JeanHadrien Chabran, Joachim Studnia, Joep Barmentlo, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, Kush Jain, L\'elio Renard Lavaud, L\'eonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Matthieu Dinot, Maxime Darrin, Maximilian Augustin, Micka\"el Seznec, Neha Gupta, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patrick von Platen, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philom\`ene Chagniot, Pierre Stock, Pravesh Agrawal, R\'emi Delacourt, Roman Soletskyi, Romain Sauvestre, Sagar Vaze, Sanchit Gandhi, Sandeep Subramanian, Shashwat Dalal, Siddharth Gandhi, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Thibaut Lavril, Thibault Schueller, Thomas Foubert, Thomas Robert, Thomas Wang, Timoth\'ee Lacroix, Tom Bewley, Valeriia Nemychnikova, Victor Paltz, Virgile Richard, Wen-Ding Li, William Marshall, Xingyao Wang, Xuanyu Zhang, Yihan Wan, Yunhao Tang, 8 Aug 2025, Devstral: Fine-tuning Language Models for Coding Agent Applications, https://arxiv.org/abs/2509.25193
  • Darren King, Yaser Atlasi and Gholamreza Rafiee, 28 Sep 2025, DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification, https://arxiv.org/abs/2509.25274
  • Chenhua Shi, Gregor Macdonald, Bhavika Jalli, Wanlu Lei, John Zou, Mridul Jain, Joji Philip, 30 Sep 2025, Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications, https://arxiv.org/abs/2509.25736
  • Matthew DosSantos DiSorbo, Harang Ju, Sinan Aral, 30 Sep 2025, Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment, https://arxiv.org/abs/2503.02976
  • Jiawei Li, 30 Sep 2025, Detecting Instruction Fine-tuning Attacks on Language Models using Influence Function, https://arxiv.org/abs/2504.09026
  • Prashant Govindarajan, Davide Baldelli, Jay Pathak, Quentin Fournier, Sarath Chandar, 30 Sep 2025, CADmium: Fine-Tuning Code Language Models for Text-Driven Sequential CAD Design, https://arxiv.org/abs/2507.09792
  • Ruoxing Yang, 6 Oct 2025, DP-Adam-AC: Privacy-preserving Fine-Tuning of Localizable Language Models Using Adam Optimization with Adaptive Clipping, https://arxiv.org/abs/2510.05288
  • Yurun Song, Zhuoyi Yang, Ian G. Harris, Sangeetha Abdu Jyothi, 7 Oct 2025, AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning, https://arxiv.org/abs/2510.05468
  • Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J. Su, Li Shen, 16 Oct 2025, Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach, https://arxiv.org/abs/2505.01997
  • Xiaoxue Yang, Bozhidar Stevanoski, Matthieu Meeus, Yves-Alexandre de Montjoye, 16 Oct 2025, Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses, https://arxiv.org/abs/2505.15738

Data Sets

Research papers on datasets used for training:

  • Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
  • Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
  • Gayathri Saranathan, Mahammad Parwez Alam, James Lim, Suparna Bhattacharya, Soon Yee Wong, Foltin Martin & Cong Xu, 2024, DELE: Data Efficient LLM Evaluation, Hewlett Packard Labs, Navigating and Addressing Data Problems for Foundation Models (DPFM) Workshop, ICLR 2024, https://openreview.net/pdf?id=I8bsxPWLNF
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
  • Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen, Nov 2023, A Survey of Large Language Models, https://arxiv.org/abs/2303.18223
  • Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
  • Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, 29 Jan 2024, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, https://arxiv.org/abs/2401.16380
  • Cobus Greyling, Dec 2023, A Comprehensive Survey of Large Language Models (LLMs), https://cobusgreyling.medium.com/a-comprehensive-survey-of-large-language-models-llms-946a30d9288e
  • Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
  • Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt, 17 Jun 2024, MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens, https://arxiv.org/abs/2406.11271
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
  • Piotr Skalski, June 20, 2024, Florence-2: Open Source Vision Foundation Model by Microsoft, https://blog.roboflow.com/florence-2/
  • Sharon Goldman, August 24, 2024, The hidden reason AI costs are soaring—and it’s not because Nvidia chips are more expensive, https://fortune.com/2024/08/23/data-labeling-ai-scaleai-snorkel-costs/ (The high cost of data labeling.)
  • Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
  • Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 20 Feb 2024 (v2), Large Language Models: A Survey, https://arxiv.org/abs/2402.06196
  • Reddit Signs AI Content Licensing Deal Ahead of IPO, https://www.bloomberg.com/news/articles/2024-02-16/reddit-is-said-to-sign-ai-content-licensing-deal-ahead-of-ipo?srnd=undefined&sref=b0SdE1lu&tpcc=NL_Marketing
  • Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, Marius Hobbhahn, Jun 06, 2024, Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data, Epoch AI, https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
  • Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla, 9 Mar 2024, Algorithmic progress in language models, https://arxiv.org/abs/2403.05812
  • Georgia Argyro, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou, 10 Sep 2024, Prompt2Fashion: An automatically generated fashion dataset, https://arxiv.org/abs/2409.06442
  • Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang, 23 Sep 2024, MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding, https://arxiv.org/abs/2409.14818
  • Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
  • Pierre-Carl Langlais, Anastasia Stasenko, Catherine Arnett, November 13, 2024, Releasing the largest multilingual open pretraining dataset, https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open
  • Arindam Mitra , Ahmed Awadallah , Yash Lara , November 14, 2024, Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/
  • Paul Sawers, Dec 2024, Harvard and Google to release 1 million public-domain books as AI training dataset, https://techcrunch.com/2024/12/12/harvard-and-google-to-release-1-million-public-domain-books-as-ai-training-dataset/
  • Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
  • Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
  • Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
  • Ali Forootani, 22 Mar 2025, A Survey on Mathematical Reasoning and Optimization with Large Language Models, https://arxiv.org/abs/2503.17726
  • Cameron R. Wolfe, Ph.D., May 19, 2025, A Guide for Debugging LLM Training Data: Data-centric techniques and tools that anyone should use when training an LLM, https://cameronrwolfe.substack.com/p/llm-debugging
  • Yi Dong, Yusuke Muraoka, Scott Shi, and Yi Zhang, 14 Aug 2025, MM-Food-100K: A 100,000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance, https://arxiv.org/abs/2508.10429
  • Haydn Thomas Jones, Natalie Maus, Josh Magnus Ludan, Maggie Ziyu Huan, Jiaming Liang, Marcelo Der Torossian Torres, Jiatao Liang, Zachary Ives, Yoseph Barash, Cesar de la Fuente-Nunez, Jacob R. Gardner, Mark Yatskar, 14 Aug 2025, A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design, https://arxiv.org/abs/2508.10899
  • Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng, Songtao Jiang, Yong Xie, Zuozhu Liu, 14 Aug 2025, Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset, https://arxiv.org/abs/2508.10528
  • Feiran Li, Qianqian Xu, Shilong Bao, Boyu Han, Zhiyong Yang, Qingming Huang, 14 Aug 2025, Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation, https://arxiv.org/abs/2508.10672
  • Yuzhuo Xiao, Zeyu Han, Yuhan Wang, Huaizu Jiang, 4 Aug 2025, XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs, https://arxiv.org/abs/2508.09999
  • Seunghyeok Back, Joosoon Lee, Kangmin Kim, Heeseon Rho, Geonhyup Lee, Raeyoung Kang, Sangbeom Lee, Sangjun Noh, Youngjin Lee, Taeyeop Lee, Kyoobin Lee, 14 Aug 2025, GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes, https://arxiv.org/abs/2504.06866
  • Quang-Trung Truong, Yuk-Kwan Wong, Vo Hoang Kim Tuyen Dang, Rinaldi Gotama, Duc Thanh Nguyen, Sai-Kit Yeung, 14 Aug 2025, MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning, https://arxiv.org/abs/2508.04549
  • Youneng Bao, Yiping Liu, Zhuo Chen, Yongsheng Liang, Mu Li, Kede Ma, 23 Jul 2025, Dataset Distillation as Data Compression: A Rate-Utility Perspective, https://arxiv.org/abs/2507.17221
  • Md Min-Ha-Zul Abedin and Tazqia Mehrub, 22 Jul 2025, Evaluating Ensemble and Deep Learning Models for Static Malware Detection with Dimensionality Reduction Using the EMBER Dataset, https://arxiv.org/abs/2507.16952
  • Mashiro Toyooka, Kiyoharu Aizawa and Yoko Yamakata, 23 Jul 2025, A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task, https://arxiv.org/abs/2507.17232
  • Yuanchen Shi, Biao Ma, Longyin Zhang, and Fang Kong, 23 Jul 2025, Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline, https://arxiv.org/abs/2405.08427
  • David Kurtenbach, Lior Shamir, 15 Jul 2025, An open dataset of neural networks for hypernetwork research, https://arxiv.org/abs/2507.15869
  • Morad Tukan, Loay Mualem, Eitan Netzer, Liran Sigalat, 22 Jul 2025, Improving Model Classification by Optimizing the Training Dataset, https://arxiv.org/abs/2507.16729
  • Yasser Ashraf, Ahmed Sharshar, Velibor Bojkovic, Bin Gu, 22 Jul 2025, SPACT18: Spiking Human Action Recognition Benchmark Dataset with Complementary RGB and Thermal Modalities, https://arxiv.org/abs/2507.16151
  • Aaron Ho (1), Lorenzo Zanisi (2), Bram de Leeuw (3), Vincent Galvan (1), Pablo Rodriguez-Fernandez (1), Nathaniel T. Howard (1) ((1) MIT Plasma Science and Fusion Center, Cambridge, USA, (2) UKAEA Culham Centre for Fusion Energy, Abingdon, UK, (3) Radboud University, Nijmegen, Netherlands), 21 Jul 2025, Efficient dataset construction using active learning and uncertainty-aware neural networks for plasma turbulent transport surrogate models, https://arxiv.org/abs/2507.15976
  • Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum, 22 Jul 2025, Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning, https://arxiv.org/abs/2507.16746
  • Fateme Nateghi Haredasht, Fatemeh Amrollahi, Manoj Maddali, Nicholas Marshall, Stephen P. Ma, Lauren N. Cooper, Andrew O. Johnson, Ziming Wei, Richard J. Medford, Sanjat Kanjilal, Niaz Banaei, Stanley Deresinski, Mary K. Goldstein, Steven M. Asch, Amy Chang, Jonathan H. Chen, 21 Jul 2025, Antibiotic Resistance Microbiology Dataset (ARMD): A Resource for Antimicrobial Resistance from EHRs, https://arxiv.org/abs/2503.07664
  • Daniel Grimm, Ahmed Abouelazm, J. Marius Z\"ollner, 24 Jul 2025, Goal-based Trajectory Prediction for improved Cross-Dataset Generalization, https://arxiv.org/abs/2507.18196
  • Paulo Mendes (1), Eva Maia (1), Isabel Pra\c{c}a (1) ((1) GECAD, ISEP, Polytechnic of Porto, Portugal), 23 Jul 2025, MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection, https://arxiv.org/abs/2507.17978
  • Maria Vlachou, 24 Jul 2025, Fashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items, https://arxiv.org/abs/2507.18017
  • Xuebo Jin, Longfei Gao, Anshuo Tong, Zhengyang Chen, Jianlei Kong, Ning Sun, Huijun Ma, Qiang Wang, Yuting Bai, Tingli Su, 24 Jul 2025, TCM-Tongue: A Standardized Tongue Image Dataset with Pathological Annotations for AI-Assisted TCM Diagnosis, https://arxiv.org/abs/2507.18288
  • Minje Park, Jeonghwa Lim, Taehyung Yu, and Sunghoon Joo, 24 Jul 2025, A Multi-Dataset Benchmark for Semi-Supervised Semantic Segmentation in ECG Delineation, https://arxiv.org/abs/2507.18323
  • Baoyao Yang, Wanyun Li, Dixin Chen, Junxiang Chen, Wenbin Yao, Haifeng Lin, 24 Jul 2025, VideoMind: An Omni-Modal Video Dataset with Intent Grounding for Deep-Cognitive Video Understanding, https://arxiv.org/abs/2507.18552
  • Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim, 24 Jul 2025, SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning, https://arxiv.org/abs/2507.18616
  • Sam Gordon James, Miranda Elaine Glynis Armstrong, Aisling Ann O'Kane, Harry Emerson and Zahraa S. Abdallah, 7 May 2025, BrisT1D Dataset: Young Adults with Type 1 Diabetes in the UK using Smartwatches, https://arxiv.org/abs/2507.17757
  • Gabriel Jarry, Ramon Dalmau, Philippe Very, Franck Ballerini, Stephania-Denisa Bocu, 24 Jul 2025, GVCCS: A Dataset for Contrail Identification and Tracking on Visible Whole Sky Camera Sequences, https://arxiv.org/abs/2507.18330
  • Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Ding Wang, Mark D\'iaz, Alicia Parrish, Aida Mostafazadeh Davani, Zoe Ashwood, Michela Paganini, Vinodkumar Prabhakaran, Verena Rieser, Lora Aroyo, 15 Jul 2025, Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models, https://arxiv.org/abs/2507.13383
  • Paul E. Calzada, Zahin Ibnat, Tanvir Rahman, Kamal Kandula, Danyu Lu, Sujan Kumar Saha, Farimah Farahmandi, Mark Tehranipoor, 9 Jul 2025, VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation, https://arxiv.org/abs/2507.13369
  • Xiao Wang, Qian Zhu, Shujuan Wu, Bo Jiang, Shiliang Zhang, Yaowei Wang, Yonghong Tian, Bin Luo, 18 Jul 2025, When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework, https://arxiv.org/abs/2507.13659
  • Morteza Bodaghi, Majid Hosseini, Raju Gottumukkala, Ravi Teja Bhupatiraju, Iftikhar Ahmad, Moncef Gabbouj, 16 Jul 2025, UL-DD: A Multimodal Drowsiness Dataset Using Video, Biometric Signals, and Behavioral Data, https://arxiv.org/abs/2507.13403
  • Hengjie Yu, Kenneth A. Dawson, Haiyun Yang, Shuya Liu, Yan Yan, Yaochu Jin, 18 Jul 2025, A million-scale dataset and generalizable foundation model for nanomaterial-protein interactions, https://arxiv.org/abs/2507.14245
  • Daniel Fein, Gabriela Aranguiz-Dias, 18 Jul 2025, Influence Functions for Preference Dataset Pruning, https://arxiv.org/abs/2507.14344
  • Refik Samet, Nooshin Nemati, Emrah Hancer, Serpil Sak, Bilge Ayca Kirmizi, Zeynep Yildirim, 18 Jul 2025, MiDeSeC: A Dataset for Mitosis Detection and Segmentation in Breast Cancer Histopathology Images, https://arxiv.org/abs/2507.14271
  • Refik Samet, Nooshin Nemati, Emrah Hancer, Serpil Sak, Bilge Ayca Kirmizi, 18 Jul 2025, NuSeC: A Dataset for Nuclei Segmentation in Breast Cancer Histopathology Images, https://arxiv.org/abs/2507.14272
  • Deyun Zhang, Xiang Lan, Shijia Geng, Qinghao Zhao, Sumei Fan, Mengling Feng, and Shenda Hong, 21 Jul 2025, MEETI: A Multimodal ECG Dataset from MIMIC-IV-ECG with Signals, Images, Features and Interpretations, https://arxiv.org/abs/2507.15255
  • Shuo Tang, Jian Xu, Jiadong Zhang, Yi Chen, Qizhao Jin, Lingdong Shen, Chenglin Liu, Shiming Xiang, 9 Aug 2025, MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction, https://arxiv.org/abs/2508.06859
  • Keyu Li, Mohan Jiang, Dayuan Fu, Yunze Wu, Xiangkun Hu, Dequan Wang, Pengfei Liu, 9 Aug 2025, DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery, https://arxiv.org/abs/2508.06960
  • Naseem Machlovi, Maryam Saleki, Innocent Ababio, Ruhul Amin, 9 Aug 2025, Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach, https://arxiv.org/abs/2508.07063
  • Xiaoyuan Zhu, Muru Zhang, Ollie Liu, Robin Jia, Willie Neiswanger, 8 Aug 2025, LLM Unlearning Without an Expert Curated Dataset, https://arxiv.org/abs/2508.06595
  • Xiaobo Zhang (1 and 2), Congqing He (2), Ying He (1 and 2), Jian Peng (1), Dajie Fu (1), Tien-Ping Tan (2) ((1) School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, China, (2) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia), 9 Aug 2025, ESNERA: Empirical and semantic named entity alignment for named entity dataset merging, https://arxiv.org/abs/2508.06877
  • Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi, Fajri Koto, Alham Fikri Aji, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Genta Indra Winata, 9 Aug 2025, SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages, https://arxiv.org/abs/2508.07069
  • Licheng Zhang, Bach Le, Naveed Akhtar, Tuan Ngo, 11 Aug 2025, DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models, https://arxiv.org/abs/2508.07714
  • Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang and Yanbin Hao, 11 Aug 2025, UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models, https://arxiv.org/abs/2508.07766
  • Vojt\v{e}ch Stan\v{e}k, Karel Srna, Anton Firc, Kamil Malinka, 11 Aug 2025, SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis, https://arxiv.org/abs/2508.07944
  • Unisha Joshi, 6 Aug 2025, Age-Diverse Deepfake Dataset: Bridging the Age Gap in Deepfake Detection, https://arxiv.org/abs/2508.06552
  • Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, Shubhi Bansal, Nagendra Kumar, 7 Aug 2025, ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos, https://arxiv.org/abs/2508.06570
  • Anurag Tripathi, Vaibhav Patle, Abhinav Jain, Ayush Pundir, Sairam Menon, Ajeet Kumar Singh, Dorien Herremans, 11 Aug 2025, End-to-End Text-to-SQL with Dataset Selection: Leveraging LLMs for Adaptive Query Generation, https://arxiv.org/abs/2508.06387
  • Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tutt\"os\'i, Angelica Lim, 25 Jul 2025, Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks, https://arxiv.org/abs/2507.19684
  • Yazeed Alrubyli, Omar Alomeir, Abrar Wafa, Di\'ana Hidv\'egi, Hend Alrasheed, Mohsen Bahrami, 25 Jul 2025, NAICS-Aware Graph Neural Networks for Large-Scale POI Co-visitation Prediction: A Multi-Modal Dataset and Methodology, https://arxiv.org/abs/2507.19697
  • Tan-Minh Nguyen, Hoang-Trung Nguyen, Trong-Khoi Dao, Xuan-Hieu Phan, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong, 26 Jul 2025, VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering, https://arxiv.org/abs/2507.19995
  • Adrien Bazoge, 28 Jul 2025, MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation, https://arxiv.org/abs/2507.20917
  • Abir Harrasse, Philip Quirke, Clement Neo, Dhruv Nathawani, Luke Marks and Amir Abdullah, 27 Jul 2025, TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research, https://arxiv.org/abs/2503.12730
  • Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi, 27 Jul 2025, FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation, https://arxiv.org/abs/2505.14351
  • Robin Burchard and Kristof Van Laerhoven, 28 Jul 2025, Enhancing Wearable Tap Water Audio Detection through Subclass Annotation in the HD-Epic Dataset, https://arxiv.org/abs/2505.20788
  • Shenghe Zheng, Qianjia Cheng, Junchi Yao, Mengsong Wu, Haonan He, Ning Ding, Yu Cheng, Shuyue Hu, Lei Bai, Dongzhan Zhou, Ganqu Cui, Peng Ye, 28 Jul 2025, Scaling Physical Reasoning with the PHYSICS Dataset, https://arxiv.org/abs/2506.00022
  • Andreas Spilz, Heiko Oppel, Jochen Werner, Kathrin Stucke-Straub, Felix Capanni and Michael Munz, 6 Jun 2025, GAITEX: Human motion dataset from impaired gait and rehabilitation exercises of inertial and optical sensor data, https://arxiv.org/abs/2507.21069
  • Ariel E. Stassi, Yanina Boria, J. Mat\'ias Di Martino and Gregory Randall, 7 Jul 2025, iLSU-T: an Open Dataset for Uruguayan Sign Language Translation, https://arxiv.org/abs/2507.21104
  • Sheng-Feng Yu, Jia-Jiun Yao, and Wei-Chen Chiu, 29 Jul 2025, Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation, https://arxiv.org/abs/2507.21455
  • Basak Demirok, Mucahid Kutlu, Selin Mergen, 29 Jul 2025, MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios, https://arxiv.org/abs/2507.21693
  • Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Mahmoud Hussain Alabbad, Maha Sbayel Alhazmi, Neel P. Mistry, Kent Ryan Kleinschmidt, Brady Chrisler, Sathvik Suryadevara, Sri Sai Dinesh Jaliparthi, Noah Michael Prudlo, Mark David Marino, Jeremy Palacio, Rithvik Akula, Hong-Yu Zhou, Ibrahim Ethem Hamamci, Scott J. Adams, Hassan Rayhan AlOmaish, Pranav Rajpurkar, 29 Jul 2025, ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports, https://arxiv.org/abs/2507.22030
  • Salvatore Sinno, Markus Bertl, Arati Sahoo, Bhavika Bhalgamiya, Thomas Gro{\ss}, Nicholas Chancellor, 29 Jul 2025, Implementing Large Quantum Boltzmann Machines as Generative AI Models for Dataset Balancing, https://arxiv.org/abs/2502.03086
  • Xiaoyi Feng, Kaifeng Zou, Caichun Cen, Tao Huang, Hui Guo, Zizhou Huang, Yingli Zhao, Mingqing Zhang, Ziyuan Zheng, Diwei Wang, Yuntao Zou, Dagang Li, 29 Jul 2025, LinkTo-Anime: A 2D Animation Optical Flow Dataset from 3D Model Rendering, https://arxiv.org/abs/2506.02733
  • Fengyi Jiang, Xiaorui Zhang, Lingbo Jin, Ruixing Liang, Yuxin Chen, Adi Chola Venkatesh, Jason Culman, Tiantian Wu, Lirong Shao, Wenqing Sun, Cong Gao, Hallie McNamara, Jingpei Lu, Omid Mohareri, 29 Jul 2025, SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures, https://arxiv.org/abs/2507.00209
  • Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang, 25 Mar 2025, OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching, https://arxiv.org/abs/2503.21813
  • Bastien Le Guellec, Kokou Adambounou, Lisa C Adams, Thibault Agripnidis, Sung Soo Ahn, Radhia Ait Chalal, Tugba Akinci D Antonoli, Philippe Amouyel, Henrik Andersson, Raphael Bentegeac, Claudio Benzoni, Antonino Andrea Blandino, Felix Busch, Elif Can, Riccardo Cau, Armando Ugo Cavallo, Christelle Chavihot, Erwin Chiquete, Renato Cuocolo, Eugen Divjak, Gordana Ivanac, Barbara Dziadkowiec Macek, Armel Elogne, Salvatore Claudio Fanni, Carlos Ferrarotti, Claudia Fossataro, Federica Fossataro, Katarzyna Fulek, Michal Fulek, Pawel Gac, Martyna Gachowska, Ignacio Garcia Juarez, Marco Gatti, Natalia Gorelik, Alexia Maria Goulianou, Aghiles Hamroun, Nicolas Herinirina, Krzysztof Kraik, Dominik Krupka, Quentin Holay, Felipe Kitamura, Michail E Klontzas, Anna Kompanowska, Rafal Kompanowski, Alexandre Lefevre, et al. (43 additional authors not shown), 25 Jul 2025, PARROT: An Open Multilingual Radiology Reports Dataset, https://arxiv.org/abs/2507.22939
  • Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, Shinsuke Mori, 31 Jul 2025, EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts, https://arxiv.org/abs/2410.05343
  • Eylon Caplan, Tania Chakraborty, Dan Goldwasser, 31 Jul 2025, Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation, https://arxiv.org/abs/2504.04640
  • Thomas Sugg, Kyle O'Brien, Lekh Poudel, Alex Dumouchelle, Michelle Jou, Marc Bosch, Deva Ramanan, Srinivasa Narasimhan, Shubham Tulsiani, 30 Jul 2025, Accenture-NVS1: A Novel View Synthesis Dataset, https://arxiv.org/abs/2503.18711
  • Feng Zhu, Zihang Zhang, Kangcheng Teng, Abduhelil Yakup and Xiaohong Zhang, 31 Jul 2025, SmartPNT-MSF: A Multi-Sensor Fusion Dataset for Positioning and Navigation Research, https://arxiv.org/abs/2507.19079
  • Hongjie Chen, Akshay Mehra, Josh Kimball, Ryan A. Rossi, 29 Jul 2025, Measuring Time-Series Dataset Similarity using Wasserstein Distance, https://arxiv.org/abs/2507.22189
  • Vanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti, Genta Indra Winata, 29 Jul 2025, IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian, https://arxiv.org/abs/2507.22159
  • Evgeniy I. Sosnin, Yuriy L. Vasilev, Roman A. Solovyev, Aleksandr L. Stempkovskiy, Dmitry V. Telpukhov, Artem A. Vasilev, Aleksandr A. Amerikanov, Aleksandr Y. Romanov, 30 Jul 2025, AlphaDent: A dataset for automated tooth pathology detection, https://arxiv.org/abs/2507.22512
  • Lucas Correia, Jan-Christoph Goos, Thomas B\"ack, Anna V. Kononova, 31 Jul 2025, PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series, https://arxiv.org/abs/2411.13951
  • Kejia Gao, Liguo Zhou, Mingjun Liu, Alois Knoll, 1 Aug 2025, E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking, https://arxiv.org/abs/2504.10812
  • Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen, 2 Aug 2025, NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset, https://arxiv.org/abs/2508.01330
  • Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, Huimin Zhao, 1 Aug 2025, FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models, https://arxiv.org/abs/2508.01055
  • Ali Forootani, Raffaele Iervolino, 3 Aug 2025, Asynchronous Federated Learning with non-convex client objective functions and heterogeneous dataset, https://arxiv.org/abs/2508.01675
  • Zhihao Zhu, Jiale Han, Yi Yang, 27 Jul 2025, HoneyImage: Verifiable, Harmless, and Stealthy Dataset Ownership Verification for Image Models, https://arxiv.org/abs/2508.00892
  • Huyu Wu, Duo Su, Junjie Hou, Guang Li, 2 Aug 2025, Dataset Condensation with Color Compensation, https://arxiv.org/abs/2508.01139
  • Han Wang, Zhuoran Wang, Roy Ka-Wei Lee, 3 Aug 2025, HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection, https://arxiv.org/abs/2508.01712
  • Runkai Zheng, Vishnu Asutosh Dasu, Yinong Oliver Wang, Haohan Wang, Fernando De la Torre, 3 Aug 2025, Improving Noise Efficiency in Privacy-preserving Dataset Distillation, https://arxiv.org/abs/2508.01749
  • Junyi Mo, Jiayu Li, Duo Zhang, Elynn Chen, 3 Aug 2025, ACT-Tensor: Tensor Completion Framework for Financial Dataset Imputation, https://arxiv.org/abs/2508.01861
  • Fan Gao, Cheng Huang, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu, 4 Aug 2025, TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models, https://arxiv.org/abs/2508.01977
  • Raviraj Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Long, Sanjay Singh Chauhan, Niranjan Wartikar, 3 Aug 2025, CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications, https://arxiv.org/abs/2508.01710
  • Cuno Sankey-Olsen, Rasmus Hvass Olesen, Tobias Oliver Eberhard, Andreas Triantafyllopoulos, Bj\"orn Schuller, Ilhan Aslan, 4 Aug 2025, Detecting COPD Through Speech Analysis: A Dataset of Danish Speech and Machine Learning Approach, https://arxiv.org/abs/2508.02354
  • Nazmun N Khan, Taylor Sweet, Chase A Harvey, Calder Knapp, Dean J. Krusienski, David E Thompson, 4 Aug 2025, The Role of Review Process Failures in Affective State Estimation: An Empirical Investigation of DEAP Dataset, https://arxiv.org/abs/2508.02417
  • Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, Nakamasa Inoue, 2 Aug 2025, CityNav: A Large-Scale Dataset for Real-World Aerial Navigation, https://arxiv.org/abs/2406.14240
  • Zedong Peng, Zeju Li, Mingzhe Gao, Qiang Xu, Chen Zhang, Jieru Zhao, 4 Aug 2025, ForgeHLS: A Large-Scale, Open-Source Dataset for High-Level Synthesis, https://arxiv.org/abs/2507.03255
  • Shaofeng Yin, Ting Lei, Yang Liu, 5 Aug 2025, ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools, https://arxiv.org/abs/2508.03284
  • Sai Ma, Zhuang Li, John A Taylor, 5 Aug 2025, Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery, https://arxiv.org/abs/2508.03127
  • Kaiwen Zhao, Bharathan Balaji, Stephen Lee, 5 Aug 2025, CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation, https://arxiv.org/abs/2508.03489
  • Anuroop Sriram, Logan M. Brabson, Xiaohan Yu, Sihoon Choi, Kareem Abdelmaqsoud, Elias Moubarak, Pim de Haan, Sindy L\"owe, Johann Brehmer, John R. Kitchin, Max Welling, C. Lawrence Zitnick, Zachary Ulissi, Andrew J. Medford, David S. Sholl, 5 Aug 2025, The Open DAC 2025 Dataset for Sorbent Discovery in Direct Air Capture, https://arxiv.org/abs/2508.03162
  • Abdul Basit, Nouhaila Innan, Muhammad Haider Asif, Minghao Shao, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique, 5 Aug 2025, PennyLang: Pioneering LLM-Based Quantum Code Generation with a Novel PennyLane-Centric Dataset, https://arxiv.org/abs/2503.02497
  • Chenxi Wang, Jizhan Fang, Xiang Chen, Bozhong Tian, Ziwen Xu, Huajun Chen, Ningyu Zhang, 5 Aug 2025, ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems, https://arxiv.org/abs/2503.20756
  • Mei Jiang, Houping Yue, Bingdong Li, Hao Hao, Ying Qian, Bo Jiang, and Aimin Zhou, 6 Aug 2025, SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset, https://arxiv.org/abs/2508.04563
  • Shengchao Chen, Guodong Long, Jing Jiang, 6 Aug 2025, FeDaL: Federated Dataset Learning for Time Series Foundation Models, https://arxiv.org/abs/2508.04045
  • Se Won Oh, Hyuntae Jeong, Seungeun Chung, Jeong Mook Lim, Kyoung Ju Noh, Sunkyung Lee, Gyuwon Jung, 18 Jul 2025, Understanding Human Daily Experience Through Continuous Sensing: ETRI Lifelog Dataset 2024, https://arxiv.org/abs/2508.03698
  • Xiao Wang, Xufeng Lou, Shiao Wang, Ju Huang, Lan Chen, Bo Jiang, 6 Aug 2025, Long-Term Visual Object Tracking with Event Cameras: An Associative Memory Augmented Tracker and A Benchmark Dataset, https://arxiv.org/abs/2403.05839
  • Naba Rizvi, Harper Strickland, Daniel Gitelman, Tristan Cooper, Alexis Morales-Flores, Michael Golden, Aekta Kallepalli, Akshat Alurkar, Haaset Owens, Saleha Ahmedi, Isha Khirwadkar, Imani Munyaka, Nedjma Ousidhoum, 6 Aug 2025, AUTALIC: A Dataset for Anti-AUTistic Ableist Language In Context, https://arxiv.org/abs/2410.16520
  • Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran Wang, 5 Aug 2025, NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models, https://arxiv.org/abs/2503.12772
  • Xiao Wang, Haiyang Wang, Shiao Wang, Qiang Chen, Jiandong Jin, Haoyu Song, Bo Jiang, Chenglong Li, 6 Aug 2025, RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework, https://arxiv.org/abs/2504.10018
  • Pouyan Navard, Yasemin Ozkut, Srikar Adhikari, Elaine Situ-LaCasse, Josie Acu\~na, Adrienne Yarnish, Alper Yilmaz, 5 Aug 2025, ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular Ultrasound, https://arxiv.org/abs/2508.04735
  • Changle Qu, Sunhao Dai, Ke Guo, Liqin Zhao, Yanan Niu, Xiao Zhang, Jun Xu, 7 Aug 2025, KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation, https://arxiv.org/abs/2508.05633
  • Vladimir Frants, Sos Agaian, 6 Aug 2025, Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset, https://arxiv.org/abs/2502.10452
  • Yin Li, Qi Chen, Kai Wang, Meige Li, Liping Si, Yingwei Guo, Yu Xiong, Qixing Wang, Yang Qin, Ling Xu, Patrick van der Smagt, Jun Tang, Nutan Chen, 6 Aug 2025, A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation, https://arxiv.org/abs/2404.03253
  • Ingo Ziegler, Abdullatif K\"oksal, Desmond Elliott, Hinrich Sch\"utze, 6 Aug 2025, CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation, https://arxiv.org/abs/2409.02098
  • Zekun Liu, Xiaowen Huang, Jitao Sang, 1 Aug 2025, ITDR: An Instruction Tuning Dataset for Enhancing Large Language Models in Recommendations, https://arxiv.org/abs/2508.05667
  • Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, Jingkuan Song, 8 Aug 2025, Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation, https://arxiv.org/abs/2508.06426
  • Jucheng Hu, Surong Yang, Lijun Wu, Dongzhan Zhou, 8 Aug 2025, DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning, https://arxiv.org/abs/2504.14810
  • Nikolaos Dionelis, Alessandra Feliciotti, Mattia Marconcini, Devis Peressutti, Nika Oman Kadunc, JaeWan Park, Hagai Raja Sinulingga, Steve Andreas Immanuel, Ba Tran, Caroline Arnold, Nicolas Long\'ep\'e, 8 Aug 2025, Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge, https://arxiv.org/abs/2502.13818
  • Zhenhui Ou, Dawei Li, Zhen Tan, Wenlin Li, Huan Liu, Siyuan Song, 9 Aug 2025, Building Safer Sites: A Large-Scale Multi-Level Dataset for Construction Safety Research, https://arxiv.org/abs/2508.09203
  • Yuxiao Wang, Yu Lei, Wolin Liang, Weiying Xue, Zhenao Wei, Nan Zhuang, Qi Liu, 13 Aug 2025, What-Meets-Where: Unified Learning of Action and Contact Localization in a New Dataset, https://arxiv.org/abs/2508.09428
  • Amir Hosseinian, Ashkan Dehghani Zahedani, Umer Mansoor, Noosheen Hashemi, Mark Woodward, 13 Aug 2025, January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis, https://arxiv.org/abs/2508.09966
  • Grigor Bezirganyan, Sana Sellami, Laure Berti-\'Equille, S\'ebastien Fournier, 13 Aug 2025, LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data, https://arxiv.org/abs/2406.09864
  • Chunan Liu, Aurelien Pelissier, Yanjun Shao, Lilian Denzler, Andrew C.R. Martin, Brooks Paige and Mar\'ia Rodr\'iguez Mart\'inez, 13 Aug 2025, AbRank: A Benchmark Dataset and Metric-Learning Framework for Antibody-Antigen Affinity Ranking, https://arxiv.org/abs/2506.17857
  • Angela John, Selvyn Allotey, Till Koebe, Alexandra Tyukavina, Ingmar Weber, 15 Aug 2025, A Global Dataset of Location Data Integrity-Assessed Reforestation Efforts, https://arxiv.org/abs/2508.11349
  • Wentao Li, Yonghu He, Kun Gao, Qing Liu and Yali Zheng, 7 Aug 2025, Collaborative Learning-Enhanced Lightweight Models for Predicting Arterial Blood Pressure Waveform in a Large-scale Perioperative Dataset, https://arxiv.org/abs/2508.11669
  • Manish Shukla, 17 Aug 2025, Interpreting Time Series Forecasts with LIME and SHAP: A Case Study on the Air Passengers Dataset, https://arxiv.org/abs/2508.12253
  • Ananya Singha, Harshita Sahijwani, Walt Williams, Emmanuel Aboah Boateng, Nick Hausman, Miguel Di Luca, Keegan Choudhury, Chaya Binet, Vu Le, Tianwei Chen, Oryan Rokeah Chen, Sulaiman Vesal, Sadid Hasan, 14 Aug 2025, Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs, https://arxiv.org/abs/2508.11715
  • Marcel Gregoriadis, Jingwei Kang, Johan Pouwelse, 17 Aug 2025, A Large-Scale Web Search Dataset for Federated Online Learning to Rank, https://arxiv.org/abs/2508.12353
  • Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos and Frank Kargl, 19 Aug 2025, Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias, https://arxiv.org/abs/2508.13813
  • Jonathan A. Karr Jr., Benjamin F. Herbst, Ting Hua, Matthew Hauenstein, Georgina Curto, Nitesh V. Chawla, 14 Aug 2025, Combating Homelessness Stigma with LLMs: A New Multi-Modal Dataset for Bias Detection, https://arxiv.org/abs/2508.13187
  • Hunter McNichols, Fareya Ikram, Andrew Lan, 19 Aug 2025, The StudyChat Dataset: Student Dialogues With ChatGPT in an Artificial Intelligence Course, https://arxiv.org/abs/2503.07928
  • Anirudh Sundar, Christopher Richardson, Adar Avsian, Larry Heck, 19 Aug 2025, iTBLS: A Dataset of Interactive Conversations Over Tabular Information, https://arxiv.org/abs/2404.12580
  • Chinmoy Biswas, Nafis Faisal, Vivek Chowdhury, Abrar Al-Shadid Abir, Sabir Mahmud, Mithon Rahman, Shaikh Anowarul Fattah, Hafiz Imtiaz, 12 Aug 2025, Load Forecasting on A Highly Sparse Electrical Load Dataset Using Gaussian Interpolation, https://arxiv.org/abs/2508.14069
  • Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim, Juneha Hwang, Jaeseon Ha, Hojun Choi, Suyeol Yun, Yongjin Kim, and Yongjae Lee, 7 Aug 2025, FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering, https://arxiv.org/abs/2508.14052
  • Sujit Roy, Dinesha V. Hegde, Johannes Schmude, Amy Lin, Vishal Gaur, Rohit Lal, Kshitiz Mandal, Talwinder Singh, Andr\'es Mu\~noz-Jaramillo, Kang Yang, Chetraj Pandey, Jinsu Hong, Berkay Aydin, Ryan McGranaghan, Spiridon Kasapis, Vishal Upendran, Shah Bahauddin, Daniel da Silva, Marcus Freitag, Iksha Gurung, Nikolai Pogorelov, Campbell Watson, Manil Maskey, Juan Bernabe-Moreno, Rahul Ramachandran, 18 Aug 2025, SuryaBench: Benchmark Dataset for Advancing Machine Learning in Heliophysics and Space Weather Prediction, https://arxiv.org/abs/2508.14107
  • Yuzhuo Li, Di Zhao, Tingrui Qiao, Yihao Wu, Bo Pang, Yun Sing Koh, 20 Aug 2025, MetaWild: A Multimodal Dataset for Animal Re-Identification with Environmental Metadata, https://arxiv.org/abs/2501.13368
  • Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, 20 Aug 2025, Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset, https://arxiv.org/abs/2508.15096
  • Manuel Serna-Aguilera, Fiona L. Goggin, Aranyak Goswami, Alexander Bucksch, Suxing Liu, Khoa Luu, 19 Aug 2025, AGP: A Novel Arabidopsis thaliana Genomics-Phenomics Dataset and its HyperGraph Baseline Benchmarking, https://arxiv.org/abs/2508.14934
  • Laura De Grazia, Pol Pastells, Mauro V\'azquez Chas, Desmond Elliott, Danae S\'anchez Villegas, Mireia Farr\'us, Mariona Taul\'e, 21 Aug 2025, MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos, https://arxiv.org/abs/2504.11169
  • Ruiqi Wu, Yuang Yao, Tengfei Ma, Chenran Zhang, Na Su, Tao Zhou, Geng Chen, Wen Fan, Yi Zhou, 22 Aug 2025, Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning, https://arxiv.org/abs/2508.16129
  • Andreas Loizou and Dimitrios Tsoumakos, 22 Aug 2025, Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning, https://arxiv.org/abs/2508.16255
  • Anyu Ying, Natarajan Balaji Shankar, Chyi-Jiunn Lin, Mohan Shi, Pu Wang, Hye-jin Shim, Siddhant Arora, Hugo Van hamme, Abeer Alwan, and Shinji Watanabe, 22 Aug 2025, Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet, https://arxiv.org/abs/2508.16576
  • Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
  • Boran Zhao, Hetian Liu, Zihang Yuan, Li Zhu, Fan Yang, Lina Xie Tian Xia, Wenzhe Zhao, Pengju Ren, 19 Aug 2025, AdapSNE: Adaptive Fireworks-Optimized and Entropy-Guided Dataset Sampling for Edge DNN Training, https://arxiv.org/abs/2508.16647
  • Syed Nazmus Sakib, Nafiul Haque, Mohammad Zabed Hossain, and Shifat E. Arman, 23 Aug 2025, PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science, https://arxiv.org/abs/2508.17117
  • Siying Zhou, Yiquan Wu, Hui Chen, Xavier Hu, Kun Kuang, Adam Jatowt, Ming Hu, Chunyan Zheng, Fei Wu, 24 Aug 2025, ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation, https://arxiv.org/abs/2508.17234
  • Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Joao Manoel Herrera Pinheiro, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker, 24 Aug 2025, A Synthetic Dataset for Manometry Recognition in Robotic Applications, https://arxiv.org/abs/2508.17468
  • Yan Cathy Hua, Paul Denny, J\"org Wicker, Katerina Taskova, 23 Aug 2025, EduRABSA: An Education Review Dataset for Aspect-based Sentiment Analysis Tasks, https://arxiv.org/abs/2508.17008
  • Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, Yuan Qi, 24 Aug 2025, SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain, https://arxiv.org/abs/2501.15587
  • Hua Li, Shijie Lian, Zhiyuan Li, Runmin Cong, Chongyi Li, Laurence T. Yang, Weidong Zhang, Sam Kwong, 25 Aug 2025, Advancing Marine Research: UWSAM Framework and UIIS10K Dataset for Precise Underwater Instance Segmentation, https://arxiv.org/abs/2505.15581
  • Andy Bonnetto and Haozhe Qi and Franklin Leong and Matea Tashkovska and Mahdi Rad and Solaiman Shokur and Friedhelm Hummel and Silvestro Micera and Marc Pollefeys and Alexander Mathis, 25 Aug 2025, EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models, https://arxiv.org/abs/2506.01608
  • Mika Leo Hube, Filip Lemic, Ethungshan Shitiri, Gerard Calvo Bartra, Sergi Abadal, Xavier Costa P\'erez, 22 Aug 2025, Set Transformer Architectures and Synthetic Data Generation for Flow-Guided Nanoscale Localization, https://arxiv.org/abs/2508.16200
  • Rafael Ayll\'on-Gavil\'an, David Guijo-Rubio, Antonio Manuel G\'omez-Orellana, David Guijo-Rubio, Francisco B\'erchez-Moreno, V\'ictor Manuel Vargas-Yun and Pedro A. Guti\'errez, 23 Jul 2025, TOC-UCO: a comprehensive repository of tabular ordinal classification datasets, https://arxiv.org/abs/2507.17348
  • Run-Ze Fan and Zengzhi Wang and Pengfei Liu, 22 Jul 2025, MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning, https://arxiv.org/abs/2507.16812
  • Varsha Ramineni, Hossein A. Rahmani, Emine Yilmaz, David Barber, 24 Jul 2025, Beyond Internal Data: Constructing Complete Datasets for Fairness Testing, https://arxiv.org/abs/2507.18561
  • Ruizhe Chen, Zhiting Fan, Tianze Luo, Heqing Zou, Zhaopeng Feng, Guiyang Xie, Hansheng Zhang, Zhuochen Wang, Zuozhu Liu, Huaijian Zhang, 24 Jul 2025, Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning, https://arxiv.org/abs/2507.18100
  • Juhwan Choi, Junehyoung Kwon, JungMin Yun, Seunguk Yu, YoungBin Kim, 24 Jul 2025, VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks, https://arxiv.org/abs/2407.19795
  • Xin Gu, Gautam Kamath, Zhiwei Steven Wu, 23 Jul 2025, Choosing Public Datasets for Private Machine Learning via Gradient Subspace Distance, https://arxiv.org/abs/2303.01256
  • Temiloluwa Prioleau, Baiying Lu, Yanjun Cui, 18 Jul 2025, Glucose-ML: A collection of longitudinal diabetes datasets for development of robust AI solutions, https://arxiv.org/abs/2507.14077
  • Joanna Komorniczak, 20 Jul 2025, Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm, https://arxiv.org/abs/2507.15132
  • Zihang Ma and Qitian Yin, 21 Jul 2025, Graph Attention Specialized Expert Fusion Model for Node Classification: Based on Cora and Pubmed Datasets, https://arxiv.org/abs/2507.15784
  • Mohammed Alkhowaiter, Norah Alshahrani, Saied Alshahrani, Reem I. Masoud, Alaa Alzahrani, Deema Alnuhait, Emad A. Alghamdi, Khalid Almubarak, 19 Jul 2025, Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations, https://arxiv.org/abs/2507.14688
  • Giwon Lee, Wooseong Jeong, Daehee Park, Jaewoo Jeong, and Kuk-Jin Yoon, 21 Jul 2025, Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning, https://arxiv.org/abs/2507.04790
  • Bartlomiej Chybowski, Shima Abdullateef, Hollan Haule, Alfredo Gonzalez-Sulser, Javier Escudero, 10 Aug 2025, PySeizure: A single machine learning classifier framework to detect seizures in diverse datasets, https://arxiv.org/abs/2508.07253
  • Cem Ata Baykara, Saurav Raj Pandey, Ali Burak \"Unal, Harlin Lee, and Mete Akg\"un, 11 Aug 2025, Federated Learning for Epileptic Seizure Prediction Across Heterogeneous EEG Datasets, https://arxiv.org/abs/2508.08159
  • Cristian Cosentino, Annamaria Defilippo, Marco Dossena, Christopher Irwin, Sara Joubbi and Pietro Li\`o, 10 Aug 2025, HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways, https://arxiv.org/abs/2508.07308
  • Sarina Penquitt, Jonathan Klees, Rinor Cakaj, Daniel Kondermann, Matthias Rottmann, Lars Schmarje, 6 Aug 2025, From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets, https://arxiv.org/abs/2508.06556
  • Yuya Kawakami, Daniel Cayan, Dongyu Liu, and Kwan-Liu Ma, 8 Aug 2025, ClimateSOM: A Visual Analysis Workflow for Climate Ensemble Datasets, https://arxiv.org/abs/2508.06732
  • Sajjad Rezvani Boroujeni, Hossein Abedi, Tom Bush, 29 Jul 2025, Enhancing Glass Defect Detection with Diffusion Models: Addressing Imbalanced Datasets in Manufacturing Quality Control, https://arxiv.org/abs/2505.03134
  • Nicolas Lapautre, Maria Marchenko, Carlos Miguel Pati\~no, Xin Zhou, 14 Aug 2025, Natively Trainable Sparse Attention for Hierarchical Point Cloud Datasets, https://arxiv.org/abs/2508.10758
  • Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian, 14 Aug 2025, Measuring Diversity in Synthetic Datasets, https://arxiv.org/abs/2502.08512
  • Fabrizio Nunnari, Alakshendra Jyotsnaditya Ramkrishna Singh, Patrick Gebhard, 27 Jul 2025, Color histogram equalization and fine-tuning to improve expression recognition of (partially occluded) faces on sign language datasets, https://arxiv.org/abs/2507.20197
  • Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas, 28 Jul 2025, Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models, https://arxiv.org/abs/2507.20704
  • Aria Salari, Abtin Djavadifar, Xiangrui Liu, Homayoun Najjaran, 30 Jul 2025, Object Recognition Datasets and Challenges: A Review, https://arxiv.org/abs/2507.22361
  • Farid Ariai, Joel Mackenzie and Gianluca Demartini, 30 Jul 2025, Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges, https://arxiv.org/abs/2410.21306
  • Maziyar Panahi, 3 Aug 2025, OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets, https://arxiv.org/abs/2508.01630
  • Kenneth Enevoldsen, Kristian N{\o}rgaard Jensen, Jan Kostkan, Bal\'azs Szab\'o, M\'arton Kardos, Kirten Vad, Andrea Blasi N\'u\~nez, Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Vahlstrup, Per M{\o}ldrup Dalum, Desmond Elliott, Lukas Galke, Peter Schneider-Kamp, Kristoffer Nielbo, 4 Aug 2025, Dynaword: From One-shot to Continuously Developed Datasets, https://arxiv.org/abs/2508.02271
  • Bhavesh Neekhra, Debayan Gupta, Partha Pratim Chakravarti, 5 Aug 2025, On the (In)Significance of Feature Selection in High-Dimensional Datasets, https://arxiv.org/abs/2508.03593
  • J. Alex Hurt, Trevor M. Bajkowski, Grant J. Scott, Curt H. Davis, 4 Aug 2025, Evaluation and Analysis of Deep Neural Transformers and Convolutional Neural Networks on Modern Remote Sensing Datasets, https://arxiv.org/abs/2508.02871
  • Wesley Brewer, Murali Meena Gopalakrishnan, Matthias Maiterth, Aditya Kashi, Jong Youl Choi, Pei Zhang, Stephen Nichols, Riccardo Balin, Miles Couchman, Stephen de Bruyn Kops, P.K. Yeung, Daniel Dotson, Rohini Uma-Vaideswaran, Sarp Oral, Feiyi Wang, 5 Aug 2025, Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training, https://arxiv.org/abs/2508.03872
  • Wei Liu, Zhongyu Niu, Lang Gao, Zhiying Deng, Jun Wang, Haozhao Wang, Ruixuan Li, 6 Aug 2025, Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets, https://arxiv.org/abs/2505.02118
  • Burak Can Kaplan, Hugo Cesar De Castro Carneiro, Stefan Wermter, 7 Aug 2025, Can Large Language Models Generate Effective Datasets for Emotion Recognition in Conversations?, https://arxiv.org/abs/2508.05474
  • Minwoo Oh, Minsu Park, Eunil Park, 8 Aug 2025, Solving Copyright Infringement on Short Video Platforms: Novel Datasets and an Audio Restoration Deep Learning Pipeline, https://arxiv.org/abs/2504.21772
  • Connor Wilhelm, Dan Ventura, 12 Aug 2025, Distilling Reinforcement Learning into Single-Batch Datasets, https://arxiv.org/abs/2508.09283
  • Viacheslav Barkov, Jonas Schmidinger, Robin Gebbers, Martin Atzmueller, 13 Aug 2025, Modern Neural Networks for Small Tabular Datasets: The New Default for Field-Scale Digital Soil Mapping?, https://arxiv.org/abs/2508.09888
  • Simon Kl\"uttermann, Emmanuel M\"uller, 13 Aug 2025, Rare anomalies require large datasets: About proving the existence of anomalies, https://arxiv.org/abs/2508.09894
  • Aishik Mandal, Prottay Kumar Adhikary, Hiba Arnaout, Iryna Gurevych, Tanmoy Chakraborty, 13 Aug 2025, A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems, https://arxiv.org/abs/2508.09809
  • Lingyu Chen, Yawen Zeng, Yue Wang, Peng Wan, Guo-chen Ning, Hongen Liao, Daoqiang Zhang, Fang Chen, 13 Aug 2025, COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets, https://arxiv.org/abs/2508.09886
  • Sai Krishna Mendu, Harish Yenala, Aditi Gulati, Shanu Kumar, Parag Agrawal, 12 Aug 2025, Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs, https://arxiv.org/abs/2505.02009
  • Gauri Jain, Dominik Rothenh\"ausler, Kirk Bansak, Elisabeth Paulson, 15 Aug 2025, CTRL Your Shift: Clustered Transfer Residual Learning for Many Small Datasets, https://arxiv.org/abs/2508.11144
  • SeungBum Ha, Taehwan Lee, Jiyoun Lim, Sung Whan Yoon, 17 Aug 2025, Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph Generation, https://arxiv.org/abs/2412.10436
  • Mizuki Ohira, Toshimichi Saito, 17 Aug 2025, A Recurrent Neural Network based Clustering Method for Binary Data Sets in Education, https://arxiv.org/abs/2508.13224
  • Wanjun Hu, 19 Aug 2025, Typed Topological Structures Of Datasets, https://arxiv.org/abs/2508.14008
  • Haohang Xu, Chengjie Liu, Qihang Wang, Wenhao Huang, Yongjian Xu, Weiyu Chen, Anlan Peng, Zhijun Li, Bo Li, Lei Qi, Jun Yang, Yuan Du, and Li Du, 27 Jun 2025, Image2Net: Datasets, Benchmark and Hybrid Framework to Convert Analog Circuit Diagrams into Netlists, https://arxiv.org/abs/2508.13157
  • Qian Zhanga, Ruilin Zhang, Jun Xiao, Yifan Liu and Zhe Wang, 12 Aug 2025, MCLPD:Multi-view Contrastive Learning for EEG-based PD Detection Across Datasets, https://arxiv.org/abs/2508.14073
  • Ishaan Mahapatra and Nihar R. Mahapatra, 14 Aug 2025, Systematic FAIRness Assessment of Open Voice Biomarker Datasets for Mental Health and Neurodegenerative Diseases, https://arxiv.org/abs/2508.14089
  • Corinna Coupette and Jeremy Wayland and Emily Simons and Bastian Rieck, 20 Aug 2025, No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets, https://arxiv.org/abs/2502.02379
  • Sen Yan, Chinmaya Kaundanya, Noel E. O'Connor, Suzanne Little, Mingming Liu, 22 Aug 2025, Machine Learning in Micromobility: A Systematic Review of Datasets, Techniques, and Applications, https://arxiv.org/abs/2508.16135
  • Sridevi Bonthu, S.Rama Sree, M.H.M. Krishna Prasad, 19 Aug 2025, Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading, https://arxiv.org/abs/2508.15837
  • Julian Oestreich and Lydia M\"uller, 21 Aug 2025, Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets, https://arxiv.org/abs/2508.15910
  • Andreas Loizou and Dimitrios Tsoumakos, 22 Aug 2025, Analytics Modelling over Multiple Datasets using Vector Embeddings, https://arxiv.org/abs/2502.17060
  • Aaron Rodrigues, Mahmood Hegazy and Azzam Naeem, 22 Aug 2025, Enhancing and Scaling Search Query Datasets for Recommendation Systems, https://arxiv.org/abs/2505.11176
  • Nikolaos Pavlidis, Vasilis Perifanis, Symeon Symeonidis, Pavlos S. Efraimidis, 24 Aug 2025, Large Language Models as Universal Predictors? An Empirical Study on Small Tabular Datasets, https://arxiv.org/abs/2508.17391
  • Prashant Gupta, 23 Aug 2025, Learning ON Large Datasets Using Bit-String Trees, https://arxiv.org/abs/2508.17083
  • Sarina Penquitt, Tobias Riedlinger, Timo Heller, Markus Reischl, Matthias Rottmann, 25 Aug 2025, Learning to Detect Label Errors by Making Them: A Method for Segmentation and Object Detection Datasets, https://arxiv.org/abs/2508.17930
  • Maximilian Burzer, Tobias King, Till Riedel, Michael Beigl and Tobias R\"oddiger, 12 Aug 2025, WHAR Datasets: An Open Source Library for Wearable Human Activity Recognition, https://arxiv.org/abs/2508.16604
  • Milad Hoseinpour, Vladimir Dvorkin, 25 Aug 2025, Constrained Diffusion Models for Synthesizing Representative Power Flow Datasets, https://arxiv.org/abs/2506.11281
  • Charles Jones and Ben Glocker, 4 Sep 2025, A Primer on Causal and Statistical Dataset Biases for Fair and Robust Image Analysis, https://arxiv.org/abs/2509.04295
  • Adrian Catalin Lutu, Ioana Pintilie, Elena Burceanu, Andrei Manolache, 4 Sep 2025, ChronoGraph: A Real-World Graph-Based Multivariate Time Series Dataset, https://arxiv.org/abs/2509.04449
  • Qizhou Wang, Hanxun Huang, Guansong Pang, Sarah Erfani, Christopher Leckie, 4 Sep 2025, AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds, https://arxiv.org/abs/2509.04345
  • Harald St\"orrle and Anastasia Hort, 19 Aug 2025, A Small Dataset May Go a Long Way: Process Duration Prediction in Clinical Settings, https://arxiv.org/abs/2509.03522
  • Matilde Contestabile, Chiara Ferrara, Alberto Giovannetti, Giovanni Parrillo, Andrea Vandin, 25 Aug 2025, The ProLiFIC dataset: Leveraging LLMs to Unveil the Italian Lawmaking Process, https://arxiv.org/abs/2509.03528
  • Iro Lim, Haein Ji, and Byungjun Kim, 4 Sep 2025, Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling, https://arxiv.org/abs/2509.03932
  • Junjie Wang, Yuxiang Zhang, Minghao Liu, Yin Zhang, Yatai Ji, Weihao Xuan, Nie Lin, Kang Zhu, Zhiqiang Lin, Yiming Ren, Chunyang Jiang, Yiyao Yu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Qunshu Liu, Yujiu Yang, Ge Zhang, Ruibin Yuan, Bei Chen, Wenhu Chen, 4 Sep 2025, PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents, https://arxiv.org/abs/2406.13923
  • Hong Ye Tan, Emma Slade, 3 Sep 2025, Dataset Distillation as Pushforward Optimal Quantization, https://arxiv.org/abs/2501.07681
  • Dizhan Xue, Shengsheng Qian, Chuanrui Hu, Changsheng Xu, 4 Sep 2025, Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model, https://arxiv.org/abs/2503.23746
  • Krittanon Kaewtawee, Wachiravit Modecrua, Krittin Pachtrachai, Touchapon Kraisingkorn, 5 Sep 2025, Cloning a Conversational Voice AI Agent from Call\,Recording Datasets for Telesales, https://arxiv.org/abs/2509.04871
  • Jiequn Han, Kui Ren, Nathan Soedjak, 4 Sep 2025, Instance-Wise Adaptive Sampling for Dataset Construction in Approximating Inverse Problem Solutions, https://arxiv.org/abs/2509.04583
  • Mohammed Khalil, Mohammed Sabry, 4 Sep 2025, ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation, https://arxiv.org/abs/2407.19835
  • Rustam Tagiew (1), Ilkay Wunderlich (2), Mark Sastuba (1), Kilian G\"oller (3) and Steffen Seitz (3) ((1) German Centre for Rail Traffic Research at the Federal Railway Authority, (2) EYYES GmbH, (3) Conrad Zuse School of Embedded Composite AI and the Chair of Fundamentals of Electrical Engineering of Dresden University of Technology), 5 Sep 2025, RailGoerl24: G\"orlitz Rail Test Center CV Dataset 2024, https://arxiv.org/abs/2504.00204
  • Janet Wang, Xin Hu, Yunbei Zhang, Diabate Almamy, Vagamon Bamba, Konan Amos S\'ebastien Koffi, Yao Koffi Aubin, Zhengming Ding, Jihun Hamm, Rie R. Yotsu, 26 Aug 2025, eSkinHealth: A Multimodal Dataset for Neglected Tropical Skin Diseases, https://arxiv.org/abs/2508.18608
  • Nowshin Sharmily, Rusab Sarmun, Muhammad E. H. Chowdhury, Mir Hamidul Hussain, Saad Bin Abul Kashem, Molla E Majid, and Amith Khandakar, 23 Aug 2025, Automated Landfill Detection Using Deep Learning: A Comparative Study of Lightweight and Custom Architectures with the AerialWaste Dataset, https://arxiv.org/abs/2508.18315
  • Jaehwan Jeong, Tuan-Anh Vu, Mohammad Jony, Shahab Ahmad, Md. Mukhlesur Rahman, Sangpil Kim, M. Khalid Jawed, 26 Aug 2025, AgriChrono: A Multi-modal Dataset Capturing Crop Growth and Lighting Variability with a Field Robot, https://arxiv.org/abs/2508.18694
  • Ilias Driouich, Hongliu Cao, Eoin Thomas, 26 Aug 2025, Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework, https://arxiv.org/abs/2508.18929
  • Soumen Ghosh, Christine Jestin Hannan, Rajat Vashistha, Parveen Kundu, Sandra Brosda, Lauren G.Aoude, James Lonie, Andrew Nathanson, Jessica Ng, Andrew P. Barbour, Viktor Vegh, 26 Aug 2025, Stress-testing cross-cancer generalizability of 3D nnU-Net for PET-CT tumor segmentation: multi-cohort evaluation with novel oesophageal and lung cancer datasets, https://arxiv.org/abs/2508.18612
  • Ri Su, Zhao Chen, Caleb Chen Cao, Nan Tang, Lei Chen, 27 Aug 2025, SCAR: A Characterization Scheme for Multi-Modal Dataset, https://arxiv.org/abs/2508.19659
  • Sumon Kanti Dey, Jeanne M. Powell, Azra Ismail, Jeanmarie Perrone, Abeed Sarker, 26 Aug 2025, Inference Gap in Domain Expertise and Machine Intelligence in Named Entity Recognition: Creation of and Insights from a Substance Use-related Dataset, https://arxiv.org/abs/2508.19467
  • Anusha Kamath, Kanishk Singla, Rakesh Paul, Raviraj Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, 27 Aug 2025, Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis, https://arxiv.org/abs/2508.19831
  • Aakash Tripathi, Asim Waqas, Matthew B. Schabath, Yasin Yilmaz, Ghulam Rasool, 27 Aug 2025, HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models, https://arxiv.org/abs/2405.07460
  • Shuo Shao, Yiming Li, Mengren Zheng, Zhiyang Hu, Yukun Chen, Boheng Li, Yu He, Junfeng Guo, Dacheng Tao, Zhan Qin, 27 Aug 2025, DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective, https://arxiv.org/abs/2507.05622
  • Yijia Guo and Junqing Zhang and Y.-W. Peter Hong, 28 Aug 2025, Practical Physical Layer Authentication for Mobile Scenarios Using a Synthetic Dataset Enhanced Deep Learning Approach, https://arxiv.org/abs/2508.20861
  • Ali Ramlaoui, Martin Siron, Inel Djafar, Joseph Musielewicz, Amandine Rossello, Victor Schmidt, Alexandre Duval, 28 Aug 2025, LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling, https://arxiv.org/abs/2508.20875
  • Seunghyeon Jung, Seoyoung Hong, Jiwoo Jeong, Seungwon Jeong, Jaerim Choi, Hoki Kim, Woojin Lee, 28 Aug 2025, CaddieSet: A Golf Swing Dataset with Human Joint Features and Ball Information, https://arxiv.org/abs/2508.20491
  • Hashim Ali, Surya Subramani, Lekha Bollinani, Nithin Sai Adupa, Sali El-Loh, Hafiz Malik, 28 Aug 2025, Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System, https://arxiv.org/abs/2508.20983
  • Lianpeng Qiao, Ziqi Cao, Kaiyu Feng, Ye Yuan, Guoren Wang, 28 Aug 2025, Graph-Based Feature Augmentation for Predictive Tasks on Relational Datasets, https://arxiv.org/abs/2508.20986
  • Jeongkyun Park, Jung-Wook Hwang, Kwanghee Choi, Seung-Hyun Lee, Jun Hwan Ahn, Rae-Hong Park, Hyung-Min Park, 28 Aug 2025, OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset, https://arxiv.org/abs/2301.06375
  • Jo\~ao Valente, Atabak Dehban, Rodrigo Ventura, 29 Aug 2025, CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models, https://arxiv.org/abs/2508.21732
  • Aishwarya Mirashi, Ananya Joshi, Raviraj Joshi, 29 Aug 2025, L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models, https://arxiv.org/abs/2508.21569
  • Nidhi Kowtal, Raviraj Joshi, 29 Aug 2025, L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models, https://arxiv.org/abs/2506.00863
  • Tung Nguyen, Harkanwar Singh, Nilay Naharas, Lucas Bandarkar, Aditya Grover, 31 Aug 2025, IndiaWeatherBench: A Dataset and Benchmark for Data-Driven Regional Weather Forecasting over India, https://arxiv.org/abs/2509.00653
  • Hirofumi Tsuruta, Masaya Kumagai, 1 Sep 2025, MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature, https://arxiv.org/abs/2509.01042
  • Smayan Khanna, Doruk Efe G\"okmen, Risi Kondor, Vincenzo Vitelli, 1 Sep 2025, Graph Contrastive Learning versus Untrained Baselines: The Role of Dataset Size, https://arxiv.org/abs/2509.01541
  • Austin Meek, Carlos H. Mendoza-Cardenas, and Austin J. Brockmeier, 1 Sep 2025, Convolutional Monge Mapping between EEG Datasets to Support Independent Component Labeling, https://arxiv.org/abs/2509.01721
  • Han Chen, Hanchen Wang, Hongmei Chen, Ying Zhang, Lu Qin, Wenjie Zhang, 2 Sep 2025, HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis, https://arxiv.org/abs/2509.02113
  • Tongtong Feng, Xin Wang, Feilin Han, Leping Zhang, Wenwu Zhu, 25 Aug 2025, U2UData-2: A Scalable Swarm UAVs Autonomous Flight Dataset for Long-horizon Tasks, https://arxiv.org/abs/2509.00055
  • Kun Qiu, Ying Wang, Baoqian Li, Wenjun Zhu, 31 Aug 2025, Unsupervised Dataset Cleaning Framework for Encrypted Traffic Classification, https://arxiv.org/abs/2509.00701
  • Artur D\'iaz-Juan, Coloma Ballester, Gloria Haro, 1 Sep 2025, SoccerHigh: A Benchmark Dataset for Automatic Soccer Video Summarization, https://arxiv.org/abs/2509.01439
  • Seungkyu Lee, Nalim Kim, Yohan Jo, 1 Sep 2025, In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents, https://arxiv.org/abs/2509.01560
  • Nishant Tanksale, Tanmay Kokate, Darshan Gohad, Sarvadnyaa Barate, Raviraj Joshi, 2 Sep 2025, L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages, https://arxiv.org/abs/2509.02503
  • Hallee E. Wong and Jose Javier Gonzalez Ortiz and John Guttag and Adrian V. Dalca, 31 Aug 2025, MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance, https://arxiv.org/abs/2412.15058
  • Yanlin Zhang, Sungyong Chung, Nachuan Li, Dana Monzer, Hani S. Mahmassani, Samer H. Hamdar, and Alireza Talebpour, 3 Sep 2025, Can the Waymo Open Motion Dataset Support Realistic Behavioral Modeling? A Validation Study with Naturalistic Trajectories, https://arxiv.org/abs/2509.03515
  • Daniel C. Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L. Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores S\'anchez-Valverde, Lara Jaques-P\'erez, Lourdes P\'erez-Rodr\'iguez, Kenji Takeda, Jos\'e Mar\'ia Salinas, Javier Alvarez-Valle, Joaqu\'in Galant Herrero, Antonio Pertusa, 3 Sep 2025, PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation, https://arxiv.org/abs/2411.05085
  • Braeden Sherritt, Isar Nejadgholi, Efstratios Aivaliotis, Khaled Mslmani and Marzieh Amini, 3 Sep 2025, WildFireCan-MMD: A Multimodal Dataset for Classification of User-Generated Content During Wildfires in Canada, https://arxiv.org/abs/2504.13231
  • Liming Xu and Yunbo Long and Alexandra Brintrup, 30 Aug 2025, SynDelay: A Synthetic Dataset for Delivery Delay Prediction, https://arxiv.org/abs/2509.05325
  • Seyed Muhammad Hossein Mousavi, Atiye Ilanloo, 31 Aug 2025, MVRS: The Multimodal Virtual Reality Stimuli-based Emotion Recognition Dataset, https://arxiv.org/abs/2509.05330
  • Honggang Jia, Xiucheng Wang, Nan Cheng, Ruijin Sun, Changle Li, 8 Sep 2025, UrbanMIMOMap: A Ray-Traced MIMO CSI Dataset with Precoding-Aware Maps and Benchmarks, https://arxiv.org/abs/2509.06270
  • Yunfei Guo, Tao Zhang, Wu Huang, Yao Song, 30 Aug 2025, A Dataset Generation Scheme Based on Video2EEG-SPGN-Diffusion for SEED-VD, https://arxiv.org/abs/2509.05321
  • Youssef Chakir and Iyad Lahsen-Cherif, 31 Aug 2025, ForensicsData: A Digital Forensics Dataset for Large Language Models, https://arxiv.org/abs/2509.05331
  • Ahad Jawaid, Yu Xiang, 5 Sep 2025, OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation, https://arxiv.org/abs/2509.05513
  • Leo Ho, Yinghao Huang, Dafei Qin, Mingyi Shi, Wangpok Tse, Wei Liu, Junichi Yamagishi, Taku Komura, 6 Sep 2025, InterAct: A Large-Scale Dataset of Dynamic, Expressive and Interactive Activities between Two People in Daily Scenarios, https://arxiv.org/abs/2509.05747
  • Phongsakon Mark Konrad, Andrei-Alexandru Popa, Yaser Sabzehmeidani, Liang Zhong, Elisa A. Liehn, Serkan Ayvaz, 7 Sep 2025, Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets, https://arxiv.org/abs/2509.05892
  • Omkar Prabhu, 7 Sep 2025, Khana: A Comprehensive Indian Cuisine Dataset, https://arxiv.org/abs/2509.06006
  • Valentin Quesnel and Damien Sileo, 8 Sep 2025, Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem, https://arxiv.org/abs/2509.06809
  • Jinrui Yang, Timothy Baldwin, Trevor Cohn, 3 Nov 2023, Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval, https://arxiv.org/abs/2311.01870
  • Zhyar Rzgar K Rostam and G\'abor Kert\'esz, 7 Sep 2025, Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting, https://arxiv.org/abs/2504.19021
  • Peter Mortimer, Raphael Hagmanns, Miguel Granero, Thorsten Luettel, Janko Petereit, Hans-Joachim Wuensche, 8 Sep 2025, The GOOSE Dataset for Perception in Unstructured Environments, https://arxiv.org/abs/2310.16788
  • Nicholas Sung, Steven Spreizer, Mohamed Elrefaie, Kaira Samuel, Matthew C. Jones, and Faez Ahmed, 8 Sep 2025, BlendedNet: A Blended Wing Body Aircraft Dataset and Surrogate Model for Aerodynamic Predictions, https://arxiv.org/abs/2509.07209
  • Cedric Caruzzo, Jong Chul Ye, 2 Sep 2025, CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis, https://arxiv.org/abs/2509.06986
  • Amelia Kovacs, Jerry Chee, Kimia Kazemian, Sarah Dean, 8 Sep 2025, Datasets for Navigating Sensitive Topics in Recommendation Systems, https://arxiv.org/abs/2509.07269
  • Gianluca Amprimo, Alberto Ancilotto, Alessandro Savino, Fabio Quazzolo, Claudia Ferraris, Gabriella Olmo, Elisabetta Farella, Stefano Di Carlo, 9 Sep 2025, EHWGesture -- A dataset for multimodal understanding of clinical gestures, https://arxiv.org/abs/2509.07525
  • Seyd Teymoor Seydi, 9 Sep 2025, Deep Learning-Based Burned Area Mapping Using Bi-Temporal Siamese Networks and AlphaEarth Foundation Datasets, https://arxiv.org/abs/2509.07852
  • Neeshu Rathi, Sanjeev Kumar, 8 Sep 2025, A Quantum Bagging Algorithm with Unsupervised Base Learners for Label Corrupted Datasets, https://arxiv.org/abs/2509.07040
  • Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma, 8 Sep 2025, Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges, https://arxiv.org/abs/2507.02074
  • Tong Chen, Raghavendra Selvan, 12 Sep 2025, A Discrepancy-Based Perspective on Dataset Condensation, https://arxiv.org/abs/2509.10367
  • Bruno Yui Yamate, Thais Rodrigues Neubauer, Marcelo Fantinato, Sarajane Marques Peres, 18 Aug 2025, Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation, https://arxiv.org/abs/2509.09684
  • Kaikai Zhao, Zhaoxiang Liu, Peng Wang, Xin Wang, Zhicheng Ma, Yajun Xu, Wenjing Zhang, Yibing Nan, Kai Wang, Shiguo Lian, 10 Sep 2025, MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance, https://arxiv.org/abs/2509.09730
  • Ying Yuan, Xing-Yue Monica Ge, Aaron Archer Waterman, Tommaso Biancalani, David Richmond, Yogesh Pandit, Avtar Singh, Russell Littman, Jin Liu, Jan-Christian Huetter, Vladimir Ermakov, 10 Sep 2025, HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets, https://arxiv.org/abs/2509.09740
  • Utsab Saha, Tanvir Muntakim Tonoy, and Hafiz Imtiaz, 12 Sep 2025, Differentially Private Decentralized Dataset Synthesis Through Randomized Mixing with Correlated Noise, https://arxiv.org/abs/2509.10385
  • Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, Tal Arbel, 12 Sep 2025, SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer's Prediction Tasks and Datasets, https://arxiv.org/abs/2509.10453
  • Tong Chen, Raghavendra Selvan, 12 Sep 2025, Is Adversarial Training with Compressed Datasets Effective?, https://arxiv.org/abs/2402.05675
  • Marianna Nezhurina and J\"org Franke and Taishi Nakamura and Timur Carstensen and Niccol\`o Ajroldi and Ville Komulainen and David Salinas and Jenia Jitsev, 12 Sep 2025, Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison, https://arxiv.org/abs/2509.09009
  • Maria Risques and Kratika Bhagtani and Amit Kumar Singh Yadav and Edward J. Delp, 11 Sep 2025, HISPASpoof: A New Dataset For Spanish Speech Forensics, https://arxiv.org/abs/2509.09155
  • Cynthia Moreira Maia, Lucas B. V. de Amorim, George D. C. Cavalcanti, and Rafael M. O. Cruz, 11 Sep 2025, PIPES: A Meta-dataset of Machine Learning Pipelines, https://arxiv.org/abs/2509.09512
  • Lei Wang, Piotr Koniusz, Yongsheng Gao, 11 Sep 2025, Video Understanding by Design: How Datasets Shape Architectures and Insights, https://arxiv.org/abs/2509.09151
  • Doha Nam, Taehyoun Kim, Duksan Ryu, Jongmoon Baik, 11 Sep 2025, Probing Pre-trained Language Models on Code Changes: Insights from ReDef, a High-Confidence Just-in-Time Defect Prediction Dataset, https://arxiv.org/abs/2509.09192
  • Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Ga\'etan Marceau Caron, Jean-Fran\c{c}ois Godbout, Reihaneh Rabbany, 11 Sep 2025, OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection, https://arxiv.org/abs/2509.09495
  • Meghan Wilkinson and Robert H Thomson, 11 Sep 2025, What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets, https://arxiv.org/abs/2509.09564
  • Kordel K. France, Ovidiu Daescu, 11 Sep 2025, Diffusion Graph Neural Networks for Robustness in Olfaction Sensors and Datasets, https://arxiv.org/abs/2506.00455
  • Henning H\"ofener (1), Farina Kock (1), Martina Pontones (2), Tabita Ghete (2 and 3), David Pfrang (1), Nicholas Dickel (4), Meik Kunz (4), Daniela P. Schacherer (1), David A. Clunie (5), Andrey Fedorov (6), Max Westphal (1), Markus Metzler (2 and 3 and 7) ((1) Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany, (2) Department of Pediatrics and Adolescent Medicine, University Hospital Erlangen, Erlangen, Germany, (3) Bavarian Cancer Research Center (BZKF), Erlangen, Germany, (4) Medical Informatics, Friedrich-Alexander University of Erlangen-N\"urnberg, Erlangen, Germany, (5) PixelMed Publishing LLC, Bangor, PA, USA, (6) Department of Radiology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA, (7) Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany), 19 Sep 2025, From Data to Diagnosis: A Large, Comprehensive Bone Marrow Dataset and AI Methods for Childhood Leukemia Prediction, https://arxiv.org/abs/2509.15895
  • Shubham Kavane, Kajol Kulkarni, Harald Koestler, 17 Sep 2025, ChannelFlow-Tools: A Standardized Dataset Creation Pipeline for 3D Obstructed Channel Flows, https://arxiv.org/abs/2509.15236
  • Nomi Yu (1), Md Ferdous Alam (1), A. John Hart (1), and Faez Ahmed (1) ((1) Massachusetts Institute of Technology), 17 Sep 2025, GenCAD-3D: CAD Program Generation using Multimodal Latent Space Alignment and Synthetic Dataset Balancing, https://arxiv.org/abs/2509.15246
  • Benedikt W. Hosp, 19 Sep 2025, FOVAL: Calibration-Free and Subject-Invariant Fixation Depth Estimation Across Diverse Eye-Tracking Datasets, https://arxiv.org/abs/2408.03591
  • Michael Galarnyk, Rutwik Routu, Vidhyakshaya Kannan, Kosha Bheda, Prasun Banerjee, Agam Shah, Sudheer Chava, 19 Sep 2025, ConfReady: A RAG based Assistant and Dataset for Conference Checklist Responses, https://arxiv.org/abs/2408.04675
  • Hanjun Luo, Yingbin Jin, Xinfeng Li, Xuecheng Liu, Ruizhe Chen, Tong Shang, Kun Wang, Qingsong Wen, Zuozhu Liu, 19 Sep 2025, DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition, https://arxiv.org/abs/2409.11022
  • Maximus Powers, Shaina Raza, Alex Chang, Rehana Riaz, Umang Mavani, Harshitha Reddy Jonala, Ansh Tiwari, Hua Wei, 15 Sep 2025, Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes, https://arxiv.org/abs/2410.08388
  • Rohan Tan Bhowmik, Youn Soo Jung, Juan Aguilera, Mary Prunicki, Kari Nadeau, 14 Sep 2025, California Wildfire Inventory (CAWFI): An Extensive Dataset for Predictive Techniques based on Artificial Intelligence, https://arxiv.org/abs/2509.11015
  • Farbod Bijary, Mohsen Ebadpour, Amirhosein Tajbakhsh, 14 Sep 2025, Agentic Username Suggestion and Multimodal Gender Detection in Online Platforms: Introducing the PNGT-26K Dataset, https://arxiv.org/abs/2509.11136
  • Yonghao Weng and Liqiang Gao and Linwu Zhu and Jian Huang, 14 Sep 2025, MatQnA: A Benchmark Dataset for Multi-modal Large Language Models in Materials Characterization and Analysis, https://arxiv.org/abs/2509.11335
  • Loka Li, Wong Yu Kang, Minghao Fu, Guangyi Chen, Zhenhao Chen, Gongxu Luo, Yuewen Sun, Salman Khan, Peter Spirtes, Kun Zhang, 14 Sep 2025, PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits, https://arxiv.org/abs/2509.11362
  • Grigori Fursin and Daniel Altunay, 14 Sep 2025, Framing AI System Benchmarking as a Learning Task: FlexBench and the Open MLPerf Dataset, https://arxiv.org/abs/2509.11413
  • Zhizheng Wang, Yifan Yang, Qiao Jin, Zhiyong Lu, 11 Sep 2025, Gene-R1: Reasoning with Data-Augmented Lightweight LLMs for Gene Set Analysis, https://arxiv.org/abs/2509.10575
  • Haiyu Yang, Enhong Liu, Jennifer Sun, Sumit Sharma, Meike van Leerdam, Sebastien Franceschini, Puchun Niu, Miel Hostens, 15 Sep 2025, A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset, https://arxiv.org/abs/2509.12047
  • Rodrigo M. Carrillo-Larco, Jesus Lov\'on Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca, 15 Sep 2025, PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams - Dataset Construction and Evaluation, https://arxiv.org/abs/2509.11517
  • Mikhail Kulyabin, Jan Joosten, Choro Ulan uulu, Nuno Miguel Martins Pacheco, Fabian Ries, Filippos Petridis, Jan Bosch, and Helena Holmstr\"om Olsson, 15 Sep 2025, User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums, https://arxiv.org/abs/2509.11777
  • Daniel Lepe-Soltero, Thierry Arti\`eres, Ana\"is Baudot, Paul Villoutreix, 15 Sep 2025, MODIS: Multi-Omics Data Integration for Small and unpaired datasets, https://arxiv.org/abs/2503.18856
  • Christian Intern\`o, Andrea Castellani, Sebastian Schmitt, Fabio Stella, Barbara Hammer, 15 Sep 2025, Industrial Energy Disaggregation with Digital Twin-generated Dataset and Efficient Data Augmentation, https://arxiv.org/abs/2506.20525
  • Julian Junyan Wang, Victor Xiaoqi Wang, 14 Sep 2025, Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research, https://arxiv.org/abs/2412.02065
  • Amy Rafferty, Rishi Ramaesh, Ajitha Rajan, 18 Sep 2025, Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges, https://arxiv.org/abs/2509.15107
  • Happymore Masoka, 10 Sep 2025, Advancing Conversational AI with Shona Slang: A Dataset and Hybrid Model for Digital Inclusion, https://arxiv.org/abs/2509.14249
  • Roman Kovalchuk, Mariana Romanyshyn, Petro Ivaniuk, 18 Sep 2025, Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction, https://arxiv.org/abs/2509.14504
  • Luca Rolshoven, Vishvaksenan Rasiah, Srinanda Br\"ugger Bose, Sarah Hostettler, Lara Burkhalter, Matthias St\"urmer, Joel Niklaus, 18 Sep 2025, Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland, https://arxiv.org/abs/2410.13456
  • Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal, 18 Sep 2025, "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets, https://arxiv.org/abs/2506.21532
  • Christopher Wiedeman, Anastasiia Sarmakeeva, Elena Sizikova, Daniil Filienko, Miguel Lago, Jana G. Delfino, Aldo Badano, 18 Sep 2025, T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images, https://arxiv.org/abs/2507.04038
  • Woohyun Cho and Youngmin Kim and Sunghyun Lee and Youngjae Yu, 18 Sep 2025, MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation, https://arxiv.org/abs/2505.18614
  • Bingjian Yang, Danni Xu, Kaipeng Niu, Wenxuan Liu, Zheng Wang, Mohan Kankanhalli, 8 Sep 2025, A New Dataset and Benchmark for Grounding Multimodal Misinformation, https://arxiv.org/abs/2509.08008
  • Yonghyun Kim, Junhyung Park, Joonhyung Bae, Kirak Kim, Taegyun Kwon, Alexander Lerch, Juhan Nam, 10 Sep 2025, PianoVAM: A Multimodal Piano Performance Dataset, https://arxiv.org/abs/2509.08800
  • Rafa{\l} Osadnik, Pablo G\'omez, Eleni Bohacek, Rickbir Bahia, 9 Sep 2025, MCTED: A Machine-Learning-Ready Dataset for Digital Elevation Model Generation From Mars Imagery, https://arxiv.org/abs/2509.08027
  • Shambhavi Krishna, Atharva Naik, Chaitali Agarwal, Sudharshan Govindan, Taesung Lee, Haw-Shiuan Chang, 17 Sep 2025, Latent Traits and Cross-Task Transfer: Deconstructing Dataset Interactions in LLM Fine-tuning, https://arxiv.org/abs/2509.13624
  • Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning, https://arxiv.org/abs/2503.09334
  • Rajvee Sheth, Himanshu Beniwal, Mayank Singh, 17 Sep 2025, COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing, https://arxiv.org/abs/2503.21670
  • Sean Michael Kerner, October 17, 2025, World's largest open-source multimodal dataset delivers 17x training efficiency, unlocking enterprise AI that connects documents, audio and video, https://venturebeat.com/data-infrastructure/worlds-largest-open-source-multimodal-dataset-delivers-17x-training
  • Shriram Karpoora Sundara Pandian and Ali Baheri, 1 Oct 2025, Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets, https://arxiv.org/abs/2510.01479
  • Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, Sachin Sharma, John D. Kelleher, 2 Oct 2025, Pre-Hoc Predictions in AutoML: Leveraging LLMs to Enhance Model Selection and Benchmarking for Tabular datasets, https://arxiv.org/abs/2510.01842
  • Leroy Z. Wang, 21 Sep 2025, Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset, https://arxiv.org/abs/2510.01219
  • Ahmed Adel Attia, Jing Liu, Carol Espy Wilson, 1 Oct 2025, RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines, https://arxiv.org/abs/2510.01462
  • Ying-Ren Chien, Po-Heng Chou, You-Jie Peng, Chun-Yuan Huang, Hen-Wai Tsao, and Yu Tsao, 2 Oct 2025, NGGAN: Noise Generation GAN Based on the Practical Measurement Dataset for Narrowband Powerline Communications, https://arxiv.org/abs/2510.01850
  • Jong Bum Won, Wesley De Neve, Joris Vankerschaver, Utku Ozbulak, 2 Oct 2025, SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification, https://arxiv.org/abs/2510.02109
  • Fatou Ndiaye Mbodji, El-hacen Diallo, Jordan Samhi, Kui Liu, Jacques Klein, Tegawend\'e F. Bissyande, 2 Oct 2025, SIEVE: Towards Verifiable Certification for Code-datasets, https://arxiv.org/abs/2510.02166
  • Krishna Teja Chitty-Venkata, Murali Emani, 2 Oct 2025, ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models, https://arxiv.org/abs/2510.01582
  • Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma, 1 Oct 2025, Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking, https://arxiv.org/abs/2505.23495
  • Monoshi Kumar Roy, Simin Chen, Benjamin Steenhoek, Jinjun Peng, Gail Kaiser, Baishakhi Ray, Wei Le, 2 Oct 2025, CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning, https://arxiv.org/abs/2506.00750
  • Emma Kondrup, Sebastian Sabry, Hussein Abdallah, Zachary Yang, James Zhou, Kellin Pelrine, Jean-Fran\c{c}ois Godbout, Michael M. Bronstein, Reihaneh Rabbany, Shenyang Huang, 2 Oct 2025, CrediBench: Building Web-Scale Network Datasets for Information Integrity, https://arxiv.org/abs/2509.23340
  • Rizal Fathony, Igor Melnyk, Owen Reinert, Nam H. Nguyen, Daniele Rosa, C. Bayan Bruss, 13 Oct 2025, Integrating Sequential and Relational Modeling for User Events: Datasets and Prediction Tasks, https://arxiv.org/abs/2510.11903
  • Benjamin W. Nelson, Celeste Wong, Matthew T. Silvestrini, Sooyoon Shin, Alanna Robinson, Jessica Lee, Eric Yang, John Torous, Andrew Trister, 14 Oct 2025, An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations, https://arxiv.org/abs/2510.12083
  • Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang, 14 Oct 2025, Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series, https://arxiv.org/abs/2506.10412
  • Shaharyar Ahmed Khan Tareen, Filza Khan Tareen, 14 Oct 2025, Optimally Deep Networks -- Adapting Model Depth to Datasets for Superior Efficiency, https://arxiv.org/abs/2510.10764
  • Giacomo Gonella, Gian Maria Campedelli, Stefano Menini, Marco Guerini, 13 Oct 2025, CrisiText: A dataset of warning messages for LLM training in emergency communication, https://arxiv.org/abs/2510.09243
  • Hayat Rajani, Valerio Franchi, Borja Martinez-Clavel Valles, Raimon Ramos, Rafael Garcia and Nuno Gracias, 13 Oct 2025, BenthiCat: An opti-acoustic dataset for advancing benthic classification and habitat mapping, https://arxiv.org/abs/2510.04876
  • Zulkaif Sajjad, Furqan Shaukat, Junaid Mir, 1 Oct 2025, U-DFA: A Unified DINOv2-Unet with Dual Fusion Attention for Multi-Dataset Medical Segmentation, https://arxiv.org/abs/2510.00585
  • Yannick Hauri, Luca A. Lanzend\"orfer, Till Aczel, 1 Oct 2025, Virtual Fashion Photo-Shoots: Building a Large-Scale Garment-Lookbook Dataset, https://arxiv.org/abs/2510.00633
  • Shuqing Li, Chenran Zhang, Cuiyun Gao, Michael R. Lyu, 1 Oct 2025, XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications, https://arxiv.org/abs/2412.06759
  • Ruixu Zhang and Yuran Wang and Xinyi Hu and Chaoyu Mai and Wenxuan Liu and Danni Xu and Xian Zhong and Zheng Wang, 1 Oct 2025, Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset, https://arxiv.org/abs/2509.20715
  • F\'elix Therrien, Jamal Abou Haibeh, Divya Sharma, Rhiannon Hendley, Leah Wairimu Mungai, Sun Sun, Alain Tchagang, Jiang Su, Samuel Huberman, Yoshua Bengio, Hongyu Guo, Alex Hern\'andez-Garc\'ia, Homin Shin, 1 Oct 2025, OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes, https://arxiv.org/abs/2502.14234
  • Sarmistha Das, R E Zera Marveen Lyngkhoi, Kirtan Jain, Vinayak Goyal, Sriparna Saha, and Manish Gupta, 24 Sep 2025, When Words Can't Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset, https://arxiv.org/abs/2509.19952
  • Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte, 24 Sep 2025, LEMUR Neural Network Dataset: Towards Seamless AutoML, https://arxiv.org/abs/2504.10552
  • Mahdi Zakizadeh and Mohammad Taher Pilehvar, 24 Sep 2025, Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets, https://arxiv.org/abs/2501.01168
  • Liangrui Pan, Qingchun Liang, Shen Zhao, Songqing Fan, Shaoliang Peng, 24 Sep 2025, PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset, https://arxiv.org/abs/2506.00096
  • Lisa Benato, Wahid Bhimji, Paolo Calafiura, Ragansu Chakkappai, Po-Wen Chang, Yuan-Tang Chou, Sascha Diefenbacher, Jordan Dudley, Ibrahim Elsharkawy, Steven Farrell, Aishik Ghosh, Cristina Giordano, Isabelle Guyon, Chris Harris, Yota Hashizume, Shih-Chieh Hsu, Elham E. Khoda, Claudius Krause, Ang Li, Benjamin Nachman, Peter Nugent, David Rousseau, Robert Schoefbeck, Maryam Shooshtari, Dennis Schwarz, Benjamin Thorne, Ihsan Ullah, Daohan Wang, Yulei Zhang, 24 Sep 2025, FAIR Universe HiggsML Uncertainty Dataset and Competition, https://arxiv.org/abs/2410.02867
  • Akwasi Asare, Ulas Bagci, 24 Sep 2025, PolypSeg-GradCAM: Towards Explainable Computer-Aided Gastrointestinal Disease Detection Using U-Net Based Segmentation and Grad-CAM Visualization on the Kvasir Dataset, https://arxiv.org/abs/2509.18159
  • Xin Yang, Yuhang Zhang, Wei Li, Xin Lin, Wenbin Zou, Chen Xu, 28 Oct 2025, UniPlanner: A Unified Motion Planning Framework for Autonomous Vehicle Decision-Making Systems via Multi-Dataset Integration, https://arxiv.org/abs/2510.24166
  • Xinqi Li, Yiqun Liu, Shan Jiang, Enrong Zheng, Huaijin Zheng, Wenhao Dai, Haodong Deng, Dianhai Yu, Yanjun Ma, 28 Oct 2025, GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler Research, https://arxiv.org/abs/2510.24035
  • Aaron Scott, Maike Z\"ufle, Jan Niehues, 28 Oct 2025, MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations, https://arxiv.org/abs/2510.24178
  • Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig, 28 Oct 2025, Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents, https://arxiv.org/abs/2510.24702
  • J. T. Fry, Xinyi Hope Fu, Zhenghao Fu, Kaliroe M. W. Pappas, Lindley Winslow, Aobo Li, 28 Oct 2025, TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising, https://arxiv.org/abs/2406.04378
  • Junseo Kim, Jongwook Han, Dongmin Choi, Jongwook Yoon, Eun-Ju Lee, Yohan Jo, 28 Oct 2025, PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings, https://arxiv.org/abs/2506.00481
  • Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu, 28 Oct 2025, GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning, https://arxiv.org/abs/2506.17939
  • Trajan Murphy, Akshunna S. Dogra, Hanfeng Gu, Caleb Meredith, Mark Kon, Julio Enrique Castrillion-Candas, 22 Oct 2025, FINDER: Feature Inference on Noisy Datasets using Eigenspace Residuals, https://arxiv.org/abs/2510.19917
  • Shumin Li, 23 Oct 2025, Assessing the Feasibility of Early Cancer Detection Using Routine Laboratory Data: An Evaluation of Machine Learning Approaches on an Imbalanced Dataset, https://arxiv.org/abs/2510.20209
  • Matteo Silvestri, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei, 23 Oct 2025, Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models, https://arxiv.org/abs/2510.20351
  • Zhenhuan Zhou, Jingbo Zhu, Yuchen Zhang, Xiaohang Guan, Peng Wang and Tao Li, 23 Oct 2025, Deep Learning in Dental Image Analysis: A Systematic Review of Datasets, Methodologies, and Emerging Challenges, https://arxiv.org/abs/2510.20634
  • Alicia Sagae and Chia-Jung Lee and Sandeep Avula and Brandon Dang and Vanessa Murdock, 23 Oct 2025, A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text, https://arxiv.org/abs/2510.20782
  • Hashem Omrani, Raha Imanirad, Adam Diamant, Utkarsh Verma, Amol Verma, Fahad Razak, 22 Oct 2025, Endogenous Aggregation of Multiple Data Envelopment Analysis Scores for Large Data Sets, https://arxiv.org/abs/2510.20052
  • Weikang Yuan, Kaisong Song, Zhuoren Jiang, Junjie Cao, Yujie Zhang, Jun Lin, Kun Kuang, Ji Zhang, Xiaozhong Liu, 23 Oct 2025, LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation, https://arxiv.org/abs/2505.19667
  • Abdou Karim Kandji and Fr\'ed\'eric Precioso and Cheikh Ba and Samba Ndiaye and Augustin Ndione, 22 Oct 2025, WolBanking77: Wolof Banking Speech Intent Classification Dataset, https://arxiv.org/abs/2509.19271
  • Arianna Francesconi, Donato Cappetta, Fabio Rebecchi, Paolo Soda, Valerio Guarrasi, Rosa Sicilia, 10 Oct 2025, Cross-dataset Multivariate Time-series Model for Parkinson's Diagnosis via Keyboard Dynamics, https://arxiv.org/abs/2510.15950
  • Jiyan Qiu, Lyulin Kuang, Guan Wang, Yichen Xu, Leiyao Cui, Shaotong Fu, Yixin Zhu, Ruihua Zhang, 19 Oct 2025, DrivAerStar: An Industrial-Grade CFD Dataset for Vehicle Aerodynamic Optimization, https://arxiv.org/abs/2510.16857
  • Duo Su, Huyu Wu, Huanran Chen, Yiming Shi, Yuzhu Wang, Xi Ye, Jun Zhu, 20 Oct 2025, Diffusion Models as Dataset Distillation Priors, https://arxiv.org/abs/2510.17421
  • Chen Kong, James Fort, Aria Kang, Jonathan Wittmer, Simon Green, Tianwei Shen, Yipu Zhao, Cheng Peng, Gustavo Solaira, Andrew Berkovich, Nikhil Raina, Vijay Baiyya, Evgeniy Oleinik, Eric Huang, Fan Zhang, Julian Straub, Mark Schwesinger, Luis Pesqueira, Xiaqing Pan, Jakob Julian Engel, Carl Ren, Mingfei Yan, Richard Newcombe, 17 Oct 2025, Aria Gen 2 Pilot Dataset, https://arxiv.org/abs/2510.16134
  • Yahia Battach, Abdulwahab Felemban, Faizan Farooq Khan, Yousef A. Radwan, Xiang Li, Fabio Marchese, Sara Beery, Burton H. Jones, Francesca Benzoni, Mohamed Elhoseiny, 19 Oct 2025, ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification, https://arxiv.org/abs/2510.16822
  • Numaan Naeem, Abdellah El Mekki, Muhammad Abdul-Mageed, 20 Oct 2025, EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs, https://arxiv.org/abs/2510.17389
  • Fr\'ed\'eric LIN, Biruk Abere Ambaw, Adrian Popescu, Hejer Ammar, Romaric Audigier, Herv\'e Le Borgne (Universit\'e Paris-Saclay, CEA, List, F-91120, Palaiseau, France), 20 Oct 2025, CaMiT: A Time-Aware Car Model Dataset for Classification and Generation, https://arxiv.org/abs/2510.17626
  • Matheus Ramos Parracho, 20 Oct 2025, Signature Forgery Detection: Improving Cross-Dataset Generalization, https://arxiv.org/abs/2510.17724
  • Andy Shi, 15 Oct 2025, A Storm-Centric 250 m NEXRAD Level-II Dataset for High-Resolution ML Nowcasting, https://arxiv.org/abs/2510.16031
  • Jugal Gajjar and Kamalasankari Subramaniakuppusamy, 18 Oct 2025, MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema, https://arxiv.org/abs/2510.16357
  • Ivan Molodetskikh, Kirill Malyshev, Mark Mirgaleev, Nikita Zagainov, Evgeney Bogatyrev, Dmitriy Vatolin, 19 Oct 2025, Prominence-Aware Artifact Detection and Dataset for Image Super-Resolution, https://arxiv.org/abs/2510.16752
  • Wei-Jer Chang, Wei Zhan, Masayoshi Tomizuka, Manmohan Chandraker, and Francesco Pittaluga, 20 Oct 2025, LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation, https://arxiv.org/abs/2504.11521
  • Chi Zhang, Mengxin Zheng, Qian Lou, Hui Min Leung, and Fan Chen, 22 Sep 2025, VQEzy: An Open-Source Dataset for Parameter Initialize in Variational Quantum Eigensolvers, https://arxiv.org/abs/2509.17322
  • Asiya Ibrahim Zanga, Salisu Mamman Abdulrahman, Abubakar Ado, Abdulkadir Abubakar Bichi, Lukman Aliyu Jibril, Abdulmajid Babangida Umar, Alhassan Adamu, Shamsuddeen Hassan Muhammad and Bashir Salisu Abubakar, 17 Sep 2025, HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language, https://arxiv.org/abs/2509.16256
  • Jina Suh, Lindy Le, Erfan Shayegani, Gonzalo Ramos, Judith Amores, Desmond C. Ong, Mary Czerwinski, Javier Hernandez, 19 Sep 2025, SENSE-7: Taxonomy and Dataset for Measuring User Perceptions of Empathy in Sustained Human-AI Conversations, https://arxiv.org/abs/2509.16437
  • Eunjin Choi, Hyerin Kim, Jiwoo Ryu, Juhan Nam, Dasaem Jeong, 20 Sep 2025, On the de-duplication of the Lakh MIDI dataset, https://arxiv.org/abs/2509.16662
  • Tiancheng Huang, Ruisheng Cao, Yuxin Zhang, Zhangyi Kang, Zijian Wang, Chenrun Wang, Yijie Luo, Hang Zheng, Lirong Qian, Lu Chen, Kai Yu, 21 Sep 2025, AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation, https://arxiv.org/abs/2509.16952
  • Haojun Yu, Youcheng Li, Zihan Niu, Nan Zhang, Xuantong Gong, Huan Li, Zhiying Zou, Haifeng Qi, Zhenxiao Cao, Zijie Lan, Xingjian Yuan, Jiating He, Haokai Zhang, Shengtao Zhang, Zicheng Wang, Dong Wang, Ziwei Zhao, Congying Chen, Yong Wang, Wangyan Qin, and Qingli Zhu, 21 Sep 2025, A Chain-of-thought Reasoning Breast Ultrasound Dataset Covering All Histopathology Categories, https://arxiv.org/abs/2509.17046
  • Yutong Liu, Ziyue Zhang, Ban Ma-bao, Renzeng Duojie, Yuqing Cai, Yongbin Yu, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi, 22 Sep 2025, TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation, https://arxiv.org/abs/2509.18060
  • Joe Barrow, 20 Sep 2025, CommonForms: A Large, Diverse Dataset for Form Field Detection, https://arxiv.org/abs/2509.16506
  • Zhichao Ma, Fan Huang, Lu Zhao, Fengjun Guo, Guangtao Zhai, Xiongkuo Min, 21 Sep 2025, DocIQ: A Benchmark Dataset and Feature Fusion Network for Document Image Quality Assessment, https://arxiv.org/abs/2509.17012
  • Junhong Lai, Jiyu Wei, Lin Yao and Yueming Wang, 21 Sep 2025, A Simple Review of EEG Foundation Models: Datasets, Advancements and Future Perspectives, https://arxiv.org/abs/2504.20069
  • JunSeo Kim and HyeHyeon Kim, 20 Sep 2025, KoACD: The First Korean Adolescent Dataset for Cognitive Distortion Analysis via Role-Switching Multi-LLM Negotiation, https://arxiv.org/abs/2505.00367
  • Manolis Mylonas, Evlampios Apostolidis, Vasileios Mezaris, 22 Sep 2025, SD-VSum: A Method and Dataset for Script-Driven Video Summarization, https://arxiv.org/abs/2505.03319
  • Peizhen Li, Longbing Cao, Xiao-Ming Wu, Runze Yang, Xiaohan Yu, 20 Sep 2025, X2C: A Dataset Featuring Nuanced Facial Expressions for Realistic Humanoid Imitation, https://arxiv.org/abs/2505.11146
  • Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Deli Zhao, Wenbing Huang, Tingyang Xu, Qifeng Bai, Yu Rong, 22 Sep 2025, ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning, https://arxiv.org/abs/2506.09513
  • Jiale Zhang, Zichong Wang, Avash Palikhe, Zhipeng Yin, Wenbin Zhang, 22 Sep 2025, Datasets for Fairness in Language Models: An In-Depth Survey, https://arxiv.org/abs/2506.23411
  • Nannan Shi, Chuanyu Qin, Shipeng Song, Man Luo, 23 Oct 2025, GeoThought: A Dataset for Enhancing Mathematical Geometry Reasoning in Vision-Language Models, https://arxiv.org/abs/2510.21881
  • Quoc Anh Nguyen, Bernard Cheng and Kelvin Soh, 25 Oct 2025, VietLyrics: A Large-Scale Dataset and Models for Vietnamese Automatic Lyrics Transcription, https://arxiv.org/abs/2510.22295
  • Vishvesh Bhat, Omkar Ghugarkar, Julian McAuley, 27 Oct 2025, On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset, https://arxiv.org/abs/2510.22898
  • Shuang Wang, Xuben Wang, Fei Deng, Peifan Jiang, Jian Chen and Gianluca Fiandaca, 23 Oct 2025, OpenEM: Large-scale multi-structural 3D datasets for electromagnetic methods, https://arxiv.org/abs/2510.21859
  • Ana K. Rivera, Anvita Bhagavathula, Alvaro Carbonero, and Priya Donti, 24 Oct 2025, PF$\Delta$: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations, https://arxiv.org/abs/2510.22048
  • Qingzhu Zhang, Jiani Zhong, Zongsheng Li, Xinke Shen, Quanying Liu, 25 Oct 2025, Multi-dataset Joint Pre-training of Emotional EEG Enables Generalizable Affective Computing, https://arxiv.org/abs/2510.22197
  • Darshana Priyasad, Tharindu Fernando, Maryam Haghighat, Harshala Gammulle, Clinton Fookes, 27 Oct 2025, Transforming volcanic monitoring: A dataset and benchmark for onboard volcano activity detection, https://arxiv.org/abs/2510.22889
  • Hong Wang, Jie Wang, Jian Luo, huanshuo dong, Yeqiu Chen, Runmin Jiang, Zhen huang, 27 Oct 2025, Accelerating Eigenvalue Dataset Generation via Chebyshev Subspace Filter, https://arxiv.org/abs/2510.23215
  • Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Koichiro Yoshino, 13 Oct 2025, J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception, https://arxiv.org/abs/2510.21761
  • Yulong Zhang, 17 Oct 2025, OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment, https://arxiv.org/abs/2510.21774
  • Mouhand Alkadri and Dania Desouki and Khloud Al Jallad, 27 Oct 2025, Arabic Little STT: Arabic Children Speech Recognition Dataset, https://arxiv.org/abs/2510.23319
  • Weiyu Chen, Arnaud Delorme, 24 Oct 2025, Adaptive Split-MMD Training for Small-Sample Cross-Dataset P300 EEG Classification, https://arxiv.org/abs/2510.21969
  • Sarabeth S. Mullins and Georg G\"otz and Eric Bezzam and Steven Zheng and Daniel Gert Nielsen, 27 Oct 2025, Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement, https://arxiv.org/abs/2510.23141
  • Linda Zeng, Rithwik Gupta, Divij Motwani, Diji Yang, Yi Zhang, 26 Oct 2025, Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals, https://arxiv.org/abs/2502.16101
  • Lily Hong Zhang and Smitha Milli and Karen Jusko and Jonathan Smith and Brandon Amos and Wassim Bouaziz and Manon Revel and Jack Kussman and Yasha Sheynin and Lisa Titus and Bhaktipriya Radharapu and Jane Yu and Vidya Sarma and Kris Rose and Maximilian Nickel, 24 Oct 2025, Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset, https://arxiv.org/abs/2507.09650
  • Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, Chris Tanner, 24 Oct 2025, DocFinQA: A Long-Context Financial Reasoning Dataset, https://arxiv.org/abs/2401.06915
  • Han Deng, Yuan Meng, Shixiang Tang, Wanli Ouyang, Xinzhu Ma, 26 Oct 2025, CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming, https://arxiv.org/abs/2505.12925
  • Dan A. Calian and Gregory Farquhar and Iurii Kemaev and Luisa M. Zintgraf and Matteo Hessel and Jeremy Shar and Junhyuk Oh and Andr\'as Gy\"orgy and Tom Schaul and Jeffrey Dean and Hado van Hasselt and David Silver, 27 Oct 2025, DataRater: Meta-Learned Dataset Curation, https://arxiv.org/abs/2505.17895
  • Shudong Sun, Hao Helen Zhang, 25 Oct 2025, Quantifying Dataset Similarity to Guide Transfer Learning, https://arxiv.org/abs/2510.10866
  • Xiangyu Zhao, Wanghan Xu, Bo Liu, Yuhao Zhou, Fenghua Ling, Ben Fei, Xiaoyu Yue, Lei Bai, Wenlong Zhang, Xiao-Ming Wu, 15 Oct 2025, MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science, https://arxiv.org/abs/2505.20740
  • Haiyang Li, Yaxiong Wang, Shengeng Tang, Lianwei Wu, Lechao Cheng, Zhun Zhong, 15 Oct 2025, Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline, https://arxiv.org/abs/2509.25991
  • Yuxin Wang, Maresa Schr\"oder, Dennis Frauen, Jonas Schweisthal, Konstantin Hess and Stefan Feuerriegel, 15 Oct 2025, Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets, https://arxiv.org/abs/2412.11511
  • Hao Yin, Lijun Gu, Paritosh Parmar, Lin Xu, Tianxiao Guo, Weiwei Fu, Yang Zhang, Tianyou Zheng, 15 Oct 2025, FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment, https://arxiv.org/abs/2506.03198
  • Kemal Sami Karaca, Bahaeddin Eravc{\i}, 26 Sep 2025, A Large-Scale Dataset and Citation Intent Classification in Turkish with LLMs, https://arxiv.org/abs/2509.21907
  • Nabeel Nisar Bhat, Maksim Karnaukh, Stein Vandenbroeke, Wouter Lemoine, Jakob Struye, Jesus Omar Lacruz, Siddhartha Kumar, Mohammad Hossein Moghaddam, Joerg Widmer, Rafael Berkvens, Jeroen Famaey, 24 Sep 2025, mmHSense: Multi-Modal and Distributed mmWave ISAC Datasets for Human Sensing, https://arxiv.org/abs/2509.21396
  • Seokbin Yoon, Keumjin Lee, 26 Sep 2025, Aircraft Trajectory Dataset Augmentation in Latent Space, https://arxiv.org/abs/2506.07585
  • Jaedong Hwang, Brian Cheung, Zhang-Wei Hong, Akhilan Boopathy, Pulkit Agrawal, Ila Fiete, 26 Sep 2025, Large Pre-Training Datasets Don't Always Guarantee Robustness after Fine-Tuning, https://arxiv.org/abs/2410.21582
  • M. Sajid, Mushir Akhtar, A. Quadir, M. Tanveer, 6 Oct 2025, RVFL-X: A Novel Randomized Network Based on Complex Transformed Real-Valued Tabular Datasets, https://arxiv.org/abs/2510.06278
  • Jiqun Pan, Zhenke Duan, Jiani Tu, Anzhi Cheng, Yanqing Wang, 3 Oct 2025, Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets, https://arxiv.org/abs/2510.06240
  • Yann Bellec, 3 Oct 2025, Dream2Image : An Open Multimodal EEG Dataset for Decoding and Visualizing Dreams with Artificial Intelligence, https://arxiv.org/abs/2510.06252
  • Aryan Kumar Singh and Janvi Singh, 5 Oct 2025, Prakriti200: A Questionnaire-Based Dataset of 200 Ayurvedic Prakriti Assessments, https://arxiv.org/abs/2510.06262
  • Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin, 8 Oct 2025, SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation, https://arxiv.org/abs/2510.06596
  • Neel Prabhanjan Rachamalla, Aravind Konakalla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, and Shubham Agarwal, 8 Oct 2025, Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages, https://arxiv.org/abs/2510.07000
  • Fred Philippy, Laura Bernardy, Siwen Guo, Jacques Klein, Tegawend\'e F. Bissyand\'e, 8 Oct 2025, LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish, https://arxiv.org/abs/2510.07074
  • Brian B. Moser, Federico Raue, Sebastian Palacio, Stanislav Frolov and Andreas Dengel, 8 Oct 2025, Unlocking Dataset Distillation with Diffusion Models, https://arxiv.org/abs/2403.03881
  • Chiara Pugliese, Francesco Lettich, Guido Rocchietti, Chiara Renso, Fabio Pinelli, 26 Sep 2025, Human Mobility Datasets Enriched With Contextual and Social Dimensions, https://arxiv.org/abs/2510.02333
  • Aur\'elien B\"uck-Kaeffer, Je Qin Chooi, Dan Zhao, Maximilian Puelma Touzel, Kellin Pelrine, Jean-Fran\c{c}ois Godbout, Reihaneh Rabbany, Zachary Yang, 27 Sep 2025, $\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training, https://arxiv.org/abs/2510.02343
  • Ingrid Navarro, Pablo Ortega-Kral, Jay Patrikar, Haichuan Wang, Alonso Cano, Zelin Ye, Jong Hoon Park, Sebastian Scherer and Jean Oh, 3 Oct 2025, Amelia: A Large Dataset and Benchmark for Airport Surface Movement Forecasting, https://arxiv.org/abs/2407.21185
  • Huidong Liang, Haitz S\'aez de Oc\'ariz Borde, Baskaran Sripathmanathan, Michael Bronstein, Xiaowen Dong, 3 Oct 2025, Towards Quantifying Long-Range Interactions in Graph Machine Learning: a Large Graph Dataset and a Measurement, https://arxiv.org/abs/2503.09008
  • Bartosz Bieganowski and Daniel Strzelecki and Robert Skiba and Mateusz Topolewski, 3 Oct 2025, Putnam-like dataset summary: LLMs as mathematical competition contestants, https://arxiv.org/abs/2509.24827
  • Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao, 3 Oct 2025, Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training, https://arxiv.org/abs/2502.11191
  • Yuchen Su, Yonghua Zhu, Ruofan Wang, Zijian Huang, Diana Benavides-Prado, Michael Witbrock, 3 Oct 2025, A Survey of Pun Generation: Datasets, Evaluations and Methodologies, https://arxiv.org/abs/2507.04793
  • Antonio-Gabriel Chac\'on Menke, Phan Xuan Tan, Eiji Kamioka, 20 Oct 2025, Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety, https://arxiv.org/abs/2510.18154
  • Yuanhe Guo, Linxi Xie, Zhuoran Chen, Kangrui Yu, Ryan Po, Guandao Yang, Gordon Wetztein, Hongyi Wen, 21 Oct 2025, ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization, https://arxiv.org/abs/2510.18433
  • Yongmin Lee, Hye Won Chung, 21 Oct 2025, CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder, https://arxiv.org/abs/2510.18583
  • Abhinav Nippani, Dongyue Li, Haotian Ju, Haris N. Koutsopoulos, Hongyang R. Zhang, 21 Oct 2025, Graph Neural Networks for Road Safety Modeling: Datasets and Evaluations for Accident Analysis, https://arxiv.org/abs/2311.00164
  • Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, Xiaoguang Han, 25 Sep 2025, VC-Agent: An Interactive Agent for Customized Video Dataset Collection, https://arxiv.org/abs/2509.21291
  • Annabel Ma, Kaiying Hou, David Alvarez-Melis, Melanie Weber, 25 Sep 2025, Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport, https://arxiv.org/abs/2509.20678
  • Srinidhi Kalgundi Srinivas, Yash Shukla, Adam Arnold, Sachin Chitta, 24 Sep 2025, GraspFactory: A Large Object-Centric Grasping Dataset, https://arxiv.org/abs/2509.20550
  • Dincy R. Arikkat and Sneha B. T. and Serena Nicolazzo and Antonino Nocera and Vinod P. and Rafidha Rehiman K. A. and Karthika R, 25 Sep 2025, CTI Dataset Construction from Telegram, https://arxiv.org/abs/2509.20943
  • Amelia Jim\'enez-S\'anchez, Natalia-Rozalia Avlona, Dovile Juodelyte, Th\'eo Sourget, Caroline Vang-Larsen, Anna Rogers, Hubert Dariusz Zaj\k{a}c, Veronika Cheplygina, 9 Feb 2024, Copycats: the many lives of a publicly available medical imaging dataset, https://arxiv.org/abs/2402.06353
  • Joana Reuss, Jan Macdonald, Simon Becker, Ekaterina Gikalo, Konrad Schultka, Lorenz Richter, Marco K\"orner, 25 Sep 2025, Benchmarking for Practice: Few-Shot Time-Series Crop-Type Classification on the EuroCropsML Dataset, https://arxiv.org/abs/2504.11022
  • Jakub Adamczyk, Jakub Poziemski, Franciszek Job, Mateusz Kr\'ol, Maciej Makowski, 25 Sep 2025, MolPILE - large-scale, diverse dataset for molecular representation learning, https://arxiv.org/abs/2509.18353
  • Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi, 24 Sep 2025, Scaling Rich Style-Prompted Text-to-Speech Datasets, https://arxiv.org/abs/2503.04713
  • Jialong Mai, Jinxin Ji, Xiaofen Xing, Chen Yang, Weidong Chen, Jingyuan Xing, Xiangmin Xu, 24 Sep 2025, MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech, https://arxiv.org/abs/2509.18196
  • Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, Qing Wang, 28 Sep 2025, Diagnosing Failure Root Causes in Platform-Orchestrated Agentic Systems: Dataset, Taxonomy, and Benchmark, https://arxiv.org/abs/2509.23735
  • Qi Xue, Minrui Jiang, Runjia Zhang, Xiurui Xie, Pei Ke, Guisong Liu, 28 Sep 2025, Falcon: A Cross-Modal Evaluation Dataset for Comprehensive Safety Perception, https://arxiv.org/abs/2509.23783
  • John N. Daras, 28 Sep 2025, Efficient Identification of High Similarity Clusters in Polygon Datasets, https://arxiv.org/abs/2509.23942
  • Youssef Sabiri, Walid Houmaidi, Ouail El Maadi, Yousra Chtouki, 28 Sep 2025, AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring, https://arxiv.org/abs/2509.24069
  • Jackson Loth, Pedro Sarmento, Saurjya Sarkar, Zixun Guo, Mathieu Barthet, Mark Sandler, 22 Jul 2025, GOAT: A Large Dataset of Paired Guitar Audio Recordings and Tablatures, https://arxiv.org/abs/2509.22655
  • Chen Yizhe, Wang Qi, Hu Dongxiao, Jingzhe Fang, Liu Sichao, Zixin An, Hongliang Niu, Haoran Liu, Li Dong, Chuanfen Feng, Lan Dapeng, Liu Yu, Zhibo Pang, 27 Sep 2025, Liaohe-CobotMagic-PnP: an Imitation Learning Dataset of Intelligent Robot for Industrial Applications, https://arxiv.org/abs/2509.23111
  • Wenyu Li, Xiaoqi Jiao, Yi Chang, Guangyan Zhang, Yiwen Guo, 27 Sep 2025, AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models, https://arxiv.org/abs/2509.23435
  • Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li, Li Lin, Yuwang Wang, 28 Sep 2025, M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation, https://arxiv.org/abs/2509.23728
  • Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, 29 Sep 2025, OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing, https://arxiv.org/abs/2509.24900
  • Ryosuke Takanami, Petr Khrapchenkov, Shu Morikuni, Jumpei Arima, Yuta Takaba, Shunsuke Maeda, Takuya Okubo, Genki Sano, Satoshi Sekioka, Aoi Kadoya, Motonari Kambara, Naoya Nishiura, Haruto Suzuki, Takanori Yoshimoto, Koya Sakamoto, Shinnosuke Ono, Hu Yang, Daichi Yashima, Aoi Horo, Tomohiro Motoda, Kensuke Chiyoma, Hiroshi Ito, Koki Fukuda, Akihito Goto, Kazumi Morinaga, Yuya Ikeda, Riko Kawada, Masaki Yoshikawa, Norio Kosuge, Yuki Noguchi, Kei Ota, Tatsuya Matsushima, Yusuke Iwasawa, Yutaka Matsuo, Tetsuya Ogata, 29 Sep 2025, AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation, https://arxiv.org/abs/2509.25032
  • Alexandru-Gabriel Ganea and Antonia-Adelina Popovici and Adrian-Marius Dumitran, 6 Jun 2025, A Culturally-Rich Romanian NLP Dataset from "Who Wants to Be a Millionaire?" Videos, https://arxiv.org/abs/2506.05991
  • Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra, 27 Sep 2025, Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional, https://arxiv.org/abs/2509.23499
  • Jiangning Zhu, Yuxing Zhou, Zheng Wang, Juntao Yao, Yima Gu, Yuhui Yuan, Shixia Liu, 28 Sep 2025, InfoDet: A Dataset for Infographic Element Detection, https://arxiv.org/abs/2505.17473
  • Trevor Stalnaker, Nathan Wintersgill, Oscar Chaparro, Laura A. Heymann, Massimiliano Di Penta, Daniel M German, Denys Poshyvanyk, 29 Sep 2025, An Empirical Analysis of Machine Learning Model and Dataset Documentation, Supply Chain, and Licensing Challenges on Hugging Face, https://arxiv.org/abs/2502.04484
  • Xiao Sun, 17 Oct 2025, WELD: A Large-Scale Longitudinal Dataset of Emotional Dynamics for Ubiquitous Affective Computing, https://arxiv.org/abs/2510.15221
  • Shuo Sun, Meiling Zhou, Chen Zhao, Joyce H. Keyak, Nancy E. Lane, Jeffrey D. Deng, Kuan-Jui Su, Hui Shen, Hong-Wen Deng, Kui Zhang, and Weihua Zhou, 16 Oct 2025, An Advanced Two-Stage Model with High Sensitivity and Generalizability for Prediction of Hip Fracture Risk Using Multiple Datasets, https://arxiv.org/abs/2510.15179
  • Giulia Lanzillotta, Felix Sarnthein, Gil Kur, Thomas Hofmann, Bobby He, 17 Oct 2025, Revisiting Knowledge Distillation: The Hidden Role of Dataset Size, https://arxiv.org/abs/2510.15516
  • Ting Qiao, Xing Liu, Wenke Huang, Jianbin Li, Zhaoxin Fan, Yiming Li, 17 Oct 2025, DSSmoothing: Toward Certified Dataset Ownership Verification for Pre-trained Language Models via Dual-Space Smoothing, https://arxiv.org/abs/2510.15303
  • Catarina G Belem, Parker Glenn, Alfy Samuel, Anoop Kumar and Daben Liu, 17 Oct 2025, Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics, https://arxiv.org/abs/2510.15345
  • Chitralekha Gupta, Soundarya Ramesh, Praveen Sasikumar, Kian Peen Yeo, Suranga Nanayakkara, 17 Oct 2025, DroneAudioset: An Audio Dataset for Drone-based Search and Rescue, https://arxiv.org/abs/2510.15383
  • Antonyo Musabini, Rachid Benmokhtar, Jagdish Bhanushali, Victor Galizzi, Bertrand Luvison, Xavier Perrotton, 17 Oct 2025, Valeo Near-Field: a novel dataset for pedestrian intent detection, https://arxiv.org/abs/2510.15673
  • Usman Ali, 17 Oct 2025, A Multimodal Lightweight Approach to Fault Diagnosis of Induction Motors in High-Dimensional Dataset, https://arxiv.org/abs/2501.03746
  • Keren Fuentes, Mimee Xu, Irene Chen, 17 Oct 2025, Privacy-Preserving Dataset Combination, https://arxiv.org/abs/2502.05765
  • Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama, 17 Oct 2025, Hyperbolic Dataset Distillation, https://arxiv.org/abs/2505.24623
  • Pafue Christy Nganjimi, Andrew Soltan, Danielle Belgrave, Lei Clifton, David A. Clifton, Anshul Thakur, 16 Oct 2025, Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates, https://arxiv.org/abs/2510.05805
  • Muhao Guo, Haoran Li, Yang Weng, 5 Oct 2025, Efficient Manifold-Constrained Neural ODE for High-Dimensional Datasets, https://arxiv.org/abs/2510.04138
  • Guillaume Godin, 6 Oct 2025, Bond-Centered Molecular Fingerprint Derivatives: A BBBP Dataset Study, https://arxiv.org/abs/2510.04837
  • Ali Khairallah and Arkaitz Zubiaga, 3 Oct 2025, ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection, https://arxiv.org/abs/2510.03502
  • Saja Al-Dabet, Sherzod Turaev, Nazar Zaki, Arif O. Khan, Luai Eldweik, 4 Oct 2025, PoseGaze-AHP: A Knowledge-Based 3D Dataset for AI-Driven Ocular and Postural Diagnosis, https://arxiv.org/abs/2510.03873
  • Davood Rafiei and Morgan Lindsay Heisler and Weiwei Zhang and Mohammadreza Pourreza and Yong Zhang, 6 Oct 2025, Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment, https://arxiv.org/abs/2510.04919
  • Tiago Rodrigues de Almeida, Yufei Zhu, Andrey Rudenko, Tomasz P. Kucner, Johannes A. Stork, Martin Magnusson, Achim J. Lilienthal, 4 Oct 2025, Trajectory prediction for heterogeneous agents: A performance analysis on small and imbalanced datasets, https://arxiv.org/abs/2510.03776
  • Muquan Li, Hang Gou, Dongyang Zhang, Shuang Liang, Xiurui Xie, Deqiang Ouyang, and Ke Qin, 6 Oct 2025, Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation, https://arxiv.org/abs/2510.04838
  • Naomi Fridman (Ariel University), Anat Goldstein (Ariel University), 4 Oct 2025, Transformer Classification of Breast Lesions: The BreastDCEDL_AMBL Benchmark Dataset and 0.92 AUC Baseline, https://arxiv.org/abs/2509.26440
  • Jelena Bratuli\'c, Sudhanshu Mittal, David T. Hoffmann, Samuel B\"ohm, Robin Tibor Schirrmeister, Tonio Ball, Christian Rupprecht, Thomas Brox, 6 Oct 2025, Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling, https://arxiv.org/abs/2501.06256
  • Zekai Zhang, Mingwei Liu, Zhenxi Chen, Linxi Liang, Yuxuan Chen, Guangsheng Ou, Yanlin Wang, Dan Li, Xin Peng and Zibin Zheng, 5 Oct 2025, Generating High-Quality Datasets for Code Editing via Open-Source Language Models, https://arxiv.org/abs/2509.25203
  • Marlon Tobaben, Hibiki Ito, Joonas J\"alk\"o, Yuan He, and Antti Honkela, 6 Oct 2025, Impact of Dataset Properties on Membership Inference Vulnerability of Deep Transfer Learning, https://arxiv.org/abs/2402.06674
  • Benjamin Townsend, Madison May, Katherine Mackowiak and Christopher Wells, 6 Oct 2025, RealKIE: Five Novel Datasets for Enterprise Key Information Extraction, https://arxiv.org/abs/2403.20101
  • Alhasan Abdellatif, Hannah P. Menke, Julien Maes, Ahmed H. Elsheikh and Florian Doster, 3 Oct 2025, Benchmark Dataset for Pore-Scale CO2-Water Interaction, https://arxiv.org/abs/2503.17592
  • Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong, 9 Oct 2025, Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study, https://arxiv.org/abs/2510.08722
  • Eshika Saxena, Alberto Alfarano, Fran\c{c}ois Charton, Emily Wenger, Kristin Lauter, 9 Oct 2025, TAPAS: Datasets for Learning the Learning with Errors Problem, https://arxiv.org/abs/2510.08797
  • Yuanming Zhang, Yan Lin, Arijit Khan, Huaiyu Wan, 10 Oct 2025, Large Language Model Prompt Datasets: An In-depth Analysis and Insights, https://arxiv.org/abs/2510.09316
  • Zhenyu Zhao and Hongyi Jing and Xiawei Liu and Jiageng Mao and Abha Jha and Hanwen Yang and Rong Xue and Sergey Zakharor and Vitor Guizilini and Yue Wang, 9 Oct 2025, Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation, https://arxiv.org/abs/2510.08807
  • Bhanu Tokas, Rahul Nair, Hannah Kerner, 9 Oct 2025, Making Bias Amplification in Balanced Datasets Directional and Interpretable, https://arxiv.org/abs/2412.11060
  • Ankur Sinha, Shobhit Arora, and Dhaval Pujara, 24 Oct 2025, AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving, https://arxiv.org/abs/2510.21436
  • Gereon Elvers, Gilad Landau, Oiwi Parker Jones, 23 Oct 2025, Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset, https://arxiv.org/abs/2510.21038
  • Luca Demetrio, Giovanni Apruzzese, Kathrin Grosse, Pavel Laskov, Emil Lupu, Vera Rimmer, Philine Widmer, 24 Oct 2025, Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews, https://arxiv.org/abs/2510.21192
  • Prakhar Ganesh, Hsiang Hsu, Golnoosh Farnadi, 24 Oct 2025, Data as a Lever: A Neighbouring Datasets Perspective on Predictive Multiplicity, https://arxiv.org/abs/2510.21303
  • Juntao Li, Haobin Yuan, Ling Luo, Yan Jiang, Fan Wang, Ping Zhang, Huiyi Lv, Jian Wang, Yuanyuan Sun, Hongfei Lin, 24 Oct 2025, CDrugRed: A Chinese Drug Recommendation Dataset for Discharge Medications in Metabolic Diseases, https://arxiv.org/abs/2510.21084
  • Raheem Karim Hashmani, Garrett W. Merz, Helen Qu, Mariel Pettee, Kyle Cranmer, 24 Oct 2025, Multimodal Datasets with Controllable Mutual Information, https://arxiv.org/abs/2510.21686
  • Stylianos Stasinos, Martino Mensio, Elena Lazovik, Athanasios Trantas, 24 Oct 2025, BioCube: A Multimodal Dataset for Biodiversity Research, https://arxiv.org/abs/2505.11568
  • Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, Genta Indra Winata, 23 Oct 2025, T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning, https://arxiv.org/abs/2505.16986
  • Riccardo Fosco Gramaccioni, Christian Marinoni, Fabrizio Frezza, Aurelio Uncini, and Danilo Comminiello, 7 Oct 2025, Generative Models for Helmholtz Equation Solutions: A Dataset of Acoustic Materials, https://arxiv.org/abs/2510.09657
  • Md Ibrahim Shikder Mahin, Md Shamsul Arefin and Md Tanvir Hasan, 12 Oct 2025, A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets, https://arxiv.org/abs/2510.10513
  • Shreshth Saini, Alan C. Bovik, Neil Birkbeck, Yilin Wang, Balu Adsumilli, 10 Oct 2025, CHUG: Crowdsourced User-Generated HDR Video Quality Dataset, https://arxiv.org/abs/2510.09879
  • Zuha Fatima, Muhammad Anser Sohaib, Muhammad Talha, Sidra Sultana, Ayesha Kanwal, Nazia Perwaiz, 12 Oct 2025, GLOFNet -- A Multimodal Dataset for GLOF Monitoring and Prediction, https://arxiv.org/abs/2510.10546
  • Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han, 13 Oct 2025, KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification, https://arxiv.org/abs/2510.10961
  • Birat Poudel, Satyam Ghimire, Sijan Bhattarai, Saurav Bhandari and Suramya Sharma Dahal, 13 Oct 2025, Nepali Sign Language Characters Recognition: Dataset Development and Deep Learning Approaches, https://arxiv.org/abs/2510.11243
  • Massinissa Merouani, Afif Boudaoud, Riyadh Baghdadi, 11 Oct 2025, LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization, https://arxiv.org/abs/2510.10209
  • Dewei Feng, Carol Li, Wei Dai, Paul Pu Liang, 11 Oct 2025, SMELLNET: A Large-scale Dataset for Real-world Smell Recognition, https://arxiv.org/abs/2506.00239
  • Qihang Zhou, Shenhao Fang, Shibo He, Wenchao Meng, Jiming Chen, 12 Oct 2025, FairDD: Fair Dataset Distillation, https://arxiv.org/abs/2411.19623
  • Trinh T.L. Vuong and Jin Tae Kwak, 13 Oct 2025, ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos, https://arxiv.org/abs/2505.04192
  • Shenghao Qin, Jianliang He, Qi Kuang, Bowen Gang, Yin Xia, 11 Oct 2025, Data-light Uncertainty Set Merging with Admissibility, https://arxiv.org/abs/2410.12201
  • Mkululi Sikosana, Sean Maudsley-Barton and Oluwaseun Ajao, 8 Oct 2025, Linguistic Patterns in Pandemic-Related Content: A Comparative Analysis of COVID-19, Constraint, and Monkeypox Datasets, https://arxiv.org/abs/2510.07579
  • Ziyi Dong, Yurui Zhang, Changmao Li, Naomi Rue Golding, Qing Long, 9 Oct 2025, A Large-scale Dataset for Robust Complex Anime Scene Text Detection, https://arxiv.org/abs/2510.07951
  • Kehui Liu, Zhongjie Jia, Yang Li, Zhaxizhuoma, Pengan Chen, Song Liu, Xin Liu, Pingrui Zhang, Haoming Song, Xinyi Ye, Nieqing Cao, Zhigang Wang, Jia Zeng, Dong Wang, Yan Ding, Bin Zhao, Xuelong Li, 9 Oct 2025, FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset, https://arxiv.org/abs/2510.08022
  • Hongruixuan Chen and Jian Song and Olivier Dietrich and Clifford Broni-Bediako and Weihao Xuan and Junjue Wang and Xinlei Shao and Yimin Wei and Junshi Xia and Cuiling Lan and Konrad Schindler and Naoto Yokoya, 9 Oct 2025, BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response, https://arxiv.org/abs/2501.06019
  • George Corr\^ea de Ara\'ujo, Helena de Almeida Maia, Helio Pedrini, 17 Sep 2025, A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts, https://arxiv.org/abs/2509.18177
  • Yunzhi Xu, Yushuang Ding, Hu Sun, Hongxi Zhang, Li Zhao, 23 Sep 2025, HyKid: An Open MRI Dataset with Expert-Annotated Multi-Structure and Choroid Plexus in Pediatric Hydrocephalus, https://arxiv.org/abs/2509.19218
  • Varun Babbar, Zhicheng Guo, Cynthia Rudin, 22 Sep 2025, "What is Different Between These Datasets?" A Framework for Explaining Data Distribution Shifts, https://arxiv.org/abs/2403.05652
  • Haonan He, Yuchen Ren, Yining Tang, Ziyang Xu, Junxian Li, Minghao Yang, Di Zhang, Dong Yuan, Tao Chen, Shufei Zhang, Yuqiang Li, Nanqing Dong, Wanli Ouyang, Dongzhan Zhou, Peng Ye, 23 Sep 2025, Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models, https://arxiv.org/abs/2412.19191
  • Hamed Jelodar, Samita Bai, Roozbeh Razavi-Far, Ali A. Ghorbani, 21 Oct 2025, FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains, https://arxiv.org/abs/2510.19025
  • Tong Zhang, Yihuan Huang, Yanzhen Ren, 22 Oct 2025, EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection, https://arxiv.org/abs/2510.19414
  • Basavasagar Patil, Sydney Belt, Jayjun Lee, Nima Fazeli, Bernadette Bucher, 22 Oct 2025, Using Temperature Sampling to Effectively Train Robot Learning Policies on Imbalanced Datasets, https://arxiv.org/abs/2510.19373
  • Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, Zhe Gan, 22 Oct 2025, Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing, https://arxiv.org/abs/2510.19808
  • Li Jiang, Yusen Wu, Junwu Xiong, Jingqing Ruan, Yichuan Ding, Qingpei Guo, Zujie Wen, Jun Zhou, Xiaotie Deng, 22 Oct 2025, Hummer: Towards Limited Competitive Preference Dataset, https://arxiv.org/abs/2405.11647
  • M\'elanie Roschewitz, Raghav Mehta, Fabio de Sousa Ribeiro, Ben Glocker, 22 Oct 2025, Where are we with calibration under dataset shift in image classification?, https://arxiv.org/abs/2507.07780
  • Yuan Gao, Mattia Piccinini, Roberto Brusnicki, Yuchen Zhang, Johannes Betz, 30 Sep 2025, NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving, https://arxiv.org/abs/2509.25944
  • Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnod\k{e}bska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son (Sonny) Vu, Jenia Jitsev, 29 Sep 2025, MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources, https://arxiv.org/abs/2509.25531
  • Walid Houmaidi, Youssef Sabiri, Fatima Zahra Iguenfer, Amine Abouaomar, 30 Sep 2025, AttriGen: Automated Multi-Attribute Annotation for Blood Cell Datasets, https://arxiv.org/abs/2509.26185
  • Chenyang Jiang, Zhengcen Li, Hang Zhao, Qiben Shan, Shaocong Wu, Jingyong Su, 30 Sep 2025, Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation, https://arxiv.org/abs/2509.26219
  • Dragos-Dumitru Ghinea and Adela-Nicoleta Corbeanu and Adrian-Marius Dumitran, 30 Sep 2025, RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models, https://arxiv.org/abs/2509.25813
  • Xenia Heilmann, Luca Corbucci, Mattia Cerrato, Anna Monreale, 30 Sep 2025, FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation, https://arxiv.org/abs/2506.21095
  • Matthias K\"ummerer, Harneet Singh Khanuja, Matthias Bethge, 30 Sep 2025, Modeling Saliency Dataset Bias, https://arxiv.org/abs/2505.10169
  • Simon Ging and Sebastian Walter and Jelena Bratuli\'c and Johannes Dienert and Hannah Bast and Thomas Brox, 30 Sep 2025, Using Knowledge Graphs to harvest datasets for efficient CLIP model training, https://arxiv.org/abs/2505.02746
  • Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Tom A. Lamb, Philip Torr, Marc Boubnovski Martell, Julien Fauqueur, Kaspar M\"artens, 7 Oct 2025, Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering, https://arxiv.org/abs/2510.05871
  • Austin Feng, Andreas Varvarigos, Ioannis Panitsas, Daniela Fernandez, Jinbiao Wei, Yuwei Guo, Jialin Chen, Ali Maatouk, Leandros Tassiulas, Rex Ying, 7 Oct 2025, TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis, https://arxiv.org/abs/2510.06063
  • Shadi Rahimian and Mario Fritz, 7 Oct 2025, DP-SNP-TIHMM: Differentially Private, Time-Inhomogeneous Hidden Markov Models for Synthesizing Genome-Wide Association Datasets, https://arxiv.org/abs/2510.05777
  • Jo\~ao Palmeiro, Diogo Duarte, Rita Costa, Pedro Bizarro, 7 Oct 2025, Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks, https://arxiv.org/abs/2510.06071
  • Muskaan Chopra, Lorenz Sparrenberg, Rafet Sifa, 1 Oct 2025, SynCED-EnDe 2025: A Synthetic and Curated English - German Dataset for Critical Error Detection in Machine Translation, https://arxiv.org/abs/2510.05144
  • Sebastian H\"ofer, Dorian Henning, Artemij Amiranashvili, Douglas Morrison, Mariliza Tzes, Ingmar Posner, Marc Matvienko, Alessandro Rennola, Anton Milan, 7 Oct 2025, Kaputt: A Large-Scale Dataset for Visual Defect Detection, https://arxiv.org/abs/2510.05903
  • Chengwei Wu, Jiapu Wang, Mingyang Gao, Xingrui Zhuo, Jipeng Guo, Runlin Lei, Haoran Luo, Tianyu Chen, Haoyi Zhou, Shirui Pan, Zechao Li, 7 Oct 2025, CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs, https://arxiv.org/abs/2510.06039
  • Bjoern Hansen, Jonas Pedersen, Klaus F. Kofoed, Oscar Camara, Rasmus R. Paulsen, Kristine Soerensen, 7 Oct 2025, A public cardiac CT dataset featuring the left atrial appendage, https://arxiv.org/abs/2510.06090
  • Ahmed Elhussein and Gamze Gursoy, 6 Oct 2025, A Universal Metric of Dataset Similarity for Cross-silo Federated Learning, https://arxiv.org/abs/2404.18773
  • Gaya Mehenni and Fabrice Lamarche and Odette Rios-Ibacache and John Kildea and Amal Zouaq, 7 Oct 2025, MedHal: An Evaluation Dataset for Medical Hallucination Detection, https://arxiv.org/abs/2504.08596
  • Matthew D. Merris and Tim Andersen, 15 Oct 2025, Data Understanding Survey: Pursuing Improved Dataset Characterization Via Tensor-based Methods, https://arxiv.org/abs/2510.14161
  • Skylar Sargent Walters, Arthea Valderrama, Thomas C. Smits, David Kou\v{r}il, Huyen N. Nguyen, Sehi L'Yi, Devin Lange, Nils Gehlenborg, 19 Sep 2025, GQVis: A Dataset of Genomics Data Questions and Visualizations for Generative AI, https://arxiv.org/abs/2510.13816
  • Mayuri Kate and Suresh Neethirajan, 16 Oct 2025, Big Data Approaches to Bovine Bioacoustics: A FAIR-Compliant Dataset and Scalable ML Framework for Precision Livestock Welfare, https://arxiv.org/abs/2510.14443
  • Yunwen Li, Shuangshuang Ying, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Tianyu Zheng, Xeron Du, Qiguang Chen, Jiajun Shi, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Stephen Huang, Wanxiang Che, Chenghua Lin, and Eli Zhang, 16 Oct 2025, COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes, https://arxiv.org/abs/2510.14763
  • Huifang Lyu and James Alvey and Noemi Anau Montel and Mauro Pieroni and Christoph Weniger, 15 Oct 2025, Dynamic SBI: Round-free Sequential Simulation-Based Inference with Adaptive Datasets, https://arxiv.org/abs/2510.13997

Synthetic Data

Research paper on LLM-generated synthetic data for training:

  • Skurzhanskyi, O.H., Marchenko, O.O. & Anisimov, A.V., 2024, Specialized Pre-Training of Neural Networks on Synthetic Data for Improving Paraphrase Generation. Cybern Syst Anal 2024 https://doi.org/10.1007/s10559-024-00658-7 https://link.springer.com/article/10.1007/s10559-024-00658-7
  • Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, 29 Jan 2024, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, https://arxiv.org/abs/2401.16380
  • André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster, 4 Jan 2024, Comprehensive Exploration of Synthetic Data Generation: A Survey https://arxiv.org/abs/2401.02524
  • Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
  • David Spuler, March 2024, Chapter 45. Knowledge Distillation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • A Gudibande, E Wallace, C Snell, X Geng, H Liu 2023, The false promise of imitating proprietary llms, https://arxiv.org/abs/2305.15717
  • Y Wang, W Zhong, L Li, F Mi, X Zeng, W Huang 2023, Aligning large language models with human: A survey, https://arxiv.org/abs/2307.12966
  • Y Gu, L Dong, F Wei, M Huang, 2023, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
  • X Wan, R Sun, H Dai, SO Arik, T Pfister, 2023, Better zero-shot reasoning with self-adaptive prompting, https://arxiv.org/abs/2305.14106
  • S Horawalavithana, S Munikoti, I Stewart, 2023, SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, https://arxiv.org/abs/2307.01139
  • X Daull, P Bellot, E Bruno, V Martin, 2023, Complex QA and language models hybrid architectures, Survey, https://arxiv.org/abs/2302.09051
  • Z Yuan, J Liu, Q Zi, M Liu, X Peng, Y Lou, 2023, Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation, https://arxiv.org/abs/2308.01240
  • W AlShikh, M Daaboul, K Goddard, B Imel, 2023, Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, https://arxiv.org/abs/2307.03692
  • Z He, Z Xie, R Jha, H Steck, D Liang, Y Feng, 2023, Large Language Models as Zero-Shot Conversational Recommenders, https://arxiv.org/abs/2308.10053
  • NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
  • Michael Nuñez, July 18, 2024, Groq’s open-source Llama AI model tops leaderboard, outperforming GPT-4o and Claude in function calling, https://venturebeat.com/ai/groq-open-source-llama-ai-model-tops-leaderboard-outperforming-gpt-4o-and-claude-in-function-calling/
  • Louie Peters, Aug 27, 2024, Two Paths to Small LMs? Synthetic Data (Phi 3.5) vs Pruning & Distillation (Llama-3.1-Minitron), https://newsletter.towardsai.net/p/114-two-paths-to-small-lms-synthetic
  • Aatish Bhatia, Aug. 25, 2024, When A.I.’s Output Is a Threat to A.I. Itself: As A.I.-generated data becomes harder to detect, it’s increasingly likely to be ingested by future A.I., leading to worse results, NY Times, https://www.nytimes.com/interactive/2024/08/26/upshot/ai-synthetic-data.html
  • Shumailov, I., Shumaylov, Z., Zhao, Y. et al. 2024, AI models collapse when trained on recursively generated data. Nature 631, 755–759. https://doi.org/10.1038/s41586-024-07566-y https://www.nature.com/articles/s41586-024-07566-y
  • Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, Gauthier Gidel, 12 Jun 2024, Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences, https://arxiv.org/abs/2407.09499
  • Ryan McNeal, Aug 27, 2024, ChatGPT and GPT-4 could get a sweet upgrade this fall with 'strawberry', https://www.androidauthority.com/openai-strawberry-ai-3475682/
  • Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai, 10 Aug 2024 (v2), Best Practices and Lessons Learned on Synthetic Data, https://arxiv.org/abs/2404.07503
  • Georgia Argyro, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou, 10 Sep 2024, Prompt2Fashion: An automatically generated fashion dataset, https://arxiv.org/abs/2409.06442
  • Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli, 12 Sep 2024, Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources, https://arxiv.org/abs/2409.08239
  • Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, Mehran Kazemi, 29 Aug 2024, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737
  • Ulyana Piterbarg, Lerrel Pinto, Rob Fergus, 3 Oct 2024, Training Language Models on Synthetic Edit Sequences Improves Code Synthesis, https://arxiv.org/abs/2410.02749
  • Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang, 16 Oct 2024, A Survey on Data Synthesis and Augmentation for Large Language Models, https://arxiv.org/abs/2410.12896
  • Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He, 23 Oct 2024, SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains, https://arxiv.org/abs/2410.17952
  • Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
  • Arindam Mitra , Ahmed Awadallah , Yash Lara , November 14, 2024, Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/
  • Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig, 4 Dec 2024, Evaluating Language Models as Synthetic Data Generators, https://arxiv.org/abs/2412.03679
  • Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
  • Xiang Huang, Jiayu Shen, Shanshan Huang, Sitao Cheng, Xiaxia Wang, Yuzhong Qu, 27 Dec 2024, TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data, https://arxiv.org/abs/2412.19544?
  • Sebastian Raschka, PhD, Jan 15, 2025, Noteworthy AI Research Papers of 2024 (Part Two). Six influential AI papers from July to December, https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2 (Examines multimodal LLama3 models and the different multimodal architectures.)
  • FZ Subah, Oct 2025, Mitigating and Assessing Bias and Fairness in Large Language Model-Generated Synthetic Tabular Data, Masters Thesis, Department of Engineering, University of Cambridge, https://www.mlmi.eng.cam.ac.uk/files/2023-2024/fzs21_mitigating_2024.pdf
  • Chetan Harsha, Karmvir Singh Phogat, Sridhar Dasaratha, Sai Akhil Puranam, Shashishekar Ramakrishna, Jan 2025, Synthetic Data Generation Using Large Language Models for Financial Question Answering, Proceedings of the Joint Workshop of the 9th FinNLP, the 6th FNP, and the 1st LLMFinLegal, pages 76–95 January 19–20, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.finnlp-1.7.pdf
  • Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen, 25 Jan 2025, LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion, https://arxiv.org/abs/2501.15089
  • Minsang Kim, Seungjun Baek, 6 Feb 2025, Syntriever: How to Train Your Retriever with Synthetic Data from LLMs, https://arxiv.org/abs/2502.03824
  • Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang, 13 Feb 2025, Logical Reasoning in Large Language Models: A Survey, https://arxiv.org/abs/2502.09100
  • Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen, 18 Feb 2025, Theorem Prover as a Judge for Synthetic Data Generation, https://arxiv.org/abs/2502.13137
  • Maria Korolov, Jun 25, 2025, 7 ways synthetic data creates business value, https://www.cio.com/article/4003262/7-ways-synthetic-data-creates-business-value.html
  • Ali Zolnour, Hossein Azadmaleki, Yasaman Haghbin, Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sina Rashidi, Masoud Khani, AmirSajjad Taleban, Samin Mahdizadeh Sani, Maryam Dadkhah, James M. Noble, Suzanne Bakken, Yadollah Yaghoobzadeh, Abdol-Hossein Vahabie, Masoud Rouhizadeh, Maryam Zolnoori, 8 Aug 2025, LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data, https://arxiv.org/abs/2508.10027
  • Nitin Rai, Nathan S. Boyd, Gary E. Vallad, Arnold W. Schumann, 13 Aug 2025, Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model, https://arxiv.org/abs/2508.10156
  • Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian, 14 Aug 2025, Measuring Diversity in Synthetic Datasets, https://arxiv.org/abs/2502.08512
  • Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng, 22 Jul 2025, Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation, https://arxiv.org/abs/2507.17066
  • \'Alvaro Ruiz-R\'odenas, Jaime Pujante S\'aez, Daniel Garc\'ia-Algora, Mario Rodr\'iguez B\'ejar, Jorge Blasco and Jos\'e Luis Hern\'andez-Ramos, 21 Jul 2025, SynthCTI: LLM-Driven Synthetic CTI Generation to enhance MITRE Technique Mapping, https://arxiv.org/abs/2507.16852
  • Rishemjit Kaur, Arshdeep Singh Bhankhar, Surangika Ranathunga, Jashanpreet Singh Salh, Sudhir Rajput, Vidhi, Kashish Mahendra, Bhavika Berwal, Ritesh Kumar, 22 Jul 2025, Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain, https://arxiv.org/abs/2507.16974
  • Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
  • Shreya Saxena, Siva Prasad, Zishan Ahmad, Vishal Vaddina, 22 Jul 2025, ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training, https://arxiv.org/abs/2507.16478
  • Ivona Krchova, Michael Platzer, Paul Tiwald, 22 Jul 2025, Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling, https://arxiv.org/abs/2507.16419
  • Alireza Dizaji, Benedict Aaron Tjandra, Mehrab Hamidi, Shenyang Huang, Guillaume Rabusseau, 22 Jul 2025, T-GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs, https://arxiv.org/abs/2507.10183
  • Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim, 24 Jul 2025, Synthetic Data Generation for Phrase Break Prediction with Large Language Model, https://arxiv.org/abs/2507.18044
  • Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina Balaur, and Venkata Satagopam, 24 Jul 2025, Generation of Synthetic Clinical Text: A Systematic Review, https://arxiv.org/abs/2507.18451
  • Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu, 24 Jul 2025, DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data, https://arxiv.org/abs/2507.18583
  • Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim, 24 Jul 2025, SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning, https://arxiv.org/abs/2507.18616
  • Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim, 24 Jul 2025, SIDA: Synthetic Image Driven Zero-shot Domain Adaptation, https://arxiv.org/abs/2507.18632
  • Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng, 24 Jul 2025, Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs, https://arxiv.org/abs/2507.18055
  • Yefeng Yuan, Yuhong Liu, Liang Cheng, 24 Jul 2025, A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models, https://arxiv.org/abs/2404.14445
  • Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp, 24 Jul 2025, Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation, https://arxiv.org/abs/2506.11790
  • Keito Inoshita, Rushia Harada, 15 Jul 2025, Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition, https://arxiv.org/abs/2507.13380
  • Junsu Kim, Yunhoe Ku, Seungryul Baek, 18 Jul 2025, Can Synthetic Images Conquer Forgetting? Beyond Unexplored Doubts in Few-Shot Class-Incremental Learning, https://arxiv.org/abs/2507.13739
  • Matthew A. Chan, Casey J. Pellizzari, Christopher A. Metzler, 17 Jul 2025, Inverse Synthetic Aperture Fourier Ptychography, https://arxiv.org/abs/2507.03733
  • Claudio Giusti, Luca Guarnera, Mirko Casu, Sebastiano Battiato, 19 Jul 2025, Fraud is Not Just Rarity: A Causal Prototype Attention Approach to Realistic Synthetic Oversampling, https://arxiv.org/abs/2507.14706
  • Anh Nguyen, Sam Schafft, Nicholas Hale, John Alfaro, 21 Jul 2025, FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs, https://arxiv.org/abs/2507.15839
  • Pan Peng, Hangyu Xu, 20 Jul 2025, Differentially Private Synthetic Graphs Preserving Triangle-Motif Cuts, https://arxiv.org/abs/2507.14835
  • Zijian Ding, Tung Nguyen, Weikai Li, Aditya Grover, Yizhou Sun, Jason Cong, 19 Jul 2025, Iceberg: Enhancing HLS Modeling with Synthetic Data, https://arxiv.org/abs/2507.09948
  • Rohit Kundu, Shan Jia, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury, 19 Jul 2025, TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data, https://arxiv.org/abs/2503.15867
  • Yewon Byun, Shantanu Gupta, Zachary C. Lipton, Rachel Leah Childers, Bryan Wilder, 8 Aug 2025, Using Imperfect Synthetic Data in Downstream Inference Tasks, https://arxiv.org/abs/2508.06635
  • Andrey Sidorenko and Paul Tiwald, 8 Aug 2025, Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN, https://arxiv.org/abs/2508.06647
  • Sabrina Namazova, Alessandra Brondetta, Younes Strittmatter, Matthew Nassar, Sebastian Musslick, 11 Aug 2025, Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant, https://arxiv.org/abs/2508.07887
  • Raunak Narwal and Syed Abbas, 10 Aug 2025, BIGBOY1.2: Generating Realistic Synthetic Data for Disease Outbreak Modelling and Analytics, https://arxiv.org/abs/2508.07239
  • Ethan Lo and Dan C. Lo, 18 Jul 2025, Exoplanet Detection Using Machine Learning Models Trained on Synthetic Light Curves, https://arxiv.org/abs/2507.19520
  • Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Zexue He, Shafiq Abedin, Jennifer Sun, Ben Wiesel, Eli Schwartz, Ahmed Nassar, Bo Wu, Assaf Arbelle, Aude Oliva, Dan Gutfreund, Leonid Karlinsky, Rogerio Feris, 31 May 2025, ChartGen: Scaling Chart Understanding Via Code-Guided Synthetic Chart Generation, https://arxiv.org/abs/2507.19492
  • Tao Lian, Jose L. G\'omez, Antonio M. L\'opez, 26 Jul 2025, FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving, https://arxiv.org/abs/2507.19881
  • Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, and Sebastien Marcel, 28 Jul 2025, Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data, https://arxiv.org/abs/2507.20782
  • Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, 25 Jul 2025, Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task, https://arxiv.org/abs/2310.09336
  • Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 28 Jul 2025, Explainable Synthetic Image Detection through Diffusion Timestep Ensembling, https://arxiv.org/abs/2503.06201
  • Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz, 28 Jul 2025, StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation, https://arxiv.org/abs/2507.21340
  • Yida Tao, Yen-Chia Hsu, 29 Jul 2025, Bridging Synthetic and Real-World Domains: A Human-in-the-Loop Weakly-Supervised Framework for Industrial Toxic Emission Segmentation, https://arxiv.org/abs/2507.22002
  • Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu, 31 Jul 2025, CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks, https://arxiv.org/abs/2507.23751
  • Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen, 31 Jul 2025, Continual Learning with Synthetic Boundary Experience Blending, https://arxiv.org/abs/2507.23534
  • Jessica Bader, Leander Girrbach, Stephan Alaniz, Zeynep Akata, 31 Jul 2025, SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions, https://arxiv.org/abs/2507.23784
  • Patricia A. Apell\'aniz and Ana Jim\'enez and Borja Arroyo Galende and Juan Parras and Santiago Zazo, 31 Jul 2025, Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios, https://arxiv.org/abs/2407.03080
  • Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg, 30 Jul 2025, Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning, https://arxiv.org/abs/2502.13820
  • Georgi Ganev and Meenatchi Sundaram Muthu Selva Annamalai and Sofiane Mahiou and Emiliano De Cristofaro, 29 Jul 2025, The Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data, https://arxiv.org/abs/2504.06923
  • Tom Or and Omri Azencot (Ben Gurion University of the Negev), 1 Aug 2025, Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics, https://arxiv.org/abs/2508.00784
  • Ivona Krchova, Mariana Vargas Vieyra, Mario Scriminaci, Andrey Sidorenko, 1 Aug 2025, Democratizing Tabular Data Access with an Open$\unicode{x2013}$Source Synthetic$\unicode{x2013}$Data SDK, https://arxiv.org/abs/2508.00718
  • Jianwei Wang, Ziming Wu, Fuming Lai, Shaobing Lian, Ziqian Zeng, 1 Aug 2025, SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought, https://arxiv.org/abs/2508.00574
  • Abdulmajid Murad, Massimiliano Ruocco, 4 Aug 2025, Pre-Tactical Flight-Delay and Turnaround Forecasting with Synthetic Aviation Data, https://arxiv.org/abs/2508.02294
  • Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Cyril Rakovski, Frank Rudzicz, 2 Aug 2025, MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs, https://arxiv.org/abs/2508.01401
  • Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou, 5 Aug 2025, Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation, https://arxiv.org/abs/2508.03117
  • Oc\'eane Doremus, Ariel Guerra-Adames, Marta Avalos-Fernandez, Vianney Jouhet, C\'edric Gil-Jardin\'e, Emmanuel Lagarde, 4 Aug 2025, Synthetic medical data generation: state of the art and application to trauma mechanism classification, https://arxiv.org/abs/2508.02771
  • Shifeng Xie, Vasilii Feofanov, Marius Alonso, Ambroise Odonnat, Jianfeng Zhang, Themis Palpanas, and Ievgen Redko, 4 Aug 2025, CauKer: classification time series foundation models can be pretrained on synthetic data only, https://arxiv.org/abs/2508.02879
  • Yongyi Wang, Lingfeng Li, Bozhou Chen, Ang Li, Hanyu Liu, Qirui Zheng, Xionghui Yang, Wenxin Li, 6 Aug 2025, Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling, https://arxiv.org/abs/2508.04282
  • George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov, 6 Aug 2025, Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success, https://arxiv.org/abs/2508.04280
  • Mohd Ashhad and Ricardo Henao, 5 Aug 2025, Generating Accurate Synthetic Survival Data by Conditioning on Outcomes, https://arxiv.org/abs/2405.17333
  • Yunbo Long, Liming Xu, Alexandra Brintrup, 7 Aug 2025, LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion, https://arxiv.org/abs/2503.02161
  • Ingo Ziegler, Abdullatif K\"oksal, Desmond Elliott, Hinrich Sch\"utze, 6 Aug 2025, CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation, https://arxiv.org/abs/2409.02098
  • Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Ra\'ul Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Or\'us, Manuel Radons, Josef Menter, and Ali Abedi, 8 Aug 2025, Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS), https://arxiv.org/abs/2508.06251
  • Ojonugwa Oluwafemi Ejiga Peter, Akingbola Oluwapemiisin, Amalahu Chetachi, Adeniran Opeyemi, Fahmi Khalifa, and Md Mahmudur Rahman, 8 Aug 2025, Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation, https://arxiv.org/abs/2508.06170
  • Pavitra Chauhan, Mohsen Gamal Saad Askar, Kristian Svendsen, Bj{\o}rn Fjukstad, Brita Elvev{\aa}g, Lars Ailo Bongo, Edvard Pedersen, 8 Aug 2025, From research to clinic: Accelerating the translation of clinical decision support systems by making synthetic data interoperable, https://arxiv.org/abs/2308.02613
  • Shayan Alahyari, Mike Domaratzki, 8 Aug 2025, SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression, https://arxiv.org/abs/2504.21152
  • Arshia Ilaty, Hossein Shirazi, Hajar Homayouni, 11 Aug 2025, SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering, https://arxiv.org/abs/2508.08529
  • Audrey Poinsot, Panayiotis Panayiotou, Alessandro Leite, Nicolas Chesneau, \"Ozg\"ur \c{S}im\c{s}ek, Marc Schoenauer, 12 Aug 2025, Position: Causal Machine Learning Requires Rigorous Synthetic Experiments for Broader Adoption, https://arxiv.org/abs/2508.08883
  • Farah Atif, Nursultan Askarbekuly, Kareem Darwish, Monojit Choudhury, 4 Aug 2025, Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions, https://arxiv.org/abs/2508.08287
  • Vibeke Binz Vallevik, Anne Kjersti C. Befring, Severin Elvatun and Jan Franz Nygaard, 11 Aug 2025, Processing of synthetic data in AI development for healthcare and the definition of personal data in EU law, https://arxiv.org/abs/2508.08353
  • Aydin Zaboli and Junho Hong, 12 Aug 2025, Generative AI for Critical Infrastructure in Smart Grids: A Unified Framework for Synthetic Data Generation and Anomaly Detection, https://arxiv.org/abs/2508.08593
  • Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matari\'c, 12 Aug 2025, Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions, https://arxiv.org/abs/2502.13135
  • Min Tang, Peng Lu, Qing Feng, 6 Aug 2025, Generating Feasible and Diverse Synthetic Populations Using Diffusion Models, https://arxiv.org/abs/2508.09164
  • Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li, 13 Aug 2025, Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation, https://arxiv.org/abs/2508.09987
  • Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun, 13 Aug 2025, Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning, https://arxiv.org/abs/2505.16483
  • Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt, 14 Aug 2025, BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining, https://arxiv.org/abs/2508.10975
  • Liam Chalcroft and Ioannis Pappas and Cathy J. Price and John Ashburner, 15 Aug 2025, Synthetic Data for Robust Stroke Segmentation, https://arxiv.org/abs/2404.01946
  • Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani, 15 Aug 2025, FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation, https://arxiv.org/abs/2508.11810
  • Jonas van Elburg, Peter van der Putten, Maarten Marx, 15 Aug 2025, Can we Evaluate RAGs with Synthetic Data?, https://arxiv.org/abs/2508.11758
  • Ahmet H. G\"uzel, Ilija Bogunovic, Jack Parker-Holder, 17 Aug 2025, Synthetic Data is Sufficient for Zero-Shot Visual Generalization from Offline Data, https://arxiv.org/abs/2508.12356
  • Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov, 17 Aug 2025, Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment, https://arxiv.org/abs/2506.00845
  • Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei, 19 Aug 2025, InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems, https://arxiv.org/abs/2508.13930
  • Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti, 19 Aug 2025, POPri: Private Federated Learning using Preference-Optimized Synthetic Data, https://arxiv.org/abs/2504.16438
  • Suleyman Olcay Polat, Poli A. Nemkova, Mark V. Albert, 20 Aug 2025, Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method, https://arxiv.org/abs/2508.14783
  • Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban, 20 Aug 2025, Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference, https://arxiv.org/abs/2508.14735
  • Gaston Gustavo Rios, 20 Aug 2025, HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation, https://arxiv.org/abs/2508.14345
  • Saptarshi Neil Sinha and P. Julius Kuehn and Johannes Koppe and Arjan Kuijper and Michael Weinmann, 20 Aug 2025, Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data, https://arxiv.org/abs/2505.22291
  • Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, Gopal Sarda, 21 Aug 2025, GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO, https://arxiv.org/abs/2508.15432
  • Jan Kapar, Kathrin G\"unther, Lori Ann Vallis, Klaus Berger, Nadine Binder, Hermann Brenner, Stefanie Castell, Beate Fischer, Volker Harth, Bernd Holleczek, Timm Intemann, Till Ittermann, Andr\'e Karch, Thomas Keil, Lilian Krist, Berit Lange, Michael F. Leitzmann, Katharina Nimptsch, Nadia Obi, Iris Pigeot, Tobias Pischon, Tamara Schikowski, B\"orge Schmidt, Carsten Oliver Schmidt, Anja M. Sedlmair, Justine Tanoey, Harm Wienbergen, Andreas Wienke, Claudia Wigmann and Marvin N. Wright, 19 Aug 2025, Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI, https://arxiv.org/abs/2508.14936
  • Juntao Tan, Liangwei Yang, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Tulika Manoj Awalgaonkar, Jianguo Zhang, Weiran Yao, Ming Zhu, Shirley Kokane, Silvio Savarese, Huan Wang, Caiming Xiong, Shelby Heinecke, 20 Aug 2025, PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data, https://arxiv.org/abs/2502.20616
  • Arefeh Kazemi and Sri Balaaji Natarajan Kalaivendan and Joachim Wagner and Hamza Qadeer and Kanishk Verma and Brian Davis, 20 Aug 2025, Synthetic vs. Gold: The Role of LLM Generated Labels and Data in Cyberbullying Detection, https://arxiv.org/abs/2502.15860
  • Weijie Niu, Alberto Huertas Celdran, Karoline Siarsky, Burkhard Stiller, 22 Aug 2025, FEST: A Unified Framework for Evaluating Synthetic Tabular Data, https://arxiv.org/abs/2508.16254
  • Seyedali Mohammadi, Manas Paldhe, Amit Chhabra, 13 Aug 2025, LingVarBench: Benchmarking LLM for Automated Named Entity Recognition in Structured Synthetic Spoken Transcriptions, https://arxiv.org/abs/2508.15801
  • Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
  • Mika Leo Hube, Filip Lemic, Ethungshan Shitiri, Gerard Calvo Bartra, Sergi Abadal, Xavier Costa P\'erez, 22 Aug 2025, Set Transformer Architectures and Synthetic Data Generation for Flow-Guided Nanoscale Localization, https://arxiv.org/abs/2508.16200
  • Stefania L. Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Delbrouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P. Langlotz, Akshay S. Chaudhari, 22 Aug 2025, Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data, https://arxiv.org/abs/2508.16783
  • Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Joao Manoel Herrera Pinheiro, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker, 24 Aug 2025, A Synthetic Dataset for Manometry Recognition in Robotic Applications, https://arxiv.org/abs/2508.17468
  • Weikang Wan, Jiawei Fu, Xiaodi Yuan, Yifeng Zhu, Hao Su, 24 Aug 2025, LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations, https://arxiv.org/abs/2508.17547
  • Rishikesh Devanathan, Varun Nathan, Ayush Kumar, 25 Aug 2025, Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation, https://arxiv.org/abs/2508.18210
  • Melissa Kazemi Rad, Alberto Purpura, Himanshu Kumar, Emily Chen, Mohammad Shahed Sorower, 23 Aug 2025, GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection, https://arxiv.org/abs/2508.17057
  • Chenhao Xue, Yuanzhe Jin, Adrian Carrasco-Revilla, Joyraj Chakraborty, Min Chen, 4 Aug 2025, AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification, https://arxiv.org/abs/2508.10000
  • Amirmohammad Farzaneh, Matteo Zecchin, Osvaldo Simeone, 4 Sep 2025, Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference, https://arxiv.org/abs/2509.04112
  • Chanon Puttanawarut, Natcha Fongsrisin, Porntep Amornritvanich, Cholatid Ratanatharathorn, Panu Looareesuwan, 4 Sep 2025, Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models, https://arxiv.org/abs/2509.04245
  • Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych, 4 Sep 2025, MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions, https://arxiv.org/abs/2509.04183
  • Mollie Shichman, Claire Bonial, Austin Blodgett, Taylor Hudson, Francis Ferraro, Rachel Rudinger, 3 Sep 2025, FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response, https://arxiv.org/abs/2502.18452
  • Seganrasan Subramanian, Abhigya Verma, 4 Sep 2025, Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation, https://arxiv.org/abs/2509.01185
  • Mat\'ias Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer, 4 Sep 2025, Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints, https://arxiv.org/abs/2411.14013
  • Yogev Cohen, Dudi Ohayon, Romy Somkin, Yehudit Aperstein, Alexander Apartsin, 5 Sep 2025, Code Review Without Borders: Evaluating Synthetic vs. Real Data for Review Recommendation, https://arxiv.org/abs/2509.04810
  • Alpana Dubey, Suma Mani Kuriakose, Nitish Bhardwaj, 5 Sep 2025, SynGen-Vision: Synthetic Data Generation for training industrial vision models, https://arxiv.org/abs/2509.04894
  • Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren, 25 Aug 2025, Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails, https://arxiv.org/abs/2508.18384
  • Ilias Driouich, Hongliu Cao, Eoin Thomas, 26 Aug 2025, Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework, https://arxiv.org/abs/2508.18929
  • Dawei Li, Yue Huang, Ming Li, Tianyi Zhou, Xiangliang Zhang, Huan Liu, 27 Aug 2025, Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era, https://arxiv.org/abs/2508.19570
  • Zhan Shi, Yefeng Yuan, Yuhong Liu, Liang Cheng, Yi Fang, 25 Aug 2025, RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting, https://arxiv.org/abs/2508.19286
  • Michael Nidd, Christoph Miksovic, Thomas Gschwind, Francesco Fusco, Andrea Giovannini, Ioana Giurgiu, 27 Aug 2025, Bootstrapping Learned Cost Models with Synthetic SQL Queries, https://arxiv.org/abs/2508.19807
  • Jingze Zhang, Jiahe Qian, Yiliang Zhou, Yifan Peng, 28 Aug 2025, Enhancing Health Fact-Checking with LLM-Generated Synthetic Data, https://arxiv.org/abs/2508.20525
  • Sang Su Lee, Vineeth Loganathan, and Vijay Raghavan, 28 Aug 2025, Dynamic Synthetic Controls vs. Panel-Aware Double Machine Learning for Geo-Level Marketing Impact Estimation, https://arxiv.org/abs/2508.20335
  • Yijia Guo and Junqing Zhang and Y.-W. Peter Hong, 28 Aug 2025, Practical Physical Layer Authentication for Mobile Scenarios Using a Synthetic Dataset Enhanced Deep Learning Approach, https://arxiv.org/abs/2508.20861
  • Yewon Byun, Sanket Vaibhav Mehta, Saurabh Garg, Emma Strubell, Michael Oberst, Bryan Wilder, Zachary C. Lipton, 28 Aug 2025, Expert Routing with Synthetic Data for Continual Learning, https://arxiv.org/abs/2412.17009
  • Joshua Ward, Chi-Hua Wang, Guang Cheng, 28 Aug 2025, Privacy Auditing Synthetic Data Release through Local Likelihood Attacks, https://arxiv.org/abs/2508.21146
  • Pujan Thapa, Alexander Ororbia, Travis Desell, 28 Aug 2025, Class Incremental Continual Learning with Self-Organizing Maps and Variational Autoencoders Using Synthetic Replay, https://arxiv.org/abs/2508.21240
  • Jo\~ao Valente, Atabak Dehban, Rodrigo Ventura, 29 Aug 2025, CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models, https://arxiv.org/abs/2508.21732
  • Jorge Saldivar, Anna Gatzioura, Carlos Castillo, 28 Aug 2025, Synthetic CVs To Build and Test Fairness-Aware Hiring Tools, https://arxiv.org/abs/2508.21179
  • Nidhi Kowtal, Raviraj Joshi, 29 Aug 2025, L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models, https://arxiv.org/abs/2506.00863
  • Shang Liu, Jing Wang, Wenji Fang, Zhiyao Xie, 26 Aug 2025, SynCircuit: Automated Generation of New Synthetic RTL Circuits Can Enable Big Data in Circuits, https://arxiv.org/abs/2509.00071
  • G. Charbel N. Kindji (MALT), Elisa Fromont (MALT), Lina Maria Rojas-Barahona, Tanguy Urvoy, 27 Aug 2025, Robust Detection of Synthetic Tabular Data under Schema Variability, https://arxiv.org/abs/2509.00092
  • Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki, 2 Sep 2025, Unsupervised Training of Vision Transformers with Synthetic Negatives, https://arxiv.org/abs/2509.02024
  • Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki, 2 Sep 2025, Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives, https://arxiv.org/abs/2509.02029
  • Yevhen Havrylenko, Meelis K\"a\"arik and Artur Tuttar, 2 Sep 2025, Amputation-imputation based generation of synthetic tabular data for ratemaking, https://arxiv.org/abs/2509.02171
  • Hunter Gittlin, 29 Aug 2025, Beyond Synthetic Augmentation: Group-Aware Threshold Calibration for Robust Balanced Accuracy in Imbalanced Learning, https://arxiv.org/abs/2509.02592
  • Vikas Kashtriya and Pardeep Singh, 2 Sep 2025, Enhancing Machine Learning for Imbalanced Medical Data: A Quantum-Inspired Approach to Synthetic Oversampling (QI-SMOTE), https://arxiv.org/abs/2509.02863
  • Jorn K. Teutloff, 29 Aug 2025, Synthetic Founders: AI-Generated Social Simulations for Startup Validation Research in Computational Social Science, https://arxiv.org/abs/2509.02605
  • Leire Benito-Del-Valle, Pedro A. Moreno-S\'anchez, Itziar Egusquiza, Itsaso Vitoria, Artzai Pic\'on, Cristina L\'opez-Saratxaga, Adrian Galdran, 30 Aug 2025, Is Synthetic Image Augmentation Useful for Imbalanced Classification Problems? Case-Study on the MIDOG2025 Atypical Cell Detection Competition, https://arxiv.org/abs/2509.02612
  • Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S. Ryoo, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles, 3 Sep 2025, Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data, https://arxiv.org/abs/2509.03501
  • Liming Xu and Yunbo Long and Alexandra Brintrup, 30 Aug 2025, SynDelay: A Synthetic Dataset for Delivery Delay Prediction, https://arxiv.org/abs/2509.05325
  • Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu, 6 Sep 2025, Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation, https://arxiv.org/abs/2509.05605
  • Ching-Chun Chang and Isao Echizen, 6 Sep 2025, Tell-Tale Watermarks for Explanatory Reasoning in Synthetic Media Forensics, https://arxiv.org/abs/2509.05753
  • Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke, 8 Sep 2025, MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML, https://arxiv.org/abs/2509.06806
  • Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro, 8 Sep 2025, Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection, https://arxiv.org/abs/2412.12761
  • Benjamin Hoffman, David Robinson, Marius Miron, Vittorio Baglione, Daniela Canestrari, Damian Elias, Eva Trapote, Felix Effenberger, Maddie Cusimano, Masato Hagiwara, Olivier Pietquin, 5 Sep 2025, Synthetic data enables context-aware bioacoustic sound event detection, https://arxiv.org/abs/2503.00296
  • Wang Wang, Mingyu Shi, Jun Jiang, Wenqian Ma, Chong Liu, Yasutaka Narazaki, Xuguang Wang, 5 Sep 2025, Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework, https://arxiv.org/abs/2507.05814
  • Seunghyeon Kim, Kyeongryeol Go, 22 Jul 2025, Edge-case Synthesis for Fisheye Object Detection: A Data-centric Perspective, https://arxiv.org/abs/2507.16254
  • Xiaopeng Ke and Hexuan Deng and Xuebo Liu and Jun Rao and Zhenxi Song and Jun Yu and Min Zhang, 24 Jul 2025, AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs, https://arxiv.org/abs/2507.18584
  • Xi Long, Christy Boscardin, Lauren A. Maggio, Joseph A. Costello, Ralph Gonzales, Rasmyah Hammoudeh, Ki Lai, Yoon Soo Park, Brian C. Gin, 14 Aug 2025, Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis, https://arxiv.org/abs/2508.09458
  • Qiushi Sun, Jinyang Gong, Lei Li, Qipeng Guo, Fei Yuan, 25 Jul 2025, CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback, https://arxiv.org/abs/2507.22080
  • Xiaoling Hu, Xiangrui Zeng, Oula Puonti, Juan Eugenio Iglesias, Bruce Fischl, Yael Balbastre, 1 Aug 2025, Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation, https://arxiv.org/abs/2411.16719
  • Siyi Liu, Yujia Zheng, Yongqi Zhang, 4 Aug 2025, StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes, https://arxiv.org/abs/2508.02601
  • Yong Lin and Shange Tang and Bohan Lyu and Ziran Yang and Jui-Hui Chung and Haoyu Zhao and Lai Jiang and Yihan Geng and Jiawei Ge and Jingruo Sun and Jiayun Wu and Jiri Gesi and Ximing Lu and David Acuna and Kaiyu Yang and Hongzhou Lin and Yejin Choi and Danqi Chen and Sanjeev Arora and Chi Jin, 5 Aug 2025, Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction, https://arxiv.org/abs/2508.03613
  • Parker Seegmiller, Kartik Mehta, Soumya Saha, Chenyang Tao, Shereen Oraby, Arpit Gupta, Tagyoung Chung, Mohit Bansal and Nanyun Peng, 22 Aug 2025, FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline, https://arxiv.org/abs/2508.16514
  • Feng Tian, Flora D. Salim, Hao Xue, 25 Aug 2025, TradingGroup: A Multi-Agent Trading System with Self-Reflection and Data-Synthesis, https://arxiv.org/abs/2508.17565
  • Sunguk Choi, Yonghoon Kwon, Heondeuk Lee, 26 Aug 2025, CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks, https://arxiv.org/abs/2508.18743
  • Timur Sattarov, Marco Schreyer, Damian Borth, 29 Aug 2025, Federated Diffusion Modeling with Differential Privacy for Tabular Data Synthesis, https://arxiv.org/abs/2412.16083
  • Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu, 30 Aug 2025, Open Data Synthesis For Deep Research, https://arxiv.org/abs/2509.00375
  • Jianwei Wang, Chengming Shi, Junyao Yang, Haoran Li, Qianli Ma, Huiping Zhuang, Cen Chen and Ziqian Zeng, 31 Aug 2025, RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis, https://arxiv.org/abs/2502.18517
  • Yuntao Du, Ninghui Li, 7 Sep 2025, Systematic Assessment of Tabular Data Synthesis, https://arxiv.org/abs/2402.06806
  • Laura Boggia, Bogdan Malaescu, 9 Sep 2025, Synthetic Data Generation with Lorenzetti for Time Series Anomaly Detection in High-Energy Physics Calorimeters, https://arxiv.org/abs/2509.07451
  • Ali Reza Ibrahimzada, Yang Chen, Ryan Rong, Reyhaneh Jabbarvand, 9 Sep 2025, Challenging Bug Prediction and Repair Models with Synthetic Bugs, https://arxiv.org/abs/2310.02407
  • Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra, 11 Sep 2025, A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes, https://arxiv.org/abs/2509.09794
  • Keunwoo Choi, Seungheon Doh, Juhan Nam, 18 Aug 2025, TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation, https://arxiv.org/abs/2509.09685
  • Basti\'an Gonz\'alez-Bustamante, Nando Verelst, Carla Cisternas, 11 Sep 2025, Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case, https://arxiv.org/abs/2509.09871
  • Jing Zhang, Alexandre Bousse, Chi-Hieu Pham, Kuangyu Shi, Julien Bert, 12 Sep 2025, Semi-Supervised Learning for Dose Prediction in Targeted Radionuclide: A Synthetic Data Study, https://arxiv.org/abs/2503.05367
  • Tung Vu, Lam Nguyen, Quynh Dao, 10 Sep 2025, PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability, https://arxiv.org/abs/2509.08910
  • Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng, 11 Sep 2025, Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function, https://arxiv.org/abs/2509.09197
  • Nazia Nafis, Inaki Esnaola, Alvaro Martinez-Perez, Maria-Cruz Villa-Uriol, Venet Osmani, 11 Sep 2025, Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review, https://arxiv.org/abs/2504.18544
  • Dimitris Tsirmpas and Ion Androutsopoulos and John Pavlopoulos, 11 Sep 2025, Scalable Evaluation of Online Facilitation Strategies via Synthetic Simulation of Discussions, https://arxiv.org/abs/2503.16505
  • Sepehr Dehdashtian, Mashrur M. Morshed, Jacob H. Seidman, Gaurav Bharaj and Vishnu Naresh Boddeti, 19 Sep 2025, PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors, https://arxiv.org/abs/2509.15551
  • Nakul Sharma, 19 Sep 2025, Efficient Long-Tail Learning in Latent Space by sampling Synthetic Data, https://arxiv.org/abs/2509.15859
  • Nomi Yu (1), Md Ferdous Alam (1), A. John Hart (1), and Faez Ahmed (1) ((1) Massachusetts Institute of Technology), 17 Sep 2025, GenCAD-3D: CAD Program Generation using Multimodal Latent Space Alignment and Synthetic Dataset Balancing, https://arxiv.org/abs/2509.15246
  • Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Cand\`es, Chong Wang, Ruoming Pang, 17 Sep 2025, Synthetic bootstrapped pretraining, https://arxiv.org/abs/2509.15248
  • Caitlin Cisar, Emily Sheffield, Joshua Drake, Alden Harrell, Subramanian Chidambaram, Nikita Nangia, Vinayak Arannil, Alex Williams, 18 Sep 2025, PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output Targeting, https://arxiv.org/abs/2509.15447
  • Junlong Jia, Xing Wu, Chaochen Gao, Ziyang Chen, Zijia Lin, Zhongzhi Li, Weinong Wang, Haotian Xu, Donghui Jin, Debing Zhang, Binghui Guo, 19 Sep 2025, LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs, https://arxiv.org/abs/2509.15568
  • Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, Feng Zheng, 19 Sep 2025, OptiScene: LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization, https://arxiv.org/abs/2506.07570
  • Alessandro Crimi and Andrea Brovelli, 15 Sep 2025, Prediction and Causality of functional MRI and synthetic signal using a Zero-Shot Time-Series Foundation Model, https://arxiv.org/abs/2509.12497
  • Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou, 16 Sep 2025, WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning, https://arxiv.org/abs/2509.13305
  • Riyaadh Gani, 12 Sep 2025, Physics-Informed Neural Networks vs. Physics Models for Non-Invasive Glucose Monitoring: A Comparative Study Under Realistic Synthetic Conditions, https://arxiv.org/abs/2509.12253
  • Nolan Platt and Pragyansmita Nayak, 16 Sep 2025, Multi-Model Synthetic Training for Mission-Critical Small Language Models, https://arxiv.org/abs/2509.13047
  • Shanmuka Sadhu, Arca Baran, Preeti Pandey, and Ayush Kumar, 15 Sep 2025, Task Decoding based on Eye Movements using Synthetic Data Augmentation, https://arxiv.org/abs/2509.11547
  • Rumeng Li, Xun Wang, Hong Yu, 5 Sep 2025, DualAlign: Generating Clinically Grounded Synthetic Data, https://arxiv.org/abs/2509.10538
  • Omkar Shailendra Vengurlekar, Adithya Pediredla, Suren Jayasuriya, 14 Sep 2025, SH-SAS: An Implicit Neural Representation for Complex Spherical-Harmonic Scattering Fields for 3D Synthetic Aperture Sonar, https://arxiv.org/abs/2509.11087
  • Milan Marocchi, Matthew Fynn, Kayapanda Mandana, Yue Rong, 15 Sep 2025, Scaling to Multimodal and Multichannel Heart Sound Classification: Fine-Tuning Wav2Vec 2.0 with Synthetic and Augmented Biosignals, https://arxiv.org/abs/2509.11606
  • Mikhail Kulyabin, Jan Joosten, Choro Ulan uulu, Nuno Miguel Martins Pacheco, Fabian Ries, Filippos Petridis, Jan Bosch, and Helena Holmstr\"om Olsson, 15 Sep 2025, User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums, https://arxiv.org/abs/2509.11777
  • Lauri Suomela, Sasanka Kuruppu Arachchige, German F. Torres, Harry Edelman, Joni-Kristian K\"am\"ar\"ainen, 15 Sep 2025, Synthetic vs. Real Training Data for Visual Navigation, https://arxiv.org/abs/2509.11791
  • Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji, 13 Sep 2025, FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering, https://arxiv.org/abs/2412.07030
  • Shengjie Ma, Xuhui Jiang, Chengjin Xu, Cehao Yang, Liyu Zhang, Jian Guo, 14 Sep 2025, Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models, https://arxiv.org/abs/2505.00979
  • Karan Dua, Puneet Mittal, Ranjeet Gupta, Hitesh Laxmichand Patel, 15 Sep 2025, SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models, https://arxiv.org/abs/2509.14270
  • Luisa Torquato Ni\~no and Hamza A. A. Gardi, 18 Sep 2025, Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies, https://arxiv.org/abs/2509.15045
  • Christopher Wiedeman, Anastasiia Sarmakeeva, Elena Sizikova, Daniil Filienko, Miguel Lago, Jana G. Delfino, Aldo Badano, 18 Sep 2025, T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images, https://arxiv.org/abs/2507.04038
  • Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin, 18 Sep 2025, Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation, https://arxiv.org/abs/2505.16360
  • Lauren H. Cooke, Matthias Jung, Jan M. Brendel, Nora M. Kerkovits, Borek Foldyna, Michael T. Lu, Vineet K. Raghu, 10 Sep 2025, RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts, https://arxiv.org/abs/2509.08640
  • Dietmar Offenhuber, 14 Sep 2025, Synthetic Data and the Shifting Ground of Truth, https://arxiv.org/abs/2509.13355
  • Inder Pal Singh, Nidhal Eddine Chenni, Abd El Rahman Shabayek, Arunkumar Rathinam, Djamila Aouada, 17 Sep 2025, Bridging the Synthetic-Real Gap: Supervised Domain Adaptation for Robust Spacecraft 6-DoF Pose Estimation, https://arxiv.org/abs/2509.13792
  • Gustavo Kruger, Nikhil Sachdeva, Michael Sobolev, 17 Sep 2025, Synthetic Data Generation for Screen Time and App Usage, https://arxiv.org/abs/2509.13892
  • Niklas Grieger, Siamak Mehrkanoon, Stephan Bialonski, 17 Sep 2025, Data-Efficient Sleep Staging with Synthetic Time Series Pretraining, https://arxiv.org/abs/2403.08592
  • Karan Dua, Hitesh Laxmichand Patel, Puneet Mittal, Ranjeet Gupta, Amit Agarwal, Praneet Pabolu, Srikant Panda, Hansa Meghwani, Graham Horwood, Fahad Shah, 2 Oct 2025, FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models, https://arxiv.org/abs/2510.02133
  • Brett Barkley and David Fridovich-Keil, 1 Oct 2025, Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization, https://arxiv.org/abs/2510.01457
  • Feiyang Kang, Newsha Ardalani, Michael Kuchnik, Youssef Emad, Mostafa Elhoushi, Shubhabrata Sengupta, Shang-Wen Li, Ramya Raghavendra, Ruoxi Jia, Carole-Jean Wu, 2 Oct 2025, Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls, https://arxiv.org/abs/2510.01631
  • Adil Koeken, Alexander Ziller, Moritz Knolle, Daniel Rueckert, 2 Oct 2025, Sensitivity, Specificity, and Consistency: A Tripartite Evaluation of Privacy Filters for Synthetic Data Generation, https://arxiv.org/abs/2510.01793
  • Dimitar Peshevski, Kiril Blazhevski, Martin Popovski, Gjorgji Madjarov, 23 Sep 2025, Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision, https://arxiv.org/abs/2510.01229
  • Adithya Rajan, Xiaoyu Liu, Prateek Verma, Vibhu Arora, 2 Oct 2025, Synthetic Prefixes to Mitigate Bias in Real-Time Neural Query Autocomplete, https://arxiv.org/abs/2510.01574
  • Krishna Teja Chitty-Venkata, Murali Emani, 2 Oct 2025, ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models, https://arxiv.org/abs/2510.01582
  • Momin Abbas and Muneeza Azmat and Raya Horesh and Mikhail Yurochkin, 1 Oct 2025, Out-of-Distribution Detection using Synthetic Data Generation, https://arxiv.org/abs/2502.03323
  • Anish Agarwal, Sukjin Han, Dwaipayan Saha, Vasilis Syrgkanis, Haeyeon Yoon, 1 Oct 2025, Synthetic Blips: Generalizing Synthetic Controls for Dynamic Treatment Effects, https://arxiv.org/abs/2210.11003
  • Urs Spiegelhalter, J\"org K.H. Franke, Frank Hutter, 13 Oct 2025, Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities, https://arxiv.org/abs/2510.11842
  • Gautier Evennou, Antoine Chaffin, Vivien Chappelier and Ewa Kijak, 14 Oct 2025, Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation, https://arxiv.org/abs/2412.15939
  • Anni Li, Aria Attar, Paul Dong, 30 Sep 2025, Thinkquel: A Model Dedicated to Text-to-dbt Using Synthetic Data and a Span-Aware Objective, https://arxiv.org/abs/2510.00186
  • Jieun Yu, Minjung Park, Sangmi Chai, 1 Oct 2025, Improving Cryptocurrency Pump-and-Dump Detection through Ensemble-Based Models and Synthetic Oversampling Techniques, https://arxiv.org/abs/2510.00836
  • Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, Yunqi Liu, Yu Chen, Yuntian Liu, Yang Jiao, Tao Luo, 1 Oct 2025, ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data, https://arxiv.org/abs/2502.05567
  • Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, Bo Zheng, 1 Oct 2025, ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis, https://arxiv.org/abs/2509.23652
  • Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita, 23 Sep 2025, ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation, https://arxiv.org/abs/2509.19454
  • Meshi Bashari, Yonghoon Lee, Roy Maor Lotan, Edgar Dobriban, Yaniv Romano, 24 Sep 2025, Statistical Inference Leveraging Synthetic Data with Distribution-Free Guarantees, https://arxiv.org/abs/2509.20345
  • Yijun Liang, Shweta Bhardwaj, Tianyi Zhou, 24 Sep 2025, Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion, https://arxiv.org/abs/2410.13674
  • Yuanyuan Wu, Zhenlin Qin, Zhenliang Ma, 28 Oct 2025, A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport, https://arxiv.org/abs/2510.24375
  • Jongsuk Kim, Jaeyoung Lee, Gyojin Han, Dongjae Lee, Minki Jeong, Junmo Kim, 28 Oct 2025, SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration, https://arxiv.org/abs/2510.24052
  • Keiya Hirashima, Shingo Nozaki, Naoto Harada, 28 Oct 2025, Self-supervised Synthetic Pretraining for Inference of Stellar Mass Embedded in Dense Gas, https://arxiv.org/abs/2510.24159
  • Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang, 28 Oct 2025, Repurposing Synthetic Data for Fine-grained Search Agent Supervision, https://arxiv.org/abs/2510.24694
  • Emma Rose Madden, 28 Oct 2025, Evaluating the Use of Large Language Models as Synthetic Social Agents in Social Science Research, https://arxiv.org/abs/2509.26080
  • Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Fabrice Jimenez, Thomas Oberlin, 23 Oct 2025, Synthetic Data for Robust Runway Detection, https://arxiv.org/abs/2510.20349
  • Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu, 23 Oct 2025, BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models, https://arxiv.org/abs/2510.20095
  • Touqeer Ahmad, Mohammadreza M. Kalan, Fran\c{c}ois Portier, Gilles Stupfler, 23 Oct 2025, Concentration and excess risk bounds for imbalanced classification with synthetic oversampling, https://arxiv.org/abs/2510.20472
  • Shuqiao Liang, Jian Liu, Renzhang Chen, Quanlong Guan, 23 Oct 2025, FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies, https://arxiv.org/abs/2509.20890
  • Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen, 18 Oct 2025, NP-Engine: Empowering Optimization Reasoning in Large Language Models with Verifiable Synthetic NP Problems, https://arxiv.org/abs/2510.16476
  • Shurong Lin, Aleksandra Slavkovi\'c, Deekshith Reddy Bhoomireddy, 19 Oct 2025, Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees, https://arxiv.org/abs/2510.16974
  • Peini Cheng and Amir Bahmani, 16 Oct 2025, Membership Inference over Diffusion-models-based Synthetic Tabular Data, https://arxiv.org/abs/2510.16037
  • Bingji Yi, Qiyuan Liu, Yuwei Cheng, and Haifeng Xu, 18 Oct 2025, Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence, https://arxiv.org/abs/2510.16657
  • Shawn M. Gibford, Mohammad Reza Boskabadi, Christopher J. Savoie, Seyed Soheil Mansouri, 20 Oct 2025, Quantum Synthetic Data Generation for Industrial Bioprocess Monitoring, https://arxiv.org/abs/2510.17688
  • Spencer Giddens, Xiaon Lang, Fang Liu, 20 Oct 2025, SAFES: Sequential Privacy and Fairness Enhancing Data Synthesis for Responsible AI, https://arxiv.org/abs/2411.09178
  • Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang, 20 Oct 2025, Synthetic Series-Symbol Data Generation for Time Series Foundation Models, https://arxiv.org/abs/2510.08445
  • Muhammad Ishfaq Hussain, Ma Van Linh, Zubia Naz, Unse Fatima, Yeongmin Ko, Moongu Jeon, 20 Oct 2025, SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with Thermal IR (LWIR/MWIR) and RGB, https://arxiv.org/abs/2510.13404
  • Yifan Yan, Shuai Yang, Xiuzhen Guo, Xiangguang Wang, Wei Chow, Yuanchao Shu, Shibo He, 20 Sep 2025, mmExpert: Integrating Large Language Models for Comprehensive mmWave Data Synthesis and Understanding, https://arxiv.org/abs/2509.16521
  • Weihua Du, Hailei Gong, Zhan Ling, Kang Liu, Lingfeng Shen, Xuesong Yao, Yufei Xu, Dingyuan Shi, Yiming Yang, Jiecao Chen, 22 Sep 2025, Generalizable End-to-End Tool-Use RL with Synthetic CodeGym, https://arxiv.org/abs/2509.17325
  • Tianyi Chen, Pengxiao Lin, Zhiwei Wang, Zhi-Qin John Xu, 22 Sep 2025, Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data, https://arxiv.org/abs/2509.17514
  • Xin Lei Lin, Soroush Mehraban, Abhishek Moturu, Babak Taati, 20 Sep 2025, Pain in 3D: Generating Controllable Synthetic Faces for Automated Pain Assessment, https://arxiv.org/abs/2509.16727
  • Vivek Iyer, Pinzhen Chen, Ricardo Rei, and Alexandra Birch, 20 Sep 2025, XL-Suite: Cross-Lingual Synthetic Training and Evaluation Data for Open-Ended Generation, https://arxiv.org/abs/2503.22973
  • Suhas BN, Dominik Mattioli, Saeed Abdullah, Rosa I. Arriaga, Chris W. Wiese, Andrew M. Sherrill, 20 Sep 2025, How Real Are Synthetic Therapy Conversations? Evaluating Fidelity in Prolonged Exposure Dialogues, https://arxiv.org/abs/2504.21800
  • Amal Abed, Ivan Lukic, J\"org K.H. Franke, Frank Hutter, 27 Oct 2025, Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks, https://arxiv.org/abs/2510.23208
  • Ollie Olby, Rory Baggott, Namid Stillman, 26 Oct 2025, TABL-ABM: A Hybrid Framework for Synthetic LOB Generation, https://arxiv.org/abs/2510.22685
  • Austin A. Barr, Brij S. Karmur, Anthony J. Winder, Eddie Guo, John T. Lysack, James N. Scott, William F. Morrish, Muneer Eesa, Morgan Willson, David W. Cadotte, Michael M.H. Yang, Ian Y.M. Chan, Sanju Lama, Garnette R. Sutherland, 25 Oct 2025, Expert Validation of Synthetic Cervical Spine Radiographs Generated with a Denoising Diffusion Probabilistic Model, https://arxiv.org/abs/2510.22166
  • Jahidul Arafat, Sanjaya Poudel, 25 Oct 2025, Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy, https://arxiv.org/abs/2510.22239
  • Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber, 26 Oct 2025, VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding, https://arxiv.org/abs/2505.01481
  • Sai Suhruth Reddy Karri, Yashwanth Sai Nallapuneni, Laxmi Narasimha Reddy Mallireddy, Gopichand G, 15 Oct 2025, LLM-Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems, https://arxiv.org/abs/2510.13202
  • Imon Mia, Armi Tiihonen, Anna Ernst, Anusha Srivastava, Tonio Buonassisi, William Vandenberghe, and Julia W.P. Hsu, 15 Oct 2025, Multi-Variable Batch Bayesian Optimization in Materials Research: Synthetic Data Analysis of Noise Sensitivity and Problem Landscape Effects, https://arxiv.org/abs/2504.03943
  • Marie Brockschmidt, Maresa Schr\"oder, Stefan Feuerriegel, 26 Sep 2025, SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis, https://arxiv.org/abs/2509.22352
  • Aurosweta Mahapatra, Ismail Rasim Ulgen, Berrak Sisman, 25 Sep 2025, HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech, https://arxiv.org/abs/2509.21676
  • Luc Boudier, Loris Manganelli, Eleftherios Tsonis, Nicolas Dufour, Vicky Kalogeiton, 26 Sep 2025, Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance, https://arxiv.org/abs/2509.22635
  • Khartik Uppalapati, Shakeel Abdulkareem, Bora Yimenicioglu, 6 Oct 2025, RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases, https://arxiv.org/abs/2510.06267
  • Sashank Makanaboyina, 6 Oct 2025, SER-Diff: Synthetic Error Replay Diffusion for Incremental Brain Tumor Segmentation, https://arxiv.org/abs/2510.06283
  • Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin, 8 Oct 2025, SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation, https://arxiv.org/abs/2510.06596
  • Tiago de Conto, John Armston, Ralph Dubayah, 7 Oct 2025, Scalable deep fusion of spaceborne lidar and synthetic aperture radar for global forest structural complexity mapping, https://arxiv.org/abs/2510.06299
  • Junki Mori, Kazuya Kakizaki, Taiki Miyagawa, Jun Sakuma, 8 Oct 2025, Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG), https://arxiv.org/abs/2510.06719
  • Moonkyung Ryu, Chih-Wei Hsu, Yinlam Chow, Mohammad Ghavamzadeh, Craig Boutilier, 26 Sep 2025, Synthetic Dialogue Generation for Interactive Conversational Elicitation & Recommendation (ICER), https://arxiv.org/abs/2510.02331
  • He Du, Bowen Li, Aijun Yang, Siyang He, Qipeng Guo, Dacheng Tao, 20 Oct 2025, EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning, https://arxiv.org/abs/2510.17928
  • Harry Amad and Zhaozhi Qian and Dennis Frauen and Julianna Piskorz and Stefan Feuerriegel and Mihaela van der Schaar, 21 Oct 2025, Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference, https://arxiv.org/abs/2510.18768
  • Henrique de Lima Alexandre and Clodoaldo Aparecido de Moraes Lima, 3 Oct 2025, Synthetic EEG Generation using Diffusion Models for Motor Imagery Tasks, https://arxiv.org/abs/2510.17832
  • Pranav Sambhu, Om Guin, Madhav Sambhu, Jinho Cha, 20 Oct 2025, Curriculum Learning with Synthetic Data for Enhanced Pulmonary Nodule Detection in Chest Radiographs, https://arxiv.org/abs/2510.07681
  • Maria F. Davila R and Azizjon Turaev and Wolfram Wingerath, 25 Sep 2025, Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis, https://arxiv.org/abs/2509.20768
  • Bing Liu, Wenqiang Yv, Xuzheng Yang, Shichang Wang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, and Heng Tao Shen, 25 Sep 2025, GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions, https://arxiv.org/abs/2509.21050
  • Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu and Chen Zhang, 25 Sep 2025, Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy, https://arxiv.org/abs/2509.21190
  • Hadley Black, Kasper Green Larsen, Arya Mazumdar, Barna Saha, Geelon So, 25 Sep 2025, Actively Learning Halfspaces without Synthetic Data, https://arxiv.org/abs/2509.20848
  • Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud, 25 Sep 2025, CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density, https://arxiv.org/abs/2509.18458
  • Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, Alexander Toshev, 29 Sep 2025, Scaling Synthetic Task Generation for Agents via Exploration, https://arxiv.org/abs/2509.25047
  • Mohammed Sabry, Anya Belz, 26 Sep 2025, What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?, https://arxiv.org/abs/2509.22947
  • Zi Liang and Qingqing Ye and Xuan Liu and Yanyun Wang and Jianliang Xu and Haibo Hu, 27 Sep 2025, Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data, https://arxiv.org/abs/2509.23041
  • Ting-Kang Wang, Yueh-Po Peng, Li Su and Vincent K.M. Cheung, 28 Sep 2025, VioPTT: Violin Technique-Aware Transcription from Synthetic Data Augmentation, https://arxiv.org/abs/2509.23759
  • Junsu Kim, Yunhoe Ku, Dongyoon Han, Seungryul Baek, 27 Sep 2025, Beyond Synthetic Replays: Turning Diffusion Features into Few-Shot Class-Incremental Learning Knowledge, https://arxiv.org/abs/2503.23402
  • Samarth Mishra, Kate Saenko and Venkatesh Saligrama, 28 Sep 2025, SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data, https://arxiv.org/abs/2504.04740
  • Chen Qian, Haoyu Zhang, Junnan Ma, Liuhong Zhu, Qingrui Cai, Yu Wang, Ruibo Song, Lv Li, Lin Mei, Xianwang Jiang, Qin Xu, Boyu Jiang, Ran Tao, Chunmiao Chen, Shufang Chen, Dongyun Liang, Qiu Guo, Jianzhong Lin, Taishan Kang, Mengtian Lu, Liyuan Fu, Ruibin Huang, Huijuan Wan, Xu Huang, Jianhua Wang, Di Guo, Hai Zhong, Jianjun Zhou and Xiaobo Qu, 17 Oct 2025, Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning, https://arxiv.org/abs/2510.15400
  • Hamin Koo and Jaehyung Kim, 17 Oct 2025, EMCee: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context, https://arxiv.org/abs/2503.05846
  • Youngjoon Lee, Seongmin Cho, Yehhyun Jo, Jinu Gong, Hyunjoo Jenny Lee, Joonhyuk Kang, 6 Oct 2025, Forecasting-Based Biomedical Time-series Data Synthesis for Open Data and Robust AI, https://arxiv.org/abs/2510.04622
  • Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei, 5 Oct 2025, Scaling Laws of Synthetic Data for Language Models, https://arxiv.org/abs/2503.19551
  • Alexander Gill, Abhilasha Ravichander, Ana Marasovi\'c, 3 Oct 2025, What Has Been Lost with Synthetic Evaluation?, https://arxiv.org/abs/2505.22830
  • Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, 4 Oct 2025, Towards Understanding Bias in Synthetic Data for Evaluation, https://arxiv.org/abs/2506.10301
  • Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman, 6 Oct 2025, Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models, https://arxiv.org/abs/2505.21574
  • Ilyas Varshavskiy, Bonu Boboeva, Shuhrat Khalilbekov, Azizjon Azimi, Sergey Shulgin, Akhlitdin Nizamitdinov, Haitz Saez de Ocariz Borde, 10 Oct 2025, Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers, https://arxiv.org/abs/2510.09294
  • Muhammad Ali Shafique, Kanwal Mehreen, Muhammad Arham, Maaz Amjad, Sabur Butt, Hamza Farooq, 10 Oct 2025, Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation, https://arxiv.org/abs/2510.09051
  • Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Ranjay Krishna, 10 Oct 2025, SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding, https://arxiv.org/abs/2510.09110
  • Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna, 9 Oct 2025, Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training, https://arxiv.org/abs/2412.08221
  • Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, Bernie Wang, 24 Oct 2025, Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models, https://arxiv.org/abs/2510.21204
  • Jens E. d'Hondt, Wieger R. Punter, Odysseas Papapetrou, 24 Oct 2025, Generative Correlation Manifolds: Generating Synthetic Data with Preserved Higher-Order Correlations, https://arxiv.org/abs/2510.21610
  • Massimiliano Ciranni, Vito Paolo Pastore, Roberto Di Via, Enzo Tartaglione, Francesca Odone, Vittorio Murino, 24 Oct 2025, Diffusing DeBias: Synthetic Bias Amplification for Model Debiasing, https://arxiv.org/abs/2502.09564
  • Parsa Rahimi, Sebastien Marcel, 24 Oct 2025, ScoreMix: Synthetic Data Generation by Score Composition in Diffusion Models Improves Recognition, https://arxiv.org/abs/2506.10226
  • Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang, 24 Oct 2025, Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks, https://arxiv.org/abs/2510.19195
  • Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang, 10 Oct 2025, Building a Foundational Guardrail for General Agentic Systems via Synthetic Data, https://arxiv.org/abs/2510.09781
  • Md Ibrahim Shikder Mahin, Md Shamsul Arefin and Md Tanvir Hasan, 12 Oct 2025, A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets, https://arxiv.org/abs/2510.10513
  • Hengyuan Zhang, Shiping Yang, Xiao Liang, Chenming Shang, Yuxuan Jiang, Chaofan Tao, Jing Xiong, Hayden Kwok-Hay So, Ruobing Xie, Angel X. Chang, Ngai Wong, 13 Oct 2025, Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation, https://arxiv.org/abs/2510.10925
  • Sneha Varur, Anirudh R Hanchinamani, Tarun S Bagewadi, Uma Mudenagudi, Chaitra D Desai, Sujata C, Padmashree Desai and Sumit Meharwade, 12 Oct 2025, DISC-GAN: Disentangling Style and Content for Cluster-Specific Synthetic Underwater Image Generation, https://arxiv.org/abs/2510.10782
  • Joshua Niemeijer, Jan Ehrhardt, Heinz Handels, Hristina Uzunova, 13 Oct 2025, Uncertainty-Aware ControlNet: Bridging Domain Gaps with Synthetic Image Generation, https://arxiv.org/abs/2510.11346
  • David Benavente-Rios and Juan Ruiz Rodriguez and Gustavo Gatica, 10 Oct 2025, Exploration of Incremental Synthetic Non-Morphed Images for Single Morphing Attack Detection, https://arxiv.org/abs/2510.09836
  • Rohan Gupta, Iv\'an Arcuschin, Thomas Kwa, Adri\`a Garriga-Alonso, 11 Oct 2025, InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques, https://arxiv.org/abs/2407.14494
  • Nupur Kumari, Xi Yin, Jun-Yan Zhu, Ishan Misra, Samaneh Azadi, 13 Oct 2025, Generating Multi-Image Synthetic Data for Text-to-Image Customization, https://arxiv.org/abs/2502.01720
  • Rongchao Xu, Kunlin Cai, Lin Jiang, Dahai Yu, Zhiqing Hong, Yuan Tian, Guang Wang, 9 Oct 2025, GeoGen: A Two-stage Coarse-to-Fine Framework for Fine-grained Synthetic Location-based Social Network Trajectory Generation, https://arxiv.org/abs/2510.07735
  • Jannek Ulm, Kevin Du, V\'esteinn Sn{\ae}bjarnarson, 9 Oct 2025, Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling, https://arxiv.org/abs/2510.08245
  • Amitis Shidani, Tyler Farghly, Yang Sun, Habib Ganjgahi, and George Deligiannidis, 9 Oct 2025, Beyond Real Data: Synthetic Data through the Lens of Regularization, https://arxiv.org/abs/2510.08095
  • Parham Rezaei, Filip Kovacevic, Francesco Locatello, Marco Mondelli, 9 Oct 2025, High-dimensional Analysis of Synthetic Data Selection, https://arxiv.org/abs/2510.08123
  • Zhuoyi Huang, Nutan Sahoo, Anamika Kumari, Girish Kumar, Kexuan Cai, Shixing Cao, Yue Kang, Tian Xia, Somya Chatterjee, Nicholas Hausman, Aidan Jay, Eric S. Rosenthal, Soundar Srinivasan, Sadid Hasan, Alex Fedorov, Sulaiman Vesal, 9 Oct 2025, High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training, https://arxiv.org/abs/2510.05492
  • Rachel Chung, Pratyush Nidhi Sharma, Mikko Siponen, Rohit Vadodaria, and Luke Smith, 23 Sep 2025, Hybrid Data can Enhance the Utility of Synthetic Data for Training Anti-Money Laundering Models, https://arxiv.org/abs/2509.18499
  • Haoyu Wang and Fengze Liu and Jiayao Zhang and Dan Roth and Kyle Richardson, 16 Sep 2025, Event Causality Identification with Synthetic Control, https://arxiv.org/abs/2509.18156
  • Zhipei Xu, Xuanyu Zhang, Qing Huang, Xing Zhou, Jian Zhang, 23 Sep 2025, AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection, https://arxiv.org/abs/2505.15173
  • Mahmoud Ibrahim, Bart Elen, Chang Sun, G\"okhan Ertaylan, Michel Dumontier, 22 Oct 2025, Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series, https://arxiv.org/abs/2510.19728
  • Lawrence Phillips, Marc Boubnovski Martell, Aditya Misra, Josefa Lia Stoisser, Cesar A. Prada-Medina, Rory Donovan-Maiye, Kaspar M\"artens, 29 Sep 2025, SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction, https://arxiv.org/abs/2509.25346
  • Hasan Alp Cafero\u{g}lu, Mehmet Serhat \c{C}elik, \"Ozg\"ur Ulusoy, 30 Sep 2025, SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation, https://arxiv.org/abs/2509.25672
  • Zihao Zhao, Anjalie Field, 30 Sep 2025, Controlled Generation for Private Synthetic Text, https://arxiv.org/abs/2509.25729
  • Chenhua Shi, Gregor Macdonald, Bhavika Jalli, Wanlu Lei, John Zou, Mridul Jain, Joji Philip, 30 Sep 2025, Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications, https://arxiv.org/abs/2509.25736
  • Kyeongryeol Go, 30 Sep 2025, Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis, https://arxiv.org/abs/2509.26158
  • Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Tom A. Lamb, Philip Torr, Marc Boubnovski Martell, Julien Fauqueur, Kaspar M\"artens, 7 Oct 2025, Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering, https://arxiv.org/abs/2510.05871
  • Muskaan Chopra, Lorenz Sparrenberg, Rafet Sifa, 1 Oct 2025, SynCED-EnDe 2025: A Synthetic and Curated English - German Dataset for Critical Error Detection in Machine Translation, https://arxiv.org/abs/2510.05144
  • Sara Mandelli, Diego Vila-Portela, David V\'azquez-Pad\'in, Paolo Bestagini, Fernando P\'erez-Gonz\'alez, 7 Oct 2025, Beyond Spectral Peaks: Interpreting the Cues Behind Synthetic Image Detection, https://arxiv.org/abs/2510.05633
  • Maria-Teresa De Rosa Palmini and Eva Cetinic, 18 May 2025, Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models, https://arxiv.org/abs/2505.17064
  • Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li, 16 Oct 2025, Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping, https://arxiv.org/abs/2501.18962

Unnatural Instructions (Synthetic Data)

Research papers on "unnatural instructions," a type of synthetic data for training:

Distributed Training

Distributed training is the optimization of spreading training computations across multiple GPUs or multiple servers. Trillion parameter models are trained on large clusters of 100,000+ GPUs, with complex multi-server multi-GPU architectures. Distributed training can also be performed on much more spread-out architectures with servers communicating over the internet.

Some of the research papers on distributed training:

Training Costs

Research on the total costs of performing LLM training:

Federated Learning

Research on federated learning, a type of distributed training for LLMs:

  • Caelin Kaplan, Tareq Si Salem, Angelo Rodio, Chuan Xu, Giovanni Neglia, 7 May 2024, Federated Learning for Cooperative Inference Systems: The Case of Early Exit Networks, https://arxiv.org/abs/2405.04249
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf
  • H Woisetschläger, A Isenko, S Wang, R Mayer, 2023, Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly, https://arxiv.org/abs/2310.03150
  • Lorenzo Sani, Alex Iacob, Zeyu Cao, Bill Marino, Yan Gao, Tomas Paulik, Wanru Zhao, William F. Shen, Preslav Aleksandrov, Xinchi Qiu, Nicholas D. Lane, 19 Jul 2024 (v2), The Future of Large Language Model Pre-training is Federated, https://arxiv.org/abs/2405.10853
  • Jaxpruner: A Concise Library for Sparsity Research, Joo Hyung Lee, Wonpyo Park, Nicole Elyse Mitchell, Jonathan Pilault, Johan Samir Obando Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Woohyun Han, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart J.C. Bik, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci, Conference on Parsimony and Learning, PMLR 234:515-528, 2024. https://proceedings.mlr.press/v234/lee24a.html https://proceedings.mlr.press/v234/lee24a/lee24a.pdf https://openreview.net/forum?id=H2rCZCfXkS https://openreview.net/pdf?id=H2rCZCfXkS
  • Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
  • Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
  • Shengwen Ding, Chenhui Hu, 24 Nov 2024, eFedLLM: Efficient LLM Inference Based on Federated Learning, https://arxiv.org/abs/2411.16003
  • Natalie Lang, Alejandro Cohen, Nir Shlezinger, 27 Mar 2024, Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates, https://arxiv.org/abs/2403.18375
  • Chengxi Li, Ming Xiao, Mikael Skoglund, 22 Mar 2024, Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation, https://arxiv.org/abs/2403.14905
  • Andrew Hard, Antonious M. Girgis, Ehsan Amid, Sean Augenstein, Lara McConnaughey, Rajiv Mathews, Rohan Anil, 14 Mar 2024, Learning from straggler clients in federated learning, https://arxiv.org/abs/2403.09086
  • Hongpeng Guo, Haotian Gu, Xiaoyang Wang, Bo Chen, Eun Kyung Lee, Tamar Eilam, Deming Chen, Klara Nahrstedt, 31 Jan 2024, FedCore: Straggler-Free Federated Learning with Distributed Coresets, https://arxiv.org/abs/2402.00219
  • Frederico Vicente, Cláudia Soares, Dušan Jakovetić, 13 May 2025, Modular Federated Learning: A Meta-Framework Perspective, https://arxiv.org/abs/2505.08646
  • Keke Gai, Dongjue Wang, Jing Yu, Liehuang Zhu, Qi Wu, 14 Aug 2025, A Vision-Language Pre-training Model-Guided Approach for Mitigating Backdoor Attacks in Federated Learning, https://arxiv.org/abs/2508.10315
  • Kejia Fan, Jianheng Tang, Zhirui Yang, Feijiang Han, Jiaxu Li, Run He, Yajiang Huang, Anfeng Liu, Houbing Herbert Song, Yunhuai Liu, Huiping Zhuang, 14 Aug 2025, APFL: Analytic Personalized Federated Learning via Dual-Stream Least Squares, https://arxiv.org/abs/2508.10732
  • Rodrigo Tertulino, 6 Aug 2025, A Robust Pipeline for Differentially Private Federated Learning on Imbalanced Clinical Data using SMOTETomek and FedProx, https://arxiv.org/abs/2508.10017
  • Jane Carney, Kushal Upreti, Gaby G. Dagher, Tim Andersen, 11 Aug 2025, FIDELIS: Blockchain-Enabled Protection Against Poisoning Attacks in Federated Learning, https://arxiv.org/abs/2508.10042
  • Tianjun Yuan, Jiaxiang Geng, Pengchao Han, Xianhao Chen, Bing Luo, 14 Aug 2025, Flexible Personalized Split Federated Learning for On-Device Fine-Tuning of Foundation Models, https://arxiv.org/abs/2508.10349
  • Wenxuan Ye, Xueli An, Junfan Wang, Xueqiang Yan, Georg Carle, 14 Aug 2025, FedABC: Attention-Based Client Selection for Federated Learning with Long-Term View, https://arxiv.org/abs/2507.20871
  • Murtaza Rangwala, KR Venugopal, Rajkumar Buyya, 14 Aug 2025, Blockchain-Enabled Federated Learning, https://arxiv.org/abs/2508.06406
  • Mattia Sabella and Monica Vitali, 23 Jul 2025, Eco-Friendly AI: Unleashing Data Power for Green Federated Learning, https://arxiv.org/abs/2507.17241
  • Aritz P\'erez, Carlos Echegoyen and Guzm\'an Santaf\'e, 23 Jul 2025, Decentralized Federated Learning of Probabilistic Generative Classifiers, https://arxiv.org/abs/2507.17285
  • Amandeep Singh Bhatia, Sabre Kais, 23 Jul 2025, Enhancing Quantum Federated Learning with Fisher Information-Based Optimization, https://arxiv.org/abs/2507.17580
  • Dario Fenoglio, Gabriele Dominici, Pietro Barbiero, Alberto Tonda, Martin Gjoreski, Marc Langheinrich, 23 Jul 2025, Federated Behavioural Planes: Explaining the Evolution of Client Behaviour in Federated Learning, https://arxiv.org/abs/2405.15632
  • Mehdi Khalaj, Shahrzad Golestani Najafabadi, Julita Vassileva, 23 Jul 2025, Privacy-Preserving Multimodal News Recommendation through Federated Learning, https://arxiv.org/abs/2507.15460
  • Binbin Ding, Penghui Yang, Sheng-Jun Huang, 22 Jul 2025, FLAIN: Mitigating Backdoor Attacks in Federated Learning via Flipping Weight Updates of Low-Activation Input Neurons, https://arxiv.org/abs/2408.08655
  • Seung-Wook Kim, Seongyeol Kim, Jiah Kim, Seowon Ji, Se-Ho Lee, 22 Jul 2025, FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization, https://arxiv.org/abs/2506.23516
  • Baran Can G\"ul, Suraksha Nadig, Stefanos Tziampazis, Nasser Jazdi, Michael Weyrich, 22 Jul 2025, FedMultiEmo: Real-Time Emotion Recognition via Multimodal Federated Learning, https://arxiv.org/abs/2507.15470
  • Obaidullah Zaland, Chanh Nguyen, Florian T. Pokorny and Monowar Bhuyan, 23 Jul 2025, Federated Learning for Large-Scale Cloud Robotic Manipulation: Opportunities and Challenges, https://arxiv.org/abs/2507.17903
  • Ahmad Alhonainy (1), Praveen Rao (1) ((1) University of Missouri, USA), 19 Jul 2025, Caching Techniques for Reducing the Communication Cost of Federated Learning in IoT Environments, https://arxiv.org/abs/2507.17772
  • Constantin Philippenko and Aymeric Dieuleveut, 24 Jul 2025, Compressed and distributed least-squares regression: convergence rates with applications to Federated Learning, https://arxiv.org/abs/2308.01358
  • Daniel Commey, Kamel Abbad, Garth V. Crosby and Lyes Khoukhi, 18 Jul 2025, FedSkipTwin: Digital-Twin-Guided Client Skipping for Communication-Efficient Federated Learning, https://arxiv.org/abs/2507.13624
  • Sahar Ghoflsaz Ghinani and Elaheh Sadredini, 18 Jul 2025, FuSeFL: Fully Secure and Scalable Cross-Silo Federated Learning, https://arxiv.org/abs/2507.13591
  • Di Yu, Xin Du, Linshan Jiang, Huijing Zhang, Shuiguang Deng, 18 Jul 2025, Exploiting Label Skewness for Spiking Neural Networks in Federated Learning, https://arxiv.org/abs/2412.17305
  • Huan Wang, Haoran Li, Huaming Chen, Jun Yan, Jiahua Shi, Jun Shen, 18 Jul 2025, FedDifRC: Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning, https://arxiv.org/abs/2507.06482
  • Zhiyong Jin, Runhua Xu, Chao Li, Yizhong Liu, Jianxin Li, 18 Jul 2025, Sparsification Under Siege: Defending Against Poisoning Attacks in Communication-Efficient Federated Learning, https://arxiv.org/abs/2505.01454
  • Nuria Rodr\'iguez-Barroso and Mario Garc\'ia-M\'arquez and M. Victoria Luz\'on and Francisco Herrera, 21 Jul 2025, Challenges of Trustworthy Federated Learning: What's Done, Current Trends and Remaining Work, https://arxiv.org/abs/2507.15796
  • Yajiao Dai, Jun Li, Zhen Mei, Yiyang Ni, Shi Jin, Zengxiang Li, Sheng Guo, Wei Xiang, 12 Jul 2025, Semi-Supervised Federated Learning via Dual Contrastive Learning and Soft Labeling for Intelligent Fault Diagnosis, https://arxiv.org/abs/2507.14181
  • Md Rafid Haque, Abu Raihan Mostofa Kamal, Md. Azam Hossain, 18 Jul 2025, FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning, https://arxiv.org/abs/2507.14322
  • Tianle Li, Yongzhi Huang, Linshan Jiang, Qipeng Xie, Chang Liu, Wenfeng Du, Lu Wang, and Kaishun Wu, 20 Jul 2025, FedWCM: Unleashing the Potential of Momentum-based Federated Learning in Long-Tailed Scenarios, https://arxiv.org/abs/2507.14980
  • Yunfeng Li, Junhong Liu, Zhaohui Yang, Guofu Liao, Chuyun Zhang, 20 Jul 2025, Clustered Federated Learning for Generalizable FDIA Detection in Smart Grids with Heterogeneous Data, https://arxiv.org/abs/2507.14999
  • Huiling Yang, Zhanwei Wang, and Kaibin Huang, 21 Jul 2025, Optimal Batch-Size Control for Low-Latency Federated Learning with Device Heterogeneity, https://arxiv.org/abs/2507.15601
  • Juntao Tan, Anran Li, Quanchao Liu, Peng Ran, Lan Zhang, 19 Jul 2025, VTarbel: Targeted Label Attack with Minimal Knowledge on Detector-enhanced Vertical Federated Learning, https://arxiv.org/abs/2507.14625
  • Juntao Tan, Lan Zhang, Zhonghao Hu, Kai Yang, Peng Ran, Bo Li, 19 Jul 2025, VMask: Tunable Label Privacy Protection for Vertical Federated Learning via Layer Masking, https://arxiv.org/abs/2507.14629
  • Khoa Nguyen, Tanveer Khan, Antonis Michalas, 20 Jul 2025, A Privacy-Centric Approach: Scalable and Secure Federated Learning Enabled by Hybrid Homomorphic Encryption, https://arxiv.org/abs/2507.14853
  • Zhipeng Wang, Nanqing Dong, Jiahao Sun, William Knottenbelt, Yike Guo, 21 Jul 2025, zkFL: Zero-Knowledge Proof-based Gradient Aggregation for Federated Learning, https://arxiv.org/abs/2310.02554
  • Shunsuke Yoneda, Valdemar \v{S}v\'abensk\'y, Gen Li, Daisuke Deguchi, Atsushi Shimada, 21 Jul 2025, Ranking-Based At-Risk Student Prediction Using Federated Learning and Differential Features, https://arxiv.org/abs/2505.09287
  • Xinglin Zhao, Yanwen Wang, Xiaobo Liu, Yanrong Hao, Rui Cao, Xin Wen, 8 Aug 2025, A Federated Learning Framework for Handling Subtype Confounding and Heterogeneity in Large-Scale Neuroimaging Diagnosis, https://arxi