Aussie AI

Synthetic Data

Last Updated 17 November, 2025

by David Spuler, Ph.D.

What is Synthetic Data?

Synthetic data is the use of computer-generated text for LLM training. This can include generating entirely new training data, such as by using output from another LLM. Another approach is to "augment" training data, such as using synonymization to create slightly different versions of training sets with different words.

Research on Synthetic Data

Research papers include:

Skurzhanskyi, O.H., Marchenko, O.O. & Anisimov, A.V., 2024, Specialized Pre-Training of Neural Networks on Synthetic Data for Improving Paraphrase Generation. Cybern Syst Anal 2024 https://doi.org/10.1007/s10559-024-00658-7 https://link.springer.com/article/10.1007/s10559-024-00658-7
Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, 29 Jan 2024, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, https://arxiv.org/abs/2401.16380
André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster, 4 Jan 2024, Comprehensive Exploration of Synthetic Data Generation: A Survey https://arxiv.org/abs/2401.02524
Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
David Spuler, March 2024, Chapter 45. Knowledge Distillation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
A Gudibande, E Wallace, C Snell, X Geng, H Liu 2023, The false promise of imitating proprietary llms, https://arxiv.org/abs/2305.15717
Y Wang, W Zhong, L Li, F Mi, X Zeng, W Huang 2023, Aligning large language models with human: A survey, https://arxiv.org/abs/2307.12966
Y Gu, L Dong, F Wei, M Huang, 2023, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
X Wan, R Sun, H Dai, SO Arik, T Pfister, 2023, Better zero-shot reasoning with self-adaptive prompting, https://arxiv.org/abs/2305.14106
S Horawalavithana, S Munikoti, I Stewart, 2023, SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, https://arxiv.org/abs/2307.01139
X Daull, P Bellot, E Bruno, V Martin, 2023, Complex QA and language models hybrid architectures, Survey, https://arxiv.org/abs/2302.09051
Z Yuan, J Liu, Q Zi, M Liu, X Peng, Y Lou, 2023, Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation, https://arxiv.org/abs/2308.01240
W AlShikh, M Daaboul, K Goddard, B Imel, 2023, Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, https://arxiv.org/abs/2307.03692
Z He, Z Xie, R Jha, H Steck, D Liang, Y Feng, 2023, Large Language Models as Zero-Shot Conversational Recommenders, https://arxiv.org/abs/2308.10053
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
Michael Nuñez, July 18, 2024, Groq’s open-source Llama AI model tops leaderboard, outperforming GPT-4o and Claude in function calling, https://venturebeat.com/ai/groq-open-source-llama-ai-model-tops-leaderboard-outperforming-gpt-4o-and-claude-in-function-calling/
Louie Peters, Aug 27, 2024, Two Paths to Small LMs? Synthetic Data (Phi 3.5) vs Pruning & Distillation (Llama-3.1-Minitron), https://newsletter.towardsai.net/p/114-two-paths-to-small-lms-synthetic
Aatish Bhatia, Aug. 25, 2024, When A.I.’s Output Is a Threat to A.I. Itself: As A.I.-generated data becomes harder to detect, it’s increasingly likely to be ingested by future A.I., leading to worse results, NY Times, https://www.nytimes.com/interactive/2024/08/26/upshot/ai-synthetic-data.html
Shumailov, I., Shumaylov, Z., Zhao, Y. et al. 2024, AI models collapse when trained on recursively generated data. Nature 631, 755–759. https://doi.org/10.1038/s41586-024-07566-y https://www.nature.com/articles/s41586-024-07566-y
Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, Gauthier Gidel, 12 Jun 2024, Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences, https://arxiv.org/abs/2407.09499
Ryan McNeal, Aug 27, 2024, ChatGPT and GPT-4 could get a sweet upgrade this fall with 'strawberry', https://www.androidauthority.com/openai-strawberry-ai-3475682/
Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai, 10 Aug 2024 (v2), Best Practices and Lessons Learned on Synthetic Data, https://arxiv.org/abs/2404.07503
Georgia Argyro, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou, 10 Sep 2024, Prompt2Fashion: An automatically generated fashion dataset, https://arxiv.org/abs/2409.06442
Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli, 12 Sep 2024, Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources, https://arxiv.org/abs/2409.08239
Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, Mehran Kazemi, 29 Aug 2024, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737
Ulyana Piterbarg, Lerrel Pinto, Rob Fergus, 3 Oct 2024, Training Language Models on Synthetic Edit Sequences Improves Code Synthesis, https://arxiv.org/abs/2410.02749
Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang, 16 Oct 2024, A Survey on Data Synthesis and Augmentation for Large Language Models, https://arxiv.org/abs/2410.12896
Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He, 23 Oct 2024, SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains, https://arxiv.org/abs/2410.17952
Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
Arindam Mitra , Ahmed Awadallah , Yash Lara , November 14, 2024, Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/
Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig, 4 Dec 2024, Evaluating Language Models as Synthetic Data Generators, https://arxiv.org/abs/2412.03679
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Xiang Huang, Jiayu Shen, Shanshan Huang, Sitao Cheng, Xiaxia Wang, Yuzhong Qu, 27 Dec 2024, TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data, https://arxiv.org/abs/2412.19544?
Sebastian Raschka, PhD, Jan 15, 2025, Noteworthy AI Research Papers of 2024 (Part Two). Six influential AI papers from July to December, https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2 (Examines multimodal LLama3 models and the different multimodal architectures.)
FZ Subah, Oct 2025, Mitigating and Assessing Bias and Fairness in Large Language Model-Generated Synthetic Tabular Data, Masters Thesis, Department of Engineering, University of Cambridge, https://www.mlmi.eng.cam.ac.uk/files/2023-2024/fzs21_mitigating_2024.pdf
Chetan Harsha, Karmvir Singh Phogat, Sridhar Dasaratha, Sai Akhil Puranam, Shashishekar Ramakrishna, Jan 2025, Synthetic Data Generation Using Large Language Models for Financial Question Answering, Proceedings of the Joint Workshop of the 9th FinNLP, the 6th FNP, and the 1st LLMFinLegal, pages 76–95 January 19–20, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.finnlp-1.7.pdf
Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen, 25 Jan 2025, LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion, https://arxiv.org/abs/2501.15089
Minsang Kim, Seungjun Baek, 6 Feb 2025, Syntriever: How to Train Your Retriever with Synthetic Data from LLMs, https://arxiv.org/abs/2502.03824
Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang, 13 Feb 2025, Logical Reasoning in Large Language Models: A Survey, https://arxiv.org/abs/2502.09100
Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen, 18 Feb 2025, Theorem Prover as a Judge for Synthetic Data Generation, https://arxiv.org/abs/2502.13137
Maria Korolov, Jun 25, 2025, 7 ways synthetic data creates business value, https://www.cio.com/article/4003262/7-ways-synthetic-data-creates-business-value.html
Ali Zolnour, Hossein Azadmaleki, Yasaman Haghbin, Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sina Rashidi, Masoud Khani, AmirSajjad Taleban, Samin Mahdizadeh Sani, Maryam Dadkhah, James M. Noble, Suzanne Bakken, Yadollah Yaghoobzadeh, Abdol-Hossein Vahabie, Masoud Rouhizadeh, Maryam Zolnoori, 8 Aug 2025, LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data, https://arxiv.org/abs/2508.10027
Nitin Rai, Nathan S. Boyd, Gary E. Vallad, Arnold W. Schumann, 13 Aug 2025, Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model, https://arxiv.org/abs/2508.10156
Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian, 14 Aug 2025, Measuring Diversity in Synthetic Datasets, https://arxiv.org/abs/2502.08512
Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng, 22 Jul 2025, Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation, https://arxiv.org/abs/2507.17066
\'Alvaro Ruiz-R\'odenas, Jaime Pujante S\'aez, Daniel Garc\'ia-Algora, Mario Rodr\'iguez B\'ejar, Jorge Blasco and Jos\'e Luis Hern\'andez-Ramos, 21 Jul 2025, SynthCTI: LLM-Driven Synthetic CTI Generation to enhance MITRE Technique Mapping, https://arxiv.org/abs/2507.16852
Rishemjit Kaur, Arshdeep Singh Bhankhar, Surangika Ranathunga, Jashanpreet Singh Salh, Sudhir Rajput, Vidhi, Kashish Mahendra, Bhavika Berwal, Ritesh Kumar, 22 Jul 2025, Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain, https://arxiv.org/abs/2507.16974
Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
Shreya Saxena, Siva Prasad, Zishan Ahmad, Vishal Vaddina, 22 Jul 2025, ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training, https://arxiv.org/abs/2507.16478
Ivona Krchova, Michael Platzer, Paul Tiwald, 22 Jul 2025, Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling, https://arxiv.org/abs/2507.16419
Alireza Dizaji, Benedict Aaron Tjandra, Mehrab Hamidi, Shenyang Huang, Guillaume Rabusseau, 22 Jul 2025, T-GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs, https://arxiv.org/abs/2507.10183
Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim, 24 Jul 2025, Synthetic Data Generation for Phrase Break Prediction with Large Language Model, https://arxiv.org/abs/2507.18044
Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina Balaur, and Venkata Satagopam, 24 Jul 2025, Generation of Synthetic Clinical Text: A Systematic Review, https://arxiv.org/abs/2507.18451
Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu, 24 Jul 2025, DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data, https://arxiv.org/abs/2507.18583
Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim, 24 Jul 2025, SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning, https://arxiv.org/abs/2507.18616
Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim, 24 Jul 2025, SIDA: Synthetic Image Driven Zero-shot Domain Adaptation, https://arxiv.org/abs/2507.18632
Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng, 24 Jul 2025, Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs, https://arxiv.org/abs/2507.18055
Yefeng Yuan, Yuhong Liu, Liang Cheng, 24 Jul 2025, A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models, https://arxiv.org/abs/2404.14445
Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp, 24 Jul 2025, Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation, https://arxiv.org/abs/2506.11790
Keito Inoshita, Rushia Harada, 15 Jul 2025, Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition, https://arxiv.org/abs/2507.13380
Junsu Kim, Yunhoe Ku, Seungryul Baek, 18 Jul 2025, Can Synthetic Images Conquer Forgetting? Beyond Unexplored Doubts in Few-Shot Class-Incremental Learning, https://arxiv.org/abs/2507.13739
Matthew A. Chan, Casey J. Pellizzari, Christopher A. Metzler, 17 Jul 2025, Inverse Synthetic Aperture Fourier Ptychography, https://arxiv.org/abs/2507.03733
Claudio Giusti, Luca Guarnera, Mirko Casu, Sebastiano Battiato, 19 Jul 2025, Fraud is Not Just Rarity: A Causal Prototype Attention Approach to Realistic Synthetic Oversampling, https://arxiv.org/abs/2507.14706
Anh Nguyen, Sam Schafft, Nicholas Hale, John Alfaro, 21 Jul 2025, FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs, https://arxiv.org/abs/2507.15839
Pan Peng, Hangyu Xu, 20 Jul 2025, Differentially Private Synthetic Graphs Preserving Triangle-Motif Cuts, https://arxiv.org/abs/2507.14835
Zijian Ding, Tung Nguyen, Weikai Li, Aditya Grover, Yizhou Sun, Jason Cong, 19 Jul 2025, Iceberg: Enhancing HLS Modeling with Synthetic Data, https://arxiv.org/abs/2507.09948
Rohit Kundu, Shan Jia, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury, 19 Jul 2025, TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data, https://arxiv.org/abs/2503.15867
Yewon Byun, Shantanu Gupta, Zachary C. Lipton, Rachel Leah Childers, Bryan Wilder, 8 Aug 2025, Using Imperfect Synthetic Data in Downstream Inference Tasks, https://arxiv.org/abs/2508.06635
Andrey Sidorenko and Paul Tiwald, 8 Aug 2025, Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN, https://arxiv.org/abs/2508.06647
Sabrina Namazova, Alessandra Brondetta, Younes Strittmatter, Matthew Nassar, Sebastian Musslick, 11 Aug 2025, Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant, https://arxiv.org/abs/2508.07887
Raunak Narwal and Syed Abbas, 10 Aug 2025, BIGBOY1.2: Generating Realistic Synthetic Data for Disease Outbreak Modelling and Analytics, https://arxiv.org/abs/2508.07239
Ethan Lo and Dan C. Lo, 18 Jul 2025, Exoplanet Detection Using Machine Learning Models Trained on Synthetic Light Curves, https://arxiv.org/abs/2507.19520
Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Zexue He, Shafiq Abedin, Jennifer Sun, Ben Wiesel, Eli Schwartz, Ahmed Nassar, Bo Wu, Assaf Arbelle, Aude Oliva, Dan Gutfreund, Leonid Karlinsky, Rogerio Feris, 31 May 2025, ChartGen: Scaling Chart Understanding Via Code-Guided Synthetic Chart Generation, https://arxiv.org/abs/2507.19492
Tao Lian, Jose L. G\'omez, Antonio M. L\'opez, 26 Jul 2025, FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving, https://arxiv.org/abs/2507.19881
Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, and Sebastien Marcel, 28 Jul 2025, Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data, https://arxiv.org/abs/2507.20782
Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, 25 Jul 2025, Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task, https://arxiv.org/abs/2310.09336
Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 28 Jul 2025, Explainable Synthetic Image Detection through Diffusion Timestep Ensembling, https://arxiv.org/abs/2503.06201
Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz, 28 Jul 2025, StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation, https://arxiv.org/abs/2507.21340
Yida Tao, Yen-Chia Hsu, 29 Jul 2025, Bridging Synthetic and Real-World Domains: A Human-in-the-Loop Weakly-Supervised Framework for Industrial Toxic Emission Segmentation, https://arxiv.org/abs/2507.22002
Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu, 31 Jul 2025, CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks, https://arxiv.org/abs/2507.23751
Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen, 31 Jul 2025, Continual Learning with Synthetic Boundary Experience Blending, https://arxiv.org/abs/2507.23534
Jessica Bader, Leander Girrbach, Stephan Alaniz, Zeynep Akata, 31 Jul 2025, SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions, https://arxiv.org/abs/2507.23784
Patricia A. Apell\'aniz and Ana Jim\'enez and Borja Arroyo Galende and Juan Parras and Santiago Zazo, 31 Jul 2025, Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios, https://arxiv.org/abs/2407.03080
Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg, 30 Jul 2025, Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning, https://arxiv.org/abs/2502.13820
Georgi Ganev and Meenatchi Sundaram Muthu Selva Annamalai and Sofiane Mahiou and Emiliano De Cristofaro, 29 Jul 2025, The Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data, https://arxiv.org/abs/2504.06923
Tom Or and Omri Azencot (Ben Gurion University of the Negev), 1 Aug 2025, Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics, https://arxiv.org/abs/2508.00784
Ivona Krchova, Mariana Vargas Vieyra, Mario Scriminaci, Andrey Sidorenko, 1 Aug 2025, Democratizing Tabular Data Access with an Open$\unicode{x2013}$Source Synthetic$\unicode{x2013}$Data SDK, https://arxiv.org/abs/2508.00718
Jianwei Wang, Ziming Wu, Fuming Lai, Shaobing Lian, Ziqian Zeng, 1 Aug 2025, SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought, https://arxiv.org/abs/2508.00574
Abdulmajid Murad, Massimiliano Ruocco, 4 Aug 2025, Pre-Tactical Flight-Delay and Turnaround Forecasting with Synthetic Aviation Data, https://arxiv.org/abs/2508.02294
Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Cyril Rakovski, Frank Rudzicz, 2 Aug 2025, MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs, https://arxiv.org/abs/2508.01401
Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou, 5 Aug 2025, Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation, https://arxiv.org/abs/2508.03117
Oc\'eane Doremus, Ariel Guerra-Adames, Marta Avalos-Fernandez, Vianney Jouhet, C\'edric Gil-Jardin\'e, Emmanuel Lagarde, 4 Aug 2025, Synthetic medical data generation: state of the art and application to trauma mechanism classification, https://arxiv.org/abs/2508.02771
Shifeng Xie, Vasilii Feofanov, Marius Alonso, Ambroise Odonnat, Jianfeng Zhang, Themis Palpanas, and Ievgen Redko, 4 Aug 2025, CauKer: classification time series foundation models can be pretrained on synthetic data only, https://arxiv.org/abs/2508.02879
Yongyi Wang, Lingfeng Li, Bozhou Chen, Ang Li, Hanyu Liu, Qirui Zheng, Xionghui Yang, Wenxin Li, 6 Aug 2025, Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling, https://arxiv.org/abs/2508.04282
George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov, 6 Aug 2025, Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success, https://arxiv.org/abs/2508.04280
Mohd Ashhad and Ricardo Henao, 5 Aug 2025, Generating Accurate Synthetic Survival Data by Conditioning on Outcomes, https://arxiv.org/abs/2405.17333
Yunbo Long, Liming Xu, Alexandra Brintrup, 7 Aug 2025, LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion, https://arxiv.org/abs/2503.02161
Ingo Ziegler, Abdullatif K\"oksal, Desmond Elliott, Hinrich Sch\"utze, 6 Aug 2025, CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation, https://arxiv.org/abs/2409.02098
Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Ra\'ul Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Or\'us, Manuel Radons, Josef Menter, and Ali Abedi, 8 Aug 2025, Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS), https://arxiv.org/abs/2508.06251
Ojonugwa Oluwafemi Ejiga Peter, Akingbola Oluwapemiisin, Amalahu Chetachi, Adeniran Opeyemi, Fahmi Khalifa, and Md Mahmudur Rahman, 8 Aug 2025, Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation, https://arxiv.org/abs/2508.06170
Pavitra Chauhan, Mohsen Gamal Saad Askar, Kristian Svendsen, Bj{\o}rn Fjukstad, Brita Elvev{\aa}g, Lars Ailo Bongo, Edvard Pedersen, 8 Aug 2025, From research to clinic: Accelerating the translation of clinical decision support systems by making synthetic data interoperable, https://arxiv.org/abs/2308.02613
Shayan Alahyari, Mike Domaratzki, 8 Aug 2025, SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression, https://arxiv.org/abs/2504.21152
Arshia Ilaty, Hossein Shirazi, Hajar Homayouni, 11 Aug 2025, SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering, https://arxiv.org/abs/2508.08529
Audrey Poinsot, Panayiotis Panayiotou, Alessandro Leite, Nicolas Chesneau, \"Ozg\"ur \c{S}im\c{s}ek, Marc Schoenauer, 12 Aug 2025, Position: Causal Machine Learning Requires Rigorous Synthetic Experiments for Broader Adoption, https://arxiv.org/abs/2508.08883
Farah Atif, Nursultan Askarbekuly, Kareem Darwish, Monojit Choudhury, 4 Aug 2025, Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions, https://arxiv.org/abs/2508.08287
Vibeke Binz Vallevik, Anne Kjersti C. Befring, Severin Elvatun and Jan Franz Nygaard, 11 Aug 2025, Processing of synthetic data in AI development for healthcare and the definition of personal data in EU law, https://arxiv.org/abs/2508.08353
Aydin Zaboli and Junho Hong, 12 Aug 2025, Generative AI for Critical Infrastructure in Smart Grids: A Unified Framework for Synthetic Data Generation and Anomaly Detection, https://arxiv.org/abs/2508.08593
Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matari\'c, 12 Aug 2025, Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions, https://arxiv.org/abs/2502.13135
Min Tang, Peng Lu, Qing Feng, 6 Aug 2025, Generating Feasible and Diverse Synthetic Populations Using Diffusion Models, https://arxiv.org/abs/2508.09164
Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li, 13 Aug 2025, Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation, https://arxiv.org/abs/2508.09987
Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun, 13 Aug 2025, Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning, https://arxiv.org/abs/2505.16483
Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt, 14 Aug 2025, BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining, https://arxiv.org/abs/2508.10975
Liam Chalcroft and Ioannis Pappas and Cathy J. Price and John Ashburner, 15 Aug 2025, Synthetic Data for Robust Stroke Segmentation, https://arxiv.org/abs/2404.01946
Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani, 15 Aug 2025, FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation, https://arxiv.org/abs/2508.11810
Jonas van Elburg, Peter van der Putten, Maarten Marx, 15 Aug 2025, Can we Evaluate RAGs with Synthetic Data?, https://arxiv.org/abs/2508.11758
Ahmet H. G\"uzel, Ilija Bogunovic, Jack Parker-Holder, 17 Aug 2025, Synthetic Data is Sufficient for Zero-Shot Visual Generalization from Offline Data, https://arxiv.org/abs/2508.12356
Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov, 17 Aug 2025, Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment, https://arxiv.org/abs/2506.00845
Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei, 19 Aug 2025, InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems, https://arxiv.org/abs/2508.13930
Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti, 19 Aug 2025, POPri: Private Federated Learning using Preference-Optimized Synthetic Data, https://arxiv.org/abs/2504.16438
Suleyman Olcay Polat, Poli A. Nemkova, Mark V. Albert, 20 Aug 2025, Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method, https://arxiv.org/abs/2508.14783
Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban, 20 Aug 2025, Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference, https://arxiv.org/abs/2508.14735
Gaston Gustavo Rios, 20 Aug 2025, HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation, https://arxiv.org/abs/2508.14345
Saptarshi Neil Sinha and P. Julius Kuehn and Johannes Koppe and Arjan Kuijper and Michael Weinmann, 20 Aug 2025, Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data, https://arxiv.org/abs/2505.22291
Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, Gopal Sarda, 21 Aug 2025, GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO, https://arxiv.org/abs/2508.15432
Jan Kapar, Kathrin G\"unther, Lori Ann Vallis, Klaus Berger, Nadine Binder, Hermann Brenner, Stefanie Castell, Beate Fischer, Volker Harth, Bernd Holleczek, Timm Intemann, Till Ittermann, Andr\'e Karch, Thomas Keil, Lilian Krist, Berit Lange, Michael F. Leitzmann, Katharina Nimptsch, Nadia Obi, Iris Pigeot, Tobias Pischon, Tamara Schikowski, B\"orge Schmidt, Carsten Oliver Schmidt, Anja M. Sedlmair, Justine Tanoey, Harm Wienbergen, Andreas Wienke, Claudia Wigmann and Marvin N. Wright, 19 Aug 2025, Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI, https://arxiv.org/abs/2508.14936
Juntao Tan, Liangwei Yang, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Tulika Manoj Awalgaonkar, Jianguo Zhang, Weiran Yao, Ming Zhu, Shirley Kokane, Silvio Savarese, Huan Wang, Caiming Xiong, Shelby Heinecke, 20 Aug 2025, PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data, https://arxiv.org/abs/2502.20616
Arefeh Kazemi and Sri Balaaji Natarajan Kalaivendan and Joachim Wagner and Hamza Qadeer and Kanishk Verma and Brian Davis, 20 Aug 2025, Synthetic vs. Gold: The Role of LLM Generated Labels and Data in Cyberbullying Detection, https://arxiv.org/abs/2502.15860
Weijie Niu, Alberto Huertas Celdran, Karoline Siarsky, Burkhard Stiller, 22 Aug 2025, FEST: A Unified Framework for Evaluating Synthetic Tabular Data, https://arxiv.org/abs/2508.16254
Seyedali Mohammadi, Manas Paldhe, Amit Chhabra, 13 Aug 2025, LingVarBench: Benchmarking LLM for Automated Named Entity Recognition in Structured Synthetic Spoken Transcriptions, https://arxiv.org/abs/2508.15801
Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
Mika Leo Hube, Filip Lemic, Ethungshan Shitiri, Gerard Calvo Bartra, Sergi Abadal, Xavier Costa P\'erez, 22 Aug 2025, Set Transformer Architectures and Synthetic Data Generation for Flow-Guided Nanoscale Localization, https://arxiv.org/abs/2508.16200
Stefania L. Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Delbrouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P. Langlotz, Akshay S. Chaudhari, 22 Aug 2025, Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data, https://arxiv.org/abs/2508.16783
Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Joao Manoel Herrera Pinheiro, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker, 24 Aug 2025, A Synthetic Dataset for Manometry Recognition in Robotic Applications, https://arxiv.org/abs/2508.17468
Weikang Wan, Jiawei Fu, Xiaodi Yuan, Yifeng Zhu, Hao Su, 24 Aug 2025, LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations, https://arxiv.org/abs/2508.17547
Rishikesh Devanathan, Varun Nathan, Ayush Kumar, 25 Aug 2025, Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation, https://arxiv.org/abs/2508.18210
Melissa Kazemi Rad, Alberto Purpura, Himanshu Kumar, Emily Chen, Mohammad Shahed Sorower, 23 Aug 2025, GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection, https://arxiv.org/abs/2508.17057
Chenhao Xue, Yuanzhe Jin, Adrian Carrasco-Revilla, Joyraj Chakraborty, Min Chen, 4 Aug 2025, AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification, https://arxiv.org/abs/2508.10000
Amirmohammad Farzaneh, Matteo Zecchin, Osvaldo Simeone, 4 Sep 2025, Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference, https://arxiv.org/abs/2509.04112
Chanon Puttanawarut, Natcha Fongsrisin, Porntep Amornritvanich, Cholatid Ratanatharathorn, Panu Looareesuwan, 4 Sep 2025, Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models, https://arxiv.org/abs/2509.04245
Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych, 4 Sep 2025, MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions, https://arxiv.org/abs/2509.04183
Mollie Shichman, Claire Bonial, Austin Blodgett, Taylor Hudson, Francis Ferraro, Rachel Rudinger, 3 Sep 2025, FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response, https://arxiv.org/abs/2502.18452
Seganrasan Subramanian, Abhigya Verma, 4 Sep 2025, Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation, https://arxiv.org/abs/2509.01185
Mat\'ias Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer, 4 Sep 2025, Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints, https://arxiv.org/abs/2411.14013
Yogev Cohen, Dudi Ohayon, Romy Somkin, Yehudit Aperstein, Alexander Apartsin, 5 Sep 2025, Code Review Without Borders: Evaluating Synthetic vs. Real Data for Review Recommendation, https://arxiv.org/abs/2509.04810
Alpana Dubey, Suma Mani Kuriakose, Nitish Bhardwaj, 5 Sep 2025, SynGen-Vision: Synthetic Data Generation for training industrial vision models, https://arxiv.org/abs/2509.04894
Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren, 25 Aug 2025, Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails, https://arxiv.org/abs/2508.18384
Ilias Driouich, Hongliu Cao, Eoin Thomas, 26 Aug 2025, Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework, https://arxiv.org/abs/2508.18929
Dawei Li, Yue Huang, Ming Li, Tianyi Zhou, Xiangliang Zhang, Huan Liu, 27 Aug 2025, Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era, https://arxiv.org/abs/2508.19570
Zhan Shi, Yefeng Yuan, Yuhong Liu, Liang Cheng, Yi Fang, 25 Aug 2025, RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting, https://arxiv.org/abs/2508.19286
Michael Nidd, Christoph Miksovic, Thomas Gschwind, Francesco Fusco, Andrea Giovannini, Ioana Giurgiu, 27 Aug 2025, Bootstrapping Learned Cost Models with Synthetic SQL Queries, https://arxiv.org/abs/2508.19807
Jingze Zhang, Jiahe Qian, Yiliang Zhou, Yifan Peng, 28 Aug 2025, Enhancing Health Fact-Checking with LLM-Generated Synthetic Data, https://arxiv.org/abs/2508.20525
Sang Su Lee, Vineeth Loganathan, and Vijay Raghavan, 28 Aug 2025, Dynamic Synthetic Controls vs. Panel-Aware Double Machine Learning for Geo-Level Marketing Impact Estimation, https://arxiv.org/abs/2508.20335
Yijia Guo and Junqing Zhang and Y.-W. Peter Hong, 28 Aug 2025, Practical Physical Layer Authentication for Mobile Scenarios Using a Synthetic Dataset Enhanced Deep Learning Approach, https://arxiv.org/abs/2508.20861
Yewon Byun, Sanket Vaibhav Mehta, Saurabh Garg, Emma Strubell, Michael Oberst, Bryan Wilder, Zachary C. Lipton, 28 Aug 2025, Expert Routing with Synthetic Data for Continual Learning, https://arxiv.org/abs/2412.17009
Joshua Ward, Chi-Hua Wang, Guang Cheng, 28 Aug 2025, Privacy Auditing Synthetic Data Release through Local Likelihood Attacks, https://arxiv.org/abs/2508.21146
Pujan Thapa, Alexander Ororbia, Travis Desell, 28 Aug 2025, Class Incremental Continual Learning with Self-Organizing Maps and Variational Autoencoders Using Synthetic Replay, https://arxiv.org/abs/2508.21240
Jo\~ao Valente, Atabak Dehban, Rodrigo Ventura, 29 Aug 2025, CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models, https://arxiv.org/abs/2508.21732
Jorge Saldivar, Anna Gatzioura, Carlos Castillo, 28 Aug 2025, Synthetic CVs To Build and Test Fairness-Aware Hiring Tools, https://arxiv.org/abs/2508.21179
Nidhi Kowtal, Raviraj Joshi, 29 Aug 2025, L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models, https://arxiv.org/abs/2506.00863
Shang Liu, Jing Wang, Wenji Fang, Zhiyao Xie, 26 Aug 2025, SynCircuit: Automated Generation of New Synthetic RTL Circuits Can Enable Big Data in Circuits, https://arxiv.org/abs/2509.00071
G. Charbel N. Kindji (MALT), Elisa Fromont (MALT), Lina Maria Rojas-Barahona, Tanguy Urvoy, 27 Aug 2025, Robust Detection of Synthetic Tabular Data under Schema Variability, https://arxiv.org/abs/2509.00092
Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki, 2 Sep 2025, Unsupervised Training of Vision Transformers with Synthetic Negatives, https://arxiv.org/abs/2509.02024
Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki, 2 Sep 2025, Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives, https://arxiv.org/abs/2509.02029
Yevhen Havrylenko, Meelis K\"a\"arik and Artur Tuttar, 2 Sep 2025, Amputation-imputation based generation of synthetic tabular data for ratemaking, https://arxiv.org/abs/2509.02171
Hunter Gittlin, 29 Aug 2025, Beyond Synthetic Augmentation: Group-Aware Threshold Calibration for Robust Balanced Accuracy in Imbalanced Learning, https://arxiv.org/abs/2509.02592
Vikas Kashtriya and Pardeep Singh, 2 Sep 2025, Enhancing Machine Learning for Imbalanced Medical Data: A Quantum-Inspired Approach to Synthetic Oversampling (QI-SMOTE), https://arxiv.org/abs/2509.02863
Jorn K. Teutloff, 29 Aug 2025, Synthetic Founders: AI-Generated Social Simulations for Startup Validation Research in Computational Social Science, https://arxiv.org/abs/2509.02605
Leire Benito-Del-Valle, Pedro A. Moreno-S\'anchez, Itziar Egusquiza, Itsaso Vitoria, Artzai Pic\'on, Cristina L\'opez-Saratxaga, Adrian Galdran, 30 Aug 2025, Is Synthetic Image Augmentation Useful for Imbalanced Classification Problems? Case-Study on the MIDOG2025 Atypical Cell Detection Competition, https://arxiv.org/abs/2509.02612
Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S. Ryoo, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles, 3 Sep 2025, Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data, https://arxiv.org/abs/2509.03501
Liming Xu and Yunbo Long and Alexandra Brintrup, 30 Aug 2025, SynDelay: A Synthetic Dataset for Delivery Delay Prediction, https://arxiv.org/abs/2509.05325
Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu, 6 Sep 2025, Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation, https://arxiv.org/abs/2509.05605
Ching-Chun Chang and Isao Echizen, 6 Sep 2025, Tell-Tale Watermarks for Explanatory Reasoning in Synthetic Media Forensics, https://arxiv.org/abs/2509.05753
Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke, 8 Sep 2025, MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML, https://arxiv.org/abs/2509.06806
Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro, 8 Sep 2025, Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection, https://arxiv.org/abs/2412.12761
Benjamin Hoffman, David Robinson, Marius Miron, Vittorio Baglione, Daniela Canestrari, Damian Elias, Eva Trapote, Felix Effenberger, Maddie Cusimano, Masato Hagiwara, Olivier Pietquin, 5 Sep 2025, Synthetic data enables context-aware bioacoustic sound event detection, https://arxiv.org/abs/2503.00296
Wang Wang, Mingyu Shi, Jun Jiang, Wenqian Ma, Chong Liu, Yasutaka Narazaki, Xuguang Wang, 5 Sep 2025, Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework, https://arxiv.org/abs/2507.05814
Seunghyeon Kim, Kyeongryeol Go, 22 Jul 2025, Edge-case Synthesis for Fisheye Object Detection: A Data-centric Perspective, https://arxiv.org/abs/2507.16254
Xiaopeng Ke and Hexuan Deng and Xuebo Liu and Jun Rao and Zhenxi Song and Jun Yu and Min Zhang, 24 Jul 2025, AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs, https://arxiv.org/abs/2507.18584
Xi Long, Christy Boscardin, Lauren A. Maggio, Joseph A. Costello, Ralph Gonzales, Rasmyah Hammoudeh, Ki Lai, Yoon Soo Park, Brian C. Gin, 14 Aug 2025, Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis, https://arxiv.org/abs/2508.09458
Qiushi Sun, Jinyang Gong, Lei Li, Qipeng Guo, Fei Yuan, 25 Jul 2025, CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback, https://arxiv.org/abs/2507.22080
Xiaoling Hu, Xiangrui Zeng, Oula Puonti, Juan Eugenio Iglesias, Bruce Fischl, Yael Balbastre, 1 Aug 2025, Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation, https://arxiv.org/abs/2411.16719
Siyi Liu, Yujia Zheng, Yongqi Zhang, 4 Aug 2025, StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes, https://arxiv.org/abs/2508.02601
Yong Lin and Shange Tang and Bohan Lyu and Ziran Yang and Jui-Hui Chung and Haoyu Zhao and Lai Jiang and Yihan Geng and Jiawei Ge and Jingruo Sun and Jiayun Wu and Jiri Gesi and Ximing Lu and David Acuna and Kaiyu Yang and Hongzhou Lin and Yejin Choi and Danqi Chen and Sanjeev Arora and Chi Jin, 5 Aug 2025, Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction, https://arxiv.org/abs/2508.03613
Parker Seegmiller, Kartik Mehta, Soumya Saha, Chenyang Tao, Shereen Oraby, Arpit Gupta, Tagyoung Chung, Mohit Bansal and Nanyun Peng, 22 Aug 2025, FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline, https://arxiv.org/abs/2508.16514
Feng Tian, Flora D. Salim, Hao Xue, 25 Aug 2025, TradingGroup: A Multi-Agent Trading System with Self-Reflection and Data-Synthesis, https://arxiv.org/abs/2508.17565
Sunguk Choi, Yonghoon Kwon, Heondeuk Lee, 26 Aug 2025, CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks, https://arxiv.org/abs/2508.18743
Timur Sattarov, Marco Schreyer, Damian Borth, 29 Aug 2025, Federated Diffusion Modeling with Differential Privacy for Tabular Data Synthesis, https://arxiv.org/abs/2412.16083
Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu, 30 Aug 2025, Open Data Synthesis For Deep Research, https://arxiv.org/abs/2509.00375
Jianwei Wang, Chengming Shi, Junyao Yang, Haoran Li, Qianli Ma, Huiping Zhuang, Cen Chen and Ziqian Zeng, 31 Aug 2025, RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis, https://arxiv.org/abs/2502.18517
Yuntao Du, Ninghui Li, 7 Sep 2025, Systematic Assessment of Tabular Data Synthesis, https://arxiv.org/abs/2402.06806
Laura Boggia, Bogdan Malaescu, 9 Sep 2025, Synthetic Data Generation with Lorenzetti for Time Series Anomaly Detection in High-Energy Physics Calorimeters, https://arxiv.org/abs/2509.07451
Ali Reza Ibrahimzada, Yang Chen, Ryan Rong, Reyhaneh Jabbarvand, 9 Sep 2025, Challenging Bug Prediction and Repair Models with Synthetic Bugs, https://arxiv.org/abs/2310.02407
Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra, 11 Sep 2025, A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes, https://arxiv.org/abs/2509.09794
Keunwoo Choi, Seungheon Doh, Juhan Nam, 18 Aug 2025, TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation, https://arxiv.org/abs/2509.09685
Basti\'an Gonz\'alez-Bustamante, Nando Verelst, Carla Cisternas, 11 Sep 2025, Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case, https://arxiv.org/abs/2509.09871
Jing Zhang, Alexandre Bousse, Chi-Hieu Pham, Kuangyu Shi, Julien Bert, 12 Sep 2025, Semi-Supervised Learning for Dose Prediction in Targeted Radionuclide: A Synthetic Data Study, https://arxiv.org/abs/2503.05367
Tung Vu, Lam Nguyen, Quynh Dao, 10 Sep 2025, PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability, https://arxiv.org/abs/2509.08910
Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng, 11 Sep 2025, Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function, https://arxiv.org/abs/2509.09197
Nazia Nafis, Inaki Esnaola, Alvaro Martinez-Perez, Maria-Cruz Villa-Uriol, Venet Osmani, 11 Sep 2025, Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review, https://arxiv.org/abs/2504.18544
Dimitris Tsirmpas and Ion Androutsopoulos and John Pavlopoulos, 11 Sep 2025, Scalable Evaluation of Online Facilitation Strategies via Synthetic Simulation of Discussions, https://arxiv.org/abs/2503.16505
Sepehr Dehdashtian, Mashrur M. Morshed, Jacob H. Seidman, Gaurav Bharaj and Vishnu Naresh Boddeti, 19 Sep 2025, PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors, https://arxiv.org/abs/2509.15551
Nakul Sharma, 19 Sep 2025, Efficient Long-Tail Learning in Latent Space by sampling Synthetic Data, https://arxiv.org/abs/2509.15859
Nomi Yu (1), Md Ferdous Alam (1), A. John Hart (1), and Faez Ahmed (1) ((1) Massachusetts Institute of Technology), 17 Sep 2025, GenCAD-3D: CAD Program Generation using Multimodal Latent Space Alignment and Synthetic Dataset Balancing, https://arxiv.org/abs/2509.15246
Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Cand\`es, Chong Wang, Ruoming Pang, 17 Sep 2025, Synthetic bootstrapped pretraining, https://arxiv.org/abs/2509.15248
Caitlin Cisar, Emily Sheffield, Joshua Drake, Alden Harrell, Subramanian Chidambaram, Nikita Nangia, Vinayak Arannil, Alex Williams, 18 Sep 2025, PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output Targeting, https://arxiv.org/abs/2509.15447
Junlong Jia, Xing Wu, Chaochen Gao, Ziyang Chen, Zijia Lin, Zhongzhi Li, Weinong Wang, Haotian Xu, Donghui Jin, Debing Zhang, Binghui Guo, 19 Sep 2025, LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs, https://arxiv.org/abs/2509.15568
Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, Feng Zheng, 19 Sep 2025, OptiScene: LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization, https://arxiv.org/abs/2506.07570
Alessandro Crimi and Andrea Brovelli, 15 Sep 2025, Prediction and Causality of functional MRI and synthetic signal using a Zero-Shot Time-Series Foundation Model, https://arxiv.org/abs/2509.12497
Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou, 16 Sep 2025, WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning, https://arxiv.org/abs/2509.13305
Riyaadh Gani, 12 Sep 2025, Physics-Informed Neural Networks vs. Physics Models for Non-Invasive Glucose Monitoring: A Comparative Study Under Realistic Synthetic Conditions, https://arxiv.org/abs/2509.12253
Nolan Platt and Pragyansmita Nayak, 16 Sep 2025, Multi-Model Synthetic Training for Mission-Critical Small Language Models, https://arxiv.org/abs/2509.13047
Shanmuka Sadhu, Arca Baran, Preeti Pandey, and Ayush Kumar, 15 Sep 2025, Task Decoding based on Eye Movements using Synthetic Data Augmentation, https://arxiv.org/abs/2509.11547
Rumeng Li, Xun Wang, Hong Yu, 5 Sep 2025, DualAlign: Generating Clinically Grounded Synthetic Data, https://arxiv.org/abs/2509.10538
Omkar Shailendra Vengurlekar, Adithya Pediredla, Suren Jayasuriya, 14 Sep 2025, SH-SAS: An Implicit Neural Representation for Complex Spherical-Harmonic Scattering Fields for 3D Synthetic Aperture Sonar, https://arxiv.org/abs/2509.11087
Milan Marocchi, Matthew Fynn, Kayapanda Mandana, Yue Rong, 15 Sep 2025, Scaling to Multimodal and Multichannel Heart Sound Classification: Fine-Tuning Wav2Vec 2.0 with Synthetic and Augmented Biosignals, https://arxiv.org/abs/2509.11606
Mikhail Kulyabin, Jan Joosten, Choro Ulan uulu, Nuno Miguel Martins Pacheco, Fabian Ries, Filippos Petridis, Jan Bosch, and Helena Holmstr\"om Olsson, 15 Sep 2025, User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums, https://arxiv.org/abs/2509.11777
Lauri Suomela, Sasanka Kuruppu Arachchige, German F. Torres, Harry Edelman, Joni-Kristian K\"am\"ar\"ainen, 15 Sep 2025, Synthetic vs. Real Training Data for Visual Navigation, https://arxiv.org/abs/2509.11791
Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji, 13 Sep 2025, FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering, https://arxiv.org/abs/2412.07030
Shengjie Ma, Xuhui Jiang, Chengjin Xu, Cehao Yang, Liyu Zhang, Jian Guo, 14 Sep 2025, Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models, https://arxiv.org/abs/2505.00979
Karan Dua, Puneet Mittal, Ranjeet Gupta, Hitesh Laxmichand Patel, 15 Sep 2025, SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models, https://arxiv.org/abs/2509.14270
Luisa Torquato Ni\~no and Hamza A. A. Gardi, 18 Sep 2025, Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies, https://arxiv.org/abs/2509.15045
Christopher Wiedeman, Anastasiia Sarmakeeva, Elena Sizikova, Daniil Filienko, Miguel Lago, Jana G. Delfino, Aldo Badano, 18 Sep 2025, T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images, https://arxiv.org/abs/2507.04038
Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin, 18 Sep 2025, Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation, https://arxiv.org/abs/2505.16360
Lauren H. Cooke, Matthias Jung, Jan M. Brendel, Nora M. Kerkovits, Borek Foldyna, Michael T. Lu, Vineet K. Raghu, 10 Sep 2025, RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts, https://arxiv.org/abs/2509.08640
Dietmar Offenhuber, 14 Sep 2025, Synthetic Data and the Shifting Ground of Truth, https://arxiv.org/abs/2509.13355
Inder Pal Singh, Nidhal Eddine Chenni, Abd El Rahman Shabayek, Arunkumar Rathinam, Djamila Aouada, 17 Sep 2025, Bridging the Synthetic-Real Gap: Supervised Domain Adaptation for Robust Spacecraft 6-DoF Pose Estimation, https://arxiv.org/abs/2509.13792
Gustavo Kruger, Nikhil Sachdeva, Michael Sobolev, 17 Sep 2025, Synthetic Data Generation for Screen Time and App Usage, https://arxiv.org/abs/2509.13892
Niklas Grieger, Siamak Mehrkanoon, Stephan Bialonski, 17 Sep 2025, Data-Efficient Sleep Staging with Synthetic Time Series Pretraining, https://arxiv.org/abs/2403.08592
Karan Dua, Hitesh Laxmichand Patel, Puneet Mittal, Ranjeet Gupta, Amit Agarwal, Praneet Pabolu, Srikant Panda, Hansa Meghwani, Graham Horwood, Fahad Shah, 2 Oct 2025, FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models, https://arxiv.org/abs/2510.02133
Brett Barkley and David Fridovich-Keil, 1 Oct 2025, Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization, https://arxiv.org/abs/2510.01457
Feiyang Kang, Newsha Ardalani, Michael Kuchnik, Youssef Emad, Mostafa Elhoushi, Shubhabrata Sengupta, Shang-Wen Li, Ramya Raghavendra, Ruoxi Jia, Carole-Jean Wu, 2 Oct 2025, Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls, https://arxiv.org/abs/2510.01631
Adil Koeken, Alexander Ziller, Moritz Knolle, Daniel Rueckert, 2 Oct 2025, Sensitivity, Specificity, and Consistency: A Tripartite Evaluation of Privacy Filters for Synthetic Data Generation, https://arxiv.org/abs/2510.01793
Dimitar Peshevski, Kiril Blazhevski, Martin Popovski, Gjorgji Madjarov, 23 Sep 2025, Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision, https://arxiv.org/abs/2510.01229
Adithya Rajan, Xiaoyu Liu, Prateek Verma, Vibhu Arora, 2 Oct 2025, Synthetic Prefixes to Mitigate Bias in Real-Time Neural Query Autocomplete, https://arxiv.org/abs/2510.01574
Krishna Teja Chitty-Venkata, Murali Emani, 2 Oct 2025, ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models, https://arxiv.org/abs/2510.01582
Momin Abbas and Muneeza Azmat and Raya Horesh and Mikhail Yurochkin, 1 Oct 2025, Out-of-Distribution Detection using Synthetic Data Generation, https://arxiv.org/abs/2502.03323
Anish Agarwal, Sukjin Han, Dwaipayan Saha, Vasilis Syrgkanis, Haeyeon Yoon, 1 Oct 2025, Synthetic Blips: Generalizing Synthetic Controls for Dynamic Treatment Effects, https://arxiv.org/abs/2210.11003
Urs Spiegelhalter, J\"org K.H. Franke, Frank Hutter, 13 Oct 2025, Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities, https://arxiv.org/abs/2510.11842
Gautier Evennou, Antoine Chaffin, Vivien Chappelier and Ewa Kijak, 14 Oct 2025, Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation, https://arxiv.org/abs/2412.15939
Anni Li, Aria Attar, Paul Dong, 30 Sep 2025, Thinkquel: A Model Dedicated to Text-to-dbt Using Synthetic Data and a Span-Aware Objective, https://arxiv.org/abs/2510.00186
Jieun Yu, Minjung Park, Sangmi Chai, 1 Oct 2025, Improving Cryptocurrency Pump-and-Dump Detection through Ensemble-Based Models and Synthetic Oversampling Techniques, https://arxiv.org/abs/2510.00836
Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, Yunqi Liu, Yu Chen, Yuntian Liu, Yang Jiao, Tao Luo, 1 Oct 2025, ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data, https://arxiv.org/abs/2502.05567
Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, Bo Zheng, 1 Oct 2025, ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis, https://arxiv.org/abs/2509.23652
Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita, 23 Sep 2025, ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation, https://arxiv.org/abs/2509.19454
Meshi Bashari, Yonghoon Lee, Roy Maor Lotan, Edgar Dobriban, Yaniv Romano, 24 Sep 2025, Statistical Inference Leveraging Synthetic Data with Distribution-Free Guarantees, https://arxiv.org/abs/2509.20345
Yijun Liang, Shweta Bhardwaj, Tianyi Zhou, 24 Sep 2025, Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion, https://arxiv.org/abs/2410.13674
Yuanyuan Wu, Zhenlin Qin, Zhenliang Ma, 28 Oct 2025, A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport, https://arxiv.org/abs/2510.24375
Jongsuk Kim, Jaeyoung Lee, Gyojin Han, Dongjae Lee, Minki Jeong, Junmo Kim, 28 Oct 2025, SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration, https://arxiv.org/abs/2510.24052
Keiya Hirashima, Shingo Nozaki, Naoto Harada, 28 Oct 2025, Self-supervised Synthetic Pretraining for Inference of Stellar Mass Embedded in Dense Gas, https://arxiv.org/abs/2510.24159
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang, 28 Oct 2025, Repurposing Synthetic Data for Fine-grained Search Agent Supervision, https://arxiv.org/abs/2510.24694
Emma Rose Madden, 28 Oct 2025, Evaluating the Use of Large Language Models as Synthetic Social Agents in Social Science Research, https://arxiv.org/abs/2509.26080
Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Fabrice Jimenez, Thomas Oberlin, 23 Oct 2025, Synthetic Data for Robust Runway Detection, https://arxiv.org/abs/2510.20349
Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu, 23 Oct 2025, BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models, https://arxiv.org/abs/2510.20095
Touqeer Ahmad, Mohammadreza M. Kalan, Fran\c{c}ois Portier, Gilles Stupfler, 23 Oct 2025, Concentration and excess risk bounds for imbalanced classification with synthetic oversampling, https://arxiv.org/abs/2510.20472
Shuqiao Liang, Jian Liu, Renzhang Chen, Quanlong Guan, 23 Oct 2025, FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies, https://arxiv.org/abs/2509.20890
Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen, 18 Oct 2025, NP-Engine: Empowering Optimization Reasoning in Large Language Models with Verifiable Synthetic NP Problems, https://arxiv.org/abs/2510.16476
Shurong Lin, Aleksandra Slavkovi\'c, Deekshith Reddy Bhoomireddy, 19 Oct 2025, Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees, https://arxiv.org/abs/2510.16974
Peini Cheng and Amir Bahmani, 16 Oct 2025, Membership Inference over Diffusion-models-based Synthetic Tabular Data, https://arxiv.org/abs/2510.16037
Bingji Yi, Qiyuan Liu, Yuwei Cheng, and Haifeng Xu, 18 Oct 2025, Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence, https://arxiv.org/abs/2510.16657
Shawn M. Gibford, Mohammad Reza Boskabadi, Christopher J. Savoie, Seyed Soheil Mansouri, 20 Oct 2025, Quantum Synthetic Data Generation for Industrial Bioprocess Monitoring, https://arxiv.org/abs/2510.17688
Spencer Giddens, Xiaon Lang, Fang Liu, 20 Oct 2025, SAFES: Sequential Privacy and Fairness Enhancing Data Synthesis for Responsible AI, https://arxiv.org/abs/2411.09178
Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang, 20 Oct 2025, Synthetic Series-Symbol Data Generation for Time Series Foundation Models, https://arxiv.org/abs/2510.08445
Muhammad Ishfaq Hussain, Ma Van Linh, Zubia Naz, Unse Fatima, Yeongmin Ko, Moongu Jeon, 20 Oct 2025, SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with Thermal IR (LWIR/MWIR) and RGB, https://arxiv.org/abs/2510.13404
Yifan Yan, Shuai Yang, Xiuzhen Guo, Xiangguang Wang, Wei Chow, Yuanchao Shu, Shibo He, 20 Sep 2025, mmExpert: Integrating Large Language Models for Comprehensive mmWave Data Synthesis and Understanding, https://arxiv.org/abs/2509.16521
Weihua Du, Hailei Gong, Zhan Ling, Kang Liu, Lingfeng Shen, Xuesong Yao, Yufei Xu, Dingyuan Shi, Yiming Yang, Jiecao Chen, 22 Sep 2025, Generalizable End-to-End Tool-Use RL with Synthetic CodeGym, https://arxiv.org/abs/2509.17325
Tianyi Chen, Pengxiao Lin, Zhiwei Wang, Zhi-Qin John Xu, 22 Sep 2025, Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data, https://arxiv.org/abs/2509.17514
Xin Lei Lin, Soroush Mehraban, Abhishek Moturu, Babak Taati, 20 Sep 2025, Pain in 3D: Generating Controllable Synthetic Faces for Automated Pain Assessment, https://arxiv.org/abs/2509.16727
Vivek Iyer, Pinzhen Chen, Ricardo Rei, and Alexandra Birch, 20 Sep 2025, XL-Suite: Cross-Lingual Synthetic Training and Evaluation Data for Open-Ended Generation, https://arxiv.org/abs/2503.22973
Suhas BN, Dominik Mattioli, Saeed Abdullah, Rosa I. Arriaga, Chris W. Wiese, Andrew M. Sherrill, 20 Sep 2025, How Real Are Synthetic Therapy Conversations? Evaluating Fidelity in Prolonged Exposure Dialogues, https://arxiv.org/abs/2504.21800
Amal Abed, Ivan Lukic, J\"org K.H. Franke, Frank Hutter, 27 Oct 2025, Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks, https://arxiv.org/abs/2510.23208
Ollie Olby, Rory Baggott, Namid Stillman, 26 Oct 2025, TABL-ABM: A Hybrid Framework for Synthetic LOB Generation, https://arxiv.org/abs/2510.22685
Austin A. Barr, Brij S. Karmur, Anthony J. Winder, Eddie Guo, John T. Lysack, James N. Scott, William F. Morrish, Muneer Eesa, Morgan Willson, David W. Cadotte, Michael M.H. Yang, Ian Y.M. Chan, Sanju Lama, Garnette R. Sutherland, 25 Oct 2025, Expert Validation of Synthetic Cervical Spine Radiographs Generated with a Denoising Diffusion Probabilistic Model, https://arxiv.org/abs/2510.22166
Jahidul Arafat, Sanjaya Poudel, 25 Oct 2025, Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy, https://arxiv.org/abs/2510.22239
Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber, 26 Oct 2025, VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding, https://arxiv.org/abs/2505.01481
Sai Suhruth Reddy Karri, Yashwanth Sai Nallapuneni, Laxmi Narasimha Reddy Mallireddy, Gopichand G, 15 Oct 2025, LLM-Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems, https://arxiv.org/abs/2510.13202
Imon Mia, Armi Tiihonen, Anna Ernst, Anusha Srivastava, Tonio Buonassisi, William Vandenberghe, and Julia W.P. Hsu, 15 Oct 2025, Multi-Variable Batch Bayesian Optimization in Materials Research: Synthetic Data Analysis of Noise Sensitivity and Problem Landscape Effects, https://arxiv.org/abs/2504.03943
Marie Brockschmidt, Maresa Schr\"oder, Stefan Feuerriegel, 26 Sep 2025, SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis, https://arxiv.org/abs/2509.22352
Aurosweta Mahapatra, Ismail Rasim Ulgen, Berrak Sisman, 25 Sep 2025, HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech, https://arxiv.org/abs/2509.21676
Luc Boudier, Loris Manganelli, Eleftherios Tsonis, Nicolas Dufour, Vicky Kalogeiton, 26 Sep 2025, Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance, https://arxiv.org/abs/2509.22635
Khartik Uppalapati, Shakeel Abdulkareem, Bora Yimenicioglu, 6 Oct 2025, RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases, https://arxiv.org/abs/2510.06267
Sashank Makanaboyina, 6 Oct 2025, SER-Diff: Synthetic Error Replay Diffusion for Incremental Brain Tumor Segmentation, https://arxiv.org/abs/2510.06283
Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin, 8 Oct 2025, SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation, https://arxiv.org/abs/2510.06596
Tiago de Conto, John Armston, Ralph Dubayah, 7 Oct 2025, Scalable deep fusion of spaceborne lidar and synthetic aperture radar for global forest structural complexity mapping, https://arxiv.org/abs/2510.06299
Junki Mori, Kazuya Kakizaki, Taiki Miyagawa, Jun Sakuma, 8 Oct 2025, Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG), https://arxiv.org/abs/2510.06719
Moonkyung Ryu, Chih-Wei Hsu, Yinlam Chow, Mohammad Ghavamzadeh, Craig Boutilier, 26 Sep 2025, Synthetic Dialogue Generation for Interactive Conversational Elicitation & Recommendation (ICER), https://arxiv.org/abs/2510.02331
He Du, Bowen Li, Aijun Yang, Siyang He, Qipeng Guo, Dacheng Tao, 20 Oct 2025, EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning, https://arxiv.org/abs/2510.17928
Harry Amad and Zhaozhi Qian and Dennis Frauen and Julianna Piskorz and Stefan Feuerriegel and Mihaela van der Schaar, 21 Oct 2025, Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference, https://arxiv.org/abs/2510.18768
Henrique de Lima Alexandre and Clodoaldo Aparecido de Moraes Lima, 3 Oct 2025, Synthetic EEG Generation using Diffusion Models for Motor Imagery Tasks, https://arxiv.org/abs/2510.17832
Pranav Sambhu, Om Guin, Madhav Sambhu, Jinho Cha, 20 Oct 2025, Curriculum Learning with Synthetic Data for Enhanced Pulmonary Nodule Detection in Chest Radiographs, https://arxiv.org/abs/2510.07681
Maria F. Davila R and Azizjon Turaev and Wolfram Wingerath, 25 Sep 2025, Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis, https://arxiv.org/abs/2509.20768
Bing Liu, Wenqiang Yv, Xuzheng Yang, Shichang Wang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, and Heng Tao Shen, 25 Sep 2025, GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions, https://arxiv.org/abs/2509.21050
Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu and Chen Zhang, 25 Sep 2025, Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy, https://arxiv.org/abs/2509.21190
Hadley Black, Kasper Green Larsen, Arya Mazumdar, Barna Saha, Geelon So, 25 Sep 2025, Actively Learning Halfspaces without Synthetic Data, https://arxiv.org/abs/2509.20848
Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud, 25 Sep 2025, CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density, https://arxiv.org/abs/2509.18458
Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, Alexander Toshev, 29 Sep 2025, Scaling Synthetic Task Generation for Agents via Exploration, https://arxiv.org/abs/2509.25047
Mohammed Sabry, Anya Belz, 26 Sep 2025, What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?, https://arxiv.org/abs/2509.22947
Zi Liang and Qingqing Ye and Xuan Liu and Yanyun Wang and Jianliang Xu and Haibo Hu, 27 Sep 2025, Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data, https://arxiv.org/abs/2509.23041
Ting-Kang Wang, Yueh-Po Peng, Li Su and Vincent K.M. Cheung, 28 Sep 2025, VioPTT: Violin Technique-Aware Transcription from Synthetic Data Augmentation, https://arxiv.org/abs/2509.23759
Junsu Kim, Yunhoe Ku, Dongyoon Han, Seungryul Baek, 27 Sep 2025, Beyond Synthetic Replays: Turning Diffusion Features into Few-Shot Class-Incremental Learning Knowledge, https://arxiv.org/abs/2503.23402
Samarth Mishra, Kate Saenko and Venkatesh Saligrama, 28 Sep 2025, SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data, https://arxiv.org/abs/2504.04740
Chen Qian, Haoyu Zhang, Junnan Ma, Liuhong Zhu, Qingrui Cai, Yu Wang, Ruibo Song, Lv Li, Lin Mei, Xianwang Jiang, Qin Xu, Boyu Jiang, Ran Tao, Chunmiao Chen, Shufang Chen, Dongyun Liang, Qiu Guo, Jianzhong Lin, Taishan Kang, Mengtian Lu, Liyuan Fu, Ruibin Huang, Huijuan Wan, Xu Huang, Jianhua Wang, Di Guo, Hai Zhong, Jianjun Zhou and Xiaobo Qu, 17 Oct 2025, Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning, https://arxiv.org/abs/2510.15400
Hamin Koo and Jaehyung Kim, 17 Oct 2025, EMCee: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context, https://arxiv.org/abs/2503.05846
Youngjoon Lee, Seongmin Cho, Yehhyun Jo, Jinu Gong, Hyunjoo Jenny Lee, Joonhyuk Kang, 6 Oct 2025, Forecasting-Based Biomedical Time-series Data Synthesis for Open Data and Robust AI, https://arxiv.org/abs/2510.04622
Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei, 5 Oct 2025, Scaling Laws of Synthetic Data for Language Models, https://arxiv.org/abs/2503.19551
Alexander Gill, Abhilasha Ravichander, Ana Marasovi\'c, 3 Oct 2025, What Has Been Lost with Synthetic Evaluation?, https://arxiv.org/abs/2505.22830
Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, 4 Oct 2025, Towards Understanding Bias in Synthetic Data for Evaluation, https://arxiv.org/abs/2506.10301
Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman, 6 Oct 2025, Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models, https://arxiv.org/abs/2505.21574
Ilyas Varshavskiy, Bonu Boboeva, Shuhrat Khalilbekov, Azizjon Azimi, Sergey Shulgin, Akhlitdin Nizamitdinov, Haitz Saez de Ocariz Borde, 10 Oct 2025, Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers, https://arxiv.org/abs/2510.09294
Muhammad Ali Shafique, Kanwal Mehreen, Muhammad Arham, Maaz Amjad, Sabur Butt, Hamza Farooq, 10 Oct 2025, Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation, https://arxiv.org/abs/2510.09051
Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Ranjay Krishna, 10 Oct 2025, SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding, https://arxiv.org/abs/2510.09110
Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna, 9 Oct 2025, Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training, https://arxiv.org/abs/2412.08221
Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, Bernie Wang, 24 Oct 2025, Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models, https://arxiv.org/abs/2510.21204
Jens E. d'Hondt, Wieger R. Punter, Odysseas Papapetrou, 24 Oct 2025, Generative Correlation Manifolds: Generating Synthetic Data with Preserved Higher-Order Correlations, https://arxiv.org/abs/2510.21610
Massimiliano Ciranni, Vito Paolo Pastore, Roberto Di Via, Enzo Tartaglione, Francesca Odone, Vittorio Murino, 24 Oct 2025, Diffusing DeBias: Synthetic Bias Amplification for Model Debiasing, https://arxiv.org/abs/2502.09564
Parsa Rahimi, Sebastien Marcel, 24 Oct 2025, ScoreMix: Synthetic Data Generation by Score Composition in Diffusion Models Improves Recognition, https://arxiv.org/abs/2506.10226
Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang, 24 Oct 2025, Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks, https://arxiv.org/abs/2510.19195
Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang, 10 Oct 2025, Building a Foundational Guardrail for General Agentic Systems via Synthetic Data, https://arxiv.org/abs/2510.09781
Md Ibrahim Shikder Mahin, Md Shamsul Arefin and Md Tanvir Hasan, 12 Oct 2025, A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets, https://arxiv.org/abs/2510.10513
Hengyuan Zhang, Shiping Yang, Xiao Liang, Chenming Shang, Yuxuan Jiang, Chaofan Tao, Jing Xiong, Hayden Kwok-Hay So, Ruobing Xie, Angel X. Chang, Ngai Wong, 13 Oct 2025, Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation, https://arxiv.org/abs/2510.10925
Sneha Varur, Anirudh R Hanchinamani, Tarun S Bagewadi, Uma Mudenagudi, Chaitra D Desai, Sujata C, Padmashree Desai and Sumit Meharwade, 12 Oct 2025, DISC-GAN: Disentangling Style and Content for Cluster-Specific Synthetic Underwater Image Generation, https://arxiv.org/abs/2510.10782
Joshua Niemeijer, Jan Ehrhardt, Heinz Handels, Hristina Uzunova, 13 Oct 2025, Uncertainty-Aware ControlNet: Bridging Domain Gaps with Synthetic Image Generation, https://arxiv.org/abs/2510.11346
David Benavente-Rios and Juan Ruiz Rodriguez and Gustavo Gatica, 10 Oct 2025, Exploration of Incremental Synthetic Non-Morphed Images for Single Morphing Attack Detection, https://arxiv.org/abs/2510.09836
Rohan Gupta, Iv\'an Arcuschin, Thomas Kwa, Adri\`a Garriga-Alonso, 11 Oct 2025, InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques, https://arxiv.org/abs/2407.14494
Nupur Kumari, Xi Yin, Jun-Yan Zhu, Ishan Misra, Samaneh Azadi, 13 Oct 2025, Generating Multi-Image Synthetic Data for Text-to-Image Customization, https://arxiv.org/abs/2502.01720
Rongchao Xu, Kunlin Cai, Lin Jiang, Dahai Yu, Zhiqing Hong, Yuan Tian, Guang Wang, 9 Oct 2025, GeoGen: A Two-stage Coarse-to-Fine Framework for Fine-grained Synthetic Location-based Social Network Trajectory Generation, https://arxiv.org/abs/2510.07735
Jannek Ulm, Kevin Du, V\'esteinn Sn{\ae}bjarnarson, 9 Oct 2025, Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling, https://arxiv.org/abs/2510.08245
Amitis Shidani, Tyler Farghly, Yang Sun, Habib Ganjgahi, and George Deligiannidis, 9 Oct 2025, Beyond Real Data: Synthetic Data through the Lens of Regularization, https://arxiv.org/abs/2510.08095
Parham Rezaei, Filip Kovacevic, Francesco Locatello, Marco Mondelli, 9 Oct 2025, High-dimensional Analysis of Synthetic Data Selection, https://arxiv.org/abs/2510.08123
Zhuoyi Huang, Nutan Sahoo, Anamika Kumari, Girish Kumar, Kexuan Cai, Shixing Cao, Yue Kang, Tian Xia, Somya Chatterjee, Nicholas Hausman, Aidan Jay, Eric S. Rosenthal, Soundar Srinivasan, Sadid Hasan, Alex Fedorov, Sulaiman Vesal, 9 Oct 2025, High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training, https://arxiv.org/abs/2510.05492
Rachel Chung, Pratyush Nidhi Sharma, Mikko Siponen, Rohit Vadodaria, and Luke Smith, 23 Sep 2025, Hybrid Data can Enhance the Utility of Synthetic Data for Training Anti-Money Laundering Models, https://arxiv.org/abs/2509.18499
Haoyu Wang and Fengze Liu and Jiayao Zhang and Dan Roth and Kyle Richardson, 16 Sep 2025, Event Causality Identification with Synthetic Control, https://arxiv.org/abs/2509.18156
Zhipei Xu, Xuanyu Zhang, Qing Huang, Xing Zhou, Jian Zhang, 23 Sep 2025, AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection, https://arxiv.org/abs/2505.15173
Mahmoud Ibrahim, Bart Elen, Chang Sun, G\"okhan Ertaylan, Michel Dumontier, 22 Oct 2025, Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series, https://arxiv.org/abs/2510.19728
Lawrence Phillips, Marc Boubnovski Martell, Aditya Misra, Josefa Lia Stoisser, Cesar A. Prada-Medina, Rory Donovan-Maiye, Kaspar M\"artens, 29 Sep 2025, SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction, https://arxiv.org/abs/2509.25346
Hasan Alp Cafero\u{g}lu, Mehmet Serhat \c{C}elik, \"Ozg\"ur Ulusoy, 30 Sep 2025, SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation, https://arxiv.org/abs/2509.25672
Zihao Zhao, Anjalie Field, 30 Sep 2025, Controlled Generation for Private Synthetic Text, https://arxiv.org/abs/2509.25729
Chenhua Shi, Gregor Macdonald, Bhavika Jalli, Wanlu Lei, John Zou, Mridul Jain, Joji Philip, 30 Sep 2025, Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications, https://arxiv.org/abs/2509.25736
Kyeongryeol Go, 30 Sep 2025, Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis, https://arxiv.org/abs/2509.26158
Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Tom A. Lamb, Philip Torr, Marc Boubnovski Martell, Julien Fauqueur, Kaspar M\"artens, 7 Oct 2025, Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering, https://arxiv.org/abs/2510.05871
Muskaan Chopra, Lorenz Sparrenberg, Rafet Sifa, 1 Oct 2025, SynCED-EnDe 2025: A Synthetic and Curated English - German Dataset for Critical Error Detection in Machine Translation, https://arxiv.org/abs/2510.05144
Sara Mandelli, Diego Vila-Portela, David V\'azquez-Pad\'in, Paolo Bestagini, Fernando P\'erez-Gonz\'alez, 7 Oct 2025, Beyond Spectral Peaks: Interpreting the Cues Behind Synthetic Image Detection, https://arxiv.org/abs/2510.05633
Maria-Teresa De Rosa Palmini and Eva Cetinic, 18 May 2025, Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models, https://arxiv.org/abs/2505.17064
Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li, 16 Oct 2025, Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping, https://arxiv.org/abs/2501.18962