Aussie AI

Synthetic Data

  • Last Updated 22 October, 2025
  • by David Spuler, Ph.D.

What is Synthetic Data?

Synthetic data is the use of computer-generated text for LLM training. This can include generating entirely new training data, such as by using output from another LLM. Another approach is to "augment" training data, such as using synonymization to create slightly different versions of training sets with different words.

Research on Synthetic Data

Research papers include:

  • Skurzhanskyi, O.H., Marchenko, O.O. & Anisimov, A.V., 2024, Specialized Pre-Training of Neural Networks on Synthetic Data for Improving Paraphrase Generation. Cybern Syst Anal 2024 https://doi.org/10.1007/s10559-024-00658-7 https://link.springer.com/article/10.1007/s10559-024-00658-7
  • Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, 29 Jan 2024, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, https://arxiv.org/abs/2401.16380
  • André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster, 4 Jan 2024, Comprehensive Exploration of Synthetic Data Generation: A Survey https://arxiv.org/abs/2401.02524
  • Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
  • David Spuler, March 2024, Chapter 45. Knowledge Distillation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • A Gudibande, E Wallace, C Snell, X Geng, H Liu 2023, The false promise of imitating proprietary llms, https://arxiv.org/abs/2305.15717
  • Y Wang, W Zhong, L Li, F Mi, X Zeng, W Huang 2023, Aligning large language models with human: A survey, https://arxiv.org/abs/2307.12966
  • Y Gu, L Dong, F Wei, M Huang, 2023, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
  • X Wan, R Sun, H Dai, SO Arik, T Pfister, 2023, Better zero-shot reasoning with self-adaptive prompting, https://arxiv.org/abs/2305.14106
  • S Horawalavithana, S Munikoti, I Stewart, 2023, SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, https://arxiv.org/abs/2307.01139
  • X Daull, P Bellot, E Bruno, V Martin, 2023, Complex QA and language models hybrid architectures, Survey, https://arxiv.org/abs/2302.09051
  • Z Yuan, J Liu, Q Zi, M Liu, X Peng, Y Lou, 2023, Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation, https://arxiv.org/abs/2308.01240
  • W AlShikh, M Daaboul, K Goddard, B Imel, 2023, Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, https://arxiv.org/abs/2307.03692
  • Z He, Z Xie, R Jha, H Steck, D Liang, Y Feng, 2023, Large Language Models as Zero-Shot Conversational Recommenders, https://arxiv.org/abs/2308.10053
  • NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
  • Michael Nuñez, July 18, 2024, Groq’s open-source Llama AI model tops leaderboard, outperforming GPT-4o and Claude in function calling, https://venturebeat.com/ai/groq-open-source-llama-ai-model-tops-leaderboard-outperforming-gpt-4o-and-claude-in-function-calling/
  • Louie Peters, Aug 27, 2024, Two Paths to Small LMs? Synthetic Data (Phi 3.5) vs Pruning & Distillation (Llama-3.1-Minitron), https://newsletter.towardsai.net/p/114-two-paths-to-small-lms-synthetic
  • Aatish Bhatia, Aug. 25, 2024, When A.I.’s Output Is a Threat to A.I. Itself: As A.I.-generated data becomes harder to detect, it’s increasingly likely to be ingested by future A.I., leading to worse results, NY Times, https://www.nytimes.com/interactive/2024/08/26/upshot/ai-synthetic-data.html
  • Shumailov, I., Shumaylov, Z., Zhao, Y. et al. 2024, AI models collapse when trained on recursively generated data. Nature 631, 755–759. https://doi.org/10.1038/s41586-024-07566-y https://www.nature.com/articles/s41586-024-07566-y
  • Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, Gauthier Gidel, 12 Jun 2024, Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences, https://arxiv.org/abs/2407.09499
  • Ryan McNeal, Aug 27, 2024, ChatGPT and GPT-4 could get a sweet upgrade this fall with 'strawberry', https://www.androidauthority.com/openai-strawberry-ai-3475682/
  • Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai, 10 Aug 2024 (v2), Best Practices and Lessons Learned on Synthetic Data, https://arxiv.org/abs/2404.07503
  • Georgia Argyro, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou, 10 Sep 2024, Prompt2Fashion: An automatically generated fashion dataset, https://arxiv.org/abs/2409.06442
  • Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli, 12 Sep 2024, Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources, https://arxiv.org/abs/2409.08239
  • Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, Mehran Kazemi, 29 Aug 2024, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737
  • Ulyana Piterbarg, Lerrel Pinto, Rob Fergus, 3 Oct 2024, Training Language Models on Synthetic Edit Sequences Improves Code Synthesis, https://arxiv.org/abs/2410.02749
  • Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang, 16 Oct 2024, A Survey on Data Synthesis and Augmentation for Large Language Models, https://arxiv.org/abs/2410.12896
  • Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He, 23 Oct 2024, SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains, https://arxiv.org/abs/2410.17952
  • Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
  • Arindam Mitra , Ahmed Awadallah , Yash Lara , November 14, 2024, Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/
  • Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig, 4 Dec 2024, Evaluating Language Models as Synthetic Data Generators, https://arxiv.org/abs/2412.03679
  • Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
  • Xiang Huang, Jiayu Shen, Shanshan Huang, Sitao Cheng, Xiaxia Wang, Yuzhong Qu, 27 Dec 2024, TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data, https://arxiv.org/abs/2412.19544?
  • Sebastian Raschka, PhD, Jan 15, 2025, Noteworthy AI Research Papers of 2024 (Part Two). Six influential AI papers from July to December, https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2 (Examines multimodal LLama3 models and the different multimodal architectures.)
  • FZ Subah, Oct 2025, Mitigating and Assessing Bias and Fairness in Large Language Model-Generated Synthetic Tabular Data, Masters Thesis, Department of Engineering, University of Cambridge, https://www.mlmi.eng.cam.ac.uk/files/2023-2024/fzs21_mitigating_2024.pdf
  • Chetan Harsha, Karmvir Singh Phogat, Sridhar Dasaratha, Sai Akhil Puranam, Shashishekar Ramakrishna, Jan 2025, Synthetic Data Generation Using Large Language Models for Financial Question Answering, Proceedings of the Joint Workshop of the 9th FinNLP, the 6th FNP, and the 1st LLMFinLegal, pages 76–95 January 19–20, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.finnlp-1.7.pdf
  • Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen, 25 Jan 2025, LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion, https://arxiv.org/abs/2501.15089
  • Minsang Kim, Seungjun Baek, 6 Feb 2025, Syntriever: How to Train Your Retriever with Synthetic Data from LLMs, https://arxiv.org/abs/2502.03824
  • Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang, 13 Feb 2025, Logical Reasoning in Large Language Models: A Survey, https://arxiv.org/abs/2502.09100
  • Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen, 18 Feb 2025, Theorem Prover as a Judge for Synthetic Data Generation, https://arxiv.org/abs/2502.13137
  • Maria Korolov, Jun 25, 2025, 7 ways synthetic data creates business value, https://www.cio.com/article/4003262/7-ways-synthetic-data-creates-business-value.html
  • Ali Zolnour, Hossein Azadmaleki, Yasaman Haghbin, Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sina Rashidi, Masoud Khani, AmirSajjad Taleban, Samin Mahdizadeh Sani, Maryam Dadkhah, James M. Noble, Suzanne Bakken, Yadollah Yaghoobzadeh, Abdol-Hossein Vahabie, Masoud Rouhizadeh, Maryam Zolnoori, 8 Aug 2025, LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data, https://arxiv.org/abs/2508.10027
  • Nitin Rai, Nathan S. Boyd, Gary E. Vallad, Arnold W. Schumann, 13 Aug 2025, Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model, https://arxiv.org/abs/2508.10156
  • Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian, 14 Aug 2025, Measuring Diversity in Synthetic Datasets, https://arxiv.org/abs/2502.08512
  • Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng, 22 Jul 2025, Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation, https://arxiv.org/abs/2507.17066
  • \'Alvaro Ruiz-R\'odenas, Jaime Pujante S\'aez, Daniel Garc\'ia-Algora, Mario Rodr\'iguez B\'ejar, Jorge Blasco and Jos\'e Luis Hern\'andez-Ramos, 21 Jul 2025, SynthCTI: LLM-Driven Synthetic CTI Generation to enhance MITRE Technique Mapping, https://arxiv.org/abs/2507.16852
  • Rishemjit Kaur, Arshdeep Singh Bhankhar, Surangika Ranathunga, Jashanpreet Singh Salh, Sudhir Rajput, Vidhi, Kashish Mahendra, Bhavika Berwal, Ritesh Kumar, 22 Jul 2025, Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain, https://arxiv.org/abs/2507.16974
  • Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
  • Shreya Saxena, Siva Prasad, Zishan Ahmad, Vishal Vaddina, 22 Jul 2025, ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training, https://arxiv.org/abs/2507.16478
  • Ivona Krchova, Michael Platzer, Paul Tiwald, 22 Jul 2025, Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling, https://arxiv.org/abs/2507.16419
  • Alireza Dizaji, Benedict Aaron Tjandra, Mehrab Hamidi, Shenyang Huang, Guillaume Rabusseau, 22 Jul 2025, T-GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs, https://arxiv.org/abs/2507.10183
  • Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim, 24 Jul 2025, Synthetic Data Generation for Phrase Break Prediction with Large Language Model, https://arxiv.org/abs/2507.18044
  • Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina Balaur, and Venkata Satagopam, 24 Jul 2025, Generation of Synthetic Clinical Text: A Systematic Review, https://arxiv.org/abs/2507.18451
  • Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu, 24 Jul 2025, DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data, https://arxiv.org/abs/2507.18583
  • Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim, 24 Jul 2025, SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning, https://arxiv.org/abs/2507.18616
  • Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim, 24 Jul 2025, SIDA: Synthetic Image Driven Zero-shot Domain Adaptation, https://arxiv.org/abs/2507.18632
  • Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng, 24 Jul 2025, Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs, https://arxiv.org/abs/2507.18055
  • Yefeng Yuan, Yuhong Liu, Liang Cheng, 24 Jul 2025, A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models, https://arxiv.org/abs/2404.14445
  • Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp, 24 Jul 2025, Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation, https://arxiv.org/abs/2506.11790
  • Keito Inoshita, Rushia Harada, 15 Jul 2025, Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition, https://arxiv.org/abs/2507.13380
  • Junsu Kim, Yunhoe Ku, Seungryul Baek, 18 Jul 2025, Can Synthetic Images Conquer Forgetting? Beyond Unexplored Doubts in Few-Shot Class-Incremental Learning, https://arxiv.org/abs/2507.13739
  • Matthew A. Chan, Casey J. Pellizzari, Christopher A. Metzler, 17 Jul 2025, Inverse Synthetic Aperture Fourier Ptychography, https://arxiv.org/abs/2507.03733
  • Claudio Giusti, Luca Guarnera, Mirko Casu, Sebastiano Battiato, 19 Jul 2025, Fraud is Not Just Rarity: A Causal Prototype Attention Approach to Realistic Synthetic Oversampling, https://arxiv.org/abs/2507.14706
  • Anh Nguyen, Sam Schafft, Nicholas Hale, John Alfaro, 21 Jul 2025, FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs, https://arxiv.org/abs/2507.15839
  • Pan Peng, Hangyu Xu, 20 Jul 2025, Differentially Private Synthetic Graphs Preserving Triangle-Motif Cuts, https://arxiv.org/abs/2507.14835
  • Zijian Ding, Tung Nguyen, Weikai Li, Aditya Grover, Yizhou Sun, Jason Cong, 19 Jul 2025, Iceberg: Enhancing HLS Modeling with Synthetic Data, https://arxiv.org/abs/2507.09948
  • Rohit Kundu, Shan Jia, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury, 19 Jul 2025, TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data, https://arxiv.org/abs/2503.15867
  • Yewon Byun, Shantanu Gupta, Zachary C. Lipton, Rachel Leah Childers, Bryan Wilder, 8 Aug 2025, Using Imperfect Synthetic Data in Downstream Inference Tasks, https://arxiv.org/abs/2508.06635
  • Andrey Sidorenko and Paul Tiwald, 8 Aug 2025, Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN, https://arxiv.org/abs/2508.06647
  • Sabrina Namazova, Alessandra Brondetta, Younes Strittmatter, Matthew Nassar, Sebastian Musslick, 11 Aug 2025, Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant, https://arxiv.org/abs/2508.07887
  • Raunak Narwal and Syed Abbas, 10 Aug 2025, BIGBOY1.2: Generating Realistic Synthetic Data for Disease Outbreak Modelling and Analytics, https://arxiv.org/abs/2508.07239
  • Ethan Lo and Dan C. Lo, 18 Jul 2025, Exoplanet Detection Using Machine Learning Models Trained on Synthetic Light Curves, https://arxiv.org/abs/2507.19520
  • Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Zexue He, Shafiq Abedin, Jennifer Sun, Ben Wiesel, Eli Schwartz, Ahmed Nassar, Bo Wu, Assaf Arbelle, Aude Oliva, Dan Gutfreund, Leonid Karlinsky, Rogerio Feris, 31 May 2025, ChartGen: Scaling Chart Understanding Via Code-Guided Synthetic Chart Generation, https://arxiv.org/abs/2507.19492
  • Tao Lian, Jose L. G\'omez, Antonio M. L\'opez, 26 Jul 2025, FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving, https://arxiv.org/abs/2507.19881
  • Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, and Sebastien Marcel, 28 Jul 2025, Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data, https://arxiv.org/abs/2507.20782
  • Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, 25 Jul 2025, Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task, https://arxiv.org/abs/2310.09336
  • Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 28 Jul 2025, Explainable Synthetic Image Detection through Diffusion Timestep Ensembling, https://arxiv.org/abs/2503.06201
  • Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz, 28 Jul 2025, StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation, https://arxiv.org/abs/2507.21340
  • Yida Tao, Yen-Chia Hsu, 29 Jul 2025, Bridging Synthetic and Real-World Domains: A Human-in-the-Loop Weakly-Supervised Framework for Industrial Toxic Emission Segmentation, https://arxiv.org/abs/2507.22002
  • Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu, 31 Jul 2025, CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks, https://arxiv.org/abs/2507.23751
  • Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen, 31 Jul 2025, Continual Learning with Synthetic Boundary Experience Blending, https://arxiv.org/abs/2507.23534
  • Jessica Bader, Leander Girrbach, Stephan Alaniz, Zeynep Akata, 31 Jul 2025, SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions, https://arxiv.org/abs/2507.23784
  • Patricia A. Apell\'aniz and Ana Jim\'enez and Borja Arroyo Galende and Juan Parras and Santiago Zazo, 31 Jul 2025, Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios, https://arxiv.org/abs/2407.03080
  • Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg, 30 Jul 2025, Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning, https://arxiv.org/abs/2502.13820
  • Georgi Ganev and Meenatchi Sundaram Muthu Selva Annamalai and Sofiane Mahiou and Emiliano De Cristofaro, 29 Jul 2025, The Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data, https://arxiv.org/abs/2504.06923
  • Tom Or and Omri Azencot (Ben Gurion University of the Negev), 1 Aug 2025, Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics, https://arxiv.org/abs/2508.00784
  • Ivona Krchova, Mariana Vargas Vieyra, Mario Scriminaci, Andrey Sidorenko, 1 Aug 2025, Democratizing Tabular Data Access with an Open$\unicode{x2013}$Source Synthetic$\unicode{x2013}$Data SDK, https://arxiv.org/abs/2508.00718
  • Jianwei Wang, Ziming Wu, Fuming Lai, Shaobing Lian, Ziqian Zeng, 1 Aug 2025, SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought, https://arxiv.org/abs/2508.00574
  • Abdulmajid Murad, Massimiliano Ruocco, 4 Aug 2025, Pre-Tactical Flight-Delay and Turnaround Forecasting with Synthetic Aviation Data, https://arxiv.org/abs/2508.02294
  • Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Cyril Rakovski, Frank Rudzicz, 2 Aug 2025, MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs, https://arxiv.org/abs/2508.01401
  • Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou, 5 Aug 2025, Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation, https://arxiv.org/abs/2508.03117
  • Oc\'eane Doremus, Ariel Guerra-Adames, Marta Avalos-Fernandez, Vianney Jouhet, C\'edric Gil-Jardin\'e, Emmanuel Lagarde, 4 Aug 2025, Synthetic medical data generation: state of the art and application to trauma mechanism classification, https://arxiv.org/abs/2508.02771
  • Shifeng Xie, Vasilii Feofanov, Marius Alonso, Ambroise Odonnat, Jianfeng Zhang, Themis Palpanas, and Ievgen Redko, 4 Aug 2025, CauKer: classification time series foundation models can be pretrained on synthetic data only, https://arxiv.org/abs/2508.02879
  • Yongyi Wang, Lingfeng Li, Bozhou Chen, Ang Li, Hanyu Liu, Qirui Zheng, Xionghui Yang, Wenxin Li, 6 Aug 2025, Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling, https://arxiv.org/abs/2508.04282
  • George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov, 6 Aug 2025, Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success, https://arxiv.org/abs/2508.04280
  • Mohd Ashhad and Ricardo Henao, 5 Aug 2025, Generating Accurate Synthetic Survival Data by Conditioning on Outcomes, https://arxiv.org/abs/2405.17333
  • Yunbo Long, Liming Xu, Alexandra Brintrup, 7 Aug 2025, LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion, https://arxiv.org/abs/2503.02161
  • Ingo Ziegler, Abdullatif K\"oksal, Desmond Elliott, Hinrich Sch\"utze, 6 Aug 2025, CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation, https://arxiv.org/abs/2409.02098
  • Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Ra\'ul Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Or\'us, Manuel Radons, Josef Menter, and Ali Abedi, 8 Aug 2025, Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS), https://arxiv.org/abs/2508.06251
  • Ojonugwa Oluwafemi Ejiga Peter, Akingbola Oluwapemiisin, Amalahu Chetachi, Adeniran Opeyemi, Fahmi Khalifa, and Md Mahmudur Rahman, 8 Aug 2025, Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation, https://arxiv.org/abs/2508.06170
  • Pavitra Chauhan, Mohsen Gamal Saad Askar, Kristian Svendsen, Bj{\o}rn Fjukstad, Brita Elvev{\aa}g, Lars Ailo Bongo, Edvard Pedersen, 8 Aug 2025, From research to clinic: Accelerating the translation of clinical decision support systems by making synthetic data interoperable, https://arxiv.org/abs/2308.02613
  • Shayan Alahyari, Mike Domaratzki, 8 Aug 2025, SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression, https://arxiv.org/abs/2504.21152
  • Arshia Ilaty, Hossein Shirazi, Hajar Homayouni, 11 Aug 2025, SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering, https://arxiv.org/abs/2508.08529
  • Audrey Poinsot, Panayiotis Panayiotou, Alessandro Leite, Nicolas Chesneau, \"Ozg\"ur \c{S}im\c{s}ek, Marc Schoenauer, 12 Aug 2025, Position: Causal Machine Learning Requires Rigorous Synthetic Experiments for Broader Adoption, https://arxiv.org/abs/2508.08883
  • Farah Atif, Nursultan Askarbekuly, Kareem Darwish, Monojit Choudhury, 4 Aug 2025, Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions, https://arxiv.org/abs/2508.08287
  • Vibeke Binz Vallevik, Anne Kjersti C. Befring, Severin Elvatun and Jan Franz Nygaard, 11 Aug 2025, Processing of synthetic data in AI development for healthcare and the definition of personal data in EU law, https://arxiv.org/abs/2508.08353
  • Aydin Zaboli and Junho Hong, 12 Aug 2025, Generative AI for Critical Infrastructure in Smart Grids: A Unified Framework for Synthetic Data Generation and Anomaly Detection, https://arxiv.org/abs/2508.08593
  • Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matari\'c, 12 Aug 2025, Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions, https://arxiv.org/abs/2502.13135
  • Min Tang, Peng Lu, Qing Feng, 6 Aug 2025, Generating Feasible and Diverse Synthetic Populations Using Diffusion Models, https://arxiv.org/abs/2508.09164
  • Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li, 13 Aug 2025, Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation, https://arxiv.org/abs/2508.09987
  • Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun, 13 Aug 2025, Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning, https://arxiv.org/abs/2505.16483
  • Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt, 14 Aug 2025, BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining, https://arxiv.org/abs/2508.10975
  • Liam Chalcroft and Ioannis Pappas and Cathy J. Price and John Ashburner, 15 Aug 2025, Synthetic Data for Robust Stroke Segmentation, https://arxiv.org/abs/2404.01946
  • Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani, 15 Aug 2025, FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation, https://arxiv.org/abs/2508.11810
  • Jonas van Elburg, Peter van der Putten, Maarten Marx, 15 Aug 2025, Can we Evaluate RAGs with Synthetic Data?, https://arxiv.org/abs/2508.11758
  • Ahmet H. G\"uzel, Ilija Bogunovic, Jack Parker-Holder, 17 Aug 2025, Synthetic Data is Sufficient for Zero-Shot Visual Generalization from Offline Data, https://arxiv.org/abs/2508.12356
  • Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov, 17 Aug 2025, Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment, https://arxiv.org/abs/2506.00845
  • Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei, 19 Aug 2025, InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems, https://arxiv.org/abs/2508.13930
  • Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti, 19 Aug 2025, POPri: Private Federated Learning using Preference-Optimized Synthetic Data, https://arxiv.org/abs/2504.16438
  • Suleyman Olcay Polat, Poli A. Nemkova, Mark V. Albert, 20 Aug 2025, Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method, https://arxiv.org/abs/2508.14783
  • Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban, 20 Aug 2025, Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference, https://arxiv.org/abs/2508.14735
  • Gaston Gustavo Rios, 20 Aug 2025, HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation, https://arxiv.org/abs/2508.14345
  • Saptarshi Neil Sinha and P. Julius Kuehn and Johannes Koppe and Arjan Kuijper and Michael Weinmann, 20 Aug 2025, Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data, https://arxiv.org/abs/2505.22291
  • Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, Gopal Sarda, 21 Aug 2025, GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO, https://arxiv.org/abs/2508.15432
  • Jan Kapar, Kathrin G\"unther, Lori Ann Vallis, Klaus Berger, Nadine Binder, Hermann Brenner, Stefanie Castell, Beate Fischer, Volker Harth, Bernd Holleczek, Timm Intemann, Till Ittermann, Andr\'e Karch, Thomas Keil, Lilian Krist, Berit Lange, Michael F. Leitzmann, Katharina Nimptsch, Nadia Obi, Iris Pigeot, Tobias Pischon, Tamara Schikowski, B\"orge Schmidt, Carsten Oliver Schmidt, Anja M. Sedlmair, Justine Tanoey, Harm Wienbergen, Andreas Wienke, Claudia Wigmann and Marvin N. Wright, 19 Aug 2025, Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI, https://arxiv.org/abs/2508.14936
  • Juntao Tan, Liangwei Yang, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Tulika Manoj Awalgaonkar, Jianguo Zhang, Weiran Yao, Ming Zhu, Shirley Kokane, Silvio Savarese, Huan Wang, Caiming Xiong, Shelby Heinecke, 20 Aug 2025, PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data, https://arxiv.org/abs/2502.20616
  • Arefeh Kazemi and Sri Balaaji Natarajan Kalaivendan and Joachim Wagner and Hamza Qadeer and Kanishk Verma and Brian Davis, 20 Aug 2025, Synthetic vs. Gold: The Role of LLM Generated Labels and Data in Cyberbullying Detection, https://arxiv.org/abs/2502.15860
  • Weijie Niu, Alberto Huertas Celdran, Karoline Siarsky, Burkhard Stiller, 22 Aug 2025, FEST: A Unified Framework for Evaluating Synthetic Tabular Data, https://arxiv.org/abs/2508.16254
  • Seyedali Mohammadi, Manas Paldhe, Amit Chhabra, 13 Aug 2025, LingVarBench: Benchmarking LLM for Automated Named Entity Recognition in Structured Synthetic Spoken Transcriptions, https://arxiv.org/abs/2508.15801
  • Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
  • Mika Leo Hube, Filip Lemic, Ethungshan Shitiri, Gerard Calvo Bartra, Sergi Abadal, Xavier Costa P\'erez, 22 Aug 2025, Set Transformer Architectures and Synthetic Data Generation for Flow-Guided Nanoscale Localization, https://arxiv.org/abs/2508.16200
  • Stefania L. Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Delbrouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P. Langlotz, Akshay S. Chaudhari, 22 Aug 2025, Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data, https://arxiv.org/abs/2508.16783
  • Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Joao Manoel Herrera Pinheiro, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker, 24 Aug 2025, A Synthetic Dataset for Manometry Recognition in Robotic Applications, https://arxiv.org/abs/2508.17468
  • Weikang Wan, Jiawei Fu, Xiaodi Yuan, Yifeng Zhu, Hao Su, 24 Aug 2025, LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations, https://arxiv.org/abs/2508.17547
  • Rishikesh Devanathan, Varun Nathan, Ayush Kumar, 25 Aug 2025, Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation, https://arxiv.org/abs/2508.18210
  • Melissa Kazemi Rad, Alberto Purpura, Himanshu Kumar, Emily Chen, Mohammad Shahed Sorower, 23 Aug 2025, GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection, https://arxiv.org/abs/2508.17057
  • Chenhao Xue, Yuanzhe Jin, Adrian Carrasco-Revilla, Joyraj Chakraborty, Min Chen, 4 Aug 2025, AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification, https://arxiv.org/abs/2508.10000
  • Amirmohammad Farzaneh, Matteo Zecchin, Osvaldo Simeone, 4 Sep 2025, Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference, https://arxiv.org/abs/2509.04112
  • Chanon Puttanawarut, Natcha Fongsrisin, Porntep Amornritvanich, Cholatid Ratanatharathorn, Panu Looareesuwan, 4 Sep 2025, Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models, https://arxiv.org/abs/2509.04245
  • Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych, 4 Sep 2025, MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions, https://arxiv.org/abs/2509.04183
  • Mollie Shichman, Claire Bonial, Austin Blodgett, Taylor Hudson, Francis Ferraro, Rachel Rudinger, 3 Sep 2025, FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response, https://arxiv.org/abs/2502.18452
  • Seganrasan Subramanian, Abhigya Verma, 4 Sep 2025, Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation, https://arxiv.org/abs/2509.01185
  • Mat\'ias Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer, 4 Sep 2025, Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints, https://arxiv.org/abs/2411.14013
  • Yogev Cohen, Dudi Ohayon, Romy Somkin, Yehudit Aperstein, Alexander Apartsin, 5 Sep 2025, Code Review Without Borders: Evaluating Synthetic vs. Real Data for Review Recommendation, https://arxiv.org/abs/2509.04810
  • Alpana Dubey, Suma Mani Kuriakose, Nitish Bhardwaj, 5 Sep 2025, SynGen-Vision: Synthetic Data Generation for training industrial vision models, https://arxiv.org/abs/2509.04894
  • Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren, 25 Aug 2025, Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails, https://arxiv.org/abs/2508.18384
  • Ilias Driouich, Hongliu Cao, Eoin Thomas, 26 Aug 2025, Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework, https://arxiv.org/abs/2508.18929
  • Dawei Li, Yue Huang, Ming Li, Tianyi Zhou, Xiangliang Zhang, Huan Liu, 27 Aug 2025, Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era, https://arxiv.org/abs/2508.19570
  • Zhan Shi, Yefeng Yuan, Yuhong Liu, Liang Cheng, Yi Fang, 25 Aug 2025, RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting, https://arxiv.org/abs/2508.19286
  • Michael Nidd, Christoph Miksovic, Thomas Gschwind, Francesco Fusco, Andrea Giovannini, Ioana Giurgiu, 27 Aug 2025, Bootstrapping Learned Cost Models with Synthetic SQL Queries, https://arxiv.org/abs/2508.19807
  • Jingze Zhang, Jiahe Qian, Yiliang Zhou, Yifan Peng, 28 Aug 2025, Enhancing Health Fact-Checking with LLM-Generated Synthetic Data, https://arxiv.org/abs/2508.20525
  • Sang Su Lee, Vineeth Loganathan, and Vijay Raghavan, 28 Aug 2025, Dynamic Synthetic Controls vs. Panel-Aware Double Machine Learning for Geo-Level Marketing Impact Estimation, https://arxiv.org/abs/2508.20335
  • Yijia Guo and Junqing Zhang and Y.-W. Peter Hong, 28 Aug 2025, Practical Physical Layer Authentication for Mobile Scenarios Using a Synthetic Dataset Enhanced Deep Learning Approach, https://arxiv.org/abs/2508.20861
  • Yewon Byun, Sanket Vaibhav Mehta, Saurabh Garg, Emma Strubell, Michael Oberst, Bryan Wilder, Zachary C. Lipton, 28 Aug 2025, Expert Routing with Synthetic Data for Continual Learning, https://arxiv.org/abs/2412.17009
  • Joshua Ward, Chi-Hua Wang, Guang Cheng, 28 Aug 2025, Privacy Auditing Synthetic Data Release through Local Likelihood Attacks, https://arxiv.org/abs/2508.21146
  • Pujan Thapa, Alexander Ororbia, Travis Desell, 28 Aug 2025, Class Incremental Continual Learning with Self-Organizing Maps and Variational Autoencoders Using Synthetic Replay, https://arxiv.org/abs/2508.21240
  • Jo\~ao Valente, Atabak Dehban, Rodrigo Ventura, 29 Aug 2025, CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models, https://arxiv.org/abs/2508.21732
  • Jorge Saldivar, Anna Gatzioura, Carlos Castillo, 28 Aug 2025, Synthetic CVs To Build and Test Fairness-Aware Hiring Tools, https://arxiv.org/abs/2508.21179
  • Nidhi Kowtal, Raviraj Joshi, 29 Aug 2025, L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models, https://arxiv.org/abs/2506.00863
  • Shang Liu, Jing Wang, Wenji Fang, Zhiyao Xie, 26 Aug 2025, SynCircuit: Automated Generation of New Synthetic RTL Circuits Can Enable Big Data in Circuits, https://arxiv.org/abs/2509.00071
  • G. Charbel N. Kindji (MALT), Elisa Fromont (MALT), Lina Maria Rojas-Barahona, Tanguy Urvoy, 27 Aug 2025, Robust Detection of Synthetic Tabular Data under Schema Variability, https://arxiv.org/abs/2509.00092
  • Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki, 2 Sep 2025, Unsupervised Training of Vision Transformers with Synthetic Negatives, https://arxiv.org/abs/2509.02024
  • Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki, 2 Sep 2025, Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives, https://arxiv.org/abs/2509.02029
  • Yevhen Havrylenko, Meelis K\"a\"arik and Artur Tuttar, 2 Sep 2025, Amputation-imputation based generation of synthetic tabular data for ratemaking, https://arxiv.org/abs/2509.02171
  • Hunter Gittlin, 29 Aug 2025, Beyond Synthetic Augmentation: Group-Aware Threshold Calibration for Robust Balanced Accuracy in Imbalanced Learning, https://arxiv.org/abs/2509.02592
  • Vikas Kashtriya and Pardeep Singh, 2 Sep 2025, Enhancing Machine Learning for Imbalanced Medical Data: A Quantum-Inspired Approach to Synthetic Oversampling (QI-SMOTE), https://arxiv.org/abs/2509.02863
  • Jorn K. Teutloff, 29 Aug 2025, Synthetic Founders: AI-Generated Social Simulations for Startup Validation Research in Computational Social Science, https://arxiv.org/abs/2509.02605
  • Leire Benito-Del-Valle, Pedro A. Moreno-S\'anchez, Itziar Egusquiza, Itsaso Vitoria, Artzai Pic\'on, Cristina L\'opez-Saratxaga, Adrian Galdran, 30 Aug 2025, Is Synthetic Image Augmentation Useful for Imbalanced Classification Problems? Case-Study on the MIDOG2025 Atypical Cell Detection Competition, https://arxiv.org/abs/2509.02612
  • Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S. Ryoo, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles, 3 Sep 2025, Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data, https://arxiv.org/abs/2509.03501
  • Liming Xu and Yunbo Long and Alexandra Brintrup, 30 Aug 2025, SynDelay: A Synthetic Dataset for Delivery Delay Prediction, https://arxiv.org/abs/2509.05325
  • Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu, 6 Sep 2025, Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation, https://arxiv.org/abs/2509.05605
  • Ching-Chun Chang and Isao Echizen, 6 Sep 2025, Tell-Tale Watermarks for Explanatory Reasoning in Synthetic Media Forensics, https://arxiv.org/abs/2509.05753
  • Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke, 8 Sep 2025, MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML, https://arxiv.org/abs/2509.06806
  • Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro, 8 Sep 2025, Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection, https://arxiv.org/abs/2412.12761
  • Benjamin Hoffman, David Robinson, Marius Miron, Vittorio Baglione, Daniela Canestrari, Damian Elias, Eva Trapote, Felix Effenberger, Maddie Cusimano, Masato Hagiwara, Olivier Pietquin, 5 Sep 2025, Synthetic data enables context-aware bioacoustic sound event detection, https://arxiv.org/abs/2503.00296
  • Wang Wang, Mingyu Shi, Jun Jiang, Wenqian Ma, Chong Liu, Yasutaka Narazaki, Xuguang Wang, 5 Sep 2025, Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework, https://arxiv.org/abs/2507.05814
  • Seunghyeon Kim, Kyeongryeol Go, 22 Jul 2025, Edge-case Synthesis for Fisheye Object Detection: A Data-centric Perspective, https://arxiv.org/abs/2507.16254
  • Xiaopeng Ke and Hexuan Deng and Xuebo Liu and Jun Rao and Zhenxi Song and Jun Yu and Min Zhang, 24 Jul 2025, AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs, https://arxiv.org/abs/2507.18584
  • Xi Long, Christy Boscardin, Lauren A. Maggio, Joseph A. Costello, Ralph Gonzales, Rasmyah Hammoudeh, Ki Lai, Yoon Soo Park, Brian C. Gin, 14 Aug 2025, Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis, https://arxiv.org/abs/2508.09458
  • Qiushi Sun, Jinyang Gong, Lei Li, Qipeng Guo, Fei Yuan, 25 Jul 2025, CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback, https://arxiv.org/abs/2507.22080
  • Xiaoling Hu, Xiangrui Zeng, Oula Puonti, Juan Eugenio Iglesias, Bruce Fischl, Yael Balbastre, 1 Aug 2025, Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation, https://arxiv.org/abs/2411.16719
  • Siyi Liu, Yujia Zheng, Yongqi Zhang, 4 Aug 2025, StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes, https://arxiv.org/abs/2508.02601
  • Yong Lin and Shange Tang and Bohan Lyu and Ziran Yang and Jui-Hui Chung and Haoyu Zhao and Lai Jiang and Yihan Geng and Jiawei Ge and Jingruo Sun and Jiayun Wu and Jiri Gesi and Ximing Lu and David Acuna and Kaiyu Yang and Hongzhou Lin and Yejin Choi and Danqi Chen and Sanjeev Arora and Chi Jin, 5 Aug 2025, Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction, https://arxiv.org/abs/2508.03613
  • Parker Seegmiller, Kartik Mehta, Soumya Saha, Chenyang Tao, Shereen Oraby, Arpit Gupta, Tagyoung Chung, Mohit Bansal and Nanyun Peng, 22 Aug 2025, FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline, https://arxiv.org/abs/2508.16514
  • Feng Tian, Flora D. Salim, Hao Xue, 25 Aug 2025, TradingGroup: A Multi-Agent Trading System with Self-Reflection and Data-Synthesis, https://arxiv.org/abs/2508.17565
  • Sunguk Choi, Yonghoon Kwon, Heondeuk Lee, 26 Aug 2025, CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks, https://arxiv.org/abs/2508.18743
  • Timur Sattarov, Marco Schreyer, Damian Borth, 29 Aug 2025, Federated Diffusion Modeling with Differential Privacy for Tabular Data Synthesis, https://arxiv.org/abs/2412.16083
  • Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu, 30 Aug 2025, Open Data Synthesis For Deep Research, https://arxiv.org/abs/2509.00375
  • Jianwei Wang, Chengming Shi, Junyao Yang, Haoran Li, Qianli Ma, Huiping Zhuang, Cen Chen and Ziqian Zeng, 31 Aug 2025, RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis, https://arxiv.org/abs/2502.18517
  • Yuntao Du, Ninghui Li, 7 Sep 2025, Systematic Assessment of Tabular Data Synthesis, https://arxiv.org/abs/2402.06806
  • Laura Boggia, Bogdan Malaescu, 9 Sep 2025, Synthetic Data Generation with Lorenzetti for Time Series Anomaly Detection in High-Energy Physics Calorimeters, https://arxiv.org/abs/2509.07451
  • Ali Reza Ibrahimzada, Yang Chen, Ryan Rong, Reyhaneh Jabbarvand, 9 Sep 2025, Challenging Bug Prediction and Repair Models with Synthetic Bugs, https://arxiv.org/abs/2310.02407
  • Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra, 11 Sep 2025, A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes, https://arxiv.org/abs/2509.09794
  • Keunwoo Choi, Seungheon Doh, Juhan Nam, 18 Aug 2025, TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation, https://arxiv.org/abs/2509.09685
  • Basti\'an Gonz\'alez-Bustamante, Nando Verelst, Carla Cisternas, 11 Sep 2025, Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case, https://arxiv.org/abs/2509.09871
  • Jing Zhang, Alexandre Bousse, Chi-Hieu Pham, Kuangyu Shi, Julien Bert, 12 Sep 2025, Semi-Supervised Learning for Dose Prediction in Targeted Radionuclide: A Synthetic Data Study, https://arxiv.org/abs/2503.05367
  • Tung Vu, Lam Nguyen, Quynh Dao, 10 Sep 2025, PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability, https://arxiv.org/abs/2509.08910
  • Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng, 11 Sep 2025, Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function, https://arxiv.org/abs/2509.09197
  • Nazia Nafis, Inaki Esnaola, Alvaro Martinez-Perez, Maria-Cruz Villa-Uriol, Venet Osmani, 11 Sep 2025, Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review, https://arxiv.org/abs/2504.18544
  • Dimitris Tsirmpas and Ion Androutsopoulos and John Pavlopoulos, 11 Sep 2025, Scalable Evaluation of Online Facilitation Strategies via Synthetic Simulation of Discussions, https://arxiv.org/abs/2503.16505
  • Sepehr Dehdashtian, Mashrur M. Morshed, Jacob H. Seidman, Gaurav Bharaj and Vishnu Naresh Boddeti, 19 Sep 2025, PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors, https://arxiv.org/abs/2509.15551
  • Nakul Sharma, 19 Sep 2025, Efficient Long-Tail Learning in Latent Space by sampling Synthetic Data, https://arxiv.org/abs/2509.15859
  • Nomi Yu (1), Md Ferdous Alam (1), A. John Hart (1), and Faez Ahmed (1) ((1) Massachusetts Institute of Technology), 17 Sep 2025, GenCAD-3D: CAD Program Generation using Multimodal Latent Space Alignment and Synthetic Dataset Balancing, https://arxiv.org/abs/2509.15246
  • Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Cand\`es, Chong Wang, Ruoming Pang, 17 Sep 2025, Synthetic bootstrapped pretraining, https://arxiv.org/abs/2509.15248
  • Caitlin Cisar, Emily Sheffield, Joshua Drake, Alden Harrell, Subramanian Chidambaram, Nikita Nangia, Vinayak Arannil, Alex Williams, 18 Sep 2025, PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output Targeting, https://arxiv.org/abs/2509.15447
  • Junlong Jia, Xing Wu, Chaochen Gao, Ziyang Chen, Zijia Lin, Zhongzhi Li, Weinong Wang, Haotian Xu, Donghui Jin, Debing Zhang, Binghui Guo, 19 Sep 2025, LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs, https://arxiv.org/abs/2509.15568
  • Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, Feng Zheng, 19 Sep 2025, OptiScene: LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization, https://arxiv.org/abs/2506.07570
  • Alessandro Crimi and Andrea Brovelli, 15 Sep 2025, Prediction and Causality of functional MRI and synthetic signal using a Zero-Shot Time-Series Foundation Model, https://arxiv.org/abs/2509.12497
  • Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou, 16 Sep 2025, WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning, https://arxiv.org/abs/2509.13305
  • Riyaadh Gani, 12 Sep 2025, Physics-Informed Neural Networks vs. Physics Models for Non-Invasive Glucose Monitoring: A Comparative Study Under Realistic Synthetic Conditions, https://arxiv.org/abs/2509.12253
  • Nolan Platt and Pragyansmita Nayak, 16 Sep 2025, Multi-Model Synthetic Training for Mission-Critical Small Language Models, https://arxiv.org/abs/2509.13047
  • Shanmuka Sadhu, Arca Baran, Preeti Pandey, and Ayush Kumar, 15 Sep 2025, Task Decoding based on Eye Movements using Synthetic Data Augmentation, https://arxiv.org/abs/2509.11547
  • Rumeng Li, Xun Wang, Hong Yu, 5 Sep 2025, DualAlign: Generating Clinically Grounded Synthetic Data, https://arxiv.org/abs/2509.10538
  • Omkar Shailendra Vengurlekar, Adithya Pediredla, Suren Jayasuriya, 14 Sep 2025, SH-SAS: An Implicit Neural Representation for Complex Spherical-Harmonic Scattering Fields for 3D Synthetic Aperture Sonar, https://arxiv.org/abs/2509.11087
  • Milan Marocchi, Matthew Fynn, Kayapanda Mandana, Yue Rong, 15 Sep 2025, Scaling to Multimodal and Multichannel Heart Sound Classification: Fine-Tuning Wav2Vec 2.0 with Synthetic and Augmented Biosignals, https://arxiv.org/abs/2509.11606
  • Mikhail Kulyabin, Jan Joosten, Choro Ulan uulu, Nuno Miguel Martins Pacheco, Fabian Ries, Filippos Petridis, Jan Bosch, and Helena Holmstr\"om Olsson, 15 Sep 2025, User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums, https://arxiv.org/abs/2509.11777
  • Lauri Suomela, Sasanka Kuruppu Arachchige, German F. Torres, Harry Edelman, Joni-Kristian K\"am\"ar\"ainen, 15 Sep 2025, Synthetic vs. Real Training Data for Visual Navigation, https://arxiv.org/abs/2509.11791
  • Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji, 13 Sep 2025, FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering, https://arxiv.org/abs/2412.07030
  • Shengjie Ma, Xuhui Jiang, Chengjin Xu, Cehao Yang, Liyu Zhang, Jian Guo, 14 Sep 2025, Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models, https://arxiv.org/abs/2505.00979
  • Karan Dua, Puneet Mittal, Ranjeet Gupta, Hitesh Laxmichand Patel, 15 Sep 2025, SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models, https://arxiv.org/abs/2509.14270
  • Luisa Torquato Ni\~no and Hamza A. A. Gardi, 18 Sep 2025, Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies, https://arxiv.org/abs/2509.15045
  • Christopher Wiedeman, Anastasiia Sarmakeeva, Elena Sizikova, Daniil Filienko, Miguel Lago, Jana G. Delfino, Aldo Badano, 18 Sep 2025, T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images, https://arxiv.org/abs/2507.04038
  • Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin, 18 Sep 2025, Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation, https://arxiv.org/abs/2505.16360
  • Lauren H. Cooke, Matthias Jung, Jan M. Brendel, Nora M. Kerkovits, Borek Foldyna, Michael T. Lu, Vineet K. Raghu, 10 Sep 2025, RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts, https://arxiv.org/abs/2509.08640
  • Dietmar Offenhuber, 14 Sep 2025, Synthetic Data and the Shifting Ground of Truth, https://arxiv.org/abs/2509.13355
  • Inder Pal Singh, Nidhal Eddine Chenni, Abd El Rahman Shabayek, Arunkumar Rathinam, Djamila Aouada, 17 Sep 2025, Bridging the Synthetic-Real Gap: Supervised Domain Adaptation for Robust Spacecraft 6-DoF Pose Estimation, https://arxiv.org/abs/2509.13792
  • Gustavo Kruger, Nikhil Sachdeva, Michael Sobolev, 17 Sep 2025, Synthetic Data Generation for Screen Time and App Usage, https://arxiv.org/abs/2509.13892
  • Niklas Grieger, Siamak Mehrkanoon, Stephan Bialonski, 17 Sep 2025, Data-Efficient Sleep Staging with Synthetic Time Series Pretraining, https://arxiv.org/abs/2403.08592

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research Topics

Read more about: