Aussie AI
Synthetic Data
-
Last Updated 30 August, 2025
-
by David Spuler, Ph.D.
What is Synthetic Data?
Synthetic data is the use of computer-generated text for LLM training. This can include generating entirely new training data, such as by using output from another LLM. Another approach is to "augment" training data, such as using synonymization to create slightly different versions of training sets with different words.
Research on Synthetic Data
Research papers include:
- Skurzhanskyi, O.H., Marchenko, O.O. & Anisimov, A.V., 2024, Specialized Pre-Training of Neural Networks on Synthetic Data for Improving Paraphrase Generation. Cybern Syst Anal 2024 https://doi.org/10.1007/s10559-024-00658-7 https://link.springer.com/article/10.1007/s10559-024-00658-7
- Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, 29 Jan 2024, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, https://arxiv.org/abs/2401.16380
- André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster, 4 Jan 2024, Comprehensive Exploration of Synthetic Data Generation: A Survey https://arxiv.org/abs/2401.02524
- Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
- David Spuler, March 2024, Chapter 45. Knowledge Distillation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- A Gudibande, E Wallace, C Snell, X Geng, H Liu 2023, The false promise of imitating proprietary llms, https://arxiv.org/abs/2305.15717
- Y Wang, W Zhong, L Li, F Mi, X Zeng, W Huang 2023, Aligning large language models with human: A survey, https://arxiv.org/abs/2307.12966
- Y Gu, L Dong, F Wei, M Huang, 2023, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
- X Wan, R Sun, H Dai, SO Arik, T Pfister, 2023, Better zero-shot reasoning with self-adaptive prompting, https://arxiv.org/abs/2305.14106
- S Horawalavithana, S Munikoti, I Stewart, 2023, SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, https://arxiv.org/abs/2307.01139
- X Daull, P Bellot, E Bruno, V Martin, 2023, Complex QA and language models hybrid architectures, Survey, https://arxiv.org/abs/2302.09051
- Z Yuan, J Liu, Q Zi, M Liu, X Peng, Y Lou, 2023, Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation, https://arxiv.org/abs/2308.01240
- W AlShikh, M Daaboul, K Goddard, B Imel, 2023, Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, https://arxiv.org/abs/2307.03692
- Z He, Z Xie, R Jha, H Steck, D Liang, Y Feng, 2023, Large Language Models as Zero-Shot Conversational Recommenders, https://arxiv.org/abs/2308.10053
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Michael Nuñez, July 18, 2024, Groq’s open-source Llama AI model tops leaderboard, outperforming GPT-4o and Claude in function calling, https://venturebeat.com/ai/groq-open-source-llama-ai-model-tops-leaderboard-outperforming-gpt-4o-and-claude-in-function-calling/
- Louie Peters, Aug 27, 2024, Two Paths to Small LMs? Synthetic Data (Phi 3.5) vs Pruning & Distillation (Llama-3.1-Minitron), https://newsletter.towardsai.net/p/114-two-paths-to-small-lms-synthetic
- Aatish Bhatia, Aug. 25, 2024, When A.I.’s Output Is a Threat to A.I. Itself: As A.I.-generated data becomes harder to detect, it’s increasingly likely to be ingested by future A.I., leading to worse results, NY Times, https://www.nytimes.com/interactive/2024/08/26/upshot/ai-synthetic-data.html
- Shumailov, I., Shumaylov, Z., Zhao, Y. et al. 2024, AI models collapse when trained on recursively generated data. Nature 631, 755–759. https://doi.org/10.1038/s41586-024-07566-y https://www.nature.com/articles/s41586-024-07566-y
- Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, Gauthier Gidel, 12 Jun 2024, Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences, https://arxiv.org/abs/2407.09499
- Ryan McNeal, Aug 27, 2024, ChatGPT and GPT-4 could get a sweet upgrade this fall with 'strawberry', https://www.androidauthority.com/openai-strawberry-ai-3475682/
- Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai, 10 Aug 2024 (v2), Best Practices and Lessons Learned on Synthetic Data, https://arxiv.org/abs/2404.07503
- Georgia Argyro, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou, 10 Sep 2024, Prompt2Fashion: An automatically generated fashion dataset, https://arxiv.org/abs/2409.06442
- Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli, 12 Sep 2024, Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources, https://arxiv.org/abs/2409.08239
- Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, Mehran Kazemi, 29 Aug 2024, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737
- Ulyana Piterbarg, Lerrel Pinto, Rob Fergus, 3 Oct 2024, Training Language Models on Synthetic Edit Sequences Improves Code Synthesis, https://arxiv.org/abs/2410.02749
- Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang, 16 Oct 2024, A Survey on Data Synthesis and Augmentation for Large Language Models, https://arxiv.org/abs/2410.12896
- Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He, 23 Oct 2024, SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains, https://arxiv.org/abs/2410.17952
- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
- Arindam Mitra , Ahmed Awadallah , Yash Lara , November 14, 2024, Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/
- Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig, 4 Dec 2024, Evaluating Language Models as Synthetic Data Generators, https://arxiv.org/abs/2412.03679
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
- Xiang Huang, Jiayu Shen, Shanshan Huang, Sitao Cheng, Xiaxia Wang, Yuzhong Qu, 27 Dec 2024, TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data, https://arxiv.org/abs/2412.19544?
- Sebastian Raschka, PhD, Jan 15, 2025, Noteworthy AI Research Papers of 2024 (Part Two). Six influential AI papers from July to December, https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2 (Examines multimodal LLama3 models and the different multimodal architectures.)
- FZ Subah, Oct 2025, Mitigating and Assessing Bias and Fairness in Large Language Model-Generated Synthetic Tabular Data, Masters Thesis, Department of Engineering, University of Cambridge, https://www.mlmi.eng.cam.ac.uk/files/2023-2024/fzs21_mitigating_2024.pdf
- Chetan Harsha, Karmvir Singh Phogat, Sridhar Dasaratha, Sai Akhil Puranam, Shashishekar Ramakrishna, Jan 2025, Synthetic Data Generation Using Large Language Models for Financial Question Answering, Proceedings of the Joint Workshop of the 9th FinNLP, the 6th FNP, and the 1st LLMFinLegal, pages 76–95 January 19–20, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.finnlp-1.7.pdf
- Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen, 25 Jan 2025, LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion, https://arxiv.org/abs/2501.15089
- Minsang Kim, Seungjun Baek, 6 Feb 2025, Syntriever: How to Train Your Retriever with Synthetic Data from LLMs, https://arxiv.org/abs/2502.03824
- Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang, 13 Feb 2025, Logical Reasoning in Large Language Models: A Survey, https://arxiv.org/abs/2502.09100
- Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen, 18 Feb 2025, Theorem Prover as a Judge for Synthetic Data Generation, https://arxiv.org/abs/2502.13137
- Maria Korolov, Jun 25, 2025, 7 ways synthetic data creates business value, https://www.cio.com/article/4003262/7-ways-synthetic-data-creates-business-value.html
- Ali Zolnour, Hossein Azadmaleki, Yasaman Haghbin, Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sina Rashidi, Masoud Khani, AmirSajjad Taleban, Samin Mahdizadeh Sani, Maryam Dadkhah, James M. Noble, Suzanne Bakken, Yadollah Yaghoobzadeh, Abdol-Hossein Vahabie, Masoud Rouhizadeh, Maryam Zolnoori, 8 Aug 2025, LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data, https://arxiv.org/abs/2508.10027
- Nitin Rai, Nathan S. Boyd, Gary E. Vallad, Arnold W. Schumann, 13 Aug 2025, Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model, https://arxiv.org/abs/2508.10156
- Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian, 14 Aug 2025, Measuring Diversity in Synthetic Datasets, https://arxiv.org/abs/2502.08512
- Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng, 22 Jul 2025, Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation, https://arxiv.org/abs/2507.17066
- \'Alvaro Ruiz-R\'odenas, Jaime Pujante S\'aez, Daniel Garc\'ia-Algora, Mario Rodr\'iguez B\'ejar, Jorge Blasco and Jos\'e Luis Hern\'andez-Ramos, 21 Jul 2025, SynthCTI: LLM-Driven Synthetic CTI Generation to enhance MITRE Technique Mapping, https://arxiv.org/abs/2507.16852
- Rishemjit Kaur, Arshdeep Singh Bhankhar, Surangika Ranathunga, Jashanpreet Singh Salh, Sudhir Rajput, Vidhi, Kashish Mahendra, Bhavika Berwal, Ritesh Kumar, 22 Jul 2025, Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain, https://arxiv.org/abs/2507.16974
- Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
- Shreya Saxena, Siva Prasad, Zishan Ahmad, Vishal Vaddina, 22 Jul 2025, ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training, https://arxiv.org/abs/2507.16478
- Ivona Krchova, Michael Platzer, Paul Tiwald, 22 Jul 2025, Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling, https://arxiv.org/abs/2507.16419
- Alireza Dizaji, Benedict Aaron Tjandra, Mehrab Hamidi, Shenyang Huang, Guillaume Rabusseau, 22 Jul 2025, T-GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs, https://arxiv.org/abs/2507.10183
- Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim, 24 Jul 2025, Synthetic Data Generation for Phrase Break Prediction with Large Language Model, https://arxiv.org/abs/2507.18044
- Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina Balaur, and Venkata Satagopam, 24 Jul 2025, Generation of Synthetic Clinical Text: A Systematic Review, https://arxiv.org/abs/2507.18451
- Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu, 24 Jul 2025, DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data, https://arxiv.org/abs/2507.18583
- Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim, 24 Jul 2025, SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning, https://arxiv.org/abs/2507.18616
- Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim, 24 Jul 2025, SIDA: Synthetic Image Driven Zero-shot Domain Adaptation, https://arxiv.org/abs/2507.18632
- Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng, 24 Jul 2025, Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs, https://arxiv.org/abs/2507.18055
- Yefeng Yuan, Yuhong Liu, Liang Cheng, 24 Jul 2025, A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models, https://arxiv.org/abs/2404.14445
- Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp, 24 Jul 2025, Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation, https://arxiv.org/abs/2506.11790
- Keito Inoshita, Rushia Harada, 15 Jul 2025, Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition, https://arxiv.org/abs/2507.13380
- Junsu Kim, Yunhoe Ku, Seungryul Baek, 18 Jul 2025, Can Synthetic Images Conquer Forgetting? Beyond Unexplored Doubts in Few-Shot Class-Incremental Learning, https://arxiv.org/abs/2507.13739
- Matthew A. Chan, Casey J. Pellizzari, Christopher A. Metzler, 17 Jul 2025, Inverse Synthetic Aperture Fourier Ptychography, https://arxiv.org/abs/2507.03733
- Claudio Giusti, Luca Guarnera, Mirko Casu, Sebastiano Battiato, 19 Jul 2025, Fraud is Not Just Rarity: A Causal Prototype Attention Approach to Realistic Synthetic Oversampling, https://arxiv.org/abs/2507.14706
- Anh Nguyen, Sam Schafft, Nicholas Hale, John Alfaro, 21 Jul 2025, FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs, https://arxiv.org/abs/2507.15839
- Pan Peng, Hangyu Xu, 20 Jul 2025, Differentially Private Synthetic Graphs Preserving Triangle-Motif Cuts, https://arxiv.org/abs/2507.14835
- Zijian Ding, Tung Nguyen, Weikai Li, Aditya Grover, Yizhou Sun, Jason Cong, 19 Jul 2025, Iceberg: Enhancing HLS Modeling with Synthetic Data, https://arxiv.org/abs/2507.09948
- Rohit Kundu, Shan Jia, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury, 19 Jul 2025, TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data, https://arxiv.org/abs/2503.15867
- Yewon Byun, Shantanu Gupta, Zachary C. Lipton, Rachel Leah Childers, Bryan Wilder, 8 Aug 2025, Using Imperfect Synthetic Data in Downstream Inference Tasks, https://arxiv.org/abs/2508.06635
- Andrey Sidorenko and Paul Tiwald, 8 Aug 2025, Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN, https://arxiv.org/abs/2508.06647
- Sabrina Namazova, Alessandra Brondetta, Younes Strittmatter, Matthew Nassar, Sebastian Musslick, 11 Aug 2025, Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant, https://arxiv.org/abs/2508.07887
- Raunak Narwal and Syed Abbas, 10 Aug 2025, BIGBOY1.2: Generating Realistic Synthetic Data for Disease Outbreak Modelling and Analytics, https://arxiv.org/abs/2508.07239
- Ethan Lo and Dan C. Lo, 18 Jul 2025, Exoplanet Detection Using Machine Learning Models Trained on Synthetic Light Curves, https://arxiv.org/abs/2507.19520
- Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Zexue He, Shafiq Abedin, Jennifer Sun, Ben Wiesel, Eli Schwartz, Ahmed Nassar, Bo Wu, Assaf Arbelle, Aude Oliva, Dan Gutfreund, Leonid Karlinsky, Rogerio Feris, 31 May 2025, ChartGen: Scaling Chart Understanding Via Code-Guided Synthetic Chart Generation, https://arxiv.org/abs/2507.19492
- Tao Lian, Jose L. G\'omez, Antonio M. L\'opez, 26 Jul 2025, FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving, https://arxiv.org/abs/2507.19881
- Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, and Sebastien Marcel, 28 Jul 2025, Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data, https://arxiv.org/abs/2507.20782
- Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, 25 Jul 2025, Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task, https://arxiv.org/abs/2310.09336
- Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 28 Jul 2025, Explainable Synthetic Image Detection through Diffusion Timestep Ensembling, https://arxiv.org/abs/2503.06201
- Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz, 28 Jul 2025, StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation, https://arxiv.org/abs/2507.21340
- Yida Tao, Yen-Chia Hsu, 29 Jul 2025, Bridging Synthetic and Real-World Domains: A Human-in-the-Loop Weakly-Supervised Framework for Industrial Toxic Emission Segmentation, https://arxiv.org/abs/2507.22002
- Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu, 31 Jul 2025, CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks, https://arxiv.org/abs/2507.23751
- Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen, 31 Jul 2025, Continual Learning with Synthetic Boundary Experience Blending, https://arxiv.org/abs/2507.23534
- Jessica Bader, Leander Girrbach, Stephan Alaniz, Zeynep Akata, 31 Jul 2025, SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions, https://arxiv.org/abs/2507.23784
- Patricia A. Apell\'aniz and Ana Jim\'enez and Borja Arroyo Galende and Juan Parras and Santiago Zazo, 31 Jul 2025, Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios, https://arxiv.org/abs/2407.03080
- Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg, 30 Jul 2025, Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning, https://arxiv.org/abs/2502.13820
- Georgi Ganev and Meenatchi Sundaram Muthu Selva Annamalai and Sofiane Mahiou and Emiliano De Cristofaro, 29 Jul 2025, The Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data, https://arxiv.org/abs/2504.06923
- Tom Or and Omri Azencot (Ben Gurion University of the Negev), 1 Aug 2025, Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics, https://arxiv.org/abs/2508.00784
- Ivona Krchova, Mariana Vargas Vieyra, Mario Scriminaci, Andrey Sidorenko, 1 Aug 2025, Democratizing Tabular Data Access with an Open$\unicode{x2013}$Source Synthetic$\unicode{x2013}$Data SDK, https://arxiv.org/abs/2508.00718
- Jianwei Wang, Ziming Wu, Fuming Lai, Shaobing Lian, Ziqian Zeng, 1 Aug 2025, SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought, https://arxiv.org/abs/2508.00574
- Abdulmajid Murad, Massimiliano Ruocco, 4 Aug 2025, Pre-Tactical Flight-Delay and Turnaround Forecasting with Synthetic Aviation Data, https://arxiv.org/abs/2508.02294
- Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Cyril Rakovski, Frank Rudzicz, 2 Aug 2025, MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs, https://arxiv.org/abs/2508.01401
- Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou, 5 Aug 2025, Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation, https://arxiv.org/abs/2508.03117
- Oc\'eane Doremus, Ariel Guerra-Adames, Marta Avalos-Fernandez, Vianney Jouhet, C\'edric Gil-Jardin\'e, Emmanuel Lagarde, 4 Aug 2025, Synthetic medical data generation: state of the art and application to trauma mechanism classification, https://arxiv.org/abs/2508.02771
- Shifeng Xie, Vasilii Feofanov, Marius Alonso, Ambroise Odonnat, Jianfeng Zhang, Themis Palpanas, and Ievgen Redko, 4 Aug 2025, CauKer: classification time series foundation models can be pretrained on synthetic data only, https://arxiv.org/abs/2508.02879
- Yongyi Wang, Lingfeng Li, Bozhou Chen, Ang Li, Hanyu Liu, Qirui Zheng, Xionghui Yang, Wenxin Li, 6 Aug 2025, Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling, https://arxiv.org/abs/2508.04282
- George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov, 6 Aug 2025, Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success, https://arxiv.org/abs/2508.04280
- Mohd Ashhad and Ricardo Henao, 5 Aug 2025, Generating Accurate Synthetic Survival Data by Conditioning on Outcomes, https://arxiv.org/abs/2405.17333
- Yunbo Long, Liming Xu, Alexandra Brintrup, 7 Aug 2025, LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion, https://arxiv.org/abs/2503.02161
- Ingo Ziegler, Abdullatif K\"oksal, Desmond Elliott, Hinrich Sch\"utze, 6 Aug 2025, CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation, https://arxiv.org/abs/2409.02098
- Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Ra\'ul Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Or\'us, Manuel Radons, Josef Menter, and Ali Abedi, 8 Aug 2025, Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS), https://arxiv.org/abs/2508.06251
- Ojonugwa Oluwafemi Ejiga Peter, Akingbola Oluwapemiisin, Amalahu Chetachi, Adeniran Opeyemi, Fahmi Khalifa, and Md Mahmudur Rahman, 8 Aug 2025, Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation, https://arxiv.org/abs/2508.06170
- Pavitra Chauhan, Mohsen Gamal Saad Askar, Kristian Svendsen, Bj{\o}rn Fjukstad, Brita Elvev{\aa}g, Lars Ailo Bongo, Edvard Pedersen, 8 Aug 2025, From research to clinic: Accelerating the translation of clinical decision support systems by making synthetic data interoperable, https://arxiv.org/abs/2308.02613
- Shayan Alahyari, Mike Domaratzki, 8 Aug 2025, SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression, https://arxiv.org/abs/2504.21152
- Arshia Ilaty, Hossein Shirazi, Hajar Homayouni, 11 Aug 2025, SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering, https://arxiv.org/abs/2508.08529
- Audrey Poinsot, Panayiotis Panayiotou, Alessandro Leite, Nicolas Chesneau, \"Ozg\"ur \c{S}im\c{s}ek, Marc Schoenauer, 12 Aug 2025, Position: Causal Machine Learning Requires Rigorous Synthetic Experiments for Broader Adoption, https://arxiv.org/abs/2508.08883
- Farah Atif, Nursultan Askarbekuly, Kareem Darwish, Monojit Choudhury, 4 Aug 2025, Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions, https://arxiv.org/abs/2508.08287
- Vibeke Binz Vallevik, Anne Kjersti C. Befring, Severin Elvatun and Jan Franz Nygaard, 11 Aug 2025, Processing of synthetic data in AI development for healthcare and the definition of personal data in EU law, https://arxiv.org/abs/2508.08353
- Aydin Zaboli and Junho Hong, 12 Aug 2025, Generative AI for Critical Infrastructure in Smart Grids: A Unified Framework for Synthetic Data Generation and Anomaly Detection, https://arxiv.org/abs/2508.08593
- Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matari\'c, 12 Aug 2025, Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions, https://arxiv.org/abs/2502.13135
- Min Tang, Peng Lu, Qing Feng, 6 Aug 2025, Generating Feasible and Diverse Synthetic Populations Using Diffusion Models, https://arxiv.org/abs/2508.09164
- Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li, 13 Aug 2025, Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation, https://arxiv.org/abs/2508.09987
- Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun, 13 Aug 2025, Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning, https://arxiv.org/abs/2505.16483
- Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt, 14 Aug 2025, BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining, https://arxiv.org/abs/2508.10975
- Liam Chalcroft and Ioannis Pappas and Cathy J. Price and John Ashburner, 15 Aug 2025, Synthetic Data for Robust Stroke Segmentation, https://arxiv.org/abs/2404.01946
- Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani, 15 Aug 2025, FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation, https://arxiv.org/abs/2508.11810
- Jonas van Elburg, Peter van der Putten, Maarten Marx, 15 Aug 2025, Can we Evaluate RAGs with Synthetic Data?, https://arxiv.org/abs/2508.11758
- Ahmet H. G\"uzel, Ilija Bogunovic, Jack Parker-Holder, 17 Aug 2025, Synthetic Data is Sufficient for Zero-Shot Visual Generalization from Offline Data, https://arxiv.org/abs/2508.12356
- Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov, 17 Aug 2025, Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment, https://arxiv.org/abs/2506.00845
- Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei, 19 Aug 2025, InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems, https://arxiv.org/abs/2508.13930
- Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti, 19 Aug 2025, POPri: Private Federated Learning using Preference-Optimized Synthetic Data, https://arxiv.org/abs/2504.16438
- Suleyman Olcay Polat, Poli A. Nemkova, Mark V. Albert, 20 Aug 2025, Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method, https://arxiv.org/abs/2508.14783
- Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban, 20 Aug 2025, Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference, https://arxiv.org/abs/2508.14735
- Gaston Gustavo Rios, 20 Aug 2025, HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation, https://arxiv.org/abs/2508.14345
- Saptarshi Neil Sinha and P. Julius Kuehn and Johannes Koppe and Arjan Kuijper and Michael Weinmann, 20 Aug 2025, Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data, https://arxiv.org/abs/2505.22291
- Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, Gopal Sarda, 21 Aug 2025, GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO, https://arxiv.org/abs/2508.15432
- Jan Kapar, Kathrin G\"unther, Lori Ann Vallis, Klaus Berger, Nadine Binder, Hermann Brenner, Stefanie Castell, Beate Fischer, Volker Harth, Bernd Holleczek, Timm Intemann, Till Ittermann, Andr\'e Karch, Thomas Keil, Lilian Krist, Berit Lange, Michael F. Leitzmann, Katharina Nimptsch, Nadia Obi, Iris Pigeot, Tobias Pischon, Tamara Schikowski, B\"orge Schmidt, Carsten Oliver Schmidt, Anja M. Sedlmair, Justine Tanoey, Harm Wienbergen, Andreas Wienke, Claudia Wigmann and Marvin N. Wright, 19 Aug 2025, Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI, https://arxiv.org/abs/2508.14936
- Juntao Tan, Liangwei Yang, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Tulika Manoj Awalgaonkar, Jianguo Zhang, Weiran Yao, Ming Zhu, Shirley Kokane, Silvio Savarese, Huan Wang, Caiming Xiong, Shelby Heinecke, 20 Aug 2025, PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data, https://arxiv.org/abs/2502.20616
- Arefeh Kazemi and Sri Balaaji Natarajan Kalaivendan and Joachim Wagner and Hamza Qadeer and Kanishk Verma and Brian Davis, 20 Aug 2025, Synthetic vs. Gold: The Role of LLM Generated Labels and Data in Cyberbullying Detection, https://arxiv.org/abs/2502.15860
- Weijie Niu, Alberto Huertas Celdran, Karoline Siarsky, Burkhard Stiller, 22 Aug 2025, FEST: A Unified Framework for Evaluating Synthetic Tabular Data, https://arxiv.org/abs/2508.16254
- Seyedali Mohammadi, Manas Paldhe, Amit Chhabra, 13 Aug 2025, LingVarBench: Benchmarking LLM for Automated Named Entity Recognition in Structured Synthetic Spoken Transcriptions, https://arxiv.org/abs/2508.15801
- Jerry Cao-Xue, Tien Comlekoglu, Keyi Xue, Guanliang Wang, Jiang Li, Gordon Laurie, 21 Aug 2025, Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset, https://arxiv.org/abs/2508.15986
- Mika Leo Hube, Filip Lemic, Ethungshan Shitiri, Gerard Calvo Bartra, Sergi Abadal, Xavier Costa P\'erez, 22 Aug 2025, Set Transformer Architectures and Synthetic Data Generation for Flow-Guided Nanoscale Localization, https://arxiv.org/abs/2508.16200
- Stefania L. Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Delbrouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P. Langlotz, Akshay S. Chaudhari, 22 Aug 2025, Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data, https://arxiv.org/abs/2508.16783
- Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Joao Manoel Herrera Pinheiro, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker, 24 Aug 2025, A Synthetic Dataset for Manometry Recognition in Robotic Applications, https://arxiv.org/abs/2508.17468
- Weikang Wan, Jiawei Fu, Xiaodi Yuan, Yifeng Zhu, Hao Su, 24 Aug 2025, LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations, https://arxiv.org/abs/2508.17547
- Rishikesh Devanathan, Varun Nathan, Ayush Kumar, 25 Aug 2025, Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation, https://arxiv.org/abs/2508.18210
- Melissa Kazemi Rad, Alberto Purpura, Himanshu Kumar, Emily Chen, Mohammad Shahed Sorower, 23 Aug 2025, GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection, https://arxiv.org/abs/2508.17057
- Chenhao Xue, Yuanzhe Jin, Adrian Carrasco-Revilla, Joyraj Chakraborty, Min Chen, 4 Aug 2025, AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification, https://arxiv.org/abs/2508.10000
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research Topics
Read more about:
- 500+ LLM Inference Optimization Techniques
- What's Hot in LLM Inference Optimization in 2025?
- Inference Optimization Research
- « Research Home