Aussie AI

8. Data Crisis

  • Book Excerpt from "The Sweetest Lesson: Your Brain vs AI"
  • by David Spuler, Ph.D.

8. Data Crisis

 

 

 

“We’re still in the horseless carriage era of AI applications.”

— Tomasz Tunguz, June 2025.

 

 

 

Training Data Sizes

AI engines are beasts that need regular feedings of new content to get smarter. They are continually searching for data, going out on safari all over the internet in search of prey. AI companies are looking for more sources of new data to train their smart silicon. Here’s some of the statistics on how much data they required:

  • OpenAI GPT-4 — 6.5 trillion tokens
  • Meta LLama-3 — 15 trillion tokens
  • Meta LLama-4 — 40 trillion tokens

This data is expressed in “tokens” rather than words, but the two metrics are somewhat comparable. For ChatGPT, a token is about three-quarters of a word on average.

Interestingly, a four-year old child only needs to hear about 40 million words to learn to speak. Probably, not as eloquently as GPT-4, but I mean, that’s a vast difference in scale: 216,000 times fewer words required than GPT-4 for a human brain (and over a million times more than used for Llama-4). The brain is a little better at learning.

Data Shortage

How could the world be running out of data when there are so many kitten videos? Yes, although it seems to most people that the world is awash with endless content, the world is running out of good content to train AI engines.

There’s a growing crisis. The world is running out of training content. Come on, people, we need to do better at recycling.

One aspect of the “wall” of training problems was a shortage of AI training data. The other aspect was concerns over reasoning models and software algorithms, as discussed in the prior chapter.

To humans, the size of the internet is vast beyond comprehension. But it’s not actually that big for the LLMs, and they are training on a “crawl” of literally the entire public internet space.

Yum, yum. All those tasty Reddit threads.

But that was only part of the training data set for massive AI models. Vast databases of millions of scanned public domain books, not to mention a few pirated ones! The lawsuits about that are still ongoing.

There’s not actually that much more content out there. I mean, we don’t have ten extra internets to go crawl. New books are published at a rate of hundreds of thousands per year, but a hundred thousand books at about 100,000 words each is only 10 billion words, which won’t even make an AI raise a sweat.

Where else could we find content?

Synthetic Data

Synthetic data is computer-generated data that can be used for AI training. This doesn’t necessarily mean AI-generated content, except that AI is probably the best technology we have for writing. In fact, there are two ways that technology can create more AI content:

  • New content creation
  • Derivative content

Creating new content is obviously using LLMs or other writing technologies to create text (or images), that can be used for training. The other method is to take some existing content, and “derive” some extra content from that original content. This is also called “data augmentation” and various methods include:

  • Machine translation — creating foreign language content.
  • Synonymization — modify text using synonyms in place of the original wording.
  • Summarization — create a shortened version of War and Peace.
  • Paraphrasing — create versions that are similar but with different sentence structures.
  • Sentence shuffling — reordering sentences in the text.
  • Double translation — convert English to French and back again to get a variation.
  • Noise injection — intentionally insert errors or mistakes to make it different.

There’s also a whole slew of similar techniques for video or audio data. You can augment images and video by changing colors or cropping, and audio data can have its pitch or other attributes changed.

Why is there a problem? There shouldn’t be a need to use any of these data augmentation techniques. I mean, if ChatGPT is great at writing content and creating images, surely we have an infinite source of content to use for training AI models? There’s only one major problem with synthetic data: it can make your model collapse in a heap.

Model Collapse

What is model collapse? The term was coined by Shumailov et. al. (2024) in relation to models trained “recursively” on their own data. What they did was repeatedly output images and then re-used those images as inputs to train new LLM models. The result was that the accuracy declined every cycle, with the images getting fuzzier to the point where they were no longer recognizable.

Hence, the concern is where the accuracy of an LLM “collapses” if it is repeatedly trained on its own output. This is a major concern for large pre-trained LLMs that are trained on public datasets, because LLM created text is increasingly found on the internet, and may be scraped into these training sets.

The problem is the circularity, whereby LLMs are outputing new text onto the internet, which is then being re-used to train new capabilities in LLMs. If the theory of model collapse applies at scale, then this cycle will become problematic, with possible failure of the training logic.

Amusingly, the AI “slop” that is infesting the internet may also be annoying to the AIs. Seems somehow fitting.

People Parrots

Humans to the rescue? How much data can one person produce? Let’s just assume we’re talking about textual information. What is the fastest way for a human to create data to train an AI engine? Let’s compare some ways:

  • Typing — 60 words per minute (assuming touch typing, not “hunt-and-peck”).
  • Speaking — 240 words per minute (English), 360 words (Chinese).

I’m pretty sure I’ve seen young people on smartphones texting at about a thousand words per minute, but let’s not go there. So, talking. Let’s assume it’s a full-time job, which is around 2,000 hours (because 50 weeks at 40 hours-per-week). Continual blather would therefore get you:

    240 x 60 x 2,000 = 28.8 million words per person-year (English).

    360 x 60 x 2,000 = 43.2 million words per person-year (Chinese).

However, we ideally want to create a trillion words, which is the order that these types of data sets are considered. How many people years?

  • English — 34,722 person-years
  • Chinese — 23,148 person-years

At an average salary of around $60,000 USD annually, this amounts to about $2.1-3.2 billion US dollars. That’s what it’ll cost to create a trillion original tokens of new human-sourced training content. Give or take. I wonder how much of that would be good content? The people to create it will actually cost a lot more than the GPU cost of using that new content to build our super-duper LLM, and the GPUs won’t need bathroom breaks.

References

Training Data Sets. Research on “trillion token” training data set sizes for frontier models:

  1. Ashley Altus, October 15, 2024, Understanding LLMs: Model size, training data, and tokenization, https://outshift.cisco.com/blog/understanding-llms-model-size-training-data-tokenization
  2. Educating Silicon, May 9, 2024, How much LLM training data is there, in the limit?, https://www.educatingsilicon.com/2024/05/09/how-much-llm-training-data-is-there-in-the-limit/
  3. Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt, 17 Jun 2024, MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens, https://arxiv.org/abs/2406.11271
  4. Pierre-Carl Langlais, Anastasia Stasenko, Catherine Arnett, November 13, 2024, Releasing the largest multilingual open pretraining dataset, https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open
  5. Paul Sawers, Dec 2024, Harvard and Google to release 1 million public-domain books as AI training dataset, https://techcrunch.com/2024/12/12/harvard-and-google-to-release-1-million-public-domain-books-as-ai-training-dataset/

Data Crisis. Research on the “AI data crisis” with concerns about an AI training data shortage:

  1. Kylie Robison, Dec 14, 2024, OpenAI cofounder Ilya Sutskever says the way AI is built is about to change. “We’ve achieved peak data and there’ll be no more,” OpenAI’s former chief scientist told a crowd of AI researchers, https://www.theverge.com/2024/12/13/24320811/what-ilya-sutskever-sees-openai-model-data-training
  2. Humans in the Loop, 2025, Why Synthetic Data Is Taking Over in 2025: Solving AI’s Data Crisis, https://humansintheloop.org/why-synthetic-data-is-taking-over-in-2025-solving-ais-data-crisis/
  3. Joy Calder, 18 Nov 2024, The AI data crisis, https://cms.law/en/gbr/publication/data-bandwidth/the-ai-data-crisis
  4. David Gewirtz, June 5, 2025, The hidden data crisis threatening your AI transformation plans, https://www.zdnet.com/article/the-hidden-data-crisis-threatening-your-ai-transformation-plans/
  5. Appen, December 6, 2023, The Impending Data Crisis in the AI Economy, https://www.appen.com/blog/data-crisis-in-the-ai-economy

Human Writing. Research on human content creation capabilities:

  1. Jiahong Yuan, Mark Liberman, Christopher Cieri, 2006, Towards an Integrated Understanding of Speaking Rate in Conversation, https://languagelog.ldc.upenn.edu/myl/ldc/llog/icslp06_final.pdf
  2. Dean Talbot, November 23, 2023, Typing Speed Statistics, https://wordsrated.com/typing-speed-statistics/
  3. Javier Naranjo-Alcazar, Jordi Grau-Haro, Ruben Ribes-Serrano, Pedro Zuccarello, 8 Oct 2024 (v3), A Data-Centric Framework for Machine Listening Projects: Addressing Large-Scale Data Acquisition and Labeling through Active Learning, https://arxiv.org/abs/2405.18153
  4. John Elmore, May 13, 2025, Cracking the Code: Unraveling the Mystery of Average Typing Speed, https://thetechylife.com/what-is-the-average-typing-speed/
  5. Lidia Hovhan, Benjamin Noble, April 4, 2025, Key Steps for Creating High-Quality and Effective Image Datasets, https://www.sapien.io/blog/creating-high-quality-and-effective-image-datasets
  6. Cem Dilmegani, May 19, 2025, Audio Data Collection for AI: Challenges & Best Practices, https://research.aimultiple.com/audio-data-collection/

Synthetic Training Content. Research on synthetic content creation by AI (for AI):

  1. Luke Conroy and Anne Fehres, January 13, 2025, Tech companies are turning to ‘synthetic data’ to train AI models – but there’s a hidden cost, https://theconversation.com/tech-companies-are-turning-to-synthetic-data-to-train-ai-models-but-theres-a-hidden-cost-246248
  2. Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai, 10 Aug 2024 (v2), Best Practices and Lessons Learned on Synthetic Data, https://arxiv.org/abs/2404.07503
  3. Goyal, M., & Mahmoud, Q. H., 2024, A Systematic Review of Synthetic Data Generation Techniques Using Generative AI, Electronics, 13(17), 3509, https://doi.org/10.3390/electronics13173509, https://www.mdpi.com/2079-9292/13/17/3509
  4. Ashley Shedlock, July 15, 2025, The Secret Life of Synthetic Data: Why It’s Taking Over Research, https://www.greenbook.org/insights/data-science/the-secret-life-of-synthetic-data-why-its-taking-over-research
  5. André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster, 4 Jan 2024, Comprehensive Exploration of Synthetic Data Generation: A Survey, https://arxiv.org/abs/2401.02524
  6. Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/

Training Data Augmentation. Research on “data augmentation” to create derivative content:

  1. Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang, 16 Oct 2024, A Survey on Data Synthesis and Augmentation for Large Language Models, https://arxiv.org/abs/2410.12896
  2. Pinzhen Chen, 2023, Data Augmentation for Language Generation Inspired by Machine Translation, Ph.D. Thesis, Institute for Language, Cognition and Computation School of Informatics University of Edinburgh https://era.ed.ac.uk/bitstream/handle/1842/41873/Chen2024.pdf?sequence=1&isAllowed=y
  3. Xiang Huang, Jiayu Shen, Shanshan Huang, Sitao Cheng, Xiaxia Wang, Yuzhong Qu, 27 Dec 2024, TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data, https://arxiv.org/abs/2412.19544?
  4. Skurzhanskyi, O.H., Marchenko, O.O. & Anisimov, A.V., 2024, Specialized Pre-Training of Neural Networks on Synthetic Data for Improving Paraphrase Generation, Cybern Syst Anal 2024, https://doi.org/10.1007/s10559-024-00658-7 https://link.springer.com/article/10.1007/s10559-024-00658-7
  5. Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, 29 Jan 2024, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, https://arxiv.org/abs/2401.16380

Model Collapse. Research on “model collapse” issues:

  1. Shumailov I, Shumaylov Z, Zhao Y, Papernot N, Anderson R, Gal Y., July 2024, AI models collapse when trained on recursively generated data, Nature, 2024 Jul;631(8022):755-759. doi: 10.1038/s41586-024-07566-y, https://www.nature.com/articles/s41586-024-07566-y, https://pmc.ncbi.nlm.nih.gov/articles/PMC11269175/
  2. Bernard Marr, Aug 19, 2024, Why AI Models Are Collapsing And What It Means For The Future Of Technology, https://www.forbes.com/sites/bernardmarr/2024/08/19/why-ai-models-are-collapsing-and-what-it-means-for-the-future-of-technology/
  3. Alice Gomstyn, Alexandra Jonker, October 10, 2024, What is model collapse?, IBM, https://www.ibm.com/think/topics/model-collapse
  4. Wikipedia, July 2025, Model collapse, https://en.wikipedia.org/wiki/Model_collapse
  5. Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, Gauthier Gidel, 12 Jun 2024, Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences, https://arxiv.org/abs/2407.09499

 

Online: Table of Contents

PDF: Free PDF book download

Buy: The Sweetest Lesson: Your Brain vs AI

The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson