Aussie AI
Limitations of LLMs
-
Last Updated 27 August, 2025
-
by David Spuler, Ph.D.
LLMs can do some amazing new things, but they also have a lot of limitations. This article is a deep dive into limitations in various categories:
- Risks and safety
- Reasoning limitations
- Computational limitations
Safety Risks and Limitations
Your average LLM has problems with:
- Inaccuracies or misinformation (wrong facts or omissions)
- Biases (of many types)
- Insensitivity (e.g. when writing eulogies).
- Gullibility (not challenging the input text)
- Hallucinations (plausible-looking made-up facts)
- Confabulation (wrongly merging two sources)
- Dangerous or harmful answers (e.g. wrong mushroom picking advice)
- Plagiarism (in its training data set)
- Paraphrasing (plagiarism-like)
- Sensitive topics (the LLM requires training on each and every one)
- Training data quality ("Garbage in, garbage out")
- Alignment (people have purpose; LLMs only have language).
- Security (e.g. "jailbreaks")
- Refusal (knowing when it should)
- Personally Identifiable Information (PII) (e.g., emails or phone numbers in training data)
- Proprietary data leakage (e.g., trade secrets in an article used in a training data set)
- Surfacing inaccurate or outdated information
- Over-confidence (it knows not what it says)
- Veneer of authority (users tend to believe the words)
- Use for nefarious purposes (e.g., by hackers)
- Transparency (of the data, of the guardrails, of how it works, etc.)
- Privacy issues (sure, but Googling online has similar issues, so this isn't as new as everyone says)
- Legal issues (copyright violations, patentability, copyrightability, and more)
- Regulatory issues (inconsistent)
- Unintended consequences
Reasoning Limitations
Let's begin with some of the limitations that have largely been solved:
- Words about words (e.g. "words", "sentences", etc.)
- Writing style, tone, reading level, etc.
- Ending responses nicely with stop tokens and max tokens
- Tool integrations (e.g. clocks, calendars, calculators)
- Cut-off date for training data sets
- Long contexts
Some other issues:
- Explainability
- Attribution (source citations)
- Logical reasoning
- Planning
- Probabilistic non-deterministic method
- Mathematical reasoning
- Banal, bland, or overly formal writing
- Math word problems
- Crosswords and other word puzzles (e.g. anagrams, alliteration)
- Repetition (e.g., if it has nothing new to add, it may repeat a prior answer, rather than admitting that)
- Specialized domains (e.g. jargon, special meanings of words)
- Prompt engineering requirements (awkward wordings! Nobody really talks like that.)
- Oversensitivity to prompt variations (and yet, sadly, prompt engineering works)
- Ambiguity (of input queries)
- Over-explaining
- Nonsense answers
- Americanisms (e.g., word spellings and implied meanings, cultural issues like "football", etc.)
- Model "drift" (decline in accuracy over time)
- Non-repeatability (same question, different answer)
- Novice assumption (not identifying a user's higher level of knowledge from words in the questions; dare I say it's a kind of "AI-splaining")
- Words and meanings are not the same thing.
- Gibberish output (usually a bug; Transformers are just C++ programs, you know)
- Lack of common sense (although I know some people like that, too)
- Lack of a "world model"
- Lack of a sense of personal context (they don't understand what it means to be a person)
- Time/temporal reasoning (the concept of things happening in sequence is tricky)
- 3D scene visualization (LLMs struggle to understand the relationship between objects in the real world)
- Sarcasm and satire (e.g. articles espousing the benefits of "eating rocks")
- Spin, biased viewpoints, and outright disinformation/deception (of source content)
- Going rogue (usually a bug, or is it?)
- Trick questions (e.g., queries that look like common online puzzles, but aren't quite the same).
- Falling back on training data (overly complex answers)
- Detecting intentional deception or other malfeasance by users
- LLMs asking follow-up questions to clarify user requests (this capability has been improving quickly).
- Not correctly prioritizing parts of the request (i.e., given multiple requests in a prompt instruction, it doesn't always automatically know which things are most important to you)
Computational Limitations
There's really only one big problem with AI computation: it's slooow. Hence, the need for all of those expensive GPU chips. This leads to problems with:
- Cloud data center execution is expensive.
- AI phone execution problems (e.g., frozen phone, battery depletion, overheating)
- AI PC execution problems (big models are still too slow to run)
- Training data set requirements (they need to feed on lots of tokens)
- Environmental impact (e.g., by one estimate, a ten-fold need of extra data center electricity for AI answers compared to non-AI internet searches)
More Research on Limitations
Research papers that cover various other AI limitations:
- Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
- Abdelrahman "Boda" Sadallah, Daria Kotova, Ekaterina Kochmar, 15 Mar 2024, Are LLMs Good Cryptic Crossword Solvers? https://arxiv.org/abs/2403.12094 Code: https://github.com/rdeits/cryptics
- Jonas Wallat, Adam Jatowt, Avishek Anand, March 2024, Temporal Blind Spots in Large Language Models, WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Pages 683–692, https://arxiv.org/abs/2401.12078, https://doi.org/10.1145/3616855.3635818, https://dl.acm.org/doi/abs/10.1145/3616855.3635818
- Juntu Zhao, Junyu Deng, Yixin Ye, Chongxuan Li, Zhijie Deng, Dequan Wang 22 Sept 2023, (modified: 11 Feb 2024), Lost in Translation: Conceptual Blind Spots in Text-to-Image Diffusion Models, ICLR 2024, https://openreview.net/forum?id=vb3O9jxTLc
- Victoria Basmov, Yoav Goldberg, Reut Tsarfaty, 11 Apr 2024 (v2), Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds, https://arxiv.org/abs/2305.14785
- Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Jonathan St. Onge, Mikaela Fudolig, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds, 23 Feb 2024 (v2), A blind spot for large language models: Supradiegetic linguistic information, https://arxiv.org/abs/2306.06794
- Michael King, July 24, 2023, Large Language Models are Extremely Bad at Creating Anagrams, https://www.techrxiv.org/doi/full/10.36227/techrxiv.23712309.v1
- George Cybenko, Joshua Ackerman, Paul Lintilhac, 16 Apr 2024, TEL'M: Test and Evaluation of Language Models, https://arxiv.org/abs/2404.10200
- Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023. https://arxiv.org/abs/2304.11082
- Mikhail Burtsev, Martin Reeves, and Adam Job. The working limitations of large language models. MIT Sloan Management Review, 65(1):1–5, 2023. https://sloanreview.mit.edu/article/the-working-limitations-of-large-language-models/
- Michael O'Neill, Mark Connor, 6 Jul 2023, Amplifying Limitations, Harms and Risks of Large Language Models, https://arxiv.org/abs/2307.04821
- Karl, May 10, 2023, Large Language Models: Reasoning Capabilities and Limitations, https://medium.com/@glovguy/large-language-models-reasoning-capabilities-and-limitations-951cee0ac642
- The PyCoach Apr 20, 2024, The False Promises of AI: How tech companies are fooling us https://medium.com/artificial-corner/the-false-promises-of-ai-fe23124e0fb9
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Bill Doerrfeld, Feb 6, 2024, Does Using AI Assistants Lead to Lower Code Quality? https://devops.com/does-using-ai-assistants-lead-to-lower-code-quality/
- Piotr Wojciech Mirowski, Juliette Love, Kory W. Mathewson, Shakir Mohamed, 3 Jun 2024 (v2), A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians, https://arxiv.org/abs/2405.20956 (The unfunny fact that AI is bad at humor.)
- Rafe Brena, May 24, 2024, 3 Key Differences Between Human and Machine Intelligence You Need to Know: AI is an alien intelligence https://pub.towardsai.net/3-key-differences-between-human-and-machine-intelligence-you-need-to-know-7a34dcee2cd3 (Good article about how LLMs don't have "emotions" or "intelligence" and they don't "pause".)
- Amanda Silberling, August 27, 2024, Why AI can’t spell ‘strawberry’, https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawberry/
- Kyle Wiggers, July 6, 2024, Tokens are a big reason today’s generative AI falls short, https://techcrunch.com/2024/07/06/tokens-are-a-big-reason-todays-generative-ai-falls-short/
- Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
- Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme, 7 Feb 2024 (v2), OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax? https://arxiv.org/abs/2309.09992
- Radhika Rajkumar, Sept. 6, 2024, What AI can't do, digital twins, and swiveling laptop screens, https://www.zdnet.com/article/what-ai-cant-do-digital-twins-and-swiveling-laptop-screens/
- Victor Tangermann, Sep 13, 2024, OpenAI's New "Strawberry" AI Is Still Making Idiotic Mistakes, https://futurism.com/openai-strawberry-o1-mistakes
- Michael Nuñez, November 11, 2024, AI’s math problem: FrontierMath benchmark shows how far technology still has to go, https://venturebeat.com/ai/ais-math-problem-frontiermath-benchmark-shows-how-far-technology-still-has-to-go/
- Dynomight, Nov 2024, Something weird is happening with LLMs and chess, https://dynomight.net/chess/
- Evan Doyle, Nov 14, 2024, AI Makes Tech Debt More Expensive, https://www.gauge.sh/blog/ai-makes-tech-debt-more-expensive
- From Transformers to the Future: An In-Depth Exploration of Modern Language Model Architectures H Xu, Z Bi, H Tseng, X Song, P Feng, https://osf.io/n8r5j/download
- Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, Chiwun Yang, 8 Dec 2024, Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond, https://arxiv.org/abs/2412.06061
- Sam Liberty, Oct 15, 2024, Why AI Can’t Crack the NYT Connections Puzzle (Yet), https://medium.com/design-bootcamp/why-ai-cant-crack-the-nyt-connections-puzzle-yet-7bd3e00b4087
- Matthias Bastian, Oct 6, 2024, Study reveals major reasoning flaws in smaller AI language models, https://the-decoder.com/study-reveals-major-reasoning-flaws-in-smaller-ai-language-models/
- Paul Sawers, January 23, 2025, Meta’s Yann LeCun predicts a ‘new AI architectures paradigm’ within 5 years and ‘decade of robotics’, https://techcrunch.com/2025/01/23/metas-yann-lecun-predicts-a-new-ai-architectures-paradigm-within-5-years-and-decade-of-robotics/
- Ethan Mollick, Sep 16, 2023, Centaurs and Cyborgs on the Jagged Frontier. I think we have an answer on whether AIs will reshape work.... https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the-jagged
- Mehul Gupta, Jan 2025, Why AI Agents will be a huge disaster: Problems with AI Agents, https://medium.com/data-science-in-your-pocket/why-ai-agents-will-be-a-huge-disaster-9a68d9db18a1
- Lan Pan, Hanbo Xie, Robert C. Wilson, 29 Jan 2025, Large Language Models Think Too Fast To Explore Effectively, https://arxiv.org/abs/2501.18009
- Venkatesh Mishra, Bimsara Pathiraja, Mihir Parmar, Sat Chidananda, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral, 8 Feb 2025, Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning, https://arxiv.org/abs/2502.05675
- Safal Shrestha, Minwu Kim, Keith Ross, 12 Feb 2025, Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, https://arxiv.org/abs/2502.08680
- Frank Landymore, Jan 25, 2025, OpenAI's Agent Has a Problem: Before It Does Anything Important, You Have to Double-Check It Hasn't Screwed Up: Not as hands-off as you might hope, https://futurism.com/openai-asks-permission-important
- Yihang Yao, Zhepeng Cen, Miao Li, William Han, Yuyou Zhang, Emerson Liu, Zuxin Liu, Chuang Gan, Ding Zhao, 25 Feb 2025, Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training, https://arxiv.org/abs/2502.17800 (Using "query augmentation" as a type of automatic prompt optimization inside a reasoning chain.)
- Tong Yu, Yongcheng Jing, Xikun Zhang, Wentao Jiang, Wenjie Wu, Yingjie Wang, Wenbin Hu, Bo Du, Dacheng Tao, 6 Mar 2025, Benchmarking Reasoning Robustness in Large Language Models, https://arxiv.org/abs/2503.04550
- Allison Morrow, March 27, 2025, Apple’s AI isn’t a letdown. AI is the letdown, https://edition.cnn.com/2025/03/27/tech/apple-ai-artificial-intelligence/index.html
- Alberto Romero, May 21, 2025, AI Has No Sense of Humor: Still an exclusive human quality, https://www.thealgorithmicbridge.com/p/ai-has-no-sense-of-humor
- Yihong Dong, Yuchen Liu, Xue Jiang, Zhi Jin, Ge Li, 15 May 2025, Rethinking Repetition Problems of LLMs in Code Generation, https://arxiv.org/abs/2505.10402
- Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville, 9 May 2025, LLMs Get Lost In Multi-Turn Conversation, https://arxiv.org/abs/2505.06120
- Tobias Schnabel, Kiran Tomlinson, Adith Swaminathan, Jennifer Neville, 19 May 2025 (v2), Lost in Transmission: When and Why LLMs Fail to Reason Globally, https://arxiv.org/abs/2505.08140
- Parshin Shojaee, Maxwell Horton, Iman Mirzadeh, Samy Bengio, Keivan Alizadeh, June 2025, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, Apple, https://machinelearning.apple.com/research/illusion-of-thinking https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
- Dr. Ashish Bamania, June 2025, Apple’s New Research Shows That LLM Reasoning Is Completely Broken: A deep dive into Apple research that exposes the flawed thinking process in state-of-the-art Reasoning LLMs, https://ai.gopubby.com/apples-new-research-shows-that-llm-reasoning-is-completely-broken-47b5be71a06a
- Shreya Shankar, Jun 16, 2025, Writing in the Age of LLMs: Common Patterns of Bad Writing I See from LLM Tools, https://www.sh-reya.com/blog/ai-writing/ (A good overview of the types of bad writing that comes out of LLMs.)
- Sven Balnojan, Jun 17, 2025, Your RAG System Is Going to Kill Your Startup, https://infusedata.io/your-rag-system-is-going-to-kill-your-startup-700f32b69bb0 (Build for tomorrow's AI, not work-arounds like RAG for current limitations.)
- Mohamed Amine Ferrag, Norbert Tihanyi, Merouane Debbah, 26 Mar 2025, Reasoning Beyond Limits: Advances and Open Problems for LLMs, https://arxiv.org/abs/2503.22732
- Fluxus, Aug 2025, Why Your AI Never Works on the First Try, https://fluxus.io/article/why-your-ai-never-works-on-the-first-try
- Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
- Kenneth Wolters, Aug 12, 2025, No AGI in Sight: What This Means for LLMs, https://kennethwolters.com/posts/no-agi/
- Samet Ozkale, Aug 12, 2025, Why AI Can't Touch Your Taste: Taste is becoming the ultimate differentiator as AI democratizes technical execution across product development, https://productpower.substack.com/p/ai-wont-replace-taste
- Tiernan Ray, Aug. 13, 2025, Why GPT-5's rocky rollout is the reality check we needed on superintelligence hype: A year after Altman said superintelligence was imminent, GPT-5 is all we get? https://www.zdnet.com/article/why-gpt-5s-rocky-rollout-is-the-reality-check-we-needed-on-superintelligence-hype/
- Andrew Zuo, Aug 2025, Apple Can’t Figure Out AI Because There’s Nothing To Figure Out: Despite massive investments and widespread hype, the promised revolution of AI might be a search for something that doesn’t exist, https://andrewzuo.com/apple-cant-figure-out-ai-because-there-s-nothing-to-figure-out-e249f16adb50
- Ma\"el Jullien, Marco Valentino, and Andr\'e Freitas, 14 Aug 2025, The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference, https://arxiv.org/abs/2508.10777
- Zhao Song, Song Yue, Jiahao Zhang, 23 Jul 2025, Thinking Isn't an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations, https://arxiv.org/abs/2507.17699
- Mathieu Godbout and Audrey Durand, 18 Jul 2025, On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes, https://arxiv.org/abs/2507.14005
- Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar, 18 Jul 2025, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, https://arxiv.org/abs/2506.06941
- Mohammed Alkhowaiter, Norah Alshahrani, Saied Alshahrani, Reem I. Masoud, Alaa Alzahrani, Deema Alnuhait, Emad A. Alghamdi, Khalid Almubarak, 19 Jul 2025, Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations, https://arxiv.org/abs/2507.14688
- Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu, 13 Aug 2025 (v3), Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens, https://arxiv.org/abs/2508.01191
- Armen Manukyan, Hrant Khachatrian, Edvard Ghukasyan, Theofanis P. Raptis, 25 Jul 2025, On the Limitations of Ray-Tracing for Learning-Based RF Tasks in Urban Environments, https://arxiv.org/abs/2507.19653
- Siwoo Park, 30 Jul 2025, Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods, https://arxiv.org/abs/2507.23010
- Evangelos Sariyanidi, John D. Herrington, Lisa Yankowitz, Pratik Chaudhari, Theodore D. Satterthwaite, Casey J. Zampella, Robert T. Schultz, Russell T. Shinohara, Birkan Tunc, 29 Jul 2025, Measuring Dependencies between Biological Signals with Temporal Self-supervision, and its Limitations, https://arxiv.org/abs/2508.02703
- Emanuele Nardone, Tiziana D'Alessandro, Francesco Fontanella, Claudio De Stefano, 5 Aug 2025, When Deep Learning Fails: Limitations of Recurrent Models on Stroke-Based Handwriting for Alzheimer's Disease Detection, https://arxiv.org/abs/2508.03773
- Yu-Hsuan Fang, Tien-Hong Lo, Yao-Ting Sung, Berlin Chen, 18 Aug 2025, Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning, https://arxiv.org/abs/2508.12591
- Noah Kasmanoff and Rahul Zalkikar, 15 Aug 2025, Limitation Learning: Catching Adverse Dialog with GAIL, https://arxiv.org/abs/2508.11767
- Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang, 16 Aug 2025, Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention, https://arxiv.org/abs/2507.00449
- Ben Dickson, August 19, 2025, LLMs generate ‘fluent nonsense’ when reasoning outside their training zone, https://venturebeat.com/ai/llms-generate-fluent-nonsense-when-reasoning-outside-their-training-zone/
- Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova and Radu State, 25 Aug 2025, Limitations of Normalization in Attention Mechanism, https://arxiv.org/abs/2508.17821
- Seamus Somerstep, Ya'acov Ritov, Mikhail Yurochkin, Subha Maity, Yuekai Sun, 23 Aug 2025, Limitations of refinement methods for weak to strong generalization, https://arxiv.org/abs/2508.17018
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: