Aussie AI

Limitations of LLMs

Last Updated 27 August, 2025

by David Spuler, Ph.D.

LLMs can do some amazing new things, but they also have a lot of limitations. This article is a deep dive into limitations in various categories:

Risks and safety
Reasoning limitations
Computational limitations

Safety Risks and Limitations

Your average LLM has problems with:

Inaccuracies or misinformation (wrong facts or omissions)
Biases (of many types)
Insensitivity (e.g. when writing eulogies).
Gullibility (not challenging the input text)
Hallucinations (plausible-looking made-up facts)
Confabulation (wrongly merging two sources)
Dangerous or harmful answers (e.g. wrong mushroom picking advice)
Plagiarism (in its training data set)
Paraphrasing (plagiarism-like)
Sensitive topics (the LLM requires training on each and every one)
Training data quality ("Garbage in, garbage out")
Alignment (people have purpose; LLMs only have language).
Security (e.g. "jailbreaks")
Refusal (knowing when it should)
Personally Identifiable Information (PII) (e.g., emails or phone numbers in training data)
Proprietary data leakage (e.g., trade secrets in an article used in a training data set)
Surfacing inaccurate or outdated information
Over-confidence (it knows not what it says)
Veneer of authority (users tend to believe the words)
Use for nefarious purposes (e.g., by hackers)
Transparency (of the data, of the guardrails, of how it works, etc.)
Privacy issues (sure, but Googling online has similar issues, so this isn't as new as everyone says)
Legal issues (copyright violations, patentability, copyrightability, and more)
Regulatory issues (inconsistent)
Unintended consequences

Reasoning Limitations

Let's begin with some of the limitations that have largely been solved:

Words about words (e.g. "words", "sentences", etc.)
Writing style, tone, reading level, etc.
Ending responses nicely with stop tokens and max tokens
Tool integrations (e.g. clocks, calendars, calculators)
Cut-off date for training data sets
Long contexts

Some other issues:

Explainability
Attribution (source citations)
Logical reasoning
Planning
Probabilistic non-deterministic method
Mathematical reasoning
Banal, bland, or overly formal writing
Math word problems
Crosswords and other word puzzles (e.g. anagrams, alliteration)
Repetition (e.g., if it has nothing new to add, it may repeat a prior answer, rather than admitting that)
Specialized domains (e.g. jargon, special meanings of words)
Prompt engineering requirements (awkward wordings! Nobody really talks like that.)
Oversensitivity to prompt variations (and yet, sadly, prompt engineering works)
Ambiguity (of input queries)
Over-explaining
Nonsense answers
Americanisms (e.g., word spellings and implied meanings, cultural issues like "football", etc.)
Model "drift" (decline in accuracy over time)
Non-repeatability (same question, different answer)
Novice assumption (not identifying a user's higher level of knowledge from words in the questions; dare I say it's a kind of "AI-splaining")
Words and meanings are not the same thing.
Gibberish output (usually a bug; Transformers are just C++ programs, you know)
Lack of common sense (although I know some people like that, too)
Lack of a "world model"
Lack of a sense of personal context (they don't understand what it means to be a person)
Time/temporal reasoning (the concept of things happening in sequence is tricky)
3D scene visualization (LLMs struggle to understand the relationship between objects in the real world)
Sarcasm and satire (e.g. articles espousing the benefits of "eating rocks")
Spin, biased viewpoints, and outright disinformation/deception (of source content)
Going rogue (usually a bug, or is it?)
Trick questions (e.g., queries that look like common online puzzles, but aren't quite the same).
Falling back on training data (overly complex answers)
Detecting intentional deception or other malfeasance by users
LLMs asking follow-up questions to clarify user requests (this capability has been improving quickly).
Not correctly prioritizing parts of the request (i.e., given multiple requests in a prompt instruction, it doesn't always automatically know which things are most important to you)

Computational Limitations

There's really only one big problem with AI computation: it's slooow. Hence, the need for all of those expensive GPU chips. This leads to problems with:

Cloud data center execution is expensive.
AI phone execution problems (e.g., frozen phone, battery depletion, overheating)
AI PC execution problems (big models are still too slow to run)
Training data set requirements (they need to feed on lots of tokens)
Environmental impact (e.g., by one estimate, a ten-fold need of extra data center electricity for AI answers compared to non-AI internet searches)

More Research on Limitations

Research papers that cover various other AI limitations:

Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
Abdelrahman "Boda" Sadallah, Daria Kotova, Ekaterina Kochmar, 15 Mar 2024, Are LLMs Good Cryptic Crossword Solvers? https://arxiv.org/abs/2403.12094 Code: https://github.com/rdeits/cryptics
Jonas Wallat, Adam Jatowt, Avishek Anand, March 2024, Temporal Blind Spots in Large Language Models, WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Pages 683–692, https://arxiv.org/abs/2401.12078, https://doi.org/10.1145/3616855.3635818, https://dl.acm.org/doi/abs/10.1145/3616855.3635818
Juntu Zhao, Junyu Deng, Yixin Ye, Chongxuan Li, Zhijie Deng, Dequan Wang 22 Sept 2023, (modified: 11 Feb 2024), Lost in Translation: Conceptual Blind Spots in Text-to-Image Diffusion Models, ICLR 2024, https://openreview.net/forum?id=vb3O9jxTLc
Victoria Basmov, Yoav Goldberg, Reut Tsarfaty, 11 Apr 2024 (v2), Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds, https://arxiv.org/abs/2305.14785
Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Jonathan St. Onge, Mikaela Fudolig, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds, 23 Feb 2024 (v2), A blind spot for large language models: Supradiegetic linguistic information, https://arxiv.org/abs/2306.06794
Michael King, July 24, 2023, Large Language Models are Extremely Bad at Creating Anagrams, https://www.techrxiv.org/doi/full/10.36227/techrxiv.23712309.v1
George Cybenko, Joshua Ackerman, Paul Lintilhac, 16 Apr 2024, TEL'M: Test and Evaluation of Language Models, https://arxiv.org/abs/2404.10200
Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023. https://arxiv.org/abs/2304.11082
Mikhail Burtsev, Martin Reeves, and Adam Job. The working limitations of large language models. MIT Sloan Management Review, 65(1):1–5, 2023. https://sloanreview.mit.edu/article/the-working-limitations-of-large-language-models/
Michael O'Neill, Mark Connor, 6 Jul 2023, Amplifying Limitations, Harms and Risks of Large Language Models, https://arxiv.org/abs/2307.04821
Karl, May 10, 2023, Large Language Models: Reasoning Capabilities and Limitations, https://medium.com/@glovguy/large-language-models-reasoning-capabilities-and-limitations-951cee0ac642
The PyCoach Apr 20, 2024, The False Promises of AI: How tech companies are fooling us https://medium.com/artificial-corner/the-false-promises-of-ai-fe23124e0fb9
Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
Bill Doerrfeld, Feb 6, 2024, Does Using AI Assistants Lead to Lower Code Quality? https://devops.com/does-using-ai-assistants-lead-to-lower-code-quality/
Piotr Wojciech Mirowski, Juliette Love, Kory W. Mathewson, Shakir Mohamed, 3 Jun 2024 (v2), A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians, https://arxiv.org/abs/2405.20956 (The unfunny fact that AI is bad at humor.)
Rafe Brena, May 24, 2024, 3 Key Differences Between Human and Machine Intelligence You Need to Know: AI is an alien intelligence https://pub.towardsai.net/3-key-differences-between-human-and-machine-intelligence-you-need-to-know-7a34dcee2cd3 (Good article about how LLMs don't have "emotions" or "intelligence" and they don't "pause".)
Amanda Silberling, August 27, 2024, Why AI can’t spell ‘strawberry’, https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawberry/
Kyle Wiggers, July 6, 2024, Tokens are a big reason today’s generative AI falls short, https://techcrunch.com/2024/07/06/tokens-are-a-big-reason-todays-generative-ai-falls-short/
Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme, 7 Feb 2024 (v2), OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax? https://arxiv.org/abs/2309.09992
Radhika Rajkumar, Sept. 6, 2024, What AI can't do, digital twins, and swiveling laptop screens, https://www.zdnet.com/article/what-ai-cant-do-digital-twins-and-swiveling-laptop-screens/
Victor Tangermann, Sep 13, 2024, OpenAI's New "Strawberry" AI Is Still Making Idiotic Mistakes, https://futurism.com/openai-strawberry-o1-mistakes
Michael Nuñez, November 11, 2024, AI’s math problem: FrontierMath benchmark shows how far technology still has to go, https://venturebeat.com/ai/ais-math-problem-frontiermath-benchmark-shows-how-far-technology-still-has-to-go/
Dynomight, Nov 2024, Something weird is happening with LLMs and chess, https://dynomight.net/chess/
Evan Doyle, Nov 14, 2024, AI Makes Tech Debt More Expensive, https://www.gauge.sh/blog/ai-makes-tech-debt-more-expensive
From Transformers to the Future: An In-Depth Exploration of Modern Language Model Architectures H Xu, Z Bi, H Tseng, X Song, P Feng, https://osf.io/n8r5j/download
Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, Chiwun Yang, 8 Dec 2024, Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond, https://arxiv.org/abs/2412.06061
Sam Liberty, Oct 15, 2024, Why AI Can’t Crack the NYT Connections Puzzle (Yet), https://medium.com/design-bootcamp/why-ai-cant-crack-the-nyt-connections-puzzle-yet-7bd3e00b4087
Matthias Bastian, Oct 6, 2024, Study reveals major reasoning flaws in smaller AI language models, https://the-decoder.com/study-reveals-major-reasoning-flaws-in-smaller-ai-language-models/
Paul Sawers, January 23, 2025, Meta’s Yann LeCun predicts a ‘new AI architectures paradigm’ within 5 years and ‘decade of robotics’, https://techcrunch.com/2025/01/23/metas-yann-lecun-predicts-a-new-ai-architectures-paradigm-within-5-years-and-decade-of-robotics/
Ethan Mollick, Sep 16, 2023, Centaurs and Cyborgs on the Jagged Frontier. I think we have an answer on whether AIs will reshape work.... https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the-jagged
Mehul Gupta, Jan 2025, Why AI Agents will be a huge disaster: Problems with AI Agents, https://medium.com/data-science-in-your-pocket/why-ai-agents-will-be-a-huge-disaster-9a68d9db18a1
Lan Pan, Hanbo Xie, Robert C. Wilson, 29 Jan 2025, Large Language Models Think Too Fast To Explore Effectively, https://arxiv.org/abs/2501.18009
Venkatesh Mishra, Bimsara Pathiraja, Mihir Parmar, Sat Chidananda, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral, 8 Feb 2025, Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning, https://arxiv.org/abs/2502.05675
Safal Shrestha, Minwu Kim, Keith Ross, 12 Feb 2025, Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, https://arxiv.org/abs/2502.08680
Frank Landymore, Jan 25, 2025, OpenAI's Agent Has a Problem: Before It Does Anything Important, You Have to Double-Check It Hasn't Screwed Up: Not as hands-off as you might hope, https://futurism.com/openai-asks-permission-important
Yihang Yao, Zhepeng Cen, Miao Li, William Han, Yuyou Zhang, Emerson Liu, Zuxin Liu, Chuang Gan, Ding Zhao, 25 Feb 2025, Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training, https://arxiv.org/abs/2502.17800 (Using "query augmentation" as a type of automatic prompt optimization inside a reasoning chain.)
Tong Yu, Yongcheng Jing, Xikun Zhang, Wentao Jiang, Wenjie Wu, Yingjie Wang, Wenbin Hu, Bo Du, Dacheng Tao, 6 Mar 2025, Benchmarking Reasoning Robustness in Large Language Models, https://arxiv.org/abs/2503.04550
Allison Morrow, March 27, 2025, Apple’s AI isn’t a letdown. AI is the letdown, https://edition.cnn.com/2025/03/27/tech/apple-ai-artificial-intelligence/index.html
Alberto Romero, May 21, 2025, AI Has No Sense of Humor: Still an exclusive human quality, https://www.thealgorithmicbridge.com/p/ai-has-no-sense-of-humor
Yihong Dong, Yuchen Liu, Xue Jiang, Zhi Jin, Ge Li, 15 May 2025, Rethinking Repetition Problems of LLMs in Code Generation, https://arxiv.org/abs/2505.10402
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville, 9 May 2025, LLMs Get Lost In Multi-Turn Conversation, https://arxiv.org/abs/2505.06120
Tobias Schnabel, Kiran Tomlinson, Adith Swaminathan, Jennifer Neville, 19 May 2025 (v2), Lost in Transmission: When and Why LLMs Fail to Reason Globally, https://arxiv.org/abs/2505.08140
Parshin Shojaee, Maxwell Horton, Iman Mirzadeh, Samy Bengio, Keivan Alizadeh, June 2025, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, Apple, https://machinelearning.apple.com/research/illusion-of-thinking https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
Dr. Ashish Bamania, June 2025, Apple’s New Research Shows That LLM Reasoning Is Completely Broken: A deep dive into Apple research that exposes the flawed thinking process in state-of-the-art Reasoning LLMs, https://ai.gopubby.com/apples-new-research-shows-that-llm-reasoning-is-completely-broken-47b5be71a06a
Shreya Shankar, Jun 16, 2025, Writing in the Age of LLMs: Common Patterns of Bad Writing I See from LLM Tools, https://www.sh-reya.com/blog/ai-writing/ (A good overview of the types of bad writing that comes out of LLMs.)
Sven Balnojan, Jun 17, 2025, Your RAG System Is Going to Kill Your Startup, https://infusedata.io/your-rag-system-is-going-to-kill-your-startup-700f32b69bb0 (Build for tomorrow's AI, not work-arounds like RAG for current limitations.)
Mohamed Amine Ferrag, Norbert Tihanyi, Merouane Debbah, 26 Mar 2025, Reasoning Beyond Limits: Advances and Open Problems for LLMs, https://arxiv.org/abs/2503.22732
Fluxus, Aug 2025, Why Your AI Never Works on the First Try, https://fluxus.io/article/why-your-ai-never-works-on-the-first-try
Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
Kenneth Wolters, Aug 12, 2025, No AGI in Sight: What This Means for LLMs, https://kennethwolters.com/posts/no-agi/
Samet Ozkale, Aug 12, 2025, Why AI Can't Touch Your Taste: Taste is becoming the ultimate differentiator as AI democratizes technical execution across product development, https://productpower.substack.com/p/ai-wont-replace-taste
Tiernan Ray, Aug. 13, 2025, Why GPT-5's rocky rollout is the reality check we needed on superintelligence hype: A year after Altman said superintelligence was imminent, GPT-5 is all we get? https://www.zdnet.com/article/why-gpt-5s-rocky-rollout-is-the-reality-check-we-needed-on-superintelligence-hype/
Andrew Zuo, Aug 2025, Apple Can’t Figure Out AI Because There’s Nothing To Figure Out: Despite massive investments and widespread hype, the promised revolution of AI might be a search for something that doesn’t exist, https://andrewzuo.com/apple-cant-figure-out-ai-because-there-s-nothing-to-figure-out-e249f16adb50
Ma\"el Jullien, Marco Valentino, and Andr\'e Freitas, 14 Aug 2025, The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference, https://arxiv.org/abs/2508.10777
Zhao Song, Song Yue, Jiahao Zhang, 23 Jul 2025, Thinking Isn't an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations, https://arxiv.org/abs/2507.17699
Mathieu Godbout and Audrey Durand, 18 Jul 2025, On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes, https://arxiv.org/abs/2507.14005
Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar, 18 Jul 2025, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, https://arxiv.org/abs/2506.06941
Mohammed Alkhowaiter, Norah Alshahrani, Saied Alshahrani, Reem I. Masoud, Alaa Alzahrani, Deema Alnuhait, Emad A. Alghamdi, Khalid Almubarak, 19 Jul 2025, Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations, https://arxiv.org/abs/2507.14688
Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu, 13 Aug 2025 (v3), Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens, https://arxiv.org/abs/2508.01191
Armen Manukyan, Hrant Khachatrian, Edvard Ghukasyan, Theofanis P. Raptis, 25 Jul 2025, On the Limitations of Ray-Tracing for Learning-Based RF Tasks in Urban Environments, https://arxiv.org/abs/2507.19653
Siwoo Park, 30 Jul 2025, Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods, https://arxiv.org/abs/2507.23010
Evangelos Sariyanidi, John D. Herrington, Lisa Yankowitz, Pratik Chaudhari, Theodore D. Satterthwaite, Casey J. Zampella, Robert T. Schultz, Russell T. Shinohara, Birkan Tunc, 29 Jul 2025, Measuring Dependencies between Biological Signals with Temporal Self-supervision, and its Limitations, https://arxiv.org/abs/2508.02703
Emanuele Nardone, Tiziana D'Alessandro, Francesco Fontanella, Claudio De Stefano, 5 Aug 2025, When Deep Learning Fails: Limitations of Recurrent Models on Stroke-Based Handwriting for Alzheimer's Disease Detection, https://arxiv.org/abs/2508.03773
Yu-Hsuan Fang, Tien-Hong Lo, Yao-Ting Sung, Berlin Chen, 18 Aug 2025, Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning, https://arxiv.org/abs/2508.12591
Noah Kasmanoff and Rahul Zalkikar, 15 Aug 2025, Limitation Learning: Catching Adverse Dialog with GAIL, https://arxiv.org/abs/2508.11767
Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang, 16 Aug 2025, Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention, https://arxiv.org/abs/2507.00449
Ben Dickson, August 19, 2025, LLMs generate ‘fluent nonsense’ when reasoning outside their training zone, https://venturebeat.com/ai/llms-generate-fluent-nonsense-when-reasoning-outside-their-training-zone/
Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova and Radu State, 25 Aug 2025, Limitations of Normalization in Attention Mechanism, https://arxiv.org/abs/2508.17821
Seamus Somerstep, Ya'acov Ritov, Mikhail Yurochkin, Subha Maity, Yuekai Sun, 23 Aug 2025, Limitations of refinement methods for weak to strong generalization, https://arxiv.org/abs/2508.17018