Aussie AI

Chapter 6. Limitations of an AI Project

Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"

by David Spuler

Chapter 6. Limitations of an AI Project

Generative AI Limitations

LLMs can do some amazing new things, but they also have a lot of limitations. In order to choose the right generative AI projects, it helps to know what to avoid.

This chapter is a deep dive into limitations in various categories:

Inaccuracy and answer quality
Alignment limitations
Data limitations
Computational limitations
Reasoning limitations

We expect rather a lot from our little silicon creations. There’s only so much that a few billion multiplications can achieve.

Accuracy Limitations

LLMs are probabilistic and non-deterministic in their answers. But facts are supposed to be definitive, so there’s some inherent problems with the architecture. Your average LLM has problems with factual accuracy:

Hallucinations (plausible-looking made-up facts)
Inaccuracies or misinformation (wrong facts or omissions)
Confabulations (wrongly merging two sources)
Plagiarism (in its training data set)
Paraphrasing (plagiarism-like)
Model “drift” (decline in accuracy over time)
Spin or bias in input training data or RAG data is repeated in answers.
Censorship

The issue of hallucinations has gotten a lot of attention. There are actually two subtypes: (a) the LLM has never been trained on the answer and makes up its answer, or (b) the LLM actually has the correct answer in its trained weights somewhere, but still gets it wrong. The problems of hallucination lead to further issues such as:

Over-confidence — it knows not what it says.
Veneer of authority — users tend to believe the words.
Gullibility — the LLM does not challenge the input text.
Acceptance — not challenging or detecting the source information’s bias, credibility, or authority.
Ambiguity — some results have arguments for and against a topic.

The main problem is that it doesn’t tell you when it’s giving you false information, because it doesn’t know. There is a partial fix possible by adding more GPU juice: use a multi-step inference method such as “self-reflection” or “LLM as judge” to check the initial response. Weirdly, LLMs are quite good at correcting their own answers if you ask them to.

When creating an AI application, it needs to be understood that LLMs will get things wrong, or it will hallucinate false answers that sound very convincing. So, if an LLM is put in a workflow that expects the output to be 100% correct all the time, then trouble will occur. Do not replace feature X with an LLM and gloat about how well the workflow is automated, because it will be broken at times. Instead keep feature X, but use the LLM to “auto-complete” or “auto-generate” output, while still allowing a human to override.

Alignment Limitations

Alignment is the name for the property whereby the LLM’s answers are “aligned” with what we’d want it to do or say. This is related to “instruction following” on the input side, but also on having the answers being what we could expect as the right answer. There are various problems with the appropriateness of the answers in terms of various alignment metrics:

Biases (of many types)
Toxicity
Harmful content
Insensitivity (e.g., when writing eulogies).
Dangerous or harmful answers (e.g., wrong mushroom picking advice)
Sensitive topics (the LLM requires training on each and every one)
Alignment (people have purpose; LLMs only have language).
Security (e.g., “jailbreaks”)
Refusal (knowing when it should refuse to answer something)
Locale Awareness (even when care is taken, local knowledge or customs can be offended).

Some other general concerns include:

LLM use for nefarious purposes (e.g., by hackers)
Transparency issues (sources of the data, details of the guardrails, how it works, etc.)
Privacy issues (sure, but Googling online has similar issues, so this isn’t as new as everyone says)
Legal issues (copyright violations, patentability, copyrightability, and more)
Regulatory issues (inconsistent across geographies)
Unintended consequences

All of the above are areas of intense and ongoing research in both academic and commercial labs. A lot of progress has been made, but more is still needed.

Training Data Quality

Sometimes, it’s not the LLM’s fault, but the training data. After all, it’s “garbage in, garbage out” for generative AI answers:

Surfacing inaccurate or outdated information
Proprietary data leakage (e.g., trade secrets in an article used in a training data set)
Personally Identifiable Information (PII) (e.g., emails or phone numbers in training data)
Poor quality training data, generally.
Americanisms — the vast majority of English language training data is from American sources, so this style dominates in terms of word spellings, implied meanings, cultural issues like “football”, etc.
Falling back on over-complex training data (causing unnecessarily complicated answers)

Ensuring that there is “clean” data is an important part of building or fine-tuning an LLM. Better quality training data has been shown to improve accuracy of LLMs, and also has the advantage that it allows smaller models to have better results.

Computational Limitations

There’s really only one big problem with AI computation: it’s slooow. Hence, the need for all of those expensive GPU chips. This leads to problems with:

Cloud data center execution is expensive.
AI phone execution problems (e.g., frozen phone, battery depletion, overheating)
AI PC execution problems (big models are still too slow to run)
Training data set requirements (they need to feed on lots of tokens)
Environmental impact (e.g., by one estimate, a ten-fold need of extra data center electricity for AI answers compared to non-AI internet searches)

Bugs. And we shouldn’t forget that Transformers and LLMs are just programs written by software engineers, with problems such as:

Gibberish output — usually a bug; AI engines are just programs, you know.
Going rogue — usually a bug, or is it?
No output — usually a bug or it’s fallen into a hole in the network.
Prompt fragility — very different results when changing the prompt just a little, such as an extra unimportant word.
Punctuation handling — e.g., results changing from adding an extra space at end of prompt.

Are these bugs, or maybe they’re features? Only time will tell!

Reasoning Limitations

At a very high level, your LLM has great difficult with any type of advanced reasoning. In a sense, an LLM is very much like your subconscious, in that it is instinctive and automatic. Humans have a conscious mind that can impose a structured logic on top of the raw low-level capabilities, but this is still an unsolved issue with LLMs. Hence, LLMs are great at making things flow smoothly, in words or images, but not at having it all make sense as a whole.

Let’s begin with some of the reasoning limitations that have largely been solved:

Words about words (e.g., “words”, “sentences”, etc.)
Writing style, tone, reading level, etc.
Ending responses nicely with stop tokens and max tokens
Tool integrations (e.g., clocks, calendars, calculators)
Cut-off date for training data sets
Long contexts

Domains of difficulty. Some particular domain areas of reasoning difficulty include:

Logical reasoning
Planning multiple steps
Emotions — LLMs don’t have them, and can only fake it.
Time/temporal reasoning — the concept of things happening in sequence is tricky.
3D scene visualization — LLMs struggle to understand the relationship between objects in the real world.
Mathematical reasoning
Arithmetic (!)
Specialized domains — e.g., jargon, special meanings of words.
Math word problems
Crosswords and other word puzzles — e.g., anagrams, alliteration.
Humor — except they can do Dad jokes.
Spelling — LLMs do not spell, but work on tokens that are small collections of letters, not actual letters.
Counting and Arithmetic — LLMs have no builtin computation engine.

It’s rather strange the LLMs have problems with basic arithmetic calculations. But anything with numbers is suspect, especially larger numbers, since four or more digits are typically broken into smaller tokens. For example, 1234 might break into 12 and 34 as tokens. The LLM has no concept of place values for 12 or 34, and can make mistakes.

Some of the other general areas where there are problems that are somewhat inherent to LLMs and the way they work include:

Prompt engineering requirements — awkward wordings! Nobody really talks like that.
Oversensitivity to prompt variations — and yet, sadly, prompt engineering works.
Over-confidence — LLMs lack insight into confidence levels, and what “understanding” even means.
Ambiguity — unclear input queries are poorly handled sometimes.
Probabilistic — non-deterministic output is hard to predict (and test).
Recent context confusion

LLMs are also easily confused by recent context that you have in a conversation. For example, it’s not hard to tell the LLM something that is false and then have it repeat that back. Even with GPT-4o, start a conversation with say, “Apples, Strawberries and Bananas are type of Panda,” let it reject that, and then assert it again. Before you know it, you will have the LLM inventing fun types of pandas. It’s not really clear whether this is a bug where the LLM is wrong, or an alignment feature where it’s following your instructions. Do you want the LLM to accept what you say or not?

Prompt ambiguities. Sometimes, the training data or the input query is difficult, whereas a human can discern aspects better:

Words and meanings are not the same thing. People sometimes use words in odd ways.
Sarcasm and satire (e.g., articles espousing the benefits of “eating rocks”)
Spin, biased viewpoints, and outright disinformation/deception (of source content)
Trick questions (e.g., queries that look like common online puzzles, but aren’t quite the same).
Detecting intentional deception or other malfeasance by users
Novice assumption (not identifying a user’s higher level of knowledge from words in the questions; dare I say it’s a kind of “AI-splaining”)
LLMs asking follow-up questions to clarify user requests (this capability has been improving quickly).
Not correctly prioritizing parts of the request (i.e., given multiple requests in a prompt instruction, it doesn’t always automatically know which things are most important to you)

LLM answer quality. More issues with answers include:

Over-explaining
Nonsense answers
Non-repeatability (same question, different answer)
Lack of common sense (although I know some people like that, too)
Lack of a “world model”
Lack of a sense of personal context (they don’t understand what it means to be a person)

Some other issues high-level issues with answer results:

Explainability (can it explain why it gave this answer?)
Attribution (source citations)
Banal, bland, generic, or overly formal writing
Repetition (e.g., if it has nothing new to add, it may repeat a prior answer, rather than admitting that)

What a hard time I’m giving to all the poor little llamas. If nothing else, this shows again that getting to AGI will be a marathon, not a sprint. There’s more work to be done.

Mathematical Reasoning

Complex mathematical reasoning remains a struggle for LLMS in Transformer-based architectures. But, I mean, don’t we all? It seems a little unfair to criticize our silicon creations, when we too struggle with these problems. Nevertheless, LLMs can answer some very advanced problems, and yet can also struggle with the simplest questions.

Ironically, by mimicking human anatomy in the creation of neural networks, we’ve also infused them with many of the same weaknesses. Some of the types of logical and mathematical reasoning problems that LLMs can struggle with include:

Multi-step logical reasoning
Temporal reasoning (time-based)
3D visualization reasoning (e.g., people standing in a circle).
Math Word Problems (MWP), which are such a common concern as to deserve an acronym.
Arithmetic on larger numbers.
Implied mathematical meanings in words (e.g., if Mandy “eats” an apple, that’s a subtraction of one apple).

There’s more than a little irony in the fact that LLMs can’t do arithmetic. They’re designed to handle the probabilities of words, and unfortunately, there are infinitely many “number words” that you can make out of numeric digits. The LLM has probably seen “one plus one” in its training data set, but if you tell the LLM to add two 11-digit numbers together, probabilistic meanings of words aren’t that helpful.

Oddly, it also makes a kind of sense in that LLMs are human-like. If I handed you two 11-digit numbers, you wouldn’t add them in your head either. You’d reach for the calculator, which is what LLMs do with tool integrations. Or otherwise, to do the calculation by hand, you’d use an “algorithm” for adding the massive numbers (i.e., add the digits in columns, carrying the ones). Hence, it makes a weird kind of sense that an LLM fails at these, and needs a multi-step overarching reasoning engine to perform arithmetic on big numbers.

But I still find it funny. I mean, it’s not like it has a GPU underneath that could do a billion of those additions every second at 100% accuracy!

References

Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
Abdelrahman “Boda” Sadallah, Daria Kotova, Ekaterina Kochmar, 15 Mar 2024, Are LLMs Good Cryptic Crossword Solvers? https://arxiv.org/abs/2403.12094 Code: https://github.com/rdeits/cryptics
Jonas Wallat, Adam Jatowt, Avishek Anand, March 2024, Temporal Blind Spots in Large Language Models, WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Pages 683–692, https://arxiv.org/abs/2401.12078, https://doi.org/10.1145/3616855.3635818, https://dl.acm.org/doi/abs/10.1145/3616855.3635818
Juntu Zhao, Junyu Deng, Yixin Ye, Chongxuan Li, Zhijie Deng, Dequan Wang 22 Sept 2023 (modified: 11 Feb 2024), Lost in Translation: Conceptual Blind Spots in Text-to-Image Diffusion Models, ICLR 2024, https://openreview.net/forum?id=vb3O9jxTLc
Victoria Basmov, Yoav Goldberg, Reut Tsarfaty, 11 Apr 2024 (v2), Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds, https://arxiv.org/abs/2305.14785
Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Jonathan St. Onge, Mikaela Fudolig, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds, 23 Feb 2024 (v2), A blind spot for large language models: Supradiegetic linguistic information, https://arxiv.org/abs/2306.06794
Michael King, July 24, 2023, Large Language Models are Extremely Bad at Creating Anagrams, https://www.techrxiv.org/doi/full/10.36227/techrxiv.23712309.v1
George Cybenko, Joshua Ackerman, Paul Lintilhac, 16 Apr 2024, TEL'M: Test and Evaluation of Language Models, https://arxiv.org/abs/2404.10200
Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023. https://arxiv.org/abs/2304.11082
Mikhail Burtsev, Martin Reeves, and Adam Job. The working limitations of large language models. MIT Sloan Management Review, 65(1):1–5, 2023. https://sloanreview.mit.edu/article/the-working-limitations-of-large-language-models/
Michael O'Neill, Mark Connor, 6 Jul 2023, Amplifying Limitations, Harms and Risks of Large Language Models, https://arxiv.org/abs/2307.04821
Karl, May 10, 2023, Large Language Models: Reasoning Capabilities and Limitations, https://medium.com/@glovguy/large-language-models-reasoning-capabilities-and-limitations-951cee0ac642
The PyCoach Apr 20, 2024, The False Promises of AI: How tech companies are fooling usa, https://medium.com/artificial-corner/the-false-promises-of-ai-fe23124e0fb9
Bill Doerrfeld, Feb 6, 2024, Does Using AI Assistants Lead to Lower Code Quality? https://devops.com/does-using-ai-assistants-lead-to-lower-code-quality/
Piotr Wojciech Mirowski, Juliette Love, Kory W. Mathewson, Shakir Mohamed, 3 Jun 2024 (v2), A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians, https://arxiv.org/abs/2405.20956 (The unfunny fact that AI is bad at humor.)
Rafe Brena, May 24, 2024, 3 Key Differences Between Human and Machine Intelligence You Need to Know: AI is an alien intelligence, https://pub.towardsai.net/3-key-differences-between-human-and-machine-intelligence-you-need-to-know-7a34dcee2cd3 (Good article about how LLMs don’t have “emotions” or “intelligence” and they don’t “pause”.)
Amanda Silberling, August 27, 2024, Why AI can’t spell ‘strawberry’, https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawberry/
Kyle Wiggers, July 6, 2024, Tokens are a big reason today’s generative AI falls short, https://techcrunch.com/2024/07/06/tokens-are-a-big-reason-todays-generative-ai-falls-short/
Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme, 7 Feb 2024 (v2), OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax? https://arxiv.org/abs/2309.09992
Radhika Rajkumar, Sept. 6, 2024, What AI can’t do, digital twins, and swiveling laptop screens, https://www.zdnet.com/article/what-ai-cant-do-digital-twins-and-swiveling-laptop-screens/
Victor Tangermann, Sep 13, 2024, OpenAI’s New “Strawberry” AI Is Still Making Idiotic Mistakes, https://futurism.com/openai-strawberry-o1-mistakes

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Generative AI Applications: Planning, Design and Implementation

The new Generative AI Applications book by Aussie AI co-founders:

Deciding on your AI project
Planning for success and safety
Designs and LLM architectures
Expediting development
Implementation and deployment

Get your copy from Amazon: Generative AI Applications

Aussie AI

Chapter 6. Limitations of an AI Project