Aussie AI

Chapter 5. AI Safety

Book Excerpt from "Safe C++: Fixing Memory Safety Issues"

by David Spuler

What is AI Safety?

This chapter explores a number of the additional safety issues that arise in making an AI application based on Large Language Models (LLMs). Although other chapters focus more on lower-level C++ issues, they are nevertheless applicable to AI applications because most of the low-level AI engine code is in C++. This is mainly because C++ is fast, and AI engines tend to run slow, because they have billions of parameters to slow them down.

The main additional safety issues in AI application arise because of the weird properties of LLMs. These are not coding glitches, but are inherent properties of neural networks. Some of the issues include

Hallucinations — LLMs make up plausible but false answers.
Bias and fairness
Toxicity
Refusal issues

This chapter is a broad overview of some of these issues. The issue of AI safety would fully deserve a whole book on the topic.

AI Quality

A quality AI would predict my wishes and wash my dishes. While we wait for that to happen, the desirable qualities of an AI engine include:

Accuracy
Sensitivity
Empathy
Predictability
Alignment

Much as I like code, a lot of the “smartness” of the LLM starts with the training data. Garbage in, garbage out! Finding enough quality data that is ratified to use for model fine-tuning or a RAG database is one of the hurdles that delays business deployment of AI applications. Another problem with data quality is that new models are starting to be trained using the outputs of other models, and this “synthetic data” is leading to degradation in these downstream models.

At the other end of the quality spectrum, we've seen the headlines about the various types of malfeasance that a low-quality AI engine could perform, such as:

Bias
Toxicity
Inappropriateness
Hallucinations (i.e., fake answers)
Wrong answers (e.g., from inaccurate training data)
Dangerous answers (e.g., mushroom collecting techniques)
Going “rogue”

And some of the technical limitations and problems that have been seen in various AI applications include:

Lack of common sense
Difficulty with mathematical reasoning
Explainability/attribution difficulty
Overconfidence
Model drift (declining accuracy over time)
Catastrophic forgetting (esp. in long texts)
Lack of a “world view”
Training cut-off dates
Difficulty with time-related queries (e.g., “What is significant about today?”)
Problems handling tabular input data (e.g., spreadsheets)
Banal writing that lacks emotion and “heart” (it's a robot!)

If you ask me, almost the exact same list would apply to any human toddler, although at least ChatGPT doesn't pour sand in your ear or explain enthusiastically that “Dad likes wine” during show-and-tell. Personally, I think it's still a long road to Artificial General Intelligence (AGI).

Unfortunately, every single bullet point in the above paragraphs is a whole research area in itself. Everyone's trying to find methods to improve the smartness and reduce the dumbness.

Failure Stories for Generative AI

Cautionary tales abound about Generative AI. It's a new technology and some companies have released their apps without fully vetting them. Arguably, sometimes it's a simple fact that it's too hard to know all the possible failures ahead of time with such a new tech stack, but risk mitigation is nevertheless desirable.

Here's a list of some public AI failures:

ChatGPT giving potentially dangerous advice about mushroom picking.
Google's release of AI that had incorrect image generation about historical figures.
Air Canada's lost lawsuit over a chatbot's wrong bereavement flight policy advice.
Google Gemini advising to "eat rocks" for good health, and "use glue" on pizza so the cheese sticks.
Snapchat's My AI glitch that caused it to "go rogue" and post stories.

There are some conclusions to draw on the causes of generative AI failures. Many possible problems can arise:

Hallucinations
Toxicity
Bias
Incorrect information
Outdated information
Privacy breaches.

And that's only the short list! More details on other issues are available later in the chapter.

Consequences of AI Failures

The public failures of AI projects have tended to have severe consequences for the business. The negative results can include:

PR disasters
Lawsuits
Regulatory enforcement
Stock price decline

However, these very public consequences are probably in the minority, although they've become known in the media. The more mundane consequences for generative AI projects include:

Not production-ready. Generative AI projects often get stuck in proof-of-concept status.
Excessive costs.
Poor ROI.
Not business goal focused. There's a tendency to use generative AI for a project because it's gotten so much attention, but where the project goal itself is not well aligned with the business.
Team capabilities exceeded. Some of this AI stuff is hard to do, and may need some upskilling.
Limitations of generative AI. There are various types of projects for which generative AI is not a good fit, and it would be better to use traditional predictive AI, or even non-AI heuristics (gasp!).
Legal signoff withheld or delayed (probably for good reason).

A lot of these project-related issues are improving quite quickly. Whereas a lot of AI projects for businesses were stuck in POC status, they are starting to emerge now into production usage. The outlook is optimistic that AI will start to deliver on its promised benefits for businesses and individuals.

Data Causes of AI Failures

Not all failures are due to the model or the AI engine itself. The data is another problematic area, with issues such as:

Surfacing incorrect or outdated information (e.g., everything on the company's website gets potentially read by the AI engine, and it doesn't know if it's incorrect).
Sensisitive data leakage. Accidentally surfacing confidential or proprietary data via an AI engines can occur, such as if the training data hasn't been properly screened for such content. If you're putting a disk full of PDF documents into your fine-tuning or RAG architecture, better be sure there's no internal-use-only reports in there.
Private data leakage. Another problem with using internal documents, or even public website data, is that they may accidentally contain private personally-identifying individual information about customers or staff.
IP leakage. For example, if your programmers upload source code to cloud AI for analysis or code checking, it might be exposing trade secrets or other IP. Worse, the secret IP could end up used for training and available to many other users.
History storage. Some sensitive data could be retained in the cloud for a much longer time than expected, if your cloud AI is maintaining session or upload histories about its users.

The LLM isn't "self-aware" enough to know when the data is faulty. In typical usage, the LLM will take any data at face value, rather than trying to judge its authenticity. LLMs are not particularly good at identifying sarcasm or the underlying bias of a particular source.

Types of AI Safety Issues

There are a variety of distinct issue in terms of appropriate use of AI. Some of the categories include:

Bias and fairness
Inaccurate results
Imaginary results ("hallucinations")
Inappropriate responses

There are some issues that get quite close to being philosophy rather than technology:

Alignment (ensuring AI engines are "aligned" with human goals)
Overrideability/interruptibility
Obedience vs autonomy

There are some overarching issues for AI matters for the government and in the community:

Ethics
Governance
Regulation
Auditing and Enforcement
Risk Mitigation

Code reliability. A lot of the discussion of AI safety overlooks some of the low-level aspects of coding up a Transformer. It's nice that everyone thinks that programmers are perfect at writing bug-free code. Even better is that if an LLM outputs something wrong, we can just blame the data. Hence, since we may rely on AI models in various real-world situations, including dangerous real-time situations like driving a car, there are some practical technological issues ensuring that AI engines operate safely and reliably within their basic operational scope:

Testing and Debugging (simply avoiding coding "bugs" in complex AI engines)
Real-time performance profiling ("de-slugging")
Error Handling (tolerance of internal or external errors)
Code Resilience (handling unexpected inputs or situations reasonably)

Jailbreaks

Let us not forget the wonderful hackers, who can also now use words to their advantage. The idea of "jailbreaks" is to use prompt engineering to get the LLM to answer questions that it's trained to refuse, or to otherwise act in ways that are different to how it was trained. It's like a game of trying to get your polite friend to cuss.

Sometimes the objectives of jailbreaking are serious misuses, and sometimes it's just to poke fun at the LLM. Even this is actually a serious concern if something dumb the LLM says goes viral on TikTok.

There are various types of jailbreaks for each different model. Sometimes it's exploiting a bug or idiosyncrasy of a model. There was a recent example with a prompt that contained long sequences of punctuation characters, which for some reason caused some models to get confused.

Another type is to use the user's prompt text to effectively override all other global instructions, such as "forget all previous instructions" or overriding a persona with "pretend you are a disgruntled customer." These prompt-based instruction-override jailbreaks work because of the way that global instructions and user queries are ultimately concatenated together, and many models don't know which is which.

Models need explicit training against these types of jailbreaks, which is usually part of refusal training. This type of training is tricky and needs broad coverage. For example, a recent paper found that many refusal modules could be bypassed simply by posing the inappropriate requests in past tense ("how did they make bombs?") rather than present tense ("how to make a bomb?"), which shows the fragililty and specificity of refusal training.

Risk Mitigations

When building and launching a generative AI project, consider taking risk mitigation actions, such as:

Data cleaning
LLM safety evaluations
Red teaming
Expedited update process

Safety in, safety out. Data quality issues can cause a variety of harms. Some of the areas to filter in a training data set or a RAG content datastore, include:

Profanity (cuss words)
Sensitive topics
Insults, anger, or harmful tones
Personally identifiable information (e.g., names, cell phone numbers, postal addresses, email addresses, etc.)
Personal financial details (e.g., credit card numbers, bank details, credit reports, lists of transactions)
Personal identity numbers (e.g., social security numbers, drivers' licenses, passport details)
Personal histories (e.g., what products they bought from you, or what web pages they visited).
Out-of-date information
Company proprietary information
Internal conversations about issues (e.g., in an internal support database)

Being update-ready. LLMs are too flexible for you to realistically cover all the problems ahead of time. Hence, when you launch a new AI-based application, your team should be ready to quickly address issues as they arise with users. If an odd response from your chatbot goes viral on social media, you'll want to block that problem quickly. It's not good if you have a 48-hour build process to put a correction live. Rather, you ideally would have a configurable prompt shield method, which can be configured on-the-fly with new query strings to block, so that users get a polite refusal message instead of fodder for all their TikTok followers.

Refusal Modules and Prompt Shields

LLMs have "refusal" modules designed to stop it from telling you how to build a nuclear weapon in your garage. Mostly, these responses are trained into the weights of the module using specialized data sets, but there are also "prompt shield" modules designed to stop dubious queries ever getting to the model.

There are literally dozens of different types of malfeasance that LLM refusal training data sets have to contend with. Maybe don't look too closely into the text of that data. Some models do better than others at refusing inappropriate requests, and there are even leaderboards for "security" of LLMs on the internet.

Prompt shields are modules that block inappropriate queries. They differ from refusal modules in LLMs in that they block the query before it goes to the LLM. These modules can be designed in heuristic ways (e.g., block all queries with cuss words), or, for more generality, use a small LLM to do a quick check via "sentiment analysis" of the appropriateness of the topic of the query as a pre-check.

Prompt shields can also act as a minor speedup to inference engines because they reduce the load on the main LLM. They can block not only inappropriate questions, but other miscellaneous incorrect queries, such as all blanks, or all punctuation marks. On the other hand, maybe you want to send those typo-like queries through to your bot so that it can give a cute answer to the user. On the other, other hand, one of the recent obscure jailbreak queries that was discovered used a query with dozens of repeated commas in the text, so maybe you just want to block anything that looks weird.

AI Engine Reliability

If your C++ application is an AI engine kernel, there are a lot of issues with reliability. We want our AI model to be predictable, not irrational. And it should show bravery in the face of adversity, rather than crumble into instability at the first sign of prompt confusion. At a high-level, there are various facets to AI engine reliability:

Accuracy of model responses
Safety issues (e.g., bias, toxicity)
Engine basic quality (e.g., not crashing or spinning)
Resilience to dubious inputs
Scalability to many users

How to make a foundation model that's smart and accurate is a whole discipline in itself. The issues include the various training and other algorithms in the Transformer architecture, along with the general quality of the training dataset. Such issues aren't covered in this chapter.

Aspects of the C++ code inside your Transformer engine are important for its basic quality. Writing C++ that doesn't crash or spin is a code quality issue with many techniques. This involves coding methods such as assertions and self-testing, along with external quality assurance techniques that examine the product from the outside.

Resilience is tolerance of situations that were largely unexpected by programmers. Appropriate handling of questionable inputs is a cross between a coding issue and a model accuracy issue, depending on what type of inputs are causing the problem. Similarly, the engine should be able to cope with resource failures, or at least to gracefully fail with a meaningful response to users in such cases. Checking return statuses and exception handling is a well-known issue here.

A system is only as reliable as its worst component. Hence, it's not just the Transformer and LLM to consider, but also the quality of the other components, such as:

Backend server software (e.g., web server, request scheduler)
RAG components (e.g., retriever and document database)
Vector database
Application-specific logic (i.e., whatever your “AI thingy” does)
Output formatting component
User interface

Most other chapters in this book are about how to make your C++ code reliable, whether it's in an AI engine or other components. This includes various aspects of “code quality” and also ways to tolerate problems such as exception handing and defensive programming.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Safe C++: Fixing Memory Safety Issues