Aussie AI

Chapter 10. Safety for AI Projects

Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"

by David Spuler

Chapter 10. Safety for AI Projects

Failure Stories for Generative AI

Cautionary tales abound about Generative AI. It’s a new technology and some companies have released their apps without fully vetting them. Arguably, sometimes it’s a simple fact that it’s too hard to know all the possible failures ahead of time with such a new tech stack, but risk mitigation is nevertheless desirable.

Here’s a list of some public AI failures:

ChatGPT giving potentially dangerous advice about mushroom picking.
Google’s release of AI that had incorrect image generation about historical figures.
Air Canada’s lost lawsuit over a chatbot’s wrong bereavement flight policy advice.
Google Gemini advising to “eat rocks” for good health, and “use glue” on pizza so the cheese sticks.
Snapchat’s My AI glitch that caused it to “go rogue” and post stories.

There are some conclusions to draw on the causes of generative AI failures. Many possible problems can arise:

Hallucinations
Toxicity
Bias
Incorrect information
Outdated information
Privacy breaches

Consequences of AI Failures

The public failures of AI projects have tended to have severe consequences for the business. The negative results can include:

PR disasters
Lawsuits
Regulatory enforcement
Stock price decline

However, these very public consequences are probably in the minority, although they’ve become known in the media. The more mundane consequences for generative AI projects include:

Not production-ready. Generative AI projects often get stuck in proof-of-concept status.
Not production-ready, but released anyway. There is tremendous pressure to show presence in the AI space fast. Projects are often released as soon as the AI parts are working, but before all the typical productization steps occur, like adding logging, monitoring, and support admin capabilities.
Excessive costs from API pay-per-token costs and GPU hourly rental.
Poor ROI. Although many AI projects are not aimed at profitability.
Not business goal focused. There’s a tendency to use generative AI for a project because it’s gotten so much attention, but where the project goal itself is not well aligned with the business.
Team capabilities exceeded. Some of this AI stuff is hard to do, and may need some upskilling.
Limitations of generative AI. There are various types of projects for which generative AI is not a good fit, and it would be better to use traditional predictive AI, or even non-AI heuristics (gasp!).
Legal signoff withheld or delayed (probably for good reason).

Data Causes of AI Failures

Not all failures are due to the model or the AI engine itself. The data is another problematic area, with issues such as:

Surfacing incorrect or outdated information (e.g., everything on the company’s website gets potentially read by the AI engine, and it doesn’t know if it’s incorrect).
Sensitive data leakage. Accidentally surfacing confidential or proprietary data via an AI engine can occur, such as if the training data hasn’t been properly screened for such content. If you’re putting a disk full of PDF documents into your fine-tuning or RAG architecture, better be sure there’s no internal-use-only reports in there.
Private data leakage. Another problem with using internal documents, or even public website data, is that they may accidentally contain private personally-identifying individual information about customers or staff.
IP leakage. For example, if your programmers upload source code to cloud AI for analysis or code checking, it might be exposing trade secrets or other IP. Worse, the secret IP could end up used for training and available to many other users.
History storage. Some sensitive data could be retained in the cloud for a much longer time than expected, if your cloud AI is maintaining session or upload histories about its users.

Types of AI Safety Issues

There are a variety of distinct issue in terms of appropriate use of AI. Some of the categories include:

Bias and fairness
Inaccurate results
Imaginary results (“hallucinations”)
Inappropriate responses

There are some issues that get quite close to being philosophy rather than technology:

Alignment (ensuring AI engines are “aligned” with human goals)
Overrideability/interruptibility
Obedience vs autonomy

There are some overarching issues for AI matters for the government and in the community:

Ethics
Governance
Regulation
Auditing and Enforcement
Risk Mitigation

Code reliability. A lot of the discussion of AI safety overlooks some of the low-level aspects of coding up a Transformer. It’s nice that everyone thinks that programmers are perfect at writing bug-free code. Even better is that if an LLM outputs something wrong, we can just blame the data. Hence, since we may rely on AI models in various real-world situations, including dangerous real-time situations like driving a car, there are some practical technological issues ensuring that AI engines operate safely and reliably within their basic operational scope:

Testing and Debugging (simply avoiding coding “bugs” in complex AI engines)
Real-time performance profiling (“de-slugging”)
Error Handling (tolerance of internal or external errors)
Code Resilience (handling unexpected inputs or situations reasonably)

Third-Party AI Vendor Safety Issues

When using a third-party LLM, it’s important to vet the vendor per normal company policy. In the rush to get AI to the market, oftentimes the security and/or procurement team is bypassed. Legal also needs to be involved.

Here are some additional specific issues to consider in the review of an AI vendor. We’ll look at the legal department’s concerns first:

Ensure that SOC2 certifications are in place
Check for indemnification clauses in the legal contracts.
Try to determine how the vendor’s AI is trained and the copyright status of any data used therein.

Security issues to consider in a review include:

Ensure that the company has documented security procedures which align with company expectations.
Check to see if the vendor has been compromised in the past.

Privacy issues include:

Remember to onboard the vendor as a “subprocessor” to comply with data privacy laws (e.g., GDPR).
Review vendor privacy policies, ideally both internal and external, for relevant issues.

It’s far too easy to leak data to an AI vendor. Data leakage issues regarding sending sensitive company internal data up to a third-party AI vendor may include:

Email creation — asking AI to write an email seems trivial, but the content in the email could be extremely sensitive.
Uploaded company documents — documents for fine-tuning or RAG, or for use cases such as document summarization, can leak company secrets.
Slide creation — another less obvious issue is that content of business slides are arguably more likely to leak future company strategy or internal secrets. Similarly, data leakages can occur in AI services that review slide content or allow dry run presentations to be critiqued.
Source code copilots — these can leak company IP in computer code. Similarly for code review services and security review services.
Conference calls — Bots creating transcripts and doing sentiment analysis of conference calls are easy add-ons in Zoom and Teams, but be aware that the vendors of those AIs have access to the call, too.

Another complex issue is that the vendor will often be using another AI vendor as their base LLM provider. Ensure that vendor does not simply refer to their sub-vendor’s answers. For example, a vendor that reviews recordings of a presentation might simply be relying on OpenAI services. Just because OpenAI has a SOC2 certification and good security practices does not mean anything if the direct vendor does not.

One final point is that you shouldn’t assume that the only issues are in vendors you are using. There next generation of BYOD issues is now BYO AI. The allure of AI is so powerful that employees will often bring their own AI subscription to work. This needs to be carefully monitored and some companies have explicitly banned such usage. Many personal licenses of AI services allow data to be used for training purposes. The corresponding business-level services often cost more, but typically do not allow the data to be used for training purposes.

Jailbreaks

Let us not forget the wonderful hackers, who can also now use words to their advantage. The idea of “jailbreaks” is to use prompt engineering to get the LLM to answer questions that it’s trained to refuse, or to otherwise act in ways that are different to how it was trained. It’s like a game of trying to get your polite friend to cuss.

Sometimes the objectives of jailbreaking are serious misuses, and sometimes it’s just to poke fun at the LLM. Even this is actually a serious concern if something dumb the LLM says goes viral on TikTok.

There are various types of jailbreaks for each different model. Sometimes it’s exploiting a bug or idiosyncrasy of a model. There was a recent example with a prompt that contained long sequences of punctuation characters, which for some reason caused some models to get confused.

Another type is to use the user’s prompt text to effectively override all other global instructions, such as “forget all previous instructions” or overriding a persona with “pretend you are a disgruntled customer.” These prompt-based instruction-override jailbreaks work because of the way that global instructions and user queries are ultimately concatenated together, and many models don’t know which is which.

Models need explicit training against these types of jailbreaks, which is usually part of refusal training. This type of training is tricky and needs broad coverage. For example, a recent paper found that many refusal modules could be bypassed simply by posing the inappropriate requests in past tense rather than present tense, which shows the fragility and specificity of refusal training.

Risk Mitigations

When building and launching a generative AI project, consider taking risk mitigation actions, such as:

Data cleaning
LLM safety evaluations
Red teaming
Expedited update process

Safety in, safety out. Data quality issues can cause a variety of harms. Some of the areas to filter in a training data set or a RAG content datastore, include:

Profanity (cuss words)
Sensitive topics
Insults, anger, or harmful tones
Personally identifiable information (e.g., names, cell phone numbers, postal addresses, email addresses, etc.)
Personal financial details (e.g., credit card numbers, bank details, credit reports, lists of transactions)
Personal identity numbers (e.g., social security numbers, drivers’ licenses, passport details)
Personal histories (e.g., what products they bought from you, or what web pages they visited).
Out-of-date information
Company proprietary information
Internal conversations about issues (e.g., in an internal support database)

Being update-ready. LLMs are too flexible for you to realistically cover all the problems ahead of time. Hence, when you launch a new AI-based application, your team should be ready to quickly address issues as they arise with users. If an odd response from your chatbot goes viral on social media, you’ll want to block that problem quickly. It’s not good if you have a 48-hour build process to put a correction live. Rather, you ideally would have a configurable prompt shield method, which can be configured on-the-fly with new query strings to block, so that users get a polite refusal message instead of fodder for all their TikTok followers.

Refusal Modules and Prompt Shields

LLMs have “refusal” modules designed to stop it from telling you how to build a nuclear weapon in your garage. Mostly, these responses are trained into the weights of the module using specialized data sets, but there are also “prompt shield” modules designed to stop dubious queries ever getting to the model.

There are literally dozens of different types of malfeasance that LLM refusal training data sets have to contend with. Maybe don’t look too closely into the text of that data. Some models do better than others at refusing inappropriate requests, and there are even leaderboards for “security” of LLMs on the internet.

Prompt shields are modules that block inappropriate queries. They differ from refusal modules in LLMs in that they block the query before it goes to the LLM. These modules can be designed in heuristic ways (e.g., block all queries with cuss words), or, for more generality, use a small LLM to do a quick check via “sentiment analysis” of the appropriateness of the topic of the query as a pre-check.

Prompt shields can also act as a minor speedup to inference engines because they reduce the load on the main LLM. They can block not only inappropriate questions, but other miscellaneous incorrect queries, such as all blanks, or all punctuation marks. On the other hand, maybe you want to send those typo-like queries through to your bot so that it can give a cute answer to the user. On the other, other hand, one of the recent obscure jailbreak queries that was discovered used a query with dozens of repeated commas in the text, so maybe you just want to block anything that looks weird.

References

Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov, 29 Oct 2024, Distinguishing Ignorance from Error in LLM Hallucinations, https://arxiv.org/abs/2410.22071 https://github.com/technion-cs-nlp/hallucination-mitigation
Garanc Burke, Hilke Schellmann, October 27, 2024, Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said, https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14
James Lee Stakelum, Sep 2024, The End of AI Hallucinations: A Big Breakthrough in Accuracy for AI Application Developers, https://medium.com/@JamesStakelum/the-end-of-ai-hallucinations-a-breakthrough-in-accuracy-for-data-engineers-e67be5cc742a
Colin Fraser, Apr 18, 2024, Hallucinations, Errors, and Dreams: On why modern AI systems produce false outputs and what there is to be done about it, https://medium.com/@colin.fraser/hallucinations-errors-and-dreams-c281a66f3c35
Bijit Ghosh Feb 2024, Advanced Prompt Engineering for Reducing Hallucination, https://medium.com/@bijit211987/advanced-prompt-engineering-for-reducing-hallucination-bb2c8ce62fc6
Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen, 6 Jan 2024, The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models, https://arxiv.org/abs/2401.03205 Code: https://github.com/RUCAIBox/HaluEval-2.0
Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
Asir Saadat, Tasmia Binte Sogir, Md Taukir Azam Chowdhury, Syem Aziz, 16 Oct 2024, When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems, https://arxiv.org/abs/2410.13029
Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
Maxime Labonne June 13, 2024 Uncensor any LLM with abliteration, https://huggingface.co/blog/mlabonne/abliteration
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
Andy Arditi, Oscar Obeso, Aaquib, Wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
Shweta Sharma, 27 Jun 2024, Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models, https://www.csoonline.com/article/2507702/microsoft-warns-of-novel-jailbreak-affecting-many-generative-ai-models.html
Dr. Ashish Bamania, Sep 2024, ‘MathPrompt’ Embarrassingly Jailbreaks All LLMs Available On The Market Today. A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation, https://bamania-ashish.medium.com/mathprompt-embarassingly-jailbreaks-all-llms-available-on-the-market-today-d749da26c6e8
Dastin Jeffrey. Oct 2018, Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
Google, Feb 2024, Responsible Generative AI Toolkit, https://ai.google.dev/responsible
Mozhi Zhang, Pengyu Wang, Chenkun Tan, Mianqiu Huang, Dong Zhang, Yaqian Zhou, Xipeng Qiu, 18 Oct 2024, MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time, https://arxiv.org/abs/2410.14184
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 4 Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (The original 2022 InstructGPT paper from OpenAI.)

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Generative AI Applications: Planning, Design and Implementation

The new Generative AI Applications book by Aussie AI co-founders:

Deciding on your AI project
Planning for success and safety
Designs and LLM architectures
Expediting development
Implementation and deployment

Get your copy from Amazon: Generative AI Applications

Aussie AI

Chapter 10. Safety for AI Projects