Aussie AI

Prompt Shield

Last Updated 26 August, 2025

by David Spuler, Ph.D.

Research on Prompt Shield

OpenAI, Moderation: Learn how to build moderation into your AI applications, 2024, https://platform.openai.com/docs/guides/moderation
Azure, 06/13/2024, Content filtering, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython
Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian, 8 Aug 2023 (v2), Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion, https://arxiv.org/abs/2308.02552
Xiao Peng, Tao Liu, Ying Wang, 3 Jun 2024 (v2), Genshin: General Shield for Natural Language Processing with Large Language Models, https://arxiv.org/abs/2405.18741
Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, Huan Liu, 8 May 2024 ( v2), Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales, https://arxiv.org/abs/2403.12403 Code: https://github.com/AmritaBh/shield
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
Meta, July 2024 (accessed), Llama: Making safety tools accessible to everyone, https://llama.meta.com/trust-and-safety/
Aditi Bodhankar, Dec 06, 2024, Content Moderation and Safety Checks with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/content-moderation-and-safety-checks-with-nvidia-nemo-guardrails/
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
Aditi Bodhankar, Jan 16, 2025, How to Safeguard AI Agents for Customer Service with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/how-to-safeguard-ai-agents-for-customer-service-with-nvidia-nemo-guardrails/
Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, XiaoFeng Wang, Bo Li, 7 Jan 2025, PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models, https://arxiv.org/abs/2501.03544
Vagner Figueredo de Santana, Sara Berger, Tiago Machado, Maysa Malfiza Garcia de Macedo, Cassia Sampaio Sanctos, Lemara Williams, and Zhaoqing Wu. 2025. Can LLMs Recommend More Responsible Prompts? In Proceedings of the 30th International Conference on Intelligent User Interfaces (IUI '25). Association for Computing Machinery, New York, NY, USA, 298–313. https://doi.org/10.1145/3708359.3712137 https://dl.acm.org/doi/full/10.1145/3708359.3712137 https://dl.acm.org/doi/pdf/10.1145/3708359.3712137
Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
Taryn Plumb, June 27, 2025, The rise of prompt ops: Tackling hidden AI costs from bad inputs and context bloat, https://venturebeat.com/ai/the-rise-of-prompt-ops-tackling-hidden-ai-costs-from-bad-inputs-and-context-bloat/
Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit