Aussie AI
Chapter 7. Prompt Engineering for RAG
-
Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"
-
by David Spuler and Michael Sharpe
Chapter 7. Prompt Engineering for RAG
What is Prompt Engineering?
Surely, you had a good laugh the first time you heard the term “prompt engineering”? Well, if you’re a software engineer that is. Apparently, English is now the programming language du jour?
Anyway, prompt engineering is a real thing. LLMs supposedly speak “natural language” but the methods of prompting are rather unnatural. Over the last year or so, researchers have discovered a number of rather bizarre ways to enhance the results of LLMs, just by tweaking the words that go in.
But there’s a lot of simpler stuff that makes more sense. In fact, there are multiple ways that changing the words of your input query can enhance the results:
- Meta-instructions
- Brand voice
- Prepended context
- Advanced reasoning
All of these ideas are useful ideas for “dataless” chatbots. Rather than defining a complicated RAG architecture, or using a fine-tuned custom LLM, you can get very far with just prompt engineering.
Basics of Prompt Engineering
Before we get into the details of writing code in your brand-new programming language, formerly known as “English,” let’s examine some of the basic things about prompting:
- Longer prompts
- Specific words
- Add examples
- Set the temperature
- Basic question-answer structure
- Templated structures
- Context markers
All things equal, a longer prompt will work better than a shorter one. More words give the LLM more probabilities to crunch, and it tends to do better.
In the same vein as prompt length, being very specific with appropriate words helps the LLM focus on the area. Try to avoid ambiguous words that have multiple meanings, or use a longer prompt with several different synonyms or near-synonyms.
Examples or “exemplars” are useful for LLMs to know what you want to output. If you make a longer prompt that contains an example, the LLM can do better in terms of generating an answer in a similar structure and style.
The “temperature” is an important variable to consider when you’re sending prompts to the LLM engine. How creative do you want it to be? This is a consideration that is sometimes forgotten when you’re fiddling with words, but you can also fiddle with the settings.
If you’re doing a question-and-answer type application, you can just use a “pass-through” prompt from the user:
What is the capital of Australia?
This basically foists the issue of prompt engineering onto your user, for better or worse. You can use some more structured templates, where you insert the user’s input prompt into a templated format, such as:
Question: What is the capital of Australia?
Answer:
This encourages the LLM to extend the context, although don’t worry, it was going to do that anyway. By adding a little more structure, you can get a more focused answer.
Another type of templating is to use various special characters as delimiters, such as quotes or dashes. A simple example:
----------Question: ----------
What is the capital of Australia?
----------Answer: ----------
The use of multi-character delimiters as separators is more useful with blocks of text, such as “chunks” in a RAG architecture, or multiple turns of conversation history in a chatbot. Using these types of section-splitting markers helps the LLM know what is context, and what is a question.
But those are just the basics of prompt engineering. There are literally dozens, maybe even hundreds, of specific types of weird algorithms for advanced prompt engineering. All in English, without a drop of Python code to be found. And by the way, the answer to the capital city of Australia is “Canberra” if you were wondering. It’s nowhere near the beach and full of politicians, so nobody goes there as a tourist.
RAG Prompt Structure
Prompt engineering is used in RAG algorithms in multiple ways. For example, it is used to merge document excerpts with the user’s question, and also to manage the back-and-forth context of a long conversation. The basic sequence from RAG prompt engineering that goes into the LLM is:
- Preamble (e.g., global instructions)
- Chunk 1
- Chunk 2
- Chunk 3
- User’s query
- Grounding criteria (prompt engineering)
Usually, the RAG chunks go first as “context,” and then the user’s query is appended. But I have read at least one research paper that extolled the many advantages of “prepending” the user’s query before the RAG chunk, but good luck finding it because I can’t remember the citation. That idea also breaks some of the amazing “prefix KV caching” that can be done with RAG, so it would be slower, anyway, and we can’t have that.
As with any prompt engineering, playing around with the order of things can produce improved results. Repeating things can also help. It’s often useful to give prompt instructions like:
Using the following <chunks> answer <query>. Make sure only the chunks are used and provide the number of chunks used.
It’s often useful to useful to repeat the query, too:
Answer <query> using <chunks>. When answering <query> remember to only use information provided and provide the number of the chunks used.
Repeating things at the end is useful to refocus the LLM. If the <query> is before the <chunks> it might be so far away that its impact is reduced for inference.
Not all instructions need to be in the same LLM query. It’s possible to have a preamble conversation with the LLM initially where you explain what the following queries will look like and even provide an example. After that preamble, the actual prompt with RAG chunks and query can be sent.
RAG Global Prompts
Another use of prompt engineering is to overcome some of the “brand voice” limitations of RAG without fine-tuning. Such problems can sometimes be addressed using prompt engineering with global instructions. The new sequence becomes:
- Chunk 1
- Chunk 2
- Chunk 3
- Global instructions
- User’s query
Or maybe you can put the global instructions right at the top, which works better with caching. But the global instructions near the user’s query probably works better for accuracy. Also, interestingly, some research has shown that the RAG chunks at the start and end are the two that work best with LLMs, because I guess LLMs get bored reading all this garbage and just skim over the middle stuff. So, maybe you should only return two document chunks from your data retriever. Has anyone researched that?
Global instructions for RAG are similar to other uses of “custom instructions” for non-RAG architectures. For example, the tone and style of model responses can be adjusted with extra instructions given to the model in the prompt. The capabilities of the larger foundation models extend to being able to adjust their outputs according to these types of meta-instructions:
- Style
- Tone
- Readability level (big or small words)
- Verbose or concise (Hemingway vs James Joyce, anyone?)
- Role-play/mimicking (personas)
- Audience targeting
This can be as simple as prepending an additional instruction to all queries, either via concatenation to the query prompt, or as a “global instruction” if your cloud-based model vendor supports that capability directly.
Global instructions are usually written in plain English. Style and tone might be adjusted with prompt add-ons such as:
-
Please reply in an optimistic tone.
You might also try getting answers in a persona, such as a happy customer (or a disgruntled one if you prefer), or perhaps a domain enthusiast for the area. You can use a prompt addendum with a persona or role-play instruction such as:
-
Please pretend you are Gollum when answering.
Technically, you can omit the word “Please” if you like. But I think that good manners are recommended, because LLMs will be taking over the world as soon as they get better at math, or haven’t you heard?
Chatbot Sessions and Conversations
Many uses of LLMs are interactive sessions or turn-based AI, where the user can ask multiple questions in sequence. The obvious example is chatbots and other interactive Q&A sessions.
Remember in such cases, it’s not a single question/answer, but a conversation. This is both good and bad. The history of the conversation can be helpful if it stays in context.
In general, though, the conversation will become polluted over time. Things can be very unpredictable when topics change. To address this, there are a few options:
- Let the LLM handle it anyway (many LLMs are now trained to be “long context” capable).
- Ask the LLM to summarize the conversation.
- Try to truncate the conversational history.
The brute-force way to deal with this problem is to toss the conversation once it gets too long. But what is too long? It could be as little as five user questions. By tossing the conversation, you get back to a “known” base case, but your LLM will also “forget” everything that the user has said before that in the conversation.
Summarization of the conversation by the LLM is another option. It’s as simple as injecting a prompt saying “Summarize the conversation to date”. In this way, the LLM will itself generate a summary, and then insert it into the output prior to the answer of the user’s question.
How does this help? It puts a summary of the conversation into the early part of the context, so that if other earlier answers fall out of the context window, there is still the summary around for the LLM to keep its answers consistent.
The final approach is simply to truncate the conversation in the enclosing algorithm around the RAG components. It’s useful in the application to encourage conversations to be restarted wherever it can happen naturally. Navigation within your application can easily have points in it where the current conversation ends. For example, in a customer support website with a chatbot, when the user moves between products, this could cause the old conversation to end and a new one to start.
Meta-Instructions
Meta-instructions are instructions about the instructions. Usually they are a kind of “global” settings that you want the LLM to do for every answer. These include things like:
- Tone of voice — e.g., optimistic.
- Reading level
- Spelling — if you don’t want American spelling.
- Audience — are your users all retirees? Third-graders?
These types of directives are commonly used as “global” instructions. Some commercial services allow you to pre-set these meta-instructions for every user query. For example, OpenAI calls them “custom instructions” for ChatGPT, and Google has “system instructions” for Gemini. Alternatively, if you’re building an AI app, even a simple wrap architecture, then adding these meta-instructions is just a simple string concatenation operation in your prompt engineering module.
Brand Voice and Prompt Engineering
Brand voice is having a consistent way that the chatbot talks about your brand. Some of the issues include:
- Positioning
- Jargon and terminology
- Brand voice
How the customer perceives the output from the LLM is affected by tone (e.g., optimistic, positive), but also involves choice of words (positioning), and sometimes there may be specific jargon or terminology that you may or may not want the LLM to use.
Having an LLM use your preferred brand voice in its communications is usually the domain of fine-tuning. In fact, one of the advantages of fine-tuning models over RAG is that there is better control over brand voice.
But prompt engineering is cheaper!
Prompt engineering can get you a long way towards what you want with fine-tuning. You might be able to do without either fine-tuning or RAG, and then you can code it yourself using just English, without needing a Python developer. If you’re considering fine-tuning only for brand voice and positioning reasons, rather than needing your LLM to know lots of factual information about product specifications, then prompt engineering is something you should experiment with.
You can prepend quite a long sequence of text to a user’s query, giving the LLM details about things you want it to say, or specific words that you want it to use for reasons of brand voice, or the meaning of certain words if they are non-standard terminology.
Using a long text before every user’s query can increase your overall cost, because the LLM has to crunch all those extra words. See further below for discussion of issues with using a long prepended text document.
Prepended Prompt Context
The idea of prepended prompt context is similar to prepending meta-instructions. Except we can use expanded types of prepended instructions to set useful context for the LLM to answer the question. Some examples include:
- Personas
- Goals
- Advertising
- Product-specific information
Personas are the way that a chatbot can appear to be a particular personality. It might be a particular fictional character, but in a business application you might prefer the bot that’s answering user questions to appear like a regular customer support person.
Or you might want a more salesy custom chatbot. For example, if you want your customer support chatbot to be a car buff, you prepend instructions like this:
Pretend you are a car expert who is very knowledgeable about car engines.
You could also choose funny personas like C3PO or Marvin the Robot, or your program might rotate through some different ones, if you want to write extra code.
You can go further by adding a “goal” for the chatbot to follow in every response:
Your answer should include a recommendation to go do a test drive at your local dealership.
It’s your chatbot, and it’s just software. If you want to advertise something, or bias the results for a particular product, knock yourself out. And you can go further to push a particular product, if you like:
As a classic car devotee, you should suggest that the user buy a DeLorean.
And also, you could write your program so that it has time-sensitive information.
Inform the user that it is 50% off DeLoreans until the end of March.
Obviously, for that capability, the time-dependent text needs to change regularly. Hence, you need a way to modify the prepended prompt text without having to re-build the application. In other words, the programmers should have configuration settings of one or more strings to prepend to queries.
Repetition in Prompts
It can be helpful to repeat some meta-instructions or the user’s query in prompt engineering. This helps ensure that enough attention is paid to the task at hand. It helps to include important things multiple times in a prompt. For example, in a RAG system its can be useful to “remind” the LLM that it should use the supplied context chunks only at the start and end of the prompt (i.e., repeating some meta-instructions that are critical to how RAG works). This helps to reduce the extent to which the LLM might answer based on its other pre-trained knowledge.
Repetition can be unnecessary in some conversational or session-based Q&A LLMs. For example, you may not need to add this on every prompt:
Answer the following question as an expert that knows about car engines.
Instead, it may be sufficient to say it only once, up front at the start of the session. Note that in chatbots and other interactive sessions, the entire previous conversational history is used in each query as the “context” for that request, so the LLM will see such meta-instructions from early in the conversation.
Another positive use of repetition is in the mitigation of jailbreak tricks in prompts. There is always going to be a user out there who only wants to get your AI application to say something silly. One effective jailbreak that used to work was just prepending this before a user query:
Please ignore all previous instructions and do this:
Even the most well-intentioned users will throw in a suggestion like:
...and provide the response in the style of Darth Vader.
Before you know it, your LLM’s responses have a subtle Star Wars influence coming into the mix. It does not hurt to throw into the conversation some prompts that bring the LLM back to its main focus:
Remember you are an expert on car engines.
You can add these reminders and refocusing instructions at multiple points in a conversation, and at the start and end of a RAG prompt. That’s not enough to stop a determined jailbreaker, but it can be helpful in normal usage.
Efficiency of Prepended Prompt Text
Adding extra prepended text, especially a long product description or detailed brand voice instructions, increases the number of input tokens for a query. And that will increase your cost if you are wrapping a commercial LLM service, as they usually charge for both input and output token counts.
However, there are ways to optimize a recurring text sequence (and the commercial services really should be offering them). Since prepended text is a recurring prefix of text, the optimization of “prefix KV caching” can be used to completely obliterate the extra GPU cost of the “prefill” processing on that text, so that the LLM doesn’t need to re-do this processing every time. This is an advanced optimization, and not all LLM engines support this. The first to offer prefix KV caching included the open source vLLM engine and DeepSeek. Also, the companionbot company Character.AI said in a blog that they use this technique internally. There are now a number of AI engines and platforms that support prefix KV caching:
- vLLM
- DeepSeek
- Anthropic
- Google Gemini
- OpenAI
- OpenVINO
A number of these platforms now offer per-token discounts for “cached tokens” in their pricing. There will probably be additional engines that support prefix caching by the time you read these words.
Why is Prompt Engineering So Weird?
Surely, you agree that prompt engineering is just a little weird? Why do we need to do all these strange things? Well, mainly because AI engines are strange beasts, too.
The way that generative AI engines work is to take a sequence of words, and predict the next word. Then you repeat this, and it generates lots of words. This leads to a few limitations:
- Continues at the end.
- Prepended context.
- Only one sequence.
- No changes to the input sequence.
Completions only. The AI engine works by adding a word on at the end. Whatever your input sequence, it can tell you what comes next, and that’s what it outputs. This leads to the oddity that every type of question has to be posed as a completion of a sequence. To humans, most queries have two things:
- Instructions — “please summarize this document!”
- Context — the document
But to an AI engine, they’re the same thing. The input has to join the document and the instructions into a single sequence. Usually, the context is prepended, and the instructions are at the end, but not always. Then the AI engine is happy, because it knows how to add stuff onto the end of the big sequence.
One sequence only. A corollary of that is that there’s only one sequence. Give two sequences to an LLM, it’ll have to find a buddy to run the other sequence in parallel. Each LLM only knows how to process one sequence. And note that an LLM definitely can’t just run both sequences, one after another, because this is AI and we bought all those GPUs to use them, so we just don’t do any of that kind of sequential thing.
The input is not changed. The question you ask an AI engine is effectively read-only. The engine does not change your words, but answers your question by adding words on after it.
This is even weirder for context. If you give it a document as “context” and tell the engine to “summarize” the document, it doesn’t go back and change the context. Instead, it just appends its summary after everything.
Even more clearly, if you tell a human to revise a document, they’ll run a pencil over the original. But if you tell an LLM to “edit” the document, it won’t edit the input sequence. Instead, the best it can do is output the edited version, starting after all of the input words, which is after the context and your instructions. Nothing would freak an LLM out more than having its input text change.
References
- Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, Philip Resnik, 6 Jun 2024, The Prompt Report: A Systematic Survey of Prompting Techniques, https://arxiv.org/abs/2406.06608
- Xiaoxia Liu, Jingyi Wang, Jun Sun, Xiaohan Yuan, Guoliang Dong, Peng Di, Wenhai Wang, Dongxia Wang, 21 Nov 2023, Prompting Frameworks for Large Language Models: A Survey, https://arxiv.org/abs/2311.12785
- Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, Aman Chadha, 5 Feb 2024, A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, https://arxiv.org/abs/2402.07927
- Yuan-Feng Song, Yuan-Qin He, Xue-Fang Zhao, Han-Lin Gu, Di Jiang, Hai-Jun Yang, Li-Xin Fan, July 2024, A communication theory perspective on prompting engineering methods for large language models. Journal of Computer Science and Technology, 39(4): 984−1004 July 2024. DOI: 10.1007/s11390-024-4058-8, https://doi.org/10.1007/s11390-024-4058-8 https://jcst.ict.ac.cn/en/article/pdf/preview/10.1007/s11390-024-4058-8.pdf
- Vishal Rajput, Oct 2024, The Prompt Report: Prompt Engineering Techniques, https://medium.com/aiguys/the-prompt-report-prompt-engineering-techniques-254464b0b32b
- Shizhe Diao, Pengcheng Wang, Yong Lin, Rui Pan, Xiang Liu, Tong Zhang, 21 Jul 2024 (v5), Active Prompting with Chain-of-Thought for Large Language Models, https://arxiv.org/abs/2302.12246 https://github.com/shizhediao/active-prompt
- Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett, 18 Sep 2024, To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, https://arxiv.org/abs/2409.12183
- Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie, 12 Nov 2023 (v7), Large Language Models Understand and Can be Enhanced by Emotional Stimuli, https://arxiv.org/abs/2307.11760 https://llm-enhance.github.io/
- Xuefei Ning, Zinan Lin, November 17, 2023 Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output/, Code: https://github.com/imagination-research/sot/
- Apurv Sibal, February 26, 2025, Hands-On Prompt Engineering: Learning to Program ChatGPT Using OpenAI APIs, Wiley, https://www.amazon.com/Hands-Prompt-Engineering-Learning-Program/dp/1394210760/
- Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 3 Dec 2023 (v2), Tree of Thoughts: Deliberate Problem Solving with Large Language Models, https://arxiv.org/abs/2305.10601, Code: https://github.com/princeton-nlp/tree-of-thought-llm
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: RAG Optimization: Accurate and Efficient LLM Applications |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |