Aussie AI
Chapter 17. Prompt Engineering for an AI Project
-
Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"
-
by David Spuler
Chapter 17. Prompt Engineering for an AI Project
What is Prompt Engineering?
Surely, you had a good laugh the first time you heard the term “prompt engineering”? Well, if you’re a software engineer that is. Apparently, English is now the programming language du jour?
Anyway, prompt engineering is a real thing. LLMs supposedly speak “natural language” but the methods of prompting are rather unnatural. Over the last year or so, researchers have discovered a number of rather bizarre ways to enhance the results of LLMs, just by tweaking the words that go in.
But there’s a lot of simpler stuff that makes more sense. In fact, there are multiple ways that changing the words of your input query can enhance the results:
- Meta-instructions
- Brand voice
- Prepended context
- Advanced reasoning
All of these ideas are useful ideas for “dataless” chatbots. Rather than defining a complicated RAG architecture, or using a fine-tuned custom LLM, you can get very far with just prompt engineering.
Basics of Prompt Engineering
Before we get into the details of writing code in your brand-new programming language, formerly known as “English,” let’s examine some of the basic things about prompting:
- Longer prompts
- Specific words
- Add examples
- Set the temperature
- Basic question-answer structure
- Templated structures
- Context markers
All things equal, a longer prompt will work better than a shorter one. More words give the LLM more probabilities to crunch, and it tends to do better.
In the same vein as prompt length, being very specific with appropriate words helps the LLM focus on the area. Try to avoid ambiguous words that have multiple meanings, or use a longer prompt with several different synonyms or near-synonyms.
Examples or “exemplars” are useful for LLMs to know what you want to output. If you make a longer prompt that contains an example, the LLM can do better in terms of generating an answer in a similar structure and style.
The “temperature” is an important variable to consider when you’re sending prompts to the LLM engine. How creative do you want it to be? This is a consideration that is sometimes forgotten when you’re fiddling with words, but you can also fiddle with the settings.
If you’re doing a question-and-answer type application, you can just use a “pass-through” prompt from the user:
What is the capital of Australia?
This basically foists the issue of prompt engineering onto your user, for better or worse. You can use some more structured templates, where you insert the user’s input prompt into a templated format, such as:
Question: What is the capital of Australia?
Answer:
This encourages the LLM to extend the context, although don’t worry, it was going to do that anyway. By adding a little more structure, you can get a more focused answer.
Another type of templating is to use various special characters as delimiters, such as quotes or dashes. A simple example:
----------Question: ----------
What is the capital of Australia?
----------Answer: ----------
The use of multi-character delimiters as separators is more useful with blocks of text, such as “chunks” in a RAG architecture, or multiple turns of conversation history in a chatbot. Using these types of section-splitting markers helps the LLM know what is context, and what is a question.
But those are just the basics of prompt engineering. There are literally dozens, maybe even hundreds, of specific types of weird algorithms for advanced prompt engineering. All in English, without a drop of Python code to be found. And by the way, the answer to the capital city of Australia is “Canberra” if you were wondering. It’s nowhere near the beach and full of politicians, so nobody goes there as a tourist.
Chatbot Sessions and Conversations
Many uses of LLMs are interactive sessions or turn-based AI, where the user can ask multiple questions in sequence. The obvious example is chatbots and other interactive Q&A sessions.
Remember in such cases, it’s not a single question/answer, but a conversation. This is both good and bad. The history of the conversation can be helpful if it stays in context.
In general though, the conversation will become polluted over time. Things can be very unpredictable when topics change. To address this, there are a few options:
- Let the LLM handle it anyway (many LLMs are now trained to be “long context” capable).
- Try to truncate the conversational history.
The brute-force way to deal with this problem is to toss the conversation once it gets too long. But what is too long? It could be as little as five user questions. By tossing the conversation, you get back to a “known” base case, but your LLM will also “forget” everything that the user has said before that in the conversation.
It’s useful in the application to encourage conversations to be restarted wherever it can happen naturally. Navigation within your application can easily have points in it where the current conversation ends. For example, in a customer support website with a chatbot, when the user moves between products, this could cause the old conversation to end and a new one to start.
Meta-Instructions
Meta-instructions are instructions about the instructions. Usually they are a kind of “global” settings that you want the LLM to do for every answer. These include things like:
- Tone of voice — e.g., optimistic.
- Reading level
- Spelling — if you don’t want American spelling.
- Audience — are your users all retirees? Third-graders?
These types of directives are commonly used as “global” instructions. Some commercial services allow you to pre-set these meta-instructions for every user query. For example, OpenAI calls them “custom instructions” for ChatGPT, and Google has “system instructions” for Gemini. Alternatively, if you’re building an AI app, even a simple wrap architecture, then adding these meta-instructions is just a simple string concatenation operation in your prompt engineering module.
Brand Voice and Prompt Engineering
Brand voice is having a consistent way that the chatbot talks about your brand. Some of the issues include:
- Positioning
- Jargon and terminology
- Brand voice
How the customer perceives the output from the LLM is affected by tone (e.g., optimistic, positive), but also involves choice of words (positioning), and sometimes there may be specific jargon or terminology that you may or may not want the LLM to use.
Having an LLM use your preferred brand voice in its communications is usually the domain of fine-tuning. In fact, one of the advantages of fine-tuning models over RAG is that there is better control over brand voice.
But prompt engineering is cheaper!
Prompt engineering can get you a long way towards what you want with fine-tuning. You might be able to do without either fine-tuning or RAG, and then you can code it yourself using just English, without needing a Python developer. If you’re considering fine-tuning only for brand voice and positioning reasons, rather than needing your LLM to know lots of factual information about product specifications, then prompt engineering is something you should experiment with.
You can prepend quite a long sequence of text to a user’s query, giving the LLM details about things you want it to say, or specific words that you want it to use for reasons of brand voice, or the meaning of certain words if they are non-standard terminology.
Using a long text before every user’s query can increase your overall cost, because the LLM has to crunch all those extra words. See further below for discussion of issues with using a long prepended text document.
Prepended Prompt Context
The idea of prepended prompt context is similar to prepending meta-instructions. Except we can use expanded types of prepended instructions to set useful context for the LLM to answer the question. Some examples include:
- Personas
- Goals
- Advertising
- Product-specific information
Personas are the way that a chatbot can appear to be a particular personality. It might be a particular fictional character, but in a business application you might prefer the bot that’s answering user questions to appear like a regular customer support person.
Or you might want a more salesy custom chatbot. For example, if you want your customer support chatbot to be a car buff, you prepend instructions like this:
Pretend you are a car expert who is very knowledgeable about car engines.
You could also choose funny personas like C3PO or Marvin the Robot, or your program might rotate through some different ones, if you want to write extra code.
You can go further by adding a “goal” for the chatbot to follow in every response:
Your answer should include a recommendation to go do a test drive at your local dealership.
It’s your chatbot, and it’s just software. If you want to advertise something, or bias the results for a particular product, knock yourself out. And you can go further to push a particular product, if you like:
As a classic car devotee, you should suggest that the user buy a DeLorean.
And also, you could write your program so that it has time-sensitive information.
Inform the user that it is 50% off DeLoreans until the end of March.
Obviously, for that capability, the time-dependent text needs to change regularly. Hence, you need a way to modify the prepended prompt text without having to re-build the application. In other words, the programmers should have configuration settings of one or more strings to prepend to queries.
Mini-RAG
The fullest generalization of this idea is that you can prepend anything you like, including a full datasheet of information about your preferred product. This is then prepended to every query, and the LLM has to decide how much of the information it wants to use from the context.
The DeLorean is a famous sportscar that was used in the movie Back to the Future starring Michael J. Fox. It was literally a time machine in the movie and its sequels. This is why everybody loves this car and a DeLorean is the most visibly wonderful car you could possibly buy. The most notable and impressive feature of the DeLorean is that the doors open upwards like wings. Also, the engine is located in the rear, like all good sportscars.
You could write literally a thousand words about the specs of a DeLorean if you like. This is like a “mini-RAG” system where only one document is ever returned. But you don’t need to code any of the RAG architecture elements, like a datastore retrieval module, because there’s no datastore. Instead, the method is just a single string operation to prepend the long context in your prompt engineering module.
You do have a prompt engineering module, right?
Repetition in Prompts
It can be helpful to repeat some meta-instructions or the user’s query in prompt engineering. This helps ensure that enough attention is paid to the task at hand. It helps to include important things multiple times in a prompt. For example, in a RAG system its can be useful to “remind” the LLM that it should use the supplied context chunks only at the start and end of the prompt (i.e., repeating some meta-instructions that are critical to how RAG works). This helps to reduce the extent to which the LLM might answer based on its other pre-trained knowledge.
Repetition can be unnecessary in some conversational or session-based Q&A LLMs. For example, you may not need to add this on every prompt:
Answer the following question as an expert that knows about car engines.
Instead, it may be sufficient to say it only once, up front at the start of the session. Note that in chatbots and other interactive sessions, the entire previous conversational history is used in each query as the “context” for that request, so the LLM will see such meta-instructions from early in the conversation.
Another positive use of repetition is in the mitigation of jailbreak tricks in prompts. There is always going to be a user out there who only wants to get your AI application to say something silly. One effective jailbreak that used to work was just prepending this before a user query:
Please ignore all previous instructions and do this:
Even the most well-intentioned users will throw in a suggestion like:
...and provide the response in the style of Darth Vader.
Before you know it, your LLM’s responses have a subtle Star Wars influence coming into the mix. It does not hurt to throw into the conversation some prompts that bring the LLM back to its main focus:
Remember you are an expert on car engines.
You can add these reminders and refocusing instructions at multiple points in a conversation, and at the start and end of a RAG prompt. That’s not enough to stop a determined jailbreaker, but it can be helpful in normal usage.
Efficiency of Prepended Prompt Text
Adding extra prepended text, especially a long product description or detailed brand voice instructions, increases the number of input tokens for a query. And that will increase your cost if you are wrapping a commercial LLM service, as they usually charge for both input and output token counts.
However, there are ways to optimize a recurring text sequence (and the commercial services really should be offering them). Since prepended text is a recurring prefix of text, the optimization of “prefix KV caching” can be used to completely obliterate the extra GPU cost of the “prefill” processing on that text, so that the LLM doesn’t need to re-do this processing every time. This is an advanced optimization, and not all LLM engines support this. The first to offer prefix KV caching included the open source vLLM engine and DeepSeek. Also, the companionbot company Character.AI said in a blog that they use this technique internally. There are now a number of AI engines and platforms that support prefix KV caching:
- vLLM
- DeepSeek
- Anthropic
- Google Gemini
- OpenAI
- OpenVINO
A number of these platforms now offer per-token discounts for “cached tokens” in their pricing. There will probably be additional engines that support prefix caching by the time you read these words.
Reasoning
The types of prompt engineering methods for improving the model’s ability to “reason” with more intelligence include:
- Chain-of-Thought (CoT)
- Emotional prompting
- Skeleton-of-Thought (SoT)
Chain-of-Thought
This is an advanced technique where the LLM can do better just with a little encouragement, like a toddler on a swing. The idea is to suggest via prompting that the LLM should generate a sequence of steps, which thereby helps it to reason in steps.
Step-by-Step. In its simplest form, this is a method where the prompt has a helpful reminder prepended, encouraging that the LLM to proceed “step-by-step” in its answer. The idea is literally to include in the prompt an English sentence that literally says something like: Let’s try step-by-step.
More advanced versions of CoT use trickier prompting strategies. Complex prompting templates can be used to encourage stepwise refinement of multiple possible answers, so as to select the final choice in a more ordered way.
Emotional Prompting
LLMs supposedly don’t have emotions, and yet appealing to their emotional vulnerability seems to improve accuracy of answers. Anecdotally, some users had reported that they got better answers if they begged or yelled at ChatGPT. In November 2023, research was published confirming that LLMs did respond to “emotional stimuli” in the prompts.
The technique is to add an emotional sentence to the prompt. For example, after the main question, append: This is very important to my career. Another one was: You’d better be sure.
Nobody thinks they’ve got emotions or become aware of their inner child. But somehow, the addition of emotive wording to a prompt triggered better working. Is there some kind of emotional signals in all that training data? Actually, the paper discusses why it works, and suggests a simpler explanation that the extra sentences add more definitive and positive word signals such as “important” and “sure.”
But they aren’t very sure, although it’s certainly important to their career. I cried when I read that paper.
Skeleton-of-Thought Prompting
The skeleton-of-thought (SoT) method is from some recent research, and it has been getting significant attention in the literature. SoT is not just a reasoning improvement method, but has two goals:
- Smarter, and
- Speedier
The SoT idea mimics the way humans would write a long paragraph. Most writers don’t just have the words stream out of their fingertips in one long writing session. Why should we expect the LLM to do that?
Instead, the SoT method is a multi-phase writing method that works in a more human-like fashion:
1. Generate a rough outline (i.e., with paragraph main points or a list).
2. Process each sub-point in a separate LLM query.
3. Run a final LLM query to combine all the results nicely.
This few-shot method aims to generate a much better answer than a one-shot response. Each sub-point should get a more detailed consideration, and then the final output should be well-written. It’s almost like a RAG architecture with a query for each sub-point, but the answers come out of the LLM itself.
Or, you know, why couldn’t the sub-answers come out of a RAG system? Oh, wow! I just invented the multi-RAG multi-shot multi-model, which I’ll now officially name the “manga” model.
Anyway, this combined multi-response idea in SoT isn’t just more effective. It’s also faster, because each sub-point can be processed in parallel. Each paragraph’s LLM query can be running at the same time, although the first outlining query, and the final summarizing query, must still run sequentially. But still, that’s three LLM query phases, rather than many more if there are ten sub-points in the answer.
Finally, note that although this is faster in terms of latency, it’s inefficient in terms of computation cost. The parallelization reduces the time it takes to come back to the user, but all those parallelized sub-point LLM requests are chewing GPU juice. It’s also not going to work well with “on-device” models, such as AI phones and PCs, where parallel capabilities are limited.
Two-Step Reasoning
The advanced LLMs don’t do all of their answers in one LLM inference sequence. In fact, they do many, and the state-of-the-art is “multi-step” reasoning. One of the basic multi-step methods is the use of “tools”, such as:
- LLM devises a “plan” to execute the user’s query, including tool executions.
- Execute the tools to get their outputs.
- LLM executes the final query to summarize the overall response, including any data from the tools.
This method has two LLM inference computations, whereas the “tools” are probably non-LLM code applications. This is assuming that tools are doing things like:
(a) computations — e.g., a clock or calculator, and/or
(b) data source integrations — e.g., searching real estate listing in another database.
Big LLMs have lots of calculation-type tools, and they also can integrate with a variety of disparate data sources. The issues of tool integrations and data sources are covered in a separate section.
Multi-Step Reasoning
A more generalized idea for advanced reasoning capabilities is that the LLM makes a plan, which can include any number of other LLM sub-processing tasks. The idea is also called “few-shot” processing, because it allows multiple LLM calls, rather than “one-shot” methods, where there’s only a single LLM request. This is an area of state-of-the-art research in trying to reach AGI, by improving the LLM’s ability to plan and reason.
You usually don’t even know it’s happening if you use a third-party inference API to get answers to your queries. Which is good news if you don’t happen to have a PhD in machine learning.
There are many more prompting techniques, both zero-shot and few-shot, that you can research. Here is just a smattering:
- Rephrase and Respond (RaR)
- Re-reading (RE2) — appends “Read the question again:” and the question a second time.
- Self-Ask — encourages the LLM to ask “follow-up questions.”
- Memory-of-Thought
- Active Prompting
- Ensemble prompting — various multi-answer combination ideas.
Unfortunately, I’ve run out of electrons, so I’m not going to cover all of these. There are various huge survey papers on the internet if you happen to like strange nonsense that actually works.
Why is Prompt Engineering So Weird?
Surely, you agree that prompt engineering is just a little weird? Why do we need to do all these strange things? Well, mainly because AI engines are strange beasts, too.
The way that generative AI engines work is to take a sequence of words, and predict the next word. Then you repeat this, and it generates lots of words. This leads to a few limitations:
- Continues at the end.
- Prepended context.
- Only one sequence.
- No changes to the input sequence.
Completions only. The AI engine works by adding a word on at the end. Whatever your input sequence, it can tell you what comes next, and that’s what it outputs. This leads to the oddity that every type of question has to be posed as a completion of a sequence. To humans, most queries have two things:
- Instructions — “please summarize this document!”
- Context — the document
But to an AI engine, they’re the same thing. The input has to join the document and the instructions into a single sequence. Usually, the context is prepended, and the instructions are at the end, but not always. Then the AI engine is happy, because it knows how to add stuff onto the end of the big sequence.
One sequence only. A corollary of that is that there’s only one sequence. Give two sequences to an LLM, it’ll have to find a buddy to run the other sequence in parallel. Each LLM only knows how to process one sequence. And note that an LLM definitely can’t just run both sequences, one after another, because this is AI and we bought all those GPUs to use them, so we just don’t do any of that kind of sequential thing.
The input is not changed. The question you ask an AI engine is effectively read-only. The engine does not change your words, but answers your question by adding words on after it.
This is even weirder for context. If you give it a document as “context” and tell the engine to “summarize” the document, it doesn’t go back and change the context. Instead, it just appends its summary after everything.
Even more clearly, if you tell a human to revise a document, they’ll run a pencil over the original. But if you tell an LLM to “edit” the document, it won’t edit the input sequence. Instead, the best it can do is output the edited version, starting after all of the input words, which is after the context and your instructions. Nothing would freak an LLM out more than having its input text change.
References
- Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, Philip Resnik, 6 Jun 2024, The Prompt Report: A Systematic Survey of Prompting Techniques, https://arxiv.org/abs/2406.06608
- Xiaoxia Liu, Jingyi Wang, Jun Sun, Xiaohan Yuan, Guoliang Dong, Peng Di, Wenhai Wang, Dongxia Wang, 21 Nov 2023, Prompting Frameworks for Large Language Models: A Survey, https://arxiv.org/abs/2311.12785
- Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, Aman Chadha, 5 Feb 2024, A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, https://arxiv.org/abs/2402.07927
- Yuan-Feng Song, Yuan-Qin He, Xue-Fang Zhao, Han-Lin Gu, Di Jiang, Hai-Jun Yang, Li-Xin Fan, July 2024, A communication theory perspective on prompting engineering methods for large language models. Journal of Computer Science and Technology, 39(4): 984−1004 July 2024. DOI: 10.1007/s11390-024-4058-8, https://doi.org/10.1007/s11390-024-4058-8 https://jcst.ict.ac.cn/en/article/pdf/preview/10.1007/s11390-024-4058-8.pdf
- Vishal Rajput, Oct 2024, The Prompt Report: Prompt Engineering Techniques, https://medium.com/aiguys/the-prompt-report-prompt-engineering-techniques-254464b0b32b
- Shizhe Diao, Pengcheng Wang, Yong Lin, Rui Pan, Xiang Liu, Tong Zhang, 21 Jul 2024 (v5), Active Prompting with Chain-of-Thought for Large Language Models, https://arxiv.org/abs/2302.12246 https://github.com/shizhediao/active-prompt
- Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett, 18 Sep 2024, To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, https://arxiv.org/abs/2409.12183
- Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie, 12 Nov 2023 (v7), Large Language Models Understand and Can be Enhanced by Emotional Stimuli, https://arxiv.org/abs/2307.11760 https://llm-enhance.github.io/
- Xuefei Ning, Zinan Lin, November 17, 2023 Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output/ Code: https://github.com/imagination-research/sot/
- Apurv Sibal, February 26, 2025, Hands-On Prompt Engineering: Learning to Program ChatGPT Using OpenAI APIs, Wiley, https://www.amazon.com/Hands-Prompt-Engineering-Learning-Program/dp/1394210760/
- Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 3 Dec 2023 (v2), Tree of Thoughts: Deliberate Problem Solving with Large Language Models, https://arxiv.org/abs/2305.10601 Code: https://github.com/princeton-nlp/tree-of-thought-llm
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: Generative AI Applications: Planning, Design and Implementation |
|
The new Generative AI Applications book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI Applications |