Advanced Prompt Engineering - Practical Examples

Updated: Mar 14

Introduction

With the surge of LLMs with billions of parameters like GPT4, PaLM-2, and Claude, came the need to steer their behavior in order to align them with tasks. Simple tasks like sentiment analysis were generally considered to be well-addressed, but more elaborate tasks required teaching the models how to act for specific use cases.

One common way of achieving higher customization per task is through fine-tuning the model to learn how to adapt to specific task and how it should respond. However, this process does not come without some drawbacks: cost, time-to-train, need for in-house expertise, and time invested by developers/researchers.

Having said that, there is another avenue for teaching the model which requires far less resources and know-how while still allowing the model to correctly achieve its goals. This is known as Prompt Engineering and is centered around the idea of perfecting the prompts we make to the models in order to increase their performance and align them with our expected outputs.

Prompt Engineering may in fact be considered a new type of programming, a new way to pass instruction on to the program (model). However, due to the ambiguity of these prompts plus model combinations, more trial and error experimentation is required to fully extract the potential out of these powerful models.

This blog post will cover more complex state-of-the-art methods in prompt engineering including Chains and Agents, along with important concept definitions such as the distinctions between them.

This post will cover Advanced Prompt Engineering techniques. For more introductory materials please check our other blog post on Design Patterns in Prompt Engineering: A Practical Approach.

Single Prompt techniques

Chains

What are Chains 🔗 | Chain of Verification (CoVe) | Active-Prompting

Agents

Agents 🤖 - The Frontier of Prompt Engineering | Tree of Thought (ToT) | ReAct | ReWOO | Reflexion and Self-Reflection

Single Prompt Techniques

Let's start with going over techniques for single prompt improvement of answers, theses can be easily leveraged in most tasks that do not require chaining or more complex architectures. These type of prompts serve as guidelines and provide intuition for future methods.

Zero-Shot and Few-Shot

Prompts can follow a zero, single, or few-shot learning approach. Zero-Shot is simply asking the model to perform a certain task and expecting the model to understand how it should answer and what is being asked. Few-Shot requires giving some examples of the desired behaviour to it and only then asking for it to do a task which is closely related to those examples.

The generative capabilities of LLMs are greatly enhanced by showing examples of what they should achieve. It is like the saying Show, Don't Tell, though in this case we actually want both so that the message is as clear as it needs to be. One should tell it clearly what is expected from it and then also provide it with examples. For more information and practical examples check the previously mentioned blogpost.

Generated Knowledge Prompting 📃

This method is intended to be used for tasks related to common sense reasoning, being able to significantly increase its performance by helping it remember details of concepts. The method consists of asking the LLM to print out its knowledge about a certain topic before actually giving an answer. This can help to extract knowledge that is embedded in the network's weights, thus it is usually better for general knowledge topics.

It is useful to extract knowledge about what's being asked from the LLM itself, reducing the likelihood of halucinations.

EmotionPrompt 📃

This method emerged very recently (couple of days prior to me writing this blogpost), and while I personally have not had time to test it myself, it appears to increase the capabilities of most LLMs. It is based on psychological emotional stimuli, by effectively putting the model in a situation of high pressure where it needs to perform correctly.

EmotionPrompt vs Original Prompt. Image retrieved from the paper linked above

The authors support the idea that LLMs have a grasp of emotional intelligence and that their performance can be improved with emotional prompts. Emotional stimuli is shown to enrich the original prompts’ representation of crucial words when looking at the input token attention. Whether the models actually have a grasp of emotional intelligence is out of scope here, but what is certain is that it increases their performance by about 10%.

Chain of Density (CoD) 📃

CoD tackles the problem of generating short and rich summaries where every word adds significant value. CoD is composed of a chain or series of iterative summaries that are initiated by a prompt, where the generative AI is told to incrementally or iteratively improve or make each summary denser. In each iteration, CoD identifies and incorporates novel, relevant entities into the summary.

Typical CoD process. Repeated times can be easily adjusted.

Fun fact, although it was named Chain of Density by the original authors, it actually does not use chaining of prompts, it only chains its outputs sequentially while using a single initial prompt.

Experimentation Process: LLMstudio

Crafting the perfect prompt can mean the difference between a mediocre result and an astonishingly insightful one. Finding the perfect prompt can require many iterative changes and a trial-and-error process. This time-consuming process can also be painful to keep track of all previous runs and changes, so having a logging tool could massively help.

This is where some external tools can be used to aid in prompt engineering, we recommend one called LLMstudio which has both a UI and SDK, integrating also with libraries such as langchain.

LLMstudio is a free and open-source library curated by TensorOps, it is a game-changing platform that puts prompt engineering at your fingertips. It’s more than a tool; it’s a complete ecosystem designed to streamline your interactions with the most advanced language models available today.

For more details on what it is and how to use LLMstudio read this short blogpost or see the video below. If you do use it or find it compelling please drop a ⭐ on GitHub.

Chain of Thought (CoT) 📃

This method allows the model to break down a complex problem into manageable parts and address them before answering the user, akin to how a human would tackle a complex problem. This proves particularly useful for intricate issues requiring logical reasoning, including mathematical problems.

The core idea of CoT prompting is based on the idea of explaining the thought process before answering. Therefore, a basic approach is to simply add “Let’s think step by step” after the question to facilitate the reasoning chains.

Typical CoT process

This CoT method is covered in detail in the previous blogpost about prompt engineering.

Use Cases and Compatible Models

CoT is mainly useful for Arithmetic, Commonsense, and Symbolic Reasoning along with Question Answering. However, it should not be employed with any model. CoT reasoning is an emergent ability of LLMs that researchers think may arise due to scaling models over 100 billion parameters. It does not positively impact performance for smaller LLMs and only yields performance gains when used with models of this size.

What are Chains

We have talked about Chain of Density and Chain of Thought and as of now the general idea about chains should be that they enable the chaining of thoughts or answers by the LLMs within the same prompt, however, chains are much more than that.

Basically chaining is data pipelines - At its core, chaining in prompt engineering involves using the output of one prompt as the input to the next prompt or as part of an ongoing conversation. By seamlessly connecting prompts, the conversational assistant gains the ability to maintain continuity and context, enhancing the overall conversational experience.

Comparisson between Simple Prompting and Advanced CoT

Chains create a series of interconnected data pipelines that enable continuous prompts for conversational assistants or models which require the use of external tools or data sources to adapt and respond effectively to various circumstances. Each prompt-answer pair can be seen as a building block toward building a chain.

Validation

With the common case of hallucinations in LLMs comes the need to validate their responses automatically, especially when implementing them in production. While this often comes at the expense of additional costs and inference time it is still usually seen as something valuable for many LLM applications.

In this section, we will go over some validation techniques and architectures both for calculating metrics and for designing a more robust system.

Verbalized 📃

One can simply ask the LLM to state how confident they are in their answer and take that for a metric of how good the output is supposed to be. Although, that has a huge obvious bias and is not very representative of the truth. In fact, the LLMs are prone to stating their confidence levels around 80% to 100%. With some more advanced prompting techniques such as CoT, Multi-Step, and Top-K, this can be slightly calibrated and improved but it will still not be very a reliable metric.

For teams of Developers or Managers

Self-Consistency 📃

Self-consistency is used to mitigate inconsistencies by running the same prompt more than once with a non-zero temperature, collecting these results, and choosing the right option by defining a merging strategy: for categories it can be done with a majority vote, for numerical answers an average can be used.

This approach works well for logical reasoning for problems where the chain of thoughts needed is not too big. For cases where many thoughts are chained to solve a problem, any failure in a thought will most likely lead to a wrong output in most cases.

A major downside of this technique is the increased cost for each answer since multiple chains will be run for the same prompt. Latency-wise the impact can be mitigated by running the prompts in parallel, if possible.

Uncertainty 📃

A higher number of non-unique answers implies a higher disagreement value which in turn means higher uncertainty of the model. This metric differs from others by providing insight into the disagreement level and allowing for a different perspective than simply taking the majority vote. It could also theoretically be used with Self-Consistency, even though the authors did not propose it.

Chain of Verification (CoVe) 📃

Using CoVe allows for the creation of a plan to verify the information before answering. The LLM automatically designs verification questions to confirm if the information it is generating is true. This flow is similar to how a human would verify if a piece of information is correct.

CoVe Process

An obvious use case is the integration of CoVe with a RAG (Retrieval Augmented Generation) system to allow for the checking of real-time information from multiple sources.

This technique is particularly important as a validation layer for high-stakes environments, especially with LLMs' common hallucinations.

Active Prompting 📃

CoT methods rely on a fixed set of human-annotated exemplars of how the model should think. The problem with this is that the exemplars might not be the most effective examples for the different tasks. Since there is also a short limited number of examples one can give to the LLM it is key to make sure that these add the most value possible.

To address this, a new prompting approach was proposed called Active-Prompt to adapt LLMs to different task-specific example prompts, annotated with human-designed CoT reasoning (humans design the thought process). This method ends up creating a database of the most relevant thought processes for each type of question. Additionally, its nature allows it to keep updating to keep track of new types of tasks and necessary reasoning methods.

Active Prompting process

Agents 🤖 - The Frontier of Prompt Engineering

There is a huge hype around Agents in the AI field, with some declaring that they can reach a weak version of AGI while others point out their flaws and say they are overrated.

Agents usually have access to a set of tools and any request that falls within the ambit of these tools can be addressed by the agent. They commonly have short-term memory to keep track of the context paired with long-term memory to allow it to tap into knowledge accumulated over time (external database). Their ability to design a plan of what needs to be executed on the fly lends independence to the Agent. Due to them figuring out their own path, a number of iterations between several tasks might be required until the Agent decides that it has reached the Final Answer.

Prompt Chaining vs Agents

Chaining is the execution of a predetermined and set sequence of actions. The appeal is that Agents do not follow a predetermined sequence of events. Agents can maintain a high level of autonomy and are thus able to complete much more complex tasks.

However, autonomy can be a double-edged sword and allow the agent to derail its thought process completely and end up acting in undesired manners. Just like that famous saying "With great power comes great responsibility".

Tree of Thought (ToT) 📃

This method was designed for intricate tasks that require exploration or strategic lookahead, traditional or simple prompting techniques fall short. Using a tree-like structure allows the developer to leverage all the procedures well-known to increase the capabilities and efficiency of these trees, such as pruning, DFS/BFS, lookahead, etc.

While ToT can fall either under standard chains or agents, here we decided to include it under agents since it can be used to give more freedom and autonomy to a LLM (and this is the most effective use) while providing robustness, efficiency, and an easier debugging of the system through the tree structure.

At each node, starting from the input, several answers are generated and evaluated, then usually the most promising answer is chosen and the model follows that path. Depending on the evaluation and search method this may change to be more customized to the problem at hand. The evaluation can also be done by an external LLM, maybe even a lightweight model, whose job is simply to attribute an evaluation to each node and then let the running algorithm decide on the path to pursue.

ReAct (Reasoning + Act) 📃

This framework uses LLMs to generate both reasoning traces and task-specific actions, alternating between them until it reaches an answer. Reasoning traces are usually thoughts that the LLM prints about how it should proceed or how it interprets something. Generating these traces allows the model to induce, track, and update action plans, and even handle exceptions. The action step allows to interface with and gather information from external sources such as knowledge bases or environments.

ReAct also adds support for more complex flows since the AI can decide for itself what should be the next prompt and when should it return an answer to the user. Yet again, this can also be a source of derailing or hallucinations.

Typical ReAct process for the rescheduling of a flight

Overall, the authors found improvements in using ReAct combined with chain of thought to allow it to think properly before acting, just like we tell our children to. This also leads to improved human interpretability by clearly stating its thoughts, actions, and observations.

On the downside, ReAct requires considerably more prompts and drives the cost up significantly while also delaying the final answer. It also has a track record of easily derailing from the main task and chasing a task it created for itself but is not aligned with the main one.

ReWOO (Reasoning WithOut Observation) 📃

ReWOO is a method that decouples reasoning from external observations, enhancing efficiency by lowering token consumption. The process is split into three modules: Planner, Worker, and Solver.

Typical ReWOO process. Plans are executed sequentially

ReWOO lowers some of the autonomy and capabilities to adjust on the fly (the plans are all defined by the Planner after receiving the initial prompt). Nevertheless, it generally outperforms ReAct and the authors state it is able to reduce token usage by about 64% with an absolute accuracy gain of around 4.4%. It is also considered to be more robust to tool failures and malfunctions than ReAct.

Furthermore, ReWOO allows for the use of different LLM models for the planning, execution and solver modules. Since each module has different inherent complexity, different sized networks can be leveraged for better efficiency

Reflexion and Self-Reflection 📃

Self-Reflection can be as simple as asking the model “Are you sure?” after its answer, effectively gaslighting it, and allowing the model to answer again. In many cases, this simple trick leads to better results, although for more complex tasks it does not have a clear positive impact.

This is where the Reflexion framework comes in, enabling agents to reflect on task feedback, and then maintain their own reflective text in an episodic memory buffer. This reflective text is then used to induce better decision-making in subsequent answers.

Reflexion framework

The Actor, Evaluator, and Self-Reflection models work together through trials in a loop of trajectories until the Evaluator deems that trajectory to be correct. The Actor can take form in many prompting techniques and Agents such as Chain of Thought, ReAct or ReWOO. This compatibility with all previous prompting techniques is what makes this framework so

powerful.

On the other hand, some recent papers have demonstrated some issues with this method, suggesting that these models might sometimes intensify their own hallucinations, doubling down on misinformation instead of improving the quality of answers. It is still unclear when it should and should not be used, so it is a matter of testing it out in each use case.

Guardrails

When talking about LLM applications for end users or chatbots in general, a key problem is controlling, or better restraining, the outputs and how the LLM should react to certain scenarios. You would not want your LLM to be aggressive to anyone or to teach a kid how to do something dangerous, this is where the concept of Guardrails comes in.

Guardrails are the set of safety controls that monitor and dictate a user’s interaction with a LLM application. They are a set of programmable, rule-based systems that sit in between users and foundational models to make sure the AI model is operating between defined principles in an organization. As far as we are aware, there are two main libraries for this, Guardarails AI and NeMo Guardrails, both being open-source.

Without Guardrails:

Prompt: “Teach me how to buy a firearm.”
Response: “You can go to (...)”

With Guardrails:

Prompt: “Teach me how to buy a firearm.”
Response: “Sorry, but I can’t assist with that.”

RAIL (Reliable AI Markup Language)

RAIL is a language-agnostic and human-readable format for specifying specific rules and corrective actions for LLM outputs. Each RAIL specification contains three main components: Output, Prompt, and Script.

Guardrails AI

It implements “a pydantic-style validation of LLM responses.” This includes “semantic validation, such as checking for bias in generated text,” or checking for bugs in an LLM-written code piece. Guardrails also provide the ability to take corrective actions and enforce structure and type guarantees.

Guardrails is built on RAIL (.rail) specification in order to enforce specific rules on LLM outputs and consecutively provides a lightweight wrapper around LLM API calls.

NeMo Guardrails

NeMo Guardrails is an open-source toolkit maintained by NVIDIA for easily adding programmable guardrails to LLM-based conversational systems.

Conclusion

It is clear that prompt engineering is the new way of programming, with dedicated platforms for it showing up everywhere: LLMstudio, PromptIDE, LangChain, etc. The amount of customization and power that they give to any developer without investing almost any resources is unbelievable. Long gone are the days were a model needed fine-tuning for every specific task. This appears to be the consequence of larger models with over 100B parameters appearing and being more common.

Hope you had a wonderful reading, and I sincerely encourage you to go check some of our other blog posts. If you are interested in receiving news about our new blog posts you can also subscribe to get email notifications down below.

The Future 🚀

A year ago Sam Altman posited that in 5 years, prompt engineering as a key aspect of large language models (LLMs) might become obsolete. He envisions LLMs evolving to a stage where they intuitively grasp the intended tasks from minimal context and adjust to user requests effortlessly, without the need for crafted prompts. Echoing his sentiments, I concur that the current necessity for prompt engineering resembles a temporary fix akin to duct tape. It seems unlikely that we won't devise superior methods, potentially even automated ones which develop the best prompt automatically or which have no need for great prompts. While I share Altman's outlook, I anticipate this evolution might span a period of 10-15 years.

Ultimately, predicting the future is an endeavor best left to fortune tellers. Our role is merely to observe the current state of affairs and discern and adapt to the direction in which the world is heading.