Contextual Retrieval - Enhancing RAG Performance

Miguel Carreira Neves
Nov 7, 2024
6 min read

Updated: Nov 8, 2024

When deploying AI in specialized domains, such as customer support or legal analysis, models require access to relevant background knowledge. This often involves integrating retrieval techniques to access external data sources. One popular method is Retrieval-Augmented Generation (RAG), which retrieves relevant information and appends it to a user's query to enhance response accuracy. However, traditional RAG systems often strip crucial context from retrieved chunks, leading to lower-quality outputs. In response, Contextual

Retrieval has emerged as an innovative technique to overcome these limitations.

Executive Summary

Deploying AI in specialized domains like customer support, legal, and financial analysis requires accuracy and context. Traditional Retrieval-Augmented Generation (RAG) systems lack the ability to maintain context in retrieved information, often leading to irrelevant or ambiguous responses.

Typical RAG example: "The earnings increased by 10% this quarter" - does not specify the quarter not the company

Contextual Retrieval addresses this by enriching data chunks with specific context before retrieval, dramatically improving the AI's relevance and precision. By combining contextual embeddings with traditional techniques like BM25 search and reranking, this method enhances retrieval quality by up to 67%, ensuring the right information is retrieved with the necessary context.

Example with Context: "Company: Apple; 2024; Q3; Report by Deloitte Dec 2024: The earnings increased by 10% this quarter" - provides relevant context to LLM

Index

Basics of RAG
Challenges with Traditional RAG Systems
Contextual Retrieval: The Solution
Use Cases and Applications
Combining BM25 with Embeddings for Optimal Retrieval
Enhancing Retrieval with Reranking
Cost Reduction: Prompt Caching
Alternative Approaches: RAG vs. Large Context Models
Practical Guide for Implementing Contextual Retrieval
Conclusion
References

The Basics of Retrieval-Augmented Generation (RAG)

RAG improves an AI’s ability to handle large knowledge bases that cannot fit into a single prompt. It works by:

1. Splitting a knowledge base into small chunks.

2. Converting those chunks into vector embeddings that capture semantic meaning.

3. Storing the embeddings in a searchable database.

4. Retrieving the most relevant chunks based on their similarity to a user query.

While this approach works well for many scenarios, it suffers from missed exact matches, especially with technical terms or specific identifiers. For example, when querying a database for "Error code TS-999," semantic embeddings may overlook this precise match. In these cases, traditional RAG fails to retrieve the most relevant chunks due to a lack of context.

Standard RAG Diagram - by Anthropic — Standard RAG - Diagram by Anthropic

Problem with RAG - Lost Context

In traditional RAG, documents are split into smaller parts to make retrieval more efficient. While effective in many cases, this approach can lead to issues when individual pieces lack the necessary context.

Imagine a knowledge base filled with financial reports from multiple companies. Now, suppose you ask the question:

Traditional Chunk - unclear information about who it is reffering to

However, this chunk doesn’t specify the company name (ACME Corp) or the time period (Q2 2023). Without these details, it’s unclear if the answer is truly about ACME Corp's Q2 2023 revenue growth or pertains to another company and period altogether.

How Contextual Retrieval Fixes the Problem

Contextual Retrieval addresses these issues by appending additional information to each chunk before embedding it into a vector database. Using Contextual Embeddings and Contextual BM25 (a lexical search method), chunks are enriched with explanatory context specific to their source document.

For example, relevant context will be added to the raw chunk:

By augmenting chunks with contextual data, retrieval accuracy improves dramatically. According to tests, this method reduces retrieval failures by 49%, and when combined with reranking strategies, failures drop by 67%.

A simpler, lighter, and more cost-effective model can be used to add context to each chunk. The prompt can be refined for the specific use case, including examples to guide the model's expected behavior. Here, we provide a straightforward example of a possible prompt:

The generated contextual text, typically around 50-100 tokens, is added before the chunk itself during embedding and when building the BM25 index.

Here’s an example of the preprocessing flow pipeline in action:

Contextual Retrieval Diagram - by Anthropic — Contextual Retrieval - Diagram by Anthropic

Use Cases and Examples

- Customer Support: A chatbot serving a technical support team might need to retrieve error documentation. Using Contextual Retrieval, the system can differentiate between similar error codes or find specific cases where exact term matches (like "Error code TS-999") are critical.

- Legal Research: In legal AI systems, retrieval of relevant case law often depends on precise document matches. Traditional RAG systems might return general cases, but with Contextual Retrieval, the AI can pull exact rulings or clauses that are directly applicable to a given query.

- Financial Analysis: For systems managing large databases of financial filings, maintaining the integrity of context across multiple queries is essential. By using contextual embeddings, these systems can ensure relevant and accurate data is retrieved, improving responses to specific inquiries like revenue growth, quarterly reports, or stock analysis.

Combining BM25 and Embeddings for Better Results

To further enhance retrieval quality, Contextual Retrieval merges semantic embeddings with BM25, a technique that excels at identifying exact lexical matches. This dual approach ensures both broad semantic understanding and precise matching.

In practice, BM25 is invaluable for handling queries involving unique identifiers, like "Q2 2023" or "ACME Corp" or an identification number, where context is also crucial.

Anthropic's experiments showed that combining Contextual Embedding + Contextual BM25 provides the best results:

Contextual Embeddings and Contextual BM25 reduce failure rate by half — Using the top-performing embedding configuration (Gemini Text 004) and retrieving the top-20-chunks

Contextual Embeddings reduced the top-20-chunk retrieval failure rate by 35% (5.7% → 3.7%).
Combining Contextual Embeddings and Contextual BM25 reduced the top-20-chunk retrieval failure rate by 49% (5.7% → 2.9%).

Key Considerations for Contextual Retrieval:

Chunking: Choose chunk size, boundaries, and overlap carefully to improve retrieval performance.
Model Selection: Contextual retrieval enhances all models tested, with Gemini and Voyage performing best.
Custom Prompts: Tailor prompts to your domain, possibly including key terms for added relevance.
Chunk Quantity: More chunks can help, with 20 chunks found to be most effective, though this may vary by use case.

Reranking: Boosting Retrieval Accuracy

Combining Contextual Retrieval with a reranking model can further optimize retrieval by filtering out less relevant chunks. In large knowledge bases, reranking helps ensure that only the top, most relevant chunks are passed to the AI model, boosting accuracy while minimizing processing costs and time.

Existing Reranking models can be used like this one from Cohere or Voyage.

Example of Rerank Pipeline - better selecting relevant chunks. Can also use contextual RAG with BM25 with Rerank for best results — Example of Rerank Pipeline - better selecting relevant chunks

Research done with the Cohere Reranker showed significant improvements:

Reranked Contextual Embedding and Contextual BM25 reduced the top-20-chunk retrieval failure rate by 67% (5.7% → 1.9%).

Reducing Costs - Prompt Caching

Using context for each chunk sent to the LLM (typically around 20 chunks) will naturally increase token usage, and if larger contexts are needed, costs can escalate quickly.

With prompt caching, repeated queries can be handled faster and more cost-effectively, reducing latency by over 2x and cutting costs by up to 90%.

Prompt Caching Details by Provider - OpenAI vs Gemini vs Anthropic — Prompt Caching Details by Provider

For more details and discussion on the table above check out our Blogpost on Prompt Caching - OpenAI vs Anthropic vs Gemini.

In some cases, you might even consider skipping RAG entirely. For smaller knowledge bases (up to 200,000 tokens on Anthropic), developers can bypass RAG by including the entire knowledge base directly in the prompt. With models like Google’s Gemini, this can go up to 2 million tokens, though token usage costs will scale accordingly. This has several benefits and drawbacks that we will thoroughly discuss in a future blogpost, but for now you can check some relevant thoughts here: RAG vs Large Context Models - Gemini

POC, AI, FUNDS, PROJECT, LLM, CLOUD, CLOUD VENDER, GOOGLE

Practical Guide - Contextual Retrieval

To access a guide on using Contextual Retrieval with Anthropic, see this Notebook Guide. With slight modifications, it can be adapted to work with other providers.

The guide covers the following topics:

Set Up Basic Retrieval: Build a simple system to find information efficiently.
Add Contextual Embeddings: Enhance each piece of information by incorporating contextual meaning.
Improve with Contextual Embeddings: Discover how adding context boosts search accuracy.
Combine with BM25 Search: Integrate contextual embeddings with traditional search for better results.
Boost Results with Reranking: Reorder search results to prioritize the most relevant information.

Conclusion: A New Era of AI Retrieval

Contextual Retrieval significantly advances the performance of AI in specialized fields, enabling better retrieval accuracy through a combination of contextual embeddings, BM25, and reranking techniques. Developers can leverage these tools for more precise, context-rich information retrieval, ultimately delivering more relevant and actionable responses across domains.

Additional costs incurred can be mostly mitigated by leveraging prompt caching.

To start implementing Contextual Retrieval in your systems, check out the Notebook Guide by Anthropic and explore the potential of this groundbreaking method.