Building a Contextual Retrieval System for Improving RAG Accuracy

Domain-specific knowledge is needed to improve AI models for specific tasks. For example, customer support chatbots require business-related information, while legal bots rely on historical case data. Developers typically use Search Augmented Generation (RAG) to retrieve relevant knowledge from databases and improve AI responses. However, existing RAG approaches often miss context during retrieval, leading to failure. In this post, we introduce a method called Contextual Retrieval. Contextual embedding Improve search accuracy and re-rank to reduce misses.

For large knowledge bases, Retrieval-Augmented Generation (RAG) provides a scalable solution. Modern RAG systems combine two powerful search methods.

Semantic search using embeddings

Bundle your knowledge base into manageable segments (typically a few hundred tokens each).
We convert these chunks into vector embeddings that capture their semantic meaning.
Store the embeddings in a vector database for similarity search.

Vocabulary search using BM25

It is based on the Term Frequency-Inverse Document Frequency (TF-IDF) principle.
Consider document length and term frequency saturation.
It’s great for finding exact matches and specific terms.

An optimal RAG implementation combines both approaches.

Split your knowledge base into multiple chunks
Generates both TF-IDF encoding and semantic embeddings.
Running parallel searches using BM25 and embedding similarity
Merge and remove duplicate results using rank fusion
Include the most relevant chunks in your prompt.
Generate responses using enhanced context

The problem with traditional RAGs is how they split documents into smaller chunks for efficient retrieval, sometimes losing important context. For example, consider an academic database that is asked the question, “What was Dr. Smith’s primary research focus in 2021?” If a searched chunk states that “research emphasized AI,” without specifying Dr. Smith or the exact year, the lack of clarity may make it difficult to find an exact answer. This issue can reduce the accuracy and usefulness of search results in knowledge-rich areas.

Contextual search solves this problem by adding chunk-specific descriptive context to each chunk before including it (“contextual embedding”). Generates contextual text for each chunk.

A typical RAG pipeline typically has the following components: As you can see, we have user input authenticated and passed through our content safety system (learn more). here ). The next step is to rewrite the query based on the recorded conversation. You can also attach query expansions that enhance the generated answers. Next, there are retrievers and re-rankers. In the RAG pipeline, Retriever and Ranker play an important complementary role in finding and prioritizing relevant context. The searcher acts as an initial filter and efficiently searches large document collections to identify potentially relevant chunks based on their semantic similarity to the query. Common search approaches include dense searchers (e.g., embedding-based search) or sparse searchers (e.g., BM25). Ranker then acts as a more sophisticated second step, taking Retriever’s candidate passages and scoring them in detail for relevance. Ranker leverages powerful language models to analyze deep semantic relationships between queries and each passage, taking into account factors such as factual alignment, answer range, and contextual relevance. This two-step approach balances efficiency and accuracy. The searcher quickly narrows the search space, and the ranker applies more computationally intensive analysis to a smaller set of promising candidates to identify the most appropriate context for the generation step.

This example uses Langchain as the framework to build it.

import os
from typing import List, Tuple
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_openai import AzureChatOpenAI
from langchain.prompts import ChatPromptTemplate
from rank_bm25 import BM25Okapi
import cohere
import logging
import time
from llama_parse import LlamaParse
from azure.ai.documentintelligence.models import DocumentAnalysisFeature
from langchain_community.document_loaders.doc_intelligence import AzureAIDocumentIntelligenceLoader

# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
load_dotenv('azure.env', override=True)

Now let’s implement contextual embedding to create a custom Retriever. Here is the code:

We use Azure AI Document Intelligence for PDF parsing.
Break your documents into manageable chunks while maintaining context.
We implement sophisticated overlapping text segmentation to avoid losing information at chunk boundaries.

class ContextualRetrieval:
    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=800,
            chunk_overlap=100,
        )
        self.embeddings = AzureOpenAIEmbeddings(
                            api_key=os.getenv("AZURE_OPENAI_API_KEY"),
                            azure_deployment="text-embedding-ada-002",
                            openai_api_version="2024-03-01-preview",
                            azure_endpoint =os.environ["AZURE_OPENAI_ENDPOINT"]
                        )
        self.llm = AzureChatOpenAI(
            api_key=os.environ["AZURE_OPENAI_API_KEY"],
            azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
            azure_deployment="gpt-4o",
            temperature=0,
            max_tokens=None,
            timeout=None,
            max_retries=2,
        )
        self.cohere_client = cohere.Client(os.getenv("COHERE_API_KEY"))

    def load_pdf_and_parse(self, pdf_path: str) -> str:
        loader = AzureAIDocumentIntelligenceLoader(file_path=pdf_path, 
                                           api_key = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY"), 
                                           api_endpoint = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT"),
                                           api_model="prebuilt-layout",
                                           api_version="2024-02-29-preview",
                                           mode="markdown",
                                           analysis_features = [DocumentAnalysisFeature.OCR_HIGH_RESOLUTION])

        try:
            documents = loader.load()
            if not documents:
                raise ValueError("No content extracted from the PDF.")
            return " ".join([doc.page_content for doc in documents])
        except Exception as e:
            logging.error(f"Error while parsing the file '{pdf_path}': {str(e)}")
            raise

    def process_document(self, document: str) -> Tuple[List[Document], List[Document]]:
        if not document.strip():
            raise ValueError("The document is empty after parsing.")
        chunks = self.text_splitter.create_documents([document])
        contextualized_chunks = self._generate_contextualized_chunks(document, chunks)
        return chunks, contextualized_chunks

    def _generate_contextualized_chunks(self, document: str, chunks: List[Document]) -> List[Document]:
        contextualized_chunks = []
        for chunk in chunks:
            context = self._generate_context(document, chunk.page_content)
            contextualized_content = f"{context}\n\n{chunk.page_content}"
            contextualized_chunks.append(Document(page_content=contextualized_content, metadata=chunk.metadata))
        return contextualized_chunks

    def _generate_context(self, document: str, chunk: str) -> str:
        prompt = ChatPromptTemplate.from_template("""
        You are an AI assistant specializing in document analysis. Your task is to provide brief, relevant context for a chunk of text from the given document.
        Here is the document:
        
        {document}
        

        Here is the chunk we want to situate within the whole document:
        
        {chunk}
        

        Provide a concise context (2-3 sentences) for this chunk, considering the following guidelines:
        1. Identify the main topic or concept discussed in the chunk.
        2. Mention any relevant information or comparisons from the broader document context.
        3. If applicable, note how this information relates to the overall theme or purpose of the document.
        4. Include any key figures, dates, or percentages that provide important context.
        5. Do not use phrases like "This chunk discusses" or "This section provides". Instead, directly state the context.

        Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

        Context:
        """)
        messages = prompt.format_messages(document=document, chunk=chunk)
        response = self.llm.invoke(messages)
        return response.content

    def create_bm25_index(self, chunks: List[Document]) -> BM25Okapi:
        tokenized_chunks = [chunk.page_content.split() for chunk in chunks]
        return BM25Okapi(tokenized_chunks)

    def generate_answer(self, query: str, relevant_chunks: List[str]) -> str:
        prompt = ChatPromptTemplate.from_template("""
        Based on the following information, please provide a concise and accurate answer to the question.
        If the information is not sufficient to answer the question, say so.

        Question: {query}

        Relevant information:
        {chunks}

        Answer:
        """)
        messages = prompt.format_messages(query=query, chunks="\n\n".join(relevant_chunks))
        response = self.llm.invoke(messages)
        return response.content

    def rerank_results(self, query: str, documents: List[Document], top_n: int = 3) -> List[Document]:
        logging.info(f"Reranking {len(documents)} documents for query: {query}")
        doc_contents = [doc.page_content for doc in documents]
        
        max_retries = 3
        for attempt in range(max_retries):
            try:
                reranked = self.cohere_client.rerank(
                    model="rerank-english-v2.0",
                    query=query,
                    documents=doc_contents,
                    top_n=top_n
                )
                break
            except cohere.errors.TooManyRequestsError:
                if attempt < max_retries - 1:
                    logging.warning(f"Rate limit hit. Waiting for 60 seconds before retry {attempt + 1}/{max_retries}")
                    time.sleep(60)  # Wait for 60 seconds before retrying
                else:
                    logging.error("Rate limit hit. Max retries reached. Returning original documents.")
                    return documents[:top_n]
        
        logging.info(f"Reranking complete. Top {top_n} results:")
        reranked_docs = []
        for idx, result in enumerate(reranked.results):
            original_doc = documents[result.index]
            reranked_docs.append(original_doc)
            logging.info(f"  {idx+1}. Score: {result.relevance_score:.4f}, Index: {result.index}")
        
        return reranked_docs

    def expand_query(self, original_query: str) -> str:
        prompt = ChatPromptTemplate.from_template("""
        You are an AI assistant specializing in document analysis. Your task is to expand the given query to include related terms and concepts that might be relevant for a more comprehensive search of the document.

        Original query: {query}

        Please provide an expanded version of this query, including relevant terms, concepts, or related ideas that might help in summarizing the full document. The expanded query should be a single string, not a list.

        Expanded query:
        """)
        messages = prompt.format_messages(query=original_query)
        response = self.llm.invoke(messages)
        return response.content

Now let’s load a sample PDF with context embeddings and create two indexes for both the regular chunk and the context-aware chunk.

cr = ContextualRetrieval()
pdf_path = "1.pdf"
document = cr.load_pdf_with_llama_parse(pdf_path)

# Process the document
chunks, contextualized_chunks = cr.process_document(document)

# Create BM25 index
contextualized_bm25_index = cr.create_bm25_index(contextualized_chunks)
normal_bm25_index = cr.create_bm25_index(chunks)

Now let’s run a query against both indexes and compare the results.

original_query = "When does the term of the Agreement commence and how long does it last?"
print(f"\nOriginal Query: {original_query}")
process_query(cr, original_query, normal_bm25_index, chunks)

Situational Awareness Index

original_query = "When does the term of the Agreement commence and how long does it last?"
print(f"\nOriginal Query: {original_query}")
process_query(cr, original_query, contextualized_bm25_index, contextualized_chunks)

You’ll probably get better answers in later answers because of the contextual finder. Now let’s evaluate this against our benchmarks. We will use the Azure AI SDK for RAG evaluation. First, let’s load the data set.

You can generate Ground Truth based on the following jsonline:

{"chat_history":[],"question":"What is short-term memory in the context of the model?","ground_truth":"Short-term memory involves utilizing in-context learning to learn."}

import pandas as pd
df = pd.read_json(output_file, lines=True, orient="records")
df.head()

Now, when you load a dataset, you can run it against both standard and contextually embedded search strategies.

normal_answers = []
contexual_answers = []
for index, row in df.iterrows():
    normal_answers.append(process_query(cr, row["question"], normal_bm25_index, chunks))
    contexual_answers.append(process_query(cr, row["question"], contextualized_bm25_index, contextualized_chunks))

Let’s evaluate it based on actual facts. Here we used similarity scores for the evaluation. You can use other built-in or custom metrics. Learn more here.

from azure.ai.evaluation import SimilarityEvaluator

# Initialzing Relevance Evaluator
similarity_eval = SimilarityEvaluator(model_config)

df["answer"] = normal_answers
df['score'] = df.apply(lambda x : similarity_eval(
    response=x["answer"],
    ground_truth = x["ground_truth"],
    query=x["question"],
), axis = 1)
df["answer_contextual"] = contexual_answers
df['score_contextual'] = df.apply(lambda x : similarity_eval(
    response=x["answer_contextual"],
    ground_truth = x["ground_truth"],
    query=x["question"],
), axis = 1)

As you can see, contextual embeddings increase retrieval, which is also reflected in the similarity score. The contextual search system described in this blog post represents a sophisticated approach to document analysis and question answering. By integrating various NLP techniques, including contextualization using GPT-4, efficient indexing using BM25, reranking using Cohere models, and query expansion, the system not only retrieves relevant information, but also understands and synthesizes it to provide accurate answers. This modular architecture ensures flexibility, allowing individual components to be enhanced or replaced as better technology becomes available. As the field of natural language processing continues to advance, systems like this will become increasingly important in making large amounts of text more accessible, searchable, and actionable across a variety of domains.

References:

https://learn.microsoft.com/en-us/azure/ai-services/content-safety/overview

https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk

https://www.anthropic.com/news/contextual-retrieval

thank you

Manoranjan Rajguru

https://www.linkedin.com/in/manoranjan-rajguru/

Source link

Our Company

About Links

Useful Links

Newsletter

Laest News

Building a Contextual Retrieval System for Improving RAG Accuracy

Discover Exciting Finance Marketing Career Opportunities at Money Stories Finserve in Bangalore

Connect Azure Cosmos DB for PostgreSQL with ASP.NET Core: A Step-by-Step Guide

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News