Building a Contextual Retrieval System for Improving RAG Accuracy by info.odysseyx@gmail.com October 17, 2024 written by info.odysseyx@gmail.com October 17, 2024 0 comment 19 views 19 Domain-specific knowledge is needed to improve AI models for specific tasks. For example, customer support chatbots require business-related information, while legal bots rely on historical case data. Developers typically use Search Augmented Generation (RAG) to retrieve relevant knowledge from databases and improve AI responses. However, existing RAG approaches often miss context during retrieval, leading to failure. In this post, we introduce a method called Contextual Retrieval. Contextual embedding Improve search accuracy and re-rank to reduce misses. For large knowledge bases, Retrieval-Augmented Generation (RAG) provides a scalable solution. Modern RAG systems combine two powerful search methods. Semantic search using embeddings Bundle your knowledge base into manageable segments (typically a few hundred tokens each). We convert these chunks into vector embeddings that capture their semantic meaning. Store the embeddings in a vector database for similarity search. Vocabulary search using BM25 It is based on the Term Frequency-Inverse Document Frequency (TF-IDF) principle. Consider document length and term frequency saturation. It’s great for finding exact matches and specific terms. An optimal RAG implementation combines both approaches. Split your knowledge base into multiple chunks Generates both TF-IDF encoding and semantic embeddings. Running parallel searches using BM25 and embedding similarity Merge and remove duplicate results using rank fusion Include the most relevant chunks in your prompt. Generate responses using enhanced context The problem with traditional RAGs is how they split documents into smaller chunks for efficient retrieval, sometimes losing important context. For example, consider an academic database that is asked the question, “What was Dr. Smith’s primary research focus in 2021?” If a searched chunk states that “research emphasized AI,” without specifying Dr. Smith or the exact year, the lack of clarity may make it difficult to find an exact answer. This issue can reduce the accuracy and usefulness of search results in knowledge-rich areas. Contextual search solves this problem by adding chunk-specific descriptive context to each chunk before including it (“contextual embedding”). Generates contextual text for each chunk. A typical RAG pipeline typically has the following components: As you can see, we have user input authenticated and passed through our content safety system (learn more). here ). The next step is to rewrite the query based on the recorded conversation. You can also attach query expansions that enhance the generated answers. Next, there are retrievers and re-rankers. In the RAG pipeline, Retriever and Ranker play an important complementary role in finding and prioritizing relevant context. The searcher acts as an initial filter and efficiently searches large document collections to identify potentially relevant chunks based on their semantic similarity to the query. Common search approaches include dense searchers (e.g., embedding-based search) or sparse searchers (e.g., BM25). Ranker then acts as a more sophisticated second step, taking Retriever’s candidate passages and scoring them in detail for relevance. Ranker leverages powerful language models to analyze deep semantic relationships between queries and each passage, taking into account factors such as factual alignment, answer range, and contextual relevance. This two-step approach balances efficiency and accuracy. The searcher quickly narrows the search space, and the ranker applies more computationally intensive analysis to a smaller set of promising candidates to identify the most appropriate context for the generation step. This example uses Langchain as the framework to build it. import os from typing import List, Tuple from dotenv import load_dotenv from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.schema import Document from langchain_openai import AzureOpenAIEmbeddings from langchain_community.vectorstores import FAISS from langchain_openai import AzureChatOpenAI from langchain.prompts import ChatPromptTemplate from rank_bm25 import BM25Okapi import cohere import logging import time from llama_parse import LlamaParse from azure.ai.documentintelligence.models import DocumentAnalysisFeature from langchain_community.document_loaders.doc_intelligence import AzureAIDocumentIntelligenceLoader # Set up logging logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") load_dotenv('azure.env', override=True) Now let’s implement contextual embedding to create a custom Retriever. Here is the code: We use Azure AI Document Intelligence for PDF parsing. Break your documents into manageable chunks while maintaining context. We implement sophisticated overlapping text segmentation to avoid losing information at chunk boundaries. class ContextualRetrieval: def __init__(self): self.text_splitter = RecursiveCharacterTextSplitter( chunk_size=800, chunk_overlap=100, ) self.embeddings = AzureOpenAIEmbeddings( api_key=os.getenv("AZURE_OPENAI_API_KEY"), azure_deployment="text-embedding-ada-002", openai_api_version="2024-03-01-preview", azure_endpoint =os.environ["AZURE_OPENAI_ENDPOINT"] ) self.llm = AzureChatOpenAI( api_key=os.environ["AZURE_OPENAI_API_KEY"], azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], azure_deployment="gpt-4o", temperature=0, max_tokens=None, timeout=None, max_retries=2, ) self.cohere_client = cohere.Client(os.getenv("COHERE_API_KEY")) def load_pdf_and_parse(self, pdf_path: str) -> str: loader = AzureAIDocumentIntelligenceLoader(file_path=pdf_path, api_key = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY"), api_endpoint = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT"), api_model="prebuilt-layout", api_version="2024-02-29-preview", mode="markdown", analysis_features = [DocumentAnalysisFeature.OCR_HIGH_RESOLUTION]) try: documents = loader.load() if not documents: raise ValueError("No content extracted from the PDF.") return " ".join([doc.page_content for doc in documents]) except Exception as e: logging.error(f"Error while parsing the file '{pdf_path}': {str(e)}") raise def process_document(self, document: str) -> Tuple[List[Document], List[Document]]: if not document.strip(): raise ValueError("The document is empty after parsing.") chunks = self.text_splitter.create_documents([document]) contextualized_chunks = self._generate_contextualized_chunks(document, chunks) return chunks, contextualized_chunks def _generate_contextualized_chunks(self, document: str, chunks: List[Document]) -> List[Document]: contextualized_chunks = [] for chunk in chunks: context = self._generate_context(document, chunk.page_content) contextualized_content = f"{context}\n\n{chunk.page_content}" contextualized_chunks.append(Document(page_content=contextualized_content, metadata=chunk.metadata)) return contextualized_chunks def _generate_context(self, document: str, chunk: str) -> str: prompt = ChatPromptTemplate.from_template(""" You are an AI assistant specializing in document analysis. Your task is to provide brief, relevant context for a chunk of text from the given document. Here is the document: {document} Here is the chunk we want to situate within the whole document: {chunk} Provide a concise context (2-3 sentences) for this chunk, considering the following guidelines: 1. Identify the main topic or concept discussed in the chunk. 2. Mention any relevant information or comparisons from the broader document context. 3. If applicable, note how this information relates to the overall theme or purpose of the document. 4. Include any key figures, dates, or percentages that provide important context. 5. Do not use phrases like "This chunk discusses" or "This section provides". Instead, directly state the context. Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else. Context: """) messages = prompt.format_messages(document=document, chunk=chunk) response = self.llm.invoke(messages) return response.content def create_bm25_index(self, chunks: List[Document]) -> BM25Okapi: tokenized_chunks = [chunk.page_content.split() for chunk in chunks] return BM25Okapi(tokenized_chunks) def generate_answer(self, query: str, relevant_chunks: List[str]) -> str: prompt = ChatPromptTemplate.from_template(""" Based on the following information, please provide a concise and accurate answer to the question. If the information is not sufficient to answer the question, say so. Question: {query} Relevant information: {chunks} Answer: """) messages = prompt.format_messages(query=query, chunks="\n\n".join(relevant_chunks)) response = self.llm.invoke(messages) return response.content def rerank_results(self, query: str, documents: List[Document], top_n: int = 3) -> List[Document]: logging.info(f"Reranking {len(documents)} documents for query: {query}") doc_contents = [doc.page_content for doc in documents] max_retries = 3 for attempt in range(max_retries): try: reranked = self.cohere_client.rerank( model="rerank-english-v2.0", query=query, documents=doc_contents, top_n=top_n ) break except cohere.errors.TooManyRequestsError: if attempt < max_retries - 1: logging.warning(f"Rate limit hit. Waiting for 60 seconds before retry {attempt + 1}/{max_retries}") time.sleep(60) # Wait for 60 seconds before retrying else: logging.error("Rate limit hit. Max retries reached. Returning original documents.") return documents[:top_n] logging.info(f"Reranking complete. Top {top_n} results:") reranked_docs = [] for idx, result in enumerate(reranked.results): original_doc = documents[result.index] reranked_docs.append(original_doc) logging.info(f" {idx+1}. Score: {result.relevance_score:.4f}, Index: {result.index}") return reranked_docs def expand_query(self, original_query: str) -> str: prompt = ChatPromptTemplate.from_template(""" You are an AI assistant specializing in document analysis. Your task is to expand the given query to include related terms and concepts that might be relevant for a more comprehensive search of the document. Original query: {query} Please provide an expanded version of this query, including relevant terms, concepts, or related ideas that might help in summarizing the full document. The expanded query should be a single string, not a list. Expanded query: """) messages = prompt.format_messages(query=original_query) response = self.llm.invoke(messages) return response.content Now let’s load a sample PDF with context embeddings and create two indexes for both the regular chunk and the context-aware chunk. cr = ContextualRetrieval() pdf_path = "1.pdf" document = cr.load_pdf_with_llama_parse(pdf_path) # Process the document chunks, contextualized_chunks = cr.process_document(document) # Create BM25 index contextualized_bm25_index = cr.create_bm25_index(contextualized_chunks) normal_bm25_index = cr.create_bm25_index(chunks) Now let’s run a query against both indexes and compare the results. original_query = "When does the term of the Agreement commence and how long does it last?" print(f"\nOriginal Query: {original_query}") process_query(cr, original_query, normal_bm25_index, chunks) Situational Awareness Index original_query = "When does the term of the Agreement commence and how long does it last?" print(f"\nOriginal Query: {original_query}") process_query(cr, original_query, contextualized_bm25_index, contextualized_chunks) You’ll probably get better answers in later answers because of the contextual finder. Now let’s evaluate this against our benchmarks. We will use the Azure AI SDK for RAG evaluation. First, let’s load the data set. You can generate Ground Truth based on the following jsonline: {"chat_history":[],"question":"What is short-term memory in the context of the model?","ground_truth":"Short-term memory involves utilizing in-context learning to learn."} import pandas as pd df = pd.read_json(output_file, lines=True, orient="records") df.head() Now, when you load a dataset, you can run it against both standard and contextually embedded search strategies. normal_answers = [] contexual_answers = [] for index, row in df.iterrows(): normal_answers.append(process_query(cr, row["question"], normal_bm25_index, chunks)) contexual_answers.append(process_query(cr, row["question"], contextualized_bm25_index, contextualized_chunks)) Let’s evaluate it based on actual facts. Here we used similarity scores for the evaluation. You can use other built-in or custom metrics. Learn more here. from azure.ai.evaluation import SimilarityEvaluator # Initialzing Relevance Evaluator similarity_eval = SimilarityEvaluator(model_config) df["answer"] = normal_answers df['score'] = df.apply(lambda x : similarity_eval( response=x["answer"], ground_truth = x["ground_truth"], query=x["question"], ), axis = 1) df["answer_contextual"] = contexual_answers df['score_contextual'] = df.apply(lambda x : similarity_eval( response=x["answer_contextual"], ground_truth = x["ground_truth"], query=x["question"], ), axis = 1) As you can see, contextual embeddings increase retrieval, which is also reflected in the similarity score. The contextual search system described in this blog post represents a sophisticated approach to document analysis and question answering. By integrating various NLP techniques, including contextualization using GPT-4, efficient indexing using BM25, reranking using Cohere models, and query expansion, the system not only retrieves relevant information, but also understands and synthesizes it to provide accurate answers. This modular architecture ensures flexibility, allowing individual components to be enhanced or replaced as better technology becomes available. As the field of natural language processing continues to advance, systems like this will become increasingly important in making large amounts of text more accessible, searchable, and actionable across a variety of domains. References: https://learn.microsoft.com/en-us/azure/ai-services/content-safety/overview https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk https://www.anthropic.com/news/contextual-retrieval thank you Manoranjan Rajguru https://www.linkedin.com/in/manoranjan-rajguru/ Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Discover Exciting Finance Marketing Career Opportunities at Money Stories Finserve in Bangalore next post Connect Azure Cosmos DB for PostgreSQL with ASP.NET Core: A Step-by-Step Guide You may also like Bots now dominate the web and this is a copy of a problem February 5, 2025 Bots now dominate the web and this is a copy of a problem February 5, 2025 Bots now dominate the web, and this is a problem February 4, 2025 DIPSEC and HI-STECS GLOBAL AI Race February 4, 2025 DEPSEC SUCCESS TICTOKE CAN RUNNING TO PUPPENSE TO RESTITE January 29, 2025 China’s AI Application DEPSEC Technology Spreads on the market January 28, 2025 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.