Home NewsX From Pixels to Intelligence: Introduction to OCR Free Vision RAG using Colpali For Complex Documents

From Pixels to Intelligence: Introduction to OCR Free Vision RAG using Colpali For Complex Documents

by info.odysseyx@gmail.com
0 comment 3 views


In the rapidly evolving artificial intelligence environment, the ability to understand and process complex documents is becoming increasingly important. Existing optical character recognition (OCR) systems are effective at extracting text from images, but often fall short at interpreting the complex visual elements that accompany textual information. In this blog we Colpaliais a groundbreaking approach to enhance the Retrieval-Augmented Generation (RAG) process by leveraging multi-vector retrieval via late interaction mechanisms and Vision Language Model (VLM). In this blog post, we’ll take a closer look at how ColPali is revolutionizing document understanding and search.

In practice, these PDF document search pipelines have a significant, but non-trivial, impact on performance.

mrajguru_0-1729595097389.png

Limitations of existing OCR

What is OCR?

Optical character recognition (OCR) is a technology that converts various types of documents, including scanned paper documents, PDFs, and images captured by digital cameras, into editable and searchable data. Although OCR has made significant advances in accuracy, it primarily focuses on text extraction and often overlooks the context and visual elements present in complex documents.

Challenges of Complex Documents

Complex documents such as financial reports, legal contracts, and academic papers often include:

  • Tables and charts: These elements convey important information that cannot be captured by text alone.
  • Images and diagrams: Visuals play an important role in understanding content, but are often ignored in traditional OCR systems.
  • Layout and Format: The arrangement of text and visual elements can have a significant impact on meaning, but OCR typically processes each element individually.

These limitations can cause traditional OCR to result in incomplete or misleading interpretations of complex documents.

What is ColPali?

ColPali builds on recent developments in VLMs, combining the capabilities of Large Language Models (LLMs) with Vision Translators (ViTs). By feeding image patch embeddings through a language model, ColPali maps visual features into a latent space aligned with the textual content. This alignment is critical to effective search because it ensures that the visual elements of the document contribute meaningfully to the matching process with the user’s query.

Key features of ColPali

  1. Unified Vision Language Model (VLM):
    • ColPali utilizes VLMs such as PaliGemma to effectively interpret document images. These models are trained on massive datasets that include not only text, but also images, diagrams, and layouts.
    • ColPali can provide richer insights into complex documents by understanding the relationships between visual elements and text.
  2. Improved contextual understanding:
    • Unlike traditional OCR systems that treat text as isolated data points, ColPali analyzes the entire layout of the document.
    • This means recognizing how a table relates to surrounding text or how a diagram illustrates key concepts, enabling more accurate interpretation.
  3. Dynamic Search-Augmented Generation (RAG):
    • ColPali is fully integrated into the RAG framework, enabling real-time information retrieval based on user queries.
    • This dynamic approach ensures that responses are not only relevant but also contextually rich, providing users with comprehensive insights.

mrajguru_0-1729597564052.png

In addition to improved accuracy, ColPali offers significant efficiency gains.

  • Simplified Indexing: ColPali accelerates the document indexing process by eliminating the need for complex preprocessing steps. Existing methods can be time-consuming due to the multiple steps involved in document parsing and chunking.

  • Low query latency: ColPali maintains low latency during queries, a critical requirement for real-time applications. The end-to-end learnable architecture optimizes the search process to ensure rapid responses to user queries.

Now let’s implement the same using Azure AI services.

mrajguru_0-1729596881203.gif

Let’s load the library.

import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import AutoProcessor
from PIL import Image
from io import BytesIO

from colpali_engine.models.paligemma_colbert_architecture import ColPali
from colpali_engine.utils.colpali_processing_utils import process_images, process_queries
from colpali_engine.utils.image_utils import scale_image, get_base64_image

import os
from dotenv import load_dotenv
load_dotenv('azure.env', override=True)

Load ColPali model

if torch.cuda.is_available():
    device = torch.device("cuda")
    if torch.cuda.is_bf16_supported():
        type = torch.bfloat16
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    type = torch.float32
else:
    device = torch.device("cpu")
    type = torch.float32

Let’s save the model to the device selected above.

model_name = "vidore/colpali-v1.2"
model = ColPali.from_pretrained("vidore/colpaligemma-3b-pt-448-base", torch_dtype=type).eval()
model.load_adapter(model_name)
model = model.eval()
model.to(device)
processor = AutoProcessor.from_pretrained(model_name)

After loading the model, the first step in the process is to import the images from the PDF.

import requests
from pdf2image import convert_from_path
from pypdf import PdfReader

def download_pdf(url):
    response = requests.get(url)
    if response.status_code == 200:
        return BytesIO(response.content)
    else:
        raise Exception(f"Failed to download PDF: Status code {response.status_code}")

def get_pdf_images(pdf_url):
    # Download the PDF
    pdf_file = download_pdf(pdf_url)
    # Save the PDF temporarily to disk (pdf2image requires a file path)
    temp_file = "temp.pdf"
    with open(temp_file, "wb") as f:
        f.write(pdf_file.read())
    reader = PdfReader(temp_file)
    page_texts = []
    for page_number in range(len(reader.pages)):
        page = reader.pages[page_number]
        text = page.extract_text()
        page_texts.append(text)
    images = convert_from_path(temp_file)
    assert len(images) == len(page_texts)
    return (images, page_texts)

Let’s go ahead and download the PDF. Once the PDF is downloaded, you can import PDF images.

sample_pdfs = [
        {
            "title": "Attention Is All You Need",
            "url": "https://arxiv.org/pdf/1706.03762"
        }
]

PDF image, text loaded once.

for pdf in sample_pdfs:
  page_images, page_texts = get_pdf_images(pdf['url'])
  pdf['images'] = page_images
  pdf['texts'] = page_texts

Now let’s create an embed for each page image.

for pdf in sample_pdfs:
  page_embeddings = []
  dataloader = DataLoader(
        pdf['images'],
        batch_size=2,
        shuffle=False,
        collate_fn=lambda x: process_images(processor, x),
    )
  for batch_doc in tqdm(dataloader):
    with torch.no_grad():
      batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
      embeddings = model(**batch_doc)
      mean_embedding = torch.mean(embeddings, dim=1).float().cpu().numpy()
      #page_embeddings.extend(list(torch.unbind(embeddings.to("cpu"))))
      page_embeddings.extend(mean_embedding)
  pdf['embeddings'] = page_embeddings

ColPali aims to remove a lot of complexity by directly using images (“screenshots”) of document pages during indexing.

Vision LLM (PaliGemma-3B) encodes an image by splitting it into a series of patches and feeding them to a vision converter.

During a runtime query, the user query is included in the language model to obtain token embeddings. ColBERT Style “late interaction” (LI) operations efficiently match query tokens with document patches. To compute the LI (Query, Document) score, we retrieve the document patch with the most similar ColPali representation for each term in the query. The final query document score is then obtained by summing the scores of the most similar patches for all terms in the query. Intuitively, this late-interaction operation allows for rich interactions between all terms and document patches in the query, while benefiting from the fast matching and offline computation offloading supported by more standard (bi-encoder) embedding models.

import numpy as np
lst_feed = []
for pdf in sample_pdfs:
    url = pdf['url']
    title = pdf['title']
    for page_number, (page_text, embedding, image) in enumerate(zip(pdf['texts'], pdf['embeddings'], pdf['images'])):
      base_64_image = get_base64_image(scale_image(image,640),add_url_prefix=False)   
      page = {
        "id": str(hash(url + str(page_number))),
        "url": url,
        "title": title,
        "page_number": page_number,
        "image": base_64_image,
        "text": page_text,
        "embedding": embedding.tolist()
      }
      lst_feed.append(page)

Now that the embeddings are complete, we need to save these embeddings to a vector store. We will use AI Search as our vector store. Let’s create an AI index.

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch,
    SearchIndex
)

def create_pdf_search_index(endpoint: str, key: str, index_name: str) -> SearchIndex:
    # Initialize the search index client
    credential = AzureKeyCredential(key)
    index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

    # Define vector search configuration
    vector_search = VectorSearch(
            algorithms=[
                HnswAlgorithmConfiguration(
                    name="myHnsw",
                    parameters={
                        "m": 4,  # Default HNSW parameter
                        "efConstruction": 400,  # Default HNSW parameter
                        "metric": "cosine"
                    }
                )
            ],
            profiles=[
                VectorSearchProfile(
                    name="myHnswProfile",
                    algorithm_configuration_name="myHnsw",
                    vectorizer="myVectorizer"
                )
            ]
    )

    # Define the fields
    fields = [
            SimpleField(
                name="id",
                type=SearchFieldDataType.String,
                key=True,
                filterable=True
            ),
            SimpleField(
                name="url",
                type=SearchFieldDataType.String,
                filterable=True
            ),
            SearchableField(
                name="title",
                type=SearchFieldDataType.String,
                searchable=True,
                retrievable=True
            ),
            SimpleField(
                name="page_number",
                type=SearchFieldDataType.Int32,
                filterable=True,
                sortable=True
            ),
            SimpleField(
                name="image",
                type=SearchFieldDataType.String,
                retrievable=True
            ),
            SearchableField(
                name="text",
                type=SearchFieldDataType.String,
                searchable=True,
                retrievable=True
            ),
            SearchField(
                name="embedding",
                type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True,
                vector_search_dimensions=128,
                vector_search_profile_name="myHnswProfile"
            )
        ]

    # Create the index definition
    index = SearchIndex(
        name=index_name,
        fields=fields,
        vector_search=vector_search
    )

    # Create the index in Azure Cognitive Search
    result = index_client.create_or_update_index(index)
    return result

Once indexed, you will need to upload your documents.

from azure.search.documents import SearchClient
credential = AzureKeyCredential(SEARCH_KEY)
index_client = SearchClient(endpoint=SEARCH_ENDPOINT, credential=credential, index_name = INDEX_NAME)

index_client.upload_documents(documents=lst_feed)

Once document collection is complete, next is processing user queries. As you can see in the code, it generates an embedding for the input query.

def process_query(query: str, processor: AutoProcessor, model: ColPali) -> np.ndarray:
    mock_image = Image.new('RGB', (224, 224), color="white")

    inputs = processor(text=query, images=mock_image, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        embeddings = model(**inputs)

    return torch.mean(embeddings, dim=1).float().cpu().numpy().tolist()[0]

Now let’s create a search client.

from IPython.display import display, HTML
from openai import AzureOpenAI
client = AzureOpenAI(api_key=os.environ['AZURE_OPENAI_API_KEY'],
                    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT'],
                    api_version=os.environ['OPENAI_API_VERSION'])

search_client = SearchClient(
    SEARCH_ENDPOINT,
    index_name=INDEX_NAME,
    credential=credential,
)

def display_query_results(query, response, hits=5):
    html_content = f"

Query text: '{query}', top results:

" for i, hit in enumerate(response): title = hit["title"] url = hit["url"] page = hit["page_number"] image = hit["image"] score = hit["@search.score"] html_content += f"

PDF Result {i + 1}

" html_content += f'

Title: {title}, page {page+1} with score {score:.2f}

' html_content += ( f'' ) display(HTML(html_content))

Now, after retrieving relevant images, you can use VLM to pass these images to the user to ask questions.

query = "What is the projected global energy related co2 emission in 2030?"
vector_query = VectorizedQuery(
    vector=process_query(query, processor, model),
    k_nearest_neighbors=3,
    fields="embedding",
)
results = search_client.search(search_text=None, vector_queries=[vector_query])
#display_query_results(query, results)
response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
        "role": "system",
        "content": """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.You will be given a mixed of text, tables, and image(s) usually of charts or graphs."""
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": query},
        *map(lambda x: {"type": "image_url", "image_url": {"url": f'data:image/jpg;base64,{x["image"]}', "detail": "low"}}, results),
      ],
    }
  ],
  max_tokens=300,
)

print("Answer:" + response.choices[0].message.content)

conclusion

As we move into an era where data becomes increasingly complex and multifaceted, tools like Multi Modal LLM are essential to gain valuable insights from your documents. By integrating advanced vision language models and search augmented generative techniques, we set a new standard for document understanding that goes beyond traditional OCR limitations. Whether you’re a researcher looking to streamline your workflow or a developer interested in advancing AI, embracing technologies like VLM and ColPali will undoubtedly improve your ability to navigate complex information environments. Stay tuned for further updates as we continue to explore the exciting intersection of AI and document processing!

** Check the license of the open source model before using the same model.

Learn more about AI search. https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search

model: https://huggingface.co/vidore/colpali-v1.2

thank you

Manoranjan Rajguru

https://www.linkedin.com/in/manoranjan-rajguru/





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX