Integrating vision into RAG applications

Augmented Search Generation (RAG) A widely used technique that allows LLMs to provide answers based on data sources. What if your knowledge base contains images, such as graphs or photos? By adding a multimodal model to your RAG flow, you can also get answers based on the image source!

Most popular RAG solution accelerators azure-search-OpenAI -demoI have ~ now Selection features It’s about the RAG of the image source. In the example question below, the app answers a question that requires the correct interpretation of a bar graph.

In this blog post, we will look at the changes we made to enable multimodal RAG so that developers using solution accelerators can understand how solution accelerators work, and so that developers using other RAG solutions can adopt multimodal support.

First, let’s look at two essential components: multimodal LLM and multimodal embedding model.

Multimodal LLM

Azure now offers several multimodal LLMs through gpt-4o and gpt-4o-mini. Azure OpenAI Serviceand via Phi-3.5-vision-instruct Azure AI Model Catalog. With this model, you can send both images and text and return a text response. (In the future, there may be an LLM that takes audio input and returns non-text input!)

For example, an API call to the gpt-4o model might include a question along with an image URL.

{
  "role": "user",
  "content": [
    { 
      "type": "text", 
      "text": "Whats in this image?" 
    },
    { 
      "type": "image_url", 
      "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } 
    } 
  ]
}

If the image is available on the public web, you can specify the image URL as a full HTTP URL, or as a Base-64 encoded data URI, which is especially useful for privately stored images.

For more examples of working with gpt-4o, see: openai-chat-vision-quickstart, Repository for deploying a simple Chat+Vision app to Azure, including additional Jupiter Notebook Here’s the scenario.

Multimodal embedding model

Azure also offers: Multimodal embedding APIAs part of the Azure AI Vision API, you can compute embeddings in multimodal spaces for both text and images. The API is state-of-the-art Florence Model Provided by Microsoft Research.

For example, this API call returns the embedding vector of an image.

curl.exe -v -X POST "https:///computervision/retrieval:vectorizeImage?api-version=2024-02-01-preview&model-version=2023-04-15" \
--data-ascii " { 'url':'https://learn.microsoft.com/azure/ai-services/computer-vision/media/quickstarts/presentation.png' }"

With the ability to embed images and text in the same embedding space, you can use vector search to find images similar to a user’s query. For example: This laptop Set up basic multimodal search of images using Azure AI Search.

Multimodal RAG

These two multimodal models allowed us to provide our RAG solution with the ability to include image sources in both the search and answer processes.

At a high level, we have made the following changes:

Search Index: Added a new field to the Azure AI Search Index to store embeddings returned from the multimodal Azure AI Vision API (the existing field storing OpenAI text embeddings remains intact).
Data collection: In addition to the normal PDF ingestion flow, we also convert each PDF document page to an image, save that image with the file name rendered at the top, and add the embedding to the index.
Questions and Answers: We search the index using both text and multimodal embeddings. We send both text and image to gpt-4o and ask it to answer questions based on both types of sources.
quote: The frontend displays both the image source and the text source to help users understand how the answer was generated.

Let’s take a closer look at each of the above changes.

Search Index

For a standard RAG approach to documents, we use an Azure AI search index that stores the following fields:

content: Text content extracted from Azure Document Intelligence can process a wide range of files and can also OCR images within files.
sourcefile: File name of the document
sourcepage: File name with page numbers for more accurate citations
embedding: A 1536-dimensional vector field to store the embeddings of content fields computed using the text-only OpenAI ada-002 model.

Add additional fields for the RAG of the image.

imageEmbedding: A 1024-dimensional vector field for storing image version embeddings of document pages computed using AI Vision. vectorizeImage API endpoint.

Data collection

For a standard RAG approach, data collection involves the following steps:

Extract text from documents using Azure Document Intelligence
Chunk your text into sections using a segmentation strategy. This is necessary to keep the chunk size to an appropriate size, as sending too much content to LLM at once tends to result in poor response quality.
Upload the original files to Azure Blob Storage.
Compute the ada-002 embedding for the content field.
Add each chunk to the Azure AI search index.

Before indexing the RAG of images, we add two additional steps: upload an image version of each document page to Blob Storage and compute multi-modal embeddings for each image.

Create quotable images

The image is not a direct copy of the document page. Instead, it contains the original document file name written in the upper left corner of the image, as follows:

This important step allows the GPT vision model to provide quotes later in the answer. From a technical standpoint, we first achieved this by using: PaiMuPDF Then use a Python package to convert documents to images. pillow A Python package to add a top border to an image and write the file name to it.

Questions and Answers

Now that you have citable images in your Blob storage containers and multi-modal embeddings in your AI search index, users can ask questions about your images.

There are two main question flows in the RAG app: one for “single-turn” questions and one for “multi-turn” questions that integrate as much of the conversation history as possible into the context window. To simplify this explanation, we will focus on the single-turn flow.

A single turn RAG for document flow is:

Accept user questions on the front end.
Compute embeddings for user questions using the OpenAI ada-002 model.
We use user questions to retrieve matching documents from the Azure AI Search index. That is, we use hybrid search, which performs keyword search on the text and vector search on the question embeddings.
Pass the resulting document chunk and the original user question to the gpt-3.5 model, while instructing the system to respect the source and provide citations in a specific format at the prompt.

A single turn RAG for document and image flows is:

Accept user questions on the front end.
We compute embeddings for user questions using the OpenAI ada-002 model, and compute additional embeddings using the AI Vision API multimodal model.
We use a hybrid multi-vector search that retrieves matching documents from the Azure AI Search index using the user question, and searches on the imageEmbedding field with additional embeddings. This way, the underlying vector search algorithm finds results that are semantically similar to the text in the document, but also semantically similar to all the images in the document (e.g. “What trends are increasing?” might match a chart with a line that goes up and to the right).

For each document chunk returned in the search results, convert the Blob image URL to a base64 data-encoded URI. Pass both the text content and the image URI to the GPT vision model, using this prompt to explain how to find and format citations.

The documents contain text, graphs, tables and images. 

Each image source has the file name in the top left corner of the image with coordinates (10,10) pixels and is in the format SourceFileName: 

Each text source starts in a new line and has the file name followed by colon and the actual information. Always include the source name from the image or text for each fact you use in the response in the format: [filename]  

Answer the following question using only the data provided in the sources below. 

The text and image source can be the same file name, don't use the image title when citing the image source, only use the file name as mentioned.

Now users can ask questions and get answers with the answers fully embedded in the image! This could be great for domains with lots of diagrams, like finance.

Considerations

While we’ve seen some really interesting use cases with this multimodal RAG approach, there’s still a lot of room for exploration to improve the experience.

More file types: Our repository only implemented image generation for PDF, but developers are now collecting many more formats, including image files like PNG and JPEG, as well as non-image files like HTML, docx, etc. We would like to provide support for multimodal RAG for more file formats with the help of the community.

More selective embeddings: Our ingestion flow uploads an image for *every* PDF page, but many pages may lack visual content, which can negatively impact vector search results. For example, if a PDF contains completely blank pages, and the index stores embeddings for those blank pages, we found that vector search often retrieves blank pages. Perhaps in a multimodal space, “blank” is considered similar to everything else. We considered approaches such as using a vision model during the ingestion step to determine whether an image is meaningful, or using that model to generate highly descriptive captions for the images instead of storing the image embeddings themselves.

Image Extraction: Another approach is to extract images from document pages and store each image separately. This can be useful for documents that contain multiple individual images for different purposes on a page, as this allows LLM to focus more on only the most relevant images.

We would appreciate your help in experimenting with RAG on images, sharing how it works on your domain, and suggesting improvements. Please go to our repo Follow the steps Deploy with optional GPT Vision feature enabled!

Source link

Multimodal LLM

Multimodal embedding model

Multimodal RAG

Search Index

Data collection

Create quotable images

Questions and Answers

Considerations

Our Company

About Links

Useful Links

Newsletter

Laest News

Integrating vision into RAG applications

Multimodal LLM

Multimodal embedding model

Multimodal RAG

Search Index

Data collection

Create quotable images

Questions and Answers

Considerations

Unlocking the Power of Responsible AI with Microsoft Azure

Exciting Voice Process Executive Jobs at Archishtech Solutions in Bommanahalli, Bangalore

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News