Multimodal Knowledge Extraction and Retrieval System for Generative AI Agents and RAG Systems

The rapid development of AI has given rise to powerful tools for knowledge retrieval and question-answering systems, especially with the emergence of augmented search generation (RAG) systems. In this blog post, I present my capstone project, which I created as part of UCL’s IXN program in collaboration with Microsoft. The project aims to improve the RAG system by incorporating multimodal knowledge extraction and retrieval capabilities. This system allows AI agents to process both textual and visual data to provide more accurate and contextual responses. In this post, I will describe the goals of the project, the development process, the technical implementation, and the results.

Project Overview

The main goal of this project was to improve the performance of the RAG system. Improve the way multimodal data is extracted, stored, and retrieved.. Current RAG systems rely primarily on text-based data, which limits their ability to generate accurate responses when queries require a combination of text and images. To address this, I developed a system to extract, process, and search multimodal data from Wikimedia, enabling AI agents to generate more accurate, grounded, and contextual answers.

Key features include:

Multimodal knowledge extraction: Preprocesses Wikimedia data (text, images, tables), passes it through a transformation pipeline, and stores it in vector and graph databases for efficient retrieval.
Dynamic knowledge retrieval: A custom query engine combined with an agentive approach using ReAct agents ensures flexible and accurate information retrieval by dynamically selecting the most appropriate tools and strategies for each query.

This project began with the challenge of addressing the limitations of existing RAG systems, particularly in handling visual data and providing accurate responses. After reviewing a number of technologies, a system architecture was developed that supported both text and image data. Throughout the process, components were refined to ensure compatibility between LlamaIndex, Qdrant and Neo4j, and to optimise performance for managing large data sets. Key challenges were the refactoring of the system for Wikimedia’s vast data sets, particularly image processing and Dockerisation. These challenges were addressed by iteratively refining the system architecture to ensure efficient multimodal data handling and reliable deployment across environments.

Implementation Overview

This project improves the retrieval and response generation of the RAG system by integrating textual and visual data. The system’s architecture is divided into two main processes.

Knowledge extraction: Data is fetched from Wikimedia and converted into embeddings of text and images. These embeddings are stored in Qdrant for efficient retrieval, and Neo4j captures relationships between nodes to ensure preservation of the data structure.
Knowledge Search: The dynamic query engine processes user queries and retrieves data from Qdrant (using vector search) and Neo4j (via graph traversal). Advanced techniques such as query expansion, re-ranking, and cross-referencing ensure that the most relevant information is returned.

System Architecture Diagram

Technology Stack

The following technologies were used to build and deploy the system:

Python: Core programming language for data pipelines
Rama Index: A framework for multimodal data indexing, transformation, and retrieval.
Kudrant: Vector database for embedding-based similarity search
neo4j: A graph database used to store and manage relationships between data entities.
Azure OpenAI (GPT-4O): Used for multi-modal input processing, model deployment via Azure App Services.
Text embedding Ada-002: Model for generating text embeddings
Azure Computer Vision: Used to generate image embeddings
Gradio: Provides an interactive interface for system queries.
Docker and Docker Compose: Used for containerization and orchestration to ensure consistent deployments.

Implementation Details

Multimodal knowledge extraction

The system starts by retrieving both textual and visual data from Wikimedia using the Wikimedia API and web scraping techniques. Then, the key steps in implementing knowledge extraction are:

Data preprocessing: Organize text, categorize images into categories such as plots or images, and structure tables to make processing easier so that images can be processed appropriately later when converted.

Creating and converting nodes: From this data, an initial LlamaIndex node is created and then goes through several transformations through the transformation pipeline. GPT-4O Distributed through models Azure OpenAI:

Convert text and tables: Text data is cleaned and split into smaller chunks using semantic chunking, and new derived nodes are created from various transformations such as key entity extraction or table analysis. Each node has a unique Llamaindex ID and carries metadata such as title, context, and relationships that reflect the hierarchy of the Wikimedia page and the parent-child relationship with the newly transformed node.
Image conversion: Process images to generate descriptions, perform plot analysis, and identify key objects based on image type to generate new text nodes.

Generating embeddings: The final step in the pipeline is to generate embeddings for the image and transformed text nodes.

Text embedding: Generated using Text embedding ada-002 Model deployed with Azure OpenAI on Azure App Services.
Image embedding: Created using the Azure Computer Vision service.

save: Both text and image embeddings are stored in Qdrant with reference node IDs in the payload for fast lookups. The entire nodes and their relationships are stored in Neo4j.

Neo4j graph (left) and enlarged portion of the graph (right)

Knowledge Search

The search process involves several main steps:

Query Extension: The system expands the search space by generating different variations of the original query to collect relevant data.
Vector Search: The extended query is passed to Qdrant for similarity-based search using cosine similarity.
Re-ranking and cross-searching: Results are then re-ranked by relevance. Nodes retrieved from Qdrant contain a LlamaIndex node ID in their payload. This is used to fetch nodes from Neo4j and then traverse the graph to retrieve nodes with original data from Wikimedia, ensuring that the final response is based solely on original Wikipedia content.
ReAct Agent Integration: ReAct Agent dynamically manages the search process by selecting tools based on the query context. It integrates with a custom query engine to balance AI-generated insights with raw data from Neo4j and Qdrant.

Dockerization using Docker Compose

To ensure consistent deployment across environments, the entire application is containerized using: Docker. Docker Compose Orchestrates multiple containers including knowledge extractors, searchers, Neo4j, and Qdrant services. This setup simplifies the deployment process and improves scalability.

Docker container

Results and Achievements

This system effectively improves the grounding and accuracy of responses generated by the RAG system. It integrates multimodal data to provide contextually relevant answers, especially in scenarios where visual information is important. The integration of Qdrant and Neo4j has proven to be very efficient, enabling fast searches and accurate results.

It also has a user-friendly interface. Gradio It provides an easy way to evaluate improvements by allowing users to interact with the system and compare AI-generated responses to standard LLM outputs.

Here is a snapshot of the Gradio UI:

Future Development

Several directions for future development have been identified to further enhance the functionality of the system.

Extending the Agent Framework: Future versions of the system could integrate an autonomous tool that can determine whether the existing knowledge base is sufficient for the query. If the knowledge base is determined to be insufficient, the system can automatically start a knowledge extraction process to fill the gap. This enhancement will make the system more adaptable and self-sufficient.
Knowledge graph with entities: Extend your knowledge graph to include key entities such as individuals, locations, events, or other people relevant to your domain. This adds significant depth and precision to your search process. Integrating these entities provides a more comprehensive and interconnected knowledge base, improving both the relevance and accuracy of your results.
Enhanced multimodality: Future iterations could extend the capabilities of the system to process image data. This could include adding support for image comparison, object detection, or splitting images into individual components. These capabilities would allow for more sophisticated queries and increase the versatility of the system to handle a variety of data formats.

With these advancements in place, the system could play a key role in the evolving field of multimodal AI, further bridging the gap between text and visual data integration in knowledge retrieval.

summation

This project demonstrates the potential of integrating multimodal data to improve the RAG system and enable AI to process text and images more effectively. Using technologies such as LlamaIndex, Qdrant, and Neo4j, the system provides more relevant and contextual answers at high speed. With a focus on accurate knowledge retrieval and dynamic query processing, this project demonstrates significant advances in AI-based question-answering systems. For more insights and project exploration, visit: GitHub repository.

If you are good night to Connect and feel free to Reach out to me ~ in LinkedIn.

Source link

Project Overview

Key features include:

Implementation Overview

Technology Stack

Implementation Details

Multimodal knowledge extraction

Knowledge Search

Dockerization using Docker Compose

Results and Achievements

Future Development

summation

Our Company

About Links

Useful Links

Newsletter

Laest News

Multimodal Knowledge Extraction and Retrieval System for Generative AI Agents and RAG Systems

Project Overview

Key features include:

Implementation Overview

Technology Stack

Implementation Details

Multimodal knowledge extraction

Knowledge Search

Dockerization using Docker Compose

Results and Achievements

Future Development

summation

Exciting Graphic Designer Job Openings in Dehradun by Freshersworld Client

Switch to Azure Business Continuity Center for your at scale BCDR management needs

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News