VoiceRAG: An App Pattern for RAG + Voice Using Azure AI Search and the GPT-4o Realtime API for Audio

The new Azure OpenAI gpt-4o-realtime-preview model provides a much more natural application user interface with speech-to-speech capabilities.

This new voice-based interface also presents exciting new challenges. How do you implement retrieval augmented generation (RAG), a common pattern for combining language models and user data, in systems that use audio for input and output?

In this blog post, we present a simple architecture for a speech-based generative AI application that enables RAG on top of real-time audio APIs with full-duplex audio streaming from client devices, while securely handling access to models and retrieval systems.

Designed for Real-Time Voice + RAG

RAG workflow support

We use two main components to perform voice tasks with RAG:

Function calls: The gpt-4o-realtime-preview model supports: function callYou can include “tools” for discovery and grounding in your session configuration. The model listens to the audio input and calls these tools directly with parameters that describe what it wants to retrieve from the knowledge base.
Real-time middle tier: Separate what needs to be done on the client from what cannot be done on the client side. Full-duplex, real-time audio content must be transmitted to and from the client device’s speakers/microphone. On the other hand, model configuration (system messages, max tokens, temperature, etc.) and access to the knowledge base for the RAG must be handled on the server, as we do not want the client to have credentials for these resources. , I don’t want to require clients to have network visibility into these components. To achieve this, we introduce a middle-tier component that proxies audio traffic while keeping aspects such as model construction and function calls completely in the backend.

These two components work in tandem with each other. This means that the real-time API knows not to proceed with the conversation if there is an outstanding function call. When the model needs information from the knowledge base to respond to input, it emits a “retrieve” function call. Turn that function call into an Azure AI Search “hybrid” query (vector + hybrid + rerank), grab the content passages that best relate to what the model needs to know, and send them back to the model as the output of the function. When the model sees that output, it responds via the audio channel to conduct a conversation.

An important factor in this picture is fast and accurate retrieval. The search call occurs between the model response and the user’s turn on the audio channel, a time sensitive to latency. Azure AI Search is perfectly suited for this, with low latency for vector and hybrid queries and support for semantic redirection to maximize response relevance.

Generate grounded responses

Using function calls solves the problem of how to coordinate search queries against a knowledge base, but this inversion of control introduces new problems. We do not know which of the verses retrieved from the knowledge base were used to base each response. In a typical RAG application that interacts with the model API on the text side, you might ask the prompt to generate a quote using a special notation that can be appropriately displayed to the UX, but you don’t want the model to speak when generating audio. Say the file name or URL out loud. In generative AI applications, it is important to be transparent about what underlying data was used to respond to a given input, so other mechanisms are needed to identify and display citations in the user experience.

We also use function calls to achieve this. It introduces a second tool called “report_grounding” and includes the following instructions as part of the system prompt:

Follow these step-by-step instructions to use your knowledge base to respond with short, concise answers.

Step 1 – Always use the ‘Search’ tool to check your knowledge base before answering a question.

Step 2 – Always use the ‘report_grounding’ tool to report the source of information in the knowledge base.

Step 3 – Write as short an answer as possible. If the answer isn’t in your knowledge base, say you don’t know.

We experimented with different ways to formalize this prompt and found that explicitly listing it as a step-by-step process was especially effective.

With these two tools, you now have a system for flowing audio to your model, allowing the model to call back to the app logic in the backend for retrieval, and telling you what ground data was used. Then let the audio flow. Send it back to the client with an additional message informing the client about the grounding information (you can see this in the UI as a citation to the document that appears when speaking your reply).

Using a Real-Time API-enabled client

The middle layer completely suppresses tool-related interactions and ignores system configuration options, but otherwise maintains the same protocol. This means that the RAG process is fully encapsulated in the backend, so any client that operates directly against the Azure OpenAI API will “operate” against the real-time middle tier.

Create safe generative AI apps

We keep all components (system prompts, max tokens, etc.) and all credentials (to access Azure OpenAI, Azure AI Search, etc.) securely separate from the client on the backend. Additionally, Azure OpenAI and Azure AI Search include network isolation to prevent the API endpoints of both models and search indexes from being reachable over the internet, Entra ID to prevent keys for cross-service authentication, and options to further secure your backend. Includes extensive security features. For multiple layers of encryption for indexed content.

Try it today

Code and data for everything discussed in this blog post can be found in the following GitHub repositories: azure-samples/aisearch-openai-rag-audio. You can use it as is, or you can easily change the data to your own and talk to it.

The code in the above repository and the explanation in this blog post are more of a pattern than a specific solution. Getting the prompts right may require some experimentation, extending the RAG workflow, and making sure you evaluate security and AI safety.

To learn more about the Azure OpenAI gpt-4o-realtime-preview model and real-time API, visit: here. For Azure AI Search, you can find a variety of resources. hereand document here.

We look forward to new “data conversation” scenarios!

Source link

Designed for Real-Time Voice + RAG

RAG workflow support

Generate grounded responses

Using a Real-Time API-enabled client

Create safe generative AI apps

Try it today

Our Company

About Links

Useful Links

Newsletter

Laest News

VoiceRAG: An App Pattern for RAG + Voice Using Azure AI Search and the GPT-4o Realtime API for Audio

Designed for Real-Time Voice + RAG

RAG workflow support

Generate grounded responses

Using a Real-Time API-enabled client

Create safe generative AI apps

Try it today

California Consumer Privacy Act (CCPA) Opt-Out Icon

MGDC for SharePoint FAQ: How do I join File Actions with Files?

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News