Monitoring LLM Inference Endpoints with LLM Listeners

In the second blog post in this series, guest blogger Martin Bald, Senior Manager, DevRel and Community, Microsoft Partner, says: Wallaroo.AI, We will walk through the steps to operationalize the LLM model easily and put in place measures to ensure model integrity and ensure fundamentals such as security, privacy, and compliance to avoid consequences such as toxicity and hallucinations.

introduction

As GenAI and related services like ChatGPT emerge, enterprises are starting to feel pressure to quickly implement GenAI to avoid being left behind in the race to adopt broad enterprise AI.

However, when talking to customers and partners, adoption has been slow because they have generally underestimated the time it takes to achieve effective and reliable LLM. For those who don’t know, OpenAI tested ChatGPT for two years before releasing it.

For AI practitioners, understanding the complexities of bringing these powerful models into production environments is essential to building robust, high-performance AI systems.

LLM Monitoring with Wallaroo’s listeners

As we covered in previous RAG LLM blog posts, deploying an LLM to production is not the end of the process. Far from it. The model needs to be monitored for performance to ensure it is performing optimally and producing the intended results.

For LLM, proactive monitoring is key. We have seen very public situations where quality and accuracy have been compromised through things like hallucinations and toxic output, leading to litigation and loss of trust and confidence in the company.

Using RAG is not the only way AI teams can ensure that LLM is generating effective and accurate text. There may be specific use cases or compliance and regulatory rules that limit the use of RAG. LLM accuracy and integrity can still be achieved through a validation and monitoring component that Wallaroo.AI calls the LLM Listener.

We came up with the concept of an LLM listener after working with some customers who were doing this in a traditional ML context that involved a variety of modalities or a variety of customer interactions related to audio scenarios. It’s primarily used for calls where the model is looking for specific information about the call to gather things like sentiment.

As our customers transition to LLM as their way of interacting with their customers, the same monitoring and models that were there before are still relevant to them. Together with our customers, we came up with the concept of the LLM Listener, which is basically a set of models that we build and provide out of the box, that can be customized to detect and monitor specific behaviors, such as toxicity, harmful language, etc.

You may want to generate alerts immediately for poor quality responses or auto-correct those actions that can be performed inline in LLM. You can also leverage it offline if you want to perform additional analysis on LLM interactions. This is especially useful if you are working in a more controlled environment. For example, you can do this in a RAG setting and add these validation and monitoring steps on top of it.

The LLM Listener can also be tuned to generate real-time monitoring reports and metrics to understand how LLM is performing and ensuring it is effective in production, which helps businesses accelerate time to value. Additionally, by iterating on the LLM Listener and keeping the endpoint static while keeping everything happening behind the scenes fluid, AI teams can rapidly iterate on LLM without impacting the bottom line, which could be business reputation, revenue cost, customer satisfaction, etc.

LLM Listeners Using Wallaroo in Action

Let’s take a look at how the LLM listener works and how easy it is to deploy to production.

Figure -1

The LLM listener approach illustrated in Figure 1 is implemented as follows:

1: Text entered into the application and text generated accordingly

2: We provide a service that allows you to have an LLM inference endpoint.

3: It records interactions between the LLM inference endpoint and the user. There you can see the input text and the corresponding generated text.

4: Logs can be monitored by a series of listener models, which can be anything from standard processes to other NLP models that monitor these outputs inline or offline. You can think of it as a sentiment analyzer or an entire system that compares them to some ground truth.

5 : The LLM Listener evaluates LLM interactions against a variety of factors and can be used to generate automated reporting and alerts if behavior changes over time or if any of these scores fall outside acceptable ranges.

In addition to the passive listening that we see here, while these listeners monitor the macro-level behavior that occurs during many interaction processes, we can actually give these listeners the ability to suppress output that violates these thresholds in the first place by deploying them alongside the LLM.

Now let’s look at a real-world example of this. You can follow this example below: LLM Monitoring Document page.

The following demonstrates evaluating llama3-instruct LLM by running the LLM listener as a Run Once task via the Wallaroo SDK. The LLM listener argument can be modified to evaluate any other distributed LLM with its own text output fields.

This assumes that the LLM listener has already been uploaded and is ready to accept new jobs and stored in a variable. llm_listener.

Here we create and orchestrate LLM monitoring jobs for the LLM Listener, provide the workspace and pipeline for the deployed LLM, and the model workspace and name for the LLM Listener.

args = {
    'llm_workspace' : 'llm-models' ,
    'llm_pipeline': 'llamav3-instruct',
    'llm_output_field': 'out.generated_text',
    'monitor_workspace': 'llm-models',
    'monitor_pipeline' : 'full-toxmonitor-pipeline',
    'window_length': -1,  # in hours. If -1, no limit (for testing)
    'n_toxlabels': 6,
}

task = llm_listener.run_once(name="sample_monitor", json_args=args, timeout=1000)

Next, we will list the tasks of the Wallaroo client stored in wl and check if the task completed with Success.

wl.list_tasks()

Figure 2.

Once this is done, check the LLM listener logs and use the evaluation fields to check for toxicity issues, etc.

llm_evaluation_results = llm_listener_pipeline.logs()
display(llm_evaluation_results)

This will produce output similar to the truncated Figure 3 below. Note the toxicity column headings and scores for Insult, Obscenity, and Serious Toxicity.

Figure 3

Once the job is done, the results are available. The listener’s inference log can be monitored via Wallaroo analytics.

The analysis results chart below shows the time periods during which the toxicity values were within the normal range threshold (Figure 4). Clicking on this time period will show how that interaction would look in Figure 5.

Figure 4

Figure 5

As you can see in Figure 6, you can also see the periods where the output exceeds the normal threshold and outliers appear.

Figure 6

In the chart above, we can see a more detailed view in Figure 7.

Figure 7

In addition to this, you can drill down further into the logs to look at this time period in more detail and view individual audit logs for specific interactions to see exactly what the model output was and how it scored on a variety of metrics, from abusive language to obscene language to threatening language (see Figure 8).

Figure 8

conclusion:

The LLM Listener is just one of the LLM monitoring methods available in LLMOps, implementing monitoring metrics and alerts to ensure that your LLM in production is robust and effective in post-production. Using the LLM Listener to avoid risks such as toxicity, pornography, etc., ensures accurate and relevant output.

As mentioned in the introduction, Wallaroo is actively building this family of listeners and working with customers to build listeners that fit their applications and use cases.

_____________________________________________________________________________

Source link

introduction

LLM Monitoring with Wallaroo’s listeners

LLM Listeners Using Wallaroo in Action

conclusion:

Our Company

About Links

Useful Links

Newsletter

Laest News

Monitoring LLM Inference Endpoints with LLM Listeners

introduction

LLM Monitoring with Wallaroo’s listeners

LLM Listeners Using Wallaroo in Action

conclusion:

Exciting Marketing Job Openings Available at Avaari in Chandigarh – Apply Now for Top Positions

Protect and Detect: Microsoft Defender for Identity Expands to Entra Connect Server

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News