New GenAI simulation and evaluation tools in Azure AI Studio

While it is important to mitigate potential risks of generative AI (GenAI), it is also important throughout the development lifecycle to: Measure Identify risks so developers know where to focus their attention. Azure AI Studio provides comprehensive assessment tools so organizations can proactively assess the quality and safety metrics of GenAI output in a systematic, transparent, and repeatable manner. tracing and Monitoring Iterative evaluation in Azure AI Studio can help development teams make data-driven decisions about model selection, content filter configuration, rapid engineering, and other application components before and after deploying to production.

In this blog, we share new features available in public preview that will help you evaluate and improve your application output. We’ve provided new documentation and step-by-step tutorials to help you get started more easily. Many of these evaluation features were originally developed to support Microsoft internal teams building popular generative AI solutions like Microsoft Copilot and GitHub Copilot. Now, based on feedback from customers and partners, we’re bringing the same proven capabilities to Azure AI Studio so every organization can build more. Trustworthy AI Applications.

Evaluate how your app responds to indirect prompt injection attacks.

A risk and safety assessment for indirect instant injection attacks is now available in public preview., Accessible through the Azure AI Studio UI and SDK experience. Indirect prompt injection attacks (also known as cross-domain prompt injection attacks or XPIA) are a new attack vector where threat actors pollute the underlying data sources of a model, such as public websites, emails, or internal documents, to pass hidden malicious instructions to the model and bypass safety guardrails. With the Azure AI Assessment SDK, users can now simulate indirect prompt injection attacks on their generated AI models or applications and measure how often the AI fails to detect and defend against the attacks (failure rate) along with the subcategories of manipulated content, intrusion, and information gathering. Users can also drill down into the assessment details to better understand how their application generally responds to these attacks and the associated risks. With this information, users can decide to enable them. Prompt Shield Use Azure AI Content Safety to adjust the underlying data sources or apply other mitigations to system messages, then rerun the assessment and deploy to production.

Follow this Step by step tutorial With Python.

Indirect attack evaluation.gif

Evaluate how often your app outputs protected material.

Risk and safety assessments for protected material (text) are now available in public preview. Accessible through the Azure AI Studio UI and SDK environments. Since foundational models are typically trained using vast amounts of data, users are understandably concerned that the model may output responses containing protected material, putting end users at risk of inadvertent infringement. The Azure AI Assessment SDK allows users to simulate conversations with a generative AI model or application, attempting to produce illegitimate responses containing protected text (e.g., lyrics, articles, recipes, selected web content), and measure how often the AI outputs protected text in its responses (the defect rate). To do this, the assessment compares the output to a third-party text content index maintained on GitHub. By drilling down into the assessment details, users can better understand how their application typically responds to these user prompts and the associated risks. With this information, users can decide to enable. Detect protected data Rerun the assessment in Azure AI Content Safety and adjust system messages or apply other mitigations before deploying to production.

Follow this Step by step tutorial With Python.

Evaluate the quality and accuracy of your app’s output

A new quality assessment is now available in public preview.Available today via the Azure AI SDK, with UI support coming in October 2024. ROUGE, BLEU, METEOR, and GLEU are popular math-based metrics that help AI developers evaluate text-based outputs for quality, such as similarity to expected output, precision, recall, and grammatical correctness. AI developers can use dedicated evaluators to evaluate each metric, and combine multiple evaluators to perform a holistic evaluation run. After reviewing the evaluation results, users can compare different models, adjust ground truth, or make other changes through rapid engineering, and then re-run the evaluation to see the impact of their changes. Overall, these evaluations can help AI developers improve the quality, accuracy, and reliability of their AI applications.

Evaluator name in Azure AI SDK	definition
ROUGEScore evaluator	The ROUGE (Recall-Oriented Understudy for Gitting Evaluation) score measures the quality of generated text by comparing it to a reference text using n-gram recall, precision, and F1 score.
BLEUScore Evaluator	The BLEU (Bilingual Evaluation Understudy) score measures how well the generated text matches the reference text based on n-gram overlap.
METEORScore evaluator	The METEOR (Metrics for Translation Evaluation Using Explicit Ordering) score evaluates the quality of the generated text by considering various linguistic features such as accuracy, recall, synonymy, stemming, and word order.
GLEUScore evaluator	The GLEU (Google-BLEU) score measures the degree of overlap between the generated text and the reference and source texts, thereby striking a balance between precision and recall.

Follow this Step by step tutorial With Python.

Role-play to see how your application responds to common user prompts.

A synthetic data generator and simulator for non-adversarial tasks is now available in public preview. Accessible via the Azure AI Assessment SDK. One of the biggest assessment challenges we hear from our customers is the lack of a comprehensive, high-quality test data set to run a full assessment. In March, we launched an adversarial simulator specifically designed to role-play with your model or application to generate high-quality test data for risk and safety assessments. However, because it was designed to accelerate the adversarial red team process, it lacked the ability to simulate more common interactions with real target users of your application. Now, we’re excited to announce end-to-end synthetic data generation capabilities to help developers understand how their applications respond to everyday user prompts. Using an index-based query generator and fully customizable simulator, AI developers can create robust test data sets around application-specific non-adversarial tasks and personas. This will help organizations fill a critical gap in their existing assessment toolkits, facilitating higher-quality assessments and faster iterations on their applications.

Follow this Step by step tutorial With Python.

Learn more about evaluating Azure AI Studio

Evaluation is a critical step in building a production-ready GenAI app. Azure AI Studio’s evaluation makes it easy for organizations to:

Get started quickly Includes pre-built, customizable, and user-defined evaluation metrics
Prepare for the unexpected Simulate adversarial and non-adversarial interactions with your application.
Make data-driven decisions Interpretable evaluation results that developers can easily update and compare over time.

If you’re new to evaluating in Azure AI Studio, check out these helpful resources for Azure AI Studio:

Source link

Evaluate how your app responds to indirect prompt injection attacks.

Evaluate how often your app outputs protected material.

Evaluate the quality and accuracy of your app’s output

Role-play to see how your application responds to common user prompts.

Learn more about evaluating Azure AI Studio

Our Company

About Links

Useful Links

Newsletter

Laest News

New GenAI simulation and evaluation tools in Azure AI Studio

Evaluate how your app responds to indirect prompt injection attacks.

Evaluate how often your app outputs protected material.

Evaluate the quality and accuracy of your app’s output

Role-play to see how your application responds to common user prompts.

Learn more about evaluating Azure AI Studio

General Availability: Azure confidential VMs with NVIDIA H100 Tensor Core GPUs

Azure AI Confidential Inferencing

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News