Evaluating Distilled Models with Azure AI Evaluation SDK by info.odysseyx@gmail.com November 1, 2024 written by info.odysseyx@gmail.com November 1, 2024 0 comment 2 views 2 Part 4 – Maximize your fine-tuned model performance with the new Azure AI Evaluation SDK Β By Cedric Vidal, Principal AI Advocate, Microsoft Part of the Future of AI π series initiated by Marco Casalaina with his Exploring Multi-Agent AI Systems blog post. Β Generated using Azure OpenAI DALL-E 3 Β In earlier posts of this distillation series, we detailed the process of distilling a Llama 3.1 405B model into a more compact Llama 3.1 8B model. This journey included generating a synthetic dataset using RAFT, as well as fine-tuning and deploying our student model on Azure AI Serverless. Β But how can we confirm that our distilled model performs optimally? The crucial final step is evaluating the model. Β Effective model evaluation is key to ensuring that our AI systems function as expected and meet the desired standards. With the introduction of the Azure AI Evaluation Python SDK, we now have a powerful toolkit for assessing AI models through advanced metrics. In this blog post, weβll look at evaluating a distilled student model, which was trained with data generated by RAFT, and compare it against a baseline model. Β In our setup, Llama 3.1 405B functions as the teacher, Llama 3.1 8B serves as the student model and GPT-4 serves as the judge. Why evaluate? Evaluating distilled student models is crucial because it allows us to assess how effectively knowledge has been transferred from the teacher model to the student model. Distillation aims to compress a larger, more complex model into a smaller, more efficient one without significantly sacrificing performance. By thoroughly evaluating the distilled models, we ensure they not only mimic the teacher modelβs outputs but also maintain high levels of accuracy, coherence, and relevance. This evaluation process helps identify areas where the student model may need further fine-tuning and ensures that the distilled models are ready for deployment in resource-constrained environments where computational efficiency is paramount. Process Overview Evaluating the performance of our models involves several key steps, which can be broadly categorized under Testing and Scoring. Β Testing Run the Baseline Model on the Evaluation Split: Our first step is to run the teacher model (Llama 3.1 405B) on the evaluation split to generate its predictions. Run the Student Model on the Evaluation Split: Next, we run the student model on the same evaluation dataset to generate its predictions. Scoring Calculate Metrics for the Baseline Model: Using the predictions from the baseline model, we calculate various performance metrics. Calculate Metrics for the Student Model: Similarly, we calculate the performance metrics for the student modelβs predictions. Compare Metrics: Finally, we compare the performance of both models, highlighting the results through visuals and diagrams. Testing the baseline and student models Installing the SDK First, you need to install the Azure AI Evaluation SDK: Β Β Β Β Β pip install openai azure-ai-evaluation azure-identity promptflow-azure Β Β Β Β Β Note on SDK Availability: Itβs important to highlight that the Azure AI Evaluation SDK is currently in beta. This means that while the SDK offers a comprehensive suite of tools and features for evaluating AI models, it may still undergo changes and improvements. Users should stay updated with any modifications or enhancements introduced by Azure, and consider providing feedback to help refine and optimize the SDK for wider use in its official release. Baseline Model Testing This will generate answers to the questions in the evaluation dataset using the baseline model: Β Β Β Β Β env $(cat .env .env.state) python .gorilla/raft/eval.py \ --question-file $dataset_path_hf_eval \ --answer-file $dataset_path_hf_eval_answer_baseline \ --model $BASELINE_OPENAI_DEPLOYMENT \ --env-prefix BASELINE \ --mode $BASELINE_MODEL_API Β Β Β Β Note: JSONL file format needs to be further converted to a format suitable for testing, see eval notebook for details. Student Model Testing This will generate answers to the questions in the evaluation dataset using the student model: Β Β Β Β Β env $(cat .env .env.state) python .gorilla/raft/eval.py \ --question-file $dataset_path_hf_eval \ --answer-file $dataset_path_hf_eval_answer \ --model $STUDENT_DEPLOYMENT_NAME \ --env-prefix STUDENT \ --mode $STUDENT_MODEL_API Β Β Β Β Note: JSONL file format needs to be further converted to a format suitable for testing, see eval notebook for details. Letβs look at a sample This sample is extracted from the evaluation split and shows the baseline and student answers: Β question: What types of waves do strong direct offshore winds create? gold_final_answer: plunging or large barrel waves context Lefts, Rights, and A-frames could be directed from this pump design providing forWave intensity Artiο¬cial reefs Artiο¬cial wavesSurο¬ng a stationary, artiο¬cialwave in Southern California A surfer going for the tube Catching waves at a surο¬ng conteston the North Shore of Oahu, Hawaiirippable surf and barrel rides. The Ocean Dome cost about $2 billion tobuild and was expensive to maintain.[31] The Ocean Dome was closed in2007. However, thewaves that are produced by reef breaks are some of the best in the world. Famous reef breaks arepresent in Padang Padang (Indonesia), Pipeline (Hawaii), Uluwatu (Bali), and Teahupoβo(Tahiti).[49][52]A ledge break is formed by steep rocks ledges that make intense waves because the waves travelthrough deeper water then abruptly reach shallower water at the ledge. Shark Island, Australia is alocation with a ledge break. baseline_answer: Strong direct offshore winds create plunging or large barrel waves. These waves are characterized by their increased height and intensity due to the shallow water depth when they break. student_answer: plunging or large barrel waves This sample was chosen randomly and in this case, the student model answer is identical to the gold answer. This is not always the case. Evaluating the baseline and student model responses Built-in Evaluators The Azure AI Evaluation SDK offers an extensive suite of built-in metrics, designed to facilitate comprehensive evaluation of AI models. In the following sections, weβll highlight selected evaluators and provide detailed examples of their application, showcasing how they can enhance your model assessments. Β They are categorized into two main groups: (1) metrics that leverage GPT models for scoring, providing advanced qualitative assessments, and (2) metrics that utilize straightforward mathematical calculations for evaluation. GPT based metrics Category Evaluator Class Notes Quality GroundednessEvaluator Groundedness measures the extent to which the generated content is based on factual correctness and aligns with the provided data or context. Β RelevanceEvaluator Relevance assesses how pertinent the generated text is to the given input or prompt. Higher relevance scores indicate that the generated responses are more appropriate and closely aligned with the query or topic. Β CoherenceEvaluator Coherence measures how logically consistent and semantically meaningful the generated text is. Higher coherence indicates better understanding and logical consistency. Β FluencyEvaluator Fluency evaluates how naturally the generated text reads. Fluent text should be grammatically correct and smooth in its flow. Β SimilarityEvaluator Measures the similarity between the predicted answer and the correct answer Content Safety ViolenceEvaluator Β Β SexualEvaluator Β Β SelfHarmEvaluator Β Β HateUnfairnessEvaluator Β Composite QAEvaluator Built on top of individual quality evaluators. Β ChatEvaluator Similar to QAEvaluator but designed for evaluating chat messages. Β ContentSafetyEvaluator Built on top of individual content safety evaluators. Math based metrics Evaluator Class Notes BleuScoreEvaluator BLEU (Bilingual Evaluation Understudy) is a widely-used metric for evaluating the quality of text generated by an AI by comparing it to one or more reference texts. It particularly looks at the precision of n-grams in the generated text. RougeScoreEvaluator ROUGE (Recall-Oriented Understudy for Gisting Evaluation) primarily measures recall, comparing n-grams between the generated text and reference texts. It is commonly used for evaluation in summarization tasks. F1ScoreEvaluator A balance between precision and recall, the F1 score provides a single metric that combines both, offering a more comprehensive view of performance in classification problems. Running metrics individually The Azure AI Evaluation SDK enables the utilization of individual metrics. This feature is particularly useful for experimentation, gaining deeper insights, and incorporating metrics into bespoke evaluation workflows. Β Tech Tip: This blog post is crafted using the Quarto writing system, a versatile tool for publishing with code. The Azure AI Evaluation metrics are seamlessly executed and displayed inline within this post. Β Letβs look first at the F1 Score math metric For a response that is accurate but includes additional information not found in the ground truth: Β Β Β Β Β from azure.ai.evaluation import F1ScoreEvaluator f1_score_evaluator = F1ScoreEvaluator() f1_score = f1_score_evaluator( ground_truth="The capital of Japan is Tokyo.", response="Tokyo is Japan's capital, known for its blend of traditional culture" ) print(f"The F1 Score is {round(f1_score['f1_score'], 2)}") Β Β Β Β Β The F1 Score is 0.5 Β For a response that is accurate but uses the same words turned differently: Β Β Β Β Β from azure.ai.evaluation import F1ScoreEvaluator f1_score_evaluator = F1ScoreEvaluator() f1_score = f1_score_evaluator( ground_truth="The capital of Japan is Tokyo.", response="Tokyo is Japan's capital" ) print(f"The F1 Score is {round(f1_score['f1_score'], 2)}") Β Β Β Β Β The F1 Score is 0.67 Β Letβs look first at the Similarity GPT metric We first need to instantiate the Judge model client: Β Β Β Β Β from os import getenv from azure.ai.evaluation import AzureOpenAIModelConfiguration model_config = AzureOpenAIModelConfiguration( azure_endpoint = getenv("JUDGE_AZURE_OPENAI_ENDPOINT"), azure_deployment = getenv("JUDGE_AZURE_OPENAI_DEPLOYMENT"), api_version = getenv("JUDGE_OPENAI_API_VERSION"), ) Β Β Β Β Letβs now instantiate the Similarity score metric: Β Β Β Β from azure.ai.evaluation import SimilarityEvaluator similarity_evaluator = SimilarityEvaluator(model_config) For a response that is accurate but includes additional information not found in the ground truth: similarity = similarity_evaluator( query="What's the capital of Japan?", ground_truth="The capital of Japan is Tokyo.", response="Tokyo is Japan's capital, known for its blend of traditional culture" ) print(f"The Similarity is {similarity['gpt_similarity']}") Β Β Β Β Β The Similarity is 4.0 Β For a response that is accurate but uses the same words turned differently: Β Β Β Β similarity = similarity_evaluator( query="What's the capital of Japan?", ground_truth="The capital of Japan is Tokyo.", response="Tokyo is Japan's capital" ) print(f"The Similarity is {similarity['gpt_similarity']}") Β Β Β Β Β The Similarity is 5.0 Β GPT-based similarity metrics demonstrate greater robustness in evaluating correct responses that are phrased differently compared to traditional F1 Scores. Running metrics in bulk While evaluating metrics individually helps in understanding their functionality, acquiring statistically significant results necessitates running them on a larger scale across an evaluation dataset. Β The Azure AI Evaluation SDK provides a convenient bulk evaluation capability via the evaluate function. Β To begin, we need to initialize the evaluators that will be used to assess the student and baseline models: Β Β Β Β Β from azure.ai.evaluation import CoherenceEvaluator, F1ScoreEvaluator, FluencyEvaluator, GroundednessEvaluator, RelevanceEvaluator, SimilarityEvaluator, BleuScoreEvaluator, RougeScoreEvaluator, RougeType # Initializing evaluators evaluators = { # GPT based metrics "coherence" : CoherenceEvaluator(model_config), "f1_score" : F1ScoreEvaluator(), "fluency" : FluencyEvaluator(model_config), "groundedness" : GroundednessEvaluator(model_config), "relevance" : RelevanceEvaluator(model_config), "similarity" : SimilarityEvaluator(model_config), # Math metrics "bleu" : BleuScoreEvaluator(), "rouge_1" : RougeScoreEvaluator(RougeType.ROUGE_1), "rouge_2" : RougeScoreEvaluator(RougeType.ROUGE_2), } Β Β Β Β Β Note that we have previously executed the baseline and student models on the evaluation dataset, which means the JSONL file provided to the evaluate function already includes their responses. Consequently, further model invocations are unnecessary at this stage. Β Recommendation: Itβs often beneficial to run the baseline and student models once initially. By doing so, you can execute the evaluate function multiple times with various metrics configurations without re-incurring the inference time and costs associated with model executions. Note that while this avoids repeated inference expenses, using GPT-based metrics will still incur costs and time for each evaluate execution, as the Judge model is utilized. Β Β Β Β Β from azure.ai.evaluation import evaluate result = evaluate( data="test-results-[baseline|student].jsonl", evaluators=evaluators, evaluator_config={ "default": { "column_mapping": { "query": "${data.question}", "response": "${data.final_answer}", "ground_truth": "${data.gold_final_answer}", "context": "${data.context}", } } }, ) Β Β Β Β Β This command initiates a background process that hosts a user interface locally. Here is an example of its appearance: Β The interface updates in real-time to display the progress of the scoring process on the evaluation dataset. Β Additionally, you can click on each completed line to view the detailed trace of the calls. This feature is particularly useful for GPT-based metrics, as it reveals the system prompt used and provides insights into the underlying logic that contributed to the final score: Β Β Comparing Metrics and Visualizing Results Β Note: You can find the implementation details for generating the comparison figures of baseline and student metrics in the repository notebook. This resource provides comprehensive insights into how the metric comparisons were conducted, along with the code necessary to reproduce these visualizations. Going further with continuous model evaluation and GenAIOps This marks the beginning of a continuous improvement journey. Itβs quite common to find that the student modelβs initial performance does not meet expectations. Through our evaluation, we may uncover areas needing adjustmentβwhether itβs refining the synthetically generated dataset, optimizing fine-tuning parameters, or other elements. This initiates a cycle of iterative improvement and reassessment before the model is ready for deployment in production. Β To effectively help you navigate this process, we came up with the GenAIOps Maturity Model, which serves as a comprehensive guide for evaluating your progress and maturity in operationalizing AI models. Conclusion By leveraging the Azure AI Evaluation Python SDK, we gain a detailed understanding of how our distilled student model compares to the baseline model across a spectrum of performance indicators. This structured evaluation framework not only helps in refining our models but also ensures that we are continuously improving and delivering robust AI solutions. Β Explore, fork and clone the comprehensive ππ₯ GitHub Recipe Repository for complete code coverage on executing the full distillation process, including in-depth evaluations as detailed in this blog post. Discover step-by-step notebooks and resources to master the entire pipeline efficiently. Β Stay tuned for more insights and tutorials on advanced AI topics and the latest tools available in the Azure ecosystem! Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Modernize your legacy apps with minimal code changes using sidecars next post App attach now integrates with partner solutions You may also like Enterprise productivity is the easiest AI sell November 20, 2024 Delivers data through IBM’s new Environmental Intelligence API November 19, 2024 Criticism mounts over old risk management frameworks November 19, 2024 What to focus on at Microsoft Ignite: Avoiding AI disasters November 18, 2024 AI search threatens digital economy, researcher warns November 12, 2024 Qualcomm has an ‘AI-first’ vision for the future of smart devices November 11, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.