Home NewsX The Future of AI: Distillation Just Got Easier, part 3

The Future of AI: Distillation Just Got Easier, part 3

by info.odysseyx@gmail.com
0 comment 12 views


The future of AI: distillation just got easier

Part 3 – LoRA Fine-Tuned Llama 3.1 Why Deploying 8B Models Is So Easy!

Learn how to easily deploy LoRA fine-tuning models using Azure AI with Azure AI. (🚀🔥 Github recipe repository).

by Cedric VidalSenior AI Advocate, Microsoft

part of The future of AI 🚀 series Started by Marco Casalaina with him Exploring multi-agent AI systems Blog post.

cedricvidal_0-1729608209856.png

Llama on a rocket launched from space created using Azure OpenAI DALL-E 3

Welcome back to our series about how to leverage Azure AI Studio to accelerate your AI development journey. In the previous post we explored. Generate synthetic data sets And the process is Fine-tune the model. Today we’ll look at a critical step that turns your efforts into actionable insights: deploying a fine-tuned model. In this installment, we’ll walk you through deploying a model using Azure AI Studio and the Python SDK to move smoothly from development to production.

Why deploying GPU-accelerated inference workloads is difficult

Deploying GPU-accelerated inference workloads comes with unique challenges that make the process significantly more complex than standard CPU workloads. Below are some of the major challenges you face:

cedricvidal_1-1729608210009.png

  • GPU resource allocation: GPUs are specialized and limited resources, requiring precise allocation to prevent waste and ensure efficiency. Unlike CPUs, where larger numbers can be easily provisioned, the special characteristics of GPUs mean that an effective allocation strategy is critical to optimizing performance.
  • GPU scaling: GPU workload scaling is inherently more difficult due to high costs and limited GPU resource availability. Unlike simpler CPU resource scaling, it requires careful planning to balance cost-effectiveness and workload demands.
  • Load balancing for GPU instances: Implementing load balancing for GPU-based tasks is complex because the work must be distributed evenly across available GPU instances. This step is important to avoid bottlenecks, avoid overloading specific instances, and ensure optimal performance of each GPU unit.
  • Model splitting and sharding: Partitioning and sharding are required for large models that cannot fit into a single GPU memory. This process involves splitting the model across multiple GPUs, which introduces additional complexity in terms of load distribution and resource management.
  • Containerization and Orchestration: While containerization simplifies the deployment process by packaging models and dependencies, managing GPU resources within containers and coordinating them across nodes adds an additional level of complexity. Handling the subtle dynamics of GPU resource utilization and management requires fine-tuning an effective orchestration setup.
  • LoRA adapter integration: Low-order Rank Adaptation (LoRA) is a powerful optimization technique that reduces the number of trainable parameters by decomposing the original weight matrix into submatrices. This is efficient for fine-tuning large models with fewer resources. However, integrating LoRA adapters into a deployment pipeline requires additional steps to efficiently save, load, and merge the lightweight adapter with the base model and deliver the final model, making the deployment process more complex.
  • GPU Inference Endpoint Monitoring: Monitoring GPU inference endpoints is complex because it requires special metrics to capture GPU utilization, memory bandwidth, and thermal limits, as well as model-specific metrics such as number of tokens or number of requests. These metrics are essential for understanding performance bottlenecks and ensuring efficient operations, but they require complex tools and expertise to collect and analyze accurately.
  • Model-specific considerations: It is important to recognize that the deployment process often depends on the underlying model architecture you are working with. Each new version of a model or a different model vendor requires a significant amount of adjustment to the deployment pipeline. This may include changing preprocessing steps, modifying environment configuration, integrating third-party libraries, or adjusting versions. Therefore, it is important to keep model documentation and vendor-specific deployment instructions up to date to ensure a smooth and efficient deployment process.
  • Model versioning complexity: Keeping track of multiple versions of a model can be complicated. Each version may exhibit unique behavior and performance metrics, requiring thorough evaluation to manage updates, rollbacks, and compatibility with other systems. I will cover the topic of model evaluation in more detail in my next blog post. Another difficulty in versioning is storing the weights of the various LoRA adapters and keeping track of which version of the base model they should be applied to.
  • cost plan: Planning costs for GPU inference workloads is difficult due to the variable nature of GPU usage and the high costs associated with GPU resources. It can be difficult to predict the exact GPU time needed for inference across different workloads, which can result in unexpected costs.

Understanding and addressing these challenges is critical to successfully deploying GPU-accelerated inference workloads and leveraging the full potential of GPU capabilities.

Azure AI Serverless: A Game Changer

Azure AI Serverless is a game changer because it effectively solves many of the challenges associated with deploying GPU-accelerated inference workloads. By leveraging a serverless architecture, we abstract away the complexities associated with GPU resource allocation, model-specific deployment considerations, and API management. This means you can deploy your models without worrying about managing the underlying infrastructure, allowing you to focus on your application requirements. Azure AI Serverless also supports rich model collections and abstracts the selection and provisioning of GPU hardware accelerators to ensure efficient and fast inference times. Integration of platforms and managed services enables powerful container orchestration, further simplifying the deployment process and improving overall operational efficiency.

Attractive pay-as-you-go cost model

One of the great features of Azure AI Serverless is its token-based cost model, which greatly simplifies cost planning. Token-based billing allows you to bill based on the number of tokens processed by your model, making it easier to estimate costs based on expected usage patterns. This model is especially useful for applications with variable load because you only pay for what you use.

There are additional hourly costs associated with fine-tuned serverless endpoints because the managed infrastructure must keep LoRA adapters in memory and replace them as needed, but you are only billed on an hourly basis while the endpoint is in use. This makes it very easy to plan ahead for future bills based on your expected usage profile.

Additionally, hourly costs will be lower. There has already been a significant reduction from $3.09 per hour for the Llama 2 7B based model to $0.74 per hour for the Llama 3.1 8B based model.

Paying attention to these important factors will help you ensure that your model deployment is robust, secure, and capable of meeting your application requirements.

Regional Availability

When deploying a Llama 3.1 fine-tuned model, it is important to consider the geographic region in which the model may be deployed. Currently, Azure AI Studio supports deploying Llama 3.1 fine-tuned models in the East US, East US 2, North Central US, South Central US, West US, and West US 3 regions. Choosing a closer region can help reduce latency and improve performance for end users. For optimal results, select the appropriate region based on your target audience.

For the most up-to-date information on regional availability of other models, see: This guide to deploying a serverless model.

Let’s code it using Azure AI Studio and the Python SDK.

Before proceeding with deployment, you will need a previously fine-tuned model. One way is to use the process described in the previous two articles in this Fine Tuning blog post series. Generating synthetic data sets using RAFT and The second deals with fine tuning.. This will help you take full advantage of the deployment step using Azure AI Studio.

cedricvidal_2-1729608210151.png

memo: All the following code samples were extracted from: 3_deploy.ipynb laptop Raft Recipe GitHub repository. The snippet has been simplified and some intermediate steps have been left out to make it easier to read. You can go there, clone the repository, and start experimenting right away, or you can get an overview with me here.

Step 1: Set up your environment

First make sure you have the required libraries installed. You need the Azure Machine Learning SDK for Python. You can install it using pip.

pip install azure-ai-ml

Next, you need to import the required modules and authenticate your Azure ML workspace. This is the standard, and MLClient is the gateway to your ML workspace that provides access to all AI and ML in Azure.

from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
)
from azure.ai.ml.entities import MarketplaceSubscription, ServerlessEndpoint

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

try:
    client = MLClient.from_config(credential=credential)
except:
    print("Please create a workspace configuration file in the current directory.")

# Get AzureML workspace object.
workspace = client._workspaces.get(client.workspace_name)
workspace_id = workspace._workspace_id

Step 2: Solve the previously registered fine-tuning model

You need to solve your fine-tuned model in your Azure ML workspace before deploying it.

Fine-tuning operations may still be running, so you can wait for the model to register. Here are some simple helper functions you can use:

def wait_for_model(client, model_name):
    """Wait for the model to be available, typically waiting for a finetuning job to complete."""
    import time

    attempts = 0
    while True:
        try:
            model = client.models.get(model_name, label="latest")
            return model
        except:
            print(f"Model not found yet #{attempts}")
            attempts += 1
            time.sleep(30)

Although the above functionality is basic, you can proceed with deployment as soon as the model is available.

print(f"Waiting for fine tuned model {FINETUNED_MODEL_NAME} to complete training...")
model = wait_for_model(client, FINETUNED_MODEL_NAME)
print(f"Model {FINETUNED_MODEL_NAME} is ready")

Step 3: Subscribe to the model provider

Before deploying a fine-tuned model using a base model from a non-Microsoft source, you must subscribe to the model provider’s marketplace offering. This subscription allows you to access and use models within Azure ML.

print(f"Deploying model asset id {model_asset_id}")

from azure.core.exceptions import ResourceExistsError
marketplace_subscription = MarketplaceSubscription(
    model_id=base_model_id,
    name=subscription_name,
)

try:
    marketplace_subscription = client.marketplace_subscriptions.begin_create_or_update(marketplace_subscription).result()
except ResourceExistsError as ex:
    print(f"Marketplace subscription {subscription_name} already exists for model {base_model_id}")

For more information on how to configure base_model_id and subscription_name can be used in 3_deploy.ipynb Notebook.

Step 4: Deploy the model as a serverless endpoint

In this section, you will manage serverless endpoint deployment for your fine-tuned model using the Azure ML client. Check for existing endpoints, create one if it doesn’t exist, and then proceed with the deployment.

from azure.core.exceptions import ResourceNotFoundError
try:
    serverless_endpoint = client.serverless_endpoints.get(endpoint_name)
    print(f"Found existing endpoint {endpoint_name}")
except ResourceNotFoundError as ex:
    serverless_endpoint = ServerlessEndpoint(name=endpoint_name, model_id=model_asset_id)
    serverless_endpoint = client.serverless_endpoints.begin_create_or_update(serverless_endpoint).result()

    print("Waiting for deployment to complete...")
    serverless_endpoint = ServerlessEndpoint(name=endpoint_name, model_id=model_id)

    created_endpoint = client.serverless_endpoints.begin_create_or_update(serverless_endpoint).result()
    print("Deployment complete")

Step 5: Verify that the endpoint is deployed correctly

As part of your deployment pipeline, we recommend including integration tests to ensure that your model is deployed correctly and fails quickly, rather than waiting for a step to fail without context.

import requests

url = f"{endpoint.scoring_uri}/v1/chat/completions"

prompt = "What do you know?"
payload = {
    "messages":[ { "role":"user","content": prompt } ],
    "max_tokens":1024
}
headers = {"Content-Type": "application/json", "Authorization": endpoint_keys.primary_key}

response = requests.post(url, json=payload, headers=headers)

response.json()

This code assumes that your deployed model is: chat A model for simplification. The code available at 3_deploy.ipynb Notebooks are more common and cover both. completion and chat model.

conclusion

Deploying fine-tuned models using Azure AI Studio and the Python SDK not only simplifies the process, but also provides unparalleled control, ensuring you have a powerful and reliable platform for your deployment needs.

Stay tuned for our next blog post. In two weeks we will learn how to evaluate the performance of deployed models through a rigorous evaluation methodology. Until then, get out there. Github repo And happy coding!





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX