Fine-tune/Evaluate/Quantize SLM/LLM using the torchtune on Azure ML by info.odysseyx@gmail.com November 4, 2024 written by info.odysseyx@gmail.com November 4, 2024 0 comment 12 views 12 In this blog, we’ll explore how to leverage torchtune on Azure ML to fine-tune, evaluate, and quantize small and large language models (SLM/LLM) effectively. As demand for adaptable and efficient language models grows, there’s a need for robust tools that make model fine-tuning and optimization more accessible. torchtune is a versatile library that simplifies these processes, offering support for distributed training, flexible logging, and model quantization. Azure ML complements torchtune by providing scalable infrastructure and integration options, making it an ideal platform for experimenting with and deploying SLM/LLMs. This guide provides hands-on code examples and step-by-step instructions for: Setting up Azure ML to work with torchtune for distributed model fine-tuning. Handling dynamic path adjustments in the YAML recipe, particularly useful for Azure’s storage-mounted environments. Applying quantization techniques to optimize models for deployment on resource-limited devices. By the end of this guide, you’ll be equipped to run scalable and efficient language model pipelines using torchtune on Azure ML, enhancing your model’s performance and accessibility. Hands-on Labs: https://github.com/Azure/torchtune-azureml 1.1. torchtune torchtune is a Python library designed to simplify fine-tune SLM/LLM models using PyTorch. torchtune stands out for its simplicity and flexibility, enabling users to perform fine-tuning, evaluation, and quantization effortlessly with minimal code through YAML-based recipes. This intuitive setup allows users to define and adjust complex training configurations in a structured, readable format, reducing the need for extensive code changes. By centralizing settings into a YAML recipe, torchtune not only speeds up the experimentation process but also makes it easy to replicate or modify configurations across different models and tasks. This approach is ideal for streamlining model optimization, ensuring that fine-tuning and deployment processes are both quick and highly adaptable. The representative features are as follows: Easy Model Tuning: torchtune is a PyTorch-native library that simplifies the SLM fine-tuning, making it accessible to users without advanced AI expertise. Easy Application of Distributed Training: torchtune simplifies the setup for distributed training, allowing users to scale their models across multiple GPUs with minimal configuration. This feature significantly reduces users’ trial-and-errors. Simplified Model Evaluation and Quantization: torchtune makes model evaluation and quantization straightforward, providing built-in support to easily assess model performance and optimize models for deployment. Scalability and Portability: torchtune is flexible enough to be used on various cloud platforms and local environments. It can be easily integrated with AzureML. For more information about torchtune, please check this link. 1.2. Azure ML with torchtune Running torchtune on AzureML offers several advantages that streamline the GenAI workflow. Here are some key benefits of using AzureML with torchtune: Scalability and Compute Power: Azure ML provides powerful, scalable compute resources, allowing torchtune to handle multiple SLMs/LLMs across multiple GPUs or distributed clusters. This makes it ideal for efficiently managing intensive tasks like fine-tuning and quantization on large datasets. Managed ML Environment: Azure ML offers a fully managed environment, so setting up dependencies and managing versions are handled with ease. This reduces setup time for torchtune, letting users focus directly on model optimization without infrastructure concerns. Model Deployment and Scaling: Once the model is optimized with torchtune, AzureML provides a straightforward pathway to deploy it on Azure’s cloud infrastructure, making it easy to scale applications to production with robust monitoring and scaling features. Seamless Integration with Other Azure Services: Users can leverage other Azure services, such as Azure Blob Storage for dataset storage or Azure SQL for data management. This ecosystem support enhances workflow efficiency and makes AzureML a powerful choice for torchtune-based model tuning and deployment. In a torchtune YAML configuration, each parameter and setting controls specific training aspects for fine-tuning large language models (LLMs). Here’s a breakdown of key components like supervised fine-tuning (SFT), direct preference optimization (DPO), knowledge distillation (KD), and quantization: SFT (Supervised Fine-Tuning): This setting manages the fine-tuning process by training the model with labeled datasets. It involves specifying the dataset path, batch size, learning rate, and the number of epochs. SFT is critical for adapting pre-trained models to specific tasks using supervised data. DPO (Direct Preference Optimization): This setting is for training models based on human preference data. It generally uses a reward model to rank outputs, guiding the model to optimize directly for preferred responses. In torchtune, you can easily apply DPO with the settings below. KD (Knowledge Distillation): In this setting, a larger, more accurate model (teacher) transfers knowledge to a smaller model (student). YAML settings might define teacher and student model paths, temperature (for smoothing probabilities), and alpha (weight for balancing loss between teacher predictions and labels). KD allows smaller models to mimic larger models’ performance while reducing computation needs. In torchtune, you can easily apply DPO with the settings below. Evaluation: Torchtune integrates seamlessly with EleutherAI’s LM Evaluation Harness, which allows you to evaluate the truthfulness and accuracy of your models using benchmarks like TruthfulQA. You can easily perform these evaluations using Torchtune’s eleuther_eval recipe. Quantization: This setting reduces model size and computational requirements by lowering the bit precision of model weights. YAML settings specify the quantization method (e.g., 8-bit or 4-bit), target layers, and possibly additional parameters for post-training quantization. This is particularly helpful for deploying models on edge devices with limited resources. In torchtune, you can easily apply DPO with the settings below. Check out the YAML samples on torchtune’s official website. Applying torchtune’s standalone command to Azure ML is very simple. However, applying the pipeline of hugging face model download-fine-tuning-evaluation-quantization and distributed training as expressed in the architecture requires some trial and error. So, refer to the life hacks below to minimize trial and error when applying them to your workload. 3.1. Downloading model The torch_distributed_zero_first decorator is used to ensure that only one process (typically rank 0 in a distributed setup) performs certain operations, such as downloading or loading a model. This approach is crucial in a distributed environment where multiple processes might attempt to load a model concurrently, which could lead to redundant downloads, excessive memory usage, or conflicts. Here’s why torch_distributed_zero_first is used to download the model on a single process: Prevent Redundant Downloads: In a distributed setup, if every process tries to download the model simultaneously, it can lead to unnecessary network traffic and redundant file storage. By ensuring that only one process downloads the model, torch_distributed_zero_first prevents this redundancy. Avoid Conflicts and File Corruption: If multiple processes attempt to write or modify the same file during download, it could lead to file corruption or access conflicts. torch_distributed_zero_first minimizes this risk by allowing only one process to handle the file download. After downloading, the model can be distributed or loaded into memory across all processes using standard PyTorch distributed training methods. This approach makes the model loading process more efficient and stable in multi-process environments. 3.2. Destroying process group When applying distributed training on AzureML with torchtune’s CLI, it’s essential to manage the process groups carefully. The distributed training recipe in torchtune CLI initializes a process group using dist.init_process_group(...). However, if a process group is already active, initializing another one can cause conflicts, leading to nested or redundant process groups. To prevent this, you should close any existing process groups before Torchtune’s distributed training starts. This can be done by calling dist.destroy_process_group(…) to terminate any active process groups, ensuring a clean state. By doing so, you avoid process conflicts, enabling torchtune CLI’s distributed training recipe to operate smoothly without overlapping with pre-existing groups. Code snippets for 3.1 and 3.2 are below. MASTER_ADDR = os.environ.get('MASTER_ADDR', '127.0.0.1') MASTER_PORT = os.environ.get('MASTER_PORT', '7777') WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1)) GLOBAL_RANK = int(os.environ.get('RANK', -1)) LOCAL_RANK = int(os.environ.get('LOCAL_RANK', -1)) NUM_GPUS_PER_NODE = torch.cuda.device_count() NUM_NODES = WORLD_SIZE // NUM_GPUS_PER_NODE if LOCAL_RANK != -1: dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo") @contextmanager def torch_distributed_zero_first(local_rank: int): """ Decorator to make all processes in distributed training wait for each local_master to do something. """ if local_rank not in [-1, 0]: dist.barrier(device_ids=[local_rank]) yield if local_rank == 0: dist.barrier(device_ids=[0]) ... with torch_distributed_zero_first(LOCAL_RANK): # Download the model download_model(args.teacher_model_id, args.teacher_model_dir) download_model(args.student_model_id, args.student_model_dir) # Construct the fine-tuning command if "single" in args.tune_recipe: print("***** Single Device Training *****"); full_command = ( f'tune run ' f'{args.tune_recipe} ' f'--config {args.tune_config_name}' ) # Run the fine-tuning command run_command(full_command) else: print("***** Distributed Training *****"); dist.destroy_process_group() if GLOBAL_RANK in {-1, 0}: # Run the fine-tuning command full_command = ( f'tune run --master-addr {MASTER_ADDR} --master-port {MASTER_PORT} --nnodes {NUM_NODES} --nproc_per_node {NUM_GPUS_PER_NODE} ' f'{args.tune_recipe} ' f'--config {args.tune_config_name}' ) run_command(full_command) ... 3.3. Dynamic configuration Since the path to the blob storage mounted on the computing cluster is dynamic, the YAML recipe must be modified dynamically. Here’s an example of how to adjust the configuration using Jinja templates to ensure the paths are set correctly at runtime: # Dynamically modify fine-tuning YAML file. import os, jinja2 jinja_env = jinja2.Environment() template = jinja_env.from_string(Path(args.tune_config_name).open().read()) train_path = os.path.join(args.train_dir, "train.jsonl") metric_logger = "DiskLogger" if len(args.wandb_api_key) > 0: metric_logger = "WandBLogger" Path(args.tune_config_name).open("w").write( template.render( train_path=train_path, log_dir=args.log_dir, model_dir=args.model_dir, model_output_dir=args.model_output_dir, metric_logger=metric_logger ) ) lora_finetune.yaml code snippet # Model arguments model: ... # Tokenizer tokenizer: _component_: torchtune.models.phi3.phi3_mini_tokenizer path: {{model_dir}}/tokenizer.model max_seq_len: null # Checkpointer checkpointer: _component_: torchtune.training.FullModelHFCheckpointer checkpoint_dir: {{model_dir}} checkpoint_files: [ model-00001-of-00002.safetensors, model-00002-of-00002.safetensors ] recipe_checkpoint: null output_dir: {{model_output_dir}} model_type: PHI3_MINI resume_from_checkpoint: False save_adapter_weights_only: False # Dataset dataset: _component_: torchtune.datasets.instruct_dataset source: json data_files: {{train_path}} column_map: input: instruction output: output train_on_input: False packed: False split: train seed: null shuffle: True # Logging output_dir: {{log_dir}}/lora_finetune_output metric_logger: _component_: torchtune.training.metric_logging.{{metric_logger}} log_dir: {{log_dir}}/training_logs log_every_n_steps: 1 log_peak_memory_stats: False ... In this setup: The script reads the template YAML file and dynamically injects the appropriate paths and configurations. train_path, log_dir, model_dir, and model_output_dir are populated based on the environment’s dynamically assigned paths, ensuring that the YAML file reflects the actual storage locations. metric_logger is set to "DiskLogger" by default but changes to "WandBLogger" if a wandb_api_key is provided, allowing for flexible metric logging configurations. This approach guarantees that the configuration is always in sync with the environment, even when paths are assigned dynamically by Azure ML’s blob storage mounting. 3.4. Logging When running a training pipeline with torchtune CLI, it may be challenging to use MLflow for logging. Therefore, you should use Torchtune’s DiskLogger or WandBLogger instead. The DiskLogger option logs metrics and training information directly to disk, making it a suitable choice when MLFlow is unavailable. Alternatively, if you have a Weights & Biases (WandB) account and API key, the WandBLogger can be used to log metrics to your WandB dashboard, enabling remote access and visualization of training progress. This way, you can ensure robust logging and monitoring within the torchtune framework. Before reading this section please refer to the Azure guide and past blogs (Blog 1, Blog 2) for basic information on Azure ML training and serving. 4.1. Dataset preparation torchtune provides several dataset options, but in this blog, we will introduce how to save the Hugging Face dataset as json and save it as a Data asset in the Azure Blog Datastore. Please note that if you would like to build/augment your own dataset, please refer to the blog and the GitHub repo for synthetic data generation. Instruction Dataset for SFT and KD Preprocessing the dataset is not difficult, but don’t forget to convert the column names to match the specifications in the yaml file. dataset = load_dataset("HuggingFaceH4/helpful_instructions", name="self_instruct", split="train[:10%]") dataset = dataset.rename_column('prompt', 'instruction') dataset = dataset.rename_column('completion', 'output') print(f"Loaded Dataset size: {len(dataset)}") if IS_DEBUG: logger.info(f"Activated Debug mode. The number of sample was resampled to 1000.") dataset = dataset.select(range(800)) print(f"Debug Dataset size: {len(dataset)}") logger.info(f"Save dataset to {SFT_DATA_DIR}") dataset = dataset.train_test_split(test_size=0.2) train_dataset = dataset['train'] train_dataset.to_json(f"{SFT_DATA_DIR}/train.jsonl", force_ascii=False) test_dataset = dataset['test'] test_dataset.to_json(f"{SFT_DATA_DIR}/eval.jsonl", force_ascii=False) Preference Dataset for DPO For the preference dataset, it may be necessary to convert it into a chat template format. Below is a code example. def convert_to_preference_format(dataset): json_format = [ { "chosen_conversations": [ {"content": row["prompt"], "role": "user"}, {"content": row["chosen"], "role": "assistant"} ], "rejected_conversations": [ {"content": row["prompt"], "role": "user"}, {"content": row["rejected"], "role": "assistant"} ] } for row in dataset ] return json_format # Load dataset from the hub data_path = "jondurbin/truthy-dpo-v0.1" dataset = load_dataset(data_path, split="train") print(f"Dataset size: {len(dataset)}") # if IS_DEBUG: # logger.info(f"Activated Debug mode. The number of sample was resampled to 1000.") # dataset = dataset.select(range(800)) logger.info(f"Save dataset to {DPO_DATA_DIR}") dataset = dataset.train_test_split(test_size=0.2) train_dataset = dataset['train'] test_dataset = dataset['test'] train_dataset = convert_to_preference_format(train_dataset) test_dataset = convert_to_preference_format(test_dataset) with open(f"{DPO_DATA_DIR}/train.jsonl", "w") as f: json.dump(train_dataset, f, ensure_ascii=False, indent=4) with open(f"{DPO_DATA_DIR}/eval.jsonl", "w") as f: json.dump(test_dataset, f, ensure_ascii=False, indent=4) 4.2. Environment asset You can add pip install to the command based on the curated environment or add a conda-based custom environment, but in this blog, we will add a docker-based custom environment. FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2 # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir # Inference requirements COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/ RUN /var/requirements/install_system_requirements.sh && \ cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \ cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \ ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \ rm -f /etc/nginx/sites-enabled/default ENV SVDIR=/var/runit ENV WORKER_TIMEOUT=400 EXPOSE 5001 8883 8888 # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update RUN apt-get install -y openssh-server openssh-client RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation [Tip] If you are building a container with Ubuntu 22.04, make sure to remove the liblttng-ust0 related packages/dependencies. Otherwise, you will get an error when building the container. FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2204-cu124-py310-torch250:biweekly.202410.2 ... # Remove packages or dependencies related to liblttng-ust0. # Starting from Ubuntu 22.04, liblttng-ust0 has been updated to liblttng-ust1 package, deprecating liblttng-ust0 for compatibility reasons. # If you build a docker file on Ubuntu 22.04 without including this syntax, you will get the following liblttng-ust0 error: # -- Package 'liblttng-ust0' has no installation candidate RUN sed -i '/liblttng-ust0/d' /var/requirements/system_requirements.txt ... 4.3. Start a Training job The code snippet below activates a compute cluster for training. The command allows user to configure the following key aspects. inputs – This is the dictionary of inputs using name value pairs to the command. type – The type of input. This can be a uri_file or uri_folder. The default is uri_folder. path – The path to the file or folder. These can be local or remote files or folders. For remote files – http/https, wasb are supported. Azure ML data/dataset or datastore are of type uri_folder. To use data/dataset as input, you can use registered dataset in the workspace using the format ‘:’. For e.g Input(type=”uri_folder”, path=”my_dataset:1″) mode – Mode of how the data should be delivered to the compute target. Allowed values are ro_mount, rw_mount and download. Default is ro_mount code – This is the path where the code to run the command is located compute – The compute on which the command will run. You can run it on the local machine by using local for the compute. command – This is the command that needs to be run in the command using the ${{inputs.}} expression. To use files or folders as inputs, we can use the Input class. The Input class supports three parameters: environment – This is the environment needed for the command to run. Curated (built-in) or custom environments from the workspace can be used. instance_count – Number of nodes. Default is 1. distribution – Distribution configuration for distributed training scenarios. Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed. from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration from utils.aml_common import get_num_gpus num_gpu = get_num_gpus(azure_compute_cluster_size) logger.info(f"Number of GPUs={num_gpu}") str_command = "" if USE_BUILTIN_ENV: str_env = "azureml://registries/azureml/environments/acpt-pytorch-2.2-cuda12.1/versions/19" # Use built-in Environment asset str_command += "pip install -r requirements.txt && " else: str_env = f"{azure_env_name}@latest" # Use Curated (built-in) Environment asset if num_gpu > 1: tune_recipe = "lora_finetune_distributed" str_command += "python launcher_distributed.py " else: tune_recipe = "lora_finetune_single_device" str_command += "python launcher_single.py " if len(wandb_api_key) > 0 or wandb_api_key is not None: str_command += "--wandb_api_key ${{inputs.wandb_api_key}} \ --wandb_project ${{inputs.wandb_project}} \ --wandb_watch ${{inputs.wandb_watch}} " str_command += "--train_dir ${{inputs.train_dir}} \ --hf_token ${{inputs.hf_token}} \ --tune_recipe ${{inputs.tune_recipe}} \ --tune_action ${{inputs.tune_action}} \ --model_id ${{inputs.model_id}} \ --model_dir ${{inputs.model_dir}} \ --log_dir ${{inputs.log_dir}} \ --model_output_dir ${{inputs.model_output_dir}} \ --tune_config_name ${{inputs.tune_config_name}}" logger.info(f"Tune recipe: {tune_recipe}") job = command( inputs=dict( #train_dir=Input(type="uri_folder", path=SFT_DATA_DIR), # Get data from local path train_dir=Input(path=f"{AZURE_SFT_DATA_NAME}@latest"), # Get data from Data asset hf_token=HF_TOKEN, wandb_api_key=wandb_api_key, wandb_project=wandb_project, wandb_watch=wandb_watch, tune_recipe=tune_recipe, tune_action="fine-tune,run-quant", model_id=HF_MODEL_NAME_OR_PATH, model_dir="./model", log_dir="./outputs/log", model_output_dir="./outputs", tune_config_name="lora_finetune.yaml" ), code="./scripts", # local path where the code is stored compute=azure_compute_cluster_name, command=str_command, environment=str_env, instance_count=1, distribution={ "type": "PyTorch", "process_count_per_instance": num_gpu, # For multi-gpu training set this to an integer value more than 1 }, ) returned_job = ml_client.jobs.create_or_update(job) logger.info("""Started training job. Now a dedicated Compute Cluster for training is provisioned and the environment required for training is automatically set up from Environment. If you have set up a new custom Environment, it will take approximately 20 minutes or more to set up the Environment before provisioning the training cluster. """) ml_client.jobs.stream(returned_job.name) 4.4. Logging Use torchtune.training.metric_logging.DiskLogger or torchtune.training.metric_logging.WandBLogger. When applying DiskLogger, the save path must be a subfolder of outputs. Otherwise, you cannot check it in the Azure ML UI. Below is a screenshot of DiskLogger applied. Below is a screenshot of WandBLogger applied. Any additional training history is recorded in the user_logs folder of Azure ML. Below is an example when using Standard_NC48ads_A100_v4 (NVIDIA A100 GPU x 2ea) as a compute cluster. Please do not forget to save the quantized model parameters when you apply fine-tuning-evaluation-quantization pipeline in your training code. It is recommended that you also save the original model weights before quantization for comparison. 4.5. Registering a Model Once you have fine-tuned and quantized your model using torchtune, you can register it as a Model asset on Azure ML. This registration process offers several advantages, making model management and deployment more efficient and organized. Here are the advantages of Registering as a Model asset. Version Control: Azure ML’s Model asset allows you to maintain multiple versions of a model. Each new iteration of your model, whether it’s a different fine-tuning configuration or an updated quantization approach, can be registered as a new version. This makes it easy to track model evolution, compare performance across versions, and roll back to previous versions if necessary. Centralized Repository: By registering your model as an asset, you store it in a centralized repository. This repository provides easy access for other team members or projects within your organization, enabling collaboration and consistent model usage across different applications. Deployment Ready: Models registered as assets in AzureML are directly deployable. This means you can set up endpoints, batch inference pipelines, or other serving mechanisms using the registered model, streamlining the deployment process and minimizing potential errors. Metadata Management: Along with the model, you can also store relevant metadata (such as training configuration, environment details, and evaluation metrics) in the Model asset. This metadata is essential for reproducibility and for understanding model performance under different conditions. Below is a code snippet that registers a model asset and downloads the model artifact. def get_or_create_model_asset(ml_client, model_name, job_name, model_dir="outputs", model_type="custom_model", download_quantized_model_only=False, update=False): try: latest_model_version = max([int(m.version) for m in ml_client.models.list(name=model_name)]) if update: raise ResourceExistsError('Found Model asset, but will update the Model.') else: model_asset = ml_client.models.get(name=model_name, version=latest_model_version) print(f"Found Model asset: {model_name}. Will not create again") except (ResourceNotFoundError, ResourceExistsError) as e: print(f"Exception: {e}") model_path = f"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}" if download_quantized_model_only: model_path = f"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}/quant" run_model = Model( name=model_name, path=model_path, description="Model created from run.", type=model_type # mlflow_model, custom_model, triton_model ) model_asset = ml_client.models.create_or_update(run_model) print(f"Created Model asset: {model_name}") return model_asset model = get_or_create_model_asset(ml_client, azure_model_name, job_name, model_dir, model_type="custom_model", download_quantized_model_only=True, update=False) # Download the model (this is optional) DOWNLOAD_TO_LOCAL = False local_model_dir = "./artifact_downloads_dpo" if DOWNLOAD_TO_LOCAL: os.makedirs(local_model_dir, exist_ok=True) ml_client.models.download(name=azure_model_name, download_path=local_model_dir, version=model.version) We have published the code to do this post end-to-end at https://github.com/Azure/torchtune-azureml. We hope you can easily perform fine-tuning/evaluation/quantization using torchtune and Azure ML. Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Introducing the Modern Web App (MWA) Pattern for .NET next post MS Ignite 2024: Our session picks You may also like 7 Disturbing Tech Trends of 2024 December 19, 2024 AI on phones fails to impress Apple, Samsung users: Survey December 18, 2024 Standout technology products of 2024 December 16, 2024 Is Intel Equivalent to Tech Industry 2024 NY Giant? December 12, 2024 Google’s Willow chip marks breakthrough in quantum computing December 11, 2024 Job seekers are targeted in mobile phishing campaigns December 10, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.