Fine-tuning a Hugging Face Diffusion Model on CycleCloud Workspace for Slurm

introduction

Azure CycleCloud Workspace for Slurm (CCWS) is a new Infrastructure as a Service (IaaS) solution that allows users to purpose-build environments for their distributed AI training and HPC simulation needs. This offer is available through Azure Marketplace. Simplifies and streamlines the creation and management of Slurm clusters in Azure. Users can easily create and configure predefined Slurm clusters using Azure CycleCloud without any prior knowledge of the cloud or Slurm, giving users full control over the infrastructure. Slurm clusters are pre-configured with PMix v4, Pyxis, and enroot to support containerized AI Slurm jobs, and users can import their own libraries or customize their clusters by selecting software and libraries to fit their needs. Users can use SSH or Visual Studio Code to access provisioned login nodes (or scheduler nodes) to perform common tasks, such as submitting and managing Slurm jobs. This blog outlines the steps to fine-tune a diffusion model. hugging face I use CCWS.

Deploy Azure CycleCloud workspace for Slurm

Follow the mentioned steps to deploy Azure CCWS from Azure Marketplace. here.

The provisioning environment for CCWS consists of:

(1) 1 scheduler node on D4asv4 SKU with Slurm 23.11.7.

(2) Up to 10 on-demand GPU nodes NDasrA100v4 The SKU of the GPU partition.

(3) Up to 16 on-demand HPC nodes HBv3 SKU of HPC partition

(4) Up to 100 on-demand HTC nodes NCA100v4 On HTC partition.

(5) 1TB shared NFS.

Note: For steps (1) through (4) above, it is important to request quotas from the Azure portal. before CCWS has been deployed.

Once CCWS is deployed, it connects to the scheduler node via SSH. You can query against a mounted repository using:

df -h

This cluster has 1TB of shared NFS available to all nodes.

environmental preparation

Once CCWS deploys SSH to the login node.

(1) Install the conda environment and tools required for shared NFS. In our environment, shared NFS is mounted as /shared.

First, let’s install miniconda.

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Install miniconda in /shared

$ bash Miniconda3-latest-Linux-x86_64.sh

(2) Once conda is complete, install Python 3.8. This code may cause library conflicts if you choose a newer version of Python.

$ conda install python=3.8

(3) Clone the hugging face repository and install the required libraries.

$ git clone https://github.com/huggingface/diffusers

$cd diffusers

$ pip install .

(4) Go to the example folder and install the package as follows. Requirements.txt file

Prepare Slurm submission script

Below is the slurm submission script used to start the multi-GPU training job. Provisioned GPUs are NDasrA100v4 SKUs with 8 GPUs

Jobs can be submitted using sbatch and queried using squeue. After the job starts running immediately, when the job is submitted, Azure CCWS takes a few minutes (6-7) to provision the GPU nodes.

Querying sinfo lists the nodes assigned to the task and the nodes available. It is important to note that users are only charged for nodes that are in the ‘assigned’ state and not in the ‘idle’ state.

NODELIST provides a list of nodes and users can connect via SSH from their login node to that node using the node names listed in the table below.

ssh ccw-gpu-1
nvidia-smi

model inference

Once fine-tuning is complete, you can test the model for inference. We test this using a simple Python script, but there are several methods you can use for inference.

The above script uses the fine-tuned model to infer the character Yoda to the prompt input “yoda”.

conclusion

This blog outlined the steps to fine-tune Hugging Face’s diffusion model using Azure CycleCloud Workspace for Slurm. The environment is secure and fully customizable, allowing users the ability to bring their own code, software packages, and deploy them on their own virtual networks, as well as their choice of virtual machines, including their choice of high-performance storage (e.g. Azure Managed Luster). . Operating system image (Ubuntu 20.04 In this case), you can get it from the Azure Marketplace and it comes pre-configured with Infiniband drivers, MPI libraries, NCCL Communicator, and CUDA toolkit so you can have your cluster up and running in minutes. Importantly, GPU VMs are on-demand (auto-scaling), so users are only billed for the period of time they are used.

Source link

environmental preparation

Our Company

About Links

Useful Links

Newsletter

Laest News

Fine-tuning a Hugging Face Diffusion Model on CycleCloud Workspace for Slurm

environmental preparation

Exciting Career Opportunity for Lead PHP Developer in Mangalore with Freshersworld’s Esteemed Client

Online disaster recovery between SQL Server 2022 and Azure SQL Managed Instance is now GA

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News