Comprehensive Nvidia GPU Monitoring for Azure N-Series VMs Using Telegraf with Azure Monitor by info.odysseyx@gmail.com September 30, 2024 written by info.odysseyx@gmail.com September 30, 2024 0 comment 7 views 7 In today’s AI and HPC environments, GPU monitoring has become essential due to the complexity and high resource demands of these workloads. Effective monitoring ensures that your GPU is utilized optimally, preventing under- and over-utilization, which can negatively impact performance and increase costs. By identifying bottlenecks such as memory limitations or thermal throttling, GPU monitoring optimizes performance and enables a smoother workflow. In cloud environments such as Azure, where GPU resources are expensive, monitoring plays an important role in managing costs by tracking usage patterns and promoting efficient resource allocation. Monitoring also helps with capacity planning, workload scaling, and forecasting, ensuring resources are allocated appropriately for future requirements. Azure Monitor provides powerful tools for tracking CPU, memory, storage, and network usage, but GPU monitoring is not supported by default. For Azure N-series VMs. Tracking GPU performance requires additional configuration through a third-party tool or integration like Telegraf. At the time of writing, Azure Monitor has no built-in GPU metrics without such an external solution. Telegraph An open source lightweight agent developed by InfluxData, designed to collect, process, and transmit metric and event data from a variety of systems, applications, and services. Supports a wide range of input plugins, allowing you to collect data from sources such as system statistics, databases, and APIs. Telegraf can then output this data to various destinations, such as monitoring platforms such as InfluxDB, Azure Monitor, or other time series databases. Its flexibility and low resource footprint make it ideal for real-time monitoring of infrastructure and applications, especially in cloud environments. In this blog, we will see how to configure Telegraf to send GPU monitoring metrics to Azure Monitor. This comprehensive guide covers all the steps required to enable GPU monitoring, so you can effectively track and optimize GPU performance in Azure. Step 1: Make changes in Azure to send GPU metrics from the Telegraf agent to Azure Monitor on the VM or VMSS. registration Microsoft Insight Resource provider for your Azure subscription. represent: Resource providers and resource types – Azure Resource Manager | microsoft run 2. Authenticate by enabling managed service identity Azure VM or Azure VMSS. In this example, we are using a managed identity for authentication. You can also authenticate VMs using user managed identities or service principals. represent: telegraf/plugins/outputs/azure_monitor · influxdata/telegraf in release-1.15 (github.com) Step 2: Set up the Telegraf agent inside your VM or VMSS to send data to Azure Monitor For this example, we will use Azure. Standard_ND96asr_v4 using a VM Ubuntu-HPC 2204 image Configure your environment for both VMs and VMSS. The Ubuntu-HPC 2204 image comes with NVIDIA GPU drivers and CUDA preinstalled. If you choose to use a different image, you will need to install the required GPU drivers and CUDA toolkit. To install the Telegraf agent on Ubuntu 22.04, download and run the ‘gpumon-setup.sh’ script. This script also NVIDIA SMI Input Plugin Set up your Telegraf configuration to send data to Azure Monitor. Run the following command: wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpumon-setup.sh -O gpumon-setup.sh chmod +x gpumon-setup.sh ./gpumon-setup.sh Test your Telegraf configuration by running the following command: sudo telegraf --config /etc/telegraf/telegraf.conf --test Step 3: Create a dashboard in Azure Monitor to check NVIDIA GPU usage Telegraf includes an output plugin designed specifically for Azure Monitor, allowing users to send custom metrics directly to the platform. Azure Monitor operates with a metric resolution of 1 minute. Therefore, the Telegraf output plugin automatically aggregates metrics into 1-minute buckets and sends them to Azure Monitor at each flush interval. Metrics from each input plugin are written to a separate Azure Monitor namespace and default to the prefix. “Telegraph/” Easy to identify. To visualize your NVIDIA GPU usage, go to the Metrics section in the Azure portal. Select the VM name as the scope, then select the metric namespace as follows: `Telegraph/Nvidia-smi`. Here you can select various metrics to check your NVIDIA GPU utilization. You can also apply filters and splits to further analyze your data. You can create GPU monitoring dashboards for both VMs and VMSS. Here are some sample charts to consider: Bonus: Simulate GPU usage using a sample training program. If you’re testing and running out of programs to simulate GPU usage, I’ve got you covered! I created a script to run a multi-GPU distributed training model. This script installs the Anaconda software and sets up the environment needed to run distributed training models using TensorFlow. Running this script effectively simulates GPU usage and allows you to check the monitoring metrics you have set up. To get started, run the following command: wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh chmod +x gpu_test_program.sh ./gpu_test_program.sh I hope you found this blog post helpful. With the right tools and insights, you can unlock the full potential of your GPU resources. Happy reading! reference: https://azuremarketplace.microsoft.com/en-gb/marketplace/apps/microsoft-dsvm.ubuntu-hpc?tab=PlansAnd… https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndasra100v4-series?ta… telegraf/plugins/outputs/azure_monitor · influxdata/telegraf in release-1.15 (github.com) telegraf/plugins/inputs/nvidia_smi in release-1.15 · influxdata/telegraf (github.com) Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Solar O M Engineers Job Openings in Bareilly – Exciting Career Opportunities for Freshers next post Making Searching and Curating Data Assets in Microsoft Purview easier. You may also like Lenovo’s ThinkPad X 1 Carbon has rewrite my MacBook Pro February 5, 2025 Bots now dominate the web and this is a copy of a problem February 5, 2025 Bots now dominate the web and this is a copy of a problem February 5, 2025 Bots now dominate the web, and this is a problem February 4, 2025 DIPSEC and HI-STECS GLOBAL AI Race February 4, 2025 DEPSEC SUCCESS TICTOKE CAN RUNNING TO PUPPENSE TO RESTITE January 29, 2025 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.