Home NewsX Comprehensive Nvidia GPU Monitoring for Azure N-Series VMs Using Telegraf with Azure Monitor

Comprehensive Nvidia GPU Monitoring for Azure N-Series VMs Using Telegraf with Azure Monitor

by info.odysseyx@gmail.com
0 comment 7 views


In today’s AI and HPC environments, GPU monitoring has become essential due to the complexity and high resource demands of these workloads. Effective monitoring ensures that your GPU is utilized optimally, preventing under- and over-utilization, which can negatively impact performance and increase costs. By identifying bottlenecks such as memory limitations or thermal throttling, GPU monitoring optimizes performance and enables a smoother workflow. In cloud environments such as Azure, where GPU resources are expensive, monitoring plays an important role in managing costs by tracking usage patterns and promoting efficient resource allocation. Monitoring also helps with capacity planning, workload scaling, and forecasting, ensuring resources are allocated appropriately for future requirements.

Azure Monitor provides powerful tools for tracking CPU, memory, storage, and network usage, but GPU monitoring is not supported by default. For Azure N-series VMs. Tracking GPU performance requires additional configuration through a third-party tool or integration like Telegraf. At the time of writing, Azure Monitor has no built-in GPU metrics without such an external solution.

Telegraph An open source lightweight agent developed by InfluxData, designed to collect, process, and transmit metric and event data from a variety of systems, applications, and services. Supports a wide range of input plugins, allowing you to collect data from sources such as system statistics, databases, and APIs. Telegraf can then output this data to various destinations, such as monitoring platforms such as InfluxDB, Azure Monitor, or other time series databases. Its flexibility and low resource footprint make it ideal for real-time monitoring of infrastructure and applications, especially in cloud environments.

In this blog, we will see how to configure Telegraf to send GPU monitoring metrics to Azure Monitor. This comprehensive guide covers all the steps required to enable GPU monitoring, so you can effectively track and optimize GPU performance in Azure.

Step 1: Make changes in Azure to send GPU metrics from the Telegraf agent to Azure Monitor on the VM or VMSS.

  1. registration Microsoft Insight Resource provider for your Azure subscription. represent: Resource providers and resource types – Azure Resource Manager | microsoft run

Vinylv_0-1727488653861.png

2. Authenticate by enabling managed service identity Azure VM or Azure VMSS. In this example, we are using a managed identity for authentication. You can also authenticate VMs using user managed identities or service principals. represent: telegraf/plugins/outputs/azure_monitor · influxdata/telegraf in release-1.15 (github.com)

Vinylv_1-1727488765598.png

vinilv_2-1727488799291.png

Step 2: Set up the Telegraf agent inside your VM or VMSS to send data to Azure Monitor

For this example, we will use Azure. Standard_ND96asr_v4 using a VM Ubuntu-HPC 2204 image Configure your environment for both VMs and VMSS. The Ubuntu-HPC 2204 image comes with NVIDIA GPU drivers and CUDA preinstalled. If you choose to use a different image, you will need to install the required GPU drivers and CUDA toolkit.

To install the Telegraf agent on Ubuntu 22.04, download and run the ‘gpumon-setup.sh’ script. This script also NVIDIA SMI Input Plugin Set up your Telegraf configuration to send data to Azure Monitor.

Run the following command:

wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpumon-setup.sh -O gpumon-setup.sh
chmod +x gpumon-setup.sh
./gpumon-setup.sh

Test your Telegraf configuration by running the following command:

sudo telegraf --config /etc/telegraf/telegraf.conf --test

Step 3: Create a dashboard in Azure Monitor to check NVIDIA GPU usage

Telegraf includes an output plugin designed specifically for Azure Monitor, allowing users to send custom metrics directly to the platform. Azure Monitor operates with a metric resolution of 1 minute. Therefore, the Telegraf output plugin automatically aggregates metrics into 1-minute buckets and sends them to Azure Monitor at each flush interval. Metrics from each input plugin are written to a separate Azure Monitor namespace and default to the prefix. “Telegraph/” Easy to identify.

To visualize your NVIDIA GPU usage, go to the Metrics section in the Azure portal. Select the VM name as the scope, then select the metric namespace as follows: `Telegraph/Nvidia-smi`. Here you can select various metrics to check your NVIDIA GPU utilization. You can also apply filters and splits to further analyze your data.

Vinylv_3-1727490371808.png

You can create GPU monitoring dashboards for both VMs and VMSS. Here are some sample charts to consider:

Vinylv_4-1727490431954.png

Vinylv_5-1727490487662.png

Bonus: Simulate GPU usage using a sample training program.

If you’re testing and running out of programs to simulate GPU usage, I’ve got you covered! I created a script to run a multi-GPU distributed training model. This script installs the Anaconda software and sets up the environment needed to run distributed training models using TensorFlow. Running this script effectively simulates GPU usage and allows you to check the monitoring metrics you have set up.

To get started, run the following command:

wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh
chmod +x gpu_test_program.sh
./gpu_test_program.sh

I hope you found this blog post helpful. With the right tools and insights, you can unlock the full potential of your GPU resources. Happy reading!

reference:

https://azuremarketplace.microsoft.com/en-gb/marketplace/apps/microsoft-dsvm.ubuntu-hpc?tab=PlansAnd…

https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndasra100v4-series?ta…

telegraf/plugins/outputs/azure_monitor · influxdata/telegraf in release-1.15 (github.com)

telegraf/plugins/inputs/nvidia_smi in release-1.15 · influxdata/telegraf (github.com)





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX