Running tightly coupled HPC/AI workloads with InfiniBand using NVIDIA Network Operator on AKS by info.odysseyx@gmail.com October 13, 2024 written by info.odysseyx@gmail.com October 13, 2024 0 comment 6 views 6 Dr. Kai Neuffer – Principal Program Manager, Industry and Partner Sales – Energy Industry Dr. Wolfgang De Salvador – Senior Product Manager – Azure Storage Paul Edwards – Principal Technical Program Manager – Azure Core HPC & AI Acknowledgments We would like to express our gratitude to Cormac Garvey for his previous contributions to the NVIDIA network-operator and GPU-operator, which inspired this article. Resources and references used in this article: Introduction As of today, more and more we see AKS gaining share as an orchestration solution for HPC/AI workloads. The drivers behind this trend are multiple: the progressive move toward containerization of the HPC/AI software stacks, the ease of management and the universal nature of Kubernetes APIs.The focus of this blog post is to provide a guide for getting an AKS cluster InfiniBand enabled, with the possibility of having HCAs or IPoIB available inside Kubernetes Pods as cluster resources. Several methodologies and articles have provided insights on the topic, as well as the official documentation of NVIDIA Network Operator. The purpose of this article is organizing and harmonizing the different experiences while proposing a deployment model which is closer to the most maintained and standard way of enabling InfiniBand cluster: using NVIDIA Network Operator. Of course, this is only the first step for having an AKS cluster HPC/AI ready. This article is meant to work in pair with the blog post for NVIDIA GPU operator. In a similar way, a proper HPC/AI AKS cluster will require an adequate job scheduler like kqueue or Volcano to handle properly multi-node jobs and allowing a smooth interaction in parallel processing. This is out of the scope of the current Blog post, but references and examples can be found in the already mentioned in the repository related to HPC on AKS or running workloads on NDv4 GPUs on AKS. Getting the basics up and running In this section we will describe how to deploy a vanilla dev/testing cluster where the content of this article can be deployed for demonstration. In case you already have your AKS cluster with InfiniBand enabled nodes, you can skip this section. Deploying a vanilla AKS cluster The standard way of deploying a vanilla AKS cluster is following the standard procedure described in Azure documentation. Please be aware that this command will create an AKS cluster with: Kubenet as Network CNI AKS cluster will have a public endpoint Local accounts with Kubernetes RBAC In general, we strongly recommend for production workloads to look the main security concepts for AKS cluster. Use Azure CNI Evaluate using Private AKS Cluster to limit API exposure to the public internet Evaluate using Azure RBAC with Entra ID accounts or Kubernetes RBAC with Entra ID accounts This will be out of scope for the present demo, but please be aware that this cluster is meant for NVIDIA Network Operator demo purposes only. Using Azure CLI we can create an AKS cluster with this procedure (replace the values between arrows with your preferred values): export RESOURCE_GROUP_NAME= export AKS_CLUSTER_NAME= export LOCATION= ## Following line to be used only if Resource Group is not available az group create --resource-group $RESOURCE_GROUP_NAME --location $LOCATION az aks create --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME --node-count 2 --generate-ssh-keys Connecting to the cluster To connect to the AKS cluster, several ways are documented in Azure documentation. In this part of the article we will use Linux Ubuntu 22.04 on a VM with Azure CLI installed to deploy the cluster. We authenticate to Azure by using the az login command and install the Azure Client (be aware that in the login command you may be required to use –tenant in case you have access to multiple tenants). Moreover, be sure you are using the right subscription checking with az account show: ## Add --tenant in case of multiple tenants ## Add --identity in case of using a managed identity on the VM az login az aks install-cli az aks get-credentials --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME After this is completed, you should be able to perform standard kubectl commands like: kubectl get nodes root@aks-gpu-playground-rg-jumpbox:~# kubectl get nodes NAME STATUS ROLES AGE VERSION aks-agentpool-11280554-vmss00001a Ready agent 9m40s v1.27.7 aks-agentpool-11280554-vmss00001b Ready agent 9m47s v1.27.7 Command line will be perfectly fine for all the operations in the blog post. However, if you would like to have a TUI experience, we suggest to use k9s, which can be easily installed on Linux following the installation instructions. For Ubuntu, you can install current version at the time of Blog post creation with: wget "https://github.com/derailed/k9s/releases/download/v0.32.5/k9s_linux_amd64.deb" dpkg -i k9s_linux_amd64.deb k9s allows to easily interact with the different resources of AKS cluster directly from a terminal user interface. It can be launched with k9s command. Detailed documentation on how to navigate on the different resources (Pods, DaemonSets, Nodes) can be found on the official k9s documentation page. Attaching an Azure Container registry to the Azure Kubernetes Cluster To be able to create our own docker container to run them on the cluster it is convenient to create an Azure Container Registry (create a private Azure Container Registry) and to attach it to the AKS cluster. This can be done in the following way: export ACR_NAME= az acr create --resource-group $RESOURCE_GROUP_NAME \ --name $ACR_NAME --sku Basic az aks update --name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME --attach-acr $ACR_NAME You will need to be able to perform pull and push operations from this Container Registry through Docker. To be able to push new containers into the registry you need to login first which you can do using the az acr login command: az acr login --name $ACR_NAME Connecting the AKS cluster to the container registry makes sure that the system managed identity for AKS nodes will have access to be allowed to pull containers. About taints for AKS nodes It is important to understand deeply the concept of taints and tolerations for Spot and GPU nodes in AKS. In case of spot instances, AKS applies the following taint: kubernetes.azure.com/scalesetpriority=spot:NoSchedule We will show later that we have to add some tolerations to be able to get the NVIDIA Network Operator running on Spot instances. Creating the first IB pool The currently created AKS cluster has as a default only a node pool with 2 nodes of Standard_D2s_v2 VMs. It is now time to add the first InfiniBand enabled node pool. This can be done using Azure Cloud Shell, for example using an Standard_HB120rs_v3 and setting the autoscaling enabled. We are setting the minimum number of nodes to 2 for the subsequent tests, so please remind to downscale the pool setting the minimum node number equal to 0 once completed to avoid not necessary costs: az aks nodepool add \ --resource-group $RESOURCE_GROUP_NAME \ --cluster-name $AKS_CLUSTER_NAME \ --name hb120v3 \ --node-vm-size Standard_HB120rs_v3 \ --enable-cluster-autoscaler \ --min-count 2 --max-count 2 --node-count 2 In order to deploy in Spot mode, the following flags should be added to Azure CLI: --priority Spot --eviction-policy Delete --spot-max-price -1 Once the deployment has finished, we can see the nodes with the kubectl get nodes command: kubectl get nodes NAME STATUS ROLES AGE VERSION aks-hb120v3-16191816-vmss000000 Ready 51s v1.29.8 aks-hb120v3-16191816-vmss000001 Ready 51s v1.29.8 aks-nodepool1-28303986-vmss000000 Ready 101m v1.29.8 aks-nodepool1-28303986-vmss000001 Ready 101m v1.29.8 Installing the Network Operator Installing Helm On a machine with kubectl configured and with the context configured above for connection to the AKS cluster, we run the following to install helm if not installed already: curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ && chmod 700 get_helm.sh \ && ./get_helm.sh Installing Node Feature Discovery On Azure, since most of the nodes have a ConnectX-4 network card for accelerated networking, it is important to fine tune the node feature recognition. Moreover, AKS nodes may have special taints that needs to be tolerated by Node Feature Discovery daemons. Because of this we will install Node Feature Discovery separately from NVIDIA Network Operator. NVIDIA Network Operator will act on the nodes with the label feature.node.kubernetes.io/pci-15b3.present. Moreover, it is important to tune the node discovery plugin so that it will be scheduled even on Spot instances of the Kubernetes cluster. Here we introduce also the MIG toleration and GPU toleration to grant compatibility with the NVIDIA GPU Operator article. helm install --wait --create-namespace -n network-operator node-feature-discovery node-feature-discovery --create-namespace --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts --set-json master.config.extraLabelNs="["nvidia.com"]" --set-json worker.tolerations="[{ "effect": "NoSchedule", "key": "sku", "operator": "Equal", "value": "gpu"},{"effect": "NoSchedule", "key": "kubernetes.azure.com/scalesetpriority", "value":"spot", "operator": "Equal"},{"effect": "NoSchedule", "key": "mig", "value":"notReady", "operator": "Equal"}]" After enabling Node Feature Discovery, it is important to create a custom rule to precisely match ConnectX-6 and ConnectX-7 cards available on the most recent Azure nodes. This can be done creating a file called nfd-network-rule.yaml containing the following: --- apiVersion: nfd.k8s-sigs.io/v1alpha1 kind: NodeFeatureRule metadata: name: nfd-network-rule spec: rules: - name: "nfd-network-rule" labels: "feature.node.kubernetes.io/pci-15b3.present": "true" matchFeatures: - feature: pci.device matchExpressions: device: {op: In, value: ["101c", "101e"]} After this file is created, we should apply this to the AKS cluster: kubectl apply -n network-operator -f nfd-network-rule.yaml After a few seconds the Node Feature Discovery will label ONLY the HBv3 nodes. This can be checked with: kubectl get nodes NAME STATUS ROLES AGE VERSION aks-hb120v3-16191816-vmss000000 Ready 51s v1.29.8 aks-hb120v3-16191816-vmss000001 Ready 51s v1.29.8 aks-nodepool1-28303986-vmss000000 Ready 101m v1.29.8 aks-nodepool1-28303986-vmss000001 Ready 101m v1.29.8 kubectl describe nodes aks-hb120v3-16191816-vmss000001 | grep present feature.node.kubernetes.io/pci-15b3.present=true kubectl describe nodes aks-nodepool1-28303986-vmss000000 | grep present The label should not be found on the agents node pool without InfiniBand (since they are of type Standard_D2s_v2) Installing and configuring the NVIDIA Network Operator helm chart We download the NVIDIA Network Operator helm chart and update our helm repository: helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update Then we generate a configuration template file with the name values.yaml. For a quick view of all the available values we could type: helm show values nvidia/network-operator --version v24.7.0 The values.yaml file is divide into several sections to configure the different components of the network operator. These can be summarized as: The SR-IOV Network Operator is relevant for an OpenShift environments and requires access to the InfiniBand subnet manager which is not available on Azure. This is why we disabled it. For this article we modify the values.yaml file to enable the following components: Our proposed values.yaml becomes: nfd: # -- Deploy Node Feature Discovery operator. enabled: false deployNodeFeatureRules: false # -- Enable CRDs upgrade with helm pre-install and pre-upgrade hooks. upgradeCRDs: true operator: tolerations: [{ "effect": "NoSchedule", "key": "sku", "operator": "Equal", "value": "gpu"},{"effect": "NoSchedule", "key": "kubernetes.azure.com/scalesetpriority", "value":"spot", "operator": "Equal"},{"effect": "NoSchedule", "key": "mig", "value":"notReady", "operator": "Equal"}] sriovNetworkOperator: # -- Deploy SR-IOV Network Operator. enabled: false deployCR: true ofedDriver: # -- Deploy the NVIDIA DOCA Driver driver container. deploy: true rdmaSharedDevicePlugin: # -- Deploy RDMA shared device plugin. deploy: true useCdi: true secondaryNetwork: # -- Deploy Secondary Network. deploy: true cniPlugins: # -- Deploy CNI Plugins Secondary Network. deploy: true multus: # -- Deploy Multus Secondary Network. deploy: true ipoib: # -- Deploy IPoIB CNI. deploy: true ipamPlugin: # -- Deploy IPAM CNI Plugin Secondary Network. deploy: true nicFeatureDiscover: deploy: false tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" - key: "node-role.kubernetes.io/control-plane" operator: "Exists" effect: "NoSchedule" - key: nvidia.com/gpu operator: Exists effect: NoSchedule - key: kubernetes.azure.com/scalesetpriority operator: Equal value: spot effect: NoSchedule - key: mig operator: Equal value: notReady effect: NoSchedule Once we have done all modifications within the values.yaml file we can deploy the helm chart: helm install network-operator nvidia/network-operator \ -n nvidia-network-operator \ --create-namespace \ --version v24.7.0 \ -f ./values.yaml \ --wait After a few minutes we should see the following pods running on compute nodes: kubectl get pods -n nvidia-network-operator --field-selector spec.nodeName=aks-hb120v3-16191816-vmss000001 NAME READY STATUS RESTARTS AGE cni-plugins-ds-f4kcj 1/1 Running 0 2m50s kube-ipoib-cni-ds-blfxf 1/1 Running 0 2m50s kube-multus-ds-45hzt 1/1 Running 0 2m50s mofed-ubuntu22.04-659dcf4b88-ds-7m462 0/1 Running 0 2m7s nv-ipam-node-fpht4 1/1 Running 0 2m14s whereabouts-78z5j 1/1 Running 0 2m50s Further, we can check if the rdma-device plugin has published as a new resource: kubectl describe node aks-hb120v3-16191816-vmss000001 | grep rdma/rdma_shared_device_a: rdma/rdma_shared_device_a: 63 rdma/rdma_shared_device_a: 63 The first line shows the number of configured Host Channel Adaptors (HCAs) that can be used by pods running on the node. The second entry show how many of them are are available. As we have not started any pods that request this resource yet, both values are the same. This maximal number of HCAs can be set within values.yaml configuration file by modifying the rdmHcaMax setting within the rdmaSharedDevicePlugin section: ... rdmaSharedDevicePlugin: deploy: true image: k8s-rdma-shared-dev-plugin repository: ghcr.io/mellanox version: sha-fe7f371c7e1b8315bf900f71cd25cfc1251dc775 useCdi: true ... resources: - name: rdma_shared_device_a vendors: [15b3] rdmaHcaMax: 63 ... We can also check if the IPoIB module has been loaded on the compute node and that the corresponding network interface exists. To do so, we use the kubectl debug command to start a busybox container on the node and connect to it: kubectl debug node/aks-hb120v3-16191816-vmss000001 -it --image=mcr.microsoft.com/cbl-mariner/busybox:2.0 Once on the node, we can run the chroot command to get on the host and sudo to become root: chroot /host sudo su - Then we can run the following commands to check if the IPoIB module is loaded and the network interface is configured: lsmod | grep ipoib ib_ipoib 135168 0 ib_cm 131072 2 rdma_cm,ib_ipoib ib_core 409600 8 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm mlx_compat 69632 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core ip a ... 8: ibP257s79720: mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:01:48:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:1a brd 00:ff:ff:ff:ff:12:40:1b:80:06:00:00:00:00:00:00:ff:ff:ff:ff altname ibP257p0s0 inet6 fe80::215:5dff:fd33:ff1a/64 scope link valid_lft forever preferred_lft forever The name of the IPoIB interface ibP257s79720 is not consistant within the same VM SKU but the altname ibP257p0s0 is. This will be important when we configure the IPoIB CNI later. About the OFED driver The AKS Ubuntu 22.04 node image comes with the inbox Infiniband drivers which are usually relatively old and miss some features. Therefore, the NVIDIA Network Operator uses a driver container to load the most-recent DOCA-OFED modules into the the OS kernel of the hosts. To make sure that the modules match the kernel version of the host, the entripoint script of the container compiles the right DOCA-OFED version before loading the modules. This process takes a few minutes and slows down spinning-up of new nodes within the nodepool. This could be fixed by using a custom DOCA-OFED container that contains the modules for the version of the AKS node’s kernel. Configuring the IBoIP CNI The IPoIB CNI plugin allows us to create an IPoIB child link on each Pod that runs on the node. The number of these links is limited to the number of HCAs cofigured in the values.yaml configuration file as discussed previously. To enable the IPoIB CNI we create the following ipoib_network.yaml file: apiVersion: mellanox.com/v1alpha1 kind: IPoIBNetwork metadata: name: example-ipoibnetwork spec: networkNamespace: "default" master: "ibP257p0s0" ipam: | { "type": "whereabouts", "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "range": "192.168.5.225/28", "exclude": [ "192.168.6.229/30", "192.168.6.236/32" ], "log_file" : "/var/log/whereabouts.log", "log_level" : "info", "gateway": "192.168.6.1" } The configration cotains the IPoIB interface name on the AKS host as value for the master key and the configuration for IP range for the IPoIB subnet. We apply the configuration by running: kubectl create -f ipoib_network.yaml Building a sample image for testing IPoIB and RDMA connectivity In order to test with InfiniBand utilities the RDMA connectivity, it is critical to have a Docker image where DOCA OFED is installed with the userspace tools. An example of such an image could have this Dockerfile: FROM ubuntu:22.04 ENV DOCA_URL="https://www.mellanox.com/downloads/DOCA/DOCA_v2.8.0/host/doca-host_2.8.0-204000-24.07-ubuntu2204_amd64.deb" ENV DOCA_SHA256="289a3e00f676032b52afb1aab5f19d2a672bcca782daf9d30ade0b59975af582" RUN apt-get update RUN apt-get install wget git -y WORKDIR /root RUN wget $DOCA_URL RUN echo "$DOCA_SHA256 $(basename ${DOCA_URL})" | sha256sum --check --status RUN dpkg -i $(basename ${DOCA_URL}) RUN apt-get update RUN apt-get install doca-ofed-userspace -y After creating this Dockerfile it is possible to build and push the image in the ACR created at the beginning of the blog post: az acr login -n $ACR_NAME docker build . -t $ACR_NAME.azurecr.io/ibtest docker push $ACR_NAME.azurecr.io/ibtest Testing IPoIB and RDMA connectivity We will create three pods and spread them over two compute AKS nodes to demonstrate the IPoIB and RDMA connectivity. To so so we need to create the pods with the following resource request: Define the number of required CPU cores in a way that two pods can fit on a single host but the rhird one will be scheduled on a second one, e.g. cpu: 50. Request one HCA per pod: rdma/rdma_shared_device_a: 1. Run the pods in privileged: true mode to be able to access the RDMA InfiniBand interface. The last point is only required if we want to use the InfinBand interface for RDMA workloads as MPI but not if we only want to use the IPoIB network communication. To test the connectivity we create the following ibtestpod.yaml (the image name must be adjusted with your ACR_NAME): apiVersion: v1 kind: Pod metadata: name: ipoib-test-pod-1 annotations: k8s.v1.cni.cncf.io/networks: example-ipoibnetwork spec: restartPolicy: OnFailure tolerations: - key: "kubernetes.azure.com/scalesetpriority" operator: "Equal" value: "spot" effect: NoSchedule containers: - image: .azurecr.io/ibtest name: rdma-test-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] privileged: true resources: requests: cpu: 50 rdma/rdma_shared_device_a: 1 limits: cpu: 50 rdma/rdma_shared_device_a: 1 command: - sh - -c - | sleep inf --- apiVersion: v1 kind: Pod metadata: name: ipoib-test-pod-2 annotations: k8s.v1.cni.cncf.io/networks: example-ipoibnetwork spec: restartPolicy: OnFailure tolerations: - key: "kubernetes.azure.com/scalesetpriority" operator: "Equal" value: "spot" effect: NoSchedule containers: - image: .azurecr.io/ibtest name: rdma-test-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] privileged: true resources: requests: cpu: 50 rdma/rdma_shared_device_a: 1 limits: cpu: 50 rdma/rdma_shared_device_a: 1 command: - sh - -c - | sleep inf --- apiVersion: v1 kind: Pod metadata: name: ipoib-test-pod-3 annotations: k8s.v1.cni.cncf.io/networks: example-ipoibnetwork spec: restartPolicy: OnFailure tolerations: - key: "kubernetes.azure.com/scalesetpriority" operator: "Equal" value: "spot" effect: NoSchedule containers: - image: .azurecr.io/ibtest name: rdma-test-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] privileged: true resources: requests: cpu: 50 rdma/rdma_shared_device_a: 1 limits: cpu: 50 rdma/rdma_shared_device_a: 1 command: - sh - -c - | sleep inf We start the pods by executing: kubectl apply -f ibtestpod.yaml pod/ipoib-test-pod-1 created pod/ipoib-test-pod-2 created pod/ipoib-test-pod-3 created We can check that all pods are distributed over the two nodes host even that might not be the same in your setup: kubectl get pods --output 'jsonpath={range .items[*]}{.spec.nodeName}{" "}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' aks-hb120v3-16191816-vmss000000 default ipoib-test-pod-1 aks-hb120v3-16191816-vmss000001 default ipoib-test-pod-2 aks-hb120v3-16191816-vmss000001 default ipoib-test-pod-3 By executing the following commands we can get the IPoIB IP addresses of the pods: kubectl describe pod ipoib-test-pod-1 | grep multus | grep ipoib Normal AddedInterface 35s multus Add net1 [192.168.5.225/28] from default/example-ipoibnetwork kubectl describe pod ipoib-test-pod-2 | grep multus | grep ipoib Normal AddedInterface 38s multus Add net1 [192.168.5.226/28] from default/example-ipoibnetwork kubectl describe pod ipoib-test-pod-3 | grep multus | grep ipoib Normal AddedInterface 41s multus Add net1 [192.168.5.227/28] from default/example-ipoibnetwork We see that pod ipoib-test-pod-1 that runs on the first AKS node has the IPoIB IP address 192.168.5.225, while pod ipoib-test-pod-2 has the IPoIB IP address 192.168.5.226 and pod ipoib-test-pod-3 192.168.5.227. Both run in the second AKS node. To test the intra-node and inter-node IPoIB connectivity we connect to ipoib-test-pod-2 and ping the other two pods: kubectl exec -it ipoib-test-pod-2 -- /bin/sh / # ping 192.168.5.225 -c 3 PING 192.168.5.225 (192.168.5.225): 56 data bytes 64 bytes from 192.168.5.225: seq=0 ttl=64 time=1.352 ms 64 bytes from 192.168.5.225: seq=1 ttl=64 time=0.221 ms 64 bytes from 192.168.5.225: seq=2 ttl=64 time=0.281 ms --- 192.168.5.225 ping statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max = 0.221/0.618/1.352 ms / # ping 192.168.5.227 -c 3 PING 192.168.5.227 (192.168.5.227): 56 data bytes 64 bytes from 192.168.5.227: seq=0 ttl=64 time=1.171 ms 64 bytes from 192.168.5.227: seq=1 ttl=64 time=0.255 ms 64 bytes from 192.168.5.227: seq=2 ttl=64 time=0.290 ms --- 192.168.5.227 ping statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max = 0.255/0.572/1.171 ms We also confirm the IP address is 192.168.5.226 by running: / # ip a | grep net1 12: net1@if8: mtu 2044 qdisc noqueue state UP inet 192.168.5.226/28 brd 192.168.5.239 scope global net1 Just notice that the name of the IBoIB link device on the node is net1@if8. To test the RDMA we just start the ib_read_lat on ipoib-test-pod-2: # ib_read_lat ************************************ * Waiting for client to connect... * ************************************ Then we open another shell, connect to iboip-test-pod-1 and run ib_read_lat 192.168.5.226, which is the IP address of connect to iboip-test-pod-2: kubectl exec -it ipoib-test-pod-1 -- /bin/sh # ib_read_lat 192.168.5.226 --------------------------------------------------------------------------------------- RDMA_Read Latency Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 1 Mtu : 4096[B] Link type : IB Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x1b7 QPN 0x0126 PSN 0xc5305e OUT 0x10 RKey 0x04070c VAddr 0x0056453ff00000 remote address: LID 0x1c1 QPN 0x0126 PSN 0x18a616 OUT 0x10 RKey 0x04060a VAddr 0x005607b90aa000 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] Conflicting CPU frequency values detected: 1846.554000 != 3463.693000. CPU Frequency is not max. Conflicting CPU frequency values detected: 1846.554000 != 3510.158000. CPU Frequency is not max. 2 1000 3.80 5.95 3.83 3.84 0.00 3.93 5.95 --------------------------------------------------------------------------------------- This shows that there is RDMA connectivity between iboip-test-pod-1 and iboip-test-pod-2 which run each on a different AKS nodes. Then we go back to pod iboip-test-pod-2 and start the ib_read_lat server again: [root@ipoib-test-pod-2 /]# ib_read_lat ************************************ * Waiting for client to connect... * ************************************ Then we open another shell, connect to iboip-test-pod-3 and run ib_read_lat 192.168.5.226 again: kubectl exec -it ipoib-test-pod-3 -- /bin/sh # ib_read_lat 192.168.5.226 --------------------------------------------------------------------------------------- RDMA_Read Latency Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 1 Mtu : 4096[B] Link type : IB Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x1b7 QPN 0x0127 PSN 0x99bde7 OUT 0x10 RKey 0x040700 VAddr 0x005624407f9000 remote address: LID 0x1c1 QPN 0x0127 PSN 0x9539df OUT 0x10 RKey 0x040600 VAddr 0x00563b8d000000 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] Conflicting CPU frequency values detected: 1846.554000 != 3512.949000. CPU Frequency is not max. Conflicting CPU frequency values detected: 1846.554000 != 2089.386000. CPU Frequency is not max. 2 1000 3.79 7.03 3.82 3.82 0.00 3.87 7.03 --------------------------------------------------------------------------------------- Now both of these pods run on teh same AKS nodes which is refleted in the lower latency. Let us clean-up by deleting the pods: kubectl delete -f ibtestpod.yaml pod "ipoib-test-pod-1" deleted pod "ipoib-test-pod-2" deleted pod "ipoib-test-pod-3" deleted Uninstalling the NVIDIA Network Operator To apply a different configuration we might want to uninstall the operator by uninstalling the helm template before we install a new version: helm uninstall network-operator -n nvidia-network-operator Downscaling the node pool To allow the nodepool to autoscale down to zero, use the following Azure CLI command. This is important to avoid any unwanted cost: az aks nodepool update \ --resource-group $RESOURCE_GROUP_NAME \ --cluster-name $AKS_CLUSTER_NAME \ --name hb120v3 \ --update-cluster-autoscaler \ --min-count 0 \ --max-count 2 Conclusions This article is meant to provide a low-level insight on how to configure InfiniBand using NVIDIA Network operator on an AKS cluster. Of course, this is only the first step for having an AKS cluster HPC/AI ready. This article is meant to work in pair with the blog post for NVIDIA GPU operator. In a similar way, a proper HPC/AI AKS cluster will require an adequate job scheduler like kqueue or Volcano to handle properly multi-node jobs and allowing a smooth interaction in parallel processing. This is out of the scope of the current Blog post, but references and examples can be found in the already mentioned in the repository related to HPC on AKS or running workloads on NDv4 GPUs on AKS. Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Revolutionising Hotel Search with Azure Maps and Azure AI Services next post Exciting Office Boy Job Opportunities at Air Marine Express in Vasai, Mumbai for Freshers 2023 You may also like 7 Disturbing Tech Trends of 2024 December 19, 2024 AI on phones fails to impress Apple, Samsung users: Survey December 18, 2024 Standout technology products of 2024 December 16, 2024 Is Intel Equivalent to Tech Industry 2024 NY Giant? December 12, 2024 Google’s Willow chip marks breakthrough in quantum computing December 11, 2024 Job seekers are targeted in mobile phishing campaigns December 10, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.