AKS Networking for Data-Intensive Kubernetes Workloads by info.odysseyx@gmail.com October 16, 2024 written by info.odysseyx@gmail.com October 16, 2024 0 comment 6 views 6 Gen’s Junctionerative AI and cloud computing have transformed the way organizations build and manage their infrastructure. As more and more data-intensive workloads are deployed, especially using Kubernetes-based compute, the demand for networking infrastructure has never been greater. This is especially true in high-performance computing (HPC) environments, where the need to securely build advanced training models using Kubernetes is paramount. The scalability and flexibility the ecosystem provides makes Kubernetes the preferred choice for managing complex workloads, but it also brings unique networking challenges that must be addressed. This blog series provides practical insights and strategies for building secure and scalable Kubernetes clusters using Azure infrastructure. Networking Requirements for HPC and AI Workloads High-performance computing and AI workloads, such as large language model (LLM) training, require networking platforms with high input/output (I/O) capabilities. These platforms must provide low latency and high bandwidth to ensure efficient data handling and processing. As data sets grow in size and complexity, networking infrastructure must scale accordingly to maintain performance and reliability. Overall, the requirements can be categorized as follows: scalability: As organizations scale their AI initiatives, their networking infrastructure must be scalable to accommodate increasing data loads and more complex models. Our scalable solutions allow you to grow seamlessly without compromising performance. security: Protecting data integrity and ensuring secure access to workloads are of utmost importance. Networking platforms must incorporate strong security measures to protect sensitive information and prevent unauthorized access. Implement a least-privilege approach to minimize the attack surface by granting users and applications only the permissions they need. observability: To maintain optimal operation, it is important to monitor network performance and identify potential problems. Advanced observability tools help you track traffic patterns, diagnose problems, and ensure efficient data flow throughout your network. low latency: In particular, AI training models for LLM require high-speed data transfer to process vast amounts of information in real time. Low latency is important to minimize data communication delays that can affect overall training time and model accuracy. high bandwidth: The amount of data exchanged between computing nodes during the learning process requires high bandwidth. This allows data to be transferred quickly and efficiently, avoiding bottlenecks that can slow down computations. Key Implementation Strategies AKS allows developers to easily deploy and manage containerized AI models to ensure consistent performance and rapid iteration. Native integration with Azure’s high-performance storage, networking, and security features allows you to efficiently handle AI workloads. AKS also supports advanced GPU scheduling.(reference)The availability of specialized hardware for training and inference accelerates the development of sophisticated GenAI applications. Now let’s take a look at some of the latest cluster networking features we’ve introduced to provide a high-performance network data path architecture that helps users build secure and scalable network platforms. With Azure CNI powered by Cilium, users have the right underlying infrastructure to address these requirements along with comprehensive integration with Azure’s extensive networking capabilities. Azure CNI based on Cilium The Azure Container Networking Interface (CNI), provided by Cilium, is built on a Linux technology called Extended Berkeley Packet Filter (eBPF). eBPF allows you to run sandboxed programs in the kernel with high efficiency and minimal overhead, making it ideal for advanced networking tasks. Azure CNI leverages eBPF to provide a variety of performance benefits along with advanced intra-cluster security and observability features. Performance Benefits of eBPF eBPF provides numerous benefits essential for high-performance networking. Efficient packet processing: eBPF allows you to run custom packet processing logic directly in the kernel, reducing the need for context switching between user space and kernel space. The result is faster packet processing and reduced latency. Dynamic Programmability: eBPF allows networking policies and rules to be dynamically updated without the need to recompile the kernel or restart the system. This flexibility is critical to adapting to changing network conditions and security requirements. High throughput: By offloading packet processing to the kernel, eBPF can handle high throughput with minimal impact on system performance. This is especially useful for data-intensive workloads that require high bandwidth. Efficient IP addressing for scalability and interoperability IP addressing planning is the cornerstone of building dynamic data workloads in AKS. Starting with v1.30, both overlay and Vnet addressing for direct access to pods are supported when leveraging overlay mode, which is the default for AKS clusters, and Azure CNI from Cilium. Cilium’s Azure CNI also supports dual-stack IP addressing, which allows both IPv4 and IPv6 protocols to coexist within the same network. This flexibility is essential to supporting legacy applications that still rely on IPv4 while adopting newer, more efficient IPv6-based systems. By leveraging a dual-stack network configuration, organizations can reduce the overhead associated with maintaining a separate network infrastructure by ensuring compatibility and seamless interoperability. Mixed IP addressing also facilitates a smooth transition to IPv6, improving future-proofing and scalability as network demands grow. In-cluster security and observability Built on Cilium, Azure CNI improves security and observability within your cluster through several key features: Advanced network policies: Azure CNI supports Layer3, Layer4 network policies along with fully qualified domain name (FDQN) based advanced network policies. This allows users to increase security by restricting connections to specific DNS names and restricting access to trusted endpoints. Comprehensive network observability: Built on Cilium, Azure CNI’s network observability platform provides: detailed insights Affects network traffic and performance. Users can identify DNS performance issues such as DNS query throttling, missing DNS responses, and errors, and track top DNS queries. This level of visibility is critical to diagnosing problems and optimizing network performance. Using Hubble CLI on-demand network flows, users can trace packet flows across a cluster for detailed analysis and debugging. Users can unlock recently released observability and FQDN-based features by activating them. Advanced container networking service(ACNS) On AKS cluster. Let’s take a closer look at how to activate it. Via FQDN filtering CiliumNetworkPolicy (CNP) and DNS Proxy allow you to upgrade the Cilium Agent with minimal impact to DNS resolution. Let’s say you have a labeled Kubernetes pod.App: genai_backend You want to control egress traffic. Specifically, we want to allow access to “myblobstorage.com” while blocking all other egress traffic except DNS queries to the kube-dns service. apiVersion: cilium.io/v2 kind: CiliumNetworkPolicy metadata: name: allow-genai-to-blobstorage spec: endpointSelector: matchLabels: app: genai_backend egress: - toEndpoints: - matchLabels: "k8s:io.kubernetes.pod.namespace":kube-system "k8s:k8s-app":kube-dns - toFQDNs: - matchName: app1.myblobstorage.com - toPorts: - ports: - port: "53" protocol: ANY rules: dns: - matchPattern: "*.myblobstorage.com" Additional considerations for high-performance networking Kubernetes-based data applications also require high-performance networking from a container networking platform. Primary networks often require high throughput and low latency to translate to high-speed interfaces comprised of technologies such as Infiniband. These interfaces can provide bandwidths of up to 100 Gbps or more, significantly reducing data transfer times and improving application performance. Managing the configuration of multiple interfaces can be cumbersome as it involves setting up the network fabric, managing traffic flows, and ensuring compatibility with existing infrastructure. We’ve heard from many of our users that they need native functionality that integrates seamlessly with their Kubernetes environment. Azure CNI gives users the flexibility to securely configure these high-speed interfaces using native Kubernetes configuration, such as custom resource definitions (CRDs). Azure CNI also supports Single Root I/O Virtualization (SR-IOV) technology, which allows a dedicated network interface for pods, further improving performance by reducing CPU overhead associated with networking. We will cover this in more detail in a future blog. conclusion As data-intensive workloads become more common in HPC and AI environments, the need for networking infrastructure is intensifying. Kubernetes-based compute provides the scalability and flexibility needed to manage these workloads, but it also presents unique networking challenges. Azure CNI, with its eBPF-based architecture, addresses these challenges by providing a high-performance networking data plane, advanced security, and comprehensive observability. If so, please wait a moment and let us know (Azure Kubernetes service roadmap (public) · Azure Kubernetes service roadmap (public) (github.com)) Learn how you can evolve your roadmap to support best-of-breed deployments with Azure. The next blog will focus on how to extend security controls from Layer4 to Layer7 along with simplifying configuration. So stay tuned! Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Windows 11, version 24H2 | Security, experience, performance, and migration updates. next post UPDATE: Azure Pass retirements – Microsoft Community Hub You may also like 7 Disturbing Tech Trends of 2024 December 19, 2024 AI on phones fails to impress Apple, Samsung users: Survey December 18, 2024 Standout technology products of 2024 December 16, 2024 Is Intel Equivalent to Tech Industry 2024 NY Giant? December 12, 2024 Google’s Willow chip marks breakthrough in quantum computing December 11, 2024 Job seekers are targeted in mobile phishing campaigns December 10, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.