Revolutionizing AI Workloads with Microsoft’s Custom AI Accelerator

Author:

Sherry Xu, Partner Lead SoC Architect, Azure Maia

Chandru Ramakrishnan, Partner Software Engineering Manager

As advances in artificial intelligence continue to demand new innovations in the cloud, it is critical to optimize hardware and software together to optimize AI infrastructure for maximum performance, scalability, and compatibility.

~ in Hot Chips 2024Microsoft has shared specifications for the Maia 100, Microsoft’s first-generation purpose-built AI accelerator designed specifically for large-scale AI workloads deployed on Azure. Vertically integrated to optimize performance and reduce costs, the Maia 100 system includes a platform architecture with custom server boards in a custom rack and a software stack built to drive performance and cost-efficiency for advanced AI capabilities in services like Azure OpenAI Services.

Maia 100 accelerator architecture

The Maia 100 accelerator is purpose-built for a wide range of cloud-based AI workloads. The chip measures ~820mm2 and utilizes TSMC’s N5 process with COWOS-S interposer technology. The Maia 100’s reticle-sized SoC die with large on-die SRAM is combined with four HBM2E dies to provide a total bandwidth of 1.8 terabytes per second and 64 gigabytes per second to accommodate AI-scale data processing requirements.

AI accelerator built for high throughput and diverse data formats

Designed to support up to 700W TDP but provisioned at 500W, the Maia 100 can deliver high performance while efficiently managing power depending on the target workload.

Chip architecture designed to support advanced machine learning requirements

Designed for modern machine learning requirements, Maia 100’s architecture reflects careful research into AI systems for optimal computational speed, performance, and accuracy.

High-speed tensor units provide fast processing for training and inference while supporting a wide range of data types, including low-precision data types such as the MX data format. The MX data format was first introduced by Microsoft through the MX Consortium in 2023. These tensor units consist of 16xRx16 units.
The vector processor is a loosely coupled superscalar engine built on a custom instruction set architecture (ISA) to support a wide range of data types, including FP32 and BF16.
The DMA (Direct Memory Access) engine supports various tensor sharding methods.
Hardware semaphores enable asynchronous programming in Maia systems.

The Maia 100 accelerator is purpose-built for a wide range of cloud-based AI workloads. The chip measures ~820mm2 and utilizes TSMC’s N5 process with COWOS-S interposer technology.

A software-centric approach to data utilization and power efficiency

Maia accelerators are designed with low-precision storage data types and data compression engines to reduce the bandwidth and capacity required for large-scale inference tasks, which are often bottlenecked by data movement. To further improve data utilization and power efficiency, large L1 and L2 scratch pads are software-managed for optimal data utilization and power efficiency.

Ethernet-based interconnection supports large-scale AI models.

In 2023, Microsoft led the development of the Ultra Ethernet Consortium to enable the industry to use an Ethernet-based interconnect designed for ultra-high bandwidth computing. Maia 100 supports up to 4800 Gbps of total ingest and distribution reduction bandwidth and 1200 Gbps of full-to-full bandwidth. This Ethernet interconnect leverages a custom RoCE-like protocol to provide enhanced reliability and balance. Maia’s backend network protocol supports AES-GCM encryption, making it ideal for confidential computing. Maia 100 is also supported in a unified backend network for scale-out and scale-out workloads, providing the flexibility to support both direct and switched connections.

Ethernet-based backend network protocol for Maia 100

Enabling rapid deployment and model portability in the Maia SDK

Maia 100 vertically integrates learnings from every layer of the cloud architecture, from advanced cooling and networking requirements to the software stack that allows for rapid deployment of models, with a hardware and software architecture designed from the ground up to run large workloads more efficiently. The Maia Software Development Kit (SDK) allows users to quickly port models written in PyTorch and Triton to Maia.

The Maia SDK provides a comprehensive set of components that enable developers to quickly deploy models to Azure OpenAI services.

Framework Integration: A first-class PyTorch backend supporting both eager execution and graph modes.
Developer Tools: Model debugging and performance tuning tools such as debuggers, profilers, visualization tools, model quantization and verification tools.
Compilers: Maia has two programming models and compilers. The Triton programming model provides agility and portability, while the Maia API is ideal for top performance.
Kernel and collective libraries: We have developed a set of highly optimized ML compute and communication kernels to get you up and running quickly on Maia using the compiler. Writing custom kernels is also supported.
Maia Host/Device Runtime: The host device runtime layer comes with a hardware abstraction layer that handles memory allocation, kernel execution, scheduling, and device management.

Maia 100's software development kit (SDK) allows users to quickly port models written in PyTorch and Triton to Maia. Maia 100’s software development kit (SDK) allows users to quickly port models written in PyTorch and Triton to Maia.

The dual programming model ensures efficient data processing and synchronization.

The Maia programming model leverages asynchronous programming using semaphores for synchronization, allowing for the overlap of memory and network transfers with computation. It operates on two streams of execution: the control processor issues asynchronous commands through a queue, and hardware threads execute these commands, ensuring efficient data processing through semaphore-based synchronization.

To program Maia, developers can choose between two programming models. Triton is a popular open-source domain-specific language (DSL) for deep neural networks (DNNs) that simplifies coding and runs on both GPUs and Maia, while the Maia API is a Maia-specific custom programming model built for maximum performance with more granular control. Triton has fewer lines of code and automatically handles memory and semaphore management, while the Maia API requires more code and explicit management from the programmer.

Optimizing data flow through collection-based matrix multiplication

Maia uses a Gather-based approach for large-scale distributed General Matrix Multiplication (GEMM), unlike All-Reduce-based approaches. This approach offers several advantages: it fuses the activation function (such as GELU) directly into SRAM after GEMM, improving processing speed and efficiency; it overlaps network communication and computation to reduce idle time; and it transmits quantized data over the network, reducing latency and thus increasing data transfer speed and improving overall system performance.

We also leverage static random-access memory (SRAM) at the cluster level to buffer activations and intermediate results. Network reads and writes are also served directly from SRAM, allowing direct access to CSRAM. This significantly reduces HBM reads, improving latency.

We further improve performance by parallelizing computation across the cluster and leveraging a network-on-chip (NOC) for on-chip activation collection.

Optimize workload performance with portability and flexibility

The key to Maia 100’s fungibility is the ability to run PyTorch models on Maia with just one line of code. This is supported by a PyTorch backend that operates in both eager mode for optimal developer experience and graph mode for best performance. Leveraging PyTorch with Triton gives developers full portability and flexibility between hardware backends to optimize workload performance without sacrificing efficiency and the ability to target AI workloads.

Satya Nadella holds the Maia 100 AI accelerator chip at Microsoft Ignite 2023.

With its advanced architecture, comprehensive developer tools, and seamless integration with Azure, Maia 100 is revolutionizing how Microsoft manages and runs AI workloads. With algorithm co-design or software and hardware, built-in hardware options for both model developers and custom kernel writers, and a vertically integrated design to optimize performance and improve power efficiency while reducing costs, Maia 100 provides new options for running advanced cloud-based AI workloads on Microsoft’s AI infrastructure.

Source link

Our Company

About Links

Useful Links

Newsletter

Laest News

Revolutionizing AI Workloads with Microsoft’s Custom AI Accelerator

Designing and Running an Azure AI Gateway Generative AI Platform

Market Research Job Opportunities Now Available at Usmanpura Imaging Centre in Ahmedabad

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News