Optimizing Language Model Inference on Azure by info.odysseyx@gmail.com October 2, 2024 written by info.odysseyx@gmail.com October 2, 2024 0 comment 1 views 1 Written by Shantanu Deepak Patankar, Software Engineer Intern Hugo Affaticati, Technical Program Manager 2 Inefficient inference optimization can drive up costs for customers, so it’s important to establish clear performance benchmarking numbers. In this blog, we set the standard for expected performance to help customers make informed decisions that maximize efficiency and minimize costs with the new Azure ND H200 v5 series. We evaluated the inference performance of the new Azure ND H200 v5 series for small language models (SLM) and large language models (LLM). ND H200 v5 series powered by 8 NVIDIA H200 Tensor Core GPU76% increased memory bandwidth compared to NVIDIA H100 Tensor Core GPU. ND H100 v5-series. Phi 3 (128k parameters), Mistral v0.1 (7B parameters), and Llama 3.1 (8B, 70B, and 405B parameters) to set performance standards and help Azure customers optimize their workloads for time or resources. variables) were compared. model architecture Achieving optimal performance requires a clear understanding of where time is spent in your inference workload to enable effective optimization. The first important step is to carefully examine the parameters that directly affect performance. For the models discussed more broadly, these key parameters include input sequence length, output sequence length, batch size, and tensor parallelism. In this paper, we measured the impact of these variables using two essential metrics: throughput and first token latency. The inference process can be categorized into three basic components: a purely computational phase (e.g., local GEMM), a pure communication phase (e.g., global reduction), and an attentional phase. Analysis of the Llama3 8B model on the new ND H200 v5 virtual machine shows that the computation consistently takes up at least 50% and up to 85% of the total inference time. The communication time ranges from 10% to 25% and scales as the number of GPUs increases from 2 to 8. In contrast, the attention mechanism consistently represents less than 10% of the total time spent, as shown in Table 1. Customers must strike the right balance between computation and communication when choosing an AI inference architecture, depending on whether time efficiency or cost efficiency are their primary goals. Tensor parallelism calculate (% of time used) communication (% of time used) attention (% of time used) 1 GPU 83.3 0 9.2 2 GPUs 70.7 10.8 7.4 4 GPUs 56.7 24.7 6.1 8 GPUs 57.2 25.1 8.2 Table 1: Breakdown of time spent per mechanism for LLAMA 3 8B inference on ND H200 v5 virtual machine with input sequence length of 1024, output sequence length of 128, and batch size of 32. Resource Optimization Because most of the inference time is spent on computation, GPU computation speed has a huge impact on overall performance. Understanding your memory requirements will improve your GPU usage. Two main factors that affect GPU memory consumption are model weights and key-value cache. Model Weight: The memory occupied by model weights depends on the number of parameters and the quantization of the model. The required memory can be calculated using the following formula: Memory used (in GB) = number of parameters (in billions) × precision (in bits / 8) For example, model weights for a LLAMA 3 model with 8B parameters and FP8 precision require 8 GB of memory (8B parameters x 8 / 8 = 8 GB). Key-value cache: Because the attention score for each token depends only on the previous token, the model stores the key and value matrix in the cache to avoid recalculating the attention value for every token in the sequence, described by factor 2 in the equation below. Size of KV cache (in B) = batch size * sequence length * 2 * number of layers * (number of heads * dimension of head) * precision (in bits / 8) For example, the key-value cache of a LLAMA 3 model with 8B parameters, FP8 precision, input size of 1024, and output length of 128 would require 0.5 GB of memory for batch size 1 (1 x (1024+128) sequence length). is required. x 2 x 32 layers x 4096 x 8/8 = 0.5GB) Using these two quantities, customers can optimize resource utilization by accurately predicting the maximum batch size a virtual machine can accommodate for that model. Available GPU memory is calculated by subtracting weight memory from total GPU memory when the system is idle. The maximum batch size is then determined by dividing the available memory by the KV cache size required for batch size 1. Table 2 provides some examples of these theoretical batch sizes. This approach not only simplifies the process, but also helps customers avoid trial-and-error methods that can lead to increased GPU consumption and increased costs. model ND H200 v5 memory per GPU (GB) Number of parameters (billion) Weight Memory (GB) Available memory (GB) KV Cache Size (GB) maximum batch size Rama 3 140 8 16 124 0.60 206 Mistral 140 7 14 126 0.60 210 Pi-3 medium 140 14 28 115.8 0.94 123 Table 2: Theoretical maximum batch sizes for inference using different language inference models (LLAMA 3 8B, Mistral, Phi-3 medium) on the ND H200 v5 virtual machine with sequence lengths of 1152 and FP8. Very similar results were obtained empirically to check the theoretical limitations. Figure 1 below shows the maximum batch size for maximizing the use of one NVIDIA H200 Tensor Core GPU and then combining that throughput with up to eight other GPUs in a modern ND H200 v5 virtual machine. By optimizing batch size, customers can extract additional performance from each GPU by making full use of available resources. This allows all virtual machines to operate at full capacity, maximizing performance while minimizing costs. Figure 1: Experimental maximum batch size as a function of tensor parallelism (TP) for inference using LLAMA 3 8B on an ND H200 v5 virtual machine with a total sequence length of 1152. time optimization For some specific workloads, time is more important. Increasing batch size can improve throughput and maximize resource utilization, but also increases latency. By measuring both latency and throughput of your inference workload, you can determine the optimal balance. For example, when running models like Llama 3 and Mistral on a single GPU in a modern ND H200 v5 virtual machine, batch size 32 provides the highest throughput-to-latency ratio, as shown in Figure 2. The optimal batch size is: It depends on the customer’s workload, as highlighted by the Phi-3 model, which achieves the highest ratio at batch size 64 using a single GPU. When scaling to two GPUs, the optimal batch size increases to 64, as shown in Figure 3. Although this approach may not fully utilize available memory, it achieves the lowest possible latency for inference, making it ideal for time-sensitive applications. Figure 2: Experimental optimal throughput versus latency trade-off as a function of batch for inference with LLAMA 3, Phi-3, and Mistral on a single GPU in an ND H200 v5 virtual machine with sequence total length of 1152, FP8, and TP 1. Figure 3: Experimental optimal throughput versus latency trade-off as a function of batch for inference using LLAMA 3, Phi-3, and Mistral on two GPUs in an ND H200 v5 virtual machine with sequence total length of 1152, FP8, and TP 2. Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Level Up Your Security Skills with the New Microsoft Sentinel Ninja Training! next post Microsoft 365 Copilot GCC Readiness Days: Unlock AI-Powered Productivity for Public Sector Agencies You may also like Get to know Microsoft 365 Copilot in Microsoft OneDrive October 4, 2024 Connecting to Azure Cache for Redis with Entra ID in Azure Government October 4, 2024 Modern Charts in Microsoft Access is GA! October 4, 2024 Cowrie honeypot and its Integration with Microsoft Sentinel. October 4, 2024 Improved Accessibility ribbon in PowerPoint for Windows and Mac October 4, 2024 Introducing the Use Cases Mapper workbook October 4, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.