Home NewsX Optimizing Language Model Inference on Azure

Optimizing Language Model Inference on Azure

by info.odysseyx@gmail.com
0 comment 1 views


Written by Shantanu Deepak Patankar, Software Engineer Intern Hugo Affaticati, Technical Program Manager 2

Inefficient inference optimization can drive up costs for customers, so it’s important to establish clear performance benchmarking numbers. In this blog, we set the standard for expected performance to help customers make informed decisions that maximize efficiency and minimize costs with the new Azure ND H200 v5 series.

We evaluated the inference performance of the new Azure ND H200 v5 series for small language models (SLM) and large language models (LLM). ND H200 v5 series powered by 8 NVIDIA H200 Tensor Core GPU76% increased memory bandwidth compared to NVIDIA H100 Tensor Core GPU. ND H100 v5-series. Phi 3 (128k parameters), Mistral v0.1 (7B parameters), and Llama 3.1 (8B, 70B, and 405B parameters) to set performance standards and help Azure customers optimize their workloads for time or resources. variables) were compared.

model architecture

Achieving optimal performance requires a clear understanding of where time is spent in your inference workload to enable effective optimization. The first important step is to carefully examine the parameters that directly affect performance. For the models discussed more broadly, these key parameters include input sequence length, output sequence length, batch size, and tensor parallelism. In this paper, we measured the impact of these variables using two essential metrics: throughput and first token latency.

The inference process can be categorized into three basic components: a purely computational phase (e.g., local GEMM), a pure communication phase (e.g., global reduction), and an attentional phase. Analysis of the Llama3 8B model on the new ND H200 v5 virtual machine shows that the computation consistently takes up at least 50% and up to 85% of the total inference time. The communication time ranges from 10% to 25% and scales as the number of GPUs increases from 2 to 8. In contrast, the attention mechanism consistently represents less than 10% of the total time spent, as shown in Table 1. Customers must strike the right balance between computation and communication when choosing an AI inference architecture, depending on whether time efficiency or cost efficiency are their primary goals.

Tensor parallelism calculate (% of time used) communication (% of time used) attention (% of time used)
1 GPU 83.3 0 9.2
2 GPUs 70.7 10.8 7.4
4 GPUs 56.7 24.7 6.1
8 GPUs 57.2 25.1 8.2

Table 1: Breakdown of time spent per mechanism for LLAMA 3 8B inference on ND H200 v5 virtual machine with input sequence length of 1024, output sequence length of 128, and batch size of 32.

Resource Optimization

Because most of the inference time is spent on computation, GPU computation speed has a huge impact on overall performance. Understanding your memory requirements will improve your GPU usage. Two main factors that affect GPU memory consumption are model weights and key-value cache.

Model Weight: The memory occupied by model weights depends on the number of parameters and the quantization of the model. The required memory can be calculated using the following formula:

Memory used (in GB) = number of parameters (in billions) × precision (in bits / 8)

For example, model weights for a LLAMA 3 model with 8B parameters and FP8 precision require 8 GB of memory (8B parameters x 8 / 8 = 8 GB).

Key-value cache: Because the attention score for each token depends only on the previous token, the model stores the key and value matrix in the cache to avoid recalculating the attention value for every token in the sequence, described by factor 2 in the equation below.

Size of KV cache (in B) = batch size * sequence length * 2 * number of layers * (number of heads * dimension of head) * precision (in bits / 8)

For example, the key-value cache of a LLAMA 3 model with 8B parameters, FP8 precision, input size of 1024, and output length of 128 would require 0.5 GB of memory for batch size 1 (1 x (1024+128) sequence length). is required. x 2 x 32 layers x 4096 x 8/8 = 0.5GB)

Using these two quantities, customers can optimize resource utilization by accurately predicting the maximum batch size a virtual machine can accommodate for that model. Available GPU memory is calculated by subtracting weight memory from total GPU memory when the system is idle. The maximum batch size is then determined by dividing the available memory by the KV cache size required for batch size 1. Table 2 provides some examples of these theoretical batch sizes. This approach not only simplifies the process, but also helps customers avoid trial-and-error methods that can lead to increased GPU consumption and increased costs.

model ND H200 v5 memory per GPU (GB) Number of parameters (billion) Weight Memory (GB) Available memory (GB) KV Cache Size (GB) maximum batch size
Rama 3 140 8 16 124 0.60 206
Mistral 140 7 14 126 0.60 210
Pi-3 medium 140 14 28 115.8 0.94 123

Table 2: Theoretical maximum batch sizes for inference using different language inference models (LLAMA 3 8B, Mistral, Phi-3 medium) on the ND H200 v5 virtual machine with sequence lengths of 1152 and FP8.

Very similar results were obtained empirically to check the theoretical limitations. Figure 1 below shows the maximum batch size for maximizing the use of one NVIDIA H200 Tensor Core GPU and then combining that throughput with up to eight other GPUs in a modern ND H200 v5 virtual machine. By optimizing batch size, customers can extract additional performance from each GPU by making full use of available resources. This allows all virtual machines to operate at full capacity, maximizing performance while minimizing costs.

HugoAffaticati_0-1726609445485.png

Figure 1: Experimental maximum batch size as a function of tensor parallelism (TP) for inference using LLAMA 3 8B on an ND H200 v5 virtual machine with a total sequence length of 1152.

time optimization

For some specific workloads, time is more important. Increasing batch size can improve throughput and maximize resource utilization, but also increases latency. By measuring both latency and throughput of your inference workload, you can determine the optimal balance. For example, when running models like Llama 3 and Mistral on a single GPU in a modern ND H200 v5 virtual machine, batch size 32 provides the highest throughput-to-latency ratio, as shown in Figure 2. The optimal batch size is: It depends on the customer’s workload, as highlighted by the Phi-3 model, which achieves the highest ratio at batch size 64 using a single GPU. When scaling to two GPUs, the optimal batch size increases to 64, as shown in Figure 3. Although this approach may not fully utilize available memory, it achieves the lowest possible latency for inference, making it ideal for time-sensitive applications.

HugoAffaticati_1-1726609582455.png

Figure 2: Experimental optimal throughput versus latency trade-off as a function of batch for inference with LLAMA 3, Phi-3, and Mistral on a single GPU in an ND H200 v5 virtual machine with sequence total length of 1152, FP8, and TP 1.

HugoAffaticati_2-1726609605817.png

Figure 3: Experimental optimal throughput versus latency trade-off as a function of batch for inference using LLAMA 3, Phi-3, and Mistral on two GPUs in an ND H200 v5 virtual machine with sequence total length of 1152, FP8, and TP 2.





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX