Home NewsX Benchmarking 6th gen. Intel-based Dv6 (preview) VM SKUs for HPC Workloads in Financial Services

Benchmarking 6th gen. Intel-based Dv6 (preview) VM SKUs for HPC Workloads in Financial Services

by info.odysseyx@gmail.com
0 comment 10 views


introduction

In the rapidly changing world of financial services, high-performance computing (HPC) systems in the cloud have become indispensable. From equipment pricing and risk assessment to portfolio optimization and regulatory workloads like CVA and FRTB, the flexibility and scalability of cloud deployments are transforming the industry. Unlike traditional HPC systems that require complex parallelization frameworks (e.g., dependent on MPI and InfiniBand networking), many financial calculations can be run efficiently on a general-purpose SKU in Azure.

Depending on the code used to perform the calculations, many implementations utilize vendor-specific optimizations, such as Intel’s AVX-512. With the recent public preview announcement of the 6th generation Intel-based Dv6 VM (see: here), In this article, we will look at the performance evolution of the third generation D32ds from D32dsv4 to D32dsv6.

It follows a similar testing methodology as the January 2023 article “Benchmarking Azure HPC SKUs for Financial Services Workloads” (link). here).

The official announcement mentioned that the upcoming Dv6 series (currently in preview) offers significant improvements over the previous Dv5 generation. The main points are as follows:

  • Up to 27% improved vCPU performance and 3x increased L3 cache compared to previous generation Intel Dl/D/Ev5 VMs.
  • Supports up to 192 vCPUs and over 18 GiB of memory.
  • Azure Boost provides:
    • Up to 400,000 IOPS and 12GB/s remote storage throughput.
    • Up to 200 Gbps VM network bandwidth.
  • Local SSD capacity increased by 46% and read IOPS more than tripled.
  • NVMe interface for both local and remote disks.

Note: Enhanced security through Total Memory Encryption (TME) technology is not enabled in preview deployments and will be benchmarked after it is available.

Technical Specifications for 3rd Generation D32ds SKU

VM name

D32ds_v4

D32ds_v5

D32ds_v6

Number of vCPUs

32

32

32

Infini Band

Not applicable

Not applicable

Not applicable

processor

Intel® Xeon® Platinum 8370C (Ice Lake) or Intel® Xeon® Platinum 8272CL (Cascade Lake)

Intel® Xeon® Platinum 8370C (Ice Lake)

Intel® Xeon® Platinum 8573C (Emerald Rapids) Processor

highest CPU frequency

3.4GHz

3.5GHz

3.0GHz

RAM per VM

128GB

128GB​

128GB​

RAM per core

4GB​

4GB​

4GB​

connected disk

1200 SSD

1200 SSD

440 SSD

Benchmarking settings

For benchmarking setup, Phoronix (link) We run two tests from the OpenBenchmarking.org test suite that specifically target quantitative finance workloads.

The tests in the “Financial Suite” are divided into two groups, each running independent benchmarks. In addition to our suite of financial tests, we also ran AI-Benchmark to evaluate the evolution of AI inference capabilities across three VM generations.

financial bench QuantLib AI Benchmarks

Bond OpenMP

Size XXS

Device inference score

Repo OpenMP Size X

Device AI Score

Monte Carlo OpenMP

device training score

software dependencies

element

version

OS image

Ubuntu Marketplace Image: 24_04-lts

Phoronix Test Suite

10.8.5

Quantlib benchmark

1.35-dev

financial benchmark benchmark

2016-07-25

AI Benchmark Alpha

0.1.2

python

3.12.3

To run the benchmark on a newly created D-series VM, run the following command (after updating the installed packages to the latest version):

git clone https://github.com/phoronix-test-suite/phoronix-test-suite.git
sudo apt-get install php-cli php-xml cmake
sudo ./install-sh
phoronix-test-suite benchmark finance

AI benchmark testing requires some additional steps. For example, you need to create a virtual environment for additional Python packages and install tensorflow and ai-benchmark packages.

sudo apt install python3 python3-pip python3-virtualenv
mkdir ai-benchmark && cd ai-benchmark
virtualenv virtualenv
source virtualenv/bin/activate
pip install tensorflow
pip install ai-benchmark
phoronix-test-suite benchmark ai-benchmark

Runtime and results benchmarking

The purpose of this article is to share the results of a set of benchmarks that closely match the use cases mentioned in the introduction. Since most of these use cases are primarily CPU bound, we limited our benchmarks to D-series VMs. For memory-bound code that can benefit from a higher memory-to-core ratio, the new Ev6 SKU may be a suitable option.

In the picture below you can see a representative benchmark running on a Dv6 VM with almost 100% of the CPU utilized during execution. Individual runs of the Phoronix test suite, starting with Finance Bench and continuing with QuantLib, are clearly visible.

runtime

Figure 1: CPU utilization for the entire financial benchmark run.Figure 1: CPU utilization for the entire financial benchmark run.

standard

VM size

start time

end time

continue

minute

financial benchmarks

Standard D32ds v4

12:08

15:29

03:21

201.00

financial benchmarks

Standard D32ds v5

11:38

14:12

02:34

154.00

financial benchmarks

Standard D32ds v6

11:39

13:27

01:48

108.00

Financial Bench Results

damocelj_1-1729167315174.png

damocelj_2-1729167356310.png

QuantLib Results

damocelj_4-1729167425317.png

AI Benchmark Alpha Results

damocelj_6-1729167485922.png

Discussion of Results

The results show significant performance improvements in QuantLib across D32v4, D32v5, and D32v6 versions. In particular, operations per second for size S increased by 47.18% from D32v5 to D32v6, while size XXS increased by 45.55%.

Benchmark times for ‘Repo OpenMP’ and ‘Bonds OpenMP’ also decreased, indicating improved performance. ‘Repo OpenMP’ time decreased by 18.72% from D32v4 to D32v5 and by 20.46% from D32v5 to D32v6. Similarly, ‘Bonds OpenMP’ time decreased by 11.98% from D32v4 to D32v5 and by 18.61% from D32v5 to D32v6.

As for Monte-Carlo OpenMP performance, D32v6 showed the best result with 51,927.04ms, followed by D32v5 with 56,443.91ms and D32v4 with 57,093.94ms. The improvement was -1.14% from D32v4 to D32v5 and -8.00% from D32v5 to D32v6.

AI benchmark alpha scores for device inference and training have also improved significantly. The inference score increased by 15.22% from D32v4 to D32v5 and by 42.41% from D32v5 to D32v6. The training score increased by 21.82% from D32v4 to D32v5 and by 43.49% from D32v5 to D32v6.

Lastly, the Device AI score improved across versions, with a D32v4 score of 6726 points, a D32v5 score of 7996 points, and a D32v6 score of 11436 points. The increase was 18.88% from D32v4 to D32v5 and 43.02% from D32v5 to D32v6.

Next steps and final thoughts

The public preview of the new Intel SKUs has already shown very promising benchmarking results showing significant performance gains over the previous D-series generation, which is widely used in FSI scenarios.

It is important to note that custom code or purchased libraries may exhibit different characteristics than the selected benchmark. Therefore, we recommend that you verify performance indicators with your own settings.

In this benchmarking setup, we have not disabled hyperthreading on the CPU, so the available cores are exposed as virtual cores. If you are interested in this scenario, please contact the author for more information.

Azure also offers a wide range of VM families to suit different needs, including F, FX, Fa, D, Da, E, Ea, and specialized HPC SKUs such as HC and HB VMs.

Dedicated validation based on individual code/workload is also recommended here to ensure that the most appropriate SKU is selected for the task at hand.





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX