Benchmarking 6th gen. Intel-based Dv6 (preview) VM SKUs for HPC Workloads in Financial Services

introduction

In the rapidly changing world of financial services, high-performance computing (HPC) systems in the cloud have become indispensable. From equipment pricing and risk assessment to portfolio optimization and regulatory workloads like CVA and FRTB, the flexibility and scalability of cloud deployments are transforming the industry. Unlike traditional HPC systems that require complex parallelization frameworks (e.g., dependent on MPI and InfiniBand networking), many financial calculations can be run efficiently on a general-purpose SKU in Azure.

Depending on the code used to perform the calculations, many implementations utilize vendor-specific optimizations, such as Intel’s AVX-512. With the recent public preview announcement of the 6th generation Intel-based Dv6 VM (see: here), In this article, we will look at the performance evolution of the third generation D32ds from D32dsv4 to D32dsv6.

It follows a similar testing methodology as the January 2023 article “Benchmarking Azure HPC SKUs for Financial Services Workloads” (link). here).

The official announcement mentioned that the upcoming Dv6 series (currently in preview) offers significant improvements over the previous Dv5 generation. The main points are as follows:

Up to 27% improved vCPU performance and 3x increased L3 cache compared to previous generation Intel Dl/D/Ev5 VMs.
Supports up to 192 vCPUs and over 18 GiB of memory.
Azure Boost provides:
- Up to 400,000 IOPS and 12GB/s remote storage throughput.

Up to 200 Gbps VM network bandwidth.

Local SSD capacity increased by 46% and read IOPS more than tripled.
NVMe interface for both local and remote disks.

Note: Enhanced security through Total Memory Encryption (TME) technology is not enabled in preview deployments and will be benchmarked after it is available.

Technical Specifications for 3rd Generation D32ds SKU
VM name	D32ds_v4	D32ds_v5	D32ds_v6
Number of vCPUs	32	32	32
Infini Band	Not applicable	Not applicable	Not applicable
processor	Intel® Xeon® Platinum 8370C (Ice Lake) or Intel® Xeon® Platinum 8272CL (Cascade Lake)	Intel® Xeon® Platinum 8370C (Ice Lake)	Intel® Xeon® Platinum 8573C (Emerald Rapids) Processor
highest CPU frequency	3.4GHz	3.5GHz	3.0GHz
RAM per VM	128GB	128GB	128GB
RAM per core	4GB	4GB	4GB
connected disk	1200 SSD	1200 SSD	440 SSD

Benchmarking settings

For benchmarking setup, Phoronix (link) We run two tests from the OpenBenchmarking.org test suite that specifically target quantitative finance workloads.

The tests in the “Financial Suite” are divided into two groups, each running independent benchmarks. In addition to our suite of financial tests, we also ran AI-Benchmark to evaluate the evolution of AI inference capabilities across three VM generations.

financial bench	QuantLib	AI Benchmarks
Bond OpenMP	Size XXS	Device inference score
Repo OpenMP	Size X	Device AI Score
Monte Carlo OpenMP		device training score

software dependencies

element	version
OS image	Ubuntu Marketplace Image: 24_04-lts
Phoronix Test Suite	10.8.5
Quantlib benchmark	1.35-dev
financial benchmark benchmark	2016-07-25
AI Benchmark Alpha	0.1.2
python	3.12.3

To run the benchmark on a newly created D-series VM, run the following command (after updating the installed packages to the latest version):

git clone https://github.com/phoronix-test-suite/phoronix-test-suite.git
sudo apt-get install php-cli php-xml cmake
sudo ./install-sh
phoronix-test-suite benchmark finance

AI benchmark testing requires some additional steps. For example, you need to create a virtual environment for additional Python packages and install tensorflow and ai-benchmark packages.

sudo apt install python3 python3-pip python3-virtualenv
mkdir ai-benchmark && cd ai-benchmark
virtualenv virtualenv
source virtualenv/bin/activate
pip install tensorflow
pip install ai-benchmark
phoronix-test-suite benchmark ai-benchmark

Runtime and results benchmarking

The purpose of this article is to share the results of a set of benchmarks that closely match the use cases mentioned in the introduction. Since most of these use cases are primarily CPU bound, we limited our benchmarks to D-series VMs. For memory-bound code that can benefit from a higher memory-to-core ratio, the new Ev6 SKU may be a suitable option.

In the picture below you can see a representative benchmark running on a Dv6 VM with almost 100% of the CPU utilized during execution. Individual runs of the Phoronix test suite, starting with Finance Bench and continuing with QuantLib, are clearly visible.

runtime

Figure 1: CPU utilization for the entire financial benchmark run.

standard	VM size	start time	end time	continue	minute
financial benchmarks	Standard D32ds v4	12:08	15:29	03:21	201.00
financial benchmarks	Standard D32ds v5	11:38	14:12	02:34	154.00
financial benchmarks	Standard D32ds v6	11:39	13:27	01:48	108.00

Financial Bench Results

QuantLib Results

AI Benchmark Alpha Results

Discussion of Results

The results show significant performance improvements in QuantLib across D32v4, D32v5, and D32v6 versions. In particular, operations per second for size S increased by 47.18% from D32v5 to D32v6, while size XXS increased by 45.55%.

Benchmark times for ‘Repo OpenMP’ and ‘Bonds OpenMP’ also decreased, indicating improved performance. ‘Repo OpenMP’ time decreased by 18.72% from D32v4 to D32v5 and by 20.46% from D32v5 to D32v6. Similarly, ‘Bonds OpenMP’ time decreased by 11.98% from D32v4 to D32v5 and by 18.61% from D32v5 to D32v6.

As for Monte-Carlo OpenMP performance, D32v6 showed the best result with 51,927.04ms, followed by D32v5 with 56,443.91ms and D32v4 with 57,093.94ms. The improvement was -1.14% from D32v4 to D32v5 and -8.00% from D32v5 to D32v6.

AI benchmark alpha scores for device inference and training have also improved significantly. The inference score increased by 15.22% from D32v4 to D32v5 and by 42.41% from D32v5 to D32v6. The training score increased by 21.82% from D32v4 to D32v5 and by 43.49% from D32v5 to D32v6.

Lastly, the Device AI score improved across versions, with a D32v4 score of 6726 points, a D32v5 score of 7996 points, and a D32v6 score of 11436 points. The increase was 18.88% from D32v4 to D32v5 and 43.02% from D32v5 to D32v6.

Next steps and final thoughts

The public preview of the new Intel SKUs has already shown very promising benchmarking results showing significant performance gains over the previous D-series generation, which is widely used in FSI scenarios.

It is important to note that custom code or purchased libraries may exhibit different characteristics than the selected benchmark. Therefore, we recommend that you verify performance indicators with your own settings.

In this benchmarking setup, we have not disabled hyperthreading on the CPU, so the available cores are exposed as virtual cores. If you are interested in this scenario, please contact the author for more information.

Azure also offers a wide range of VM families to suit different needs, including F, FX, Fa, D, Da, E, Ea, and specialized HPC SKUs such as HC and HB VMs.

Dedicated validation based on individual code/workload is also recommended here to ensure that the most appropriate SKU is selected for the task at hand.

Source link

introduction

Benchmarking settings

software dependencies