Benchmarking 6th gen. Intel-based Dv6 (preview) VM SKUs for HPC Workloads in Financial Services by info.odysseyx@gmail.com October 28, 2024 written by info.odysseyx@gmail.com October 28, 2024 0 comment 10 views 10 introduction In the rapidly changing world of financial services, high-performance computing (HPC) systems in the cloud have become indispensable. From equipment pricing and risk assessment to portfolio optimization and regulatory workloads like CVA and FRTB, the flexibility and scalability of cloud deployments are transforming the industry. Unlike traditional HPC systems that require complex parallelization frameworks (e.g., dependent on MPI and InfiniBand networking), many financial calculations can be run efficiently on a general-purpose SKU in Azure. Depending on the code used to perform the calculations, many implementations utilize vendor-specific optimizations, such as Intel’s AVX-512. With the recent public preview announcement of the 6th generation Intel-based Dv6 VM (see: here), In this article, we will look at the performance evolution of the third generation D32ds from D32dsv4 to D32dsv6. It follows a similar testing methodology as the January 2023 article “Benchmarking Azure HPC SKUs for Financial Services Workloads” (link). here). The official announcement mentioned that the upcoming Dv6 series (currently in preview) offers significant improvements over the previous Dv5 generation. The main points are as follows: Up to 27% improved vCPU performance and 3x increased L3 cache compared to previous generation Intel Dl/D/Ev5 VMs. Supports up to 192 vCPUs and over 18 GiB of memory. Azure Boost provides: Up to 400,000 IOPS and 12GB/s remote storage throughput. Up to 200 Gbps VM network bandwidth. Local SSD capacity increased by 46% and read IOPS more than tripled. NVMe interface for both local and remote disks. Note: Enhanced security through Total Memory Encryption (TME) technology is not enabled in preview deployments and will be benchmarked after it is available. Technical Specifications for 3rd Generation D32ds SKU VM name D32ds_v4 D32ds_v5 D32ds_v6 Number of vCPUs 32 32 32 Infini Band Not applicable Not applicable Not applicable processor Intel® Xeon® Platinum 8370C (Ice Lake) or Intel® Xeon® Platinum 8272CL (Cascade Lake) Intel® Xeon® Platinum 8370C (Ice Lake) Intel® Xeon® Platinum 8573C (Emerald Rapids) Processor highest CPU frequency 3.4GHz 3.5GHz 3.0GHz RAM per VM 128GB 128GB 128GB RAM per core 4GB 4GB 4GB connected disk 1200 SSD 1200 SSD 440 SSD Benchmarking settings For benchmarking setup, Phoronix (link) We run two tests from the OpenBenchmarking.org test suite that specifically target quantitative finance workloads. The tests in the “Financial Suite” are divided into two groups, each running independent benchmarks. In addition to our suite of financial tests, we also ran AI-Benchmark to evaluate the evolution of AI inference capabilities across three VM generations. financial bench QuantLib AI Benchmarks Bond OpenMP Size XXS Device inference score Repo OpenMP Size X Device AI Score Monte Carlo OpenMP device training score software dependencies element version OS image Ubuntu Marketplace Image: 24_04-lts Phoronix Test Suite 10.8.5 Quantlib benchmark 1.35-dev financial benchmark benchmark 2016-07-25 AI Benchmark Alpha 0.1.2 python 3.12.3 To run the benchmark on a newly created D-series VM, run the following command (after updating the installed packages to the latest version): git clone https://github.com/phoronix-test-suite/phoronix-test-suite.gitsudo apt-get install php-cli php-xml cmakesudo ./install-shphoronix-test-suite benchmark finance AI benchmark testing requires some additional steps. For example, you need to create a virtual environment for additional Python packages and install tensorflow and ai-benchmark packages. sudo apt install python3 python3-pip python3-virtualenvmkdir ai-benchmark && cd ai-benchmarkvirtualenv virtualenvsource virtualenv/bin/activatepip install tensorflowpip install ai-benchmarkphoronix-test-suite benchmark ai-benchmark Runtime and results benchmarking The purpose of this article is to share the results of a set of benchmarks that closely match the use cases mentioned in the introduction. Since most of these use cases are primarily CPU bound, we limited our benchmarks to D-series VMs. For memory-bound code that can benefit from a higher memory-to-core ratio, the new Ev6 SKU may be a suitable option. In the picture below you can see a representative benchmark running on a Dv6 VM with almost 100% of the CPU utilized during execution. Individual runs of the Phoronix test suite, starting with Finance Bench and continuing with QuantLib, are clearly visible. runtime Figure 1: CPU utilization for the entire financial benchmark run. standard VM size start time end time continue minute financial benchmarks Standard D32ds v4 12:08 15:29 03:21 201.00 financial benchmarks Standard D32ds v5 11:38 14:12 02:34 154.00 financial benchmarks Standard D32ds v6 11:39 13:27 01:48 108.00 Financial Bench Results QuantLib Results AI Benchmark Alpha Results Discussion of Results The results show significant performance improvements in QuantLib across D32v4, D32v5, and D32v6 versions. In particular, operations per second for size S increased by 47.18% from D32v5 to D32v6, while size XXS increased by 45.55%. Benchmark times for ‘Repo OpenMP’ and ‘Bonds OpenMP’ also decreased, indicating improved performance. ‘Repo OpenMP’ time decreased by 18.72% from D32v4 to D32v5 and by 20.46% from D32v5 to D32v6. Similarly, ‘Bonds OpenMP’ time decreased by 11.98% from D32v4 to D32v5 and by 18.61% from D32v5 to D32v6. As for Monte-Carlo OpenMP performance, D32v6 showed the best result with 51,927.04ms, followed by D32v5 with 56,443.91ms and D32v4 with 57,093.94ms. The improvement was -1.14% from D32v4 to D32v5 and -8.00% from D32v5 to D32v6. AI benchmark alpha scores for device inference and training have also improved significantly. The inference score increased by 15.22% from D32v4 to D32v5 and by 42.41% from D32v5 to D32v6. The training score increased by 21.82% from D32v4 to D32v5 and by 43.49% from D32v5 to D32v6. Lastly, the Device AI score improved across versions, with a D32v4 score of 6726 points, a D32v5 score of 7996 points, and a D32v6 score of 11436 points. The increase was 18.88% from D32v4 to D32v5 and 43.02% from D32v5 to D32v6. Next steps and final thoughts The public preview of the new Intel SKUs has already shown very promising benchmarking results showing significant performance gains over the previous D-series generation, which is widely used in FSI scenarios. It is important to note that custom code or purchased libraries may exhibit different characteristics than the selected benchmark. Therefore, we recommend that you verify performance indicators with your own settings. In this benchmarking setup, we have not disabled hyperthreading on the CPU, so the available cores are exposed as virtual cores. If you are interested in this scenario, please contact the author for more information. Azure also offers a wide range of VM families to suit different needs, including F, FX, Fa, D, Da, E, Ea, and specialized HPC SKUs such as HC and HB VMs. Dedicated validation based on individual code/workload is also recommended here to ensure that the most appropriate SKU is selected for the task at hand. Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post new skin tone settings for reactions and emojis next post Efficient Management of Append and Page Blobs Using Azure Storage Actions You may also like Bots now dominate the web and this is a copy of a problem February 5, 2025 Bots now dominate the web and this is a copy of a problem February 5, 2025 Bots now dominate the web, and this is a problem February 4, 2025 DIPSEC and HI-STECS GLOBAL AI Race February 4, 2025 DEPSEC SUCCESS TICTOKE CAN RUNNING TO PUPPENSE TO RESTITE January 29, 2025 China’s AI Application DEPSEC Technology Spreads on the market January 28, 2025 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.