Professional Documents
Culture Documents
Hewlett-Packard, 3000 Waterview Parkway, Richardson, TX 75080, USA SIMULIA, 166 Valley Street, Providence, RI 02909, USA
Introduction ................................................................................................................................................2 Abaqus FEA Software on Dual-Core HP ProLiant Servers..............................................................................2 Abaqus Models ......................................................................................................................................2 System Information .................................................................................................................................2 Abaqus/Standard S4b ...........................................................................................................................3 Abaqus/Standard S6 .............................................................................................................................5 Abaqus/Explicit E2 (Double Precision).....................................................................................................7 Summary of Abaqus/Standard on Dual-core HP ProLiant Servers ..............................................................7 Summary of Abaqus/Explicit on Dual-core HP ProLiant Servers..................................................................8 Advances in Interconnects ...........................................................................................................................8 Abaqus/Standard & Abaqus/Explicit Performance on Dual-Core HP ProLiant and Integrity Servers .............. 11 Abaqus/Standard S4b ........................................................................................................................ 12 Abaqus/Standard S6 .......................................................................................................................... 13 Abaqus/Explicit E2 (Double Precision).................................................................................................. 14 Conclusion.............................................................................................................................................. 15 Acknowledgments ................................................................................................................................... 16 For more information ............................................................................................................................... 16
Introduction
Over the past couple of years, high performance computing (HPC) applications, such as Abaqus FEA software from the SIMULIA brand of Dassault Systemes, have begun to take advantage of multi-core processor technology to achieve more computing capacity. Dual-core processor servers are becoming an increasingly popular platform for running applications such as this in a cluster environment. This paper will explore the behavior of Abaqus/Standard and Abaqus/Explicit running parallel on HP ProLiant servers based on Intel Xeon and AMD Opteron dual-core processors using InfiniBand and Gigabit Ethernet interconnects. We will also take a look at advances in InfiniBand technology using Abaqus/Explicit. This paper will conclude with a runtime parallel performance comparison of specific datasets on dual-core HP ProLiant and Integrity servers.
System Information
The following results presented in this paper were run on similarly configured clusters of servers located in the HP High Performance Computing Division and are described in the following table:
Table 1. System Configurations System Processor Memory/Node Processors/Node Cores/Node Local Disk Interconnect Processor Data Cache Peak Floating Point Rate HP ProLiant DL145 G2 2.6 GHz dual-core AMD Opteron 8 GB 2 4 2 SCSI disks Gigabit Ethernet, SDR InfiniBand 2 MB (1 MB per core) 5.2 GFLOP/sec HP ProLiant DL140 G3 3.0 GHz dual-core Intel Xeon 8 GB 2 4 2 SAS disks Gigabit Ethernet, SDR InfiniBand 4 MB (shared by 2 cores) 12 GFLOP/sec HP ProLiant BL460c 3.0 GHz dual-core Intel Xeon 8 GB 2 4 1 SAS disk Gigabit Ethernet, DDR (Lx) InfiniBand 4 MB (shared by 2 cores) 12 GFLOP/sec
System Linux OS
Abaqus/Standard S4b
In the S4b model, the analysis is compute-bound. More time is spent in computationperforming such operations as matrix-matrix multiply (DGEMM), for examplethan is spent in communication (moving data around.) For this reason, a high-speed interconnect such as InfiniBand is only 5% to 14% faster than running on Gigabit Ethernet on the HP ProLiant DL145 G2 Opteron cluster. In addition, Gigabit Ethernet is still scaling at 64 cores. The following figure shows the elapsed time of S4b running on the Opteron cluster on 8, 16, 32 and 64 cores.
6500 6000 5500 HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant DL145 G2, Opteron DC 2.6GHz, GigE
Wallclock (seconds)
5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 8
5% faster
16
32
64
# of cores
However, running S4b on the HP ProLiant DL140 G3 Intel Xeon cluster tells a different story. Although S4b is a highly compute intensive problem, Xeon spends less time in computation than Opteron, making it more sensitive to communication speed. S4b is 4% to 25% faster running S4b on InfiniBand than running on Gigabit Ethernet. The following figure shows the elapsed time of S4b running on the Xeon-based cluster with 8, 16, 32 and 64 cores.
4500 4000 HP Proliant DL140 G3, Intel Xeon DC 3.0GHz, IB-SDR HP Proliant DL140 G3, Intel Xeon DC 3.0GHz, GigE
Wallclock (seconds)
4% faster
25% faster
25% faster
16
32
64
# of cores
Abaqus/Standard S6
In the S6 model, the analysis is more communication-bound than S4b. Because there is more effort in forming the matrices in S6, there are more than twice the messages passed than S4b, resulting in more communication per compute unit. In this scenario, a high-speed interconnect such as InfiniBand makes a bigger difference on Opteron and is 14% to 21% faster than running on Gigabit Ethernet. The following figure shows the elapsed time of S6 running on the Opteron cluster using 8, 16, 32 and 64 cores.
4500 4000 3500 HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant DL145 G2, Opteron DC 2.6GHz, GigE
Wallclock (seconds)
# of cores
Running S6 on the Xeon cluster tells a similar but also different story. Up to 32 cores, Xeon shows similar behavior as Opteron, but then the benefits of Single Data Rate (SDR) InfiniBand start to taper off. Although there is more communication than compute in S6, the Xeon finishes its compute portion fast enough to exhaust SDR InfiniBand. However, Double Data Rate (DDR) InfiniBand, with its increase in bandwidth, is beneficial for these medium-sized Standard messages, making the Xeons behavior once again similar to Opteron. The following figure shows the elapsed time of S6 running on the Xeon cluster with 8, 16, 32 and 64 cores.
HP Proliant DL140 G3, Xeon DC 3.0GHz, IB-SDR HP Proliant DL140 G3, Xeon DC 3.0GHz, GigE HP Proliant BL460c, Xeon DC 3.0GHz, IB-DDR (Lx)
Wallclock (seconds)
# of cores
4500 4000 3500 HP Proliant DL140 G3, Xeon DC 3.0GHz, IB-SDR HP Proliant DL140 G3, Xeon DC 3.0GHz, GigE
Wallclock (seconds)
1000
31% faster
500 0 8 16
# of cores
32
64
Advances in Interconnects
Advances in interconnect performance are ongoing. For example, improvements in Mellanox Technologies InfiniBand cards generate improved scalability results. To demonstrate this, we measured the performance and scalability of Abaqus/Explicit Version 6.6 on a compute cluster, comparing the following Mellanox InfiniBand cards: SDRSingle Data Rate DDR (Lx)Double Data Rate ConnectX Each server node in the cluster was a ProLiant DL140 G3 consisting of 2 dual-core Xeon 3.0GHz processors and 8GB memory per node running RHEL 4. Using the industry standard pingpong benchmark, the following table shows the measured minimum latency and measured maximum bandwidth of the Mellanox InfiniBand cards. These measurements are configuration dependent, so they may vary from cluster to cluster.
Table 2. Measured latency and bandwidth of Mellanox InfiniBand cards Mellanox InfiniBand Card Single Data Rate SDR 4 sec 900 MB/sec Double Data Rate DDR (Lx) 3.2 sec 1360 MB/sec ConnectX
We ran all six Abaqus/Explicit datasets in single precision. The performance and scalability of each dataset was affected in varying degrees by the type of InfiniBand card used. Some datasets were affected more than others. Here we will show the results for E2 and E5, the most and least affected datasets respectively: E2 - simplified model of a cell phone impacting a fixed rigid floor E5 stiffened steel plate subjected to a high intensity blast load For a 64-core job, the following table shows the number of messages passed and the average message size for E2 and E5.
Table 3. Messages passed and average message size for E2 and E5. Explicit Dataset Number of Messages Average Message Size E2 1.2 billion 476 bytes E5 700 million 189 bytes
The following figure shows the performance of dataset E2. Here we see that the type of InfiniBand card used can make a big difference in runtime for E2. Typical of Abaqus/Explicit jobs, E2 passes a relatively large number of small messages. From 8 to 64 cores, we see a 1.2% to 46% difference between the slowest and fastest runtimes dependent upon the card used.
16
24
32 # of cores
40
48
56
64
The following figure shows the performance of dataset E5. In this example we see that the type of InfiniBand card used makes a smaller difference in runtime for E5 than E2. E5 passes about one half the small messages as E2, so its communication is much less. From 8 to 64 cores, we see a 1.3% to 20% difference between the slowest and fastest runtimes, dependent upon the card used.
2500
2000
1500
1000
500
16
24
32 # of cores
40
48
56
64
In conclusion, these examples demonstrate a wide range in performance gains using different generations of interconnect technology. Performance is job-dependent.
10
Table 4. Clustered systems configurations System Processor Memory/Node Processors/Node Cores/Node Local Disk Interconnect Processor Data Cache Peak Floating Point Rate Linux OS HP ProLiant DL145 G2 2.6 GHz dual-core AMD Opteron 8 GB 2 4 2 SCSI disks SDR InfiniBand 2 MB (1 MB per core) 5.2 GFLOP/sec RHEL 4 HP ProLiant DL140 G3 3.0 GHz dual-core Intel Xeon 8 GB 2 4 2 SAS disks SDR InfiniBand 4 MB (shared by 2 cores) 12 GFLOP/sec RHEL 4 HP ProLiant BL460c 3.0 GHz dual-core Intel Xeon 8 GB 2 4 1 SAS disk DDR (Lx) InfiniBand 4 MB (shared by 2 cores) 12 GFLOP/sec RHEL 4 HP Integrity rx2660 1.6 GHz dual-core Intel Itanium2 8 GB 2 4 5 SCSI disks SDR InfiniBand 18 MB (9 MB per core) 6.4 GFLOP/sec RHEL 4
11
Abaqus/Standard S4b
The following figure shows the elapsed time of S4b on dual-core HP ProLiant and Integrity clustered systems. Xeon and Integrity are the fastest due to higher FLOP rates and highly tuned DGEMMs.
25000 HP Integrity RX2660, Itanium2 DC 1.6GHz, IB-SDR 20000 HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant DL140 G3, Xeon DC 3.0GHz, IB-SDR
Wallclock (seconds)
15000
10000
5000
0 1 8 16 32 64
# of cores
12
Abaqus/Standard S6
The following figure shows the elapsed time of S6 on dual-core HP ProLiant and Integrity clustered systems. Xeon is the fastest.
HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant BL460c, Xeon DC 3.0GHz, IB-DDR (Lx) HP Integrity RX2660, Itanium2 DC 1.6GHz, IB-SDR
Wallclock (seconds)
# of cores
13
30000
HP Integrity RX2660, Itanium2 DC 1.6GHz, IB-SDR 25000 HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant DL140 G3, Xeon DC 3.0GHz, IB-SDR 20000
Wallclock (seconds)
15000
10000
5000
0 1 8
# of cores
16
32
64
14
Conclusion
To select the best components for a compute cluster, it is important to study the performance characteristics of your workload. The best choice balances the performance characteristics of the server with the amount of memory in the server, the performance of the cluster network, and the I/O performance. For Abaqus/Standard compute-bound jobs on dual-core Xeon and Opteron clusters, the performance characteristics are different on each platform. A high-speed interconnect, such as InfiniBand, is beneficial to varying degrees at different core counts. For Abaqus/Standard communication-bound jobs, both Xeon and Opteron are bound by the communication speed of the cluster network. However, Xeon is more sensitive and can take advantage of faster interconnects. For any Abaqus/Standard job on any platform, the white paper by SIMULIA in May 2006 (Running ABAQUS/Standard Version 6.6 on Compute Clusters) reminds us that memory and I/O performance need to be taken into consideration as well. For best performance, jobs must fit in the memory available per node and local disk is necessary for scratch I/O. For Abaqus/Explicit jobs on dual-core Xeon and Opteron clusters, the performance characteristics are similar. A high-speed interconnect such as InfiniBand is definitely beneficial. Interconnect performance improvements are always ongoing and definitely worth investigating, when running either Abaqus/Standard or Abaqus/Explicit. Early in 2007, the HP High Performance Computing Division launched its Multi-core Optimization Program. The goal of this project is to investigate and implement techniques that improve the performance of HPC applications on HP servers that use multi-core processors. This analysis of Abaqus FEA software performance is part of this HPCD program.
15
Acknowledgments
The idea for this project originated in HPs High Performance Computing Division. It is one of the results of HPs Multi-Core Optimization Program, which seeks ways to improve total application performance and per-core application performance on servers using multi-core processors.
2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. AMD and AMD Opteron are trademarks of Advanced Micro Devices, Inc. Intel and Xeon are registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Itanium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation. Linux is a U.S. registered trademark of Linus Torvalds. SIMULIA is a registered trademark of Dassault Systmes or its subsidiaries in the US and/or other countries. 4AA1-6095ENW, November 2007