You are on page 1of 16

Abaqus FEA Software on Multi-Core HP Servers

Sharon Shaw1, Matt Dunbar2


1 2

Hewlett-Packard, 3000 Waterview Parkway, Richardson, TX 75080, USA SIMULIA, 166 Valley Street, Providence, RI 02909, USA

Introduction ................................................................................................................................................2 Abaqus FEA Software on Dual-Core HP ProLiant Servers..............................................................................2 Abaqus Models ......................................................................................................................................2 System Information .................................................................................................................................2 Abaqus/Standard S4b ...........................................................................................................................3 Abaqus/Standard S6 .............................................................................................................................5 Abaqus/Explicit E2 (Double Precision).....................................................................................................7 Summary of Abaqus/Standard on Dual-core HP ProLiant Servers ..............................................................7 Summary of Abaqus/Explicit on Dual-core HP ProLiant Servers..................................................................8 Advances in Interconnects ...........................................................................................................................8 Abaqus/Standard & Abaqus/Explicit Performance on Dual-Core HP ProLiant and Integrity Servers .............. 11 Abaqus/Standard S4b ........................................................................................................................ 12 Abaqus/Standard S6 .......................................................................................................................... 13 Abaqus/Explicit E2 (Double Precision).................................................................................................. 14 Conclusion.............................................................................................................................................. 15 Acknowledgments ................................................................................................................................... 16 For more information ............................................................................................................................... 16

Introduction
Over the past couple of years, high performance computing (HPC) applications, such as Abaqus FEA software from the SIMULIA brand of Dassault Systemes, have begun to take advantage of multi-core processor technology to achieve more computing capacity. Dual-core processor servers are becoming an increasingly popular platform for running applications such as this in a cluster environment. This paper will explore the behavior of Abaqus/Standard and Abaqus/Explicit running parallel on HP ProLiant servers based on Intel Xeon and AMD Opteron dual-core processors using InfiniBand and Gigabit Ethernet interconnects. We will also take a look at advances in InfiniBand technology using Abaqus/Explicit. This paper will conclude with a runtime parallel performance comparison of specific datasets on dual-core HP ProLiant and Integrity servers.

Abaqus FEA Software on Dual-Core HP ProLiant Servers


Abaqus Models
We used the following Abaqus/Standard and Abaqus/Explicit Version 6.7 benchmark datasets to obtain performance data: S4b5.2 MDOF powertrain model, which is a mildly nonlinear static analysis that simulates bolting a cylinder head onto an engine block S6700 KDOF model, which is a strongly nonlinear static analysis that determines the footprint of an automobile tire E2simplified model of a cell phone impacting a fixed rigid floor. More information on these models can be obtained from the Abaqus Version 6.7 Performance Data page on the SIMULIA website: http://www.simulia.com/support/v67/v67_performance.html.

System Information
The following results presented in this paper were run on similarly configured clusters of servers located in the HP High Performance Computing Division and are described in the following table:

Table 1. System Configurations System Processor Memory/Node Processors/Node Cores/Node Local Disk Interconnect Processor Data Cache Peak Floating Point Rate HP ProLiant DL145 G2 2.6 GHz dual-core AMD Opteron 8 GB 2 4 2 SCSI disks Gigabit Ethernet, SDR InfiniBand 2 MB (1 MB per core) 5.2 GFLOP/sec HP ProLiant DL140 G3 3.0 GHz dual-core Intel Xeon 8 GB 2 4 2 SAS disks Gigabit Ethernet, SDR InfiniBand 4 MB (shared by 2 cores) 12 GFLOP/sec HP ProLiant BL460c 3.0 GHz dual-core Intel Xeon 8 GB 2 4 1 SAS disk Gigabit Ethernet, DDR (Lx) InfiniBand 4 MB (shared by 2 cores) 12 GFLOP/sec

System Linux OS

HP ProLiant DL145 G2 RHEL 4

HP ProLiant DL140 G3 RHEL 4

HP ProLiant BL460c RHEL 4

Abaqus/Standard S4b
In the S4b model, the analysis is compute-bound. More time is spent in computationperforming such operations as matrix-matrix multiply (DGEMM), for examplethan is spent in communication (moving data around.) For this reason, a high-speed interconnect such as InfiniBand is only 5% to 14% faster than running on Gigabit Ethernet on the HP ProLiant DL145 G2 Opteron cluster. In addition, Gigabit Ethernet is still scaling at 64 cores. The following figure shows the elapsed time of S4b running on the Opteron cluster on 8, 16, 32 and 64 cores.

Figure 1. Abaqus/Standard S4b on HP ProLiant servers based on AMD Opteron processors.

6500 6000 5500 HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant DL145 G2, Opteron DC 2.6GHz, GigE

Wallclock (seconds)

5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 8

5% faster

12% faster 14% faster

16

32

64

# of cores

However, running S4b on the HP ProLiant DL140 G3 Intel Xeon cluster tells a different story. Although S4b is a highly compute intensive problem, Xeon spends less time in computation than Opteron, making it more sensitive to communication speed. S4b is 4% to 25% faster running S4b on InfiniBand than running on Gigabit Ethernet. The following figure shows the elapsed time of S4b running on the Xeon-based cluster with 8, 16, 32 and 64 cores.

Figure 2. Abaqus/Standard S4b on HP ProLiant servers based on Intel Xeon processors.

4500 4000 HP Proliant DL140 G3, Intel Xeon DC 3.0GHz, IB-SDR HP Proliant DL140 G3, Intel Xeon DC 3.0GHz, GigE

Wallclock (seconds)

3500 3000 2500 2000 1500 1000 500 0 8

4% faster

25% faster

25% faster

16

32

64

# of cores

Abaqus/Standard S6
In the S6 model, the analysis is more communication-bound than S4b. Because there is more effort in forming the matrices in S6, there are more than twice the messages passed than S4b, resulting in more communication per compute unit. In this scenario, a high-speed interconnect such as InfiniBand makes a bigger difference on Opteron and is 14% to 21% faster than running on Gigabit Ethernet. The following figure shows the elapsed time of S6 running on the Opteron cluster using 8, 16, 32 and 64 cores.

Figure 3. Abaqus/Standard S6 on HP ProLiant servers based on AMD Opteron processors.

4500 4000 3500 HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant DL145 G2, Opteron DC 2.6GHz, GigE

Wallclock (seconds)

3000 2500 2000 1500 1000 500 0 8 16 32 64

14% faster 17% faster 21% faster

# of cores

Running S6 on the Xeon cluster tells a similar but also different story. Up to 32 cores, Xeon shows similar behavior as Opteron, but then the benefits of Single Data Rate (SDR) InfiniBand start to taper off. Although there is more communication than compute in S6, the Xeon finishes its compute portion fast enough to exhaust SDR InfiniBand. However, Double Data Rate (DDR) InfiniBand, with its increase in bandwidth, is beneficial for these medium-sized Standard messages, making the Xeons behavior once again similar to Opteron. The following figure shows the elapsed time of S6 running on the Xeon cluster with 8, 16, 32 and 64 cores.

Figure 4. Abaqus/Standard S6 on HP ProLiant servers based on Intel Xeon processors.

3500 3000 2500 2000

HP Proliant DL140 G3, Xeon DC 3.0GHz, IB-SDR HP Proliant DL140 G3, Xeon DC 3.0GHz, GigE HP Proliant BL460c, Xeon DC 3.0GHz, IB-DDR (Lx)

Wallclock (seconds)

SDR 12% faster


1500 1000 500 0 8 16 32 64

SDR 15% faster

SDR 9% faster DDR 22% faster

# of cores

Abaqus/Explicit E2 (Double Precision)


Whereas Abaqus/Standard passes data in relatively large messages, Abaqus/Explicit communicates a relatively large number of smaller messages. In the E2 model run in double precision on 64 cores, 1.2 billion messages of an average size of 1 Kbytes are passed. For this reason, a high-speed interconnect such as InfiniBand makes the biggest difference of all the analyses and is 12% to 31% faster than running on Gigabit Ethernet. Gigabit Ethernet is saturated at 32 cores, thus mandating InfiniBand to get further performance improvements. Both Xeon and Opteron show this same behavior. The following figure shows the elapsed time of E2 double precision running on the Xeon cluster with 8, 16, 32 and 64 cores.

Figure 5. Abaqus/Explicit E2(DP) on HP ProLiant servers based on Intel Xeon processors.

4500 4000 3500 HP Proliant DL140 G3, Xeon DC 3.0GHz, IB-SDR HP Proliant DL140 G3, Xeon DC 3.0GHz, GigE

Wallclock (seconds)

3000 2500 2000 1500

12% faster 24% faster

1000

31% faster
500 0 8 16

# of cores

32

64

Summary of Abaqus/Standard on Dual-core HP ProLiant Servers


For compute-bound problems like S4b, which spend a great deal of time in floating point calculations such as matrix-matrix multiply (DGEMM), the dual-core Xeon, with its high FLOP rate and highly tuned DGEMM, spends less time in computation than the dual-core Opteron. This situation makes Xeon more sensitive to communication speed powered by the interconnect. Thus InfiniBand is more beneficial on dual-core Xeon than on dual-core Opteron. It is interesting to note that Gigabit Ethernet is still scaling at 64 cores on both platforms. For communication-bound problems like S6, InfiniBand is equally beneficial on both dual-core Xeon and Opteron clusters. However, the faster dual-core Xeon may require DDR vs. SDR InfiniBand to realize the same performance gains that Opteron sees with SDR. Again, Gigabit Ethernet is still scaling at 64 cores on both platforms.

Summary of Abaqus/Explicit on Dual-core HP ProLiant Servers


InfiniBand is equally beneficial on both dual-core Xeon and Opteron clusters. Running E2 double precision, InfiniBand is up to 31% faster than Gigabit Ethernet on 64 cores. In addition, Gigabit Ethernet is saturated at 32 cores. The benefit of InfiniBand in this study for E2 is typical of most Abaqus/Explicit datasets, since Abaqus/Explicit tends to pass a relatively large number of small messages.

Advances in Interconnects
Advances in interconnect performance are ongoing. For example, improvements in Mellanox Technologies InfiniBand cards generate improved scalability results. To demonstrate this, we measured the performance and scalability of Abaqus/Explicit Version 6.6 on a compute cluster, comparing the following Mellanox InfiniBand cards: SDRSingle Data Rate DDR (Lx)Double Data Rate ConnectX Each server node in the cluster was a ProLiant DL140 G3 consisting of 2 dual-core Xeon 3.0GHz processors and 8GB memory per node running RHEL 4. Using the industry standard pingpong benchmark, the following table shows the measured minimum latency and measured maximum bandwidth of the Mellanox InfiniBand cards. These measurements are configuration dependent, so they may vary from cluster to cluster.

Table 2. Measured latency and bandwidth of Mellanox InfiniBand cards Mellanox InfiniBand Card Single Data Rate SDR 4 sec 900 MB/sec Double Data Rate DDR (Lx) 3.2 sec 1360 MB/sec ConnectX

Minimum Latency Maximum Measured Bandwidth

1.4 sec 1400 MB/sec

We ran all six Abaqus/Explicit datasets in single precision. The performance and scalability of each dataset was affected in varying degrees by the type of InfiniBand card used. Some datasets were affected more than others. Here we will show the results for E2 and E5, the most and least affected datasets respectively: E2 - simplified model of a cell phone impacting a fixed rigid floor E5 stiffened steel plate subjected to a high intensity blast load For a 64-core job, the following table shows the number of messages passed and the average message size for E2 and E5.

Table 3. Messages passed and average message size for E2 and E5. Explicit Dataset Number of Messages Average Message Size E2 1.2 billion 476 bytes E5 700 million 189 bytes

The following figure shows the performance of dataset E2. Here we see that the type of InfiniBand card used can make a big difference in runtime for E2. Typical of Abaqus/Explicit jobs, E2 passes a relatively large number of small messages. From 8 to 64 cores, we see a 1.2% to 46% difference between the slowest and fastest runtimes dependent upon the card used.

Figure 6. Abaqus/Explicit E2 (SP) on HP ProLiant servers based on Intel Xeon processors.

5000 4500 4000

InfiniBand-SDR InfiniBand-DDR (Lx) InfiniBand-ConnectX

Runtime (smaller is better)

3500 3000 2500 2000 1500 1000 500 0

16

24

32 # of cores

40

48

56

64

The following figure shows the performance of dataset E5. In this example we see that the type of InfiniBand card used makes a smaller difference in runtime for E5 than E2. E5 passes about one half the small messages as E2, so its communication is much less. From 8 to 64 cores, we see a 1.3% to 20% difference between the slowest and fastest runtimes, dependent upon the card used.

Figure 7. Abaqus/Explicit E5 (SP) on HP ProLiant servers based on Intel Xeon processors.

2500

2000

Runtime (smaller is better)

InfiniBand-SDR InfiniBand-DDR (Lx) InfiniBand-ConnectX

1500

1000

500

16

24

32 # of cores

40

48

56

64

In conclusion, these examples demonstrate a wide range in performance gains using different generations of interconnect technology. Performance is job-dependent.

10

Abaqus/Standard & Abaqus/Explicit Performance on DualCore HP ProLiant and Integrity Servers


The next three figures show the elapsed time of Abaqus/Standard and Abaqus/Explicit Version 6.7 benchmark datasets S4b, S6, and E2 (double precision) on dual-core HP ProLiant and Integrity servers. The times for the HP ProLiant servers are the InfiniBand results obtained in the dual-core HP ProLiant Xeon and Opteron Gigabit Ethernet/InfiniBand study found earlier in this paper. The following table shows the configuration of all clustered systems.

Table 4. Clustered systems configurations System Processor Memory/Node Processors/Node Cores/Node Local Disk Interconnect Processor Data Cache Peak Floating Point Rate Linux OS HP ProLiant DL145 G2 2.6 GHz dual-core AMD Opteron 8 GB 2 4 2 SCSI disks SDR InfiniBand 2 MB (1 MB per core) 5.2 GFLOP/sec RHEL 4 HP ProLiant DL140 G3 3.0 GHz dual-core Intel Xeon 8 GB 2 4 2 SAS disks SDR InfiniBand 4 MB (shared by 2 cores) 12 GFLOP/sec RHEL 4 HP ProLiant BL460c 3.0 GHz dual-core Intel Xeon 8 GB 2 4 1 SAS disk DDR (Lx) InfiniBand 4 MB (shared by 2 cores) 12 GFLOP/sec RHEL 4 HP Integrity rx2660 1.6 GHz dual-core Intel Itanium2 8 GB 2 4 5 SCSI disks SDR InfiniBand 18 MB (9 MB per core) 6.4 GFLOP/sec RHEL 4

11

Abaqus/Standard S4b
The following figure shows the elapsed time of S4b on dual-core HP ProLiant and Integrity clustered systems. Xeon and Integrity are the fastest due to higher FLOP rates and highly tuned DGEMMs.

Figure 8. Abaqus/Standard S4b Parallel Performance

25000 HP Integrity RX2660, Itanium2 DC 1.6GHz, IB-SDR 20000 HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant DL140 G3, Xeon DC 3.0GHz, IB-SDR

Wallclock (seconds)

15000

10000

5000

0 1 8 16 32 64

# of cores

12

Abaqus/Standard S6
The following figure shows the elapsed time of S6 on dual-core HP ProLiant and Integrity clustered systems. Xeon is the fastest.

Figure 9. Abaqus/Standard S6 Parallel Performance

16000 14000 12000

HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant BL460c, Xeon DC 3.0GHz, IB-DDR (Lx) HP Integrity RX2660, Itanium2 DC 1.6GHz, IB-SDR

Wallclock (seconds)

10000 8000 6000 4000 2000 0 1 8 16 32 64

# of cores

13

Abaqus/Explicit E2 (Double Precision)


The following figure shows the elapsed time of E2 run in double precision on dual-core HP ProLiant and Integrity clustered systems. Integrity and Xeon are the fastest. Integritys fast double precision arithmetic and large data cache make it as fast as Xeon.

Figure 10. Abaqus/Explicit E2(DP) Parallel Performance

30000

HP Integrity RX2660, Itanium2 DC 1.6GHz, IB-SDR 25000 HP Proliant DL145 G2, Opteron DC 2.6GHz, IB-SDR HP Proliant DL140 G3, Xeon DC 3.0GHz, IB-SDR 20000

Wallclock (seconds)

15000

10000

5000

0 1 8

# of cores

16

32

64

14

Conclusion
To select the best components for a compute cluster, it is important to study the performance characteristics of your workload. The best choice balances the performance characteristics of the server with the amount of memory in the server, the performance of the cluster network, and the I/O performance. For Abaqus/Standard compute-bound jobs on dual-core Xeon and Opteron clusters, the performance characteristics are different on each platform. A high-speed interconnect, such as InfiniBand, is beneficial to varying degrees at different core counts. For Abaqus/Standard communication-bound jobs, both Xeon and Opteron are bound by the communication speed of the cluster network. However, Xeon is more sensitive and can take advantage of faster interconnects. For any Abaqus/Standard job on any platform, the white paper by SIMULIA in May 2006 (Running ABAQUS/Standard Version 6.6 on Compute Clusters) reminds us that memory and I/O performance need to be taken into consideration as well. For best performance, jobs must fit in the memory available per node and local disk is necessary for scratch I/O. For Abaqus/Explicit jobs on dual-core Xeon and Opteron clusters, the performance characteristics are similar. A high-speed interconnect such as InfiniBand is definitely beneficial. Interconnect performance improvements are always ongoing and definitely worth investigating, when running either Abaqus/Standard or Abaqus/Explicit. Early in 2007, the HP High Performance Computing Division launched its Multi-core Optimization Program. The goal of this project is to investigate and implement techniques that improve the performance of HPC applications on HP servers that use multi-core processors. This analysis of Abaqus FEA software performance is part of this HPCD program.

15

Acknowledgments
The idea for this project originated in HPs High Performance Computing Division. It is one of the results of HPs Multi-Core Optimization Program, which seeks ways to improve total application performance and per-core application performance on servers using multi-core processors.

For more information


www.hp.com/go/hpc Hewlett-Packard High Performance Computing home site www.simulia.com Dassault Systemes/SIMULIA website

2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. AMD and AMD Opteron are trademarks of Advanced Micro Devices, Inc. Intel and Xeon are registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Itanium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation. Linux is a U.S. registered trademark of Linus Torvalds. SIMULIA is a registered trademark of Dassault Systmes or its subsidiaries in the US and/or other countries. 4AA1-6095ENW, November 2007

You might also like