Professional Documents
Culture Documents
NVIDIA is a MSC Software Performance partner with Quadro® and Professional Solution product lines that provide excellent
performance for Patran and MSC Nastran on Windows® and Linux® systems.
Key Highlights:
The power wall (resulting from increase in terms of hiding memory latency because of their
power consumption and heat dissipation due specialization to inherently parallel problems.
to increased processor speeds) has introduced With the ever-increasing demand for more Industry
radical changes in computer architectures. computing performance, the HPC industry is High-Perfomance
Increasing core counts and hence, increasing moving towards a hybrid computing model, Computing
parallelism have replaced increasing clock where GPUs and CPUs work together to
speeds as the primary way of delivering greater perform general purpose computing tasks. In
hardware performance. A modern GPU (Graphics this hybrid-computing model, the GPU serves Challenge
Processing Unit) consists of hundreds of simple as a co-processor to the CPU. Co-processing Increase computing performance by
processing cores; this degree of parallelism refers to the use of an accelerator, a GPU, to developing a hybrid computing model
on a single processor is typically referred to as offload the CPU and to increase computational
‘many-core’ relative to ‘multi-core’ that refers to efficiency. In order to exploit this hybrid MSC Software Solutions
processors with at most a few dozen cores. computing model and the massively parallel
MSC Nastran 2012 to support GPU
Many-core GPUs will often demand a high GPU architecture, application software will need
computing capability including multiple GPU
degree of fine-grained parallelism – the to be redesigned. MSC Software and NVIDIA
computing capability for DMP runs
application program should create many threads engineers have been working together over
so that while some threads are waiting for data the last year on the use of GPUs to accelerate
to return from memory other threads can be the sparse direct solver in MSC Nastran. Benefits
executing – offering a different approach in • Vastly reduce use of pinned host memory
• Handle arbitrarily large fronts, for very
large models
Solver Acceleration in and floating-point performance that are several In addition, the MSC Nastran implementation
MSC Nastran 2012: factors faster than the latest CPUs. In supports multiple GPU computing capability
MSC Nastran, the most time consuming part is for DMP (Distributed Memory Parallel) runs.
A sparse direct solver is possibly the most
the BLAS level 3 operations in the multi-frontal In such cases of DMP>1, multiple fronts are
important component in a finite element
factorization process. To date, only the trailing factorized concurrently on multiple GPUs. The
structural analysis program. Typically, a
matrix updates of the front factorization are matrix is decomposed into two domains, and
multi-frontal algorithm with out-of-core
implemented as CUDA kernels and these each domain is computed by a MPI process.
capability for solving extremely large
update kernels are the subject of a collaborative A typical MSC Nastran job submission
problems and BLAS level 3 kernels for the
work between NVIDIA and MSC engineers. command with multiple GPUs is shown below:
highest compute efficiency is implemented.
Elimination tree and compute kernel level nastran2012 jid=myinput mem=48gb
parallelism with dynamic scheduling is used GPU Computing Implementation and buffsize=65537 dmp=2 gpuid=0:1
to ensure the best scalability. The BLAS level Target Analysis (Solution Sequences): gputhresh=12000 sys205=192
3 compute kernels in a sparse direct solver NVIDIA’s CUDA parallel programming sys151=1 mode=i8 sdir=/local/
are the prime candidate for GPU computing architecture is used to implement the update skodiyal/tmp bat=no scr=yes
due to their high floating point density and kernels. CUDA is the hardware and software gpuid is the ID of a licensed GPU device to
favorable compute to communication ratio. architecture that enables NVIDIA GPUs be used in the analysis. Multiple IDs may
The proprietary symmetric MSCLDL and to execute programs written with C, C++, be assigned to MSC Nastran DMP runs.
asymmetric MSCLU sparse direct solvers FORTRAN, OpenCL, and other languages. gputhresh represents the minimum threshold
in MSC Nastran employ a super-element Vastly reduced use of pinned host memory for GPU computing in the multi-frontal sparse
analysis concept instead of dynamic tree level and the ability to handle arbitrarily large factorization. If the product of the rank size
parallelism. In this super-element analysis, the fronts, for very large models (greater than and the front size of each front is smaller than
structure/matrix is first decomposed into large 15M DOF) on a single Tesla C2050 GPU, are value, the rank update of the front is processed
sub-structures/sub-domains according to user some strengths of the GPU implementation on the CPU. Otherwise, the GPU device would
input and load balance heuristics. The out- in MSC Nastran 2012. ‘Staging’ is a term that be used for the rank update of the front.
of-core multi-frontal algorithm is then used to is used to describe how very large fronts are The GPUs supported with this implementation
compute the boundary stiffness, or the Schur handled. If the trailing submatrix is too large are the NVIDIA Tesla 20-series (shown in
compliment, followed by the transformation of to fit on the GPU device memory, then it is Figure 1) and Quadro GPUs based on the Fermi
the load vector, or the right hand side, to the broken up into approximately equal-sized architecture (compute capability 2.0). Linux
boundary. The global solution is found after ‘stages’ and the stages are completed in and Windows 64-bit platforms are supported
the boundary stiffness matrices are assembled order. Multiple streams are used within a
into the residual structure and the residual stage. So, for an arbitrarily large submatrix, Any ‘fat’ BLAS3 code path would be
structure is factorized and solved. The GPU is say 40GB, then it would be solved in, say, 10 potential candidate for GPU computing.
a natural fit for each sub-structure boundary stages of 4GB each. The actual sizes of the Sparse direct solver intensive SOL101 (linear
stiffness/Schur compliment calculation. stages can be varied for performance tuning. statics), SOL108 (direct frequency) and
SOL400 (nonlinear) fall into this category.
Today’s GPUs can provide memory bandwidth
Figure 1: NVIDIA Tesla 20-series GPUs (workstation & server form factors)
2 | MSC Software
MSC Software: Partner Showcase - NVIDIA PARTNER SHOWCASE
Figure 2: Automotive crank shaft (945K DOF) and engine (15.2M DOF) models
SOL108 would need a complex sparse direct The hardware configurations used with enabled by GPU computing will facilitate
solver that is not supported in MSC Nastran these benchmark runs consisted of: MSC Nastran users to add more realism
2012 implementation, however, this feature (1) AMAX server, Linux, 2x hex-core Westmere, to their models thus improving the quality
is currently under development and testing 2.67GHz, 32GB memory, 2x Tesla C2050 of the simulations. A rapid CAE simulation
for an upcoming point release. Likewise, GPU for the 945K and 1.3M DOF model capability from GPUs has the potential to
conventional SOL111 (modal frequency) with transform current practices in engineering
large MPYAD’s (multiply-add) also should (2) Super Micro server, Linux, 2x quad- analysis and design optimization procedures.
benefit from GPU computing in a later release. core Nehalem 2.27GHz, 96GB memory,
2.2 TB SATA 5-way striped RAID and 2x This initial GPU computing implementation
Tesla C2050 GPU for all other models. also identified certain issues – for one, the
Performance analysis with larger the model, the higher the DMP overhead
GPU Computing: Figure 3 shows the end-to-end (total) speed-up in MSC Nastran. This increased CPU side
for single and multiple GPU runs. In general, overhead reduces the overall speed-up
Linear and nonlinear structural stress analysis based on the benchmark models, we see
are the target applications with this first resulting from GPU computing. Future
speed-ups in the range of 4-6X with a single releases of MSC Nastran will address such
implementation of GPU computing in GPU over a serial run and in the range of
MSC Nastran 2012. Structural finite element issues as well as expand the GPU computing
1.4-2X with 2 GPUs over a 8 core DMP run. capability to include complex solver kernels
models dominated by solid elements provide
for more concentrated computational work for the NVH and dynamics markets.
in the sparse matrix factorization, which Summary:
is highly desirable for the GPU. A range of GPU computing is implemented in
models with varying fidelity, from around MSC Nastran 2012 to significantly lower the
1M degrees of freedom (DOF) to 15M DOF simulation times for industry standard analysis
is considered (Figure 2). Performance models. Vastly reduced use of pinned memory
comparisons are relative to a serial Nastran and the ability to handle arbitrarily large front
run, which is still widely adopted within sizes for very large models are some of the
the customer community, as well as with strengths of this implementation. Further,
multi-core (2x quad-core Nehalem) CPUs. multiple GPUs can be used with Nastran
DMP analysis. The performance speed-ups
Please visit MSC Nastran offers a complete set of linear static and dynamic
analysis capabilities along with unparalleled support for
for more partner showcases and explicit nonlinear analysis capabilities, thermal and interior/
exterior acoustics, and coupling between various disciplines
such as thermal, structural, and fluid interaction. New modular
packaging that enables you to get only what you need makes
it more affordable to own MSC Nastran than ever before.
NVIDIA*2012MAY*PS