Jason LSTC Arup MPP Hybrid

What is HYBRID and Why?
Jason Wang (LSTC)

1/24/2018
http://ftp.lstc.com/anonymous/outgoing/jason/Slides
Outline
• Introduction
• Numerical variations
• Evolution of crashworthness modelling
• Evolution of CPU technology
• SMP and MPP
• Summary
Introduction - SMP vs MPP
Shared memory Clusters
SMP (Shared Memory Parallel) MPP (Message Passing Parallel)

 Start and base from serial code  Using domain decomposition
 Using OpenMP directives to split the  Using MPI for communications
tasks between sub-domains
(fine grained parallelism) (coarse grained parallelism)
 Consistent results with different cores  Results change with different cores
 Only run on SMP (single image)  Work on both SMP machines and
computers clusters
 Scalable up to ~16 CPUs  Scalable >> 16 CPUs
Telapse=Tcpu+Tsys+Tomp Telapse=Tcpu+Tsys+Tmpp
MPP performance
600000
550000 Cray XT4(2x1)

AMD Opteron
500000 2.6GHz
Elapsed Time (Seconds)
HP
450000 CP3000+InfiniBand
400000 For 1024 cores, max. (2x1) Xeon 3.0GHz
HP
350000 speedup is 251(2007)! CP3000+InfiniBand
(2x2) Xeon 3.0GHz
300000
Model (car2car model):
250000 2,500,000+ nodes and elements
200000 Fully integrated (type 16) shells
150000
100000
50000
0
2 4 8 16 32 64 128 224 256 512 1024
Number of CPU
Numerical Variations - SMP
SMP ncpu=4, no special treatment
Based on the system load with 4 different runs
Force at node 2 h2= ((f1 + f2) + f3) +f4

E1,f1 E3,f3 Force at node 2 h2= ((f3 + f1) + f2) +f4
Force at node 2 h2= ((f4 + f1) + f3) +f4
2 Force at node 2 h2= ((f4 + f2) + f3) +f1
h2 != h2 != h2 ! = h2
It may create 4 different results!
E2,f2 E4,f4
SMP ncpu=-4, special treatment
Force at node 2 from e1 g1= f1
MPP domain 0 Force at node 2 from e2 g2= f2
4 elements using 4 SMP threads Force at node 2 from e3 g3= f3
Force at node 2 from e4 g4= f4
Force assembly is outside of element loop:
Force at node 2 h2= ((g1+g2)+g3)+g4
Consistent results but ~10% slow down

Numerical Variations - SMP
SMP ncpu=-2
Thread 1: Force at node 2 from e1 g1= f1

E1,f1 E3,f3 Thread 1: Force at node 2 from e4 g4= f4
2
SMP ncpu=-4
E2,f2 E4,f4 Thread 2: Force at node 2 from e2 g2= f2
MPP domain 0 Thread 4: Force at node 2 from e3 g4= f4
2/4 SMP threads

Force assembly is outside of element loop
Summing up nodal force in a specific order:
Force at node 2 h2= ((g1+g2)+g3)+g4
The results from ncpu=-2 and ncpu=-4 are identical.

Numerical Variations - MPP
New domain line
Change MPP ranks

E1, f1 E3,f3 E1,f1 E3,f3
Example with 3 2 2
2
significant digits:
f1=10.1
f2=4.01
E2,f2 E4,f4 f3=5.05 E2,f2 E4,f4
f4=3.06
MPP domain 0 MPP domain 0 MPP domain 1

4 elements in one MPP domain
Force at node 2 h2=((f1+f2)+f3)+f4 Processor 0 node 2 p0=f1+f2
Processor 1 node 2 p1=f3+f4
Force at node 2 h2 = p0 + p1
h2 ! = h2
h2=((10.1+4.01)+5.05)+3.06)
=(14.1+5.05)+3.06 p0= 10.1+4.01= 14.1
22.1 != 22.2 p1= 5.05+3.06=8.11
=19.1+3.06
=22.1 h2 =14.1+8.11=22.2
MPP consistency? Very expansive!
VOLVO XC90 GEN II Crash CAE model 2014
• Total number of elements: 10 000 000+
• Number of CPU’s: 480
• Run time ~ 40 hours
• CrachFEM materials
• Detailed HAZ welds
• Multiple airbags
• Rolling wheels
• New powertrain models
• New chassis models
Courtesy of: VOLVO CAR CORPORATION • Steering mechanics from
wheel to steering wheel
Model size for Satety Analysis
35M in 3 yrs
16000
Solid elements!
14000
Smaller elements
Solid spotweld cluster
Model Size (x1000)
12000 Multi-Physics
MPP
10000 Turn around time ~16h-48h
8000
Fully integrated shell
6000 More sophisticated
mat model
4000
Transition from
2000 SMP to MPP SMP
BT shell
0
1994 1996 1998 1999 2000 2002 2004 2006 2008 2014 2016
Serial Serial+SMP SMP/MPP MPP
cores/job 1 4 8 24 48 96 192
MPP performance – topcrunch.org (2007)
600000
550000 Cray XT4(2x1)

AMD Opteron
500000 2.6GHz
Elapsed Time (Seconds)
HP
(2x1) Xeon 3.0GHz
400000
HP
(2x2) Xeon 3.0GHz
300000
Model (car2car model):
250000 2,500,000+ nodes and elements
200000 Fully integrated (type 16) shells
150000
120 times 251 times
100000
50000
0
2 4 8 16 32 64 128 224 256 512 1024
Number of CPU
MPP scaling
31 ideal
Total
26 element
contact
21
Performance
rigidbody
16
11
Scaling of contact
6
Courtesy of: Steve Kan, Dhafer Marzougui (CCSA, George Mason University)
1 • Contact occurs in a small region

16 • Long
116simulation
216time 316 416
• Hard to scale with CPU
MPP
CPU technology - Moore’s Law
Intel
DoCode
youname Min. featureincrease?
see clockrate sz Rel. core/sck Clockrate Instruction
Do you notice cores/socket
Bloomfield 45nm increase?
2008 4 3.2 GHz SSE
Westmere 32nm 2010 6 2.4 GHz SSE
SandyBridge 32nm 2011 6 3.1 GHz AVX
Haswell 22nm 2013 18 2.5GHz AVX2
Skylake 14nm 2015 28 2.5GHz AVX512

Data collected from Wikipedia
AMD Min. feature sz Rel. core/sck Clockrate Instruction
Epyc 14nm 2017 32 2.2GHz AVX2

SMP and MPP – HYBRID
12core/2socket 4 nodes clusters
MPP HYBRID
SMP
SMP
SMP SMP
SMP SMP
SMP
SMP
MPP message
(8,16) MPP ranks
96 MPP ranks (12,6) SMP threads
Telapse=Tcpu+Tsys+Tmpp Telapse=Tcpu+Tsys+T mpp+Tomp

Message latency - Tmpp
MPI ranks reducing

from 96 to 16/8
Each message size

increasing a little
Message size comparison
• Total message size and latency reduced by ~ thread2 (Tmpp )

• Number of local files reduced by ~ thread (Tsys )
1 16 512 CPU core counts 5,000+
SMP MPP HYBRID
Explicit vs Implicit
Explicit: L1$
Implicit:
S CPU bounded CPU bounded
p Memory bounded
e L2$ IO bounded
e
d L3$
DRAM
NVMe Disk
SSD Disk
Disk
Capacities
Numerical consistency
SMP
 ncpu= - # (~10% slower)
To give identical results while changing threads
MPP - keep same number of MPP ranks

 lstc_reduce (~2% slower)
To give identical results while changing CPU cores

 rcblog
To preserve the domain lines to reduce the noise from

changing meshes
HYBRID = SMP + MPP

Explicit MPP/Hybrid Performance
Multi-core/Multi-socket clusters
 NCAC model
 2.5M elements, 53 contacts, type 16 shell
 35 mph frontal
 100 ms
Multi-core/Multi-socket clusters
 Nodes to monitor
Varying SMP threads but

fixed decomposition
Tomp
Tmpp~Tmpp+Tomp
Explicit MPP/Hybrid Consistency
Explicit MPP/Hybrid Consistency
Implicit Hybrid Performance
Model CYL0P5E6 – Intel 6core 2.95 GHz
Node x MPI x OMP N=2 N=4 N=8 N=16
Nx1x1 5074 2761 1500 850
Nx1x4 1564 918 571 373
Nx1x6 1209 720 465 318

Implicit MPP/Hybrid Performance
Performance on Linux AMD64 systems
No. of cores WCT of Factor WCT for job to

(node x socket x core) Matrix complete
(seconds) (seconds)
16 x 4 x 1 2055 14417
16 x 4 x 2 985 13290
16 x 4 x 4 582 29135
16 x 4 x 4omp (Hybrid) 960 9887
Telapse=Tcpu+Tsys +Tmpp +Tomp

Hybrid Performance vs Model Size
 Dynapower 2cars models (Dr. Makino)

 6 contacts, type 16 shell, 8 M/16M
 35 mph frontal
 20 ms
Explicit Hybrid Performance vs Model Size
Ls-dyna R9.1 SVN:117280, 240 cores/job
25000
16 M
Others
Elapsed time (Seconds)
20000 Contact
Element
15000 2 threads
Tmpp > Tmpp+Tomp
10000
8M
5000
0
1 2 3 4 5 6 7
How to choose your executable
Explicit analysis
• SMP - 1-2 cores
• MPP - 2-100 cores
• HYBRID
– More than 100 cores
– Run under different core counts but need consistent answers
Implicit analysis
• HYBRID
– Keep as much data in memory for incore solution
– Reduce IO bottleneck
– Reduce amount of data transferring in the network
Summary
For better prediction and to reduce cost of prototyping
• Models involve coupled multi physics, HAZ, CPM airbag, etc.
• Mesh becomes finer and critical parts are using solids
• More sophisticated contact (mortar) for better behavior
• Gives better scaling for explicit analysis with more then 100
cores
• Gives better performance in most of the implicit analysis
Thank you !
• Hybrid has already used in production (stable)
• More capabilities will become OpenMP enabled to get better
speedup

Jason LSTC Arup MPP Hybrid

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jason LSTC Arup MPP Hybrid

Uploaded by

Copyright:

Available Formats

What is HYBRID and Why?

Jason Wang (LSTC)

SMP (Shared Memory Parallel) MPP (Message Passing Parallel)

550000 Cray XT4(2x1)

Force at node 2 h2= ((f1 + f2) + f3) +f4

Consistent results but ~10% slow down

Thread 1: Force at node 2 from e1 g1= f1

2/4 SMP threads

Force at node 2 h2= ((g1+g2)+g3)+g4

The results from ncpu=-2 and ncpu=-4 are identical.

Change MPP ranks

MPP domain 0 MPP domain 0 MPP domain 1

550000 Cray XT4(2x1)

1 • Contact occurs in a small region

Westmere 32nm 2010 6 2.4 GHz SSE

SandyBridge 32nm 2011 6 3.1 GHz AVX

Haswell 22nm 2013 18 2.5GHz AVX2

Skylake 14nm 2015 28 2.5GHz AVX512

Epyc 14nm 2017 32 2.2GHz AVX2

Telapse=Tcpu+Tsys+Tmpp Telapse=Tcpu+Tsys+T mpp+Tomp

MPI ranks reducing

Each message size

• Total message size and latency reduced by ~ thread2 (Tmpp )

To give identical results while changing threads

MPP - keep same number of MPP ranks

To give identical results while changing CPU cores

To preserve the domain lines to reduce the noise from

HYBRID = SMP + MPP

Varying SMP threads but

Node x MPI x OMP N=2 N=4 N=8 N=16

Nx1x1 5074 2761 1500 850

Nx1x4 1564 918 571 373

Nx1x6 1209 720 465 318

No. of cores WCT of Factor WCT for job to

16 x 4 x 4omp (Hybrid) 960 9887

Telapse=Tcpu+Tsys +Tmpp +Tomp

 Dynapower 2cars models (Dr. Makino)

You might also like