You are on page 1of 33

HighPerformanceComputingforMechanical

SimulationsusingANSYS
JeffBeisheim
ANSYS,Inc
HighPerformanceComputing(HPC)atANSYS:
Anongoingeffortdesignedtoremove
computinglimitationsfromengineerswho
usecomputeraidedengineeringinallphases
ofdesign,analysis,andtesting.
Itisahardwareandsoftware initiative!
HPCDefined
NeedforSpeed
Impactproductdesign
Enablelargemodels
Allowparametricstudies
Modal
Nonlinear
Multiphysics
Dynamics
Assemblies
CADtomesh
Capturefidelity
AHistoryofHPCPerformance
1990
SharedMemoryMultiprocessing(SMP)available
1990
SharedMemoryMultiprocessing(SMP)available
1994
IterativePCGSolverintroducedforlargeanalyses
1994
IterativePCGSolverintroducedforlargeanalyses
1999 2000
64bitlargememoryaddressing
1999 2000
64bitlargememoryaddressing
2004
1
st
companytosolve100MstructuralDOF
2004
1
st
companytosolve100MstructuralDOF
2007 2009
Optimizedformulticoreprocessors
Teraflopperformanceat512cores
2007 2009
Optimizedformulticoreprocessors
Teraflopperformanceat512cores
1980s
VectorProcessingonMainframes
1980s
VectorProcessingonMainframes
20052007
DistributedPCGsolver
DistributedANSYS(DMP)released
Distributedsparsesolver
Variational Technology
SupportforclustersusingWindowsHPC
20052007
DistributedPCGsolver
DistributedANSYS(DMP)released
Distributedsparsesolver
Variational Technology
SupportforclustersusingWindowsHPC
1980
1990
2010
2000
2012
2010
GPUacceleration(singleGPU;SMP)
2010
GPUacceleration(singleGPU;SMP)
2012
GPUacceleration(multipleGPUs;DMP)
2012
GPUacceleration(multipleGPUs;DMP)
HPCRevolution
Recentadvancementshaverevolutionizedthe
computationalspeedavailableonthedesktop
Multicoreprocessors
Everycoreisreallyanindependentprocessor
LargeamountsofRAMandSSDs
GPUs
ParallelProcessing Hardware
2Typesofmemorysystems
Sharedmemoryparallel(SMP) singlebox,workstation/server
Distributedmemoryparallel(DMP)multipleboxes,cluster
Cluster
Workstation
ParallelProcessing Software
2TypesofparallelprocessingforMechanicalAPDL
Sharedmemoryparallel(np >1)
Firstavailableinv4.3
Canonlybeusedonsinglemachine
Distributedmemoryparallel(dis np >1)
Firstavailableinv6.0withtheDDSsolver
Canbeusedonsinglemachineorcluster
GPUacceleration(acc)
Firstavailableinv13.0usingNVIDIAGPUs
SupportsusingeithersingleGPUormultipleGPUs
Canbeusedonsinglemachineorcluster
Distributed ANSYSDesignRequirements
Nolimitationinsimulationcapability
Mustsupportallfeatures
Continuallyworkingtoaddmorefunctionalitywitheachrelease
Reproducibleandconsistentresults
Sameanswersachievedusing1coreor100cores
SamequalitychecksandtestingaredoneaswithSMPversion
UsesthesamecodebaseasSMPversionofANSYS
Supportallmajorplatforms
Mostwidelyusedprocessors,operatingsystems,andinterconnects
SupportssameplatformsthatSMPversionsupports
UseslatestversionsofMPIsoftwarewhichsupportthelatest
interconnects
Distributed ANSYSDesign
Distributedsteps(dis np N)
Atstartoffirstloadstep,decompose
FEAmodelintoNpieces(domains)
Eachdomaingoestoadifferentcore
tobesolved
Solutionisnotindependent!!
Lotsofcommunicationrequiredto
achievesolution
Lotsofsynchronizationrequiredto
keepallprocessestogether
Eachprocesswritesitsownsetsof
files(file0*,file1*,file2*,,file[N1]*)
Resultsareautomaticallycombinedat
endofsolution
Facilitatespostprocessingin/POST1,
/POST26,orWorkBench
Distributed ANSYSCapabilities
Staticlinearornonlinearanalyses
Bucklinganalyses
Modalanalyses
HarmonicresponseanalysesusingFULLmethod
TransientresponseanalysesusingFULLmethod
Singlefieldstructuralandthermalanalyses
Lowfrequencyelectromagneticanalysis
Highfrequencyelectromagneticanalysis
Coupledfieldanalyses
Allwidelyusedelementtypesandmaterials
Superelements(usepass)
NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE,
LinearPerturbation
Multiframerestarts
Cyclicsymmetryanalyses
UserProgrammablefeatures(UPFs)
Widevarietyof
features&analysis
capabilitiesare
supported
DistributedANSYSEquationSolvers
Sparsedirectsolver(default)
SupportsSMP,DMP,andGPUacceleration
Canhandleallanalysistypesandoptions
FoundationforBlockLanczos,Unsymmetric,Damped,andQR
dampedeigensolvers
PCGiterativesolver
SupportsSMP,DMP,andGPUacceleration
Symmetric,realvaluematricesonly(i.e.,static/fulltransient)
FoundationforPCGLanczoseigensolver
JCG/ICCGiterativesolvers
SupportsSMPonly
DistributedANSYSEigensolvers
BlockLanczoseigensolver(includingQRdamp)
SupportsSMPandGPUacceleration
PCGLanczoseigensolver
SupportsSMP,DMP,andGPUacceleration
Greatforlargemodels(>5MDOF)withrelativelyfewmodes(<50)
Supernode eigensolver
SupportsSMPonly
Optimalchoicewhenrequestinghundredsorthousandsofmodes
Subspaceeigensolver
SupportsSMP,DMP,andGPUacceleration
Currentlyonlysupportsbucklinganalyses;betaformodalinR14.5
Unsymmetric/Dampedeigensolvers
SupportsSMP,DMP,andGPUacceleration
DistributedANSYSBenefits
Betterarchitecture
Morecomputationsperformedinparallelfastersolutiontime
BetterspeedupsthanSMP
Canachieve>10xon16cores(trygettingthatwithSMP!)
Canbeusedforjobsrunningon1000+CPUcores
Cantakeadvantageofresourcesonmultiplemachines
Memoryusageandbandwidthscales
Disk(I/O)usagescales
Wholenewclassofproblemscanbesolved!
DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors
Twomaincharacteristics foreachinterconnect:latencyandbandwidth
DistributedANSYSishighlybandwidthbound
+- - - - - - - - - D I S T R I B U T E D A N S Y S S T A T I S T I C S - - - - - - - - - - - - +
Rel ease: 14. 5 Bui l d: UP20120802 Pl at f or m: LI NUX x64
Dat e Run: 08/ 09/ 2012 Ti me: 23: 07
Pr ocessor Model : I nt el ( R) Xeon( R) CPU E5- 2690 0 @2. 90GHz
Tot al number of cor es avai l abl e : 32
Number of physi cal cor es avai l abl e : 32
Number of cor es r equest ed : 4 ( Di st r i but ed Memor y Par al l el )
MPI Type: I NTELMPI
Cor e Machi ne Name Wor ki ng Di r ect or y
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 hpcl nxsmc00 / dat a1/ ansyswor k
1 hpcl nxsmc00 / dat a1/ ansyswor k
2 hpcl nxsmc01 / dat a1/ ansyswor k
3 hpcl nxsmc01 / dat a1/ ansyswor k
Lat ency t i me f r ommast er t o cor e 1 = 1. 171 mi cr oseconds
Lat ency t i me f r ommast er t o cor e 2 = 2. 251 mi cr oseconds
Lat ency t i me f r ommast er t o cor e 3 = 2. 225 mi cr oseconds
Communi cat i on speed f r ommast er t o cor e 1 = 7934. 49 MB/ sec Same machine
Communi cat i on speed f r ommast er t o cor e 2 = 3011. 09 MB/ sec QDR Infiniband
Communi cat i on speed f r ommast er t o cor e 3 = 3235. 00 MB/ sec QDR Infiniband
DistributedANSYSPerformance
0
10
20
30
40
50
60
8cores 16cores 32cores 64cores 128cores
R
a
t
i
n
g

(
r
u
n
s
/
d
a
y
)
InterconnectPerformance
GigabitEthernet
DDRInfiniband
Needfastinterconnectstofeedfastprocessors
Turbinemodel
2.1millionDOF
SOLID187elements
Nonlinearstaticanalysis
Sparsesolver(DMP)
Linuxcluster(8corespernode)
DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors
Checkthebandwidthspecs
ANSYSMechanicalcanbehighlyI/Obandwidthbound
SparsesolverintheoutofcorememorymodedoeslotsofI/O
DistributedANSYScanbehighlyI/Olatencybound
Seektimetoread/writeeachsetoffilescausesoverhead
ConsiderSSDs
Highbandwidthandextremelylowseektimes
ConsiderRAIDconfigurations
RAID0 forspeed
RAID1,5 forredundancy
RAID10 forspeedandredundancy
DistributedANSYSPerformance
Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.
Needfastharddrivestofeedfastprocessors
0
5
10
15
20
25
30
1core 2cores 4cores 8cores
R
a
t
i
n
g

(
r
u
n
s
/
d
a
y
)
HardDrivePerformance
HDD
SSD
8millionDOF
Linearstaticanalysis
Sparsesolver(DMP)
DellT5500workstation
12IntelXeonx5675cores,48GBRAM,
single7.2krpmHDD,singleSSD,Win7)
AvoidwaitingforI/Otocomplete!
ChecktoseeifjobisI/Oboundorcomputebound
CheckoutputfileforCPUandElapsedtimes
WhenElapsedtime>>mainthreadCPUtime I/Obound
ConsideraddingmoreRAMorfasterharddriveconfiguration
WhenElapsedtimemainthreadCPUtimeComputebound
Consideringmovingsimulationtoamachinewithfasterprocessors
ConsiderusingDistributedANSYS(DMP)insteadofSMP
ConsiderrunningonmorecoresorpossiblyusingGPU(s)
DistributedANSYSPerformance
Tot al CPU t i me f or mai n t hr ead : 167. 8 seconds
. . .
. . .
El apsed Ti me ( sec) = 388. 000 Dat e = 08/ 21/ 2012
AllrunswithSparsesolver
Hardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernode
Hardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernode
ANSYS12.0to14.0runswithDDRInfinibandinterconnect
ANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS
DistributedANSYSPerformance
ANSYS11.0 ANSYS12.0 ANSYS12.1 ANSYS13.0SP2 ANSYS 14.0
Thermal(fullmodel)
3MDOF
Time 4 hours 4hours 4hours 4hours 1hour 0.8hour
Cores 8 8 8 8 8+1GPU 32
Thermomechanical
Simulation (fullmodel)
7.8MDOF
Time ~5.5days 34.3hours 12.5hours 9.9 hours 7.5hours
Iterations 163 164 195 195 195
Cores 8 20 64 64 128
Interpolation of
BoundaryConditions
Time 37hours 37hours 37hours 0.2hour 0.2hour
LoadSteps 16 16 16
Improved
algorithm
16
Submodel:CreepStrain
Analysis5.5MDOF
Time ~5.5days 38.5hours 8.5hours 6.1hours 5.9hours 4.2hours
Iterations 492 492 492 488 498 498
Cores 18 16 76 128 64+8GPU 256
TotalTime
2weeks 5days 2 days 1day 0.5day
ResultsCourtesyofMicroConsult Engineering,GmbH
DistributedANSYSPerformance
0
5
10
15
20
25
0 8 16 24 32 40 48 56 64
S
p
e
e
d
u
p
SolutionScalability
Minimumtimetosolutionmoreimportantthanscaling
Turbinemodel
2.1millionDOF
Nonlinearstaticanalysis
1Loadstep,7substeps,
25equilibriumiterations
Linuxcluster(8corespernode)
DistributedANSYSPerformance
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
0 8 16 24 32 40 48 56 64
S
o
l
u
t
i
o
n

E
l
a
p
s
e
d

T
i
m
e
SolutionScalability
11hrs,48 mins
30mins
Minimumtimetosolutionmoreimportantthanscaling
1hr,20mins
Turbinemodel
2.1millionDOF
Nonlinearstaticanalysis
1Loadstep,7substeps,
25equilibriumiterations
Linuxcluster(8corespernode)
Graphicsprocessingunits(GPUs)
Widelyusedforgaming,graphicsrendering
Recentlybeenmadeavailableasgeneralpurposeaccelerators
Supportfordoubleprecisioncomputations
PerformanceexceedingthelatestmulticoreCPUs
SohowcanANSYSmakeuseofthisnewtechnology
toreducetheoveralltimetosolution??
GPUAcceleratorCapability
AccelerateSparsedirectsolver(SMP&DMP)
GPUisusedtofactormanydensefrontalmatrices
DecisionismadeautomaticallyonwhentosenddatatoGPU
Frontalmatrixtoosmall,toomuchoverhead,staysonCPU
Frontalmatrixtoolarge,exceedsGPUmemory,onlypartially
accelerated
AcceleratePCG/JCGiterativesolvers(SMP&DMP)
GPUisonlyusedforsparsematrixvectormultiply(SpMV
kernel)
DecisionismadeautomaticallyonwhentosenddatatoGPU
Modeltoosmall,toomuchoverhead,staysonCPU
Modeltoolarge,exceedsGPUmemory,onlypartiallyaccelerated
GPUAcceleratorCapability
Supportedhardware
CurrentlysupportNVIDIATesla20series,Quadro6000,and
QuadroK5000cards
NextgenerationNVIDIATeslacards(Kepler)shouldworkwith
R14.5
InstallingaGPUrequiresthefollowing:
Largerpowersupply(singlecardneeds~250W)
Open2xformfactorPCIe x162.0(or3.0)slot
Supportedplatforms
WindowsandLinux64bitplatformsonly
DoesnotincludeLinuxItanium(IA64)platform
GPUAcceleratorCapability
NVIDIA
TeslaC2075
NVIDIA
Tesla
M2090
NVIDIA
Quadro
6000
NVIDIA
Quadro
K5000

NVIDIA
Tesla
K10
NVIDIA
Tesla
K20

Power(W) 225 250 225 122 250 250


Memory 6GB 6GB 6GB 4GB 8GB 6to24GB
Memory
Bandwidth
(GB/s)
144 177.4 144 173 320 288
PeakSpeed
SP/DP
(GFlops)
1030/515 1331/665 1030/515 2290/95 4577/190 5184/1728
Targetedhardware

TheseNVIDIAKeplerbasedproductsarenotreleasedyet,sospecificationsmaybeincorrect
GPUAcceleratorCapability
GPUAcceleratorCapability
2.6x
3.8x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
2cores 8cores 8cores
R
e
l
a
t
i
v
e

S
p
e
e
d
u
p
GPUPerformance
(noGPU) (noGPU)
6.5millionDOF
Linearstaticanalysis
Sparsesolver(DMP)
2IntelXeonE52670(2.6GHz,
16corestotal),128GBRAM,
SSD,4TeslaC2075,Win7
GPUscanoffersignificantlyfastertime tosolution
(1GPU)
GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
2.7x
5.2x
0.0
1.0
2.0
3.0
4.0
5.0
6.0
2cores 8cores 16cores
R
e
l
a
t
i
v
e

S
p
e
e
d
u
p
GPUPerformance
11.8millionDOF
Linearstaticanalysis
PCGsolver(DMP)
2IntelXeonE52670(2.6GHz,
16corestotal),128GBRAM,
SSD,4TeslaC2075,Win7
(noGPU) (1GPU) (4GPUs)
SupportsmajorityofANSYSusers
CoversbothsparsedirectandPCGiterativesolvers
Onlyafewminorlimitations
Easeofuse
RequiresatleastonesupportedGPUcardtobeinstalled
RequiresatleastoneHPCpacklicense
Norebuild,noadditionalinstallationsteps
Performance
~1025%reductionintimetosolutionwhenusing8CPUcores
Shouldneverslowdownyoursimulation!
GPUAcceleratorCapability
Howwillyouuseallofthiscomputingpower?
DesignOptimizationStudies
DesignOptimization
Higherfidelity Fullassemblies Morenonlinear
ANSYSHPCPacksenable
highfidelityinsight
Eachsimulationconsumesone
ormorepacks
Parallelenabledincreases
quicklywithaddedpacks
Singlesolutionforallphysics
andanyleveloffidelity
FlexibilityasyourHPC
resourcesgrow
ReallocatePacks,asresources
allow
2048
32
8
128
512
Parallel
Enabled
(Cores)
PacksperSimulation
1 2 3 4 5
HPCLicensing
1GPU
+
4 GPU
+
16 GPU
+
64 GPU
+
256 GPU
+
Scalable,likeANSYSHPC
Packs
Enhancesthecustomersability
toincludemany designpointsas
partofasingle study
Ensuresoundproductdecision
making
Amplifiescompleteworkflow
Designpointscaninclude
executionofmultipleproducts
(pre,solve,HPC,post)
Packagedtoencourageadoption
ofthepathtorobustdesign!
Numberof
Simultaneous
DesignPoints
Enabled
NumberofHPCParametricPackLicenses
1 2 3 4 5
64
8
4
16
32
HPCParametricPackLicensing
HPCRevolution
Therightcombinationof
algorithmsand hardware
leadstomaximum
efficiency
SMPvs.DMP
HDDvs.SSDs
Interconnects Clusters
GPUs
HPCRevolution
Everycomputertodayisaparallelcomputer
EverysimulationinANSYScanbenefitfrom
parallelprocessing

You might also like