ST7 SHP 1.1 ArchiParallelesDistribuees 1spp

ST7 : High Performance Simulation
Parallel & Distributed Architectures
Stéphane Vialle
Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
What is « HPC » ?
High performance hardware

TeraFlops
Parallel algorithmic High
PetaFlops
& programming Performance
ExaFlops
Computing
Numerical algorithms
Intensive computing & HPDA

large scale applications
TeraBytes
PetaBytes
Big Data ExaBytes
2
High Performance Hardware
Inside ….
… high performance hardware
3
From core to SuperComputer

Computing cores
4

Computing cores
with vector units
Instru-
ctions
ALUs
Unités AVX
5

Computing cores
with vector units
Multi-core
processor
6

Computing cores
with vector units
Multi-core
processor
Multi-core
PC/node
7

Computing cores
with vector units
Multi-core
processor
Multi-core
PC/node
Multi-core
PC cluster
8

Computing cores
with vector units
Multi-core
processor
Multi-core
PC/node
Multi-core
PC cluster
Super-
Computer
9

Fat Tree Computing cores
interconnect with vector units
Multi-core
processor
Multi-core
PC/node
Multi-core
PC cluster
Super-
Computer
+ hardware
accelerators 10
Top500 site: HPC roadmap

Top500 site: HPC roadmap

Expected
machine
Required computing power
Current Machine
machine installed
Present
day
New machine in
production
Modeling and Implementation,

Numerical debug and test
investigations (on small pb on  Regular improvement
current machine) allows to plan the future
12
HPC in the cloud ?
« AZUR BigCompute » : High performance nodes

High performance interconnect (Infiniband)
 Customers can allocate a part of a HPC cluster
• Allows to allocate a huge number of nodes for a short time

• High performance interconnection network
 Comfortable for HPC & Big Data scaling benchmarks
• Compliant with very large scale systems and applications
No SuperComputer available in a cloud

But some HPC or Large Scale PC-clusters exist in Clouds
Wind of change ? 13
Architecture issues
Why multi-core processors ? Shared or distributed memory ?
14
Architecture issues
Re-interpreted Moore’s law
Processor performance increase due to parallelism since 2004:
CPU performance
increase is due to
(different) parallel
mechanisms since
several years…
… and they require
explicit parallel
programming
W. Kirshenmann
EDF &
EPI AlGorille,
d’après une
première
étude menée
par SpiralGen
15
Architecture issues
Impossible to
dissipate so
much
electrical
power in the
silicium
Auteur :
Jack Dongara
Architecture issues
It became impossible to dissipate the energy consummed

by the semiconductor when the frequency increased ! 17
Architecture issues
To bound
frequency and to
increase the nb
of cores is
energy efficient
2×0.753 = 0.8
Auteur :
Jack Dongara
Architecture issues
Initial (electronic) Moore’s law:
each 18 months  x2 number of transistors per µm²
Previous computer science interpretation:

each 18 months  x2 processor speed
New computer science interpretation:

each 24 months  x2 number of cores
Leads to a massive parallelism challenge:

to split many codes in 100, 1000, ….. 106 tasks … 107 tasks !!
Architecture issues
3 classic parallel architectures

Mem Mem Mem Mem Mem Mem network
…
network proc proc proc
Mem Mem
proc proc proc
network
Shared memory Distributed memory Distributed shared

« SMP » « Cluster » memory « DSM»
Simple and Highly scalable. Comfortable solution

efficient up to … But efficiency and and efficient hardware
16 processors. price depend on implementation up to
Limited solution. the interconnect. 1000 processors.
Limited solution.
Today : all supercomputers are clusters

with a hierarchical architecture
20
Architecture issues
Modern clusters
One PC / One node One PC cluster

One Non-Uniform Memory Architecture node
Cluster of nodes Cluster of NUMA nodes
Mem Mem Mem
proc proc proc
network
network
Architecture issues
Hierarchical & hybrid architecture

EU/France China USA
Cluster of multi-processor NUMA AVX vectorization:

nodes with hardware vector #pragma simd
accelerators + CPU multithreading:
#pragma omp parallel for
• Hierarchical and hybrid + Message passing:
architectures MPI_Send(…, Me-1,…);
MPI_Recv(..., Me+1,…);
• Multi-paradigms + GPU vectorization
programming myKernel<<<grid,bloc>>>(…)
+ checkpointing 22
Architecture issues
Hierarchical & hybrid architecture

Distributed Memory space of the
Software Memory
process space
(and of itsofthreads)
the
Memory
process space
(and of itsofthreads)
the
Architecture code process (and of its threads) code
code code
code Stackcode
and code of
stack Stack and thread
the main code ofof
stack
thread x thread y thread z Stack andprocess
the main
the code ofof
thread
stack
thread x thread y thread z the main thread of
the process
thread x thread y thread z the process
Need for efficient deployment of software

resources over hardware resources:
• One process per node ? per processor ? per core ?
• How many threads per process ? per core ?
• How to load balance the computations ?
• How to group processes to minimize communications?
• How to co-localize data and process/threads ?
Energy Consumption
1 PetaFlops: 2.3 MW !
 1 ExaFlops : 2.3 GW !! 350 MW ! ….. 20 MW ?
Perhaps we will be able to build the machines,

but not to pay for the energy consumption !!
24
Energy consumption
How much electrical power for an Exaflops ?
1.0 Exaflops should be reached close to 2020-2022:
• 2.0 GWatts with the flop/watt ratio of 2008 Top500 1st machine
• 1.2 GWatts with the flop/watt ratio of 2011 Top500 1st machine
• 350 MWatts if the flop/watt ratio increases regularly
• 20 MWatts if we succeed to improve the architecture ? …

… « the maximum energy cost we can support ! » (2010)
• 2 MWatts …
… « the maximum cost for a large set of customers » (2014)
25
Energy consumption
From Petaflops to Exaflops

1.00 Exaflops : 2018-2020
×1000 perf 2020-2022
25 Tb/s (IO)
× 100 cores/node
20 MWatt max….
× 10 nodes
× 50 IO 513.8 Petaflops : juin 2020
× 10 energy
Fugaku – Fujitsu, Japan
(only × 10) Fujitsu A64FX (ARM) 48C 2.2GHz
7 630 848 CPU cores
28.3 MWatt
1.03 Petaflops : June 2008
RoadRunner (IBM)
Opteron + PowerXCell Related challenges:
122440 « cores » • Elegant & efficient programming ?
500 Gb/s (IO)
• Training of large programmer teams ?
2.35 Mwatt !!!!!! 26
Energy consumption
Sunway TaihuLight - China:
93.0 Pflops
N°1 Juin 2016 – Nov. 2017
• 41 000 processors Sunway SW26010 260C 1.45GHz
 10 649 600 « cores »
• Sunway interconnect:
5-level integrated hierarchy
(Infiniband like ?)
15.4 MWatt
27
Energy consumption
Summit - USA: N°1 Juin 2018/Nov. 2019
143.5 Pflops (×1.54)
• 9 216 processors IBM POWER9 22C 3.07GHz
• 27 648 GPU Volta GV100
 2 282 544 « cores (CPU+GPU) »
• interconnect: Dual-rail Mellanox EDR Infiniband
9.8 MWatt (×0.64) Flops/Watt : ×2.4
28
Energy consumption
Fugaku - Japan: N°1 Juin 2020 – Nov. 2021
513.8 Pflops (×3.58)
• ≈ 150 000 processors Fujitsu A64FX (64-bit ARM) 48C 2.2GHz
 7 630 848 cores
• interconnect: Tofu Interconnect D
6D Mesh/Torus
28.3 MWatt (×2.89) Flops/Watt: ×1.24
29
Cooling
Cooling is close to 30% of the energy consumption
Optimization is mandatory!
30
Cooling
Cooling is strategic !
Des processeurs moins gourmands en énergie :
• on essaie de limiter la consommation de chaque processeur
• les processeurs passent en mode économique s’ils sont inutilisés
• on améliore le rendement flops/watt
Mais une densité de processeurs en hausse :
• une tendance à la limitation de la taille
totale des machines (en m² au sol)
 Besoin de refroidissement efficace

et bon marché (!)
Souvent estimé à 30% de la

dépense énergétique!
31
Cooling
Optimized air flow

Optimisation des flux d’air : en entrée et en sortie des armoires
• Architecture Blue Gene : haute densité de processeurs
• Objectif d’encombrement minimal (au sol) et de consommation
énergétique minimale
• Formes triangulaires ajoutées pour optimiser le flux d’air
IBM Blue Gene
32
Cooling
Cold doors (air+water cooling)

On refroidit par eau une « porte/grille » dans laquelle circule un
flux d’air, qui vient de refroidir la machine
Le refroidissement
se concentre sur
l’armoire.
33
Cooling
Direct liquid cooling

On amène de l’eau froide directement sur le point chaud, mais
l’eau reste isolée de l’électronique.
• Expérimental en 2009
• Adopté depuis (IBM, BULL, …)
Lame de calcul IBM en 2012
Commercialisée
Carte expérimentale IBM en 2009
(projet Blue Water)
34
Cooling
Liquid and immersive cooling

Refroidissement par immersion des
cartes dans un liquide électriquement Refroidissement
neutre, et refroidi. liquide par
immersion testé
par SGI &
Novec en 2014
Refroidissement
liquide par
immersion sur le
CRAY-2 en 1985 35
Cooling
Extreme air cooling

Refroidissement avec de l’air à température ambiante :
- circulant à grande vitesse
- circulant à gros volume
 Les CPUs fonctionnent proche de

leur température max supportable
(ex : 35°C sur une carte mère sans pb)
 Il n’y a pas de
refroidissement Economique !
du flux d’air.
Mais arrêt de la
Une machine de machine quand
Grid’5000 à Grenoble l’air ambiant est
(la seule en trop chaud (l’été) !36
Extreme Cooling)
Cooling
Extreme air cooling

Refroidissement avec de l’air à température ambiante :
- circulant à grande vitesse
- circulant à gros volume
 Les CPUs fonctionnent proche de

leur température max supportable
(ex : 35°C sur une carte mère sans pb)
 Il n’y a pas de
refroidissement Economique !
du flux d’air.
Mais arrêt de la
Installation Iliium à machine quand
CentraleSupélec à l’air ambiant est
Metz trop chaud (l’été) !37
(blockchain - 2018)
SG6 – Calcul à Haute Performance
Parallel & Distributed Architectures
Questions ?
38

ST7 SHP 1.1 ArchiParallelesDistribuees 1spp

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ST7 SHP 1.1 ArchiParallelesDistribuees 1spp

Uploaded by

Copyright:

Available Formats

ST7 : High Performance Simulation

Parallel & Distributed Architectures

High performance hardware

Intensive computing & HPDA

… high performance hardware

From core to SuperComputer

From core to SuperComputer

From core to SuperComputer

From core to SuperComputer

From core to SuperComputer

From core to SuperComputer

From core to SuperComputer

Top500 site: HPC roadmap

Top500 site: HPC roadmap

Modeling and Implementation,

HPC in the cloud ?

« AZUR BigCompute » : High performance nodes

• Allows to allocate a huge number of nodes for a short time

• Compliant with very large scale systems and applications

No SuperComputer available in a cloud

Why multi-core processors ? Shared or distributed memory ?

It became impossible to dissipate the energy consummed

Previous computer science interpretation:

New computer science interpretation:

Leads to a massive parallelism challenge:

3 classic parallel architectures

Shared memory Distributed memory Distributed shared

Simple and Highly scalable. Comfortable solution

Today : all supercomputers are clusters

One PC / One node One PC cluster

Cluster of nodes Cluster of NUMA nodes

Mem Mem Mem

proc proc proc

Hierarchical & hybrid architecture

Cluster of multi-processor NUMA AVX vectorization:

Hierarchical & hybrid architecture

Need for efficient deployment of software

Perhaps we will be able to build the machines,

1.0 Exaflops should be reached close to 2020-2022:

• 350 MWatts if the flop/watt ratio increases regularly

• 20 MWatts if we succeed to improve the architecture ? …

From Petaflops to Exaflops

9.8 MWatt (×0.64) Flops/Watt : ×2.4

28.3 MWatt (×2.89) Flops/Watt: ×1.24

Cooling is close to 30% of the energy consumption

 Besoin de refroidissement efficace

Souvent estimé à 30% de la

Optimized air flow

Cold doors (air+water cooling)

Direct liquid cooling

Liquid and immersive cooling

Extreme air cooling

 Les CPUs fonctionnent proche de

Extreme air cooling

 Les CPUs fonctionnent proche de

You might also like