Professional Documents
Culture Documents
2
OPTIMIZING
TOMORROW’S
SOFTWARE
4
NODE-‐COMPLEXITY
IS
INCREASING
• Significant
HPC
node
architecture
changes
– Increases
in
core
counts
• More,
lower-‐power
cores
(for
energy
efficiency)
– Increases
in
thread
(SMT)
counts
– Cache-‐coherent
NUMA
5
TRENDS
IN
PROCESSOR
DESIGN:
CORES
Number
of
cores
per
node
is
increasing
– 2001:
Dual-‐core
POWER4
– 2005:
Dual-‐core
AMD
Opteron
– 2011:
10-‐core
Intel
Xeon
Westmere-‐EX
– 2012:
Intel
MIC
Knights
Corner
(60+
cores)
– 2013:
Intel
MIC
Knights
Landing
announced1
intel-‐powers-‐the-‐worlds-‐fastest-‐supercomputer-‐reveals-‐new-‐and-‐future-‐high-‐performance-‐compuUng-‐technologies
6
MANY
ARCHITECTURE
OPTIONS
L1
L1I
L1
L1I
D
L1I
L1
L1I
L1
L1I
L1
L1I
L1 L1I
D
L1L1
L1 D
L1 D
L1 L1 D
L1
D
L1I
L1I
D
L1I
L1I
L1I
D
L1
D
L1
L1I
D
L1
L1I
D
L1 L2
L2
L1I
D
L1I
L1
L1I
L1I
D
L1
L1
D
L1 D
L1 D
L1 L2
L1I
D
D
L1I
D
L1I
D
L1I
D
L1I
L2
D
L1L1
L1I
D
L2
L2
L2
L1I
D
L1
L2
L2
L2
L1I
D
L1
L2
L2
L2
L1I
D
L1
L2
L1I
D
L2
L2
L2
D
L3
L2
L2
L3
L2
L3
L3
NoC
DRAM
7
UPCOMING
CHALLENGES
• Future
systems
will
be
diverse
– Varying
processor
speeds
– Varying
failure
rates
for
different
components
– Homogeneous
applicaUons
show
heterogeneous
performance
13
THE
SNIPER
MULTI-‐CORE
SIMULATOR
VISUALIZATION
HTTP://WWW.SNIPERSIM.ORG
SATURDAY,
FEBRUARY
1ST,
2013
FOSDEM
2014
–
HPC
DEVROOM
–
BRUSSELS,
BRLGIUM
VISUALIZATION
Sniper
generates
quite
a
few
staUsUcs,
but
only
with
text
is
it
difficult
to
understand
performance
details
15
CYCLE
STACKS
CPI
16
CYCLE
STACKS
FOR
PARALLEL
APPLICATIONS
By
thread:
heterogeneous
behavior
in
a
homogeneous
applicaUon?
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2 L2 L2
L3 data L3
DRAM
17
USING
CYCLE
STACKS
TO
EXPLAIN
SCALING
BEHAVIOR
18
USING
CYCLE
STACKS
TO
EXPLAIN
SCALING
BEHAVIOR
• Scale
input:
applicaUon
becomes
DRAM
bound
19
USING
CYCLE
STACKS
TO
EXPLAIN
SCALING
BEHAVIOR
• Scale
input:
applicaUon
becomes
DRAM
bound
• Scale
core
count:
sync
losses
increase
to
20%
20
VIZ:
CYCLES
STACKS
IN
TIME
21
VIZ:
ENERGY
OUTPUT
OVER
TIME
22
3D
VISUALIZATION:
IPC
VS.
TIME
VS.
CORE
23
ARCHITECTURE
TOPOLOGY
VISUALIZATION
• System
topology
informaUon
– With
IPC/MPKI/APKI
stats
for
each
component
24
SUGGESTIONS
FOR
OPTIMIZATION:
INSTRUCTIONS
VS.
TIME
PLOT
Expected
trends
Outlying
funcUons
(more
Ume
per
insn)
25
SUGGESTIONS
FOR
OPTIMIZATION:
ROOFLINE
MODEL
Peak
FP
performance
Peak
memory
bandwidth
S.
Williams,
A.
Waterman,
and
D.
A.
Pacerson,
“Roofline:
An
insightul
visual
performance
model
for
mulUcore
architectures,”
CommunicaUons
of
the
ACM,
vol.
52,
no.
4,
pp.
65–76,
Apr.
2009.
26
THE
SNIPER
MULTI-‐CORE
SIMULATOR
POWER-‐AWARE
HW/SW
OPTIMIZATION
HTTP://WWW.SNIPERSIM.ORG
SATURDAY,
FEBRUARY
1ST,
2013
FOSDEM
2014
–
HPC
DEVROOM
–
BRUSSELS,
BRLGIUM
H/W
UNDER/OVERSUBSCRIPTION
• Main
idea:
– For
Xeon-‐Phi-‐style
cores,
cache
performance
is
the
biggest
indicator
of
performance
• Each
core
has
a
private
cache
hierarchy
– Private
L1
+
Private
L2
• Can
access
other
L2s
via
coherency
• Each
applicaUon
has
its
own
cache
scaling
characterisUcs
– We
see
cache
requirements
both
increasing,
and
decreasing
per
core
• Increasing:
globally
shared
working
set
• Decreasing:
data
is
parUUoned
per
core
• By
controlling
the
core/thread
count
we
can
opUmize
placement
28
POWER-‐AWARE
HW/SW
CO-‐OPTIMIZATION
• Hooked
up
McPAT
(MulU-‐Core
Power,
Area,
Timing
framework)
to
Sniper’s
output
staUsUcs
• Evaluate
different
architecture
direcUons
(45nm
to
22nm)
with
near-‐constant
area
• Compare
performance,
energy
efficiency
[Heirman
et
al.,
PACT
2012]
core
cache
8
cores
16
cores,
no
L3,
stacked
DRAM
ARE/SOFTWARE CO-DESIGN
• Heat
transfer:
stencil
on
regular
grid
ne step further and we use Sniper/McPAT
0
B
0 0
tudy in which Used
– we in
the
optimize bothExaScience
hardware Lab
as
component
of
Space
WBeather
modeling
e do this for an important scientific kernel,
omputation. – Important
Our kernel,
part
kernel implementation al- of
Berkeley
Figure 7: Dwarfs
(structured
Illustration grid)
of three iterations of the heat
transfer simulation, applied to an 8 8 tile. To sat
• Improve
m emory
off data locality with redundant computa-
les finding an optimum software configura-
l ocality:
U ling
isfy othever
data
mulUple
dependencies Ume
steps
up to the third step o
– Trade
hardware setting, and viceoff
versa.
locality
with
redundant
computaUon
the stencil, redundant computations (on the darker
dots) are performed at the boundaries of the tile
ansfer application
– OpUmum
depends
on
relaUve
Fromcost
(performance
&
energy)
[11].
of
computaUon,
mputation benchmark data
transfer
à
requires
integrated
simulator
models heat transfer
2D grid over a number of time steps. The
ation involves stencil computation in which 32Performance (GFLOP/s)
peak floating-point performance
a given point in time at each grid location
nation
3 of the temperatures of that location 16 th re
d wid du
s at the previous time step. a n nd
y b an
o r tc
mentation of the heat transfer equation com- 8 m e m om
2 pu
ime step at a time, and iterates over the ak tat
pe ion
apply the stencil operation at each grid lo-
4
ment).
1 This implementation has very poor 1/2 1 2 4 8 16
se each data element is used only once per Arithmetic intensity (FLOP/byte)
the time a data element is used again —B Total performance
0
e step — the 0 processor will have touched
0
2
Useful performance (256 tiles)
ements, and hence, when B simulating a large Useful performance (1282 tiles) 30
POWER-‐AWARE
HW/SW
CO-‐OPTIMIZATION
• Match
Ule
size
to
L2
size,
find
opUmum
between
locality
and
redundant
work
–
depending
on
their
(performance/
energy)
cost
• Isolated
opUmizaUon:
(a) Performance (simulated time steps per second)
Steps/time (1/s)
Steps/time (1/s)
Steps/time (1/s)
200
150
– Fix
SW
parameters,
explore
HW
architecture
200
150
200
150
200
150
Steps/Energy (1/J)
Steps/Energy (1/J)
Steps/Energy (1/J)
2.0 2.0 2.0 2.0
1.5 1.5 1.5 1.5
1.0 1.0 1.0 1.0
0.5 0.5 0.5 0.5
0.0 0.0 0.0 0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte)
32 64 128 256 512 32 64 128 256 512 32 64 128 256 512 32 64 128 31
256 512
REFERENCES
• Sniper
website
– hcp://snipersim.org/
• Download
– hcp://snipersim.org/w/Download
– hcp://snipersim.org/w/Download_Benchmarks
• Gewng
started
– hcp://snipersim.org/w/Gewng_Started
• QuesUons?
– hcp://groups.google.com/group/snipersim
– hcp://snipersim.org/w/Frequently_Asked_QuesUons
32
HPC
NODE
PERFORMANCE
AND
POWER
SIMULATION
WITH
THE
SNIPER
MULTI-‐CORE
SIMULATOR
TREVOR
E.
CARLSON,
WIM
HEIRMAN,
LIEVEN
EECKHOUT
HTTP://WWW.SNIPERSIM.ORG
SATURDAY,
FEBRUARY
1ST,
2014
FOSDEM
2014
–
HPC
DEVROOM
–
BRUSSELS,
BELGIUM