FOSDEM14 HPC Devroom 12 Sniper

HPC
NODE PERFORMANCE AND POWER

SIMULATION WITH THE SNIPER MULTI-‐CORE
SIMULATOR
TREVOR E. CARLSON,
WIM HEIRMAN, LIEVEN EECKHOUT
HTTP://WWW.SNIPERSIM.ORG
SATURDAY, FEBRUARY 1ST, 2014
FOSDEM 2014 – HPC DEVROOM – BRUSSELS, BELGIUM
MAJOR GOALS OF SNIPER
•  What will node performance look like for
next-‐generaUon systems?
–  Intel Xeon, Xeon Phi, etc.
•  What opUmizaUons can we make for these
systems?
–  So[ware OpUmizaUons
–  Hardware / So[ware co-‐design
•  How is my applicaUon performing?
–  Detailed insight into applicaUon performance
on today’s systems
2
OPTIMIZING TOMORROW’S SOFTWARE
•  Design tomorrow’s processor

using today’s hardware
•  OpUmize tomorrow’s so[ware for tomorrow’s
processors
•  SimulaUon is one promising soluUon
–  Obtain performance characterisUcs
for new architectures
–  Architectural exploraUon
–  Early so[ware opUmizaUon
3
WHY CAN’T I JUST …
use performance counters?
–  perf stat, perf record

use Cachegrind?

It can be difficult to see exactly where the problems are
–  Not all cache misses are alike – latency macers
–  Modern out-‐of-‐order processors can overlap misses
–  Both core and cache performance macers
4
NODE-‐COMPLEXITY IS INCREASING

•  Significant HPC node architecture changes
–  Increases in core counts
•  More, lower-‐power cores (for energy efficiency)
–  Increases in thread (SMT) counts
–  Cache-‐coherent NUMA
•  OpUmizing for efficiency

Source: Wikimedia Commons
–  How do we analyze our current so[ware?
–  How do we design our next-‐generaUon so[ware?
5
TRENDS IN PROCESSOR DESIGN: CORES
Number of cores per node is increasing
–  2001: Dual-‐core POWER4
–  2005: Dual-‐core AMD Opteron
–  2011: 10-‐core Intel Xeon Westmere-‐EX
–  2012: Intel MIC Knights Corner (60+ cores)
–  2013: Intel MIC Knights Landing announced1
Westmere-‐EX, Source: Intel Xeon Phi, Source: Intel

1hcp://newsroom.intel.com/community/intel_newsroom/blog/2013/06/17/
intel-‐powers-‐the-‐worlds-‐fastest-‐supercomputer-‐reveals-‐new-‐and-‐future-‐high-‐performance-‐compuUng-‐technologies 6
MANY ARCHITECTURE OPTIONS
L1
L1I L1
L1I D
L1I
L1
L1I
L1
L1I
L1
L1I
L1 L1I D L1L1
L1 D L1 D L1 L1 D L1
D L1I
L1I D L1I L1I L1I D L1
D L1
L1I D L1
L1I D L1 L2 L2
L1I D L1I L1
L1I
L1I D L1
L1
D L1 D L1 D L1 L2 L1I D D
L1I
D
L1I
D
L1I
D
L1I
L2 D L1L1
L1I D
L2 L2 L2 L1I D L1
L2 L2 L2 L1I D L1
L2 L2 L2 L1I D L1
L2 L1I D
L2 L2 L2 D
L3 L2
L2
L3 L2
L3
L3
NoC
DRAM
7
UPCOMING CHALLENGES
•  Future systems will be diverse
–  Varying processor speeds
–  Varying failure rates for different components
–  Homogeneous applicaUons show heterogeneous performance
•  So[ware and hardware soluUons are needed to

solve these challenges
–  Handle heterogeneity (reacUve load balancing)
–  Handle fault tolerance
–  Improve power efficiency at the algorithmic level
(extreme data locality)
•  Hard to model accurately with analyUcal models

8
FAST AND ACCURATE SIMULATION IS NEEDED
•  EvaluaUng current so[ware on current hardware is
difficult
–  Performance counters do not provide enough insight
•  SimulaUon use cases
–  Pre-‐silicon so[ware opUmizaUon
–  Architecture exploraUon
•  Cycle-‐accurate simulaUon is too slow for exploring
mulU/many-‐core design space and so[ware
•  Key quesUons
–  Can we raise the level of abstracUon?
–  What is the right level of abstracUon?
–  When to use these abstracUon models?
9
SNIPER: A FAST AND ACCURATE SIMULATOR
•  Hybrid simulaUon approach
–  AnalyUcal interval core model
–  Micro-‐architecture structure simulaUon
•  branch predictors, caches (incl. coherency), NoC, etc.
•  Hardware-‐validated, Pin-‐based
•  Models mulU/many-‐cores running mulU-‐
threaded and mulU-‐program workloads
•  Parallel simulator scales with the number of
simulated cores
•  Available at http://snipersim.org
10
TOP SNIPER FEATURES
•  Interval Model
•  MulU-‐threaded ApplicaUon Sampling
•  CPI Stacks and InteracUve VisualizaUon
•  Parallel MulUthreaded Simulator
•  x86-‐64 and SSE2 support
•  Validated against Core2, Nehalem
•  Thread scheduling and migraUon
•  Full DVFS support
•  Shared and private caches
•  Modern branch predictor
•  Supports pthreads and OpenMP, TBB, OpenCL, MPI, …
•  SimAPI and Python interfaces to the simulator
•  Many flavors of Linux supported (Redhat, Ubuntu, etc.)
11
SNIPER LIMITATIONS
•  User-‐level
–  Not the best match for workloads with significant OS
involvement
•  FuncUonal-‐directed
–  No simulaUon / cache accesses along false paths
•  High-‐abstracUon core model
–  Not suited to model all effects of core-‐level changes
–  Perfect for memory subsystem or NoC work
•  x86 only
•  But … is a perfect match for HPC evaluaUon
12
SNIPER HISTORY
•  November, 2011: SC’11 paper, first public release
•  March 2012, version 2.0: MulU-‐program workloads
•  May 2012, version 3.0: Heterogeneous architectures
•  November 2012, version 4.0: Thread scheduling and migraUon
•  April 2013, version 5.0: MulU-‐threaded applicaUon sampling
•  June 2013, version 5.1: SuggesUons for opUmizaUon visualizaUon
•  September 2013,
version 5.2:
MESI/F, 2-‐level TLBs,
Python scheduling
•  Today: 700+ downloads
from 60 countries
13
THE SNIPER MULTI-‐CORE SIMULATOR
VISUALIZATION
FOSDEM 2014 – HPC DEVROOM – BRUSSELS, BRLGIUM
VISUALIZATION
Sniper generates quite a few staUsUcs,
but only with text is it difficult to understand
performance details
Text output from Sniper (sim.stats)
15
CYCLE STACKS CPI
•  Where did my cycles go?

•  CPI stack
–  Cycles per instrucUon
–  Broken up in components
•  Normalize by either
–  Number of instrucUons (CPI stack)
–  ExecuUon Ume (Ume stack)
•  Different from miss rates: L2 cache
I-‐cache
cycle stacks directly quanUfy Branch
the effect on performance Base
16
CYCLE STACKS FOR PARALLEL APPLICATIONS
By thread: heterogeneous behavior
in a homogeneous applicaUon?
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2 L2 L2
L3 data L3
DRAM
17
USING CYCLE STACKS TO EXPLAIN SCALING
BEHAVIOR
18
BEHAVIOR
•  Scale input: applicaUon becomes DRAM bound
19
BEHAVIOR
•  Scale input: applicaUon becomes DRAM bound
•  Scale core count: sync losses increase to 20%
20
VIZ: CYCLES STACKS IN TIME
21
VIZ: ENERGY OUTPUT OVER TIME
22
3D VISUALIZATION: IPC VS. TIME VS. CORE
23
ARCHITECTURE TOPOLOGY VISUALIZATION
•  System topology informaUon
–  With IPC/MPKI/APKI stats for each component
24
SUGGESTIONS FOR OPTIMIZATION:
INSTRUCTIONS VS. TIME PLOT
Expected
trends
Outlying funcUons
(more Ume per insn)
25
SUGGESTIONS FOR OPTIMIZATION:
ROOFLINE MODEL
Peak FP
performance
Peak memory
bandwidth
S. Williams, A. Waterman, and D. A. Pacerson, “Roofline: An insightul visual performance model
for mulUcore architectures,” CommunicaUons of the ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009. 26
THE SNIPER MULTI-‐CORE SIMULATOR
POWER-‐AWARE HW/SW OPTIMIZATION
FOSDEM 2014 – HPC DEVROOM – BRUSSELS, BRLGIUM
H/W UNDER/OVERSUBSCRIPTION
•  Main idea:
–  For Xeon-‐Phi-‐style cores, cache performance is the biggest
indicator of performance
•  Each core has a private cache hierarchy
–  Private L1 + Private L2
•  Can access other L2s via coherency
•  Each applicaUon has its own cache scaling characterisUcs
–  We see cache requirements both increasing, and decreasing per
core
•  Increasing: globally shared working set
•  Decreasing: data is parUUoned per core
•  By controlling the core/thread count we can opUmize
placement
28
POWER-‐AWARE HW/SW CO-‐OPTIMIZATION
•  Hooked up McPAT (MulU-‐Core Power, Area, Timing framework) to
Sniper’s output staUsUcs
•  Evaluate different architecture direcUons (45nm to 22nm) with
near-‐constant area
•  Compare performance, energy efficiency [Heirman et al., PACT 2012]
core
cache
8 cores
16 cores, no L3, stacked DRAM
baseline: 2x quad-‐core
16 slow cores 16 thin cores 29

ore for all of the selected benchmarks. How-
g the large input, 3D is considerably more 3

or three out of five of the applications con-
hows that architecture studies should take
ing reduced input sizes. s
2
ARE/SOFTWARE CO-DESIGN
•  Heat transfer: stencil on regular grid
ne step further and we use Sniper/McPAT
0
B
0 0
tudy in which Used
–  we in the
optimize bothExaScience
hardware Lab as component of Space WBeather modeling
e do this for an important scientific kernel,
omputation. –  Important
Our kernel, part
kernel implementation al- of Berkeley
Figure 7: Dwarfs (structured
Illustration grid)
of three iterations of the heat
transfer simulation, applied to an 8 8 tile. To sat
•  Improve m emory
off data locality with redundant computa-
les finding an optimum software configura-
l ocality: U ling
isfy othever data
mulUple
dependencies Ume steps
up to the third step o
–  Trade
hardware setting, and viceoff versa.
locality with redundant computaUon
the stencil, redundant computations (on the darker
dots) are performed at the boundaries of the tile
ansfer application
–  OpUmum depends on relaUve Fromcost (performance & energy)
[11].
of computaUon,
mputation benchmark data transfer à requires integrated simulator
models heat transfer
2D grid over a number of time steps. The
ation involves stencil computation in which 32Performance (GFLOP/s)
peak floating-point performance
a given point in time at each grid location
nation
3 of the temperatures of that location 16 th re
d wid du
s at the previous time step. a n nd
y b an
o r tc
mentation of the heat transfer equation com- 8 m e m om
2 pu
ime step at a time, and iterates over the ak tat
pe ion
apply the stencil operation at each grid lo-
4
ment).
1 This implementation has very poor 1/2 1 2 4 8 16
se each data element is used only once per Arithmetic intensity (FLOP/byte)
the time a data element is used again —B Total performance
0
e step — the 0 processor will have touched
0
2
Useful performance (256 tiles)
ements, and hence, when B simulating a large Useful performance (1282 tiles) 30
•  Match Ule size to L2 size, find opUmum between locality
and redundant work – depending on their (performance/
energy) cost
•  Isolated opUmizaUon:
(a) Performance (simulated time steps per second)
300 –  Fix HW architecture, explore SW parameters

8-core
300
3D
300
low-frequency
300
dual-issue
250 250 250 250

Steps/time (1/s)
Steps/time (1/s)
Steps/time (1/s)
Steps/time (1/s)
200
150
–  Fix SW parameters, explore HW architecture
200
150
200
150
200
150
•  Co-‐opUmizaUon yields 1.66x more performance, or 1.25x

100 100 100 100
50 50 50 50
0 0 0 0
more energy efficiency, than isolated opUmizaUon

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte)
32 64 128 256 512 32 64 128 256 512 32 64 128 256 512 32 64 128 256 512
(b) Energy e⌅ciency (simulated time steps per Joule)

8-core 3D low-frequency dual-issue
2.5 2.5 2.5 2.5
Steps/Energy (1/J)
Steps/Energy (1/J)
Steps/Energy (1/J)
Steps/Energy (1/J)
2.0 2.0 2.0 2.0
1.5 1.5 1.5 1.5
1.0 1.0 1.0 1.0
0.5 0.5 0.5 0.5
0.0 0.0 0.0 0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte)
32 64 128 256 512 32 64 128 256 512 32 64 128 256 512 32 64 128 31
256 512
REFERENCES
•  Sniper website
–  hcp://snipersim.org/
•  Download
–  hcp://snipersim.org/w/Download
–  hcp://snipersim.org/w/Download_Benchmarks
•  Gewng started
–  hcp://snipersim.org/w/Gewng_Started
•  QuesUons?
–  hcp://groups.google.com/group/snipersim
–  hcp://snipersim.org/w/Frequently_Asked_QuesUons
32
HPC NODE PERFORMANCE AND POWER
SIMULATION WITH THE SNIPER MULTI-‐CORE
SIMULATOR
TREVOR E. CARLSON,
WIM HEIRMAN, LIEVEN EECKHOUT
FOSDEM 2014 – HPC DEVROOM – BRUSSELS, BELGIUM

FOSDEM14 HPC Devroom 12 Sniper

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FOSDEM14 HPC Devroom 12 Sniper

Uploaded by

Copyright:

Available Formats

HPC

NODE PERFORMANCE AND POWER

•  Design tomorrow’s processor

•  OpUmizing for eﬃciency

Westmere-‐EX, Source: Intel Xeon Phi, Source: Intel

•  So[ware and hardware soluUons are needed to

•  Hard to model accurately with analyUcal models

Text output from Sniper (sim.stats)

•  Where did my cycles go?

baseline: 2x quad-‐core

16 slow cores 16 thin cores 29

POWER-‐AWARE HW/SW CO-‐OPTIMIZATION

300 –  Fix HW architecture, explore SW parameters

250 250 250 250

•  Co-‐opUmizaUon yields 1.66x more performance, or 1.25x

more energy eﬃciency, than isolated opUmizaUon

(b) Energy e⌅ciency (simulated time steps per Joule)

You might also like

FOSDEM14 HPC Devroom 12 Sniper

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FOSDEM14 HPC Devroom 12 Sniper

Uploaded by

Copyright:

Available Formats

HPC

NODE PERFORMANCE AND POWER

• Design tomorrow’s processor

• OpUmizing for eﬃciency

Westmere-­‐EX, Source: Intel Xeon Phi, Source: Intel

• So[ware and hardware soluUons are needed to

• Hard to model accurately with analyUcal models

Text output from Sniper (sim.stats)

• Where did my cycles go?

baseline: 2x quad-­‐core

16 slow cores 16 thin cores 29

POWER-­‐AWARE HW/SW CO-­‐OPTIMIZATION

300 – Fix HW architecture, explore SW parameters

250 250 250 250

• Co-­‐opUmizaUon yields 1.66x more performance, or 1.25x

more energy eﬃciency, than isolated opUmizaUon

(b) Energy e⌅ciency (simulated time steps per Joule)

You might also like

•  Design tomorrow’s processor

•  OpUmizing for eﬃciency

Westmere-‐EX, Source: Intel Xeon Phi, Source: Intel

•  So[ware and hardware soluUons are needed to

•  Hard to model accurately with analyUcal models

•  Where did my cycles go?

baseline: 2x quad-‐core

POWER-‐AWARE HW/SW CO-‐OPTIMIZATION

300 –  Fix HW architecture, explore SW parameters

•  Co-‐opUmizaUon yields 1.66x more performance, or 1.25x