Professional Documents
Culture Documents
on Intel® Architecture
Week 1
Modern Code
May 2017
2
Course Roadmap
Optimized Software
Novel Platforms
6
Parallel Programming Layers
§2. How Computers Get Faster
Source: karlrupp.net
9
Clock Speed and Power Wall
Cooling solutions for high clock speeds are not practical or expensive
10
Pipelining and ILP Wall
Clock Cycles
Clock Cycles
MKL MPI
C++ OpenMP
Boost
Fortran Implementation Hybrid
Atlas Python Pthreads Offload
2000
1612 1712
1500
500 443
316 261 196 288
160 124
0 5.9 0.8 2.1
Step 0: Step 1: Step 2: Step 3: Step 4:
Initial Multi-threaded Scalar Tuning Vectorized w/SoA Memory Optimization
§5. What You Are Going to Learn
26
Lecture 2: Multi-Threading with OpenMP
Cores implement MIMD (Multiple Instruction Multiple Data) arch
Core Ring
SBOX
CORE CORE CORE CORE Interconnect (CRI)
PCIe v2.0
controller, DATA
L2 L2 L2 L2
DMA engines
ADDRESS
COHERENCE
TD TD TD TD
GDDR5 GDDR5
TD TD
CORE L2 L2 CORE
GDDR5 TD TD TD TD GDDR5
GBOX GBOX
GDDR5 (memory (memory GDDR5
controller) L2 L2 L2 L2 controller)
Vector Length
0 + 3 = 3 0 3 3
-2 + 8 = 6 -2
+ 8
= 6
9 + -7 = 2 9 -7 2
28
Lecture 4: Essentials of Memory Traffic
Caches facilitate data re-use
RAM is optimized for streaming
1 2 3 4 1 3 9 11
5 6 7 8 2 4 10 12
9 10 11 12 5 7 13 15
13 14 15 16 6 8 14 16
29
Lecture 5: Clusters and MPI
scheduler
Q
Login qsub
u
Internet Node e MPI
(Linux) u
SSH
e
NFS ... ...