You are on page 1of 31

Fundamentals of Parallelism

on Intel® Architecture
Week 1
Modern Code

May 2017
2
Course Roadmap

Lecture 1 Intel Architecture and Modern Code


Lecture 2 Multi-Threading with OpenMP
Lecture 3 Vectorization with Intel Compilers
Lecture 4 Essentials of Memory Traffic
Lecture 5 Clusters and MPI
§1. Why This Course
Shorter Better Reduced Greater
time to power hardware problem
insight efficiency costs sizes

Optimized Software
Novel Platforms
6
Parallel Programming Layers
§2. How Computers Get Faster
Source: karlrupp.net
9
Clock Speed and Power Wall

Overclocking? Free cooling?

Cooling solutions for high clock speeds are not practical or expensive
10
Pipelining and ILP Wall

FETCH DECODE EXECUTE MEMORY WRITE


FETCH DECODE EXECUTE MEMORY WRITE
Pipeline Stage

FETCH DECODE EXECUTE MEMORY WRITE


FETCH DECODE EXECUTE MEMORY
FETCH DECODE EXECUTE
FETCH DECODE

Clock Cycles

Only so many pipeline stages, possible conflicts


11
Superscalar Execution and ILP Wall

FETCH DECODE EXECUTE MEMORY WRITE


FETCH DECODE EXECUTE MEMORY WRITE
Pipeline Stage

FETCH DECODE EXECUTE MEMORY WRITE


FETCH DECODE EXECUTE MEMORY WRITE
FETCH DECODE EXECUTE EXECUTE WRITE
FETCH DECODE EXECUTE EXECUTE WRITE

Clock Cycles

Automatic search for independent instructions reqiures extra resources


12
Out-of-Order Execution and Memory Wall

In-Order FETCH DECODE EXECUTE MEMORY MEMORY MEMORY WRITE


FETCH DECODE STALL STALL STALL EXECUTE
FETCH STALL STALL STALL DECODE
Out-of-Orer

FETCH DECODE EXECUTE MEMORY MEMORY MEMORY WRITE


FETCH DECODE EXECUTE WRITE
FETCH DECODE EXECUTE WRITE

Limited by data locality access pattern in application.


13
Parallelism: Cores and Vectors

Unbounded growth opportunity, but not automatic


14
Parallelism is the Path Forward

The Show Must Go On


▷ Power wall
▷ ILP wall Hardware keeps evolving
▷ Memory wall through parallelism.

Software must catch up!


§3. Intel Architecture
16
Intel Computing Platforms
17
Intel Xeon Processors
▷ 1-, 2-, 4-way
▷ General-purpose
▷ Highly parallel (44 cores*)
▷ Resource-rich
▷ Forgiving performance
▷ Theor. ∼ 1.0 TFLOP/s in DP*
▷ Meas. ∼ 154 GB/s bandwidth*

* 2-way Intel Xeon processor, Broadwell architec-


ture (2016), top-of-the-line (e.g., E5-2699 V4)
18
Intel Xeon Phi Processors (1st Gen)

▷ PCIe add-in card


▷ Specialized for computing
▷ Highly-parallel (61 cores*)
▷ Balanced for compute
▷ Less forgiving
▷ Theor. ∼ 1.2 TFLOP/s in DP*
▷ Meas. ∼ 176 GB/s bandwidth*

* Intel Xeon Phi coprocessor, Knighs Corner ar-


chitecture (2012), top-of-the-line (e.g., 7120P)
19
Intel Xeon Phi Processors (2nd Gen)

▷ Bootable or PCIe add-in card


▷ Specialized for computing
▷ Highly-parallel (72 cores*)
▷ Balanced for compute
▷ Less forgiving than Xeon
▷ Theor. ∼ 3.0 TFLOP/s in DP*
▷ Meas. ∼ 490 GB/s bandwidth*

* Intel Xeon Phi processor, Knighs Landing ar-


chitecture (2016), top-of-the-line (e.g., 7290P)
§4. Modern Code
21
One Code for All Platforms
22
Computing in Science and Engineering

Galaxy formation Weather prediction


Oil mining Problem Chemical interaction

Gravity field + star formation Euler equations


Reservoir model Analytical model Schroedinger's equation

N-body problem Finite differences


Finite element analysis Discretization Spectral methods

Barnes-Hut algorithm Godunov method


Leap-frog method Numerical Algorithm Cooley-Tukey algorithm

MKL MPI
C++ OpenMP
Boost
Fortran Implementation Hybrid
Atlas Python Pthreads Offload

Intel Xeon Intel Xeon Phi


Intel Omni-Path Computer Architecture Intel 3D XPoint
23
Optimization Areas
24
Common Code Modernization Experience
N-Body Simulation Performance
3000 2875
Intel Xeon E5-2697 v3 (Haswell)
Intel Xeon Phi 7120A (1st gen, KNC)
2500 Intel Xeon Phi 7220 (2nd gen, KNL)
Performance, GFLOP/s

2000
1612 1712
1500

1000 918 976

500 443
316 261 196 288
160 124
0 5.9 0.8 2.1
Step 0: Step 1: Step 2: Step 3: Step 4:
Initial Multi-threaded Scalar Tuning Vectorized w/SoA Memory Optimization
§5. What You Are Going to Learn
26
Lecture 2: Multi-Threading with OpenMP
Cores implement MIMD (Multiple Instruction Multiple Data) arch
Core Ring
SBOX
CORE CORE CORE CORE Interconnect (CRI)
PCIe v2.0
controller, DATA
L2 L2 L2 L2
DMA engines
ADDRESS
COHERENCE
TD TD TD TD

GDDR5 CORE L2 TD Distributed tag TD L2 CORE GDDR5


directory (DTD)

GDDR5 GDDR5
TD TD
CORE L2 L2 CORE
GDDR5 TD TD TD TD GDDR5
GBOX GBOX
GDDR5 (memory (memory GDDR5
controller) L2 L2 L2 L2 controller)

CORE CORE CORE CORE


27
Lecture 3: Vectorization with Intel Compilers

Vectors – form Single Instruction Multiple Data (SIMD) architecture

Scalar Instructions Vector Instructions


4 + 1 = 5 4 1 5

Vector Length
0 + 3 = 3 0 3 3
-2 + 8 = 6 -2
+ 8
= 6
9 + -7 = 2 9 -7 2
28
Lecture 4: Essentials of Memory Traffic
Caches facilitate data re-use
RAM is optimized for streaming

Tiling Cache-Oblivious Recursion

1 2 3 4 1 3 9 11
5 6 7 8 2 4 10 12
9 10 11 12 5 7 13 15
13 14 15 16 6 8 14 16
29
Lecture 5: Clusters and MPI

Clusters form distributed-memory systems with network interconnects


§6. Remote Access
31
Educational Computing Cluster
Compute nodes

scheduler
Q
Login qsub
u
Internet Node e MPI
(Linux) u
SSH
e
NFS ... ...

Storage: /home, /opt NFS

You might also like