Fundamentals of Parallelism On Intel Architecture: Week 1 Modern Code

Fundamentals of Parallelism
on Intel® Architecture
Week 1
Modern Code
May 2017
2
Course Roadmap
Lecture 1 Intel Architecture and Modern Code

Lecture 2 Multi-Threading with OpenMP
Lecture 3 Vectorization with Intel Compilers
Lecture 4 Essentials of Memory Traffic
Lecture 5 Clusters and MPI
§1. Why This Course
Shorter Better Reduced Greater
time to power hardware problem
insight efficiency costs sizes
Optimized Software
Novel Platforms
6
Parallel Programming Layers
§2. How Computers Get Faster
Source: karlrupp.net
9
Clock Speed and Power Wall
Overclocking? Free cooling?
Cooling solutions for high clock speeds are not practical or expensive
10
Pipelining and ILP Wall
FETCH DECODE EXECUTE MEMORY WRITE

Pipeline Stage

FETCH DECODE EXECUTE MEMORY
FETCH DECODE EXECUTE
FETCH DECODE
Clock Cycles
Only so many pipeline stages, possible conflicts

11
Superscalar Execution and ILP Wall

Pipeline Stage

FETCH DECODE EXECUTE EXECUTE WRITE
FETCH DECODE EXECUTE EXECUTE WRITE
Clock Cycles
Automatic search for independent instructions reqiures extra resources

12
Out-of-Order Execution and Memory Wall
In-Order FETCH DECODE EXECUTE MEMORY MEMORY MEMORY WRITE

FETCH DECODE STALL STALL STALL EXECUTE
FETCH STALL STALL STALL DECODE
Out-of-Orer
FETCH DECODE EXECUTE MEMORY MEMORY MEMORY WRITE

FETCH DECODE EXECUTE WRITE
FETCH DECODE EXECUTE WRITE
Limited by data locality access pattern in application.

13
Parallelism: Cores and Vectors
Unbounded growth opportunity, but not automatic

14
Parallelism is the Path Forward
The Show Must Go On

▷ Power wall
▷ ILP wall Hardware keeps evolving
▷ Memory wall through parallelism.
Software must catch up!

§3. Intel Architecture
16
Intel Computing Platforms
17
Intel Xeon Processors
▷ 1-, 2-, 4-way
▷ General-purpose
▷ Highly parallel (44 cores*)
▷ Resource-rich
▷ Forgiving performance
▷ Theor. ∼ 1.0 TFLOP/s in DP*
▷ Meas. ∼ 154 GB/s bandwidth*
* 2-way Intel Xeon processor, Broadwell architec-

ture (2016), top-of-the-line (e.g., E5-2699 V4)
18
Intel Xeon Phi Processors (1st Gen)
▷ PCIe add-in card

▷ Specialized for computing
▷ Highly-parallel (61 cores*)
▷ Balanced for compute
▷ Less forgiving
* Intel Xeon Phi coprocessor, Knighs Corner ar-

chitecture (2012), top-of-the-line (e.g., 7120P)
19
Intel Xeon Phi Processors (2nd Gen)
▷ Bootable or PCIe add-in card

▷ Specialized for computing
▷ Highly-parallel (72 cores*)
▷ Balanced for compute
▷ Less forgiving than Xeon
* Intel Xeon Phi processor, Knighs Landing ar-

chitecture (2016), top-of-the-line (e.g., 7290P)
§4. Modern Code
21
One Code for All Platforms
22
Computing in Science and Engineering
Galaxy formation Weather prediction

Oil mining Problem Chemical interaction
Gravity field + star formation Euler equations

Reservoir model Analytical model Schroedinger's equation
N-body problem Finite differences

Finite element analysis Discretization Spectral methods
Barnes-Hut algorithm Godunov method

Leap-frog method Numerical Algorithm Cooley-Tukey algorithm
MKL MPI
C++ OpenMP
Boost
Fortran Implementation Hybrid
Atlas Python Pthreads Offload
Intel Xeon Intel Xeon Phi

Intel Omni-Path Computer Architecture Intel 3D XPoint
23
Optimization Areas
24
Common Code Modernization Experience
N-Body Simulation Performance
3000 2875
Intel Xeon E5-2697 v3 (Haswell)
Intel Xeon Phi 7120A (1st gen, KNC)
2500 Intel Xeon Phi 7220 (2nd gen, KNL)
Performance, GFLOP/s
2000
1612 1712
1500
1000 918 976
500 443
316 261 196 288
160 124
0 5.9 0.8 2.1
Step 0: Step 1: Step 2: Step 3: Step 4:
Initial Multi-threaded Scalar Tuning Vectorized w/SoA Memory Optimization
§5. What You Are Going to Learn
26
Lecture 2: Multi-Threading with OpenMP
Cores implement MIMD (Multiple Instruction Multiple Data) arch
Core Ring
SBOX
CORE CORE CORE CORE Interconnect (CRI)
PCIe v2.0
controller, DATA
L2 L2 L2 L2
DMA engines
ADDRESS
COHERENCE
TD TD TD TD
GDDR5 CORE L2 TD Distributed tag TD L2 CORE GDDR5

directory (DTD)
GDDR5 GDDR5
TD TD
CORE L2 L2 CORE
GDDR5 TD TD TD TD GDDR5
GBOX GBOX
GDDR5 (memory (memory GDDR5
controller) L2 L2 L2 L2 controller)
CORE CORE CORE CORE

27
Lecture 3: Vectorization with Intel Compilers
Vectors – form Single Instruction Multiple Data (SIMD) architecture
Scalar Instructions Vector Instructions

4 + 1 = 5 4 1 5
Vector Length
0 + 3 = 3 0 3 3
-2 + 8 = 6 -2
+ 8
= 6
9 + -7 = 2 9 -7 2
28
Lecture 4: Essentials of Memory Traffic
Caches facilitate data re-use
RAM is optimized for streaming
Tiling Cache-Oblivious Recursion
1 2 3 4 1 3 9 11
5 6 7 8 2 4 10 12
9 10 11 12 5 7 13 15
13 14 15 16 6 8 14 16
29
Lecture 5: Clusters and MPI
Clusters form distributed-memory systems with network interconnects

§6. Remote Access
31
Educational Computing Cluster
Compute nodes
scheduler
Q
Login qsub
u
Internet Node e MPI
(Linux) u
SSH
e
NFS ... ...
Storage: /home, /opt NFS

Fundamentals of Parallelism On Intel Architecture: Week 1 Modern Code

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fundamentals of Parallelism On Intel Architecture: Week 1 Modern Code

Uploaded by

Copyright:

Available Formats

Fundamentals of Parallelism

Lecture 1 Intel Architecture and Modern Code

Overclocking? Free cooling?

FETCH DECODE EXECUTE MEMORY WRITE

FETCH DECODE EXECUTE MEMORY WRITE

Only so many pipeline stages, possible conflicts

FETCH DECODE EXECUTE MEMORY WRITE

FETCH DECODE EXECUTE MEMORY WRITE

Automatic search for independent instructions reqiures extra resources

In-Order FETCH DECODE EXECUTE MEMORY MEMORY MEMORY WRITE

FETCH DECODE EXECUTE MEMORY MEMORY MEMORY WRITE

Limited by data locality access pattern in application.

Unbounded growth opportunity, but not automatic

The Show Must Go On

Software must catch up!

* 2-way Intel Xeon processor, Broadwell architec-

▷ PCIe add-in card

* Intel Xeon Phi coprocessor, Knighs Corner ar-

▷ Bootable or PCIe add-in card

* Intel Xeon Phi processor, Knighs Landing ar-

Galaxy formation Weather prediction

Gravity field + star formation Euler equations

N-body problem Finite differences

Barnes-Hut algorithm Godunov method

Intel Xeon Intel Xeon Phi

1000 918 976

GDDR5 CORE L2 TD Distributed tag TD L2 CORE GDDR5

CORE CORE CORE CORE

Vectors – form Single Instruction Multiple Data (SIMD) architecture

Scalar Instructions Vector Instructions

Tiling Cache-Oblivious Recursion

Clusters form distributed-memory systems with network interconnects

Storage: /home, /opt NFS

You might also like