Professional Documents
Culture Documents
Lecture13 - Full IS1500
Lecture13 - Full IS1500
Fall 2020
Lecture 13: SIMD, MIMD, and Parallel Programming
Artur Podobas
Researcher, KTH Royal Institute of Technology
Course Structure
Module 1: C and Assembly
Module 4: Processor Design
Programming
LE1 LE2 LE3 EX1 LAB1 LE9 LE10 EX4 S2 LAB4
LE4 S1 LAB2
Application Software
Software
Operating System
Microarchitecture
Digital Circuits
Analog Circuits
Analog Design and Physics
Devices and Physics
Agenda
Part I Part II
SIMD, Multithreading, and GPUs MIMD, Multicore, and Clusters
DLP TLP
Part III
Parallelization in Practice
DLP + TLP
Part I
SIMD, Multithreading, and GPUs
DLP
Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy
MISD MIMD
Task-level parallelism.
Few examples E.g. Intel Examples are multicore and
(Systolic arrays Core i7 cluster computers
closest)
Subword SSE/2 (Streaming SIMD Extension) introduced by Intel in Pentium III (year
1999). Included single-precision FP.
Parallelism
AVX (Advanced Vector Extension), supported by both Intel and AMD
(processors available in 2011). Added support for 256/512 bits and double-
precision FP.
NEON multimedia extension for ARMv7 and ARMv8
(32 registers 8 bytes wide or 16 registers 16 bytes wide)
SVE specialized scalable vector extension to ARMv8..
Slot 1
Slot 2
Slot 3 Typically, all functional units cannot
Time be fully utilized in a single-threaded
program (white space is unused
slot/functional unit).
Hardware Multithreading
In a multithreaded processor, several hardware Thread A Thread B Thread C
threads share the same functional units.
The purpose of multithreading is to hide latencies
and avoid stalls due to cache misses etc.
Coarse-grained multithreading,
switches threads only at costly
Slot 1 stalls, e.g., last-level cache misses.
Slot 2
Cannot overcome throughput
Slot 3 losses in short stalls.
Time
Fine-grained multithreading
Slot 1
switches between hardware
Slot 2 threads every cycle. Better
Slot 3 utilization.
Time
Part II
MIMD, Multicore, and Clusters
TLP
Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy
Cache Coherence
Different cores’ local caches could result in that different cores see
different values for the same memory address.
This is called the cache coherency problem.
Snooping Protocol
Cache coherence can be enforced using a cache coherence protocol. For
instance a write invalidate protocol, such as the snooping protocol.
False Sharing
Assume that Core 1 and Core 2 share a
cache line Z (the same set).
Process Thread C C C C
C C C C
Operating Thread
System Process Memory
Thread
Note: All threads share the process
Process context, including virtual memory etc.
A process context include its own Each process can have N number of
Hands-on:
virtual memory space, IO files, read- concurrent threads. The thread context Activity
only code, heap, shared library, includes thread ID, stack, stack pointer, Monitor
process id (PID) etc. program counter etc.
Shared Variables
POSIX threads (pthreads) is a common way of programming
concurrency and utilizing multicores for parallel computation.
Semaphores
A semaphore is a global variable that can hold a nonnegative integer
value. It can only be changed by the following two operations.
Note that the check
and return of P(s)
P(s): If s > 0, then decrement s and return.
P(s) and increment of
If s = 0, then wait until s > 0, then decrement
V(s) must be atomic,
s and return.
meaning that
appears to be
V(s): Increment s. “instantaneously”.
V(s)
Mutex
A semaphore can be used for mutual exclusion, meaning that only one thread can access a
particular resource at the same time. Such a binary semaphore is called a mutex.
Supercomputers
Similar to a cluster but with focus on high-performance:
• Used to solve tough, real-life problems
• Medicine, Weather, Fluid-Dynamics, AI, …
• Fast inter-node communication
• Infiniband, Tofu, Slingshot, etc.
• Large amount of memory bandwidth (HBM2, etc.)
• Non-volatile in-memory storage (Burst-buffers, think
caches but for I/O)
• Performance measured in FLOP/s (double-precision) The Top500 list (www.top500.org)
• Programmed using different models
• e.g., OpenMP for intra-node (shared memory)
• Message Passing Interface (MPI) inter-node
Part III
Parallelization in Practice
DLP + TLP
Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy
Hands-on:
Show
example
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
podobas@kth.se and GPUs and Clusters in Practice
28
Parallelizing GEMM
Unoptimized Unoptimized C version (previous page). Using 1.7 GigaFLOPS (32x32)
one core. 0.8 GigaFLOPS (960x960)
“For x86 computers, we expect to see two additional cores per chip
every two years and the SIMD width to double every four years.”
Hennessy & Patterson, Computer Architecture – A
Quantitative Approach, 5th edition, 2013 (page 263)
stream
domain B[x][y-1]
drain
❑ 400,000 cores (tailored for AI training) 0
0
6
0
0
0
6
0
5
5
2
3
1
1
0
1
B[x][y]
Reading Guidelines
Summary