Lecture13 - Full IS1500

Computer Hardware Engineering (IS1200)
Computer Organization and Components (IS1500)
Fall 2020
Lecture 13: SIMD, MIMD, and Parallel Programming
Artur Podobas
Researcher, KTH Royal Institute of Technology
Slides by David Broman, KTH (Extensions by Artur Podobas)

2
Course Structure
Module 1: C and Assembly
Module 4: Processor Design
Programming
LE1 LE2 LE3 EX1 LAB1 LE9 LE10 EX4 S2 LAB4
LE4 S1 LAB2
Module 2: I/O Systems Module 5: Memory Hierarchy
LE5 LE6 EX2 LAB3 LE11 EX5 S3
PROJ Module 6: Parallel Processors

Module 3: Logic Design
(IS1500 only) START and Programs
LE7 LE8 EX3 LD-LAB LE12 LE13 EX6 S4
Proj. Expo LE14
Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
podobas@kth.se and GPUs and Clusters in Practice
3
Abstractions in Computer Systems

Computer System Networked Systems and Systems of Systems
Application Software
Software
Operating System
Instruction Set Architecture Hardware/Software Interface
Microarchitecture
Logic and Building Blocks Digital Hardware Design
Digital Circuits
Analog Circuits
Analog Design and Physics
Devices and Physics

4
Agenda
Part I Part II
SIMD, Multithreading, and GPUs MIMD, Multicore, and Clusters
DLP TLP
Part III
Parallelization in Practice
DLP + TLP

5
Part I
SIMD, Multithreading, and GPUs
DLP
Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy

6
SISD, SIMD, and MIMD (Revisited)
Data-level parallelism. Examples

Data Stream are multimedia extensions (e.g.,
SSE, streaming SIMD
Single Multiple extension), vector processors.
Single
Instruction Stream
SISD SIMD Graphical Processing Unit

E.g. Intel (GPUs) are both SIMD and
E.g. SSE MIMD
Pentium 4
Instruction in x86
Multiple
MISD MIMD
Task-level parallelism.
Few examples E.g. Intel Examples are multicore and
(Systolic arrays Core i7 cluster computers
closest)

7
Subword Parallelism and

Multimedia Extensions
This is the same as SIMD or
Subword parallelism is when a wide
data-level parallelism.
data word is operated on in parallel.
One instruction operates on
multiple data items.
Instruction 32-bit data 32-bit data 32-bit data 32-bit data

MMX (MultiMedia eXtension), first SIMD by Intel Pentium processors
(introduced 1997). Only on Integers.
3D Now! AMD, included single-precision floating-point (1998)
Subword SSE/2 (Streaming SIMD Extension) introduced by Intel in Pentium III (year
1999). Included single-precision FP.
Parallelism
AVX (Advanced Vector Extension), supported by both Intel and AMD
(processors available in 2011). Added support for 256/512 bits and double-
precision FP.
NEON multimedia extension for ARMv7 and ARMv8
(32 registers 8 bytes wide or 16 registers 16 bytes wide)
SVE specialized scalable vector extension to ARMv8..

8
Streaming SIMD Extension (SSE) and E

Advanced Vector Extension (AVX)
In SSE (and the later version SSE2), assembly
instructions are using two-operand format.
meaning: %xmm4 = %xmm4 + %xmm0

addpd %xmm0, %xmm4 Note the reversed order.
Registers (e.g. %xmm4) are 128-bits in SSE/SEE2.
Added the “v” for vector to distinguish

AVX from SSE and renamed registers
“pd” means Packed Double precision FP. It can
to %ymm that are now 256-bit
operate on as many FP that fits in the register
Question: How many FP additions
vaddpd %ymm0, %ymm1, %ymm4 does vaddpd perform in parallel? Answer: 4
vmovapd %ymm4, (%r11)
AVX introduced three-operand format
Moves the result to the memory address stored in
Meaning: %ymm4 = %ymm0 + %ymm1
%r11 (a 64-bit register). Stores the four 64-bit FP
in consecutive order in memory.
9
Recall the idea of a

multi-issue uniprocesor
Thread A Thread B Thread C
Slot 1
Slot 2
Slot 3 Typically, all functional units cannot
Time be fully utilized in a single-threaded
program (white space is unused
slot/functional unit).

10
Hardware Multithreading
In a multithreaded processor, several hardware Thread A Thread B Thread C
threads share the same functional units.
The purpose of multithreading is to hide latencies
and avoid stalls due to cache misses etc.
Coarse-grained multithreading,
switches threads only at costly
Slot 1 stalls, e.g., last-level cache misses.
Slot 2
Cannot overcome throughput
Slot 3 losses in short stalls.
Time
Fine-grained multithreading
Slot 1
switches between hardware
Slot 2 threads every cycle. Better
Slot 3 utilization.
Time

11
Simultaneous multithreading (SMT)

Simultaneous multithreading (SMT) combines Thread A Thread B Thread C
multithreading with a multiple-issue, dynamically
scheduled pipeline.
Can fill in the holes that multiple-

issue cannot utilize with cycles
from other hardware threads. Thus,
better utilization.
Slot 1
Slot 2
Slot 3
Time Example: Hyper-threading is
Intel's name and implementation of
SMT. That is why a processor can
have 2 real cores, but the OS
shows 4 cores (4 hardware
threads).

12
Graphical Processing Units (GPUs)

A Graphical Processing Unit (GPU) utilizes multithreading,
MIMD, SIMD, and ILP. The main form of parallelism that can
be used is data-level parallelism.
CUDA (Compute Unified Device Architecture) is a
parallel computing platform and programming model
from NVIDIA.
CUDA The parallelism is expressed as CUDA threads.
GPU Therefore, the model is also called
Single Instruction Multiple Thread (SIMT).
A GPU consists of a set of multithreaded SIMD

processors (called streaming multiprocessor using
NVIDIA terms). For instance 16 processors.
The main idea is to execute a massive number of threads and to use

multithreading to hide latency. However, the latest GPUs also include caches.

13
Part II
MIMD, Multicore, and Clusters
TLP

14
Shared Memory Multiprocessor (SMP)

A Shared Memory Multiprocessor (SMP) has a single
physical address space across all processors.
An SMP is almost always the same as a multicore processor.
In a uniform memory access (UMA)

multiprocessor, the latency of
accessing memory does not depend Processor Processor Processor
on the processor. Core Core Core
In a nonuniform memory access L1 Cache L1 Cache L1 Cache

(NUMA) multiprocessor, memory can
be divided between processor and L2 Cache
result in different latencies.
Processors (cores) in a SMP
communicate via shared memory. Memory
Alternative: Network on Chip (NoC)

15
Cache Coherence
Different cores’ local caches could result in that different cores see
different values for the same memory address.
This is called the cache coherency problem.
Time step 1 Time step 2 Time step 3
Processor Processor Processor Processor Processor Processor

Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
Cache 0 Cache Cache Cache Cache Cache

0 1 0
0
Memory Memory Memory 1
0 0
Core 2 reads Core 1

Core 1 reads
memory position X. writes to Core 2 sees
memory position X.
The value is stored memory. the incorrect
The value is stored
in Core 1’s cache. in Core 2’s cache. value.

16
Snooping Protocol
Cache coherence can be enforced using a cache coherence protocol. For
instance a write invalidate protocol, such as the snooping protocol.
Time step 2 Time step 2 Time step 3
Processor Processor Processor Processor Processor Processor

Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
Cache 0 Cache 0 Cache Cache Cache Cache

1 1
1
Memory Memory Memory 1
0 1
Core 2 reads Core 2 now tries to read the

memory position X. The write
Core 1 invalidates variable, it gets a cache miss
The value is stored and loads the new value from
in Core 2’s cache. writes to the cache
memory. line of other memory (heavily simplified
processors. example)

17
False Sharing
Assume that Core 1 and Core 2 share a
cache line Z (the same set).
Processor Core 1 Processor Core 2
Core 1 reads and writes to X and

Cache Cache
Core 2 reads and writes to Y.
Cache line Z Cache line Z
X=1 X=1 This will result in that the

cache coherence protocol
Y=0 Y=0 will invalidate the other
core’s cache line, even if the
cores are not interested in
the other ones variable!
Memory
This is called false sharing.

18
Processes, Threads, and Cores

A modern operating system (OS) Concurrent threads are
can execute several processes scheduled by the OS to execute
concurrently. in parallel on different cores.
Thread
Process Thread C C C C
C C C C
Operating Thread
System Process Memory
Thread
Note: All threads share the process
Process context, including virtual memory etc.
A process context include its own Each process can have N number of
Hands-on:
virtual memory space, IO files, read- concurrent threads. The thread context Activity
only code, heap, shared library, includes thread ID, stack, stack pointer, Monitor
process id (PID) etc. program counter etc.

19
Programming with Threads and E
Shared Variables
POSIX threads (pthreads) is a common way of programming
concurrency and utilizing multicores for parallel computation.
#include <stdio.h> int main(){

#include <pthread.h> pthread_t tid1, tid2;
int max;
volatile int counter = 0; max = 40000;
pthread_create(&tid1, NULL, count, &max);
void *count(void *data){
int i; max = 60000;
int max = *((int*)data); pthread_create(&tid2, NULL, count, &max);
for(i=0; i<max; i++)
counter++; pthread_join(tid1, NULL);
pthread_exit(NULL); pthread_join(tid2, NULL);
} printf("counter = %d\n", counter);
pthread_exit(NULL);
} Hands-on:
Show
Exercise: What is the output? example
Creates two threads, each is
counting a shared variable. Answer: Possibly different values each time…

20
Semaphores
A semaphore is a global variable that can hold a nonnegative integer
value. It can only be changed by the following two operations.
Note that the check
and return of P(s)
P(s): If s > 0, then decrement s and return.
P(s) and increment of
If s = 0, then wait until s > 0, then decrement
V(s) must be atomic,
s and return.
meaning that
appears to be
V(s): Increment s. “instantaneously”.
V(s)
Semaphores were invented by Edsger Dijkstra, who was originally from

the Nederland. P and V is supposed to stand for
Prolaag (probeer te verlagen, “try to reduce”) and Verhogen (increase).

21
Mutex
A semaphore can be used for mutual exclusion, meaning that only one thread can access a
particular resource at the same time. Such a binary semaphore is called a mutex.
A global binary semaphore is initiated to 1.
One or more threads are executing code that

semaphore s = 1
needs to be protected.
One of more threads execute: P(s), also called wait(s), checks if the semaphore
P(s); is nonzero. If so, lock the mutex, else wait.
Code to
protected...
In the critical section, it is ensured that not more than
V(s);
one thread can execute the code at the same time.
V(s), also called post, unlocks the mutex and

increments the semaphore.

22

Shared Variables with Semaphores
Problem. We update the value
max, that is also shared…
volatile int counter = 0; int main(){
sem_t *mutex; pthread_t tid1, tid2;
int max;
void *count(void *data){
int i; mutex = sem_open("/semaphore", O_CREAT,
int max = *((int*)data); O_RDWR, 1);
for(i=0; i<max; i++){ sem_unlink("/semaphore");
sem_wait(mutex); /* P()*/ max = 40000;
counter++; pthread_create(&tid1, NULL, count, &max);
sem_post(mutex); /* V(m)*/ max = 60000;
} pthread_create(&tid2, NULL, count, &max);
pthread_exit(NULL);
} pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
Exercise: Is it correct printf("counter = %d\n", counter);
Hands-on: this time? sem_close(mutex);
Show
example
pthread_exit(NULL);
}
23

Shared Variables with Semaphores
Correct solution… Simple solution. Use different
variables.
volatile int counter = 0; int main(){

sem_t *mutex; pthread_t tid1, tid2;
int max1 = 40000;
void *count(void *data){ int max2 = 60000;
int i;
int max = *((int*)data); mutex = sem_open("/semaphore", O_CREAT,
for(i=0; i<max; i++){ 0777, 1);
sem_wait(mutex); /*P()*/ sem_unlink("/semaphore");
counter++; pthread_create(&tid1, NULL, count, &max1);
sem_post(mutex); /*V(m)*/ pthread_create(&tid2, NULL, count, &max2);
}
pthread_exit(NULL); pthread_join(tid1, NULL);
} pthread_join(tid2, NULL);
printf("counter = %d\n", counter);
sem_close(mutex);
Hands-on: pthread_exit(NULL);
Show
example
}

24
Clusters and Warehouse Scale Computers

A cluster is a set of computers that are
connected over a local area network (LAN).
May be viewed as one large multiprocessor.
Warehouse-Scale Computers are very large cluster that
can include 100 000 servers that act as one giant computer
Photo by Robert Harker
(e.g., Facebook, Google, Apple).
Clusters do not communicate over shared memory (as
for SMP) but are using message passing.
MapReduce is a programming model that is

Computer
popular for batch processing.
1 Computer
N 1. Map applies a programmer defined function on all
data items.
2. Reduce collects the output and collapse the data
Computer
2
using another programmer defined function.
Computer The map step is highly parallel. The reduce stage may
N-1
be parallelized to some extent.

25
Supercomputers
Similar to a cluster but with focus on high-performance:
• Used to solve tough, real-life problems
• Medicine, Weather, Fluid-Dynamics, AI, …
• Fast inter-node communication
• Infiniband, Tofu, Slingshot, etc.
• Large amount of memory bandwidth (HBM2, etc.)
• Non-volatile in-memory storage (Burst-buffers, think
caches but for I/O)
• Performance measured in FLOP/s (double-precision) The Top500 list (www.top500.org)
• Programmed using different models
• e.g., OpenMP for intra-node (shared memory)
• Message Passing Interface (MPI) inter-node
Flow around landing gear of a private jet

The KTH Beskow Supercomputer at PDC. (Photo: Niclas Jansson)

26
Part III
Parallelization in Practice
DLP + TLP

27
General Matrix Multiplication (GEMM)

Uses matrix size n as a
Simple matrix multiplication parameter and single
dimension for
performance.
void dgemm(int n, double* A, double* B, double* C){

for(int i = 0; i < n; ++i)
for(int j = 0; j < n; ++j){
double cij = C[i+j*n];
for(int k = 0; k < n; k++)
cij += A[i+k*n] * B[k+j*n];
C[i+j*n] = cij;
}
}
Hands-on:
Show
example
28
Parallelizing GEMM
Unoptimized Unoptimized C version (previous page). Using 1.7 GigaFLOPS (32x32)
one core. 0.8 GigaFLOPS (960x960)
Use AVX instructions vaddpd and vmulpd 6.4 GigaFLOPS (32x32)

SIMD to do 4 double precision floating-point 2.5 GigaFLOPS (960x960)
operations in parallel.
.
AVX + unroll parts of the loop, so that the 14.6 GigaFLOPS (32x32)
ILP multiple-issue, out-of-order processor have 5.1 GigaFLOPS (960x960)
more instructions to schedule.
AVX + unroll + blocking (dividing the problem 13.6 GigaFLOPS (32x32)

Cache 12.0 GigaFLOPS (960x960)
into submatrices). This avoids cache misses.
23 GigaFLOPS (960x960, 2 cores)

Multi- AVX + unroll + blocking + multi core
core 44 GigaFLOPS (960x960, 4 cores)
(multithreading using OpenMP)
174 GigaFLOPS (960x960, 16 cores)
Experiment by P&H on a 2.6GHz Intel Core i7 with Turbo mode turned off.
For details see P&H, 5th edtion, sections 3.8, 4.12, 5.14, and 6.12
29
Future perspective (Part 1 of 3):

MIMD, SIMD, ILP, and Caches
“For x86 computers, we expect to see two additional cores per chip
every two years and the SIMD width to double every four years.”
Hennessy & Patterson, Computer Architecture – A
Quantitative Approach, 5th edition, 2013 (page 263)
We must understand and utilize both MIMD and

SIMD to gain maximal speedups in the future,
although MIMD (multicore) has gained much more
attention lately.
The previous example showed that the way we

program for ILP and caches, also matters
significantly.

30

Heterogeneity and Accelerators
Heterogeneity
Different architectures good for different things:
❑ General-purpose systems (CPUs)
❑ Latency-critical applications
❑ Large/deep memory hierarchies
❑ Out-of-order execution
❑ Newer systems (e.g., ARM A64FX) offer high performance
❑ Accelerators Fig. source: Wikipedia
❑ Graphics Processing Units (GPUs)

❑ Throughput (not latency) oriented
❑ High amount of thread-level parallelism
❑ Programmed in special languages (CUDA/OpenCL/HIP/…)
❑ Field-Programmable Gate Arrays (FPGAs)
❑ Reconfigurable architectures (can take on many forms)
❑ Spatial computing (data-flow)
❑ Programmed using High-Level Synthesis (HLS) or HDLs
❑ Gaining in importance
❑ Intel acquired Altera (June, 2015)
❑ AMD acquiring Xilinx (this year)
❑ Nvidia to acquire ARM (?) } Exciting (Heterogeneous) future ahead of us

31

Specialization 1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
Matrix B 0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
Architectural Specialization 0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
❑ Move away from ”one size fits all” 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
❑ Specialize architecture (or silicon) toward a particular application
stream
domain B[x][y-1]
❑ Exploit alternative compute paradigms (e.g., data-flow computing) A[x-1][y] A[x][y]

1 0 0 0 0 0 0 1
Example 1: Cerebras CS-1 Matrix A 3 0 0 3 0 0 4 0
4 3 0 0 0 4 0 0 stream MAC
❑ ”Wafer”-scale chip 0 4 2 0 0 0 0 2
0 0 4 2 1 0 0 0
❑ 46,225 mm2 (56x larger than a GPU) 0 0 0 4 2 1 0 0
Systolic Array C[x][y]
drain
❑ 400,000 cores (tailored for AI training) 0
0
6
0
0
0
6
0
5
5
2
3
1
1
0
1
B[x][y]
Example 2: Matrix Engines C=AxB 0 0 0 0 0 0 0 0

Matrix C 0 0 0 0 0 0 0 0
❑ Matrix multiplication ”claimed” to be a common workload (e.g., AI) 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
❑ Many modern processors (Intel Sapphire Rapids, IBM Power 10, 0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Nvidia Volta/Ampere) include hardware support for MxM 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
(implemented as systolic arrays)

32
It’s about time to…

…relax

33
Reading Guidelines
Module 6: Parallel Processors and Programs
Lecture 12: Parallelism, Concurrency, Speedup, and ILP

• H&H Chapter 1.8, 3.6, 7.8.3-7.8.5
• P&H5 Chapters 1.7-1.8, 1.10, 4.10, 6.1-6.2
or P&H4 Chapters 1.5-1.6, 1.8, 4.10, 7.1-7.2
Lecture 13: SIMD, MIMD, and Parallel Programming

• H&H Chapter 7.8.6-7.8.9
• P&H5 Chapters 2.11, 3.6, 5.10, 6.3-6.7
or P&H4 Chapters 2.11, 3,6, 5.8, 7.3-7.7
Reading Guidelines
See the course webpage
for more information.

34
Summary
Some key take away points:
• SIMD and GPUs can efficiently parallelize

problems that have data-level parallelism
• MIMD, Multicores, and Clusters can be used to

parallelize problems that have task-level parallelism.
• In the future, we should try to combine and use both

SIMD and MIMD!
Thanks for listening!


Lecture13 - Full IS1500

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture13 - Full IS1500

Uploaded by

Copyright:

Available Formats

Computer Hardware Engineering (IS1200)

Computer Organization and Components (IS1500)

Slides by David Broman, KTH (Extensions by Artur Podobas)

Module 2: I/O Systems Module 5: Memory Hierarchy

LE5 LE6 EX2 LAB3 LE11 EX5 S3

PROJ Module 6: Parallel Processors

Proj. Expo LE14

Part I Part II Part III

Abstractions in Computer Systems

Instruction Set Architecture Hardware/Software Interface

Logic and Building Blocks Digital Hardware Design

Part I Part II Part III

Part I Part II Part III

Part I Part II Part III

SISD, SIMD, and MIMD (Revisited)

Data-level parallelism. Examples

SISD SIMD Graphical Processing Unit

Part I Part II Part III

Subword Parallelism and

Instruction 32-bit data 32-bit data 32-bit data 32-bit data

Part I Part II Part III

Streaming SIMD Extension (SSE) and E

meaning: %xmm4 = %xmm4 + %xmm0

Registers (e.g. %xmm4) are 128-bits in SSE/SEE2.

Added the “v” for vector to distinguish

Recall the idea of a

Part I Part II Part III

Part I Part II Part III

Simultaneous multithreading (SMT)

Can fill in the holes that multiple-

Part I Part II Part III

Graphical Processing Units (GPUs)

A GPU consists of a set of multithreaded SIMD

The main idea is to execute a massive number of threads and to use

Part I Part II Part III

Part I Part II Part III

Shared Memory Multiprocessor (SMP)

An SMP is almost always the same as a multicore processor.

In a uniform memory access (UMA)

In a nonuniform memory access L1 Cache L1 Cache L1 Cache

Part I Part II Part III

Time step 1 Time step 2 Time step 3

Processor Processor Processor Processor Processor Processor

Cache 0 Cache Cache Cache Cache Cache

Core 2 reads Core 1

Part I Part II Part III

Time step 2 Time step 2 Time step 3

Processor Processor Processor Processor Processor Processor

Cache 0 Cache 0 Cache Cache Cache Cache

Core 2 reads Core 2 now tries to read the

Part I Part II Part III

Processor Core 1 Processor Core 2

Core 1 reads and writes to X and

X=1 X=1 This will result in that the

This is called false sharing.

Part I Part II Part III

Processes, Threads, and Cores

Part I Part II Part III

Programming with Threads and E

#include <stdio.h> int main(){

Part I Part II Part III

Semaphores were invented by Edsger Dijkstra, who was originally from