You are on page 1of 16

Vector Processing:

SIMD/Vector/GPU Exploiting Regular (Data) Parallelism

Data Parallelism SIMD Processing


 Concurrency arises from performing the same operations  Single instruction operates on multiple data elements
on different pieces of data  In time or in space
 Single instruction multiple data (SIMD)  Multiple processing elements
 E.g., dot product of two vectors

 Contrast with data flow  Time-space duality


 Concurrency arises from executing different operations in parallel (in  Array processor: Instruction operates on multiple data
a data driven manner) elements at the same time
Vector processor: Instruction operates on multiple data
 Contrast with thread (“control”) parallelism 
elements in consecutive time steps
 Concurrency arises from executing different threads of control in
parallel

 SIMD exploits instruction-level parallelism


 Multiple instructions concurrent: instructions happen to be the same

3 4
Array vs. Vector Processors SIMD Array Processing vs. VLIW
ARRAY PROCESSOR VECTOR PROCESSOR
 VLIW

Instruction Stream Same op @ same time


Different ops @ time
LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR  VR, 2
ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space

5 6

SIMD Array Processing vs. VLIW Vector Processors


 Array processor  A vector is a one-dimensional array of numbers
 Many scientific/commercial programs use vectors
for (i = 0; i<=49; i++)
C[i] = (A[i] + B[i]) / 2

 A vector processor is one whose instructions operate on


vectors rather than scalar (single data) values
 Basic requirements
 Need to load/store vectors  vector registers (contain vectors)
 Need to operate on vectors of different lengths  vector length
register (VLEN)
 Elements of a vector might be stored apart from each other in
memory  vector stride register (VSTR)
 Stride: distance between two elements of a vector

7 8
Vector Processors (II) Vector Processor Advantages
 A vector instruction performs an operation on each element + No dependencies within a vector
in consecutive cycles  Pipelining, parallelization work well
 Vector functional units are pipelined  Can have very deep pipelines, no dependencies!
 Each pipeline stage operates on a different data element
+ Each instruction generates a lot of work
 Vector instructions allow deeper pipelines  Reduces instruction fetch bandwidth
 No intra-vector dependencies  no hardware interlocking
within a vector + Highly regular memory access pattern
 No control flow within a vector  Interleaving multiple banks for higher memory bandwidth
 Known stride allows prefetching of vectors into cache/memory  Prefetching

+ No need to explicitly code loops


 Fewer branches in the instruction sequence

9 10

Vector Processor Disadvantages Vector Processor Limitations


-- Works (only) if parallelism is regular (data/SIMD parallelism) -- Memory (bandwidth) can easily become a bottleneck,
++ Vector operations especially if
-- Very inefficient if parallelism is irregular 1. compute/memory operation balance is not maintained
-- How about searching for a key in a linked list? 2. data is not mapped appropriately to memory banks

Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 11 12
Vector Registers Vector Functional Units
 Each vector data register holds N M-bit values  Use deep pipeline (=> fast
 Vector control registers: VLEN, VSTR, VMASK clock) to execute element
operations V V V
 Vector Mask Register (VMASK) 1 2 3
 Indicates which elements of vector to operate on
 Simplifies control of deep
pipeline because elements in
 Set by vector test instructions
vector are independent
 e.g., VMASK[i] = (Vk[i] == 0)
 Maximum VLEN can be N
 Maximum number of elements stored in a vector register
V0,0
M-bit wide
V1,0
M-bit wide Six stage multiply pipeline
V0,1 V1,1

V3 <- v1 * v2
V0,N-1 V1,N-1

13 14

Vector Machine Organization (CRAY-1) Memory Banking


 CRAY-1  Example: 16 banks; can start one bank access per cycle
 Russell, “The CRAY-1  Bank latency: 11 cycles
computer system,”  Can sustain 16 parallel accesses if they go to different banks
CACM 1978.
Bank Bank Bank Bank
0 1 2 15
 Scalar and vector modes
 8 64-element vector
registers MDR MAR MDR MAR MDR MAR MDR MAR
 64 bits per element
Data bus
 16 memory banks
 8 64-bit scalar registers
 8 24-bit address registers Address bus

CPU
15 16
Vector Memory System Scalar Code Example
 For I = 0 to 49
 C[i] = (A[i] + B[i]) / 2
Bas
Stride
Vector Registers e
 Scalar code
Address MOVI R0 = 50 1
Generator + MOVA R1 = A 1 304 dynamic instructions
MOVA R2 = B 1
MOVA R3 = C 1
X: LD R4 = MEM[R1++] 11 ;autoincrement addressing
LD R5 = MEM[R2++] 11
0 1 2 3 4 5 6 7 8 9 A B C D E F
ADD R6 = R4 + R5 4
Memory Banks SHFR R7 = R6 >> 1 1
ST MEM[R3++] = R7 11
DECBNZ --R0, X 2 ;decrement and branch if NZ
17 18

Scalar Code Execution Time Vectorizable Loops


 Scalar execution time on an in-order processor with 1 bank  A loop is vectorizable if each iteration is independent of any
 First two loads in the loop cannot be pipelined: 2*11 cycles other
 4 + 50*40 = 2004 cycles  For I = 0 to 49
 C[i] = (A[i] + B[i]) / 2 7 dynamic instructions
 Scalar execution time on an in-order processor with 16  Vectorized loop:
banks (word-interleaved) MOVI VLEN = 50 1
 First two loads in the loop can be pipelined MOVI VSTR = 1 1
 4 + 50*30 = 1504 cycles VLD V0 = A 11 + VLN - 1
VLD V1 = B 11 + VLN – 1
 Why 16 banks? VADD V2 = V0 + V1 4 + VLN - 1
 11 cycle memory access latency VSHFR V3 = V2 >> 1 1 + VLN - 1
 Having 16 (>11) banks ensures there are enough banks to VST C = V3 11 + VLN – 1
overlap enough memory operations to cover memory latency
19 20
Vector Code Performance Vector Chaining
 No chaining  Vector chaining: Data forwarding from one vector
 i.e., output of a vector functional unit cannot be used as the functional unit to another
input of another (i.e., no vector data forwarding)
 One memory port (one address generator)
V V V V V
 16 memory banks (word-interleaved) LV v1 1 2 3 4 5
MULV v3,v1,v2
ADDV v5, v3, v4

Chain Chain

Load
Unit
Mult. Add

Memory
 285 cycles
21 22

Vector Code Performance - Chaining Vector Code Performance – Multiple Memory Ports
 Vector chaining: Data forwarding from one vector  Chaining and 2 load ports, 1 store port in each bank
functional unit to another
1 1 11 49 11 49

Strict assumption:
Each memory bank
4 49 has a single port
(memory bandwidth
bottleneck)
These two VLDs cannot be 1 49
pipelined. WHY?

11 49

VLD and VST cannot be


 182 cycles pipelined. WHY?  79 cycles
23 24
Questions (I) Gather/Scatter Operations
 What if # data elements > # elements in a vector register?
 Need to break loops so that each iteration operates on #
Want to vectorize loops with indirect accesses:
elements in a vector register
for (i=0; i<N; i++)
 E.g., 527 data elements, 64-element VREGs
A[i] = B[i] + C[D[i]]
 8 iterations where VLEN = 64
 1 iteration where VLEN = 15 (need to change value of VLEN)
Indexed load instruction (Gather)
 Called vector strip-mining LV vD, rD # Load indices in D vector
LVI vC, rC, vD # Load indirect from rC base
 What if vector data is not stored in a strided fashion in LV vB, rB # Load B vector
memory? (irregular memory access to a vector) ADDV.D vA,vB,vC # Do add
 Use indirection to combine elements into vector registers SV vA, rA # Store result
 Called scatter/gather operations

25 26

Gather/Scatter Operations Conditional Operations in a Loop


 Gather/scatter operations often implemented in hardware  What if some operations should not be executed on a vector
to handle sparse matrices (based on a dynamically-determined condition)?
loop: if (a[i] != 0) then b[i]=a[i]*b[i]
 Vector loads and stores use an index vector which is added
goto loop
to the base register to generate the addresses
Index Vector Data Vector Equivalent
 Idea: Masked operations
1 3.14 3.14  VMASK register is a bit mask determining which data element
3 6.5 0.0 should not be acted upon
7 71.2 6.5 VLD V0 = A
8 2.71 0.0
VLD V1 = B
0.0
0.0 VMASK = (V0 != 0)
0.0 VMUL V1 = V0 * V1
71.2
VST B = V1
2.7
 Does this look familiar? This is essentially predicated execution.
27 28
Another Example with Masking Masked Vector Instructions
Simple Implementation Density-Time Implementation
for (i = 0; i < 64; ++i) – execute all N operations, turn off – scan mask vector and only execute
if (a[i] >= b[i]) then c[i] = a[i] result writeback according to mask elements with non-zero masks
else c[i] = b[i] Steps to execute loop M[7]=1 A[7] B[7] M[7]=1
M[6]=0 A[6] B[6] M[6]=0 A[7] B[7]
1. Compare A, B to get
M[5]=1 A[5] B[5] M[5]=1
A B VMASK VMASK
1 2 0 M[4]=1 A[4] B[4] M[4]=1
M[3]=0 A[3] B[3] M[3]=0 C[5]
2 2 1 2. Masked store of A into C
3 2 1 M[2]=0 C[4]
4 10 0 3. Complement VMASK M[1]=1
M[2]=0 C[2]
-5 -4 0 M[0]=0
0 -3 1 M[1]=1 C[1] C[1]
4. Masked store of B into C
6 5 1 Write data port
-7 -8 1
M[0]=0 C[0]

Write Enable Write data port

29 30

Some Issues
 Stride and banking
 As long as they are relatively prime to each other and there
are enough banks to cover bank access latency, consecutive
accesses proceed in parallel

 Storage of a matrix
 Row major: Consecutive elements in a row are laid out
consecutively in memory
 Column major: Consecutive elements in a column are laid out
consecutively in memory
 You need to change the stride when accessing a row versus
column

31 32
Array vs. Vector Processors, Revisited Remember: Array vs. Vector Processors
 Array vs. vector processor distinction is a “purist’s” ARRAY PROCESSOR VECTOR PROCESSOR
distinction

 Most “modern” SIMD processors are a combination of both Instruction Stream Same op @ same time
 They exploit data parallelism in both time and space Different ops @ time
LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR  VR, 2
ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space

33 34

Vector Instruction Execution Vector Unit Structure


ADDV C,A,B Functional Unit

Execution using Execution using


one pipelined four pipelined
functional unit functional units

Vector
Registers
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]


Lane

C[0] C[0] C[1] C[2] C[3] Memory Subsystem


35 36
Vector Instruction Level Parallelism Automatic Code Vectorization
for (i=0; i < N; i++)
Can overlap execution of multiple vector instructions C[i] = A[i] + B[i];
 example machine has 32 elements per vector register and 8 lanes Vectorized Code
Scalar Sequential Code
 Complete 24 operations/cycle while issuing 1 short instruction/cycle
load load load
Load Unit Multiply Unit Add Unit
Iter. 1 load load load
load

Time
mul
add add add add
time
load store store store
mul
add load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction

add
Vectorization is a compile-time reordering of
Instruction
issue
operation sequencing
 requires extensive loop dependence analysis
store
37 38

Vector/SIMD Processing Summary


 Vector/SIMD machines good at exploiting regular data-level
parallelism
 Same operation performed on many data elements
SIMD Operations in Modern ISAs
 Improve performance, simplify design (no intra-vector
dependencies)

 Performance improvement limited by vectorizability of code


 Scalar operations limit vector machine performance
 Amdahl’s Law
 CRAY-1 was the fastest SCALAR machine at its time!

 Many existing ISAs include (vector-like) SIMD operations


 Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
39
Intel Pentium MMX Operations MMX Example: Image Overlaying (I)
 Idea: One instruction operates on multiple data elements
simultaneously
 Ala array processing (yet much more limited)
 Designed with multimedia (graphics) operations in mind
No VLEN register
Opcode determines data type:
8 8-bit bytes
4 16-bit words
2 32-bit doublewords
1 64-bit quadword

Stride always equal to 1.

Peleg and Weiser, “MMX Technology


Extension to the Intel Architecture,”
IEEE Micro, 1996.

41 42

MMX Example: Image Overlaying (II)

Graphics Processing Units


SIMD not Exposed to Programmer (SIMT)

43
High-Level View of a GPU Concept of “Thread Warps” and SIMT
 Warp: A set of threads that execute the same instruction
(on different data elements)  SIMT (Nvidia-speak)
 All threads run the same kernel
 Warp: The threads that run lengthwise in a woven fabric …

Thread Warp 3
Thread Warp 8
Thread Warp Common PC
Scalar Scalar Scalar Scalar Thread Warp 7
ThreadThread Thread Thread
W X Y Z
SIMD Pipeline

45 46

Loop Iterations as Threads SIMT Memory Access


for (i=0; i < N; i++)
C[i] = A[i] + B[i];  Same instruction in different threads uses thread id to
Scalar Sequential Code Vectorized Code index and access different data elements
load load load
Let’s assume N=16, blockDim=4  4 blocks
Iter. 1 load load load
Time

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

add
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
add add

store store store

load + + + +
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction

add

store
47
Sample GPU SIMT Code (Simplified) Sample GPU Program (Less Simplified)

CPU code
for (ii = 0; ii < 100; ++ii) {
C[ii] = A[ii] + B[ii];
}

CUDA code
// there are 100 threads
__global__ void KernelFunction(…) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int varA = aa[tid];
int varB = bb[tid];
C[tid] = varA + varB;
}

50

Latency Hiding with “Thread Warps” Warp-based SIMD vs. Traditional SIMD
 Traditional SIMD contains a single thread
 Warp: A set of threads that
 Lock step
execute the same instruction  Programming model is SIMD (no threads)  SW needs to know vector
Warps available
(on different data elements) Thread Warp 3
Thread Warp 8 for scheduling length
 ISA contains vector/SIMD instructions
Thread Warp 7
SIMD Pipeline
 Fine-grained multithreading
I-Fetch
 One instruction per thread in  Warp-based SIMD consists of multiple scalar threads executing in
pipeline at a time (No branch Decode
a SIMD manner (i.e., same instruction executed by all threads)
prediction)
RF

Does not have to be lock step


RF

RF


 Interleave warp execution to Warps accessing
 Each thread can be treated individually (i.e., placed in a different
ALU

ALU

ALU

memory hierarchy
hide latencies Miss? warp)  programming model not SIMD
 Register values of all threads stay D-Cache Thread Warp 1  SW does not need to know vector length
in register file All Hit? Data Thread Warp 2
 Enables memory and branch latency tolerance
 No OS context switching
Writeback
Thread Warp 6
 ISA is scalar  vector instructions formed dynamically
 Memory latency hiding
 Essentially, it is SPMD programming model implemented on SIMD
 Graphics has millions of pixels
hardware
Slide credit: Tor Aamodt 51 52
SPMD Branch Divergence Problem in Warp-based SIMD
 Single procedure/program, multiple data  SPMD Execution on SIMD Hardware
 This is a programming model rather than computer organization  NVIDIA calls this “Single Instruction, Multiple Thread” (“SIMT”)
execution
 Each processing element executes the same procedure, except on
different data elements
 Procedures can synchronize at certain points in program, e.g. barriers A

Thread Warp Common PC


B
 Essentially, multiple instruction streams execute the same Thread Thread Thread Thread
program C D F 1 2 3 4
 Each program/procedure can 1) execute a different control-flow path,
E
2) work on different data, at run-time
 Many scientific applications programmed this way and run on MIMD G
computers (multiprocessors)
 Modern GPUs programmed in a similar way on a SIMD computer

53 54

Control Flow Problem in GPUs/SIMD Branch Divergence Handling (I)


 GPU uses SIMD
pipeline to save area Stack
AA/1111
on control logic. TOS
Reconv. PC
-
Next PC
B
G
A
E
Active Mask
1111
 Group scalar threads into BB/1111
TOS
TOS
E
E
D
E
C
0110
1001
warps Branch
C/1001
C D/0110
D F
Thread Warp Common PC
Path A
 Branch divergence EE/1111 Thread Thread Thread Thread
occurs when threads Path B G/1111
G
1 2 3 4

inside warps branch to


different execution A B C D E G A

paths.

Time

Slide credit: Tor Aamodt 55 Slide credit: Tor Aamodt 56


Dynamic Warp Formation Dynamic Warp Formation/Merging
 Idea: Dynamically merge threads executing the same  Idea: Dynamically merge threads executing the same
instruction (after branch divergence) instruction (after branch divergence)
 Form new warp at divergence
 Enough threads branching to each path to create full new
warps
Branch

Path A

Path B

 Fung et al., “Dynamic Warp Formation and Scheduling for


Efficient GPU Control Flow,” MICRO 2007.
58 59

Dynamic Warp Formation Example What About Memory Divergence?


A
x/1111
y/1111  Modern GPUs have caches
Legend
x/1110 A A  Ideally: Want all threads in the warp to hit (without
B
conflicting with each other)
y/0011 Execution of Warp x Execution of Warp y
at Basic Block A at Basic Block A
x/1000 x/0110 x/0001
C y/0010 D y/0001 F y/1100
D
 Problem: One thread in a warp can stall the entire warp if it
x/1110
A new warp created from scalar
threads of both Warp x and y
misses in the cache.
E y/0011 executing at Basic Block D

x/1111
G y/1111  Need techniques to
A A B B C C D D E E F F G G A A
 Tolerate memory divergence
Baseline
 Integrate solutions to branch and memory divergence
Time
Dynamic A A B B C D E E F G G A A
Warp
Formation
Time

60 61
NVIDIA GeForce GTX 285 NVIDIA GeForce GTX 285 “core”
 NVIDIA-speak:
 240 stream processors

 “SIMT execution”

 Generic speak: 64 KB of storage


 30 cores … for fragment
contexts (registers)
 8 SIMD functional units per core

= SIMD functional unit, control = instruction stream decode


shared across 8 units
= multiply-add = execution context storage
= multiply

62 63

NVIDIA GeForce GTX 285 “core” NVIDIA GeForce GTX 285

Tex Tex
… … … … … …

Tex Tex
… … … … … …

64 KB of storage Tex Tex


… for thread contexts
(registers)
… … … … … …

Tex Tex
… … … … … …
 Groups of 32 threads share instruction stream (each group is
a Warp) Tex Tex

Up to 32 warps are simultaneously interleaved


… … … … … …

 Up to 1024 thread contexts can be stored There are 30 of these things on the GTX 285: 30,720 threads
64 65

You might also like