Professional Documents
Culture Documents
3 4
Array vs. Vector Processors SIMD Array Processing vs. VLIW
ARRAY PROCESSOR VECTOR PROCESSOR
VLIW
Space Space
5 6
7 8
Vector Processors (II) Vector Processor Advantages
A vector instruction performs an operation on each element + No dependencies within a vector
in consecutive cycles Pipelining, parallelization work well
Vector functional units are pipelined Can have very deep pipelines, no dependencies!
Each pipeline stage operates on a different data element
+ Each instruction generates a lot of work
Vector instructions allow deeper pipelines Reduces instruction fetch bandwidth
No intra-vector dependencies no hardware interlocking
within a vector + Highly regular memory access pattern
No control flow within a vector Interleaving multiple banks for higher memory bandwidth
Known stride allows prefetching of vectors into cache/memory Prefetching
9 10
Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 11 12
Vector Registers Vector Functional Units
Each vector data register holds N M-bit values Use deep pipeline (=> fast
Vector control registers: VLEN, VSTR, VMASK clock) to execute element
operations V V V
Vector Mask Register (VMASK) 1 2 3
Indicates which elements of vector to operate on
Simplifies control of deep
pipeline because elements in
Set by vector test instructions
vector are independent
e.g., VMASK[i] = (Vk[i] == 0)
Maximum VLEN can be N
Maximum number of elements stored in a vector register
V0,0
M-bit wide
V1,0
M-bit wide Six stage multiply pipeline
V0,1 V1,1
V3 <- v1 * v2
V0,N-1 V1,N-1
13 14
CPU
15 16
Vector Memory System Scalar Code Example
For I = 0 to 49
C[i] = (A[i] + B[i]) / 2
Bas
Stride
Vector Registers e
Scalar code
Address MOVI R0 = 50 1
Generator + MOVA R1 = A 1 304 dynamic instructions
MOVA R2 = B 1
MOVA R3 = C 1
X: LD R4 = MEM[R1++] 11 ;autoincrement addressing
LD R5 = MEM[R2++] 11
0 1 2 3 4 5 6 7 8 9 A B C D E F
ADD R6 = R4 + R5 4
Memory Banks SHFR R7 = R6 >> 1 1
ST MEM[R3++] = R7 11
DECBNZ --R0, X 2 ;decrement and branch if NZ
17 18
Chain Chain
Load
Unit
Mult. Add
Memory
285 cycles
21 22
Vector Code Performance - Chaining Vector Code Performance – Multiple Memory Ports
Vector chaining: Data forwarding from one vector Chaining and 2 load ports, 1 store port in each bank
functional unit to another
1 1 11 49 11 49
Strict assumption:
Each memory bank
4 49 has a single port
(memory bandwidth
bottleneck)
These two VLDs cannot be 1 49
pipelined. WHY?
11 49
25 26
29 30
Some Issues
Stride and banking
As long as they are relatively prime to each other and there
are enough banks to cover bank access latency, consecutive
accesses proceed in parallel
Storage of a matrix
Row major: Consecutive elements in a row are laid out
consecutively in memory
Column major: Consecutive elements in a column are laid out
consecutively in memory
You need to change the stride when accessing a row versus
column
31 32
Array vs. Vector Processors, Revisited Remember: Array vs. Vector Processors
Array vs. vector processor distinction is a “purist’s” ARRAY PROCESSOR VECTOR PROCESSOR
distinction
Most “modern” SIMD processors are a combination of both Instruction Stream Same op @ same time
They exploit data parallelism in both time and space Different ops @ time
LD VR A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR VR, 2
ST A[3:0] VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3
Space Space
33 34
Vector
Registers
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Time
mul
add add add add
time
load store store store
mul
add load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction
add
Vectorization is a compile-time reordering of
Instruction
issue
operation sequencing
requires extensive loop dependence analysis
store
37 38
41 42
43
High-Level View of a GPU Concept of “Thread Warps” and SIMT
Warp: A set of threads that execute the same instruction
(on different data elements) SIMT (Nvidia-speak)
All threads run the same kernel
Warp: The threads that run lengthwise in a woven fabric …
Thread Warp 3
Thread Warp 8
Thread Warp Common PC
Scalar Scalar Scalar Scalar Thread Warp 7
ThreadThread Thread Thread
W X Y Z
SIMD Pipeline
45 46
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
add
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
add add
load + + + +
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction
add
store
47
Sample GPU SIMT Code (Simplified) Sample GPU Program (Less Simplified)
CPU code
for (ii = 0; ii < 100; ++ii) {
C[ii] = A[ii] + B[ii];
}
CUDA code
// there are 100 threads
__global__ void KernelFunction(…) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int varA = aa[tid];
int varB = bb[tid];
C[tid] = varA + varB;
}
50
Latency Hiding with “Thread Warps” Warp-based SIMD vs. Traditional SIMD
Traditional SIMD contains a single thread
Warp: A set of threads that
Lock step
execute the same instruction Programming model is SIMD (no threads) SW needs to know vector
Warps available
(on different data elements) Thread Warp 3
Thread Warp 8 for scheduling length
ISA contains vector/SIMD instructions
Thread Warp 7
SIMD Pipeline
Fine-grained multithreading
I-Fetch
One instruction per thread in Warp-based SIMD consists of multiple scalar threads executing in
pipeline at a time (No branch Decode
a SIMD manner (i.e., same instruction executed by all threads)
prediction)
RF
RF
Interleave warp execution to Warps accessing
Each thread can be treated individually (i.e., placed in a different
ALU
ALU
ALU
memory hierarchy
hide latencies Miss? warp) programming model not SIMD
Register values of all threads stay D-Cache Thread Warp 1 SW does not need to know vector length
in register file All Hit? Data Thread Warp 2
Enables memory and branch latency tolerance
No OS context switching
Writeback
Thread Warp 6
ISA is scalar vector instructions formed dynamically
Memory latency hiding
Essentially, it is SPMD programming model implemented on SIMD
Graphics has millions of pixels
hardware
Slide credit: Tor Aamodt 51 52
SPMD Branch Divergence Problem in Warp-based SIMD
Single procedure/program, multiple data SPMD Execution on SIMD Hardware
This is a programming model rather than computer organization NVIDIA calls this “Single Instruction, Multiple Thread” (“SIMT”)
execution
Each processing element executes the same procedure, except on
different data elements
Procedures can synchronize at certain points in program, e.g. barriers A
53 54
paths.
Time
Path A
Path B
x/1111
G y/1111 Need techniques to
A A B B C C D D E E F F G G A A
Tolerate memory divergence
Baseline
Integrate solutions to branch and memory divergence
Time
Dynamic A A B B C D E E F G G A A
Warp
Formation
Time
60 61
NVIDIA GeForce GTX 285 NVIDIA GeForce GTX 285 “core”
NVIDIA-speak:
240 stream processors
“SIMT execution”
62 63
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Groups of 32 threads share instruction stream (each group is
a Warp) Tex Tex
Up to 1024 thread contexts can be stored There are 30 of these things on the GTX 285: 30,720 threads
64 65