Professional Documents
Culture Documents
Chapter 4
Data-Level Parallelism in
Vector, SIMD, and
GPU Architectures
Basic idea:
Heterogeneous execution model
CPU is the host, GPU is the device
Develop a C-like programming language for GPU
Unify all forms of GPU parallelism as CUDA thread
Programming model is “Single Instruction Multiple
Thread” (SIMT)
Source: http://www.chipworks.com/blog/recentteardowns/2012/09/21/apple-iphone-5-the-a6-application-processor/
3
Graphical Processing Units
Threads and Blocks
A thread is associated with each data element
Threads are organized into blocks
Blocks are organized into a grid
Differences:
No scalar processor OpenCL (Open Computing Language)
6
Graphical Processing Units
Terminology
Threads of SIMD instructions
Each has its own PC
Thread scheduler uses scoreboard to dispatch
No data dependencies between threads!
Keeps track of up to 48 threads of SIMD instructions
Hides memory latency
Thread block scheduler schedules blocks to
SIMD processors
Within each SIMD processor:
32 SIMD lanes
Wide and shallow compared to vector processors
32 Cores
per
Block
ref: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
8
Graphical Processing Units
NVIDIA Fermi Architecture (2011)
NVIDIA Fermi GPU has 16 x 32,768 registers of 32 bits
Divided in 4 GTCs (Device)
Each GTC divided in 4 SMs (Streaming Multiprocessor or GRID)
Each GTC is divided into SIMD lanes of threads
Fermi has 16 physical SIMD lanes, each containing 2048 registers
Each SIMD thread is limited to 64 registers
SIMD thread has up to:
64 vector registers of 32 32-bit elements
32 vector registers of 32 64-bit elements
GPU
21
Bilhões
FPGAs
50 Bi
10
Nvidia Tesla V100 (2017)
throws 21 billion transistors
Fonte: https://techreport.com/news/31881/nvidia-tesla-v100-throws-21-billion-transistors-at-gpu-computing
11
What can I do with 21 billion of transistors?
https://blogs.nvidia.com/blog/2018/03/26/live-jensen-huang-keynote-2018-gtc/
12
The GPU-CPU Gap
Fonte: https://www.hpcwire.com/2016/08/23/2016-important-year-hpc-two-decades/
13
How to program GPUs:
Let's Start with Examples
Architecture
14
Let's Start with C and MIPS (RISC)
int A[2][4];
for(i=0;i<2;i++){ Assembly
code of lw r0, 4(r1)
for(j=0;j<4;j++){ inner-loop addi r0, r0, 1
A[i][j]++;
sw r0, 4(r1)
}
}
Programmer's
view of RISC
15
CPUs with Vector SIMD Units
16
Let's Program the Vector SIMD
Unroll inner-loop to vector operation.
int A[2][4]; int A[2][4];
for(i=0;i<2;i++){ for(i=0;i<2;i++){
for(j=0;j<4;j++){ movups xmm0, [ &A[i][0] ] // load
A[i][j]++; addps xmm0, xmm1 // add 1
} movups [ &A[i][0] ], xmm0 // store
} } Looks like the previous example,
int A[2][4]; but SSE instructions execute on 4
ALUs.
for(i=0;i<2;i++){ Assembly lw r0, 4(r1)
for(j=0;j<4;j++){ code of
inner-loop addi r0, r0, 1
A[i][j]++;
sw r0, 4(r1)
}
}
17
How Do Vector SIMD Programs Run?
int A[2][4];
for(i=0;i<2;i++){
movups xmm0, [ &A[i][0] ] // load
addps xmm0, xmm1 // add 1
movups [ &A[i][0] ], xmm0 // store
}
18
CUDA Programmer's View of GPUs
19
CUDA Programmer's View of GPUs
A GPU contains multiple SIMD Units.
All of them can access global memory.
20
What Are the Differences?
SSE GPU
21
Remember: Thread Hierarchy in CUDA
Grid
contains
Thread Blocks
Thread Block
contains
Threads
22
Let's Start Again from C
int A[2][4];
for(i=0;i<2;i++){
for(j=0;j<4;j++){
A[i][j]++;
}
} convert into CUDA
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A); // define threads
__device__ kernelF(A){ // all threads run same kernel
i = blockIdx.x; // each thread block has its id
j = threadIdx.x; // each thread has its id
A[i][j]++; // each thread has a different i and j
}
23
What Is the Thread Hierarchy?
thread 3 of block 1
operates
on element A[1][3]
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A); // define 2 x 4 = 8 threads
__device__ kernelF(A){ // all threads run same kernel
i = blockIdx.x; // each thread block has its own id
j = threadIdx.x; // each thread has its own id
A[i][j]++; // each thread has a different i and j
}
24
How Are Threads Scheduled?
25
Blocks Are Dynamically Scheduled
26
How Are Threads Executed?
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A);
__device__ kernelF(A){
i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}
27
NVIDIA Fermi Architecture (2011)
Processing Element (PE) = Stream Processor (SP) = Core
32 Cores
per
Block
ref: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
28
Graphical Processing Units
Example 2
Capacity to multiply two vectors of length 8192
using Fermi
Code that works over all elements is the grid
Thread blocks break this down into manageable sizes
512 threads per block
SIMD instruction executes 32 elements at a time
Thus grid size = 16 blocks (SM)
Block is analogous to a strip-mined vector loop with
vector length of 32
Block is assigned to a multithreaded SIMD processor
by the thread block scheduler
Current-generation GPUs (Fermi) have 7-15
multithreaded SIMD processors
Copyright © 2012, Elsevier Inc. All rights reserved. 29