Data-Level Parallelism in Vector, SIMD, And: GPU Architectures

Computer Architecture
A Quantitative Approach, Fifth Edition
Chapter 4
Data-Level Parallelism in
Vector, SIMD, and
GPU Architectures
Also based on the guest lecture of Prof. Zhenyu Ye on GPU

architecture and programming course: 5SIA0 EMBEDDED
COMPUTER ARCHITECTURE at TU Delft University - 2017
Copyright © 2012, Elsevier Inc. All rights reserved. 1

Graphical Processing Units
 Given the hardware invested to do graphics well,
how can be supplement it to improve
performance of a wider range of applications?
 Basic idea:
 Heterogeneous execution model
 CPU is the host, GPU is the device
 Develop a C-like programming language for GPU
 Unify all forms of GPU parallelism as CUDA thread
 Programming model is “Single Instruction Multiple
Thread” (SIMT)

GPU Is In Your Pocket Too
Apple iPhone X – 2017

~ R$ 6.000,00 (abril de 2018)
Source: http://www.chipworks.com/blog/recentteardowns/2012/09/21/apple-iphone-5-the-a6-application-processor/
3
Threads and Blocks
 A thread is associated with each data element
 Threads are organized into blocks
 Blocks are organized into a grid
 GPU hardware handles thread management, not

applications or OS

NVIDIA GPU Architecture
 Similarities to vector machines:
 Works well with data-level parallel problems
CUDA (Compute Unified Device
 Scatter-gather transfers Architecture) é uma API destinada
a computação paralela e GPGPU
 Mask registers
 Large register files
 Differences:
 No scalar processor OpenCL (Open Computing Language)
 Uses multithreading to hide memory latency

 Has many functional units, as opposed to a few
deeply pipelined units like a vector processor

GTC = Graphic Threading Component (Device)
General GPU Architecture SM = Streaming Multiprocessor (GRID)

ROP = Raster Operations Pipeline
SP = Streaming Processor (CUDA Core)
NVIDIA Fermi, 512 Processing Elements (PEs)
6
Terminology
 Threads of SIMD instructions
 Each has its own PC
 Thread scheduler uses scoreboard to dispatch
 No data dependencies between threads!
 Keeps track of up to 48 threads of SIMD instructions
 Hides memory latency
 Thread block scheduler schedules blocks to
SIMD processors
 Within each SIMD processor:
 32 SIMD lanes
 Wide and shallow compared to vector processors

NVIDIA Fermi Architecture (2011)
Processing Element (PE) = Stream Processor (SP) = Core
Maxim. Capacity: 512 CUDA Cores
32 Cores
per
Block
ref: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
8
 NVIDIA Fermi GPU has 16 x 32,768 registers of 32 bits
 Divided in 4 GTCs (Device)
 Each GTC divided in 4 SMs (Streaming Multiprocessor or GRID)
 Each GTC is divided into SIMD lanes of threads
 Fermi has 16 physical SIMD lanes, each containing 2048 registers
 Each SIMD thread is limited to 64 registers
 SIMD thread has up to:
 64 vector registers of 32 32-bit elements
 32 vector registers of 32 64-bit elements

Transistor Count ref: http://en.wikipedia.org/wiki/Transistor_count
GV100 Volta (NVIDIA - 2017) – GPU mais moderna

Manycore & SOC Entre 8 e 19 Bilhões
GPU
21
Bilhões
FPGAs
50 Bi
10
Nvidia Tesla V100 (2017)
throws 21 billion transistors
Fonte: https://techreport.com/news/31881/nvidia-tesla-v100-throws-21-billion-transistors-at-gpu-computing
11
What can I do with 21 billion of transistors?
 3D rendering: 19 Billions of triangles per second!

 Machine Learning and Artificial Inteligence
 Autonomous machines (Atlas – Boston Dynamics)
 Self driving vehicules (Tesla and Google)
 Scientific Research
ref: "How GPUs Work",

http://dx.doi.org/10.1109/MC.2007.59
https://blogs.nvidia.com/blog/2018/03/26/live-jensen-huang-keynote-2018-gtc/
12
The GPU-CPU Gap
Fonte: https://www.hpcwire.com/2016/08/23/2016-important-year-hpc-two-decades/
13
How to program GPUs:
Let's Start with Examples
Architecture
SSE – Streaming SIMD extensions – SIMD instruction set to x86 architectures
14
Let's Start with C and MIPS (RISC)
int A[2][4];
for(i=0;i<2;i++){ Assembly
code of lw r0, 4(r1)
for(j=0;j<4;j++){ inner-loop addi r0, r0, 1
A[i][j]++;
sw r0, 4(r1)
}
}
Programmer's
view of RISC
15
CPUs with Vector SIMD Units
Programmer's view of a vector SIMD, e.g. SSE.
SSE – Streaming SIMD extensions
16
Let's Program the Vector SIMD
Unroll inner-loop to vector operation.
int A[2][4]; int A[2][4];
for(i=0;i<2;i++){ for(i=0;i<2;i++){
for(j=0;j<4;j++){ movups xmm0, [ &A[i][0] ] // load
A[i][j]++; addps xmm0, xmm1 // add 1
} movups [ &A[i][0] ], xmm0 // store
} } Looks like the previous example,
int A[2][4]; but SSE instructions execute on 4
ALUs.
for(i=0;i<2;i++){ Assembly lw r0, 4(r1)
for(j=0;j<4;j++){ code of
inner-loop addi r0, r0, 1
A[i][j]++;
sw r0, 4(r1)
}
}
17
How Do Vector SIMD Programs Run?
int A[2][4];
for(i=0;i<2;i++){
movups xmm0, [ &A[i][0] ] // load
addps xmm0, xmm1 // add 1
movups [ &A[i][0] ], xmm0 // store
}
18
CUDA Programmer's View of GPUs
A GPU contains multiple SIMD Units.
19
CUDA Programmer's View of GPUs
A GPU contains multiple SIMD Units.
All of them can access global memory.
20
What Are the Differences?
SSE GPU
1. GPUs use threads instead of vectors

2. GPUs have the "Shared Memory" spaces
21
Remember: Thread Hierarchy in CUDA
Grid
contains
Thread Blocks
Thread Block
contains
Threads
22
Let's Start Again from C
int A[2][4];
for(i=0;i<2;i++){
for(j=0;j<4;j++){
A[i][j]++;
}
} convert into CUDA
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A); // define threads
__device__ kernelF(A){ // all threads run same kernel
i = blockIdx.x; // each thread block has its id
j = threadIdx.x; // each thread has its id
A[i][j]++; // each thread has a different i and j
}
23
What Is the Thread Hierarchy?
thread 3 of block 1
operates
on element A[1][3]
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A); // define 2 x 4 = 8 threads
__device__ kernelF(A){ // all threads run same kernel
i = blockIdx.x; // each thread block has its own id
j = threadIdx.x; // each thread has its own id
A[i][j]++; // each thread has a different i and j
}
24
How Are Threads Scheduled?
25
Blocks Are Dynamically Scheduled
26
How Are Threads Executed?
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A);
__device__ kernelF(A){
i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}
mv.u32 %r0, %ctaid.x // r0 = i = blockIdx.x

mv.u32 %r1, %ntid.x // r1 = "threads-per-block"
mv.u32 %r2, %tid.x // r2 = j = threadIdx.x
mad.u32 %r3, %r2, %r1, %r0 // r3 = i * "threads-per-block" + j
ld.global.s32 %r4, [%r3] // r4 = A[i][j]
add.s32 %r4, %r4, 1 // r4 = r4 + 1
st.global.s32 [%r3], %r4 // A[i][j] = r4
27
Processing Element (PE) = Stream Processor (SP) = Core
Maxim. Capacity: 512 CUDA Cores
32 Cores
per
Block
ref: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
28
Example 2
 Capacity to multiply two vectors of length 8192
using Fermi
 Code that works over all elements is the grid
 Thread blocks break this down into manageable sizes
 512 threads per block
 SIMD instruction executes 32 elements at a time
 Thus grid size = 16 blocks (SM)
 Block is analogous to a strip-mined vector loop with
vector length of 32
 Block is assigned to a multithreaded SIMD processor
by the thread block scheduler
 Current-generation GPUs (Fermi) have 7-15
multithreaded SIMD processors

Data-Level Parallelism in Vector, SIMD, And: GPU Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data-Level Parallelism in Vector, SIMD, And: GPU Architectures

Uploaded by

Copyright:

Available Formats

Computer Architecture

A Quantitative Approach, Fifth Edition

Also based on the guest lecture of Prof. Zhenyu Ye on GPU

Copyright © 2012, Elsevier Inc. All rights reserved. 1

Copyright © 2012, Elsevier Inc. All rights reserved. 2

Apple iPhone X – 2017

 GPU hardware handles thread management, not

Copyright © 2012, Elsevier Inc. All rights reserved. 4

 Uses multithreading to hide memory latency

Copyright © 2012, Elsevier Inc. All rights reserved. 5

General GPU Architecture SM = Streaming Multiprocessor (GRID)

NVIDIA Fermi, 512 Processing Elements (PEs)

Copyright © 2012, Elsevier Inc. All rights reserved. 7

Maxim. Capacity: 512 CUDA Cores

Copyright © 2012, Elsevier Inc. All rights reserved. 9

GV100 Volta (NVIDIA - 2017) – GPU mais moderna

 3D rendering: 19 Billions of triangles per second!

ref: "How GPUs Work",

SSE – Streaming SIMD extensions – SIMD instruction set to x86 architectures

Programmer's view of a vector SIMD, e.g. SSE.

SSE – Streaming SIMD extensions

A GPU contains multiple SIMD Units.

1. GPUs use threads instead of vectors

mv.u32 %r0, %ctaid.x // r0 = i = blockIdx.x

Maxim. Capacity: 512 CUDA Cores

You might also like