You are on page 1of 29

Computer Architecture

A Quantitative Approach, Fifth Edition

Chapter 4
Data-Level Parallelism in
Vector, SIMD, and
GPU Architectures

Also based on the guest lecture of Prof. Zhenyu Ye on GPU


architecture and programming course: 5SIA0 EMBEDDED
COMPUTER ARCHITECTURE at TU Delft University - 2017

Copyright © 2012, Elsevier Inc. All rights reserved. 1


Graphical Processing Units
Graphical Processing Units
 Given the hardware invested to do graphics well,
how can be supplement it to improve
performance of a wider range of applications?

 Basic idea:
 Heterogeneous execution model
 CPU is the host, GPU is the device
 Develop a C-like programming language for GPU
 Unify all forms of GPU parallelism as CUDA thread
 Programming model is “Single Instruction Multiple
Thread” (SIMT)

Copyright © 2012, Elsevier Inc. All rights reserved. 2


GPU Is In Your Pocket Too

Apple iPhone X – 2017


~ R$ 6.000,00 (abril de 2018)

Source: http://www.chipworks.com/blog/recentteardowns/2012/09/21/apple-iphone-5-the-a6-application-processor/

3
Graphical Processing Units
Threads and Blocks
 A thread is associated with each data element
 Threads are organized into blocks
 Blocks are organized into a grid

 GPU hardware handles thread management, not


applications or OS

Copyright © 2012, Elsevier Inc. All rights reserved. 4


Graphical Processing Units
NVIDIA GPU Architecture
 Similarities to vector machines:
 Works well with data-level parallel problems
CUDA (Compute Unified Device
 Scatter-gather transfers Architecture) é uma API destinada
a computação paralela e GPGPU
 Mask registers
 Large register files

 Differences:
 No scalar processor OpenCL (Open Computing Language)

 Uses multithreading to hide memory latency


 Has many functional units, as opposed to a few
deeply pipelined units like a vector processor

Copyright © 2012, Elsevier Inc. All rights reserved. 5


GTC = Graphic Threading Component (Device)

General GPU Architecture SM = Streaming Multiprocessor (GRID)


ROP = Raster Operations Pipeline
SP = Streaming Processor (CUDA Core)

NVIDIA Fermi, 512 Processing Elements (PEs)

6
Graphical Processing Units
Terminology
 Threads of SIMD instructions
 Each has its own PC
 Thread scheduler uses scoreboard to dispatch
 No data dependencies between threads!
 Keeps track of up to 48 threads of SIMD instructions
 Hides memory latency
 Thread block scheduler schedules blocks to
SIMD processors
 Within each SIMD processor:
 32 SIMD lanes
 Wide and shallow compared to vector processors

Copyright © 2012, Elsevier Inc. All rights reserved. 7


NVIDIA Fermi Architecture (2011)
Processing Element (PE) = Stream Processor (SP) = Core

Maxim. Capacity: 512 CUDA Cores

32 Cores
per
Block

ref: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

8
Graphical Processing Units
NVIDIA Fermi Architecture (2011)
 NVIDIA Fermi GPU has 16 x 32,768 registers of 32 bits
 Divided in 4 GTCs (Device)
 Each GTC divided in 4 SMs (Streaming Multiprocessor or GRID)
 Each GTC is divided into SIMD lanes of threads
 Fermi has 16 physical SIMD lanes, each containing 2048 registers
 Each SIMD thread is limited to 64 registers
 SIMD thread has up to:
 64 vector registers of 32 32-bit elements
 32 vector registers of 32 64-bit elements

Copyright © 2012, Elsevier Inc. All rights reserved. 9


Transistor Count ref: http://en.wikipedia.org/wiki/Transistor_count

GV100 Volta (NVIDIA - 2017) – GPU mais moderna


Manycore & SOC Entre 8 e 19 Bilhões

GPU
21
Bilhões

FPGAs
50 Bi
10
Nvidia Tesla V100 (2017)
throws 21 billion transistors

Fonte: https://techreport.com/news/31881/nvidia-tesla-v100-throws-21-billion-transistors-at-gpu-computing

11
What can I do with 21 billion of transistors?

 3D rendering: 19 Billions of triangles per second!


 Machine Learning and Artificial Inteligence
 Autonomous machines (Atlas – Boston Dynamics)
 Self driving vehicules (Tesla and Google)
 Scientific Research

ref: "How GPUs Work",


http://dx.doi.org/10.1109/MC.2007.59

https://blogs.nvidia.com/blog/2018/03/26/live-jensen-huang-keynote-2018-gtc/

12
The GPU-CPU Gap

Fonte: https://www.hpcwire.com/2016/08/23/2016-important-year-hpc-two-decades/

13
How to program GPUs:
Let's Start with Examples

Architecture

SSE – Streaming SIMD extensions – SIMD instruction set to x86 architectures

14
Let's Start with C and MIPS (RISC)
int A[2][4];
for(i=0;i<2;i++){ Assembly
code of lw r0, 4(r1)
for(j=0;j<4;j++){ inner-loop addi r0, r0, 1
A[i][j]++;
sw r0, 4(r1)
}
}
Programmer's
view of RISC

15
CPUs with Vector SIMD Units

Programmer's view of a vector SIMD, e.g. SSE.

SSE – Streaming SIMD extensions

16
Let's Program the Vector SIMD
Unroll inner-loop to vector operation.
int A[2][4]; int A[2][4];
for(i=0;i<2;i++){ for(i=0;i<2;i++){
for(j=0;j<4;j++){ movups xmm0, [ &A[i][0] ] // load
A[i][j]++; addps xmm0, xmm1 // add 1
} movups [ &A[i][0] ], xmm0 // store
} } Looks like the previous example,
int A[2][4]; but SSE instructions execute on 4
ALUs.
for(i=0;i<2;i++){ Assembly lw r0, 4(r1)
for(j=0;j<4;j++){ code of
inner-loop addi r0, r0, 1
A[i][j]++;
sw r0, 4(r1)
}
}
17
How Do Vector SIMD Programs Run?
int A[2][4];
for(i=0;i<2;i++){
movups xmm0, [ &A[i][0] ] // load
addps xmm0, xmm1 // add 1
movups [ &A[i][0] ], xmm0 // store
}

18
CUDA Programmer's View of GPUs

A GPU contains multiple SIMD Units.

19
CUDA Programmer's View of GPUs
A GPU contains multiple SIMD Units.
All of them can access global memory.

20
What Are the Differences?

SSE GPU

1. GPUs use threads instead of vectors


2. GPUs have the "Shared Memory" spaces

21
Remember: Thread Hierarchy in CUDA
Grid
contains
Thread Blocks

Thread Block
contains
Threads

22
Let's Start Again from C
int A[2][4];
for(i=0;i<2;i++){
for(j=0;j<4;j++){
A[i][j]++;
}
} convert into CUDA

int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A); // define threads
__device__ kernelF(A){ // all threads run same kernel
i = blockIdx.x; // each thread block has its id
j = threadIdx.x; // each thread has its id
A[i][j]++; // each thread has a different i and j
}

23
What Is the Thread Hierarchy?
thread 3 of block 1
operates
on element A[1][3]

int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A); // define 2 x 4 = 8 threads
__device__ kernelF(A){ // all threads run same kernel
i = blockIdx.x; // each thread block has its own id
j = threadIdx.x; // each thread has its own id
A[i][j]++; // each thread has a different i and j
}

24
How Are Threads Scheduled?

25
Blocks Are Dynamically Scheduled

26
How Are Threads Executed?
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A);
__device__ kernelF(A){
i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}

mv.u32 %r0, %ctaid.x // r0 = i = blockIdx.x


mv.u32 %r1, %ntid.x // r1 = "threads-per-block"
mv.u32 %r2, %tid.x // r2 = j = threadIdx.x
mad.u32 %r3, %r2, %r1, %r0 // r3 = i * "threads-per-block" + j
ld.global.s32 %r4, [%r3] // r4 = A[i][j]
add.s32 %r4, %r4, 1 // r4 = r4 + 1
st.global.s32 [%r3], %r4 // A[i][j] = r4

27
NVIDIA Fermi Architecture (2011)
Processing Element (PE) = Stream Processor (SP) = Core

Maxim. Capacity: 512 CUDA Cores

32 Cores
per
Block

ref: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

28
Graphical Processing Units
Example 2
 Capacity to multiply two vectors of length 8192
using Fermi
 Code that works over all elements is the grid
 Thread blocks break this down into manageable sizes
 512 threads per block
 SIMD instruction executes 32 elements at a time
 Thus grid size = 16 blocks (SM)
 Block is analogous to a strip-mined vector loop with
vector length of 32
 Block is assigned to a multithreaded SIMD processor
by the thread block scheduler
 Current-generation GPUs (Fermi) have 7-15
multithreaded SIMD processors
Copyright © 2012, Elsevier Inc. All rights reserved. 29

You might also like