You are on page 1of 119

Graphics Processing Unit (GPU)

Architectures

Dr. Ranjani Parthasarathi


Professor
Dept. of IST, CEG Campus
Anna University

6/27/2014 MIT - Comp Arch FDP Jun 2014 1


Topics

• GPU Evolution
• GPU architecture
• GPU micro-architecture
• GPU programming
• Performance optimizations
• Trends

6/27/2014 MIT - Comp Arch FDP Jun 2014 2


System Architecture

6/27/2014 MIT - Comp Arch FDP Jun 2014 3


GPU Architecture
NVIDIA Fermi, 512 Processing Elements (PEs)

6/27/2014 MIT - Comp Arch FDP Jun 2014 4


What Can It Do?
Render triangles.

NVIDIA GTX480 can render 1.6


billion triangles per second!

6/27/2014 MIT - Comp Arch FDP Jun 2014 5


General Purpose Computing

6/27/2014 MIT - Comp Arch FDP Jun 2014 6


ref: http://www.nvidia.com/object/tesla_computing_solutions.html
6/27/2014 MIT - Comp Arch FDP Jun 2014 7
6/27/2014 MIT - Comp Arch FDP Jun 2014 8
6/27/2014 MIT - Comp Arch FDP Jun 2014 9
GPU Highlights
• A total of 62 systems on the list are using
accelerator/co-processor technology, up from
53 from November 2013. Forty-four of these
use NVIDIA chips, two use ATI Radeon, and
there are now 17 systems with Intel MIC
technology (Xeon Phi). The average number
of accelerator cores for these 62 systems is
78,127 cores/system.

6/27/2014 MIT - Comp Arch FDP Jun 2014 10


The Gap Between CPU and GPU

6/27/2014 MIT - Comp Arch FDP Jun 2014 11


ref: Tesla GPU Computing Brochure
GPU Has >10x Comp Density

Given the same chip area, the achievable performance of


GPU is >10x higher than that of CPU.

6/27/2014 MIT - Comp Arch FDP Jun 2014 12


Evolution of Intel Pentium
Pentium I Pentium II

Chip area
breakdown

Pentium III Pentium IV

Q: What
6/27/2014 can you observe?
MIT Why?
- Comp Arch FDP Jun 2014 13
Extrapolation of Single Core CPU
If we extrapolate the trend, in a few generations, Pentium
would look like:

Of course, we know it did not happen.

Q: What happened instead? Why?

6/27/2014 MIT - Comp Arch FDP Jun 2014 14


Evolution of Multi-core CPUs
Penryn Chip area Bloomfield
breakdown

Gulftown Beckton

Q: What can you observe? Why?

6/27/2014 MIT - Comp Arch FDP Jun 2014 15


From Single core to Multi core to
Simple Cores
"This [multiple IA-core] approach is analogous to trying to
build an airplane by putting wings on a train."
--Bill Dally, NVIDIA

6/27/2014 MIT - Comp Arch FDP Jun 2014 16


Let's Take a Closer Look

Less than 10% of total chip area is used for the real execution.

Q: Why?

6/27/2014 MIT - Comp Arch FDP Jun 2014 17


The Memory Hierarchy

Notes on Energy at 45nm:


64-bit Int ADD takes about 1 pJ.
64-bit FP FMA takes about 200 pJ.

It seems we can not further increase the computational density.


6/27/2014 MIT - Comp Arch FDP Jun 2014 18
The Brick Wall -- UC Berkeley's View

Power Wall: power expensive, transistors free


Memory Wall: Memory slow, multiplies fast
ILP Wall: diminishing returns on more ILP HW

David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE
6/27/2014
Computer Systems Colloquium, Jan 2007, link MIT - Comp Arch FDP Jun 2014 19
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast
ILP Wall: diminishing returns on more ILP HW

Power Wall + Memory Wall + ILP Wall = Brick Wall

David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE
6/27/2014
Computer Systems Colloquium, Jan 2007, link MIT - Comp Arch FDP Jun 2014 20
How to Break the Brick Wall?

Hint: how to exploit the parallelism inside the application?

6/27/2014 MIT - Comp Arch FDP Jun 2014 21


Step 1: Trade Latency with Throughput

Hide the memory latency through fine-grained interleaved


threading.

6/27/2014 MIT - Comp Arch FDP Jun 2014 22


Interleaved Multi-threading

6/27/2014 MIT - Comp Arch FDP Jun 2014 23


Interleaved Multi-threading

The granularity of interleaved multi-threading:


• 100 cycles: hide off-chip memory latency
• 10 cycles: + hide cache latency
• 1 cycle: + hide branch latency, instruction dependency

6/27/2014 MIT - Comp Arch FDP Jun 2014 24


Interleaved Multi-threading

The granularity of interleaved multi-threading:


• 100 cycles: hide off-chip memory latency
• 10 cycles: + hide cache latency
• 1 cycle: + hide branch latency, instruction dependency

Fine-grained interleaved multi-threading:


Pros: ?
Cons:
6/27/2014
? MIT - Comp Arch FDP Jun 2014 25
Interleaved Multi-threading

The granularity of interleaved multi-threading:


• 100 cycles: hide off-chip memory latency
• 10 cycles: + hide cache latency
• 1 cycle: + hide branch latency, instruction dependency

Fine-grained interleaved multi-threading:


Pros: remove branch predictor, OOO scheduler, large cache
Cons:
6/27/2014
register pressure, etc.
MIT - Comp Arch FDP Jun 2014 26
Fine-Grained Interleaved Threading
Without and with fine-grained interleaved threading

Pros:
reduce cache size,
no branch predictor,
no OOO scheduler

Cons:
register pressure,
thread scheduler,
require huge parallelism

6/27/2014 MIT - Comp Arch FDP Jun 2014 27


HW Support
Register file supports zero
overhead context switch
between interleaved
threads.

6/27/2014 MIT - Comp Arch FDP Jun 2014 28


Can We Make Further Improvement?

• Reducing large cache gives 2x computational density.

• Q: Can we make further improvements?

Hint:
We have only utilized thread
level parallelism (TLP) so far.

6/27/2014 MIT - Comp Arch FDP Jun 2014 29


Step 2: Single Instruction Multiple Data
GPU uses wide SIMD: 8/16/24/... processing elements (PEs)
CPU uses short SIMD: usually has vector width of 4.

SSE has 4 data lanes GPU has 8/16/24/... data lanes

6/27/2014 MIT - Comp Arch FDP Jun 2014 30


Hardware Support
Supporting interleaved threading + SIMD execution

6/27/2014 MIT - Comp Arch FDP Jun 2014 31


Single Instruction Multiple Thread (SIMT)
Hide vector width using scalar threads.

6/27/2014 MIT - Comp Arch FDP Jun 2014 32


Example of SIMT Execution
Assume 32 threads are grouped into one warp.

6/27/2014 MIT - Comp Arch FDP Jun 2014 33


Step 3: Simple Core
The Stream Multiprocessor (SM) is a
light weight core compared to IA core.

Light weight PE:


Fused Multiply Add
(FMA)

SFU:
Special Function
Unit

6/27/2014 MIT - Comp Arch FDP Jun 2014 34


Arrival of Throughput Oriented
Architectures – 3 key ideas/features
1. Simple core (~2x comp density)
Use many slimmed down cores to run in parallel

2. SIMD/SIMT (>10x comp density)


Pack cores full of ALUs (by sharing instruction stream across groups )
Explicit SIMD
Implicit SIMT

3. Fine-grained interleaved threading (~2x comp density)


Avoid latency stalls

ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. (link)

6/27/2014 MIT - Comp Arch FDP Jun 2014 35


So We Reach Here & Beyond !
NVIDIA Fermi, 512 Processing Elements (PEs)

6/27/2014 MIT - Comp Arch FDP Jun 2014 36


The GPU perspective

6/27/2014 MIT - Comp Arch FDP Jun 2014 37


6/27/2014 MIT - Comp Arch FDP Jun 2014 38
The GPU pipeline
• The GPU receives geometry information from
the CPU as an input and provides a picture as
an output.
• Let’s see how that happens :

host vertex triangle pixel memory


interface processing setup processing interface

6/27/2014 MIT - Comp Arch FDP Jun 2014 39


Host Interface
• The host interface is the communication bridge between the
CPU and the GPU
• It receives commands from the CPU and also pulls geometry
information from system memory
• It outputs a stream of vertices in object space with all their
associated information (normals, texture coordinates, per
vertex color etc)

host vertex triangle pixel memory


interface processing setup processing interface

6/27/2014 MIT - Comp Arch FDP Jun 2014 40


Vertex Processing
• The vertex processing stage receives vertices from the
host interface in object space and outputs them in screen
space
• This may be a simple linear transformation, or a complex
operation involving morphing effects
• Normals, texcoords etc are also transformed
• No new vertices are created in this stage, and no vertices
are discarded (input/output has 1:1 mapping)

host vertex triangle pixel memory


interface processing setup processing interface

6/27/2014 MIT - Comp Arch FDP Jun 2014 41


Triangle setup
• In this stage geometry information becomes raster
information (screen space geometry is the input, pixels are
the output).
• Prior to rasterization, triangles that are backfacing or located
outside the viewing frustrum are rejected.
• Some GPUs also do some hidden surface removal at this
stage.

host vertex triangle pixel memory


interface processing setup processing interface

6/27/2014 MIT - Comp Arch FDP Jun 2014 42


Triangle Setup (cont)
• A fragment is generated if and only if its
center is inside the triangle

• Every fragment generated has its attributes


computed to be the perspective correct
interpolation of the three vertices that
make up the triangle

host vertex triangle pixel memory


interface processing setup processing interface

6/27/2014 MIT - Comp Arch FDP Jun 2014 43


Fragment Processing
• Each fragment provided by triangle setup is fed into
fragment processing as a set of attributes (position,
normal, texcoord etc), which are used to compute the
final color for this pixel
• The computations taking place here include texture
mapping and math operations
• Typically the bottleneck in modern applications

host vertex triangle pixel memory


interface processing setup processing interface

6/27/2014 MIT - Comp Arch FDP Jun 2014 44


Memory Interface
• Fragment colors provided by the previous stage are
written to the framebuffer
• Used to be the biggest bottleneck before fragment
processing took over
• Before the final write occurs, some fragments are
rejected by the zbuffer, stencil and alpha tests
• On modern GPUs, z and color are compressed to reduce
framebuffer bandwidth (but not size)
host vertex triangle pixel memory
interface processing setup processing interface

6/27/2014 MIT - Comp Arch FDP Jun 2014 45


Programmability in the GPU

6/27/2014 MIT - Comp Arch FDP Jun 2014 46


6/27/2014 MIT - Comp Arch FDP Jun 2014 47
6/27/2014 MIT - Comp Arch FDP Jun 2014 48
6/27/2014 MIT - Comp Arch FDP Jun 2014 49
6/27/2014 MIT - Comp Arch FDP Jun 2014 50
6/27/2014 MIT - Comp Arch FDP Jun 2014 51
Topics

• GPU Evolution
• GPU architecture
• GPU micro-architecture
• GPU programming
• Performance optimizations
• Trends

6/27/2014 MIT - Comp Arch FDP Jun 2014 52


NVIDIA’s G80, GT200, Tesla, Fermi,
Kepler
• November 2006: G80
• June 2008: Tesla (GT200)
• March 2011: Fermi (GF10x)
• March 2012: Kepler (GK10x)

6/27/2014 MIT - Comp Arch FDP Jun 2014 53


Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Block Diagram
GF100
• 16 Streaming
Multiprocessors
(SMs)
• Each with 32 cores
– 512 total cores
• Each SM hosts up to
– 48 warps, or
– 1,536 threads
• In flight, up to
– 24,576 threads

6/27/2014 MIT - Comp Arch FDP Jun 2014 54


Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
6/27/2014 MIT - Comp Arch FDP Jun 2014 55
6/27/2014 MIT - Comp Arch FDP Jun 2014 56
512 Cores

6/27/2014 MIT - Comp Arch FDP Jun 2014 57


Fermi SM

• Why 32 cores per SM instead of 8?


– Why not more SMs?

G80 – 8 cores GT200 – 8 cores GF100 – 32 cores


6/27/2014 MIT - Comp Arch FDP Jun 2014 58
Fermi SM
• Dual warp scheduling
– Why?
• 32K registers
• 32 cores
– Floating point and integer
unit per core
• 16 Load/stores
• 4 SFUs
6/27/2014 MIT - Comp Arch FDP Jun 2014 59
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi SM
• 16 SMs * 32 cores/SM =
512 floating point
operations per cycle
• Why not in practice?

6/27/2014 MIT - Comp Arch FDP Jun 2014 60


Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi SM
• Each SM
– 64KB on-chip memory
• 48KB shared memory /
16KB L1 cache, or
• 16KB L1 cache / 48 KB
shared memory
– Configurable by CUDA
developer

6/27/2014 MIT - Comp Arch FDP Jun 2014 61


Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi Dual Warp Scheduling

6/27/2014 MIT - Comp Arch FDP Jun 2014 62


Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
6/27/2014 MIT - Comp Arch FDP Jun 2014 63
Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf
Fermi Caches

6/27/2014 MIT - Comp Arch FDP Jun 2014 64


Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Caches

6/27/2014 MIT - Comp Arch FDP Jun 2014 65


Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi ECC

• ECC Protected
– Register file, L1, L2, DRAM
• Uses redundancy to ensure data integrity against
cosmic rays flipping bits
– For example, 64 bits is stored as 72 bits
• Fix single bit errors, detect multiple bit errors
• What are the applications?
• Data centre computers

6/27/2014 MIT - Comp Arch FDP Jun 2014 66


Topics

• GPU Evolution
• GPU architecture
• GPU micro-architecture
• GPU programming
• Performance optimizations
• Trends

6/27/2014 MIT - Comp Arch FDP Jun 2014 67


Micro-architecture
GF100 micro-architecture

6/27/2014 MIT - Comp Arch FDP Jun 2014 68


HW Groups Threads Into Warps
Example: 32 threads per warp

6/27/2014 MIT - Comp Arch FDP Jun 2014 69


Example of Implementation
Note: NVIDIA may use a more
complicated implementation.

6/27/2014 MIT - Comp Arch FDP Jun 2014 70


Example
• Program Address: Inst
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5

Assume warp 0 and


warp 1 are scheduled
for execution.

6/27/2014 MIT - Comp Arch FDP Jun 2014 71


Read Src Op
• Program Address: Inst
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5

Read source operands:


r1 for warp 0
r4 for warp 1

6/27/2014 MIT - Comp Arch FDP Jun 2014 72


Buffer Src Op
• Program Address: Inst
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5

Push ops to op collector:


r1 for warp 0
r4 for warp 1

6/27/2014 MIT - Comp Arch FDP Jun 2014 73


Read Src Op
• Program Address: Inst
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5

Read source operands:


r2 for warp 0
r5 for warp 1

6/27/2014 MIT - Comp Arch FDP Jun 2014 74


Buffer Src Op
• Program Address: Inst
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5

Push ops to op collector:


r2 for warp 0
r5 for warp 1

6/27/2014 MIT - Comp Arch FDP Jun 2014 75


Execute
• Program Address: Inst
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5

Compute the first 16


threads in the warp.
6/27/2014 MIT - Comp Arch FDP Jun 2014 76
Execute
• Program Address: Inst
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5

Compute the last 16


threads in the warp.
6/27/2014 MIT - Comp Arch FDP Jun 2014 77
Write back
• Program Address: Inst
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5

Write back:
r0 for warp 0
r3 for warp 1

6/27/2014 MIT - Comp Arch FDP Jun 2014 78


NVIDIA Instruction Set Arch.

• Target ISA of NVIDIA compiler is an


abstraction of the hardware instruction
set :
– “Parallel Thread Execution (PTX)”
– Uses virtual registers
– Translation to machine code is performed in
s/w at load time

6/27/2014 MIT - Comp Arch FDP Jun 2014 79


PTX instructions
– Example:
shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

6/27/2014 MIT - Comp Arch FDP Jun 2014 80


Conditional Branching
• Like vector architectures, GPU branch hardware uses
internal masks
• Also uses
– Branch synchronization stack
• Entries consist of masks for each SIMD lane
• I.e. which threads commit their results (all threads execute)
– Instruction markers to manage when a branch diverges into
multiple execution paths
• Push on divergent branch
– …and when paths converge
• Act as barriers
• Pops stack
• Per-thread-lane 1-bit predicate register, specified by
programmer

6/27/2014 MIT - Comp Arch FDP Jun 2014 81


Example
if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]


setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1
@!P1, bra ELSE1, *Push ; Push old mask, set new mask
bits
; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2 ; Difference in RD0
st.global.f64 [X+R8], RD0 ; X[i] = RD0
@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1
ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]
st.global.f64 [X+R8], RD0 ; X[i] = RD0
ENDIF1: <next instruction>, *Pop ; pop to restore old mask

6/27/2014 MIT - Comp Arch FDP Jun 2014 82


Topics

• GPU evolution
• GPU architecture
• GPU programming
• GPU micro-architecture
• Performance optimizations
• Trends

6/27/2014 MIT - Comp Arch FDP Jun 2014 83


Programming GPUs with CUDA
• CUDA provides facilities for programming the
accelerator
– A thread execution model
– A memory hierarchy
– Extensions to C for writing 'kernels'
– A run-time API for
• querying device attributes (eg compute capability)
• memory management (allocation, movement)
• for launching kernels
• for managing 'task parallelism' (CUDA streams)
• CUDA Toolkit gives tools
– compiler, debugger, profiler
• CUDA driver (kernel level) for making all this
happen
CUDA – Programming
Massive number (>10000) of light-weight threads.

6/27/2014 MIT - Comp Arch FDP Jun 2014 85


The CUDA Thread Model
• user 'kernels' execute in a
'grid' of 'blocks' of 'threads'
– block has ID in the grid
– thread has ID in the block
• blocks are 'independent'
– no synchronization between
blocks
• threads within a block may
cooperate
– use shared memory
– fast synchronization
• in H/W blocks are mapped
to SMs
Two Levels of Thread Hierarchy
• kernelF<<<(4,1),(8,1)>>>(A)
;

• __device__ kernelF(A){
• i = blockIdx.x;
• j = threadIdx.x;
• A[i][j]++;
• }

6/27/2014 MIT - Comp Arch FDP Jun 2014 87


Multi-dimension Thread and Block ID
Both grid and thread block can have two dimensional index.
• kernelF<<<(2,2),(4,2)>>>(A);

• __device__ kernelF(A){
• i = blockDim.x * blockIdx.y
• + blockIdx.x;
• j = threadDim.x *
threadIdx.y
• + threadIdx.x;
• A[i][j]++;
• }

6/27/2014 MIT - Comp Arch FDP Jun 2014 88


CUDA Program
• CUDA program expresses data level parallelism (DLP) in
terms of thread level parallelism (TLP).
• Hardware converts TLP into DLP at run time.
• Scalar program • CUDA program

• float A[4][8]; • float A[4][8];
• do-all(i=0;i<4;i++){ •
• do-all(j=0;j<8;j++){ • kernelF<<<(4,1),(8,1)>>>(A);
• A[i][j]++; •
• } • __device__ kernelF(A){
• } • i = blockIdx.x;
• j = threadIdx.x;
• A[i][j]++;
• }

6/27/2014 MIT - Comp Arch FDP Jun 2014 89
Scheduling Thread Blocks on SM
Example:
Scheduling 4 thread blocks on 3 SMs.

6/27/2014 MIT - Comp Arch FDP Jun 2014 90


Multiple Levels of Memory Hierarchy
Name Cache? cycle read-only?
Global L1/L2 200~400 (cache miss) R/W
Shared No 1~3 R/W
Constant Yes 1~3 Read-only
Texture Yes ~100 Read-only
Local L1/L2 200~400 (cache miss) R/W

6/27/2014 MIT - Comp Arch FDP Jun 2014 91


CUDA Memories
• Registers - automatic variables in kernels are mapped to registers
– Fermi hardware places limit of 64 registers / thread.
• Shared memory- shared by kernels within a thread block
– shared memory is 'banked' (like CPU N-way caches)
• Global device memory
– accessed through 'device' pointers
• Constant cache – fast read only memory for constants
• Texture cache – fast read only memory for data with spatial locality
• Host memory
– host pointers cannot be directly accessed by kernels
– must copy memory from host to a device memory
– can be mapped to GPU (zero copy) – accessed through dev. ptr.
Programmers Think in Threads
Q: Why make this
hassle?

6/27/2014 MIT - Comp Arch FDP Jun 2014 93


Example: Kernel to add two vectors
#include <cuda.h> Include cuda.h to access
#include <cstdio> cuda API
#include <iostream> (may also need
cuda_runtime.h)
#define N 20

// Kernel to add vectors 'x' and 'y' into 'z'


// vectors are of length N elements
__global__ marks this as a kernel
__global__
void add( float *z, float *x, float *y )
{
// Compute global thread ID from:
// - local id within the block (threadIdx)
// - id of block within grid (blockIdx)
// threadIdx and blockIdx are predefined and can be up to 3d

int thread_id = threadIdx.x + blockIdx.x * blockDim.x; Generate a global


thread ID
if( thread_id < N ) {

z[ thread_id ] = x[ thread_id ] + y[ thread_id ];


These are device memory
accesses
}
}
Example: Host Code
int main(int argc, char *argv[])
{ float host_x[N], host_y[N], host_z[N]; Set up input on host
float* device_x; float* device_y; float* device_z;

for(int i=0; i < N; i++) { Allocate arrays in device


host_x[i]=(float)i; host_y[i]=(float)(2*i); global memory
}

cudaMalloc( &device_x, N*sizeof(float) );


cudaMalloc( &device_y, N*sizeof(float) ); Copy host data to device arrays
cudaMalloc( &device_z, N*sizeof(float) ); (via PCIe bus)

cudaMemcpy( device_x, host_x, N*sizeof(float), cudaMemcpyHostToDevice );


cudaMemcpy( device_y, host_y, N*sizeof(float), cudaMemcpyHostToDevice );

dim3 n_blocks; dim3 threads_per_block; Set up grid (1-d) of blocks


n_blocks.x = 1; threads_per_block.x = N;

add<<< threads_per_block, n_blocks >>>( device_z, device_x, device_y );


LAUNCH KERNEL!!!

cudaMemcpy( host_z, device_z, N*sizeof(float), cudaMemcpyDeviceToHost );


Copy back
cudaFree( device_x ); cudaFree( device_y ); cudaFree( device_z ); answer to host
}
Warps & Divergence
• Threads mapped to hardware in groups of 32 threads at a time
– these groups are called 'warps'
• Threads within a warp proceed in lock-step
– if threads within a warp take different execution paths one gets
'thread-divergence’
– Divergence reduces performance as divergent branches are
serialized
– eg:
__global__ 16 threads
void add( float *z, float *x, float *y ) go one way
{
int thread_id = threadIdx.x + blockIdx.x * blockDim.x; other 16 wait
if( thread_id % 32 < 15 ) { 2nd 16 threads
z[ thread_id ] = x[ thread_id ] + y[ thread_id ]; go other way
}
else { first 16 wait
z[ thread_id ] = x[ thread_id ] - y[ thread_id ];
}
All threads in warp exit if-else
}
together
Read/Write Coalescing Pre Fermi
• Memory transactions are issued for a half-warp (16 threads) at the same
time
• Under the right circumstances, the reads for the 16 threads may be
combined into “bursts”: called “read coalescing”
• For compute capability 1.2 & 1.3 coalescing rules are simple:
– the words accessed by threads in ½ warp must lie in the same
segment of size equal to:
• 32 bytes if all threads access 8-bit words
• 64 bytes if all threads access 16-bit words
• 128 bytes if all threads access 32-bit or 64-bit words
• For compute capability < 1.2 rules are much more restrictiv
• required alignment, sequential access etc…
• Fermi coalescing is different yet again
• Memory accesses are cached, cache line length is 128 bytes
• Single memory request for a single warp (128 bytes aligned and all
addresses in the warp are within the 128 byte line)
Coalescing 'double'-s (c.c. 1.3)
128 byte thread 0 128 byte
thread 0 128 128
alignment alignment
thread 1 thread 1 136
136 boundary boundary
thread 2 thread 2
144 144
thread 3 For compute thread 3
152 152
capability < 1.2
thread 4 thread 4 160
160 misalignment
thread 5 would have caused thread 5
168 168
thread 6 16 separate thread 6
176 176
transactions
thread 7 thread 7
184
184
128 byte 128 byte
thread 8 192 thread 8 192
segment thread 9
segment
thread 9 200 200
thread 10 thread 10
208 208 Compute
This would be thread 11 Capability >= 1.2
thread 11 216
216 coalesced for breaks this into just
thread 12 224 compute capability thread 12 224
2 transactions
thread 13
< 1.2 as well... thread 13 232 1 for each segment
232
thread 14 thread 14
240 240 128 byte
128 byte thread 15
thread 15 248 248 alignment
alignment
256 boundary 256 boundary
coalesced misaligned
Using Shared Memory
• CUDA devices contain on-chip fast access
shared memory
• Fermi: shared mem can be configured as
addressable/cache
• In CUDA one can declare memory as
__shared__
• Shared memory is banked
– compute capability 2.x: 32 banks
– compute capability 1.x: 16 banks
• Successive 32 bit
__shared__ float words assigned to successive
data[17][2];
banks
bank bank bank bank bank bank bank bank
0 1 2 3 4 31 0 1

0 4 8 16 24
... 124 128 132
– [0][0] [0][1] [1][0] [1][1] [2][0] [15][1] [16][0] [16][1]
Shared Mem Contains Multiple Banks

6/27/2014 MIT - Comp Arch FDP Jun 2014 100


Bank Conflicts
• As long as all requests come from separate banks, there are no
conflicts and requests can be satisfied simultaneously
• If multiple requests hit same bank: bank conflicts
– requests serviced in serial
• Similar to n-way cache bank conflicts
• Broadcast special case: several threads hit same word (no conflict)

tid=0 tid=1 tid=2 tid=3 tid=4

no conflict

bank bank bank bank bank bank bank bank


0 1 2 3 4 31 0 1

0 4 8 16 24
... 124 128 132

[0][0] [0][1] [1][0] [1][1] [2][0] [15][1] [16][0] [16][1]

conflict: tid=0,4 hit same bank


tid=0 tid=1 tid=2 tid=3 tid=4
Broadcast: tid 1,2
acess the same word
Shared Memory (compute capability 2.x)
without with
bank bank
conflict: conflict:

6/27/2014 MIT - Comp Arch FDP Jun 2014 102


CUDA Streams
• CUDA provides a form of task parallelism: streams
• Streams are command queues: enque task & wait to finish
• Classic use: overlap computation, with host-device memcpy
stream 0
(default)
Host code:
stream1
cudaStreamCreate(&stream1) stream2
cudaStreamCreate(&stream2)
kern1<<<size,Nb>>
kern2<<<size,Nb,Smem,stream1>>>
cudaMemcpyAsync(..., stream2);
cudaStreamSynchronize(0)

cudaStreamSynchronize(stream1)
cudaStreamSynchronize(stream2)

cudaStreamDestroy(stream1)
cudaStreamDestroy(stream2)
What else is there to help me?
• Thrust: The STL of CUDA
– uses C++ type system & template tricks to hide a
lot of the finicky memcpy stuff etc.
– http://code.google.com/p/thrust/
• Lots of Tools, libraries for BLAS, LAPACK, FFTs
etc
– http://developer.nvidia.com/object/gpucomputin
g.html
– Prefer Python to C++?
– Check out PyCUDA
• Your favorite piece of software may already
have been ported to CUDA (talks later today…)
Other High Performance GPU

• ATI Radeon 5000 series.

6/27/2014 MIT - Comp Arch FDP Jun 2014 105


ATI Radeon 5000 Series Architecture

6/27/2014 MIT - Comp Arch FDP Jun 2014 106


Radeon SIMD Engine

• 16 Stream Cores (SC)


• Local Data Share

6/27/2014 MIT - Comp Arch FDP Jun 2014 107


VLIW Stream Core (SC)

6/27/2014 MIT - Comp Arch FDP Jun 2014 108


Local Data Share (LDS)

6/27/2014 MIT - Comp Arch FDP Jun 2014 109


Topics

• GPU evolution
• GPU architecture
• GPU micro-architecture
• GPU programming
• Performance optimizations
• Trends

6/27/2014 MIT - Comp Arch FDP Jun 2014 110


Performance Optimizations
• Optimizations on memory latency tolerance
• Reduce register pressure
• Reduce shared memory pressure

Optimizations on memory bandwidth


• Global memory coalesce
• Avoid shared memory bank conflicts
• Grouping byte access
• Avoid Partition camping

Optimizations on computation efficiency


• Mul/Add balancing
• Increase floating point proportion

Optimizations on operational intensity


• Use tiled algorithm
• Tuning
6/27/2014 thread granularity
MIT - Comp Arch FDP Jun 2014 111
Topics

• GPU architecture
• GPU programming
• GPU micro-architecture
• Performance optimizations
• Trends

6/27/2014 MIT - Comp Arch FDP Jun 2014 112


OpenCL
• OpenComputingLanguage
• For heterogeneous architecture
• Work-item – CUDA thread
• Work-group - Thread block
• ND-Range - Grid

6/27/2014 MIT - Comp Arch FDP Jun 2014 113


GPUs: NVIDIA
• Current: Kepler Architecture – GK110
• 7.1 billion transistors
• 192 CUDA cores
• 32 Special Function Units
• 32 Load/Store Units
• More than 1 TFLOPS
• Recent: Maxwell Architecture – 2014
• ARM instruction set
• 20 nm Fab process
• 14-16 GFLOPS in double precision per
watt
• Power efficient – 1/4th the power of
Fermi

http://www.xbitlabs.com/news/cpu/display/
20110119204601_Nvidia_Maxwell_Graphics_Processors_to_Have_Integrated_ARM_General_Purpose_Cores.html
6/27/2014 MIT - Comp Arch FDP Jun 2014 114
GPUs: AMD
• Current: AMD Radeon HD 7970 (Southern Islands HD7xxx series)
• Released in January 2012
• 28 nm Fab process
• 352 mm2 die size with 4.3 billion transistors
• Up to 925 MHz engine clock
• 947 GFLOPS double precision compute power
• 230 W TDP
• Latest AMD architecture – Graphic Core Next (GCN):
• 28 nm GPU architecture
• Designed both for graphics and general computing
• 32 compute nodes (1,048 stream processors)
• Handles workloads of the processor

6/27/2014 MIT - Comp Arch FDP Jun 2014 115


Research Trends

• Low-power GPUs - lpgpu.org


• GPU based cloud computing
• Virtualization support in GPU
• Interconnection networks
• Cache coherence in GPUs
• GPU debugging tools
• Deterministic execution on GPUs
• Cache aware scheduling
• Algorithm optimization & tuning – diff domains

6/27/2014 MIT - Comp Arch FDP Jun 2014 116


Simulators
• GPGPU-Sim that runs CUDA and OpenCL
applications – www.gpgpusim.org
• GPUWattch
• Multi2sim – www.multi2sim.org
• Macsim – : http://code.google.com/p/macsim
• BarraSim - http://code.google.com/p/barra-
sim/

6/27/2014 MIT - Comp Arch FDP Jun 2014 117


Acknowledgements
• Zhenyu Ye, Bart Mesman, Henk
Corporaal for their slides - GPU research
in the Electronic Systems group,
Eindhoven University of Technology,
Netherlands
• http://www.es.ele.tue.nl/~gpuattue/
• www.nvidia.com – nividia architecture
technical briefs

6/27/2014 MIT - Comp Arch FDP Jun 2014 118


Thank you

rp@annauniv.edu

6/27/2014 MIT - Comp Arch FDP Jun 2014 119

You might also like