Graphics Processing Unit (GPU) Architectures

Graphics Processing Unit (GPU)
Architectures
Dr. Ranjani Parthasarathi

Professor
Dept. of IST, CEG Campus
Anna University
6/27/2014 MIT - Comp Arch FDP Jun 2014 1

Topics
• GPU Evolution
• GPU architecture
• GPU micro-architecture
• GPU programming
• Performance optimizations
• Trends

System Architecture

GPU Architecture
NVIDIA Fermi, 512 Processing Elements (PEs)

What Can It Do?
Render triangles.
NVIDIA GTX480 can render 1.6

billion triangles per second!

General Purpose Computing

ref: http://www.nvidia.com/object/tesla_computing_solutions.html
GPU Highlights
• A total of 62 systems on the list are using
accelerator/co-processor technology, up from
53 from November 2013. Forty-four of these
use NVIDIA chips, two use ATI Radeon, and
there are now 17 systems with Intel MIC
technology (Xeon Phi). The average number
of accelerator cores for these 62 systems is
78,127 cores/system.

The Gap Between CPU and GPU

ref: Tesla GPU Computing Brochure
GPU Has >10x Comp Density
Given the same chip area, the achievable performance of

GPU is >10x higher than that of CPU.

Evolution of Intel Pentium
Pentium I Pentium II
Chip area
breakdown
Pentium III Pentium IV
Q: What
6/27/2014 can you observe?
MIT Why?
- Comp Arch FDP Jun 2014 13
Extrapolation of Single Core CPU
If we extrapolate the trend, in a few generations, Pentium
would look like:
Of course, we know it did not happen.
Q: What happened instead? Why?

Evolution of Multi-core CPUs
Penryn Chip area Bloomfield
breakdown
Gulftown Beckton
Q: What can you observe? Why?

From Single core to Multi core to
Simple Cores
"This [multiple IA-core] approach is analogous to trying to
build an airplane by putting wings on a train."
--Bill Dally, NVIDIA

Let's Take a Closer Look
Less than 10% of total chip area is used for the real execution.
Q: Why?

The Memory Hierarchy
Notes on Energy at 45nm:

64-bit Int ADD takes about 1 pJ.
64-bit FP FMA takes about 200 pJ.
It seems we can not further increase the computational density.

The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free

Memory Wall: Memory slow, multiplies fast
ILP Wall: diminishing returns on more ILP HW
David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE
6/27/2014
Computer Systems Colloquium, Jan 2007, link MIT - Comp Arch FDP Jun 2014 19
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast
ILP Wall: diminishing returns on more ILP HW
Power Wall + Memory Wall + ILP Wall = Brick Wall
David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE
6/27/2014
Computer Systems Colloquium, Jan 2007, link MIT - Comp Arch FDP Jun 2014 20
How to Break the Brick Wall?
Hint: how to exploit the parallelism inside the application?

Step 1: Trade Latency with Throughput
Hide the memory latency through fine-grained interleaved

threading.

Interleaved Multi-threading

The granularity of interleaved multi-threading:

• 100 cycles: hide off-chip memory latency
• 10 cycles: + hide cache latency
• 1 cycle: + hide branch latency, instruction dependency


Fine-grained interleaved multi-threading:

Pros: ?
Cons:
6/27/2014
? MIT - Comp Arch FDP Jun 2014 25

Fine-grained interleaved multi-threading:

Pros: remove branch predictor, OOO scheduler, large cache
Cons:
6/27/2014
register pressure, etc.
MIT - Comp Arch FDP Jun 2014 26
Fine-Grained Interleaved Threading
Without and with fine-grained interleaved threading
Pros:
reduce cache size,
no branch predictor,
no OOO scheduler
Cons:
register pressure,
thread scheduler,
require huge parallelism

HW Support
Register file supports zero
overhead context switch
between interleaved
threads.

Can We Make Further Improvement?
• Reducing large cache gives 2x computational density.
• Q: Can we make further improvements?
Hint:
We have only utilized thread
level parallelism (TLP) so far.

Step 2: Single Instruction Multiple Data
GPU uses wide SIMD: 8/16/24/... processing elements (PEs)
CPU uses short SIMD: usually has vector width of 4.
SSE has 4 data lanes GPU has 8/16/24/... data lanes

Hardware Support
Supporting interleaved threading + SIMD execution

Single Instruction Multiple Thread (SIMT)
Hide vector width using scalar threads.

Example of SIMT Execution
Assume 32 threads are grouped into one warp.

Step 3: Simple Core
The Stream Multiprocessor (SM) is a
light weight core compared to IA core.
Light weight PE:

Fused Multiply Add
(FMA)
SFU:
Special Function
Unit

Arrival of Throughput Oriented
Architectures – 3 key ideas/features
1. Simple core (~2x comp density)
Use many slimmed down cores to run in parallel
2. SIMD/SIMT (>10x comp density)

Pack cores full of ALUs (by sharing instruction stream across groups )
Explicit SIMD
Implicit SIMT
3. Fine-grained interleaved threading (~2x comp density)

Avoid latency stalls
ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. (link)

So We Reach Here & Beyond !
NVIDIA Fermi, 512 Processing Elements (PEs)

The GPU perspective

The GPU pipeline
• The GPU receives geometry information from
the CPU as an input and provides a picture as
an output.
• Let’s see how that happens :
host vertex triangle pixel memory

interface processing setup processing interface

Host Interface
• The host interface is the communication bridge between the
CPU and the GPU
• It receives commands from the CPU and also pulls geometry
information from system memory
• It outputs a stream of vertices in object space with all their
associated information (normals, texture coordinates, per
vertex color etc)


Vertex Processing
• The vertex processing stage receives vertices from the
host interface in object space and outputs them in screen
space
• This may be a simple linear transformation, or a complex
operation involving morphing effects
• Normals, texcoords etc are also transformed
• No new vertices are created in this stage, and no vertices
are discarded (input/output has 1:1 mapping)


Triangle setup
• In this stage geometry information becomes raster
information (screen space geometry is the input, pixels are
the output).
• Prior to rasterization, triangles that are backfacing or located
outside the viewing frustrum are rejected.
• Some GPUs also do some hidden surface removal at this
stage.


Triangle Setup (cont)
• A fragment is generated if and only if its
center is inside the triangle
• Every fragment generated has its attributes

computed to be the perspective correct
interpolation of the three vertices that
make up the triangle


Fragment Processing
• Each fragment provided by triangle setup is fed into
fragment processing as a set of attributes (position,
normal, texcoord etc), which are used to compute the
final color for this pixel
• The computations taking place here include texture
mapping and math operations
• Typically the bottleneck in modern applications


Memory Interface
• Fragment colors provided by the previous stage are
written to the framebuffer
• Used to be the biggest bottleneck before fragment
processing took over
• Before the final write occurs, some fragments are
rejected by the zbuffer, stencil and alpha tests
• On modern GPUs, z and color are compressed to reduce
framebuffer bandwidth (but not size)

Programmability in the GPU

Topics
• GPU Evolution
• GPU programming
• Trends

NVIDIA’s G80, GT200, Tesla, Fermi,
Kepler
• November 2006: G80
• June 2008: Tesla (GT200)
• March 2011: Fermi (GF10x)
• March 2012: Kepler (GK10x)

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Block Diagram
GF100
• 16 Streaming
Multiprocessors
(SMs)
• Each with 32 cores
– 512 total cores
• Each SM hosts up to
– 48 warps, or
– 1,536 threads
• In flight, up to
– 24,576 threads

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
512 Cores

Fermi SM
• Why 32 cores per SM instead of 8?

– Why not more SMs?
G80 – 8 cores GT200 – 8 cores GF100 – 32 cores

Fermi SM
• Dual warp scheduling
– Why?
• 32K registers
• 32 cores
– Floating point and integer
unit per core
• 16 Load/stores
• 4 SFUs
Fermi SM
• 16 SMs * 32 cores/SM =
512 floating point
operations per cycle
• Why not in practice?

Fermi SM
• Each SM
– 64KB on-chip memory
• 48KB shared memory /
16KB L1 cache, or
• 16KB L1 cache / 48 KB
shared memory
– Configurable by CUDA
developer

Fermi Dual Warp Scheduling

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf
Fermi Caches

Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Caches

Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi ECC
• ECC Protected
– Register file, L1, L2, DRAM
• Uses redundancy to ensure data integrity against
cosmic rays flipping bits
– For example, 64 bits is stored as 72 bits
• Fix single bit errors, detect multiple bit errors
• What are the applications?
• Data centre computers

Topics
• GPU Evolution
• GPU programming
• Trends

Micro-architecture
GF100 micro-architecture

HW Groups Threads Into Warps
Example: 32 threads per warp

Example of Implementation
Note: NVIDIA may use a more
complicated implementation.

Example
• Program Address: Inst
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5
Assume warp 0 and

warp 1 are scheduled
for execution.

Read Src Op
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5
Read source operands:

r1 for warp 0
r4 for warp 1

Buffer Src Op
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5
Push ops to op collector:

r1 for warp 0
r4 for warp 1

Read Src Op
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5
Read source operands:

r2 for warp 0
r5 for warp 1

Buffer Src Op
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5
Push ops to op collector:

r2 for warp 0
r5 for warp 1

Execute
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5
Compute the first 16

threads in the warp.
Execute
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5
Compute the last 16

threads in the warp.
Write back
• 0x0004: add r0, r1, r2
• 0x0008: sub r3, r4, r5
Write back:
r0 for warp 0
r3 for warp 1

NVIDIA Instruction Set Arch.
• Target ISA of NVIDIA compiler is an

abstraction of the hardware instruction
set :
– “Parallel Thread Execution (PTX)”
– Uses virtual registers
– Translation to machine code is performed in
s/w at load time

PTX instructions
– Example:
shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

Conditional Branching
• Like vector architectures, GPU branch hardware uses
internal masks
• Also uses
– Branch synchronization stack
• Entries consist of masks for each SIMD lane
• I.e. which threads commit their results (all threads execute)
– Instruction markers to manage when a branch diverges into
multiple execution paths
• Push on divergent branch
– …and when paths converge
• Act as barriers
• Pops stack
• Per-thread-lane 1-bit predicate register, specified by
programmer

Example
if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1
@!P1, bra ELSE1, *Push ; Push old mask, set new mask
bits
; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2 ; Difference in RD0
st.global.f64 [X+R8], RD0 ; X[i] = RD0
@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1
ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]
st.global.f64 [X+R8], RD0 ; X[i] = RD0
ENDIF1: <next instruction>, *Pop ; pop to restore old mask

Topics
• GPU evolution
• GPU programming
• Trends

Programming GPUs with CUDA
• CUDA provides facilities for programming the
accelerator
– A thread execution model
– A memory hierarchy
– Extensions to C for writing 'kernels'
– A run-time API for
• querying device attributes (eg compute capability)
• memory management (allocation, movement)
• for launching kernels
• for managing 'task parallelism' (CUDA streams)
• CUDA Toolkit gives tools
– compiler, debugger, profiler
• CUDA driver (kernel level) for making all this
happen
CUDA – Programming
Massive number (>10000) of light-weight threads.

The CUDA Thread Model
• user 'kernels' execute in a
'grid' of 'blocks' of 'threads'
– block has ID in the grid
– thread has ID in the block
• blocks are 'independent'
– no synchronization between
blocks
• threads within a block may
cooperate
– use shared memory
– fast synchronization
• in H/W blocks are mapped
to SMs
Two Levels of Thread Hierarchy
• kernelF<<<(4,1),(8,1)>>>(A)
;
•
• __device__ kernelF(A){
• i = blockIdx.x;
• j = threadIdx.x;
• A[i][j]++;
• }
•

Multi-dimension Thread and Block ID
Both grid and thread block can have two dimensional index.
• kernelF<<<(2,2),(4,2)>>>(A);
•
• __device__ kernelF(A){
• i = blockDim.x * blockIdx.y
• + blockIdx.x;
• j = threadDim.x *
threadIdx.y
• + threadIdx.x;
• A[i][j]++;
• }
•

CUDA Program
• CUDA program expresses data level parallelism (DLP) in
terms of thread level parallelism (TLP).
• Hardware converts TLP into DLP at run time.
• Scalar program • CUDA program
•
• float A[4][8]; • float A[4][8];
• do-all(i=0;i<4;i++){ •
• do-all(j=0;j<8;j++){ • kernelF<<<(4,1),(8,1)>>>(A);
• A[i][j]++; •
• } • __device__ kernelF(A){
• } • i = blockIdx.x;
• j = threadIdx.x;
• A[i][j]++;
• }
•
Scheduling Thread Blocks on SM
Example:
Scheduling 4 thread blocks on 3 SMs.

Multiple Levels of Memory Hierarchy
Name Cache? cycle read-only?
Global L1/L2 200~400 (cache miss) R/W
Shared No 1~3 R/W
Constant Yes 1~3 Read-only
Texture Yes ~100 Read-only
Local L1/L2 200~400 (cache miss) R/W

CUDA Memories
• Registers - automatic variables in kernels are mapped to registers
– Fermi hardware places limit of 64 registers / thread.
• Shared memory- shared by kernels within a thread block
– shared memory is 'banked' (like CPU N-way caches)
• Global device memory
– accessed through 'device' pointers
• Constant cache – fast read only memory for constants
• Texture cache – fast read only memory for data with spatial locality
• Host memory
– host pointers cannot be directly accessed by kernels
– must copy memory from host to a device memory
– can be mapped to GPU (zero copy) – accessed through dev. ptr.
Programmers Think in Threads
Q: Why make this
hassle?

Example: Kernel to add two vectors
#include <cuda.h> Include cuda.h to access
#include <cstdio> cuda API
#include <iostream> (may also need
cuda_runtime.h)
#define N 20
// Kernel to add vectors 'x' and 'y' into 'z'

// vectors are of length N elements
__global__ marks this as a kernel
__global__
void add( float *z, float *x, float *y )
{
// Compute global thread ID from:
// - local id within the block (threadIdx)
// - id of block within grid (blockIdx)
// threadIdx and blockIdx are predefined and can be up to 3d
int thread_id = threadIdx.x + blockIdx.x * blockDim.x; Generate a global

thread ID
if( thread_id < N ) {
z[ thread_id ] = x[ thread_id ] + y[ thread_id ];

These are device memory
accesses
}
}
Example: Host Code
int main(int argc, char *argv[])
{ float host_x[N], host_y[N], host_z[N]; Set up input on host
float* device_x; float* device_y; float* device_z;
for(int i=0; i < N; i++) { Allocate arrays in device

host_x[i]=(float)i; host_y[i]=(float)(2*i); global memory
}
cudaMalloc( &device_x, N*sizeof(float) );

cudaMalloc( &device_y, N*sizeof(float) ); Copy host data to device arrays
cudaMalloc( &device_z, N*sizeof(float) ); (via PCIe bus)
cudaMemcpy( device_x, host_x, N*sizeof(float), cudaMemcpyHostToDevice );

cudaMemcpy( device_y, host_y, N*sizeof(float), cudaMemcpyHostToDevice );
dim3 n_blocks; dim3 threads_per_block; Set up grid (1-d) of blocks

n_blocks.x = 1; threads_per_block.x = N;
add<<< threads_per_block, n_blocks >>>( device_z, device_x, device_y );

LAUNCH KERNEL!!!
cudaMemcpy( host_z, device_z, N*sizeof(float), cudaMemcpyDeviceToHost );

Copy back
cudaFree( device_x ); cudaFree( device_y ); cudaFree( device_z ); answer to host
}
Warps & Divergence
• Threads mapped to hardware in groups of 32 threads at a time
– these groups are called 'warps'
• Threads within a warp proceed in lock-step
– if threads within a warp take different execution paths one gets
'thread-divergence’
– Divergence reduces performance as divergent branches are
serialized
– eg:
__global__ 16 threads
void add( float *z, float *x, float *y ) go one way
{
int thread_id = threadIdx.x + blockIdx.x * blockDim.x; other 16 wait
if( thread_id % 32 < 15 ) { 2nd 16 threads
z[ thread_id ] = x[ thread_id ] + y[ thread_id ]; go other way
}
else { first 16 wait
z[ thread_id ] = x[ thread_id ] - y[ thread_id ];
}
All threads in warp exit if-else
}
together
Read/Write Coalescing Pre Fermi
• Memory transactions are issued for a half-warp (16 threads) at the same
time
• Under the right circumstances, the reads for the 16 threads may be
combined into “bursts”: called “read coalescing”
• For compute capability 1.2 & 1.3 coalescing rules are simple:
– the words accessed by threads in ½ warp must lie in the same
segment of size equal to:
• 32 bytes if all threads access 8-bit words
• 64 bytes if all threads access 16-bit words
• 128 bytes if all threads access 32-bit or 64-bit words
• For compute capability < 1.2 rules are much more restrictiv
• required alignment, sequential access etc…
• Fermi coalescing is different yet again
• Memory accesses are cached, cache line length is 128 bytes
• Single memory request for a single warp (128 bytes aligned and all
addresses in the warp are within the 128 byte line)
Coalescing 'double'-s (c.c. 1.3)
128 byte thread 0 128 byte
thread 0 128 128
alignment alignment
thread 1 thread 1 136
136 boundary boundary
thread 2 thread 2
144 144
thread 3 For compute thread 3
152 152
capability < 1.2
thread 4 thread 4 160
160 misalignment
thread 5 would have caused thread 5
168 168
thread 6 16 separate thread 6
176 176
transactions
thread 7 thread 7
184
184
128 byte 128 byte
thread 8 192 thread 8 192
segment thread 9
segment
thread 9 200 200
thread 10 thread 10
208 208 Compute
This would be thread 11 Capability >= 1.2
thread 11 216
216 coalesced for breaks this into just
thread 12 224 compute capability thread 12 224
2 transactions
thread 13
< 1.2 as well... thread 13 232 1 for each segment
232
thread 14 thread 14
240 240 128 byte
128 byte thread 15
thread 15 248 248 alignment
alignment
256 boundary 256 boundary
coalesced misaligned
Using Shared Memory
• CUDA devices contain on-chip fast access
shared memory
• Fermi: shared mem can be configured as
addressable/cache
• In CUDA one can declare memory as
__shared__
• Shared memory is banked
– compute capability 2.x: 32 banks
– compute capability 1.x: 16 banks
• Successive 32 bit
__shared__ float words assigned to successive
data[17][2];
banks
bank bank bank bank bank bank bank bank
0 1 2 3 4 31 0 1
0 4 8 16 24
... 124 128 132
– [0][0] [0][1] [1][0] [1][1] [2][0] [15][1] [16][0] [16][1]
Shared Mem Contains Multiple Banks

Bank Conflicts
• As long as all requests come from separate banks, there are no
conflicts and requests can be satisfied simultaneously
• If multiple requests hit same bank: bank conflicts
– requests serviced in serial
• Similar to n-way cache bank conflicts
• Broadcast special case: several threads hit same word (no conflict)
tid=0 tid=1 tid=2 tid=3 tid=4
no conflict
bank bank bank bank bank bank bank bank

0 1 2 3 4 31 0 1
0 4 8 16 24
... 124 128 132
[0][0] [0][1] [1][0] [1][1] [2][0] [15][1] [16][0] [16][1]
conflict: tid=0,4 hit same bank

tid=0 tid=1 tid=2 tid=3 tid=4
Broadcast: tid 1,2
acess the same word
Shared Memory (compute capability 2.x)
without with
bank bank
conflict: conflict:

CUDA Streams
• CUDA provides a form of task parallelism: streams
• Streams are command queues: enque task & wait to finish
• Classic use: overlap computation, with host-device memcpy
stream 0
(default)
Host code:
stream1
cudaStreamCreate(&stream1) stream2
cudaStreamCreate(&stream2)
kern1<<<size,Nb>>
kern2<<<size,Nb,Smem,stream1>>>
cudaMemcpyAsync(..., stream2);
cudaStreamSynchronize(0)
cudaStreamSynchronize(stream1)
cudaStreamSynchronize(stream2)
cudaStreamDestroy(stream1)
cudaStreamDestroy(stream2)
What else is there to help me?
• Thrust: The STL of CUDA
– uses C++ type system & template tricks to hide a
lot of the finicky memcpy stuff etc.
– http://code.google.com/p/thrust/
• Lots of Tools, libraries for BLAS, LAPACK, FFTs
etc
– http://developer.nvidia.com/object/gpucomputin
g.html
– Prefer Python to C++?
– Check out PyCUDA
• Your favorite piece of software may already
have been ported to CUDA (talks later today…)
Other High Performance GPU
• ATI Radeon 5000 series.

ATI Radeon 5000 Series Architecture

Radeon SIMD Engine
• 16 Stream Cores (SC)

• Local Data Share

VLIW Stream Core (SC)

Local Data Share (LDS)

Topics
• GPU evolution
• GPU programming
• Trends

Performance Optimizations
• Optimizations on memory latency tolerance
• Reduce register pressure
• Reduce shared memory pressure
Optimizations on memory bandwidth

• Global memory coalesce
• Avoid shared memory bank conflicts
• Grouping byte access
• Avoid Partition camping
Optimizations on computation efficiency

• Mul/Add balancing
• Increase floating point proportion
Optimizations on operational intensity

• Use tiled algorithm
• Tuning
6/27/2014 thread granularity
MIT - Comp Arch FDP Jun 2014 111
Topics
• GPU programming
• Trends

OpenCL
• OpenComputingLanguage
• For heterogeneous architecture
• Work-item – CUDA thread
• Work-group - Thread block
• ND-Range - Grid

GPUs: NVIDIA
• Current: Kepler Architecture – GK110
• 7.1 billion transistors
• 192 CUDA cores
• 32 Special Function Units
• 32 Load/Store Units
• More than 1 TFLOPS
• Recent: Maxwell Architecture – 2014
• ARM instruction set
• 20 nm Fab process
• 14-16 GFLOPS in double precision per
watt
• Power efficient – 1/4th the power of
Fermi
http://www.xbitlabs.com/news/cpu/display/
20110119204601_Nvidia_Maxwell_Graphics_Processors_to_Have_Integrated_ARM_General_Purpose_Cores.html
GPUs: AMD
• Current: AMD Radeon HD 7970 (Southern Islands HD7xxx series)
• Released in January 2012
• 28 nm Fab process
• 352 mm2 die size with 4.3 billion transistors
• Up to 925 MHz engine clock
• 947 GFLOPS double precision compute power
• 230 W TDP
• Latest AMD architecture – Graphic Core Next (GCN):
• 28 nm GPU architecture
• Designed both for graphics and general computing
• 32 compute nodes (1,048 stream processors)
• Handles workloads of the processor

Research Trends
• Low-power GPUs - lpgpu.org

• GPU based cloud computing
• Virtualization support in GPU
• Interconnection networks
• Cache coherence in GPUs
• GPU debugging tools
• Deterministic execution on GPUs
• Cache aware scheduling
• Algorithm optimization & tuning – diff domains

Simulators
• GPGPU-Sim that runs CUDA and OpenCL
applications – www.gpgpusim.org
• GPUWattch
• Multi2sim – www.multi2sim.org
• Macsim – : http://code.google.com/p/macsim
• BarraSim - http://code.google.com/p/barra-
sim/

Acknowledgements
• Zhenyu Ye, Bart Mesman, Henk
Corporaal for their slides - GPU research
in the Electronic Systems group,
Eindhoven University of Technology,
Netherlands
• http://www.es.ele.tue.nl/~gpuattue/
• www.nvidia.com – nividia architecture
technical briefs

Thank you
rp@annauniv.edu

Graphics Processing Unit (GPU) Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Graphics Processing Unit (GPU) Architectures

Uploaded by

Copyright:

Available Formats

Graphics Processing Unit (GPU)

Dr. Ranjani Parthasarathi

6/27/2014 MIT - Comp Arch FDP Jun 2014 1

6/27/2014 MIT - Comp Arch FDP Jun 2014 2

6/27/2014 MIT - Comp Arch FDP Jun 2014 3

6/27/2014 MIT - Comp Arch FDP Jun 2014 4

NVIDIA GTX480 can render 1.6

6/27/2014 MIT - Comp Arch FDP Jun 2014 5

6/27/2014 MIT - Comp Arch FDP Jun 2014 6

6/27/2014 MIT - Comp Arch FDP Jun 2014 10

6/27/2014 MIT - Comp Arch FDP Jun 2014 11

Given the same chip area, the achievable performance of

6/27/2014 MIT - Comp Arch FDP Jun 2014 12

Pentium III Pentium IV

Of course, we know it did not happen.

Q: What happened instead? Why?

6/27/2014 MIT - Comp Arch FDP Jun 2014 14

Q: What can you observe? Why?

6/27/2014 MIT - Comp Arch FDP Jun 2014 15

6/27/2014 MIT - Comp Arch FDP Jun 2014 16

6/27/2014 MIT - Comp Arch FDP Jun 2014 17

Notes on Energy at 45nm:

It seems we can not further increase the computational density.

Power Wall: power expensive, transistors free

Power Wall + Memory Wall + ILP Wall = Brick Wall

Hint: how to exploit the parallelism inside the application?

6/27/2014 MIT - Comp Arch FDP Jun 2014 21

Hide the memory latency through fine-grained interleaved

6/27/2014 MIT - Comp Arch FDP Jun 2014 22

6/27/2014 MIT - Comp Arch FDP Jun 2014 23

The granularity of interleaved multi-threading:

6/27/2014 MIT - Comp Arch FDP Jun 2014 24

The granularity of interleaved multi-threading:

Fine-grained interleaved multi-threading:

The granularity of interleaved multi-threading:

Fine-grained interleaved multi-threading:

6/27/2014 MIT - Comp Arch FDP Jun 2014 27

6/27/2014 MIT - Comp Arch FDP Jun 2014 28

• Reducing large cache gives 2x computational density.

• Q: Can we make further improvements?

6/27/2014 MIT - Comp Arch FDP Jun 2014 29

SSE has 4 data lanes GPU has 8/16/24/... data lanes

6/27/2014 MIT - Comp Arch FDP Jun 2014 30

6/27/2014 MIT - Comp Arch FDP Jun 2014 31

6/27/2014 MIT - Comp Arch FDP Jun 2014 32

6/27/2014 MIT - Comp Arch FDP Jun 2014 33

Light weight PE:

6/27/2014 MIT - Comp Arch FDP Jun 2014 34

2. SIMD/SIMT (>10x comp density)

3. Fine-grained interleaved threading (~2x comp density)

6/27/2014 MIT - Comp Arch FDP Jun 2014 35

6/27/2014 MIT - Comp Arch FDP Jun 2014 36

6/27/2014 MIT - Comp Arch FDP Jun 2014 37

host vertex triangle pixel memory

6/27/2014 MIT - Comp Arch FDP Jun 2014 39

host vertex triangle pixel memory

6/27/2014 MIT - Comp Arch FDP Jun 2014 40

host vertex triangle pixel memory

6/27/2014 MIT - Comp Arch FDP Jun 2014 41

host vertex triangle pixel memory

6/27/2014 MIT - Comp Arch FDP Jun 2014 42

• Every fragment generated has its attributes

host vertex triangle pixel memory