0% found this document useful (0 votes)

2K views87 pages

How A GPU Works - Kayvon Fatahalian

The document provides an overview of how a modern GPU works and why it is so fast. It discusses three key ideas: 1) GPUs execute programs that process multiple fragments in parallel through shader cores that can run the same program on multiple fragments simultaneously. 2) Shader cores are optimized for parallel execution by removing components that help single instruction streams run fast and focusing on executing the same instruction across multiple fragments. 3) Efficiency is gained by amortizing the cost of managing instruction streams across many arithmetic logic units (ALUs) that can operate on vectors of data through single instruction multiple data (SIMD) processing.

Uploaded by

gsbabil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views87 pages

How A GPU Works - Kayvon Fatahalian

Uploaded by

gsbabil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

How a GPU Works

Kayvon Fatahalian
15-462 (Fall 2011)

Today
1.

Review: the graphics pipeline

History: a few old GPUs

How a modern GPU works (and why it is so fast!)

Closer look at a real GPU design

NVIDIA GTX 285
2

Part 1:

The graphics pipeline

(an abstraction)
3

Vertex processing
Vertices are transformed into screen space

v2
v0
v5

Vertices

Vertex processing
Vertices are transformed into screen space

v2
v0
v5

Vertices

EACH VERTEX IS
TRANSFORMED
INDEPENDENTLY

Primitive processing
Then organized into primitives that are clipped and
culled
v2
v2

v5
v5
v4

Vertices

Primitives
(triangles)

Rasterization
Primitives are rasterized into pixel fragments

Fragments

Rasterization
Primitives are rasterized into pixel fragments

EACH PRIMITIVE IS RASTERIZED

INDEPENDENTLY

Fragment processing
Fragments are shaded to compute a color at each pixel

Shaded fragments

Fragment processing
Fragments are shaded to compute a color at each pixel

EACH FRAGMENT IS PROCESSED

INDEPENDENTLY

Pixel operations
Fragments are blended into the frame buffer at their
pixel locations (z-buffer determines visibility)

Pixels

Pipeline entities
v2
v0

v2
v0

Vertices

Fragments (shaded)

Primitives

Fragments

Pixels

Graphics pipeline
Memory Buffers
Vertex Generation
Vertices

Vertex Data Buffers

Vertex stream

Vertex Processing

Programmable

Textures

Vertex stream

Primitive Generation
Primitives

Primitive stream

Primitive Processing

Textures

Primitive stream

Fragment Generation
Fragments

Fragment stream

Fragment Processing

Textures

Fragment stream

Pixels

Pixel Operations

Fixed-function

Output image (pixels)

Part 2:

Graphics architectures

(implementations of the graphics pipeline)

Independent
Whats so important about independent
computations?

Silicon Graphics RealityEngine (1993)

graphics supercomputer

Vertex Generation
Vertex Processing
Primitive Generation
Primitive Processing
Fragment Generation
Fragment Processing
Pixel Operations

Pre-1999 PC 3D graphics accelerator

Vertex Generation

CPU
Vertex Processing

Primitive Generation
Primitive Processing

3dfx Voodoo
NVIDIA RIVA TNT

Clip/cull/rasterize
Tex

Tex

Pixel operations

Fragment Generation
Fragment Processing
Pixel Operations

GPU* circa 1999

CPU
Vertex Generation
Vertex Processing

GPU

Primitive Generation
Primitive Processing
Fragment Generation

NVIDIA GeForce 256

Fragment Processing
Pixel Operations

Direct3D 9 programmability: 2002

Vertex Generation
Vtx

Vtx

Vertex Processing

Clip/cull/rasterize
Tex

Tex

Frag

Primitive Generation
Primitive Processing

Tex

Frag

Fragment Generation

Pixel operations

Fragment Processing

ATI Radeon 9700

Pixel Operations

Direct3D 10 programmability: 2006

Vertex Generation
Vertex Processing
Core

Core

Pixel op

Core

Pixel op

Tex

Core

Clip/Cull/Rast

Primitive Generation
Primitive Processing

Scheduler

Fragment Generation

NVIDIA GeForce 8800

(unified shading GPU)

Fragment Processing
Pixel Operations

Part 3:

How a shader core works

(three key ideas)

GPUs are fast

Intel Core i7 Quad Core

AMD Radeon HD 5870

~100 GFLOPS peak

730 million transistors

~2.7 TFLOPS peak

2.2 billion transistors

(obtainable if you code your program to

use 4 threads and SSE vector instr)

(obtainable if you write OpenGL programs

like youve done in this class)
22

A diffuse reflectance shader

sampler(mySamp;
Texture2D<float3>(myTex;
float3(lightDir;
float4(diffuseShader(float3(norm,(float2(uv)
{
((float3(kd;
((kd(=([Link](mySamp,(uv);
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0);
((return(float4(kd,(1.0);(((
}(

Shader programming model:

Fragments are processed independently,
but there is no explicit parallel
programming.
Independent logical sequence of control
per fragment. ***

A diffuse reflectance shader

Shader programming model:

Fragments are processed independently,
but there is no explicit parallel
programming.
Independent logical sequence of control
per fragment. ***

A diffuse reflectance shader

Shader programming model:

Fragments are processed independently,
but there is no explicit parallel
programming.
Independent logical sequence of control
per fragment. ***

Big Guy, lookin diffuse

Compile shader
1 unshaded fragment input record
sampler(mySamp;
Texture2D<float3>(myTex;
float3(lightDir;

<diffuseShader>:
sample(r0,(v4,(t0,(s0

float4(diffuseShader(float3(norm,(float2(uv)
{
((float3(kd;
((kd(=([Link](mySamp,(uv);
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0);
((return(float4(kd,(1.0);(((
}

mul((r3,(v0,(cb0[0]
madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o0,(r0,(r3
mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)