You are on page 1of 17

Ceng 545

GPU Computing

1
Grading

Midterm Exam: 20%


Homeworks: 40%
Demo/knowledge: 25%
Functionality: 40%
Report: 35%
Project: 40%
Design Document: 25%
Project Presentation: 25%
Demo/Final Report: 50%

2
TextBooks /References

 D. Kirk and W. Hwu,”Programming Massively Parallel


Processors”,Morgan Kaufmann 2010, 978-0-12-381472-2
 J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to
General-Purpose GPU Programming”,Pearson 2010,978-0-13-
138768-3
 Draft textbook by Prof. Hwu and Prof. Kirk available at the website
 NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007
(reference book)
 Videos (Stanford University)
http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322
 Lecture Notes (Illinois University )
http://courses.engr.illinois.edu/ece498/al/
 Lecture notes will be posted at the class web site

3
GPU vs CPU
 A GPU is tailored for highly parallel operation while a
CPU executes programs serially
 For this reason, GPUs have many parallel execution units
and higher transistor counts (GTX 480 has 3200*million) ,
while CPUs have few execution units and higher
clockspeeds
 GPUs have much deeper pipelines (several thousand stages
vs 10-20 for CPUs)
 GPUs have significantly faster and more advanced memory
interfaces as they need to shift around a lot more data than
CPUs
Many-core GPUs vs Multicore CPU
Design philosophies:
The design of a CPU is optimized for sequential code
performance.(Large cache memories are provided to
reduce the instruction and data access latencies)
Memory bandwith:
CPU :It has to satisfy requirements from OS, applications
and
I/O devices.
GPU: Small cache memories are provided to help control
the bandwith requirements so multiple threads that
access the same memory data do not need to all go to
the DDRAM
Marketplace
CPU vs. GPU - Hardware

More transistors devoted to data processing

6 Supercomputing 2008 Education Program


Processing Element

Processing element = thread processor = ALU

7 Supercomputing 2008 Education Program


Memory Architecture
Constant Memory
Texture Memory
Device Memory

8 Supercomputing 2008 Education Program


Why Massively Parallel Processor
 A quiet revolution and potential build-up
 Calculation: 367 GFLOPS vs. 32 GFLOPS
 Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
 Until last year, programmed through graphics API
GFLOPS

G80 = GeForce 8800 GTX


G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra

 GPU in every PC and workstation – massive volume NV30


and= GeForce FX 5800
potential
impact
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
9 ECE 498AL, University of Illinois, Urbana-Champaign
GeForce 8800
16 highly threaded SM’s (each with 8 SP), >128 FPU’s, 367
GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S
Host BW to CPU
Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009


10 ECE 498AL, University of Illinois, Urbana-Champaign
We have
GTX 480
32 highly threaded SM’s (each with 15 SP),
>480 FPU’s, 1344 GFLOPS, 1536 MB
DRAM, 177.4 GB/S Mem BW
Tesla C2070
Compared to the latest quad-core CPUs,Tesla
C2050 and C2070 Computing Processors
deliver equivalent supercomputing
performance at
1/10th the cost and 1/20th the
powerconsumption.

11
Terms
 GPGPU
General-Purpose computing on a Graphics Processing
Unit
Using graphic hardware for non-graphic computations

CUDA
Compute Unified Device Architecture
Software architecture for managing data-parallel
programming

12
Parallel Programming
MPI: Computing nodes do not share memory; all data
sharing and interaction must be done through explicit
passing.
Cuda provides sharde memory
OpenMP :It has not been able to scale beyond a couple
hundred computing nodes due to thread management
overheads and cache coherence hardware
requirements.
Cuda achivies simple thread management and no cache
coherence hardware requirements.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009


13 ECE 498AL, University of Illinois, Urbana-Champaign
Future Apps Reflect a Concurrent World
Exciting applications in future mass computing market have
been traditionally considered “supercomputing applications”
Molecular dynamics simulation, Video and audio coding and
manipulation, 3D imaging and visualization, Consumer game
physics, and virtual reality products
These “Super-apps” represent and model physical, concurrent
world
Various granularities of parallelism exist, but…
programming model must not hinder parallel implementation
data delivery needs careful management

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009


14 ECE 498AL, University of Illinois, Urbana-Champaign
Stretching Traditional Architectures
 Traditional parallel architectures cover some super-
applications
 DSP, GPU, network apps, Scientific
 The game is to grow mainstream architectures “out” or
domain-specific architectures “in”
 CUDA is latter
Traditional applications

Current architecture
coverage

New applications

Domain-specific
architecture coverage

Obstacles

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009


15 ECE 498AL, University of Illinois, Urbana-Champaign
Previous Projects
Application Description Source Kernel % time

H.264 SPEC ‘06 version, change in guess vector 34,811 194 35%

LBM SPEC ‘06 version, change to single precision and print fewer 1,481 285 >99%
reports

RC5-72 Distributed.net RC5-72 challenge client code 1,979 218 >99%

FEM Finite element modeling, simulation of 3D graded materials 1,874 146 99%

RPES Rye Polynomial Equation Solver, quantum chem, 2-electron 1,104 281 99%
repulsion

PNS Petri Net simulation of a distributed system 322 160 >99%

SAXPY Single-precision implementation of saxpy, used in Linpack’s 952 31 >99%


Gaussian elim. routine

TRACF Two Point Angular Correlation Function 536 98 96%

FDTD Finite-Difference Time Domain analysis of 2D electromagnetic 1,365 93 16%


wave propagation

MRI-Q Computing a matrix Q, a scanner’s configuration in MRI 490 33 >99%


reconstruction

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009


16 ECE 498AL, University of Illinois, Urbana-Champaign
Speedup of Applications
210 457 316
79 431 263
60
50 Kernel
Relative to CPU
GPU Speedup

Application
40
30
20
10
0
H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI-
FHD
GeForce 8800 GTX vs. 2.2GHz Opteron 248
10 speedup in a kernel is typical, as long as the kernel can occupy
enough parallel threads
25 to 400 speedup if the function’s data requirements and control flow
suit the GPU and the application is optimized

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009


17 ECE 498AL, University of Illinois, Urbana-Champaign

You might also like