Lecture 1

Ceng 545
GPU Computing
1
Grading
Midterm Exam: 20%

Homeworks: 40%
Demo/knowledge: 25%
Functionality: 40%
Report: 35%
Project: 40%
Design Document: 25%
Project Presentation: 25%
Demo/Final Report: 50%
2
TextBooks /References
 D. Kirk and W. Hwu,”Programming Massively Parallel

Processors”,Morgan Kaufmann 2010, 978-0-12-381472-2
 J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to
General-Purpose GPU Programming”,Pearson 2010,978-0-13-
138768-3
 Draft textbook by Prof. Hwu and Prof. Kirk available at the website
 NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007
(reference book)
 Videos (Stanford University)
http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322
 Lecture Notes (Illinois University )
http://courses.engr.illinois.edu/ece498/al/
 Lecture notes will be posted at the class web site
3
GPU vs CPU
 A GPU is tailored for highly parallel operation while a
CPU executes programs serially
 For this reason, GPUs have many parallel execution units
and higher transistor counts (GTX 480 has 3200*million) ,
while CPUs have few execution units and higher
clockspeeds
 GPUs have much deeper pipelines (several thousand stages
vs 10-20 for CPUs)
 GPUs have significantly faster and more advanced memory
interfaces as they need to shift around a lot more data than
CPUs
Many-core GPUs vs Multicore CPU
Design philosophies:
The design of a CPU is optimized for sequential code
performance.(Large cache memories are provided to
reduce the instruction and data access latencies)
Memory bandwith:
CPU :It has to satisfy requirements from OS, applications
and
I/O devices.
GPU: Small cache memories are provided to help control
the bandwith requirements so multiple threads that
access the same memory data do not need to all go to
the DDRAM
Marketplace
CPU vs. GPU - Hardware
More transistors devoted to data processing
6 Supercomputing 2008 Education Program

Processing Element
Processing element = thread processor = ALU

Memory Architecture
Constant Memory
Texture Memory
Device Memory

Why Massively Parallel Processor
 A quiet revolution and potential build-up
 Calculation: 367 GFLOPS vs. 32 GFLOPS
 Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
 Until last year, programmed through graphics API
GFLOPS
G80 = GeForce 8800 GTX

NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
 GPU in every PC and workstation – massive volume NV30

and= GeForce FX 5800
potential
impact
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
9 ECE 498AL, University of Illinois, Urbana-Champaign
GeForce 8800
16 highly threaded SM’s (each with 8 SP), >128 FPU’s, 367
GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S
Host BW to CPU
Input Assembler
Thread Execution Manager
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
Texture Texture Texture Texture Texture Texture Texture Texture
Load/store Load/store Load/store Load/store Load/store Load/store
Global Memory

We have
GTX 480
32 highly threaded SM’s (each with 15 SP),
>480 FPU’s, 1344 GFLOPS, 1536 MB
DRAM, 177.4 GB/S Mem BW
Tesla C2070
Compared to the latest quad-core CPUs,Tesla
C2050 and C2070 Computing Processors
deliver equivalent supercomputing
performance at
1/10th the cost and 1/20th the
powerconsumption.
11
Terms
 GPGPU
General-Purpose computing on a Graphics Processing
Unit
Using graphic hardware for non-graphic computations
CUDA
Compute Unified Device Architecture
Software architecture for managing data-parallel
programming
12
Parallel Programming
MPI: Computing nodes do not share memory; all data
sharing and interaction must be done through explicit
passing.
Cuda provides sharde memory
OpenMP :It has not been able to scale beyond a couple
hundred computing nodes due to thread management
overheads and cache coherence hardware
requirements.
Cuda achivies simple thread management and no cache
coherence hardware requirements.

Future Apps Reflect a Concurrent World
Exciting applications in future mass computing market have
been traditionally considered “supercomputing applications”
Molecular dynamics simulation, Video and audio coding and
manipulation, 3D imaging and visualization, Consumer game
physics, and virtual reality products
These “Super-apps” represent and model physical, concurrent
world
Various granularities of parallelism exist, but…
programming model must not hinder parallel implementation
data delivery needs careful management

Stretching Traditional Architectures
 Traditional parallel architectures cover some super-
applications
 DSP, GPU, network apps, Scientific
 The game is to grow mainstream architectures “out” or
domain-specific architectures “in”
 CUDA is latter
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Obstacles

Previous Projects
Application Description Source Kernel % time
H.264 SPEC ‘06 version, change in guess vector 34,811 194 35%
LBM SPEC ‘06 version, change to single precision and print fewer 1,481 285 >99%
reports
RC5-72 Distributed.net RC5-72 challenge client code 1,979 218 >99%
FEM Finite element modeling, simulation of 3D graded materials 1,874 146 99%
RPES Rye Polynomial Equation Solver, quantum chem, 2-electron 1,104 281 99%
repulsion
PNS Petri Net simulation of a distributed system 322 160 >99%
SAXPY Single-precision implementation of saxpy, used in Linpack’s 952 31 >99%

Gaussian elim. routine
TRACF Two Point Angular Correlation Function 536 98 96%
FDTD Finite-Difference Time Domain analysis of 2D electromagnetic 1,365 93 16%

wave propagation
MRI-Q Computing a matrix Q, a scanner’s configuration in MRI 490 33 >99%

reconstruction

Speedup of Applications
210 457 316
79 431 263
60
50 Kernel
Relative to CPU
GPU Speedup
Application
40
30
20
10
0
H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI-
FHD
GeForce 8800 GTX vs. 2.2GHz Opteron 248
10 speedup in a kernel is typical, as long as the kernel can occupy
enough parallel threads
25 to 400 speedup if the function’s data requirements and control flow
suit the GPU and the application is optimized


Lecture 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1

Uploaded by

Copyright:

Available Formats

Ceng 545

Midterm Exam: 20%

 D. Kirk and W. Hwu,”Programming Massively Parallel

More transistors devoted to data processing

6 Supercomputing 2008 Education Program

Processing element = thread processor = ALU

7 Supercomputing 2008 Education Program

8 Supercomputing 2008 Education Program

G80 = GeForce 8800 GTX

 GPU in every PC and workstation – massive volume NV30

Thread Execution Manager

Load/store Load/store Load/store Load/store Load/store Load/store

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

RC5-72 Distributed.net RC5-72 challenge client code 1,979 218 >99%

PNS Petri Net simulation of a distributed system 322 160 >99%

SAXPY Single-precision implementation of saxpy, used in Linpack’s 952 31 >99%

TRACF Two Point Angular Correlation Function 536 98 96%

FDTD Finite-Difference Time Domain analysis of 2D electromagnetic 1,365 93 16%

MRI-Q Computing a matrix Q, a scanner’s configuration in MRI 490 33 >99%

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

You might also like