Professional Documents
Culture Documents
GPU Computing
1
Grading
2
TextBooks /References
3
GPU vs CPU
A GPU is tailored for highly parallel operation while a
CPU executes programs serially
For this reason, GPUs have many parallel execution units
and higher transistor counts (GTX 480 has 3200*million) ,
while CPUs have few execution units and higher
clockspeeds
GPUs have much deeper pipelines (several thousand stages
vs 10-20 for CPUs)
GPUs have significantly faster and more advanced memory
interfaces as they need to shift around a lot more data than
CPUs
Many-core GPUs vs Multicore CPU
Design philosophies:
The design of a CPU is optimized for sequential code
performance.(Large cache memories are provided to
reduce the instruction and data access latencies)
Memory bandwith:
CPU :It has to satisfy requirements from OS, applications
and
I/O devices.
GPU: Small cache memories are provided to help control
the bandwith requirements so multiple threads that
access the same memory data do not need to all go to
the DDRAM
Marketplace
CPU vs. GPU - Hardware
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
Texture Texture Texture Texture Texture Texture Texture Texture
Global Memory
11
Terms
GPGPU
General-Purpose computing on a Graphics Processing
Unit
Using graphic hardware for non-graphic computations
CUDA
Compute Unified Device Architecture
Software architecture for managing data-parallel
programming
12
Parallel Programming
MPI: Computing nodes do not share memory; all data
sharing and interaction must be done through explicit
passing.
Cuda provides sharde memory
OpenMP :It has not been able to scale beyond a couple
hundred computing nodes due to thread management
overheads and cache coherence hardware
requirements.
Cuda achivies simple thread management and no cache
coherence hardware requirements.
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Obstacles
H.264 SPEC ‘06 version, change in guess vector 34,811 194 35%
LBM SPEC ‘06 version, change to single precision and print fewer 1,481 285 >99%
reports
FEM Finite element modeling, simulation of 3D graded materials 1,874 146 99%
RPES Rye Polynomial Equation Solver, quantum chem, 2-electron 1,104 281 99%
repulsion
Application
40
30
20
10
0
H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI-
FHD
GeForce 8800 GTX vs. 2.2GHz Opteron 248
10 speedup in a kernel is typical, as long as the kernel can occupy
enough parallel threads
25 to 400 speedup if the function’s data requirements and control flow
suit the GPU and the application is optimized