Professional Documents
Culture Documents
Graphics Processing Unit (GPU) Architectures
Graphics Processing Unit (GPU) Architectures
Architectures
• GPU Evolution
• GPU architecture
• GPU micro-architecture
• GPU programming
• Performance optimizations
• Trends
Chip area
breakdown
Q: What
6/27/2014 can you observe?
MIT Why?
- Comp Arch FDP Jun 2014 13
Extrapolation of Single Core CPU
If we extrapolate the trend, in a few generations, Pentium
would look like:
Gulftown Beckton
Less than 10% of total chip area is used for the real execution.
Q: Why?
David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE
6/27/2014
Computer Systems Colloquium, Jan 2007, link MIT - Comp Arch FDP Jun 2014 19
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast
ILP Wall: diminishing returns on more ILP HW
David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE
6/27/2014
Computer Systems Colloquium, Jan 2007, link MIT - Comp Arch FDP Jun 2014 20
How to Break the Brick Wall?
Pros:
reduce cache size,
no branch predictor,
no OOO scheduler
Cons:
register pressure,
thread scheduler,
require huge parallelism
Hint:
We have only utilized thread
level parallelism (TLP) so far.
SFU:
Special Function
Unit
ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. (link)
• GPU Evolution
• GPU architecture
• GPU micro-architecture
• GPU programming
• Performance optimizations
• Trends
• ECC Protected
– Register file, L1, L2, DRAM
• Uses redundancy to ensure data integrity against
cosmic rays flipping bits
– For example, 64 bits is stored as 72 bits
• Fix single bit errors, detect multiple bit errors
• What are the applications?
• Data centre computers
• GPU Evolution
• GPU architecture
• GPU micro-architecture
• GPU programming
• Performance optimizations
• Trends
Write back:
r0 for warp 0
r3 for warp 1
• GPU evolution
• GPU architecture
• GPU programming
• GPU micro-architecture
• Performance optimizations
• Trends
0 4 8 16 24
... 124 128 132
– [0][0] [0][1] [1][0] [1][1] [2][0] [15][1] [16][0] [16][1]
Shared Mem Contains Multiple Banks
no conflict
0 4 8 16 24
... 124 128 132
cudaStreamSynchronize(stream1)
cudaStreamSynchronize(stream2)
cudaStreamDestroy(stream1)
cudaStreamDestroy(stream2)
What else is there to help me?
• Thrust: The STL of CUDA
– uses C++ type system & template tricks to hide a
lot of the finicky memcpy stuff etc.
– http://code.google.com/p/thrust/
• Lots of Tools, libraries for BLAS, LAPACK, FFTs
etc
– http://developer.nvidia.com/object/gpucomputin
g.html
– Prefer Python to C++?
– Check out PyCUDA
• Your favorite piece of software may already
have been ported to CUDA (talks later today…)
Other High Performance GPU
• GPU evolution
• GPU architecture
• GPU micro-architecture
• GPU programming
• Performance optimizations
• Trends
• GPU architecture
• GPU programming
• GPU micro-architecture
• Performance optimizations
• Trends
http://www.xbitlabs.com/news/cpu/display/
20110119204601_Nvidia_Maxwell_Graphics_Processors_to_Have_Integrated_ARM_General_Purpose_Cores.html
6/27/2014 MIT - Comp Arch FDP Jun 2014 114
GPUs: AMD
• Current: AMD Radeon HD 7970 (Southern Islands HD7xxx series)
• Released in January 2012
• 28 nm Fab process
• 352 mm2 die size with 4.3 billion transistors
• Up to 925 MHz engine clock
• 947 GFLOPS double precision compute power
• 230 W TDP
• Latest AMD architecture – Graphic Core Next (GCN):
• 28 nm GPU architecture
• Designed both for graphics and general computing
• 32 compute nodes (1,048 stream processors)
• Handles workloads of the processor
rp@annauniv.edu