You are on page 1of 48

CHW 461:

Parallel Architectures
Lecture # 4
GPU System Context
GPU Computing?
 Design target for CPUs:
 Make a single thread very fast
 Take control away from programmer

 GPU Computing takes a different approach:


 Throughput matters- single threads do not
 Give explicit control to programmer
“CPU-style" Cores
Slimming down
More Space: Double the Number of Cores
Saving Yet More Space

Idea #2
 Amortize cost/complexity of managing an
instruction stream across many ALUs.

→ SIMD
Saving Yet More Space

Idea #2
 Amortize cost/complexity of managing an
instruction stream across many ALUs.

→ SIMD
Saving Yet More Space

Idea #2
 Amortize cost/complexity of managing an
instruction stream across many ALUs.

→ SIMD
Gratuitous Amounts of Parallelism!
Gratuitous Amounts of Parallelism!
Gratuitous Amounts of Parallelism!
Branches
Memory
 Memory latency: The time taken for a memory request to be completed.
This usually takes 100s of cycles.

 Memory bandwidth: The rate at which the memory system can provide
data to a processor.

 Stalling: Occurs when a processor cannot continue to execute code


because of a dependency on a previous instruction. To continue with the
current instruction, the processor must wait until the previous instruction
is completed. Stalling can occur when we have to load memory.
Remaining Problem: Slow Memory
 Problem
 Memory still has very high latency. . .
. . . but we've removed most of the hardware that helps us deal
with that.

We've removed
 caches
 branch prediction
 out-of-order execution
So what now?
Remaining Problem: Slow Memory
 Problem
 Memory still has very high latency. . .
. . . but we've removed most of the hardware that helps us deal
with that.

We've removed
 caches
 branch prediction
 out-of-order execution
So what now?
Hiding Memory Latency
Discussion !!
 Multi-threading increases /decreases time for individual thread to
finish assigned task?!

 Does multithreading improve throughput?

 Does multithreading improve performance?

 The time to complete all the tasks should increase/decrease ?! Why?!

 Multi-threading requires a lot of memory bandwidth?! Explain?!


GPU Architecture Summary
 Core Ideas:
1. Many slimmed down cores
→ lots of parallelism.

2. More ALUs, Fewer Control Units.

3. Avoid memory stalls by interleaving execution of SIMD groups


Two Main Goals
• Maintain execution speed of old sequential programs.→ CPU

• Increase throughput of parallel programs. → GPU


CPU is optimized for sequential
code performance

Almost 10x the bandwidth of multicore


(relaxed memory model)
A Quick Glimpse on Flynn Classification
• A taxonomy of computer architecture.

• Proposed by Micheal Flynn in 1966.

• It is based two things:


– Instructions
– Data
Which one
is closest to
GPU?
Problems Faced by GPUs
• Need enough parallelism.

• Under-utilization.

• Bandwidth to CPU
Modern GPU Hardware
 GPUs have
 many parallel execution units and
 higher transistor counts,
 while CPUs have
 few execution units and
 higher clock speeds

• GPUs have much deeper pipelines (several thousand stages vs 10-20 for
CPUs)
• GPUs have significantly faster and more advanced memory interfaces as
they need to shift around a lot more data than CPUs
Let’s Take A Closer Look:
The Hardware
GPU Architecture: GeForce 8800 (2007)

➢ Each SM is capable of supporting thousands of concurrent hardware threads, up to 2048


on modern architecture GPUs.

➢ The SM performs all the thread management including creation, scheduling and barrier
synchronization.

➢ The SM employs a SIMT (Single Instruction, Multiple Thread) architecture to efficiently


manage the large number of threads that exist. 44
• Much higher bandwidth than typical system memory
• A bit slower than typical system memory
SPs within SM share control
• Communication between GPU memory and system
logic and instruction cache memory is slow
Streaming
Processor (SP) Streaming Multiprocessor
(SM)

45
Scalar vs Threaded
Scalar program
float A[4][8];
for(int i=0;i<4;i++){
for(int j=0;j<8;j++){
A[i][j]++;
}
}
Multithreaded: (4x1) blocks – (8x1) threads
Multithreaded: (2x2) blocks – (4x2) threads

You might also like