Professional Documents
Culture Documents
GPUs: Throughput Oriented Design ( has a massively parallel architecture consisting of thousands of smaller,
more efficient cores designed for handling multiple tasks simultaneously )
GPUs for parallel parts where throughput wins.
-Moderate clock frequency (shorter clocks)
-Small caches
To boost memory throughput
-Simple control
No branch prediction , No data forwarding
-Energy efficient ALUs
Many, long latency but heavily pipelined for high throughput
-Require massive number of threads to tolerate latencies.
Global Memory:
1. Faster than getting memory from the actual RAM but still have other options
Each block:
1. All threads execute the same kernel program (SPMD)
2. each thread/warp handles a small portion of the given task
3. All threads share data and synchronize while doing their share of the work
4. All threads can cooperate together, other cases not.
5. Hardware is free to assigns blocks to any parallel processor at any time, so Each block can execute in any
order relative to other blocks
Warps:
1. from 32-threads in a warp execute the same set of instructions at the same time (SIMD) because there
is only one instruction unit
2. it’s a scheduling unit in SM and it run concurrently
3. No ordering to execute the blocks or warps
4. zero-overhead warp scheduling (No empty slots while there is some pending instruction)
a. Warps whose next instruction has its operands ready for consumption are eligible for execution
b. Eligible Warps are selected for execution on a prioritized scheduling policy
5. Warp/thread Divergence tries to have threads do different instructions in a single warp by using if
statements; for example, the two tasks will be run sequentially so I need more cycle to do it.
operand scoreboarding used to prevent hazards it let instruction becomes ready after the needed values are
deposited
#Cache and Shared Memory differ between GPU models and different architectures.