You are on page 1of 2

CPUs: Latency Oriented Design (consists of a few cores optimized for sequential serial processing)

CPUs for sequential parts where latency matters


-High clock frequency
-Large caches
Convert long latency memory accesses to short latency cache accesses
-Sophisticated control which add more complexity
Branch prediction for reduced branch latency
Data forwarding for reduced data latency
-Powerful ALU
Reduced operation latency.

GPUs: Throughput Oriented Design ( has a massively parallel architecture consisting of thousands of smaller,
more efficient cores designed for handling multiple tasks simultaneously )
GPUs for parallel parts where throughput wins.
-Moderate clock frequency (shorter clocks)
-Small caches
To boost memory throughput
-Simple control
No branch prediction , No data forwarding
-Energy efficient ALUs
Many, long latency but heavily pipelined for high throughput
-Require massive number of threads to tolerate latencies.

GPU is a slave, It cannot run the program using itself.


OS talking with CPU, then the CPU is going to assign the parallel tasks (applications) to the GPU because of
boost memory throughput (no interaction between OS and GPU).
Anything that is related to latency and the sequential parts, the CPU will do it.
OS, Databases for CPU only.
Sequential component ( 1-f ) limits the speed up (Amdahl’s low), So the sequential part that will run on CPU, will
control the speed up.

To improve the performance of applications?


1- Heterogeneous execution model ( CPU is the host, GPU is the device )
2- Develop a C-like programming language for GPU ( Compute Unified Device Architecture (CUDA) )
3- Unify all forms of GPU parallelism as CUDA thread
4- Programming model is “Single Instruction Multiple Thread” (SIMT)

Global Memory:
1. Faster than getting memory from the actual RAM but still have other options

Streaming Multiprocessors (SMs):


1. has multiple processors
2. Only one instruction unit (each thread shares program counter)
3. has a group of processors must run the exact same set of instructions at any given time.
4. Up to 32blocks to each SM as resource allows, each block executed as 32-thread warps SIMD
5. Maintains thread/block id #s and manages/schedules thread execution.

Each block:
1. All threads execute the same kernel program (SPMD)
2. each thread/warp handles a small portion of the given task
3. All threads share data and synchronize while doing their share of the work
4. All threads can cooperate together, other cases not.
5. Hardware is free to assigns blocks to any parallel processor at any time, so Each block can execute in any
order relative to other blocks

Warps:
1. from 32-threads in a warp execute the same set of instructions at the same time (SIMD) because there
is only one instruction unit
2. it’s a scheduling unit in SM and it run concurrently
3. No ordering to execute the blocks or warps
4. zero-overhead warp scheduling (No empty slots while there is some pending instruction)
a. Warps whose next instruction has its operands ready for consumption are eligible for execution
b. Eligible Warps are selected for execution on a prioritized scheduling policy
5. Warp/thread Divergence tries to have threads do different instructions in a single warp by using if
statements; for example, the two tasks will be run sequentially so I need more cycle to do it.

operand scoreboarding used to prevent hazards it let instruction becomes ready after the needed values are
deposited
#Cache and Shared Memory differ between GPU models and different architectures.

You might also like