Professional Documents
Culture Documents
Application
•! Why multithreading (MT)?
OS
•! Utilization vs. performance
Compiler Firmware
CIS 501 •! Three implementations
•! Coarse-grained MT
Computer Architecture CPU I/O
•! Fine-grained MT
Memory
•! Simultaneous MT (SMT)
Digital Circuits
•! Paper
•! Even moderate superscalars (e.g., 4-way) not fully utilized
•! Tullsen et al., “Exploiting Choice…”
•! Average sustained IPC: 1.5–2 ! < 50% utilization
•! Mis-predicted branches
•! Cache misses, especially L2
•! Data dependences
•! Multi-threading (MT)
•! Improve utilization by multi-plexing multiple threads on single CPU
•! One thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can
time
cache cache Fill in with instructions
miss miss from another thread
time
•! Choice depends on
•! What kind of latencies (specifically, length) you want to tolerate
•! How much single thread performance you are willing to sacrifice
•! Three designs
•! Coarse-grain multithreading (CGMT)
Superscalar CGMT FGMT SMT
•! Fine-grain multithreading (FGMT)
•! Simultaneous multithreading (SMT)
L2 miss?
CIS 501 (Martin/Roth): Multithreading 11 CIS 501 (Martin/Roth): Multithreading 12
Fine-Grain Multithreading (FGMT) Fine-Grain Multithreading
•! Fine-Grain Multithreading (FGMT) •! FGMT
–! Sacrifices significant single thread performance
•! Multiple threads in pipeline at once
+! Tolerates latencies (e.g., L2 misses, mispredicted branches, etc.)
•! (Many) more threads
•! Thread scheduling policy
•! Switch threads every cycle (round-robin), L2 miss or no
regfile
•! Pipeline partitioning
thread scheduler regfile
•! Dynamic, no flushing
regfile
•! Length of pipeline doesn’t matter so much
regfile
–! Need a lot of threads
•! Extreme example: Denelcor HEP I$
D$
•! So many threads (100+), it didn’t even need caches B
P
•! Failed commercially
•! Not popular today
•! Many threads ! many register files
•! …out-of-order execution I$
D$
B
•! Simultaneous multithreading (SMT): OOO + FGMT P
•! Aka “hyper-threading”
•! Observation: once insns are renamed, scheduler doesn’t care which •! SMT
thread they come from (well, for non-loads at least) •! Replicate map table, share (larger) physical register file
•! Some examples thread scheduler map tables
•! IBM Power5: 4-way issue, 2 threads
•! Intel Pentium4: 3-way issue, 2 threads regfile
regfile regfile
I$ I$
D$ D$
B B
P P