Advanced Computer Architecture

— Part I: General Purpose Simultaneous Multithreading

Sources of Parallelism
Bit-level
Wider processor datapaths (8 16 32 64…)

Word-level (SIMD)
Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)

Instruction-level
Pipelining Superscalar VLIW and EPIC

Paolo.Ienne@epfl.ch EPFL – I&C – LAP

Task- and Application-levels…
Explicit parallel programming Multiple threads Multiple applications…

2

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

Simple Sequential Processor

Pipelined Processor

op 1

op 1 op 2

cycles

cycles

op 3 op 4 op 5

op 2

op 3

op 6 (br)

3

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

4

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

Superscalar Processor
functional units

OOO Superscalar Processor
functional units

op 1

op 2

op 1 op 4

op 2 op 5

cycles

op 4 op 5 op 6 (br)

cycles

op 3

op 3

op 6 (br)

op 7 op 11

5

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

6

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

Speculative Execution
functional units

Limits of ILP
functional units

op 1 op 4

op 2 op 10 ? op 5 op 7 ? op 8 ? op 9 ?
cycles

op 1 op 4

op 2 op 10 ? op 5 op 6 (br) op 7 ? op 8 ? op 9 ?

cycles

op 3

op 6 (br)

op 3

?
© Ienne 2003-04

op 11 op 12 op 13

op 11 op 12 op 13

7

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

8

AdvCompArch — Simultaneous Multithreading

Sources of Unused Issue Slots

Horizontal and Vertical Waste
functional units

op 1
Source: Tullsen et al., © IEEE 1995

op 2 op 7
Source: Tullsen et al., © IEEE 1995

cycles

op 3

op 4
Vertical Waste 61%

op 11 op 6

op 10 op 5

Horizontal Waste 39%
9 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04 10 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

Multithreading: The Idea

Basic Needs of a Multithreaded Processor Processor must be aware of several independent states, one per each thread:
Program Counter Register File (and Flags) (Memory)

Branch

Branch

Branch

Branch

Branch

Issue

future
Branch Branch

Rather than enlarging the depth of the instruction window (more speculation with lowering confidence!), enlarge its “width” fetch from multiple threads!

Branch

Issue
Branch

Either multiple resources in the processor or a fast way to switch across states

11

AdvCompArch — Simultaneous Multithreading

Branch

© Ienne 2003-04

12

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

Thread Scheduling When one switches thread? Which thread will be run next? Simple interleaving options:
Cycle-by-cycle multithreading

Cycle-by-cycle Interleaving (or Fine-Grain) Multithreading
functional units

op 1 op 1 op 3 op 3 op 3 op 11 op 6

op 2

op 5 op 2 op 4 op 6 op 4 op 10 op 5 op 8 op 10 op 7 op 4
context switches
© Ienne 2003-04 © Ienne 2003-04

op 5
cycles

op 2

op 1

Round-robin selection between a set of threads

Block multithreading
Keep executing a thread until something happens
Long latency instruction found Some indication of scheduling difficulties Maximum number of cycles per thread executed
13 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

op 7

op 9

14

AdvCompArch — Simultaneous Multithreading

Block Interleaving (or Coarse-Grain) Multithreading
functional units

Fundamental Requirement Key issue in general-purpose processors which has prevented for many years multithreaded techniques to become commercially relevant It is not acceptable that single-thread performance goes significantly down or at all

op 1 op 3

op 2

op 5 op 4 op 8
context switches
© Ienne 2003-04

op 6
cycles

op 1 op 3 op 7 op 5 op 6 op 9

op 2 op 4 op 2 op 7 op 1 op 3

15

AdvCompArch — Simultaneous Multithreading

16

AdvCompArch — Simultaneous Multithreading

Problems of Cycle-by-Cycle Multithreading
Null time to switch context
Multiple Register Files

Block Interleaving Techniques
Block Interleaving
Adapted from: Šilc et al., © Springer 1999

No need for forwarding paths if threads supported are more than pipeline depth!
Simple(r) hardware

Fills well short vertical waste (other threads hide latencies ~ no. of threads) Fills much less well long vertical waste (the thread is rescheduled no matter what) Does not reduce significantly horizontal waste (per thread, the instruction window is not much different…) Significant deterioration of single thread job
17 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

Static

Dynamic

Explicit Switch

Implicit Switch
Switch-on-Load Switch-on-Store Switch-on-Branch

Explicit Switch
Conditional-Switch

Implicit Switch

Switch-on-Miss Switch-on-Use (a.k.a. Lazy-Switch-on-Miss) Switch-on-Signal (interrupt, trap,…)

18

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

Problems of Block Multithreading
Scheduling of threads not self-evident:
What happens of thread #2 if thread #1 executes perfectly well and leaves no gap? Explicit techniques require ISA modifications Bad…

Simultaneous Multithreading (SMT): The Idea
functional units

op 1 op 4 op 7
cycles

op 2 op 5 op 4 op 6 op 9 op 8 op 9

op 5 op 2 op 5 op 6 op 7 op 8 op 10 op 12 op 13

op 1 op 2 op 4 op 6 op 10 op 11

op 1 op 3

More time allowable for context switch Fills very well long vertical waste (other threads come in) Fills poorly short vertical waste (if not sufficient to switch context) Does not reduce almost at all horizontal waste
19 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

op 3 op 3 op 11 op 7 op 14

op 8 op 10 op 11

20

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

Several Simple Scheduling Possibilities Prioritised scheduling?
Thread #0 schedules freely Thread #1 is allowed to use #0 empty slots Thread #2 is allowed to use #0 and #1 empty slots, etc.

Superscalar Processor
PC Instruction Fetch & Decode Unit (Multiple Instructions per Cycle)
Multiple Buses Reservation Stn.

Reservation Stn.

Reservation Stn.

Reservation Stn.

Fair scheduling?
All threads compete for resources If several threads want the same resource, round-robin assignment
21 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

ALU 1

ALU 2

FP Unit

Branch Unit

Load/Store Unit

Register File

Commit Unit (Multiple Instructions per Cycle)
Multiple Buses
22 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

IQ

Reorder Buffer
from F&D Unit from EUs

What Must Be Added to a Superscalar to Achieve SMT?
Multiple program counters (= threads) and a policy for the instruction fetch unit(s) to decide which thread(s) to fetch Multiple or larger register file(s) with at least as many registers as logical registers for all threads Multiple instruction retirement (e.g., per thread squashing) No changes needed in the execution path And also: Thread-aware branch predictors (BTBs, etc.) Per-thread Return Address Stacks
24 AdvCompArch — Simultaneous Multithreading

PC

Tag

Register

Address

Value

0 0 1 1
tail

0x1000 0004

$f3
0xa87f b351

0x627f ba5a

head

to MEM and RF

0x1000 0008 ALU1 0x1000 000c MUL2

??? ???

1 0

$f5

Register Renaming
(between fetch/decode and commit)
23 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

© Ienne 2003-04

Commit

F/D

SMT Processor as a Natural Extension of a Superscalar
Instruction Fetch & Decode Unit (Multiple Instructions per Cycle)

Reorder Buffer Remembers the Thread of Origin
Some changes to the reorder buffer in the Commit Unit—e.g.:
from F&D Unit from EUs

PC PC PC PC

PC
Reservation Stn. Reservation Stn. Reservation Stn. Reservation Stn.

Thread Tag Register

Address

Value
head

0 1
Register File(s)

0x1000 0004 2 0x1000 0008 2 0x2001 1234 1 0x1000 000c 2

— ALU1 MUL3 MUL2

$f3
0xa87f b351

0x627f ba5a

to MEM and RF

ALU 1

ALU 2

FP Unit

Branch Unit

Load/Store Unit

1 1 1
tail

$f3 $f5

??? ??? ???

Commit Unit (Multiple Instructions per Cycle)

IQ IQ IQ IQ
© Ienne 2003-04

0
Architectural Register Identifier: Reg # + Thread #
26 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

25

AdvCompArch — Simultaneous Multithreading

Reservation Stations
Reservation stations do not need to know which thread an instruction belongs to Remember: operand sources are renamed—physical regs, tags, etc.
Tag1

Does It Work?!

Op

Tag2

Arg1

Arg2

ALU1: ALU2: ALU3:

1 addd 1 subd 0


ALU1

MUL3 0xa87f b351

???
0xffff fee1

Thread #2?! Thread #1?!

???

a b ALU
27 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04 28 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

Source: Tullsen et al., © IEEE 1996

from F&D Unit

from EUs and RF

Main Results on Implementability SMT vs. Superscalar
From [TullsenJun96]: Instruction scheduling not more complex Register File datapaths not more complex (but much larger register file!) Instruction Fetch Throughput is attainable even without more fetch bandwidth Unmodified cache and branch predictors are appropropriate also for SMT SMT achieves better results than aggressive superscalar
29 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

Where to Fetch?
Static solutions: Round-robin
Each cycle 8 instructions from 1 thread Each cycle 4 instructions from 2 threads, 2 from 4,… Each cycle 8 instructions from 2 threads, and forward as many as possible from #1 then when long latency instruction in #1 pick rest from #2

Dynamic solutions: Check execution queues!
Favour threads with minimal # of in-flight branches Favour threads with minimal # of outstanding misses Favour threads with minimal # of in-flight instructions Favour threads with instructions far from queue head
30 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

What to Issue?
Not exactly the same as in superscalars…
In superscalar: oldest is the best (least speculation, more dependent ones waiting, etc.) In SMT not so clear: branch-speculation level and optimism (cache-hit speculation) vary across threads

Importance of Accurate Branch Prediction in SMT vs. Superscalar
Reduce the impact of Branch Prediction was one of the qualitative initial motivations Results from [TullsenJun96]:
Perfect branch prediction advantage
25% at 1 thread 15% at 4 threads 9% at 8 threads

One can think of many selection strategies:
Oldest first Cache-hit speculated last Branch speculated last Branches first…

Important result: doesn’t matter too much! Issue Logic (critical in superscalars) can be left alone

Losses due to suppression of speculative execution
-7% at 8 threads -38% at 1 thread ( speculation was a good idea…)

31

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

32

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

Bottlenecks Sources of Unused Issue Slots
Source: Šilc et al., © Springer 1999

Bottlenecks
Fetch and memory throughput are still bottlenecks
Source: Tullsen et al., © IEEE 1996

Fetch: branches, etc. Memory not addressed

Performance vs. # of rename registers (8T) in addition to the architectural ones
Infinite: 100: 90: 80 70 +2% ref. -1% -3% -6%
IPC vs. # threads 200 physical registers
© Ienne 2003-04

Completion queue not very relevant (remember: this is out of the execution path…) Rename register count important Most critical: number of register writeback ports SMT for utilisation rate (EUs) not bandwidth!…
33 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

Register file access time likely limit to # of threads
34

AdvCompArch — Simultaneous Multithreading

Performance (IPC) per Unit Cost
Source: Šilc et al., © Springer 1999

SMT in Commercial Processors Coming… Compaq Alpha 21464 (EV8)
4T SMT Project killed June 2001

Intel Pentium IV (Xeon)
2T SMT Availability in 2002 (already there before, but not enabled) 10-30% gains expected

Superscalars are cheap only for relatively small issue bandwidth, then quickly down SMT improves significantly the picture already with 2 threads and maximum moves to larger issue bandwidths with more threads
35 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

SUN Ultra III
2-core CMP, 4T SMT
36 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

Intel SMT: Xeon Hyper-Threading Pipeline
duplicated resources

Intel SMT: Xeon Hyper-Threading Switching Off Threads
What happens when there is only one thread? What does the OS when there is nothing to do? Ahem… Four modes: Low-power, ST0, ST1, and MT
Source: Marr et al., © Intel 2002 Source: Marr et al., © Intel 2002

Front-end
(TC hit)

(TC miss)

or

freely shared resources

split resources

time-shared resources

OOO Execution

Halt1 Halt0 Int0 Int1

Halt0 Int0 Int1 Halt1
© Ienne 2003-04

Low-Power Mode
37 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04 38 AdvCompArch — Simultaneous Multithreading

Intel SMT: Xeon Hyper-Threading Goals and Results
Minimum additional cost: SMT = approx. 5% area No impact on single-thread performance
Recombine partitioned resources

SMT Is Definitely Coming… And Now, What’s Next?
Key ingredients for success so far:
Maximise compatibility, no info from programmers beyond straight sequential code and coarse threads Aggressive prediction and speculation of anything predictable Use irregular, fine-grained parallelism (ILP): it is “easier” to extract, can be done at runtime,…

Fair behaviour with 2 threads

Source: Marr et al., © Intel 2002

Problems:
Branch prediction accuracy hard to improve Hard to exploit ILP any further within a thread

39

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

40

AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

An Example of Research Directions
Dynamic Multithreading (DMT) [AkkariNov98]
Automatic generation of threads from loops (backward branches) and procedure call
Source: Akkary et al., © IEEE 1998

An Example of Research Directions
What Dynamic Multithreading represents?
Introduces, from a single program, not only out-of-order Issue but now also out-of-order Fetch Lowers the dependence on correct Branch Prediction

Experiments show good advantages even without more fetch bandwidth
Source: Akkary et al., © IEEE 1998 42 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

Execution on a “typical” SMT microarchitecture Speculative execution of threads with interthread data dependence and value speculation
41 AdvCompArch — Simultaneous Multithreading © Ienne 2003-04

References
PA, Sections 6.3, 6.4, and 6.5 CAR, Chapter 5—Introduction D. M. Tullsen et al., Exploiting Choice:

Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, ISCA, 1996 H. Marr et al., Hyper-Threading Technology Architecture and Microarchitecture, Intel
Technology Journal, Q1, 2002 H. Accary and M. A. Driscoll, A Dynamic Multithreaded Processor, MICRO, 1998
43 AdvCompArch — Simultaneous Multithreading

© Ienne 2003-04

Sign up to vote on this title
UsefulNot useful