Future Processors To Use Coarse-Grain Parallelism

1
Chapter 6
Future Processors to use Coarse-
Grain Parallelism
2
Future processors to use coarse-grain parallelism
Chip multiprocessors (CMPs) or multiprocessor chips
integrate two or more complete processors on a single chip,
every functional unit of a processor is duplicated

Simultaneous multithreaded processors (SMPs)
store multiple contexts in different register sets on the chip,
the functional units are multiplexed between the threads,
instructions of different contexts are simultaneously executed

3
Principal chip multiprocessor alternatives
Symmetric multiprocessor (SMP)

Distributed shared memory multiprocessor (DSM)

Message-passing shared-nothing multiprocessor
4
Organizational
principles of
multiprocessors
Pro-

cessor
Pro-

cessor
...
Interconnection
Shared Memory
(SMP) symmetric multiprocessor
Pro-

cessor
Pro-

cessor
...
(DSM) distributed-shared-memory
multiprocessor

Interconnection
Local
Memory
Local
Memory
Pro-

cessor
Pro-
cessor
...
Interconnection
Local
Memory

Local
Memory
message-passing
(shared-nothing) multiprocessor
send receive
empty
global memory physically distributed memory
d
i
s
t
r
i
b
u
t
e
d

a
d
d
r
e
s
s

s
p
a
c
e
s

s
h
a
r
e
d

a
d
d
r
e
s
s

s
p
a
c
e

5
Typical SMP
Pro-
cessor
Pro-
cessor
Pro-
cessor
Pro-
cessor
Primary
Cache
Secondary
Cache
Bus
Secondary
Cache
Secondary
Cache
Secondary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Global Memory
6
Shared memory candidates for CMPs
Pro-
cessor
Pro-
cessor
Pro-
cessor
Pro-
cessor
Primary
Cache
Secondary
Cache
Secondary
Cache
Secondary
Cache
Secondary
Cache
Global Memory
Primary
Cache
Primary
Cache
Primary
Cache
Pro-
cessor
Pro-
cessor
Pro-
cessor
Pro-
cessor
Primary
Cache
Secondary Cache
Global Memory
Primary
Cache
Primary
Cache
Primary
Cache
Shared-main memory and shared-secondary cache
7
Shared memory candidates for CMPs
Pro-
cessor
Pro-
cessor
Pro-
cessor
Pro-
cessor
Secondary Cache
Global Memory
Primary Cache
shared-primary cache
8
Grain-levels for CMPs
multiple processes in parallel

multiple threads from a single application
implies a common address space for all threads

extracting threads of control dynamically from a single instruction stream
see last chapter, multiscalar, trace processors, ...
9
Texas Instruments TMS320C80 Multimedia Video
Processor
Advanced
DSP3
P
a
r
a
m
e
t
e
r

R
A
M

D
a
t
a

R
A
M
2

D
a
t
a

R
A
M
1

D
a
t
a

R
A
M
0

I
-
C
a
c
h
e

P
a
r
a
m
e
t
e
r

R
A
M

D
a
t
a

R
A
M
2

D
a
t
a

R
A
M
1

D
a
t
a

R
A
M
0

I
-
C
a
c
h
e

P
a
r
a
m
e
t
e
r

R
A
M

D
a
t
a

R
A
M
2

D
a
t
a

R
A
M
1

D
a
t
a

R
A
M
0

I
-
C
a
c
h
e

P
a
r
a
m
e
t
e
r

R
A
M

D
a
t
a

R
A
M
2

D
a
t
a

R
A
M
1

D
a
t
a

R
A
M
0

I
-
C
a
c
h
e

P
a
r
a
m
e
t
e
r

R
A
M

D
a
t
a

R
A
M
2

D
a
t
a

R
A
M
1

D
a
t
a

R
A
M
0

I
-
C
a
c
h
e

32
32
64 32
32
64 32
32
64 32
32
64
L G I
Advanced
DSP2
L G I
Advanced
DSP1
L G I
Advanced
DSP0
L G I
FPU
C/D I
MP
64
TC
TAP
VC
32
64 32
64
10
Hydra: A single-chip multiprocessor
CPU 0
Centralized Bus Arbitration Mechanisms
Cache SRAM Array DRAM Main Memory
I/O Device
A

S
i
n
g
l
e

C
h
i
p

Primary
I-cache
Primary
D-cache
CPU 0 Memory Controller
Rambus Memory
Interface
Off-chip L3
Interface
I/O Bus
Interface
DMA
CPU 1
Primary
I-cache
Primary
D-cache
CPU 2
Primary
I-cache
Primary
D-cache
CPU2 Memory Controller
CPU 3
Primary
I-cache
Primary
D-cache
On-chip Secondary
Cache
11
Conclusions on CMP
Usually, a CMP will feature:
separate L1 I-cache and D-cache per on-chip CPU
and an optional unified L2 cache.

If the CPUs always execute threads of the same process, the L2 cache
organization will be simplified, because different processes do not have to be
distinguished.

Recently announced commercial processors with CMP hardware:
IBM Power4 processor with 2 processor on a single die
Sun MAJC5200 two processor on a die (each processor a 4-threaded block-
interleaving VLIW)

12
Multithreaded processors
Aim: Latency tolerance
What is the problem?

Load access latencies measured on an Alpha Server 4100 SMP with four 300
MHz Alpha 21164 processors are:
7 cycles for a primary cache miss which hits in the on-chip L2 cache of the
21164 processor,
21 cycles for a L2 cache miss which hits in the L3 (board-level) cache,
80 cycles for a miss that is served by the memory, and
125 cycles for a dirty miss, i.e., a miss that has to be served from another
processor's cache memory.

Multithreaded processors are able to bridge latencies by switching to another
thread of control - in contrast to chip multiprocessors.
13
Register set 1
Register set 2
Register set 3
Register set 4
PC PSR 1
PC PSR 2
PC PSR 3
PC PSR 4
FP
Thread 1:
Thread 2:
Thread 3:
Thread 4:
... ... ...
Multithreaded processors
Multithreading:
Provide several program counters registers (and usually several register
sets) on chip
Fast context switching by switching to another thread of control
14
Approaches of multithreaded processors
Cycle-by-cycle interleaving
An instruction of another thread is fetched and fed into the execution
pipeline at each processor cycle.
Block-interleaving
The instructions of a thread are executed successively until an event
occurs that may cause latency. This event induces a context switch.
Simultaneous multithreading
Instructions are simultaneously issued from multiple threads to the FUs of
a superscalar processor.
combines a wide issue superscalar instruction issue with multithreading.
15
Comparision of multithreading with non-
multithreading approaches:
(a) single-threaded scalar
(b) cycle-by-cycle interleaving multithreaded scalar
(c) block interleaving multithreaded scalar
16
Comparision of multithreading with non-
multithreading approaches:
(a) superscalar (c) cycle-by-cycle interleaving
(b) VLIW (d) cycle-by-cycle interleaving VLIW
17
Comparison of multithreading with
non-multithreading:
simultaneous multithreading (SMT) chip multiprocessor (CMP)

18
the processor switches to a different thread after each instruction fetch
pipeline hazards cannot arise and the processor pipeline can be easily built
without the necessity of complex forwarding paths
context-switching overhead is zero cycles
memory latency is tolerated by not scheduling a thread until the memory
transaction has completed
requires at least as many threads as pipeline stages in the processor
degrading the single-thread performance if not enough threads are present
19
- Improving single-thread performance
The dependence look-ahead technique adds several bits to each instruction
format in the ISA.
Scheduler feeds non data or control dependent instructions of the same
thread successively into the pipeline.

The interleaving technique proposed by Laudon et al. adds caching and full
pipeline interlocks to the cycle-by-cycle interleaving approach.
20
Tera MTA
cycle-by-cycle interleaving technique
employs the dependence-look-ahead technique
VLIW ISA (3-issue)
The processor switches context every cycle (3 ns cycle period) among as many
as 128 distinct threads, thereby hiding up to 128 cycles (384 ns) of memory
latency.

128 register sets
21
Tera processing
element
Register
state
for 128
threads
Instruction
fetch
& issue
logic
I-cache
Done
D

o
n
e

OK
Fail
~ 16 tick
A and C
execution
pipeline
M-unit
M-ops
A-ops
C-ops
Asynchronous Memory
access through network
(70 ticks avrg. latency)
22
Tera MTA
...... ......
...... ......
3D Toroidal Interconnection Network
Computational Processors (max 256)
CP
CP
CP
IOC MU
MU MU IOC IOC
IOP IOP
IOP
I/O Processors (max 256)
Memories (max 512) I/O Caches (max 512)
23
Block interleaving
Executes a single thread until it reaches a situation that triggers a context
switch.
Typical switching event: the instruction execution reaches a long-latency
operation or a situation where a latency may arise.
Compared to the cycle-by-cycle interleaving technique, a smaller number of
threads is needed
A single thread can execute at full speed until the next context switch.
Single thread performance is similar to the performance of a comparable
processor without multithreading.

IBM NorthStar processors are two-threaded 64 bit PowerPCs with switch-
on-cache-miss; implemented in departmental computers (eServers) of IBM
since 10/98! (revealed at MTEAC-4, Dec. 2000)
Recent announcement (Oct. 1999): Sun MAJC5200 two processor on a die,
each processor is a 4-threaded block-interleaving VLIW
24
Interleaving techniques
multithreading
cycle-by-cycle
interleaving
block
interleaving
dynamic
switch-on-cache-miss conditional-switch
(explicit with condition)
switch-on-use
(lazy-cache-miss)
static
explicit-switch implicit-switch
(switch-on-load,
switch-on-store,
switch-on-branch, ..)
switch-on-signal
(interrupt, trap, ..)
25
Rhamma
26
Komodo-microcontroller
Develop multithreaded embedded real-time Java-microcontroller
Java processor core
bytecode as machine language, portability across all platforms
dense machine code, important for embedded applications
fast byte code execution in hardware, microcode and traps
Interrupts activate interrupt service threads (ISTs) instead of interrupt
service routines (ISRs)
extremely fast context switch
no blocking of interrupt services
Switch-on-signal technique enhanced to very fine-grain switching
due to hardware-implemented real-time scheduling algorithms (FPP, EDF,
LLF, guranteed percentage)
hard real-time requirements fulfilled

For more information see:
http://goethe.ira.uka.de/~jkreuzin/komodo/komodoEng.html
27
Komodo - microcontroller
28
Nanothreading and microthreading
- multithreading in same register set
Nanothreading (DanSoft processor) dismisses full multithreading for a
nanothread that executes in the same register set as the main thread.
only a 9-bit PC, some simple control logic, and it resides in the same page
as the main thread.
Whenever the processor stalls on the main thread, it automatically begins
fetching instructions from the nanothread.

The microthreading technique (Bolychevsky et al. 1996) is similar to
nanothreading.
All threads share the same register set and the same run-time stack.
However, the number of threads is not restricted to two.
29
Simultaneous multithreading (SMT)
The SMT approach combines a wide superscalar instruction issue with the
multithreading approach
by providing several register sets on the multiprocessor
and issuing instructions from several instruction queues simultaneously.

The issue slots of a wide issue processor can be filled by operations of several
threads.

Latencies occurring in the execution of single threads are bridged by issuing
operations of the remaining threads loaded on the processor.
30
- Hardware organization (1)
SMT processors can be organized in two ways:

First: Instructions of different threads share all buffer resources in an extended
superscalar pipeline
Thus SMT adds minimal hardware complexity to conventional superscalars,
hardware designers can focus on building a fast single-threaded superscalar
and add multithread capability on top.
Complexity added to superscalars by multithreading are thread tag for each
internal instruction representation, multiple register sets, and the abilities of
the fetch and the retire units to fetch respectively retire instructions of
different threads.
31
- Hardware organization (2)
Second: Replicate all internal buffers of a superscalar such that each buffer is
bound to a specific thread.
The issue unit is able to issue instructions of different instruction windows
simultaneously to the FUs.
Adds more changes to superscalar processors organization
but leads to a natural partitioning of the instruction window (similar to
CMP)
and simplifes the issue and retire stages.

32
SMT fetch unit can take advantage of the interthread competition for instruction
bandwidth in two ways:
First, it can partition fetch bandwidth among the threads and fetch from
several threads each cycle.
Goal: increasing the probability of fetching only non speculative
instructions.
Second, the fetch unit can be selective about which threads it fetches.
The main drawback to simultaneous multithreading may be that it complicates
the instruction issue stage, which always is central to the multiple threads.
A functional partitioning as demanded for processors of the 10
9
-transistor era is
therefore not easily reached.
No simultaneous multithreaded processors exist to date. Only simulations.
General opinion: SMT will be in next generation microprocessors.
Announcement (Oct. 1999): Compaq Alpha 21464 (EV8) will be four-threaded
SMT

33
SMT at the Universities of Washington and San Diego
Hypothetical out-of-order issue superscalar microprocessor that resembles
MIPS R10000 and HP PA-8000.
8 threads and 8-issue superscalar organization are assumed.
Eight instructions are decoded, renamed and fed to either the integer or
floating-point instruction window.
Unified buffers are used
When operands become available, up to 8 instructions are issued out-of-order
per cycle, executed and retired.
Each thread can address 32 architectural integer (and floating-point) registers.
These registers are renamed to a large physical register le of 356 physical
registers.
34
Instruction
fetch
Instruction
decode
Instruction
issue
Execution pipelines
Floating-
point
register
file
Integer
register
file
Floating-point
Instruction
Queue
Integer
Instruction
Queue
I-cache
D-cache
Fetch Unit
PC
Decode
Register
Renaming
Floating-point
Units
Integer
load/store
Units

35
- Instruction fetching schemes
Basic: Round-robin: RR.2.8 fetching scheme, i.e., in each cycle, two times 8
instructions are fetched in round-robin policy from two different 2 threads,
superior to different other schemes like RR.1.8, RR.4.2, and RR.2.4
Other fetch policies:
BRCOUNT scheme gives highest priority to those threads that are least
likely to be on a wrong path,
MISSCOUNT scheme gives priority to the threads that have the fewest
outstanding D-cache misses
IQPOSN policy gives lowest priority to the oldest instructions by penalizing
those threads with instructions closest to the head of either the integer or
the floating-point queue
ICOUNT feedback technique gives highest fetch priority to the threads with
the fewest instructions in the decode, renaming, and queue pipeline stages
36
- Instruction fetching schemes
The ICOUNT policy proved as superior!
The ICOUNT.2.8 fetching strategy reached a IPC of about 5.4 (the RR.2.8
reached about 4.2 only).
Most interesting is that neither mispredicted branches nor blocking due to
cache misses, but a mix of both and perhaps some other effects showed as the
best fetching strategy.

Recently, simultaneous multithreading has been evaluated with
SPEC95,
database workloads,
and multimedia workloads.
Both achieving roughly a 3-fold IPC increase with an eight-threaded SMT over
a single-threaded superscalar with similar resources.
37
SMT processor with multimedia enhancement
- Combining SMT and multimedia
Start with a wide-issue superscalar general-purpose processor
Enhance by simultaneous multithreading
Enhance by multimedia unit(s)
Utilization of subword parallelism
(data parallel instructions, SIMD)
Saturation arithmetic
Additional arithmetic, masking
and selection, reordering and
conversion instructions
Enhance by additional features useful for multimedia processing,
e.g. on-chip RAM memory, special cache techniques
For more information see:
http://goethe.ira.uka.de/people/ungerer/smt-mm/SM-MM-processor.html
38
The SMT multimedia processor model
Branch
Compl
Integer
RT WB RI
ID IF
Global
L/S
Local
L/S
Thread
Control
Simple
Integer
Local
Memory
I/O
Memory-
interface
DCache
BTAC
ICache
Rename
Register
ID IF
To Memory
39
Maximum processor configuration
- IPCs of 8-threaded 8-issue cases
Initial maximum configuration: 2.28
16 entry reservation stations for thread, global and local load/store units
(instead of 256): 2.96
one common 256-entry reservation station unit for all integer/multimedia
units (instead of 256-entry reservation stations each): 3.27
loads and stores may pass blocked load/stores of other threads: 4.1
highest-priority-first, non-speculative-instruction-first, non-saturated-first
strategies for issue, dispatch, and retire stages: 4.34
32-entry reorder buffer (instead of 256): 4.69
second local load/store unit (because of 20.1% local load/stores): 6.07 (6.32
with dynamic branch prediction)
40
IPC of maximum processor
1
2
4
6
8
1
4
8
6,32
5,56
3,84
1,98
1
6,33
5,64
3,89
1,99
1
5,67
5,34
3,91
1,99
1
3,53
3,52
3,27
1,96
1
1,86
1,86
1,86
1,57
0,96
0
1
2
3
4
5
6
7
IPC
Issue
Threads
On-chip RAM and two local load/store units
4 MB I-cache,
D-cache fill burst rate
of 6:2:2:2
41
More realistic processor
D-cache
fill burst rate of 32:4:4:4
issue bandwidth 8
42
Speedup
Maximum processor
Realistic processor
A threefold speedup
43
IPC-Performance of SMT and CMP (1)
SPEC92-simulations [Tullsen et al.] vs. [Sigmund and Ungerer].
44
IPC-Performance of SMT and CMP (2)
SPEC95-simulations [Eggers et al.].
CMP2: 2 processors, 4-issue superscalar 2*(1,4)
CMP4: 4 processors, 2-issue superscalar 4*(1,2)
SMT: 8-threaded, 8-issue superscalar 1*(8,8)
45
IPC-Performance of SMT and CMP
SPEC95-simulations.
Performance is given relative to a single 2-issue superscalar
processor as baseline processor [Hammond et al.].

46
Comments to the simulation results [Hammond et al.]
CMP (eight 2-issue processors) outperforms a 12-issue superscalar and a 12-
issue, 8-threaded SMT processor on four SPEC95 benchmark programs (by
hand parallelized for CMP and SMP).

The CMP achieved higher performance than SMT due to a total of 16 issue slot
instead of 12 issue slots for SMT.

Hammond et al. argue that design complexity for 16-issue CMPs is similar to
12-issue superscalars or 12-issue SMT processors.
47
SMT vs. multiprocessor chip [Eggers et al.]
SMT obtained better speedups than the (CMP) chip multiprocessors
- in contrast to results of Hammond et al.!!

Eggers et al. compared 8-issue, 8-threaded SMTs with four 2-issue CMPs.
Hammond et al. compared 12-issue, 8-threaded SMTs with eight 2-issue CMPs.
Eggers et al.:
Speedups on the CMP were hindered by the fixed partitioning of their
hardware resources across the processors.
In CMP processors were idle when thread-level parallelism was insufficient.
Exploiting large amounts of instruction-level parallelism in the unrolled
loops of individual threads not possible due to CMP processors smaller
issue bandwidth.
An SMT processor dynamically partitions its resources among threads, and
therefore can respond well to variations in both types of parallelism, exploiting
them interchangeably.
48
Conclusions
The performance race between SMT and CMP is not yet decided.
CMP is easier to implement, but only SMT has the ability to hide latencies.
A functional partitioning is not easily reached within a SMT processor due to
the centralized instruction issue.
A separation of the thread queues is a possible solution, although it does
not remove the central instruction issue.
A combination of simultaneous multithreading with the CMP may be
superior.
We favor a CMP consisting of moderately equipped (e.g., 4-threaded 4-issue
superscalar) SMTs.
Future research: combine SMT or CMP organization with the ability to create
threads with compiler support or fully dynamically out of a single thread
thread-level speculation
close to multiscalar

Future Processors To Use Coarse-Grain Parallelism

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Future Processors To Use Coarse-Grain Parallelism

Uploaded by

Copyright:

Available Formats

1

You might also like