You are on page 1of 31

Pipelining for Multi-

Core Architectures

1
Multi-Core Technology
2004 2005 2007
Single Core Dual Core Multi-Core
Cache 4 or more cores ore
m
2X res
+ Cache co
Core + Cache

2 or more cores
Cache +
Core Cache
+ Cache

2
Why multi-core ?
• Difficult to make single-core clock frequencies even higher
• Deeply pipelined circuits:
– heat problems
– Clock problems
– Efficiency (Stall) problems
• Doubling issue rates above today’s 3-6 instructions per clock, say to
6 to 12 instructions, is extremely difficult
– issue 3 or 4 data memory accesses per cycle,
– rename and access more than 20 registers per cycle, and
– fetch 12 to 24 instructions per cycle.
• Many new applications are multithreaded
• General trend in computer architecture (shift towards more
parallelism)

3
Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
• Instruction-level parallelism enabled rapid
increases in processor speeds over the
last 15 years

4
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
thread (Web server, database server)
• A computer game can do AI, graphics, and
sound in three separate threads
• Single-core superscalar processors cannot
fully exploit TLP
• Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
5
What applications benefit
from multi-core?
• Database servers
• Web servers (Web commerce) Each can
run on its
• Multimedia applications own core
• Scientific applications,
CAD/CAM
• In general, applications with
Thread-level parallelism
(as opposed to instruction-
level parallelism)

6
More examples
• Editing a photo while recording a TV show
through a digital video recorder
• Downloading software while running an
anti-virus program
• “Anything that can be threaded today will
map efficiently to multi-core”
• BUT: some applications difficult to
parallelize
7
Core 2 Duo Microarchitecture

8
Without SMT, only a single thread
can run at any given time
L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 1: floating point


9
Without SMT, only a single thread
can run at any given time
L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 2:
integer operation 10
SMT processor: both threads can
run concurrently
L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 2: Thread 1: floating point


integer operation 11
But: Can’t simultaneously use the
same functional unit
L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder This scenario is


impossible with SMT
Bus

BTB and I-TLB


on a single core
Thread 1 Thread 2 (assuming a single
IMPOSSIBLE integer unit) 12
Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus

BTB and I-TLB BTB and I-TLB

13
Thread 1 Thread 2
Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus

BTB and I-TLB BTB and I-TLB

14
Thread 3 Thread 4
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
15
SMT Dual-core: all four threads can
run concurrently
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus

BTB and I-TLB BTB and I-TLB

Thread 1 Thread 3 16
Thread 2 Thread 4
Multi-Core and caches coherence

CORE0

CORE1
CORE1

CORE0
L1 cache L1 cache L1 cache L1 cache

L2 cache L2 cache L2 cache L2 cache

L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2 17
The cache coherence problem
• Since we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a
monolithic array, shared by all the cores

18
The cache coherence problem
Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
x=15213 19
The cache coherence problem
Core 1 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213

multi-core chip
Main memory
x=15213 20
The cache coherence problem
Core 2 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213 21
The cache coherence problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory assuming
x=21660 write-through 22
caches
The cache coherence problem
Core 2 attempts to read x… gets a stale copy

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory
x=21660 23
The Memory Wall
Problem

24
Memory Wall
1000 CPU
µProc
60%/yr.
“Moore’s Law”
(2X/1.5yr
Performance

100 )
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM
9%/yr.
1 (2X/10
yrs)

2000
1980

1982
1983
1984
1985
1986
1987
1988
1989
1990

1992
1993
1994
1995
1996
1997
1998
1999
1981

1991

25
Latency in a Single PC
500
1000 Ratio

Memory to CPU Ratio


400
Memory Access Time
100 300
Time (ns)

200
10
100
1 CPU Time
0
0.1
1997 1999 2001 2003 2006 2009
X-Axis

CPU Clock Period (ns) Ratio


Memory System Access Time

THE WALL
26
Pentium 4
Processor
Cache hierarchy
L1 I (12Ki) L1 D (8KiB) Cycles: 2

L2 cache (512 KiB) Cycles: 19

L3 cache (2 MiB) Cycles: 43

Memory Cycles: 206

27
Technology Trends
Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
DRAM Generations
Year Size Cycle Time

1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 120 ns
1996 64 Mb 110 ns
1998 128 Mb 100 ns
2000 256 Mb 90 ns
2002 512 Mb 80 ns
2006 1024 Mb 60ns
16000:1 4:1
(Capacity) (Latency)
28
Processor-DRAM Performance Gap Impact:
Example
• To illustrate the performance impact, assume a single-issue pipelined CPU with CPI
= 1 using non-ideal memory.
• The minimum cost of a full memory access in terms of number of wasted CPU
cycles:

CPU CPU Memory Minimum CPU cycles or


Year speed cycle Access instructions wasted
MHZ ns ns

1986: 8 125 190 190/125 - 1 = 0.5


1989: 33 30 165 165/30 -1 = 4.5
1992: 60 16.6 120 120/16.6 -1 = 6.2
1996: 200 5 110 110/5 -1 = 21
1998: 300 3.33 100 100/3.33 -1 = 29
2000: 1000 1 90 90/1 - 1 = 89
2003: 2000 .5 80 80/.5 - 1 = 159

29
Main Memory
• Main memory generally uses Dynamic RAM (DRAM),
which uses a single transistor to store a bit, but requires a
periodic data refresh (~every 8 msec).
• Cache uses SRAM: Static Random Access Memory
– No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)
• Size: DRAM/SRAM ­4-8,
Cost & Cycle time: SRAM/DRAM ­8-16
• Main memory performance:
– Memory latency:
• Access time: The time it takes between a memory access request and
the time the requested information is available to cache/CPU.
• Cycle time: The minimum time between requests to memory
(greater than access time in DRAM to allow address lines to be stable)
– Memory bandwidth: The maximum sustained data transfer
rate between main memory and cache/CPU.
30
Architects Use Transistors to Tolerate Slow
Memory
• Cache
– Small, Fast Memory
– Holds information (expected)
to be used soon
– Mostly Successful
• Apply Recursively
– Level-one cache(s)
– Level-two cache
• Most of microprocessor
die area is cache!
31

You might also like