Pipelining For Multi-Core Architectures

Pipelining for Multi-
Core Architectures
1
Multi-Core Technology
2004 2005 2007
Single Core Dual Core Multi-Core
Cache 4 or more cores ore
m
2X res
+ Cache co
Core + Cache
2 or more cores
Cache +
Core Cache
+ Cache
2
Why multi-core ?
• Difficult to make single-core clock frequencies even higher
• Deeply pipelined circuits:
– heat problems
– Clock problems
– Efficiency (Stall) problems
• Doubling issue rates above today’s 3-6 instructions per clock, say to
6 to 12 instructions, is extremely difficult
– issue 3 or 4 data memory accesses per cycle,
– rename and access more than 20 registers per cycle, and
– fetch 12 to 24 instructions per cycle.
• Many new applications are multithreaded
• General trend in computer architecture (shift towards more
parallelism)
3
Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
• Instruction-level parallelism enabled rapid
increases in processor speeds over the
last 15 years
4
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
thread (Web server, database server)
• A computer game can do AI, graphics, and
sound in three separate threads
• Single-core superscalar processors cannot
fully exploit TLP
• Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
5
What applications benefit
from multi-core?
• Database servers
• Web servers (Web commerce) Each can
run on its
• Multimedia applications own core
• Scientific applications,
CAD/CAM
• In general, applications with
Thread-level parallelism
(as opposed to instruction-
level parallelism)
6
More examples
• Editing a photo while recording a TV show
through a digital video recorder
• Downloading software while running an
anti-virus program
• “Anything that can be threaded today will
map efficiently to multi-core”
• BUT: some applications difficult to
parallelize
7
Core 2 Duo Microarchitecture
8
Without SMT, only a single thread
can run at any given time
L1 D-Cache D-TLB
L2 Cache and Control

Integer Floating Point
Schedulers
Uop queues
Rename/Alloc
BTB Trace Cache uCode ROM
Decoder
Bus
BTB and I-TLB
Thread 1: floating point

9
Without SMT, only a single thread
can run at any given time
L1 D-Cache D-TLB

Schedulers
Uop queues
Rename/Alloc
Decoder
Bus
BTB and I-TLB
Thread 2:
integer operation 10
SMT processor: both threads can
run concurrently
L1 D-Cache D-TLB

Schedulers
Uop queues
Rename/Alloc
Decoder
Bus
BTB and I-TLB
Thread 2: Thread 1: floating point

integer operation 11
But: Can’t simultaneously use the
same functional unit
L1 D-Cache D-TLB

Schedulers
Uop queues
Rename/Alloc
Decoder This scenario is

impossible with SMT
Bus
BTB and I-TLB

on a single core
Thread 1 Thread 2 (assuming a single
IMPOSSIBLE integer unit) 12
Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB
Integer Floating Point Integer Floating Point


Schedulers Schedulers
Uop queues Uop queues
Rename/Alloc Rename/Alloc
BTB Trace Cache uCode BTB Trace Cache uCode

ROM ROM
Decoder Decoder
Bus
Bus
BTB and I-TLB BTB and I-TLB
13
Thread 1 Thread 2
Multi-core:
threads can run on separate cores



ROM ROM
Decoder Decoder
Bus
Bus
14
Thread 3 Thread 4
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
15
SMT Dual-core: all four threads can
run concurrently



ROM ROM
Decoder Decoder
Bus
Bus
Thread 1 Thread 3 16
Thread 2 Thread 4
Multi-Core and caches coherence
CORE0
CORE1
CORE1
CORE0
L1 cache L1 cache L1 cache L1 cache
L2 cache L2 cache L2 cache L2 cache
L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2 17
The cache coherence problem
• Since we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a
monolithic array, shared by all the cores
18
Suppose variable x initially contains 15213
Core 1 Core 2 Core 3 Core 4
One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
multi-core chip
Main memory
x=15213 19
Core 1 reads x

x=15213
multi-core chip
Main memory
x=15213 20
Core 2 reads x

x=15213 x=15213
multi-core chip
Main memory
x=15213 21
Core 1 writes to x, setting it to 21660

x=21660 x=15213
multi-core chip
Main memory assuming
x=21660 write-through 22
caches
Core 2 attempts to read x… gets a stale copy

x=21660 x=15213
multi-core chip
Main memory
x=21660 23
The Memory Wall
Problem
24
Memory Wall
1000 CPU
µProc
60%/yr.
“Moore’s Law”
(2X/1.5yr
Performance
100 )
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM
9%/yr.
1 (2X/10
yrs)
2000
1980
1982
1983
1984
1985
1986
1987
1988
1989
1990
1992
1993
1994
1995
1996
1997
1998
1999
1981
1991
25
Latency in a Single PC
500
1000 Ratio
Memory to CPU Ratio

400
Memory Access Time
100 300
Time (ns)
200
10
100
1 CPU Time
0
0.1
1997 1999 2001 2003 2006 2009
X-Axis
CPU Clock Period (ns) Ratio

Memory System Access Time
THE WALL
26
Pentium 4
Processor
Cache hierarchy
L1 I (12Ki) L1 D (8KiB) Cycles: 2
L2 cache (512 KiB) Cycles: 19
L3 cache (2 MiB) Cycles: 43
Memory Cycles: 206
27
Technology Trends
Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
DRAM Generations
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 120 ns
1996 64 Mb 110 ns
1998 128 Mb 100 ns
2000 256 Mb 90 ns
2002 512 Mb 80 ns
2006 1024 Mb 60ns
16000:1 4:1
(Capacity) (Latency)
28
Processor-DRAM Performance Gap Impact:
Example
• To illustrate the performance impact, assume a single-issue pipelined CPU with CPI
= 1 using non-ideal memory.
• The minimum cost of a full memory access in terms of number of wasted CPU
cycles:
CPU CPU Memory Minimum CPU cycles or

Year speed cycle Access instructions wasted
MHZ ns ns
1986: 8 125 190 190/125 - 1 = 0.5

1989: 33 30 165 165/30 -1 = 4.5
1992: 60 16.6 120 120/16.6 -1 = 6.2
1996: 200 5 110 110/5 -1 = 21
1998: 300 3.33 100 100/3.33 -1 = 29
2000: 1000 1 90 90/1 - 1 = 89
2003: 2000 .5 80 80/.5 - 1 = 159
29
Main Memory
• Main memory generally uses Dynamic RAM (DRAM),
which uses a single transistor to store a bit, but requires a
periodic data refresh (~every 8 msec).
• Cache uses SRAM: Static Random Access Memory
– No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)
• Size: DRAM/SRAM 4-8,
Cost & Cycle time: SRAM/DRAM 8-16
• Main memory performance:
– Memory latency:
• Access time: The time it takes between a memory access request and
the time the requested information is available to cache/CPU.
• Cycle time: The minimum time between requests to memory
(greater than access time in DRAM to allow address lines to be stable)
– Memory bandwidth: The maximum sustained data transfer
rate between main memory and cache/CPU.
30
Architects Use Transistors to Tolerate Slow
Memory
• Cache
– Small, Fast Memory
– Holds information (expected)
to be used soon
– Mostly Successful
• Apply Recursively
– Level-one cache(s)
– Level-two cache
• Most of microprocessor
die area is cache!
31

Pipelining For Multi-Core Architectures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipelining For Multi-Core Architectures

Uploaded by

Copyright:

Available Formats

Pipelining for Multi-

L2 Cache and Control

BTB Trace Cache uCode ROM

BTB and I-TLB

Thread 1: floating point

L2 Cache and Control

BTB Trace Cache uCode ROM

BTB and I-TLB

L2 Cache and Control

BTB Trace Cache uCode ROM

BTB and I-TLB

Thread 2: Thread 1: floating point

L2 Cache and Control

BTB Trace Cache uCode ROM

Decoder This scenario is

BTB and I-TLB

Integer Floating Point Integer Floating Point

L2 Cache and Control

Uop queues Uop queues

BTB Trace Cache uCode BTB Trace Cache uCode

BTB and I-TLB BTB and I-TLB

Integer Floating Point Integer Floating Point

L2 Cache and Control

Uop queues Uop queues

BTB Trace Cache uCode BTB Trace Cache uCode

BTB and I-TLB BTB and I-TLB

Integer Floating Point Integer Floating Point

L2 Cache and Control

Uop queues Uop queues

BTB Trace Cache uCode BTB Trace Cache uCode

BTB and I-TLB BTB and I-TLB

L2 cache L2 cache L2 cache L2 cache

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

Memory to CPU Ratio

CPU Clock Period (ns) Ratio

L2 cache (512 KiB) Cycles: 19

L3 cache (2 MiB) Cycles: 43

Memory Cycles: 206

CPU CPU Memory Minimum CPU cycles or

1986: 8 125 190 190/125 - 1 = 0.5

You might also like