Professional Documents
Culture Documents
Core Architectures
1
Multi-Core Technology
2004 2005 2007
Single Core Dual Core Multi-Core
Cache 4 or more cores ore
m
2X res
+ Cache co
Core + Cache
2 or more cores
Cache +
Core Cache
+ Cache
2
Why multi-core ?
• Difficult to make single-core clock frequencies even higher
• Deeply pipelined circuits:
– heat problems
– Clock problems
– Efficiency (Stall) problems
• Doubling issue rates above today’s 3-6 instructions per clock, say to
6 to 12 instructions, is extremely difficult
– issue 3 or 4 data memory accesses per cycle,
– rename and access more than 20 registers per cycle, and
– fetch 12 to 24 instructions per cycle.
• Many new applications are multithreaded
• General trend in computer architecture (shift towards more
parallelism)
3
Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
• Instruction-level parallelism enabled rapid
increases in processor speeds over the
last 15 years
4
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
thread (Web server, database server)
• A computer game can do AI, graphics, and
sound in three separate threads
• Single-core superscalar processors cannot
fully exploit TLP
• Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
5
What applications benefit
from multi-core?
• Database servers
• Web servers (Web commerce) Each can
run on its
• Multimedia applications own core
• Scientific applications,
CAD/CAM
• In general, applications with
Thread-level parallelism
(as opposed to instruction-
level parallelism)
6
More examples
• Editing a photo while recording a TV show
through a digital video recorder
• Downloading software while running an
anti-virus program
• “Anything that can be threaded today will
map efficiently to multi-core”
• BUT: some applications difficult to
parallelize
7
Core 2 Duo Microarchitecture
8
Without SMT, only a single thread
can run at any given time
L1 D-Cache D-TLB
Schedulers
Uop queues
Rename/Alloc
Decoder
Bus
Schedulers
Uop queues
Rename/Alloc
Decoder
Bus
Thread 2:
integer operation 10
SMT processor: both threads can
run concurrently
L1 D-Cache D-TLB
Schedulers
Uop queues
Rename/Alloc
Decoder
Bus
Schedulers
Uop queues
Rename/Alloc
Rename/Alloc Rename/Alloc
Bus
13
Thread 1 Thread 2
Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB
Rename/Alloc Rename/Alloc
Bus
14
Thread 3 Thread 4
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
15
SMT Dual-core: all four threads can
run concurrently
L1 D-Cache D-TLB L1 D-Cache D-TLB
Rename/Alloc Rename/Alloc
Bus
Thread 1 Thread 3 16
Thread 2 Thread 4
Multi-Core and caches coherence
CORE0
CORE1
CORE1
CORE0
L1 cache L1 cache L1 cache L1 cache
L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2 17
The cache coherence problem
• Since we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a
monolithic array, shared by all the cores
18
The cache coherence problem
Suppose variable x initially contains 15213
multi-core chip
Main memory
x=15213 19
The cache coherence problem
Core 1 reads x
multi-core chip
Main memory
x=15213 20
The cache coherence problem
Core 2 reads x
multi-core chip
Main memory
x=15213 21
The cache coherence problem
Core 1 writes to x, setting it to 21660
multi-core chip
Main memory assuming
x=21660 write-through 22
caches
The cache coherence problem
Core 2 attempts to read x… gets a stale copy
multi-core chip
Main memory
x=21660 23
The Memory Wall
Problem
24
Memory Wall
1000 CPU
µProc
60%/yr.
“Moore’s Law”
(2X/1.5yr
Performance
100 )
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM
9%/yr.
1 (2X/10
yrs)
2000
1980
1982
1983
1984
1985
1986
1987
1988
1989
1990
1992
1993
1994
1995
1996
1997
1998
1999
1981
1991
25
Latency in a Single PC
500
1000 Ratio
200
10
100
1 CPU Time
0
0.1
1997 1999 2001 2003 2006 2009
X-Axis
THE WALL
26
Pentium 4
Processor
Cache hierarchy
L1 I (12Ki) L1 D (8KiB) Cycles: 2
27
Technology Trends
Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
DRAM Generations
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 120 ns
1996 64 Mb 110 ns
1998 128 Mb 100 ns
2000 256 Mb 90 ns
2002 512 Mb 80 ns
2006 1024 Mb 60ns
16000:1 4:1
(Capacity) (Latency)
28
Processor-DRAM Performance Gap Impact:
Example
• To illustrate the performance impact, assume a single-issue pipelined CPU with CPI
= 1 using non-ideal memory.
• The minimum cost of a full memory access in terms of number of wasted CPU
cycles:
29
Main Memory
• Main memory generally uses Dynamic RAM (DRAM),
which uses a single transistor to store a bit, but requires a
periodic data refresh (~every 8 msec).
• Cache uses SRAM: Static Random Access Memory
– No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)
• Size: DRAM/SRAM 4-8,
Cost & Cycle time: SRAM/DRAM 8-16
• Main memory performance:
– Memory latency:
• Access time: The time it takes between a memory access request and
the time the requested information is available to cache/CPU.
• Cycle time: The minimum time between requests to memory
(greater than access time in DRAM to allow address lines to be stable)
– Memory bandwidth: The maximum sustained data transfer
rate between main memory and cache/CPU.
30
Architects Use Transistors to Tolerate Slow
Memory
• Cache
– Small, Fast Memory
– Holds information (expected)
to be used soon
– Mostly Successful
• Apply Recursively
– Level-one cache(s)
– Level-two cache
• Most of microprocessor
die area is cache!
31