Professional Documents
Culture Documents
Architecture
http://www.eecs.berkeley.edu/~pattrsn
http://vlsi.cs.berkeley.edu/cs252-s06
Outline
• Review
• Vector Metrics, Terms
• Cray 1 paper discussion
• MP Motivation
• SISD v. SIMD v. MIMD
• Centralized vs. Distributed Memory
• Challenges to Parallel Programming
• Consistency, Coherency, Write Serialization
• Write Invalidate Protocol
• Example
• Conclusion
Unpipelined
Unpipelined
Unpipelined
Unpipelined
Unpipelined
Unpipelined
Unpipelined
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Addr Addr Addr Addr Addr Addr Addr Addr
Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8
= 0 = 1 = 2 = 3 = 4 = 5 = 6 = 7
• Great for unit stride:
– Contiguous elements in different DRAMs
– Startup time for vector operation is latency of single read
• What about non-unit stride?
– Above good for strides that are relatively prime to 8
– Bad for: 2, 4
– Better: prime number of banks…!
Scalar Vector
• N ops per cycle • N ops per cycle
2) circuitry 2) circuitry
• HP PA-8000 • T0 vector micro
• 4-way issue
• 24 ops per cycle
• reorder buffer:
• 730K transistors total
850K transistors • only 23 5-bit register
• incl. 6,720 5-bit register number comparators
number comparators
• No floating point
Superscalar Vector
• Control logic grows • Control logic grows
linearly with issue width
quadratically with issue
• Vector unit switches
width off when not in use
• Control logic consumes
energy regardless of • Vector instructions expose
available parallelism parallelism without
speculation
• Speculation to increase
• Software control of
visible parallelism speculation when desired:
wastes energy – Whether to use vector mask or
compress/expand for
conditionals
Decoupled Execution
– Scalar unit runs ahead of vector unit, doing addressing and control
– Hardware dynamically unrolls loops, and issues multiple loops concurrently
– Special sync operations keep pipeline full, even across barriers
Allows the processor to perform well on short nested loops
• Machine upgradeable
– Can replace Cray X1 nodes with X1E nodes
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX : 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 toCS252
12/20/22
2002s06 smp 27
• RISC + x86: ??%/year 2002 to present
Déjà vu all over again?
“… today’s processors … are nearing an impasse as technologies approach
the speed of light..”
David Mitchell, The Transputer: The Time Is Now (1989)
• Transputer had bad timing (Uniprocessor performance)
Procrastination rewarded: 2X seq. perf. / 1.5 years
• “We are dedicating all of our future product development to multicore
designs. … This is a sea change in computing”
Paul Otellini, President, Intel (2005)
• All microprocessor companies switch to MP (2X CPUs / 2 yrs)
Procrastination penalized: 2X sequential perf. / 5 yrs
Threads/chip
12/20/22 2 4 smp
CS252 s06 4 32 28
Other Factors Multiprocessors
• Growth in data-intensive applications
– Data bases, file servers, …
• Growing interest in servers, server perf.
• Increasing desktop perf. less important
– Outside of graphics
• Improved understanding in how to use
multiprocessors effectively
– Especially server where significant natural TLP
• Advantage of leveraging design investment
by replication
– Rather than unique design
Scale
P1 Pn P1 Pn
$ $ $ $
Mem Mem
Inter
connection network
Inter
connection network
Mem Mem
u :5 u :5 u = 7
1 I/O devices
2
u :5
Memory
Conceptual
Picture Mem
12/20/22 CS252 s06 smp 43
Intuitive Memory Model
P
L1
• Reading an address
100:67 should return the last
L2 value written to that
100:35
address
Memory – Easy in uniprocessors,
except for I/O
Disk 100:34
State P1 Pn
Bus snoop
Address
Data
$ $
Cache-memory
I/O devices transaction
Mem
u :5 u :5 u = 7
1 I/O devices
2
u :5 u=7
Memory
P0: R R R W R R
P1: R R R R R W
P2: R R R R R R
CPU Write
Place Write
Miss on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
CPU read hit (read/write)
CPU write hit CPU Write Miss (?)
Write back cache block
12/20/22 CS252 s06 smp Place write miss on bus 61
Write-Back State Machine- Bus request
• State machine
for bus requests
for each Write miss
cache block for this block Shared
Invalid
(read/only)
Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block; (abort
Exclusive memory access)
(read/write)
CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
CPU read hit (read/write)
CPU write hit CPU Write Miss
Write back cache block
12/20/22 CS252 s06 smp Place write miss on bus 63
Write-back State Machine-III
CPU Read hit
• State machine
for CPU requests
for each Write miss
cache block and
for this block
for bus requests
Shared
for each
cache block Invalid CPU Read
(read/only)
Place read miss
CPU Write on bus
Place Write
Miss on bus
Write miss CPU read miss CPU Read miss
for this block Write back block, Place read miss
Write Back Place read miss on bus
Block; (abort on bus CPU Write
memory access) Place Write Miss on Bus
Cache Block Read miss Write Back
State Exclusive for this block Block; (abort
(read/write) memory access)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
12/20/22 CS252 s06 smp Place write miss on bus 64
Example
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1
P1:P1:
Read
ReadA1A1
P2:
P2: Read A1
P2:
P2: Write
Write 20 to
to A1
A1
P2:
P2: Write
Write 40 to
to A2
A2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1
P2:
P2: Read A1
P2:
P2: Write
Write 20 to
to A1
A1
P2:
P2: Write
Write 40 to A2
to A2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1
P2:
P2: Write
Write 20 to
to A1
A1
P2:
P2: Write
Write 40 to
to A2
A2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2:
P2: Write
Write 20 to
to A1
A1
P2:
P2: Write
Write 40 to
to A2
A2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2:
P2: Write
Write 20 to
to A1
A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2:
P2: Write
Write 40 to
to A2
A2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1: Write 10
Write 10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2:
P2: Write
Write 20 to
to A1
A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2:
P2: Write
Write 40 to A2
to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20