You are on page 1of 33

CS 267: Applications of Parallel Computers Lecture 5: Shared Memory Parallel Machines

Horst D. Simon
http://www.cs.berkeley.edu/~strive/cs267

09/10/02

CS267

• Processors all connected to a large shared memory • Local caches for each processor • Cost: much cheaper to cache than main memory

Basic Shared Memory Architecture

P1 $

P2 $

Pn $

network

memory

° Simple to program, but hard to scale ° Now take a closer look at structure, costs, limits
09/10/02 CS267 – Lecture 5

Programming Shared Memory (review)
• Program is a collection of threads of control. • Each thread has a set of private variables
• e.g. local variables on the stack.

• Collectively with a set of shared variables
• e.g., static variables, shared common blocks, global heap.

• Communication and synchronization through shared variables

Address : y = ..x ...
i res s
Shared

x = ...
i res s

Private

P
09/10/02

P
CS267 – Lecture 5

...

P

Outline • Historical perspective • Bus-based machines • Pentium SMP • IBM SP node • Directory-based (CC-NUMA) machine • Origin 2000 • Global address space machines • Cray t3d and (sort of) t3e 09/10/02 CS267 – Lecture 5 .

60s Mainframe Multiprocessors • Enhance memory capacity or I/O capabilities by adding memory modules or I/O devices I/O Devices Mem Mem Mem Mem IOC IOC Interconnect Proc Proc • How do you enhance processing capacity? • Add processors • Already need an interconnect between slow memory M banks and processor + I/O channels • cross-bar or multistage interconnection network M M M P P IO IO 09/10/02 CS267 – Lecture 5 .

70s Breakthrough: Caches • Memory system scaled by adding memory modules • Both bandwidth and capacity • Memory was still a bottleneck • Enter… Caches! A: 17 memory (slow) interconnect processor (fast) P I/O Device or Processor • Cache does two things: • Reduces average access time (latency) • Reduces bandwidth requirements to memory 09/10/02 CS267 – Lecture 5 .

4x in 10 years 350 300 SpecInt SpecFP DRAM Year 1000:1! Size 1980 1983 1986 1989 1992 64 Kb 256 Kb 1 Mb 4 Mb 16 Mb 250 2:1! Cycle Time 250 ns 220 ns 190 ns 165 ns 145 ns 200 150 100 50 0 1986 1988 1990 1992 1994 1996 1995 09/10/02 64 Mb 120 ns CS267 – Lecture 5 Year .Technology Perspective Capacity Logic: 2x in 3 years DRAM: 4x in 3 years Speed 2x in 3 years 1.4x in 10 years Disk: 2x in 3 years 1.

Approaches to Building Parallel Machines P1 Switch (Interleaved) First-level $ Pn Scale P1 Pn (Interleaved) Main memory $ Interconnection network $ Shared Cache Mem Mem Centralized Memory Dance Hall. UMA P1 Pn Mem $ Mem $ Interconnection network 09/10/02 CS267 – Lecture 5 Distributed Memory (NUMA) .

80s Shared Memory: Shared Cache • Alliant FX-8 • early 80’s 100000000 • eight 68020s with x-bar to 512 KB interleaved cache • Encore & Sequent • first 32-bit micros (N32032) 10000000 • two to a board with a shared cache Transistors 1000000 i80386 100000 i80286 R3010 SU MIPS R10000 R4400 Pentium i80486 P1 Switch (Interleaved) First-level $ Pn i8086 (Interleaved) Main memory i80x86 M68K MIPS 10000 i4004 1000 1965 1970 1975 1980 1985 1990 1995 2000 2005 09/10/02 CS267 – Lecture 5 Year .

Shared Cache: Advantages and Disadvantages Advantages • Cache placement identical to single cache • only one copy of any cached block • Fine-grain sharing is possible • Interference • One processor may prefetch data for another • Can share data within a line without moving line Disadvantages • Bandwidth limitation • Interference • One processor may flush another processors data 09/10/02 CS267 – Lecture 5 .

UMA P1 Pn Mem $ Mem $ Interconnection network 09/10/02 CS267 – Lecture 5 Distributed Memory (NUMA) .Approaches to Building Parallel Machines P1 Switch (Interleaved) First-level $ Pn Scale P1 Pn (Interleaved) Main memory $ Interconnection network $ Shared Cache Mem Mem Centralized Memory Dance Hall.

this is called sequential consistency: “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order.” [Lamport. 1979] 09/10/02 CS267 – Lecture 5 . and the operations of each individual processor appear in this sequence in the order specified by its program.Intuitive Memory Model • Reading an address should return the last value written to that address • Easy in uniprocessors • except for I/O • Cache coherence problem in MPs is more pervasive and more performance critical • More formally.

but gets the “stale” cached copy x=0 x1 x0 x0 p1 p2 09/10/02 CS267 – Lecture 5 .Cache Coherence: Semantic Problem • p1 and p2 both have cached copies of x (as 0) • p1 writes x=1 • May “write through” to memory • p2 reads x.

then writes y=1 • P2 read y. ½ of 2 values • Processors always see values written by some some processor • The value seen is constrained by program order on all processors • Time always move forward • Example: • P1 writes x=1.. it must see the new value of x 09/10/02 CS267 – Lecture 5 .e. then reads x x=0 y=0 P1 x=1 y=1 P2 =y =x If P2 sees the new value of y. I.Cache Coherence: Semantic Problem What does this imply about program behavior? • No process ever sees “garbage” values.

or supply value • depends on state of the block and the protocol CS267 – Lecture 5 09/10/02 .Snoopy Cache-Coherence Protocols State Address Data P1 Bus snoop P n $ $ Mem I/O dev ices Cache-memory transaction • Bus is a broadcast medium & caches know what they have • Cache Controller “snoops” all transactions on the shared bus • A transaction is a relevant transaction if it involves a cache block currently contained in this cache • take action to ensure coherence • invalidate. update.

when the item is flushed • Update: give all other processors the new value • Invalidate: all other processors remove from cache 09/10/02 CS267 – Lecture 5 .Basic Choices in Cache Coherence • Cache may keep information such as: • Valid/invalid • Dirty (inconsistent with memory) • Shared (in another caches) • When a processor executes a write operation to shared data. basic design choices are: • Write thru: do the write in memory as well as cache • Write back: wait and do the write later.

Example: Write-thru Invalidate P1 u=? $ 4 P2 u=? $ 5 P3 3 $ u :5 u :5 u = 7 1 I/O devices u :5 Memory 2 • Update and write-thru both use more memory bandwidth if there are writes to the same address • Update to the other caches • Write-thru to memory 09/10/02 CS267 – Lecture 5 .

• reads by others cause it to return to “shared” state • Most bus-based multiprocessors today use such schemes. processor writes do not result in bus writes. • Many variants of ownership-based protocols 09/10/02 CS267 – Lecture 5 . thus conserving bandwidth.Write-Back/Ownership Schemes • When a single cache has ownership of a block.

or at least avoid interleaving • Example problem: an array of ints.Sharing: A Performance Problem • True sharing • Frequent writes to a variable can create a bottleneck • OK for read-only or infrequently written data • Technique: make copies of the value. one per processor. if this is possible in the algorithm • Example problem: the data structure that stores the freelist/heap for malloc/free • False sharing • Cache block may also introduce artifacts • Two distinct variables in the same cache block • Technique: allocate data used by each processor contiguously. one written frequently by each processor 09/10/02 CS267 – Lecture 5 .

2 GB/s PROC PROC 09/10/02 CS267 – Lecture 5 .Limits of Bus-Based Shared Memory I/O MEM ° ° ° MEM Assume: 1 GHz processor w/o cache => 4 GB/s inst BW per processor (32-bit) => 1.2 GB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate => 80 MB/s inst BW per processor => 60 MB/s data BW per processor 140 MB/s combined BW Assuming 1 GB/s bus bandwidth \ 8 processors will saturate bus 140 MB/s °°° cache cache 5.

Engineering: Intel Pentium Pro Quad CPU 256-KB Interrupt L2 $ controller Bus interf ace P-Pro module P-Pro module P-Pro module P-Pro bus (64-bit data. or 4-way interleav ed DRAM PCI bus PCI I/O cards SMP for the masses: • All coherence and multiprocessing glue in processor module • Highly integrated. 66 MHz) PCI bridge PCI bridge Memory controller MIU 1-. 36-bit addr ess. 2-. targeted at high volume • Low latency and bandwidth 09/10/02 CS267 – Lecture 5 PCI bus .

so symmetric • Higher bandwidth. 83 MHz) e Bus interf ace 2 FiberChannel 100bT.Engineering: SUN Enterprise P $ $2 P $ $2 Mem ctrl CPU/mem cards Bus interf ace/switch Gigaplane bus (256 data. 41 addr ss. SCSI I/O car s d SBUS • Proc + mem card . higher latency bus 09/10/02 CS267 – Lecture 5 SBUS SBUS .I/O card • 16 cards of either type • All memory accessed over bus.

Approaches to Building Parallel Machines P1 Switch (Interleaved) First-level $ Pn Scale P1 Pn (Interleaved) Main memory $ Interconnection network $ Shared Cache Mem Mem Centralized Memory Dance Hall. UMA P1 Pn Mem $ Mem $ Interconnection network 09/10/02 CS267 – Lecture 5 Distributed Memory (NUMA) .

Directory-Based Cache-Coherence 09/10/02 CS267 – Lecture 5 .

Cache Coherent Multiprocessors P 1 Cache Inter connect ion Net w ork Pn Cache memory block Memory Directory dirt y-bit presence bit s 09/10/02 CS267 – Lecture 5 .90 Scalable.

accessed in parallel Up to 512 nodes ( 2 processors per node) With 195MHz R10K processor.SGI Origin 2000 P L2 cac he (1-4 MB) P L2 cac he (1-4 MB) P L2 cac he (1-4 MB) P L2 cac he (1-4 MB) Xbow Hub Hub Xbow Direct ory Main Memory (1-4 GB) Direct ory Main Memory (1-4 GB) Interconnection Netw ork • • • • • • Single 16”-by-11” PCB Directory state in same or separate DRAMs. peak 390MFLOPS or 780 MIPS per proc Peak SysAD bus bw is 780MB/s. so also Hub-Mem Hub to router chip and to Xbow is 1.56 GB/s (both are off-board) CS267 – Lecture 5 09/10/02 .

Caches and Scientific Computing • Caches tend to perform worst on demanding applications that operate on large data sets • transaction processing • operating systems • sparse matrices • Modern scientific codes use tiling/blocking to become cache friendly • easier for dense codes than for sparse • tiling and parallelism are similar transformations 09/10/02 CS267 – Lecture 5 .

Scalable Global Address Space 09/10/02 CS267 – Lecture 5 .

• 09/10/02 Examples: BBN butterfly. Cray T3D CS267 – Lecture 5 .Global Address Space: Structured Memory Scalable Network src rrsp tag data tag src addr read dest Pseudo Mem M $ mmu P °°° Pseudo Proc $ P mmu M Ld R<. which performs the memory operation and replies with the data.Addr • Processor performs load • Pseudo-memory controller turns it into a message transaction with a remote controller.

load/store re-ordering. memory fence • load-lock / store-conditional Direct global memory access via external segment regs • DTB annex. 32 entries. • no L2 cache • non-blocking stores. 32-bit physical • 32-bit and 64-bit load/store + byte manipulation on regs. 16 or 64 MB each) + fast network • 43-bit virtual address space.Cray T3D: Global Address Space machine • 2048 Alphas (150 MHz. global-AND barriers Prefetch Queue Block Transfer Engine User-level Message Queue 09/10/02 CS267 – Lecture 5 • • • • . remote processor number and mode • atomic swap between special local reg and memory • special fetch&inc register • global-OR.

480MB/s links • Memory system similar to t3d • Memory controller generates request message for non-local references • No hardware mechanism for coherence 09/10/02 CS267 – Lecture 5 .Cray T3E External I/O P $ Mem ctrl and NI Mem XY Switch Z • Scales up to 1024 processors.

code motion… CS267 – Lecture 5 • But compiler and processor may still get in the way 09/10/02 . due to coherence traffic • For performance tuning. watch sharing (both true and false) • Semantics • Need to lock access to shared variable for read-modify-write • Sequential consistency is the natural semantics • Architects worked hard to make this work • • • Caches are coherent with buses or directories No caching of remote data on shared address space machines Non-blocking writes. read prefetching.What to Take Away? • Programming shared memory machines • May allocate data in large shared region without too many worries about where • Memory hierarchy is critical to performance • Even more so than on uniprocs.

Cray SV1. IBM SP machines) • with specialized communication assist integrated with memory system to provide global access to shared data (??) • Mid-end • almost all servers are bus-based CC SMPs • high-end servers are replacing the bus with a network • • • Sun Enterprise 10000.Where are things going • High-end • collections of almost complete workstations/SMP on high-speed network (Millennium. Data General • volume approach is Pentium pro quadpack + SCI ring • Low-end • SMP desktop is here • Major change ahead 09/10/02 • SMP on a chip as a building Lecture 5 block CS267 – . HP/Convex SPP SGI Origin 2000 Sequent.