0% found this document useful (0 votes)
111 views38 pages

Understanding Multiprocessor Systems

This document discusses multiprocessors and parallel programming. It begins by introducing multiprocessors and their scalability. It then discusses shared memory and distributed memory multiprocessor architectures. Programming multiprocessors is difficult because programmers must explicitly manage coordination and synchronization between processors. Parallel programs are also harder to develop than sequential programs due to the overhead of communication and coordination between processors.

Uploaded by

IsaacMedeiros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views38 pages

Understanding Multiprocessor Systems

This document discusses multiprocessors and parallel programming. It begins by introducing multiprocessors and their scalability. It then discusses shared memory and distributed memory multiprocessor architectures. Programming multiprocessors is difficult because programmers must explicitly manage coordination and synchronization between processors. Parallel programs are also harder to develop than sequential programs due to the overhead of communication and coordination between processors.

Uploaded by

IsaacMedeiros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 9

Multiprocessors

cs 152 L1 5 .1 DAP Fa97,  U.CB


9.1 Introduction
° Until now we have studied how to improve
performance by exploring parallelism inside the
processor (ILP) with techniques such as:
Pipelining, superscalar machines, VLIW, and
Dynamic Scheduling.

° What if we try to connect many processors to build


a machine with as much performance as we want?

° Multiprocessors must therefore be scalable: the


hardware and software are designed to be sold with
a variable number of processors. We want
performance to improve proportionally to the
increase in the number of processors.

cs 152 L1 5 .2 DAP Fa97,  U.CB


Multiprocessors

Idea: create powerful computers by connecting


many smaller and ones

good news: works for timesharing (better than


supercomputer). Vector processing may be
coming back

bad news: it´s really hard to write good concurrent


programs. Many commercial failures

Processor Processor Processor


Processor Processor Processor

Cache Cache Cache


Cache Cache Cache

Single bus
Memory Memory Memory

Memory I/O
Network

cs 152 L1 5 .3 DAP Fa97,  U.CB


Why Use Multiprocessors?

° System is fault tolerant. If one processor fails, the


system will continue service with the others.
° Multiprocessors may have the highest absolute
performance Faster than the fastest
uniprocessor.
° More cost effective to use many single-chip
uniprocessors than building a high-performance
uniprocessor from a more exotic technology.
° Current file servers can be ordered with multiple
processors.
° The data base industry has standardized on
multiprocessors.

cs 152 L1 5 .4 DAP Fa97,  U.CB


Performance in Multiprocessors

° Commercial dealers usually define high


performance as high throughput for independent
tasks.

° Alternative: running a single task on multiple


processors.

° Definition: Parallel processing program is a single


program that runs on multiple processors
simultaneously.

cs 152 L1 5 .5 DAP Fa97,  U.CB


Key Questions

° How do parallel processors share data?


— single address space
— message passing
° How do parallel processors coordinate?
— synchronization (locks, semaphores)
— built into send / recieve primitives
° How many processors should the system have?
— connected by a single bus (2 - 32 procs)
— connected by a network (8 - 256 procs)

cs 152 L1 5 .6 DAP Fa97,  U.CB


Shared Memory Processors (SMPs)

° Processors with a single address space: all


processors share a single memory address space.

° Processors communicate through shared variables


in memory. All processors are capable of
accessing any memory location with loads and
stores.
P ro c e s s o r P ro c e s s o r P ro c e s s o r

C a che C ac he C ache

S in g le b u s

M e m o ry I/ O

cs 152 L1 5 .7 DAP Fa97,  U.CB


Coordinating Shared Memory Processors

° Prohibit one processor to start working on data


before another is finished with it: synchronization.
° When sharing is supported with a single address
space, there must be a separate mechanism for
synchronization.
° Example synchronization mechanism: lock. When a
processor j wants to access a shared variable it
must require its lock. Only one processor at a time
can aquire the lock. Other interested processors
must wait until j unlocks the variable.

cs 152 L1 5 .8 DAP Fa97,  U.CB


Classifying SMPs as to Memory Access

° UMA (Uniform Memory Access) multiprocessors,


also called (Symmetric Multiprocessors - SMPs): All
memory accesses take the same time,
independently of which processors access and
which word is accessed. (2 - 64 procs).

° NUMA (Nonuniform Memory Access)


multiprocessors. Memory access times depend on
which processors access which words. (8 - 256
procs).

° It is harder to program NUMA machines efficiently,


but they are more scalable.

cs 152 L1 5 .9 DAP Fa97,  U.CB


Distributed Memory Processors (8 - 256 procs)
° Each processor has a local (private) memory.
° Coordination (communication) is physically
achieved by message passing. The system has
send / recieve primitives (routines) for achieving
this. Much harder to program. New programming
paradigms.
° Events are triggered in the processors by the
sending or the reception of a message.
P ro c e s s o r P ro c e s s o r P ro c e s s o r

C ache C ache C ache

M e m o ry M e m o ry M e m o ry

N e tw o rk

cs 152 L1 5 .10 DAP Fa97,  U.CB


Clusters

° One can achieve multiprocessing by making a LAN


(computers connected by Local Area Network) act
as a single large multiprocessor.

° The switch based LAN must provide high


bandwidth between computers in the cluster.

cs 152 L1 5 .11 DAP Fa97,  U.CB


Keeping Up to Date

° Industry is rapidly changing in this field.

° There is no clear answer to all the key questions


presented.

° Many answers depend strongly on application


behavior.

° Find some of the latest information in this field in


http:\\www.mkp.com/cod2e.htm.

cs 152 L1 5 .12 DAP Fa97,  U.CB


9.2 Programming Multiprocessors

° Uniprocessors that compose multiprocessors are


cheap since the early 1980s.

° Communication interconnection network


technology is well established.

° We have good programming languages for parallel


program execution.

cs 152 L1 5 .13 DAP Fa97,  U.CB


What´s Difficult in Parallel and Distributed Processing?

° Identifying real (and important) applications that


have inherent parallelism and can benefit from
parallel processing.
° Writing parallel programs for parallel execution for
these applications is harder than writing sequential
programs.
° The challenge is even greater if we want the
application to scale (maintain efficiency when the
problem size increases and larger multiprocessors
are used).
° As a result, parallel processing has had sucess
when parallel programming specialists have
developed a parallel subsystem that presents a
sequential interface.

cs 152 L1 5 .14 DAP Fa97,  U.CB


Why is Parallel Program Development so Hard?

° An analogy: For 10 people to work on a task (e.g.


painting a house, writing an article or developing
SW), they must communicate and coordinate their
work. There is no such overhead if one person
does the task alone.

° The overhead is even greater if the number of


people increases to 100, 1000 or even 1000000.
Because of this overhead, 1000 people do not finish
the task 1000 time faster than a single person. n-
fold speedup becomes especially unlikely as n
increases.
° The programmer must know a good deal about the
HW (number and types of processors,
interconnection topology).
cs 152 L1 5 .15 DAP Fa97,  U.CB
Problems of Parallel Program Development
° On a uniprocessor, the high-level language
programmer writes the program largely ignoring
the underlying machine organization.
° Uniprocessor design techniques for ILP (Instr Level
Parallelism) such as pipelines, SS and out-of -order
execution are transparent to the programmer. The
compiler handles these issues.
° you must get good performance and
efficiency from the parallel program on a
multiprocessor, otherwise you would use a
uniprocessor, since programming is easier.
° Even small parts of a program must be parallelized
to reach their full potential (see Amdahl´s law in
chapter 2). Sequential parts are bottlenecks.

cs 152 L1 5 .16 DAP Fa97,  U.CB


9.3 Multiprocessors Connected by a Single Bus

P ro c e s s o r P ro c e s s o r P ro c e s s o r

C a c he C a che C a c he

S in g le b us

M e m o ry I/O

Single-bus multiprocessor (2 - 32 processors)

cs 152 L1 5 .17 DAP Fa97,  U.CB


Why use Single-bus Architectures?

° Low cost of microprocessors.


° Each microprocessor is much smaller than a
multichip processor, so more processors can be
placed on a bus.
° Local caches can lower bus traffic.
° There are mechanisms for keeping caches and
memory consistent (like consistency protocols for
caches and memory for I/O) simplifies
programs.
° Traffic per processor and the bus bandwidth
determine the useful number of processors in such
a multiprocessor.

cs 152 L1 5 .18 DAP Fa97,  U.CB


Caches

Caches replicate data in their faster memories:

• to reduce the latency to the data and


• to reduce memory traffic on the bus

cs 152 L1 5 .19 DAP Fa97,  U.CB


Example Parallel Program for Single Bus
° Sum 100000 numbers
° 10 processors
° Each processor sums 100000 / 10 numbers
• Variables
- Pn: number of the local processor
- A[100000]: Array with numbers to be summed
• Each processor Pn executes (i is private
variable)
sum [Pn] = 0;
for (i = 10000 *Pn; i < 10000 * (Pn + 1); i++)
sum [Pn] = sum [Pn] + A[ i ];
• Load Instrs bring correct subset of numbers to
the local caches from main memory.
cs 152 L1 5 .20 DAP Fa97,  U.CB
Divide and Conquer to Add Partial Sums

cs 152 L1 5 .21 DAP Fa97,  U.CB


Program that Adds Partial Sums

half = 10; /* 10 processors */


repeat
synch( );
if ( half % 2 != 0 && Pn == 0 )
sum [0] = sum [0] + sum [half - 1];
half = half / 2;
if ( Pn < half ) sum [Pn] = sum [Pn] + sum [Pn + half];
until (half ==1); /* exit with final sum in sum[0] */

cs 152 L1 5 .22 DAP Fa97,  U.CB


synchronization primitives

° synch() is a barrier synchronization primitive,


processors wait at the barrier until every processor
has reached it. When this occurs, all may proceed.

° This function can be implemented either in SW


with the lock synchronization primitive, or with
special HW that combines each processor ready
signal into a single global signal that all processors
can test.

cs 152 L1 5 .23 DAP Fa97,  U.CB


Cache Consistency

° In the example program, the part that adds the


partial sums implies in communication.

° It is convenient to have copies of the same data in


multiple caches (sum[ ] in the example).

° There may be inconsistencies between data in


memory and data in caches and also among same
data in different caches.

° Protocols that maintain coherency for


multiprocessors are called cache coherency
protocols.
cs 152 L1 5 .24 DAP Fa97,  U.CB
Some Interesting Problems

° Cache Coherency

Processor Processor Processor

Snoop Cache tag Snoop Cache tag Snoop Cache tag


tag and data tag and data tag and data

Single bus

Memory I/O

° Most popular protocol to maintain cache


coherency is called snooping.

cs 152 L1 5 .25 DAP Fa97,  U.CB


Synchronization Using Coherency

° Read in page 724 of section 9.3.

cs 152 L1 5 .26 DAP Fa97,  U.CB


9.4 Multiprocessors Connected by a Network

° 3 Desirable bus characteristics:


• high bandwidth
• low latency
• long length

INCOMPATIBLE

° There is also a limit to the bandwidth of a single


memory module attached to a bus.
Number of proceessors that can be
connected to a single bus is limited (Max
processors ~36 processors)
cs 152 L1 5 .27 DAP Fa97,  U.CB
Solution for More Demands on Communication

° Use more communication channels: a Network


connects processor-memory modules.

Processor Processor Processor

Cache Cache Cache

Memory Memory Memory

Network

cs 152 L1 5 .28 DAP Fa97,  U.CB


Example Network Topologies

a. 2D grid or mesh of 16 nodes b. n-cube tree of 8 nodes (8 = 23 so n = 3)

cs 152 L1 5 .29 DAP Fa97,  U.CB


Example Network Topologies

P0

P1

P2
P0
P3
P1
P4
P2
P5
P3
P6
P4
P7
P5

P6

P7

a . Cro ssb ar b . O m ega netw o rk

A C
B D

c. O me ga n etwork s w itch box

cs 152 L1 5 .30 DAP Fa97,  U.CB


Distributed Memory X Shared Memory

° Shared Memory: a single address space (implicit


communication with loads and stores).

° Multiple Private Memories (Address Spaces):


explicit communication with sends and receives.

° Distributed Memory: refers to the physical location


of the memory. Physical memory is divided into
modules, with some placed near each processor.

° Centralized Memory: access time to a physical


memory location is the same for all processors
because all accesses go over the interconnect.
cs 152 L1 5 .31 DAP Fa97,  U.CB
Many Combinations are Possible

° Multiprocessors can have a single address space


and a distributed physical memory.

° Decisions:
• single address space?
• explicit communication?
• distributed physical memory?

° In machines without a single global address,


communication is explicit, the programmer or
compiler must send / receive messages to export /
import data to / from other nodes.
cs 152 L1 5 .32 DAP Fa97,  U.CB
Parallel Message Passing Program

° Sum 100000 numbers


° 100 processors
° Each processor sums 100000 / 100 numbers
• Variables
- Pn: number of the local processor
- A[100000]: Array with numbers to be summed
- One processor containing 100000 numbers
sends subsets to each of the 100 processors.
• Each processor Pn executes to sum its subset
sum = 0;
for (i = 0; i < 1000; i++)
sum = sum + A[ i ];

cs 152 L1 5 .33 DAP Fa97,  U.CB


Divide and Conquer to Add Partial Sums

cs 152 L1 5 .34 DAP Fa97,  U.CB


Parallel Message Passing Program

° Communicate partial sums among neighbors


through the network to accumulate the final sum.
° Divide and conquer to obtain final sum in parallel.
° send (x, y): routine that sends the value y over the
interconnection network to processor x.
° receive(): function that accepts a value from the
network for this processor.
half = limit = 100; /* 100 processors */
repeat
half = (half + 1) / 2;
if (Pn >= half && Pn < limit) send (Pn - half, sum);
if ( Pn < (limit / 2 -1) ) sum = sum + receive();
limit = half;
until (half ==1);

cs 152 L1 5 .35 DAP Fa97,  U.CB


Parallel Message Passing Program

° The program divides processors into senders and


receivers.

° Each receiving processor gets only one msg


a receiving processor will stall until it receives a
message.

° send and receive can be used as promitives


for synchronization as well as for communication.

cs 152 L1 5 .36 DAP Fa97,  U.CB


Addressing in Large-Scale Processors

° To be scalable a multiprocessor must have


distributed memory.
° For HW designer it is easier to offer only send and
receive.
° Msg passing also minimizes communication
because the programmer who knows the
application parallelism should optimize it.
° It is possible to add a SW layer to provide a single
address space on top of sends and receives so that
communication is possible using loads and stores
(implicit communication) Distributed Shared
Memory (DSM).
° DSM is comparable to the Virtual Memory system. It
tries to detect spatial locallity of data and is only
efficient if shared memory communication is rare.
Otherwise most of the time is spent transferring
data.
cs 152 L1 5 .37 DAP Fa97,  U.CB
Concluding Remarks

° Evolution vs. Revolution


“More often the expense of innovation comes from
being too disruptive to computer users”

Parallel processing multiprocessor


Message-passing multiprocessor
Not-CC-NUMA multiprocessor
Timeshared multiprocessor

CC-NUMA multiprocessor
CC-UMA multiprocessor
Microprogramming

Virtual memory

Massive SIMD
Pipelining
Cache

RISC
Evolutionary Revolutionary

“Acceptance of hardware ideas requires acceptance


by software people; therefore hardware people should
learn about software. And if software people want good
machines, they must learn more about hardware to be
able to communicate with and thereby influence
hardware engineers.”
cs 152 L1 5 .38 DAP Fa97,  U.CB

You might also like