0% found this document useful (0 votes)

111 views38 pages

Understanding Multiprocessor Systems

This document discusses multiprocessors and parallel programming. It begins by introducing multiprocessors and their scalability. It then discusses shared memory and distributed memory multiprocessor architectures. Programming multiprocessors is difficult because programmers must explicitly manage coordination and synchronization between processors. Parallel programs are also harder to develop than sequential programs due to the overhead of communication and coordination between processors.

Uploaded by

IsaacMedeiros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views38 pages

Understanding Multiprocessor Systems

Uploaded by

IsaacMedeiros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 9

Multiprocessors

cs 152 L1 5 .1 DAP Fa97,  U.CB

9.1 Introduction
° Until now we have studied how to improve
performance by exploring parallelism inside the
processor (ILP) with techniques such as:
Pipelining, superscalar machines, VLIW, and
Dynamic Scheduling.

° What if we try to connect many processors to build

a machine with as much performance as we want?

° Multiprocessors must therefore be scalable: the

hardware and software are designed to be sold with
a variable number of processors. We want
performance to improve proportionally to the
increase in the number of processors.

cs 152 L1 5 .2 DAP Fa97,  U.CB

Multiprocessors

Idea: create powerful computers by connecting

many smaller and ones

good news: works for timesharing (better than

supercomputer). Vector processing may be
coming back

bad news: it´s really hard to write good concurrent

programs. Many commercial failures

Processor Processor Processor

Cache Cache Cache

Single bus
Memory Memory Memory

Memory I/O
Network

cs 152 L1 5 .3 DAP Fa97,  U.CB

Why Use Multiprocessors?

° System is fault tolerant. If one processor fails, the

system will continue service with the others.
° Multiprocessors may have the highest absolute
performance Faster than the fastest
uniprocessor.
° More cost effective to use many single-chip
uniprocessors than building a high-performance
uniprocessor from a more exotic technology.
° Current file servers can be ordered with multiple
processors.
° The data base industry has standardized on
multiprocessors.

cs 152 L1 5 .4 DAP Fa97,  U.CB

Performance in Multiprocessors

° Commercial dealers usually define high

performance as high throughput for independent
tasks.

° Alternative: running a single task on multiple

processors.

° Definition: Parallel processing program is a single

program that runs on multiple processors
simultaneously.

cs 152 L1 5 .5 DAP Fa97,  U.CB

Key Questions

° How do parallel processors share data?

— single address space
— message passing
° How do parallel processors coordinate?
— synchronization (locks, semaphores)
— built into send / recieve primitives
° How many processors should the system have?
— connected by a single bus (2 - 32 procs)
— connected by a network (8 - 256 procs)

cs 152 L1 5 .6 DAP Fa97,  U.CB

Shared Memory Processors (SMPs)

° Processors with a single address space: all

processors share a single memory address space.

° Processors communicate through shared variables

in memory. All processors are capable of
accessing any memory location with loads and
stores.
P ro c e s s o r P ro c e s s o r P ro c e s s o r

C a che C ac he C ache

S in g le b u s

M e m o ry I/ O

cs 152 L1 5 .7 DAP Fa97,  U.CB

Coordinating Shared Memory Processors

° Prohibit one processor to start working on data

before another is finished with it: synchronization.
° When sharing is supported with a single address
space, there must be a separate mechanism for
synchronization.
° Example synchronization mechanism: lock. When a
processor j wants to access a shared variable it
must require its lock. Only one processor at a time
can aquire the lock. Other interested processors
must wait until j unlocks the variable.

cs 152 L1 5 .8 DAP Fa97,  U.CB

Classifying SMPs as to Memory Access

° UMA (Uniform Memory Access) multiprocessors,

also called (Symmetric Multiprocessors - SMPs): All
memory accesses take the same time,
independently of which processors access and
which word is accessed. (2 - 64 procs).

° NUMA (Nonuniform Memory Access)

multiprocessors. Memory access times depend on
which processors access which words. (8 - 256
procs).

° It is harder to program NUMA machines efficiently,

but they are more scalable.

cs 152 L1 5 .9 DAP Fa97,  U.CB

Distributed Memory Processors (8 - 256 procs)
° Each processor has a local (private) memory.
° Coordination (communication) is physically
achieved by message passing. The system has
send / recieve primitives (routines) for achieving
this. Much harder to program. New programming
paradigms.
° Events are triggered in the processors by the
sending or the reception of a message.
P ro c e s s o r P ro c e s s o r P ro c e s s o r

C ache C ache C ache

M e m o ry M e m o ry M e m o ry

N e tw o rk

cs 152 L1 5 .10 DAP Fa97,  U.CB

Clusters

° One can achieve multiprocessing by making a LAN

(computers connected by Local Area Network) act
as a single large multiprocessor.

° The switch based LAN must provide high

bandwidth between computers in the cluster.

cs 152 L1 5 .11 DAP Fa97,  U.CB

Keeping Up to Date

° Industry is rapidly changing in this field.

° There is no clear answer to all the key questions

presented.

° Many answers depend strongly on application

behavior.

° Find some of the latest information in this field in

http:\\www.mkp.com/cod2e.htm.

cs 152 L1 5 .12 DAP Fa97,  U.CB

9.2 Programming Multiprocessors

° Uniprocessors that compose multiprocessors are

cheap since the early 1980s.

° Communication interconnection network

technology is well established.

° We have good programming languages for parallel

program execution.

cs 152 L1 5 .13 DAP Fa97,  U.CB

What´s Difficult in Parallel and Distributed Processing?

° Identifying real (and important) applications that

have inherent parallelism and can benefit from
parallel processing.
° Writing parallel programs for parallel execution for
these applications is harder than writing sequential
programs.
° The challenge is even greater if we want the
application to scale (maintain efficiency when the
problem size increases and larger multiprocessors
are used).
° As a result, parallel processing has had sucess
when parallel programming specialists have
developed a parallel subsystem that presents a
sequential interface.

cs 152 L1 5 .14 DAP Fa97,  U.CB

Why is Parallel Program Development so Hard?

° An analogy: For 10 people to work on a task (e.g.

painting a house, writing an article or developing
SW), they must communicate and coordinate their
work. There is no such overhead if one person
does the task alone.

° The overhead is even greater if the number of

people increases to 100, 1000 or even 1000000.
Because of this overhead, 1000 people do not finish
the task 1000 time faster than a single person. n-
fold speedup becomes especially unlikely as n
increases.
° The programmer must know a good deal about the
HW (number and types of processors,
interconnection topology).
cs 152 L1 5 .15 DAP Fa97,  U.CB
Problems of Parallel Program Development
° On a uniprocessor, the high-level language
programmer writes the program largely ignoring
the underlying machine organization.
° Uniprocessor design techniques for ILP (Instr Level
Parallelism) such as pipelines, SS and out-of -order
execution are transparent to the programmer. The
compiler handles these issues.
° you must get good performance and
efficiency from the parallel program on a
multiprocessor, otherwise you would use a
uniprocessor, since programming is easier.
° Even small parts of a program must be parallelized
to reach their full potential (see Amdahl´s law in
chapter 2). Sequential parts are bottlenecks.

cs 152 L1 5 .16 DAP Fa97,  U.CB

9.3 Multiprocessors Connected by a Single Bus

P ro c e s s o r P ro c e s s o r P ro c e s s o r

C a c he C a che C a c he

S in g le b us

M e m o ry I/O

Single-bus multiprocessor (2 - 32 processors)

cs 152 L1 5 .17 DAP Fa97,  U.CB

Why use Single-bus Architectures?

° Low cost of microprocessors.

° Each microprocessor is much smaller than a
multichip processor, so more processors can be
placed on a bus.
° Local caches can lower bus traffic.
° There are mechanisms for keeping caches and
memory consistent (like consistency protocols for
caches and memory for I/O) simplifies
programs.
° Traffic per processor and the bus bandwidth
determine the useful number of processors in such
a multiprocessor.

cs 152 L1 5 .18 DAP Fa97,  U.CB

Caches

Caches replicate data in their faster memories:

• to reduce the latency to the data and

• to reduce memory traffic on the bus

cs 152 L1 5 .19 DAP Fa97,  U.CB

Example Parallel Program for Single Bus
° Sum 100000 numbers
° 10 processors
° Each processor sums 100000 / 10 numbers
• Variables
- Pn: number of the local processor
- A[100000]: Array with numbers to be summed
• Each processor Pn executes (i is private
variable)
sum [Pn] = 0;
for (i = 10000 *Pn; i < 10000 * (Pn + 1); i++)
sum [Pn] = sum [Pn] + A[ i ];
• Load Instrs bring correct subset of numbers to
the local caches from main memory.
cs 152 L1 5 .20 DAP Fa97,  U.CB
Divide and Conquer to Add Partial Sums

cs 152 L1 5 .21 DAP Fa97,  U.CB

Program that Adds Partial Sums

half = 10; /* 10 processors */

repeat
synch( );
if ( half % 2 != 0 && Pn == 0 )
sum [0] = sum [0] + sum [half - 1];
half = half / 2;
if ( Pn < half ) sum [Pn] = sum [Pn] + sum [Pn + half];
until (half ==1); /* exit with final sum in sum[0] */

cs 152 L1 5 .22 DAP Fa97,  U.CB

synchronization primitives

° synch() is a barrier synchronization primitive,

processors wait at the barrier until every processor
has reached it. When this occurs, all may proceed.

° This function can be implemented either in SW

with the lock synchronization primitive, or with
special HW that combines each processor ready
signal into a single global signal that all processors
can test.

cs 152 L1 5 .23 DAP Fa97,  U.CB

Cache Consistency

° In the example program, the part that adds the

partial sums implies in communication.

° It is convenient to have copies of the same data in

multiple caches (sum[ ] in the example).

° There may be inconsistencies between data in

memory and data in caches and also among same
data in different caches.

° Protocols that maintain coherency for

multiprocessors are called cache coherency
protocols.
cs 152 L1 5 .24 DAP Fa97,  U.CB
Some Interesting Problems

° Cache Coherency

Processor Processor Processor

Snoop Cache tag Snoop Cache tag Snoop Cache tag

tag and data tag and data tag and data

Single bus

Memory I/O

° Most popular protocol to maintain cache

coherency is called snooping.

cs 152 L1 5 .25 DAP Fa97,  U.CB

Synchronization Using Coherency

° Read in page 724 of section 9.3.

cs 152 L1 5 .26 DAP Fa97,  U.CB

9.4 Multiprocessors Connected by a Network

° 3 Desirable bus characteristics:

• high bandwidth
• low latency
• long length

INCOMPATIBLE

° There is also a limit to the bandwidth of a single

memory module attached to a bus.
Number of proceessors that can be
connected to a single bus is limited (Max
processors ~36 processors)
cs 152 L1 5 .27 DAP Fa97,  U.CB
Solution for More Demands on Communication

° Use more communication channels: a Network

connects processor-memory modules.

Processor Processor Processor

Cache Cache Cache

Memory Memory Memory

Network

cs 152 L1 5 .28 DAP Fa97,  U.CB

Example Network Topologies

a. 2D grid or mesh of 16 nodes b. n-cube tree of 8 nodes (8 = 23 so n = 3)

cs 152 L1 5 .29 DAP Fa97,  U.CB

Example Network Topologies

P2
P0
P3
P1
P4
P2
P5
P3
P6
P4
P7
P5

a . Cro ssb ar b . O m ega netw o rk

A C
B D

c. O me ga n etwork s w itch box

cs 152 L1 5 .30 DAP Fa97,  U.CB

Distributed Memory X Shared Memory

° Shared Memory: a single address space (implicit

communication with loads and stores).

° Multiple Private Memories (Address Spaces):

explicit communication with sends and receives.

° Distributed Memory: refers to the physical location

of the memory. Physical memory is divided into
modules, with some placed near each processor.

° Centralized Memory: access time to a physical

memory location is the same for all processors
because all accesses go over the interconnect.
cs 152 L1 5 .31 DAP Fa97,  U.CB
Many Combinations are Possible

° Multiprocessors can have a single address space

and a distributed physical memory.

° Decisions:
• single address space?
• explicit communication?
• distributed physical memory?

° In machines without a single global address,

communication is explicit, the programmer or
compiler must send / receive messages to export /
import data to / from other nodes.
cs 152 L1 5 .32 DAP Fa97,  U.CB
Parallel Message Passing Program

° Sum 100000 numbers

° 100 processors
° Each processor sums 100000 / 100 numbers
• Variables
- Pn: number of the local processor
- A[100000]: Array with numbers to be summed
- One processor containing 100000 numbers
sends subsets to each of the 100 processors.
• Each processor Pn executes to sum its subset
sum = 0;
for (i = 0; i < 1000; i++)
sum = sum + A[ i ];

cs 152 L1 5 .33 DAP Fa97,  U.CB

Divide and Conquer to Add Partial Sums

cs 152 L1 5 .34 DAP Fa97,  U.CB

Parallel Message Passing Program

° Communicate partial sums among neighbors

through the network to accumulate the final sum.
° Divide and conquer to obtain final sum in parallel.
° send (x, y): routine that sends the value y over the
interconnection network to processor x.
° receive(): function that accepts a value from the
network for this processor.
half = limit = 100; /* 100 processors */
repeat
half = (half + 1) / 2;
if (Pn >= half && Pn < limit) send (Pn - half, sum);
if ( Pn < (limit / 2 -1) ) sum = sum + receive();
limit = half;
until (half ==1);

cs 152 L1 5 .35 DAP Fa97,  U.CB

Parallel Message Passing Program

° The program divides processors into senders and

receivers.

° Each receiving processor gets only one msg

a receiving processor will stall until it receives a
message.

° send and receive can be used as promitives

for synchronization as well as for communication.

cs 152 L1 5 .36 DAP Fa97,  U.CB

Addressing in Large-Scale Processors

° To be scalable a multiprocessor must have

distributed memory.
° For HW designer it is easier to offer only send and
receive.
° Msg passing also minimizes communication
because the programmer who knows the
application parallelism should optimize it.
° It is possible to add a SW layer to provide a single
address space on top of sends and receives so that
communication is possible using loads and stores
(implicit communication) Distributed Shared
Memory (DSM).
° DSM is comparable to the Virtual Memory system. It
tries to detect spatial locallity of data and is only
efficient if shared memory communication is rare.
Otherwise most of the time is spent transferring
data.
cs 152 L1 5 .37 DAP Fa97,  U.CB
Concluding Remarks

° Evolution vs. Revolution

“More often the expense of innovation comes from
being too disruptive to computer users”

Parallel processing multiprocessor

Message-passing multiprocessor
Not-CC-NUMA multiprocessor
Timeshared multiprocessor

CC-NUMA multiprocessor
CC-UMA multiprocessor
Microprogramming

Virtual memory

Massive SIMD
Pipelining
Cache

RISC
Evolutionary Revolutionary

“Acceptance of hardware ideas requires acceptance

by software people; therefore hardware people should
learn about software. And if software people want good
machines, they must learn more about hardware to be
able to communicate with and thereby influence
hardware engineers.”
cs 152 L1 5 .38 DAP Fa97,  U.CB

CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
CS 213: Parallel Processing Syllabus
No ratings yet
CS 213: Parallel Processing Syllabus
26 pages
Understanding Multiprocessor Systems
No ratings yet
Understanding Multiprocessor Systems
36 pages
CS213 Parallel Processing Syllabus
No ratings yet
CS213 Parallel Processing Syllabus
26 pages
Parallel Processing in Computer Design
No ratings yet
Parallel Processing in Computer Design
43 pages
Understanding Parallel Processing Concepts
No ratings yet
Understanding Parallel Processing Concepts
16 pages
Overview of Parallel Processing Types
No ratings yet
Overview of Parallel Processing Types
31 pages
Understanding Multiprocessor Systems
No ratings yet
Understanding Multiprocessor Systems
40 pages
Understanding Parallel Processing Techniques
No ratings yet
Understanding Parallel Processing Techniques
26 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
Chapter 8 - Parallel Processing
No ratings yet
Chapter 8 - Parallel Processing
50 pages
CH17-COA10e - Parallel Processing
No ratings yet
CH17-COA10e - Parallel Processing
45 pages
Multicores and Multiprocessors Overview
No ratings yet
Multicores and Multiprocessors Overview
51 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
Parallel Processing in Computer Architecture
No ratings yet
Parallel Processing in Computer Architecture
51 pages
Unit6 - Microprocessor - Final 1
No ratings yet
Unit6 - Microprocessor - Final 1
30 pages
Lecture 03
No ratings yet
Lecture 03
39 pages
University Questions on Parallel Computing
No ratings yet
University Questions on Parallel Computing
85 pages
Overview of Parallel Processing Units
No ratings yet
Overview of Parallel Processing Units
64 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
Multiprocessor Architecture Overview
No ratings yet
Multiprocessor Architecture Overview
70 pages
8051 Arch
No ratings yet
8051 Arch
55 pages
Lec 4
No ratings yet
Lec 4
36 pages
Multiprocessors and Multithreading Overview
No ratings yet
Multiprocessors and Multithreading Overview
60 pages
Types of Parallel Processor Systems
No ratings yet
Types of Parallel Processor Systems
52 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Parallel Programming Course Overview
No ratings yet
Parallel Programming Course Overview
36 pages
Multiprocessor Computer Architecture Overview
No ratings yet
Multiprocessor Computer Architecture Overview
28 pages
Overview of Multiprocessor Systems
No ratings yet
Overview of Multiprocessor Systems
50 pages
Multiprocessors: COMP 211 - Computer Systems Organization and Architecture
No ratings yet
Multiprocessors: COMP 211 - Computer Systems Organization and Architecture
29 pages
Understanding Parallelism in Computing
No ratings yet
Understanding Parallelism in Computing
77 pages
Multiprocessors and Multithreading Overview
No ratings yet
Multiprocessors and Multithreading Overview
13 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Multiprocessor Systems Overview
No ratings yet
Multiprocessor Systems Overview
51 pages
15 Parallel Processing
No ratings yet
15 Parallel Processing
36 pages
Parallel and Concurrent Systems Explained
No ratings yet
Parallel and Concurrent Systems Explained
27 pages
Levels of Parallelism in Computing
No ratings yet
Levels of Parallelism in Computing
70 pages
Understanding Parallel Processor Architectures
No ratings yet
Understanding Parallel Processor Architectures
117 pages
Unit Iv Parallelism
No ratings yet
Unit Iv Parallelism
80 pages
Understanding Multi-Core Processor Architectures
No ratings yet
Understanding Multi-Core Processor Architectures
32 pages
Parallel Computer Models Overview
No ratings yet
Parallel Computer Models Overview
20 pages
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
34 pages
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
34 pages
HPC Important Question
No ratings yet
HPC Important Question
19 pages
Understanding Multiprocessing in Computer Architecture
No ratings yet
Understanding Multiprocessing in Computer Architecture
60 pages
Parallel Processor Architectures Overview
No ratings yet
Parallel Processor Architectures Overview
45 pages
Parallel Processing:: Multiple Processor Organization
No ratings yet
Parallel Processing:: Multiple Processor Organization
24 pages
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
Understanding Speedup in Parallel Computing
No ratings yet
Understanding Speedup in Parallel Computing
12 pages
3a Flynns
No ratings yet
3a Flynns
17 pages
Nokia Ringtone Sheet Music
No ratings yet
Nokia Ringtone Sheet Music
3 pages
Nokia Tune - Score - Piano
100% (1)
Nokia Tune - Score - Piano
1 page
Inglês
100% (1)
Inglês
214 pages
English Learning Guide for Beginners
No ratings yet
English Learning Guide for Beginners
214 pages
Method For Trumpet - Goldman
100% (3)
Method For Trumpet - Goldman
52 pages
WXT530 Users Guide M211840EN
No ratings yet
WXT530 Users Guide M211840EN
222 pages
Oct 2006 - Beverage Vending Machine
No ratings yet
Oct 2006 - Beverage Vending Machine
6 pages
XDVD3101: Installation/Owner'S Manual
No ratings yet
XDVD3101: Installation/Owner'S Manual
40 pages
ModBus Map for Engineers
No ratings yet
ModBus Map for Engineers
10 pages
NMOS Inverter with Resistive Load Analysis
No ratings yet
NMOS Inverter with Resistive Load Analysis
41 pages
HJK-481H
No ratings yet
HJK-481H
2 pages
K50 Spec Sheet
100% (1)
K50 Spec Sheet
2 pages
1SFA898114R7000
No ratings yet
1SFA898114R7000
6 pages
Elite Software Handbook
No ratings yet
Elite Software Handbook
70 pages
Chapter 11 Measurement of Power and Wattmeters
No ratings yet
Chapter 11 Measurement of Power and Wattmeters
31 pages
Modicon Modbus Bridge
No ratings yet
Modicon Modbus Bridge
7 pages
Physical Layer: Data Communication?
No ratings yet
Physical Layer: Data Communication?
5 pages
5g New Radio Evolution
No ratings yet
5g New Radio Evolution
14 pages
Vivado Tutorial
No ratings yet
Vivado Tutorial
16 pages
一個0 016mm2使用電容倍增技術之次取樣鎖相迴路
No ratings yet
一個0 016mm2使用電容倍增技術之次取樣鎖相迴路
77 pages
Accessory Design Guidelines
No ratings yet
Accessory Design Guidelines
405 pages
Computer Network Set-Up Curriculum Guide
No ratings yet
Computer Network Set-Up Curriculum Guide
13 pages
Backplane Bus System Overview
No ratings yet
Backplane Bus System Overview
8 pages
Satellite Communication Overview and Types
No ratings yet
Satellite Communication Overview and Types
91 pages
STM32F4xx Clock Configuration V1.1.0
No ratings yet
STM32F4xx Clock Configuration V1.1.0
30 pages
Keysight 34420A Nanovolt/Micro-Ohm Meter
No ratings yet
Keysight 34420A Nanovolt/Micro-Ohm Meter
10 pages
36350
No ratings yet
36350
15 pages
Computer Network Full Forms Guide
No ratings yet
Computer Network Full Forms Guide
4 pages
Cubebaby Presets Variations
No ratings yet
Cubebaby Presets Variations
2 pages
PW-455R Analog Interface Instruction For Robots
No ratings yet
PW-455R Analog Interface Instruction For Robots
43 pages
Circuit Analysis Techniques Overview
No ratings yet
Circuit Analysis Techniques Overview
15 pages
Adaptive Phased Array Radar SEO
No ratings yet
Adaptive Phased Array Radar SEO
36 pages
EEE 464 Lecture 05 Converters DC-DC
No ratings yet
EEE 464 Lecture 05 Converters DC-DC
46 pages
Cisco ASR 1000 Series Aggregation Services Routers White - Paper - c11-452157
No ratings yet
Cisco ASR 1000 Series Aggregation Services Routers White - Paper - c11-452157
9 pages
LTE Cell Planning Guide
No ratings yet
LTE Cell Planning Guide
51 pages