Architecture II

Pipeline And Vector Processing
Parallel Processing
Execution of Concurrent Events in the computing process to

achieve faster Computational Speed
The purpose of parallel processing is to speed up the computer

processing capability and increase its throughput, that is, the
amount of processing that can be accomplished during a given
interval of time.
The amount of hardware increases with parallel processing,

and with it, the cost of the system increases.
However, technological developments have reduced hardware

costs to the point where parallel processing techniques are
economically feasible.
Parallel processing according to levels of complexity
At the lower level

Serial Shift register VS
parallel load registers
At the higher level
Multiplicity of functional
units that performs
identical or different
operations simultaneously.
Parallel Computers
SISD COMPUTER SYSTEMS
Von Neumann Architecture
MISD COMPUTER SYSTEMS
SIMD COMPUTER SYSTEMS
MIMD COMPUTER SYSTEMS
PIPELINING
A technique of decomposing a sequential process into

suboperations, with each subprocess being executed in a
partial dedicated segment that operates concurrently
with all other segments.
A pipeline can be visualized as a collection of processing

segments through which binary information flows.
The name “pipeline” implies a flow of information analogous

to an industrial assembly line.
Example of the Pipeline Organization
OPERATIONS IN EACH PIPELINE STAGE
GENERAL PIPELINE
Cont.
Speedup ratio of pipeline
Cont.
PIPELINE AND MULTIPLE FUNCTION UNITS
Cont.
ARITHMETIC PIPELINE
Cont.
See the example in P. 310

INSTRUCTION CYCLE
INSTRUCTION PIPELINE
INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE
Pipeline
Space time diagram
MAJOR HAZARDS IN PIPELINED EXECUTION
Structural hazards (Resource Conflicts):
Hardware resources required by the instructions

simultaneous overlapped execution cannot be met.
Data hazards (Data Dependency Conflicts):
An instruction scheduled to be executed in the

pipeline requires the result of a previous instruction,
which is not yet available.
Control hazards (Branch difficulties):
Branches and other instructions that change the PC

make the fetch of the next instruction to be delayed.
Data hazards
Control hazards
STRUCTURAL HAZARDS
Occur when some resource has not been duplicated enough

to allow all combinations of instructions in the pipeline to
execute.
Example:
With one memory-port, a data and an instruction fetch cannot
be initiated in the same clock.
The Pipeline is stalled for a structural hazard

<- Two Loads with one port memory
-> Two-port memory will serve without stall
DATA HAZARDS
FORWARDING HARDWARE
INSTRUCTION SCHEDULING
CONTROL HAZARDS
CONTROL HAZARDS
CONTROL HAZARDS
VECTOR PROCESSING
There is a class of computational problems that are

beyond the capabilities of conventional computer.
These problems are characterized by the fact that
they require a vast number of computations that will
take a conventional computer days or even weeks to
complete.
VECTOR PROCESSING
VECTOR PROGRAMMING
VECTOR INSTRUCTIONS
Matrix Multiplication
The multiplication of two nxn matrices consists of n2 inner

products or n3 multiply-add operations.
Example: Product of two 3x3 matrices

c11= a11b11+a12b21+a13b31
This requires 3 multiplications and 3 additions. The total

number of multiply-add required to compute the matrix
product is 9x3=27.
In general, the inner product consists of the sum of k product

terms of the form
C = A1B1+A2B2+A3B3+…+Ak Bk
C = A1B1+A5B5+A9B9+A13 B13+…
+A2B2+A6B6+A10B10+A14 B14+…
+A3B3+A7B7+A11B11+A15 B15+…
+A4B4+A8B8+A12B12+A16 B16+…
VECTOR INSTRUCTION FORMAT
MULTIPLE MEMORY MODULE AND INTERLEAVING
ARRAY PROCESSOR
attached array processor with host computer
SIMD array processor Organization
Don’t forget,
try to solve the
questions of the
chapter
Chapter 11
Input-Output
Organization
Input-Output Organization
Important Terms
Peripherals
ASCII (American Standard Code for Information Interchange) Alphanumeric Characters

Derive the actual speedup ratio for the pipeline system,
and then list the reasons that prevent the pipeline from
obtaining maximum speedup ratio.
A digital computer has a memory unit of 128Kx18 and a

cache memory of 2K words. The cache uses direct
mapping with a block size of four words. (a) how many
bits are there in the tag, index, block, and word fields of
the address format? (b) how many bits are there in each
word of cache? (c) How many blocks can the cache
accommodate?.
ISOLATED VERSUS MEMORY MAPPED I/O
Many computers use one common bus to transfer information between

memory or I/O and the CPU.
The distinction between a memory transfer and I/O transfer is made

through separate read and write lines.
The CPU specifies whether the address on the address lines is for a
memory word or for an interface register by enabling one of two possible
read or write lines.
The I/O read and I/O writes control lines are enabled during an I/O transfer.
The memory read and memory write control lines are enabled during a
memory transfer.
This configuration isolates all I/O interface addresses from the address
assigned to memory and is referred to as the isolated I/O method for
assigning addresses in a common bus.
ISOLATED I/O
In the isolated I/O configuration, the CPU has distinct input and output
instructions and each of these instructions are associated with the
address of an interface register.
When the CPU fetches and decodes the operation code of an input or
output instruction, it places the address associated with the instruction
into the common address lines.
At the same time, it enables the I/O read (for input) or I/O write (for output)
control line.
This informs the external components that are attached to the common
bus that the address in the address lines is for an interface register and
not for a memory word.
When the CPU is fetching an instruction or an operand from memory, it

places the memory address on the address lines and enables the
memory read or memory write control lines. This informs the external
components that the address is for a memory word and not for an I/O
interface.
MEMORY MAPPED I/O
The isolated I/O method isolates memory and I/O addresses so that memory
address values are not affected by interface address assignment since each has
its own address space.
The memory mapped I/O uses the same address space for both memory and I/O.
This is the case in computers that employ only one set of read and write signals
and do not distinguish between memory and I/O addresses.
The computer treats an interface register as being part of the memory system.
The assigned addresses for interface registers cannot be used for memory
words, which reduce the memory address( range available).
In memory mapped I/O organization, there are no specific inputs or output
instructions. The CPU can manipulate I/O data residing in interface registers with
the same instructions that are used to manipulate memory words.
Typically, a segment of the total address space is reserved for interface registers,
but in general, they can be located at any address as long as there is not also a
memory word that responds to the same address.
It allows the computer to use the same instructions for either input-output
transfers or for memory transfers.
There are two ways can be used to achieve that
1-Strobe
2-Handshaking
Read status register
Check flag bit

Input Device to CPU
=0
Flag
Data bus Interface I/O bus
=1
Address bus Data register
CPU
Data valid I/O Read data register
I/O read device
Status
I/O write F
register Data accepted
Transfer data to memory
F = Flag bit
Operation no
complete ?
• Continuous CPU involvement yes
• CPU slowed down to I/O speed Continue

• Simple with
program
• Least hardware
- Polling takes valuable CPU time
- Open communication only when some data has to be passed -> Interrupt.
- I/O interface, instead of the CPU, monitors the I/O device
- When the interface determines that the I/O device is ready for data
transfer, it generates an Interrupt Request to the CPU.
- Upon detecting an interrupt, CPU stops momentarily the task it is doing,
branches to the service routine to process the data transfer, and then
returns to the task it was performing.
1) Non-vectored : fixed branch address

2) Vectored : interrupt source supplies the branch address (interrupt
vector)
Software Considerations
• I/O routines
– software routines for controlling peripherals and for transfer of data
between the processor and peripherals
• I/O routines for standard peripherals are provided by the manufacturer
(Device driver, OS or BIOS)
• I/O routines are usually included within the operating system
• I/O routines are usually available as operating system procedures ( OS
or BIOS function call).
Priority Interrupt
• Identify the source of the interrupt when several sources will request
service simultaneously
• Determine which condition is to be serviced first when two or more
requests arrive simultaneously
• Techniques used:
– 1) Software : Polling
– 2) Hardware : Daisy chain, Parallel priority
Priority Interrupt by Software (Polling)
- Priority is established by the order of polling the devices(interrupt sources)
- Flexible since it is established by software
- Low cost since it needs a very little hardware
- Very slow
Priority Interrupt by Hardware

- Require a priority interrupt manager which accepts all the interrupt requests to
determine the highest priority request
-Fast since identification of the highest priority interrupt request is identified by
the hardware
- Fast since each interrupt source has its own interrupt vector to access
directly to its own service routine
Processor data bus
VAD 1 VAD 2 VAD 3
“1” Device 1 “1” Device 2 “0” Device 3

To next
PI PO PI PO PI PO
Device
Interrupt request
INT
CPU
Interrupt acknowledge
INTACK
One stage of the daisy-chain priority arrangement
VAD
INTACK Priority in
PI Enable
Vector address
INT Priority out

RF PO
Interrupt S Q
request
from device PI RF PO Enable
R
 0 0 0 0
Delay  0 1 0 0
Open-collector  1 0 1 0
inverter Interrupt request to CPU  1 1 1 1
 No interrupt request
 Invalid : interrupt request, but no acknowledge
 No interrupt request : Pass to other device (other device requested interrupt )
 Interrupt request
Interrupt
register
VAD
disk 0 to CPU
I0
y
x
Printer 1
I1
Priority 0
encoder
0
Reade 2
I2
r 0
0
Keyboard 3
I3
0
0
Enable
IEN IST
Interrupt Enable F/F (IEN) :

set or cleared by the program.
0
Interrupt Status F/F (IST) : Interrupt

1 to CPU
set or cleared by the encoder
output
2 INTACK
from CPU
Mask
register
Initial Operation of ISR
1) Clear lower-level mask register bit

2) Clear interrupt status bit IST
3) Save contents of processor registers
4) Set interrupt enable bit IEN
5) Proceed with service routine
Final Operation of ISR
1) Clear interrupt enable bit IEN

2) Restore contents of processor registers
3) Clear the bit in the interrupt register
belonging to the source that has been serviced
4) Set lower-level priority bits in the mask register
5) Restore return address into PC and set IEN
DIRECT MEMORY ACCESS (DMA)
The transfer of data between a fast storage device such as magnetic

disk and memory is often limited by the speed of the CPU.
Removing the CPU from the path and letting the peripheral device
manage the memory buses directly would improve the speed of
transfer.
This transfer technique is called direct memory access (DMA).
During DMA transfer, the CPU is idle and has no control of the memory
buses. A DMA controller takes over the buses to manage the transfer
directly between the I/O device and memory.
CPU bus signals for DMA transfer
There are two modes of transfer:

Burst transfer and cycle stealing
Block Diagram of DMA Controller
DMA Transfer in a Computer System
Block diagram of a computer with I/O processor
Solve the following question
The time delay of the five segments in a certain pipeline are

as follows:
t1=30 ns,
t2=70 ns,
t3=70 ns,
t4=25 ns,
t5=35 ns.
The interface registers delay time tr=5 ns.
How long would it take to add 100 pairs of numbers in the

pipeline?.
Derive the actual speedup ratio for the pipeline system,
and then list the reasons that prevent the pipeline from
obtaining maximum speedup ratio.
A digital computer has a memory unit of 256Kx16 and a

cache memory of 2K words. The cache uses direct
mapping with a block size of four words. (a) how many
bits are there in the tag, index, block, and word fields of
the address format? (b) how many bits are there in each
word of cache? (c) How many blocks can the cache
accommodate?.
List five of the main applications that
may use the vector processing, draw
the instruction format for the vector
processor with defining each of its
items.
Memory Organization
12.1 Memory Hierarchy

12.2 Main Memory
12.3 Auxiliary Memory
12.4 Associative Memory
12.5 Cache Memory
12.6 Virtual Memory
12.7 Memory management hardware
Memory Hierarchy
The overall goal of using a memory hierarchy is to obtain the

highest-possible average access speed while minimizing the
total cost of the entire memory system.
Microprogramming: refers to the existence of many programs in

different parts of main memory at the same time.
Main memory
ROM Chip
Memory Address Map
The designer of a computer system must calculate the

amount of memory required for the particular application
and assign it to either RAM or ROM.
The interconnection between memory and processor is then

established from knowledge of the size of memory needed
and the type of RAM and ROM chips available.
The addressing of memory can be established by means

of a table that specifies the memory address assigned to
each chip.
The table, called a memory address map, is a pictorial

representation of assigned address space for each chip in
the system.
Memory Configuration (case study):
Required: 512 bytes ROM + 512 bytes RAM

Available: 512 byte ROM + 128 bytes RAM
Memory Address Map
Memory connections Address bus CPU
to the CPU 16 - 11 10 9 8 7 R-D 1 W R Data bus
Decoder
3 2 1 0
CS1
CS2
128× 8 Data
RD
RAM 1
WR
AD7
CS1
CS2
128× 8 Data
RD
RAM 2
WR
AD7
CS1
CS2
128× 8 Data
RD
RAM 3
WR
AD7
CS1
CS2
128× 8 Data
RD
RAM 4
WR
AD7
CS1
CS2
1-7 128× 8 Data
ROM
8
AD9
9
Associative Memory
Associative Memory
The time required to find an item stored in memory can be

reduced considerably if stored data can be identified for access
by the content of the data itself rather than by an address.
A memory unit access by content is called an associative

memory or Content Addressable Memory (CAM). This type of
memory is accessed simultaneously and in parallel on the basis
of data content rather than specific address or location.
When a word is written in an associative memory, no address is

given. The memory is capable of finding an empty unused
location to store the word. When a word is to be read from an
associative memory, the content of the word or part of the
word is specified.
The associative memory is uniquely suited to do parallel

searches by data association. Moreover, searches can be done
on an entire word or on a specific field within a word.
Associative memories are used in applications where the
search time is very critical and must be very short.
Hardware Organization
Argument register (A)
Key register (K)
Match
register
Input
Associative memory
array and logic M
Read
Write m words
n bits per word
Output
Associative memory of an m word, n cells per word
A1 Aj An
K1 Kj Kn
Word 1 C 11 C 1j C 1n M1
Word i C i1 C ij C in Mi
Word m C m1 C mj C mn Mm
Bit 1 Bit j Bit n

One Cell of Associative Memory
A j K j
Input
Write
R S Match
F ij To M i
logic
Read
Output
Match logic
Neglect the K bits and compare the argument in A with the bits
stored in the cells of the words.
Word i is equal to the argument in A if
Two bits are equal if they are both 1 or 0
For a word i to be equal to the argument in A we must have all xj

variables equal to 1.
This is the condition for setting the corresponding match bit Mi to 1.

Now include the key bit Kj in the comparison logic Cont.
The requirement is that if Kj=0, the corresponding bits of Aj and
need no comparison. Only when Kj=1 must be compared. This
requirement is achieved by OR ing each term with Kj
The match logic for word i in an associative memory can now be

expressed by the following Boolean function.
If we substitute the original definition of xj, the above Boolean

function can be expressed as follows:
Where ∏ is a product symbol designating the AND operation of all n terms.

Match Logic cct.
K1 A1 K2 A2 Kn An
F'i1 F i1 F'i2 Fi2 F'in Fin
Mi
Read Operation
If more than one word in memory matches the unmasked argument

field, all the matched words will have 1’s in the corresponding bit
position of the match register.
It is then necessary to scan the bits of the match register one at

a time. The matched words are read in sequence by applying a
read signal to each word line whose corresponding Mi bit is a 1.
If only one word may match the unmasked argument field, then
connect output Mi directly to the read line in the same word
position,
The content of the matched word will be presented automatically

at the output lines and no special read command signal is
needed.
If we exclude words having zero content, then all zero output will
indicate that no match occurred and that the searched item is
not available in memory.
Write Operation
If the entire memory is loaded with new information at once,

then the writing can be done by addressing each location in
sequence.
The information is loaded prior to a search operation.
If unwanted words have to be deleted and new words inserted

one at a time, there is a need for a special register to
distinguish between active an inactive words.
This register is called “Tag Register”.
A word is deleted from memory by clearing its tag bit to 0.

Cache Memory
Cache Memory
Locality of reference
The references to memory at any given interval of time tent to be
contained within a few localized areas in memory.
If the active portions of the program and data are placed in a fast
small memory, the average memory access time can be reduced.
Thus, reducing the total execution time of the program. Such a

fast small memory is referred to as “Cache Memory”.
The performance of the cache memory is measured in terms of a
quality called “Hit Ratio”.
When the CPU refers to memory and finds the word in cache, it
produces a hit. If the word is not found in cache, it counts it as
a miss.
The ratio of the number of hits divided by the total CPU references
to memory (hits + misses) is the hit ratio. The hit ratios of 0.9 and
higher have been reported
Cache Memory
The average memory access time of a computer system can be
improved considerably by use of cache.
The cache is placed between the CPU and main memory. It is the
faster component in the hierarchy and approaches the speed of
CPU components.
When the CPU needs to access memory, the cache is examined. If

it is found in the cache, it is read very quickly.
If it is not found in the cache, the main memory is accessed.
A block of words containing the one just accessed is then

transferred from main memory to cache memory.
For example,
A computer with cache access time of 100ns, a main memory

access time of 1000ns and a hit of 0.9 produce an average access
time of 200ns. This is a considerable improvement over a similar
computer without a cache memory, whose access time is 1000ns.
Cache Memory
The basic characteristic of cache memory is its fast access time.

Therefore, very little or no time must be wasted when searching
for words in the cache.
The transformation of data from main memory to cache memory

is referred to as a “Mapping Process”.
There are three types of mapping procedures are available.
· Associative Mapping
· Direct Mapping
· Self – Associative Mapping.
Cache Memory
Consider the following memory organization to show mapping

procedures of the cache memory.
· The main memory can stores 32k word of 12 bits each.

· The cache is capable of storing 512 of these words at any given
time.
· For every word stored in cache, there is a duplicate copy in main
memory.
· The CPU communicates with both memories
· It first sends a 15 – bit address to cache.
· If there is a hit, the CPU accepts the 12 bit data from cache
· If there is a miss, the CPU reads the word from main memory and
the word is then transferred to cache.
Associative Mapping
The associative mapping stores both the address and content (data)
of the memory word.
Octal
Argument register
A CPU address of 15 bits is placed in the argument register and

associative memory is searched for a matching address.
If the address is found, the corresponding 12 bit data is read and
sent to the CPU.
If no match occurs, the main memory is accessed for the word.
The address – data pair is then transferred to associative cache
memory.
If the cache is full, it must be displayed, using replacement algorithm.
FIFO may be used.
Direct Mapping
The 15-bit CPU address is divided into two fields.
The 9 least significant bits constitute the index field and the
remaining 6 bits form the tag fields.
The main memory needs an address but includes both the tag and
the index bits.
The cache memory requires the index bit only i.e., 9 bits.
There are 2k words in the cache memory & 2n words in the main
memory where k is the No. of bits in the index field and n is the No.
of bits for the CPU address.
e.g: k = 9, n = 15
Direct Mapping
Direct Mapping
00000
6710
Direct Mapping
Each word in cache consists of the data word and it associated tag.
When a new word is brought into cache, the tag bits store along
data
When the CPU generates a memory request, the index field is

used in the address to access the cache.
The tag field of the CPU address is equal to tag in the word from
cache; there is a hit, otherwise miss.
How can we calculate the word size of the

cache memory?
Direct Mapping (block organization)
How to calculate…
Address length?
Number of addressable units(M/C)?
Block size?
No. of blocks in Main memory?
No. of blocks in cache memory?
Size of tag?
Set – Associative Mapping
In set – Associative mapping, each word of cache can store two or

more words of memory under the same index address.
Each data word is stored together with its tag and the number of
tag – data items in one word of cache is said to form a set.
Each index address refers to two data words and their associated tags.
Set – Associative Mapping
Each tag requires 6 bits & each data word has 12 bits, so the word
length is 2(6+12) =36 bits
An index address of 9 bits can accommodate 512 cache words. It

can accommodate 1024 memory words.
When the CPU generates a memory request, the index value of the
address is used to access the cache.
The tag field of the CPU address is compared with both tags in the
cache.
The most common replacement algorithms are:
· Random replacement
· FIFO
· Least Recently Used (LRU)
Page Replacement Strategies
◼ The Principle of Optimality
• Replace page that will be used the farthest in the future.
◼ Random page replacement
• Choose a page randomly
◼ FIFO - First in First Out
• Replace the page that has been in primary memory the
longest
◼ LRU - Least Recently Used
• Replace the page that has not been used for the longest
time
◼ LFU - Least Frequently Used
• Replace the page that has been used least often
◼ NRU - Not Recently Used
• An approximation to LRU.
◼ Working Set
• Keep in memory those pages that the process is actively
using.
Writing into cache
there are two writing methods that the system can proceed.
Write-through method (The simplest & commonly used way)

update main memory with every memory write operation, with cache
memory being update in parallel if it contains the word at the specified
address.
This method has the advantage that main memory always contains the
same data as the cache.
Write-back method
In this method only the cache location is updated during a write operation.
The location is then marked by a flag so that later when the word is
removed from the cache it is copied into main memory.
The reason for the write-back method is that during the time a word resides
in the cache, it may be updated several times.
The access time of a cache memory is
100 ns and that of main memory 1100
ns. It is estimated that the number of
the hit states is 67425, while the hit
ratio is 0.94. What is the average
access time of the system memory?
What is the number of the miss
states?
A digital computer has a memory unit of
1Mx14 and a cache memory of 2K
words. The cache uses direct mapping
with a block size of four words. How
many bits are there in the tag, index,
block, and word fields of the address
format? How many bits are there in each
word of cache?.
Virtual Memory
Virtual Memory
Q1)An address space is specified by 31 bits and the
corresponding memory space by 17 bits. (a) How
many words are there in the address space? (b)
How many words are there in the memory space?
(c) If a page consists of to 4K words, how many
pages and blocks are there in the system? .
Q2)Obtain the complement function for the match

logic of one word in an associative memory. In
other words, show that Mi is the sum of exclusive
OR functions.
Chap. 13 Multiprocessors
13-1 Characteristics of Multiprocessors
◆ Multiprocessors System = MIMD
An interconnection of two or more CPUs with memory and I/O equipment
» a single CPU and one or more IOPs is usually not included in a multiprocessor system
Unless the IOP has computational facilities comparable to a CPU
◆ Computation can proceed in parallel in one of two ways
1) Multiple independent jobs can be made to operate in parallel
2) A single job can be partitioned into multiple parallel tasks
◆ Classified by the memory Organization
1) Shared memory or Tightly-coupled system
» Local memory + Shared memory
higher degree of interaction between tasks
2) Distribute memory or Loosely-coupled system
» Local memory + message passing scheme (packet or message )
most efficient when the interaction between tasks is minimal
13-2 Interconnection Structure
◆ Multiprocessor System Components
1) Time-shared common bus
2) Multi-port memory
3) Crossbar switch CPU, IOP, Memory unit
4) Multistage switching network Interconnection Components
5) Hypercube system
◆ Time-shared Common Bus
Time-shared single common bus system : Fig. 13-1
» Only one processor can communicate with the memory or another processor at any given
time
when one processor is communicating with the memory, all other processors are either busy with
internal operations or must be idle waiting for the bus
Dual common bus system : Fig. 13-2 Tightly coupled system
» System bus + Local bus
» Shared memory
the memory connected to the common system bus is shared by all processors
» System bus controller
Link each local but to a common system bus
Local bus
COmmon System
Local
shared bus CPU IOP
memory
Memory unit memory controller
Local bus
CPU 1 CPU 2 CPU 3 IOP 1 IOP 2 System System

Local Local
bus CPU IOP bus CPU
memory memory
controller controller
Local bus Local bus

◆ Multi-port memory : Fig. 13-3
multiple paths between processors and memory
» Advantage : high transfer rate can be achieved
» Disadvantage : expensive memory control logic / large number of cables & connectors
◆ Crossbar Switch : Fig. 13-4
Memory Module or I/O Port connected through Crossbar Switch.
Block diagram of crossbar switch : Fig. 13-5
MM CPUs
Memory modules Memory modules
MM 1 MM 2 MM 3 MM 4 MM 1 MM 2 MM 3 MM 4
Data,address, and
control form CPU 1
Data
CPU 1 CPU 1
Data,address, and
Address Multiplexers control form CPU 2
Memory and
CPU 2 module arbitration
CPU 2
Read/write logic
Data,address, and
control form CPU 3
CPU 3 Memory
CPU 3
enable
Data,address, and
control form CPU 4
CPU 4
CPU 4
Crossbar Switch
cluster cluster cluster cluster cluster
cluster cluster
Crossbar-
cluster cluster
Hierarchies
cluster cluster
cluster cluster cluster cluster cluster
Cluster
Node
Node Node
Node Node
Node
Crossbar
PU CU
8
Network I/O
Interface 4
8
Local Memory
◆ Multistage Switching Network
Control the communication between a number of sources and destinations
» Tightly coupled system : PU MM
» Loosely coupled system : PU PU
Basic components of a multistage switching network :
two-input, two-output interchange switch : Fig. 13-6
2 Processor (P1 and P2) are connected through switches to 8 memory modules
(000 - 111) : Fig. 13-7
Omega Network : Fig. 13-8
» 2 x 2 Interchange switch used for N input x N output network topology
0 0 000
000
0 0 1 1 001
A A 0
001
1 1 1
B B 0
010 2 010
1 3 011
A connected to 0 A connected to 1 0
011
P0
1
P1
0
100 4 100
0 0 1
A A 101 5 101
0
1 1 1
B B
0
110
6 110
B connected to 0 B connected to 1 1
111
7 111
0 000
1 001
2 010
3 011
4 100
5 101
6 110
7 111
◆ Hypercube Interconnection : Fig. 13-9
Loosely coupled system
Hypercube Architecture : Intel iPSC ( n = 7, 128 node )
011 111
0 01 11 010 110
001 101
0 00 10 000 100
13-3 Interprocessor Arbitration : Bus Control

◆ Single Bus System : Address bus, Data bus, Control bus
◆ Multiple Bus System : Memory bus, I/O bus, System bus
System bus : Bus that connects CPUs, IOPs, and Memory in multiprocessor
system
◆ Data transfer method over the system bus
Synchronous bus : achieved by driving both units from a common clock source
Asynchronous bus : accompanied by handshaking control signals
011 111
0 01 11 010 110
001 101
0 00 10 000 100
◆ System Bus : IEEE Standard 796 MultiBus
86 signal lines : Tab. 13-1
» Bus Arbitration : BREQ, BUSY, …
◆ Bus Arbitration Algorithm : Static / Dynamic
* Bus Busy Line
If this line is inactive,
Static : priority fixed no other processor is using the bus
» Serial arbitration : Fig. 13-10
Highest Lowest
priority priority
Bus Bus Bus Bus To next
arbiter
1 PI PO PI PO PI PO PI PO
arbiter 1 arbiter 1 arbiter 1 arbiter 1
Bus busy line
Bus Bus Bus Bus
» Parallel arbitration : Fig. 13-11 arbiter 1
Ack Req
arbiter 2
Ack Req Ack

arbiter 3
Req
arbiter 4
Ack Req
Dynamic : priority flexible Bus busy line
» Time slice (fixed length time)

» Polling 4×2
Priority encoder
» LRU
» FIFO 2×4
Decoder
» Rotating daisy-chain
Bus Bus Bus Bus
arbiter 1 arbiter 2 arbiter 3 arbiter 4
Ack Req Ack Req Ack Req Ack Req
Bus busy line
4×2
Priority encoder
2×4
Decoder
13-4 Interprocessor Communication & Synchronization
◆ Interprocessor Communication
shared memory : tightly coupled system
» Accessible to all processors : common memory
» Act as a message center similar to a mailbox
no shared memory : loosely coupled system
» message passing through I/O channel communication
◆ Interprocessor Synchronization
Enforce the correct sequence of processes and ensure mutually exclusive access
to shared writable data
Mutual Exclusion
» Protect data from being changed simultaneous by two or more processor
Mutual Exclusion with Semaphore
» Critical Session
Once begun, must complete execution before another processor accesses
» Semaphore
Indicate whether or not a processor is executing a critical section
» Hardware Lock
Processor generated signal to prevent other processors from using system bus
◆ Semaphore를 이용한 shared memory 사용 방법
1) TSL SEM 명령 실행 (Test and Set while Locked)
» Hardware Lock 신호를 발생시키면서 SEM
» 2 memory cycle 필요
R  M [ SEM ] : Test semaphore (semaphore를 레지스터 R로 읽어 들인다)
M [ SEM ]  1 : Set semaphore (다른 processor의 shared memory 사용을 금지)
2) R = 0 인 경우 : shared memory is available
R = 1 인 경우 : processor can not access shared memory (semaphore
originally set)
13-5 Cache Coherence
X = 120 Main memory
Bus
◆ Conditions for Incoherence : Fig. 13-12, 13

Multiprocessor system with private caches
X = 120 X = 52 X = 52 Caches
» Write through : P2, P3 Incoherence P0 P2 P3 Processors
» Write back : P2, P3, Main memory Incoherence (a) With write-through cache policy
P1 이 X 에 120 을 X = 52 Main memory X = 52 Main memory
Write 하는 경우 Bus
Bus
X = 120 X = 52 X = 52 Caches
X = 52 X = 52 X = 52 Caches
P0 P2 P3 Processors
P0 P2 P3 Processors
(b) With write-back cache policy
◆ Solution to the Cache Coherence Problem
Software
» 1) Shared writable data are non-cacheable
» 2) Writable data exists in one cache : Centralized global table
Hardware
» 1) Monitor possible write operation : Snoopy cache controller
참고 문헌 :
» IEEE Computer, 1988, Feb.
“Synchronization, coherence, and event ordering in multiprocessors”
» IEEE Computer, 1990, June.
“A survey of cache coherence schemes for multiprocessors”

Architecture II

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Architecture II

Uploaded by

Copyright:

Available Formats

Pipeline And Vector Processing

Execution of Concurrent Events in the computing process to

The purpose of parallel processing is to speed up the computer

The amount of hardware increases with parallel processing,

However, technological developments have reduced hardware

At the lower level

At the higher level

A technique of decomposing a sequential process into

A pipeline can be visualized as a collection of processing

The name “pipeline” implies a flow of information analogous

See the example in P. 310

Structural hazards (Resource Conflicts):

Hardware resources required by the instructions

Data hazards (Data Dependency Conflicts):

An instruction scheduled to be executed in the

Control hazards (Branch difficulties):

Branches and other instructions that change the PC

Occur when some resource has not been duplicated enough

The Pipeline is stalled for a structural hazard

There is a class of computational problems that are

The multiplication of two nxn matrices consists of n2 inner

Example: Product of two 3x3 matrices

This requires 3 multiplications and 3 additions. The total

In general, the inner product consists of the sum of k product

ASCII (American Standard Code for Information Interchange) Alphanumeric Characters

A digital computer has a memory unit of 128Kx18 and a

Many computers use one common bus to transfer information between

The distinction between a memory transfer and I/O transfer is made

When the CPU is fetching an instruction or an operand from memory, it

Check flag bit

• Continuous CPU involvement yes

• CPU slowed down to I/O speed Continue

1) Non-vectored : fixed branch address

Priority Interrupt by Hardware

“1” Device 1 “1” Device 2 “0” Device 3

INT Priority out

Interrupt Enable F/F (IEN) :

Interrupt Status F/F (IST) : Interrupt

1) Clear lower-level mask register bit

Final Operation of ISR

1) Clear interrupt enable bit IEN

The transfer of data between a fast storage device such as magnetic

This transfer technique is called direct memory access (DMA).

There are two modes of transfer:

The time delay of the five segments in a certain pipeline are

The interface registers delay time tr=5 ns.

How long would it take to add 100 pairs of numbers in the

A digital computer has a memory unit of 256Kx16 and a

12.1 Memory Hierarchy

The overall goal of using a memory hierarchy is to obtain the

Microprogramming: refers to the existence of many programs in

The designer of a computer system must calculate the

The interconnection between memory and processor is then

The addressing of memory can be established by means

The table, called a memory address map, is a pictorial

Memory Configuration (case study):

Required: 512 bytes ROM + 512 bytes RAM

to the CPU 16 - 11 10 9 8 7 R-D 1 W R Data bus

The time required to find an item stored in memory can be

A memory unit access by content is called an associative

When a word is written in an associative memory, no address is

The associative memory is uniquely suited to do parallel

Argument register (A)

Key register (K)