Professional Documents
Culture Documents
Parallel Processing
Multiplicity of functional
units that performs
identical or different
operations simultaneously.
Parallel Computers
SISD COMPUTER SYSTEMS
Von Neumann Architecture
MISD COMPUTER SYSTEMS
SIMD COMPUTER SYSTEMS
MIMD COMPUTER SYSTEMS
PIPELINING
Control hazards
STRUCTURAL HAZARDS
Example:
With one memory-port, a data and an instruction fetch cannot
be initiated in the same clock.
C = A1B1+A2B2+A3B3+…+Ak Bk
C = A1B1+A5B5+A9B9+A13 B13+…
+A2B2+A6B6+A10B10+A14 B14+…
+A3B3+A7B7+A11B11+A15 B15+…
+A4B4+A8B8+A12B12+A16 B16+…
VECTOR INSTRUCTION FORMAT
MULTIPLE MEMORY MODULE AND INTERLEAVING
MULTIPLE MEMORY MODULE AND INTERLEAVING
MULTIPLE MEMORY MODULE AND INTERLEAVING
ARRAY PROCESSOR
attached array processor with host computer
SIMD array processor Organization
Don’t forget,
try to solve the
questions of the
chapter
Chapter 11
Input-Output
Organization
Input-Output Organization
Important Terms
Peripherals
The CPU specifies whether the address on the address lines is for a
memory word or for an interface register by enabling one of two possible
read or write lines.
The I/O read and I/O writes control lines are enabled during an I/O transfer.
The memory read and memory write control lines are enabled during a
memory transfer.
This configuration isolates all I/O interface addresses from the address
assigned to memory and is referred to as the isolated I/O method for
assigning addresses in a common bus.
ISOLATED I/O
In the isolated I/O configuration, the CPU has distinct input and output
instructions and each of these instructions are associated with the
address of an interface register.
When the CPU fetches and decodes the operation code of an input or
output instruction, it places the address associated with the instruction
into the common address lines.
At the same time, it enables the I/O read (for input) or I/O write (for output)
control line.
This informs the external components that are attached to the common
bus that the address in the address lines is for an interface register and
not for a memory word.
This is the case in computers that employ only one set of read and write signals
and do not distinguish between memory and I/O addresses.
The computer treats an interface register as being part of the memory system.
The assigned addresses for interface registers cannot be used for memory
words, which reduce the memory address( range available).
In memory mapped I/O organization, there are no specific inputs or output
instructions. The CPU can manipulate I/O data residing in interface registers with
the same instructions that are used to manipulate memory words.
Typically, a segment of the total address space is reserved for interface registers,
but in general, they can be located at any address as long as there is not also a
memory word that responds to the same address.
It allows the computer to use the same instructions for either input-output
transfers or for memory transfers.
There are two ways can be used to achieve that
1-Strobe
2-Handshaking
Read status register
=0
Flag
Data bus Interface I/O bus
=1
Address bus Data register
CPU
Data valid I/O Read data register
I/O read device
Status
I/O write F
register Data accepted
Transfer data to memory
F = Flag bit
Operation no
complete ?
Priority Interrupt
• Identify the source of the interrupt when several sources will request
service simultaneously
• Determine which condition is to be serviced first when two or more
requests arrive simultaneously
• Techniques used:
– 1) Software : Polling
– 2) Hardware : Daisy chain, Parallel priority
Priority Interrupt by Software (Polling)
- Priority is established by the order of polling the devices(interrupt sources)
- Flexible since it is established by software
- Low cost since it needs a very little hardware
- Very slow
Interrupt request
INT
CPU
Interrupt acknowledge
INTACK
One stage of the daisy-chain priority arrangement
VAD
INTACK Priority in
PI Enable
Vector address
No interrupt request
Invalid : interrupt request, but no acknowledge
No interrupt request : Pass to other device (other device requested interrupt )
Interrupt request
Interrupt
register
VAD
disk 0 to CPU
I0
y
x
Printer 1
I1
Priority 0
encoder
0
Reade 2
I2
r 0
0
Keyboard 3
I3
0
0
Enable
IEN IST
Mask
register
Initial Operation of ISR
Removing the CPU from the path and letting the peripheral device
manage the memory buses directly would improve the speed of
transfer.
During DMA transfer, the CPU is idle and has no control of the memory
buses. A DMA controller takes over the buses to manage the transfer
directly between the I/O device and memory.
CPU bus signals for DMA transfer
t1=30 ns,
t2=70 ns,
t3=70 ns,
t4=25 ns,
t5=35 ns.
Decoder
3 2 1 0
CS1
CS2
128× 8 Data
RD
RAM 1
WR
AD7
CS1
CS2
128× 8 Data
RD
RAM 2
WR
AD7
CS1
CS2
128× 8 Data
RD
RAM 3
WR
AD7
CS1
CS2
128× 8 Data
RD
RAM 4
WR
AD7
CS1
CS2
1-7 128× 8 Data
ROM
8
AD9
9
Associative Memory
Associative Memory
Match
register
Input
Associative memory
array and logic M
Read
Write m words
n bits per word
Output
Associative memory of an m word, n cells per word
A1 Aj An
K1 Kj Kn
Word 1 C 11 C 1j C 1n M1
Word i C i1 C ij C in Mi
Word m C m1 C mj C mn Mm
Write
R S Match
F ij To M i
logic
Read
Output
Match logic
Neglect the K bits and compare the argument in A with the bits
stored in the cells of the words.
K1 A1 K2 A2 Kn An
Mi
Read Operation
If only one word may match the unmasked argument field, then
connect output Mi directly to the read line in the same word
position,
If we exclude words having zero content, then all zero output will
indicate that no match occurred and that the searched item is
not available in memory.
Write Operation
Locality of reference
The references to memory at any given interval of time tent to be
contained within a few localized areas in memory.
If the active portions of the program and data are placed in a fast
small memory, the average memory access time can be reduced.
When the CPU refers to memory and finds the word in cache, it
produces a hit. If the word is not found in cache, it counts it as
a miss.
The ratio of the number of hits divided by the total CPU references
to memory (hits + misses) is the hit ratio. The hit ratios of 0.9 and
higher have been reported
Cache Memory
The average memory access time of a computer system can be
improved considerably by use of cache.
The cache is placed between the CPU and main memory. It is the
faster component in the hierarchy and approaches the speed of
CPU components.
For example,
· Associative Mapping
· Direct Mapping
· Self – Associative Mapping.
Cache Memory
The 9 least significant bits constitute the index field and the
remaining 6 bits form the tag fields.
The main memory needs an address but includes both the tag and
the index bits.
The cache memory requires the index bit only i.e., 9 bits.
There are 2k words in the cache memory & 2n words in the main
memory where k is the No. of bits in the index field and n is the No.
of bits for the CPU address.
e.g: k = 9, n = 15
Direct Mapping
Direct Mapping
00000
6710
Direct Mapping
Each word in cache consists of the data word and it associated tag.
When a new word is brought into cache, the tag bits store along
data
The tag field of the CPU address is equal to tag in the word from
cache; there is a hit, otherwise miss.
Address length?
Number of addressable units(M/C)?
Block size?
No. of blocks in Main memory?
No. of blocks in cache memory?
Size of tag?
Set – Associative Mapping
Each tag requires 6 bits & each data word has 12 bits, so the word
length is 2(6+12) =36 bits
When the CPU generates a memory request, the index value of the
address is used to access the cache.
The tag field of the CPU address is compared with both tags in the
cache.
· Random replacement
· FIFO
· Least Recently Used (LRU)
Page Replacement Strategies
◼ The Principle of Optimality
• Replace page that will be used the farthest in the future.
◼ Random page replacement
• Choose a page randomly
◼ FIFO - First in First Out
• Replace the page that has been in primary memory the
longest
◼ LRU - Least Recently Used
• Replace the page that has not been used for the longest
time
◼ LFU - Least Frequently Used
• Replace the page that has been used least often
◼ NRU - Not Recently Used
• An approximation to LRU.
◼ Working Set
• Keep in memory those pages that the process is actively
using.
Writing into cache
there are two writing methods that the system can proceed.
This method has the advantage that main memory always contains the
same data as the cache.
Write-back method
In this method only the cache location is updated during a write operation.
The location is then marked by a flag so that later when the word is
removed from the cache it is copied into main memory.
The reason for the write-back method is that during the time a word resides
in the cache, it may be updated several times.
The access time of a cache memory is
100 ns and that of main memory 1100
ns. It is estimated that the number of
the hit states is 67425, while the hit
ratio is 0.94. What is the average
access time of the system memory?
What is the number of the miss
states?
A digital computer has a memory unit of
1Mx14 and a cache memory of 2K
words. The cache uses direct mapping
with a block size of four words. How
many bits are there in the tag, index,
block, and word fields of the address
format? How many bits are there in each
word of cache?.
Virtual Memory
Virtual Memory
Q1)An address space is specified by 31 bits and the
corresponding memory space by 17 bits. (a) How
many words are there in the address space? (b)
How many words are there in the memory space?
(c) If a page consists of to 4K words, how many
pages and blocks are there in the system? .
COmmon System
Local
shared bus CPU IOP
memory
Memory unit memory controller
Local bus
MM CPUs
Memory modules Memory modules
MM 1 MM 2 MM 3 MM 4 MM 1 MM 2 MM 3 MM 4
Data,address, and
control form CPU 1
Data
CPU 1 CPU 1
Data,address, and
Address Multiplexers control form CPU 2
Memory and
CPU 2 module arbitration
CPU 2
Read/write logic
Data,address, and
control form CPU 3
CPU 3 Memory
CPU 3
enable
Data,address, and
control form CPU 4
CPU 4
CPU 4
Crossbar Switch
cluster cluster
Crossbar-
cluster cluster
Hierarchies
cluster cluster
Cluster
Node
Node Node
Node Node
Node
Crossbar
PU CU
8
Network I/O
Interface 4
8
Local Memory
◆ Multistage Switching Network
Control the communication between a number of sources and destinations
» Tightly coupled system : PU MM
» Loosely coupled system : PU PU
Basic components of a multistage switching network :
two-input, two-output interchange switch : Fig. 13-6
2 Processor (P1 and P2) are connected through switches to 8 memory modules
(000 - 111) : Fig. 13-7
Omega Network : Fig. 13-8
» 2 x 2 Interchange switch used for N input x N output network topology
0 0 000
000
0 0 1 1 001
A A 0
001
1 1 1
B B 0
010 2 010
1 3 011
A connected to 0 A connected to 1 0
011
P0
1
P1
0
100 4 100
0 0 1
A A 101 5 101
0
1 1 1
B B
0
110
6 110
B connected to 0 B connected to 1 1
111
7 111
0 000
1 001
2 010
3 011
4 100
5 101
6 110
7 111
◆ Hypercube Interconnection : Fig. 13-9
Loosely coupled system
Hypercube Architecture : Intel iPSC ( n = 7, 128 node )
011 111
0 01 11 010 110
001 101
0 00 10 000 100
0 01 11 010 110
001 101
0 00 10 000 100
◆ System Bus : IEEE Standard 796 MultiBus
86 signal lines : Tab. 13-1
» Bus Arbitration : BREQ, BUSY, …
◆ Bus Arbitration Algorithm : Static / Dynamic
* Bus Busy Line
If this line is inactive,
Static : priority fixed no other processor is using the bus
» Serial arbitration : Fig. 13-10
Highest Lowest
priority priority
Bus Bus Bus Bus To next
arbiter
1 PI PO PI PO PI PO PI PO
arbiter 1 arbiter 1 arbiter 1 arbiter 1
Ack Req
arbiter 2
Req
arbiter 4
Ack Req
» LRU
» FIFO 2×4
Decoder
» Rotating daisy-chain
Bus Bus Bus Bus
arbiter 1 arbiter 2 arbiter 3 arbiter 4
4×2
Priority encoder
2×4
Decoder
13-4 Interprocessor Communication & Synchronization
◆ Interprocessor Communication
shared memory : tightly coupled system
» Accessible to all processors : common memory
» Act as a message center similar to a mailbox
no shared memory : loosely coupled system
» message passing through I/O channel communication
◆ Interprocessor Synchronization
Enforce the correct sequence of processes and ensure mutually exclusive access
to shared writable data
Mutual Exclusion
» Protect data from being changed simultaneous by two or more processor
Mutual Exclusion with Semaphore
» Critical Session
Once begun, must complete execution before another processor accesses
» Semaphore
Indicate whether or not a processor is executing a critical section
» Hardware Lock
Processor generated signal to prevent other processors from using system bus
◆ Semaphore를 이용한 shared memory 사용 방법
1) TSL SEM 명령 실행 (Test and Set while Locked)
» Hardware Lock 신호를 발생시키면서 SEM
» 2 memory cycle 필요
R M [ SEM ] : Test semaphore (semaphore를 레지스터 R로 읽어 들인다)
M [ SEM ] 1 : Set semaphore (다른 processor의 shared memory 사용을 금지)
2) R = 0 인 경우 : shared memory is available
R = 1 인 경우 : processor can not access shared memory (semaphore
originally set)
13-5 Cache Coherence
X = 120 Main memory
Bus
» Write back : P2, P3, Main memory Incoherence (a) With write-through cache policy
Write 하는 경우 Bus
Bus
X = 120 X = 52 X = 52 Caches
X = 52 X = 52 X = 52 Caches
P0 P2 P3 Processors
P0 P2 P3 Processors
(b) With write-back cache policy
◆ Solution to the Cache Coherence Problem
Software
» 1) Shared writable data are non-cacheable
» 2) Writable data exists in one cache : Centralized global table
Hardware
» 1) Monitor possible write operation : Snoopy cache controller
참고 문헌 :
» IEEE Computer, 1988, Feb.
“Synchronization, coherence, and event ordering in multiprocessors”
» IEEE Computer, 1990, June.
“A survey of cache coherence schemes for multiprocessors”