Unit - IV

17.1.
EMULATIONS AMONG ARCHITECTURES

Why do we need emulations?
1. To quickly develop algorithms for new architectures without having to spend significant
resources.
2. To develop algorithms for architectures that are easier to program and then have them run on
machines that are realizable, easily accessible, or affordable.
An important use other than practical porting is ………….

Testing of various architectures
1. Take a base architecture for which there exists a large number of algorithms.
2. Now we check if our developed architecture is capable of emulating the base architecture.
3. If yes , then what is the latency.
Example :-
 Let the base architecture be hypercube , which is a known and very powerful architecture.
there exits another architecture called cube connected cycles( CCC).
CCC can emulate a “p- node” hypercube with a latency of log(p).
Hypercube
Cube connected
cycles
An Important Rule:-
If architecture A emulates architecture B with O(f(p)) slowdown and
B in turn emulates C with O(g(p)) slowdown (assuming, for
simplicity, that they all contain p processors), then A can emulate C
with O(f ( p) × g (p)) slowdown.
(There exists a form of transitivity)
Three fairly general emulation results
involving classes of, rather than specific,
architectures.
Our first result is based on embedding of
an architecture or graph into another one.
There are two factors to be considered here:-
1. The cost of an embedding:-
Given by a factor called Expansion which is the Ratio of the number of nodes in the two
graphs.
2. Performance factors which are :-
◦ Dilation :- Longest path onto which any given edge is mapped
◦ Congestion :-Maximum number of edges mapped onto the same edge
◦ Load factor :-Maximum number of nodes mapped onto the same node
Calculation of slowdown:-
slowdown ≤ dilation × congestion × load factor
Example :-
Lets simulate a p-node complete graph Kp into the two-node graph K 2
Let the dilation be 1
Let the congestion factor be p2/4,
Let the load factor of p/2
Then worst case slowdown is p3/8
Our second general emulation result is that the
PRAM can emulate any degree-d network
with O(d) slowdown.
1. Slowdown is measured in terms of computation steps rather than real time; i.e., each PRAM
computation step, involving memory access by every processor, counts as one step.
Proof :- Here the degree of each node is 2.
each link buffer Each node can send or

corresponds to a receive at most 2 messages
location in shared at a time.
memory.
Continued ….
So for a d-degree graph:-
1. The sending of up to d messages by each node can be done in d steps, with the each step being
devoted to writing a message from the send buffer to the link buffer (O(d)).
2. Similarly, the receiving part, which involves reading up to d messages from known locations
in the shared memory, can be done in O(d ) steps.
so the total time taken is of order O(d).

Our third and final general result, that the (wrapped)
butterfly network can emulate any
p-node degree-d network with O(d log p) slowdown
We augment this graph with a link from Node 1 to Node 3, in order to make then node
degree uniformly equal to 3, and represent the augmented graph by the bipartite graph,
shown in the middle, which has Nodes u i and v i in its two parts corresponding to Node i in
the original graph.
Continued…
We next identify a perfect matching in this bipartite graph, say,

the heavy dotted edges on the right. This perfect matching,
with its edges carrying the label 0 defines the permutation P0 =
{ 1, 0, 3, 2} of the node set {0,1, 2, 3}. Removing these edges,
we arrive at a bipartite graph having the uniform node degree 2
and the perfect matching defined by the light dotted edges
labeled with 1. This time, we
have identified the permutation P1 = {2, 3, 1, 0}. Finally, the
remaining edges, define the last perfect matching or
permutation P2 = {3, 2, 0, 1}.
Continued …
1. A consequence of the above emulation result is that if a problem is efficiently parallelizable
on any bounded-degree architecture, then it is efficiently parallelizable on the butterfly
network, because emulating the bounded-degree network on the butterfly increases the
running time by at most a factor of O(log p).
2. So for d-degree nodes the factor becomes O(d logp).
17.2 Distributed Shared Memory
One of p =
Randomized emulation of q P M P M P M P M Memory module
2 (q + 1) holding m/p
the processors
0 dim 0 dim 1 dim 2 0
memory locations
p-processor PRAM on p-
node butterfly 1 1 Each node =
router +
processor +
2 2 memory
Use hash function to map memory
locations to modules 3 3
p locations  p modules, 4 4
2 q Rows
not necessarily distinct
5 5
With high probability, at most O(log p) of
the p locations will be in modules located in 6 6
the same row 7 7
Average slowdown = O(log p) 0 1 2 3
q + 1 Columns
Fig. 17.2 Butterfly distributed-memory machine

emulating the PRAM.
Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 14

PRAM Emulation with Butterfly MIN
Emulation of the p-processor PRAM on (p log p)-node butterfly, with memory modules and processors
connected to the two sides; O(log p) avg. slowdown
q Memory module
One of p =2 0 dim 0 dim 1 dim 2 0
proces s ors
P M holding m/p
mem ory locations
1 1
Less efficient than Fig. 17.2, which uses
a smaller butterfly 2 2
By using p / (log p) physical processors 3 3

to emulate the 4 4 2 qRow s
p-processor PRAM, this new emulation
scheme becomes quite efficient (pipeline 5 5
the memory accesses of the log p virtual
processors assigned to each physical 6 6
processor) 7 7
0 1 2 3
q + 1 Columns
Fig. 17.3 Distributed-memory machine, with a butterfly multistage

interconnection network, emulating the PRAM.
Deterministic Shared-Memory Emulation
One of p = Memory module
Deterministic emulation of p- q
2 (q + 1)
P M P M P M P M
holding m/p
processor PRAM on p-node processors
0 dim 0 dim 1 dim 2 0
memory locations
butterfly
1 1 Each node =
router +
processor +
Store log2 m copies of each of the m 2 2 memory
memory location contents
3 3
Time-stamp each updated value 2 q Rows
4 4
A “write” is complete once a majority of
5 5
copies are updated
6
A “read” is satisfied when a majority of 6
copies are accessed and the one with 7 7
latest time stamp is used 0 1 2 3
q + 1 Columns
Why it works: A few congested links won’t
delay the operation Write set Read set
log2 m copies

PRAM Emulation Using Information Dispersal
Instead of (log m)-fold replication of data, divide each data element into k pieces and
encode the pieces using a redundancy factor of 3, so that any k / 3 pieces suffice for
reconstructing the original data
Original data word

and its k pieces
The k pieces
after encoding
(approx. three
times larger)
Possible read set Up-to-date Possible update set
of size 2k/3 pieces of size 2k/3
Recunstruction
algorithm
Original data word recovered
from k /3 encoded pieces
Fig. 17.4 Illustrating the information dispersal approach to PRAM emulation
with lower data redundancy.

18.3 Multithreading and Latency Hiding
● The general idea of latency hiding methods is to provide each processor with some useful work
to do as it waits for remote memory access requests to be satisfied.
● Multithreading is a practical mechanism for latency hiding.

Latency hiding: Provide each processor with useful work to do as
it awaits the completion of memory access requests
Multithreading is one way to implement latency hiding
Sequential Thread Rem ote Idl e

thread comput ations accesses time
Schedul ing Synchronization

overhead overhead
Fig. 18.5 The concept of multithreaded parallel computation.

In the Fig., a processor is working on a thread and the thread
requires a remote memory access before it can continue.
The processor places the thread in a wait queue and switches

to a different thread. This switching of context in the processor
involves some overhead which must be considerably less
than the remote access latency for the scheme to be
efficient.The processor is thus likely to have multiple register
sets, one per active thread, to minimize the overhead.
The higher the remote memory access delay, the larger is the
number of threads required to successfully hide the latency.
At some point in its computation, the newly activated thread
may require a remote memory access. Thus, another context
switch occurs, and so on.
● The access requests of the threads that are in the wait state may be completed out of order. Thus, each
access request is associated (tagged) with a unique thread identifier so that when the result is returned,
the processor knows which of the waiting threads should be activated. Thread identifiers are sometimes
called “continuations”.
● Application of multithreading is not restricted to a parallel processing environment but offers advantages
even on a uniprocessor.
Multithreading on a Single Processor
Here, the motivation is to reduce the performance impact of data dependencies
and branch misprediction penalty
Threads in memory Issue pipeli nes Retirement and

commit pipeline
Bubble
Function units
Fig. 24.9 of Parhami’s Computer Architecture text (2005)

Parallel I/O Technology
● An important requirement for highly parallel systems is the provision of

high-bandwidth I/O capability. For some data-intensive applications, the
high processing power of a massively parallel system is not of much
value unless the I/O subsystem can keep up with the processors.
● Figure shows a multiple-platter high-capacity
magnetic disk. Each platter has two recording
surfaces and a read/write head mounted on an
arm. The access arms are attached to an actuator
that can move the heads radially in order to align
them with a desired cylinder (i.e., a set of tracks,
one per recording surface). A sector or disk block
is part of a track that forms the unit of data
transfer to/from the disk.
● Access to a block of data on disk typically consists

of:
1. Cylinder seek: moving the heads to the desired
cylinder (seek time)
2. Sector alignment: waiting until the desired
sector is under the head (rotational latency).
3. Data transfer: reading the bytes out as they
pass under the head (transfer time).
Consider a disk with a sector size of 512 bytes, 2000 tracks per surface, 50 sectors per track, five double-sided platters, and
average seek time of 10 msec.
1. What is the capacity of a track in bytes? What is the capacity of each surface? What is the capacity of the disk?
2. How many cylinders does the disk have?
3. Give examples of valid block sizes. Is 256 bytes a valid block size? 2048? 51200?
4. If the disk platters rotate at 5400 rpm (revolutions per minute), what is the maximum rotational delay?
5. If one track of data can be transferred per revolution, what is the transfer rate?
1. bytes/track = bytes/sector × sectors/track = 512 × 50 = 25K

bytes/surface = bytes/track × tracks/surface = 25K × 2000 = 50, 000K
bytes/disk = bytes/surface× surfaces/disk = 50, 000K × 5 × 2 = 500, 000K
2. The number of cylinders is the same as the number of tracks on each platter, which is 2000.
3. The block size should be a multiple of the sector size.

We can see that 256 is not a valid block size while 2048 is.
51200 is not a valid block size in this case because block size cannot exceed the size of a track, which is 25600
bytes.
4. If the disk platters rotate at 5400rpm, the time required for one complete rotation, which is the maximum
rotational delay, is 1/5400× 60 = 0.011seconds. The average rotational delay is half of the rotation time, 0.006
seconds.
5. The capacity of a track is 25K bytes. Since one track of data can be transferred per revolution, the data transfer
rate is 25K/0.011= 2, 250Kbytes/second
One of the earliest attempts at achieving high-throughput
parallel I/O was the use of head-per-track disks (Fig. 18.7),
with their multiple read/write heads capable of being
activated at the same time. The seek (radial head movement)
time of a disk can be approximately modeled by the equation
Seek time = a + b
where c is the seek distance in cylinders.

Typical values for the constants a and b are 2
and 0.4 ms, respectively.
● Provision of many disks that can be accessed concurrently is only part of the solution to
the I/O problem in parallel systems. Development of suitable file structures and access
strategies in order to reduce the I/O overhead is perhaps more important. A common
technique of file organization in parallel systems is declustering or striping of the files in
order to allow parallel access to different parts of a single file.
● The problem of parallel I/O has received considerable attention in recent years.
Processor speeds have improved by several orders of magnitude since parallel I/O
became an issue, while disk speeds have remained virtually stagnant.
● The use of large disk caches, afforded by higher capacity and cheaper semiconductor
memories, does not solve the entire I/O problem, just as ordinary caches only partially
compensate for slow main memories.
● Various technologies for parallel I/O, along with tools and standardization issues, are
under extensive scrutiny.
18.4 Parallel I/O Technology
Sector Read/writ e head
Actuator
Recording area
Track c – 1
Track 2
Track 1
Track 0
Arm
Platter
Direction of
rotation
Spindl e
Fig. 18.6 Moving-head magnetic disk elements.
Comprehensive info about disk memory: http://www.storageview.com/guide/

Access Time for a Disk
Average rotational latency =
Data transfer time = 30 000 / rpm (in ms) Seek time =
Bytes / Data rate
2. Disk rotation until the desired a + b(c – 1)
3. Disk rotation until sector
sector arri ves under the head: + b(c – 1)1/2
Rotational latency (0-10s ms )
has passed under the head:
Data transfer time (< 1 ms ) 1. Head m ovement
from current position
2 to desired cylinder:
1 Seek time (0-10s ms )
3
Sector
Rotation
The three components of disk access time. Disks that spin faster have a shorter average
and worst-case access time.

Amdahl’s Rules of Thumb for System Balance
The need for high-capacity, high-throughput secondary (disk) memory
Processor RAM Disk I/O rate Number of Disk Number of disks

speed size disks capacity
1 GIPS 1 GB 100 MB/s 1 100 GB 1
1 TIPS 1 TB 100 GB/s 1000 100 TB 100
1 PIPS 1 PB 100 TB/s 1 Million 100 PB 100 000
1 EIPS 1 EB 100 PB/s 1 Billion 100 EB 100 Million
G Giga
T Tera
1 RAM byte 1 I/O bit per sec 100 disk bytes P Peta
for each IPS for each IPS for each RAM byte E Exa

Growing Gap Between Disk and CPU Performance
s
From Parhami’s
Disk seek time computer
architecture
ms textbook,
Oxford, 2005
Time
s
DRAM access time
ns
CPU cycle time
ps
1980 1990 2000 2010
Calendar year
Fig. 20.11 Trends in disk, main memory, and CPU speeds.

Head-Per-Track Disks
Multiple sets of head reduce
Dedicated track heads eliminate seek time (replace it with rotational latency
activation time for a head)
Track c–1
Track 0 Track 1
Fig. 18.7 Head-per-track disk concept.

19.1 Defects, Faults, . . . , Failures
Opportunities for fault tolerance in parallel systems: IDEAL
Built-in spares, load redistribution, graceful degradation
Difficulties in achieving fault tolerance: DEFECTIVE
Change in structure, bad units disturbing good ones
The multilevel model of dependable computing FAULTY
Abstraction level Dealing with deviant

Defect / Component Atomic parts ERRONEOUS
Fault / Logic Signal values or decisions
Error / Information Data or internal states
MALFUNCTIONING
Malfunction / System Functional behavior
Degradation / Service Performance
Failure / Result Outputs or actions DEGRADED
Fig. 19.1 System states and state transitions in

our multilevel model. FAILED

Analogy for the Multilevel Model
Inlet val ves represent
Many avoidance and tolerance Wall heights represent avoi dance techni ques
methods are applicable to more interlevel latencies
than one level, but we deal with
them at the level for which they I I I I I I
are most suitable, or at which
they have been most
successfully applied
Fig. 19.2 An analogy for the I I I I I I

multilevel model of
Concentric reservoirs are Drai n val ves represent
dependable computing.
analogues of the six model toleranc e techniques
levels (defect is innermost)

19.2 Defect-Level Methods
Defects are caused in two ways (sideways and downward transitions IDEAL
into the defective state):

a. Design slips leading to defective components DEFECTIVE
b. Component wear and aging, or harsh

operating conditions (e.g., interference)
FAULTY
A dormant (ineffective) defect is very hard to detect

ERRONEOUS
Methods for coping with defects during dormancy:
Periodic maintenance MALFUNCTIONING
Burn-in testing
Goal of defect tolerance methods: DEGRADED
Improving the manufacturing yield

Reconfiguration during system operation FAILED

Defect Tolerance Schemes for Linear Arrays
T est Bypass ed T est
I/O Spare or I/O

P0 P1 Defecti ve
P2 P3
Fig. 19.3 A linear array with a spare processor and reconfiguration

switches.
Mux
Spare or
P0 P1 Defective
P2 P3
Fig. 19.4 A linear array with a spare processor and embedded

switching.

Defect Tolerance in 2D Arrays
Fig. 19.5 Two types of reconfiguration switching for 2D arrays.
Assumption: A malfunctioning processor can be bypassed in its row/column by means of a

separate switching mechanism (not shown)

A Reconfiguration Scheme for 2D Arrays
Spare
Row
Spare Column
Fig. 19.6 A 5  5 working array salvaged from a 6  Fig. 19.7 Seven defective processors and
6 redundant mesh through reconfiguration switching. their associated compensation paths.

Limits of Defect Tolerance
No compensation path
exists for t his faulty node
A set of three defective nodes, one of which cannot be accommodated by

the compensation-path method.
Extension: We can go beyond the 3-defect limit by providing spare rows on top and
bottom and spare columns on either side

Fault-Level
Method
Fault-Level Methods
● A hardware fault may be defined as any anomalous behavior of logic structures or substructures
that can compromise the correct signal values within a logic circuit.
● If the anomalous behavior results from implementing the logic function g rather than the intended
function ƒ, then the fault is related to a logical design or implementation slip. The alternative cause
of faults is the implementation of the correct logic functions with defective components.
● Defect-based faults can be classified according to duration (permanent,
intermittent/recurring, or transient), extent (local or distributed/catastrophic), and
effect (dormant or active).
● One way to protect the computations against fault-induced errors is to use

duplication with comparison of the two results (for single fault detection) or
triplication with two-outof-three voting on the three results (for single fault
masking or tolerance).
● In 1st part the decoding logic is duplicated along with the computation part to
ensure that a single fault in the decoder does not go undetected. The encoder,
on the other hand, remains a critical element whose faulty behavior will lead to
undetected errors.
● In 2nd part combining the voting and encoding functions, one may be able to
design an efficient self-checking voter-encoder. This three-channel computation
strategy can be generalized to m channels in order to tolerate more faults.
However, the cost overhead of a higher degree of replication becomes
prohibitive.
● The above replication schemes are quite general and can be applied to any part
of a parallel system for any type of fault. However, the cost of full replication is
difficult to justify for most applications.
● Like the original butterfly network, the extra-stage butterfly network is self-
routing.
● To see this, note that the connections between Columns q-1 and q are identical
to those between Columns 0 and 1.
● Hence, a processor that is aware of the fault status of the network switches can
append a suitable routing tag to the message and insert it into the network
through one of its two available access ports.
● From that point onward, the message finds its way through the network and
automatically avoids the faulty element(s).
● Of course, if more than one element is faulty, existence of a path is not
guaranteed.
Error-Level Methods
Errors are caused in two ways (sideways and downward transitions into
the erroneous state):
a. Design slips leading to incorrect initial state

b. Exercising of faulty circuits, leading to deviations in stored values or
machine state
Types of errors:
Error detecting and correcting techniques:
(encodes data into redundant format called codes)
1.single-error-detecting or SED code
2.Checksum
3.Hamming SEC/DED
INPUT
ENCODE
SEND
Unpr ote cte d

Pr ote cte d
by Encoding STORE MANIPULATE
SEND
DECODE
OUTPUT
For example, the sum of two even-parity numbers does not necessarily have
even parity.
We can check it through two approaches product code and algorithm-based

error tolerance.
Product Code:
One is the use of codes that are closed under data manipulation operations of interest.
For example, product codes are closed under addition and subtraction. A product code
with check modulus 15, say, represents the integers x and y as 15x and 15y,
respectively. correct encoded form 15(x ± y) of the sum/difference x ± y. While product
codes are not closed under multiplication, division, and square-rooting, it is possible to
devise arithmetic algorithms for these operations that deal directly with the coded
operands.
Algorithm-based error tolerance :

Unit - IV

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit - IV

Uploaded by

Copyright:

Available Formats

17.1.

EMULATIONS AMONG ARCHITECTURES

An important use other than practical porting is ………….

Proof :- Here the degree of each node is 2.

each link buffer Each node can send or

so the total time taken is of order O(d).

We next identify a perfect matching in this bipartite graph, say,

Fig. 17.2 Butterfly distributed-memory machine

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 14

By using p / (log p) physical processors 3 3

Fig. 17.3 Distributed-memory machine, with a butterfly multistage

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 16

Original data word

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 17

● Multithreading is a practical mechanism for latency hiding.

Sequential Thread Rem ote Idl e

Schedul ing Synchronization

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 19

The processor places the thread in a wait queue and switches

Threads in memory Issue pipeli nes Retirement and

Fig. 24.9 of Parhami’s Computer Architecture text (2005)

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 22

● An important requirement for highly parallel systems is the provision of

● Access to a block of data on disk typically consists

1. bytes/track = bytes/sector × sectors/track = 512 × 50 = 25K

3. The block size should be a multiple of the sector size.

where c is the seek distance in cylinders.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 29

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 30

Processor RAM Disk I/O rate Number of Disk Number of disks

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 31

CPU cycle time

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 32

Fig. 18.7 Head-per-track disk concept.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 33

Abstraction level Dealing with deviant

Fig. 19.1 System states and state transitions in

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 34

Fig. 19.2 An analogy for the I I I I I I

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 35

into the defective state):

b. Component wear and aging, or harsh

A dormant (ineffective) defect is very hard to detect

Goal of defect tolerance methods: DEGRADED

Improving the manufacturing yield

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 36

T est Bypass ed T est

I/O Spare or I/O

Fig. 19.3 A linear array with a spare processor and reconfiguration

Fig. 19.4 A linear array with a spare processor and embedded

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 37

Fig. 19.5 Two types of reconfiguration switching for 2D arrays.

Assumption: A malfunctioning processor can be bypassed in its row/column by means of a

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 38

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 39

A set of three defective nodes, one of which cannot be accommodated by

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 40

● One way to protect the computations against fault-induced errors is to use

a. Design slips leading to incorrect initial state

1.single-error-detecting or SED code

Unpr ote cte d

We can check it through two approaches product code and algorithm-based

You might also like