You are on page 1of 56

17.1.

EMULATIONS AMONG ARCHITECTURES


Why do we need emulations?
1. To quickly develop algorithms for new architectures without having to spend significant
resources.
2. To develop algorithms for architectures that are easier to program and then have them run on
machines that are realizable, easily accessible, or affordable.

An important use other than practical porting is ………….


Testing of various architectures
1. Take a base architecture for which there exists a large number of algorithms.
2. Now we check if our developed architecture is capable of emulating the base architecture.
3. If yes , then what is the latency.
Example :-
 Let the base architecture be hypercube , which is a known and very powerful architecture.
there exits another architecture called cube connected cycles( CCC).
CCC can emulate a “p- node” hypercube with a latency of log(p).
Hypercube

Cube connected
cycles
An Important Rule:-
If architecture A emulates architecture B with O(f(p)) slowdown and
B in turn emulates C with O(g(p)) slowdown (assuming, for
simplicity, that they all contain p processors), then A can emulate C
with O(f ( p) × g (p)) slowdown.
(There exists a form of transitivity)
Three fairly general emulation results
involving classes of, rather than specific,
architectures.
Our first result is based on embedding of
an architecture or graph into another one.
There are two factors to be considered here:-
1. The cost of an embedding:-
Given by a factor called Expansion which is the Ratio of the number of nodes in the two
graphs.
2. Performance factors which are :-
◦ Dilation :- Longest path onto which any given edge is mapped
◦ Congestion :-Maximum number of edges mapped onto the same edge
◦ Load factor :-Maximum number of nodes mapped onto the same node
Calculation of slowdown:-
slowdown ≤ dilation × congestion × load factor
Example :-
Lets simulate a p-node complete graph Kp into the two-node graph K 2
Let the dilation be 1
Let the congestion factor be p2/4,
Let the load factor of p/2
Then worst case slowdown is p3/8
Our second general emulation result is that the
PRAM can emulate any degree-d network
with O(d) slowdown.
1. Slowdown is measured in terms of computation steps rather than real time; i.e., each PRAM
computation step, involving memory access by every processor, counts as one step.

Proof :- Here the degree of each node is 2.

each link buffer Each node can send or


corresponds to a receive at most 2 messages
location in shared at a time.
memory.
Continued ….
So for a d-degree graph:-
1. The sending of up to d messages by each node can be done in d steps, with the each step being
devoted to writing a message from the send buffer to the link buffer (O(d)).
2. Similarly, the receiving part, which involves reading up to d messages from known locations
in the shared memory, can be done in O(d ) steps.

so the total time taken is of order O(d).


Our third and final general result, that the (wrapped)
butterfly network can emulate any
p-node degree-d network with O(d log p) slowdown

We augment this graph with a link from Node 1 to Node 3, in order to make then node
degree uniformly equal to 3, and represent the augmented graph by the bipartite graph,
shown in the middle, which has Nodes u i and v i in its two parts corresponding to Node i in
the original graph.
Continued…

We next identify a perfect matching in this bipartite graph, say,


the heavy dotted edges on the right. This perfect matching,
with its edges carrying the label 0 defines the permutation P0 =
{ 1, 0, 3, 2} of the node set {0,1, 2, 3}. Removing these edges,
we arrive at a bipartite graph having the uniform node degree 2
and the perfect matching defined by the light dotted edges
labeled with 1. This time, we
have identified the permutation P1 = {2, 3, 1, 0}. Finally, the
remaining edges, define the last perfect matching or
permutation P2 = {3, 2, 0, 1}.
Continued …
1. A consequence of the above emulation result is that if a problem is efficiently parallelizable
on any bounded-degree architecture, then it is efficiently parallelizable on the butterfly
network, because emulating the bounded-degree network on the butterfly increases the
running time by at most a factor of O(log p).
2. So for d-degree nodes the factor becomes O(d logp).
17.2 Distributed Shared Memory
One of p =
Randomized emulation of q P M P M P M P M Memory module
2 (q + 1) holding m/p
the processors
0 dim 0 dim 1 dim 2 0
memory locations
p-processor PRAM on p-
node butterfly 1 1 Each node =
router +
processor +
2 2 memory
Use hash function to map memory
locations to modules 3 3
p locations  p modules, 4 4
2 q Rows
not necessarily distinct
5 5
With high probability, at most O(log p) of
the p locations will be in modules located in 6 6
the same row 7 7
Average slowdown = O(log p) 0 1 2 3
q + 1 Columns

Fig. 17.2 Butterfly distributed-memory machine


emulating the PRAM.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 14


PRAM Emulation with Butterfly MIN
Emulation of the p-processor PRAM on (p log p)-node butterfly, with memory modules and processors
connected to the two sides; O(log p) avg. slowdown
q Memory module
One of p =2 0 dim 0 dim 1 dim 2 0
proces s ors
P M holding m/p
mem ory locations
1 1
Less efficient than Fig. 17.2, which uses
a smaller butterfly 2 2

By using p / (log p) physical processors 3 3


to emulate the 4 4 2 qRow s
p-processor PRAM, this new emulation
scheme becomes quite efficient (pipeline 5 5
the memory accesses of the log p virtual
processors assigned to each physical 6 6
processor) 7 7
0 1 2 3
q + 1 Columns

Fig. 17.3 Distributed-memory machine, with a butterfly multistage


interconnection network, emulating the PRAM.
Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 15
Deterministic Shared-Memory Emulation
One of p = Memory module
Deterministic emulation of p- q
2 (q + 1)
P M P M P M P M
holding m/p
processor PRAM on p-node processors
0 dim 0 dim 1 dim 2 0
memory locations

butterfly
1 1 Each node =
router +
processor +
Store log2 m copies of each of the m 2 2 memory
memory location contents
3 3
Time-stamp each updated value 2 q Rows
4 4
A “write” is complete once a majority of
5 5
copies are updated
6
A “read” is satisfied when a majority of 6
copies are accessed and the one with 7 7
latest time stamp is used 0 1 2 3
q + 1 Columns
Why it works: A few congested links won’t
delay the operation Write set Read set
log2 m copies

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 16


PRAM Emulation Using Information Dispersal
Instead of (log m)-fold replication of data, divide each data element into k pieces and
encode the pieces using a redundancy factor of 3, so that any k / 3 pieces suffice for
reconstructing the original data

Original data word


and its k pieces

The k pieces
after encoding
(approx. three
times larger)
Possible read set Up-to-date Possible update set
of size 2k/3 pieces of size 2k/3
Recunstruction
algorithm
Original data word recovered
from k /3 encoded pieces
Fig. 17.4 Illustrating the information dispersal approach to PRAM emulation
with lower data redundancy.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 17


18.3 Multithreading and Latency Hiding

● The general idea of latency hiding methods is to provide each processor with some useful work
to do as it waits for remote memory access requests to be satisfied.

● Multithreading is a practical mechanism for latency hiding.


Latency hiding: Provide each processor with useful work to do as
it awaits the completion of memory access requests
Multithreading is one way to implement latency hiding

Sequential Thread Rem ote Idl e


thread comput ations accesses time

Schedul ing Synchronization


overhead overhead
Fig. 18.5 The concept of multithreaded parallel computation.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 19


In the Fig., a processor is working on a thread and the thread
requires a remote memory access before it can continue.

The processor places the thread in a wait queue and switches


to a different thread. This switching of context in the processor
involves some overhead which must be considerably less
than the remote access latency for the scheme to be
efficient.The processor is thus likely to have multiple register
sets, one per active thread, to minimize the overhead.

The higher the remote memory access delay, the larger is the
number of threads required to successfully hide the latency.
At some point in its computation, the newly activated thread
may require a remote memory access. Thus, another context
switch occurs, and so on.
● The access requests of the threads that are in the wait state may be completed out of order. Thus, each
access request is associated (tagged) with a unique thread identifier so that when the result is returned,
the processor knows which of the waiting threads should be activated. Thread identifiers are sometimes
called “continuations”.

● Application of multithreading is not restricted to a parallel processing environment but offers advantages
even on a uniprocessor.
Multithreading on a Single Processor
Here, the motivation is to reduce the performance impact of data dependencies
and branch misprediction penalty

Threads in memory Issue pipeli nes Retirement and


commit pipeline

Bubble
Function units

Fig. 24.9 of Parhami’s Computer Architecture text (2005)

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 22


Parallel I/O Technology

● An important requirement for highly parallel systems is the provision of


high-bandwidth I/O capability. For some data-intensive applications, the
high processing power of a massively parallel system is not of much
value unless the I/O subsystem can keep up with the processors.
● Figure shows a multiple-platter high-capacity
magnetic disk. Each platter has two recording
surfaces and a read/write head mounted on an
arm. The access arms are attached to an actuator
that can move the heads radially in order to align
them with a desired cylinder (i.e., a set of tracks,
one per recording surface). A sector or disk block
is part of a track that forms the unit of data
transfer to/from the disk.

● Access to a block of data on disk typically consists


of:
1. Cylinder seek: moving the heads to the desired
cylinder (seek time)
2. Sector alignment: waiting until the desired
sector is under the head (rotational latency).
3. Data transfer: reading the bytes out as they
pass under the head (transfer time).
Consider a disk with a sector size of 512 bytes, 2000 tracks per surface, 50 sectors per track, five double-sided platters, and
average seek time of 10 msec.

1. What is the capacity of a track in bytes? What is the capacity of each surface? What is the capacity of the disk?
2. How many cylinders does the disk have?
3. Give examples of valid block sizes. Is 256 bytes a valid block size? 2048? 51200?
4. If the disk platters rotate at 5400 rpm (revolutions per minute), what is the maximum rotational delay?
5. If one track of data can be transferred per revolution, what is the transfer rate?

1. bytes/track = bytes/sector × sectors/track = 512 × 50 = 25K


bytes/surface = bytes/track × tracks/surface = 25K × 2000 = 50, 000K
bytes/disk = bytes/surface× surfaces/disk = 50, 000K × 5 × 2 = 500, 000K

2. The number of cylinders is the same as the number of tracks on each platter, which is 2000.

3. The block size should be a multiple of the sector size.


We can see that 256 is not a valid block size while 2048 is.
51200 is not a valid block size in this case because block size cannot exceed the size of a track, which is 25600
bytes.
4. If the disk platters rotate at 5400rpm, the time required for one complete rotation, which is the maximum
rotational delay, is 1/5400× 60 = 0.011seconds. The average rotational delay is half of the rotation time, 0.006
seconds.

5. The capacity of a track is 25K bytes. Since one track of data can be transferred per revolution, the data transfer
rate is 25K/0.011= 2, 250Kbytes/second
One of the earliest attempts at achieving high-throughput
parallel I/O was the use of head-per-track disks (Fig. 18.7),
with their multiple read/write heads capable of being
activated at the same time. The seek (radial head movement)
time of a disk can be approximately modeled by the equation

Seek time = a + b

where c is the seek distance in cylinders.


Typical values for the constants a and b are 2
and 0.4 ms, respectively.
● Provision of many disks that can be accessed concurrently is only part of the solution to
the I/O problem in parallel systems. Development of suitable file structures and access
strategies in order to reduce the I/O overhead is perhaps more important. A common
technique of file organization in parallel systems is declustering or striping of the files in
order to allow parallel access to different parts of a single file.
● The problem of parallel I/O has received considerable attention in recent years.
Processor speeds have improved by several orders of magnitude since parallel I/O
became an issue, while disk speeds have remained virtually stagnant.
● The use of large disk caches, afforded by higher capacity and cheaper semiconductor
memories, does not solve the entire I/O problem, just as ordinary caches only partially
compensate for slow main memories.
● Various technologies for parallel I/O, along with tools and standardization issues, are
under extensive scrutiny.
18.4 Parallel I/O Technology
Sector Read/writ e head

Actuator
Recording area

Track c – 1
Track 2
Track 1
Track 0

Arm

Platter
Direction of
rotation
Spindl e
Fig. 18.6 Moving-head magnetic disk elements.
Comprehensive info about disk memory: http://www.storageview.com/guide/

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 29


Access Time for a Disk
Average rotational latency =
Data transfer time = 30 000 / rpm (in ms) Seek time =
Bytes / Data rate
2. Disk rotation until the desired a + b(c – 1)
3. Disk rotation until sector
sector arri ves under the head: + b(c – 1)1/2
Rotational latency (0-10s ms )
has passed under the head:
Data transfer time (< 1 ms ) 1. Head m ovement
from current position
2 to desired cylinder:
1 Seek time (0-10s ms )
3
Sector

Rotation

The three components of disk access time. Disks that spin faster have a shorter average
and worst-case access time.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 30


Amdahl’s Rules of Thumb for System Balance
The need for high-capacity, high-throughput secondary (disk) memory

Processor RAM Disk I/O rate Number of Disk Number of disks


speed size disks capacity
1 GIPS 1 GB 100 MB/s 1 100 GB 1
1 TIPS 1 TB 100 GB/s 1000 100 TB 100
1 PIPS 1 PB 100 TB/s 1 Million 100 PB 100 000
1 EIPS 1 EB 100 PB/s 1 Billion 100 EB 100 Million

G Giga
T Tera
1 RAM byte 1 I/O bit per sec 100 disk bytes P Peta
for each IPS for each IPS for each RAM byte E Exa

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 31


Growing Gap Between Disk and CPU Performance
s
From Parhami’s
Disk seek time computer
architecture
ms textbook,
Oxford, 2005
Time

s
DRAM access time

ns

CPU cycle time

ps
1980 1990 2000 2010
Calendar year
Fig. 20.11 Trends in disk, main memory, and CPU speeds.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 32


Head-Per-Track Disks
Multiple sets of head reduce
Dedicated track heads eliminate seek time (replace it with rotational latency
activation time for a head)

Track c–1

Track 0 Track 1

Fig. 18.7 Head-per-track disk concept.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 33


19.1 Defects, Faults, . . . , Failures
Opportunities for fault tolerance in parallel systems: IDEAL
Built-in spares, load redistribution, graceful degradation
Difficulties in achieving fault tolerance: DEFECTIVE
Change in structure, bad units disturbing good ones
The multilevel model of dependable computing FAULTY

Abstraction level Dealing with deviant


Defect / Component Atomic parts ERRONEOUS
Fault / Logic Signal values or decisions
Error / Information Data or internal states
MALFUNCTIONING
Malfunction / System Functional behavior
Degradation / Service Performance
Failure / Result Outputs or actions DEGRADED

Fig. 19.1 System states and state transitions in


our multilevel model. FAILED

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 34


Analogy for the Multilevel Model
Inlet val ves represent
Many avoidance and tolerance Wall heights represent avoi dance techni ques
methods are applicable to more interlevel latencies
than one level, but we deal with
them at the level for which they I I I I I I
are most suitable, or at which
they have been most
successfully applied

Fig. 19.2 An analogy for the I I I I I I


multilevel model of
Concentric reservoirs are Drai n val ves represent
dependable computing.
analogues of the six model toleranc e techniques
levels (defect is innermost)

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 35


19.2 Defect-Level Methods
Defects are caused in two ways (sideways and downward transitions IDEAL

into the defective state):


a. Design slips leading to defective components DEFECTIVE

b. Component wear and aging, or harsh


operating conditions (e.g., interference)
FAULTY

A dormant (ineffective) defect is very hard to detect


ERRONEOUS
Methods for coping with defects during dormancy:
Periodic maintenance MALFUNCTIONING
Burn-in testing

Goal of defect tolerance methods: DEGRADED

Improving the manufacturing yield


Reconfiguration during system operation FAILED

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 36


Defect Tolerance Schemes for Linear Arrays

T est Bypass ed T est

I/O Spare or I/O


P0 P1 Defecti ve
P2 P3

Fig. 19.3 A linear array with a spare processor and reconfiguration


switches.

Mux
Spare or
P0 P1 Defective
P2 P3

Fig. 19.4 A linear array with a spare processor and embedded


switching.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 37


Defect Tolerance in 2D Arrays

Fig. 19.5 Two types of reconfiguration switching for 2D arrays.

Assumption: A malfunctioning processor can be bypassed in its row/column by means of a


separate switching mechanism (not shown)

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 38


A Reconfiguration Scheme for 2D Arrays

Spare
Row
Spare Column

Fig. 19.6 A 5  5 working array salvaged from a 6  Fig. 19.7 Seven defective processors and
6 redundant mesh through reconfiguration switching. their associated compensation paths.

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 39


Limits of Defect Tolerance

No compensation path
exists for t his faulty node

A set of three defective nodes, one of which cannot be accommodated by


the compensation-path method.

Extension: We can go beyond the 3-defect limit by providing spare rows on top and
bottom and spare columns on either side

Winter 2016 PARALLEL PROCESSING, SOME BROAD TOPICS Slide 40


Fault-Level
Method
Fault-Level Methods
● A hardware fault may be defined as any anomalous behavior of logic structures or substructures
that can compromise the correct signal values within a logic circuit.

● If the anomalous behavior results from implementing the logic function g rather than the intended
function ƒ, then the fault is related to a logical design or implementation slip. The alternative cause
of faults is the implementation of the correct logic functions with defective components.
● Defect-based faults can be classified according to duration (permanent,
intermittent/recurring, or transient), extent (local or distributed/catastrophic), and
effect (dormant or active).

● One way to protect the computations against fault-induced errors is to use


duplication with comparison of the two results (for single fault detection) or
triplication with two-outof-three voting on the three results (for single fault
masking or tolerance).
● In 1st part the decoding logic is duplicated along with the computation part to
ensure that a single fault in the decoder does not go undetected. The encoder,
on the other hand, remains a critical element whose faulty behavior will lead to
undetected errors.
● In 2nd part combining the voting and encoding functions, one may be able to
design an efficient self-checking voter-encoder. This three-channel computation
strategy can be generalized to m channels in order to tolerate more faults.
However, the cost overhead of a higher degree of replication becomes
prohibitive.
● The above replication schemes are quite general and can be applied to any part
of a parallel system for any type of fault. However, the cost of full replication is
difficult to justify for most applications.
● Like the original butterfly network, the extra-stage butterfly network is self-
routing.
● To see this, note that the connections between Columns q-1 and q are identical
to those between Columns 0 and 1.
● Hence, a processor that is aware of the fault status of the network switches can
append a suitable routing tag to the message and insert it into the network
through one of its two available access ports.
● From that point onward, the message finds its way through the network and
automatically avoids the faulty element(s).
● Of course, if more than one element is faulty, existence of a path is not
guaranteed.
Error-Level Methods
Errors are caused in two ways (sideways and downward transitions into
the erroneous state):

a. Design slips leading to incorrect initial state


b. Exercising of faulty circuits, leading to deviations in stored values or
machine state

Types of errors:
Error detecting and correcting techniques:
(encodes data into redundant format called codes)

1.single-error-detecting or SED code

2.Checksum

3.Hamming SEC/DED
INPUT

ENCODE

SEND

Unpr ote cte d


Pr ote cte d
by Encoding STORE MANIPULATE

SEND

DECODE

OUTPUT
For example, the sum of two even-parity numbers does not necessarily have
even parity.

We can check it through two approaches product code and algorithm-based


error tolerance.
Product Code:

One is the use of codes that are closed under data manipulation operations of interest.
For example, product codes are closed under addition and subtraction. A product code
with check modulus 15, say, represents the integers x and y as 15x and 15y,
respectively. correct encoded form 15(x ± y) of the sum/difference x ± y. While product
codes are not closed under multiplication, division, and square-rooting, it is possible to
devise arithmetic algorithms for these operations that deal directly with the coded
operands.
Algorithm-based error tolerance :

You might also like