Professional Documents
Culture Documents
Cube connected
cycles
An Important Rule:-
If architecture A emulates architecture B with O(f(p)) slowdown and
B in turn emulates C with O(g(p)) slowdown (assuming, for
simplicity, that they all contain p processors), then A can emulate C
with O(f ( p) × g (p)) slowdown.
(There exists a form of transitivity)
Three fairly general emulation results
involving classes of, rather than specific,
architectures.
Our first result is based on embedding of
an architecture or graph into another one.
There are two factors to be considered here:-
1. The cost of an embedding:-
Given by a factor called Expansion which is the Ratio of the number of nodes in the two
graphs.
2. Performance factors which are :-
◦ Dilation :- Longest path onto which any given edge is mapped
◦ Congestion :-Maximum number of edges mapped onto the same edge
◦ Load factor :-Maximum number of nodes mapped onto the same node
Calculation of slowdown:-
slowdown ≤ dilation × congestion × load factor
Example :-
Lets simulate a p-node complete graph Kp into the two-node graph K 2
Let the dilation be 1
Let the congestion factor be p2/4,
Let the load factor of p/2
Then worst case slowdown is p3/8
Our second general emulation result is that the
PRAM can emulate any degree-d network
with O(d) slowdown.
1. Slowdown is measured in terms of computation steps rather than real time; i.e., each PRAM
computation step, involving memory access by every processor, counts as one step.
We augment this graph with a link from Node 1 to Node 3, in order to make then node
degree uniformly equal to 3, and represent the augmented graph by the bipartite graph,
shown in the middle, which has Nodes u i and v i in its two parts corresponding to Node i in
the original graph.
Continued…
butterfly
1 1 Each node =
router +
processor +
Store log2 m copies of each of the m 2 2 memory
memory location contents
3 3
Time-stamp each updated value 2 q Rows
4 4
A “write” is complete once a majority of
5 5
copies are updated
6
A “read” is satisfied when a majority of 6
copies are accessed and the one with 7 7
latest time stamp is used 0 1 2 3
q + 1 Columns
Why it works: A few congested links won’t
delay the operation Write set Read set
log2 m copies
The k pieces
after encoding
(approx. three
times larger)
Possible read set Up-to-date Possible update set
of size 2k/3 pieces of size 2k/3
Recunstruction
algorithm
Original data word recovered
from k /3 encoded pieces
Fig. 17.4 Illustrating the information dispersal approach to PRAM emulation
with lower data redundancy.
● The general idea of latency hiding methods is to provide each processor with some useful work
to do as it waits for remote memory access requests to be satisfied.
The higher the remote memory access delay, the larger is the
number of threads required to successfully hide the latency.
At some point in its computation, the newly activated thread
may require a remote memory access. Thus, another context
switch occurs, and so on.
● The access requests of the threads that are in the wait state may be completed out of order. Thus, each
access request is associated (tagged) with a unique thread identifier so that when the result is returned,
the processor knows which of the waiting threads should be activated. Thread identifiers are sometimes
called “continuations”.
● Application of multithreading is not restricted to a parallel processing environment but offers advantages
even on a uniprocessor.
Multithreading on a Single Processor
Here, the motivation is to reduce the performance impact of data dependencies
and branch misprediction penalty
Bubble
Function units
1. What is the capacity of a track in bytes? What is the capacity of each surface? What is the capacity of the disk?
2. How many cylinders does the disk have?
3. Give examples of valid block sizes. Is 256 bytes a valid block size? 2048? 51200?
4. If the disk platters rotate at 5400 rpm (revolutions per minute), what is the maximum rotational delay?
5. If one track of data can be transferred per revolution, what is the transfer rate?
2. The number of cylinders is the same as the number of tracks on each platter, which is 2000.
5. The capacity of a track is 25K bytes. Since one track of data can be transferred per revolution, the data transfer
rate is 25K/0.011= 2, 250Kbytes/second
One of the earliest attempts at achieving high-throughput
parallel I/O was the use of head-per-track disks (Fig. 18.7),
with their multiple read/write heads capable of being
activated at the same time. The seek (radial head movement)
time of a disk can be approximately modeled by the equation
Seek time = a + b
Actuator
Recording area
Track c – 1
Track 2
Track 1
Track 0
Arm
Platter
Direction of
rotation
Spindl e
Fig. 18.6 Moving-head magnetic disk elements.
Comprehensive info about disk memory: http://www.storageview.com/guide/
Rotation
The three components of disk access time. Disks that spin faster have a shorter average
and worst-case access time.
G Giga
T Tera
1 RAM byte 1 I/O bit per sec 100 disk bytes P Peta
for each IPS for each IPS for each RAM byte E Exa
s
DRAM access time
ns
ps
1980 1990 2000 2010
Calendar year
Fig. 20.11 Trends in disk, main memory, and CPU speeds.
Track c–1
Track 0 Track 1
Mux
Spare or
P0 P1 Defective
P2 P3
Spare
Row
Spare Column
Fig. 19.6 A 5 5 working array salvaged from a 6 Fig. 19.7 Seven defective processors and
6 redundant mesh through reconfiguration switching. their associated compensation paths.
No compensation path
exists for t his faulty node
Extension: We can go beyond the 3-defect limit by providing spare rows on top and
bottom and spare columns on either side
● If the anomalous behavior results from implementing the logic function g rather than the intended
function ƒ, then the fault is related to a logical design or implementation slip. The alternative cause
of faults is the implementation of the correct logic functions with defective components.
● Defect-based faults can be classified according to duration (permanent,
intermittent/recurring, or transient), extent (local or distributed/catastrophic), and
effect (dormant or active).
Types of errors:
Error detecting and correcting techniques:
(encodes data into redundant format called codes)
2.Checksum
3.Hamming SEC/DED
INPUT
ENCODE
SEND
SEND
DECODE
OUTPUT
For example, the sum of two even-parity numbers does not necessarily have
even parity.
One is the use of codes that are closed under data manipulation operations of interest.
For example, product codes are closed under addition and subtraction. A product code
with check modulus 15, say, represents the integers x and y as 15x and 15y,
respectively. correct encoded form 15(x ± y) of the sum/difference x ± y. While product
codes are not closed under multiplication, division, and square-rooting, it is possible to
devise arithmetic algorithms for these operations that deal directly with the coded
operands.
Algorithm-based error tolerance :