Professional Documents
Culture Documents
1/31/2012
1/31/2012
of program execution explicitly stated in user programs. y Dataflow machines which instructions can be executed by determining operand availability. y Reduction machines trigger an instructions execution based on the demand for its results.
1/31/2012
Basic Definition
Conventional computation; token Eager evaluation; statements are Lazy evaluation; statements are of control indicates when a executed when all their operands are executed only when their result is statement should be executed. available required for another computation
Advantages
1. 2.
Full control 1. Complex data and control structures are easily 2. implemented. 3.
Very high potential parallelism High throughput Free from side effects
for 1. 2. 3.
Only required instructions are executed. High degree of parallelism Easy manipulation of data structures.
Disadvantages
1. 2. 3.
propagate
1/31/2012
Since variables are updated by many instructions, there may be side effects on other instructions. These side effects frequently prevent parallel processing. Single processor systems are inherently sequential.
y Instructions in dataflow machines are unordered and can be executed as
soon as their operands are available; data is held in the instructions themselves. Data tokens are passed from an instruction to its dependents to trigger execution.
1/31/2012
1/31/2012
1/31/2012
A Dataflow Architecture - 1
y The Arvind machine (MIT) has N PEs and an N-by-N interconnection
network. y Each PE has a token-matching mechanism that dispatches only instructions with data tokens available. y Each datum is tagged with
y address of instruction to which it belongs y context in which the instruction is being executed
y Tagged tokens enter PE through local path (pipelined), and can also be
1/31/2012
A Dataflow Architecture - 2
y Instruction address(es) effectively replace the program
counter in a control flow machine. y Context identifier effectively replaces the frame base register in a control flow machine. y Since the dataflow machine matches the data tags from one instruction with successors, synchronized instruction execution is implicit.
1/31/2012
A Dataflow Architecture - 3
y An I-structure in each PE is provided to eliminate excessive y
y y y
copying of data structures. Each word of the I-structure has a two-bit tag indicating whether the value is empty, full, or has pending read requests. This is a retreat from the pure dataflow approach. Example 2.6 shows a control flow and dataflow comparison. Special compiler technology needed for dataflow machines.
10
1/31/2012
Demand-Driven Mechanisms
y Data-driven machines select instructions for execution based on the
availability of their operands; this is essentially a bottom-up approach. y Demand-driven machines take a top-down approach, attempting to execute the instruction (a demander) that yields the final result. This triggers the execution of instructions that yield its operands, and so forth. y The demand-driven approach matches naturally with functional programming languages (e.g. LISP and SCHEME).
11
1/31/2012
12
1/31/2012
13
1/31/2012
Summary
y Control flow machines give complete control, but are less efficient than
other approaches. y Data flow (eager evaluation) machines have high potential for parallelism and throughput and freedom from side effects, but have high control overhead, lose time waiting for unneeded arguments, and difficulty in manipulating data structures. y Reduction (lazy evaluation) machines have high parallelism potential, easy manipulation of data structures, and only execute required instructions. But they do not share objects with changing local state, and do require time to propagate tokens.
14
1/31/2012
Question 2 (4+4+2)
y Question 2 -Write notes on the following:
y Amdahls law and efficiency of a system y Utilization of system and quality of parallelism y Redundancy
15
1/31/2012
Amdahls Law
y Assume Ri = i, and w (the weights) are (E, 0, , 0, 1-E). y Basically this means the system is used sequentially (with probability E) or
all n processors are used (with probability 1- E). y This yields the speedup equation known as Amdahls law:
n Sn ! 1 n 1
E
The implication is that the best speedup possible is 1/ E, regardless of n, the number of processors.
16 1/31/2012
System Efficiency 1
y Assume the following definitions: y O (n) = total number of unit operations performed by an n-processor system in completing a program P. y T (n) = execution time required to execute the program P on an n-processor system. y O (n) can be considered similar to the total number of instructions
executed by the n processors, perhaps scaled by a constant factor. y If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make any use at all of the extra processor(s).
17
1/31/2012
System Efficiency 2
y Clearly, the speedup factor (how much faster the program runs with
n processors) can now be expressed as S (n) = T (1) / T (n) Recall that we expect T (n) < T (1), so S (n) u 1. y System efficiency is defined as E (n) = S (n) / n = T (1) / ( n v T (n) ) It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup. Thus 1 / n e E (n) e 1. The value is 1/n when only one processor is used (regardless of n), and the value is 1 when all processors are fully utilized.
18
1/31/2012
Redundancy
y The redundancy in a parallel computation is defined as
is independent of the number of processors, n. This is the ideal case. y R (n) = n when all processors performs the same number of operations as when only a single processor is used; this implies that n completely redundant computations are performed!
is carried over to the hardware implementation without having extra operations performed.
19
1/31/2012
System Utilization
y System utilization is defined as
U (n) = R (n) v E (n) = O (n) / ( n v T (n) ) It indicates the degree to which the system resources were kept busy during execution of the program. Since 1 e R (n) e n, and 1 / n e E (n) e 1, the best possible value for U (n) is 1, and the worst is 1 / n. y 1 / n e E (n) e U (n) e 1 y 1 e R (n) e 1 / E (n) e n
20
1/31/2012
Quality of Parallelism
y The quality of a parallel computation is defined as
Q (n) = S (n) v E (n) / R (n) = T 3 (1) / ( n v T 2 (n) v O (n) ) y This measure is directly related to speedup (S) and efficiency (E), and inversely related to redundancy (R). y The quality measure is bounded by the speedup (that is, Q (n) e S (n) ).
21
1/31/2012
Question 3
y Explain super scalar processors.
22
1/31/2012
Superscalar Processors
y This subclass of the RISC processors allow multiple
instructoins to be issued simultaneously during each cycle. y The effective CPI of a superscalar processor should be less than that of a generic scalar RISC processor. y Clock rates of scalar RISC and superscalar RISC machines are similar.
23
1/31/2012
24
1/31/2012
25
1/31/2012
26
1/31/2012
address
27
1/31/2012
MMU (no aliasing) y After cache miss, load a block from main memory y Use either write-back or write-through policy
28
1/31/2012
y Disadvantage:
y Slowdown in accessing
from OS kernel
29
1/31/2012
30
1/31/2012
31
1/31/2012
32
1/31/2012
Aliasing Problem
y Different logically addressed data have the same index/tag in
the cache y Confusion if two or more processors access the same physical cache location y Flush cache when aliasing occurs, but leads to slowdown y Apply special tagging with a process key or with a physical address
33
1/31/2012
organization, and management policy y Blocks in caches are block frames, and blocks in main memory y Bi (i e m), Bj (i e n), n<<m, n=2s, m=2r y Each block has b words b=2w, for cache total of mb=2r+w words, main memory of nb= 2s+w words
34
1/31/2012
the cache Placement is by using modulo-m function Bj p Bi if i=j mod m Unique block frame Bi that each Bj loads to Simplest organization to implement
35
1/31/2012
36
1/31/2012
y Disadvantages
y Rigid mapping y Poorer hit ratio y Prohibits parallel virtual
address translation y Use larger cache size with more block frames to avoid contention
37
1/31/2012
available block frames y s-bit tag needed in each cache block (s > r) y An m-way associative search requires the tag to be compared w/ all cache block tags y Use an associative memory to achieve a parallel comparison w/all tags concurrently
38
1/31/2012
39
1/31/2012
y Disadvantages:
y Higher hardware cost y Only moderate size cache y Expensive search process
mapping cache blocks y Higher hit ratio y Allows better block replacement policy with reduced block contention
40
1/31/2012
divided into v=m/k sets, with k blocks per set y Each set is identified by a d-bit set number y Compare the tag w/the k tags w/in the identified set y Bj p Bf Si if j(mod v) = i
41
1/31/2012
42
1/31/2012
use fully associative search Use sector tags for search and block fields within sector to find block Only missing block loaded for a miss The ith block in a sector placed into the th block frame in a destined sector frame Attach a valid/invalid bit to block frames
43
1/31/2012
44
1/31/2012
45
1/31/2012
Dynamic Pipeline
y Can be reconfigured to perform variable functions at
y Linear pipelines are static for fixed functions y Following different dataflow patterns, can use same
46
1/31/2012
Reservation Tables
y Multiple reservation tables can be generated for evaluation of
different functions y Different fxns may follow dif. paths y One to many mapping b/t pipeline configuration and reservation tables y # of columns is evaluation time of a given fxn
47
1/31/2012
48
1/31/2012
Latency
y # of time units b/t two initiations y Any attempt by two or more initiations to use the same
pipeline stage at the same time causes a collision resource conflict y Forbidden latencies: cause collisions y To detect forbidden latencies, check distance b/t any two marks in the same row of the reservation table
49
1/31/2012
Latency Analysis
y Latency sequence: sequence of permissible latencies b/t
successive task initiations y Latency cycle: latency seq. that repeats the same cycle indefinitely y Average latency: divide sum of all latencies by # of latencies in cycle y Constant cycle: cycle which contains only one latency value
50
1/31/2012
51
1/31/2012
52
1/31/2012
53
1/31/2012
54
1/31/2012
Question 6
y Write short notes on
y Crossbar switch and multiport memory (5 marks) y Multistage network (5 marks)
55
1/31/2012
Crossbar Networks
y Crossbar networks connect every input to every output
through a crosspoint switch. y A crossbar network is a single stage, non-blocking permutation network. y In an n-processor, m-memory system, n v m crosspoint switches will be required. Each crosspoint is a unary switch which can be open or closed, providing a point-to-point connection path between the processor and a memory module.
mesh, only one can be connected at a time. y Crosspoint switches must be designed to handle the potential contention for each memory module. y Each processor provides a request line, a read/write line, a set of address lines, and a set of data lines to a crosspoint switch for a single column. y The crosspoint switch eventually responds with an acknowledgement when the access has been completed.
59
1/31/2012
Multiport Memory
y Since crossbar switches are expensive, and not suitable for systems
with many processors or memory modules, multiport memory modules may be used instead. y A multiport memory module has multiple connections points for processors (or I/O devices), and the memory controller in the module handles the arbitration and switching that might otherwise have been accomplished by a crosspoint switch. y Multi-port memories require more testing effort since all ports have to be verified.
networks because data items may have to pass through the single stage many times. y The crossbar switch and the multiported memory organization (seen later) are both single-stage networks. y This is because even if two processors attempted to access the same memory module (or I/O device at the same time, only one of the requests is serviced at a time.
Multistage Networks
y Multistage networks consist of multiple sages of switch
boxes, and should be able to connect any input to any output. y A multistage network is called blocking if the simultaneous connections of some multiple input-output pairs may result in conflicts in the use of switches or communication links. y A nonblocking multistage network can perform all possible connections between inputs and outputs by rearranging its connections.
Omega Networks
y N-input Omega networks, in general, have log2n stages, with
the input stage labeled 0. y The interstage connection (ISC) pattern is a perfect shuffle. y Routing is controlled by inspecting the destination address. When the i-th highest order bit is 0, the 2v2 switch in stage i connects the input to the upper output. Otherwise it connects the input to the lower output.
Blocking Effects
y Blocking exists in an Omega network when the requested
permutation would require that a single switch be set in two positions simultaneously. y Obviously this is impossible, and requires that one of the permutation requests be blocked and tried in a later pass. y In general, with 2v2 switches, an Omega network can implement n n/2 permutations in a single pass. For n = 8, this is about 10% of all possible permutations. y In general, a maximum of log2n passes are needed for an n-input Omega network.
Omega Broadcast
y An Omega network can be used to broadcast data to multiple
destinations. y The switch to which the input is connected is set to the broadcast position (input connected to both outputs). y Each additional switch (in later stages) to which an output is directed is also set to the broadcast position.
Question 7
y Explain vector access memory schemes
69
1/31/2012
Vector operands may have arbitrary length. Vector elements are not necessarily stored in contiguous memory locations. To access a vector a memory, one must specify its base, stride, and length. Since each vector register has fixed length, only a segment of the vector can be loaded into a vector register. Vector operands should be stored in memory to allow pipelined and parallel access. Access itself should be pipelined. C-Access memory organization The m-way low-order memory structure, allows m words to be accessed concurrently and overlapped. S-Access memory organization All modules are accessed simultaneously storing consecutive words to data buffers. The low order address bits are used to multiplex the m words out of buffers. C/S-Access memory organization.
EENG-630
C-access
y Eight-way interleaved memory (m = 8
and w = 8). m is called the degree of interleaving. y The major cycle is the total time required to complete the access of a single word form a memory. The minor cycle is the actual time needed to produce one word, assuming overlapped access of successive memory modules separated in every memory cycle .
EENG-630
72
73
74
1/31/2012
75
1/31/2012
Message Formats
y Messages may be fixed or variable length. y Messages are comprised of one or more packets. y Packets are the basic units containing a destination address
(e.g. processor number) for routing purposes. y Different packets may arrive at the destination asynchronously, so they are sequence numbered to allow reassembly. y Flits (flow control digits) are used in wormhole routing; theyre discussed a bit later
it can be forwarded to the next node or the final destination, and only then if the output channel is free and the next node has available buffer space for the packet. y The latency in store and format networks is directly related to the number of intermediate nodes through which the packet must pass.
pieces called flits (flow control digits). y The first flit in the packet must contain (at least) the destination address. Thus the size of a flit must be at least log2 N in an N-processor multicomputer. y Each flit is transmitted as a separate entity, but all flits belonging to a single packet must be transmitted in sequence, one immediately after the other, in a pipeline through intermediate routers.
Asynchronous Pipelining
y Each intermediate node in a wormhole network, and the source
and destination, each have a buffer capable of storing a flit. y Adjacent nodes communicate requests and acknowledgements using a one-bit ready/request (R/A) line.
y When a receiver is ready, it pulls the R/A line low. y When the sender is ready, it raises the R/A line high and transmits the next
flit; the line is left high. y After the receiver deals with the flit (perhaps sending it on to another node), it lowers the R/A line to indicate it is ready to accept another flit. y The cycle repeats for transmission of other flits.
clock speed higher than that used in a synchronous pipeline. y The pipeline can be stalled if buffers or successive channels in the path are not available during certain cycles. y A packet could be buffered, blocked, dragged, detoured and just knocked around, in general if the pipeline stalls.
Question No. 9
y Explain the steps in the parallel processing.
84
1/31/2012
Instruction Pipeline
y To implement pipelining, a designer s divides a processor's
datapath into sections (stages), and places pipeline latches (also called buffers) between each section (stage)
85
1/31/2012
86
1/31/2012
87
1/31/2012
88
1/31/2012
89
1/31/2012
90
1/31/2012
91
1/31/2012