You are on page 1of 91

February 2009 Question Paper

Complete Solution & Supplementary Q&A

1/31/2012

Question 1 (10 Marks)


y Write notes on any 2 Program Flow mechanisms

1/31/2012

Program Flow Mechanisms


y Conventional machines used control flow mechanism in which order

of program execution explicitly stated in user programs. y Dataflow machines which instructions can be executed by determining operand availability. y Reduction machines trigger an instructions execution based on the demand for its results.

1/31/2012

Comparison of program flow mechanisms


Machine Model Control Flow Data flow Reduction machine

Basic Definition

Conventional computation; token Eager evaluation; statements are Lazy evaluation; statements are of control indicates when a executed when all their operands are executed only when their result is statement should be executed. available required for another computation

Advantages

1. 2.

Full control 1. Complex data and control structures are easily 2. implemented. 3.

Very high potential parallelism High throughput Free from side effects

for 1. 2. 3.

Only required instructions are executed. High degree of parallelism Easy manipulation of data structures.

Disadvantages

1. 2. 3.

Less efficient 1. Difficult in programming 2. Difficult in preventing run time error

High control overhead 1. Difficult in manipulating data structures

Time needed to demand tokens

propagate

1/31/2012

Control Flow vs. Data Flow


y Control flow machines used shared memory for instructions and data.

Since variables are updated by many instructions, there may be side effects on other instructions. These side effects frequently prevent parallel processing. Single processor systems are inherently sequential.
y Instructions in dataflow machines are unordered and can be executed as

soon as their operands are available; data is held in the instructions themselves. Data tokens are passed from an instruction to its dependents to trigger execution.

1/31/2012

Data Flow Features


y No need for
y shared memory y program counter y control sequencer

y Special mechanisms are required to


y detect data availability y match data tokens with instructions needing them y enable chain reaction of asynchronous instruction execution

1/31/2012

1/31/2012

A Dataflow Architecture - 1
y The Arvind machine (MIT) has N PEs and an N-by-N interconnection

network. y Each PE has a token-matching mechanism that dispatches only instructions with data tokens available. y Each datum is tagged with
y address of instruction to which it belongs y context in which the instruction is being executed

y Tagged tokens enter PE through local path (pipelined), and can also be

communicated to other PEs through the routing network.

1/31/2012

A Dataflow Architecture - 2
y Instruction address(es) effectively replace the program

counter in a control flow machine. y Context identifier effectively replaces the frame base register in a control flow machine. y Since the dataflow machine matches the data tags from one instruction with successors, synchronized instruction execution is implicit.

1/31/2012

A Dataflow Architecture - 3
y An I-structure in each PE is provided to eliminate excessive y

y y y

copying of data structures. Each word of the I-structure has a two-bit tag indicating whether the value is empty, full, or has pending read requests. This is a retreat from the pure dataflow approach. Example 2.6 shows a control flow and dataflow comparison. Special compiler technology needed for dataflow machines.

10

1/31/2012

Demand-Driven Mechanisms
y Data-driven machines select instructions for execution based on the

availability of their operands; this is essentially a bottom-up approach. y Demand-driven machines take a top-down approach, attempting to execute the instruction (a demander) that yields the final result. This triggers the execution of instructions that yield its operands, and so forth. y The demand-driven approach matches naturally with functional programming languages (e.g. LISP and SCHEME).

11

1/31/2012

12

1/31/2012

Reduction Machine Models


y String-reduction model: y each demander gets a separate copy of the expression string to evaluate y each reduction step has an operator and embedded reference to demand the corresponding operands y each operator is suspended while arguments are evaluated y Graph-reduction model: y expression graph reduced by evaluation of branches or subgraphs, possibly in parallel, with demanders given pointers to results of reductions. y based on sharing of pointers to arguments; traversal and reversal of pointers continues until constant arguments are encountered.

13

1/31/2012

Summary
y Control flow machines give complete control, but are less efficient than

other approaches. y Data flow (eager evaluation) machines have high potential for parallelism and throughput and freedom from side effects, but have high control overhead, lose time waiting for unneeded arguments, and difficulty in manipulating data structures. y Reduction (lazy evaluation) machines have high parallelism potential, easy manipulation of data structures, and only execute required instructions. But they do not share objects with changing local state, and do require time to propagate tokens.

14

1/31/2012

Question 2 (4+4+2)
y Question 2 -Write notes on the following:
y Amdahls law and efficiency of a system y Utilization of system and quality of parallelism y Redundancy

15

1/31/2012

Amdahls Law
y Assume Ri = i, and w (the weights) are (E, 0, , 0, 1-E). y Basically this means the system is used sequentially (with probability E) or

all n processors are used (with probability 1- E). y This yields the speedup equation known as Amdahls law:

n Sn ! 1  n  1 E
The implication is that the best speedup possible is 1/ E, regardless of n, the number of processors.
16 1/31/2012

System Efficiency 1
y Assume the following definitions: y O (n) = total number of unit operations performed by an n-processor system in completing a program P. y T (n) = execution time required to execute the program P on an n-processor system. y O (n) can be considered similar to the total number of instructions

executed by the n processors, perhaps scaled by a constant factor. y If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make any use at all of the extra processor(s).

17

1/31/2012

System Efficiency 2
y Clearly, the speedup factor (how much faster the program runs with

n processors) can now be expressed as S (n) = T (1) / T (n) Recall that we expect T (n) < T (1), so S (n) u 1. y System efficiency is defined as E (n) = S (n) / n = T (1) / ( n v T (n) ) It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup. Thus 1 / n e E (n) e 1. The value is 1/n when only one processor is used (regardless of n), and the value is 1 when all processors are fully utilized.

18

1/31/2012

Redundancy
y The redundancy in a parallel computation is defined as

R (n) = O (n) / O (1) y What values can R (n) obtain?


y R (n) = 1 when O (n) = O (1), or when the number of operations performed

is independent of the number of processors, n. This is the ideal case. y R (n) = n when all processors performs the same number of operations as when only a single processor is used; this implies that n completely redundant computations are performed!

y The R (n) figure indicates to what extent the software parallelism

is carried over to the hardware implementation without having extra operations performed.

19

1/31/2012

System Utilization
y System utilization is defined as

U (n) = R (n) v E (n) = O (n) / ( n v T (n) ) It indicates the degree to which the system resources were kept busy during execution of the program. Since 1 e R (n) e n, and 1 / n e E (n) e 1, the best possible value for U (n) is 1, and the worst is 1 / n. y 1 / n e E (n) e U (n) e 1 y 1 e R (n) e 1 / E (n) e n

20

1/31/2012

Quality of Parallelism
y The quality of a parallel computation is defined as

Q (n) = S (n) v E (n) / R (n) = T 3 (1) / ( n v T 2 (n) v O (n) ) y This measure is directly related to speedup (S) and efficiency (E), and inversely related to redundancy (R). y The quality measure is bounded by the speedup (that is, Q (n) e S (n) ).

21

1/31/2012

Question 3
y Explain super scalar processors.

22

1/31/2012

Superscalar Processors
y This subclass of the RISC processors allow multiple

instructoins to be issued simultaneously during each cycle. y The effective CPI of a superscalar processor should be less than that of a generic scalar RISC processor. y Clock rates of scalar RISC and superscalar RISC machines are similar.

23

1/31/2012

24

1/31/2012

25

1/31/2012

Question 4 (10 marks)


y Explain the cache addressing model

26

1/31/2012

Cache Addressing Models


y Most systems use private caches for each processor y Have an interconnection n/w b/t caches and main memory y Address caches using either a physical address or virtual

address

27

1/31/2012

Physical Address Caches


y Cache is indexed and tagged with the physical address y Cache lookup occurs after address translation in TLB or

MMU (no aliasing) y After cache miss, load a block from main memory y Use either write-back or write-through policy

28

1/31/2012

Physical Address Caches


y Advantages:
y No cache flushing y No aliasing problems y Simplistic design y Requires little intervention

y Disadvantage:
y Slowdown in accessing

cache until the MMU/TLB finishes translation

from OS kernel

29

1/31/2012

Physical Address Models

30

1/31/2012

Virtual Address Caches


y Cache indexed or tagged w/virtual address y Cache and MMU translation/validation performed in parallel y Physical address saved in tags for write back y More efficient access to cache

31

1/31/2012

Virtual Address Model

32

1/31/2012

Aliasing Problem
y Different logically addressed data have the same index/tag in

the cache y Confusion if two or more processors access the same physical cache location y Flush cache when aliasing occurs, but leads to slowdown y Apply special tagging with a process key or with a physical address

33

1/31/2012

Block Placement Schemes


y Performance depends upon cache access patterns,

organization, and management policy y Blocks in caches are block frames, and blocks in main memory y Bi (i e m), Bj (i e n), n<<m, n=2s, m=2r y Each block has b words b=2w, for cache total of mb=2r+w words, main memory of nb= 2s+w words

34

1/31/2012

Direct Mapping Cache


y Direct mapping of n/m memory blocks to one block frame in y y y y

the cache Placement is by using modulo-m function Bj p Bi if i=j mod m Unique block frame Bi that each Bj loads to Simplest organization to implement

35

1/31/2012

Direct Mapping Cache

36

1/31/2012

Direct Mapping Cache


y Advantages
y Simple hardware y No associative search y No page replacement

y Disadvantages
y Rigid mapping y Poorer hit ratio y Prohibits parallel virtual

policy y Lower cost y Higher speed

address translation y Use larger cache size with more block frames to avoid contention

37

1/31/2012

Fully Associative Cache


y Each block in main memory can be placed in any of the

available block frames y s-bit tag needed in each cache block (s > r) y An m-way associative search requires the tag to be compared w/ all cache block tags y Use an associative memory to achieve a parallel comparison w/all tags concurrently

38

1/31/2012

Fully Associative Cache

39

1/31/2012

Fully Associative Caches


y Advantages:
y Offers most flexibility in

y Disadvantages:
y Higher hardware cost y Only moderate size cache y Expensive search process

mapping cache blocks y Higher hit ratio y Allows better block replacement policy with reduced block contention

40

1/31/2012

Set Associative Caches


y In a k-way associative cahe, the m cache block frames are

divided into v=m/k sets, with k blocks per set y Each set is identified by a d-bit set number y Compare the tag w/the k tags w/in the identified set y Bj p Bf Si if j(mod v) = i

41

1/31/2012

42

1/31/2012

Sector Mapping Cache


y Partition cache and main memory into fixed size sectors then y y y y

use fully associative search Use sector tags for search and block fields within sector to find block Only missing block loaded for a miss The ith block in a sector placed into the th block frame in a destined sector frame Attach a valid/invalid bit to block frames

43

1/31/2012

44

1/31/2012

Question 5 (10 Marks)


y Explain non-linear pipeline processors.

45

1/31/2012

Dynamic Pipeline
y Can be reconfigured to perform variable functions at

different times y Allows feedforward/feedback connections


y Making nonlinear pipelines

y Linear pipelines are static for fixed functions y Following different dataflow patterns, can use same

pipeline to evaluate different functions

46

1/31/2012

Reservation Tables
y Multiple reservation tables can be generated for evaluation of

different functions y Different fxns may follow dif. paths y One to many mapping b/t pipeline configuration and reservation tables y # of columns is evaluation time of a given fxn

47

1/31/2012

48

1/31/2012

Latency
y # of time units b/t two initiations y Any attempt by two or more initiations to use the same

pipeline stage at the same time causes a collision resource conflict y Forbidden latencies: cause collisions y To detect forbidden latencies, check distance b/t any two marks in the same row of the reservation table

49

1/31/2012

Latency Analysis
y Latency sequence: sequence of permissible latencies b/t

successive task initiations y Latency cycle: latency seq. that repeats the same cycle indefinitely y Average latency: divide sum of all latencies by # of latencies in cycle y Constant cycle: cycle which contains only one latency value

50

1/31/2012

51

1/31/2012

52

1/31/2012

53

1/31/2012

54

1/31/2012

Question 6
y Write short notes on
y Crossbar switch and multiport memory (5 marks) y Multistage network (5 marks)

55

1/31/2012

Crossbar Networks
y Crossbar networks connect every input to every output

through a crosspoint switch. y A crossbar network is a single stage, non-blocking permutation network. y In an n-processor, m-memory system, n v m crosspoint switches will be required. Each crosspoint is a unary switch which can be open or closed, providing a point-to-point connection path between the processor and a memory module.

Crosspoint Switch Design


y Out of n crosspoint switches in each column of an n v m crossbar

mesh, only one can be connected at a time. y Crosspoint switches must be designed to handle the potential contention for each memory module. y Each processor provides a request line, a read/write line, a set of address lines, and a set of data lines to a crosspoint switch for a single column. y The crosspoint switch eventually responds with an acknowledgement when the access has been completed.

Schematic of a Crosspoint Switch

59

1/31/2012

Multiport Memory
y Since crossbar switches are expensive, and not suitable for systems

with many processors or memory modules, multiport memory modules may be used instead. y A multiport memory module has multiple connections points for processors (or I/O devices), and the memory controller in the module handles the arbitration and switching that might otherwise have been accomplished by a crosspoint switch. y Multi-port memories require more testing effort since all ports have to be verified.

Multiport Memory Examples

Crossbar Switch and Multiport Memory


y Single stage networks are sometimes called recirculating

networks because data items may have to pass through the single stage many times. y The crossbar switch and the multiported memory organization (seen later) are both single-stage networks. y This is because even if two processors attempted to access the same memory module (or I/O device at the same time, only one of the requests is serviced at a time.

Multistage Networks
y Multistage networks consist of multiple sages of switch

boxes, and should be able to connect any input to any output. y A multistage network is called blocking if the simultaneous connections of some multiple input-output pairs may result in conflicts in the use of switches or communication links. y A nonblocking multistage network can perform all possible connections between inputs and outputs by rearranging its connections.

Omega Networks
y N-input Omega networks, in general, have log2n stages, with

the input stage labeled 0. y The interstage connection (ISC) pattern is a perfect shuffle. y Routing is controlled by inspecting the destination address. When the i-th highest order bit is 0, the 2v2 switch in stage i connects the input to the upper output. Otherwise it connects the input to the lower output.

Omega Network without Blocking

Blocking Effects
y Blocking exists in an Omega network when the requested

permutation would require that a single switch be set in two positions simultaneously. y Obviously this is impossible, and requires that one of the permutation requests be blocked and tried in a later pass. y In general, with 2v2 switches, an Omega network can implement n n/2 permutations in a single pass. For n = 8, this is about 10% of all possible permutations. y In general, a maximum of log2n passes are needed for an n-input Omega network.

Omega Network with Blocking

Omega Broadcast
y An Omega network can be used to broadcast data to multiple

destinations. y The switch to which the input is connected is set to the broadcast position (input connected to both outputs). y Each additional switch (in later stages) to which an output is directed is also set to the broadcast position.

Question 7
y Explain vector access memory schemes

69

1/31/2012

Vector-access memory schemes


y y y y y y y

Vector operands may have arbitrary length. Vector elements are not necessarily stored in contiguous memory locations. To access a vector a memory, one must specify its base, stride, and length. Since each vector register has fixed length, only a segment of the vector can be loaded into a vector register. Vector operands should be stored in memory to allow pipelined and parallel access. Access itself should be pipelined. C-Access memory organization The m-way low-order memory structure, allows m words to be accessed concurrently and overlapped. S-Access memory organization All modules are accessed simultaneously storing consecutive words to data buffers. The low order address bits are used to multiplex the m words out of buffers. C/S-Access memory organization.

EENG-630

C-access
y Eight-way interleaved memory (m = 8

and w = 8). m is called the degree of interleaving. y The major cycle is the total time required to complete the access of a single word form a memory. The minor cycle is the actual time needed to produce one word, assuming overlapped access of successive memory modules separated in every memory cycle .

EENG-630

72

73

74

1/31/2012

Question No. 8 (10 marks)


y Explain message passing mechanism in detail.

75

1/31/2012

Message Passing in Multicomputers y Multicomputers have no shared memory, and each


computer consists of a single processor, cache, private memory, and I/O devices. y Some network must be provided to allow the multiple computers to communicate. y The communication between computers in a multicomputer is called message passing.

Message Formats
y Messages may be fixed or variable length. y Messages are comprised of one or more packets. y Packets are the basic units containing a destination address

(e.g. processor number) for routing purposes. y Different packets may arrive at the destination asynchronously, so they are sequence numbered to allow reassembly. y Flits (flow control digits) are used in wormhole routing; theyre discussed a bit later

Store and Forward Routing


y Packets are the basic unit in the store and forward scheme. y An intermediate node must receive a complete packet before

it can be forwarded to the next node or the final destination, and only then if the output channel is free and the next node has available buffer space for the packet. y The latency in store and format networks is directly related to the number of intermediate nodes through which the packet must pass.

Flits and Wormhole Routing


y Wormhole routing divides a packet into smaller fixed-sized

pieces called flits (flow control digits). y The first flit in the packet must contain (at least) the destination address. Thus the size of a flit must be at least log2 N in an N-processor multicomputer. y Each flit is transmitted as a separate entity, but all flits belonging to a single packet must be transmitted in sequence, one immediately after the other, in a pipeline through intermediate routers.

Store and Forward vs. Wormhole

Asynchronous Pipelining
y Each intermediate node in a wormhole network, and the source

and destination, each have a buffer capable of storing a flit. y Adjacent nodes communicate requests and acknowledgements using a one-bit ready/request (R/A) line.
y When a receiver is ready, it pulls the R/A line low. y When the sender is ready, it raises the R/A line high and transmits the next

flit; the line is left high. y After the receiver deals with the flit (perhaps sending it on to another node), it lowers the R/A line to indicate it is ready to accept another flit. y The cycle repeats for transmission of other flits.

Wormhole Node Handshaking

Asynchronous Pipeline Speeds


y An asynchronous pipeline can be very efficient, and use a

clock speed higher than that used in a synchronous pipeline. y The pipeline can be stalled if buffers or successive channels in the path are not available during certain cycles. y A packet could be buffered, blocked, dragged, detoured and just knocked around, in general if the pipeline stalls.

Question No. 9
y Explain the steps in the parallel processing.

84

1/31/2012

Instruction Pipeline
y To implement pipelining, a designer s divides a processor's

datapath into sections (stages), and places pipeline latches (also called buffers) between each section (stage)

85

1/31/2012

Pipelined five stages processor

86

1/31/2012

87

1/31/2012

88

1/31/2012

89

1/31/2012

90

1/31/2012

91

1/31/2012

You might also like