ACA Answer Key Feb 2009

February 2009 Question Paper
Complete Solution & Supplementary Q&A
1/31/2012
Question 1 (10 Marks)

y Write notes on any 2 Program Flow mechanisms
1/31/2012
Program Flow Mechanisms

y Conventional machines used control flow mechanism in which order
of program execution explicitly stated in user programs. y Dataflow machines which instructions can be executed by determining operand availability. y Reduction machines trigger an instructions execution based on the demand for its results.
1/31/2012
Comparison of program flow mechanisms

Machine Model Control Flow Data flow Reduction machine
Basic Definition
Conventional computation; token Eager evaluation; statements are Lazy evaluation; statements are of control indicates when a executed when all their operands are executed only when their result is statement should be executed. available required for another computation
Advantages
1. 2.
Full control 1. Complex data and control structures are easily 2. implemented. 3.
Very high potential parallelism High throughput Free from side effects
for 1. 2. 3.
Only required instructions are executed. High degree of parallelism Easy manipulation of data structures.
Disadvantages
1. 2. 3.
Less efficient 1. Difficult in programming 2. Difficult in preventing run time error
High control overhead 1. Difficult in manipulating data structures
Time needed to demand tokens
propagate
1/31/2012
Control Flow vs. Data Flow

y Control flow machines used shared memory for instructions and data.
Since variables are updated by many instructions, there may be side effects on other instructions. These side effects frequently prevent parallel processing. Single processor systems are inherently sequential.
y Instructions in dataflow machines are unordered and can be executed as
soon as their operands are available; data is held in the instructions themselves. Data tokens are passed from an instruction to its dependents to trigger execution.
1/31/2012
Data Flow Features

y No need for
y shared memory y program counter y control sequencer
y Special mechanisms are required to

y detect data availability y match data tokens with instructions needing them y enable chain reaction of asynchronous instruction execution
1/31/2012
1/31/2012
A Dataflow Architecture - 1
y The Arvind machine (MIT) has N PEs and an N-by-N interconnection
network. y Each PE has a token-matching mechanism that dispatches only instructions with data tokens available. y Each datum is tagged with
y address of instruction to which it belongs y context in which the instruction is being executed
y Tagged tokens enter PE through local path (pipelined), and can also be
communicated to other PEs through the routing network.
1/31/2012
y Instruction address(es) effectively replace the program
counter in a control flow machine. y Context identifier effectively replaces the frame base register in a control flow machine. y Since the dataflow machine matches the data tags from one instruction with successors, synchronized instruction execution is implicit.
1/31/2012
y An I-structure in each PE is provided to eliminate excessive y
y y y
copying of data structures. Each word of the I-structure has a two-bit tag indicating whether the value is empty, full, or has pending read requests. This is a retreat from the pure dataflow approach. Example 2.6 shows a control flow and dataflow comparison. Special compiler technology needed for dataflow machines.
10
1/31/2012
Demand-Driven Mechanisms
y Data-driven machines select instructions for execution based on the
availability of their operands; this is essentially a bottom-up approach. y Demand-driven machines take a top-down approach, attempting to execute the instruction (a demander) that yields the final result. This triggers the execution of instructions that yield its operands, and so forth. y The demand-driven approach matches naturally with functional programming languages (e.g. LISP and SCHEME).
11
1/31/2012
12
1/31/2012
Reduction Machine Models

y String-reduction model: y each demander gets a separate copy of the expression string to evaluate y each reduction step has an operator and embedded reference to demand the corresponding operands y each operator is suspended while arguments are evaluated y Graph-reduction model: y expression graph reduced by evaluation of branches or subgraphs, possibly in parallel, with demanders given pointers to results of reductions. y based on sharing of pointers to arguments; traversal and reversal of pointers continues until constant arguments are encountered.
13
1/31/2012
Summary
y Control flow machines give complete control, but are less efficient than
other approaches. y Data flow (eager evaluation) machines have high potential for parallelism and throughput and freedom from side effects, but have high control overhead, lose time waiting for unneeded arguments, and difficulty in manipulating data structures. y Reduction (lazy evaluation) machines have high parallelism potential, easy manipulation of data structures, and only execute required instructions. But they do not share objects with changing local state, and do require time to propagate tokens.
14
1/31/2012
Question 2 (4+4+2)
y Question 2 -Write notes on the following:
y Amdahls law and efficiency of a system y Utilization of system and quality of parallelism y Redundancy
15
1/31/2012
Amdahls Law
y Assume Ri = i, and w (the weights) are (E, 0, , 0, 1-E). y Basically this means the system is used sequentially (with probability E) or
all n processors are used (with probability 1- E). y This yields the speedup equation known as Amdahls law:
n Sn ! 1 n 1 E
The implication is that the best speedup possible is 1/ E, regardless of n, the number of processors.
16 1/31/2012
System Efficiency 1
y Assume the following definitions: y O (n) = total number of unit operations performed by an n-processor system in completing a program P. y T (n) = execution time required to execute the program P on an n-processor system. y O (n) can be considered similar to the total number of instructions
executed by the n processors, perhaps scaled by a constant factor. y If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make any use at all of the extra processor(s).
17
1/31/2012
System Efficiency 2
y Clearly, the speedup factor (how much faster the program runs with
n processors) can now be expressed as S (n) = T (1) / T (n) Recall that we expect T (n) < T (1), so S (n) u 1. y System efficiency is defined as E (n) = S (n) / n = T (1) / ( n v T (n) ) It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup. Thus 1 / n e E (n) e 1. The value is 1/n when only one processor is used (regardless of n), and the value is 1 when all processors are fully utilized.
18
1/31/2012
Redundancy
y The redundancy in a parallel computation is defined as
R (n) = O (n) / O (1) y What values can R (n) obtain?

y R (n) = 1 when O (n) = O (1), or when the number of operations performed
is independent of the number of processors, n. This is the ideal case. y R (n) = n when all processors performs the same number of operations as when only a single processor is used; this implies that n completely redundant computations are performed!
y The R (n) figure indicates to what extent the software parallelism
is carried over to the hardware implementation without having extra operations performed.
19
1/31/2012
System Utilization
y System utilization is defined as
U (n) = R (n) v E (n) = O (n) / ( n v T (n) ) It indicates the degree to which the system resources were kept busy during execution of the program. Since 1 e R (n) e n, and 1 / n e E (n) e 1, the best possible value for U (n) is 1, and the worst is 1 / n. y 1 / n e E (n) e U (n) e 1 y 1 e R (n) e 1 / E (n) e n
20
1/31/2012
Quality of Parallelism
y The quality of a parallel computation is defined as
Q (n) = S (n) v E (n) / R (n) = T 3 (1) / ( n v T 2 (n) v O (n) ) y This measure is directly related to speedup (S) and efficiency (E), and inversely related to redundancy (R). y The quality measure is bounded by the speedup (that is, Q (n) e S (n) ).
21
1/31/2012
Question 3
y Explain super scalar processors.
22
1/31/2012
Superscalar Processors
y This subclass of the RISC processors allow multiple
instructoins to be issued simultaneously during each cycle. y The effective CPI of a superscalar processor should be less than that of a generic scalar RISC processor. y Clock rates of scalar RISC and superscalar RISC machines are similar.
23
1/31/2012
24
1/31/2012
25
1/31/2012
Question 4 (10 marks)

y Explain the cache addressing model
26
1/31/2012
Cache Addressing Models

y Most systems use private caches for each processor y Have an interconnection n/w b/t caches and main memory y Address caches using either a physical address or virtual
address
27
1/31/2012
Physical Address Caches

y Cache is indexed and tagged with the physical address y Cache lookup occurs after address translation in TLB or
MMU (no aliasing) y After cache miss, load a block from main memory y Use either write-back or write-through policy
28
1/31/2012
Physical Address Caches

y Advantages:
y No cache flushing y No aliasing problems y Simplistic design y Requires little intervention
y Disadvantage:
y Slowdown in accessing
cache until the MMU/TLB finishes translation
from OS kernel
29
1/31/2012
Physical Address Models
30
1/31/2012
Virtual Address Caches

y Cache indexed or tagged w/virtual address y Cache and MMU translation/validation performed in parallel y Physical address saved in tags for write back y More efficient access to cache
31
1/31/2012
Virtual Address Model
32
1/31/2012
Aliasing Problem
y Different logically addressed data have the same index/tag in
the cache y Confusion if two or more processors access the same physical cache location y Flush cache when aliasing occurs, but leads to slowdown y Apply special tagging with a process key or with a physical address
33
1/31/2012
Block Placement Schemes

y Performance depends upon cache access patterns,
organization, and management policy y Blocks in caches are block frames, and blocks in main memory y Bi (i e m), Bj (i e n), n<<m, n=2s, m=2r y Each block has b words b=2w, for cache total of mb=2r+w words, main memory of nb= 2s+w words
34
1/31/2012
Direct Mapping Cache

y Direct mapping of n/m memory blocks to one block frame in y y y y
the cache Placement is by using modulo-m function Bj p Bi if i=j mod m Unique block frame Bi that each Bj loads to Simplest organization to implement
35
1/31/2012
36
1/31/2012

y Advantages
y Simple hardware y No associative search y No page replacement
y Disadvantages
y Rigid mapping y Poorer hit ratio y Prohibits parallel virtual
policy y Lower cost y Higher speed
address translation y Use larger cache size with more block frames to avoid contention
37
1/31/2012
Fully Associative Cache

y Each block in main memory can be placed in any of the
available block frames y s-bit tag needed in each cache block (s > r) y An m-way associative search requires the tag to be compared w/ all cache block tags y Use an associative memory to achieve a parallel comparison w/all tags concurrently
38
1/31/2012
Fully Associative Cache
39
1/31/2012
Fully Associative Caches

y Advantages:
y Offers most flexibility in
y Disadvantages:
y Higher hardware cost y Only moderate size cache y Expensive search process
mapping cache blocks y Higher hit ratio y Allows better block replacement policy with reduced block contention
40
1/31/2012
Set Associative Caches

y In a k-way associative cahe, the m cache block frames are
divided into v=m/k sets, with k blocks per set y Each set is identified by a d-bit set number y Compare the tag w/the k tags w/in the identified set y Bj p Bf Si if j(mod v) = i
41
1/31/2012
42
1/31/2012
Sector Mapping Cache

y Partition cache and main memory into fixed size sectors then y y y y
use fully associative search Use sector tags for search and block fields within sector to find block Only missing block loaded for a miss The ith block in a sector placed into the th block frame in a destined sector frame Attach a valid/invalid bit to block frames
43
1/31/2012
44
1/31/2012
Question 5 (10 Marks)

y Explain non-linear pipeline processors.
45
1/31/2012
Dynamic Pipeline
y Can be reconfigured to perform variable functions at
different times y Allows feedforward/feedback connections

y Making nonlinear pipelines
y Linear pipelines are static for fixed functions y Following different dataflow patterns, can use same
pipeline to evaluate different functions
46
1/31/2012
Reservation Tables
y Multiple reservation tables can be generated for evaluation of
different functions y Different fxns may follow dif. paths y One to many mapping b/t pipeline configuration and reservation tables y # of columns is evaluation time of a given fxn
47
1/31/2012
48
1/31/2012
Latency
y # of time units b/t two initiations y Any attempt by two or more initiations to use the same
pipeline stage at the same time causes a collision resource conflict y Forbidden latencies: cause collisions y To detect forbidden latencies, check distance b/t any two marks in the same row of the reservation table
49
1/31/2012
Latency Analysis
y Latency sequence: sequence of permissible latencies b/t
successive task initiations y Latency cycle: latency seq. that repeats the same cycle indefinitely y Average latency: divide sum of all latencies by # of latencies in cycle y Constant cycle: cycle which contains only one latency value
50
1/31/2012
51
1/31/2012
52
1/31/2012
53
1/31/2012
54
1/31/2012
Question 6
y Write short notes on
y Crossbar switch and multiport memory (5 marks) y Multistage network (5 marks)
55
1/31/2012
Crossbar Networks
y Crossbar networks connect every input to every output
through a crosspoint switch. y A crossbar network is a single stage, non-blocking permutation network. y In an n-processor, m-memory system, n v m crosspoint switches will be required. Each crosspoint is a unary switch which can be open or closed, providing a point-to-point connection path between the processor and a memory module.
Crosspoint Switch Design

y Out of n crosspoint switches in each column of an n v m crossbar
mesh, only one can be connected at a time. y Crosspoint switches must be designed to handle the potential contention for each memory module. y Each processor provides a request line, a read/write line, a set of address lines, and a set of data lines to a crosspoint switch for a single column. y The crosspoint switch eventually responds with an acknowledgement when the access has been completed.
Schematic of a Crosspoint Switch
59
1/31/2012
Multiport Memory
y Since crossbar switches are expensive, and not suitable for systems
with many processors or memory modules, multiport memory modules may be used instead. y A multiport memory module has multiple connections points for processors (or I/O devices), and the memory controller in the module handles the arbitration and switching that might otherwise have been accomplished by a crosspoint switch. y Multi-port memories require more testing effort since all ports have to be verified.
Multiport Memory Examples
Crossbar Switch and Multiport Memory

y Single stage networks are sometimes called recirculating
networks because data items may have to pass through the single stage many times. y The crossbar switch and the multiported memory organization (seen later) are both single-stage networks. y This is because even if two processors attempted to access the same memory module (or I/O device at the same time, only one of the requests is serviced at a time.
Multistage Networks
y Multistage networks consist of multiple sages of switch
boxes, and should be able to connect any input to any output. y A multistage network is called blocking if the simultaneous connections of some multiple input-output pairs may result in conflicts in the use of switches or communication links. y A nonblocking multistage network can perform all possible connections between inputs and outputs by rearranging its connections.
Omega Networks
y N-input Omega networks, in general, have log2n stages, with
the input stage labeled 0. y The interstage connection (ISC) pattern is a perfect shuffle. y Routing is controlled by inspecting the destination address. When the i-th highest order bit is 0, the 2v2 switch in stage i connects the input to the upper output. Otherwise it connects the input to the lower output.
Omega Network without Blocking
Blocking Effects
y Blocking exists in an Omega network when the requested
permutation would require that a single switch be set in two positions simultaneously. y Obviously this is impossible, and requires that one of the permutation requests be blocked and tried in a later pass. y In general, with 2v2 switches, an Omega network can implement n n/2 permutations in a single pass. For n = 8, this is about 10% of all possible permutations. y In general, a maximum of log2n passes are needed for an n-input Omega network.
Omega Network with Blocking
Omega Broadcast
y An Omega network can be used to broadcast data to multiple
destinations. y The switch to which the input is connected is set to the broadcast position (input connected to both outputs). y Each additional switch (in later stages) to which an output is directed is also set to the broadcast position.
Question 7
y Explain vector access memory schemes
69
1/31/2012
Vector-access memory schemes

y y y y y y y
Vector operands may have arbitrary length. Vector elements are not necessarily stored in contiguous memory locations. To access a vector a memory, one must specify its base, stride, and length. Since each vector register has fixed length, only a segment of the vector can be loaded into a vector register. Vector operands should be stored in memory to allow pipelined and parallel access. Access itself should be pipelined. C-Access memory organization The m-way low-order memory structure, allows m words to be accessed concurrently and overlapped. S-Access memory organization All modules are accessed simultaneously storing consecutive words to data buffers. The low order address bits are used to multiplex the m words out of buffers. C/S-Access memory organization.
EENG-630
C-access
y Eight-way interleaved memory (m = 8
and w = 8). m is called the degree of interleaving. y The major cycle is the total time required to complete the access of a single word form a memory. The minor cycle is the actual time needed to produce one word, assuming overlapped access of successive memory modules separated in every memory cycle .
EENG-630
72
73
74
1/31/2012
Question No. 8 (10 marks)

y Explain message passing mechanism in detail.
75
1/31/2012
Message Passing in Multicomputers y Multicomputers have no shared memory, and each

computer consists of a single processor, cache, private memory, and I/O devices. y Some network must be provided to allow the multiple computers to communicate. y The communication between computers in a multicomputer is called message passing.
Message Formats
y Messages may be fixed or variable length. y Messages are comprised of one or more packets. y Packets are the basic units containing a destination address
(e.g. processor number) for routing purposes. y Different packets may arrive at the destination asynchronously, so they are sequence numbered to allow reassembly. y Flits (flow control digits) are used in wormhole routing; theyre discussed a bit later
Store and Forward Routing

y Packets are the basic unit in the store and forward scheme. y An intermediate node must receive a complete packet before
it can be forwarded to the next node or the final destination, and only then if the output channel is free and the next node has available buffer space for the packet. y The latency in store and format networks is directly related to the number of intermediate nodes through which the packet must pass.
Flits and Wormhole Routing

y Wormhole routing divides a packet into smaller fixed-sized
pieces called flits (flow control digits). y The first flit in the packet must contain (at least) the destination address. Thus the size of a flit must be at least log2 N in an N-processor multicomputer. y Each flit is transmitted as a separate entity, but all flits belonging to a single packet must be transmitted in sequence, one immediately after the other, in a pipeline through intermediate routers.
Store and Forward vs. Wormhole
Asynchronous Pipelining
y Each intermediate node in a wormhole network, and the source
and destination, each have a buffer capable of storing a flit. y Adjacent nodes communicate requests and acknowledgements using a one-bit ready/request (R/A) line.
y When a receiver is ready, it pulls the R/A line low. y When the sender is ready, it raises the R/A line high and transmits the next
flit; the line is left high. y After the receiver deals with the flit (perhaps sending it on to another node), it lowers the R/A line to indicate it is ready to accept another flit. y The cycle repeats for transmission of other flits.
Wormhole Node Handshaking
Asynchronous Pipeline Speeds

y An asynchronous pipeline can be very efficient, and use a
clock speed higher than that used in a synchronous pipeline. y The pipeline can be stalled if buffers or successive channels in the path are not available during certain cycles. y A packet could be buffered, blocked, dragged, detoured and just knocked around, in general if the pipeline stalls.
Question No. 9
y Explain the steps in the parallel processing.
84
1/31/2012
Instruction Pipeline
y To implement pipelining, a designer s divides a processor's
datapath into sections (stages), and places pipeline latches (also called buffers) between each section (stage)
85
1/31/2012
Pipelined five stages processor
86
1/31/2012
87
1/31/2012
88
1/31/2012
89
1/31/2012
90
1/31/2012
91
1/31/2012

ACA Answer Key Feb 2009

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ACA Answer Key Feb 2009

Uploaded by

Copyright:

Available Formats

February 2009 Question Paper

Complete Solution & Supplementary Q&A

Question 1 (10 Marks)

Program Flow Mechanisms

Comparison of program flow mechanisms

Less efficient 1. Difficult in programming 2. Difficult in preventing run time error

High control overhead 1. Difficult in manipulating data structures

Time needed to demand tokens

Control Flow vs. Data Flow

Data Flow Features

y Special mechanisms are required to

communicated to other PEs through the routing network.

Reduction Machine Models

R (n) = O (n) / O (1) y What values can R (n) obtain?

y The R (n) figure indicates to what extent the software parallelism

Question 4 (10 marks)

Cache Addressing Models

Physical Address Caches

Physical Address Caches

cache until the MMU/TLB finishes translation

Physical Address Models

Virtual Address Caches

Virtual Address Model

Block Placement Schemes

Direct Mapping Cache

Direct Mapping Cache

Direct Mapping Cache

policy y Lower cost y Higher speed

Fully Associative Cache

Fully Associative Cache

Fully Associative Caches

Set Associative Caches

Sector Mapping Cache

Question 5 (10 Marks)

different times y Allows feedforward/feedback connections

pipeline to evaluate different functions

Crosspoint Switch Design

Schematic of a Crosspoint Switch

Multiport Memory Examples

Crossbar Switch and Multiport Memory

Omega Network without Blocking

Omega Network with Blocking

Vector-access memory schemes

Question No. 8 (10 marks)

Message Passing in Multicomputers y Multicomputers have no shared memory, and each

Store and Forward Routing

Flits and Wormhole Routing

Store and Forward vs. Wormhole

Wormhole Node Handshaking

Asynchronous Pipeline Speeds

Pipelined five stages processor

You might also like