M4-CS405 Computer System Architecture-Ktustudents - in

MESSAGE PASSING MECHANISMS
KTUStudents.in
MODULE 4
Sufyan P
Assistant professor
Computer Science and Engineering
sufyan@meaec.edu.in
For more study materials: WWW.KTUSTUDENTS.IN
INTRODUCTION
 Message passing in a multicomputer network
require hardware and software support
KTUStudents.in
 There are 2 message routing schemes for
multicomputer n/w
 Store and forward routing
 Wormhole routing

MESSAGE FORMATS
 Message
 Message is a logical unit for inter-node communication
 It is formed by assembling arbitrary number of fixed
length packets

KTUStudents.in
 Every message will be of variable length
Packet
 Packet is a basic unit containing destination address for
routing purposes
 Different packets may arrive asynchronously at the
destination
 So a sequence number is needed in each packet to allow
reassembly of transmitted message

 Flits
 Packet is further divided into a no: of fixed length
flow control digits called flits
 Header flits consist of
 Routing information (destination address)
 Sequence number
 Remaining flits are the data elements of a packet
KTUStudents.in

MESSAGE FORMATS
KTUStudents.in

FACTORS AFFECTING PACKET &FLIT SIZES
 Packet length is determined by routing scheme

& n/w implementation
 Packet length range from 64 – 512 bits
Sequence number may occupy 1 or 2 flits
KTUStudents.in

depending on message length
 Other factors
 Channel bandwidth
 Router design
 n/w traffic intensity

STORE AND FORWARD ROUTING
 This scheme was used in first generation

multicomputer
 Packets are the basic unit of information flow in
store and forward n/w
KTUStudents.in
 Each node require a packet buffer
 Packet is transmitted from source to destination

through a sequence of intermediate nodes

STORE AND FORWARD ROUTING [2]
KTUStudents.in

STORE AND FORWARD ROUTING [3]
 When a packet reaches the intermediate node, it is

first stored in the buffer
 KTUStudents.in
It is forwarded to the next node if output channel
and the packet buffer of receiving node is available
 Latency is directly proportional to distance between

source & destination

WORMHOLE ROUTING
 This scheme is implemented in latter generations

of multicomputer
 Here packets are subdivided to flits
KTUStudents.in
 Flit buffers are used in the h/w routers attached to
nodes
 Transmission from source to destination is done
via sequence of routers

WORMHOLE ROUTING
KTUStudents.in

WORMHOLE ROUTING
 All flits in the same packet are transmitted in

proper order
 They are transmitted as inseparable companions in
a pipelined fashion
KTUStudents.in
 A packet can be visualized as a railroad train with
an engine car (header flit) as the engine and the
data flits following the header
 Only header flit knows where the packet is going
 Data flits follow the header flits

WORMHOLE ROUTING
 Packets can be interleaved during transmission

 Flits of different packets cannot be mixed up
 This may lead to transfer to wrong destination
KTUStudents.in
 Latency of this method is independent of distance
b/w source & destination

ASYNCHRONOUS PIPELINING
 Pipelining of successive flits in a packet is done

asynchronously using handshaking protocol
 A one bit ready/request line is used b/w adjacent
routers to perform handshaking
KTUStudents.in
 When the receiver router D is ready, to receive a
flit,
 R/A line is pulled to low
 When sending router S is ready,
 It raises R/A line to high
 It transmits the flit through the channel

ASYNCHRONOUS PIPELINING [2]
 R/A is kept high while the flit is being received by
D
 After flit i is removed from D’s buffer and
transmitted to next node,

KTUStudents.in
cycle repeats for the transmission of flit i+1
 This goes on until entire packet is transmitted

KTUStudents.in

 Asynchronous pipelining is very efficient
 Clock is faster than synchronous pipeline
 Pipeline is stalled if flit buffers or successive
channels along the path are not available
KTUStudents.in

LATENCY ANALYSIS
 L ➔ packet length
 W➔ bandwidth
 D➔ distance (no: of nodes traversed -1)
 F➔ flit length

 KTUStudents.in
TSF➔ communication latency for store and forward
TWH➔ communication latency for wormhole

KTUStudents.in

 TSF is directly proportional to D
 TWH is L/W if L>>F
 Thus D has negligible effect on routing latency
KTUStudents.in
 First generation value of TSF is between 2000 and
6000µs
 TWH is 5µs or less

KTUStudents.in
FLOW CONTROL STRATEGIES

INTRODUCTION
 Flow control strategies are used to control n/w
traffic flow without causing congestion or
deadlock situations
 When 2 or more packets collide at a node, policies
KTUStudents.in
must be set for resolving their conflict

PACKET COLLISION RESOLUTION
 To move a flit between adjacent nodes in a

pipeline of channels, 3 elements must be present
 Source buffer which holds the flit
 Channel being allocated
 Receiver buffer accepting the flit

KTUStudents.in
When 2 packets reach the same node and request
for same receiver buffer or outgoing channel, 2
decisions are to be made
 Which packet will be allocated to the channel?
 What will be done to the packet being denied the
channel?

FLOW CONTROL POLICIES FOR COLLISION
RESOLUTION
 Buffering
 Blocking policy
 Discard and retransmission
KTUStudents.in
 Detour after being blocked

BUFFERING METHOD
 This method is applied in virtual-cut routing

 When packet 1 and 2 collide at a particular node,
 Packet 1 is allocated to the channel
 Packet 2 is denied
KTUStudents.in
 Packet 2 is temporarily stored in packet buffer
 It will transmitted later, when the channel becomes
available
 Advantage
 Already allocated resources are not wasted
 Disadvantages
 Requires the use of large buffer to hold the entire
packet
 Cause significant storage delay

KTUStudents.in

BLOCKING POLICY
 Pure wormhole routing uses this scheme

 Second packet is being blocked from advancing
 However it is not abandoned
KTUStudents.in

DISCARD POLICY
 It drops the packet which is blocked

 This scheme result in severe wastage of resources
 It demands packet retransmission and
KTUStudents.in
acknowledgement
 Rarely used policy coz of unstable packet delivery
rate

DETOUR POLICY
 Blocked packet is routed through a detour

channel
 It is economical to implement
KTUStudents.in
 Offers more flexibility
 Disadvantage
 Result in idling of resources allocated to the blocked
packet
 Waste more channel resources

FLOW CONTROL POLICIES FOR COLLISION
RESOLUTION
 Some multicomputer n/w uses hybrid policies

which combines the advantage of above
mentioned flow control policies
KTUStudents.in

DETERMINISTIC ROUTING
 Communication path is completely determined by

the source and destination addresses
 Routing path is predetermined in advance
irrespective of n/w condition

KTUStudents.in
 Eg: of deterministic routing algorithm
E-Cube routing ➔ routing in hypercube
 X-Y routing ➔ routing in mesh
 Both of the above algorithms works based on the
concept of Dimension-order routing

DIMENSION ORDER ROUTING
 It selects the successive channels for routing in a

specific order, based on the dimensions of a
multidimensional n/w
 In case of a 2D mesh network, this scheme is called

KTUStudents.in
X-Y routing
Path along X dimension is decided first before choosing
a path along Y dimension

E–CUBE ROUTING ON HYPERCUBE
 Consider an n-cube with 2n nodes

 Each node b is binary coded as b=bn-1bn-2……b1b0
 Source node s=sn-1sn-2….s1s0
KTUStudents.in
 Destination node d=dn-1dn-2…..d1d0
 We have to determine a route from s to d with

minimum no: of steps
 v=vn-1vn-2……v1v0 be any node along the route

ALGORITHM TO DETERMINE ROUTE FROM
S TO D
 Compute the direction bit ri=si-1 XOR di-1 for all n

dimensions (i=1,2….n) . Start the following with i=1
and v=s
 Route from current node v to next node v XOR 2i-1 if
KTUStudents.in
ri=1. skip this step if ri=0
 Move to dimension i+1 (ie i=i+1). If i<=n, go to step 2,
else done

EXAMPLE
 n=4, s=0110 and d=1101
 r=r4r3r2r1➔ 0110 XOR 1101➔ 1011
 r=1011
 Route from s to next node

 KTUStudents.in
v XOR 2i-1
For i=1
 0110 XOR 20= 0111
v=s
this is done since r1=1

 For i=2
 0111 XOR 21= 0101 since r2=1
 For i=3
 Skip since r3=0
 For i=4
 0101 XOR 23= 1101
KTUStudents.in

X–Y ROUTING ON 2D MESH
 From any source node s=(x1,y1) to any destination

node d=(x2,y2)
 Route from s along X-axis first, until it reaches
column y2 where d is located.
KTUStudents.in
 Then route to d along the Y axis
 Four possible X-Y routing patterns

 East-north
 East-south
 West-north
 West-south

KTUStudents.in

KTUStudents.in
LINEAR PIPELINE PROCESSORS

INTRODUCTION
 It is a cascade of processing stages, which are
linearly connected to perform a fixed function over a
stream of data flowing from one end to other end
 They can be applied for

KTUStudents.in
Instruction execution
 Arithmetic computation
 Memory access operations

STRUCTURE OF LINEAR PIPELINE
 It consist of k processing stages

 Inputs are fed into the first stage of the pipeline ie
S1
KTUStudents.in
 Processed results are passed from stage Si to stage
Si+1 for all i=1,2….k-1
 Final result emerges from the last stage of the
pipeline ie Sk

CATEGORIES OF LINEAR PIPELINE
PROCESSORS
 Depending on the control of data flow, linear
pipelines are divided into 2 categories
Asynchronous pipeline model
KTUStudents.in

 Synchronous pipeline model

ASYNCHRONOUS MODEL
 Data flow between adjacent stages of an

asynchronous pipeline is controlled by
handshaking protocol

KTUStudents.in
When stage Si is ready to transmit, it sends a ready
signal to stage Si+1
 After stage Si+1 receives the incoming data, it returns
an acknowledgement signal to Si

ASYNCHRONOUS PIPELINE MODEL
KTUStudents.in

ASYNCHRONOUS MODEL[2]
 This pipeline has variable delay at different stages

 It has variable throughput rate
 This pipeline is used for designing communication
KTUStudents.in
channels in message-passing multicomputer which
employs wormhole routing

SYNCHRONOUS PIPELINE MODEL
 In this pipeline, clocked latches are used b/w the

stages of pipeline
 Latches are made with master-slave flip-flops
 They are used to isolate i/p from o/p

KTUStudents.in
When a clock pulse arrive, all latches transfer data
to the next stage simultaneously
 Pipeline stages are implemented as combinational
logic circuits
 There will be approximately equal delays in all the
stages

KTUStudents.in

RESERVATION TABLE
 It is a table for representing the task flow pattern of

a pipelined system
 Specifies the utilization pattern of successive stages in a
pipeline
KTUStudents.in
 It consist of rows and columns
 Rows➔ resource of a pipeline
 Columns➔ time slice of pipeline
 In linear pipeline, utilization pattern is in diagonal
format
 For a k stage linear pipeline, k clock cycles are
needed for a data to flow through the pipeline
 Once the pipeline is filled, one result emerges from
pipeline, for each additional cycle
RESERVATION TABLE
KTUStudents.in

SPEEDUP, EFFICIENCY AND THROUGHPUT
 Ideally, a linear pipeline of k stages can process n

tasks in k+(n-1) clock cycles
 k cycles are needed to complete first task
Remaining n-1 tasks require n-1 cycles
KTUStudents.in

 Total time
 Tk= [k+(n-1)] τ ---(1)
 τ ➔ clock period
 Time taken for a non pipelined processor to execute
n tasks
 T1=nkτ ---(2) where kτ is the flow through
delay of non pipelined processor

SPEEDUP FACTOR
 Speedup factor of a k stage pipeline over an

equivalent non pipelined processor is defined as
KTUStudents.in

OPTIMAL NUMBER OF STAGES OF A PIPELINE
 Let t be the total time required to execute a non

pipelined sequential program
 To execute that same program on a k stage pipeline,
with equal flow-through delay (t), the required clock

KTUStudents.in
period is:
𝑡
p= + 𝑑 ---- (4)
𝑘
 t➔ flow-through delay
 d➔ latch delay

 Maximum throughput of pipeline in ideal
condition is:
1
𝑓 =
𝑝
KTUStudents.in
1
𝑓=𝑡 ---- (5)
+𝑑
𝑘
 Total pipeline cost = c+kh

 c➔ cost of all logic stages
 h➔ cost of each latch

P C R (PIPELINE PERFORMANCE COST RATIO)
𝑓 1
 PCR = = 𝑡 ----- (6)
𝑐+𝑘ℎ (𝑘+𝑑)(𝑐+𝑘ℎ)

KTUStudents.in
Peak of PCR curve specifies the optimal choice for no: of
desired pipeline stages
 t➔ total flow-through delay of the pipeline

 c➔ total stage cost
 d➔ latch delay
 h➔ latch cost
KTUStudents.in

EFFICIENCY AND THROUGHPUT
 Efficiency of a k stage pipeline is
𝑠𝑘 𝑛
 Ek= =
𝑘 𝑘+(𝑛−1)
KTUStudents.in

PIPELINE THROUGHPUT
 It is defined as the no: of tasks performed per unit

time
𝑛.𝑓
 𝐻𝑘 =𝐸𝑘 .f =
𝑘+(𝑛−1)
KTUStudents.in
𝐸𝑘
 𝐻𝑘 =
τ
𝑠𝑘
 𝐻𝑘 =
kτ

NON LINEAR PIPELINE PROCESSORS
 Linear pipeline are known as static pipelines,
 because they are used to perform fixed functions
 Non linear pipelines are dynamic pipelines,
KTUStudents.in
 because they can be reconfigured to perform variable
functions at different times
 Dynamic pipeline allows feedforward and
feedback connections in addition to the
streamline connections
 Hence the structure of this pipeline is called as
non linear

RESERVATION AND LATENCY ANALYSIS
 In static pipeline, it is easy to partition a given
function into a sequence of linearly ordered
subfunctions
 Function partitioning is difficult in case of

KTUStudents.in
dynamic pipeline
Because pipeline stages are interconnected with loops
in addition to streamline connections

KTUStudents.in

 Feedforward and feedback connections make
scheduling of successive events difficult
 Due to these connections, o/p of pipeline need not
be necessarily from last stage
 Same pipeline can be used to evaluate different
functions
KTUStudents.in

RESERVATION TABLES
 Reservation table for static pipeline is simple
 Coz dataflow follows a linear streamline
 Reservation table for dynamic pipeline is complex
KTUStudents.in
 Coz it follows a nonlinear pattern
 Multiple reservation table is generated for
evaluation of different functions
 Static pipeline is specified by a single reservation
table
 Dynamic pipeline is specified by more than one
reservation table

KTUStudents.in

 Each reservation table displays the time-space
flow of data through the pipeline for evaluation of
one function
 Different functions follow different paths through
the pipeline
 No : of columns in reservation table➔ evaluation
time of a given function
KTUStudents.in
 Eg: function X requires 8 clock cycles
function Y requires 6 clock cycles

 Checkmarks in each row of the reservation table
correspond to the cycles that a particular stage
will be used
 Multiple checkmarks in a row indicates the repeated
usage of same stage in different cycles
 Contiguous checkmarks in a row indicates the
extended usage of a stage over more than one cycle

KTUStudents.in
Multiple checkmarks in a column indicates that,
multiple stages need to be used in parallel during a
particular clock cycle

KTUStudents.in
LATENCY ANALYSIS

INTRODUCTION
 No: of clock cycles between the two initiations of

a pipeline is the latency between them
 Latency is a non negative integer
KTUStudents.in
 A latency value k implies that, two initiations are
separated by k clock cycles

 Latency for reservation table for X
 6-1➔ 5
 8-6➔ 2
KTUStudents.in

COLLISION
 An attempt by 2 or more initiations to use the

same pipeline stage at same time will cause
collision
 Collision implies resource conflicts b/w 2 initiation in
KTUStudents.in
the pipeline
 Collisions must be avoided by proper scheduling of
pipeline

TYPES OF LATENCIES
 2 types
 Forbidden latencies
 Permissible latencies
Latencies which causes collision are called as

KTUStudents.in

forbidden latencies
 Latencies which does not cause collision are
called as permissible latencies

KTUStudents.in

 Latency sequence
 It is a sequence of permissible nonforbidden latencies
b/w successive task initiations
 Latency cycle
 It is a latency sequence which repeats the same
subsequences(cycles) indefinitely
KTUStudents.in
 Eg: latency cycle (1,8)
 It represents an infinite latency sequence 1,8,1,8,….
 This implies that successive initiations of new tasks are
separated by one cycle and 8 cycles alternately

 Constant Latency cycle
 It is a latency cycle which contain only one
latency value
 Eg: cycle(3)

KTUStudents.in

 Average latency
 Average latency of a latency cycle is obtained by
dividing the sum of all latencies by the no: of
latencies along the cycle
 KTUStudents.in
 Eg: avg latency of latency cycle(1,8)➔ (1+8)/2= 4.5
Average latency of the constant cycle is simply

the latency itself

COLLISION FREE SCHEDULING
 Objective of scheduling
 Obtain shortest average latency between initiations
without causing collision
Concepts used for collision free scheduling
KTUStudents.in

 Collision vectors
 State diagrams
 Single cycles
 Greedy cycles
 Minimal average latency (MAL)

COLLISION VECTOR
 It is a vector displaying the combined set of

permissible and forbidden latencies in a pipeline
 For a reservation table with n column, maximum
forbidden latency is m
KTUStudents.in
 m <=n-1
 Permissible latency➔ p
 p should be as small as possible
 1<=p<=m-1
 Collision vector is an m bit binary vector C
 C =(CmCm-1…..C2C1)
 Ci=1 if latency i causes collision
 Ci=0 if latency i is permissible

KTUStudents.in
 2,4,5,7➔ forbidden latencies
 Collision vector Cx=(1011010)
 Cy=(1010)
 4,2➔ forbidden latencies

M4-CS405 Computer System Architecture-Ktustudents - in

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

M4-CS405 Computer System Architecture-Ktustudents - in

Uploaded by

Copyright:

Available Formats

MESSAGE PASSING MECHANISMS

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

 Remaining flits are the data elements of a packet

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

 Packet length is determined by routing scheme

For more study materials: WWW.KTUSTUDENTS.IN

 This scheme was used in first generation

 Packet is transmitted from source to destination

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

 When a packet reaches the intermediate node, it is

 Latency is directly proportional to distance between

For more study materials: WWW.KTUSTUDENTS.IN

 This scheme is implemented in latter generations

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

 All flits in the same packet are transmitted in

 Data flits follow the header flits

For more study materials: WWW.KTUSTUDENTS.IN

 Packets can be interleaved during transmission

For more study materials: WWW.KTUSTUDENTS.IN

 Pipelining of successive flits in a packet is done

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

 Thus D has negligible effect on routing latency

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

 To move a flit between adjacent nodes in a

For more study materials: WWW.KTUSTUDENTS.IN

 Discard and retransmission

For more study materials: WWW.KTUSTUDENTS.IN

 This method is applied in virtual-cut routing

For more study materials: WWW.KTUSTUDENTS.IN

For more study materials: WWW.KTUSTUDENTS.IN

 Pure wormhole routing uses this scheme

 However it is not abandoned

For more study materials: WWW.KTUSTUDENTS.IN

 It drops the packet which is blocked

 It demands packet retransmission and

For more study materials: WWW.KTUSTUDENTS.IN

 Blocked packet is routed through a detour

For more study materials: WWW.KTUSTUDENTS.IN

 Some multicomputer n/w uses hybrid policies

For more study materials: WWW.KTUSTUDENTS.IN

 Communication path is completely determined by

For more study materials: WWW.KTUSTUDENTS.IN

 It selects the successive channels for routing in a

For more study materials: WWW.KTUSTUDENTS.IN

 Consider an n-cube with 2n nodes

 Source node s=sn-1sn-2….s1s0

 We have to determine a route from s to d with

For more study materials: WWW.KTUSTUDENTS.IN

 Compute the direction bit ri=si-1 XOR di-1 for all n

For more study materials: WWW.KTUSTUDENTS.IN

this is done since r1=1

For more study materials: WWW.KTUSTUDENTS.IN