You are on page 1of 10

Petri Net Modeling of Intemonnection Networks for Massively Pamllel Amhitectums.

J.A. Gregorio, F. Vallejo, R. Beivide and C. Carrion


Departamento de Electr6nica
Universidad de Cantabria
39005 Santander- Spain
e-mail: ja@ ctrhp3. unican. es

Abstract. The analysls, design and evaluation of the alternative for exploiting coarse grain parallelism [7].
interconnection subsystem for massively parallel arch i~ectures is
norm ally carried out using computer simulation tools, requiring It seems evident that the future of massively parallel
great computational costs. This w or-k shows the suitability of the architectures greatly depends on the adequate design of
use of formal representation methods, like SPN (Stochastic Petri interconnection subsystems (IS) showing high pe~formance and
Nets), for the description of the message routers, focusing on tw a scalability. On the one hand, latency and throughput are the basic
important features. Firstly, the possibility of obtaining network indicators of the IS performance; in order to exploit fine grain
performance indicators using simulation but with a lower parallelism it is necessary to manage messages in the network with
computational cost than using conventional techniques. And low latency even in situations of heavy traffic. On the other hand,
secondly, making the basic parameters of the network design it also seems clear that, in order to make an IS highly scalable,
independent of the router implementation issues, thus simplifying direct networks of the k-ary n-cube type with or without wrap-
the method of establishing the behavior of new router stnwtures. around links must be used [13]. Implementability issues indicate
This approach has been successfully applied to the analysis of that when a parallel architecture with thousands of nodes is to be
both symmetrical torus and asymmetrical mesh interconnection designed, it is necessary to look for mesh or torus topologies of
topologies, with cut-through ,flow control, oblivious routing and 2 or 3 dimensions [2].
random traffic. Two different functional router structures have
been used in each case: transit buffers located at the inPut [jr at Fundamentally, the architect of this type of parallel
the output router channels. machines has three basic design tools available: computer
simulation, monitoring techniques and formal models. The
Key words: Interconnection subsystem, massively parallel problem associated with the simulation tools is the computational
arch itectures, stochastic Petri” nets, hardware message routec cost required; for example, a complete simulation of a 16K nodes
performance evaluation, simulation. IS at the register transfer level, needs days or even weeks of CPU
time on a modem workstation. More accurate simulations, at the
I. Introduction. logic gate or even at the electrical level, could require an
inordinate computational cost. Altem atively, monitoring techniques
Nowadays there is no great problem in developing using software tracers or hardware spies, show an inevitable
parallel computers composed of a large number of CMOS problem: it is necessary to have a working system to obtain
microprocessors. A computer using at least 1000 processing measurements. Undoubtedly, the use of formal models that can
elements, processors, or computers in parallel is qualified as lead to analytical expressions to represent the IS behavior, is an
“massively parallel” [7]. Experiences like those carried out in interesting goal in this active area of research. On the one hand,
Caltech with the Mosaic project show that multicomputers with up formal methods are extremely elegant since tlhey can give
to 16K nodes can be built [33]. On the other hand, all the big performance indicators in the form of equations which reflect the
computer manufacturers offer in their range of products parallel behavior of the system as a function of the basic design
machines with hundreds or thousands of nodes. Such is the case parameters. On the other hand, they avoid the inherent
of the CM-5 [23], the Intel Paragon [18], the Meiko CS-2 [17], computational costs of the simulation processes, thus reducing the
the Convex MPP [7], and the KSR-2 [16]. Additionally, the idea time and the cost of the design process.
of connecting hundreds or thousands of workstations using a fast
interconnection network is becoming a real and practical Traditionally, queuing theory has been used to evaluate
the performance in high-speed data communication networks. For
Permission to make digital/hard copies of all or part of this material with- instance, independent M/M/l queues frequently provide a good
out fee is granted provided that the copies are not made or distributed
approximation to the operation of store-and-forw ard networks [22].
for profit or commercial advantage, the ACM copyright/server
notice, the title of the publication and its date appear, and notice is given
that copyright is by permission of the Association for Computing Machinery,
Inc. (ACM). To copy otherwise, to republish,to post on servers or to This work has been done with the support the CICYT, Spain,
redistribute to lists, requires specific permission and/or fee. under contracts TIC93-0243 and TIC94-0370-E.
ICS ’95 Barcelona, Spain o 1995 ACM 0-89791-726-6/95/0007. .$3.50

107
Unfortunately this flow control mechanism can not be used in performance indicators and the basic parameters of the IS design.
masswely parallel architectures since it gives rise to a high Basically, the problem confronted in this work is to find a reliable
message latency, which is unacceptable from a practical point of evaluation method with a low cost which allows the prediction of
view. Nowadays, the most widely used flow control mechanism is the performance of a given IS under specific traffic conditions,
the so called wormhole [12], which is the blocking version of the before attempting its design. Results obtained using a conventional
cut-through technique [20]. The use of these mechanisms makes simulator will be used in order to validate the approach proposed
more difficult the IS modeling through queuing networks in this paper.
[1][2][14][21].
With respect to the traftlc conditions applied to the IS in
In this paper, an akem ative approach to model and our experiments, situations of random traffic have been considered.
evaluate the IS behavior for massively parallel architectures is This type of scenario gives an idea about the network behavior
presented, which gives accurate performance indicators, with corresponding to the worst case. Normally, parallel applications
reduced computational costs. The proposed method is based on the will show a certain degree of communication locality, which could
description of the message router using a particular type of Petri be exploited to improve the global system performance. In the
net, the so called DSPN (Stochastic Petri Net with Deterministic majority of our experiments, message lengths of 32 flits have been
and Exponential Firing Times) [4]. In recent literature, there have used, akhough this is a parameter that could be varied at any time
been few papers in which Petri nets have been used to evaluate depending on the degree of parallelism which must be represented.
the performance of interconnection networks. It can be highlighted
the Caselli’s work [8], describing a GSPN-based approach to With regard to the performance indicators, it seems
performance evaluation of message-passing parallel architectures. obvious that the average message latency and the maximum
This work takes into account both the effects of the sustained throughput should be considered. The majority of the
synchronization schemes and the interconnection topology, authors share this type of choice. Second order indicators, such as
although the dimension of the analyzed net is limited to a the maximum latency, the average queue size and the
maximum size of 4x4 nodes. communication channels occupation could also be considered.

The main noticeable points in the approach taken in this Finally, it is necessary to establish the basic design
work are: a) using a formal description tool allows a univocal parameters of the interconnection subsytem: topology, routing and
representation of the router iirnctions; b) using top-down or flow control. In reference to topology, k-sty n-cube networks [13],
bottom-up techniques, the same tool can be used for router with and without wrap-around links are considered. This choice is
description at all levels, from the physical level [25], up to the conditioned by the easy scalability of this type of networks. Other
highest functional level; c) although, in most cases, simulation topologies such as hypercubes [31] or fat-trees [23] have also been
processes must be used in order to obtain performance metrics, in proposed. The authors consider that, when thousands of nodes are
no case modifications to the simulator are necessary; d) gradual to be used, the most implementable and scalable alternatives are
approximations to the model can be carried out. This allows tie the two or three dimensional meshes and tortrses. In our
use of the complete model or the realization of small experiments, bidimensional structures with a variable number of
approximations which drastically reduce the required simulation nodes per dimension has been considered, although there is no
time and computational power. Moreover, the methodology used difficulty in carrying out evaluations with three dimensional
here, pen-nits a hierarchical evaluation of the IS, making topologies.
independent the analysis of the system as a function of the basic
network design parameters (topology, flow control, routing and With respect to the routing mechanism, DOR
trafilc pattern) from the hardware structure of the message router. (Dimensional Order Routing) has been chosen. Recent studies
show that for the scenarios under consideration, with situations of
The remainder of this paper is organized as follows. uniform traffic, there are no significant advantages in performance
Sectton II establishes the application environment in which the when adaptive routing strategies are used [27]. Additionally, the
proposed performance evaluation method is employed. Section HI DOR homogeneously distributes the trafilc in the topologies
deals with the proposal of a generic router model. Section IV considered. Its implementation is simple and, when a mesh is
completes the modeling process by inchrding the architectural used, the problem of message deadlock is avoided. When toms
characteristics of the message router. Additionally, in this section, networks are considered, the problem of message deadlock an ses
the main performance metrics obtained through the model are even using DOR routing. In our model, infinite router buffers have
presented, and a comparison with similar metrics supplied by a been employed, thus avoiding the deadlock problem. Another set
conventional register transfer level simulator is established. of experiments were earned out for a finite buffer space, and no
Finally, section V concludes the paper summarizing the most differences were appreciate; in that case a new deadlock avoidance
important results. method for torus networks proposed by the authors was
implemented [5]. In addition, the methodology presented here
allows the use of other routing algorithms to be considered by the
II. Scenarios, Metrics and Basic Parameters. authors, in the future.

In this section, the selection of the three fundamental Finally, the selected mechanism of flow control is cut-
issues that characterize the proposed method of performance through. Many modem interconnection subsystems use a blocking
evaluation are justified. These three issues are: traffic conditions, version of cut-through named wormhole, although there are

108
normally several flits of temporary storage in the routers. Our --------- ------------------------- . . . . ---------- .. . . . ..
work is oriented towards fine grain parallelism, in which the
length of the packets is normally no greater than a few flits.
\
Moreover, recent studies show that this type of routers with a
u,
buffer of only one packet per input/output link, reaches more than
90% of the performance achieved with infinite buffers, when
random traffic is considered [28]. Thus, the temporary storage
requirements in the routers will be minimal. It is also well known
that the cut-through mechanism can double the performance of the
wormhole technique.

III. Petri Model of the Message Router.

tff.1. Generic Router ModeI.


I

Obtaining a complete Petri model of the message router

requires a detailed knowledge of all the features of the router

functions, as well as of the environment in which it is used.

However, it is possible to obtain the complete model in two
different phases. In the first step, an incomplete generic model can ● ‘,lt .
be obtained including all the aspects determined by: the
interconnection network, the routing strategy, the flow control
mechamsm and the traffic pattern. Afterwards, the aspects
corresponding to the specific router architecture are incorporated,
that is, whether buffers are located at the input or output channels,
the storage capacity, whether or not simukaneous writing is
possible, etc. The model is filly defined when both aspects are --.----------- ----------------------------------- .----
stablished. This allows, for example, the modification of the
Figme 1 Petri represenratiorr of router datapath.
mtem al router structure without affecting the first phase of the
modeling. remain blocked until all the packet flits have passed, spending a
time proportional to L, the packet length. This allows us to
In the model representation, two types of transitions consider a packet, within the model, as the minimum information
appea~ the immediate ones, corresponding to control or decision unit. The input of the information in the node is represented by the
actions and the timed ones, corresponding to actions which employ firing of the input transitions t,,, corresponding to the output
a time in firing, according to some time distribution. In figure 1, transitions ry of the neighboring nodes, and so it can be considered
the representation of a generic router node corresponding to an that once any of the input or output transitions has been enabled,
n-dimensional interconnection network is shown. The arrival of the firing frequencies, ‘y, will be inversely proportional to the
messages transmitted by the neighboring nodes, which are packet length L:
represented by tokens, implies the firing of the timed transitions
r,,. Places PI have as many output arcs as the number of possible y(q) = -f(q) = l/L (1)
paths for the arriving information. The firing probabilities of the
transitions tr,,y, represent the probability of the information As will be proved later, the constant value of the packet
following each path. The number of possible paths is only length, L, means that the firing time of the output transitions is
determined by the routing algorithm used, while the transition constant and so, the transitions have a deterministic firing model.
probabilities depend on the routing and the network structure. The Besides the routing utilized, the firing probabilities of the
incomplete arcs which appear in these transitions, represent the immediate transitions depend on the structure of the
restrictions corresponding to the router hardware structure. interconnection network. Until both are specified it is not possible
to obtain the modeling parameters. Following an approach like that
111.2. Model Pammetem. in [2],[6], the method of obtaining the basic parameters for torus
and mesh networks will be presented. Although the process is
The model parameters to be determined correspond to carried out for two dimensions, extending it for n-dimensions is
the firing probabilities of the Immediate transitions and to the straightforward.
firing frequencies of the timed transitions [3]. The first ones are
determined by the structure of the interconnection network and the 111.2.1. Model Pammeters for Toms Networks.

routing used, and the second ones by the characteristics of the


infomabon packets. In stationary state, the number of packets injected into
the network must be equal to the number of packets consumed by
Since cut-through flow control is assumed, when the it. All the packets that circulate through a node can be divided in
header flit reaches a node and reserves some of its resources, these four categories: packets which have been genemted in the node

109
itselfi packets entering in the node coming from other nodes; In the same way, the packets generated by each node which are
packets consumed in the node; and packets in tmnsi~ which directed towards the y axis in either of the two directions, can be
correspond to those entering minus those that are being consumed, expressed as:
In a bidimensional torus, when homogeneous message generation
and random message destination with static DOR routing are
(4)
considered, it is easy to obtain the number of messages of each of
these four groups.
Finally, the sum of the packets sent towards the four possible
Input Packets in each direction. Assuming that q is the number of directions along with those directed to the node itself, q/R2,
packets generated by each node per unit time, those directed to a obviously is equal to all the packets, q, generated by a node.
given node will be q/R2 since a pattern of random destination is
assumed’, R being the network radix or the number of nodes per Packets consumed by each node. Taking into account the
dimension. Thus, the input packets for each direction will be all homogeneity of the destinations and the DOR routing, it is simple
those directed to the node itself, or nodes further away whose to determine, from all the input packets in a node, those that are
route from the original node passes through this node. For consumed in it. So, the packets reaching a node along the x axis,
example, the input packets in the we~t direction will be (fig 2> in either of the two directions, correspond to those generated by
the nodes located in the same row, at a distance less than (R-1)/2;

q R2-1 then:
IPD = ;
(R* I+ R*2+. . ,+R*~
) = –—
8R
(2)

Pxc = ~= (5)
that is, those injected by all the neighboring nodes located in the R2 2
same row and at a distance less than or equal toz (R-1)/2, together
with those sent by these nodes to other nodes litrther away, but by Similarly, the consumed packets which arrive in either of the two
the same path. Due to the symmetry of the torus, these correspond directions of the y axis can be expressed as:
to the packets which input in the node from any direction.

● ● ● ●
, PYC=;
()
R? (6)

● ● ● ● Tmnsit packets. The transit packets are those which enter a node
but are not consumed in It. They can be determined either as the
● ● ● ●
difference between those entered and those consumed, or directly.
● ● ● ● In any case, these packets can be divided into two groups
according to whether they change dimension in the node or not.
● ● ● ● z The packets which pass through a node without changing its
● ● ● ●
direction are:

● ● ● ●

PSD=Q(R-l)(R-3) (7)
● c ● ●

● ● ● ●

- And the transit packets which change dimension in the node,


~*R-J

2 correspond to those routed to nodes located in the same column:
Figure 2 Input packets from west direction

(8)
Packets generated by each node. The packets generated by each
node that are routed in either of the two directions in the .x axis
are determined by the product of the packets that each node sends
to another, q/R2. multiplied by the number of nodes found to the Probabilities of immediate transitions in torus netwoks. From the

right (or to the left) at a distance less than or equal to (R-1)/2; previously obtained expressions for the number of packets, the
then: firing probabilities of each of the immediate transitions can be
immediately obtained (see Table I). So, for example, in order to
(3)
determine the firing probability of the transition which represents
the packets passing through the node without changing dimension,
it is sufficient to obtain the relationship between the packets
1 It really should be q/( R~- 1) m order to elimmate the posslblhty Ihat one node passing through a node in the direction x (y) and those input along
sends packets to ltsetf Although, this IS not a reahshc assumption, n simpMies the the x (’y) axis.
expressions and practically does not modify the results.

2 An odd value of R is supposed for simplicity

110
transition firing probability transition firing probability

North +South; South +Narth North + South Vl(v +1)


(R -3)/(R +1)
East + West; West +East
North + Consumption l/(y +1)
East + Sauth; East +North
2(R -1)/(R(R +1)) South + North
West +Sauth; Wext+Natih (R -y-1)/(R -y)

East + Consumption South + Consumption l/(R -y)


4/(R (R +1))
West + Consumption
East + North (R-y -1)/(R(R -X))

Narth + Consumption
4/(R +1) East + South y/(R {X +1 ))
South --+ Consumption

East + West .d(x +1 )


[njectian + East
(R-1)/(2R)
In]ectmn + West East + Consumntiarr l/(R(x+l ))

Injection + North West + North I (R-v-1 )/(R(R-x))


(R -1)/(2R2)
Injectian +Sauth
West + Sm.tth y/(R (R -X))
Injection + Consumption l/R 2
West + East (R -x-1)/(R -X)

Table I Firing probabilities of the immediate transitions of the West + Cansumptian I l/(R (R -X))
Pctnmodelofaroutec inatasus network.
lnjectian + North I (R-y -1)/R2

Injection + South I y/R2

IIL2.2. Model Parameter for Mesh NetwoAw. Injectwn + East I (R-x-1)/R

Following an analysis parallel to that of totus networks, Injection + West I x/R


each one of the expressions of the generated, input, consumed and
Injection + Consumption I l/R2
transit packets for mesh networks can be obtained. Contrary to the
totus, the mesh is an asymmetric structure. Each node position (its
coordinates x, y) determine the flow of packets which input or Table If Firing probabilities af the immediate transitions of the
output along each of its communication channels, and so the Petri model of the rauter, for a mesh netw or-k.
expressions of the packets are also dependent on these coordinates.

In the same way as in the previous subsection, from the


expressions of the four groups of packets it is easy to obtain the fV.1. Message Router with Output Buffem
parameters of the router model. Table II shows the firing
probabilities of each immediate transition of the model for a mesh. A- Petri Model of the Router
Although the number of transitions is the same as for the torus
case, the number of expressions is greater because of tbe mesh In the case of a message router with transit buffers
asymmetry. placed at the output channels, as in figure 3, su,ch buffers are
represented in the Petri model as places in which the tokens that
indicate packets arriving at the node are stored (see Fig. 4). A
token (packet) arrives at a node by firing a t,:
transition, and it is
Iv. Modelling of Message RouteIs with Diffenmt stored at an input place, P,: (7 E (n, s,e,w) ); then, depending on the
probabilities of firing a transition, it is routed to one of the output
Functional Structures
buffer P,,z. Each one of these places, P,,,, is associated to a timed
transition to,,
that represents the time spent by a packet on being
After setting the firing probabilities of the immediate
forwarded across the corresponding output channel. This amount
transitions, a description of a node functional structure completes
of time is proportional to L, the packet length.
the model. In this section, the model is fully stablished taken into
account two alternative router structures; subsection IV.1 deals
Nonetheless, when using a cut-through flow control
with a router model m which the transit buffers are placed at the
mechanism, the firing time can be divided into two components:
output channels and, later in the paper, subsection IV.2 considers
the time spent by the packet header to leave the node. and the
the model when the transit buffers are associated to the input
time to send the remainder of the packet through the output
channels.
channel, i.e. the packet spooling time. For each channel, both
times have been modelled using two transitions, to,
and to:,,
and a

111
-----------------.
---------- ....... N

BN ~
+ .1..-..
i. .


“f!q-
W:p)j x E
~“ <
‘yi :
,,~.. t ‘

‘pi ~Ets
-------.......--.....
s
-------------
;

F1gum
1“ v,

5 Load Model

Figu~ 3 Output buffer router


The substitution of each node of a RxR network by its
correspondent Petri model would lead to a very high
place, c!-ranrw_z, with an initial token representing the output computational cost. Hence, it is more convenient to employ a
channel as a non-shared resource. single router in combination with a load model that emulates the
effect of all the other elements of the network. In a given network,
Figure 4 shows a complete Petri model of a message this approximation will be valid if it is possible to detesmine the
router with infinite buffers associated to the output channels. The packet arrival distribution at each node. This is the case of the
firing probability values of each immediate transition depend on networks under study (meshes and tonrses), for which the arrival
the topological characteristics of the interconnection network in distribution of packets from each direction has been calculated. As
which the router will be used (tables I and II). The firing a first approximation, it is sufilcient to consider that packet arrival
frequency of the timed transitions are given, basically, by the at each input, follows an exponential distribution with a firing
length of the packets managed by the network. frequency v. In the case of a torus, being a symmetric network, all
the v values of a node (Un,V,,VC and VW) are equal and given by the
following equation:

“–q R2-l (9)


8R

However, when the node is a component of a mesh network, the


firing frequency values of each input depend on the node
coordinates, fulfilling the following expressions:

=~~+l)(l?-y-1)
‘W R
v =:(R-y)y
WJ
(lo)
v =~(x+l)(R-x-1)
%Y R

v =~(R-~~
%V R

The use of this load model allows us, besides working with a
single node model, a first simplification in relation with the output
transitions, consisting in the elimination of the chanrzel_z places
and its corresponding arcs, as well as the transition toZC.
Figure 4. Petri model of the output buffem router.
IV.1.l Avemge Packet Latency in Torus Intemonnection Networks
B- Load Model
The use of the previous models allows us to establish the
After establishing the Petri model of the message router, average delay, t,,, of the tokens (packets) in the different places
the average waiting time of the tokens (messages) must be (buffers) of the node. Now then, because of the network
determined in order to calculate the average message latency. symmetry, this delay would have the same value in all the torus
Unfortunately, the mix of deterministic and exponential transition nodes and, consequently, by knowing only d,, the torus average
firing delays, prevents the obtention of analytical expressions distance, and the time ththat a packet spend to go across a node
reflecting the system behavior [4]. Therefore, a strategy based on with no collisions, it is possible to determine the average packet
the Petri Net simulation has been employed. latency T, as shown below:

112
North (South) Channel East (West) Channel
~ Sm. (R=15/L=32)

~ Model (R=l 5/L=32)

~ Sire. (R=31iL=32)

—A— Model (R=31/L=32)

~ Mm. (R=l 27rL=32)

West)
(Norttr) (North)

i
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Network Load

Figure 6 Comparison between the average packec latencies


obtained by the model and by a conventional simulator
,for tons networks.

south – - ~

T=dt<td-L+tJ +(L-l) (11)


Figure 7 Delay distribution [?f a packet at the channel_z bzrffer
and the average delay ~?f each mesh network node.
where d, = R/2. When the network is totally unloaded t<, is equal
to L, and only the time t,,
and the spooling time are responsible
of the average latency.
established and it is possible to obtain a spatial distribution of the
The figure 6 show the close correspondence of the message delays. Figure 7 shows, as an example, the packet delay
packet latency results, for different network sizes and message distribution in each network node, for a 15x 15 mesh with a load
lengths, obtained with the simulation of the Petri model of a single of 50% of its maximum capacity and a packet length of 32 flits.
node compared to the ones obtained using a conventional network
simulator. It should be noticed that while a conventional simulator Since we have considered a random distribution of the
could require hours to determine some of the v ahtes showed in the messages, the average packet latency can still be. calculated by
figure 6, the Petri simulator can obtain close values in a much equation 11. In this case, average network: distance is
lower computational time. For instance, for a bidimensional torus d,n = 2/3(R -1/R) and the average delay t,,,
can be obtained as an
of 128x128 nodes, while a conventional simulator running on HP arithmetic average of the average delays of each node:
9000/735 workstation spent more than 7 hours to obtain the
latency curve, the Petri model provides the same values just in 6
(12)
seconds.

IV.1.2 Avemge Packet Latency in Mesh Intemonnection Networks


When the number of nodes is high, as in the networks
As noticed above, when the firing probabilities were under consideration, the obtention of the t,, value from equation
established, in a mesh network each node exhibits a different (12) can result in computing times similar to those required by a
behavior depending on its position. So, each packet crossing the conventional simulator. Thus, a simplification for the tdobtention
mesh will be delayed a different amount of time in each node of has been used. The symmetry of the delay distribute on in the mesh
its path. Because of the unbalanced load distribution in the mesh, (see figure 7) allows us to consider just only one row and one
as the center nodes have a higher load than the peripheral ones, column in order to obtain the t<,value. Concretely the (R-1/4, y)
messages crossing the center will suffer more collisions and, row and the (x,R -1/4) column with O S x,y < R-1, have been
therefore, a higher delay. employed. This approach reduces the problem complexity in a R
factor.
The model presented above allows us to establish the
average packet delay by simulating separately each network node. Figure 8 shows the close correspondence of the results
Knowing the node coordinates, all the simulation parameters are derived from this method and those obtained using a conventional

113
r ~ Sire. (R=15/L=32) ...... ........ .. ...... ..... ..... ..!..................................
/“
~ Model(S?=15/L=32)

—*— Sire. (R=6YI.=32)

~ Model (R=63/[.=32)

,’

p.) 1 ,’
,,’
i- @-
#-

I bw wit! ,/,
: Ixacl
,.-, .
,,

P’
: NW.,.: ,,,
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
tm#n ,.-.$pw2c ,.,
m,
Network Load
: mxk J.-, ,/ ,,
,/,
Figurv 8 Average packet latertcies ,for meshes with the buffers : plws,,:
InW ,/,
,. ,,
pwa ,/’
located at the output channels.
: -k,.-. ,.,
: +.,.: ~ ,.,
.,,
CGA ,.-,
,/,
,.,
tiw. (: i ..-
,.-
simulator. Again, the simulation time for the Petri model is IAwc ..-
. ----
reduced compared with a conventional simulation: the bigger the . -----
. . -------- -
network, the higher the reduction.
Figrrm 9 Partial router model with the buffers located at the input
To give an idea about the CPU time consumed in this channels.
type of experiments a version of the Pertel simulator for meshes
[28], was developed in an HP 9000/735 workstation. In the case
of a bidimensional mesh of 64x64 nodes, with 32 flits messages
and a load level corresponding to an 80?Z0 of the maximum advance: no packet precedes it in the input FIFO buffer and its
achievable throughput, the evaluation of the average latency spent selected output channel is free. In the Petri net both conditions are
1077 minutes CPU time. On the contraty, our Petri Net model, represented by a place for the input channel (bw ) and another
spent 13 minutes obtaining a similar latency amount. In order to place associated to the selected output channel (en) both of them
reduce this time value, an additional work is in progress that with an initial token; without this token the packets cannot
attempts to achieve a similar methodology as in the toms case. advance, so they must wait for it.

IV.2 Message Router with Input Buffem For instance, a token arrives at a node from the west
channel and is stored at place pw2. If it is the first packet of the
The problems of implementing routers with output buffer, it will exist a token at place bw and, with some probability,
queues associated to the output channels are well known[15],[34]. one of the output transitions of this place will be fired, routing it
Such router architecture has a packet delay lower than a router to an output channel z, and marking it by putting a token at place
with input buffers [19], but it requires the ability of all input pwzz. If the output channel is free, the token is transferred to the
channels to write simultaneously in the output buffers. This results output place, p,~, and after a time proportional to the packet
in a higher complexity and implementation cost. For this reason, length, both the channel resource (place CZ) and the buffer head
it is common to implement routers with buffers associated to the (place bz) are returned. If another packet would have arrived from
input channels. other input channel directed to the same output channel, it would
have found the channel resource busy, and then, it would have
A main advantage of using Petri Nets for modelling a waited until it is released,
router is that if the functional structure is modified, but all other
working conditions remain constant, the parameters of the model The average packet delay at each node, well as the
are still valid and only the buffer control part has to be modified. average packet latency, are calculated in a similar way and with
the same parameters that in the previous subsection. Figure 10
Figure 9 shows a Petri model of a router with a different shows, as an example, the results obtained for bidimensional
functional structure that the one described above. In this case the meshes (the most complex case considered here). Once more, this
buffers are placed at the input channels (for simplicity only two of method obtains results very similar to the ones of a conventional
the directions are shown, being the others symmetric to them). The simulation, with computational times in the same range of
model is slightly more complex than the previous one, because previous experiments.
now two conditions have to be satisfied for an input packet to

114
transitions and this creates serious difficulties in obtaining
250 .$im. (R=l 51L=32)
——+——— analytical results, since the GSPN resolution techniques can not be
used. But, it has been proved that numerical results can be
Model(R=15L=32)
——+———
/ obtained by simulation of the resultant Petri model, with a lower
200 computational cost and metrics very close to those obtained by
—o— Sire.(R=63L=32)
conventional slmulatron of the corresponding interconnection
~ Model(R=63L=32) / network. However, our future interest is directed, on the one hand,
~~ to further study in this aspect using more power-krl analysis tools
/ and, on the other hand, to the use of the method for the analysis
of other types of routing, flow control and interconnection
topologies.

50

o~ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8


Refenmces

[1] V.S. Adve and M.K. Vernon.


lncerconnec~ion Networks with
Pe~rrrnrance
Deterministic
A n,alysis
Routing.
of Mesh
IEEE
Network bad Trans. on Parallel and Distributed Systems, vol. 5, no.3, pp.
Figure 10 Average packet latencies ,for meshes with buffers 225-246, March., 1994.
located at the input channels
[2] A. Agarwal. Lim its on Incerconnecticm Network Performance.
IEEE Trans. on Parallel and Distributed Systems, vol. 2,
no.4, pp. 398-412, Oct., 1991.
V. Conclusions
[3] M. Ajmone Marsan, G. Balbo, and G. Conte. A class rrf
This paper attempts to highlight the potential of SPN Generalized Stochastic Petri Nets for the performance
models, not only for the description of the functional structure of evaluation of m u ltiprocessor ,sy,stems. ACM Transactions on
the message router for interconnection networks, but also to Computer Systems. vol. 2, pp. 93-122, May 1984.
analyze the network performance with less computational costs
than in conventional simtrlatom. [4] M. Ajmone Marsan and G. Chiola. On Petri Nets with
Deterrn inistic and Exponential Transition Firing Times
The Petri nets constitute a precise and unambiguous Proc.7° European Workshop Application and “rheory of Petri
method of representing the conceptual or high level functioning of Nets, Oxford, England, pp. 151-165, Jun. 1986.
the concurrent operations which occur in this type of systems. The
use of this probabilistic model, which does not need to know a [5] A. Arruabarrena, R. Beivide, C. Izu and J. Miguel. A
great part of the implementation details in order to obtain performance evaluation of adaptive routing in bidim ensiorral
important performance indicators is very suitable for use in early cut-through networks. Parallel Processing Letters, vol. 3, n.
design stages. 4, pp. 469-484, 1993.

The analysis carried out on interconnection networks [6] A. Arnrabarrena. A ncilisis y Evaluaci6n de Sistemas de
with mesh or torus topology, cut-through flow control and Interconexi6rr para Procesadores Ma.rivamente Paralelos.
oblivious routing, has demonstrated how, by using timed Petri Tesis Doctoral, UPV, Diciembre, 1993.
nets, an incomplete probabilistic model can be obtained,
independently of the internal router structure. Later this model is [7] G. Bell. Scalable, Parallel Computer-s: Alternatives, Issues,
completed for each of the different router structures to be and Challenges. Intern ational Joum al of Parallel
analyzed. It is very beneficial to reach a high degree of Programming, vol. 22, no. 1, pp. 3-46, 1994.
independence between aspects related to the interconnection
network, such as topology, routing, flow control, etc., and issues [8] S. Caselli, G. Conti, and U. Malavolta. Topolo,gy and Process
of the internal stmctttre of the router itself, such as buffer Interaction in Concurrent A rchitectures: A GSPN Modeling
placement, capacity, etc., because this facilitates the trial of Approach. Joum al of Parallel and Distributed Computing,
different router structures within the same network. no.15, pp. 270-281, 1992.

It can be shown that starting from the complete [9] A.A. Chlen. A cost and speed model for k-ary la-cube
representation of the router behavior under given conditions, the worm
hole router.r. Proceeding of Hot Interconnect Workshop,
model can be simplified in a more natural way than when there is Aug., 1993.
no graphic representation of the behavior. Each of the
characteristics taken into account by the analysis is clearly located. [1 O] G. Ciardo, J. Muppala and K. Trivedi. SPNP: !;tocha.rtic Petri
Net Package. International Conference on Petri Nets and
Some aspects, such as the fixed length of the packet, Performance Models, Kyoto, Japan, Dec. 1989.
require the introduction of deterministic firing times in the Petri

115
[11] W.J. Dally and C,L. Seitz. The torus routing chip, [27] M.J, Pertel. A Critique of Adaptive Routing, CalTech
Distributed Computing, vol. 1, no.4, pp187-1 96, Oct. 1986. Technical Report CS-TR-92-06.

[12] W.J. Dally and C,L. Seitz. Deadlock-free message routing in [28] M.J. Pertel. A simple simulator ,for multicomputer routing
multiprocessor interconnection networks. IEEE ‘Trans. on networks. CalTech-CS-TR-92-04. 1992.
Computers, vol. C-36, no. 5, pp. 547-553, 1987.
[29] J.L. Peterson. Petri Net Theory and the Modeling qf Systems.
[13] W,J. Dally. Pe~orrnance analysis qf k-ary n-cube Prentice-Hall, Englewood Cliffs, N. J., 1981.
Interconnection networks. IEEE Trans. on Computer, VO1.39,
no.6, pp. 775-785, June, 1990. [30] D.A. Reed and R.M. Fujimoto. Multicornputers Networks:
Message-based Parallel Processing. The MIT Press, 1987.
[14] W.J, Dally. Netw ork andprocessoramhitectures formessage-
dnven m ulticomputers. VLSI and Parallel Computation. [31] C.L. Seitz. The Cosrn ic Cube. Comm. of the ACM, vol. 28,
Morgan Kaufinann Pub. pp. 140-222, 1990. no. 1, pp. 22-33, 1985.

[15] W.J. Dally. Virtual-Channel Flow Control. IEEE Trans on [32] C.L. Seitz and W. Su. A Family of routing and
Parallel and Distributed Systems, VOI.3, no.2, pp. 194-205, corn m unication chips based on the Mosaic. Proc. of the
March, 1992. University of Washington Symp. on Integrated Circuits,
1993.
[16] S. Frank, H, Burkhardt 111 and J. Rothnie. The KSR-1:
Bridging the gap between shared m emery an MPPs. [33] C.L. Seitz, N.J. Boden, J. Seizovic and W-K Su. The design
CompCon’93. pp. 285-294, 1993. of the Caltech Mosaic C multicomputer. Proc. of the 1993
Symp. on Research on Integrated Systems, The MIT press,
[17] M. Homewood and M. McLaren. Meiko CS-2 interconnect pp. 1-22, 1993.
Elan - Elite design. Proc. of the IEEE Hot Interconnects
Symposium. IEEE TCMM, August, 1993. [34] Y. Tamir and G.L. Frazier. Dynamically- Allocated Multi-
Queue Buffer. for VLSI Communication Switches. IEEE
[18] Intel Corporation. Paragon XP/S Product Overview, 1991. Trans. on Computers, VOI.41, no. 6, June 1992.

[19] M.J. Karol, M.G. Hluchyj, and S.P. Morgan. Input vs. output
queueing on a space-division packet switch, IEEE Trans. on
Communications, vol. tom-35, no.12, pp. 1347-1356, Dec.
1987.

[20] P. Kermani and L, Kleinrock. Virtual Cut-Through: A New


Computer Communication S w itching Technique. North-
Holland Pb. Co., Computer Networks 3, 1979.

[21] J.Kim and C.H. Das. Hypercube Communication Delay with


Wormhole Routing. IEEE Trans. on Computers, VO1.43, no.7,
pp. 806-814, July 1994.

[22] L. Kleinrock. Communication Nets; Stochastic message,flow


and delay. McGraw Hill, 1964.

[23] C. Leiserson, Z.S. Abuhamdeh, D.C. Douglas, C.R. Feynman


et al. The network architecture of the Connection Machine
CM-5. 4th. Ann. Symp. on Parallel Algotitms and
Architectures SPAA’92. pp. 272-285, 1992.

[24] N.F. Maxemchuk and M. El Zarki. Routing andflow control


in high-speed w ide-area netw orks. Proc. of the IEEE, vol, 78,
no. 1, pp. 204-221, 1990.

[25] T. Meng. A synchronous Design for Digital Signal Processing


Architectures. PhD. Thesis. Berkeley. 1992.

[26] A. Nowatzyk. Generalized Stochastic Petri-Net Simulator.


Carnegie-Mellon Univ., April 1989.

116

You might also like