You are on page 1of 19

Multiprocessin Systems

KEY POINTS OF THE CHAPTER L Building real-time multiprocessingsystemsis hard becausebuilding real-timesystemsis alreadydifficult enough. uniprocessing through redunsystemscan be increased Reliability in multiprocessing 2. processing,and reliability dancy and multiplicity. However, security, with the communicationlinks betweenprocessors. costs are associated 3. Describing the functional behavior and design of multiprocessing systemsis difficult and requires nontraditional tools. 4. It is crucial to understandthe underlying hardware architecture of the systembeing used. multiprocessing in this chapterwe look at issuesrelated to real-time systemswhen more than one processor is used. We characteize real-time multiprocessing systems into two processors, and those that use a large types: those that use severalautonomous highly integratedmicroprocessors' number of interdependent, real-time in Although many of the problems encountered multiprocessing world, these problems systemsare the same as those in the single-processing For example,systemspecificationis more difficult. becomemore troublesome. Intertask communication and synchronization becomes interprocessorcommunication and synchronization. Integration and testing is more challenging. and reliability more difficult to manage.Combine these complications with the fact themselvescan be multitasking. and you can seethe that the individual processors level of complexity.
281

282

Chap. 12 I

MultiprocessingSystems

In a singlechapterwe can only give a brief introductionto thoseissuesthat needto be addressed the designof real-timemultrprocessmg in systems.

12.1 CLASSIFICATION ARCHITECTURES OF
Computerarchitectures be classified termsof singleor multiple instructions in can streams and singleor multiple datastreams shown in Table 12.1.By providing as a taxonomy.it is easierto match a computerto an applicationand to remember the basiccapabiiitiesof a processor. standard In von Neumannarchitectures, the senal tetch and executeprocess, coupledwith a singlecombineddatalinstruction store. forces serial instruction and data streams.This is also the case in RISC rreciucedinstruction set computer) architectures. Although many RISC architectures include pipelining, and hence become multiple instruction stream, pipeiinin-e not a requisite is characteristic RISC. of

T-{BLE l2.l

Classification ComputerArchitectures for

Single Data Stream SineleIn:tr.rction Stream \ f u l t i p l e I n s t r u e t i oS t r e a m n

Multiple Data Stream

von Neumannarchitecture/uniprocessors Systolic processors Wavefront processors RISC Dataflow processors Pipeiinedarchitectures Very long instruction word processors Transputers

In both systolic and u'avefront processors,each processingelement is executing the same (and onll'.t instruction buf on different data. Hence these architectures SIMD. are pipelines architectures, In effectively more than one instruction can be (one for each level of pipeline). However, since only processed simultaneously one instruction can use data at any one time, it is MISD. Similarly, very long instructionword computerstend to be implementedwith microinstructions that (and hence more capability). Hence, rather than have very long bit-lengths breaking down macroinstructionsinto numerous microinstructions, several (nonconflicting)macroinstructions can be combined into severalmicroinstructions.For example,if objectcodewas generated calledfor a load one register that followed by an increment of another register, these two instructions could be executed simultaneously by the processor (or at least appear so at the rnacroinstruction level) with a series of long microinstruction. Since only nonconflicting instructions can be combined, any two accessingthe data bus conflict. Thus, only one instruction can accessthe data bus, and so the very long instructionword computeris MISD.

Sec. 12.2 I

DistributedSystems

2E3

Finally, in dataflow processors (see the foilowing discusand transputers sion), each processing element is capable of executing numerous different instructionand on different data;henceit is MINllD" Distributedarchitectures are also classifiedin this way.

12.2DISTRIBUTED SYSTEMS
We characterize disn"ibuted real-time systemsas a coilection of interconnected self-contained processors. We differentiatethis type of system from the type discussedin the next section in that each of the processors the distributed in systemcan perform significantprocessingwithout the cooperationof the other processors. Many of the techniquesdevelopedin the context of multiraskingsystems can be appliedto multiprocessing systems. For example,by treatingeach of the processors a distributedsystemas a task,the synchronization communicain and tion techniques previouslydiscussed can be used. But this is not alwaysenough, because often eachof the processors a multiprocessing in systemare themselves multitasking.In any case, this type of distributed-processing systemrepresents the best solution to the real-time problem when such resources available. are 12.2.1 Embedded Embeddeddistributedsystemsare those in which the individual processors are assigned fixed, specifictasks.This type of systemis widely usedin the areasof avionics,astronautics, and robotics. r EXAMPLE 12.1
In an avionics system for a military aircraft, separate processorsare usually assignedfor navigation, weaponscontrol, and communications.While these systemscertainly shareinformation (see Figure 12.1), we can prevent failure of the overall system in the event of a single processor failure. To achieve this safeguard, we designate one of the three processors or a fourth to coordinate the activities of the others. If this computer is damaged,or shuts itself off due to a BITS fail, another can assume roie. its I

12.2.2Organic
Another type of distributed processingsystem consistsof a central scheduler processorand a collection of generalprocessors with nonspecificfunctions(see Figure 12.2). These systems may be connected in a number of topologies (including ring, hypercube, array, and common bus) and may be used to solve generalproblems. In organic distributed systems,the challengeis to program the schedulerprocessorin such a way as to maximize the utilization of the serving processors.

284

Chap. 12 I

Systems Multiprocessing

Weapons computer

Communications computer

Figure 12.1 A distributedcomputersystemfor a military aircraft'

Figure 12.2 Organic distributed computer in common bus configuration'

SystemSPecification 12.2.3
because,as The specification of software for distributed systemsis challenging systemis even a single-processing we have seen,the specificationof software for difficult. lends itself nicely to the One techniquethat we have discussed,statecharts, can orthogonalprocesses be assigned specificationof distributed systemsbecause is multitasking, theseorthogonal states If to individual processors. eachprocessor tasks can be furthei subdivided into orthogonal statesrepresentingthe individual for each Processor.

Sec. 12.2 I

DistributedSystems

285

12.2 I EXAMPLE
the Considerthe specificationof the avionics systemfbr the military aircrafi. We have discussed for function of the navigationcomputerthruughoutthis text. The statechart this function is given in Figure 5.18. The functions for the weaponscontrol and communicationsystemsare depicted in only this pictorial descriptionof of In Figure 12.3 and Figure 12.4,respectively. the interests space, I will be given. each subsystem

Weaponsyslem Rockets Bombs

Figure 12.3 Weapons control system for a military aircraft

system Communications Log on J Send message Log on

\
Log
D

a

,
t

\

\
\

Receive message , \ \

Log message Log off

Log otf

\

Unscramble Unscramble otf on Button pressed

/

Unscramble Unscramble nff on Receive message \ /

r

J

a

Scramble
L

T to Ouput
speaKer )
Button otf

Unscramble \
Message interrupt

Figure 12.4 Communications system for military aircraft.

286

Chap. l2 I

Multiprocessing Systems

A second techniquethat can be used is the dataflow diagram. Here the processsymbols can representprocessors, whereasthe directeo arcs represent communicationspaths between the processors. The sinks and sourcescan be either devices that produce and consume data or processesthat produce or consumeraw data.
I EXAMPLE12.3

12.2.4Reliability Distributed in Systems
'The characterizationof reliability in a distributed system(real-time or otherwise) has beenstatedin a well-known paper[89], "The ByzantineGenerals'problem.,' The processorsin a distributed system can be considered"generals".and the interconnectionsbetween them "messengers."The generalsand messengers can be both loyal (operatingproperly) or traitors (faulty). The task is for rhe generals, who can only communicate via the messengers,to formulate a straiegy for capturinga city (seeFigure 12.5).The problem is to find an algorithmthat ailows the loyal generals to reach an agreement. It tums out that the problem is unsolvablefor a totally asynchronous system,but solvableif the generalscan vote in'rounds [153]. This provision, however,imposesadditionaltiming constraints on the system. Furthermore, the problem can be solved only if thi number of traitors is less than one-third the total number of processors. will be using the We Byzantine generals' problem as an analogy for cooperative multiproceJsing throughout this chapter.

12.2.5Galculation Retiabitity ot in Distributed Systems
consider a group of z processorsconnected in any flat topology. It would be desirable,but costly, to have every processorconnectedto every other processor in such away that datacould be sharedbetweenprocessors. This, however, is not usually possible. In any case, we c:ur use a matrix representationto denote the connectionsbetween the processors.The matrix, R, is constructedas follows: if processori is conflectedto processor/ we place a "1" in the ith row, column of Trh R. If they are not connected,a "0" is placed there. we consider every processor

Sec. 12.2 I

DistributedSvsrems

287

Army ,
General3 Generaln

Messenger

Messenger

General2

Messenger

General 1

Figure 12.5 The Byzantine generals' problem.

representmessengers.

T EXAMPLE 12.4
A topology in which each of n processors is comected to every other *.ould have an n by n reliability matrix with all ls; that rs,

288
1 1 R = 1 1 l 1 l l l

Chap. 12 I

MultiprocessingSystems

12.5 I EXAMPLE
A topology in which none of the r processorsis connectedto any other (except itself) would have that is, an n by n reliability matrix with all ls on the diagonalbut 0s elsewhere;

1 0 0 1
p -

0 0

0 0

I

12.6 I EXAMPLE
As a more practical example, consider the four processors connected as in Figure 12.6. T\e reliability matrix for this topology would be

l r 1 1 0 \ I r r o rl ft= [ i o r r i
\ 0 1 1 1 I
Since processors 2 and 3 are disconnected, as are plocessors I and 4, 0s are placed in row 2 I column 3, row 3 column 2, row I column 4, and row 4 column I in the reliability matrix.

Figure 12.6 Four-processordistributed system.

The ideal world has all processorsand interconnectionsuniformly reliableWe but this is not always the.case. can assigna number between0 and 1 for each entry to representits reliability. For example, an entry of 1 representsa perfecl a or messenger general.If the entriesare lessthan 1, then this represents traitorou$ gets a 0; a "small(A generalor messenger. very traitorous generalor messenger time" traitor may get a "0.9" entry.) Disconnectionsstill receive a 0.

12.2 I

DistributedSystems

I EXAMPLE12.7 Suppose distributed the system described Figure12.6actually in had interconnections the with reliabilities marked in Figure12.'7.The reliability as new marrixwouldbe
A .7 I .4 I 0 ( .'l 0 I 0 I .9

R

t)

I

Figure 12.7 Four:-processor distributed systemwith reliabilities.

Notice that if we assumethat the communications links have reciprocal reliability (the reliability is the sameregardless which directionthe message of is traveling in), then the matrix is symmetric with respectto the diagonal.This, alongwith the assumption the diagonalelements alwaysI (not necessarily that are true), can greatly simplify calculations.

12.2.6 Increasing Reliability Distributed in Systems
In Figure 12.7 the fact that processorsI and4 do not have direct communications , links doesnot meanthat the two processors cannotcommunicate. Processor can I senda message processof via processor or 3. It turns out that the overall to 4 2 reliability of the systemmay be increased using this technique. by Without formalization,the overall reliability of the systemcan be calculated by performinga seriesof specialmatrix multiplications. R and .! are reliability If matrices for a system of n processorseach, then we define the composition of thesematrices, denotedR O S. to be
n

(R O SXr,j) = .V. R(i,lc)S(ft.1) k= l

(r2.r)

where (R o s)(t,7) is the entry in the f throw and7ft colurnn of the resultantmatrix taking the maximum of the reliabilities. If R = S, then we denote and V repr€sents R O R = R2, called the second-orderreliabilitv matrix.

294

Chap. 12 I

Multiprocessing Systems

I EXAMPLE12.8 R2 Computing for this yields in the Consider system Figure12.7. 1 .4 .7 .63 . 4 t . 9 1 . 7 . 91 . 9 6 3 1 . 9 I

12.2.6.1Higher-Order Reliability Matrices Higher-order reliabilities can be found using the same technique as for the secondorder. Recursively,we order reliability matrix as can define the ruth
R"=R"-lOR

(r2.2)

I EXAMPLE12.9 I in reliabilitycanbe seen Figure12.8,whereprocessors and4 are The utility of the higher-order apart.Here,the reliabilitymatrixis two connections

distriburcd Figure 12.8 Four-processor with reliabilities. system

R= ; \ t 0
\ o
The second-orderreliability is

l t ' s

0 0 . 4 0 1 . 3 .3 I

. 2 0
A 1n

1 . 3 .3 I
Calculating the third-order reliability matrix gives

.2 .06 .4 .12
t J

.3

I

12.3 I

Non-von NeumannArchitectures

The higher reliability matrix allows us to draw an equivalenttopology for that canbe drawn from looking at One obviousconclusion the distributedsystem. is and one that is intuitively pleasing, that we reliability matrices, the higher-order by in passing distributedsystems providing of message the can increase reliability third-, and so on. Order pathsbetweenprocessors. redundantsecond-,
1 T EXAMPLE 2.10 is topology givenin Figure12.9. equivalent the example, third-order For theprevious

T

sL

Figure 12.9 Equivalent third-order topology for Example 12.9.

Finally, it can be shown that the maximum reliability matrix for n processors is given by
n

R-^- = \/ R'

(r2.3)

For example,in the previousexample,R-u* = Rl yR2 yR3. To what n do we need to compute to obtain x percent of the theoretical maximum reliability? Is this dependenton the topology? Is this dependenton the reliabilities?In addition,the reliability matrix might not be fixed; that is, it might be some function of time r. Finally, the fact that transmissionsover higher-order paths increase signal transit time introduces a penalty that rnust be balanced againstthe benefit of increasedreliability. There are a number of open problems in this area that are beyond the scope of this text.

ARCFIITECTURES NEUMANN NON.VON
The processing of discrete signals in real-time is of paramount importance to virtually every type of system. Yet the qomputationsneededto detect, extract, mix, or otherwiseprocesssignals are computationally intensive.For example,the in convolutionsum discussed Chapter5 is widely usedin signal processing. Becauseof these computationally intensive operations,real-time designers must look to hardware to improve response times. In response, hardware

Chap. 12 I

MultiprocessingSystems

have provided severalnon-von Neumann,multiprocessingarchitectures designers can be usedto solvea wide classof problems which, thoughnot generalpurpose, in realtime. (Recall that von Neumann architecturesare stored program, single typically feature large fetch-executecycle machines.)These multiprocessors quantities of simple processorsin VLSI. processing are systems consisting Increasingly, real-timesystems distributed and of one or more generalprocessors one or more of theseotherstyleprocessors. provide control and input/output, The general,von Neumann-styleprocessors whereasthe specializedprocessoris used as an engine for fast execution of we In computations. the next sections, discussseveralof complexand specialized illustratetheir applications. and thesenon-von Neumannarchitectures

low 12.3.1Dataf Architectures
in use Dataflow architecture.s a large numberof specialprocessors a topology in which each of the processorsis connectedto every other. has its own local memory In a dataflow architecture,each of the processors betweenthe processors asynchronously. and a counter.Specialtokensare passed packets,containan opcode,operandcount,operands, calledactivity Thesetokens, for and list of destinationaddresses the result of the computation.An exarnpleof local memory a genericactivity packetis given in Figure 12.10.Eachprocessor's the is usedto hold a list of activity packetsfor that processor, operandsneededfor the current activity packet, and a counter used to keep track of the number of operands received. When the number of operands stored in local memory is equivalent to that required for the operation in the current activity packet, the operationis performed and the results are sent to the specified destinations.Once an activity packet has been executed,the processorbegins working on the next activity packet in its execution list.
Opcode I n (numberof arguments)

Argument 1
Argument2

Argument n
Destinalion1 Destination2

Destinationm

Figure 12.10 Generic activity template for dataflow machine.

12.3 I

Non-von NeumannArchitectures

293

12.11 T EXAMPLE
convolutionof two signalsas described to We can use the dataflow architecture perform the discrete functions/(r) and oftwo real-valued conr.,olution 5. That is, the discrete for in the exercises Chapter

s(r, t = 0,1,2,3,4. ff* s)(4 = ,l /(i )s(r- i)
in topology and activity packet list is clescribed Figure 12 ll The processor I
4

Aotivity f u l t1 2 t(2) flult I2 l u l t 2 (2) l- r(r) s(2) l M u l tl 2 I s(1) (2) I-nr) s(0) 0
t

rJrngtaq

{ u l tl 2

Processor 3

ternPl

AcilvitY

Processor 6 Figure 12.11 Discreteconvolutionin a dataflow architecture.

Dataflow architectures are an excellent parallel solution for signai is The'only drawbackfor dataflow architectures that cuffentlv ihe\ processing. be implementedin VLSI. Performancestudies for dataflo'x real-tr::i cannot systemscan be found in [148].

--i-rSystem Specificationfor Dataflow Processors Datai-lcr'r' L2.3:.1.1 they are dilect implementationsof dataflorr g:i:.-s i:. tecturesare ideal'because draw dataflow diagrams as part of the programmng ;,:r\-3s>. fact, programm'ers SStlr. -t The graphsare then translatedinto a list of activity packelsfor eachctLr--e

294

Chap. l2 I

MultiprocessingSystems

0, 0, g(0),

0, 0, 0,

s(1), s(2)

s(o), s(1)

Figure 12.12 Specificationof discreteconvolutionusing dataflow diagrams

exampleof is given in Figure 12.12.As we have seenin the example,they are to well-adapted parallel signal processing [52], [53].

12.3.2SystolicProcessors
in connected processors consistof a large numberof uniform processors Systolic an array topology. Each processor usually performs only one specialized operation, operationand has only enoughlocal memory to perform its designated processing called and to store the inputs and outputs.The individual processors, elements,take inputs from the top and left, perform a specified operation, and elementis depicted outputthe resultsto the right andbottom.One suchprocessing ale processors connectedto the four nearestneighboring in Figure 12.13.The neighbortopology depictedin Figure 12.14.Processing processors the nearest in with a in or firing at each of the cells occurs simultaneously synchronization central clock. The fact that eachcell fires on this heartbeatlends the namesystolic. Inputs to the systemare from memory storesor input devicesat the boundary cells

z=c'y+x

w=l

element. Figure 12.13 Systolic processor

12.3 I

Non-von NeumannArchitectures

29s

Outputs
Figure 12.14 Systolic array in nearestneighbor topology'

obtained from at the left and top. outputs to memory oI output devices ale boundary cells at the right and bottom. 12.12 T EXAMPLE
functions/(l) andg(l), t = 0,1,2,3,4' convolutionof two real-valued the Once againconsider discrete one in Figure 12.15 canbe constructedto perform the convolutibn' A A systolic array such as the I general algorithm can be found in [5.2]'

are are Systolic Processors fast and can be implemented in VLSI' They with propagation delays in the sornewhat troublesome, however, in dealing ticks' connectionbusesand in the availability of inputs when the clock

Inputs

sgl e (3)
o ; (0)

s(z',) (1)

s$) s (3) s (?',)

btol

o (1)

s (41 s (3) s (21 s (1) s (0)

00000 Inputstream

Figure 12.15 SystoTieagay for convolution'

296

Chap. 12 I

MultiprocessingSystems

F O

o
.F

o \o

(\
!0 ?=

L-_

Sec. 12.3 I

Non-von NeumannArchitectures

12.3.2.1 Specification of Systolic Systems The similarity of the jargon associated with systolicprocessors leadsus to believethat Petri netscan be used to specify such systems. This is indeed true, and an example of specifying the convolution operationis given in Figure 12.16.

12.3.3WavefrontProcessors
processorsconsist of an array of identical processors, WaveJront each with its own local memory and connected in a nearest neighbor topology. Each processorusually perforrnsonly one specializedoperation.Hybrids containing two or more different type cells are possible. The cells fire asynchronously when all requiredinputs from the left and top are present.Outputsthen appear to the right and below. Unlike the systolic processor,the outputs are the unalteredinputs. That is, the top input is transmitted,unaltered,to the bottom output bus, and the left input is transmitted,unaltered,to the right output bus, Also different from the systolicprocessor, outputsfrom the wavefrontprocessor are read directly from the local memory of selectedcells and not obtainedfrom boundary cells. Inputs are still placed on the top and left input buses of boundary cells. The fact that inputs propagate through the array unaltered like a wave gives this architecture name.Figure 12.17 depictsa typical wavefront its

Figure 12.17 Wavefront processorelement

processing element. Wavefront processorsare very good for computatronallyintensive real-time systems and are used widely in modern real-time signal. processing[51], [52]. In addition, a wavefront archirecturecan cope with timing uncertainties such as local bfocking, random delay in semmrrnisations, and fluctuations in computing times [86].

298

Chap. 12 I

Systems Multiprocessing

functionsflr) andg(t)' t = O'l'2'3'4. convolutionof two real-valued the Onceagainconsider discrete the convolution. A wavefront array such as the one in Figure 12.18 can be constructedto perform I proaluctswill be founcl in the innennost PEs. tfter five firings- the convoluiion

12.13 I EXAMPLE

0 0 0 0 (0)

o o o rt-ol iiii

o o rlol tiri tizi

0 t(o) riri t(2i tigi

(o) t(t) f(2) 1(3) r(4i

(1) t(21 !(3) (1) o

(2) (3) (1) o o

!(9) (1) q o o

(4) 9 g o o

. s(4),. ., s(0)
Figure 12.18 Discrete convolution using a wavefront array'

Wavefront processors combine the best of systolic architectures with dataflow computing That is, they supportan asynchronous dataflow architectures. in the interconnectionbusesand at input and output devicesis structure-timing not a problem. Furthermore,the structure can be implemented in VLSI.

L2.3.3.1 System Specification for Wavefront Processors As is true of the dataflow architecture,dataflow diagramscan be usedto specify thesesystems. For example, the convolution system depicted in the previous example can h specifiedusing Figure 12.12' Finally, Petri nets and finite stateautomataor a variation, cellular automatamay have potential use for specifying wavefront systems'

12.3.4TransPuters
Transputers are fully self-sufficient, multiple instruction set, von Neumann p.o""rrorr. The instruction set includes directives to senddata or receive data via The transputers,though capableof ports that are connectedto other transputers. acting as a uniprocessor,are best utilized when connectedin a nearestneighbor configuration. In a sense, the transputer provides a wavefront or systolic capability but without the restriction of a single instruction. Indeed,b1 processing providing each transputer in a network with an appropriate stream of an*l synchronization signals, wavefront or systolic computers-which can change corifigurations-can be implemented. real-time applications,a-uc have been widely used in ernbedded Transputers cornmercial implementationsare readily available. Moreover, tool support, suct as the multitasking languageoccam-2,has madeit easierto build transputer-ba-s applications.

_4

Sec.

I

Exercises

299

12.4EXERCISES
1. For the following reliabiliry matrix draw the associateddistributed system graph and compute R2.

ft=

1 (

1

1

\

1
l 0

r
t

0
l

l

2. For the following reliability matrix draw the associated distributed system gaph compute R2.

0J ft= I r0., o . z I 0 t l \ 0 . 7 0

\
l

l

3. For the following reliability matrix compute R2, R3, and R^ * (Hint" R*u* + R3).

l 0 1 0 R = [ 0 . 6 0 1 \o 0 . 8I

lr

o 0.60
0 . 8 l I

4. Show that the O operation is not commutative. For example, if R and ,S are 3 X 3 reliability matrices, then in general, ROS*SOR In fact, you should be able to show that for any n x n reliability matrix

RoS=(soR)r
the where0r represents matrix transpose. of for architecture performingthe matrix multiplication two 5 by 5 5. Designa dataflow set. that binaryADD andMULT arepart of the instnrction Assume arrays. 6. Design a dataflow architecturefor performing the matrix addition of two 5 by 5 arrays.
Assume that binary ADD is part of the instruction set. Use data.flow diagrarnsto describe the systems in exercises4 and 5. 8. Design a systolic array for performing the matrik multiplication of two 5 by 5 arrays.Use the processing element described in Figure 12'13. 9. Design a systolic array for performing the matrix addition of two 5 by 5 arrays. Use the processingelement described in Frgure 12.13. 10. Use Petri nets and the processing element described in Figure 12-13 to describe the systolic array to perform the functions described in (a) Exercise 7 (b) Exercise 8 11. Design a wavefront array for performing the manix multiplication of two 5 by 5 anays' Use the processing element described in Figure 12.17. 12. Design a wavefront array for performing the maEix addition of two 5 by 5 arrays. Use the processing element described in Figure 12.17. 13. Use dataflow diagrams to describe the syltems in (a) Exercise 10 (b) Exercise 1l 14. Use Petri nets to specify the wavefront irray system shown in Figure 12.18.