You are on page 1of 23


4.0 Objectives 4.1 Introduction 4.2 Parallel Computer Models 4.2.1 4.2.2 4.2." 4.2.4 4.2.( 4.2.) Flynns Classi ication Parallel and !ector Computers #ystem $ttributes to Per ormance Multiprocessors and Multicomputers #%ared&Memory Multiprocessors 'istributed&Memory Multicomputers Multivector and #IM'Computers 4.2.(.1 !ector #uper Computers 4.2.(.2 #IM' Computers P*$M and !+#I Models 4.2.).1 Parallel *andom&$ccess Mac%ines 4.2.).2 !+#I Comple,ity Model

C%ec- .our Pro/ress 4." #ummary 4.4 0lossary 4.( *e erences 4.) $ns1ers to C%ec- .our pro/ress 2uestions

$ ter /oin/ t%rou/% t%is unit3 you 1ill be able to describe t%e Flynns classi ication describe parallel and vector computers e,plain system attributes to per ormance distin/uis% bet1een implicit and e,plicit parallelism e,plain multiprocessors and multicomputers e,plain multivector and #IM' computers describe t%e P*$M and !+#I models

Parallel processin/ %as emer/ed as a -ey enablin/ tec%nolo/y in modern computers3 driven by t%e ever&increasin/ demand or %i/%er per ormance3 lo1er costs3 and sustained productivity in real&li e applications. Concurrent events are ta-in/ place in todays %i/%& per ormance computers due to t%e common practice o multipro/rammin/3 multiprocessin/3 or multicomputin/.


Parallelism appears in various orms3 suc% as loo- a%ead3 pipelinin/ vectori4ation concurrency3 simultaneity3 data parallelism3 partitionin/3 interleavin/3 overlappin/3 multiplicity3 replication3 time s%arin/3 space s%arin/3 multitas-in/3 multipro/rammin/3 multit%readin/3 and distributed computin/ at di erent processin/ levels.

4.2.1Flynn Cl! "#"$!%"&n

Mic%ael Flynn introduced a classi ication o various computer arc%itectures based on notions o instruction and data streams. $s illustrated in t%e Fi/ure 4.1 conventional se5uential mac%ines are called #I#' Computers. !ector are e5uipped 1it% scalar and vector %ard1are or appear as #IM' mac%ines. Parallel computers are reserved or MIM' mac%ines. $n MI#' mac%ines are modeled. 6%e same data stream lo1s t%rou/% a linear array o processors e,ecutin/ di erent instruction streams. 6%is arc%itecture is also -no1n as systolic arrays or pipelined e,ecution o speci ic al/orit%ms.

'!( SISD )n"*+&$, &+ !+$-"%,$%)+,

'.( SIMD !+$-"%,$%)+, '/"%- 0" %+".)%,0 1,1&+y(

'$( MIMD !+$-"%,$%)+, '/"%- -!+,0 1,1&+y(

'0( MISD !+$-"%,$%)+, ' y %&l"$ !++!y( F"2)+, 4.1 Flynn Cl! "#"$!%"&n &# C&1*)%,+ A+$-"%,$%)+,

4.2.2 P!+!ll,l3V,$%&+ C&1*)%,+

Intrinsic parallel computers are t%ose t%at e,ecute pro/rams in MIM' mode. 6%ere are t1o major classes o parallel computers3 namely s%ared&memory multiprocessors and messa/e&passin/ multicomputers. 6%e major distinction bet1een multprocessors and multicomputers lies in memory s%arin/ and t%e mec%anisms used or interprocesssor communication. 6%e processors in a multiprocessor system communicate 1it% eac% ot%er t%rou/% s%ared variables in a common memory. 7ac% computer node in a multicomputer system %as a local memory3 uns%ared 1it% ot%er nodes. Inter processor communication is done t%rou/% messa/e passin/ amon/ t%e nodes. 7,plicit vector instructions 1ere introduced 1it% t%e appearance o vector processors. $ vector processor is e5uipped 1it% multiple vector pipelines t%at can be concurrently used under %ard1are or irm1are control. 6%ere are t1o amilies o pipelined vector processors. Memory-to-memory architecture supports t%e pipelined lo1 o vector operands directly rom t%e memory to pipelines and t%en bac- to t%e memory. Register-to-register architecture uses vector re/isters to inter ace bet1een t%e memory and unctional pipelines.

4.2.3 Sy %,1 A%%+".)%, %& P,+#&+1!n$,

6%e ideal per ormance o a computer system demands a per ect matc% bet1een mac%ine capability and pro/ram be%avior. Mac%ine capability can be en%anced 1it% better %ard1are tec%nolo/y arc%itectural eatures3 and e icient resources mana/ement. 8o1ever3 pro/ram be%avior is di icult to predict due to its %eavy dependence on application and run&time conditions. 6%ere are also many ot%ers actors a ectin/ pro/rams be%avior3 includin/ al/orit%m desi/n3 data structures3 lan/ua/e e iciency3 pro/rammer s-ill3 and compiler tec%nolo/y. It is impossible to ac%ieve a per ect matc% bet1een %ard1are and so t1are by merely improvin/ only a e1 actors 1it%out touc%in/ ot%er actors. 9esides3 mac%ine per ormance may vary rom pro/ram to pro/ram. 6%is ma-es peak performance an impossible tar/et to ac%ieve in real&li e applications. On t%e ot%er %and3 a mac%ine cannot be said to %ave an avera/e per ormance eit%er. $ll per ormance indices or benc%mar-in/ results must be tied to a pro/ram mi,. For t%is reason3 t%e per ormance s%ould be described as ran/e or as a %armonic distribution. Cl&$4 R!%, !n0 CPI 6%e CP: ;or simply t%e processor) o todays di/ital computer is driven by a cloc- 1it% a constant cycle time ;T in nanoseconds<. 6%e inverse o t%e cycle time is t%e clock rate

; =1>T in me/a%ert4<. 6%e si4e o a pro/ram is determined by its instruction count;Ic<3 in terms o t%e number o mac%ine instructions to be e,ecuted in t%e pro/ram. 'i erent mac%ine instructions may re5uire di erent numbers o cloc- cycles to e,ecute. 6%ere ore3 t%e cycles per instruction ;CPI< becomes an important parameter or measurin/ t%e time needed to e,ecute eac% instruction. For a /iven instruction set3 1e can calculate an average CPI over all instruction types3 provided 1e -no1 t%eir re5uencies o appearance in t%e pro/ram. $n accurate estimate o t%e avera/e CPI re5uires a lar/e amount o pro/ram code to be traced over a lon/ period o time. :nless speci ically ocusin/ on a sin/le instruction type3 1e simply use t%e term CPI to mean t%e avera/e value 1it% respect to a /iven instruction set and a /iven pro/ram mi,. P,+#&+1!n$, F!$%&+ : +et Ic be t%e number o instructions in a /iven pro/ram3 or t%e instruction count. 6%e CP: time ;T in seconds>pro/ram< needed to e,ecute t%e pro/ram is estimated by indin/ t%e product o t%ree contributin/ actors? T=Ic , CPI , . ;4.1<

6%e e,ecution o an instruction re5uires /oin/ t%rou/% a cycle o events involvin/ t%e instruction etc%3 decode3 operand;s< etc%3 e,ecution3 and store results. In t%is cycle3 only t%e instruction decodes and e,ecution p%ases are carried out in t%e CP:. 6%e remainin/ t%ree operations may be re5uired to access t%e memory. @e de ine a memory cycle is k times t%e processor cycle T. 6%e value o k depends on t%e speed o t%e memory tec%nolo/y and processor&memory interconnection sc%eme used. 6%e CPI o an instruction type can be divided into t1o component terms correspondin/ to t%e total processor cycles and memory cycles needed to complete t%e e,ecution o t%e instruction. 'ependin/ on t%e instruction type3 t%e complete instruction cycle may involve to our memory re erences ;one or instruction etc%3 t1o or operand etc%3 and one or store results<. 6%ere ore 1e can re1rite 75. 4.1 as ollo1s? T=Ic , ;p Am , k< , . ;4.2<

@%ere p is t%e number o processor cycles needed or t%e instruction decode and e,ecution3 m is t%e number o memory re erences needed3 k is t%e ratio bet1een memory cycle and processor cycle Ic is t%e instruction count3 and T is t%e processor cycle time. 75uation 4.2 can be urt%er re ined once t%e CPI components ;p,m,k< are 1ei/%ted over t%e entire instruction set.

Sy %,1 A%%+".)%, : 6%e above ive per ormance actors ;Ic,p,m,k, ) are in luenced by our system attributes? 6%ey are instruction&set arc%itecture3 compiler tec%nolo/y3 CP: implementation and control3 and cac%e and memory %ierarc%y. 6%e instruction&set arc%itecture a ects t%e pro/ram len/t% ;Ic< and processor cycle needed ;p<. 6%e compiler tec%nolo/y a ects t%e pro/ram len/t% ;Ic<3 p3 and t%e memory re erence count ;m). 6%e CP: implementation and control determine t%e total processor time ;p.) needed. Finally3 t%e memory tec%nolo/y and %ierarc%y desi/n a ect t%e memory access latency ;k.). 6%e above CP: time can be used as a basis in estimatin/ t%e e,ecution o a processor. M"* R!%,: +et C be t%e total number o cloc- cycles needed to e,ecute a /iven pro/ram. 6%en t%e CP: time can be estimated as T=C , =C>F. Furt%ermore3 CPI=C Ic and T=Ic , CPI , =Ic , CPI>f. t%e processor speed is o ten measured in terms o million instruction per secon!"MIP#<. @e simply call it t%e MIP# rate o a /iven processor. It s%ould be emp%asi4ed t%at t%e MIP# rate varies 1it% respect to a number o actors3 includin/ t%e cloc- rate ;f<3 t%e instruction count ;Ic<3 and t%e CPI o a /iven mac%ine3 as de ined belo1? MIP# rate = IC > ;T , 10)< = f > ;CPI , 10)< = ;f , Ic< > ;C , 10)< ;4."<

9ased on 75. 4.".3 t%e CP: time in 75. 4.2 can be 1ritten as T=Ic , 10&)>MIP#. 9ased on t%e above derived e,pressions3 1e conclude by indicatin/ t%e act t%at t%e MIP# rate o a /iven computer is directly proportional to t%e cloc- rate and inversely proportional to t%e CPI. $ll our system attributes3 instruction set3 compiler3 processor3 and memory tec%nolo/ies3 a ect t%e MIP# rate3 1%ic% varies also rom pro/ram to pro/ram. T-+&)2-*)% R!%,: $not%er important concept is related to %o1 many pro/rams a system can e,ecute per unit time3 called t%e system throughput @s ;in pro/rams>second<. In a multipro/rammed system3 t%e system t%rou/%put is o ten lo1er t%an t%e CP: throughput @p de ined by? #p=f Ic , CPI ;4.4<

6%e CP: t%rou/%put is a measure o %o1 many pro/rams can be e,ecuted per #sB#p is due to t%e additional system over%eads caused by t%e I>O. compiler3 and O# 1%en multiple pro/rams are interleaved or CP: e,ecution by multipro/rammin/ or time& s%arin/ operations. I t%e CP: is -ept busy in a per ect pro/ram&interleavin/ as%ion3 t%en #s=#p. 6%is 1ill probably never %appen3 since t%e system over %ead o ten causes an e,tra delay and t%e CP: may be le t idle or some cycles.

P+&2+!11"n2 En5"+&n1,n% : 6%e pro/rammability o a computer depends on t%e pro/rammin/ environment provided to t%e users. Most computer environments are not user& riendly. In act3 t%e mar-etability o any ne1 computer system depends on t%e creation o a user& riendly environment in 1%ic% pro/rammin/ becomes a joy ul underta-in/ rat%er t%an a nuisance. @e brie ly introduce belo1 t%e environmental eatures desired in modern computers. Conventional uniprocessor computers are pro/rammed in a se5uential environment in 1%ic% instructions are e,ecuted one a ter anot%er in a se5uential manner. In act3 t%e ori/inal :CID>O# -ernel 1as desi/ned to respond to one system call rom t%e user process at a time. #uccessive system calls must be seriali4ed t%rou/% t%e -ernel. Most e,istin/ compilers are desi/ned to /enerate se5uential object codes to run on a se5uential computer. In ot%er 1ords3 conventional computers are bein/ used in a se5uential pro/rammin/ environment usin/ lan/ua/es3 compilers3 and operatin/ systems all developed or a uniprocessor computer3 desires a parallel environment 1%ere parallelism is automatically e,ploited. +an/ua/e e,tensions or ne1 constructs must be developed to speci y parallelism or to acilitate easy detection o parallelism at various /ranularity levels by more intelli/ent compilers. I1*l"$"% P!+!ll,l" 1 $n implicit approac% uses a conventional lan/ua/e3 suc% as C3 FO*6*$C3 +isp3 or Pascal3 to 1rite t%e source pro/ram. 6%e se5uentially coded source pro/ram is translated into parallel object code by a paralleli4in/ compiler. $s illustrated in Fi/ure 4.23 t%is compiler must be able to detect parallelism and assi/n tar/et mac%ine resources. 6%is compiler approac% %as been applied in pro/rammin/ s%ared&memory multiprocessors. @it% parallelism bein/ implicit3 success relies %eavily on t%e Eintelli/enceF o a paralleli4in/ compiler. 6%is approac% re5uires less e ort on t%e part o t%e pro/rammer.

'!( I1*l"$"% *!+!ll,l" 1

'.( E6*l"$"% *!+!ll,l" 1 F"2)+, 4.2

E6*l"$"% P!+!ll,l" 1: 6%e second approac% re5uires more e ort by t%e pro/rammer to develop a source pro/ram usin/ parallel dialects o C3 FO*6*$C3 +isp3 or Pascal. Parallelism is e,plicitly speci ied in t%e user pro/rams. 6%is 1ill si/ni icantly reduce t%e burden on t%e compiler to detect parallelism. Instead3 t%e compiler needs to preserve parallelism and3 1%ere possible3 assi/ns tar/et mac%ine resources. C%arles #eit4 o Cali ornia Institute o 6ec%nolo/y and @illiam 'ally o Massac%usetts Institute o 6ec%nolo/y adopted t%is e,plicit approac% in multicomputer development. #pecial so t1are tools are needed to ma-e an environment riendlier to user /roups. #ome o t%e tools are parallel e,tensions o conventional %i/%&level lan/ua/es. Ot%ers are inte/rated environments 1%ic% include tools providin/ di erent levels o pro/ram abstraction3 validation3 testin/3 debu//in/3 and tunin/G per ormance prediction and monitorin/G and visuali4ation support to aid pro/ram development3 per ormance measurement3 and /rap%ics display and animation o computer results.

4.2.4 M)l%"*+&$, &+ !n0 M)l%"$&1*)%,+

61o cate/ories o parallel computers are arc%itecturally modeled belo1. 6%ese p%ysical models are distin/uis%ed by %avin/ a s%ared common memory or uns%ared distributed memories. S-!+,0-M,1&+y M)l%"*+&$, &+ @e describe belo1 t%ree s%ared&memory multiprocessor models? t%e uni orm memory& access ;:M$< model3 t%e nonuni orm&memory&access;C:M$< model3 and t%e cac%e& only memory arc%itecture;COM$< model. 6%ese models di er in %o1 t%e memory and perip%eral resources are s%ared or distributed. T-, UMA M&0,l In a :M$ multiprocessor model ;Fi/ure 4."< t%e p%ysical memory is uni ormly s%ared by all t%e processors. $ll processors %ave e5ual access time to all memory 1ords3 1%ic% is 1%y it is called uni orm memory access. 7ac% processor may use a private cac%e. Perip%erals are also s%ared in some as%ion. Multiprocessors are called ti/%tly coupled systems due to t%e %i/% de/ree o resource s%arin/. 6%e system interconnect ta-es t%e orm o a common bus 3 a crossbar s1itc% or a multista/e net1or-. Most computer manu acturers %ave multiprocessor e,tensions o t%eir uniprocessor product line. 6%e :M$ model is suitable or /eneral purpose and time s%arin/ applications by multiple users. It can be used to speed up t%e e,ecution o a sin/le lar/e pro/ram in time&critical applications. 6o coordinate parallel events3 sync%roni4ation and communication amon/ processors are done t%rou/% usin/ s%ared variables in t%e common memory. @%en all processors %ave e5ual access to all perip%eral devices3 t%e system is called a symmetric multiprocessor. In t%is case3 all t%e processors are e5ually capable o runnin/ t%e e,ecutive pro/rams3 suc% as t%e O# -ernel and I>O service routines. In an asymmetric multiprocessor3 only one or a subset o processors are e,ecutive capable. $n e,ecutive or a master processor can e,ecute t%e operatin/ system and %andle I>O. 6%e remainin/ processors %ave no I>O capability and t%us are called attache! processors ;$Ps<. $ttac%ed processors e,ecute user codes under t%e supervision o t%e master processor. In bot% multiprocessor and attache! processor con i/urations3 memory s%arin/ amon/ master and attac%ed processors is still in place.

F"2)+, 4.3 T-, UMA 1)l%"*+&$, &+ 1&0,l A**+&6"1!%,0 *,+#&+1!n$, &# ! 1)l%"*+&$, &+ 6%is e,ample e,poses t%e reader to parallel pro/ram e,ecution on a s%ared memory multiprocessor system. Consider t%e ollo1in/ Fortran pro/ram 1ritten or se5uential e,ecution on a uniprocessor system. $ll t%e arrays 3 $;I<3 9;I<3 and C;I<3 are assumed to %ave $ elements. +1? +2? +"? +4? +(? +)? +I? D& 10 I=13 C $;I< = 9;I< A C;I<. 10 C&n%"n), #:M = 0. D& 20 H = 13 C #:M = #:M A $;H<. 20 C&n%"n),

#uppose eac% line o code +23 +43 and +) ta-es 1 mac%ine cycle to e,ecute. 6%e time re5uired to e,ecute t%e pro/ram control statements +13 +"3 +(3 and +I is i/nored to simpli y t%e analysis. $ssume t%at k cycles are needed or eac% interprocessor communication operation via t%e s%ared memory. Initially3 all arrays are assumed already loaded in t%e main memory and t%e s%ort pro/ram ra/ment already loaded in t%e instruction cac%e. In ot%er 1ords instruction etc% and data loadin/ over%ead is i/nored. $lso3 1e i/nore bus contention or memory access con licts problems. In t%is 1ay3 1e can concentrate on t%e analysis o CP: demand.

6%e above pro/ram can be e,ecuted on a se5uential mac%ine in 2$ cycles under t%e above assumption. $ cycles are needed to e,ecute t%e $ independent iterations in t%e I loop. #imilarly3 $ cycles are needed or t%e H loop3 1%ic% contains $ recursive iterations. 6o e,ecute t%e pro/ram on an M&processor system3 1e partition t%e loopin/ operations into M sections 1it% %7 $3M elements per section. In t%e ollo1in/ parallel code3 D&!ll declares t%at all M sections be e,ecuted by M processors in parallel. D&!ll 8713M D& 10 I = +;J&1< A 13 J+ $;I< = 9;I< A C;I<. 10 C&n%"n), #:M;J< = 0 D& 20 H = 13 + #:M;J< = #:M;J< A $;+;J K 1< A H< 20 C&n%"n), En0!ll For M&1ay parallel e,ecution3 t%e sectioned I loop can be done in % cycles. 6%e sectioned & loop produces M partial sums in % cycles. 6%us 2% cycles are consumed to produce all M partial sums. #till3 1e need to mer/e t%ese M partial sums to produce t%e inal sum o $ elements. T-, NUMA M&0,l: $ C:M$ multiprocessor is a s%ared&memory system in 1%ic% t%e access time varies 1it% t%e location o t%e memory 1ord. 61o C:M$ mac%ine models are depicted in t%e Fi/ure 4.4

'!( S-!+,0 l&$!l 1,1&+",

'.( A -",+!+$-"$!l $l) %,+ 1&0,l F"2)+, 4.4 6%e s%ared memory is p%ysically distributed to all processors3 called local memories. 6%e collection o all local memories orms a /lobal address space accessible by all processors. It is aster to access a local memory 1it% a local processor. 6%e access o remote memory attac%ed to ot%er processors ta-es lon/er due to t%e added delay t%rou/% t%e interconnection net1or-. 6%e 99C 6C&2000 9utter ly multiprocessor assumes t%e con i/uration. 9esides distributed memories3 /lobally s%ared memory can be added to a multiprocessor system. In t%is case3 t%ere are t%ree memory&access patterns? 6%e astest is local memory access. 6%e ne,t is /lobal memory access. 6%e slo1est is access o remote memory. $s a mater o act3 t%e models can be easily modi ied to allo1 a mi,ture o s%ared memory and private memory 1it% pre speci ied access ri/%ts. $ %ierarc%ically structured multiprocessor is modeled. 6%e processors are divided into several clusters. 7ac% cluster is itsel an :M$ or a C:M$ multiprocessor. 6%e clusters are connected to glo'al share!-memory modules. 6%e entire system is considered a C:M$ multiprocessor. $ll processors belon/in/ to t%e same cluster are allo1ed to uni ormly access t%e cluster share!-memory modules.

$ll clusters %ave e5ual access to t%e /lobal memory. 8o1ever3 t%e access time to t%e cluster memory is s%orter t%an t%at to t%e /lobal memory. One can speci y t%e access ri/%t amon/ intercluster memories in various 1ays. 6%e Cedar multiprocessor3 built at t%e :niversity o Illinois3 assumes suc% a structure in 1%ic% eac% cluster is an $lliant FD>L0 multiprocessor. T-, COMA M&0,l: $ multiprocessor usin/ cac%e&only memory assumes t%e COM$ model. 6%e COM$ model ;Fi/ure 4.(< is a special case o a C:M$ mac%ine3 in 1%ic% t%e distributed main memories are converted to cac%es. 6%ere is no memory %ierarc%y at eac% processor node. $ll t%e cac%es orm a /lobal address space. *emote cac%e access is assisted by t%e distributed cac%e directories. 'ependin/ on t%e interconnection net1or- used3 sometimes %ierarc%ical directories may be used to %elp locate copies o cac%e bloc-s. Initial data placement is not critical because data 1ill eventually mi/rate to 1%ere it 1ill be used.

F"2)+, 4.9 T-, COMA 1&0,l &# ! 1)l%"*+&$, &+ 9esides t%e :M$3 C:M$3 and COM$ models speci ied above3 ot%er variations e,ist or mutliprocessors. For e,ample3 a cache-coherent non-uniform memory access ;CC& C:M$< model can be speci ied 1it% distributed s%ared memory and cac%e directories. One can also insist on a cac%e&co%erent COM$ mac%ine in 1%ic% all cac%e copies must be -ept consistent. D" %+".)%,0-M,1&+y M)l%"$&1*)%,+ $ distributed&memory multicomputer system is modeled in Fi/ure 4.). 6%e system consists o multiple computers3 o ten called no!es, interconnected by a messa/e&passin/ net1or-. 7ac% node is an autonomous computer consistin/ o a processor3 local memory3 and sometimes attac%ed dis-s or I>O perip%erals.

F"2)+, 4.: ;,n,+"$ 1&0,l &# ! 1, !2,-*! "n2 1)l%"$&1*)%,+ 6%e messa/e&passin/ net1or- provides point&to&point static connections amon/ t%e nodes. $ll local memories are private and are accessible only by local processors. For t%is reason3 traditional multicomputers %ave been called no-remote-memory-access ;CO*M$< mac%ines. 8o1ever3 t%is restriction 1ill /radually be removed in uture multicomputers 1it% distributed s%ared memories. Internode communication is carried out by passin/ messa/es to t%e static connection net1or-.

4.2.9 M)l%"5,$%&+ !n0 SIMD C&1*)%,+ :

@e classi y supercomputers eit%er as pipelined vector mac%ines usin/ a e1 po1er ul processors e5uipped 1it% vector %ard1are3 or as #IM' computers emp%asi4in/ massive data parallelism.,$%&+ S)*,+$&1*)%,+ : $ vector computer is o ten built on top o a scalar processor. $s s%o1n in Fi/ure 4.I3 t%e vector processor is attac%ed to t%e scalar processor as an optional eature. Pro/ram and data are irst loaded into t%e main memory t%rou/% a %ost computer. $ll instructions are irst decoded by t%e scalar control unit. I t%e decoded instruction is a scalar operation or a pro/ram control operation3 it 1ill be directly e,ecuted by t%e scalar processor usin/ t%e scalar unctional pipelines.

I t%e instructions are decoded as a vector operation3 it 1ill be sent to t%e vector control unit. 6%is control unit 1ill supervise t%e lo1 o vector data bet1een t%e main memory and vector unctional pipelines. 6%e vector data lo1 is coordinated by t%e control unit. $ number o vector unctional pipelines may be built into a vector processor. V,$%&+ P+&$, &+ M&0,l: 6%e Fi/. 4.I s%o1s a register-to-register architecture. !ector re/isters are used to %old t%e vector operands3 intermediate and inal vector results. 6%e vector unctional pipelines retrieve operands rom and put results into t%e vector re/isters. $ll vector re/isters are pro/rammable in user instructions. 7ac% vector re/ister is e5uipped 1it% a component counter 1%ic% -eeps trac- o t%e component re/isters used in successive pipelines cycles.

F"2)+, 4.< T-, !+$-"%,$%)+, &# ! 5,$%&+ )*,+ $&1*)%,+ 6%e len/t% o eac% vector re/ister is usually i,ed3 say3 si,ty& our )4&bit component re/isters in a vector re/ister in a Cray #eries supercomputers. Ot%er mac%ines3 li-e t%e Fujitsu !P2000 #eries3 use recon i/urable vector re/isters to dynamically matc% t%e re/ister len/t% 1it% t%at o t%e vector operands. In /eneral3 t%ere are i,ed numbers o vector re/isters and unctional pipelines in a vector processor. 6%ere ore3 bot% resources must be reserved in advance to avoid resource con licts bet1een vector operations. $ memory-to-memory arc%itecture di ers rom a

re/ister&to&re/ister arc%itecture in t%e use o a vector stream unit to replace t%e vector re/isters. !ector operands and results are directly retrieved rom t%e main memory in super 1ords3 say3 (12 bits as in t%e Cyber 20(. SIMD S)*,+$&1*)%,+ In Fi/ure 4.1b3 1e %ave s%o1n an abstract model o a #IM' computer3 %avin/ a sin/le instruction stream over multiple data streams. $n operational model o an #IM' computer is s%o1n in Fi/ure 4.L. SIMD M!$-"n, M&0,l: $n operational model o an #IM' computer is speci ied by a (&tuple? M = B$ 3 C 3 I 3 M 3 RM 1%ere ;1< $ is t%e number o processing elements ;P7s< in t%e mac%ine. For e,ample3 Illiac I! %as )4 P7s and t%e Connection Mac%ine CM&2 uses )(3(") P7s. ;2< C is t%e set o instructions directly e,ecuted by t%e control unti;C:<3 includin/ scalar and pro/ram lo1 control instructions3 ;"< I is t%e set o instructions broadcast by t%e C: to all P7s or parallel e,ecution. 6%ese include arit%metic3 lo/ic3 data routin/3 mas-in/3 and ot%er local operations e,ecuted by eac% active P7 over data 1it%in t%at P7. ;4< M is t%e set o mas-in/ sc%emes3 1%ere eac% mas- partitions t%e set o P7s into enabled and disabled subsets. ;(< R is t%e set o data&routin/ unctions3 speci yin/ various patterns to be set up in t%e interconnection net1or- or inter&P7 communications. ;4.(<

F"2)+, 4.= O*,+!%"&n!l 1&0,l &# SIMD $&1*)%,+

4.2.: PRAM !n0 VLSI M&0,l

6%eoretical models o parallel computers are abstracted rom t%e p%ysical models studied in previous sections. 6%ese models are o ten used by al/orit%m desi/ners and !+#I device>c%ip developers. 6%e ideal models provide a convenient rame1or- or developin/ parallel al/orit%ms 1it%out 1orryin/ about t%e implementation details or p%ysical constraints. 6%e models can be applied to obtain t%eoretical per ormance bounds on parallel computers or to estimate !+#I comple,ity on c%ip area and e,ecution time be ore t%e c%ip is abricated. 6%e abstract models are also use ul in scalability and pro/rammability analysis3 1%en real mac%ines are compared 1it% an ideali4ed parallel mac%ine 1it%out 1orryin/ about communication over%ead amon/ processin/ nodes. 4.2.:.1 P!+!ll,l R!n0&1-A$$, M!$-"n, :

6%eoretical models o parallel computers are presented belo1. @e de ine irst t%e time and space comple,ities. Computational tractability is revie1ed or solvin/ di icult problems on computers. 6%en 1e introduce t%e ran!om-access machine ;*$M<3 parallel ran!om-access machine ;P*$M<3 and variants o P*$Ms. 6%ese comple,ity models acilitate t%e study o asymptotic be%avior o al/orit%ms implementable on parallel computers. T"1, !n0 S*!$, C&1*l,6"%", : 6%e comple,ity o an al/orit%m or solvin/ a problem o si4e s on a computer is determined by t%e e,ecution time and t%e stora/e space re5uired. 6%e time comple(ity is a unction o t%e problem si4e. 6%e time comple,ity unction in order notation is t%e asymptotic time comple(ity o t%e al/orit%m. :sually3 t%e 1orst&case time comple,ity is considered. For e,ample3 a time comple,ity g"s) is said to be O;f;s<<3 read Eorder f;s<F3 i t%ere e,ist positive constants c and s) suc% t%at g"s) B= c ;s< or all nonne/ative values o sMs0. 6%e space comple(ity can be similarly de ined as a unction o t%e problem si4e s. 6%e asymptotic space comple(ity re ers to t%e data stora/e o lar/e problems. Cote t%at t%e pro/ram ;code< stora/e re5uirement and t%e stora/e or input data are not considered in t%is. 6%e time comple,ity o a serial al/orit%m is simply called serial comple(ity. 6%e time comple,ity o a parallel al/orit%m is called parallel comple(ity. Intuitively3 t%e parallel comple,ity s%ould be lo1er t%an t%e serial comple,ity3 at least asymptotically. @e consider only !eterministic algorithms, in 1%ic% every operational step is uni5uely de ined in a/reement 1it% t%e 1ay pro/rams are e,ecuted on real computers.

$ non!eterministic algorithm contains operations resultin/ in one outcome in a set o possible outcomes. 6%ere e,ist no real computers t%at can e,ecute nondeterministic al/orit%ms. PRAM M&0,l Conventional uniprocessor computers %ave been modeled as random access mac%ines by #%eperdson and #tur/is. $ parallel random&access mac%ine ;P*$M< model %as been developed by Fortune and @yllie or modelin/ ideali4ed parallel computers 1it% 4ero sync%roni4ation or memory access over%ead. 6%is P*$M model 1ill be used or parallel al/orit%m development and or scalability and comple,ity analysis.

F"2)+, 4.> PRAM 1&0,l &# ! 1)l%"*+&$, &+ y %,1 $n n&processor P*$M ;Fi/ure 4.N< %as a /lobally addressable memory. 6%e s%ared memory can be distributed amon/ t%e processors or centrali4ed in one place. 6%e n processors operate on a sync%roni4ed read&memory3 compute3 and 1rite&memory cycle. @it% s%ared memory3 t%e model must speci y %o1 concurrent read and concurrent 1rite o memory are %andled. Four memory&update options are possible. *(clusive rea! ;7*< K 6%is allo1s at most one processor to read rom any memory location in eac% cycle3 a rat%er restrictive policy. Concurrent rea! ;C*< K 6%is allo1s multiple processors to read t%e same in ormation rom t%e same memory cell in t%e same cycle. Concurrent +rite ;C@< K t%is allo1s simultaneous 1rites to t%e same memory location. In order to avoid con usion3 some policy must be set up to resolve t%e 1rite con licts.

!arious combinations o t%e above options lead to several variants o t%e P*$M model as speci ied belo1. #ince C* does not create a con lict problem3 variants di er mainly in %o1 t%ey %andle t%e C@ con licts. PRAM V!+"!n% : 'escribed belo1 are our variants o t%e P*$M model3 dependin/ on %o1 t%e memory reads and 1rites are %andled. ;1< 6%e 7*7@&P*$M model K 6%is model orbids more t%an one processor rom readin/ or 1ritin/ t%e same memory cell simultaneously. 6%is is t%e most restrictive P*$M model proposed. ;2< 6%e C*7@&P*$M model K 6%e 1rite con licts are avoided by mutual e,clusion. Concurrent reads to t%e same memory location are allo1ed. ;"< 6%e 7*C@&P*$M model K 6%is allo1s e,clusive read or concurrent 1rites to t%e same memory location. ;4< 6%e C*C@&P*$M model K 6%is model allo1s eit%er concurrent reads or concurrent 1rites at t%e same time. 6%e con lictin/ 1rites are resolved by one o t%e ollo1in/ our polices ? Common K $ll simultaneous 1rites store t%e same value to t%e %ot&spot memory location. ,r'itrary K $ny one o t%e values 1ritten may remainG t%e ot%ers are i/nored. Minimum K 6%e value 1ritten by t%e processor 1it% t%e minimum inde, 1ill remain. -riority K 6%e values bein/ 1ritten are combined usin/ some associative unctions3 suc% as summation or ma,imum.

4.2.:.2 VLSI C&1*l,6"%y M&0,l: Parallel computers rely on t%e use o !+#I c%ips to abricate t%e major components suc% as processor arrays3 memory arrays3 and lar/e&scale s1itc%in/ net1or-s. $n $6 po1er 2 model or t1o&dimensional !+#I c%ips is presented belo13 based on t%e 1or- o Clar6%omson. 6%ree lo1er bounds on !+#I circuit are interpreted by He rey :llaman. 6%e bounds are obtained by settin/ limits on memory3 I>O3 and communication or implementin/ parallel al/orit%ms 1it% !+#I c%ips.

T-, AT2 M&0,l: +et , be t%e c%ip area and T be t%e latency or completin/ a /iven computation usin/ a !+#I circuit c%ip. +et s by t%e problem si4e involved in t%e computation. 6%ompson stated in %is doctoral t%esis t%at or certain computations3 t%ere e,ists a lo1er bound ;s< suc% t%at , , T2 M=. ; ;s<< ;4.)<

6%e c%ip area , is a measure o t%e c%ips comple,ity. 6%e latency T is t%e time re5uired rom 1%en inputs are applied until all outputs are produced or a sin/le problem instance. 6%e c%ip is represented by t%e base area in t%e t1o %ori4ontal dimensions. 6%e vertical dimension corresponds to time. 6%ere ore3 t%e t%ree&dimensional solid represents t%e %istory o t%e computation per ormed by t%e c%ip. M,1&+y B&)n0 C-"* A+,! A: 6%ere are many computations 1%ic% are memory&bound3 due to t%e need to process lar/e data sets. 6o implement t%is type o computation in silicon3 one is limited by %o1 densely in ormation ;bit cells< can be placed on t%e c%ip. $s depicted in Fi/ure 4.10a3 t%e memory re5uirement o a computation sets a lo1er bound on t%e c%ip area ,. 6%e amount o in ormation processed by t%e c%ip can be visuali4ed as in ormation lo1 up1ard across t%e c%ip area. 7ac% bit can lo1 t%rou/% a unit area o t%e %ori4ontal c%ip slice. 6%us3 t%e c%ip area bounds t%e amount o memory bits stored on t%e c%ip.

'!( M,1&+y-l"1"%,0 .&)n0 &n $-"* !+,! A !n0 I3O-l"1"%,0 .&)n0 &n $-"* -" %&+y +,*+, ,n%,0 .y %-, 5&l)1, AT

?? '.(C&11)n"$!%"&n-l"1"%,0 .&)n0 &n %-, ." ,$%"&n @A T F"2)+, 4.10 I3O B&)n0 &n V&l)1, AT: 6%e volume o t%e rectan/ular cube is represented by t%e product ,T. $s in ormation lo1s t%rou/% t%e c%ip or a period o time T3 t%e number o input bits cannot e,ceed t%e volume. 6%is provides an I>O&limited lo1er bound on t%e product ,T3 as demonstrated. 6%e area , corresponds to data into and out o t%e entire sur ace o t%e silicon c%ip. 6%is area measure sets t%e ma,imum I>O limit rat%er t%an usin/ perip%eral I>O pads as seen in conventional c%ips. 6%e %ei/%t T o t%e volume can be visuali4ed as a number o snaps%ots on t%e c%ip3 as computin/ time elapses. 6%e volume represents t%e amount o in ormation lo1in/ t%rou/% t%e c%ip durin/ t%e entire course o t%e computation. ?? B" ,$%"&n C&11)n"$!%"&n B&)n0A @A T: OO It depicts a communication limited lo1er bound on t%e bisection area P ,T. 6%e bisection is represented by t%e vertical slice cuttin/ across t%e s%orter dimension o t%e c%ip area. 6%e distance o t%is dimension is at most s5uare root , or a s5uare c%ip. 6%e %ei/%t o t%e cross section is T. 6%e bisection area represents t%e ma,imum amount o in ormation e,c%an/e bet1een t%e t1o %alves o t%e c%ip circuit durin/ t%e time period T. 6%e cross&section area P,T limits

t%e communication band1idt% o a computation. !+#I comple,ity t%eoreticians %ave used t%e s5uare o t%is measure ,T23 as t%e lo1er bound. C%arles #eit4 %as /iven anot%er interpretation o t%e ,T2 result. 8e considers t%e area& time product ,T t%e cost o a computation3 1%ic% can be e,pected to vary as 1>T. 6%is implies t%at t%e cost o computation or a t1o&dimensional c%ip decreases 1it% t%e e,ecution time allo1ed. @%en t%ree&dimensional ;multilayer< silicon c%ips are used3 #eit4 asserted t%at t%e cost o computation3 as limited by volume&time product3 1ould vary as 1>PT. 6%is is due to t%at t%e bisection 1ill vary as ;,T<2>" or "&' c%ips instead o as s5uare ,T or 2&' c%ips.


1. Conventional se5uential mac%ines are called ;a< #I#' ;b< #IM' ;c< MI#' ;d< MIM' 2. MI#' arc%itecture is also -no1n as ;a< vector computer ;b< systolic array ;c< parallel computer ;d< se5uential structure ".6%e CP: time needed to e,ecute t%e pro/ram is ;a< 6 = Ic , CPI , Q ;b< 6 = Ic , CPI > Q ;c< 6 = Ic , Q > CPI ;d< 6 = Ic , CPI , Q A 4. Processor %avin/ e5ual access to all perip%eral devices is called ;a< symmetric processor ;b< asymmetric processor ;c< attac%ed processor ;d< associative processor (. 6%e P*$M model 1%ic% allo1s eit%er concurrent read or concurrent 1rite at t%e same time is called ;a< 7*7@ model ;b< C*7@ model;c< 7*C@ model ;d< C*C@ model

Flynn classi ies t%e computers into our cate/ories o computer arc%itectures Intrinsic parallel computers are t%ose t%at e,ecute pro/rams in MIM' mode Conventional se5uential computers are called #I#' !ector computer e5uipped 1it% scalar and vector %ard1are is called #IM' Parallel computers are reserved or MIM' MI#' is systolic arc%itecture In a :M$ multiprocessor model t%e p%ysical memory is uni ormly s%ared by all t%e processors C:M$ multiprocessor is a s%ared&memory system in 1%ic% t%e access time vary 1it% t%e location o t%e memory 1ord

In COM$ model remote cac%e access is assisted by t%e distributed cac%e directories !ariants o P*$M are 7*7@3 C*7@3 7*C@3 C*C@ Parallel computers rely on t%e use o !+#I c%ip

COMA : Cac%e&only memory arc%itecture MIMD : Multiple instruction multiple data stream MIPS : Million instructions per second MISD : Multiple instruction sin/le data stream NUMA : Conuni orm memory access PRAM : Parallel random access mac%ine SISD : #in/le instruction sin/le data stream SIMD : #in/le instruction multiple data stream T-+&)2-*)% : $ measure o %o1 many pro/rams can be e,ecuted per second UMA : :ni orm memory access 4.9 REFERENCE 1. 6%omas C 9artee EComputer arc%itecture and lo/ic desi/nF 6M8 2. Jai 81an/ E$dvanced pro/ramabilityF3 6M8 computer arc%itecture? Parallelism3 scalability3

". 8amac%er3 !ranesic3 and Ra-i EComputer or/ani4ationF 6M8 4. 8ayes EComputer arc%itecture and or/ani4ationF3 688


1. ;a< 3 2. ;b<3 ". ;a<3 4.;a<3 (.;d<