4.0 Objectives 4.1 Introduction 4.2 Parallel Computer Models 4.2.1 4.2.2 4.2." 4.2.4 4.2.( 4.2.) Flynn’s Classi ication Parallel and !ector Computers #ystem $ttributes to Per ormance Multiprocessors and Multicomputers #%ared&Memory Multiprocessors 'istributed&Memory Multicomputers Multivector and #IM'Computers 4.2.(.1 !ector #uper Computers 4.2.(.2 #IM' Computers P*$M and !+#I Models 4.2.).1 Parallel *andom&$ccess Mac%ines 4.2.).2 !+#I Comple,ity Model

C%ec- .our Pro/ress 4." #ummary 4.4 0lossary 4.( *e erences 4.) $ns1ers to C%ec- .our pro/ress 2uestions

$ ter /oin/ t%rou/% t%is unit3 you 1ill be able to • • • • • • • describe t%e Flynn’s classi ication describe parallel and vector computers e,plain system attributes to per ormance distin/uis% bet1een implicit and e,plicit parallelism e,plain multiprocessors and multicomputers e,plain multivector and #IM' computers describe t%e P*$M and !+#I models

Parallel processin/ %as emer/ed as a -ey enablin/ tec%nolo/y in modern computers3 driven by t%e ever&increasin/ demand or %i/%er per ormance3 lo1er costs3 and sustained productivity in real&li e applications. Concurrent events are ta-in/ place in today’s %i/%& per ormance computers due to t%e common practice o multipro/rammin/3 multiprocessin/3 or multicomputin/.

Parallelism appears in various orms3 suc% as loo- a%ead3 pipelinin/ vectori4ation concurrency3 simultaneity3 data parallelism3 partitionin/3 interleavin/3 overlappin/3 multiplicity3 replication3 time s%arin/3 space s%arin/3 multitas-in/3 multipro/rammin/3 multit%readin/3 and distributed computin/ at di erent processin/ levels.

4.2.1Flynn’ Cl! "#"$!%"&n
Mic%ael Flynn introduced a classi ication o various computer arc%itectures based on notions o instruction and data streams. $s illustrated in t%e Fi/ure 4.1 conventional se5uential mac%ines are called #I#' Computers. !ector are e5uipped 1it% scalar and vector %ard1are or appear as #IM' mac%ines. Parallel computers are reserved or MIM' mac%ines. $n MI#' mac%ines are modeled. 6%e same data stream lo1s t%rou/% a linear array o processors e,ecutin/ di erent instruction streams. 6%is arc%itecture is also -no1n as systolic arrays or pipelined e,ecution o speci ic al/orit%ms.

'!( SISD )n"*+&$, &+ !+$-"%,$%)+,

$%)+.1&+y( '$( MIMD !+$-"%.1 Flynn Cl! "#"$!%"&n &# C&1*)%.( SIMD !+$-"%.1&+y( '0( MISD !+$-"%. .$%)+.0 1.$%)+. '/"%. ' y %&l"$ !++!y( F"2)+.+ A+$-"%.-!+.0" %+".$%)+. 4.0 1.'.)%. '/"%.

2.2 P!+!ll.. It is impossible to ac%ieve a per ect matc% bet1een %ard1are and so t1are by merely improvin/ only a e1 actors 1it%out touc%in/ ot%er actors.T in nanoseconds<.4.2. Memory-to-memory architecture supports t%e pipelined lo1 o vector operands directly rom t%e memory to pipelines and t%en bac. Cl&$4 R!%. 6%e inverse o t%e cycle time is t%e clock rate . On t%e ot%er %and3 a mac%ine cannot be said to %ave an avera/e per ormance eit%er.$%&+ C&1*)%. For t%is reason3 t%e per ormance s%ould be described as ran/e or as a %armonic distribution. Inter processor communication is done t%rou/% messa/e passin/ amon/ t%e nodes.+#&+1!n$. !n0 CPI 6%e CP: .to t%e memory.)%.l3V.1it% a constant cycle time . 8o1ever3 pro/ram be%avior is di icult to predict due to its %eavy dependence on application and run&time conditions. 6%e major distinction bet1een multprocessors and multicomputers lies in memory s%arin/ and t%e mec%anisms used or interprocesssor communication. 6%ere are also many ot%ers actors a ectin/ pro/rams be%avior3 includin/ al/orit%m desi/n3 data structures3 lan/ua/e e iciency3 pro/rammer s-ill3 and compiler tec%nolo/y. 6%is ma-es peak performance an impossible tar/et to ac%ieve in real&li e applications. $ll per ormance indices or benc%mar-in/ results must be tied to a pro/ram mi.plicit vector instructions 1ere introduced 1it% t%e appearance o vector processors.1 A%%+".3 Sy %. %& P. 6%ere are t1o major classes o parallel computers3 namely s%ared&memory multiprocessors and messa/e&passin/ multicomputers. 4. 7.+ Intrinsic parallel computers are t%ose t%at e. 6%e processors in a multiprocessor system communicate 1it% eac% ot%er t%rou/% s%ared variables in a common memory. 9esides3 mac%ine per ormance may vary rom pro/ram to pro/ram. 6%e ideal per ormance o a computer system demands a per ect matc% bet1een mac%ine capability and pro/ram be%avior.or simply t%e processor) o today’s di/ital computer is driven by a cloc. Register-to-register architecture uses vector re/isters to inter ace bet1een t%e memory and unctional pipelines. 7ac% computer node in a multicomputer system %as a local memory3 uns%ared 1it% ot%er nodes.ecute pro/rams in MIM' mode. $ vector processor is e5uipped 1it% multiple vector pipelines t%at can be concurrently used under %ard1are or irm1are control. Mac%ine capability can be en%anced 1it% better %ard1are tec%nolo/y arc%itectural eatures3 and e icient resources mana/ement. 6%ere are t1o amilies o pipelined vector processors.

.2 can be urt%er re ined once t%e CPI components .+#&+1!n$.cycles to e. 'i erent mac%ine instructions may re5uire di erent numbers o cloc. . 6%e CPI o an instruction type can be divided into t1o component terms correspondin/ to t%e total processor cycles and memory cycles needed to complete t%e e.ecute.1< 6%e e.ecution o t%e instruction. F!$%&+ : +et Ic be t%e number o instructions in a /iven pro/ram3 or t%e instruction count.p. k< .Ic<3 in terms o t%e number o mac%ine instructions to be e.ecuted in t%e pro/ram. CPI . $n accurate estimate o t%e avera/e CPI re5uires a lar/e amount o pro/ram code to be traced over a lon/ period o time. .k< are 1ei/%ted over t%e entire instruction set. 6%e CP: time .ecution3 m is t%e number o memory re erences needed3 k is t%e ratio bet1een memory cycle and processor cycle Ic is t%e instruction count3 and T is t%e processor cycle time. In t%is cycle3 only t%e instruction decodes and e. τ.one or instruction etc%3 t1o or operand etc%3 and one or store results<.ecution o an instruction re5uires /oin/ t%rou/% a cycle o events involvin/ t%e instruction etc%3 decode3 operand. τ. For a /iven instruction set3 1e can calculate an average CPI over all instruction types3 provided 1e -no1 t%eir re5uencies o appearance in t%e pro/ram.p Am .s< etc%3 e.T in seconds>pro/ram< needed to e. 4. =1>T in me/a%ert4<. 6%e remainin/ t%ree operations may be re5uired to access t%e memory. 6%ere ore 1e can re1rite 75.ecute eac% instruction.1 as ollo1s? T=Ic ..4. 6%e value o k depends on t%e speed o t%e memory tec%nolo/y and processor&memory interconnection sc%eme used. 6%ere ore3 t%e cycles per instruction .ecution3 and store results. P.ecution p%ases are carried out in t%e CP:.. :nless speci ically ocusin/ on a sin/le instruction type3 1e simply use t%e term CPI to mean t%e avera/e value 1it% respect to a /iven instruction set and a /iven pro/ram mi. .2< @%ere p is t%e number o processor cycles needed or t%e instruction decode and e. @e de ine a memory cycle is k times t%e processor cycle T. 75uation 4.ecute t%e pro/ram is estimated by indin/ t%e product o t%ree contributin/ actors? T=Ic .4. 6%e si4e o a pro/ram is determined by its instruction count.CPI< becomes an important parameter or measurin/ t%e time needed to e.m. 'ependin/ on t%e instruction type3 t%e complete instruction cycle may involve to our memory re erences .

6%e CP: implementation and control determine t%e total processor time .ecute per unit time3 called t%e system throughput @s .C . 6%is 1ill probably never %appen3 since t%e system over %ead o ten causes an e.1 A%%+".4< 6%e CP: t%rou/%put is a measure o %o1 many pro/rams can be e.Ic<3 and t%e CPI o a /iven mac%ine3 as de ined belo1? MIP# rate = IC > .in pro/rams>second<.3 t%e CP: time in 75.ecution o a processor. τ) are in luenced by our system attributes? 6%ey are instruction&set arc%itecture3 compiler tec%nolo/y3 CP: implementation and control3 and cac%e and memory %ierarc%y.m). Finally3 t%e memory tec%nolo/y and %ierarc%y desi/n a ect t%e memory access latency .f . Ic< > . 10&)>MIP#.ecute a /iven pro/ram.cycles needed to e.4.m.2 can be 1ritten as T=Ic .4. @e simply call it t%e MIP# rate o a /iven processor. 10)< = . 6%en t%e CP: time can be estimated as T=C .CPI .: +et C be t%e total number o cloc.Ic< and processor cycle needed . M"* R!%.)%. In a multipro/rammed system3 t%e system t%rou/%put is o ten lo1er t%an t%e CP: throughput @p de ined by? #p=f Ic .p.τ) needed.Ic. 4. τ =C>F. : 6%e above ive per ormance actors .f<3 t%e instruction count . compiler3 and O# 1%en multiple pro/rams are interleaved or CP: e. 4. 9ased on t%e above derived e.rate .pressions3 1e conclude by indicatin/ t%e act t%at t%e MIP# rate o a /iven computer is directly proportional to t%e cloc. CPI>f.p."< 9ased on 75.rate and inversely proportional to t%e CPI.Ic<3 p3 and t%e memory re erence count . 10)< = f > .T . 6%e above CP: time can be used as a basis in estimatin/ t%e e. t%e processor speed is o ten measured in terms o million instruction per secon!"MIP#<. 6%e instruction&set arc%itecture a ects t%e pro/ram len/t% .ecution by multipro/rammin/ or time& s%arin/ operations. 6%e compiler tec%nolo/y a ects t%e pro/ram len/t% . $ll our system attributes3 instruction set3 compiler3 processor3 and memory tec%nolo/ies3 a ect t%e MIP# rate3 1%ic% varies also rom pro/ram to pro/ram.k.p<. 10)< .τ). τ =Ic . I t%e CP: is -ept busy in a per ect pro/ram&interleavin/ as%ion3 t%en #s=#p.ecuted per #sB#p is due to t%e additional system over%eads caused by t%e I>O. Furt%ermore3 CPI=C Ic and T=Ic .Sy %. CPI . It s%ould be emp%asi4ed t%at t%e MIP# rate varies 1it% respect to a number o actors3 includin/ t%e cloc.k.tra delay and t%e CP: may be le t idle or some cycles.". CPI .: $not%er important concept is related to %o1 many pro/rams a system can e. . T-+&)2-*)% R!%.

6%is compiler approac% %as been applied in pro/rammin/ s%ared&memory multiprocessors.23 t%is compiler must be able to detect parallelism and assi/n tar/et mac%ine resources.ecuted one a ter anot%er in a se5uential manner. In ot%er 1ords3 conventional computers are bein/ used in a se5uential pro/rammin/ environment usin/ lan/ua/es3 compilers3 and operatin/ systems all developed or a uniprocessor computer3 desires a parallel environment 1%ere parallelism is automatically e. $s illustrated in Fi/ure 4. . Most computer environments are not user& riendly. Most e. @it% parallelism bein/ implicit3 success relies %eavily on t%e Eintelli/enceF o a paralleli4in/ compiler. @e brie ly introduce belo1 t%e environmental eatures desired in modern computers.l" 1 $n implicit approac% uses a conventional lan/ua/e3 suc% as C3 FO*6*$C3 +isp3 or Pascal3 to 1rite t%e source pro/ram. I1*l"$"% P!+!ll.P+&2+!11"n2 En5"+&n1.ploited. #uccessive system calls must be seriali4ed t%rou/% t%e -ernel. 6%is approac% re5uires less e ort on t%e part o t%e pro/rammer. Conventional uniprocessor computers are pro/rammed in a se5uential environment in 1%ic% instructions are e.tensions or ne1 constructs must be developed to speci y parallelism or to acilitate easy detection o parallelism at various /ranularity levels by more intelli/ent compilers.istin/ compilers are desi/ned to /enerate se5uential object codes to run on a se5uential computer.n% : 6%e pro/rammability o a computer depends on t%e pro/rammin/ environment provided to t%e users. 6%e se5uentially coded source pro/ram is translated into parallel object code by a paralleli4in/ compiler. In act3 t%e mar-etability o any ne1 computer system depends on t%e creation o a user& riendly environment in 1%ic% pro/rammin/ becomes a joy ul underta-in/ rat%er t%an a nuisance. In act3 t%e ori/inal :CID>O# -ernel 1as desi/ned to respond to one system call rom t%e user process at a time. +an/ua/e e.

l" 1 F"2)+. . 6%is 1ill si/ni icantly reduce t%e burden on t%e compiler to detect parallelism.plicit approac% in multicomputer development. Ot%ers are inte/rated environments 1%ic% include tools providin/ di erent levels o pro/ram abstraction3 validation3 testin/3 debu//in/3 and tunin/G per ormance prediction and monitorin/G and visuali4ation support to aid pro/ram development3 per ormance measurement3 and /rap%ics display and animation o computer results.l" 1 '.( E6*l"$"% *!+!ll.plicitly speci ied in t%e user pro/rams. #ome o t%e tools are parallel e.'!( I1*l"$"% *!+!ll.tensions o conventional %i/%&level lan/ua/es. 4. Parallelism is e.2 E6*l"$"% P!+!ll.l" 1: 6%e second approac% re5uires more e ort by t%e pro/rammer to develop a source pro/ram usin/ parallel dialects o C3 FO*6*$C3 +isp3 or Pascal. #pecial so t1are tools are needed to ma-e an environment riendlier to user /roups. C%arles #eit4 o Cali ornia Institute o 6ec%nolo/y and @illiam 'ally o Massac%usetts Institute o 6ec%nolo/y adopted t%is e. Instead3 t%e compiler needs to preserve parallelism and3 1%ere possible3 assi/ns tar/et mac%ine resources.

ecute user codes under t%e supervision o t%e master processor.C:M$< model3 and t%e cac%e& only memory arc%itecture. 6%e remainin/ processors %ave no I>O capability and t%us are called attache! processors . In an asymmetric multiprocessor3 only one or a subset o processors are e.0-M. Perip%erals are also s%ared in some as%ion.l In a :M$ multiprocessor model .ecutive pro/rams3 suc% as t%e O# -ernel and I>O service routines. Multiprocessors are called ti/%tly coupled systems due to t%e %i/% de/ree o resource s%arin/. T-. $ll processors %ave e5ual access time to all memory 1ords3 1%ic% is 1%y it is called uni orm memory access. 4. In t%is case3 all t%e processors are e5ually capable o runnin/ t%e e.:M$< model3 t%e nonuni orm&memory&access.ecution o a sin/le lar/e pro/ram in time&critical applications.ecutive or a master processor can e.Fi/ure 4.tensions o t%eir uniprocessor product line."< t%e p%ysical memory is uni ormly s%ared by all t%e processors. @%en all processors %ave e5ual access to all perip%eral devices3 t%e system is called a symmetric multiprocessor. Most computer manu acturers %ave multiprocessor e. &+ @e describe belo1 t%ree s%ared&memory multiprocessor models? t%e uni orm memory& access .$Ps<. $ttac%ed processors e. It can be used to speed up t%e e.+ 61o cate/ories o parallel computers are arc%itecturally modeled belo1.4.2. &+ !n0 M)l%"$&1*)%. 7ac% processor may use a private cac%e. $n e.1 S-!+.4.2. 6%e :M$ model is suitable or /eneral purpose and time s%arin/ applications by multiple users. 6%ese p%ysical models are distin/uis%ed by %avin/ a s%ared common memory or uns%ared distributed memories. 6%ese models di er in %o1 t%e memory and perip%eral resources are s%ared or distributed. 6o coordinate parallel events3 sync%roni4ation and communication amon/ processors are done t%rou/% usin/ s%ared variables in t%e common memory.ecute t%e operatin/ system and %andle I>O.1&+y M)l%"*+&$.ecutive capable. .COM$< model.4 M)l%"*+&$. In bot% multiprocessor and attache! processor con i/urations3 memory s%arin/ amon/ master and attac%ed processors is still in place. UMA M&0. 6%e system interconnect ta-es t%e orm o a common bus 3 a crossbar s1itc% or a multista/e net1or-.

I< A C.ecution on a s%ared memory multiprocessor system. $ssume t%at k cycles are needed or eac% interprocessor communication operation via t%e s%ared memory. +1? +2? +"? +4? +(? +)? +I? D& 10 I=13 C $. Consider t%e ollo1in/ Fortran pro/ram 1ritten or se5uential e. 6%e time re5uired to e.ecute. 4.+#&+1!n$. #uppose eac% line o code +23 +43 and +) ta-es 1 mac%ine cycle to e. &# ! 1)l%"*+&$. 10 C&n%"n). &+ 6%is e.ecution on a uniprocessor system. $ll t%e arrays 3 $.I<3 and C.3 T-.ample e. .I<3 are assumed to %ave $ elements. $lso3 1e i/nore bus contention or memory access con licts problems.l A**+&6"1!%.ecute t%e pro/ram control statements +13 +"3 +(3 and +I is i/nored to simpli y t%e analysis. In t%is 1ay3 1e can concentrate on t%e analysis o CP: demand.0 *.poses t%e reader to parallel pro/ram e.I<. #:M = 0. D& 20 H = 13 C #:M = #:M A $. In ot%er 1ords instruction etc% and data loadin/ over%ead is i/nored. Initially3 all arrays are assumed already loaded in t%e main memory and t%e s%ort pro/ram ra/ment already loaded in t%e instruction cac%e. UMA 1)l%"*+&$. 20 C&n%"n).H<.I< = 9.I<3 9. &+ 1&0.F"2)+.

J< A $. In t%e ollo1in/ parallel code3 D&!ll declares t%at all M sections be e.ecuted on a se5uential mac%ine in 2$ cycles under t%e above assumption.J K 1< A H< 20 C&n%"n).4 '!( S-!+. 6%e sectioned & loop produces M partial sums in % cycles.J< = 0 D& 20 H = 13 + #:M.I< = 9.l: $ C:M$ multiprocessor is a s%ared&memory system in 1%ic% t%e access time varies 1it% t%e location o t%e memory 1ord. 6o e.I< A C. En0!ll For M&1ay parallel e. #imilarly3 $ cycles are needed or t%e H loop3 1%ic% contains $ recursive iterations.ecution3 t%e sectioned I loop can be done in % cycles.I<. #:M.ecuted by M processors in parallel. T-.J&1< A 13 J+ $.0 l&$!l 1.ecute t%e $ independent iterations in t%e I loop.1&+". 6%us 2% cycles are consumed to produce all M partial sums.+.ecute t%e pro/ram on an M&processor system3 1e partition t%e loopin/ operations into M sections 1it% %7 $3M elements per section. 10 C&n%"n). D&!ll 8713M D& 10 I = +.6%e above pro/ram can be e.J< = #:M. NUMA M&0. . $ cycles are needed to e. #till3 1e need to mer/e t%ese M partial sums to produce t%e inal sum o $ elements. 61o C:M$ mac%ine models are depicted in t%e Fi/ure 4.

6%e slo1est is access o remote memory. 4.+ 1&0. $ %ierarc%ically structured multiprocessor is modeled. 6%e entire system is considered a C:M$ multiprocessor. 6%e 99C 6C&2000 9utter ly multiprocessor assumes t%e con i/uration.ture o s%ared memory and private memory 1it% pre speci ied access ri/%ts.'. 6%e processors are divided into several clusters.t is /lobal memory access. In t%is case3 t%ere are t%ree memory&access patterns? 6%e astest is local memory access.+!+$-"$!l $l) %. 7ac% cluster is itsel an :M$ or a C:M$ multiprocessor.( A -". $ll processors belon/in/ to t%e same cluster are allo1ed to uni ormly access t%e cluster share!-memory modules. $s a mater o act3 t%e models can be easily modi ied to allo1 a mi. It is aster to access a local memory 1it% a local processor. 6%e access o remote memory attac%ed to ot%er processors ta-es lon/er due to t%e added delay t%rou/% t%e interconnection net1or-. 9esides distributed memories3 /lobally s%ared memory can be added to a multiprocessor system.4 6%e s%ared memory is p%ysically distributed to all processors3 called local memories.l F"2)+. 6%e clusters are connected to glo'al share!-memory modules. . 6%e ne. 6%e collection o all local memories orms a /lobal address space accessible by all processors.

&+ 9esides t%e :M$3 C:M$3 and COM$ models speci ied above3 ot%er variations e. One can also insist on a cac%e&co%erent COM$ mac%ine in 1%ic% all cac%e copies must be -ept consistent.4. 7ac% node is an autonomous computer consistin/ o a processor3 local memory3 and sometimes attac%ed dis-s or I>O perip%erals. 6%e COM$ model .(< is a special case o a C:M$ mac%ine3 in 1%ic% t%e distributed main memories are converted to cac%es.l: $ multiprocessor usin/ cac%e&only memory assumes t%e COM$ model.0-M. *emote cac%e access is assisted by t%e distributed cac%e directories. 6%e Cedar multiprocessor3 built at t%e :niversity o Illinois3 assumes suc% a structure in 1%ic% eac% cluster is an $lliant FD>L0 multiprocessor. COMA M&0. Initial data placement is not critical because data 1ill eventually mi/rate to 1%ere it 1ill be used.CC& C:M$< model can be speci ied 1it% distributed s%ared memory and cac%e directories. 'ependin/ on t%e interconnection net1or.used3 sometimes %ierarc%ical directories may be used to %elp locate copies o cac%e bloc-s.l &# ! 1)l%"*+&$.). One can speci y t%e access ri/%t amon/ intercluster memories in various 1ays. F"2)+.$ll clusters %ave e5ual access to t%e /lobal memory. COMA 1&0. interconnected by a messa/e&passin/ net1or-. 6%ere is no memory %ierarc%y at eac% processor node.Fi/ure 4. $ll t%e cac%es orm a /lobal address space.ist or mutliprocessors. 6%e system consists o multiple computers3 o ten called no!es. 8o1ever3 t%e access time to t%e cluster memory is s%orter t%an t%at to t%e /lobal memory.2 D" %+".1&+y M)l%"$&1*)%.1.ample3 a cache-coherent non-uniform memory access . For e. 4. 4.9 T-.+ $ distributed&memory multicomputer system is modeled in Fi/ure 4. T-.)%. .

-*! "n2 1)l%"$&1*)%.l &# ! 1.$%&+ S)*.+$&1*)%. $s s%o1n in Fi/ure 4. Internode communication is carried out by passin/ messa/es to t%e static connection net1or-. . Pro/ram and data are irst loaded into t%e main memory t%rou/% a %ost computer.+ 6%e messa/e&passin/ net1or. 4.I3 t%e vector processor is attac%ed to t%e scalar processor as an optional eature. $ll local memories are private and are accessible only by local processors.2.n..+"$ 1&0. 8o1ever3 t%is restriction 1ill /radually be removed in uture multicomputers 1it% distributed s%ared memories.+ : @e classi y supercomputers eit%er as pipelined vector mac%ines usin/ a e1 po1er ul processors e5uipped 1it% vector %ard1are3 or as #IM' computers emp%asi4in/ massive data parallelism. For t%is reason3 traditional multicomputers %ave been called no-remote-memory-access .9. $ll instructions are irst decoded by t%e scalar control unit.: .CO*M$< mac%ines.1V.ecuted by t%e scalar processor usin/ t%e scalar unctional pipelines. !2.provides point&to&point static connections amon/ t%e nodes. 4.+ : $ vector computer is o ten built on top o a scalar processor.F"2)+. 4.9 M)l%"5.2.$%&+ !n0 SIMD C&1*)%. I t%e decoded instruction is a scalar operation or a pro/ram control operation3 it 1ill be directly e.

7ac% vector re/ister is e5uipped 1it% a component counter 1%ic% -eeps trac.+ 6%e len/t% o eac% vector re/ister is usually i. F"2)+. $ memory-to-memory arc%itecture di ers rom a .I s%o1s a register-to-register architecture. 6%ere ore3 bot% resources must be reserved in advance to avoid resource con licts bet1een vector operations.l: 6%e Fi/. !ector re/isters are used to %old t%e vector operands3 intermediate and inal vector results.$%&+ P+&$. $ll vector re/isters are pro/rammable in user instructions.+ $&1*)%.ed numbers o vector re/isters and unctional pipelines in a vector processor. &# ! 5. Ot%er mac%ines3 li-e t%e Fujitsu !P2000 #eries3 use recon i/urable vector re/isters to dynamically matc% t%e re/ister len/t% 1it% t%at o t%e vector operands. !+$-"%. V. 6%e vector unctional pipelines retrieve operands rom and put results into t%e vector re/isters.ty& our )4&bit component re/isters in a vector re/ister in a Cray #eries supercomputers. 4. &+ M&0.< T-.o t%e component re/isters used in successive pipelines cycles.$%&+ )*.I t%e instructions are decoded as a vector operation3 it 1ill be sent to t%e vector control unit. 4. $ number o vector unctional pipelines may be built into a vector processor.$%)+. In /eneral3 t%ere are i. 6%is control unit 1ill supervise t%e lo1 o vector data bet1een t%e main memory and vector unctional pipelines. 6%e vector data lo1 is coordinated by t%e control unit.ed3 say3 si.

.2 SIMD S)*.P7s< in t%e mac%ine. M&0. . .l: $n operational model o an #IM' computer is speci ied by a (&tuple? M = B$ 3 C 3 I 3 M 3 RM 1%ere .(< R is t%e set o data&routin/ unctions3 speci yin/ various patterns to be set up in t%e interconnection net1or.+!%"&n!l 1&0. $n operational model o an #IM' computer is s%o1n in Fi/ure 4.1b3 1e %ave s%o1n an abstract model o a #IM' computer3 %avin/ a sin/le instruction stream over multiple data streams.4.9.1< $ is t%e number o processing elements .+ In Fi/ure 4.L. SIMD M!$-"n. . 4.(< F"2)+.l &# SIMD $&1*)%.re/ister&to&re/ister arc%itecture in t%e use o a vector stream unit to replace t%e vector re/isters. For e.ecuted by t%e control unti.+$&1*)%.2< C is t%e set o instructions directly e.or inter&P7 communications."< I is t%e set o instructions broadcast by t%e C: to all P7s or parallel e. 6%ese include arit%metic3 lo/ic3 data routin/3 mas-in/3 and ot%er local operations e.ecuted by eac% active P7 over data 1it%in t%at P7.C:<3 includin/ scalar and pro/ram lo1 control instructions3 .partitions t%e set o P7s into enabled and disabled subsets.4< M is t%e set o mas-in/ sc%emes3 1%ere eac% mas.ample3 Illiac I! %as )4 P7s and t%e Connection Mac%ine CM&2 uses )(3(") P7s.2. 4.+ .= O*.ecution. !ector operands and results are directly retrieved rom t%e main memory in super 1ords3 say3 (12 bits as in t%e Cyber 20(.

l R!n0&1-A$$.4.ity unction in order notation is t%e asymptotic time comple(ity o t%e al/orit%m. : 6%eoretical models o parallel computers are presented belo1.2. 6%e ideal models provide a convenient rame1or. 6%e abstract models are also use ul in scalability and pro/rammability analysis3 1%en real mac%ines are compared 1it% an ideali4ed parallel mac%ine 1it%out 1orryin/ about communication over%ead amon/ processin/ nodes.or developin/ parallel al/orit%ms 1it%out 1orryin/ about t%e implementation details or p%ysical constraints. Cote t%at t%e pro/ram .ities. 6%ese comple. 6%en 1e introduce t%e ran!om-access machine . !n0 S*!$. 6%e time comple.ecution time be ore t%e c%ip is abricated. .ity s%ould be lo1er t%an t%e serial comple. 4. For e.l 6%eoretical models o parallel computers are abstracted rom t%e p%ysical models studied in previous sections. :sually3 t%e 1orst&case time comple.P*$M<3 and variants o P*$Ms.: PRAM !n0 VLSI M&0.ity g"s) is said to be O.1 P!+!ll.ecution time and t%e stora/e space re5uired.s<F3 i t%ere e. Computational tractability is revie1ed or solvin/ di icult problems on computers. 6%e time comple(ity is a unction o t%e problem si4e.ity on c%ip area and e. in 1%ic% every operational step is uni5uely de ined in a/reement 1it% t%e 1ay pro/rams are e. 6%e time comple.:. @e de ine irst t%e time and space comple.f. Intuitively3 t%e parallel comple. 6%e models can be applied to obtain t%eoretical per ormance bounds on parallel computers or to estimate !+#I comple.ity3 at least asymptotically.ist positive constants c and s) suc% t%at g"s) B= c . 6%e time comple. : 6%e comple. 6%ese models are o ten used by al/orit%m desi/ners and !+#I device>c%ip developers.ity o a parallel al/orit%m is called parallel comple(ity. @e consider only !eterministic algorithms.ity o a serial al/orit%m is simply called serial comple(ity.s<<3 read Eorder f. M!$-"n.code< stora/e re5uirement and t%e stora/e or input data are not considered in t%is.ity o an al/orit%m or solvin/ a problem o si4e s on a computer is determined by t%e e. T"1.6"%".s< or all nonne/ative values o sMs0.*$M<3 parallel ran!om-access machine .ity is considered. C&1*l. 6%e asymptotic space comple(ity re ers to t%e data stora/e o lar/e problems.ample3 a time comple. 6%e space comple(ity can be similarly de ined as a unction o t%e problem si4e s.ity models acilitate t%e study o asymptotic be%avior o al/orit%ms implementable on parallel computers.2.ecuted on real computers.

1 $n n&processor P*$M . F"2)+.Fi/ure 4.$ non!eterministic algorithm contains operations resultin/ in one outcome in a set o possible outcomes. Four memory&update options are possible. Concurrent rea! . 4. Concurrent +rite .l &# ! 1)l%"*+&$.N< %as a /lobally addressable memory. &+ y %.ecute nondeterministic al/orit%ms. 6%e n processors operate on a sync%roni4ed read&memory3 compute3 and 1rite&memory cycle. 6%e s%ared memory can be distributed amon/ t%e processors or centrali4ed in one place.> PRAM 1&0.ist no real computers t%at can e. In order to avoid con usion3 some policy must be set up to resolve t%e 1rite con licts.ity analysis.7*< K 6%is allo1s at most one processor to read rom any memory location in eac% cycle3 a rat%er restrictive policy.C@< K t%is allo1s simultaneous 1rites to t%e same memory location.P*$M< model %as been developed by Fortune and @yllie or modelin/ ideali4ed parallel computers 1it% 4ero sync%roni4ation or memory access over%ead.l Conventional uniprocessor computers %ave been modeled as random access mac%ines by #%eperdson and #tur/is. PRAM M&0. $ parallel random&access mac%ine . . 6%is P*$M model 1ill be used or parallel al/orit%m development and or scalability and comple. 6%ere e. • • • *(clusive rea! . @it% s%ared memory3 t%e model must speci y %o1 concurrent read and concurrent 1rite o memory are %andled.C*< K 6%is allo1s multiple processors to read t%e same in ormation rom t%e same memory cell in t%e same cycle.

6%is is t%e most restrictive P*$M model proposed.r'itrary K $ny one o t%e values 1ritten may remainG t%e ot%ers are i/nored.imum.4< 6%e C*C@&P*$M model K 6%is model allo1s eit%er concurrent reads or concurrent 1rites at t%e same time.1< 6%e 7*7@&P*$M model K 6%is model orbids more t%an one processor rom readin/ or 1ritin/ t%e same memory cell simultaneously. 4.clusion. 6%e bounds are obtained by settin/ limits on memory3 I>O3 and communication or implementin/ parallel al/orit%ms 1it% !+#I c%ips. .l: Parallel computers rely on t%e use o !+#I c%ips to abricate t%e major components suc% as processor arrays3 memory arrays3 and lar/e&scale s1itc%in/ net1or-s.6"%y M&0."< 6%e 7*C@&P*$M model K 6%is allo1s e. PRAM V!+"!n% : 'escribed belo1 are our variants o t%e P*$M model3 dependin/ on %o1 t%e memory reads and 1rites are %andled. 6%ree lo1er bounds on !+#I circuit are interpreted by He rey :llaman. . $n $6 po1er 2 model or t1o&dimensional !+#I c%ips is presented belo13 based on t%e 1or.2 VLSI C&1*l. Minimum K 6%e value 1ritten by t%e processor 1it% t%e minimum inde. . #ince C* does not create a con lict problem3 variants di er mainly in %o1 t%ey %andle t%e C@ con licts.!arious combinations o t%e above options lead to several variants o t%e P*$M model as speci ied belo1. Concurrent reads to t%e same memory location are allo1ed. .2< 6%e C*7@&P*$M model K 6%e 1rite con licts are avoided by mutual e. .clusive read or concurrent 1rites to t%e same memory location.:. -riority K 6%e values bein/ 1ritten are combined usin/ some associative unctions3 suc% as summation or ma. .o Clar6%omson. 1ill remain.2. 6%e con lictin/ 1rites are resolved by one o t%e ollo1in/ our polices ?     Common K $ll simultaneous 1rites store t%e same value to t%e %ot&spot memory location.

5&l)1. +et s by t%e problem si4e involved in t%e computation. .10a3 t%e memory re5uirement o a computation sets a lo1er bound on t%e c%ip area . . $s depicted in Fi/ure 4. AT2 M&0.ists a lo1er bound .&)n0 &n $-"* !+.0 . . 7ac% bit can lo1 t%rou/% a unit area o t%e %ori4ontal c%ip slice. 6%e vertical dimension corresponds to time. 6%e latency T is t%e time re5uired rom 1%en inputs are applied until all outputs are produced or a sin/le problem instance. M. 6o implement t%is type o computation in silicon3 one is limited by %o1 densely in ormation .bit cells< can be placed on t%e c%ip.&)n0 &n $-"* -" %&+y +.*+.! A !n0 I3O-l"1"%. T2 M=. 6%ompson stated in %is doctoral t%esis t%at or certain computations3 t%ere e.y %-. '!( M.)< 6%e c%ip area .! A: 6%ere are many computations 1%ic% are memory&bound3 due to t%e need to process lar/e data sets.ity.0 .T-.. 6%ere ore3 t%e t%ree&dimensional solid represents t%e %istory o t%e computation per ormed by t%e c%ip.0 .n%. is a measure o t%e c%ip’s comple. be t%e c%ip area and T be t%e latency or completin/ a /iven computation usin/ a !+#I circuit c%ip.s< suc% t%at . 6%us3 t%e c%ip area bounds t%e amount o memory bits stored on t%e c%ip.l: +et . AT . 6%e amount o in ormation processed by t%e c%ip can be visuali4ed as in ormation lo1 up1ard across t%e c%ip area.s<< .1&+y B&)n0 C-"* A+.1&+y-l"1"%.4. . 6%e c%ip is represented by t%e base area in t%e t1o %ori4ontal dimensions.

6%e volume represents t%e amount o in ormation lo1in/ t%rou/% t%e c%ip durin/ t%e entire course o t%e computation.&)n0 &n %-.$%"&n C&11)n"$!%"&n B&)n0A @A T: OO It depicts a communication limited lo1er bound on t%e bisection area P .T limits . 6%e area .ceed t%e volume.$%"&n @A T F"2)+. 6%e cross&section area P.T. 6%is provides an I>O&limited lo1er bound on t%e product . $s in ormation lo1s t%rou/% t%e c%ip or a period o time T3 t%e number o input bits cannot e. .imum I>O limit rat%er t%an usin/ perip%eral I>O pads as seen in conventional c%ips.10 I3O B&)n0 &n V&l)1. ?? B" . 4. 6%e bisection area represents t%e ma. 6%e %ei/%t T o t%e volume can be visuali4ed as a number o snaps%ots on t%e c%ip3 as computin/ time elapses. 6%e bisection is represented by t%e vertical slice cuttin/ across t%e s%orter dimension o t%e c%ip area. 6%e %ei/%t o t%e cross section is T. or a s5uare c%ip. 6%is area measure sets t%e ma.T3 as demonstrated.T.?? '.0 . corresponds to data into and out o t%e entire sur ace o t%e silicon c%ip.c%an/e bet1een t%e t1o %alves o t%e c%ip circuit durin/ t%e time period T. 6%e distance o t%is dimension is at most s5uare root .(C&11)n"$!%"&n-l"1"%.imum amount o in ormation e." . AT: 6%e volume o t%e rectan/ular cube is represented by t%e product .

b< #IM' .d< se5uential structure ". Q .a< vector computer .b< asymmetric processor . CPI > Q .d< C*C@ model 4.t%e communication band1idt% o a computation.T<2>" or "&' c%ips instead o as s5uare .a< symmetric processor .b< 6 = Ic .6%e CP: time needed to e.T23 as t%e lo1er bound. CPI . 6%is implies t%at t%e cost o computation or a t1o&dimensional c%ip decreases 1it% t%e e.a< 7*7@ model .c< parallel computer .d< 6 = Ic .c< 6 = Ic .ity t%eoreticians %ave used t%e s5uare o t%is measure .d< MIM' 2.b< C*7@ model. @%en t%ree&dimensional . Q > CPI . CPI .T or 2&' c%ips.ecute t%e pro/ram is .3 SUMMARC • • • • • • • • Flynn classi ies t%e computers into our cate/ories o computer arc%itectures Intrinsic parallel computers are t%ose t%at e. CBEC8 COUR PRO. Processor %avin/ e5ual access to all perip%eral devices is called .d< associative processor (. Q A 4.c< 7*C@ model . Conventional se5uential mac%ines are called . !+#I comple.c< attac%ed processor .ecute pro/rams in MIM' mode Conventional se5uential computers are called #I#' !ector computer e5uipped 1it% scalar and vector %ard1are is called #IM' Parallel computers are reserved or MIM' MI#' is systolic arc%itecture In a :M$ multiprocessor model t%e p%ysical memory is uni ormly s%ared by all t%e processors C:M$ multiprocessor is a s%ared&memory system in 1%ic% t%e access time vary 1it% t%e location o t%e memory 1ord . 6%is is due to t%at t%e bisection 1ill vary as .c< MI#' . 8e considers t%e area& time product .a< #I#' .T2 result..RESS 1. MI#' arc%itecture is also -no1n as .b< systolic array .T t%e cost o a computation3 1%ic% can be e.a< 6 = Ic .ecution time allo1ed.pected to vary as 1>T.multilayer< silicon c%ips are used3 #eit4 asserted t%at t%e cost o computation3 as limited by volume&time product3 1ould vary as 1>PT. C%arles #eit4 %as /iven anot%er interpretation o t%e . 6%e P*$M model 1%ic% allo1s eit%er concurrent read or concurrent 1rite at t%e same time is called .

Jai 81an/ E$dvanced pro/ramabilityF3 6M8 computer arc%itecture? Parallelism3 scalability3 ".LOSSARC COMA : Cac%e&only memory arc%itecture MIMD : Multiple instruction multiple data stream MIPS : Million instructions per second MISD : Multiple instruction sin/le data stream NUMA : Conuni orm memory access PRAM : Parallel random access mac%ine SISD : #in/le instruction sin/le data stream SIMD : #in/le instruction multiple data stream T-+&)2-*)% : $ measure o %o1 many pro/rams can be e.RESS EUESTIONS 1. .9 REFERENCE 1.4 .ecuted per second UMA : :ni orm memory access 4. . 8ayes EComputer arc%itecture and or/ani4ationF3 688 4. ..• • • In COM$ model remote cac%e access is assisted by t%e distributed cac%e directories !ariants o P*$M are 7*7@3 C*7@3 7*C@3 C*C@ Parallel computers rely on t%e use o !+#I c%ip 4.: ANSDERS TO CBEC8 COUR PRO..a<3 (.a< 3 2. 6%omas C 9artee EComputer arc%itecture and lo/ic desi/nF 6M8 2.a<3 4.d< . 8amac%er3 !ranesic3 and Ra-i EComputer or/ani4ationF 6M8 4.b<3 ".