You are on page 1of 18
gnats FROCESING PF vce co yan fu" float poing . Drei iste, vets OF retrace aN vecTon HOCESSNG 204 jou bs SO pe marking he status (Top 4% (ay | ie FUR eof instrecto ohh | se MH ty coaches to tesolve hazards ate then introduced. Hazards dis bee opeands 1 the FLRs, the Ser Ol | gods at Phan are known a5 dataependet hazards, Methods to cope with” Iams sHPD MFMAIOn gg ul reeled i any OE of Wokahed roses rch ety, gam thee adder stations, ang yy" \ ran Tae 5 rts (FLY re binary Coded as FL snot ee gations S~9- A 108 is eng saat at Tne mul poesing stem. Anohe ype of hazed Nm) NAR abl an ibedeesd m SionsaS te ORS emt Se arn lp thr Fach dole ad exciton ne ee tte ao rl laa NUS pe il ge roe ImeTnsUIOG dents aft peve at Se TO sterling scheme Song’) woH#h Tr gata owe ine pipeline or cxample.an ntruion may depend ce a a ile preserving the ees aae Ce te se°eaGs of @ previous imstrUeHOn] Until the completion ofthe previo ese tae ‘canmt be initiated into the pipeline. In other y need to upda ne ent instr - sutras ar ae i rope) ete ad resell You hae sent expoation of arithmetic unig, yt¢deg —_ Ha28"8S 1 ane pipeline ar produce unreliable results’ emanate pe dass of data-dependent hana fa pipeline ee re ens rtf eal (WAR) azar. read fer erie RAW azats soir th consctive e200 OF 186 Noy oe UA te Nt ht fra 6 oe a Co eT fe eng re pting ischinged seca i 1s fers to an pugay prob resource objects to refer to working rezisters, mer ant see pscomirandsandsPorthe uh FLB.Tgeh | MEET hse sure oa ae ill tots Fat sec Fd BD ST ha) sel TA era a apn Toa et of data bjt ast Ua 100 #8, Fe(6)+(0,) Fe a annie ate ws fears obes sos MPY FB, FH(F)a(B,) isppsenasateaitensatensl nts he ome nsmasciona [ston Theo AUD i date pe se resource objects whos dala bjs may Bemodiied bs thes Inte procs f the all instruction. set the busy bit of F to struction 1. Obviously. the operands to bused man nstraston execution, {Ens Fant (@)0the adder station A set the tag eld off! Sen rf doman nd Ve els wile sss eens (te tg vale station 4, and then carry out the addition, °®) SEE ym what follows, we conser the execution ofthe two instructions J and Jn {ste meantime the decode ofthe mpy (multiply) ins | fam There may etn ta Fs by, This imps that the mpy depense cra Fe a eee acon fan JT any Bee spearinn Teta snp dnd mera) te cry matron ieee iy ston Mtoe the fag of Mt Boake cpa pbete rar the complet ofthe event ntruon Ts ime Srigur deci theg eat) ‘Wagon pon year ion. oe th cones (8) set to My, When the ald instruction is complua et iste 336 ol CObfsatesidon ut sould beset deasroai eee) CA RAW bazad-betwen teen istuson Land em SS Tesi hie v ein Case | gg AGA and some data je ht has Ben ods By 1A WAB ssid ‘he throne ort teeution when both opzandsban) yer aes 208 ram. Inscuction J appears afer instruction Li th pe 9 are te Fron J attempts io modify some data chiest that i read By L.A WAW Sieg ERP AO done, the CDB finds Finke] Rad mayor bin an J temp 0 mos thesame data abs Formaly Tl ea) uly sult FI this process, the ner epoca eondars Tocaes hazards are ated as ols (Fete fe eet) il tbe sem toF before sending ity Th en onaayes BRAY ont noenanes MeN ow» 2 ‘esr Deon an Reson Dunne) eg - fr WAR hr Toa ted by segues : are listed in ’ Possible hazards forthe four types of struction (able 31 Used nae conflicts among various) Tale 3.3 Recognizing the eitense of pose hard, Compute oes ‘ction, ye character A ibgered by interinstrucion deren todetect the hasard and then to resolve it efletively-Wazard detection an PESOS | "arous hazard conditions. Hazaee"™™ ‘Scanned with CamSeanner jt aw resi 1/@) Fare 3.36 tesa bd WAR ba cg (oan 24%, gesting pipeline proesor by compari gsc te nomi instuction wih ths MALing the don instructions being pos iscitsson retro asin veeron mews 23 snd is detested, the set shoul resolve the interlock station amon Sequence fol + Tees feu) Which a haz sd between the cuTTent instruction J anda presios instruction Cas eed approach sto slop the pipe and to suspend the execution of si + 2s. uni the instruction 1 fas pased the point of mote sphsiated approach it suspend only instruction ‘oes Soh now ofisrotonsJ + IJ + 2...xdonnthe pips OF nurse Fant cote ds duc tothe suspension oJ shouldbe continuously checked tet ld + 2..-move ahead of. Muley hazard detection may ution eequirig Much more complex control mechanisms to res yer the aie octet: fo avoid RAW hazards, IBM engincsrs developed a short-creitng ich gives a copy of the data object to be writen dietly to the in+ uch eh tng to tea the data. This concept was generalized int a technique na tons asiay wh 0 ea A dstesorvarding came boa saing me cases, The internabforvarding and register-agging techniques Ia te piu section should bs hp esling lg hard pies ono Feige Sul ay ttc condionin Eq. 318 be dete Ebel pc he azar om aking place Anon ‘abet og uacten though he pipe and diene eee iErosulpaie wes Th Gabutdapmrostaty es | 2 fsa tat harhar con Note iat the eee ra TE Pes tar rr istration types Tony ina = Scotti aw z aw wan wan ve Sent “ thas, a waw waw waw aw Sreuueat i Section 3.1.3 ens the spaces Row patern of ne complete Un tvouth the pipsin fr one function evahation Ga a sate piste all ‘ton are eharacieied by the sunersration able. Op tb Oe Bi. ‘ures infalion fora dynam pipeline may be characterized by set oe ‘ton bles, on per eich neon beng vals Figre 337 shows the reseration blefraumfuntion pipeline Themtipe x sina tow pose the posbiy pf enllsions. The numb of ine unis ener two iniations eal ts laecy, whch ay be any postive nest. For 3 static pipeline the lteney usally one, ow or peater However, ro aeney Fines bstween diferent functions. The, sequence of Tepe al eed a ens yl Te poser to chow a iheny sequence cil conrl rater sonLsatas ht sas is seth ence he cur inaon ad the ey stntiain Seed Arad arte. Agreed sate mae independent offre reals occurs when wo tsk iad with's teny iain ier) equa othe clunn distance beeen two "son sme rw of eee vation able The set column stances = Ufo) Btwn a possible ‘Scanned with CamSeanner ca bea etd 45 ‘The notation 7* mean ny iene tena qualia on reser han Sac apan nMAL = 0 +42 = 38 Fr heen ie te ira or enifecton pine. ; | reservation table is"called the forbiddor xy onal poabl tens that ce ‘sion vector is a binary vector, shown belie: | where y= lifte F and C, = 0 ifothecwie 0 Frit ep in tse mare name te forbidden list F = (1, 5,6,8),an6) ‘isle stat wee n= 8 isthe largest Forbidden es Shao ames = saat Ti sree a latencies from the same ree! st Fits 0n chr ofthe laces. The tid et ee iin C2 GGG) vei Seta ld the collision veetrforimp initiations in the pipeline. Upon nnytts OF Paine AND VECTOR FROCESING 205 ion vet sak the cll sta at eit 5 eT a colision-fee se gris eing sits xe into the sift ester asthe a time, entering OS ime instant 1+ if im onde 1 Bi yams epeseted bythe contents of the eit rene om Figur outgoing branches (rom the iia sats beled by a Zor (10110001 By shifting right the vet (10110001 two 2 ot th tions without col tae ie (period) divided by the numberof states inthe cycle. Any cycle ean be tntered from the initial state “The cyte consisting of states (10110111) and (10111011) in Figure 3.378 has two latencies, three and four. This cycle has a period equal to 7-= 3 +4. The Tatency ofthis eycle is} = 3.8. Another eycl, which consists of the states 0110000) tome cy int ons, we obtaiP postiigon vector (101 ito by the are ‘Theare stl no mo ate ithe latency Spe successive col average {Hor10001), (10111101), and (10111111), has the thee latencies 2, 2, and 7, with a ‘= 366. The throughput of a period of 11. Its average latency cycle equals 3 Pipeline isinversely proportional tothe reciprocal ofthe average latency. latency cquence is called permissible if no collisions exist in the successive initiations governed by the given latency sequence. The maximum throughput is achieved by an optimal scheduling steategy that achieves the minimum average latency (MAL) without collisions. Thus, the job-sequencing problem is equivalent to Finding & permissible latency eycle with the MAL in the state diagram. The mani- ‘mum number of x's in any single row of the reservation table sa lower bound ofthe MAL, In other words, the MAL is always greater than or qual to the maxi- ‘mum numberof check marks in any row of the reservation table ‘Scanned with CamSeanner URE AND PARALLEL PROCESSING (TURE CHIT 496 COMPUTER ancHl Table 3.4 Simple cycles in Figure 33h Jecyle Average latency ipl so 1 a) 5 (7) (At 3s (437) 46 47) 55 7) 45 Qrnt 16 G47 46 4 Greedy cytes, Simple cycles are those latency cycles in whic! Per each iteration of the cycle. Listed in Table h each state a \ppears Othe average latencies for the state diagram shown in 34 are simple yee ay * 4) has an aye, than 4, the * the number of 15, f0r static uni sin ‘atic unifunet ines. 4 tilt rage latency equal tot the initial collision vec ‘on pipelines can be generalizt OY p reser PrOCESSOF Which can perfoin? ions, the “eration tables Overlaid together. ve Pinang must be reconfigurable. Ont Possion® 2tithmetic pipelines in TEASC sible functional configurations. Each ees 9g identifying the reservation WO OF more tasks with the samé eh fu ore a £2 be displayed with a different funetiong| on P'Peline, an overlaid reserva Peline jg opsct¥ation tables. An overlaid Actions, Bach SMOWn in Figure 3386 where sion bidden oe, 9K-tequesting inition ee Om causing ine tencies for a mulitarcnnn ‘atencies A task with function tag Scanned with CamScanner > pat PROCESSING nc aN vc when one instruct cr rocesm 213 me collision as nmr t ts PELNING AND EC ccay ene ghd OF preceding inna Sree tot Py at its ned NN ere ables ioe ga eel coma, Cate ail att ores 8 Mme time. Tremendous contra) jes of Vector Processing 1 centers , en lements, where is called the ete youl Be erected: None oe hae MEAN eg a Oe of mclements, where mis called th eon eA gai ADIN. Mos qn yt One ach ee ieee ae a anol allows dierent ing got Nr ube. Seat ee aitmetiipelnet ry Ey nme clasified into four primitive 8 eT B feet ae asta iT times 54 tions cai-only memory (ROM), which Ay Sas" sea isto Steg ah favo 4 vey 5 sation for floating-point addition in 7}, cet wav fyil® apa es 18 PE LEN On | om a be 2 he ROM fc | eee Bes ua ay betel 3 10988) Pave VY seed et onl eae a tt er er, 109 86 by both opera) Teco xt in Hse tO MSUTUTIONS. The stag BES) cg Vand S dente 8 saan a els own et The se roeerntn ively aes ge Oe hy ne a NS IED ETT OE GS =. ty) Shown in Table 3 ¢ “re pln confguraton fora floating-point vector doy ‘ stow matin is anf Upstns SEP estat fie 12 Ihe dot product operated upon Watt tg, and VADD (ctor ai 5 20 0 ple would bein his configuration for 1000 clock periods, 0p eos Soma Va: Va gendTated By 3 i Fr Pe ae cool sequences When severe S27) (ctr sum) operations in sequence Listed in Table 35 a5 fqeme ar af common phe instructions streaming te) sever operations that can be found in sama Sec Pe cee ened re Th reaus setts tough te mention of the four base yestor operation is ustats Fi Selection of oy) yack connection is needed in the f, operation. (Eicher om rap crimes ses Siete seen organ segment. Ovtaying tudying the utilization of se vector instructions eee cigar Nati TOR ernparress BD~ Va ‘eduresare yet tobe devel ig multifunction pipelines. Syuent) vin ecto sin Bu — SLA loped for designing dynamically reconfgurat, veo Complement: AU) = a0) fe YUM Yesorvmmsions $= El A 34 VECTOR PRoces: | : a pe SING REQUIREMENTS fe YaDDYeeoe att: cn = aus a hiss ea Wainy Vescrmfiphs en sane ingen? Dn the base concepts of VAND Vesna! Gin = Ai and a cong tion requirements We diame son ectot Processing tke) VUAR sector ge: i) = masa, BU) on en eet NE ditinguish vector pocesingeons) War = eur’ ev} OE MM“ gS ts oven pO tO instructions, and dat an = iran = an ite aici mpceen®, WE present a paralkl weors#) SAPD earn, Bh 7 cane omPuCT Thee vcr poesia mi) SOY, ANT ANS ler examining the architet" ‘Scanned with CamSeanner E RALLEL PROCESSING ARCHITECTURE AND PA pUTER 214 COMP % Of VY, OL: —»s ¥ Y, s ¥, m4 i Fiewe 341 Fur ete 9 types for pipelined procs 1 av Scane special instructions may be used to facilitate the manipulation of data, A boolean vector can be pee can be us, o 1g two vectors. in_a vector’ io ig component OPE, ruction. F ion_will shorten a vector Ue Control of a masking vector, A Merge instruction soo control ofa maskin ctor er ‘ge instfuction combines.two vectors und a masking vector. “Ompress and merge are special fy and fy ore inp! Decause TRE Fesuting operaig mer have a length different from that oft! Scanned with ComScanner , == PRINCIPLES OF PIPELINING AND VECTOR PROCESSING 215 3 Let X = (2, 5,8. 7) and ¥ = 9, 3, 6,4 Y > Yis executed, the boolean ecto eta Ha ae ine stots Is gener- ated. | Lot X = (1,2,3,4.5.6,7.8)and B= (1,0,1,0,1,0, 1,0) After the execution afthe compress instruction Y = X(B), the compressed vector ¥ = (1,3, 5. This generated , ee pet X = (1.2.4.8) Y= (3,5,6,7), and B= (1, 1,0, 1,0,0,0, 1), After the merge instruction Z = X; Y.(B), the result is (1, 2, 3, 4, 5, 6, 7, 8). The rary in B indicates that Z(1) is selected from the first element of X. Similarly. the first 0 in B indicates that Z(3) is selected from the first element of Y. In general, machine operations suitable for pipelining should have the follow- ing. properties: ; ' 4, Wentical processes (or functions) are repeatedly invoked many times, each of which can be subdivided into subprocesses (or subfunctions). 4, Successive operands are fed through the pipeline segments and require as few buffers and local controls as possible. c. Operations executed by distinct pipelines should be able to share eypensive resources, such as memories and buses, in the system. These characteristics explain why most vector processors have pipeline structures. Vector instructions need to perform the same operation on different data sets repeatedly. This is not true for scalar processing over a single pair of operands. One obvious advantage of vector processing over scalar processing is the elimination of the overhead caused by the loop-control mechanism. Because of the startup delay in a pipeline, a vector processor should perform better with longer vectors. Vector instructions are usually specified by the following fields: order to select the functional unit or Ae operation code must be specified to reconfigure a multifunctional unit to perform the specified operation. Usually, microcode control is used to set up the required resources . 2. Fpr a memory-reference instruction, the base addresses are needed for a are operands and result vectors. If the operands and results are located in the Vector register filer tire des jesignated vector regis tte i Pie address increment between the ‘elements must be specified Some compuret like the STaF=T00; restrict the elements to be consecutively stor in he ma memory, ie, the imerement is always 1. Some other COPA on can hag, ariable inerement, which offers higher MAN) Tsing the Re address offset relative to the base adress © a i ory address e se address and the offset, the effective men 1 ot skewed The offset, either positive or negative achieve parallel accesses. ° ‘Scanned with CamSeanner PROCESSING 1 his needed to determine the termination ofa veg ie 5, he vector length is ne e used to mask off some of the clement ty “A masking vector may be used i sy, changing the contents of the origi 1 AND PARALLE 2G COMPUTER ARCHITECTURE AND PAR We can classify pipeline eee a vector pata according to where the operands ate rete ee ee 0 the memory-tomenory architecture, in wh sours operan tS intermedi final results are retrieved dlireetly from the ma memory. Form mony Vector instructions, the fnformation of the base a a ne offset, the incre! and the vector length must be specified in order to enable streams Of data between the main memory and the pipelines. Vector instructions in the Ta the CDC SYAR-100, and the Cyber-205 have a memory-to- memory for other class has a register-to-register architecture, in which Operands ang tea, are retrieved indirectly from the main memory through the use of a large ung, of vector or scal lar registers, Vector instructions in the Cray-1 and the uj VP-200 use a Fegister-to-register format. The example below demonstrate o difference between these two vector-instruction formats, To examine the efficiency of vector Processing over scalar Processing , compare the following two Programs, one written for vector Processing and , other written for scalar Processing, Bur, SOF. Ong il Example 3.4 Ina conventional scalar Processor, the DO 100 1=1, N A()=B(I)+C(1) B(I)=2*A(141) Fortran DO loop INITIALIZE 1=4 10 READ B(|) READ C(I) ADD BI) +c(1) STORE AW —Bay+cqy READ A(I+1) QULTIPLY 26aci44) STORE B( NH 2eacy INCREMENT eit In a vector ree vector Processor, fi + the thi structions © above DO 18 ina i! ined i P Operations can be vector Sequence; ‘Scanned with CamSeanner crore 190 PARALLEL PROCES r a si coneren 808 in parentheses indicates The parameter in patent Whe dy anaes 3 mal «Parl lgociten (4) due te "| + a eel) , ‘Target machine code (M) 4 H ne numberof independent operat, ery We wish t0 find a SUTaBIe algoritig”S y i be performed simultaneous ea line a Foe ese generat problems Weal necd og ka ro expe paral ()Unortnaely, 2 ar y re ae el acepted AT IESEN MO SETS ye sundae nel OR Sn pl ALA Fear eo dacmned hhaare Fore Te Goon has O = M = 6t or 32. In the del situation with | “The degre of parle refers 10! oot ng pra orice aneT eer aguas, we should expect Az L2 O 2 M. as ils Fue 3a ees an ope pl poe Kel Me eee eee | ehoeth yeczaion scion wcll etrtaon The | level languages and then discuss desired features in veetorizing compilers, Fea See Oe 34.2 Multiple Vector Task Dispatching oth nnwase ‘Avparalleltask-scheduling model is presented for mult-pipe pecs o wing vetorsng compiler ad sass ERR ow rerton in ing ering compe a ens we seat i ‘are 342 Perales re ‘This model ean Beay TICUTFERGY Th Vector supecon | puters. The functional block diagram of a modern multiple-pipeline vector can | '*™* Fier shown nFgare 3 Thsstctreis generalized romhecxiing de | Sct posne Tegan rey woes mrad omit hua wor isn nil 8 HE sein Une of vector operands NSTUCUON' SRY GA may-apRET eer et | opr keene Supe VRME rere The eos of a Ts Te roc reg OBL cole nar ak recor nections, cneltingefeine vetacspcanlages and sso asc am af patched hcl P| egy ecg ess an Nera EI ESES ceo Geen Thad ponmor tela mle salsa | ne ef enor cons We const 80 ae 'A Yas tem contains a set of vector insteuctions (Lass) with a peE Seng Hon of vector inguin) tego ak a scan denied ony by daa dependences ong ventor ak et | lrcinnstedfretancang pipelined 0 0% OOH er ie toned into many sbreors tobe proces by Svea pipelines coe | Sena pipelines must exeeue the sme For WAUUEIOE ST se face menor my ened wih es amen | ‘Miva Imichine model bei peseieS MATES Sy, fr it has been proved by Hwang and Su (1983) that the multi-pipeline schedtill | Commercial machines. ‘The vector access cont 5 amen robe Nc te reece! | Nero gt enn aeons er BMeoritims ae ths desirable or paral vector presi. | dette spe op beneath in menor AES ‘Scanned with CamSeanner mon Yee Vector registers Fee Deapeteaetese utp unc mm epee king scion ease homogeneous vector pi pce eth of ih nluntional or statesman Ay Wedll ect on scheduling vector tasks eae te,” scion conor in Figure 343 s capable of scheduling sever eS ss nan, The time required to complete the see ear as eae By, + her isthe pipeline ce rnp dig += Lis the padcton eg (gyms wo seve operand pai. and Lis the wy | tein dhe ver mtn ge oe ate tie mea ital eso iston tothe entrance a he fat gat Srna yf eo a ee ein Nea ey 2 eto several hundreds of He wit to eh Given ata tedule the vector tasks among mid, Pisin ten sf son tines inne, To snp 1 eis 2 aloe sim or a vector tasks. A vector ta ssn uae by tps Vaan a Misastotn vector take PRINCIPLES OF HiTLINaNG AND VECTOR FROCISSNG 221 at ordering ation. speciying the precedence relationship among foand the ‘total production dey = De 0) — 1 2M Pye Pye then {3.3 Deeg] = @ This imple that each pipeline state, performing only one subtask at a time. “The finish time for vector task T, is F(T) = maxtyys¥a,eoos ph The fish time «of @ parallel schedule for an n-task system is defined by eo = max{ F(T) FT FTO) em) ‘The purposes ond good” pra schedule such ato can be minnie Ths determine sceduling cont is cared by he flowing ea Example 35 Given a vector task system 1 a8 specified in Figure 34a, {TisTnTys Tae fy = bt = 10, t= 2 t= 6, and tom 2 Th delays are marked beside each node ofthe task graph, We want 0 ste four tasks on two (m = 2) pipelines. A parallel schedule fis shown fn FIEDS 3b were the shaded area denotes the id pets ote pipelines, The vector task 7; is partitioned into the two subtasks Ti, and Tha, Whe a and eyn = 3 Sia. tevestr tasked tte tno ba ‘Scanned with CamSeanner Ge SIERUCTURES AND | ALGORITH | FOR ARRAY PROCESSORS aschapter deals with the interconnection structures and parallel algorithms for sD array processors and associative processors. The various organizations and trol mechanisms of array processors are presented first. Interconnection vorks used in array processors will be characterized by their routing functions adimplementation methods. We then study the structure of associative memory parallel search in associative array processors. SIMD algorithms are presented irmtrix manipulation, parallel sorting, fast Fourier transform, and associative sach and retrieval operations. SISIMD ARRAY PROCESSORS 7 nchronous array of parallel processors is called_an_array processor, which “Asiss of muitiple processing elements ) under_the -su ervision of on Sato unit (CU die single instruction and multipie Al ssor can handle sing! By (CU). An array processor Savors are also KNOWN 25 SIM. ag D) i ‘ay proc ‘i SIMD) streams. In this sense, array P! 4 to perform Vector computations “nicer SIMD . dj Oe feos machines are especially designe if we atioes or arrays of data. In this book, the leP™ ary ORT Pa ESCO, paral eably. ; ue and SIMD computers are used interchan ee organi array heey, COMPUters_appear_in two basic architectural On using- ‘S, using random-access_memorys ant toe Sections of this tania, gectessable (or associative)” We will study associative processors we Sion Primarily with array processors: ‘whose PES correspond to the tds Wan as a special type of array processor @ssociative memory. 35 ed Scanned with ComScanner ater Organizations sid oe may assim OT Of 90 gi Se Confgation Las ig i ted inthe wel-pubiized Mae gy ele He, ep weal aaa aa ‘and focal memory PEM, for the stora 1 rams on a executed under The contra Hiams. The stem (Celts CU memory Tom an ees our wuctions and determine where the oe, Salt or conto ieee Sian dre broad othe PEs cE paral through duplicate arithmetic unis (Pes) eg per a NEN eNO ey ica ofthe CU, Veeior opeands are dsttned wo seg at Mal used to control the clot instruction, Each PE PF abled dor an stun ej ston ele masking vector ger gs Tall PEY Mother words, navall te PEs ‘a DES 9 vector instruction Only enabled PEs prior sare done via an inter-PE cornmusizatior Totercomnetion network (6) Cofigueation (tine 1) ™ vo Daten f a ay aug wd SE OS a eer anlar RE COMTt heen ne 1a fe tn cot gue mo na To as” of neal sem, oti ees sera octins of the hos computer ile outs mangers ee and O suprvio. The ent! uat f he Fre i ithe the eceton of rogram wheres tos meh oes se YO function wih the ove walt ho seo ry seme he Ue comieed 3 acento, yori Sains eared process ted n Cape 4 ote possible ayo construing aT ey processor is illustrated in rset Tis con This configuration II differs from the configuration ln tuo aspect Fit the local memories attached to the PEs are now replaced by parallel memo eRe RSLT Brat ihe PEx troaghan algae nets SSN atin fune- (Contra (BSP) Feet sae lero et SND ary proces ‘permutation network is replaced bythe inter-PE memoryalignment network, thehrapin controlled bythe CU. A good exanplf aconfguaen SIND nachine is the Burroughs Scientific Processo? (SP), There are N PEs and P- tory modules Soniguation Te two nue are ot eso ge i they have been chosen 0 De elie pie. The ligne sto 82 th-switching network between the PEs and the parallel memories ‘Such an Alignment network is desired to allow conflict-free accesses of the shared memories ‘ts many PEs as possible, TY Processors became wll publicized with the hardwaresoftware develop: av he Mie system, Since then, mn SIMD maces ‘ete wat noow perpen opiois he Bete allel Element Processing semble (PEPE) and the Goodyear ‘Scanned with CamSeanner a court cE a sors. Extended f ive array pro tom 4 nme 0 i er BSP) AB Goon arch Becesr (MPP. character iran SIND cOmPH “ eet 8 ey, _\ pac “yt parameters c=iN,FLMy ‘4 ster of PEsin he stem. FOr eXampl hy wre N = hu N= H6.and the MPP has N= Toa gl ea ting functions provided by yyrts fea st of die 5a) or bythe ai ine 2 en Fgute 510) bythe alignment erway ay ions for scalar-vect. mt ee acne instr vee gt Ute ation operations. trou network-manipulat My yo bie of making schemes, Whee c2¢D MAK pation the two dgjint subsets of enabled PES ang. Sy Pesto Sn bass for evaluating dite jz mode prvidesa.common b rent SIM ingle ron fonctions inthe net Weil clatcton networks for SIMD machines, The ising, sty ny reso dcused with those examples. ‘non m= Peer ; Char gear SIMD machines, evel algorithmic ary soe a bend aac Gon mach pe ary pra ae ye IBM 3838 and the Datawest MATP. Thee attached fe hy pipelined for ary procesing. They arelnot SIMD mactine «| ed aboe, Threason that these pipeline attached processors ate comme own as"aray” processors liesin the fact that they ae used for procesinger of dt, Deal ofthe PiaclV, BSP, MPP and multple-SIMD computes 2 shared pool willbe treated in Chapter 6. s.12 Migeig and Data-Rouing Mechanisms ple, ne consider only configuration Lof an SIMD compute Ef ita procesr (Figure $2) wth its ownsmeriory PEM: a set of working rit and fogs namely A | $,:an arithmetic logic unit; local inde ‘iin addres rete D,; and a data-routing register R,. The R, of each F creel oh of ther PE via te interconnection network. Wi ranser among PEs occurs its the tontents ofthe R, registers that are beat sre gst ant PEs To facia faye illastrations, we sisume N= 2) “te ait ae needed to encode the address of a PE. TH AE Ineo ad mares othe PE, This PE srs i Sone nin ac : otro ce NOS may we wo song girs, oe al ‘il simply consider the use-of one R, pet PE; io *! tocu 1 soe $ Peine pS Forts 01,2, .N =I eetS2 Components na Procesing Element PE). ‘pats and outputs of R, are totally isolated by using master-slave fiptions Eas Fee sane ade dng oh tn OE TERE pacts inst brennan te. EE et " wate-the instructions broadcast t0 it The mssing se. ene speci status) ‘of PE,..The convention 5; = 1 is josen for "ste Fand a masking register M. The Mf feist has N bts The th 1190 lbe denoted as M,. The collection ofS, Nags for = 0,1, 2,--..N = HOES {BY texiter§ forall the PES, Note that the bit patterns in FEST Siktgsleapen he eal he CU when ‘Scanned with CamSeanner Ryeq ori = 010 5. In the fina step, th 3 Consequethy, PE nei for i= 010 3. Consequently, PE, hag ete a, Tie Tasshown bythe le cof in Figg ia va As far asthe data-routing operations are gget® 53, a {rovviving but not transmitting) in sty step 2 Abo PEy, PE, PEs, and PE, ate nog invobgl ten wanted PEs are masked off during the cottespondi ey dition operations, PE is disabled in step 1; pa angi Sey instep 2:and PE9, PE, PE;,and PE, are masked one are masked off in each step depend on the peration nrticadition) to be performed. Therion changing in the diferent operation eyeies Notethat the masking an id routing operatio when the vector length > N. Array processors are special purpose co ‘tions, We will describe the detailed structu SIMD computer digs in Chapter 6 ren ans waits to be alle for pualeteonpaa Es ce ggg Ste param cont fom ae ee abo ‘Structures of SIMD. be modified in the sf fore tobe presented in Chapt & How MPULeTS for lm eS of the CU an array of PES are i Specific SIMp 4 8) ve, the pincils of PE mai Nets local indexing. and data permutation are not ‘much changed int machines. : 5.13 Inter-PE Communications TanoTk desin decisions for inter-PE communications ‘These ate fundamental decisions in determin ‘ninterconnection network foran SIMD maching operation modes.contrl strategies, switching methg are discus iy the appropriate archiecun| ‘The decisions aremadebeine| dologies, and network ey atezories: sy chines choose among al P existing SIMD mat lock:step operation ichronous, asynchronous, and combine. the synchronous operation mode, in whit "ES are enforced, Control strategy A typical interconnection network consists of a number 4 ‘mtching elemens and ireonnetng Uns, Inetcomecton ene SUTURES 0 AEA aay Procesens 33 ing contol eh ee The it Po irra Me sty cle dag Conacher eM onralize control Mot xsi gn co mn *enrlized contol tall sigh elements by thy th ‘ two major swith Rebotsoges we ery “ r ai In circuit switching a phys pa ttn ket smgurce and a destination. In Dacket switching dat ce oh te ineconeion aah cuted yin general, cit siching sen tion and packet swiching is ores ore sper option integrated whinge ap ‘not cet switching, ei sand pact " eee hing ag swtching packet vthing and icp vi abe ee: ee pelworks ae handed fo a Sica sing intercoritched networks have been sugested mail ‘packet SW wails — Anawoikean teed phe abits repel onmenite i Ti se in a re no two bg me cegular an fe sale oPo}oe m0, we and dd rOCessors are pas 13 inks between 10 rst : oy Tred fr diet connectns oo = ‘poo BE : ss sven eeoonnction networks an berepesiaty i “Thespace 0! eats: (perain ose « ( shore lors of des eer: (oe sal « veined)» estos —— ze cieatrer mesg. The ceca enn dbs ‘he application demands, ehnolgy sp tuk depends on ‘asin. WORKS 52 SIMD INTERCONNECTION NET suggested for SIMD computes 12 lt ‘rious interconnection networks have bee en ge erclting 0005 ‘neon we istingish between Sigh SEE nae orate es * tae nt oS Me wer the Hp neon, ten oe Hl snp the bare shite, a : Ted be studied cations 98 ae eae on ine PE communis er Brae 1. The nterrocesorenor Ss Chane 7 for MIMD operations ‘Scanned with CamSeanner 34 conrenin 4.21 State Verss Dynami Networks i diate ofan SIMD aay processor is yg The topes fk used in inetconnecting the ely cid te 2 a we ently the addresses ofall the Py «hha owing fotos Ese routine Deva N = Moeach routing functigg sig” SY TR ‘Ee pomtion news the PE ores Te content pee en TE «This data-routing operat i R Sone af PE, This Gxta-outing operation occurs ray ON Bagi ee PE may rece dala rom another pene Bets ant To ps dt rel SES comin te stor. the data must Ge pasa ys PEs by executing 2 Sequence of routing functions through the ine ewok | mc ape SIMD interconnection networks are classified into the fo “ categories based on network topologies: stale networks and dime K networks Topologies in the : th eens raped Gr lout. For illstaton, onan inesiora ge dimensional and hypercube are shown in Figues¢ a of one-dimensional topologies include the linear array used for eae Srnec (gue Se. Toomer tpalage pe ramps of hese tures aresou gk oy | eri Seb ahroogh Saf ‘Threedimensional topologies include the completely comettel cha 3 ae and Sehermeceleilenlvors depeed in Fgues SoS) 54, A Ddimensionat- Hide hypercube contains W nodes in each d tod theresa coaneion oa node neath dimension. THe mesh nd cg ate actualy to- and threedimensionl hypercube, reser. fo ‘connected-qcleisadeviationof the hypercube. For example,the }-cubecomec ‘ce shown in Figure 54j is obtained from the 3 cube, Arran nama We conse dss of dai eto: ‘ersus multistage, a described below separately ee seid below Separate Singlestere neocorks A single-stage network is a switching network vd) tna altos (1) and N out setetors (OS) ae demonsatl i FE Each IS i esienfally a -0-D demultiplexer nd each OS isan M-o-1 = were! = D = Nandd = Mf = NoNote that the rossbr-sitching 45) single-stage network with D = M = NW. To establish a desired connect ‘ieent path eénrcl signals wil be applied to all 1S and OS sles, 4 PRSRESAARE cewek sao elle a recrcuaing nt thay have fo reieuate through the single stage Several Cme® 60% thf desinations. The mmr of eens eed e700 (Ring aunor ene? os 3pm amar 7 ‘W)Coapletely connected 09 Chor eg wiske ‘Scanned with CamSeanner cc and PARALLEL PROCESSING pscaecTon 336 cont oe ; ots ayo}: il? * i t - 7 a cle onset inthe inglstage network In general, the hi ‘Senet, the es the numberof recrultions, The eotbar yee ‘extreme casein which only one circulation 1 Weeded t0-estabiistrany gent path However, the uly connected erosbar-netWorks haxe-a cos iy et $j Be probe or ge N. Mos rectculating network have cox op orlower. which defntely marecosffesive for large = -Melitage networks Many stages of interconnected switches form a macy SIMD network. Multistage networks are described by three charactering tures: the gritch box, the network topology, and the contol struct itch boues are used in 3 UT hr BOX is eset change device with two Tapuls_and_ wo OUIpUTS,a5 in Pg Sige Exchange Urner word ‘once cae | Spa 8A moby chng Box nd 18 or teenie sae TES SS ni np an Io GUID. sua ofthese networks will be introduced in subsequent sections. Figure S7¢ Ticsated are Toursttes of a wich box aight, exchange. upper broaes21 lower broadcast. Atwo-function switch box can assume either the straight ‘achange states fout-funetion swt ae chboxcan bein any on of te lu | eats Between inputs and ca mates Loot ‘vest interconnection pattern in the baseline network. ‘A setwork is called_a_rearrangeable network iit san perform all possible, its exis ons ‘ntata connection path for a new input-output par can alway be esablited ‘A multistage network is capable of et necting an arbitrary input ti} Avebdened network, the Benes network, shown in Fis to an atbittary cutpatcerminal, Multistage networks carr be-one-sied Sided, The one-sided networks, somei ‘ors on the same side, The rwo-sided multistage networks, which usualy 2 | iain, —— input c and an output side, can be divided into three classes: Dosis artangeable and nonblocking ~In-blocking re ‘Omega, fi, n eube a tsa | he or or exions-ot more tan ot) Btn c ed ma ° 22 may retin contcsin thee ork ammunition ns 0% | *Sl eee ay ipot and an oop Te ot * OF Hocking network ate date manploee ee = Leia acaba i

You might also like