Patterns For Parallel Programming

"Ifyoubuildit,theywillcome." Andsowebuiltthem.Multiprocessorworkstations,massivelyparallelsupercomputers,aclusterin everydepartment...andtheyhaven'tcome.Programmershaven'tcometoprogramthesewonderful machines.Oh,afewprogrammersinlovewiththechallengehaveshownthatmosttypesofproblems canbeforcefitontoparallelcomputers,butgeneralprogrammers,especiallyprofessional programmerswho"havelives",ignoreparallelcomputers. Andtheydosoattheirownperil.Parallelcomputersaregoingmainstream.Multithreaded microprocessors,multicoreCPUs,multiprocessorPCs,clusters,parallelgameconsoles...parallel computersaretakingovertheworldofcomputing.Thecomputerindustryisreadytofloodthemarket withhardwarethatwillonlyrunatfullspeedwithparallelprograms.Butwhowillwritethese programs? Thisisanoldproblem.Evenintheearly1980s,whenthe"killermicros"startedtheirassaulton traditionalvectorsupercomputers,weworriedendlesslyabouthowtoattractnormalprogrammers. Wetriedeverythingwecouldthinkof:highlevelhardwareabstractions,implicitlyparallel programminglanguages,parallellanguageextensions,andportablemessagepassinglibraries.
But aftermanyyearsofhardwork,thefactofthematteristhat"they"didn'tcome.Theoverwhelming majorityofprogrammerswillnotinvesttheefforttowriteparallelsoftware. Acommonviewisthatyoucan'tteacholdprogrammersnewtricks,sotheproblemwillnotbesolved untiltheoldprogrammersfadeawayandanewgenerationtakesover. Butwedon'tbuyintothatdefeatistattitude.Programmershaveshownaremarkableabilitytoadopt newsoftwaretechnologiesovertheyears.LookathowmanyoldFortranprogrammersarenow writingelegantJavaprogramswithsophisticatedobjectorienteddesigns.Theproblemisn'twithold programmers.Theproblemiswitholdparallelcomputingexpertsandthewaythey'vetriedtocreatea poolofcapableparallelprogrammers. Andthat'swherethisbookcomesin.Wewanttocapturetheessenceofhowexpertparallel programmersthinkaboutparallelalgorithmsandcommunicatethatessentialunderstandinginaway professionalprogrammerscanreadilymaster.Thetechnologywe'veadoptedtoaccomplishthistaskis apatternlanguage.Wemadethischoicenotbecausewestartedtheprojectasdevoteesofdesign patternslookingforanewfieldtoconquer,butbecausepatternshavebeenshowntoworkinwaysthat wouldbeapplicableinparallelprogramming.Forexample,patternshavebeenveryeffectiveinthe fieldofobjectorienteddesign.Theyhaveprovidedacommonlanguageexpertscanusetotalkabout theelementsofdesignandhavebeenextremelyeffectiveathelpingprogrammersmasterobject orienteddesign.
Thisbookcontainsourpatternlanguageforparallelprogramming.Thebookopenswithacoupleof chapterstointroducethekeyconceptsinparallelcomputing.Thesechaptersfocusontheparallel computingconceptsandjargonusedinthepatternlanguageasopposedtobeinganexhaustive introductiontothefield. Thepatternlanguageitselfispresentedinfourpartscorrespondingtothefourphasesofcreatinga parallelprogram: * FindingConcurrency.Theprogrammerworksintheproblemdomaintoidentifytheavailable concurrencyandexposeitforuseinthealgorithmdesign. * AlgorithmStructure.Theprogrammerworkswithhighlevelstructuresfororganizingaparallel algorithm. * SupportingStructures.Weshiftfromalgorithmstosourcecodeandconsiderhowtheparallel programwillbeorganizedandthetechniquesusedtomanageshareddata. * ImplementationMechanisms.Thefinalstepistolookatspecificsoftwareconstructsfor implementingaparallelprogram. Thepatternsmakingupthesefourdesignspacesaretightlylinked.Youstartatthetop(Finding Concurrency),workthroughthepatterns,andbythetimeyougettothebottom(Implementation Mechanisms),youwillhaveadetaileddesignforyourparallelprogram. Ifthegoalisaparallelprogram,however,youneedmorethanjustaparallelalgorithm.Youalsoneed aprogrammingenvironmentandanotationforexpressingtheconcurrencywithintheprogram's sourcecode.Programmersusedtobeconfrontedbyalargeandconfusingarrayofparallel programmingenvironments.Fortunately,overtheyearstheparallelprogrammingcommunityhas convergedaroundthreeprogrammingenvironments.
* OpenMP.AsimplelanguageextensiontoC,C++,orFortrantowriteparallelprogramsfor sharedmemorycomputers. * MPI.Amessagepassinglibraryusedonclustersandotherdistributedmemorycomputers. * Java.Anobjectorientedprogramminglanguagewithlanguagefeaturessupportingparallel programmingonsharedmemorycomputersandstandardclasslibrariessupportingdistributed computing. Manyreaderswillalreadybefamiliarwithoneormoreoftheseprogrammingnotations,butfor readerscompletelynewtoparallelcomputing,we'veincludedadiscussionoftheseprogramming environmentsintheappendixes. Inclosing,wehavebeenworkingformanyyearsonthispatternlanguage.Presentingitasabookso peoplecanstartusingitisanexcitingdevelopmentforus.Butwedon'tseethisastheendofthis effort.Weexpectthatotherswillhavetheirownideasaboutnewandbetterpatternsforparallel programming.We'veassuredlymissedsomeimportantfeaturesthatreallybelonginthispattern language.Weembracechangeandlookforwardtoengagingwiththelargerparallelcomputing communitytoiterateonthislanguage.Overtime,we'llupdateandimprovethepatternlanguageuntil ittrulyrepresentstheconsensusviewoftheparallelprogrammingcommunity.Thenourrealwork willbeginusingthepatternlanguagetoguidethecreationofbetterparallelprogramming environmentsandhelpingpeopletousethesetechnologiestowriteparallelsoftware.Wewon'trest untilthedaysequentialsoftwareisrare. ACKNOWLEDGMENTS Westartedworkingtogetheronthispatternlanguagein1998.It'sbeenalongandtwistedroad, startingwithavagueideaaboutanewwaytothinkaboutparallelalgorithmsandfinishingwiththis book.Wecouldn'thavedonethiswithoutagreatdealofhelp. ManiChandy,whothoughtwewouldmakeagoodteam,introducedTimtoBeverlyandBerna.The NationalScienceFoundation,IntelCorp.,andTrinityUniversityhavesupportedthisresearchat varioustimesovertheyears.HelpwiththepatternsthemselvescamefromthepeopleatthePattern LanguagesofPrograms(PLoP)workshopsheldinIllinoiseachsummer.Theformatofthese
workshopsandtheresultingreviewprocesswaschallengingandsometimesdifficult,butwithout themwewouldhaveneverfinishedthispatternlanguage.Wewouldalsoliketothankthereviewers whocarefullyreadearlymanuscriptsandpointedoutcountlesserrorsandwaystoimprovethebook. Finally,wethankourfamilies.Writingabookishardontheauthors,butthatistobeexpected.What wedidn'tfullyappreciatewashowharditwouldbeonourfamilies.WearegratefultoBeverly's family(DanielandSteve),Tim'sfamily(Noah,August,andMartha),andBerna'sfamily(Billie)for thesacrificesthey'vemadetosupportthisproject.
TimMattson,Olympia,Washington,April2004 BeverlySanders,Gainesville,Florida,April2004 BernaMassingill,SanAntonio,Texas,April2004
Chapter 1.APatternLanguageforParallelProgramming Section1.1. INTRODUCTION Section1.2. PARALLELPROGRAMMING Section1.3. DESIGNPATTERNSANDPATTERNLANGUAGES Section1.4. APATTERNLANGUAGEFORPARALLELPROGRAMMING Chapter 2.BackgroundandJargonofParallelComputing Section2.1. CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS Section2.2. PARALLELARCHITECTURES:ABRIEFINTRODUCTION Section2.3. PARALLELPROGRAMMINGENVIRONMENTS Section2.4. THEJARGONOFPARALLELCOMPUTING Section2.5. AQUANTITATIVELOOKATPARALLELCOMPUTATION Section2.6. COMMUNICATION Section2.7. SUMMARY Chapter 3.TheFindingConcurrencyDesignSpace Section3.1. ABOUTTHEDESIGNSPACE Section3.2. THETASKDECOMPOSITIONPATTERN Section3.3. THEDATADECOMPOSITIONPATTERN Section3.4. THEGROUPTASKSPATTERN Section3.5. THEORDERTASKSPATTERN Section3.6. THEDATASHARINGPATTERN Section3.7. THEDESIGNEVALUATIONPATTERN Section3.8. SUMMARY Chapter 4.TheAlgorithmStructureDesignSpace Section4.1. INTRODUCTION Section4.2. CHOOSINGANALGORITHMSTRUCTUREPATTERN Section4.3. EXAMPLES
Section4.4. THETASKPARALLELISMPATTERN Section4.5. THEDIVIDEANDCONQUERPATTERN Section4.6. THEGEOMETRICDECOMPOSITIONPATTERN Section4.7. THERECURSIVEDATAPATTERN Section4.8. THEPIPELINEPATTERN Section4.9. THEEVENTBASEDCOORDINATIONPATTERN Chapter 5.TheSupportingStructuresDesignSpace Section5.1. INTRODUCTION Section5.2. FORCES Section5.3. CHOOSINGTHEPATTERNS Section5.4. THESPMDPATTERN Section5.5. THEMASTER/WORKERPATTERN Section5.6. THELOOPPARALLELISMPATTERN Section5.7. THEFORK/JOINPATTERN Section5.8. THESHAREDDATAPATTERN Section5.9. THESHAREDQUEUEPATTERN Section5.10. THEDISTRIBUTEDARRAYPATTERN Section5.11. OTHERSUPPORTINGSTRUCTURES Chapter 6.TheImplementationMechanismsDesignSpace Section6.1. OVERVIEW Section6.2. UEMANAGEMENT Section6.3. SYNCHRONIZATION Section6.4. COMMUNICATION Endnotes Appendix A:ABriefIntroductiontoOpenMP SectionA.1. CORECONCEPTS SectionA.2. STRUCTUREDBLOCKSAND IRECTIVEFORMATS D SectionA.3. WORKSHARING SectionA.4. DATAENVIRONMENTCLAUSES SectionA.5. THEOpenMPRUNTIMELIBRARY SectionA.6. SYNCHRONIZATION SectionA.7. THESCHEDULECLAUSE SectionA.8. THERESTOFTHELANGUAGE Appendix B:ABriefIntroductiontoMPI SectionB.1. CONCEPTS SectionB.2. GETTINGSTARTED SectionB.3. BASICPOINTTOPOINTMESSAGEPASSING SectionB.4. COLLECTIVEOPERA TIONS SectionB.5. ADVANCEDPOINTTOPOINTMESSAGEPASSING SectionB.6. MPIANDFORTRAN SectionB.7. CONCLUSION Appendix C:ABriefIntroductiontoConcurrentProgramminginJava SectionC.1. CREATINGTHREADS SectionC.2. ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD
SectionC.3. SYNCHRONIZEDBLOCKS SectionC.4. WAITANDNOTIFY SectionC.5. LOCKS SectionC.6. OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA STRUCTURES SectionC.7. INTERRUPTS Glossary Bibliography AbouttheAuthors Index
APatternLanguageforParallelProgramming>INTRODUCTION
Chapter 1. A Pattern Language for Parallel Programming

1.1INTRODUCTION 1.2PARALLELPROGRAMMING 1.3DESIGNPATTERNSANDPATTERNLANGUAGES 1.4APATTERNLANGUAGEFORPARALLELPROGRAMMING
1.1. INTRODUCTION
Computersareusedtomodelphysicalsystemsinmanyfieldsofscience,medicine,andengineering. Modelers,whethertryingtopredicttheweatherorrenderasceneinthenextblockbustermovie,can usuallyusewhatevercomputingpowerisavailabletomakeevermoredetailedsimulations.Vast amountsofdata,whethercustomershoppingpatterns,telemetrydatafromspace,orDNAsequences, requireanalysis.Todelivertherequiredpower,computerdesignerscombinemultipleprocessing elementsintoasinglelargersystem.Thesesocalledparallelcomputersrunmultipletasks simultaneouslyandsolvebiggerproblemsinlesstime. Traditionally,parallelcomputerswererareandavailableforonlythemostcriticalproblems.Sincethe mid1990s,however,theavailabilityofparallelcomputershaschangeddramatically.With multithreadingsupportbuiltintothelatestmicroprocessorsandtheemergenceofmultipleprocessor coresonasinglesilicondie,parallelcomputersarebecomingubiquitous.Now,almostevery universitycomputersciencedepartmenthasatleastoneparallelcomputer.Virtuallyalloilcompanies, automobilemanufacturers,drugdevelopmentcompanies,andspecialeffectsstudiosuseparallel computing. Forexample,incomputeranimation,renderingisthestepwhereinformationfromtheanimationfiles, suchaslighting,textures,andshading,isappliedto3Dmodelstogeneratethe2Dimagethatmakes upaframeofthefilm.Parallelcomputingisessentialtogeneratetheneedednumberofframes(24 persecond)forafeaturelengthfilm.ToyStory,thefirstcompletelycomputergeneratedfeature lengthfilm,releasedbyPixarin1995,wasprocessedona"renderfarm"consistingof100dual
processormachines[PS00].By1999,forToyStory2,Pixarwasusinga1,400processorsystemwith theimprovementinprocessingpowerfullyreflectedintheimproveddetailsintextures,clothing,and atmosphericeffects.Monsters,Inc.(2001)usedasystemof250enterpriseserverseachcontaining14 processorsforatotalof3,500processors.Itisinterestingthattheamountoftimerequiredtogenerate aframehasremainedrelativelyconstantascomputingpower(boththenumberofprocessorsand thespeedofeachprocessor)hasincreased,ithasbeenexploitedtoimprovethequalityofthe animation. ThebiologicalscienceshavetakendramaticleapsforwardwiththeavailabilityofDNAsequence informationfromavarietyoforganisms,includinghumans.Oneapproachtosequencing,championed andusedwithsuccessbyCeleraCorp.,iscalledthewholegenomeshotgunalgorithm.Theideaisto breakthegenomeintosmallsegments,experimentallydeterminetheDNAsequencesofthesegments, andthenuseacomputertoconstructtheentiresequencefromthesegmentsbyfindingoverlapping areas.ThecomputingfacilitiesusedbyCeleratosequencethehumangenomeincluded150fourway serversplusaserverwith16processorsand64GBofmemory.Thecalculationinvolved500million trillionbasetobasecomparisons[Ein00]. TheSETI@homeproject[SET,ACK02 + ]providesafascinatingexampleofthepowerofparallel computing.Theprojectseeksevidenceofextraterrestrialintelligencebyscanningtheskywiththe world'slargestradiotelescope,theAreciboTelescopeinPuertoRico.Thecollecteddataisthen analyzedforcandidatesignalsthatmightindicateanintelligentsource.Thecomputationaltaskis beyondeventhelargestsupercomputer,andcertainlybeyondthecapabilitiesofthefacilitiesavailable totheSETI@homeproject.Theproblemissolvedwithpublicresourcecomputing,whichturnsPCs aroundtheworldintoahugeparallelcomputerconnectedbytheInternet.Dataisbrokenupintowork unitsanddistributedovertheInternettoclientcomputerswhoseownersdonatesparecom putingtime tosupporttheproject.EachclientperiodicallyconnectswiththeSETI@homeserver,downloadsthe datatoanalyze,andthensendstheresultsbacktotheserver.Theclientprogramistypically implementedasascreensaversothatitwilldevoteCPUcyclestotheSETIproblemonlywhenthe computerisotherwiseidle.Aworkunitcurrentlyrequiresanaverageofbetweensevenandeight hoursofCPUtimeonaclient.Morethan205,000,000workunitshavebeenprocessedsincethestart oftheproject.Morerecently,similartechnologytothatdemonstratedbySETI@homehasbeenused foravarietyofpublicresourcecomputingprojectsaswellasinternalprojectswithinlargecompanies utilizingtheiridlePCstosolveproblemsrangingfromdrugscreeningtochipdesignvalidation. Althoughcomputinginlesstimeisbeneficial,andmayenableproblemstobesolvedthatcouldn'tbe otherwise,itcomesatacost.Writingsoftwaretorunonparallelcomputerscanbedifficult.Onlya smallminorityofprogrammershaveexperiencewithparallelprogramming.Ifallthesecomputers designedtoexploitparallelismaregoingtoachievetheirpotential,moreprogrammersneedtolearn howtowriteparallelprograms. Thisbookaddressesthisneedbyshowingcompetentprogrammersofsequentialmachineshowto designprogramsthatcanrunonparallelcomputers.Althoughmanyexcellentbooksshowhowtouse particularparallelprogrammingenvironments,thisbookisuniqueinthatitfocusesonhowtothink aboutanddesignparallelalgor ithms.Toaccomplishthisgoal,wewillbeusingtheconceptofa patternlanguage.Thishighlystructuredrepresentationofexpertdesignexperiencehasbeenheavily usedintheobjectorienteddesigncommunity.
Thebookopenswithtwointroductorychapters.Thefirstgivesanoverviewoftheparallelcomputing landscapeandbackgroundneededtounderstandandusethepatternlanguage.Thisisfollowedbya moredetailedchapterinwhichwelayoutthebasicconceptsandjargonusedbyparallel programmers.Thebookthenmovesintothepatternlanguageitself.
1.2. PARALLEL PROGRAMMING

Thekeytoparallelcomputingisexploitableconcurrency.Concurrencyexistsinacomputational problemwhentheproblemcanbedecomposedintosubproblemsthatcansafelyexecuteatthesame time.Tobeofanyuse,however,itmustbepossibletostructurethecodetoexposeandlaterexploit theconcurrencyandpermitthesubproblemstoactuallyrunconcurrently;thatis,theconcurrency mustbeexploitable. Mostlargecomputationalproblemscontainexploitableconcurrency.Aprogrammerworkswith exploitableconcurrencybycreatingaparallelalgorithmandimplementingthealgorithmusinga parallelprogrammingenvironment.Whentheresultingparallelprogramisrunonasystemwith multipleprocessors,theamountoftimewehavetowaitfortheresultsofthecomputationisreduced. Inaddition,multipleprocessorsmayallowlargerproblemstobesolvedthancouldbedoneona singleprocessorsystem. Asasimpleexample,supposepartofacomputationinvolvescomputingthesummationofalargeset ofvalues.Ifmultipleprocessorsareavailable,insteadofaddingthevaluestogethersequentially,the setcanbepartitionedandthesummationsofthesubsetscomputedsimultaneously,eachonadifferent processor.Thepartialsumsarethencombinedtogetthefinalanswer.Thus,usingmultipleprocessors tocomputeinparallelmayallowustoobtainasolutionsooner.Also,ifeachprocessorhasitsown memory,partitioningthedatabetweentheprocessorsmayallowlargerproblemstobehandledthan couldbehandledonasingleprocessor. Thissimpleexampleshowstheessenceofparallelcomputing.Thegoalistousemultipleprocessors tosolveproblemsinlesstimeand/ortosolvebiggerproblemsthanwouldbepossibleonasingle processor.Theprogrammer'staskistoidentifytheconcurrencyintheproblem,structurethe algorithmsothatthisconcurrencycanbeexploited,andthenimplementthesolutionusingasuitable programmingenvironment.Thefinalstepistosolvetheproblembyexecutingthecodeonaparallel system. Parallelprogrammingpresentsuniquechallenges.Often,theconcurrenttasksmakinguptheproblem includedependenciesthatmustbeidentifiedandcorrectlymanaged.Theorderinwhichthetasks executemaychangetheanswersofthecomputationsinnondeterministicways.Forexample,inthe parallelsummationdescribedearlier,apartialsumcannotbecombinedwithothersuntilitsown computationhascompleted.Thealgorithmimposesapartialorderonthetasks(thatis,theymust completebeforethesumscanbecombined).Moresubtly,thenumericalvalueofthesummationsmay changeslightlydependingontheorderoftheoperationswithinthesumsbecausefloatingpoint
arithmeticisnonassociative.Agoodparallelprogrammermusttakecaretoensurethat nondeterministicissuessuchasthesedonotaffectthequalityofthefinalanswer.Creatingsafe parallelprogramscantakeconsiderableeffortfromtheprogrammer. Evenwhenaparallelprogramis"correct",itmayfailtodelivertheanticipatedperformance improvementfromexploitingconcurrency.Caremustbetakentoensurethattheoverheadincurredby managingtheconcurrencydoesnotoverwhelmtheprogramruntime.Also,partitioningthework amongtheprocessorsinabalancedwayisoftennotaseasyasthesummationexamplesuggests.The effectivenessofaparallelalgorithmdependsonhowwellitmapsontotheunderlyingparallel computer,soaparallelalgorithmcouldbeveryeffectiveononeparallelarchitectureandadisasteron another. Wewillrevisittheseissuesandprovideamorequantitativeviewofparallelcomputationinthenext chapter.
1.3. DESIGN PATTERNS AND PATTERN LANGUAGES

Adesignpatterndescribesagoodsolutiontoarecurringprobleminaparticularcontext.Thepattern followsaprescribedformatthatincludesthepatternname,adescriptionofthecontext,theforces (goalsandconstraints),andthesolution.Theideaistorecordtheexperienceofexpertsinawaythat canbeusedbyothersfacingasimilarproblem.Inadditiontothesolutionitself,thenameofthe patternisimportantandcanformthebasisforadomainspecificvocabularythatcansignificantly enhancecommunicationbetweendesignersinthesamearea. DesignpatternswerefirstproposedbyChristopherAlexander.Thedomainwascityplanningand architecture[AIS77].Designpatternswereoriginallyintroducedtothesoftwareengineering communitybyBeckandCunningham[BC87]andbecameprominentintheareaofobjectoriented programmingwiththepublicationofthebookbyGamma,Helm,Johnson,andVlissides[GHJV95], affectionatelyknownastheGoF(GangofFour)book.Thisbookgivesalargecollectionofdesign patternsforobjectorientedprogramming.Togiveoneexample,theVisitorpatterndescribesawayto structureclassessothatthecodeimplementingaheterogeneousdatastructurecanbekeptseparate fromthecodetotraverseit.Thus,whathappensinatraversaldependsonboththetypeofeachnode andtheclassthatimplementsthetraversal.Thisallowsmultiplefunctionalityfordatastructure traversals,andsignificantflexibilityasnewfunctionalitycanbeaddedwithouthavingtochangethe datastructureclass.ThepatternsintheGoFbookhaveenteredthelexiconofobjectoriented programmingreferencestoitspatternsarefoundintheacademicliterature,tradepublications,and systemdocumentation.Thesepatternshavebynowbecomepartoftheexpectedknowledgeofany competentsoftwareengineer. AneducationalnonprofitorganizationcalledtheHillsideGroup[Hil]wasformedin1993topromote theuseofpatternsandpatternlanguagesand,moregenerally,toimprovehumancommunication aboutcomputers"byencouragingpeopletocodifycommonprogramminganddesignpractice".To developnewpatternsandhelppatternwritershonetheirskills,theHillsideGroupsponsorsanannual PatternLanguagesofPrograms(PLoP)workshopandseveralspinoffsinotherpartsoftheworld, suchasChiliPLoP(inthewesternUnitedStates),KoalaPLoP(Australia),EuroPLoP(Europe),and
MensorePLoP(Japan).Theproceedingsoftheseworkshops[Pat]providearichsourceofpatterns coveringavastrangeofapplicationdomainsinsoftwaredevelopmentandhavebeenusedasabasis forseveralbooks[CS95,VCK96,MRB97,HFR99]. Inhisoriginalworkonpatterns,Alexanderprovidednotonlyacatalogofpatterns,butalsoapattern languagethatintroducedanewapproachtodesign.Inapatternlanguage,thepatternsareorganized intoastructurethatleadstheuserthroughthecollectionofpatternsinsuchawaythatcomplex systemscanbedesignedusingthepatterns.Ateachdecisionpoint,thedesignerselectsanappropriate pattern.Eachpatternleadstootherpatterns,resultinginafinaldesignintermsofawebofpatterns. Thus,apatternlanguageembodiesadesignmethodologyandprovidesdomainspecificadvicetothe applicationdesigner.(Inspiteoftheoverlappingterminology,apatternlanguageisnota programminglanguage.)
1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING

Thisbookdescribesapatternlanguageforparallelprogrammingthatprovidesseveralbenefits.The immediatebenefitsareawaytodisseminatetheexperienceofexpertsbyprovidingacatalogofgood solutionstoimportantproblems,anexpandedvocabulary,andamethodologyforthedesignof parallelprograms.Wehopetolowerthebarriertoparallelprogrammingbyprovidingguidance throughtheentireprocessofdevelopingaparallelprogram.Theprogrammerbringstotheprocessa goodunderstandingoftheactualproblemtobesolvedandthenworksthroughthepatternlanguage, eventuallyobtainingadetailedparalleldesignorpossiblyworkingcode.Inthelongerterm,wehope thatthispatternlanguagecanprovideabasisforbothadisciplinedapproachtothequalitative evaluationofdifferentprogrammingmodelsandthedevelopmentofparallelprogrammingtools. ThepatternlanguageisorganizedintofourdesignspacesFindingConcurrency,Algorithm Structure,SupportingStructures,andImplementationMechanismswhichformalinearhierarchy, withFindingConcurrencyatthetopandImplementationMechanismsatthebottom,asshowninFig. 1.1.
Figure 1.1. Overview of the pattern language
TheFindingConcurrencydesignspaceisconcernedwithstructuringtheproblemtoexpose exploitableconcurrency.Thedesignerworkingatthislevelfocusesonhighlevelalgorithmicissues
andreasonsabouttheproblemtoexposepotentialconcurrency.TheAlgorithmStructuredesignspace isconcernedwithstructuringthealgorithmtotakeadvantageofpotentialconcurrency.Thatis,the designerworkingatthislevelreasonsabouthowtousetheconcurrencyexposedinworkingwiththe FindingConcurrencypatterns.TheAlgorithmStructurepatternsdescribeoverallstrategiesfor exploitingconcurrency.TheSupportingStructuresdesignspacerepresentsanintermediatestage betweentheAlgorithmStructureandImplementationMechanismsdesignspaces.Twoimportant groupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesandthosethat representcommonlyusedshareddatastructures.TheImplementationMechanismsdesignspaceis concernedwithhowthepatternsofthehigherlevelspacesaremappedintoparticularprogramming environments.Weuseittoprovidedescriptionsofcommonmechanismsforprocess/thread management(forexample,creatingordestroyingprocesses/threads)andprocess/threadinteraction (forexample,semaphores,barriers,ormessagepassing).Theitemsinthisdesignspacearenot presentedaspatternsbecauseinmanycasestheymapdirectlyontoelementswithinparticularparallel programmingenvironments.Theyareincludedinthepatternlanguageanyway,however,toprovidea completepathfromproblemdescriptiontocode.
Chapter 2. Background and Jargon of Parallel Computing

2.1CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS 2.2PARALLELARCHITECTURES:ABRIEFINTRODUCTION 2.3PARALLELPROGRAMMINGENVIRONMENTS 2.4THEJARGONOFPARALLELCOMPUTING 2.5AQUANTITATIVELOOKATPARALLELCOMPUTATION 2.6COMMUNICATION 2.7SUMMARY Inthischapter,wegiveanoverviewoftheparallelprogramminglandscape,anddefineany specializedparallelcomputingterminologythatwewilluseinthepatterns.Becausemanytermsin computingareoverloaded,takingdifferentmeaningsindifferentcontexts,wesuggestthateven readersfamiliarwithparallelprogrammingatleastskimthischapter.
2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMS

Concurrencywasfirstexploitedincomputingtobetterutilizeorshareresourceswithinacomputer. Modernoperatingsystemssupportcontextswitchingtoallowmultipletaskstoappeartoexecute concurrently,therebyallowingusefulworktooccurwhiletheprocessorisstalledononetask.This applicationofconcurrency,forexample,allowstheprocessortostaybusybyswappinginanewtask toexecutewhileanothertaskiswaitingforI/O.Byquicklyswappingtasksinandout,givingeach
taska"slice"oftheprocessortime,theoperatingsystemcanallowmultipleuserstousethesystemas ifeachwereusingitalone(butwithdegradedperformance). Mostmodernoperatingsystemscanusemultipleprocessorstoincreasethethroughputofthesystem. TheUNIXshellusesconcurrencyalongwithacommunicationabstractionknownaspipestoprovide apowerfulformofmodularity:Commandsarewrittentoacceptastreamofbytesasinput(the consumer)andproduceastreamofbytesasoutput(theproducer).Multiplecommandscanbechained togetherwithapipeconnectingtheoutputofonecommandtotheinputofthenext,allowingcomplex commandstobebuiltfromsimplebuildingblocks.Eachcommandisexecutedinitsownprocess, withallprocessesexecutingconcurrently.Becausetheproducerblocksifbufferspaceinthepipeis notavailable,andtheconsumerblocksifdataisnotavailable,thejobofmanagingthestreamof resultsmovingbetweencommandsisgreatlysimplified.Morerecently,withoperatingsystemswith windowsthatinviteuserstodomorethanonethingatatime,andtheInternet,whichoftenintroduces I/Odelaysperceptibletotheuser,almosteveryprogramthatcontainsaGUIincorporates concurrency. Althoughthefundamentalconceptsforsafelyhandlingconcurrencyarethesameinparallelprograms andoperatingsystems,therearesomeimportantdifferences.Foranoperatingsystem,theproblemis notfindingconcurrencytheconcurrencyisinherentinthewaytheoperatingsystemfunctionsin managingacollectionofconcurrentlyexecutingprocesses(representingusers,applications,and backgroundactivitiessuchasprintspooling)andprovidingsynchronizationmechanismssoresources canbesafelyshared.However,anoperatingsystemmustsupportconcurrencyinarobustandsecure way:Processesshouldnotbeabletointerferewitheachother(intentionallyornot),andtheentire systemshouldnotcrashifsomethinggoeswrongwithoneprocess.Inaparallelprogram,findingand exploitingconcurrencycanbeachallenge,whileisolatingprocessesfromeachotherisnotthecritical concernitiswithanoperatingsystem.Performancegoalsaredifferentaswell.Inanoperating system,performancegoalsarenormallyrelatedtothroughputorresponsetime,anditmaybe acceptabletosacrificesomeefficiencytomaintainrobustnessandfairnessinresourceallocation.Ina parallelprogram,thegoalistominimizetherunningtimeofasingleprogram.
2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTION

Therearedozensofdifferentparallelarchitectures,amongthemnetworksofworkstations,clustersof offtheshelfPCs,massivelyparallelsupercomputers,tightlycoupledsymmetricmultiprocessors,and multiprocessorworkstations.Inthissection,wegiveanoverviewofthesesystems,focusingonthe characteristicsrelevanttotheprogrammer. 2.2.1. Flynn's Taxonomy ByfarthemostcommonwaytocharacterizethesearchitecturesisFlynn'staxonomy[Fly72].He categorizesallcomputersaccordingtothenumberofinstructionstreamsanddatastreamstheyhave, whereastreamisasequenceofinstructionsordataonwhichacomputeroperates.InFlynn's taxonomy,therearefourpossibilities:SISD,SIMD,MISD,andMIMD.
SingleInstruction,SingleData(SISD).InaSISDsystem,onestreamofinstructionsprocessesa singlestreamofdata,asshowninFig.2.1.ThisisthecommonvonNeumannmodelusedinvirtually allsingleprocessorcomputers.
Figure 2.1. The Single Instruction, Single Data (SISD) architecture
SingleInstruction,MultipleData(SIMD).InaSIMDsystem,asingleinstructionstreamis concurrentlybroadcasttomultipleprocessors,eachwithitsowndatastream(asshowninFig.2.2). TheoriginalsystemsfromThinkingMachinesandMasParcanbeclassifiedasSIMD.TheCPPDAP GammaIIandQuadricsApemillearemorerecentexamples;thesearetypicallydeployedin specializedapplications,suchasdigitalsignalprocessing,thataresuitedtofinegrainedparallelism andrequirelittleinterprocesscommunication.Vectorprocessors,whichoperateonvectordataina pipelinedfashion,canalsobecategorizedasSIMD.Exploitingthisparallelismisusuallydonebythe compiler.

Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture
MultipleInstruction,SingleData(MISD).Nowellknownsystemsfitthisdesignation.Itismentioned forthesakeofcompleteness. MultipleInstruction,MultipleData(MIMD).InaMIMDsystem,eachprocessingelementhasitsown streamofinstructionsoperatingonitsowndata.Thisarchitecture,showninFig.2.3,isthemost generalofthearchitecturesinthateachoftheothercasescanbemappedontotheMIMDarchitecture. Thevastmajorityofmodernparallelsystemsfitintothiscategory.

Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture
2.2.2. A Further Breakdown of M IMD TheMIMDcategoryofFlynn'staxonomyistoobroadtobeusefulonitsown;thiscategoryis typicallydecomposedaccordingtomemoryorganization. Sharedmemory.Inasharedmemorysystem,allprocessesshareasingleaddressspaceand communicatewitheachotherbywritingandreadingsharedvariables. OneclassofsharedmemorysystemsiscalledSMPs(symmetricmultiprocessors).AsshowninFig. 2.4,allprocessorsshareaconnectiontoacommonmemoryandaccessallmemorylocationsatequal speeds.SMPsystemsarearguablytheeasiestparallelsystemstoprogrambecauseprogrammersdo notneedtodistributedatastructuresamongprocessors.Becauseincreasingthenumberofprocessors increasescontentionforthememory,theprocessor/memorybandwidthistypicallyalimitingfactor. Thus,SMPsystemsdonotscalewellandarelimitedtosmallnumbersofprocessors.
Figure 2.4. The Symmetric Multiprocessor (SMP) architecture
TheothermainclassofsharedmemorysystemsiscalledNUMA(nonuniformmemoryaccess).As showninFig.2.5,thememoryisshared;thatis,itisuniformlyaddressablefromallprocessors,but someblocksofmemorymaybephysicallymorecloselyassociatedwithsomeprocessorsthanothers. Thisreducesthememorybandwidthbottleneckandallowssystemswithmoreprocessors;however,as aresult,theaccesstimefromaprocessortoamemorylocationcanbesignificantlydifferent dependingonhow"close"thememorylocationistotheprocessor.Tomitigatetheeffectsof nonuniformaccess,eachprocessorhasacache,alongwithaprotocoltokeepcacheentriescoherent. Hence,anothernameforthesearchitecturesiscachecoherentnonuniformmemoryaccesssystems (ccNUMA).Logically,programmingaccNUMAsystemisthesameasprogramminganSMP,butto obtainthebestperformance,theprogrammerwillneedtobemorecarefulaboutlocalityissuesand cacheeffects.
Figure 2.5. An example of the nonuniform memory access (NUMA) architecture
Distributedmemory.Inadistributedmemorysystem,eachprocesshasitsownaddressspaceand communicateswithotherprocessesbymessagepassing(sendingandreceivingmessages).A schematicrepresentationofadistributedmemorycomputerisshowninFig.2.6.

Figure 2.6. The distributed-memory architecture
Dependingonthetopologyandtechnologyusedfortheprocessorinterconnection,communication
speedcanrangefromalmostasfastassharedmemory(intightlyintegratedsupercomputers)toorders ofmagnitudeslower(forexample,inaclusterofPCsinterconnectedwithanEthernetnetwork).The programmermustexplicitlyprogramallthecommunicationbetweenprocessorsandbeconcerned withthedistributionofdata. Distributedmemorycomputersaretraditionallydividedintotwoclasses:MPP(massivelyparallel processors)andclusters.InanMPP,theprocessorsandthenetworkinfrastructurearetightlycoupled andspecializedforuseinaparallelcomputer.Thesesystemsareextremelyscalable,insomecases supportingtheuseofmanythousandsofprocessorsinasinglesystem[MSW96,IBM02]. Clustersaredistributedmemorysystemscomposedofofftheshelfcomputersconnectedbyanoff theshelfnetwork.WhenthecomputersarePCsrunningtheLinuxoperatingsystem,theseclustersare calledBeowulfclusters.Asofftheshelfnetworkingtechnologyimproves,systemsofthistypeare becomingmorecommonandmuchmorepowerful.Clustersprovideaninexpensivewayforan organizationtoobtainparallelcomputingcapabilities[Beo].Preconfiguredclustersarenowavailable frommanyvendors.Onefrugalgroupevenreportedconstructingausefulparallelsystembyusinga clustertoharnessthecombinedpowerofobsoletePCsthatotherwisewouldhavebeendiscarded [HHS01]. Hybridsystems.Thesesystemsareclustersofnodeswithseparateaddressspacesinwhicheachnode containsseveralprocessorsthatsharememory. AccordingtovanderSteenandDongarra's"OverviewofRecentSupercomputers"[vdSD03],which containsabriefdescriptionofthesupercomputerscurrentlyorsoontobecommerciallyavailable, hybridsystemsformedfromclustersofSMPsconnectedbyafastnetworkarecurrentlythedominant trendinhighperformancecomputing.Forexample,inlate2003,fourofthefivefastestcomputersin theworldwerehybridsystems[Top]. Grids.Gridsaresystemsthatusedistributed,heterogeneousresourcesconnectedbyLANsand/or WANs[FK03].OftentheinterconnectionnetworkistheInternet.Gridswereoriginallyenvisionedas awaytolinkmultiplesupercomputerstoenablelargerproblemstobesolved,andthuscouldbe viewedasaspecialtypeofdistributedmemoryorhybridMIMDmachine.Morerecently,theideaof gridcomputinghasevolvedintoageneralwaytoshareheterogeneousresources,suchascomputation servers,storage,applicationservers,informationservices,orevenscientificinstruments.Gridsdiffer fromclustersinthatthevariousresourcesinthegridneednothaveacommonpointofadministration. Inmostcases,theresourcesonagridareownedbydifferentorganizationsthatmaintaincontrolover thepoliciesgoverninguseoftheresources.Thisaffectsthewaythesesystemsareused,the middlewarecreatedtomanagethem,andmostimportantlyforthisdiscussion,theoverheadincurred whencommunicatingbetweenresourceswithinthegrid.
2.2.3. Summary Wehaveclassifiedthesesystemsaccordingtothecharacteristicsofthehardware.These characteristicstypicallyinfluencethenativeprogrammingmodelusedtoexpressconcurrencyona system;however,thisisnotalwaysthecase.Itispossibleforaprogrammingenvironmentfora sharedmemorymachinetoprovidetheprogrammerwiththeabstractionofdistributedmemoryand messagepassing.Virtualdistributedsharedmemorysystemscontainmiddlewaretoprovidethe opposite:theabstractionofsharedmemoryonadistributedmemorymachine.
2.3. PARALLEL PROGRAMMING ENVIRONMENTS

Parallelprogrammingenvironmentsprovidethebasictools,languagefeatures,andapplication programminginterfaces(APIs)neededtoconstructaparallelprogram.Aprogrammingenvironment impliesaparticularabstractionofthecomputersystemcalledaprogrammingmodel.Traditional sequentialcomputersusethewellknownvonNeumannmodel.Becauseallsequentialcomputersuse thismodel,softwaredesignerscandesignsoftwaretoasingleabstractionandreasonablyexpectitto mapontomost,ifnotall,sequentialcomputers. Unfortunately,therearemanypossiblemodelsforparallelcomputing,reflectingthedifferentways processorscanbeinterconnectedtoconstructaparallelsystem.Themostcommonmodelsarebased ononeofthewidelydeployedparallelarchitectures:sharedmemory,distributedmemorywith messagepassing,orahybridcombinationofthetwo. Programmingmodelstoocloselyalignedtoaparticularparallelsystemleadtoprogramsthatarenot portablebetweenparallelcomputers.Becausetheeffectivelifespanofsoftwareislongerthanthatof hardware,manyorganizationshavemorethanonetypeofparallelcomputer,andmostprogrammers insistonprogrammingenvironmentsthatallowthemtowriteportableparallelprograms.Also, explicitlymanaginglargenumbersofresourcesinaparallelcomputerisdifficult,suggestingthat higherlevelabstractionsoftheparallelcomputermightbeuseful.Theresultisthatasofthemid 1990s,therewasaveritableglutofparallelprogrammingenvironments.Apartiallistoftheseis showninTable2.1.Thiscreatedagreatdealofconfusionforapplicationdevelopersandhinderedthe adoptionofparallelcomputingformainstreamapplications.
Table 2.1. Some Parallel Programming Environments from the Mid-1990s
"C*inC ABCPL ACE ACT++ ADDAP Adl Adsmith
CUMULVS DAGGER DAPPLE DataParallelC DC++ DCE++ DDD
JavaRMI javaPG JAVAR JavaSpaces JIDL Joyce Karma
PRIO P3L P4Linda Pablo PADE PADRE Panda
Quake Quark QuickThreads Sage++ SAM SCANDAL SCHEDULE
AFAPI ALWAN AM AMDC Amoeba AppLeS ARTS
DICE DIPC Distributed Smalltalk DOLIB DOME DOSMOS DRL
Khoros KOAN/FortranS LAM Legion Lilac Linda LiPS Locust Lparx Lucid Maisie Manifold Mentat MetaChaos Midway Millipede Mirage Modula2* ModulaP MOSIX MpC MPC++ MPI Multipol Munin NanoThreads NESL NetClasses++
Papers Para++ Paradigm Parafrase2 Paralation Parallaxis Parallel Haskell ParallelC++ ParC ParLib++ ParLin Parlog Parmacs Parti pC pC++ PCN PCP: PCU PEACE PENNY PET PETSc PH Phosphorus POET Polaris POOLT
SciTL SDDA SHMEM SIMPLE Sina SISAL SMI SONiC SplitC SR Sthreads Strand SUIF SuperPascal Synergy TCGMSG Telegraphos TheFORCE Threads.h++ TRAPPER TreadMarks UC uC++ UNITY V Vic* VisifoldVNUS VPE
AthapascanOb DSMThreads Aurora Automap bb_threads Blaze BlockComm BSP C* C** C4 CarlOS Cashmere CC++ Charlotte Charm Charm++ Chu Cid Cilk CMFortran Code Ease ECO Eilean Emerald EPL Excalibur Express Falcon Filaments FLASH FM Fork FortranM FX GA GAMMA Glenda GLU GUARD HAsL
ConcurrentML HORUS Converse COOL CORRELATE CparPar CPS CRL CSP Cthreads HPC HPC++ HPF IMPACT ISETLLinda ISIS JADA JADE
Nexus Nimrod NOW ObjectiveLinda Occam Omega OOF90 Orca P++
POOMA POSYBL PRESTO Prospero Proteus PSDM PSI PVM QPC++
Win32threads WinPar WWWinda XENOOPS XPC Zounds ZPL
Fortunately,bythelate1990s,theparallelprogrammingcommunityconvergedpredominantlyontwo environmentsforparallelprogramming:OpenMP[OMP]forsharedmemoryandMPI[Mesb]for messagepassing. OpenMPisasetoflanguageextensionsimplementedascompilerdirectives.Implementationsare currentlyavailableforFortran,C,andC++.OpenMPisfrequentlyusedtoincrementallyadd parallelismtosequentialcode.Byaddingacompilerdirectivearoundaloop,forexample,the compilercanbeinstructedtogeneratecodetoexecutetheiterationsoftheloopinparallel.The compilertakescareofmostofthedetailsofthreadcreationandmanagement.OpenMPprogramstend toworkverywellonSMPs,butbecauseitsunderlyingprogrammingmodeldoesnotincludeanotion ofnonuniformmemoryaccesstimes,itislessidealforccNUMAanddistributedmemorymachines. MPIisasetoflibraryroutinesthatprovideforprocessmanagement,messagepassing,andsome collectivecommunicationoperations(theseareoperationsthatinvolvealltheprocessesinvolvedina program,suchasbarrier,broadcast,andreduction).MPIprogramscanbedifficulttowritebecause theprogrammerisresponsiblefordatadistributionandexplicitinterprocesscommunicationusing messages.Becausetheprogrammingmodelassumesdistributedmemory,MPIisagoodchoicefor MPPsandotherdistributedmemorymachines. NeitherOpenMPnorMPIisanidealfitforhybridarchitecturesthatcombinemultiprocessornodes, eachwithmultipleprocessesandasharedmemory,intoalargersystemwithseparateaddressspaces foreachnode:TheOpenMPmodeldoesnotrecognizenonuniformmemoryaccesstimes,soitsdata allocationcanleadtopoorperformanceonmachinesthatarenotSMPs,whileMPIdoesnotinclude constructstomanagedatastructuresresidinginasharedmemory.Onesolutionisahybridmodelin whichOpenMPisusedoneachsharedmemorynodeandMPIisusedbetweenthenodes.Thisworks well,butitrequirestheprogrammertoworkwithtwodifferentprogrammingmodelswithinasingle program.AnotheroptionistouseMPIonboththesharedmemoryanddistributedmemoryportions ofthealgorithmandgiveuptheadvantagesofasharedmemoryprogrammingmodel,evenwhenthe hardwaredirectlysupportsit. Newhighlevelprogrammingenvironmentsthatsimplifyportableparallelprogrammingandmore accuratelyreflecttheunderlyingparallelarchitecturesaretopicsofcurrentresearch[Cen].Another
approachmorepopularinthecommercialsectoristoextendMPIandOpenMP.Inthemid1990s,the MPIForumdefinedanextendedMPIcalledMPI2.0,althoughimplementationsarenotwidely availableatthetimethiswaswritten.ItisalargecomplexextensiontoMPIthatincludesdynamic processcreation,parallelI/O,andmanyotherfeatures.Ofparticularinteresttoprogrammersof modernhybridarchitecturesistheinclusionofonesidedcommunication.Onesidedcommunication mimicssomeofthefeaturesofasharedmemorysystembylettingoneprocesswriteintoorreadfrom thememoryregionsofotherprocesses.Theterm"onesided"referstothefactthatthereadorwriteis launchedbytheinitiatingprocesswithouttheexplicitinvolvementoftheotherparticipatingprocess. AmoresophisticatedabstractionofonesidedcommunicationisavailableaspartoftheGlobalArrays [NHL96,NHK02 + ,Gloa]package.GlobalArraysworkstogetherwithMPItohelpaprogrammer managedistributedarraydata.Aftertheprogrammerdefinesthearrayandhowitislaidoutin memory,theprogramexecutes"puts"or"gets"intothearraywithoutneedingtoexplicitlymanage whichMPIprocess"owns"theparticularsectionofthearray.Inessence,theglobalarrayprovidesan abstractionofagloballysharedarray.Thisonlyworksforarrays,butthesearesuchcommondata structuresinparallelcomputingthatthispackage,althoughlimited,canbeveryuseful. JustasMPIhasbeenextendedtomimicsomeofthebenefitsofasharedmemoryenvironment, OpenMPhasbeenextendedtorunindistributedmemoryenvironments.TheannualWOMPAT (WorkshoponOpenMPApplicationsandTools)workshopscontainmanypapersdiscussingvarious approachesandexperienceswithOpenMPinclustersandccNUMAenvironments. MPIisimplementedasalibraryofroutinestobecalledfromprogramswritteninasequential programminglanguage,whereasOpenMPisasetofextensionstosequentialprogramminglanguages. Theyrepresenttwoofthepossiblecategoriesofparallelprogrammingenvironments(librariesand languageextensions),andthesetwoparticularenvironmentsaccountfortheoverwhelmingmajority ofparallelcomputingbeingdonetoday.Thereis,however,onemorecategoryofparallel programmingenvironments,namelylanguageswithbuiltinfeaturestosupportparallelprogramming. Javaissuchalanguage.Ratherthanbeingdesignedtosupporthighperformancecomputing,Javais anobjectoriented,generalpurposeprogrammingenvironmentwithfeaturesforexplicitlyspecifying concurrentprocessingwithsharedmemory.Inaddition,thestandardI/Oandnetworkpackages provideclassesthatmakeiteasyforJavatoperforminterprocesscommunicationbetweenmachines, thusmakingitpossibletowriteprogramsbasedonboththesharedmemoryandthedistributed memorymodels.Thenewerjava.niopackagessupportI/Oinawaythatislessconvenientforthe programmer,butgivessignificantlybetterperformance,andJava21.5includesnewsupportfor concurrentprogramming,mostsignificantlyinthejava.util.concurrent.*packages.Additional packagesthatsupportdifferentapproachestoparallelcomputingarewidelyavailable. Althoughtherehavebeenothergeneralpurposelanguages,bothpriortoJavaandmorerecent(for example,C#),thatcontainedconstructsforspecifyingconcurrency,Javaisthefirsttobecomewidely used.Asaresult,itmaybethefirstexposureformanyprogrammerstoconcurrentandparallel programming.AlthoughJavaprovidessoftwareengineeringbenefits,currentlytheperformanceof parallelJavaprogramscannotcompetewithOpenMPorMPIprogramsfortypicalscientific computingapplications.TheJavadesignhasalsobeencriticizedforseveraldeficienciesthatmatterin thisdomain(forexample,afloatingpointmodelthatemphasizesportabilityandmorereproducible resultsoverexploitingtheavailablefloatingpointhardwaretothefullest,inefficienthandlingof
arrays,andlackofalightweightmechanismtohandlecomplexnumbers).Theperformancedifference betweenJavaandotheralternativescanbeexpectedtodecrease,especiallyforsymbolicorother nonnumericproblems,ascompilertechnologyforJavaimprovesandasnewpackagesandlanguage extensionsbecomeavailable.TheTitaniumproject[Tita]isanexampleofaJavadialectdesignedfor highperformancecomputinginaccNUMAenvironment. Forthepurposesofthisbook,wehavechosenOpenMP,MPI,andJavaasthethreeenvironmentswe willuseinourexamplesOpenMPandMPIfortheirpopularityandJavabecauseitislikelytobe manyprogrammers'firstexposuretoconcurrentprogramming.Abriefoverviewofeachcanbefound intheappendixes.
2.4. THE JARGON OF PARALLEL COMPUTING

Inthissection,wedefinesometermsthatarefrequentlyusedthroughoutthepatternlanguage. Additionaldefinitionscanbefoundintheglossary. Task.Thefirststepindesigningaparallelprogramistobreaktheproblemupintotasks.Ataskisa sequenceofinstructionsthatoperatetogetherasagroup.Thisgroupcorrespondstosomelogicalpart ofanalgorithmorprogram.Forexample,considerthemultiplicationoftwoorderNmatrices. Dependingonhowweconstructthealgorithm,thetaskscouldbe(1)themultiplicationofsubblocks ofthematrices,(2)innerproductsbetweenrowsandcolumnsofthematrices,or(3)individual iterationsoftheloopsinvolvedinthematrixmultiplication.Thesearealllegitimatewaystodefine tasksformatrixmultiplication;thatis,thetaskdefinitionfollowsfromthewaythealgorithmdesigner thinksabouttheproblem. Unitofexecution(UE).Tobeexecuted,ataskneedstobemappedtoaUEsuchasaprocessor thread.Aprocessisacollectionofresourcesthatenablestheexecutionofprograminstructions. Theseresourcescanincludevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,user andgroupIDs,andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight" unitofexecutionwithitsownaddressspace.AthreadisthefundamentalUEinmodernoperating systems.Athreadisassociatedwithaprocessandsharestheprocess'senvironment.Thismakes threadslightweight(thatis,acontextswitchbetweenthreadstakesonlyasmallamountoftime).A morehighlevelviewisthatathreadisa"lightweight"UEthatsharesanaddressspacewithother threads. WewilluseunitofexecutionorUEasagenerictermforoneofacollectionofpossiblyconcurrently executingentities,usuallyeitherprocessesorthreads.Thisisconvenientintheearlystagesof programdesignwhenthedistinctionsbetweenprocessesandthreadsarelessimportant. Processingelement(PE).Weusethetermprocessingelement(PE)asagenerictermforahardware elementthatexecutesastreamofinstructions.TheunitofhardwareconsideredtobeaPEdependson thecontext.Forexample,someprogrammingenvironmentsvieweachworkstationinaclusterofSMP workstationsasexecutingasingleinstructionstream;inthissituation,thePEwouldbethe workstation.Adifferentprogrammingenvironmentrunningonthesamehardware,however,might vieweachprocessorofeachworkstationasexecutinganindividualinstructionstream;inthiscase,the PEistheindividualprocessor,andeachworkstationcontainsseveralPEs.
Loadbalanceandloadbalancing.Toexecuteaparallelprogram,thetasksmustbemappedtoUEs, andtheUEstoPEs.Howthemappingsaredonecanhaveasignificantimpactontheoverall performanceofaparallelalgorithm.ItiscrucialtoavoidthesituationinwhichasubsetofthePEsis doingmostoftheworkwhileothersareidle.Loadbalancereferstohowwelltheworkisdistributed amongPEs.LoadbalancingistheprocessofallocatingworktoPEs,eitherstaticallyordynamically, sothattheworkisdistributedasevenlyaspossible. Synchronization.Inaparallelprogram,duetothenondeterminismoftaskschedulingandother factors,eventsinthecomputationmightnotalwaysoccurinthesameorder.Forexample,inonerun, ataskmightreadvariablexbeforeanothertaskreadsvariabley;inthenextrunwiththesameinput, theeventsmightoccurintheoppositeorder.Inmanycases,theorderinwhichtwoeventsoccurdoes notmatter.Inothersituations,theorderdoesmatter,andtoensurethattheprogramiscorrect,the programmermustintroducesynchronizationtoenforcethenecessaryorderingconstraints.The primitivesprovidedforthispurposeinourselectedenvironmentsarediscussedintheImplementation Mechanismsdesignspace(Section6.3). Synchronousversusasynchronous.Weusethesetwotermstoqualitativelyrefertohowtightly coupledintimetwoeventsare.Iftwoeventsmusthappenatthesametime,theyaresynchronous; otherwisetheyareasynchronous.Forexample,messagepassing(thatis,communicationbetweenUEs bysendingandreceivingmessages)issynchronousifamessagesentmustbereceivedbeforethe sendercancontinue.Messagepassingisasynchronousifthesendercancontinueitscomputation regardlessofwhathappensatthereceiver,orifthereceivercancontinuecomputationswhilewaiting forareceivetocomplete. Raceconditions.Araceconditionisakindoferrorpeculiartoparallelprograms.Itoccurswhenthe outcomeofaprogramchangesastherelativeschedulingofUEsvaries.Becausetheoperatingsystem andnottheprogrammercontrolstheschedulingoftheUEs,raceconditionsresultinprogramsthat potentiallygivedifferentanswersevenwhenrunonthesamesystemwiththesamedata.Race conditionsareparticularlydifficulterrorstodebugbecausebytheirnaturetheycannotbereliably reproduced.Testinghelps,butisnotaseffectiveaswithsequentialprograms:Aprogrammayrun correctlythefirstthousandtimesandthenfailcatastrophicallyonthethousandandfirstexecution andthenrunagaincorrectlywhentheprogrammerattemptstoreproducetheerrorasthefirststepin debugging. Raceconditionsresultfromerrorsinsynchronization.IfmultipleUEsreadandwriteshared variables,theprogrammermustprotectaccesstothesesharedvariablessothereadsandwritesoccur inavalidorderregardlessofhowthetasksareinterleaved.Whenmanyvariablesaresharedorwhen theyareaccessedthroughmultiplelevelsofindirection,verifyingbyinspectionthatnorace conditionsexistcanbeverydifficult.Toolsareavailablethathelpdetectandfixraceconditions,such asThreadCheckerfromIntelCorporation,andtheproblemremainsanareaofactiveandimportant research[NM92]. Deadlocks.Deadlocksareanothertypeoferrorpeculiartoparallelprograms.Adeadlockoccurs whenthereisacycleoftasksinwhicheachtaskisblockedwaitingforanothertoproceed.Because allarewaitingforanothertasktodosomething,theywillallbeblockedforever.Asasimpleexample, considertwotasksinamessagepassingenvironment.TaskAattemptstoreceiveamessagefromtask B,afterwhichAwillreplybysendingamessageofitsowntotaskB.Meanwhile,taskBattemptsto
receiveamessagefromtaskA,afterwhichBwillsendamessagetoA.Becauseeachtaskiswaiting fortheothertosenditamessagefirst,bothtaskswillbeblockedforever.Fortunately,deadlocksare notdifficulttodiscover,asthetaskswillstopatthepointofthedeadlock.
2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATION

Thetwomainreasonsforimplementingaparallelprogramaretoobtainbetterperformanceandto solvelargerproblems.Performancecanbebothmodeledandmeasured,sointhissectionwewilltake aanotherlookatparallelcomputationsbygivingsomesimpleanalyticalmodelsthatillustratesome ofthefactorsthatinfluencetheperformanceofaparallelprogram. Consideracomputationconsistingofthreeparts:asetupsection,acomputationsection,anda finalizationsection.ThetotalrunningtimeofthisprogramononePEisthengivenasthesumofthe timesforthethreeparts. Equation2.1
WhathappenswhenwerunthiscomputationonaparallelcomputerwithmultiplePEs?Supposethat thesetupandfinalizationsectionscannotbecarriedoutconcurrentlywithanyotheractivities,but thatthecomputationsectioncouldbedividedintotasksthatwouldrunindependentlyonasmanyPEs asareavailable,withthesametotalnumberofcomputationstepsasintheoriginalcomputation.The timeforthefullcomputationonPPEscanthereforebegivenbyOfcourse,Eq.2.2describesavery idealizedsituation.However,theideathatcomputationshaveaserialpart(forwhichadditionalPEs areuseless)andaparallelizablepart(forwhichmorePEsdecreasetherunningtime)isrealistic.Thus, thissimplemodelcapturesanimportantrelationship. Equation2.2
AnimportantmeasureofhowmuchadditionalPEshelpistherelativespeedupS,whichdescribes howmuchfasteraproblemrunsinawaythatnormalizesawaytheactualrunningtime. Equation2.3
ArelatedmeasureistheefficiencyE,whichisthespeedupnormalizedbythenumberofPEs. Equation2.4
Equation2.5
Ideally,wewouldwantthespeeduptobeequaltoP,thenumberofPEs.Thisissometimescalled perfectlinearspeedup.Unfortunately,thisisanidealthatcanrarelybeachievedbecausetimesfor setupandfinalizationarenotimprovedbyaddingmorePEs,limitingthespeedup.Thetermsthat cannotberunconcurrentlyarecalledtheserialterms.Theirrunningtimesrepresentsomefractionof thetotal,calledtheserialfraction,denoted. Equation2.6
Thefractionoftimespentintheparallelizablepartoftheprogramisthen(1).Wecanthus rewritetheexpressionfortotalcomputationtimewithPPEsas Equation2.7
Now,rewritingSintermsofthenewexpressionforTtotal(P),weobtainthefamousAmdahl'slaw: Equation2.8
Equation2.9
Thus,inanidealparallelalgorithmwithnooverheadintheparallelpart,thespeedupshouldfollow Eq.2.9.Whathappenstothespeedupifwetakeouridealparallelalgorithmanduseaverylarge numberofprocessors?TakingthelimitasPgoestoinfinityinourexpressionforSyields
Equation2.10
Eq.2.10thusgivesanupperboundonthespeedupobtainableinanalgorithmwhoseserialpart representsofthetotalcomputation. Theseconceptsarevitaltotheparallelalgorithmdesigner.Indesigningaparallelalgorithm,itis importanttounderstandthevalueoftheserialfractionsothatrealisticexpectationscanbesetfor performance.Itmaynotmakesensetoimplementacomplex,arbitrarilyscalableparallelalgorithmif 10%ormoreofthealgorithmisserialand10%isfairlycommon. Ofcourse,Amdahl'slawisbasedonassumptionsthatmayormaynotbetrueinpractice.Inreallife, anumberoffactorsmaymaketheactualrunningtimelongerthanthisformulaimplies.Forexample, creatingadditionalparalleltasksmayincreaseoverheadandthechancesofcontentionforshared resources.Ontheotherhand,iftheoriginalserialcomputationislimitedbyresourcesotherthanthe availabilityofCPUcycles,theactualperformancecouldbemuchbetterthanAmdahl'slawwould predict.Forexample,alargeparallelmachinemayallowbiggerproblemstobeheldinmemory,thus reducingvirtualmemorypaging,ormultipleprocessorseachwithitsowncachemayallowmuch moreoftheproblemtoremaininthecache.Amdahl'slawalsorestsontheassumptionthatforany giveninput,theparallelandserialimplementationsperformexactlythesamenumberof computationalsteps.Iftheserialalgorithmbeingusedintheformulaisnotthebestpossible algorithmfortheproblem,thenacleverparallelalgorithmthatstructuresthecomputationdifferently canreducethetotalnumberofcomputationalsteps. Ithasalsobeenobserved[Gus88]thattheexerciseunderlyingAmdahl'slaw,namelyrunningexactly thesameproblemwithvaryingnumbersofprocessors,isartificialinsomecircumstances.If,say,the parallelapplicationwereaweathersimulation,thenwhennewprocessorswereadded,onewould mostlikelyincreasetheproblemsizebyaddingmoredetailstothemodelwhilekeepingthetotal executiontimeconstant.Ifthisisthecase,thenAmdahl'slaw,orfixedsizespeedup,givesa pessimisticviewofthebenefitsofadditionalprocessors. Toseethis,wecanreformulatetheequationtogivethespeedupintermsofperformanceonaP processorsystem.EarlierinEq.2.2,weobtainedtheexecutiontimeforTprocessors,Ttotal(P),from theexecutiontimeoftheserialtermsandtheexecutiontimeoftheparallelizablepartwhenexecuted ononeprocessor.Here,wedotheoppositeandobtainTtotal(1)fromtheserialandparallelterms whenexecutedonPprocessors. Equation2.11
Now,wedefinethesocalledscaledserialfraction,denotedscaled,as
Equation2.12
andthen
Equation2.13
Rewritingtheequationforspeedup(Eq.2.3)andsimplifying,weobtainthescaled(orfixedtime) speedup.[1]
[1]
Thisequation,sometimesknownasGustafson'slaw,wasattributedin[Gus88]toE. Barsis. Equation2.14
ThisgivesexactlythesamespeedupasAmdahl'slaw,butallowsadifferentquestiontobeaskedwhen thenumberofprocessorsisincreased.SincescaleddependsonP,theresultoftakingthelimitisn't immediatelyobvious,butwouldgivethesameresultasthelimitinAmdahl'slaw.However,suppose wetakethelimitinPwhileholdingTcomputeandthusscaledconstant.Theinterpretationisthatwe areincreasingthesizeoftheproblemsothatthetotalrunningtimeremainsconstantwhenmore processorsareadded.(Thiscontainstheimplicitassumptionthattheexecutiontimeoftheserial termsdoesnotchangeastheproblemsizegrows.)Inthiscase,thespeedupislinearinP.Thus,while addingmoreprocessorstosolveafixedproblemmayhitthespeeduplimitsofAmdahl'slawwitha relativelysmallnumberofprocessors,iftheproblemgrowsasmoreprocessorsareadded,Amdahl's lawwillbepessimistic.Thesetwomodelsofspeedup,alongwithafixedmemoryversionofspeedup, arediscussedin[SN90].
2.6. COMMUNICATION
2.6.1. Latency and Bandwidth Asimplebutusefulmodelcharacterizesthetotaltimeformessagetransferasthesumofafixedcost
plusavariablecostthatdependsonthelengthofthemessage. Equation2.15
Thefixedcostiscalledlatencyandisessentiallythetimeittakestosendanemptymessageover thecommunicationmedium,fromthetimethesendroutineiscalledtothetimethedataisreceived bytherecipient.Latency(giveninsomeappropriatetimeunit)includesoverheadduetosoftwareand networkhardwareplusthetimeittakesforthemessagetotraversethecommunicationmedium.The bandwidth(giveninsomemeasureofbytespertimeunit)isameasureofthecapacityofthe communicationmedium.Nisthelengthofthemessage. Thelatencyandbandwidthcanvarysignificantlybetweensystemsdependingonboththehardware usedandthequalityofthesoftwareimplementingthecommunicationprotocols.Becausethesevalues canbemeasuredwithfairlysimplebenchmarks[DD97],itissometimesworthwhiletomeasure valuesforand,asthesecanhelpguideoptimizationstoimprovecommunicationperformance. Forexample,inasysteminwhichisrelativelylarge,itmightbeworthwhiletotrytorestructurea programthatsendsmanysmallmessagestoaggregatethecommunicationintoafewlargemessages instead.Dataforseveralrecentsystemshasbeenpresentedin[BBC03 + ].
2.6.2. Overlapping Communication and Computation and Latency Hiding Ifwelookmorecloselyatthecomputationtimewithinasingletaskonasingleprocessor,itcan roughlybedecomposedintocomputationtime,communicationtime,andidletime.The communicationtimeisthetimespentsendingandreceivingmessages(andthusonlyappliesto distributedmemorymachines),whereastheidletimeistimethatnoworkisbeingdonebecausethe taskiswaitingforanevent,suchasthereleaseofaresourceheldbyanothertask. Acommonsituationinwhichataskmaybeidleiswhenitiswaitingforamessagetobetransmitted throughthesystem.Thiscanoccurwhensendingamessage(astheUEwaitsforareplybefore proceeding)orwhenreceivingamessage.Sometimesitispossibletoeliminatethiswaitby restructuringthetasktosendthemessageand/orpostthereceive(thatis,indicatethatitwantsto receiveamessage)andthencontinuethecomputation.Thisallowstheprogrammertooverlap communicationandcomputation.WeshowanexampleofthistechniqueinFig.2.7.Thisstyleof messagepassingismorecomplicatedfortheprogrammer,becausetheprogrammermusttakecareto waitforthereceivetocompleteafteranyworkthatcanbeoverlappedwithcommunicationis completed.
Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the computation requires less total time because of UE 1 's earlier start.
AnothertechniqueusedonmanyparallelcomputersistoassignmultipleUEstoeachPE,sothat whenoneUEiswaitingforcommunication,itwillbepossibletocontextswitchtoanotherUEand keeptheprocessorbusy.Thisisanexampleoflatencyhiding.Itisincreasinglybeingusedonmodern highperformancecomputingsystems,themostfamousexamplebeingtheMTAsystemfromCray Research[ACC90 + ].
2.7. SUMMARY
Thischapterhasgivenabriefoverviewofsomeoftheconceptsandvocabularyusedinparallel computing.Additionaltermsaredefinedintheglossary.Wealsodiscussedthemajorprogramming environmentsinuseforparallelcomputing:OpenMP,MPI,andJava.Throughoutthebook,wewill usethesethreeprogrammingenvironmentsforourexamples.MoredetailsaboutOpenMP,MPI,and Javaandhowtousethemtowriteparallelprogramsareprovidedintheappendixes.
Chapter 3. The Finding Concurrency Design Space

3.1ABOUTTHEDESIGNSPACE 3.2THETASKDECOMPOSITIONPATTERN
3.3THEDATADECOMPOSITIONPATTERN 3.4THEGROUPTASKSPATTERN 3.5THEORDERTASKSPATTERN 3.6THEDATASHARINGPATTERN 3.7THEDESIGNEVALUATIONPATTERN 3.8SUMMARY
3.1. ABOUT THE DESIGN SPACE

Thesoftwaredesignerworksinanumberofdomains.Thedesignprocessstartsintheproblem domainwithdesignelementsdirectlyrelevanttotheproblembeingsolved(forexample,fluidflows, decisiontrees,atoms,etc.).Theultimateaimofthedesignissoftware,soatsomepoint,thedesign elementschangeintoonesrelevanttoaprogram(forexample,datastructuresandsoftwaremodules). Wecallthistheprogramdomain.Althoughitisoftentemptingtomoveintotheprogramdomainas soonaspossible,adesignerwhomovesoutoftheproblemdomaintoosoonmaymissvaluabledesign options. Thisisparticularlyrelevantinparallelprogramming.Parallelprogramsattempttosolvebigger problemsinlesstimebysimultaneouslysolvingdifferentpartsoftheproblemondifferentprocessing elements.Thiscanonlywork,however,iftheproblemcontainsexploitableconcurrency,thatis, multipleactivitiesortasksthatcanexecuteatthesametime.Afteraproblemhasbeenmappedonto theprogramdomain,however,itcanbedifficulttoseeopportunitiestoexploitconcurrency. Hence,programmersshouldstarttheirdesignofaparallelsolutionbyanalyzingtheproblemwithin theproblemdomaintoexposeexploitableconcurrency.Wecallthedesignspaceinwhichthis analysisiscarriedouttheFindingConcurrencydesignspace.Thepatternsinthisdesignspacewill helpidentifyandanalyzetheexploitableconcurrencyinaproblem.Afterthisisdone,oneormore patternsfromtheAlgorithmStructurespacecanbechosentohelpdesigntheappropriatealgorithm structuretoexploittheidentifiedconcurrency. AnoverviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.3.1.
Figure 3.1. Overview of the Finding Concurrency design space and its place in the pattern language
Experienceddesignersworkinginafamiliardomainmayseetheexploitableconcurrency immediatelyandcouldmovedirectlytothepatternsintheAlgorithmStructuredesignspace. 3.1.1. Overview Beforestartingtoworkwiththepatternsinthisdesignspace,thealgorithmdesignermustfirst considertheproblemtobesolvedandmakesuretheefforttocreateaparallelprogramwillbe justified:Istheproblemlargeenoughandtheresultssignificantenoughtojustifyexpendingeffortto solveitfaster?Ifso,thenextstepistomakesurethekeyfeaturesanddataelementswithinthe problemarewellunderstood.Finally,thedesignerneedstounderstandwhichpartsoftheproblemare mostcomputationallyintensive,becausetheefforttoparallelizetheproblemshouldbefocusedon thoseparts. Afterthisanalysisiscomplete,thepatternsintheFindingConcurrencydesignspacecanbeusedto startdesigningaparallelalgorithm.Thepatternsinthisdesignspacecanbeorganizedintothree groups.
DecompositionPatterns.Thetwodecompositionpatterns,TaskDecompositionandData Decomposition,areusedtodecomposetheproblemintopiecesthatcanexecuteconcurrently. DependencyAnalysisPatterns.Thisgroupcontainsthreepatternsthathelpgroupthetasks andanalyzethedependenciesamongthem:GroupTasks,OrderTasks,andDataSharing. Nominally,thepatternsareappliedinthisorder.Inpractice,however,itisoftennecessaryto workbackandforthbetweenthem,orpossiblyevenrevisitthedecompositionpatterns. DesignEvaluationPattern.Thefinalpatterninthisspaceguidesthealgorithmdesigner throughananalysisofwhathasbeendonesofarbeforemovingontothepatternsinthe AlgorithmStructuredesignspace.Thispatternisimportantbecauseitoftenhappensthatthe bestdesignisnotfoundonthefirstattempt,andtheearlierdesignflawsareidentified,the
easiertheyaretocorrect.Ingeneral,workingthroughthepatternsinthisspaceisaniterative process. 3.1.2. Using the Decomposition Patterns Thefirststepindesigningaparallelalgorithmistodecomposetheproblemintoelementsthatcan executeconcurrently.Wecanthinkofthisdecompositionasoccurringintwodimensions.
Thetaskdecompositiondimensionviewstheproblemasastreamofinstructionsthatcanbe brokenintosequencescalledtasksthatcanexecutesimultaneously.Forthecomputationtobe efficient,theoperationsthatmakeupthetaskshouldbelargelyindependentoftheoperations takingplaceinsideothertasks. Thedatadecompositiondimensionfocusesonthedatarequiredbythetasksandhowitcanbe decomposedintodistinctchunks.Thecomputationassociatedwiththedatachunkswillonly beefficientifthedatachunkscanbeoperateduponrelativelyindependently.
Viewingtheproblemdecompositionintermsoftwodistinctdimensionsissomewhatartificial.Atask decompositionimpliesadatadecompositionandviceversa;hence,thetwodecompositionsarereally differentfacetsofthesamefundamentaldecomposition.Wedividethemintoseparatedimensions, however,becauseaproblemdecompositionusuallyproceedsmostnaturallybyemphasizingone dimensionofthedecompositionovertheother.Bymakingthemdistinct,wemakethisdesign emphasisexplicitandeasierforthedesignertounderstand. 3.1.3. Background for Examples Inthissection,wegivebackgroundinformationonsomeoftheexamplesthatareusedinseveral patterns.Itcanbeskippedforthetimebeingandrevisitedlaterwhenreadingapatternthatrefersto oneoftheexamples.
Medical imaging
PET(PositronEmissionTomography)scansprovideanimportantdiagnostictoolbyallowing physicianstoobservehowaradioactivesubstancepropagatesthroughapatient'sbody.Unfortunately, theimagesformedfromthedistributionofemittedradiationareoflowresolution,dueinparttothe scatteringoftheradiationasitpassesthroughthebody.Itisalsodifficulttoreasonfromtheabsolute radiationintensities,becausedifferentpathwaysthroughthebodyattenuatetheradiationdifferently. Tosolvethisproblem,modelsofhowradiationpropagatesthroughthebodyareusedtocorrectthe images.AcommonapproachistobuildaMonteCarlomodel,asdescribedbyLjungbergandKing [LK98].Randomlyselectedpointswithinthebodyareassumedtoemitradiation(usuallyagamma ray),andthetrajectoryofeachrayisfollowed.Asaparticle(ray)passesthroughthebody,itis attenuatedbythedifferentorgansittraverses,continuinguntiltheparticleleavesthebodyandhitsa cameramodel,therebydefiningafulltrajectory.Tocreateastatisticallysignificantsimulation, thousands,ifnotmillions,oftrajectoriesarefollowed. Thisproblemcanbeparallelizedintwoways.Becauseeachtrajectoryisindependent,itispossibleto parallelizetheapplicationbyassociatingeachtrajectorywithatask.Thisapproachisdiscussedinthe ExamplessectionoftheTaskDecompositionpattern.Anotherapproachwouldbetopartitionthe
bodyintosectionsandassigndifferentsectionstodifferentprocessingelements.Thisapproachis discussedintheExamplessectionoftheDataDecompositionpattern.
Linear algebra
Linearalgebraisanimportanttoolinappliedmathematics:Itprovidesthemachineryrequiredto analyzesolutionsoflargesystemsoflinearequations.Theclassiclinearalgebraproblemasks,for matrixAandvectorb,whatvaluesforxwillsolvetheequation Equation3.1
ThematrixAinEq.3.1takesonacentralroleinlinearalgebra.Manyproblemsareexpressedin termsoftransformationsofthismatrix.Thesetransformationsareappliedbymeansofamatrix multiplication Equation3.2
IfT,A,andCaresquarematricesoforderN,matrixmultiplicationisdefinedsuchthateachelement oftheresultingmatrixCis Equation3.3
wherethesubscriptsdenoteparticularelementsofthematrices.Inotherwords,theelementofthe productmatrixCinrowiandcolumnjisthedotproductoftheithrowofTandthejthcolumnof A.Hence,computingeachoftheN2elementsofCrequiresNmultiplicationsandN1additions, makingtheoverallcomplexityofmatrixmultiplicationO(N3).
Therearemanywaystoparallelizeamatrixmultiplicationoperation.Itcanbeparallelizedusing eitherataskbaseddecomposition(asdiscussedintheExamplessectionoftheTaskDecomposition pattern)oradatabaseddecomposition(asdiscussedintheExamplessectionoftheData Decompositionpattern).

Molecular dynamics
Moleculardynamicsisusedtosimulatethemotionsofalargemolecularsystem.Forexample, moleculardynamicssimulationsshowhowalargeproteinmovesaroundandhowdifferentlyshaped drugsmightinteractwiththeprotein.Notsurprisingly,moleculardynamicsisextremelyimportantin
thepharmaceuticalindustry.Itisalsoausefultestproblemforcomputerscientistsworkingonparallel computing:Itisstraightforwardtounderstand,relevanttoscienceatlarge,anddifficulttoparallelize effectively.Asaresult,ithasbeenthesubjectofmuchresearch[Mat94,PH95,Pli95]. Thebasicideaistotreatamoleculeasalargecollectionofballsconnectedbysprings.Theballs representtheatomsinthemolecule,whilethespringsrepresentthechemicalbondsbetweenthe atoms.Themoleculardynamicssimulationitselfisanexplicittimesteppingprocess.Ateachtime step,theforceoneachatomiscomputedandthenstandardclassicalmechanicstechniquesareusedto computehowtheforcemovestheatoms.Thisprocessiscarriedoutrepeatedlytostepthroughtime andcomputeatrajectoryforthemolecularsystem. Theforcesduetothechemicalbonds(the"springs")arerelativelysimpletocompute.These correspondtothevibrationsandrotationsofthechemicalbondsthemselves.Theseareshortrange forcesthatcanbecomputedwithknowledgeofthehandfulofatomsthatsharechemicalbonds.The majordifficultyarisesbecausetheatomshavepartialelectricalcharges.Hence,whileatomsonly interactwithasmallneighborhoodofatomsthroughtheirchemicalbonds,theelectricalcharges causeeveryatomtoapplyaforceoneveryotheratom. ThisisthefamousNbodyproblem.OntheorderofN2termsmustbecomputedtofindthese nonbondedforces.BecauseNislarge(tensorhundredsofthousands)andthenumberoftimestepsin asimulationishuge(tensofthousands),thetimerequiredtocomputethesenonbondedforces dominatesthecomputation.Severalwayshavebeenproposedtoreducetheeffortrequiredtosolvethe Nbodyproblem.Weareonlygoingtodiscussthesimplestone:thecutoffmethod. Theideaissimple.Eventhougheachatomexertsaforceoneveryotheratom,thisforcedecreases withthesquareofthedistancebetweentheatoms.Hence,itshouldbepossibletopickadistance beyondwhichtheforcecontributionissosmallthatitcanbeignored.Byignoringtheatomsthat exceedthiscutoff,theproblemisreducedtoonethatscalesasO(Nxn),wherenisthenumberof atomswithinthecutoffvolume,usuallyhundreds.Thecomputationisstillhuge,anditdominatesthe overallruntimeforthesimulation,butatleasttheproblemistractable. Thereareahostofdetails,butthebasicsimulationcanbesummarizedasinFig.3.2. Theprimarydatastructuresholdtheatomicpositions(atoms),thevelocitiesofeachatom (velocity),theforcesexertedoneachatom(forces),andlistsofatomswithinthecutoff distanceofeachatoms(neighbors).Theprogramitselfisatimesteppingloop,inwhicheach iterationcomputestheshortrangeforceterms,updatestheneighborlists,andthenfindsthe nonbondedforces.Aftertheforceoneachatomhasbeencomputed,asimpleordinarydifferential equationissolvedtoupdatethepositionsandvelocities.Physicalpropertiesbasedonatomicmotions arethenupdated,andwegotothenexttimestep. Therearemanywaystoparallelizethemoleculardynamicsproblem.Weconsiderthemostcommon approach,startingwiththetaskdecomposition(discussedintheTaskDecompositionpattern)and followingwiththeassociateddatadecomposition(discussedintheDataDecompositionpattern).This exampleshowshowthetwodecompositionsfittogethertoguidethedesignoftheparallelalgorithm.
Figure 3.2. Pseudocode for the molecular dynamics example Int const N // number of atoms Array Array Array Array of of of of Real Real Real List :: :: :: :: atoms (3,N) //3D coordinates velocities (3,N) //velocity vector forces (3,N) //force in each dimension neighbors(N) //atoms in cutoff volume
loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors) non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) end loop
3.2. THE TASK DECOMPOSITION PATTERN

Problem Howcanaproblembedecomposedintotasksthatcanexecuteconcurrently? Context Everyparallelalgorithmdesignstartsfromthesamepoint,namelyagoodunderstandingofthe problembeingsolved.Theprogrammermustunderstandwhicharethecomputationallyintensive partsoftheproblem,thekeydatastructures,andhowthedataisusedastheproblem'ssolution unfolds. Thenextstepistodefinethetasksthatmakeuptheproblemandthedatadecompositionimpliedby thetasks.Fundamentally,everyparallelalgorithminvolvesacollectionoftasksthatcanexecute concurrently.Thechallengeistofindthesetasksandcraftanalgorithmthatletsthemrun concurrently. Insomecases,theproblemwillnaturallybreakdownintoacollectionofindependent(ornearly independent)tasks,anditiseasiesttostartwithataskbaseddecomposition.Inothercases,thetasks aredifficulttoisolateandthedecompositionofthedata(asdiscussedintheDataDecomposition pattern)isabetterstartingpoint.Itisnotalwaysclearwhichapproachisbest,andoftenthealgorithm designerneedstoconsiderboth. Regardlessofwhetherthestartingpointisataskbasedoradatabaseddecomposition,however,a parallelalgorithmultimatelyneedstasksthatwillexecuteconcurrently,sothesetasksmustbe identified.
Forces Themainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.
Flexibility.Flexibilityinthedesignwillallowittobeadaptedtodifferentimplementation requirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasingle computersystemorstyleofprogrammingatthisstageofthedesign. Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallel computer(intermsofreducedruntimeand/ormemoryutilization).Forataskdecomposition, thismeansweneedenoughtaskstokeepallthePEsbusy,withenoughworkpertaskto compensateforoverheadincurredtomanagedependencies.However,thedriveforefficiency canleadtocomplexdecompositionsthatlackflexibility. Simplicity.Thetaskdecompositionneedstobecomplexenoughtogetthejobdone,but simpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.
Solution Thekeytoaneffectivetaskdecompositionistoensurethatthetasksaresufficientlyindependentso thatmanagingdependenciestakesonlyasmallfractionoftheprogram'soverallexecutiontime.Itis alsoimportanttoensurethattheexecutionofthetaskscanbeevenlydistributedamongtheensemble ofPEs(theloadbalancingproblem). Inanidealworld,thecompilerwouldfindthetasksfortheprogrammer.Unfortunately,thisalmost neverhappens.Instead,itmustusuallybedonebyhandbasedonknowledgeoftheproblemandthe coderequiredtosolveit.Insomecases,itmightbenecessarytocompletelyrecasttheproblemintoa formthatexposesrelativelyindependenttasks. Inataskbaseddecomposition,welookattheproblemasacollectionofdistincttasks,paying particularattentionto
Theactionsthatarecarriedouttosolvetheproblem.(Arethereenoughofthemtokeepthe processingelementsonthetargetmachinesbusy?) Whethertheseactionsaredistinctandrelativelyindependent.
Asafirstpass,wetrytoidentifyasmanytasksaspossible;itismucheasiertostartwithtoomany tasksandmergethemlateronthantostartwithtoofewtasksandlatertrytosplitthem. Taskscanbefoundinmanydifferentplaces.
Insomecases,eachtaskcorrespondstoadistinctcalltoafunction.Definingataskforeach functioncallleadstowhatissometimescalledafunctionaldecomposition. Anotherplacetofindtasksisindistinctiterationsoftheloopswithinanalgorithm.Ifthe iterationsareindependen tandthereareenoughofthem,thenitmightworkwelltobaseatask decompositiononmappingeachiterationontoatask.Thisstyleoftaskbaseddecomposition leadstowhataresometimescalledloopsplittingalgorithms. Tasksalsoplayakeyroleindatadrivendecompositions.Inthiscase,alargedatastructureis decomposedandmultipleunitsofexecutionconcurrentlyupdatedifferentchunksofthedata structure.Inthiscase,thetasksarethoseupdatesonindividualchunks.
AlsokeepinmindtheforcesgivenintheForcessection:
Flexibility.Thedesignneedstobeflexibleinthenumberoftasksgenerated.Usuallythisis donebyparameterizingthenumberandsizeoftasksonsomeappropriatedimension.This willletthedesignbeadaptedtoawiderangeofparallelcomputerswithdifferentnumbersof processors. Efficiency.Therearetwomajorefficiencyissuestoconsiderinthetaskdecomposition.First, eachtaskmustincludeenoughworktocompensatefortheoverheadincurredbycreatingthe tasksandmanagingtheirdependencies.Second,thenumberoftasksshouldbelargeenough sothatalltheunitsofexecutionarebusywithusefulworkthroughoutthecomputation. Simplicity.Tasksshouldbedefinedinawaythatmakesdebuggingandmaintenancesimple. Whenpossible,tasksshouldbedefinedsotheyreusecodefromexistingsequentialprograms thatsolverelatedproblems.
Afterthetaskshavebeenidentified,thenextstepistolookatthedatadecompositionimpliedbythe tasks.TheDataDecompositionpatternmayhelpwiththisanalysis. Examples

Medical imaging
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsidea modelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthe trajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands, ifnotmillions,oftrajectoriesarefollowed. Itisnaturaltoassociateataskwitheachtrajectory.Thesetasksareparticularlysimpletomanage concurrentlybecausetheyarecompletelyindependent.Furthermore,therearelargenumbersof trajectories,sotherewillbemanytasks,makingthisdecompositionsuitableforalargerangeof computersystems,fromasharedmemorysystemwithasmallnumberofprocessingelementstoa largeclusterwithhundredsofprocessingelements. Withthebasictasksdefined,wenowconsiderthecorrespondingdatadecompositionthatis,we definethedataassociatedwitheachtask.Eachtaskneedstoholdtheinformationdefiningthe trajectory.Butthatisnotall:Thetasksneedaccesstothemodelofthebodyaswell.Althoughit mightnotbeapparentfromourdescriptionoftheproblem,thebodymodelcanbeextremelylarge. Becauseitisareadonlymodel,thisisnoproblemifthereisaneffectivesharedmemorysystem; eachtaskcanreaddataasneeded.Ifthetargetplatformisbasedonadistributedmemory architecture,however,thebodymodelwillneedtobereplicatedoneachPE.Thiscanbeverytime consumingandcanwasteagreatdealofmemory.ForsystemswithsmallmemoriesperPEand/or withslownetworksbetweenPEs,adecompositionoftheproblembasedonthebodymodelmightbe moreeffective. Thisisacommonsituationinparallelprogramming:Manyproblemscanbedecomposedprimarilyin termsofdataorprimarilyintermsoftasks.Ifataskbaseddecompositionavoidstheneedtobreakup anddistributecomplexdatastructures,itwillbeamuchsimplerprogramtowriteanddebug.Onthe otherhand,ifmemoryand/ornetworkbandwidthisalimitingfactor,adecompositionthatfocuseson
thedatamightbemoreeffective.Itisnotsomuchamatterofoneapproachbeing"better"than anotherasamatterofbalancingtheneedsofthemachinewiththeneedsoftheprogrammer.We discussthisinmoredetailintheDataDecompositionpattern.

Matrix multiplication
Considerthemultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Wecanproducea taskbaseddecompositionofthisproblembyconsideringthecalculationofeachelementofthe productmatrixasaseparatetask.EachtaskneedsaccesstoonerowofAandonecolumnofB.This decompositionhastheadvantagethatallthetasksareindependent,andbecauseallthedatathatis sharedamongtasks(AandB)isreadonly,itwillbestraightforwardtoimplementinashared memoryenvironment. Theperformanceofthisalgorithm,however,wouldbepoor.Considerthecasewherethethree matricesaresquareandoforderN.ForeachelementofC,NelementsfromAandNelementsfromB wouldberequired,resultingin2NmemoryreferencesforNmultiply/addoperations.Memoryaccess timeisslowcomparedtofloatingpointarithmetic,sothebandwidthofthememorysubsystemwould limittheperformance. Abetterapproachwouldbetodesignanalgorithmthatmaximizesreuseofdataloadedintoa processor'scaches.Wecanarriveatthisalgorithmintwodifferentways.First,wecouldgroup togethertheelementwisetaskswedefinedearliersothetasksthatusesimilarelementsoftheAandB matricesrunonthesameUE(seetheGroupTaskspattern).Alternatively,wecouldstartwiththedata decompositionanddesignthealgorithmfromthebeginningaroundthewaythematricesfitintothe caches.WediscussthisexamplefurtherintheExamplessectionoftheDataDecompositionpattern.
Molecular dynamics
ConsiderthemoleculardynamicsproblemdescribedinSec.3.1.3.Pseudocodeforthisexampleis shownagaininFig.3.3. Beforeperformingthetaskdecomposition,weneedtobetterunderstandsomedetailsoftheproblem. First,theneighbor_list ()computationistimeconsuming.Thegistofthecomputationisa loopovereachatom,insideofwhicheveryotheratomischeckedtodeterminewhetheritfallswithin theindicatedcutoffvolume.Fortunately,thetimestepsareverysmall,andtheatomsdon'tmovevery muchinanygiventimestep.Hence,thistimeconsumingcomputationisonlycarriedoutevery10to 100steps.

Figure 3.3. Pseudocode for the molecular dynamics example Int const N // number of atoms Array Array Array Array of of of of Real Real Real List :: :: :: :: atoms (3,N) //3D coordinates velocities (3,N) //velocity vector forces (3,N) //force in each dimension neighbors(N) //atoms in cutoff volume
loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors)
non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) end loop
Second,thephysical_properties()functioncomputesenergies,correlationcoefficients,and ahostofinterestingphysicalproperties.Thesecomputations,however,aresimpleanddonot significantlyaffecttheprogram'soverallruntime,sowewillignoretheminthisdiscussion. Becausethebulkofthecomputationtimewillbeinnon_bonded_forces(),wemustpicka problemdecompositionthatmakesthatcomputationrunefficientlyinparallel.Theproblemismade easierbythefactthateachofthefunctionsinsidethetimeloophasasimilarstructure:Inthe sequentialversion,eachfunctionincludesaloopoveratomstocomputecontributionstotheforce vector.Thus,anaturaltaskdefinitionistheupdaterequiredbyeachatom,whichcorrespondstoa loopiterationinthesequentialversion.Afterperformingthetaskdecomposition,therefore,weobtain thefollowingtasks.

Tasksthatfindthevibrationalforcesonanatom Tasksthatfindtherotationalforcesonanatom Tasksthatfindthenonbondedforcesonanatom Tasksthatupdatethepositionandvelocityofanatom Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential)
Withourcollectionoftasksinhand,wecanconsidertheaccompanyingdatadecomposition.Thekey datastructuresaretheneighborlist,theatomiccoordinates,theatomicvelocities,andtheforcevector. Everyiterationthatupdatestheforcevectorneedsthecoordinatesofaneighborhoodofatoms.The computationofnonbondedforces,however,potentiallyneedsthecoordinatesofalltheatoms,because themoleculebeingsimulatedmightfoldbackonitselfinunpredictableways.Wewillusethis informationtocarryoutthedatadecomposition(intheDataDecompositionpattern)andthedata sharinganalysis(intheDataSharingpattern).

Known uses
Taskbaseddecompositionsareextremelycommoninparallelcomputing.Forexample,thedistance geometrycodeDGEOM[Mat96]usesataskbaseddecomposition,asdoestheparallelWESDYN moleculardynamicsprogram[MR95].
3.3. THE DATA DECOMPOSITION PATTERN
Problem Howcanaproblem'sdatabedecomposedintounitsthatcanbeoperatedonrelativelyindependently? Context Theparallelalgorithmdesignermusthaveadetailedunderstandingoftheproblembeingsolved.In addition,thedesignershouldidentifythemostcomputationallyintensivepartsoftheproblem,thekey datastructuresrequiredtosolvetheproblem,andhowdataisusedastheproblem'ssolutionunfolds. Afterthebasicproblemisunderstood,theparallelalgorithmdesignershouldconsiderthetasksthat makeuptheproblemandthedatadecompositionimpliedbythetasks.Boththetaskanddata decompositionsneedtobeaddressedtocreateaparallelalgorithm.Thequestionisnotwhich decompositiontodo.Thequestioniswhichonetostartwith.Adatabaseddecompositionisagood startingpointifthefollowingistrue.
Themostcomputationallyintensivepartoftheproblemisorganizedaroundthemanipulation ofalargedatastructure. Similaroperationsarebeingappliedt differentpartsofthedatastructure,insuchawaythat o thedifferentpartscanbeoperatedonrelativelyindependently.
Forexample,manylinearalgebraproblemsupdatelargematrices,applyingasimilarsetofoperations toeachelementofthematrix.Inthesecases,itisstraightforwardtodrivetheparallelalgorithm designbylookingathowthematrixcanbebrokenupintoblocksthatareupdatedconcurrently.The taskdefinitionsthenfollowfromhowtheblocksaredefinedandmappedontotheprocessing elementsoftheparallelcomputer. Forces Themainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.
Flexibility.Flexibilitywillallowthedesigntobeadaptedtodifferentimplementation requirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasingle computersystemorstyleofprogrammingatthisstageofthedesign. Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallel computer(intermsofreducedruntimeand/ormemoryutilization). Simplicity.Thedecompositionneedstobecomplexenoughtogetthejobdone,butsimple enoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.
Solution InsharedmemoryprogrammingenvironmentssuchasOpenMP,thedatadecompositionwill frequentlybeimpliedbythetaskdecomposition.Inmostcases,however,thedecompositionwillneed tobedonebyhand,becausethememoryisphysicallydistributed,becausedatadependenciesaretoo complexwithoutexplicitlydecomposingthedata,ortoachieveacceptableefficiencyonaNUMA computer. Ifataskbaseddecompositionhasalreadybeendone,thedatadecompositionisdrivenbytheneedsof eachtask.Ifwelldefinedanddistinctdatacanbeassociatedwitheachtask,thedecompositionshould
besimple. Whenstartingwithadatadecomposition,however,weneedtolooknotatthetasks,butatthecentral datastructuresdefiningtheproblemandconsiderwhethertheycantheybebrokendownintochunks thatcanbeoperatedonconcurrently.Afewcommonexamplesincludethefollowing.
Arraybasedcomputations.Concurrencycanbedefinedintermsofupdatesofdifferent segmentsofthearray.Ifthearrayismultidimensional,itcanbedecomposedinavarietyof ways(rows,columns,orblocksofvaryingshapes). Recursivedatastructures.Wecanthinkof,forexample,decomposingtheparallelupdateofa largetreedatastructurebydecomposingthedatastructureintosubtreesthatcanbeupdated concurrently.
Regardlessofthenatureoftheunderlyingdatastructure,ifthedatadecompositionistheprimary factordrivingthesolutiontotheproblem,itservesastheorganizingprincipleoftheparallel algorithm. Whenconsideringhowtodecomposetheproblem'sdatastructures,keepinmindthecompeting forces.
Flexibility.Thesizeandnumberofdatachunksshouldbeflexibletosupportthewidestrange ofparallelsystems.Oneapproachistodefinechunkswhosesizeandnumberarecontrolledby asmallnumberofparameters.Theseparametersdefinegranularityknobsthatcanbevariedto modifythesizeofthedatachunkstomatchtheneedsoftheunderlyinghardware.(Note, however,thatmanydesignsarenotinfinitelyadaptablewithrespecttogranularity.) Theeasiestplacetoseetheimpactofgranularityonthedatadecompositionisintheoverhead requiredtomanagedependenciesbetweenchunks.Thetimerequiredtomanagedependencies mustbesmallcomparedtotheoverallruntime.Inagooddatadecomposition,the dependenciesscaleatalowerdimensionthanthecomputationaleffortassociatedwitheach chunk.Forexample,inmanyfinitedifferenceprograms,thecellsattheboundariesbetween chunks,thatis,thesurfacesofthechunks,mustbeshared.Thesizeofthesetofdependent cellsscalesasthesurfacearea,whiletheeffortrequiredinthecomputationscalesasthe volumeofthechunk.Thismeansthatthecomputationaleffortcanbescaled(basedonthe chunk'svolume)tooffsetoverheadsassociatedwithdatadependencies(basedonthesurface areaofthechunk).
Efficiency.Itisimportantthatthedatachunksbelargeenoughthattheamountofworkto updatethechunkoffsetstheoverheadofmanagingdependencies.Amoresubtleissueto considerishowthechunksmapontoUEs.Aneffectiveparallelalgorithmmustbalancethe loadbetweenUEs.Ifthisisn'tdonewell,somePEsmighthaveadisproportionateamountof work,andtheoverallscalabilitywillsuffer.Thismayrequirecleverwaystobreakupthe problem.Forexample,iftheproblemclearsthecolumnsinamatrixfromlefttoright,a columnmappingofthematrixwillcauseproblemsastheUEswiththeleftmostcolumnswill finishtheirworkbeforetheothers.Arowbasedblockdecompositionorevenablockcyclic decomposition(inwhichrowsareassignedcyclicallytoPEs)woulddoamuchbetterjobof keepingalltheprocessorsfullyoccupied.Theseissuesarediscussedinmoredetailinthe DistributedArraypattern.
Simplicity.Overlycomplexdatadecompositionscanbeverydifficulttodebug.Adata decompositionwillusuallyrequireamappingofaglobalindexspaceontoatasklocalindex space.Makingthismappingabstractallowsittobeeasilyisolatedandtested.
Afterthedatahasbeendecomposed,ifithasnotalreadybeendone,thenextstepistolookatthetask decompositionimpliedbythetasks.TheTaskDecompositionpatternmayhelpwiththisanalysis. Examples

Medical imaging
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsidea modelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthe trajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands ifnotmillionsoftrajectoriesarefollowed. Inadatabaseddecompositionofthisproblem,thebodymodelisthelargecentraldatastructure aroundwhichthecomputationcanbeorganized.Themodelisbrokenintosegments,andoneormore segmentsareassociatedwitheachprocessingelement.Thebodysegmentsareonlyread,notwritten, duringthetrajectorycomputations,sotherearenodatadependenciescreatedbythedecompositionof thebodymodel. Afterthedatahasbeendecomposed,weneedtolookatthetasksassociatedwitheachdatasegment. Inthiscase,eachtrajectorypassingthroughthedatasegmentdefinesatask.Thetrajectoriesare initiatedandpropagatedwithinasegment.Whenasegmentboundaryisencountered,thetrajectory mustbepassedbetweensegments.Itisthistransferthatdefinesthedependenciesbetweendata chunks. Ontheotherhand,inataskbasedapproachtothisproblem(asdiscussedintheTaskDecomposition pattern),thetrajectoriesforeachparticledrivethealgorithmdesign.EachPEpotentiallyneedsto accessthefullbodymodeltoserviceitssetoftrajectories.Inasharedmemoryenvironment,thisis easybecausethebodymodelisareadonlydataset.Inadistributedmemoryenvironment,however, thiswouldrequiresubstantialstartupoverheadasthebodymodelisbroadcastacrossthesystem. Thisisacommonsituationinparallelprogramming:Differentpointsofviewleadtodifferent algorithmswithpotentiallyverydifferentperformancecharacteristics.Thetaskbasedalgorithmis simple,butitonlyworksifeachprocessingelementhasaccesstoalargememoryandiftheoverhead incurredloadingthedataintomemoryisinsignificantcomparedtotheprogram'sruntime.An algorithmdrivenbyadatadecomposition,ontheotherhand,makesefficientuseofmemoryand(in distributedmemoryenvironments)lessuseofnetworkbandwidth,butitincursmorecommunication overheadduringtheconcurrentpartofcomputationandissignificantlymorecomplex.Choosing whichistheappropriateapproachcanbedifficultandisdiscussedfurtherintheDesignEvaluation pattern.
Considerthestandardmultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Several databaseddecompositionsarepossibleforthisproblem.Astraightforwardonewouldbeto
decomposetheproductmatrixCintoasetofrowblocks(setofadjacentrows).Fromthedefinitionof matrixmultiplication,computingtheelementsofarowblockofCrequiresthefullAmatrix,butonly thecorrespondingrowblockofB.Withsuchadatadecomposition,thebasictaskinthealgorithm becomesthecomputationoftheelementsinarowblockofC. AnevenmoreeffectiveapproachthatdoesnotrequirethereplicationofthefullAmatrixisto decomposeallthreematricesintosubmatricesorblocks.Thebasictaskthenbecomestheupdateofa Cblock,withtheAandBblocksbeingcycledamongthetasksasthecomputationproceeds.This decomposition,however,ismuchmorecomplextoprogram;communicationandcomputationmustbe carefullycoordinatedduringthemosttimecriticalportionsoftheproblem.Wediscussthisexample furtherintheGeometricDecompositionandDistributedArraypatterns. Oneofthefeaturesofthematrixmultiplicationproblemisthattheratiooffloatingpointoperations (O(N3))tomemoryreferences(O(N2))issmall.Thisimpliesthatitisespeciallyimportanttotakeinto accountthememoryaccesspatternstomaximizereuseofdatafromthecache.Themosteffective approachistousetheblock(submatrix)decompositionandadjustthesizeoftheblockssothe problemsfitintocache.Wecouldarriveatthesamealgorithmbycarefullygroupingtogetherthe elementwisetasksthatwereidentifiedintheExamplessectionoftheTaskDecompositionpattern,but startingwithadatadecompositionandassigningatasktoupdateeachsubmatrixseemseasierto understand.
Molecular dynamics
ConsiderthemoleculardynamicsproblemdescribedinSec.3.1.3andintheExamplessectionofthe TaskDecompositionpattern.Thisproblemnaturallybreaksdownintoataskdecompositionwitha taskbeinganiterationoftheloopoveratomsineachoftheforcecomputationroutines. Summarizingourproblemanditstaskdecomposition,wehavethefollowing:

Tasksthatfindthevibrationalforcesonanatom Tasksthatfindtherotationalforcesonanatom Tasksthatfindthenonbondedforcesonanatom Tasksthatupdatethepositionandvelocityofanatom Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential) Anarrayofatomcoordinates,oneelementperatom Anarrayofatomvelocities,oneelementperatom Anarrayoflists,oneperatom,eachdefiningtheneighborhoodofatomswithinthecutoff distanceoftheatom Anarrayofforcesonatoms,oneelementperatom
Thekeydatastructuresare

Anelementofthevelocityarrayisusedonlybythetaskowningthecorrespondingatom.Thisdata doesnotneedtobesharedandcanremainlocaltothetask.Everytask,however,needsaccesstothe fullarrayofcoordinates.Thus,itwillmakesensetoreplicatethisdatainadistributedmemory
environmentorshareitamongUEsinasharedmemoryenvironment. Moreinterestingisthearrayofforces.FromNewton'sthirdlaw,theforcefromatomionatomjisthe negativeoftheforcefromatomjonatomi.Wecanexploitthissymmetrytocuttheamountof computationinhalfasweaccumulatetheforceterms.Thevaluesintheforcearrayarenotinthe computationuntilthelaststepsinwhichthecoordinatesandvelocitiesareupdated.Therefore,the approachusedistoinitializetheentireforcearrayoneachPEandhavethetasksaccumulatepartial sumsoftheforcetermsintothisarray.Afterallthepartialforcetermshavecompleted,wesumallthe PEs'arraystogethertoprovidethefinalforcearray.WediscussthisfurtherintheDataSharing pattern.

Known uses
Datadecompositionsareverycommoninparallelscientificcomputing.Theparallellinearalgebra
libraryScaLAPACK[Sca,BCC97 + ]usesblockbaseddecompositions.ThePLAPACKenvironment [vdG97]fordenselinearalgebraproblemsusesaslightlydifferentapproachtodatadecomposition.If, forexample,anequationoftheformy=Axappears,insteadoffirstpartitioningmatrixA,thevectors yandxarepartitionedinanaturalwayandthentheinducedpartitiononAisdetermined.The authorsreportbetterperformanceandeasierimplementationwiththisapproach. ThedatadecompositionusedinourmoleculardynamicsexampleisdescribedbyMattsonand Ravishanker[MR95].Moresophisticateddatadecompositionsforthisproblemthatscalebetterfor largenumbersofnodesarediscussedbyPlimptonandHendrickson[PH95,Pli95].
3.4. THE GROUP TASKS PATTERN

Problem Howcanthetasksthatmakeupaproblembegroupedtosimplifythejobofmanagingdependencies? Context Thispatterncanbeappliedafterthecorrespondingtaskanddatadecompositionshavebeenidentified asdiscussedintheTaskDecompositionandDataDecompositionpatterns. Thispatterndescribesthefirststepinanalyzingdependenciesamongthetaskswithinaproblem's decomposition.Indevelopingtheproblem'staskdecomposition,wethoughtintermsoftasksthatcan executeconcurrently.Whilewedidnotemphasizeitduringthetaskdecomposition,itisclearthat thesetasksdonotconstituteaflatset.Forexample,tasksderivedfromthesamehighleveloperation inthealgorithmarenaturallygroupedtogether.Othertasksmaynotberelatedintermsoftheoriginal problembuthavesimilarconstraintsontheirconcurrentexecutionandcanthusbegroupedtogether. Inshort,thereisconsiderablestructuretothesetoftasks.Thesestructuresthesegroupingsoftasks simplifyaproblem'sdependencyanal sis.Ifagroupsharesatemporalconstraint(forexample, y
waitingononegrouptofinishfillingafilebeforeanothergroupcanbeginreadingit),wecansatisfy thatconstraintonceforthewholegroup.Ifagroupoftasksmustworktogetheronashareddata structure,therequiredsynchronizationcanbeworkedoutonceforthewholegroup.Ifasetoftasks areindependent,combiningthemintoasinglegroupandschedulingthemforexecutionasasingle largegroupcansimplifythedesignandincreasetheavailableconcurrency(therebylettingthe solutionscaletomorePEs). Ineachcase,theideaistodefinegroupsoftasksthatshareconstraintsandsimplifytheproblemof managingconstraintsbydealingwithgroupsratherthanindividualtasks. Solution Constraintsamongtasksfallintoafewmajorcategories.
Theeasiestdependencytounderstandisatemporaldependencythatis,aconstraintonthe orderinwhichacollectionoftasksexecutes.IftaskAdependsontheresultsoftaskB,for example,thentaskAmustwaituntiltaskBcompletesbeforeitcanexecute.Wecanusually thinkofthiscaseintermsofdataflow:TaskAisblockedwaitingforthedatatobeready fromtaskB;whenBcompletes,thedataflowsintoA.Insomecases,Acanbegincomputing assoonasdatastartstoflowfromB(forexample,pipelinealgorithmsasdescribedinthe Pipelinepattern). Anothertypeoforderingconstraintoccurswhenacollectionoftasksmustrunatthesame time.Forexample,inmanydataparallelproblems,theoriginalproblemdomainisdivided intomultipleregionsthatcanbeupdatedinparallel.Typically,theupdateofanygivenregion requiresinformationabouttheboundariesofitsneighboringregions.Ifalloftheregionsare notprocessedatthesametime,theparallelprogramcouldstallordeadlockassomeregions waitfordatafrominactiveregions. Insomecases,tasksinagrouparetrulyindependentofeachother.Thesetasksdonothavean orderingconstraintamongthem.Thisisanimportantfeatureofasetoftasksbecauseitmeans theycanexecuteinanyorder,includingconcurrently,anditisimportanttoclearlynotewhen thisholds. Bygroupingtasks,wesimplifytheestablishmentofpartialordersbetweentasks,since orderingconstraintscanbeappliedtogroupsratherthantoindividualtasks. Groupingtasksmakesiteasiertoidentifywhichtasksmustexecuteconcurrently.
Thegoalofthispatternistogrouptasksbasedontheseconstraints,becauseofthefollowing.
Foragivenproblemanddecomposition,theremaybemanywaystogrouptasks.Thegoalistopicka groupingoftasksthatsimplifiesthedependencyanalysis.Toclarifythispoint,thinkofthe dependencyanalysisasfindingandsatisfyingconstraintsontheconcurrentexecutionofaprogram. Whentasksshareasetofconstraints,itsimplifiesthedependencyanalysistogroupthemtogether. Thereisnosinglewaytofindtaskgroups.Wesuggestthefollowingapproach,keepinginmindthat whileonecannotthinkabouttaskgroupswithoutconsideringtheconstraintsthemselves,atthispoint inthedesign,itisbesttodosoasabstractlyaspossibleidentifytheconstraintsandgrouptasksto helpresolvethem,buttrynottogetboggeddowninthedetails.
First,lookathowtheoriginalproblemwasdecomposed.Inmostcases,ahighleveloperation (forexample,solvingamatrix)oralargeiterativeprogramstructure(forexample,aloop) playsakeyroleindefiningthedecomposition.Thisisthefirstplacetolookforgrouping tasks.Thetasksthatcorrespondtoahighleveloperationnaturallygrouptogether. Atthispoint,theremaybemanysmallgroupsoftasks.Inthenextstep,wewilllookatthe constraintssharedbetweenthetaskswithinagroup.Ifthetasksshareaconstraintusuallyin termsoftheupdateofashareddatastructurekeepthemasadistinctgroup.Thealgorithm designwillneedtoensurethatthesetasksexecuteatthesametime.Forexample,many problemsinvolvethecoordinatedupdateofashareddatastructurebyasetoftasks.Ifthese tasksdonotrunconcurrently,theprogramcoulddeadlock.
Next,weaskifanyothertaskgroupssharethesameconstraint.Ifso,mergethegroups together.LargetaskgroupsprovideadditionalconcurrencytokeepmorePEsbusyandalso provideextraflexibilityinschedulingtheexecutionofthetasks,therebymakingiteasierto balancetheloadbetweenPEs(thatis,ensurethateachofthePEsspendsapproximatelythe sameamountoftimeworkingontheproblem). Thenextstepistolookatconstraintsbetweengroupsoftasks.Thisiseasywhengroupshave acleartemporalorderingorwhenadistinctchainofdatamovesbetweengroups.Themore complexcase,however,iswhenotherwiseindependenttaskgroupsshareconstraintsbetween groups.Inthesecases,itcanbeusefultomergetheseintoalargergroupofindependenttasks onceagainbecauselargetaskgroupsusuallymakeformoreschedulingflexibilityandbetter scalability.
Examples
Molecular dynamics
ThisproblemwasdescribedinSec.3.1.3,andwediscusseditsdecompositionintheTask DecompositionandDataDecompositionpatterns.Weidentifiedthefollowingtasks:

Tasksthatfindthevibrationalforcesonanatom Tasksthatfindtherotationalforcesonanatom Tasksthatfindthenonbondedforcesonanatom Tasksthatupdatethepositionandvelocityofanatom Atasktoupdatetheneighborlistforalltheatoms(asingletaskbecausewehavedecidedto leavethispartofthecomputationsequential)
Considerhowthesecanbegroupedtogether.Asafirstpass,eachiteminthepreviouslistcorresponds toahighleveloperationintheoriginalproblemanddefinesataskgroup.Ifweweretodigdeeper intotheproblem,however,wewouldseethatineachcasetheupdatesimpliedintheforcefunctions areindependent.Theonlydependencyisthesummationoftheforcesintoasingleforcearray. Wenextwanttoseeifwecanmergeanyofthesegroups.Goingdownthelist,thetasksinfirsttwo groupsareindependentbutsharethesameconstraints.Inbothcases,coordinatesforasmall neighborhoodofatomsarereadandlocalcontributionsaremadetotheforcearray,sowecanmerge
theseintoasinglegroupforbondedinteractions.Theothergroupshavedistincttemporalorordering constraintsandthereforeshouldnotbemerged.
IntheExamplessectionoftheTaskDecompositionpatternwediscussdecomposingthematrix multiplicationC=ABintotasks,eachcorrespondingtotheupdateofoneelementinC.The memoryorganizationofmostmoderncomputers,however,favorslargergrainedtaskssuchas updatingablockofC,asdescribedintheExamplessectionoftheDataDecompositionpattern. Mathematically,thisisequivalenttogroupingtheelementwiseupdatetasksintogroups correspondingtoblocks,andgroupingthetasksthiswayiswellsuitedtoanoptimumutilizationof systemmemory.
3.5. THE ORDER TASKS PATTERN

Problem Givenawayofdecomposingaproblemintotasksandawayofcollectingthesetasksintologically relatedgroups,howmustthesegroupsoftasksbeorderedtosatisfyconstraintsamongtasks? Context Thispatternconstitutesthesecondstepinanalyzingdependenciesamongthetasksofaproblem decomposition.Thefirststep,addressedintheGroupTaskspattern,istogrouptasksbasedon constraintsamongthem.Thenextstep,discussedhere,istofindandcorrectlyaccountfor dependenciesresultingfromconstraintsontheorderofexecutionofacollectionoftasks.Constraints amongtasksfallintoafewmajorcategories:
Temporaldependencies,thatis,constraintsplacedontheorderinwhichacollectionoftasks executes. Requirementsthatparticulartasksmustexecuteatthesametime(forexample,becauseeach requiresinformationthatwillbeproducedbytheothers). Lackofconstraint,thatis,totalindependence.Althoughthisisnotstrictlyspeakinga constraint,itisanimportantfeatureofasetoftasksbecauseitmeanstheycanexecuteinany order,includingconcurrently,anditisimportanttoclearlynotewhenthisholds.
Thepurposeofthispatternistohelpfindandcorrectlyaccountfordependenciesresultingfrom constraintsontheorderofexecutionofacollectionoftasks. Solution Therearetwogoalstobemetwhenidentifyingorderingconstraintsamongtasksanddefininga partialorderamongtaskgroups.
Theorderingmustberestrictiveenoughtosatisfyalltheconstraintssothattheresulting designiscorrect. Theorderingshouldnotbemorerestrictivethanitneedstobe.Overlyconstrainingthe solutionlimitsdesignoptionsandcanimpairprogramefficiency;thefewertheconstraints,the moreflexibilityyouhavetoshifttasksaroundtobalancethecomputationalloadamongPEs. Firstlookatthedatarequiredbyagroupoftasksbeforetheycanexecute.Afterthisdatahas beenidentified,findthetaskgroupthatcreatesitandanorderingconstraintwillbeapparent. Forexample,ifonegroupoftasks(callitA)buildsacomplexdatastructureandanother group(B)usesit,thereisasequentialorderingconstraintbetweenthesegroups.Whenthese twogroupsarecombinedinaprogram,theymustexecuteinsequence,firstAandthenB. Alsoconsiderwhetherexternalservicescanimposeorderingconstraints.Forexample,ifa programmustwritetoafileinacertainorder,thenthesefileI/Ooperationslikelyimposean orderingconstraint. Finally,itisequallyimportanttonotewhenanorderingconstraintdoesnotexist.Ifanumber oftaskgroupscanexecuteindependently,thereisamuchgreateropportunitytoexploit parallelism,soweneedtonotewhentasksareindependentaswellaswhentheyare dependent.
Toidentifyorderingconstraints,considerthefollowingwaystaskscandependoneachother.
Regardlessofthesourceoftheconstraint,wemustdefinetheconstraintsthatrestricttheorderof executionandmakesuretheyarehandledcorrectlyintheresultingalgorithm.Atthesametime,itis importanttonotewhenorderingconstraintsareabsent,sincethiswillgivevaluableflexibilitylaterin thedesign. Examples

Molecular dynamics
ThisproblemwasdescribedinSec.3.1.3,andwediscusseditsdecompositionintheTask DecompositionandDataDecompositionpatterns.IntheGroupTaskspattern,wedescribedhowto organizethetasksforthisprobleminthefollowinggroups:
Agroupoftaskstofindthe"bondedforces"(vibrationalforcesandrotationalforces)oneach atom Agroupoftaskstofindthenonbondedforcesoneachatom Agroupoftaskstoupdatethepositionandvelocityofeachatom Atasktoupdatetheneighborlistforalltheatoms(whichtriviallyconstitutesataskgroup)
Nowwearereadytoconsiderorderingconstraintsbetweenthegroups.Clearly,theupdateofthe atomicpositionscannotoccuruntiltheforcecomputationiscomplete.Also,thenonbondedforces cannotbecomputeduntiltheneighborlistisupdated.Soineachtimestep,thegroupsmustbe orderedasshowninFig.3.4.
Figure 3.4. Ordering of tasks in molecular dynamics problem
Whileitistooearlyinthedesigntoconsiderindetailhowtheseorderingconstraintswillbeenforced, eventuallywewillneedtoprovidesomesortofsynchronizationtoensurethattheyarestrictly followed.
3.6. THE DATA SHARING PATTERN

Problem Givenadataandtaskdecompositionforaproblem,howisdatasharedamongthetasks? Context Atahighlevel,everyparallelalgorithmconsistsof

Acollectionoftasksthatcanexecuteconcurrently(seetheTaskDecompositionpattern) Adatadecompositioncorrespondingtothecollectionofconcurrenttasks(seetheData Decompositionpattern) Dependenciesamongthetasksthatmustbemanagedtopermitsafeconcurrentexecution
AsaddressedintheGroupTasksandOrderTaskspatterns,thestartingpointinadependencyanalysis istogrouptasksbasedonconstraintsamongthemandthendeterminewhatorderingconstraintsapply togroupsoftasks.Thenextstep,discussedhere,istoanalyzehowdataissharedamonggroupsof tasks,sothataccesstoshareddatacanbemanagedcorrectly. Althoughtheanalysisthatledtothegroupingoftasksandtheorderingconstraintsamongthem focusesprimarilyonthetaskdecomposition,atthisstageofthedependencyanalysis,thefocusshifts tothedatadecomposition,thatis,thedivisionoftheproblem'sdataintochunksthatcanbeupdated independently,eachassociatedwithoneormoretasksthathandletheupdateofthatchunk.This chunkofdataissometimescalledtasklocaldata(orjustlocaldata),becauseitistightlycoupledto thetask(s)responsibleforitsupdate.Itisrare,however,thateachtaskcanoperateusingonlyitsown localdata;datamayneedtobesharedamongtasksinmanyways.Twoofthemostcommonsituations
arethefollowing.
Inadditiontotasklocaldata,theproblem'sdatadecompositionmightdefinesomedatathat mustbesharedamongtasks;forexample,thetasksmightneedtocooperativelyupdatealarge shareddatastructure.Suchdatacannotbeidentifiedwithanygiventask;itisinherentlyglobal totheproblem.Thisshareddataismodifiedbymultipletasksandthereforeservesasasource ofdependenciesamongthetasks. Datadependenciescanalsooccurwhenonetaskneedsaccesstosomeportionofanother task'slocaldata.Theclassicexampleofthistypeofdatadependencyoccursinfinite differencemethodsparallelizedusingadatadecomposition,whereeachpointintheproblem spaceisupdatedusingvaluesfromnearbypointsandthereforeupdatesforonechunkofthe decompositionrequirevaluesfromtheboundariesofneighboringchunks.
Thispatterndiscussesdatasharinginparallelalgorithmsandhowtodealwithtypicalformsofshared data. Forces Thegoalofthispatternistoidentifywhatdataissharedamonggroupsoftasksanddeterminehowto manageaccesstoshareddatainawaythatisbothcorrectandefficient. Datasharingcanhavemajorimplicationsforbothcorrectnessandefficiency.
Ifthesharingisdoneincorrectly,ataskmaygetinvaliddataduetoaracecondition;this happensofteninsharedaddressspaceenvironments,whereataskcanreadfromamemory locationbeforethewriteoftheexpecteddatahascompleted. Guaranteeingthatshareddataisreadyforusecanleadtoexcessivesynchronizationoverhead. Forexample,anorderingconstraintcanbeenforcedbyputtingbarrieroperations[1]before readsofshareddata.Thiscanbeunacceptablyinefficient,however,especiallyincaseswhere onlyasmallsubsetoftheUEsareactuallysharingthedata.Amuchbetterstrategyistousea combinationofcopyingintolocaldataorrestructuringtaskstominimizethenumberoftimes shareddatamustberead.

[1]
Abarrierisasynchronizationconstructthatdefinesapointinaprogramthata groupofUEsmustallreachbeforeanyofthemareallowedtoproceed.
Anothersourceofdatasharingoverheadiscommunication.Insomeparallelsystems,any accesstoshareddataimpliesthepassingofamessagebetweenUEs.Thisproblemcan sometimesbemitigatedbyoverlappingcommunicationandcomputation,butthisisn'talways possible.Frequently,abetterchoiceistostructurethealgorithmandtaskssothattheamount ofshareddatatocommunicateisminimized.AnotherapproachistogiveeachUEitsown copyoftheshareddata;thisrequiressomecaretobesurethatthecopiesarekeptconsistentin valuebutcanbemoreefficient. Thegoal,therefore,istomanageshareddataenoughtoensurecorrectnessbutnotsomuchas tointerferewithefficiency.
Solution Thefirststepistoidentifydatathatissharedamongtasks. Thisismostobviouswhenthedecompositionispredominantlyadatabaseddecomposition.For example,inafinitedifferenceproblem,thebasicdataisdecomposedintoblocks.Thenatureofthe decompositiondictatesthatthedataattheedgesoftheblocksissharedbetweenneighboringblocks. Inessence,thedatasharingwasworkedoutwhenthebasicdecompositionwasdone. Inadecompositionthatispredominantlytaskbased,thesituationismorecomplex.Atsomepointin thedefinitionoftasks,itwasdeterminedhowdataispassedintooroutofthetaskandwhetherany dataisupdatedinthebodyofthetask.Thesearethesourcesofpotentialdatasharing. Aftertheshareddatahasbeenidentified,itneedstobeanalyzedtoseehowitisused.Shareddata fallsintooneofthefollowingthreecategories.
Readonly.Thedataisreadbutnotwritten.Becauseitisnotmodified,accesstothesevalues doesnotneedtobeprotected.Onsomedistributedmemorysystems,itisworthwhileto replicatethereadonlydatasoeachunitofexecutionhasitsowncopy. Effectivelylocal.Thedataispartitionedintosubsets,eachofwhichisaccessed(forreador write)byonlyoneofthetasks.(Anexampleofthiswouldbeanarraysharedamongtasksin suchawaythatitselementsareeffectivelypartitionedintosetsoftasklocaldata.)Thiscase providessomeoptionsforhandlingthedependencies.Ifthesubsetscanbeaccessed independently(aswouldnormallybethecasewith,say,arrayelements,butnotnecessarily withlistelements),thenitisnotnecessarytoworryaboutprotectingaccesstothisdata.On distributedmemorysystems,suchdatawouldusuallybedistributedamongUEs,witheach UEhavingonlythedataneededbyitstasks.Ifnecessary,thedatacanberecombinedintoa singledatastructureattheendofthecomputation. Readwrite.Thedataisbothreadandwrittenandisaccessedbymorethanonetask.Thisis thegeneralcase,andincludesarbitrarilycomplicatedsituationsinwhichdataisreadfromand writtentobyanynumberoftasks.Itisthemostdifficulttodealwith,becauseanyaccessto thedata(readorwrite)mustbeprotectedwithsometypeofexclusiveaccessmechanism (locks,semaphores,etc.),whichcanbeveryexpensive. Accumulate.Thedataisbeingusedtoaccumulatearesult(forexample,whencomputinga reduction).Foreachlocationintheshareddata,thevaluesareupdatedbymultipletasks,with theupdatetakingplacethroughsomesortofassociativeaccumulationoperation.Themost commonaccumulationoperationsaresum,minimum,andmaximum,butanyassociative operationonpairsofoperandscanbeused.Forsuchdata,eachtask(or,usually,eachUE)has aseparatecopy;theaccumulationsoccurintotheselocalcopies,whicharethenaccumulated intoasingleglobalcopyasafinalstepattheendoftheaccumulation. Multipleread/singlewrite.Thedataisreadbymultipletasks(allofwhichneeditsinitial value),butmodifiedbyonlyonetask(whichcanreadandwriteitsvaluearbitrarilyoften). Suchvariablesoccurfrequentlyinalgorithmsbasedondatadecompositions.Fordataofthis type,atleasttwocopiesareneeded,onetopreservetheinitialvalueandonetobeusedbythe
Twospecialcasesofreadwritedataarecommonenoughtodeservespecialmention:
modifyingtask;thecopycontainingtheinitialvaluecanbediscardedwhennolongerneeded. Ondistributedmemorysystems,typicallyacopyiscreatedforeachtaskneedingaccess(read orwrite)tothedata. Examples

Molecular dynamics
ThisproblemwasdescribedinSec.3.1.3,andwediscusseditsdecompositionintheTask DecompositionandDataDecompositionpatterns.Wethenidentifiedthetaskgroups(intheGroup Taskspattern)andconsideredtemporalconstraintsamongthetaskgroups(intheOrderTasks pattern).Wewillignorethetemporalconstraintsfornowandjustfocusondatasharingforthe problem'sfinaltaskgroups:
Thegroupoftaskstofindthe"bondedforces"(vibrationalforcesandrotationalforces)on eachatom Thegroupoftaskstofindthenonbondedforcesoneachatom Thegroupoftaskstoupdatethepositionandvelocityofeachatom Thetasktoupdatetheneighborlistforalltheatoms(whichtriviallyconstitutesataskgroup)
Thedatasharinginthisproblemcanbecomplicated.Wesummarizethedatasharedbetweengroups inFig.3.5.Themajorshareddataitemsarethefollowing.
Theatomiccoordinates,usedbyeachgroup. Thesecoordinatesaretreatedasreadonlydatabythebondedforcegroup,thenonbonded forcegroup,andtheneighborlistupdategroup.Thisdataisreadwriteforthepositionupdate group.Fortunately,thepositionupdategroupexecutesaloneaftertheotherthreegroupsare done(basedontheorderingconstraintsdevelopedusingtheOrderTaskspattern).Hence,in thefirstthreegroups,wecanleaveaccessestothepositiondataunprotectedorevenreplicate it.Forthepositionupdategroup,thepositiondatabelongstothereadwritecategory,and accesstothisdatawillneedtobecontrolledcarefully.
Theforcearray,usedbyeachgroupexceptfortheneighborlistupdate. Thisarrayisusedasreadonlydatabythepositionupdategroupandasaccumulatedatafor thebondedandnonbondedforcegroups.Becausethepositionupdategroupmustfollowthe forcecomputations(asdeterminedusingtheOrderTaskspattern),wecanputthisarrayinthe accumulatecategoryfortheforcegroupsandinthereadonlycategoryforthepositionupdate group. Thestandardprocedureformoleculardynamicssimulations[MR95]beginsbyinitializingthe forcearrayasalocalarrayoneachUE.Contributionstoelementsoftheforcearrayarethen computedbyeachUE,withtheprecisetermscomputedbeingunpredictablebecauseofthe waythemoleculefoldsinspace.Afteralltheforceshavebeencomputed,thelocalarraysare reducedintoasinglearray,acopyofwhichisplaceoneachUE(seethediscussionof reductioninSec.6.4.2formoreinformation.)
Theneighborlist,sharedbetweenthenonbondedforcegroupandtheneighborlistupdate group. Theneighborlistisessentiallylocaldatafortheneighborlistupdategroupandreadonlydata forthenonbondedforcecomputation.ThelistcanbemanagedinlocalstorageoneachUE.
Figure 3.5. Data sharing in molecular dynamics. We distinguish between sharing for reads, read-writes, and accumulations.
3.7. THE DESIGN EVALUATION PATTERN

Problem Isthedecompositionanddependencyanalysissofargoodenoughtomoveontothenextdesign space,orshouldthedesignberevisited? Context Atthispoint,theproblemhasbeendecomposedintotasksthatcanexecuteconcurrently(usingthe TaskDecompositionandDataDecompositionpatterns)andthedependenciesbetweenthemhave beenidentified(usingtheGroupTasks,OrderTasks,andDataSharingpatterns).Inparticular,the originalproblemhasbeendecomposedandanalyzedtoproduce:

Ataskdecompositionthatidentifiestasksthatcanexecuteconcurrently Adatadecompositionthatidentifiesdatalocaltoeachtask Awayofgroupingtasksandorderingthegroupstosatisfytemporalconstraints Ananalysisofdependenciesamongtasks
Itisthesefouritemsthatwillguidethedesigner'sworkinthenextdesignspace(theAlgorithm Structurepatterns).Therefore,gettingtheseitemsrightandfindingthebestproblemdecompositionis importantforproducingahighqualitydesign. Insomecases,theconcurrencyisstraightforwardandthereisclearlyasinglebestwaytodecompose aproblem.Moreoften,however,multipledecompositionsarepossible.Hence,itisimportantbefore proceedingtoofarintothedesignprocesstoevaluatetheemergingdesignandmakesureitmeetsthe application'sneeds.Rememberthatalgorithmdesignisaninherentlyiterativeprocess,anddesigners shouldnotexpecttoproduceanoptimumdesignonthefirstpassthroughtheFindingConcurrency patterns. Forces Thedesignneedstobeevaluatedfromthreeperspectives.
Suitabilityforthetargetplatform.Issuessuchasnumberofprocessorsandhowdatastructures aresharedwillinfluencetheefficiencyofanydesign,butthemorethedesigndependsonthe targetarchitecture,thelessflexibleitwillbe. Designquality.Simplicity,flexibility,andefficiencyarealldesirablebutpossibly conflictingattributes. Preparationforthenextphaseofthedesign.Arethetasksanddependenciesregularor irregular(thatis,aretheysimilarinsize,ordotheyvary)?Istheinteractionbetweentasks synchronousorasynchronous(thatis,dotheinteractionsoccuratregularintervalsorhighly variableorevenrandomtimes)?Arethetasksaggregatedinaneffectiveway?Understanding theseissueswillhelpchooseanappropriatesolutionfromthepatternsintheAlgorithm Structuredesignspace.
Solution Beforemovingontothenextphaseofthedesignprocess,itishelpfultoevaluatetheworksofarfrom thethreeperspectivesmentionedintheForcessection.Theremainderofthispatternconsistsof questionsanddiscussionstohelpwiththeevaluation.

Suitability for target platform
Althoughitisdesirabletodelaymappingaprogramontoaparticulartargetplatformaslongas possible,thecharacteristicsofthetargetplatformdoneedtobeconsideredatleastminimallywhile evaluatingadesign.Followingaresomeissuesrelevanttothechoiceoftargetplatformorplatforms. HowmanyPEsareavailable?Withsomeexceptions,havingmanymoretasksthanPEsmakesit easiertokeepallthePEsbusy.Obviouslywecan'tmakeuseofmorePEsthanwehavetasks,but havingonlyoneorafewtasksperPEcanleadtopoorloadbalance.Forexample,considerthecaseof aMonteCarlosimulationinwhichacalculationisrepeatedoverandoverfordifferentsetsof randomlychosendata,suchthatthetimetakenforthecalculationvariesconsiderablydependingon thedata.Anaturalapproachtodevelopingaparallelalgorithmwouldbetotreateachcalculation(for aseparatesetofdata)asatask;thesetasksarethencompletelyindependentandcanbescheduled howeverwelike.Butbecausethetimeforeachtaskcanvaryconsiderably,unlesstherearemany
moretasksthanPEs,itwillbedifficulttoachievegoodloadbalance. Theexceptionstothisrulearedesignsinwhichthenumberoftaskscanbeadjustedtofitthenumber ofPEsinsuchawaythatgoodloadbalanceismaintained.Anexampleofsuchadesignistheblock basedmatrixmultiplicationalgorithmdescribedintheExamplessectionoftheDataDecomposition pattern:Taskscorrespondtoblocks,andallthetasksinvolveroughlythesameamountof computation,soadjustingthenumberoftaskstobeequaltothenumberofPEsproducesanalgorithm withgoodloadbalance.(Note,however,thateveninthiscaseitmightbeadvantageoustohavemore tasksthanPEs.Thismight,forexample,allowoverlapofcomputationandcommunication.) HowaredatastructuressharedamongPEs?Adesignthatinvolveslargescaleorfinegraineddata sharingamongtaskswillbeeasiertoimplementandmoreefficientifalltaskshaveaccesstothe samememory.Easeofimplementationdependsontheprogrammingenvironment;anenvironment basedonasharedmemorymodel(allUEsshareanaddressspace)makesiteasiertoimplementa designrequiringextensivedatasharing.Efficiencydependsalsoonthetargetmachine;adesign involvingextensivedatasharingislikelytobemoreefficientonasymmetricmultiprocessor(where accesstimetomemoryisuniformacrossprocessors)thanonamachinethatlayersasharedmemory environmentoverphysicallydistributedmemory.Incontrast,iftheplanistouseamessagepassing environmentrunningonadistributedmemoryarchitecture,adesigninvolvingextensivedatasharing isprobablynotagoodchoice. Forexample,considerthetaskbasedapproachtothemedicalimagingproblemdescribedinthe ExamplessectionoftheTaskDecompositionpattern.Thisdesignrequiresthatalltaskshaveread accesstoapotentiallyverylargedatastructure(thebodymodel).Thispresentsnoproblemsina sharedmemoryenvironment;itisalsonoprobleminadistributedmemoryenvironmentinwhich eachPEhasalargememorysubsystemandthereisplentyofnetworkbandwidthtohandle broadcastingthelargedataset.However,inadistributedmemoryenvironmentwithlimitedmemory ornetworkbandwidth,themorememoryefficientalgorithmthatemphasizesthedatadecomposition wouldberequired. Adesignthatrequiresfinegraineddatasharing(inwhichthesamedatastructureisaccessed repeatedlybymanytasks,particularlywhenbothreadsandwritesareinvolved)isalsolikelytobe moreefficientonasharedmemorymachine,becausetheoverheadrequiredtoprotecteachaccessis likelytobesmallerthanforadistributedmemorymachine. Theexceptiontotheseprincipleswouldbeaprobleminwhichitiseasytogroupandscheduletasks insuchawaythattheonlylargescaleorfinegraineddatasharingisamongtasksassignedtothe sameunitofexecution. WhatdoesthetargetarchitectureimplyaboutthenumberofUEsandhowstructuresareshared amongthem?Inessence,werevisittheprecedingtwoquestions,butintermsofUEsratherthanPEs. ThiscanbeanimportantdistinctiontomakeifthetargetsystemdependsonmultipleUEsperPEto hidelatency.Therearetwofactorstokeepinmindwhenconsideringwhetheradesignusingmore thanoneUEperPEmakessense. ThefirstfactoriswhetherthetargetsystemprovidesefficientsupportformultipleUEsperPE.Some systemsdoprovidesuchsupport,suchastheCrayMTAmachinesandmachinesbuiltwithIntel processorsthatutilizehyperthreading.Thisarchitecturalapproachprovideshardwaresupportfor
extremelyrapidcontextswitching,makingitpracticaltouseinafarwiderrangeoflatencyhiding situations.OthersystemsdonotprovidegoodsupportformultipleUEsperPE.Forexample,anMPP systemwithslowcontextswitchingand/oroneprocessorpernodemightrunmuchbetterwhenthere isonlyoneUEperPE. ThesecondfactoriswhetherthedesigncanmakegooduseofmultipleUEsperPE.Forexample,if thedesigninvolvescommunicationoperationswithhighlatency,itmightbepossibletomaskthat latencybyassigningmultipleUEstoeachPEsosomeUEscanmakeprogresswhileothersare waitingonahighlatencyoperation.If,however,thedesigninvolvescommunicationoperationsthat aretightlysynchronized(forexample,pairsofblockingsend/receives)andrelativelyefficient, assigningmultipleUEstoeachPEismorelikelytointerferewitheaseofimplementation(by requiringextraefforttoavoiddeadlock)thantoimproveefficiency. Onthetargetplatform,willthetimespentdoingusefulworkinataskbesignificantlygreaterthanthe timetakentodealwithdependencies?Acriticalfactorindeterminingwhetheradesigniseffectiveis theratiooftimespentdoingcomputationtotimespentincommunicationorsynchronization:The highertheratio,themoreefficienttheprogram.Thisratioisaffectednotonlybythenumberandtype ofcoordinationeventsrequiredbythedesign,butalsobythecharacteristicsofthetargetplatform.For example,amessagepassingdesignthatisacceptablyefficientonanMPPwithafastinterconnect networkandrelativelyslowprocessorswilllikelybelessefficient,perhapsunacceptablyso,onan Ethernetconnectednetworkofpowerfulworkstations. NotethatthiscriticalratioisalsoaffectedbyproblemsizerelativetothenumberofavailablePEs, becauseforafixedproblemsize,thetimespentbyeachprocessordoingcomputationdecreaseswith thenumberofprocessors,whilethetimespentbyeachprocessordoingcoordinationmightstaythe sameorevenincreaseasthenumberofprocessorsincreases.
Design quality
Keepingthesecharacteristicsofthetargetplatforminmind,wecanevaluatethedesignalongthe threedimensionsofflexibility,efficiency,andsimplicity. Flexibility.Itisdesirableforthehighleveldesigntobeadaptabletoavarietyofdifferent implementationrequirements,andcertainlyalltheimportantones.Therestofthissectionprovidesa partialchecklistoffactorsthataffectflexibility.
Isthedecompositionflexibleinthenumberoftasksgenerated?Suchflexibilityallowsthe designtobeadaptedtoawiderangeofparallelcomputers. Isthedefinitionoftasksimpliedbythetaskdecompositionindependentofhowtheyare scheduledforexecution?Suchindependencemakestheloadbalancingproblemeasierto solve. Canthesizeandnumberofchunksinthedatadecompositionbeparameterized?Such parameterizationmakesadesigneasiertoscaleforvaryingnumbersofPEs. Doesthealgorithmhandletheproblem'sboundarycases?Agooddesignwillhandleall relevantcases,evenunusualones.Forexample,acommonoperationistotransposeamatrix sothatadistributionintermsofblocksofmatrixcolumnsbecomesadistributionintermsof
blocksofmatrixrows.Itiseasytowritedownthealgorithmandcodeitforsquarematrices wherethematrixorderisevenlydividedbythenumberofPEs.Butwhatifthematrixisnot square,orwhatifthenumberofrowsismuchgreaterthanthenumberofcolumnsandneither numberisevenlydividedbythenumberofPEs?Thisrequiressignificantchangestothe transposealgorithm.Forarectangularmatrix,forexample,thebufferthatwillholdthematrix blockwillneedtobelargeenoughtoholdthelargerofthetwoblocks.Ifeithertherowor columndimensionofthematrixisnotevenlydivisiblebythenumberofPEs,thentheblocks willnotbethesamesizeoneachPE.Canthealgorithmdealwiththeunevenloadthatwill resultfromhavingdifferentblocksizesoneachPE? Efficiency.Theprogramshouldeffectivelyutilizetheavailablecomputingresources.Therestofthis sectiongivesapartiallistofimportantfactorstocheck.Notethattypicallyitisnotpossibleto simultaneouslyoptimizeallofthesefactors;designtradeoffsareinevitable.
CanthecomputationalloadbeevenlybalancedamongthePEs?Thisiseasierifthetasksare independent,oriftheyareroughlythesamesize. Istheoverheadminimized?Overheadcancomefromseveralsources,includingcreationand schedulingoftheUEs,communication,andsynchronization.CreationandschedulingofUEs involvesoverhead,soeachUEneedstohaveenoughworktodotojustifythisoverhead.On theotherhand,moreUEsallowforbetterloadbalance. Communicationcanalsobeasourceofsignificantoverhead,particularlyondistributed memoryplatformsthatdependonmessagepassing.AswediscussedinSection2.6,thetime totransferamessagehastwocomponents:latencycostarisingfromoperatingsystem overheadandmessagestartupcostsonthenetwork,andacostthatscaleswiththelengthof themessage.Tominimizethelatencycosts,thenumberofmessagestobesentshouldbekept toaminimum.Inotherwords,asmallnumberoflargemessagesisbetterthanalargenumber ofsmallones.Thesecondtermisrelatedtothebandwidthofthenetwork.Thesecostscan sometimesbehiddenbyoverlappingcommunicationwithcomputation. Onsharedmemorymachines,synchronizationisamajorsourceofoverhead.Whendatais sharedbetweenUEs,dependenciesariserequiringonetasktowaitforanothertoavoidrace conditions.Thesynchronizationmechanismsusedtocontrolthiswaitingareexpensive comparedtomanyoperationscarriedoutbyaUE.Furthermore,somesynchronization constructsgeneratesignificantmemorytrafficastheyflushcaches,buffers,andothersystem resourcestomakesureUEsseeaconsistentviewofmemory.Thisextramemorytrafficcan interferewiththeexplicitdatamovementwithinacomputation.Synchronizationoverheadcan bereducedbykeepingdatawelllocalizedtoatask,therebyminimizingthefrequencyof synchronizationoperations.
Simplicity.ToparaphraseEinstein:Makeitassimpleaspossible,butnotsimpler. Keepinmindthatpracticallyallprogramswilleventuallyneedtobedebugged,maintained,andoften enhancedandported.Adesignevenagenerallysuperiordesignisnotvaluableifitistoohardto debug,maintain,andverifythecorrectnessofthefinalprogram. ThemedicalimagingexampleinitiallydescribedinSec.3.1.3andthendiscussedfurtherintheTask DecompositionandDataDecompositionpatternsisanexcellentcaseinpointinsupportofthevalue
ofsimplicity.Inthisproblem,alargedatabasecouldbedecomposed,butthisdecompositionwould forcetheparallelalgorithmtoincludecomplexoperationsforpassingtrajectoriesbetweenUEsandto distributechunksofthedatabase.Thiscomplexitymakestheresultingprogrammuchmoredifficult tounderstandandgreatlycomplicatesdebugging.Theotherapproach,replicatingthedatabase,leads toavastlysimplerparallelprograminwhichcompletelyindependenttaskscanbepassedoutto multipleworkersastheyareread.Allcomplexcommunicationthusgoesaway,andtheparallelpartof theprogramistrivialtodebugandreasonabout.

Preparation for next phase
TheproblemdecompositioncarriedoutwiththeFindingConcurrencypatternsdefinesthekey componentsthatwillguidethedesignintheAlgorithmStructuredesignspace:

Beforemovingoninthedesign,considerthesecomponentsrelativetothefollowingquestions. Howregulararethetasksandtheirdatadependencies?Regulartasksaresimilarinsizeandeffort. Irregulartaskswouldvarywidelyamongthemselves.Ifthetasksareirregular,theschedulingofthe tasksandtheirsharingofdatawillbemorecomplicatedandwillneedtobeemphasizedinthedesign. Inaregulardecomposition,allthetasksareinsomesensethesameroughlythesamecomputation (ondifferentsetsofdata),roughlythesamedependenciesondatasharedwithothertasks,etc. ExamplesincludethevariousmatrixmultiplicationalgorithmsdescribedintheExamplessectionsof theTaskDecomposition,DataDecomposition,andotherpatterns. Inanirregulardecomposition,theworkdonebyeachtaskand/orthedatadependenciesvaryamong tasks.Forexample,consideradiscreteeventsimulationofalargesystemconsistingofanumberof distinctcomponents.Wemightdesignaparallelalgorithmforthissimulationbydefiningataskfor eachcomponentandhavingtheminteractbasedonthediscreteeventsofthesimulation.Thiswould beaveryirregulardesigninthattherewouldbeconsiderablevariationamongtaskswithregardto workdoneanddependenciesonothertasks. Areinteractionsbetweentasks(ortaskgroups)synchronousorasynchronous?Insomedesigns,the interactionbetweentasksisalsoveryregularwithregardtotimethatis,itissynchronous.For example,atypicalapproachtoparallelizingalinearalgebraprobleminvolvingtheupdateofalarge matrixistopartitionthematrixamongtasksandhaveeachtaskupdateitspartofthematrix,using datafrombothitsandotherpartsofthematrix.Assumingthatallthedataneededfortheupdateis presentatthestartofthecomputation,thesetaskswilltypicallyfirstexchangeinformationandthen computeindependently.Anothertypeofexampleisapipelinecomputation(seethePipelinepattern), inwhichweperformamultistepoperationonasequenceofsetsofinputdatabysettingupan assemblylineoftasks(oneforeachstepoftheoperation),withdataflowingfromonetasktothenext aseachtaskaccomplishesitswork.Thisapproachworksbestifallofthetasksstaymoreorlessin stepthatis,iftheirinteractionissynchronous.
Inotherdesigns,theinteractionbetweentasksisnotsochronologicallyregular.Anexampleisthe discreteeventsimulationdescribedpreviously,inwhichtheeventsthatleadtointeractionbetween taskscanbechronologicallyirregular. Arethetasksgroupedinthebestway?Thetemporalrelationsareeasy:Tasksthatcanrunatthesame timearenaturallygroupedtogether.Butaneffectivedesignwillalsogrouptaskstogetherbasedon theirlogicalrelationshipintheoverallproblem. Asanexampleofgroupingtasks,considerthemoleculardynamicsproblemdiscussedinthe ExamplessectionoftheGroupTasks,OrderTasks,andDataSharingpatterns.Thegroupingwe eventuallyarriveat(intheGroupTaskspattern)ishierarchical:groupsofrelatedtasksbasedonthe highleveloperationsoftheproblem,furthergroupedonthebasisofwhichonescanexecute concurrently.Suchanapproachmakesiteasiertoreasonaboutwhetherthedesignmeetsthe necessaryconstraints(becausetheconstraintscanbestatedintermsofthetaskgroupsdefinedbythe highleveloperations)whileallowingforschedulingflexibility.
3.8. SUMMARY
WorkingthroughthepatternsintheFindingConcurrencydesignspaceexposestheconcurrencyin yourproblem.Thekeyelementsfollowingfromthatanalysisare

Apatternlanguageistraditionallydescribedasawebofpatternswithonepatternlogicallyconnected tothenext.TheoutputfromtheFindingConcurrencydesignspace,however,doesnotfitintothat picture.Rather,thegoalofthisdesignspaceistohelpthedesignercreatethedesignelementsthat togetherwillleadintotherestofthepatternlanguage.
Chapter 4. The Algorithm Structure Design Space

4.1INTRODUCTION 4.2CHOOSINGANALGORITHMSTRUCTUREPATTERN 4.3EXAMPLES 4.4THETASKPARALLELISMP ATTERN
4.5THEDIVIDEANDCONQUERPATTERN 4.6THEGEOMETRICDECOMPOSITIONPATTERN 4.7THERECURSIVEDATAPATTERN 4.8THEPIPELINEPATTERN 4.9THEEVENTBASEDCOORDINATIONPATTERN
4.1. INTRODUCTION
Thefirstphaseofdesigningaparallelalgorithmconsistsofanalyzingtheproblemtoidentify exploitableconcurrency,usuallybyusingthepatternsoftheFindingConcurrencydesignspace.The outputfromtheFindingConcurrencydesignspaceisadecompositionoftheproblemintodesign elements:

TheseelementsprovidetheconnectionfromtheFindingConcurrencydesignspacetotheAlgorithm Structuredesignspace.OurgoalintheAlgorithmStructuredesignspaceistorefinethedesignand moveitclosertoaprogramthatcanexecutetasksconcurrentlybymappingtheconcurrencyonto multipleUEsrunningonaparallelcomputer. Ofthecountlesswaystodefineanalgorithmstructure,mostfollowoneofsixbasicdesignpatterns. ThesepatternsmakeuptheAlgorithmStructuredesignspace.Anoverviewofthisdesignspaceand itsplaceinthepatternlanguageisshowninFig.4.1.

Figure 4.1. Overview of the Algorithm Structure design space and its place in the pattern language
Thekeyissueatthisstageistodecidewhichpatternorpatternsaremostappropriatefortheproblem.
Firstofall,weneedtokeepinmindthatdifferentaspectsoftheanalysiscanpullthedesignin differentdirections;oneaspectmightsuggestonestructurewhileanothersuggestsadifferent structure.Innearlyeverycase,however,thefollowingforcesshouldbekeptinmind.
Efficiency.Itiscrucialthataparallelprogramrunquicklyandmakegooduseofthecomputer resources. Simplicity.Asimplealgorithmresultingineasytounderstandcodeiseasiertodevelop, debug,verify,andmodify. Portability.Ideally,programsshouldrunonthewidestrangeofparallelcomputers.Thiswill maximizethe"market"foraparticularprogram.Moreimportantly,aprogramisusedfor manyyears,whileanyparticularcomputersystemisusedforonlyafewyears.Portable programsprotectasoftwareinvestment. Scalability.Ideally,analgorithmshouldbeeffectiveonawiderangeofnumbersofprocessing elements(PEs),fromafewuptohundredsoreventhousands.
Theseforcesconflictinseveralways,however. Efficiencyconflictswithportability:Makingaprogramefficientalmostalwaysrequiresthatthecode takeintoaccountthecharacteristicsofthespecificsystemonwhichitisintendedtorun,whichlimits portability.Adesignthatmakesuseofthespecialfeaturesofaparticularsystemorprogramming environmentmayleadtoanefficientprogramforthatparticularenvironment,butbeunusablefora differentplatform,eitherbecauseitperformspoorlyorbecauseitisdifficultorevenimpossibleto implementforthenewplatform. Efficiencyalsocanconflictwithsimplicity:Forexample,towriteefficientprogramsthatusetheTask Parallelismpattern,itissometimesnecessarytousecomplicatedschedulingalgorithms.These algorithmsinmanycases,however,maketheprogramverydifficulttounderstand. Thus,agoodalgorithmdesignmuststrikeabalancebetween(1)abstractionandportabilityand(2) suitabilityforaparticulartargetarchitecture.Thechallengefacedbythedesigner,especiallyatthis earlyphaseofthealgorithmdesign,istoleavetheparallelalgorithmdesignabstractenoughto supportportabilitywhileensuringthatitcaneventuallybeimplementedeffectivelyfortheparallel systemsonwhichitwillbeexecuted.
4.2. CHOOSING AN ALGORITHM STRUCTURE PATTERN

FindinganeffectiveAlgorithmStructurepatternforagivenproblemcanbeaccomplishedby consideringthequestionsinthefollowingsections. 4.2.1. Target Platform Whatconstraintsareplacedontheparallelalgorithmbythetargetmachineorprogramming environment? Inanidealworld,itwouldnotbenecessarytoconsiderthedetailsofthetargetplatformatthisstage ofthedesign,becausedoingsoworksagainstkeepingtheprogramportableandscalable.Thisisnot
anidealworld,however,andsoftwaredesignedwithoutconsideringthemajorfeaturesofthetarget platformisunlikelytorunefficiently. Theprimaryissueishowmanyunitsofexecution(UEs)thesystemwilleffectivelysupport,because analgorithmthatworkswellfortenUEsmaynotworkwellforhundredsofUEs.Itisnotnecessary todecideonaspecificnumber(infacttodosowouldoverlyconstraintheapplicabilityofthedesign), butitisimportanttohaveinmindatthispointanorderofmagnitudeforthenumberofUEs. AnotherissueishowexpensiveitistoshareinformationamongUEs.Ifthereishardwaresupportfor sharedmemory,informationexchangetakesplacethroughsharedaccesstocommonmemory,and frequentdatasharingmakessense.Ifthetargetisacollectionofnodesconnectedbyaslownetwork, however,thecommunicationrequiredtoshareinformationisveryexpensiveandmustbeavoided whereverpossible. WhenthinkingaboutbothoftheseissuesthenumberofUEsandthecostofsharinginformation avoidthetendencytooverconstrainthedesign.Softwaretypicallyoutliveshardware,sooverthe courseofaprogram'slifeitmaybeusedonatremendousrangeoftargetplatforms.Thegoalisto obtainadesignthatworkswellontheoriginaltargetplatform,butatthesametimeisflexibleenough toadapttodifferentclassesofhardware. Finally,inadditiontomultipleUEsandsomewaytoshareinformationamongthem,aparallel computerhasoneormoreprogrammingenvironmentsthatcanbeusedtoimplementparallel algorithms.Differentprogrammingenvironmentsprovidedifferentwaystocreatetasksandshare informationamongUEs,andadesignthatdoesnotmapwellontothecharacteristicsofthetarget programmingenvironmentwillbedifficulttoimplement. 4.2.2. Major Organizing Principle Whenconsideringtheconcurrencyintheproblem,isthereaparticularwayoflookingatitthatstands outandprovidesahighlevelmechanismfororganizingthisconcurrency? TheanalysiscarriedoutusingthepatternsoftheFindingConcurrencydesignspacedescribesthe potentialconcurrencyintermsoftasksandgroupsoftasks,data(bothsharedandtasklocal),and orderingconstraintsamongtaskgroups.Thenextstepistofindanalgorithmstructurethatrepresents howthisconcurrencymapsontotheUEs.Thereisusuallyamajororganizingprincipleimpliedbythe concurrency.Thisusuallyfallsintooneofthreecamps:organizationbytasks,organizationbydata decomposition,andorganizationbyflowofdata.Wenowconsidereachoftheseinmoredetail. Forsomeproblems,thereisreallyonlyonegroupoftasksactiveatonetime,andthewaythetasks withinthisgroupinteractisthemajorfeatureoftheconcurrency.Examplesincludesocalled embarrassinglyparallelprogramsinwhichthetasksarecompletelyindependent,aswellasprograms inwhichthetasksinasinglegroupcooperatetocomputearesult. Forotherproblems,thewaydataisdecomposedandsharedamongtasksstandsoutasthemajorway toorganizetheconcurrency.Forexample,manyproblemsfocusontheupdateofafewlargedata structures,andthemostproductivewaytothinkabouttheconcurrencyisintermsofhowthis structureisdecomposedanddistributedamongUEs.Programstosolvedifferentialequationsorcarry outlinearalgebracomputationsoftenfallintothiscategorybecausetheyarefrequentlybasedon updatinglargedatastructures.
Finally,forsomeproblems,themajorfeatureoftheconcurrencyisthepresenceofwelldefined interactinggroupsoftasks,andthekeyissueishowthedataflowsamongthetasks.Forexample,ina signalprocessingapplication,datamayflowthroughasequenceoftasksorganizedasapipeline,each performingatransformationonsuccessivedataelements.Oradiscreteeventsimulationmightbe parallelizedbydecomposingitintoatasksinteractingvia"events".Here,themajorfeatureofthe concurrencyisthewayinwhichthesedistincttaskgroupsinteract. Noticealsothatthemosteffectiveparallelalgorithmdesignmightmakeuseofmultiplealgorithm structures(combinedhierarchically,compositionally,orinsequence),andthisisthepointatwhichto considerwhethersuchadesignmakessense.Forexample,itoftenhappensthattheverytoplevelof thedesignisasequentialcompositionofoneormoreAlgorithmStructurepatterns.Otherdesigns mightbeorganizedhierarchically,withonepatternusedtoorganizetheinteractionofthemajortask groupsandotherpatternsusedtoorganizetaskswithinthegroupsforexample,aninstanceofthe PipelinepatterninwhichindividualstagesareinstancesoftheTaskParallelismpattern. 4.2.3. The Algorithm Structure Decision T ree Foreachsubsetoftasks,whichAlgorithmStructuredesignpatternmosteffectivelydefineshowto mapthetasksontoUEs? Havingconsideredthequestionsraisedintheprecedingsections,wearenowreadytoselectan algorithmstructure,guidedbyanunderstandingofconstraintsimposedbythetargetplatform,an appreciationoftheroleofhierarchyandcomposition,andamajororganizingprincipleforthe problem.ThedecisionisguidedbythedecisiontreeshowninFig.4.2.Startingatthetopofthetree, considertheconcurrencyandthemajororganizingprinciple,andusethisinformationtoselectoneof thethreebranchesofthetree;thenfollowtheupcomingdiscussionfortheappropriatesubtree.Notice againthatforsomeproblems,thefinaldesignmightcombinemorethanonealgorithmstructure:Ifno singlestructureseemssuitable,itmightbenecessarytodividethetasksmakinguptheprobleminto twoormoregroups,workthroughthisprocedureseparatelyforeachgroup,andthendeterminehow tocombinetheresultingalgorithmstructures.
Figure 4.2. Decision tree for the Algorithm Structure design space
Organize By Tasks
SelecttheOrganizeByTasksbranchwhentheexecutionofthetasksthemselvesisthebestorganizing principle.Thendeterminehowthetasksareenumerated.Iftheycanbegatheredintoasetlinearin anynumberofdimensions,choosetheTaskParallelismpattern.Thispatternincludesbothsituations inwhichthetasksareindependentofeachother(socalledembarrassinglyparallelalgorithms)and situationsinwhichtherearesomedependenciesamongthetasksintheformofaccesstoshareddata oraneedtoexchangemessages.Ifthetasksareenumeratedbyarecursiveprocedure,choosethe DivideandConquerpattern.Inthispattern,theproblemissolvedbyrecursivelydividingitinto subproblems,solvingeachsubproblemindependently,andthenrecombiningthesubsolutionsintoa solutiontotheoriginalproblem.

Organize By Data Decomposition
SelecttheOrganizeByDataDecompositionbranchwhenthedecompositionofthedataisthemajor organizingprincipleinunderstandingtheconcurrency.Therearetwopatternsinthisgroup,differing inhowthedecompositionisstructuredlinearlyineachdimensionorrecursively.Choosethe GeometricDecompositionpatternwhentheproblemspaceisdecomposedintodiscretesubspacesand theproblemissolvedbycomputingsolutionsforthesubspaces,withthesolutionforeachsubspace typicallyrequiringdatafromasmallnumberofothersubspaces.Manyinstancesofthispatterncanbe foundinscientificcomputing,whereitisusefulinparallelizinggridbasedcomputations,for example.ChoosetheRecursiveDatapatternwhentheproblemisdefinedintermsoffollowinglinks througharecursivedatastructure(forexample,abinarytree).

Organize By Flow of Data
SelecttheOrganizeByFlowofDatabranchwhenthemajororganizingprincipleishowtheflowof dataimposesanorderingonthegroupsoftasks.Thispatterngrouphastwomembers,onethat applieswhenthisorderingisregularandstaticandonethatapplieswhenitisirregularand/or dynamic.ChoosethePipelinepatternwhentheflowofdataamongtaskgroupsisregular,oneway, anddoesnotchangeduringthealgorithm(thatis,thetaskgroupscanbearrangedintoapipeline throughwhichthedataflows).ChoosetheEventBasedCoordinationpatternwhentheflowofdatais irregular,dynamic,and/orunpredictable(thatis,whenthetaskgroupscanbethoughtofasinteracting viaasynchronousevents). 4.2.4. Re-evaluation IstheAlgorithmStructurepattern(orpatterns)suitableforthetargetplatform?Itisimportantto frequentlyreviewdecisionsmadesofartobesurethechosenpattern(s)areagoodfitwiththetarget platform. AfterchoosingoneormoreAlgorithmStructurepatternstobeusedinthedesign,skimthroughtheir descriptionstobesuretheyarereasonablysuitableforthetargetplatform.(Forexample,ifthetarget platformconsistsofalargenumberofworkstationsconnectedbyaslownetwork,andoneofthe chosenAlgorithmStructurepatternsrequiresfrequentcommunicationamongtasks,itmightbe difficulttoimplementthedesignefficiently.)Ifthechosenpatternsseemwildlyunsuitableforthe targetplatform,tryidentifyingasecondaryorganizingprincipleandworkingthroughthepreceding
stepagain.
4.3. EXAMPLES
4.3.1. Medical Imaging Forexample,considerthemedicalimagingproblemdescribedinSec.3.1.3.Thisapplication simulatesalargenumberofgammaraysastheymovethroughabodyandouttoacamera.Oneway todescribetheconcurrencyistodefinethesimulationofeachrayasatask.Becausetheyareall logicallyequivalent,weputthemintoasingletaskgroup.Theonlydatasharedamongthetasksisa largedatastructurerepresentingthebody,andsinceaccesstothisdatastructureisreadonly,thetasks donotdependoneachother. Becausetherearemanyindependenttasksforthisproblem,itislessnecessarythanusualtoconsider thetargetplatform:Thelargenumberoftasksshouldmeanthatwecanmakeeffectiveuseofany (reasonable)numberofUEs;theindependenceofthetasksshouldmeanthatthecostofsharing informationamongUEswillnothavemucheffectonperformance. Thus,weshouldbeabletochooseasuitablestructurebyworkingthroughthedecisiontreeshown previouslyinFig.4.2.Giventhatinthisproblemthetasksareindependent,theonlyissuewereally needtoworryaboutasweselectanalgorithmstructureishowtomapthesetasksontoUEs.Thatis, forthisproblem,themajororganizingprincipleseemstobethewaythetasksareorganized,sowe startbyfollowingtheOrganizeByTasksbranch. Wenowconsiderthenatureofoursetoftaskswhethertheyarearrangedhierarchicallyorresidein anunstructuredorflatset.Forthisproblem,thetasksareinanunstructuredsetwithnoobvious hierarchicalstructureamongthem,sowechoosetheTaskParallelismpattern.Notethatinthe problem,thetasksareindependent,afactthatwewillbeabletousetosimplifythesolution. Finally,wereviewthisdecisioninlightofpossibletargetplatformconsiderations.Asweobserved earlier,thekeyfeaturesofthisproblem(thelargenumberoftasksandtheirindependence)makeit unlikelythatwewillneedtoreconsiderbecausethechosenstructurewillbedifficulttoimplementon thetargetplatform. 4.3.2. Molecular Dynamics Asasecondexample,considerthemoleculardynamicsproblemdescribedinSec.3.1.3.IntheTask Decompositionpattern,weidentifiedthefollowinggroupsoftasksassociatedwiththisproblem:

Tasksthatfindthevibrationalforcesonanatom Tasksthatfindtherotationalforcesonanatom Tasksthatfindthenonbondedforcesonanatom Tasksthatupdatethepositionandvelocityofanatom Atasktoupdatetheneighborlistforalltheatoms
Thetaskswithineachgroupareexpressedastheiterationsofaloopovertheatomswithinthe molecularsystem. Wecanchooseasuitablealgorithmstructurebyworkingthroughthedecisiontreeshownearlierin Fig.4.2.Oneoptionistoorganizetheparallelalgorithmintermsoftheflowofdataamongthe groupsoftasks.Notethatonlythefirstthreetaskgroups(thevibrational,rotational,andnonbonded forcecalculations)canexecuteconcurrently;thatis,theymustfinishcomputingtheforcesbeforethe atomicpositions,velocitiesandneighborlistscanbeupdated.Thisisnotverymuchconcurrencyto workwith,soadifferentbranchinFig.4.2shouldbeusedforthisproblem. Anotheroptionistoderiveexploitableconcurrencyfromthesetoftaskswithineachgroup,inthis casetheiterationsofaloopoveratoms.Thissuggestsanorganizationbytaskswithalinear arrangementoftasks,orbasedonFig.4.2,theTaskParallelismpatternshouldbeused.Totalavailable concurrencyislarge(ontheorderofthenumberofatoms),providingagreatdealofflexibilityin designingtheparallelalgorithm. Thetargetmachinecanhaveamajorimpactontheparallelalgorithmforthisproblem.The dependenciesdiscussedintheDataDecompositionpattern(replicatedcoordinatesoneachUEanda combinationofpartialsumsfromeachUEtocomputeaglobalforcearray)suggestthatontheorder of23Nterms(whereNisthenumberofatoms)willneedtobepassedamongtheUEs.The computation,however,isofordernN,wherenisthenumberofatomsintheneighborhoodofeach atomandconsiderablylessthanN.Hence,thecommunicationandcomputationareofthesameorder andmanagementofcommunicationoverheadwillbeakeyfactorindesigningthealgorithm.
4.4. THE TASK PARALLELISM PATTERN

Problem Whentheproblemisbestdecomposedintoacollectionoftasksthatcanexecuteconcurrently,how canthisconcurrencybeexploitedefficiently? Context Everyparallelalgorithmisfundamentallyacollectionofconcurrenttasks.Thesetasksandany dependenciesamongthemcanbeidentifiedbyinspection(forsimpleproblems)orbyapplicationof thepatternsintheFindingConcurrencydesignspace.Forsomeproblems,focusingonthesetasksand theirinteractionmightnotbethebestwaytoorganizethealgorithm:Insomecasesitmakessenseto organizethetasksintermsofthedata(asintheGeometricDecompositionpattern)ortheflowof dataamongconcurrenttasks(asinthePipelinepattern).However,inmanycasesitisbesttowork directlywiththetasksthemselves.Whenthedesignisbaseddirectlyonthetasks,thealgorithmis saidtobeataskparallelalgorithm. Theclassoftaskparallelalgorithmsisverylarge.Examplesincludethefollowing.
RaytracingcodessuchasthemedicalimagingexampledescribedintheTaskDecomposition pattern:Herethecomputationassociatedwitheach"ray"becomesaseparateandcompletely independenttask. ThemoleculardynamicsexampledescribedintheTaskDecompositionpattern:Theupdateof thenonbondedforceoneachatomisatask.Thedependenciesamongtasksaremanagedby replicatingtheforcearrayoneachUEtoholdthepartialsumsforeachatom.Whenallthe taskshavecompletedtheircontributionstothenonbondedforce,theindividualforcearrays arecombined(or"reduced")intoasinglearrayholdingthefullsummationofnonbonded forcesforeachatom. Branchandboundcomputations,inwhichtheproblemissolvedbyrepeatedlyremovinga solutionspacefromalistofsuchspaces,examiningit,andeitherdeclaringitasolution, discardingit,ordividingitintosmallersolutionspacesthatarethenaddedtothelistofspaces toexamine.Suchcomputationscanbeparallelizedusingthispatternbymakingeach"examine andprocessasolutionspace"stepaseparatetask.Thetasksweaklydependoneachother throughthesharedqueueoftasks.
Thecommonfactoristhattheproblemcanbedecomposedintoacollectionoftasksthatcanexecute concurrently.Thetaskscanbecompletelyindependent(asinthemedicalimagingexample)orthere canbedependenciesamongthem(asinthemoleculardynamicsexample).Inmostcases,thetasks willbeassociatedwithiterationsofaloop,butitispossibletoassociatethemwithlargerscale programstructuresaswell. Inmanycases,allofthetasksareknownatthebeginningofthecomputation(thefirsttwoexamples). However,insomecases,tasksarisedynamicallyasthecomputationunfolds,asinthebranchand boundexample. Also,whileitisusuallythecasethatalltasksmustbecompletedbeforetheproblemisdone,forsome problems,itmaybepossibletoreachasolutionwithoutcompletingallofthetasks.Forexample,in thebranchandboundexample,wehaveapooloftaskscorrespondingtosolutionspacestobe searched,andwemightfindanacceptablesolutionbeforeallthetasksinthispoolhavebeen completed. Forces
Toexploitthepotentialconcurrencyintheproblem,wemustassigntaskstoUEs.Ideallywe wanttodothisinawaythatissimple,portable,scalable,andefficient.AsnotedinSection 4.1,however,thesegoalsmayconflict.Akeyconsiderationisbalancingtheload,thatis, ensuringthatallUEshaveroughlythesameamountofworktodo. Ifthetasksdependoneachotherinsomeway(viaeitherorderingconstraintsordata dependencies),thesedependenciesmustbemanagedcorrectly,againkeepinginmindthe sometimesconflictinggoalsofsimplicity,portability,sealability,andefficiency.
Solution Designsfortaskparallelalgorithmsinvolvethreekeyelements:thetasksandhowtheyaredefined, thedependenciesamongthem,andtheschedule(howthetasksareassignedtoUEs).Wediscuss
themseparately,butinfacttheyaretightlycoupled,andallthreemustbeconsideredbeforefinal decisionsaremade.Afterthesefactorsareconsidered,welookattheoverallprogramstructureand thenatsomeimportantspecialcasesofthispattern.

Tasks
Ideally,thetasksintowhichtheproblemisdecomposedshouldmeettwocriteria:First,thereshould beatleastasmanytasksasUEs,andpreferablymanymore,toallowgreaterflexibilityinscheduling. Second,thecomputationassociatedwitheachtaskmustbelargeenoughtooffsettheoverhead associatedwithmanagingthetasksandhandlinganydependencies.Iftheinitialdecompositiondoes notmeetthesecriteria,itisworthwhiletoconsiderwhetherthereisanotherwayofdecomposingthe problemintotasksthatdoesmeetthecriteria. Forexample,inimageprocessingapplicationswhereeachpixelupdateisindependent,thetask definitioncanbeindividualpixels,imagelines,orevenwholeblocksintheimage.Onasystemwith asmallnumberofnodesconnectedbyaslownetwork,tasksshouldbelargetooffsethigh communicationlatencies,sobasingtasksonblocksoftheimageisappropriate.Thesameproblemon asystemcontainingalargenumberofnodesconnectedbyafast(lowlatency)network,however, wouldneedsmallertaskstomakesureenoughworkexiststokeepalltheUEsoccupied.Noticethat thisimposesarequirementforafastnetwork,becauseotherwisethesmalleramountofworkpertask willnotbeenoughtocompensateforcommunicationoverhead.
Dependencies
Dependenciesamongtaskshaveamajorimpactontheemergingalgorithmdesign.Therearetwo categoriesofdependencies,orderingconstraintsanddependenciesrelatedt shareddata. o Forthispattern,orderingconstraintsapplytotaskgroupsandcanbehandledbyforcingthegroupsto executeintherequiredorder.Forexample,inataskparallelmultidimensionalFastFourierTransform, thereisagroupoftasksforeachdimensionofthetransform,andsynchronizationorotherprogram constructsareusedtomakesurecomputationononedimensioncompletesbeforethenextdimension begins.Alternatively,wecouldsimplythinkofsuchaproblemasasequentialcompositionoftask parallelcomputations,oneforeachtaskgroup. Shareddatadependenciesarepotentiallymorecomplicated.Inthesimplestcase,thereareno dependenciesamongthetasks.Asurprisinglylargenumberofproblemscanbecastintothisform. Suchproblemsareoftencalledembarrassinglyparallel.Theirsolutionsareamongthesimplestof parallelprograms;themainconsiderationsarehowthetasksaredefined(asdiscussedpreviously)and scheduled(asdiscussedlater).Whendataissharedamongtasks,thealgorithmcanbemuchmore complicated,althoughtherearestillsomecommoncasesthatcanbedealtwithrelativelyeasily.We cancategorizedependenciesasfollows.
Removabledependencies.Inthiscase,thedependencyisnotatruedependencybetweentasks, butanapparentdependencythatcanberemovedbysimplecodetransformations.Thesimplest caseisatemporaryvariablewhoseuseiscompletelylocaltoeachtask;thatis,eachtask initializesthevariablewithoutreferencetoothertasks.Thiscasecanbehandledbysimply creatingacopyofthevariablelocaltoeachUE.Inmorecomplicatedcases,iterative expressionsmightneedtobetransformedintoclosedformexpressionstoremovealoop
carrieddependency.Forexample,considerthefollowingsimpleloop:
int ii = 0, jj = 0; for(int i = 0; i< N; i++) { ii = ii + 1; d[ii] = big_time_consuming_work(ii); jj = jj + i; a[jj] = other_big_calc(jj); }
Thevariablesiiandjjcreateadependencybetweentasksandpreventparallelizationofthe loop.Wecanremovethisdependencybyreplacingiiandjjwithclosedformexpressions (noticingthatthevaluesofiiandiarethesameandthatthevalueofjjisthesumofthe valuesfrom0throughi):

for(int i = 0; i< N; i++){ d[i] = big_time_consuming_work(i); a[(i*i+i)/2] = other_big_calc((i*i+i)/2)); }
"Separable"dependencies.Whenthedependenciesinvolveaccumulationintoashareddata structure,theycanbeseparatedfromthetasks("pulledoutsidetheconcurrentcomputation") byreplicatingthedatastructureatthebeginningofthecomputation,executingthetasks,and thencombiningthecopiesintoasingledatastructureafterthetaskscomplete.Oftenthe accumulationisareductionoperation,inwhichacollectionofdataelementsisreducedtoa singleelementbyrepeatedlyapplyingabinaryoperationsuchasadditionormultiplication. Inmoredetail,thesedependenciescanbemanagedasfollows:Acopyofthedatastructure usedintheaccumulationiscreatedoneachUE.Eachcopyisinitialized(inthecaseofa reduction,totheidentityelementforthebinaryoperationforexample,zeroforadditionand oneformultiplication).Eachtaskthencarriesouttheaccumulationintoitslocaldata structure,eliminatingtheshareddatadependency.Whenalltasksarecomplete,thelocaldata structuresoneachUEarecombinedtoproducethefinalglobalresult(inthecaseofa reduction,byapplyingthebinaryoperationagain).Asanexample,considerthefollowingloop tosumtheelementsofarrayf:
for(int i = 0; i< N; i++){ sum = sum + f(i); }
Thisistechnicallyadependencybetweenloopiterations,butifwerecognizethattheloop bodyisjustaccumulatingintoasimplescalarvariable,itcanbehandledasareduction. ReductionsaresocommonthatbothMPIandOpenMPprovidesupportforthemaspartofthe API.Sec.6.4.2intheImplementationMechanismsdesignspacediscussesreductionsinmore detail.
Otherdependencies.Iftheshareddatacannotbepulledoutofthetasksandisbothreadand writtenbythetasks,datadependenciesmustbeexplicitlymanagedwithinthetasks.Howto dothisinawaythatgivescorrectresultsandalsoacceptableperformanceisthesubjectofthe
SharedDatapattern.
Schedule
TheremainingkeyelementtoconsideristheschedulethewayinwhichtasksareassignedtoUEs andscheduledforexecution.Loadbalance(asdescribedinChapter2)isacriticalconsiderationin scheduling;adesignthatbalancesthecomputationalloadamongPEswillexecutemoreefficiently thanonethatdoesnot.Fig.4.3illustratestheproblem.
Figure 4.3. Good versus poor load balance
Twoclassesofschedulesareusedinparallelalgorithms:staticschedules,inwhichthedistributionof tasksamongUEsisdeterminedatthestartofthecomputationanddoesnotchange;anddynamic schedules,inwhichthedistributionoftasksamongUEsvariesasthecomputationproceeds. Inastaticschedule,thetasksareassociatedintoblocksandthenassignedtoUEs.Blocksizeis adjustedsoeachUEtakesapproximatelythesameamountoftimetocompleteitstasks.Inmost applicationsusingastaticschedule,thecomputationalresourcesavailablefromtheUEsare predictableandstableoverthecourseofthecomputation,withthemostcommoncasebeingUEsthat areidentical(thatis,thecomputingsystemishomogeneous).Ifthesetoftimesrequiredtocomplete eachtaskisnarrowlydistributedaboutamean,thesizesoftheblocksshouldbeproportionaltothe relativeperformanceoftheUEs(so,inahomogeneoussystem,theyareallthesamesize).Whenthe
effortassociatedwiththetasksvariesconsiderably,astaticschedulecanstillbeuseful,butnowthe numberofblocksassignedtoUEsmustbemuchgreaterthanthenumberofUEs.Bydealingoutthe blocksinaroundrobinmanner(muchasadeckofcardsisdealtamongagroupofcardplayers),the loadisbalancedstatistically. Dynamicschedulesareusedwhen(1)theeffortassociatedwitheachtaskvarieswidelyandis unpredictableand/or(2)whenthecapabilitiesoftheUEsvarywidelyandunpredictably.Themost commonapproachusedfordynamicloadbalancingistodefineataskqueuetobeusedbyallthe UEs;whenaUEcompletesitscurrenttaskandisthereforereadytoprocessmorework,itremovesa taskfromthetaskqueue.FasterUEsorthosereceivinglighterweighttaskswillaccessthequeue moreoftenandtherebybeassignedmoretasks. Anotherdynamicschedulingstrategyusesworkstealing,whichworksasfollows.Thetasksare distributedamongtheUEsatthestartofthecomputation.EachUEhasitsownworkqueue.When thequeueisempty,theUEwilltrytostealworkfromthequeueonsomeotherUE(wheretheother UEisusuallyrandomlyselected).Inmanycases,thisproducesanoptimaldynamicschedulewithout incurringtheoverheadofmaintainingasingleglobalqueue.Inprogrammingenvironmentsor packagesthatprovidesupportfortheconstruct,suchasCilk[BJK96 + ],Hood[BP99],ortheFJTask framework[Lea00b,Lea],itisstraightforwardtousethisapproach.Butwithmorecommonlyused programmingenvironmentssuchasOpenMP,MPI,orJava(withoutsupportsuchastheFJTask framework),thisapproachaddssignificantcomplexityandthereforeisnotoftenused.
Selectingascheduleforagivenproblemisnotalwayseasy.Staticschedulesincurtheleastoverhead duringtheparallelcomputationandshouldbeusedwheneverpossible. Beforeendingthediscussionofschedules,weshouldmentionagainthatwhileformostproblemsall ofthetasksareknownwhenthecomputationbeginsandallmustbecompletedtoproduceanoverall solution,thereareproblemsforwhichoneorbothoftheseisnottrue.Inthesecases,adynamic scheduleisprobablymoreappropriate.

Program structure
Manytaskparallelproblemscanbeconsideredtobeloopbased.Loopbasedproblemsare,asthe nameimplies,thoseinwhichthetasksarebasedontheiterationsofaloop.Thebestsolutionsfor suchproblemsusetheLoopParallelismpattern.Thispatterncanbeparticularlysimpletoimplement inprogrammingenvironmentsthatprovidedirectivesforautomaticallyassigningloopiterationsto UEs.Forexample,inOpenMPaloopcanbeparallelizedbysimplyaddinga"parallelfor"directive withanappropriatescheduleclause(onethatmaximizesefficiency).Thissolutionisespecially attractivebecauseOpenMPthenguaranteesthattheresultingprogramissemanticallyequivalentto theanalogoussequentialcode(withinroundofferrorassociatedwithdifferentorderingsof floatingpointoperations). ForproblemsinwhichthetargetplatformisnotagoodfitwiththeLoopParallelismpattern,orfor problemsinwhichthemodelof"alltasksknowninitially,alltasksmustcomplete"doesnotapply (eitherbecausetaskscanbecreatedduringthecomputationorbecausethecomputationcanterminate withoutalltasksbeingcomplete),thisstraightforwardapproachisnotthebestchoice.Instead,the bestdesignmakesuseofataskqueue;tasksareplacedonthetaskqueueastheyarecreatedand
removedbyUEsuntilthecomputationiscomplete.Theoverallprogramstructurecanbebasedon eithertheMaster/WorkerpatternortheSPMDpattern.Theformerisparticularlyappropriatefor problemsrequiringadynamicschedule. Inthecaseinwhichthecomputationcanterminatebeforeallthetasksarecomplete,somecaremust betakentoensurethatthecomputationendswhenitshould.Ifwedefinetheterminationconditionas theconditionthatwhentruemeansthecomputationiscompleteeitheralltasksarecompleteor someothercondition(forexample,anacceptablesolutionhasbeenfoundbyonetask)thenwewant tobesurethat(1)theterminationconditioniseventuallymet(which,iftaskscanbecreated dynamically,mightmeanbuildingintoitalimitonthetotalnumberoftaskscreated),and(2)when theterminationconditionismet,theprogramends.Howtoensurethelatterisdiscussedinthe Master/WorkerandSPMDpatterns.
Common idioms
Mostproblemsforwhichthispatternisapplicablefallintothefollowingtwocategories. Embarrassinglyparallelproblemsarethoseinwhichtherearenodependenciesamongthetasks.A widerangeofproblemsfallintothiscategory,rangingfromrenderingframesinamotionpictureto statisticalsamplingincomputationalphysics.Becausetherearenodependenciestomanage,thefocus isonschedulingthetaskstomaximizeefficiency.Inmanycases,itispossibletodefineschedulesthat automaticallyanddynamicallybalancetheloadamongUEs. Replicateddataorreductionproblemsarethoseinwhichdependenciescanbemanagedby "separatingthemfromthetasks"asdescribedearlierreplicatingthedataatthebeginningof computationandcombiningresultswhentheterminationconditionismet(usually"alltasks complete").Fortheseproblems,theoverallsolutionconsistsofthreephases,onetoreplicatethedata intolocalvariables,onetosolvethenowindependenttasks(usingthesametechniquesusedfor embarrassinglyparallelproblems),andonetorecombinetheresultsintoasingleresult. Examples Wewillconsidertwoexamplesofthispattern.Thefirstexample,animageconstructionexample,is embarrassinglyparallel.Thesecondexamplewillbuildonthemoleculardynamicsexampleusedin severaloftheFindingConcurrencypatterns.
Image construction
Inmanyimageconstructionproblems,eachpixelintheimageisindependentofalltheotherpixels. Forexample,considerthewellknownMandelbrotset[Dou86].Thisfamousimageisconstructedby coloringeachpixelaccordingtothebehaviorofthequadraticrecurrencerelation Equation4.1
whereCandZarecomplexnumbersandtherecurrenceisstartedwithZ0=C.Theimageplotsthe
imaginarypartofContheverticalaxisandtherealpartonthehorizontalaxis.Thecolorofeach pixelisblackiftherecurrencerelationconvergestoastablevalueoriscoloreddependingonhow rapidlytherelationdiverges. Atthelowestlevel,thetaskistheupdateforasinglepixel.Firstconsidercomputingthissetona clusterofPCsconnectedbyanEthernet.Thisisacoarsegrainedsystem;thatis,therateof communicationisslowrelativetotherateofcomputation.Tooffsettheoverheadincurredbytheslow network,thetasksizeneedstobelarge;forthisproblem,thatmightmeancomputingafullrowofthe image.Theworkinvolvedincomputingeachrowvariesdependingonthenumberofdivergentpixels intherow.Thevariation,however,ismodestanddistributedcloselyaroundameanvalue.Therefore, astaticschedulewithmanymoretasksthanUEswilllikelygiveaneffectivestatisticalbalanceofthe loadamongnodes.Theremainingstepinapplyingthepatternischoosinganoverallstructureforthe program.OnasharedmemorymachineusingOpenMP,theLoopParallelismpatterndescribedinthe SupportingStructuresdesignspaceisagoodfit.OnanetworkofworkstationsrunningMPI,the SPMDpattern(alsointheSupportingStructuresdesignspace)isappropriate. Beforemovingontothenextexample,weconsideronemoretargetsystem,aclusterinwhichthe nodesarenotheterogeneousthatis,somenodesaremuchfasterthanothers.Assumealsothatthe speedofeachnodemaynotbeknownwhentheworkisscheduled.Becausethetimeneededto computetheimageforarownowdependsbothontherowandonwhichnodecomputesit,adynamic scheduleisindicated.Thisinturnsuggeststhatageneraldynamicloadbalancingschemeisindicated, whichthensuggeststhattheoverallprogramstructureshouldbebasedontheMaster/Workerpattern.
Molecular dynamics
Foroursecondexample,weconsiderthecomputationofthenonbondedforcesinamolecular dynamicscomputation.ThisproblemisdescribedinSec.3.1.3andin[Mat95,PH95]andisused throughoutthepatternsintheFindingConcurrencydesignspace.Pseudocodeforthiscomputationis showninFig.4.4.Thephysicsinthisexampleisnotrelevantandisburiedincodenotshownhere (thecomputationoftheneighborslistandtheforcefunction).Thebasiccomputationstructureisa loopoveratoms,andthenforeachatom,aloopoverinteractionswithotheratoms.Thenumberof interactionsperatomiscomputedseparatelywhentheneighborslistisdetermined.Thisroutine(not shownhere)computesthenumberofatomswithinaradiusequaltoapresetcutoffdistance.The neighborlistisalsomodifiedtoaccountforNewton'sthirdlaw:Becausetheforceofatomionatom jisthenegativeoftheforceofatomjonatomi,onlyhalfofthepotentialinteractionsneedactually becomputed.Understandingthisdetailisnotimportantforunderstandingthisexample.Thekeyis thatthiscauseseachloopoverjtovarygreatlyfromoneatomtoanother,therebygreatly complicatingtheloadbalancingproblem.Indeed,forthepurposesofthisexample,allthatmustreally beunderstoodisthatcalculatingtheforceisanexpensiveoperationandthatthenumberof interactionsperatomvariesgreatly.Hence,thecomputationaleffortforeachiterationoveriis difficulttopredictinadvance.
Figure 4.4. Pseudocode for the nonbonded computation in a typical molecular dynamics code function non_bonded_forces (N, Atoms, neighbors, Forces)
Int const N // number of atoms Array of Real :: atoms (3,N) //3D coordinates Array of Real :: forces (3,N) //force in each dimension Array of List :: neighbors(N) //atoms in cutoff volume Real :: forceX, forceY, forceZ loop [i] over atoms loop [j] over neighbors(i) forceX = non_bond_force(atoms(1,i), forceY = non_bond_force(atoms(2,i), forceZ = non_bond_force(atoms(3,i), force(1,i) += forceX; force(1,j) -= force(2,i) += forceY; force(2,j) -= force(3,i) += forceZ; force(3,j) -= end loop [j] end loop [i] end function non_bonded_forces atoms(1,j)) atoms(2,j)) atoms(3,j)) forceX; forceY; forceZ;
Eachcomponentoftheforcetermisanindependentcomputation,meaningthateach(i, j)pairis fundamentallyanindependenttask.Thenumberofatomstendstobeontheorderofthousands,and squaringthatgivesanumberoftasksthatismorethanenoughforallbutthelargestparallelsystems. Therefore,wecantakethemoreconvenientapproachofdefiningataskasoneiterationoftheloop overi.Thetasks,however,arenotindependent:Theforcearrayisreadandwrittenbyeachtask. Inspectionofthecodeshowsthatthearraysareonlyusedtoaccumulateresultsfromthecomputation, however.Thus,thefullarraycanbereplicatedoneachUEandthelocalcopiescombined(reduced) afterthetaskscomplete. Afterthereplicationisdefined,theproblemisembarrassinglyparallelandthesameapproaches discussedpreviouslyapply.WewillrevisitthisexampleintheMaster/Worker,LoopParallelism,and SPMDpatterns.Achoiceamongthesepatternsisnormallymadebasedonthetargetplatforms.
Known uses
Therearemanyapplicationareasinwhichthispatternisuseful,includingthefollowing. Manyraytracingprogramsusesomeformofpartitioningwithindividualtaskscorrespondingtoscan linesinthefinalimage[BKS91]. ApplicationswrittenwithcoordinationlanguagessuchasLindaareanotherrichsourceofexamples ofthispattern[BCM91 + ].Linda[CG91]isasimplelanguageconsistingofonlysixoperationsthat readandwriteanassociative(thatis,contentaddressable)sharedmemorycalledatuplespace.The tuplespaceprovidesanaturalwaytoimplementawidevarietyofsharedqueueandmaster/worker algorithms. Parallelcomputationalchemistryapplicationsalsomakeheavyuseofthispattern.Inthequantum chemistryprogramGAMESS,theloopsovertwoelectronintegralsareparallelizedwiththetask queueimpliedbytheNextvalconstructwithinTCGMSG.Anearlyversionofthedistance geometryprogramDGEOMwasparallelizedwiththemaster/workerformofthispattern.These
examplesarediscussedin[Mat95]. PTEP(ParallelTelemetryProcessor)[NBB01],developedbyNASAasthedownlinkprocessing systemfordatafromaplanetaryroverorlander,alsomakesuseofthispattern.Thesystemis implementedinJavabutcanincorporatecomponentsimplementedinotherlanguages.Foreach incomingdatapacket,thesystemdetermineswhichinstrumentproducedthedata,andthenperforms anappropriatesequentialpipelineofprocessingsteps.Becausetheincomingdatapacketsare independent,theprocessingofindividualpacketscanbedoneinparallel.
4.5. THE DIVIDE AND CONQUER PATTERN

Problem Supposetheproblemisformulatedusingthesequentialdivideandconquerstrategy.Howcanthe potentialconcurrencybeexploited? Context Thedivideandconquerstrategyisemployedinmanysequentialalgorithms.Withthisstrategy,a problemissolvedbysplittingitintoanumberofsmallersubproblems,solvingthemindependently, andmergingthesubsolutionsintoasolutionforthewholeproblem.Thesubproblemscanbesolved directly,ortheycaninturnbesolvedusingthesamedivideandconquerstrategy,leadingtoan overallrecursiveprogramstructure. Thisstrategyhasprovenvaluableforawiderangeofcomputationallyintensiveproblems.Formany problems,themathematicaldescriptionmapswellontoadivideandconqueralgorithm.Forexample, thefamousfastFouriertransformalgorithm[PTV93]isessentiallyamappingofthedoublynested loopsofthediscreteFouriertransformintoadivideandconqueralgorithm.Lesswellknownisthe factthatmanyalgorithmsfromcomputationallinearalgebra,suchastheCholeskydecomposition [ABE97 + ,PLA],alsomapwellontodivideandconqueralgorithms. Thepotentialconcurrencyinthisstrategyisnothardtosee:Becausethesubproblemsaresolved independently,theirsolutionscanbecomputedconcurrently.Fig.4.5illustratesthestrategyandthe potentialconcurrency.Noticethateach"split"doublestheavailableconcurrency.Althoughthe concurrencyinadivideandconqueralgorithmisobvious,thetechniquesrequiredtoexploitit effectivelyarenotalwaysobvious.
Figure 4.5. The divide-and-conquer strategy
Forces
Thetraditionaldivideandconquerstrategyisawidelyusefulapproachtoalgorithmdesign. Sequentialdivideandconqueralgorithmsarealmosttrivialtoparallelizebasedontheobvious exploitableconcurrency. AsFig.4.5suggests,however,theamountofexploitableconcurrencyvariesoverthelifeofthe program.Attheoutermostleveloftherecursion(initialsplitandfinalmerge),thereislittleor noexploitableconcurrency,andthesubproblemsalsocontainsplitandmergesections. Amdahl'slaw(Chapter2)tellsusthattheserialpartsofaprogramcansignificantlyconstrain thespeedupthatcanbeachievedbyaddingmoreprocessors.Thus,ifthesplitandmerge computationsarenontrivialcomparedtotheamountofcomputationforthebasecases,a programusingthispatternmightnotbeabletotakeadvantageoflargenumbersofprocessors. Further,iftherearemanylevelsofrecursion,thenumberoftaskscangrowquitelarge, perhapstothepointthattheoverheadofmanagingthetasksoverwhelmsanybenefitfrom executingthemconcurrently. Indistributedmemorysystems,subproblemscanbegeneratedononePEandexecutedby another,requiringdataandresultstobemovedbetweenthePEs.Thealgorithmwillbemore efficientiftheamountofdataassociatedwithacomputation(thatis,thesizeoftheparameter setandresultforeachsubproblem)issmall.Otherwise,largecommunicationcostscan dominatetheperformance. Individeandconqueralgorithms,thetasksarecreateddynamicallyasthecomputation proceeds,andinsomecases,theresulting"taskgraph"willhaveanirregularanddata dependentstructure.Ifthisisthecase,thenthesolutionshouldemploydynamicload balancing.
Figure 4.6. Sequential pseudocode for the divide-and-conquer algorithm func func func func func solve returns Solution; // a solution stage baseCase returns Boolean; // direct solution test baseSolve returns Solution; // direct solution merge returns Solution; // combine subsolutions split returns Problem[]; // split into subprobs
Solution solve(Problem P) { if (baseCase(P)) return baseSolve(P); else { Problem subProblems[N]; Solution subSolutions[N]; subProblems = split(P); for (int i = 0; i < N; i++) subSolutions[i] = solve(subProblems[i]); return merge(subSolutions); } }
Solution AsequentialdivideandconqueralgorithmhasthestructureshowninFig.4.6.Thecornerstoneof thisstructureisarecursivelyinvokedfunction(solve ())thatdriveseachstageinthesolution. Insidesolve,theproblemiseithersplitintosmallersubproblems(usingsplit())oritisdirectly solved(usingbaseSolve()).Intheclassicalstrategy,recursioncontinuesuntilthesubproblems aresimpleenoughtobesolveddirectly,oftenwithjustafewlinesofcodeeach.However,efficiency canbeimprovedbyadoptingtheviewthatbaseSolve()shouldbecalledwhen(1)theoverheadof performingfurthersplitsandmergessignificantlydegradesperformance,or(2)thesizeofthe problemisoptimalforthetargetsystem(forexample,whenthedatarequiredforabaseSolve() fitsentirelyincache). Theconcurrencyinadivideandconquerproblemisobviouswhen,asisusuallythecase,the subproblemscanbesolvedindependently(andhence,concurrently).Thesequentialdivideand conqueralgorithmmapsdirectlyontoataskparallelalgorithmbydefiningonetaskforeach invocationofthesolve()function,asillustratedinFig.4.7.Notetherecursivenatureofthedesign, witheachtaskineffectdynamicallygeneratingandthenabsorbingataskforeachsubproblem.
Figure 4.7. Parallelizing the divide-and-conquer strategy. Each dashed-line box represents a task.
Atsomelevelofrecursion,theamountofcomputationrequiredforasubproblemscanbecomeso smallthatitisnotworththeoverheadofcreatinganewtasktosolveit.Inthiscase,ahybridprogram thatcreatesnewtasksatthehigherlevelsofrecursion,thenswitchestoasequentialsolutionwhenthe subproblemsbecomesmallerthansomethreshold,willbemoreeffective.Asdiscussednext,thereare tradeoffsinvolvedinchoosingthethreshold,whichwilldependonthespecificsoftheproblemand thenumberofPEsavailable.Thus,itisagoodideatodesigntheprogramsothatthis"granularity knob"iseasytochange.

Mapping tasks to UEs and PEs
Conceptually,thispatternfollowsastraightforwardfork/joinapproach(seetheFork/Joinpattern). Onetasksplitstheproblem,thenforksnewtaskstocomputethesubproblems,waitsuntilthe subproblemsarecomputed,andthenjoinswiththesubtaskstomergetheresults. Theeasiestsituationiswhenthesplitphasegeneratessubproblemsthatareknowntobeaboutthe samesizeintermsofneededcomputation.Then,astraightforwardimplementationofthefork/join strategy,mappingeachtasktoaUEandstoppingtherecursionwhenthenumberofactivesubtasksis thesameasthenumberofPEs,workswell. Inmanysituations,theproblemwillnotberegular,anditisbesttocreatemore,finergrainedtasks anduseamaster/workerstructuretomaptaskstounitsofexecution.Thisimplementationofthis approachisdescribedindetailintheMaster/Workerpattern.Thebasicideaistoconceptually maintainaqueueoftasksandapoolofUEs,typicallyoneperPE.Whenasubproblemissplit,the newtasksareplacedinthequeue.WhenaUEfinishesatask,itobtainsanotheronefromthequeue. Inthisway,alloftheUEstendtoremainbusy,andthesolutionshowsagoodloadbalance.Finer grainedtasksallowabetterloadbalanceatthecostofmoreoverheadfortaskmanagement. Manyparallelprogrammingenvironmentsdirectlysupportthefork/joinconstruct.Forexample,in OpenMP,wecouldeasilyproduceaparallelapplicationbyturningtheforloopofFig.4.6intoan OpenMPparallel forconstruct.Thenthesubproblemswillbesolvedconcurrentlyratherthan
insequence,withtheOpenMPruntimeenvironmenthandlingthethreadmanagement.Unfortunately, thistechniquewillonlyworkwithimplementationsofOpenMPthatsupporttruenestingofparallel regions.Currently,onlyafewOpenMPimplementationsdoso.ExtendingOpenMPtobetteraddress recursiveparallelalgorithmsisanactiveareaofresearchintheOpenMPcommunity[Mat03].One proposallikelytobeadoptedinafutureOpenMPspecificationistoaddanexplicittaskqueue constructdesignedtosupporttheexpressionofrecursivealgorithms[SHPT00]. TheFJTaskframeworkforJava[Lea00b,Lea]providessupportforfork/joinprogramswithapoolof threadsbackingtheimplementation.Severalexampleprogramsusingadivideandconquerstrategy areprovidedwiththepackage.

Communication costs
Becausetasksaregenerateddynamicallyfromasingletopleveltask,ataskcanbeexecutedona differentPEthantheonethatgeneratedit.Inadistributedmemorysystem,ahigherleveltaskwill typicallyhavethedatanecessarytosolveitsentireproblem,therelevantdatamustbemovedtothe subproblem'sPE,andtheresultmovedbacktothesource.Thusitpaystoconsiderhowtoefficiently representtheparametersandresults,andconsiderwhetheritmakessensetoreplicatesomedataatthe beginningofthecomputation.

Dealing with dependencies
Inmostalgorithmsformulatedusingthedivideandconquerstrategy,thesubproblemscanbesolved independentlyfromeachother.Lesscommonly,thesubproblemsrequireaccesstoacommondata structure.ThesedependenciescanbehandledusingthetechniquesdescribedintheSharedData pattern.

Other optimizations
Afactorlimitingthescalabilityofthispatternistheserialsplitandmergesections.Reducingthe numberoflevelsofrecursionrequiredbysplittingeachproblemintomoresubproblemscanoften help,especiallyifthesplitandmergephasescanbeparallelizedthemselves.Thismightrequire restructuring,butcanbequiteeffective,especiallyinthelimitingcaseof"onedeepdivideand conquer",inwhichtheinitialsplitisintoPsubproblems,wherePisthenumberofavailablePEs. Examplesofthisapproacharegivenin[Tho95]. Examples

Mergesort
Mergesortisawellknownsortingalgorithmbasedonthedivideandconquerstrategy,appliedas followstosortanarrayofNelements.
Thebasecaseisanarrayofsizelessthansomethreshold.Thisissortedusinganappropriate sequentialsortingalgorithm,oftenquicksort. Inthesplitphase,thearrayissplitbysimplypartitioningitintotwocontiguoussubarrays, eachofsizeN/2. Inthesolvesubproblemsphase,thetwosubarraysaresorted(byapplyingthemergesort
procedurerecursively).
Inthemergephase,thetwo(sorted)subarraysarerecombinedintoasinglesortedarray.
Thisalgorithmisreadilyparallelizedbyperformingthetworecursivemergesortsinparallel. ThisexampleisrevisitedwithmoredetailintheFork/JoinpatternintheSupportingStructuresdesign space.

Matrix diagonalization
DongarraandSorensen([DS87])describeaparallelalgorithmfordiagonalizing(computingthe eigenvectorsandeigenvaluesof)asymmetrictridiagonalmatrixT.TheproblemistofindamatrixQ suchthatQTTQisdiagonal;thedivideandconquerstrategygoesasfollows(omittingthe mathematicaldetails).

Thebasecaseisasmallmatrixwhichisdiagonalizedsequentially. ThesplitphaseconsistsoffindingmatrixT'andvectorsu,v,suchthatT=T'+uvT,andT' hastheform
whereT1andT2aresymmetrictridiagonalmatrices(whichcanbediagonalizedbyrecursive callstothesameprocedure).
ThemergephaserecombinesthediagonalizationsofT1andT2intoadiagonalizationofT.
Detailscanbefoundin[DS87]orin[GL96].
Known uses
Anyintroductoryalgorithmstextwillhavemanyexamplesofalgorithmsbasedonthedivideand conquerstrategy,mostofwhichcanbeparallelizedwiththispattern. SomealgorithmsfrequentlyparallelizedwiththisstrategyincludetheBarnesHut[BH86]andFast Multipole[GG90]algorithmsusedinNbodysimulations;signalprocessingalgorithms,suchas discreteFouriertransforms;algorithmsforbandedandtridiagonallinearsystems,suchasthosefound intheScaLAPACKpackage[CD97,Sca];andalgorithmsfromcomputationalgeometry,suchas convexhullandnearestneighbor. AparticularlyrichsourceofproblemsthatusetheDivideandConquerpatternistheFLAMEproject [GGHvdG01].Thisisanambitiousprojecttorecastlinearalgebraproblemsinrecursivealgorithms. Themotivationistwofold.First,mathematically,thesealgorithmsarenaturallyrecursive;infact,most pedagogicaldiscussionsofthesealgorithmsarerecursive.Second,theserecursivealgorithmshave proventobeparticularlyeffectiveatproducingcodethatisbothportableandhighlyoptimizedforthe cachearchitecturesofmodernmicroprocessors. Related Patterns Justbecauseanalgorithmisbasedonasequentialdivideandconquerstrategydoesnotmeanthatit
mustbeparallelizedwiththeDivideandConquerpattern.Ahallmarkofthispatternistherecursive arrangementofthetasks,leadingtoavaryingamountofconcurrencyandpotentiallyhighoverheads onmachinesforwhichmanagingtherecursionisexpensive.Iftherecursivedecompositionintosub problemscanbereused,however,itmightbemoreeffectivetodotherecursivedecomposition,and thenusesomeotherpattern(suchastheGeometricDecompositionpatternortheTaskParallelism pattern)fortheactualcomputation.Forexample,thefirstproductionlevelmoleculardynamics programtousethefastmultipolemethod,PMD[Win95],usedtheGeometricDecompositionpattern toparallelizethefastmultipolealgorithm,eventhoughtheoriginalfastmultipolealgorithmused divideandconquer.Thisworkedbecausethemultipolecomputationwascarriedoutmanytimesfor eachconfigurationofatoms.
4.6. THE GEOMETRIC DECOMPOSITION PATTERN

Problem Howcananalgorithmbeorganizedaroundadatastructurethathasbeendecomposedinto concurrentlyupdatable"chunks"? Context Manyimportantproblemsarebestunderstoodasasequenceofoperationsonacoredatastructure. Theremaybeotherworkinthecomputation,butaneffectiveunderstandingofthefullcomputation canbeobtainedbyunderstandinghowthecoredatastructuresareupdated.Forthesetypesof problems,oftenthebestwaytorepresenttheconcurrencyisintermsofdecompositionsofthesecore datastructures.(Thisformofconcurrencyissometimesknownasdomaindecomposition,orcoarse graineddataparallelism.) Thewaythesedatastructuresarebuiltisfundamentaltothealgorithm.Ifthedatastructureis recursive,anyanalysisoftheconcurrencymusttakethisrecursionintoaccount.Forrecursivedata structures,theRecursiveDataandDivideandConquerpatternsarelikelycandidates.Forarraysand otherlineardatastructures,wecanoftenreducetheproblemtopotentiallyconcurrentcomponentsby decomposingthedatastructureintocontiguoussubstructures,inamanneranalogoustodividinga geometricregionintosubregionshencethenameGeometricDecomposition.Forarrays,this decompositionisalongoneormoredimensions,andtheresultingsubarraysareusuallycalledblocks. Wewillusethetermchunksforthesubstructuresorsubregions,toallowforthepossibilityofmore generaldatastructures,suchasgraphs. Thisdecompositionofdataintochunksthenimpliesadecompositionoftheupdateoperationinto
tasks,whereeachtaskrepresentstheupdateofonechunk,andthetasksexecuteconcurrently.Ifthe computationsarestrictlylocal,thatis,allrequiredinformationiswithinthechunk,theconcurrencyis embarrassinglyparallelandthesimplerTaskParallelismpatternshouldbeused.Inmanycases, however,theupdaterequiresinformationfrompointsinotherchunks(frequentlyfromwhatwecan callneighboringchunkschunkscontainingdatathatwasnearbyintheoriginalglobaldata structure).Inthesecases,informationmustbesharedbetweenchunkstocompletetheupdate.

Example: mesh-computation program
Theproblemistomodel1Dheatdiffusion(thatis,diffusionofheatalonganinfinitelynarrowpipe). Initially,thewholepipeisatastableandfixedtemperature.Attime0,wesetbothendstodifferent temperatures,whichwillremainfixedthroughoutthecomputation.Wethencalculatehow temperatureschangeintherestofthepipeovertime.(Whatweexpectisthatthetemperatureswill convergetoasmoothgradientfromoneendofthepipetotheother.)Mathematically,theproblemis tosolvea1Ddifferentialequationrepresentingheatdiffusion: Equation4.2
Theapproachusedistodiscretizetheproblemspace(representingUbyaonedimensionalarrayand computingvaluesforasequenceofdiscretetimesteps).Wewilloutputvaluesforeachtimestepas theyarecomputed,soweneedonlysavevaluesforUfortwotimesteps;wewillcallthesearraysuk (Uatthetimestepk)andukp1(Uattimestepk+1).Ateachtimestep,wethenneedtocomputefor eachpointinarrayukp1thefollowing:

ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]);
Variablesdtanddxrepresenttheintervalsbetweendiscretetimestepsandbetweendiscretepoints, respectively. Observethatwhatisbeingcomputedisanewvalueforvariableukp1ateachpoint,basedondataat thatpointanditsleftandrightneighbors. Wecanbegintodesignaparallelalgorithmforthisproblembydecomposingthearraysukandukp1 intocontiguoussubarrays(thechunksdescribedearlier).Thesechunkscanbeoperatedon concurrently,givingusexploitableconcurrency.Noticethatwehaveasituationinwhichsome elementscanbeupdatedusingonlydatafromwithinthechunk,whileothersrequiredatafrom neighboringchunks,asillustratedbyFig.4.8.

Figure 4.8. Data dependencies in the heat-equation problem. Solid boxes indicate the element being updated; shaded boxes the elements containing needed data.
Example: matrix-multiplication program
Considerthemultiplicationoftwosquarematrices(thatis,computeC=AB).Asdiscussedin
[FJL88 + ],thematricescanbedecomposedintoblocks.Thesummationsinthedefinitionofmatrix multiplicationarelikewiseorganizedintoblocks,allowingustowriteablockwisematrix multiplicationequation Equation4.3
whereateachstepinthesummation,wecomputethematrixproductAikBkjandaddittothe runningmatrixsum. ThisequationimmediatelyimpliesasolutionintermsoftheGeometricDecompositionpattern;that is,oneinwhichthealgorithmisbasedondecomposingthedatastructureintochunks(squareblocks here)thatcanbeoperatedonconcurrently. Tohelpvisualizethisalgorithmmoreclearly,considerthecasewherewedecomposeallthree matricesintosquareblockswitheachtask"owning"correspondingblocksofA,B,andC.Eachtask willrunthroughthesumoverktocomputeitsblockofC,withtasksreceivingblocksfromother tasksasneeded.InFig.4.9,weillustratetwostepsinthisprocessshowingablockbeingupdated(the solidblock)andthematrixblocksrequiredattwodifferentsteps(theshadedblocks),whereblocksof theAmatrixarepassedacrossarowandblocksoftheBmatrixarepassedaroundacolumn.
Figure 4.9. Data dependencies in the matrix-multiplication problem. Solid boxes indicate the "chunk" being updated (C); shaded boxes indicate the chunks of A (row) and B (column) required to update C at each of the two steps.
Forces
Toexploitthepotentialconcurrencyintheproblem,wemustassignchunksofthe decomposeddatastructuretoUEs.Ideally,wewanttodothisinawaythatissimple,portable, scalable,andefficient.AsnotedinSection4.1,however,thesegoalsmayconflict.Akey considerationisbalancingtheload,thatis,ensuringthatallUEshaveroughlythesame amountofworktodo. Wemustalsoensurethatthedatarequiredfortheupdateofeachchunkispresentwhen
needed.Thisproblemissomewhatanalogoustotheproblemofmanagingdatadependencies intheTaskParallelismpattern,andagainthedesignmustkeepinmindthesometimes conflictinggoalsofsimplicity,portability,scalability,andefficiency. Solution Designsforproblemsthatfitthispatterninvolvethefollowingkeyelements:partitioningtheglobal datastructureintosubstructuresor"chunks"(thedatadecomposition),ensuringthateachtaskhas accesstoallthedataitneedstoperformtheupdateoperationforitschunk(theexchangeoperation), updatingthechunks(theupdateoperation),andmappingchunkstoUEsinawaythatgivesgood performance(thedatadistributionandtaskschedule).

Data decomposition
Thegranularityofthedatadecompositionhasasignificantimpactontheefficiencyoftheprogram. Inacoarsegraineddecomposition,thereareasmallernumberoflargechunks.Thisresultsina smallernumberoflargemessages,whichcangreatlyreducecommunicationoverhead.Afinegrained decomposition,ontheotherhand,resultsinalargernumberofsmallerchunks,inmanycasesleading tomanymorechunksthanPEs.Thisresultsinalargernumberofsmallermessages(andhence increasescommunicationoverhead),butitgreatlyfacilitatesloadbalancing. Althoughitmightbepossibleinsomecasestomathematicallyderiveanoptimumgranularityforthe datadecomposition,programmersusuallyexperimentwitharangeofchunksizestoempirically determinethebestsizeforagivensystem.Thisdepends,ofcourse,onthecomputationalperformance ofthePEsandontheperformancecharacteristicsofthecommunicationnetwork.Therefore,the programshouldbeimplementedsothatthegranularityiscontrolledbyparametersthatcanbeeasily changedatcompileorruntime. Theshapeofthechunkscanalsoaffecttheamountofcommunicationneededbetweentasks.Often, thedatatosharebetweentasksislimitedtotheboundariesofthechunks.Inthiscase,theamountof sharedinformationscaleswiththesurfaceareaofthechunks.Becausethecomputationscaleswith thenumberofpointswithinachunk,itscalesasthevolumeoftheregion.Thissurfacetovolume effectcanbeexploitedtomaximizetheratioofcomputationtocommunication.Therefore,higher dimensionaldecompositionsareusuallypreferred.Forexample,considertwodifferent decompositionsofanNbyNmatrixintofourchunks.Inonecase,wedecomposetheprobleminto fourcolumnchunksofsizeNbyN/4.Inthesecondcase,wedecomposetheproblemintofoursquare chunksofsizeN/2byN/2.Forthecolumnblockdecomposition,thesurfaceareais2N+2(N/4)or 5N/2.Forthesquarechunkcase,thesurfaceareais4(N/2)or2N.Hence,thetotalamountofdatathat mustbeexchangedislessforthesquarechunkdecomposition. Insomecases,thepreferredshapeofthedecompositioncanbedictatedbyotherconcerns.Itmaybe thecase,forexample,thatexistingsequentialcodecanbemoreeasilyreusedwithalower dimensionaldecomposition,andthepotentialincreaseinperformanceisnotworththeeffortof reworkingthecode.Also,aninstanceofthispatterncanbeusedasasequentialstepinalarger computation.Ifthedecompositionusedinanadjacentstepdiffersfromtheoptimaloneforthis patterninisolation,itmayormaynotbeworthwhiletoredistributethedataforthisstep.Thisis especiallyanissueindistributedmemorysystemswhereredistributingthedatacanrequire
significantcommunicationthatwilldelaythecomputation.Therefore,datadecompositiondecisions musttakeintoaccountthecapabilitytoreusesequentialcodeandtheneedtointerfacewithother stepsinthecomputation.Noticethattheseconsiderationsmightleadtoadecompositionthatwould besuboptimalunderothercircumstances. Communicationcanoftenbemoreeffectivelymanagedbyreplicatingthenonlocaldataneededto updatethedatainachunk.Forexample,ifthedatastructureisanarrayrepresentingthepointsona meshandtheupdateoperationusesalocalneighborhoodofpointsonthemesh,acommon communicationmanagementtechniqueistosurroundthedatastructurefortheblockwithaghost boundarytocontainduplicatesofdataattheboundariesofneighboringblocks.Sonoweachchunk hastwoparts:aprimarycopyownedbytheUE(thatwillbeupdateddirectly)andzeroormoreghost copies(alsoreferredtoasshadowcopies).Theseghostcopiesprovidetwobenefits.First,theiruse mayconsolidatecommunicationintopotentiallyfewer,largermessages.Onlatencysensitive networks,thiscangreatlyreducecommunicationoverhead.Second,communicationoftheghost copiescanbeoverlapped(thatis,itcanbedoneconcurrently)withtheupdateofpartsofthearray thatdon'tdependondatawithintheghostcopy.Inessence,thishidesthecommunicationcostbehind usefulcomputation,therebyreducingtheobservedcommunicationoverhead. Forexample,inthecaseofthemeshcomputationexamplediscussedearlier,eachofthechunks wouldbeextendedbyonecelloneachside.Theseextracellswouldbeusedasghostcopiesofthe cellsontheboundariesofthechunks.Fig.4.10illustratesthisscheme.
Figure 4.10. A data distribution with ghost boundaries. Shaded cells are ghost copies; arrows point from primary copies to corresponding secondary copies.
The exchange operation
Akeyfactorinusingthispatterncorrectlyisensuringthatnonlocaldatarequiredfortheupdate operationisobtainedbeforeitisneeded. Ifallthedataneededispresentbeforethebeginningoftheupdateoperation,thesimplestapproachis toperformtheentireexchangebeforebeginningtheupdate,storingtherequirednonlocaldataina localdatastructuredesignedforthatpurpose(forexample,theghostboundaryinamesh computation).Thisapproachisrelativelystraightforwardtoimplementusingeithercopyingor messagepassing. Moresophisticatedapproachesinwhichcomputationandcommunicationoverlaparealsopossible. Suchapproachesarenecessaryifsomedataneededfortheupdateisnotinitiallyavailable,andmay improveperformanceinothercasesaswell.Forexample,intheexampleofameshcomputation,the exchangeofghostcellsandtheupdateofcellsintheinteriorregion(whichdonotdependonthe ghostcells)canproceedconcurrently.Aftertheexchangeiscomplete,theboundarylayer(thevalues thatdodependontheghostcells)canbeupdated.Onsystemswherecommunicationandcomputation occurinparallel,thesavingsfromsuchanapproachcanbesignificant.Thisissuchacommonfeature ofparallelalgorithmsthatstandardcommunicationAPIs(suchasMPI)includewholeclassesof messagepassingroutinestooverlapcomputationandcommunication.Thesearediscussedinmore
detailintheMPIappendix. Thelowleveldetailsofhowtheexchangeoperationisimplementedcanhavealargeimpacton efficiency.Programmersshouldseekoutoptimizedimplementationsofcommunicationpatternsused intheirprograms.Inmanyapplications,forexample,thecollectivecommunicationroutinesin messagepassinglibrariessuchasMPIareuseful.Thesehavebeencarefullyoptimizedusing techniquesbeyondtheabilityofmanyparallelprogrammers(wediscusssomeoftheseinSec.6.4.2) andshouldbeusedwheneverpossible.

The update operation
Updatingthedatastructureisdonebyexecutingthecorrespondingtasks(eachresponsibleforthe updateofonechunkofthedatastructures)concurrently.Ifalltheneededdataispresentatthe beginningoftheupdateoperation,andifnoneofthisdataismodifiedduringthecourseofthe update,parallelizationiseasierandmorelikelytobeefficient. Iftherequiredexchangeofinformationhasbeenperformedbeforebeginningtheupdateoperation, theupdateitselfisusuallystraightforwardtoimplementitisessentiallyidenticaltotheanalogous updateinanequivalentsequentialprogram,particularlyifgoodchoiceshavebeenmadeabouthowto representnonlocaldata.Iftheexchangeandupdateoperationsoverlap,morecareisneededtoensure thattheupdateisperformedcorrectly.Ifasystemsupportslightweightthreadsthatarewellintegrated withthecommunicationsystem,thenoverlapcanbeachievedviamultithreadingwithinasingletask, withonethreadcomputingwhileanotherhandlescommunication.Inthiscase,synchronization betweenthethreadsisrequired. Insomesystems,forexampleMPI,nonblockingcommunicationissupportedbymatching communicationprimitives:onetostartthecommunication(withoutblocking),andtheother (blocking)tocompletetheoperationandusetheresults.Formaximaloverlap,communicationshould bestartedassoonaspossible,andcompletedaslateaspossible.Sometimes,operationscanbe reorderedtoallowmoreoverlapwithoutchangingthealgorithmsemantics.
Data distribution and task scheduling
Thefinalstepindesigningaparallelalgorithmforaproblemthatfitsthispatternisdecidinghowto mapthecollectionoftasks(eachcorrespondingtotheupdateofonechunk)toUEs.EachUEcan thenbesaidto"own"acollectionofchunksandthedatatheycontain.Thus,wehaveatwotiered schemefordistributingdataamongUEs:partitioningthedataintochunksandthenassigningthese chunkstoUEs.Thisschemeisflexibleenoughtorepresentavarietyofpopularschemesfor distributingdataamongUEs. Inthesimplestcase,eachtaskcanbestaticallyassignedtoaseparateUE;thenalltaskscanexecute concurrently,andtheintertaskcoordinationneededtoimplementtheexchangeoperationis straightforward.Thisapproachismostappropriatewhenthecomputationtimesofthetasksare uniformandtheexchangeoperationhasbeenimplementedtooverlapcommunicationand computationwithineachtasks. Thesimpleapproachcanleadtopoorloadbalanceinsomesituations,however.Forexample,consider alinearalgebraprobleminwhichelementsofthematrixaresuccessivelyeliminatedasthe
computationproceeds.Earlyinthecomputation,alltherowsandcolumnsofthematrixhave numerouselementstoworkwithanddecompositionsbasedonassigningfullrowsorcolumnstoUEs areeffective.Laterinthecomputation,however,rowsorcolumnsbecomesparse,theworkperrow becomesuneven,andthecomputationalloadbecomespoorlybalancedbetweenUEs.Thesolutionis todecomposetheproblemintomanymorechunksthanthereareUEsandtoscatterthemamongthe UEswithacyclicorblockcyclicdistribution.(Cyclicandblockcyclicdistributionsarediscussedin theDistributedArraypattern.)Then,aschunksbecomesparse,thereare(withhighprobability)other nonsparsechunksforanygivenUEtoworkon,andtheloadbecomeswellbalanced.Aruleofthumb isthatoneneedsaroundtentimesasmanytasksasUEsforthisapproachtoworkwell. Itisalsopossibletousedynamicloadbalancingalgorithmstoperiodicallyredistributethechunks amongtheUEstoimprovetheloadbalance.Theseincuroverheadthatmustbetradedoffagainstthe improvementlikelytooccurfromtheimprovedloadbalanceandincreasedimplementationcosts.In addition,theresultingprogramismorecomplexthanthosethatuseoneofthestaticmethods. Generally,oneshouldconsiderthe(static)cyclicallocationstrategyfirst.
Program structure
TheoverallprogramstructureforapplicationsofthispatternwillnormallyuseeithertheLoop ParallelismpatternortheSPMDpattern,withthechoicedeterminedlargelybythetargetplatform. ThesepatternsaredescribedintheSupportingStructuresdesignspace. Examples Weincludetwoexampleswiththispattern:ameshcomputationandmatrixmultiplication.The challengesinworkingwiththeGeometricDecompositionpatternarebestappreciatedinthelowlevel detailsoftheresultingprograms.Therefore,eventhoughthetechniquesusedintheseprogramsare notfullydevelopeduntilmuchlaterinthebook,weprovidefullprogramsinthissectionratherthan highleveldescriptionsofthesolutions.

Mesh computation
ThisproblemisdescribedintheContextsectionofthispattern.Fig.4.11presentsasimplesequential versionofaprogram(somedetailsomitted)thatsolvesthe1Dheatdiffusionproblem.Theprogram associatedwiththisproblemisstraightforward,althoughonedetailmightneedfurtherexplanation: Aftercomputingnewvaluesinukp1ateachstep,conceptuallywhatwewanttodoiscopythemto ukforthenextiteration.Weavoidatimeconsumingactualcopybymakingukandukp1pointersto theirrespectivearraysandsimplyswappingthemattheendofeachstep.Thiscausesuktopointto thenewlycomputedvaluesandukp1topointtotheareatouseforcomputingnewvaluesinthenext iteration.

Figure 4.11. Sequential heat-diffusion program
CodeView:Scroll/ShowAll
#include <stdio.h> #include <stdlib.h> #define NX 100
#define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]) { uk[0] = LEFTVAL; uk[NX-1] = RIGHTVAL; for (int i = 1; i < NX-1; ++i) uk[i] = 0.0; for (int i = 0; i < NX; ++i) ukp1[i] = uk[i]; } void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) { /* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; double dx = 1.0/NX; double dt = 0.5*dx*dx; initialize(uk, ukpl); for (int k = 0; k < NSTEPS; ++k) { /* compute new values */ for (int i = 1; i < NX-1; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ temp = ukp1; ukp1 = uk; uk = temp; printValues(uk, k); } return 0; }
Thisprogramcombinesatoplevelsequentialcontrolstructure(thetimesteploop)withanarray updateoperation,whichcanbeparallelizedusingtheGeometricDecompositionpattern.Weshow parallelimplementationsofthisprogramusingOpenMPandMPI.

OpenMP solution
AparticularlysimpleversionoftheprogramusingOpenMPandtheLoopParallelismpatternis showninFig.4.12.BecauseOpenMPisasharedmemoryprogrammingmodel,thereisnoneedto explicitlypartitionanddistributethetwokeyarrays(ukandukp1).Thecreationofthethreadsand distributionoftheworkamongthethreadsareaccomplishedwiththeparallel fordirective.

#pragma parallel for schedule(static)
Theschedule (static)clausedecomposestheiterationsoftheparallelloopintoone contiguousblockperthreadwitheachblockbeingapproximatelythesamesize.Thisscheduleis importantforLoopParallelismprogramsimplementingtheGeometricDecompositionpattern.Good performanceformostGeometricDecompositionproblems(andmeshprogramsinparticular)requires thatthedataintheprocessor'scachebeusedmanytimesbeforeitisdisplacedbydatafromnewcache lines.Usinglargeblocksofcontiguousloopiterationsincreasesthechancethatmultiplevalues fetchedinacachelinewillbeutilizedandthatsubsequentloopiterationsarelikelytofindatleast someoftherequireddataincache. ThelastdetailtodiscussfortheprograminFig.4.12isthesynchronizationrequiredtosafelycopythe pointers.Itisessentialthatallofthethreadscompletetheirworkbeforethepointerstheymanipulate areswappedinpreparationforthenextiteration.Inthisprogram,thissynchronizationhappens automaticallyduetotheimpliedbarrier(seeSec.6.3.2)attheendoftheparallelloop. TheprograminFig.4.12workswellwithasmallnumberofthreads.Whenlargenumbersofthreads areinvolved,however,theoverheadincurredbyplacingthethreadcreationanddestructioninsidethe loopoverkwouldbeprohibitive.Wecanreducethreadmanagementoverheadbysplittingthe parallel fordirectiveintoseparateparallelandfordirectivesandmovingthethread creationoutsidetheloopoverk.ThisapproachisshowninFig.4.13.Becausethewholekloopis nowinsideaparallelregion,wemustbemorecarefulabouthowdataissharedbetweenthreads.The privateclausecausestheloopindiceskanditobelocaltoeachthread.Thepointersukand ukp1areshared,however,sotheswapoperationmustbeprotected.Theeasiestwaytodothisisto ensurethatonlyonememberoftheteamofthreadsdoestheswap.InOpenMP,thisismosteasily donebyplacingtheupdateinsideasingleconstruct.AsdescribedinmoredetailintheOpenMP appendix,AppendixA,thefirstthreadtoencountertheconstructwillcarryouttheswapwhilethe otherthreadswaitattheendoftheconstruct.
Figure 4.12. Parallel heat-diffusion program using OpenMP
#include <stdio.h> #include <stdlib.h> #define NX 100 #define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]) { uk[0] = LEFTVAL; uk[NX-1] = RIGHTVAL; for (int i = 1; i < NX-1; ++i) uk[i] = 0.0; for (int i=0; i < NX; ++i) ukp1[i] = uk[i]; } void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) {
/* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; double dx = 1.0/NX; double dt = 0.5*dx*dx; initialize(uk, ukp1); for (int k = 0; k < NSTEPS; ++k) { #pragma omp parallel for schedule(static) /* compute new values */ for (int i = 1; i < NX-1; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ temp = ukp1; ukpl = uk; uk = temp; printValues(uk, k); } return 0; }
MPI solution
AnMPIbasedprogramforthisexampleisshowninFig.4.14and4.15.Theapproachusedinthis programusesadatadistributionwithghostcellsandtheSPMDpattern. EachprocessisgivenasinglechunkofthedatadomainofsizeNX/NP,whereNXisthetotalsizeof theglobaldataarrayandNPisthenumberofprocesses.Forsimplicity,weassumeNXisevenly dividedbyNP. Theupdateofthechunkisstraightforwardandessentiallyidenticaltothatfromthesequentialcode. ThelengthandgreatercomplexityinthisMPIprogramarisesfromtwosources.First,thedata initializationismorecomplex,becauseitmustaccountforthedatavaluesattheedgesofthefirstand lastchunks.Second,messagepassingroutinesarerequiredinsidetheloopoverktoexchangeghost cells.
Figure 4.13. Parallel heat-diffusion program using OpenMP. This version has less thread-management overhead.
#include <stdio.h> #include <stdlib.h> #include <omp.h> #define NX 100 #define LEFTVAL 1.0
#define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]){/* NOT SHOWN */} void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) { /* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; int i,k; double dx = 1.0/NX; double dt = 0.5*dx*dx; #pragma omp parallel private (k, i) { initialize(uk, ukp1); for (k = 0; k < NSTEPS; ++k) { #pragma omp for schedule(static) for (i = 1; i < NX-1; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ #pragma omp single { temp = ukp1; ukp1 = uk; uk = temp; } } } return 0; }
ThedetailsofthemessagepassingfunctionscanbefoundintheMPIappendix,AppendixB.Briefly, transmittingdataconsistsofoneprocessdoingasendoperation,specifyin gthebuffercontainingthe data,andanotherprocessdoingareceiveoperation,specifyingthebufferintowhichthedatashould beplaced.Weneedseveraldifferentpairsofsendsandreceivesbecausetheprocessthatownsthe leftmostchunkofthearraydoesnothavealeftneighboritneedstocommunicatewith,andsimilarly theprocessthatownstherightmostchunkdoesnothavearightneighbortocommunicatewith.

Figure 4.14. Parallel heat-diffusion program using MPI (continued in Fig. 4.15)
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <mpi.h> #define NX 100 #define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[], int numPoints,
int numProcs, int myID) { for (int i = 1; i <= numPoints; ++i) uk[i] = 0.0; /* left endpoint */ if (myID == 0) uk[1] = LEFTVAL; /* right endpoint */ if (myID == numProcs-1) uk[numPoints] = RIGHTVAL; /* copy values to ukpl */ for (int i = 1; i <= numPoints; ++i) ukp1[i] = uk[i];
void printValues(double uk[], int step, int numPoints, int myID) { /* NOT SHOWN */ } int main(int argc, char *argv[]) { /* pointers to arrays for two iterations of algorithm */ double *uk, *ukp1, *temp; double dx = 1.0/NX; double dt = 0.5*dx*dx; int numProcs, myID, leftNbr, rightNbr, numPoints; MPI_Status status; /* MPI initialization */ MPI-Init(&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &numProcs); MPI_Comm_rank(MPI_COMM_WORLD, &myID); //get own ID /* initialization of other variables */ leftNbr = myID - 1; // ID of left "neighbor" process rightNbr = myID + 1; // ID of right "neighbor" process numPoints = (NX / numProcs); /* uk, ukp1 include a "ghost cell" at each end */ uk = malloc(sizeof(double) * (numPoints+2)); ukp1 = malloc(sizeof(double) * (numPoints+2)); initialize(uk, ukp1, numPoints, numProcs, myID); /* continued in next figure */
WecouldfurthermodifythecodeinFig.4.14and4.15tousenonblockingcommunicationtooverlap computationandcommunication,asdiscussedearlierinthispattern.Thefirstpartoftheprogramis unchangedfromourfirstmeshcomputationMPIprogram(thatis,Fig.4.14).Thedifferencesforthis casearecontainedinthesecondpartoftheprogramcontainingthemaincomputationloop.Thiscode isshowninFig.4.16.

Figure 4.15. Parallel heat-diffusion program using MPI (continued from Fig. 4.14)
/* continued from Figure 4.14 */ for (int k = 0; k < NSTEPS; ++k) {
/* exchange boundary information */ if (myID != 0) MPI_Send(&uk[1], 1, MPI_DOUBLE, leftNbr, 0, MPI_COMM_WORLD); if (myID != numProcs-1) MPI_Send(&uk[numPoints], 1, MPI_DOUBLE, rightNbr, 0, MPI_COMM_WORLD); if (myID != 0) MPI_Recv(&uk[0], 1, MPI_DOUBLE, leftNbr, 0, MPI_COMM_WORLD, &status); if (myID != numProcs-1) MPI_Recv(&uk[numPoints+l],1, MPI_DOUBLE, rightNbr, 0, MPI_COMM_WORLD, &status); /* compute new values for interior points */ for (int i = 2; i < numPoints; ++i) { ukpl[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* compute new values for boundary points */ if (myID != 0) { int i=1; ukpl[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } if (myID != numProcs-1) { int i=numPoints; ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ temp = ukp1; ukp1 = uk; uk = temp; printValues(uk, k, numPoints, myID); } /* clean up and end */ MPI_Finalize(); return 0;
Whilethebasicalgorithmisthesame,thecommunicationisquitedifferent.Theimmediatemode communicationroutines,MPI_IsendandMPI_Irecv,areusedtosetupandthenlaunchthe communicationevents.Thesefunctions(describedinmoredetailintheMPIappendix,AppendixB) returnimmediately.Theupdateoperationsontheinteriorpointscanthentakeplacebecausethey don'tdependontheresultsofthecommunication.Wethencallfunctionstowaituntilthe communicationiscompleteandupdatetheedgesofeachUE'schunksusingtheresultsofthe communicationevents.Inthiscase,themessagesaresmallinsize,soitisunlikelythatthisversionof theprogramwouldbeanyfasterthanourfirstone.Butitiseasytoimaginecaseswherelarge, complexcommunicationeventswouldbeinvolvedandbeingabletodousefulworkwhilethe messagesmoveacrossthecomputernetworkwouldresultinsignificantlygreaterperformance.
Figure 4.16. Parallel heat-diffusion program using MPI with overlapping communication/ computation (continued from Fig. 4.14)
/* continued */ MPI_Request reqRecvL, reqRecvR, reqSendL, reqSendR; //needed for // nonblocking I/0 for (int k = 0; k < NSTEPS; ++k) { /* initiate communication to exchange boundary information */ if (myID != 0) { MPI_Irecv(&uk[0], 1, MPI_DOUBLE, leftNbr, 0, MPI_COMM_WORLD, &regRecvL); MPI_Isend(&uk[l], 1, MPI_DOUBLE, leftNbr, 0, MPI_COMM_WORLD, &regSendL); } if (myID != numProcs-1) { MPI_Irecv(&uk[numPoints+1],1, MPI_DOUBLE, rightNbr, 0, MPI_COMM_WORLD, &reqRecvR); MPI_Isend(&uk[numPoints], 1, MPI_DOUBLE, rightNbr, 0, MPI_COMM_WORLD, &reqSendR); } /* compute new values for interior points */ for (int i = 2; i < numPoints; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* wait for communication to complete */ if (myID != 0) { MPI_Wait(&reqRecvL, &status); MPI_Wait(&reqSendL, &status); } if (myID != numProcs-1) { MPI_Wait(&reqRecvR, &status); MPI_Wait(&reqSendR, &status); } /* compute new values for boundary points */ if (myID != 0) { int i=1; ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } if (myID != numProcs-1) { int i=numPoints; ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ temp = ukp1; ukp1 = uk; uk = temp; printValues(uk, k, numPoints, myID); } /* clean up and end */ MPI_Finalize(); return 0; }
Figure 4.17. Sequential matrix multiplication
#include <stdio.h> #include <stdlib.h> #define N 100 #define NB 4 #define blockstart(M,i,j,rows_per_blk,cols_per_blk,stride) \ (M + ((i)*(rows_per_blk))*(stride) + (j)*(cols_per_blk)) int main(int argc, char *argv[]) { /* matrix dimensions */ int dimN = N; int dimP = N; int dimM = N; /* block dimensions */ int dimNb = dimN/NB; int dimPb = dimP/NB; int dimMb = dimM/NB; /* allocate double *A = double *B = double *C = memory for matrices */ malloc(dimN*dimP*sizeof(double)); malloc(dimP*dimM*sizeof(double)); malloc(dimN*dimM*sizeof(double));
/* Initialize matrices */ initialize(A, B, dimN, dimP, dimM); /* Do the matrix multiplication */ for (int ib=0; ib < NB; ++ib) { for (int jb=0; jb < NB; ++jb) { /* find block[ib][jb] of C */ double * blockPtr = blockstart(C, ib, jb, dimNb, dimMb, dimM); /* clear block[ib] [jb] of C (set all elements to zero) */ matclear(blockPtr, dimNb, dimMb, dimM); for (int kb=0; kb < NB; ++kb) { /* compute product of block[ib][kb] of A and block[kb] [jb] of B and add to block[ib][jb] of C */ matmul_add(blockstart(A, ib, kb, dimNb, dimPb, dimP), blockstart(B, kb, jb, dimPb, dimMb, dimM), blockPtr, dimNb, dimPb, dimMb, dimP, dimM, dimM); } } /* Code to print results not shown */ } return 0; }
ThematrixmultiplicationproblemisdescribedintheContextsection.Fig.4.17presentsasimple
sequentialprogramtocomputethedesiredresult,basedondecomposingtheNbyNmatrixinto NB*NBsquareblocks.Thenotationblock [i] [j]incommentsindicatesthe(i,j)thblockas describedearlier.TosimplifythecodinginC,werepresentthematricesas1Darrays(internally arrangedinrowmajororder)anddefineamacroblockstarttofindthetopleftcornerofa submatrixwithinoneofthese1Darrays.Weomitcodeforfunctionsinitialize(initialize matricesAandB),printMatrix(printamatrix'svalues),matclear(clearamatrixsetall valuestozero),andmatmul_add(computethematrixproductofthetwoinputmatricesandaddit totheoutputmatrix).Parameterstomostofthesefunctionsincludematrixdimensions,plusastride thatdenotesthedistancefromthestartofonerowofthematrixtothestartofthenextandallowsus toapplythefunctionstosubmatricesaswellastowholematrices.
Figure 4.18. Sequential matrix multiplication, revised. We do not show the parts of the program that are not changed from the program in Fig. 4.17. /* Declarations, initializations, etc. not shown -- same as first version */ /* Do the multiply */ matclear(C, dimN, dimM, dimM); /* sets all elements to zero */ for (int kb=0; kb < NB; ++kb) { for (int ib=0; ib < NB; ++ib) { for (int jb=0; jb < NB; ++jb) { /* compute product of block[ib][kb] of A and block[kb][jb] of B and add to block[ib][jb] of C */ matmul_add(blockstart(A, ib, kb, dimNb, dimPb, dimP), blockstart(B, kb, jb, dimPb, dimMb, dimM), blockstart(C, ib, jb, dimNb, dimMb, dimM), dimNb, dimPb, dimMb, dimP, dimM, dimM); } }
/* Remaining code is the same as for the first version */
Wefirstobservethatwecanrearrangetheloopswithoutaffectingtheresultofthecomputation,as showninFig.4.18. Observethatwiththistransformation,wehaveaprogramthatcombinesahighlevelsequential structure(theloopoverkb)withaloopstructure(thenestedloopsoveribandjb)thatcanbe parallelizedwiththeGeometricDecompositionpattern.

OpenMP solution
Wecanproduceaparallelversionofthisprogramforasharedmemoryenvironmentbyparallelizing theinnernestedloops(overiband/orjb)withOpenMPloopdirectives.Aswiththemeshexample, itisimportanttokeepthreadmanagementoverheadsmall,soonceagaintheparalleldirective shouldappearoutsideoftheloopoverkb.Afordirectivewouldthenbeplacedpriortooneofthe innerloops.Theissuesraisedbythisalgorithmandtheresultingsourcecodemodificationsare
essentiallythesameasthosearisingfromthemeshprogramexample,sowedonotshowprogram sourcecodehere.
MPI solution
AparallelversionofthematrixmultiplicationprogramusingMPIisshowninFig.4.19and4.20.The naturalapproachwithMPIistousetheSPMDpatternwiththeGeometricDecompositionpattern. Wewillusethematrixmultiplicationalgorithmdescribedearlier.

Figure 4.19. Parallel matrix multiplication with message passing (continued in Fig. 4.20)
#include #include #include #include #include <stdio.h> <stdlib.h> <string.h> <math.h> <mpi.h> #define N 100
#define blockstart(M,i,j,rows_per_blk,cols_per_blk,stride) \ (M + ((i)*(rows_per_blk))*(stride) + (j)*(cols_per_blk)) int main(int argc, char *argv[]) { /* matrix dimensions */ int dimN = N; int dimP = N; int dimM = N; /* block dimensions */ int dimNb, dimPb, dimMb; /* matrices */ double *A, *B, *C; /* buffers for receiving sections of A, B from other processes */ double *Abuffer, *Bbuffer; int numProcs, myID, myID_i, myID_j, NB; MPI_Status status; /* MPI initialization */ MPI_Init(&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &numProcs); MPI_Comm_rank(MPI_COMM_WORLD, &myID) ; /* initialize other variables */ NB = (int) sqrt((double) numProcs); myID_i = myID / NB; myID_j = myID % NB; dimNb = dimN/NB; dimPb = dimP/NB; dimMb = dimM/NB; A = malloc(dimNb*dimPb*sizeof(double)); B = malloc(dimPb*dimMb*sizeof(double)); C = malloc(dimNb*dimMb*sizeof(double)); Abuffer = malloc(dimNb*dimPb*sizeof(double)); Bbuffer = malloc(dimPb*dimMb*sizeof(double)); /* Initialize matrices */ initialize(A, B, dimNb, dimPb, dimPb, dimPb, NB, myID_i, myID_j);
/* continued in next figure */
Thethreematrices(A,B,andC)aredecomposedintoblocks.TheUEs(processesinthecaseofMPI) involvedinthecomputationareorganizedintoagridsuchthattheindicesofthematrixblocksmap ontothecoordinatesoftheprocesses(thatis,matrixblock(i,j)isassociatedwiththeprocesswith rowindexiandcolumnindexj).Forsimplicity,weassumethenumberofprocessesnumProcsisa perfectsquareanditssquarerootevenlydividestheorderofthematrices(N).

Figure 4.20. Parallel matrix multiplication with message-passing (continued from Fig. 4.19)
/* continued from previous figure */ /* Do the multiply */ matclear(C, dimNb, dimMb, dimMb); for (int kb=0; kb < NB; ++kb) { if (myID_j == kb) { /* send A to other processes in the same "row" */ for (int jb=0; jb < NB; ++jb) { if (jb != myID_j) MPI_Send(A, dimNb*dimPb, MPI_DOUBLE, myID_i*NB + jb, 0, MPI_COMM_WORLD); } /* copy A to Abuffer */ memcpy(Abuffer, A, dimNb*dimPb*sizeof(double)); } else { MPI_Recv(Abuffer, dimNb*dimPb, MPI_DOUBLE, myID_i*NB + kb, 0, MPI_COMM_WORLD, &status); ) if (myID_i == kb) { /* send B to other processes in the same "column" */ for (int ib=0; ib < NB; ++ib) { if (ib != myID_i) MPI_Send(B, dimPb*dimMb, MPI_DOUBLE, ib*NB + myID_j, 0, MPI_COMM_WORLD); } /* copy B to Bbuffer */ memcpy(Bbuffer, B, dimPb*dimMb*sizeof(double)); } else { MPI_Recv(Bbuffer, dimPb*dimMb, MPI_DOUBLE, kb*NB + myID_j, 0, MPI_COMM_WORLD, &status); } /* compute product of block[ib] [kb] of A and block[kb][jb] of B and add to block[ib][jb] of C */ matmul_add(Abuffer, Bbuffer, C, dimNb, dimPb, dimMb, dimPb, dimMb, dimMb); }
} /* Code to print results not shown */ /* Clean up and end */ MPI_Finalize(); return 0; }
Althoughthealgorithmmayseemcomplexatfirst,theoverallideaisstraightforward.The computationproceedsthroughanumberofphases(theloopoverkb).Ateachphase,theprocess whoserowindexequalsthekbindexsendsitsblocksofAacrosstherowofprocesses.Likewise,the processwhosecolumnindexequalskbsendsitsblocksofBalongthecolumnofprocesses. Followingthecommunicationoperations,eachprocessthenmultipliestheAandBblocksitreceived andsumstheresultintoitsblockofC.AfterNBphases,theblockoftheCmatrixoneachprocesswill holdthefinalproduct. ThesetypesofalgorithmsareverycommonwhenworkingwithMPI.Thekeytounderstandingthese algorithmsistothinkintermsofthesetofprocesses,thedataownedbyeachprocess,andhowdata fromneighboringprocessesflowsamongtheprocessesasthecalculationunfolds.Werevisitthese issuesintheSPMDandDistributedArraypatternsaswellasintheMPIappendix. Agreatdealofresearchhasbeencarriedoutonparallelmatrixmultiplicationandrelatedlinear algebraalgorithms.Amoresophisticatedapproach,inwhichtheblocksofAandBcirculateamong processes,arrivingateachprocessjustintimetobeused,isgivenin[FJL88 + ].
Known uses
MostproblemsinvolvingthesolutionofdifferentialequationsusetheGeometricDecomposition pattern.Afinitedifferencingschemedirectlymapsontothispattern.Anotherclassofproblemsthat usethispatterncomesfromcomputationallinearalgebra.TheparallelroutinesintheScaLAPACK
[Sca,BCC97 + ]libraryareforthemostpartbasedonthispattern.Thesetwoclassesofproblemscover alargeportionofallparallelapplicationsinscientificcomputing. Related Patterns Iftheupdaterequiredforeachchunkcanbedonewithoutdatafromotherchunks,thenthispattern reducestotheembarrassinglyparallelalgorithmdescribedintheTaskParallelismpattern.Asan exampleofsuchacomputation,considercomputinga2DFFT(FastFourierTransform)byfirst applyinga1DFFTtoeachrowofthematrixandthenapplyinga1DFFTtoeachcolumn.Although thedecompositionmayappeardatabased(byrows/bycolumns),infactthecomputationconsistsof twoinstancesoftheTaskParallelismpattern. Ifthedatastructuretobedistributedisrecursiveinnature,thentheDivideandConquerorRecursive Datapatternmaybeapplicable
4.7. THE RECURSIVE DATA PATTERN

Problem Supposetheprobleminvolvesanoperationonarecursivedatastructure(suchasalist,tree,orgraph) thatappearstorequiresequentialprocessing.Howcanoperationsonthesedatastructuresbe performedinparallel? Context Someproblemswithrecursivedatastructuresnaturallyusethedivideandconquerstrategydescribed intheDivideandConquerpatternwithitsinherentpotentialforconcurrency.Otheroperationson thesedatastructures,however,seemtohavelittleifanypotentialforconcurrencybecauseitappears thattheonlywaytosolvetheproblemistosequentiallymovethroughthedatastructure,computinga resultatoneelementbeforemovingontothenext.Sometimes,however,itispossibletoreshapethe operationsinawaythataprogramcanoperateconcurrentlyonallelementsofthedatastructure. Anexamplefrom[J92]illustratesthesituation:Supposewehaveaforestofrooteddirectedtrees (definedbyspecifying,foreachnode,itsimmediateancestor,witharootnode'sancestorbeingitself) andwanttocompute,foreachnodeintheforest,therootofthetreecontainingthatnode.Todothis inasequentialprogram,wewouldprobablytracedepthfirstthrougheachtreefromitsroottoitsleaf nodes;aswevisiteachnode,wehavetheneededinformationaboutthecorrespondingroot.Total runningtimeofsuchaprogramforaforestofNnodeswouldbeO(N).Thereissomepotentialfor concurrency(operatingonsubtreesconcurrently),butthereisnoobviouswaytooperateonall elementsconcurrently,becauseitappearsthatwecannotfindtherootforaparticularnodewithout knowingitsparent'sroot. However,arethinkingoftheproblemexposesadditionalconcurrency:Wefirstdefineforeachnodea "successor",whichinitiallywillbeitsparentandultimatelywillbetherootofthetreetowhichthe nodebelongs.Wethencalculateforeachnodeits"successor'ssuccessor". ornodesone"hop"from F theroot,thiscalculationdoesnotchangethevalueofitssuccessor(becausearoot'sparentisitself). Fornodesatleasttwo"hops"awayfromaroot,thiscalculationmakesthenode'ssuccessoritsparent's parent.Werepeatthiscalculationuntilitconverges(thatis,thevaluesproducedbyonesteparethe sameasthoseproducedbytheprecedingstep),atwhichpointeverynode'ssuccessoristhedesired value.Fig.4.21showsanexamplerequiringthreestepstoconverge.Ateachstepwecanoperateon allNnodesinthetreeconcurrently,andthealgorithmconvergesinatmostlogNsteps.
Figure 4.21. Finding roots in a forest. Solid lines represent the original parent-child relationships among nodes; dashed lines point from nodes to their successors.
Whatwehavedoneistransformtheoriginalsequentialcalculation(findrootsfornodesone"hop" fromaroot,thenfindrootsfornodestwo"hops"fromaroot,etc.)intoacalculationthatcomputesa partialresult(successor)foreachnodeandthenrepeatedlycombinesthesepartialresults,firstwith neighboringresults,thenwithresultsfromnodestwohopsaway,thenwithresultsfromnodesfour hopsaway,andsoon.Thisstrategycanbeappliedtootherproblemsthatatfirstappearunavoidably sequential;theExamplessectionpresentsotherexamples.Thistechniqueissometimesreferredtoas pointerjumpingorrecursivedoubling. Aninterestingaspectofthisrestructuringisthatthenewalgorithminvolvessubstantiallymoretotal workthantheoriginalsequentialone(O(NlogN)versusO(N)),buttherestructuredalgorithm containspotentialconcurrencythatiffullyexploitedreducestotalrunningtimetoO(logN)(versus O(N)).Moststrategiesandalgorithmsbasedonthispatternsimilarlytradeoffanincreaseintotal workforapotentialdecreaseinexecutiontime.Noticealsothattheexploitableconcurrencycanbe extremelyfinegrained(asinthepreviousexample),whichmaylimitthesituationsinwhichthis patternyieldsanefficientalgorithm.Nevertheless,thepatterncanstillserveasaninspirationfor lateralthinkingabouthowtoparallelizeproblemsthatatfirstglanceappeartobeinherently sequential. Forces
Recastingtheproblemtotransformaninherentlysequentialtraversaloftherecursivedata structureintoonethatallowsallelementstobeoperateduponconcurrentlydoessoatthecost ofincreasingthetotalworkofthecomputation.Thismustbebalancedagainsttheimproved performanceavailablefromrunninginparallel.
Thisrecastingmaybedifficulttoachieve(becauseitrequireslookingattheoriginalproblem fromanunusualperspective)andmayleadtoadesignthatisdifficulttounderstandand maintain. Whethertheconcurrencyexposedbythispatterncanbeeffectivelyexploitedtoimprove performancedependsonhowcomputationallyexpensivetheoperationisandonthecostof communicationrelativetocomputationonthetargetparallelcomputersystem.
Solution Themostchallengingpartofapplyingthispatternisrestructuringtheoperationsoverarecursivedata structureintoaformthatexposesadditionalconcurrency.Generalguidelinesaredifficulttoconstruct, butthekeyideasshouldbeclearfromtheexamplesprovidedwiththispattern. Aftertheconcurrencyhasbeenexposed,itisnotalwaysthecasethatthisconcurrencycanbe effectivelyexploitedtospeedupthesolutionofaproblem.Thisdependsonanumberoffactors includinghowmuchworkisinvolvedaseachelementoftherecursivedatastructureisupdatedandon thecharacteristicsofthetargetparallelcomputer.

Data decomposition
Inthispattern,therecursivedatastructureiscompletelydecomposedintoindividualelementsand eachelementisassignedtoaseparateUE.IdeallyeachUEwouldbeassignedtoadifferentPE,butit isalsopossibletoassignmultipleUEstoeachPE.IfthenumberofUEsperPEistoolarge,however, theoverallperformancewillbepoorbecausetherewillnotbeenoughconcurrencytoovercomethe increaseinthetotalamountofwork. Forexample,considertherootfindingproblemdescribedearlier.We'llignoreoverheadinour computations.IfN=1024andtisthetimetoperformonestepforonedataelement,thentherunning timeofasequentialalgorithmwillbeabout1024t.IfeachUEisassigneditsownPE,thenthe runningtimeoftheparallelalgorithmwillbearound(logN)tor10t.IfonlytwoPEsareavailablefor theparallelalgorithm,however,thenallNlogNor10240computationstepsmustbeperformedon thetwoPEs,andtheexecutiontimewillbeatleast5120t,considerablymorethanthesequential algorithm.
Structure
Typicallytheresultofapplyingthispatternisanalgorithmwhosetoplevelstructureisasequential compositionintheformofaloop,inwhicheachiterationcanbedescribedas"performthisoperation simultaneouslyonall(orselected)elementsoftherecursivedatastructure".Typicaloperations include"replaceeachelement'ssuccessorwithitssuccessor'ssuccessor"(asintheexampleinthe Contextsection)and"replaceavalueheldatthiselementwiththesumofthecurrentvalueandthe valueofthepredecessor'selement."

Synchronization
Algorithmsthatfitthispatternaredescribedintermsofsimultaneouslyupdatingallelementsofthe datastructure.Sometargetplatforms(forexample,SIMDarchitecturessuchastheearlyConnection Machines)makethistrivialtoaccomplishbyassigningeachdataelementtoaseparatePE(possiblya
logicalPE)andexecutinginstructionsinalockstepfashionateachPE.MIMDplatformswiththe rightsupportingprogrammingenvironments(forexample,HighPerformanceFortran[HPF97]) providesimilarsemantics. Ifthetargetplatformdoesn'tprovidetherequiredsynchronizationimplicitly,itwillbenecessaryto introducethesynchronizationexplicitly.Forexample,iftheoperationperformedduringaloop iterationcontainstheassignment

next [k] = next [next[k]]
thentheparallelalgorithmmustensurethatnext [k]isnotupdatedbeforeotherUEsthatneedits valuefortheircomputationhavereceivedit.Onecommontechniqueistointroduceanewvariable, saynext2,ateachelement.Evennumberediterationsthenreadnextbutupdatenext2,while oddnumberediterationsreadnext2andupdatenext.Thenecessarysynchronizationis accomplishedbyplacingabarrier(asdescribedintheImplementationMechanismsdesignspace) betweeneachsuccessivepairofiterations.Noticethatthiscansubstantiallyincreasetheoverhead associatedwiththeparallelalgorithm,whichcanoverwhelmanyspeedupderivedfromtheadditional concurrency.Thisismostlikelytobeafactorifthecalculationrequiredforeachelementistrivial (which,alas,formanyoftheexamplesitis). IftherearefewerPEsthandataelements,theprogramdesignermustdecidewhethertoassigneach dataelementtoaUEandassignmultipleUEstoeachPE(therebysimulatingsomeoftheparallelism) orwhethertoassignmultipledataelementstoeachUEandprocessthemserially.Thelatterisless straightforward(requiringanapproachsimilartothatsketchedpreviously,inwhichvariablesinvolved inthesimultaneousupdateareduplicated),butcanbemoreefficient. Examples
Partial sums of a linked list
Inthisexample,adoptedfromHillisandSteele[HS86],theproblemistocomputetheprefixsumsof alltheelementsinalinkedlistinwhicheachelementcontainsavaluex.Inotherwords,afterthe computationiscomplete,thefirstelementwillcontainx0,thesecondwillcontainx0+x1thethirdx0 +x1+x2,etc. Fig.4.22showspseudocodeforthebasicalgorithm.Fig.4.23showstheevolutionofthecomputation wherexiistheinitialvalueofthe(i+1)thelementinthelist.
Figure 4.23. Steps in finding partial sums of a list. Straight arrows represent links between elements; curved arrows indicate additions.
Thisexamplecanbegeneralizedbyreplacingadditionwithanyassociativeoperatorandissometime knownasaprefixscan.Itcanbeusedinavarietyofsituations,includingsolvingvarioustypesof recurrencerelations.

Known uses
Algorithmsdevelopedwiththispatternareatypeofdataparallelalgorithm.Theyarewidelyusedon SIMDplatformsandtoalesserextentinlanguagessuchasHighPerformanceFortran[HPF97]. Theseplatformssupportthefinegrainedconcurrencyrequiredforthepatternandhandle synchronizationautomaticallybecauseeverycomputationstep(logicallyifnotphysically)occursin locksteponalltheprocessors.HillisandSteele[HS86]describeseveralinterestingapplicationsof thispattern,includingfindingtheendofalinkedlist,computingallpartialsumsofalinkedlist, regionlabelingintwodimensionalimages,andparsing.

Figure 4.22. Pseudocode for finding partial sums of a list for all k in parallel { temp[k] = next[k]; while temp[k] != null { x[temp[k]] = x[k] + x[temp[k]]; temp[k] = temp [temp [k] ]; } }
Incombinatorialoptimization,problemsinvolvingtraversingallnodesinagraphortreecanoftenbe solvedwiththispatternbyfirstfindinganorderingonthenodestocreatealist.Eulertoursandear
decomposition[EG88]arewellknowntechniquestocomputethisordering. JJ[J92]alsodescribesseveralapplicationsofthispattern:findingtherootsoftreesinaforestof rooteddirectedtrees,computingpartialsumsonasetofrooteddirectedtrees(similartothepreceding examplewithlinkedlists),andlistranking(determiningforeachelementofthelistitsdistancefrom thestart/endofthelist). Related Patterns Withrespecttotheactualconcurrency,thispatternisverymuchliketheGeometricDecomposition pattern,adifferencebeingthatinthispatternthedatastructurecontainingtheelementstobeoperated onconcurrentlyisrecursive(atleastconceptually).Whatmakesitdifferentistheemphasison fundamentallyrethinkingtheproblemtoexposefinegrainedconcurrency.
4.8. THE PIPELINE PATTERN

Problem Supposethattheoverallcomputationinvolvesperformingacalculationonmanysetsofdata,where thecalculationcanbeviewedintermsofdataflowingthroughasequenceofstages.Howcanthe potentialconcurrencybeexploited? Context Anassemblylineisagoodanalogyforthispattern.Supposewewanttomanufactureanumberof cars.Themanufacturingprocesscanbebrokendownintoasequenceofoperationseachofwhich addssomecomponent,saytheengineorthewindshield,tothecar.Anassemblyline(pipeline) assignsacomponenttoeachworker.Aseachcarmovesdowntheassemblyline,eachworkerinstalls thesamecomponentoverandoveronasuccessionofcars.Afterthepipelineisfull(anduntilitstarts toempty)theworkerscanallbebusysimultaneously,allperformingtheiroperationsonthecarsthat arecurrentlyattheirstations. Examplesofpipelinesarefoundatmanylevelsofgranularityincomputersystems,includingtheCPU hardwareitself.
InstructionpipelineinmodernCPUs.Thestages(fetchinstruction,decode,execute,etc.)are doneinapipelinedfashion;whileoneinstructionisbeingdecoded,itspredecessorisbeing executedanditssuccessorisbeingfetched. Vectorprocessing(looplevelpipelining).Specializedhardwareinsomesupercomputers allowsoperationsonvectorstobeperformedinapipelinedfashion.Typically,acompileris expectedtorecognizethataloopsuchas

for(i = 0; i < N; i++) { a[i] = b[i] + c[i]; }
canbevectorizedinawaythatthespecialhardwarecanexploit.Afterashortstartup,one a[i]valuewillbegeneratedeachclockcycle.
Algorithmlevelpipelining.Manyalgorithmscanbeformulatedasrecurrencerelationsand implementedusingapipelineoritshigherdimensionalgeneralization,asystolicarray.Such implementationsoftenexploitspecializedhardwareforperformancereasons. Signalprocessing.Passingastreamofrealtimesensordatathroughasequenceoffilterscan bemodeledasapipeline,witheachfiltercorrespondingtoastageinthepipeline. Graphics.Processingasequenceofimagesbyapplyingthesamesequenceofoperationsto eachimagecanbemodeledasapipeline,witheachoperationcorrespondingtoapipeline stage.Somestagesmaybeimplementedbyspecializedhardware. ShellprogramsinUNIX.Forexample,theshellcommand

cat sampleFile | grep "word" | we
createsathreestagepipeline,withoneprocessforeachcommand(cat,grep,andwc). Theseexamplesandtheassemblylineanalogyhaveseveralaspectsincommon.Allinvolveapplying asequenceofoperations(intheassemblylinecaseitisinstallingtheengine,installingthe windshield,etc.)toeachelementinasequenceofdataelements(intheassemblyline,thecars). Althoughtheremaybeorderingconstraintsontheoperationsonasingledataelement(forexample,it mightbenecessarytoinstalltheenginebeforeinstallingthehood),itispossibletoperformdifferent operationsondifferentdataelementssimultaneously(forexample,onecaninstalltheengineonone carwhileinstallingthehoodonanother.) Thepossibilityofsimultaneouslyperformingdifferentoperationsondifferentdataelementsisthe potentialconcurrencythispatternexploits.IntermsoftheanalysisdescribedintheFinding Concurrencypatterns,eachtaskconsistsofrepeatedlyapplyinganoperationtoadataelement (analogoustoanassemblylineworkerinstallingacomponent),andthedependenciesamongtasksare orderingconstraintsenforcingtheorderinwhichoperationsmustbeperformedoneachdataelement (analogoustoinstallingtheenginebeforethehood). Forces
Agoodsolutionshouldmakeitsimpletoexpresstheorderingconstraints.Theordering constraintsinthisproblemaresimpleandregularandlendthemselvestobeingexpressedin termsofdataflowingthroughapipeline. Thetargetplatformcanincludespecialpurposehardwarethatcanperformsomeofthe desiredoperations. Insomeapplications,futureadditions,modifications,orreorderingofthestagesinthe pipelineareexpected. Insomeapplications,occasionalitemsintheinputsequencecancontainerrorsthatprevent theirprocessing.
Solution Thekeyideaofthispatterniscapturedbytheassemblylineanalogy,namelythatthepotential concurrencycanbeexploitedbyassigningeachoperation(stageofthepipeline)toadifferentworker andhavingthemworksimultaneously,withthedataelementspassingfromoneworkertothenextas operationsarecompleted.Inparallelprogrammingterms,theideaistoassigneachtask(stageofthe pipeline)toaUEandprovideamechanismwherebyeachstageofthepipelinecansenddataelements tothenextstage.Thisstrategyisprobablythemoststraightforwardwaytodealwiththistypeof orderingconstraints.Itallowstheapplicationtotakeadvantageofspecialpurposehardwareby appropriatemappingofpipelinestagestoPEsandprovidesareasonablemechanismforhandling errors,describedlater.Italsoislikelytoyieldamodulardesignthatcanlaterbeextendedor modified. Beforegoingfurther,itmayhelptoillustratehowthepipelineissupposedtooperate.LetCirepresent amultistepcomputationondataelementi.Ci(j)isthejthstepofthecomputation.Theideaistomap computationstepstopipelinestagessothateachstageofthepipelinecomputesonestep.Initially,the firststageofthepipelineperformsC1(1).Afterthatcompletes,thesecondstageofthepipeline receivesthefirstdataitemandcomputesC1(2)whilethefirststagecomputesthefirststepofthe seconditem,C2(1).Next,thethirdstagecomputesC1(3),whilethesecondstagecomputesC2(2)and thefirststageC3(1).Fig.4.24illustrateshowthisworksforapipelineconsistingoffourstages.Notice thatconcurrencyisinitiallylimitedandsomeresourcesremainidleuntilallthestagesareoccupied withusefulwork.Thisisreferredtoasfillingthepipeline.Attheendofthecomputation(draining thepipeline),againthereislimitedconcurrencyandidleresourcesasthefinalitemworksitsway throughthepipeline.Wewantthetimespentfillingordrainingthepipelinetobesmallcomparedto thetotaltimeofthecomputation.Thiswillbethecaseifthenumberofstagesissmallcomparedto thenumberofitemstobeprocessed.Noticealsothatoverallthroughput/efficiencyismaximizedif thetimetakentoprocessadataelementisroughlythesameforeachstage.
Figure 4.24. Operation of a pipeline. Each pipeline stage i computes the i-th step of the computation.
Thisideacanbeextendedtoincludesituationsmoregeneralthanacompletelylinearpipeline.For example,Fig.4.25illustratestwopipelines,eachwithfourstages.Inthesecondpipeline,thethird stageconsistsoftwooperationsthatcanbeperformedconcurrently.
Figure 4.25. Example pipelines
Defining the stages of the pipeline
Normallyeachpipelinestagewillcorrespondtoonetask.Fig.4.26showsthebasicstructureofeach stage. Ifthenumberofdataelementstobeprocessedisknowninadvance,theneachstagecancountthe numberofelementsandterminatewhenthesehavebeenprocessed.Alternatively,asentinel indicatingterminationmaybesentthroughthepipeline. Itisworthwhiletoconsideratthispointsomefactorsthataffectperformance.
Theamountofconcurrencyinafullpipelineislimitedbythenumberofstages.Thus,alarger numberofstagesallowsmoreconcurrency.However,thedatasequencemustbetransferred betweenthestages,introducingoverheadtothecalculation.Thus,w eneedtoorganizethe computationintostagessuchthattheworkdonebyastageislargecomparedtothe communicationoverhead.Whatis"largeenough"ishighlydependentontheparticular architecture.Specializedhardware(suchasvectorprocessors)allowsveryfinegrained parallelism. Thepatternworksbetteriftheoperationsperformedbythevariousstagesofthepipelineare allaboutequallycomputationallyintensive.Ifthestagesinthepipelinevarywidelyin computationaleffort,thesloweststagecreatesabottleneckfortheaggregatethroughput.
Figure 4.26. Basic structure of a pipeline stage initialize while (more data) { receive data element from previous stage perform operation on data element send data element to next stage } finalize
Thepatternworksbetterifthetimerequiredtofillanddrainthepipelineissmallcomparedto theoverallrunningtime.Thistimeisinfluencedbythenumberofstages(morestagesmeans morefill/draintime).
Therefore,itisworthwhiletoconsiderwhethertheoriginaldecompositionintotasksshouldbe revisitedatthispoint,possiblycombininglightlyloadedadjacentpipelinestagesintoasinglestage, ordecomposingaheavilyloadedstageintomultiplestages. ItmayalsobeworthwhiletoparallelizeaheavilyloadedstageusingoneoftheotherAlgorithm Structurepatterns.Forexample,ifthepipelineisprocessingasequenceofimages,itisoftenthecase thateachstagecanbeparallelizedusingtheTaskParallelismpattern.

Structuring the computation
Wealsoneedawaytostructuretheoverallcomputation.OnepossibilityistousetheSPMDpattern (describedintheSupportingStructuresdesignspace)anduseeachUE'sIDtoselectanoptionina caseorswitchstatement,witheachcasecorrespondingtoastageofthepipeline. Toincreasemodularity,objectorientedframeworkscanbedevelopedthatallowstagestobe representedbyobjectsorproceduresthatcaneasilybe"pluggedin"tothepipeline.Suchframeworks arenotdifficulttoconstructusingstandardOOPtechniques,andseveralareavailableascommercial orfreelyavailableproducts.

Representing the dataflow among pipeline elements
Howdataflowbetweenpipelineelementsisrepresenteddependsonthetargetplatform. Inamessagepassingenvironment,themostnaturalapproachistoassignoneprocesstoeach operation(stageofthepipeline)andimplementeachconnectionbetweensuccessivestagesofthe pipelineasasequenceofmessagesbetweenthecorrespondingprocesses.Becausethestagesare hardlyeverperfectlysynchronized,andtheamountofworkcarriedoutatdifferentstagesalmost alwaysvaries,thisflowofdatabetweenpipelinestagesmustusuallybebothbufferedandordered. Mostmessagepassingenvironments(e.g.,MPI)makethiseasytodo.Ifthecostofsendingindividual messagesishigh,itmaybeworthwhiletoconsidersendingmultipledataelementsineachmessage; thisreducestotalcommunicationcostattheexpenseofincreasingthetimeneededtofillthepipeline. Ifamessagepassingprogrammingenvironmentisnotagoodfitwiththetargetplatform,thestages ofthepipelinecanbeconnectedexplicitlywithbufferedchannels.Suchabufferedchannelcanbe implementedasaqueuesharedbetweenthesendingandreceivingtasks,usingtheSharedQueue pattern. Iftheindividualstagesarethemselvesimplementedasparallelprograms,thenmoresophisticated approachesmaybecalledfor,especiallyifsomesortofdataredistributionneedstobeperformed betweenthestages.Thismightbethecaseif,forexample,thedataneedstobepartitionedalonga differentdimensionorpartitionedintoadifferentnumberofsubsetsinthesamedimension.For example,anapplicationmightincludeonestageinwhicheachdataelementispartitionedintothree subsetsandanotherstageinwhichitispartitionedintofoursubsets.Thesimplestwaystohandlesuch situationsaretoaggregateanddisaggregatedataelementsbetweenstages.Oneapproachwouldbeto haveonlyonetaskineachstagecommunicatewithtasksinotherstages;thistaskwouldthenbe responsibleforinteractingwiththeothertasksinitsstagetodistributeinputdataelementsandcollect outputdataelements.Anotherapproachwouldbetointroduceadditionalpipelinestagestoperform aggregation/disaggregationoperations.Eitheroftheseapproaches,however,involvesafairamountof
communication.Itmaybepreferabletohavetheearlierstage"know"abouttheneedsofitssuccessor andcommunicatewitheachtaskreceivingpartofitsdatadirectlyratherthanaggregatingthedataat onestageandthendisaggregatingatthenext.Thisapproachimprovesperformanceatthecostof reducedsimplicity,modularity,andflexibility. Lesstraditionally,networkedfilesystemshavebeenusedforcommunicationbetweenstagesina pipelinerunninginaworkstationcluster.Thedataiswrittentoafilebyonestageandreadfromthe filebyitssuccessor.Networkfilesystemsareusuallymatureandfairlywelloptimized,andthey provideforthevisibilityofthefileatallPEsaswellasmechanismsforconcurrencycontrol.Higher levelabstractionssuchastuplespacesandblackboardsimplementedovernetworkedfilesystemscan alsobeused.Filesystembasedsolutionsareappropriateinlargegrainedapplicationsinwhichthe timeneededtoprocessthedataateachstageislargecomparedwiththetimetoaccessthefilesystem.
Handling errors
Forsomeapplications,itmightbenecessarytogracefullyhandleerrorconditions.Onesolutionisto createaseparatetasktohandleerrors.Eachstageoftheregularpipelinesendstothistaskanydata elementsitcannotprocessalongwitherrorinformationandthencontinueswiththenextiteminthe pipeline.Theerrortaskdealswiththefaultydataelementsappropriately.

Processor allocation and task scheduling
ThesimplestapproachistoallocateonePEtoeachstageofthepipeline.Thisgivesgoodloadbalance ifthePEsaresimilarandtheamountofworkneededtoprocessadataelementisroughlythesamefor eachstage.Ifthestageshavedifferentrequirements(forexample,oneismeanttoberunonspecial purposehardware),thisshouldbetakenintoconsiderationinassigningstagestoPEs. IftherearefewerPEsthanpipelinestages,thenmultiplestagesmustbeassignedtothesamePE, preferablyinawaythatimprovesoratleastdoesnotmuchreduceoverallperformance.Stagesthatdo notsharemanyresourcescanbeallocatedtothesamePE;forexample,astagethatwritestoadisk andastagethatinvolvesprimarilyCPUcomputationmightbegoodcandidatestoshareaPE.Ifthe amountofworktoprocessadataelementvariesamongstages,stagesinvolvinglessworkmaybe allocatedtothesamePE,therebypossiblyimprovingloadbalance.Assigningadjacentstagestothe samePEcanreducecommunicationcosts.Itmightalsobeworthwhiletoconsidercombining adjacentstagesofthepipelineintoasinglestage. IftherearemorePEsthanpipelinestages,itisworthwhiletoconsiderparallelizingoneormoreofthe pipelinestagesusinganappropriateAlgorithmStructurepattern,asdiscussedpreviously,and allocatingmorethanonePEtotheparallelizedstage(s).Thisisparticularlyeffectiveifthe parallelizedstagewaspreviouslyabottleneck(takingmoretimethantheotherstagesandthereby draggingdownoverallperformance). AnotherwaytomakeuseofmorePEsthanpipelinestages,iftherearenotemporalconstraintsamong thedataitemsthemselves(thatis,itdoesn'tmatterif,say,dataitem3iscomputedbeforedataitem2), istorunmultipleindependentpipelinesinparallel.ThiscanbeconsideredaninstanceoftheTask Parallelismpattern.Thiswillimprovethethroughputoftheoverallcalculation,butdoesnot significantlyimprovethelatency,however,sinceitstilltakesthesameamountoftimeforadata elementtotraversethepipeline.
Throughput and latency
Therearefewmorefactorstokeepinmindwhenevaluatingwhetheragivendesignwillproduce acceptableperformance. InmanysituationswherethePipelinepatternisused,theperformancemeasureofinterestisthe throughput,thenumberofdataitemspertimeunitthatcanbeprocessedafterthepipelineisalready full.Forexample,iftheoutputofthepipelineisasequenceofrenderedimagestobeviewedasan animation,thenthepipelinemusthavesufficientthroughput(numberofitemsprocessedpertime unit)togeneratetheimagesattherequiredframerate. Inanothersituation,theinputmightbegeneratedfromrealtimesamplingofsensordata.Inthiscase, theremightbeconstraintsonboththethroughput(thepipelineshouldbeabletohandleallthedataas itcomesinwithoutbackinguptheinputqueueandpossiblylosingdata)andthelatency(theamount oftimebetweenthegenerationofaninputandthecompletionofprocessingofthatinput).Inthis case,itmightbedesirabletominimizelatencysubjecttoaconstraintthatthethroughputissufficient tohandletheincomingdata. Examples
Fourier-transform computations
Atypeofcalculationwidelyusedinsignalprocessinginvolvesperformingthefollowing computationsrepeatedlyondifferentsetsofdata. 1. PerformadiscreteFouriertransform(DFT)onasetofdata. 2. Manipulatetheresultofthetransformelementwise. 3. PerformaninverseDFTontheresultofthemanipulation. Examplesofsuchcalculationsincludeconvolution,correlation,andfilteringoperations([PTV93]). Acalculationofthisformcaneasilybeperformedbyathreestagepipeline.
ThefirststageofthepipelineperformstheinitialFouriertransform;itrepeatedlyobtainsone setofinputdata,performsthetransform,andpassestheresulttothesecondstageofthe pipeline. Thesecondstageofthepipelineperformsthedesiredelementwisemanipulation;itrepeatedly obtainsapartialresult(ofapplyingtheinitialFouriertransformtoaninputsetofdata)from thefirststageofthepipeline,performsitsmanipulation,andpassestheresulttothethirdstage ofthepipeline.ThisstagecanoftenitselfbeparallelizedusingoneoftheotherAlgorithm Structurepatterns. ThethirdstageofthepipelineperformsthefinalinverseFouriertransform;itrepeatedly obtainsapartialresult(ofapplyingtheinitialFouriertransformandthentheelementwise manipulationtoaninputsetofdata)fromthesecondstageofthepipeline,performsthe inverseFouriertransform,andoutputstheresult.
Eachstageofthepipelineprocessesonesetofdataatatime.However,exceptduringtheinitialfilling ofthepipeline,allstagesofthepipelinecanoperateconcurrently;whilethefirststageisprocessing theNthsetofdata,thesecondstageisprocessingthe(Nl)thsetofdata,andthethirdstageis processingthe(N2)thsetofdata.

Java pipeline framework
ThefiguresforthisexampleshowasimpleJavaframeworkforpipelinesandanexampleapplication. Theframeworkconsistsofabaseclassforpipelinestages,PipelineStage,showninFig.4.27, andabaseclassforpipelines,LinearPipeline,showninFig.4.28.Applicationsprovidea subclassofPipelineStageforeachdesiredstage,implementingitsthreeabstractmethodsto indicatewhatthestageshoulddoontheinitialstep,thecomputationsteps,andthefinalstep,anda subclassofLinearPipelinethatimplementsitsabstractmethodstocreateanarraycontaining thedesiredpipelinestagesandthedesiredqueuesconnectingthestages.Forthequeueconnectingthe stages,weuseLinkedBlockingQueue,animplementationoftheBlockingQueueinterface. Theseclassesarefoundinthejava.util. concurrentpackage.Theseclassesusegenericsto specifythetypeofobjectsthequeuecanhold.Forexample,new LinkedBlockingQueue<String>createsaBlockingQueueimplementedbyanunderlying linkedlistthatcanholdStrings.Theoperationsofinterestareput,toaddanobjecttothequeue, andtake,toremoveanobject,takeblocksifthequeueisempty.TheclassCountDownLatch, alsofoundinthejava.util.concurrentpackage,isasimplebarrierthatallowstheprogramto printamessagewhenithasterminated.Barriersingeneral,andCountDownLatchinparticular, arediscussedintheImplementationMechanismsdesignspace. Theremainingfiguresshowcodeforanexampleapplication,apipelinetosortintegers.Fig.4.29is therequiredsubclassofLinearPipeline,andFig.4.30istherequiredsubclassof PipelineStage.Additionalpipelinestagestogenerateorreadtheinputandtohandletheoutput arenotshown.
Known uses
Manyapplicationsinsignalandimageprocessingareimplementedaspipelines. TheOPUS[SR98]systemisapipelineframeworkdevelopedbytheSpaceTelescopeScienceInstitute originallytoprocesstelemetrydatafromtheHubbleSpaceTelescopeandlateremployedinother applications.OPUSusesablackboardarchitecturebuiltontopofanetworkfilesystemforinterstage communicationandincludesmonitoringtoolsandsupportforerrorhandling.

Figure 4.27. Base class for pipeline stages
import java.util.concurrent.*; abstract class PipelineStage implements Runnable { BlockingQueue in; BlockingQueue out; CountDownLatch s;
boolean done; //override to abstract void //override to abstract void //override to abstract void specify initialization step firstStep() throws Exception; specify compute step step() throws Exception; specify finalization step lastStep() throws Exception;
void handleComputeException(Exception e) { e.printStackTrace(); } public void run() { try { firstStep(); while(!done){ step();} lastStep(); } catch(Exception e){handleComputeException(e);} finally {s.countDown();} } public void init(BlockingQueue in, BlockingQueue out, CountDownLatch s) { this.in = in; this.out = out; this.s = s;} }
Airbornesurveillanceradarsusespacetimeadaptiveprocessing(STAP)algorithms,whichhavebeen implementedasaparallelpipeline[CLW00 + ].Eachstageisitselfaparallelalgorithm,andthe pipelinerequiresdataredistributionbetweensomeofthestages.
Fx[GOS94],aparallelizingFortrancompilerbasedonHPF[HPF97],hasbeenusedtodevelop
severalexampleapplications[DGO94 + ,SSOG93]thatcombinedataparallelism(similartotheform ofparallelismcapturedintheGeometricDecompositionpattern)andpipelining.Forexample,one applicationperforms2DFouriertransformsonasequenceofimagesviaatwostagepipeline(one stagefortherowtransformsandonestageforthecolumntransforms),witheachstagebeingitself parallelizedusingdataparallelism.TheSIGPLANpaper([SSOG93])isespeciallyinterestinginthat itpresentsperformancefigurescomparingthisapproachwithastraightdataparallelismapproach.

Figure 4.28. Base class for linear pipeline
import java.util.concurrent.*;
abstract class LinearPipeline { PipelineStage[] stages; BlockingQueue[] queues; int numStages; CountDownLatch s; //override method to create desired array of pipeline stage objects abstract PipelineStage[] getPipelineStages(String[] args); //override method to create desired array of BlockingQueues //element i of returned array contains queue between stages i and i+1 abstract BlockingQueue[] getQueues(String[] args); LinearPipeline(String[] args) { stages = getPipelineStages(args); queues = getQueues(args); numStages = stages.length; s = new CountDownLatch(numStages); BlockingQueue in = null; BlockingQueue out = queues[0]; for (int i = 0; i != numStages; i++) { stages[i].init(in,out,s); in = out; if (i < numStages-2) out = queues[i+1]; else out = null; } } public void start() { for (int i = 0; i != numStages; i++) { new Thread(stages[i]).start(); } }
[J92]presentssomefinergrainedapplicationsofpipelining,includinginsertingasequenceof elementsintoa23treeandpipelinedmergesort. Related Patterns ThispatternisverysimilartothePipesandFilterspatternof[BMR96 + ];thekeydifferenceisthat thispatternexplicitlydiscussesconcurrency. Forapplicationsinwhichtherearenotemporaldependenciesbetweenthedatainputs,analternative tothispatternisadesignbasedonmultiplesequentialpipelinesexecutinginparallelandusingthe TaskParallelismpattern.
Figure 4.29. Pipelined sort (main class)
import java.uti1.concurrent.*;
class SortingPipeline extends LinearPipeline { /*Creates an array of pipeline stages with the number of sorting stages given via args. Input and output stages are also included at the beginning and end of the array. Details are omitted. */ PipelineStage[] getPipelineStages(String[] args) { //.... return stages; } /* Creates an array of LinkedBlockingQueues to serve as communication channels between the stages. For this example, the first is restricted to hold Strings, the rest can hold Comparables. */ BlockingQueue[] getQueues(String[] args) { BlockingQueue[] queues = new BlockingQueue[totalStages - 1]; queues[0] = new LinkedBlockingQueue<String>(); for (int i = 1; i!= totalStages -1; i++) { queues[i] = new LinkedBlockingQueue<Comparable>();} return queues; } SortingPipeline(String[] args) { super(args); } public static void main(String[] args) throws InterruptedException { // create pipeline LinearPipeline 1 = new SortingPipeline(args); 1.start(); //start threads associated with stages l.s.await(); //terminate thread when all stages terminated System.out.println("All threads terminated"); } }
Atfirstglance,onemightalsoexpectthatsequentialsolutionsbuiltusingtheChainofResponsibility pattern[GHJV95]couldbeeasilyparallelizedusingthePipelinepattern.InChainofResponsibility, orCOR,an"event"ispassedalongachainofobjectsuntiloneormoreoftheobjectshandlethe event.Thispatternisdirectlysupported,forexample,intheJavaServletSpecification[1][SER]to enablefilteringofHTTPrequests.WithServlets,aswellasothertypicalapplicationsofCOR, however,thereasonforusingthepatternistosupportmodularstructuringofaprogramthatwillneed tohandleindependenteventsindifferentwaysdependingontheeventtype.Itmaybethatonlyone objectinthechainwillevenhandletheevent.Weexpectthatinmostcases,theTaskParallelism patternwouldbemoreappropriatethanthePipelinepattern.Indeed,Servletcontainer implementationsalreadysupportingmultithreadingtohandleindependentHTTPrequestsprovidethis solutionforfree.
[1]
AServletisaJavaprograminvokedbyaWebserver.TheJavaServletstechnologyis includedintheJava2EnterpriseEditionplatformforWebserverapplications.
Figure 4.30. Pipelined sort (sorting stage)
class SortingStage extends PipelineStage { Comparable val = null; Comparable input = null; void firstStep() throws InterruptedException { input = (Comparable)in.take(); done = (input.equals("DONE")); val = input; return; } void step() throws InterruptedException { input = (Comparable)in.take(); done = (input.equals("DONE")); if (!done) { if(val.compareTo(input)<0) { out.put(val); val = input; } else { out.put(input); } } else out.put(val); } void lastStep() throws InterruptedException { out.put("DONE"); }
ThePipelinepatternissimilartotheEventBasedCoordinationpatterninthatbothpatternsapplyto problemswhereitisnaturaltodecomposethecomputationintoacollectionofsemiindependent tasks.ThedifferenceisthattheEventBasedCoordinationpatternisirregularandasynchronous wherethePipelinepatternisregularandsynchronous:InthePipelinepattern,thesemiindependent tasksrepresentthestagesofthepipeline,thestructureofthepipelineisstatic,andtheinteraction betweensuccessivestagesisregularandlooselysynchronous.IntheEventBasedCoordination pattern,however,thetaskscaninteractinveryirregularandasynchronousways,andthereisno requirementforastaticstructure.
4.9. THE EVENT-BASED COORDINATION PATTERN
Problem Supposetheapplicationcanbedecomposedintogroupsofsemiindependenttasksinteractinginan irregularfashion.Theinteractionisdeterminedbytheflowofdatabetweenthemwhichimplies orderingconstraintsbetweenthetasks.Howcanthesetasksandtheirinteractionbeimplementedso theycanexecuteconcurrently? Context Someproblemsaremostnaturallyrepresentedasacollectionofsemiindependententitiesinteracting inanirregularway.Whatthismeansisperhapscleare stifwecomparethispatternwiththePipeline pattern.InthePipelinepattern,theentitiesformalinearpipeline,eachentityinteractsonlywiththe entitiestoeitherside,theflowofdataisoneway,andinteractionoccursatfairlyregularand predictableintervals.IntheEventBasedCoordinationpattern,incontrast,thereisnorestrictiontoa linearstructure,norestrictionthattheflowofdatabeoneway,andtheinteractiontakesplaceat irregularandsometimesunpredictableintervals. Asarealworldanalogy,consideranewsroom,withreporters,editors,factcheckers,andother employeescollaboratingonstories.Asreportersfinishstories,theysendthemtotheappropriate editors;aneditorcandecidetosendthestorytoafactchecker(whowouldtheneventuallysendit back)orbacktothereporterforfurtherrevision.Eachemployeeisasemiindependententity,and theirinteraction(forexample,areportersendingastorytoaneditor)isirregular. Manyotherexamplescanbefoundinthefieldofdiscreteeventsimulation,thatis,simulationofa physicalsystemconsistingofacollectionofobjectswhoseinteractionisrepresentedbyasequenceof discrete"events".Anexampleofsuchasystemisthecarwashfacilitydescribedin[Mis86]:The facilityhastwocarwashmachinesandanattendant.Carsarriveatrandomtimesattheattendant. Eachcarisdirectedbytheattendanttoanonbusycarwashmachineifoneexists,orqueuedifboth machinesarebusy.Eachcarwashmachineprocessesonecaratatime.Thegoalistocompute,fora givendistributionorarrivaltimes,theaveragetimeacarspendsinthesystem(timebeingwashed plusanytimewaitingforanonbusymachine)andtheaveragelengthofthequeuethatbuildsupatthe attendant.The"events"inthissystemincludecarsarrivingattheattendant,carsbeingdirectedtothe carwashmachines,andcarsleavingthemachines.Fig.4.31sketchesthisexample.Noticethatit includes"source"and"sink"objectstomakeiteasiertomodelcarsarrivingandleavingthefacility. Noticealsothattheattendantmustbenotifiedwhencarsleavethecarwashmachinessothatitknows whetherthemachinesarebusy.
Figure 4.31. Discrete-event simulation of a car-wash facility. Arrows indicate the flow of events.
Also,itissometimesdesirabletocomposeexisting,possiblysequential,programcomponentsthat interactinpossiblyirregularwaysintoaparallelprogramwithoutchangingtheinternalsofthe components. Forproblemssuchasthis,itmightmakesensetobaseaparallelalgorithmondefiningatask(ora groupoftightlycoupledtasks)foreachcomponent,orinthecaseofdiscreteeventsimulation, simulationentity.Interactionbetweenthesetasksisthenbasedontheorderingconstraintsdetermined bytheflowofdatabetweenthem. Forces
Agoodsolutionshouldmakeitsimpletoexpresstheorderingconstraints,whichcanbe numerousandirregularandevenarisedynamically.Itshouldalsomakeitpossibleforasmany activitiesaspossibletobeperformedconcurrently. Orderingconstraintsimpliedbythedatadependenciescanbeexpressedbyencodingtheminto theprogram(forexample,viasequentialcomposition)orusingsharedvariables,butneither approachleadstosolutionsthataresimple,capableofexpressingcomplexconstraints,and easytounderstand.
Solution Agoodsolutionisbasedonexpressingthedataflowusingabstractionscalledevents,witheachevent havingataskthatgeneratesitandataskthatprocessesit.Becauseaneventmustbegeneratedbefore itcanbeprocessed,eventsalsodefineorderingconstraintsbetweenthetasks.Computationwithin eachtaskconsistsofprocessingevents.

Defining the tasks
Thebasicstructureofeachtaskconsistsofreceivinganevent,processingit,andpossiblygenerating
events,asshowninFig.4.32. Iftheprogramisbeingbuiltfromexistingcomponents,thetaskwillserveasaninstanceofthe Facadepattern[GHJV95]byprovidingaconsistenteventbasedinterfacetothecomponent. Theorderinwhichtasksreceiveeventsmustbeconsistentwiththeapplication'sorderingconstraints, asdiscussedlater.

Figure 4.32. Basic structure of a task in the Event-Based Coordination pattern initialize while(not done) { receive event process event send events } finalize
Representing event flow
Toallowcommunicationandcomputationtooverlap,onegenerallyneedsaformofasynchronous communicationofeventsinwhichataskcancreate(send)aneventandthencontinuewithoutwaiting fortherecipienttoreceiveit.Inamessagepassingenvironment,aneventcanberepresentedbya messagesentasynchronouslyfromthetaskgeneratingtheeventtothetaskthatistoprocessit.Ina sharedmemoryenvironment,aqueuecanbeusedtosimulatemessagepassing.Becauseeachsuch queuewillbeaccessedbymorethanonetask,itmustbeimplementedinawaythatallowssafe concurrentaccess,asdescribedintheSharedQueuepattern.Othercommunicationabstractions,such astuplespacesasfoundintheLindacoordinationlanguageorJavaSpaces[FHA99],canalsobeused effectivelywiththispattern.Linda[CG91]isasimplelanguageconsistingofonlysixoperationsthat readandwriteanassociative(thatis,contentaddressable)sharedmemorycalledatuplespace.A tuplespaceisaconceptuallysharedrepositoryfordatacontainingobjectscalledtuplesthattasksuse forcommunicationinadistributedsystem.
Enforcing event ordering
Theenforcementoforderingconstraintsmaymakeitnecessaryforatasktoprocesseventsina differentorderfromtheorderinwhichtheyaresent,ortowaittoprocessaneventuntilsomeother eventfromagiventaskhasbeenreceived,soitisusuallynecessarytobeabletolookaheadinthe queueormessagebufferandremoveelementsoutoforder.Forexample,considerthesituationinFig. 4.33.Task1generatesaneventandsendsittotask2,whichwillprocessit,andalsosendsittotask3, whichisrecordinginformationaboutallevents.Task2processestheeventfromtask1andgenerates anewevent,acopyofwhichisalsosenttotask3.Supposethatthevagariesoftheschedulingand underlyingcommunicationlayercausetheeventfromtask2toarrivebeforetheeventfromtask1. Dependingonwhattask3isdoingwiththeevents,thismayormaynotbeproblematic.Iftask3is simplytallyingthenumberofeventsthatoccur,thereisnoproblem.Iftask3iswritingalogentry thatshouldreflecttheorderinwhicheventsarehandled,however,simplyprocessingeventsinthe orderinwhichtheyarrivewouldinthiscaseproduceanincorrectresult.Iftask3iscontrollingagate,
andtheeventfromtask1resultsinopeningthegateandtheeventfromtask2inclosingthegate,then theoutofordermessagescouldcausesignificantproblems,andtask3shouldnotprocessthefirst eventuntilaftertheeventfromtask1hasarrivedandbeenprocessed.

Figure 4.33. Event-based communication among three tasks. Task 2 generates its event in response to the event received from task 1. The two events sent to task 3 can arrive in either order.
Indiscreteeventsimulations,asimilarproblemcanoccurbecauseofthesemanticsoftheapplication domain.Aneventarrivesatastation(task)alongwithasimulationtimewhenitshouldbescheduled. Aneventcanarriveatastationbeforeothereventswithearliersimulationtimes. Thefirststepistodeterminewhether,inaparticularsituation,outofordereventscanbeaproblem. Therewillbenoproblemifthe"event"pathislinearsothatnooutofordereventswilloccur,orif, accordingtotheapplicationsemantics,outofordereventsdonotmatter. Ifoutofordereventsmaybeaproblem,theneitheranoptimisticorpessimisticapproachcanbe chosen.Anoptimisticapproachrequirestheabilitytorollbacktheeffectsofeventsthatare mistakenlyexecuted(includingtheeffectsofanyneweventsthathavebeencreatedbytheoutof orderexecution).Intheareaofdistributedsimulation,thisapproachiscalledtimewarp[Jef85]. Optimisticapproachesareusuallynotfeasibleifaneventcausesinteractionwiththeoutsideworld. Pessimisticapproachesensurethattheeventsarealwaysexecutedinorderattheexpenseofincreased latencyandcommunicationoverhead.Pessimisticapproachesdonotexecuteeventsuntilitcanbe guaranteed"safe"todoso.Inthefigure,forexample,task3cannotprocessaneventfromtask2until it"knows"thatnoearliereventwillarrivefromtask1andviceversa.Providingtask3withthat knowledgemayrequireintroducingnulleventsthatcontainnoinformationusefulforanythingexcept theeventordering.Manyimplementationsofpessimisticapproachesarebasedontimestampsthatare consistentwiththecausalityinthesystem[Lam78]. Muchresearchanddevelopmentefforthasgoneintoframeworksthattakecareofthedetailsofevent orderingindiscreteeventsimulationforbothoptimistic[RMC98 + ]andpessimisticapproaches [CLL99 + ].Similarly,middlewareisavailablethathandleseventorderingproblemsinprocessgroups causedbythecommunicationsystem.AnexampleistheEnsemblesystemdevelopedatCornell [vRBH98 + ].
Avoiding deadlocks
Itispossibleforsystemsusingthispatterntodeadlockattheapplicationlevelforsomereasonthe systemarrivesinastatewherenotaskcanproceedwithoutfirstreceivinganeventfromanothertask thatwillneverarrive.Thiscanhappenbecauseofaprogrammingerror;inthecaseofasimulation,it
canalsobecausedbyproblemsinthemodelthatisbeingsimulated.Inthelattercase,thedeveloper mustrethinkthesolution. Ifpessimistictechniquesareusedtocontroltheorderinwhicheventsareprocessed,thendeadlocks canoccurwhenaneventisavailableandactuallycouldbeprocessed,butisnotprocessedbecausethe eventisnotyetknowntobesafe.Thedeadlockcanbebrokenbyexchangingenoughinformationthat theeventcanbesafelyprocessed.Thisisaverysignificantproblemastheoverheadofdealingwith deadlockscancancelthebenefitsofparallelismandmaketheparallelalgorithmsslowerthana sequentialsimulation.Approachestodealingwiththistypeofdeadlockrangefromsendingfrequent enough"nullmessages"toavoiddeadlocksaltogether(atthecostofmanyextramessages)tousing deadlockdetectionschemestodetectthepresenceofadeadlockandthenresolveit(atthecostof possiblesignificantidletimebeforethedeadlockisdetectedandresolved).Theapproachofchoice willdependonthefrequencyofdeadlock.Amiddlegroundsolutionistousetimeoutsinsteadof accuratedeadlockdetection,andisoftenthebestapproach.
Scheduling and processor allocation
ThemoststraightforwardapproachistoallocateonetaskperPEandallowallthetaskstoexecute concurrently.IfinsufficientPEsareavailabletodothis,thenmultipletaskscanbeallocatedtoeach PE.Thisshouldbedoneinawaythatachievesgoodloadbalance.Loadbalancingisadifficult probleminthispatternduetoitspotentiallyirregularstructureandpossibledynamicnature.Some infrastructuresthatsupportthispatternallowtaskmigrationsothattheloadcanbebalanced dynamicallyatruntime.

Efficient communication of events
Iftheapplicationistoperformwell,themechanismusedtocommunicateeventsmustbeasefficient asisfeasible.Inasharedmemoryenvironment,thismeansmakingsurethemechanismdoesnothave thepotentialtobecomeabottleneck.Inamessagepassingenvironment,thereareseveralefficiency considerations;forexample,whetheritmakessensetosendmanyshortmessagesbetweentasksortry tocombinethem.[YWC96 + ]and[WY95]describesomeconsiderationsandsolutions. Examples

Known uses
Anumberofdiscreteeventsimulationapplicationsusethispattern.TheDPATsimulationusedto analyzeairtrafficcontrolsystems[Wie01]isasuccessfulsimulationthatusesoptimistictechniques.
ItisimplementedusingtheGTW(GeorgiaTechTimeWarp)System[DFP94 + ].Thepaper([Wie01]) describesapplicationspecifictuningandseveralgeneraltechniquesthatallowthesimulationtowork wellwithoutexcessiveoverheadfortheoptimisticsynchronization.TheSynchronousParallel EnvironmentforEmulationandDiscreteEventSimulation(SPEEDES)[Met]isanotheroptimistic simulationenginethathasbeenusedforlargescalewargamingexercises.TheScalableSimulation
Framework(SSF)[CLL99 + ]isasimulationframeworkwithpessimisticsynchronizationthathasbeen usedforlargescalemodelingoftheInternet. TheCSWEBapplicationdescribedin[YWC96 + ]simulatesthevoltageoutputofcombinational
digitalcircuits(thatis,circuitswithoutfeedbackpaths).Thecircuitispartitionedintosubcircuits; associatedwitheachareinputsignalportsandoutputvoltageports,whichareconnectedtoforma representationofthewholecircuit.Thesimulationofeachsubcircuitproceedsinatimestepped fashion;ateachtimestep,thesubcircuit'sbehaviordependsonitspreviousstateandthevaluesreadat itsinputports(whichcorrespondtovaluesatthecorrespondingoutputportsofothersubcircuitsat previoustimesteps).Simulationofthesesubcircuitscanproceedconcurrently,withordering constraintsimposedbytherelationshipbetweenvaluesgeneratedforoutputportsandvaluesreadon inputports.Thesolutiondescribedin[YWC96 + ]fitstheEventBasedCoordinationpattern,defining ataskforeachsubcircuitandrepresentingtheorderingconstraintsasevents. Related Patterns ThispatternissimilartothePipelinepatterninthatbothpatternsapplytoproblemsinwhichitis naturaltodecomposethecomputationintoacollectionofsemiindependententitiesinteractingin termsofaflowofdata.Therearetwokeydifferences.First,inthePipelinepattern,theinteraction amongentitiesisfairlyregular,withallstagesofthepipelineproceedinginalooselysynchronous way,whereasintheEventBasedCoordinationpatternthereisnosuchrequirement,andtheentities caninteractinveryirregularandasynchronousways.Second,inthePipelinepattern,theoverall structure(numberoftasksandtheirinteraction)isusuallyfixed,whereasintheEventBased Coordinationpattern,theproblemstructurecanbemoredynamic.
Chapter 5. The Supporting Structures Design Space

5.1INTRODUCTION 5.2FORCES 5.3CHOOSINGTHEPATTERNS 5.4THESPMDPATTERN 5.5THEMASTER/WORKERPATTERN 5.6THELOOPP ARALLELISMP ATTERN 5.7THEFORK/JOINPATTERN 5.8THESHAREDDATAPATTERN 5.9THESHAREDQUEUEPATTERN 5.10THEDISTRIBUTEDARRAYPATTERN 5.11OTHERSUPPORTINGSTRUCTURES
5.1. INTRODUCTION
TheFindingConcurrencyandAlgorithmStructuredesignspacesfocusonalgorithmexpression.At somepoint,however,algorithmsmustbetranslatedintoprograms.ThepatternsintheSupporting Structuresdesignspaceaddressthatphaseoftheparallelprogramdesignprocess,representingan intermediatestagebetweentheproblemorientedpatternsoftheAlgorithmStructuredesignspaceand thespecificprogrammingmechanismsdescribedintheImplementationMechanismsdesignspace. WecallthesepatternsSupportingStructuresbecausetheydescribesoftwareconstructionsor "structures"thatsupporttheexpressionofparallelalgorithms.Anoverviewofthisdesignspaceand itsplaceinthepatternlanguageisshowninFig.5.1.
Figure 5.1. Overview of the Supporting Structures design space and its place in the pattern language
Thetwogroupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesand thosethatrepresentcommonlyusedshareddatastructures.Thesepatternsarebrieflydescribedinthe nextsection.Insomeprogrammingenvironments,someofthesepatternsaresowellsupportedthat thereislittleworkfortheprogrammer.Weneverthelessdocumentthemaspatternsfortworeasons: First,understandingthelowleveldetailsbehindthesestructuresisimportantforeffectivelyusing them.Second,describingthesestructuresaspatternsprovidesguidanceforprogrammerswhomight needtoimplementthemfromscratch.Thefinalsectionofthischapterdescribesstructuresthatwere notdeemedimportantenough,forvariousreasons,towarrantadedicatedpattern,butwhichdeserve mentionforcompleteness. 5.1.1. Program Structuring Patterns Patternsinthisfirstgroupdescribeapproachesforstructuringsourcecode.Thesepatternsincludethe following.
SPMD.InanSPMD(SingleProgram,MultipleData)program,allUEsexecutethesame
program(SingleProgram)inparallel,buteachhasitsownsetofdata(MultipleData). DifferentUEscanfollowdifferentpathsthroughtheprogram.Inthesourcecode,thelogicto controlthisisexpressedusingaparameterthatuniquelylabelseachUE(forexampleaprocess ID).
Master/Worker.Amasterprocessorthreadsetsupapoolofworkerprocessesorthreadsanda bagoftasks.Theworkersexecuteconcurrently,witheachworkerrepeatedlyremovingatask fromthebagoftasksandprocessingit,untilalltaskshavebeenprocessedorsomeother terminationconditionhasbeenreached.Insomeimplementations,noexplicitmasteris present. LoopParallelism.Thispatternaddressestheproblemoftransformingaserialprogramwhose runtimeisdominatedbyasetofcomputeintensiveloopsintoaparallelprogramwherethe differentiterationsoftheloopareexecutedinparallel. Fork/Join.AmainUEforksoffsomenumberofotherUEsthatthencontinueinparallelto accomplishsomeportionoftheoverallwork.OftentheforkingUEwaitsuntilthechildUEs terminateandjoin.
Whilewedefineeachoftheseprogramstructuresasadistinctpattern,thisissomewhatartificial.Itis possible,forexample,toimplementtheMaster/WorkerpatternusingtheFork/Joinpatternorthe SPMDpattern.Thesepatternsdonotrepresentexclusive,uniquewaystostructureaparallelprogram. Rather,theydefinethemajoridiomsusedbyexperiencedparallelprogrammers. Thesepatternsalsoinevitablyexpressabiasrootedinthesubsetofparallelprogramming environmentsweconsiderinthispatternlanguage.ToanMPIprogrammer,forexample,allprogram structurepatternsareessentiallyavariationontheSPMDpattern.ToanOpenMPprogrammer, however,thereisahugedifferencebetweenprogramsthatutilizethreadIDs(thatis,theSPMD pattern)versusprogramsthatexpressallconcurrencyintermsoflooplevelworksharingconstructs (thatis,theLoopParallelismpattern). Therefore,inusingthesepatterns,don'tthinkofthemtoorigidly.Thesepatternsexpressimportant techniquesandareworthyofconsiderationinisolation,butdonothesitatetocombinethemin differentwaystomeettheneedsofaparticularproblem.Forexample,intheSPMDpattern,wewill discussparallelalgorithmsbasedonparallelizingloopsbutexpressedwiththeSPMDpattern.It mightseemthatthisindicatesthattheSPMDandLoopParallelismpatternsarenotreallydistinct patterns,butinfactitshowshowflexibletheSPMDpatternis. 5.1.2. Patterns Representing Data Structures Patternsinthissecondgrouphavetodowithmanagingdatadependencies.TheSharedDatapattern dealswiththegeneralcase.Theothersdescribespecificfrequentlyuseddatastructures.
SharedData.Thispatternaddressesthegeneralproblemofhandlingdatathatissharedby morethanoneUE,discussingbothcorrectnessandperformanceissues. SharedQueue.Thispatternrepresentsa"threadsafe"implementationofthefamiliarqueue abstractdatatype(ADT),thatis,animplementationofthequeueADTthatmaintainsthe correctsemanticsevenwhenusedbyconcurrentlyexecutingUEs.
DistributedArray.Thispatternrepresentsaclassofdatastructuresoftenfoundinparallel scientificcomputing,namelyarraysofoneormoredimensionsthataredecomposedinto subarraysanddistributedamongprocessesorthreads.
5.2. FORCES
Alloftheprogramstructuringpatternsaddressthesamebasicproblem:howtostructuresourcecode tobestsupportalgorithmstructuresofinterest.Uniqueforcesareapplicabletoeachpattern,butin designingaprogramaroundthesestructures,therearesomecommonforcestoconsiderinmostcases:
Clarityofabstraction.Istheparallelalgorithmclearlyapparentfromthesourcecode? Inawellstructuredprogram,thealgorithmleapsfromthepage.Thereadercanseethedetails ofthealgorithmwithlittlementaleffort.Werefertothisqualityasclarityofabstraction. Goodclarityofabstractionisalwaysimportantforwritingcorrectcode,butisparticularly essentialforparallelprograms:Parallelprogrammersmustdealwithmultiplesimultaneous tasksthatinteractinsubtleways.Gettingthisrightcanbeverydifficult,especiallyifitishard tofigureoutwhatthealgorithmisdoingbylookingatthesourcecode.
Scalability.Howmanyprocessorscantheparallelprogrameffectivelyutilize? Thescalabilityofaprogramisrestrictedbythreefactors.First,thereistheamountof concurrencyavailableinthealgorithm.Ifanalgorithmonlyhastenconcurrenttasks,then runningwithmorethantenPEswillprovidenobenefit.Second,thefractionoftheruntime spentdoinginherentlyserialworklimitshowmanyprocessorscanbeused.Thisisdescribed quantitativelybyAmdahl'slawasdiscussedinChapter2.Finally,theparalleloverheadofthe algorithmcontributestotheserialfractionmentionedinAmdahl'slawandlimitsscalability.
Efficiency .Howclosedoestheprogramcometofullyutilizingtheresourcesoftheparallel computer?RecallthequantitativedefinitionofefficiencygiveninChapter2: Equation5.1
Equation5.2
PisthenumberofPEs,T(1)issomesequentialreferencetime,andT(P)istheparalleltime withPPEs.S(P)isthespeedup. ThemostrigorousdefinitionofefficiencysetsT(1)totheexecutiontimeofthebestsequential algorithmcorrespondingtotheparallelalgorithmunderstudy.Whenanalyzingparallel programs,"best"sequentialalgorithmsarenotalwaysavailable,anditiscommontousethe runtimefortheparallelprogramonasinglePEasthereferencetime.Thiscaninflatethe efficiencybecausemanagingtheparallelcomputationalwaysincurssomeoverhead,even
whenexecutingonasinglePE.Efficiencyiscloselyrelatedtoscalabilitybecauseeveryhighly scalablealgorithmisalsohighlyefficient.Evenwhenthescalabilityislimitedbytheavailable numberoftasksortheparallelhardware,however,algorithmscandifferintheirefficiency.
Maintainability.Istheprogrameasytodebug,verify,andmodify? Castinganalgorithmintosourcecodeisalmostneveraonetimeproposition.Programsneed tobedebugged,newfeaturesadded,performancetuned,etc.Thesechangestothesourcecode arereferredtoasmaintenance.Programsaremoreorlessmaintainabledependingonhow harditistomakethesechangesandtodothemcorrectly.
Environmentalaffinity.Istheprogramwellalignedwiththeprogrammingenvironmentand hardwareofchoice? Ifthehardware,forexample,lackssupportforsharedmemory,analgorithmstructurebased onsharedmemorywouldbeapoorchoice.Thisissuealsocomesupwhenconsidering programmingenvironments.Whencreatingaprogrammingenvironment,thecreatorsusually haveaparticularstyleofprogramminginmind.Forexample,OpenMPisdesigned specificallyforprogramsconsistingofaseriesofloops,theiterationsofwhichwillbesplit betweenmultiplethreads(loopbasedparallelism).Itismucheasiertowritesoftwarewhen theprogramstructureemployediswellalignedwiththeprogrammingenvironment.
Sequentialequivalence.Whereappropriate,doesaprogramproduceequivalentresultswhen runwithmanyUEsaswithone?Ifnotequivalent,istherelationshipbetweenthemclear? Itishighlydesirablethattheresultsofanexecutionofaparallelprogrambethesame regardlessofthenumberofPEsused.Thisisnotalwayspossible,especiallyifbitwise equivalenceisdesired,becausefloatingpointoperationsperformedinadifferentordercan producesmall(or,forillconditionedalgorithms,notsosmall)changesintheresultingvalues. However,ifweknowthattheparallelprogramgivesequivalentresultswhenexecutedonone processorasmany,thenwecanreasonaboutcorrectnessanddomostofthetestingonthe singleprocessorversion.Thisismucheasier,andthus,whenpossible,sequentialequivalence isahighlydesirablegoal.
5.3. CHOOSING THE PATTERNS

Choosingwhichprogramstructurepatterntouseisusuallystraightforward.Inmostcases,the programmingenvironmentselectedfortheprojectandthepatternsusedfromtheAlgorithmStructure designspacepointtotheappropriateprogramstructurepatterntouse.Wewillconsiderthesetwo factorsseparately. TherelationshipbetweenthepatternsintheAlgorithmStructureandSupportingStructuresdesign spacesisshowninTable5.1.NoticethattheSupportingStructurespatternscanbeusedwithmultiple AlgorithmStructurepatterns.Forexample,considertherangeofapplicationsusingthe Master/Workerpattern:In[BCM+91,CG91,CGMS94],itisusedtoimplementeverythingfrom embarrassinglyparallelprograms(aspecialcaseoftheTaskParallelismpattern)tothoseusingthe
GeometricDecompositionpattern.TheSPMDpatternisevenmoreflexibleandcoversthemost importantalgorithmstructuresusedinscientificcomputing(whichtendstoemphasizetheGeometric Decomposition,TaskParallelism,andDivideandConquerpatterns).Thisflexibilitycanmakeit difficulttochooseaprogramstructurepatternsolelyonthebasisofthechoiceofAlgorithmStructure pattern(s).

Table 5.1. Relationship between Supporting Structures patterns and Algorithm Structure patterns. The number of stars (ranging from zero to four) is an indication of the likelihood that the given Supporting Structures pattern is useful in the implementation of the Algorithm Structure pattern.
Task Parallelism
Divide Geometric Recursive EventBased and Pipeline Decomposition Data Coordination Conquer
SPMD Loop Parallelism Master/Wor ker Fork/Join
Thechoiceofprogrammingenvironment,however,helpsnarrowthechoiceconsiderably.InTable 5.2,weshowtherelationshipbetweenprogrammingenvironmentsandtheSupportingStructures patterns.MPI,theprogrammingenvironmentofchoiceonanydistributedmemorycomputer,strongly favorstheSPMDpattern.OpenMP,thestandardprogrammingmodelusedonvirtuallyeveryshared memorycomputeronthemarket,iscloselyalignedwiththeLoopParallelismpattern.The combinationofprogrammingenvironmentandAlgorithmStructurepatternstypicallyselectswhich SupportingStructurespatternstouse.

Table 5.2. Relationship between Supporting Structures patterns and programming environments. The number of stars (ranging from zero to four) is an indication of the likelihood that the given Supporting Structures pattern is useful in the programming environment.
SPMD LoopParallelism Master/Worker Fork/Join
OpenMP
MPI
Java
5.4. THE SPMD PATTERN

Problem TheinteractionsbetweenthevariousUEscausemostoftheproblemswhenwritingcorrectand efficientparallelprograms.Howcanprogrammersstructuretheirparallelprogramstomakethese interactionsmoremanageableandeasiertointegratewiththecorecomputations? Context Aparallelprogramtakescomplexitytoanewlevel.Thereareallthenormalchallengesofwritingany program.Ontopofthosechallenges,theprogrammermustmanagemultipletasksrunningon multipleUEs.Inaddition,thesetasksandUEsinteract,eitherthroughexchangeofmessagesorby sharingmemory.Inspiteofthesecomplexities,theprogrammustbecorrect,andtheinteractions mustbewellorchestratedifexcessoverheadistobeavoided. Fortunately,formostparallelalgorithms,theoperationscarriedoutoneachUEaresimilar.Thedata mightbedifferentbetweenUEs,orslightlydifferentcomputationsmightbeneededonasubsetof UEs(forexample,handlingboundaryconditionsinpartialdifferentialequationsolvers),butforthe mostparteachUEwillcarryoutsimilarcomputations.Hence,inmanycasesthetasksandtheir interactionscanbemademoremanageablebybringingthemalltogetherintoonesourcetree.This way,thelogicforthetasksissidebysidewiththelogicfortheinteractionsbetweentasks,thereby makingitmucheasiertogetthemright. Thisisthesocalled"SingleProgram,MultipleData"(SPMD)approach.Itemergedasthedominant waytostructureparallelprogramsearlyintheevolutionofscalablecomputing,andprogramming environments,notablyMPI,havebeendesignedtosupportthisapproach.[1]
[1]
ItisnotthattheavailableprogrammingenvironmentspushedSPMD;theforcewasthe otherwayaround.TheprogrammingenvironmentsforMIMDmachinespushedSPMD becausethatisthewayprogrammerswantedtowritetheirprograms.Theywrotethem thiswaybecausetheyfoundittobethebestwaytogetthelogiccorrectandefficientfor whatthetasksdoandhowtheyinteract.Forexample,theprogrammingenvironment PVM,sometimesconsideredapredecessortoMPI,inadditiontotheSPMDprogram structurealsosupportedrunningdifferentprogramsondifferentUEs(sometimescalled theMPMDprogramstructure).TheMPIdesigners,withthebenefitofthePVM experience,chosetosupportonlySPMD. Inadditiontotheadvantagestotheprogrammer,SPMDmakesmanagementofthesolutionmuch easier.Itismucheasiertokeepasoftwareinfrastructureuptodateandconsistentifthereisonlyone programtomanage.ThisfactorbecomesespeciallyimportantonsystemswithlargenumbersofPEs. Thesecangrowtohugenumbers.Forexample,thetwofastestcomputersintheworldaccordingto theNovember2003top500list[Top],theEarthSimulatorattheEarthSimulatorCenterinJapanand theASCIQatLosAlamosNationalLabs,have5120and8192processors,respectively.IfeachPE runsadistinctprogram,managingtheapplicationsoftwarecouldquicklybecomeprohibitively
difficult. Thispatternisbyfarthemostcommonlyusedpatternforstructuringparallelprograms.It isparticularlyrelevantforMPIprogrammersandproblemsusingtheTaskParallelismand GeometricDecompositionpatterns.Ithasalsoprovedeffectiveforproblemsusingthe DivideandConquerandRecursiveDatapatterns. Forces
UsingsimilarcodeforeachUEiseasierfortheprogrammer,butmostcomplexapplications requirethatdifferentoperationsrunondifferentUEsandwithdifferentdata. Softwaretypicallyoutlivesanygivenparallelcomputer.Hence,programsshouldbeportable. Thiscompelstheprogrammertoassumethelowestcommondenominatorinprogramming environments,andtoassumethatonlybasicmechanismsforcoordinatingtaskswillbe available. Achievinghighscalabilityandgoodefficiencyinaparallelprogramrequiresthattheprogram bewellalignedwiththearchitectureoftheparallelcomputer.Therefore,thedetailsofthe parallelsystemmustbeexposedand,whereappropriate,undertheprogrammer'scontrol.
Solution TheSPMDpatternsolvesthisproblembycreatingasinglesourcecodeimagethatrunsoneachofthe UEs.Thesolutionconsistsofthefollowingbasicelements.
Initialize.TheprogramisloadedontoeachUEandopenswithbookkeepingoperationsto establishacommoncontext.Thedetailsofthisprocedurearetiedtotheparallelprogramming environmentandtypicallyinvolveestablishingcommunicationchannelswithotherUEs. Obtainauniqueidentifier.Nearthetopoftheprogram,anidentifierissetthatisuniqueto eachUE.ThisisusuallytheUE'srankwithintheMPIgroup(thatis,anumberintheinterval from0toN1,whereNisthenumberofUEs)orthethreadIDinOpenMP.Thisunique identifierallowsdifferentUEstomakedifferentdecisionsduringprogramexecution. RunthesameprogramoneachUE,usingtheuniqueIDtodifferentiatebehaviorondifferent UEs.ThesameprogramrunsoneachUE.Differencesintheinstructionsexecutedbydifferent UEsareusuallydrivenbytheidentifier.(TheycouldalsodependontheUE'sdata.)Thereare manywaystospecifythatdifferentUEstakedifferentpathsthroughthesourcecode.The mostcommonare(1)branchingstatementstogivespecificblocksofcodetodifferentUEs and(2)usingtheUEidentifierinloopindexcalculationstosplitloopiterationsamongthe UEs. Distributedata.ThedataoperatedonbyeachUEisspecializedtothatUEbyoneoftwo techniques:(1)decomposingglobaldataintochunksandstoringtheminthelocalmemoryof eachUE,andlater,ifrequired,recombiningthemintothegloballyrelevantresultsor(2) sharingorreplicatingtheprogram'smajordatastructuresandusingtheUEidentifierto associatesubsetsofthedatawithparticularUEs. Finalize.Theprogramclosesbycleaningupthesharedcontextandshuttingdownthe
computation.IfgloballyrelevantdatawasdistributedamongUEs,itwillneedtobe recombined.
Discussion
AnimportantissuetokeepinmindwhendevelopingSPMDprogramsistheclarityofabstraction, thatis,howeasyitistounderstandthealgorithmfromreadingtheprogram'ssourcecode.Depending onhowthedataishandled,thiscanrangefromawfultogood.IfcomplexindexalgebraontheUE identifierisneededtodeterminethedatarelevanttoaUEortheinstructionbranch,thealgorithmcan bealmostimpossibletofollowfromthesourcecode.(TheDistributedArraypatterndiscussesuseful techniquesforarrays.) Insomecases,areplicateddataalgorithmcombinedwithsimpleloopsplittingisthebestoption becauseitleadstoaclearabstractionoftheparallelalgorithmwithinthesourcecodeandan algorithmwithahighdegreeofsequentialequivalence.Unfortunately,thissimpleapproachmightnot scalewell,andmorecomplexsolutionsmightbeneeded.Indeed,SPMDalgorithmscanbehighly scalable,andalgorithmsrequiringcomplexcoordinationbetweenUEsandscalingouttoseveral thousandUEs[PH95]havebeenwrittenusingthispattern.Thesehighlyscalablealgorithmsare usuallyextremelycomplicatedastheydistributethedataacrossthenodes(thatis,nosimplifying replicateddatatechniques),andtheygenerallyincludecomplexloadbalancinglogic.These algorithms,unfortunately,bearlittleresemblancetotheirserialcounterparts,reflectingacommon criticismoftheSPMDpattern. AnimportantadvantageoftheSPMDpatternisthatoverheadsassociatedwithstartupand terminationaresegregatedatthebeginningandendoftheprogram,notinsidetimecriticalloops. Thiscontributestoefficientprogramsandresultsintheefficiencyissuesbeingdrivenbythe communicationoverhead,thecapabilitytobalancethecomputationalloadamongtheUEs,andthe amountofconcurrencyavailableinthealgorithmitself. SPMDprogramsarecloselyalignedwithprogrammingenvironmentsbasedonmessagepassing.For example,mostMPIorPVMprogramsusetheSPMDpattern.Note,however,thatitispossibletouse theSPMDpatternwithOpenMP[CPP01].Withregardtothehardware,theSPMDpatterndoesnot assumeanythingconcerningtheaddressspacewithinwhichthetasksexecute.AslongaseachUE canrunitsowninstructionstreamoperatingonitsowndata(thatis,thecomputercanbeclassifiedas MIMD),theSPMDstructureissatisfied.ThisgeneralityofSPMDprogramsisoneofthestrengthsof thispattern. Examples TheissuesraisedbyapplicationoftheSPMDpatternarebestdiscussedusingthreespecific examples:

Numericalintegrationtoestimatethevalueofadefiniteintegralusingthetrapezoidrule Moleculardynamics,forcecomputations Mandelbrotsetcomputation
Numerical integration
Wecanuseaverysimpleprogram,frequentlyusedinteachingparallelprogramming,toexplore manyoftheissuesraisedbytheSPMDpattern.Considertheproblemofestimatingthevalueof usingEq.5.3. Equation5.3
Weusetrapezoidalintegrationtonumericallysolvetheintegral.Theideaistofilltheareaundera curvewithaseriesofrectangles.Asthewidthoftherectanglesapproacheszero,thesumoftheareas oftherectanglesapproachesthevalueoftheintegral.

Figure 5.2. Sequential program to carry out a trapezoid rule integration to compute
#include <stdio.h> #include <math.h> int main () { int i; int num_steps = 1000000; double x, pi, step, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("pi %lf\n",pi); return 0; }
AprogramtocarrythiscalculationoutonasingleprocessorisshowninFig.5.2.Tokeepthe programassimpleaspossible,wefixthenumberofstepstouseintheintegrationat1,000,000.The variablesumisinitializedto0andthestepsizeiscomputedastherangeinx(equalto1.0inthis case)dividedbythenumberofsteps.Theareaofeachrectangleisthewidth(thestepsize)timesthe height(thevalueoftheintegrandatthecenteroftheinterval).Becausethewidthisaconstant,we pullitoutofthesummationandmultiplythesumoftherectangleheightsbythestepsize,step,to getourestimateofthedefiniteintegral. Wewilllookatseveralversionsoftheparallelalgorithm.Wecanseealltheelementsofaclassic SPMDprograminthesimpleMPIversionofthisprogram,asshowninFig.5.3.Thesameprogramis runoneachUE.Nearthebeginningoftheprogram,theMPIenvironmentisinitializedandtheIDfor
eachUE(my_id)isgivenbytheprocessrankforeachUEintheprocessgroupassociatedwiththe communicatorMPI_COMM_WORLD(forinformationaboutcommunicatorsandotherMPIdetails,see theMPIappendix,AppendixB).WeusethenumberofUEsandtheIDtoassignloopranges (i_startandi_end)toeachUE.Becausethenumberofstepsmaynotbeevenlydividedbythe numberofUEs,wehavetomakesurethelastUErunsuptothelaststepinthecalculation.Afterthe partialsumshavebeencomputedoneachUE,wemultiplybythestepsize,step,andthenusethe MPI_Reduce()routinetocombinethepartialsumsintoaglobalsum.(Reductionoperationsare describedinmoredetailintheImplementationMechanismsdesignspace.)Thisglobalvaluewillonly beavailableintheprocesswithmy_id == 0,sowedirectthatprocesstoprinttheanswer. Inessence,whatwehavedoneintheexampleinFig.5.3istoreplicatethekeydata(inthiscase,the partialsummationvalue,sum),usetheUE'sIDtoexplicitlysplituptheworkintoblockswithone blockperUE,andthenrecombinethelocalresultsintothefinalglobalresult.Thechallengein applyingthispatternisto(1)splitupthedatacorrectly,(2)correctlyrecombinetheresults,and(3) achieveanevendistributionofthework.Thefirsttwostepsweretrivialinthisexample.Theload balance,however,isabitmoredifficult.Unfortunately,thesimpleprocedureweusedinFig.5.3could resultinsignificantlymoreworkforthelastUEifthenumberofUEsdoesnotevenlydividethe numberofsteps.Foramoreevendistributionofthework,weneedtospreadouttheextraiterations amongmultipleUEs.WeshowonewaytodothisintheprogramfragmentinFig.5.4.Wecompute thenumberofiterationsleftoverafterdividingthenumberofstepsbythenumberofprocessors (rem).WewillincreasethenumberofiterationscomputedbythefirstremUEstocoverthatamount ofwork.ThecodeinFig.5.4accomplishesthattask.Thesesortsofindexadjustmentsarethebaneof programmersusingtheSPMDpattern.Suchcodeiserrorproneandthesourceofhoursoffrustration asprogramreaderstrytounderstandthereasoningbehindthislogic.
Figure 5.3. MPI program to carry out a trapezoid rule integration in parallel by assigning one block of loop iterations to each UE and performing a reduction
#include <stdio.h> #include <math.h> #include <mpi.h> int main (int argc, char *argv[]) { int i, i_start, i_end; int num_steps = 1000000; double x, pi, step, sum = 0.0; int my_id, numprocs; step = 1.0/(double) num_steps; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_id); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); i_start = my_id * (num_steps/numprocs); i_end = i_start + (num_steps/numprocs); if (my_id == (numprocs-1)) i_end = num_steps; for (i=i_start; i< i_end; i++)
} sum *= step; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (my_id == 0) printf("pi %lf\n",pi); MPI_Finalize(); return 0; }
x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x);
Finally,weusealoopsplittingstrategyforthenumericalintegrationprogram.Theresultingprogram isshowninFig.5.5.Thisapproachusesacommontricktoachieveacyclicdistributionoftheloop iterations:EachUEstartswiththeiterationequaltoitsrank,andthenmarchesthroughtheiterations oftheloopwithastrideequaltothenumberofUEs.TheiterationsareinterleavedamongtheUEs,in thesamemannerasadeckofcardswouldbedealt.Thisversionoftheprogramevenlydistributesthe loadwithoutresortingtocomplexindexalgebra.

Figure 5.4. Index calculation that more evenly distributes the work when the number of steps is not evenly divided by the number of UEs. The idea is to split up the remaining tasks (rem) among the first rem UEs. int rem = num_steps % numprocs; i_start = my_id * (num_steps/numprocs); i_end = i_start + (num_steps/numprocs); if (rem != 0){ if(my_id < rem){ i_start += my_id; i_end += (my_id + 1); } else { i_start += rem; i_end += rem; } }
Figure 5.5. MPI program to carry out a trapezoid rule integration in parallel using a simple loop-splitting algorithm with cyclic distribution of iterations and a reduction
#include <stdio.h> #include <math.h> #include <mpi.h> int main (int argc, char *argv[]) { int i;
int num_steps = 1000000; double x, pi, step, sum = 0.0; int my_id, numprocs; step = 1.0/(double) num_steps; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_id); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); for (i=my_id; i< num_steps; i+= numprocs) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } sum *= step; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (my_id == 0) printf("pi %lf\n",pi); MPI_Finalize(); return 0;
Figure 5.6. OpenMP program to carry out a trapezoid rule integration in parallel using the same SPMD algorithm used in Fig. 5.5 #include <stdio.h> #include <math.h> #include <omp.h> int main () { int num_steps = 1000000; double pi, step, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel reduction(+:sum) { int i, id = omp_get_thread_num(); int numthreads = omp_get_num_threads(); double x; for (i=id;i< num_steps; i+=numthreads){ x = (i+0.5)*step; sum += + 4.0/(1.0+x*x); } } // end of parallel region pi = step * sum; printf("\n pi is %lf\n",pi); return 0;
SPMDprogramscanalsobewrittenusingOpenMPandJava.InFig.5.6,weshowanOpenMP
versionofourtrapezoidalintegrationprogram.ThisprogramisverysimilartotheanalogousMPI program.Theprogramhasasingleparallelregion.WestartbyfindingthethreadIDandthenumber ofthreadsintheteam.Wethenusethesametricktointerleaveiterationsamongtheteamofthreads. AswiththeMPIprogram,weuseareductiontocombinepartialsumsintoasingleglobalsum.

Molecular dynamics
Throughoutthispatternlanguage,wehaveusedmoleculardynamicsasarecurringexample. Moleculardynamicssimulatesthemotionsofalargemolecularsystem.Itusesanexplicittime steppingmethodologywhereateachtimestep,theforceoneachatomiscomputedandstandard techniquesfromclassicalmechanicsareusedtocomputehowtheforceschangeatomicmotions. Thisproblemisidealforpresentingkeyconceptsinparallelalgorithmsbecausetherearesomany waystoapproachtheproblembasedonthetargetcomputersystemandtheintendeduseofthe program.Inthisdiscussion,wewillfollowtheapproachtakenin[Mat95]andassumethat(1)a sequentialversionoftheprogramexists,(2)havingasingleprogramforsequentialandparallel executionisimportant,and(3)thetargetsystemisasmallclusterconnectedbystandardEthernet LAN.Morescalablealgorithmsforexecutiononmassivelyparallelsystemsarediscussedin[PH95].
Figure 5.7. Pseudocode for molecular dynamics example. This code is very similar to the version discussed earlier, but a few extra details have been included. To support more detailed pseudocode examples, the call to the function that initializes the force arrays has been made explicit. Also, the fact that the neighbor list is only occasionally updated is made explicit. Int const N // number of atoms Array Array Array Array of of of of Real Real Real List :: :: :: :: atoms (3,N) //3D coordinates velocities (3,N) //velocity vector forces (3,N) //force in each dimension neighbors(N) //atoms in cutoff volume
loop over time steps initialize_forces (N, Forces) if(time to update neighbor list) neighbor_list (N, Atoms, neighbors) end if vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) non_bonded-forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) end loop
Thecorealgorithm,includingpseudocode,waspresentedinSec.3.1.3.Whilewewon'trepeatthe discussionhere,wedoprovideacopyofthepseudocodeinFig.5.7. TheparallelalgorithmisdiscussedinseveralofthepatternsintheFindingConcurrencyand AlgorithmStructuredesignspaces.Followingarethekeypointsfromthosediscussionsthatwewill needherealongwiththelocationoftheoriginaldiscussion. 1. Computingthenon_bonded_forcestakestheoverwhelmingmajorityoftheruntime
(Sec.3.1.3). 2. Incomputingthenon_bonded_force,eachatompotentiallyinteractswithalltheother atoms.Hence,eachUEneedsreadaccesstothefullatomicpositionarray.Also,dueto Newton'sthirdlaw,eachUEwillbescatteringcontributionstotheforceacrossthefullforce array(theExamplessectionoftheDataSharingpattern). 3. OnewaytodecomposetheMDproblemintotasksistofocusonthecomputationsneededfor aparticularatom,thatis,wecanparallelizethisproblembyassigningatomstoUEs(the ExamplessectionoftheTaskDecompositionpattern). Giventhatourtargetisasmallclusterandfrompoint(1)intheprecedinglist,wewillonlyparallelize theforcecomputations.Becausethenetworkisslowforparallelcomputingandgiventhedata dependencyinpoint(2),wewill:

Keepacopyofthefullforceandcoordinatearraysoneachnode. HaveeachUEredundantlyupdatepositionsandvelocitiesfortheatoms(thatis,weassumeit ischeapertoredundantlycomputethesetermsthantodotheminparallelandcommunicate theresults). HaveeachUEcomputeitscontributionstotheforcearrayandthencombine(orreduce)the UEs'contributionsintoasingleglobalforcearraycopiedontoeachUE.
Thealgorithmisasimpletransformationfromthesequentialalgorithm.PseudocodeforthisSPMD programisshowninFig.5.8.AswithanyMPIprogram,theMPIincludefileisreferencedatthetop oftheprogram.TheMPIenvironmentisinitializedandtheIDisassociatedwiththerankoftheMPI process.

Figure 5.8. Pseudocode for an SPMD molecular dynamics program using MPI
#include <mpi.h> Int const N // number of atoms Int const LN // maximum number of atoms assigned to a UE Int ID // an ID for each UE Int num_UEs // the number of UEs in the parallel computation Array Array Array Array Array Array of of of of of of Real :: atoms (3,N) //3D coordinates Real :: velocities (3,N) //velocity vector Real :: forces (3,N) //force in each dimension Real :: final_forces(3,N) //globally summed force List :: neighbors(LN) //atoms in cutoff volume Int :: local_atoms(LN) //atoms for this UE
ID = 0 // default ID (used by the serial code) num_UEs = 1 // default num_UEs (used by the serial code) MPI_Init() MPI_Comm_size(MPI_COMM_WORLD, &ID) MPI_Comm_rank(MPI_COMM_WORLD, &num_UEs) loop over time steps
initialize_forces (N, forces, final_forces) if(time to update neighbor list) neighbor_list (N, LN, atoms, neighbors) end if vibrational_forces (N, LN, local_atoms, atoms, forces) rotational_forces (N, LN, local_atoms, atoms, forces) non_bonded_forces (N, LN, atoms, local_atoms, neighbors, forces) MPI_All_reduce{forces, final.forces, 3*N, MPI_REAL, MPI_SUM, MPI_COMM_WORLD) update_atom_positions_and_velocities( N, atoms, velocities, final_forces) physical_properties ( ... Lots of stuff ... ) end loop MPI_Finalize()
Figure 5.9. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This code is almost identical to the sequential version of the function shown in Fig. 4.4. The only major change is a new array of integers holding the indices for the atoms assigned to this UE, local_atoms. We've also assumed that the neighbor list has been generated to hold only those atoms assigned to this UE. For the sake of allocating space for these arrays, we have added a parameter LN which is the largest number of atoms that can be assigned to a single UE. function non_bonded_forces (N, LN, atoms, local_atoms, neighbors, Forces) Int N // number of atoms Int LN // maximum number of atoms assigned to a UE Array of Real :: atoms (3,N) //3D coordinates Array of Real :: forces (3,N) //force in each dimension Array of List :: neighbors(LN) //atoms in cutoff volume Array of Int :: local_atoms(LN) //atoms assigned to this UE real :: forceX, forceY, forceZ loop [i] over local_atoms loop [j] over neighbors(i) forceX = non_bond_force(atoms(1,i), forceY = non_bond_force(atoms(2,i), forceZ = non_bond_force(atoms(3,i), force{l,i) += forceX; force{1,j) -= force(2,i) += forceY; force{2,j) -= force{3,i) += forceZ; force{3,j) -= end loop [j] end loop [i] end function non_bonded_forces atoms(l,j)) atoms(2,j)) atoms(3,j)) forceX; forceY; forceZ;
Onlyafewchangesaremadetothesequentialfunctions.First,asecondforcearraycalled final_forcesisdefinedtoholdthegloballyconsistentforcearrayappropriatefortheupdateof theatomicpositionsandvelocities.Second,alistofatomsassignedtotheUEiscreatedandpassedto anyfunctionthatwillbeparallelized.Finally,theneighbor_listismodifiedtoholdthelistfor onlythoseatomsassignedtotheUE. Finally,withineachofthefunctionstobeparallelized(theforcescalculations),theloopoveratomsis replacedbyaloopoverthelistoflocalatoms. WeshowanexampleofthesesimplechangesinFig.5.9.Thisisalmostidenticaltothesequential versionofthisfunctiondiscussedintheTaskParallelismpattern.Asdiscussedearlier,thefollowing arethekeychanges.
AnewarrayhasbeenaddedtoholdindicesfortheatomsassignedtothisUE.Thisarrayisof lengthLNwhereLNisthemaximumnumberofatomsthatcanbeassignedtoasingleUE. Theloopoverallatoms(loopoveri)hasbeenreplacedbyaloopovertheelementsofthe local_atomslist. Weassumethattheneighborlisthasbeenmodifiedtocorrespondtotheatomslistedinthe local_atomslist.
TheresultingcodecanbeusedforasequentialversionoftheprogrambysettingLNtoNandby puttingthefullsetofatomindicesintolocal_atoms.Thisfeaturesatisfiesoneofourdesign goals:thatasinglesourcecodewouldworkforbothsequentialandparallelversionsoftheprogram. Thekeytothisalgorithmisinthefunctiontocomputetheneighborlist.Theneighborlistfunction containsaloopovertheatoms.Foreachatomi,thereisaloopoverallotheratomsandatestto determinewhichatomsareintheneighborhoodofatomi.Theindicesfortheseneighboringatoms aresavedinneighbors,alistoflists.PseudocodeforthiscodeisshowninFig.5.10.

Figure 5.10. Pseudocode for the neighbor list computation. For each atom i, the indices for atoms within a sphere of radius cutoff are added to the neighbor list for atom i. Notice that the second loop (over j) only considers atoms with indices greater than i. This accounts for the symmetry in the force computation due to Newton's third law of motion, that is, that the force between atom i and atom j is just the negative of the force between atom j and atom i.
function neighbor (N, LN, ID, cutoff, atoms, local_atoms, neighbors) Int N // number of atoms Int LN // max number of atoms assigned to a UE Real cutoff // radius of sphere defining neighborhood Array of Real :: atoms (3,N) //3D coordinates Array of List :: neighbors(LN) //atoms in cutoff volume Array of Int :: local_atoms(LN) //atoms assigned to this UE real :: dist_squ initialize_lists (local_atoms, neighbors)
loop [i] over atoms on UE //split loop iterations among UEs add_to_list (i, local_atoms) loop [j] over atoms greater than i dist_squ = square(atom(l,i)-atom(i,j)) + square(atom(2,i)-atom(2,j)) + square(atom(3,i)-atom(3,j)) if(dist_squ < (cutoff * cutoff)) add_to_list (j, neighbors(i)) end if end loop [j] end loop [i] end function neighbors
ThelogicdefininghowtheparallelismisdistributedamongtheUEsiscapturedinthesingleloopin Fig.5.10:
loop [i] over atoms on UE //split loop iterations among UEs add_to_list (i, local_atoms)
ThedetailsofhowthisloopissplitamongUEsdependsontheprogrammingenvironment.An approachthatworkswellwithMPIisthecyclicdistributionweusedinFig.5.5:
for (i=id;i<number_of_atoms; i+= number_of UEs){ add_to_list (i, local_atoms) }
Morecomplexorevendynamicdistributionscanbehandledbycreatinganownercomputesfilter [Mat95].Anownercomputesfilterprovidesaflexibleandreusablescheduleformappingloop iterationsontoUEs.ThefilterisabooleanfunctionoftheIDandtheloopiteration.Thevalueofthe functiondependsonwhetheraUE"owns"aparticulariterationofaloop.Forexample,inamolecular dynamicsprogram,thecalltotheownercomputesfunctionwouldbeaddedatthetopofthe parallelizedloopsoveratoms:

for (1=0;i<number_of_atoms; i++){ if !(is_owner (i)) break add_to_list (i, local_atoms) }
Nootherchangestotheloopareneededtosupportexpressionofconcurrency.Ifthelogicmanaging theloopisconvoluted,thisapproachpartitionstheiterationsamongtheUEswithoutalteringthat logic,andtheindexpartitioninglogicislocatedclearlyinoneplaceinthesourcecode.Another advantageoccurswhenseveralloopsthatshouldbescheduledthesamewayarespreadout throughoutaprogram.Forexample,onaNUMAmachineoraclusteritisveryimportantthatdata
broughtclosetoaPEbeusedasmanytimesaspossible.Often,thismeansreusingthesameschedule inmanyloops. Thisapproachisdescribedfurtherformoleculardynamicsapplicationsin[Mat95].Itcouldbe importantinthisapplicationsincetheworkloadcapturedintheneighborlistgenerationmaynot accuratelyreflecttheworkloadinthevariousforcecomputations.Onecouldeasilycollect informationaboutthetimerequiredforeachatomandthenreadjusttheis_ownerfunctionto producemoreoptimalworkloads.

Figure 5.11. Pseudocode for a parallel molecular dynamics program using OpenMP
#include <omp.h> Int const N // number of atoms Int const LN // maximum number of atoms assigned to a UE Int ID // an ID for each UE Int num_UEs // number of UEs in the parallel computation Array of Real :: atoms(3,N) //3D coordinates Array of Real :: velocities(3,N) //velocity vector Array of Real :: forces(3,N) //force in each dim Array of List :: neighbors(LN) //atoms in cutoff volume Array of Int :: local_atoms(LN) //atoms for this UE ID = 0 num_UEs = 1 #pragma omp parallel private (ID, num_UEs, local_atoms, forces) { ID = omp_get_thread_num() num_UEs = omp_get_num_threads() loop over time steps initialize_forces (N, forces, final_forces) if(time to update neighbor list) neighbor_list (N, LN, atoms, neighbors) end if vibrational_forces (N, LN, local_atoms, atoms, forces) rotational_forces (N, LN, local_atoms, atoms, forces) non_bonded_forces (N, LN, atoms, local_atoms, neighbors, forces) #pragma critical final_forces += forces #barrier #pragma single { update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) } // remember, the end of a single implies a barrier end loop } // end of OpenMP parallel region
TheseSPMDalgorithmsworkforOpenMPprogramsaswell.Allofthebasicfunctionsremainthe same.ThetoplevelprogramischangedtoreflecttheneedsofOpenMP.ThisisshowninFig.5.11. Theloopovertimeisplacedinsideasingleparallelregion.Theparallelregioniscreatedwiththe parallelpragma:

#pragma omp parallel private (ID, num_UEs, local_atoms, forces)
Thispragmacausesateamofthreadstobecreatedwitheachmemberoftheteamexecutingtheloop overtime.TheprivateclausecausescopiesofthelistedvariablestobecreatedforeachUE.The reductioniscarriedoutinacriticalsection:

#pragma critical final_forces += forces
Areductionclauseontheparallelregioncannotbeusedinthiscasebecausetheresultwouldnot beavailableuntiltheparallelregioncompletes.Thecriticalsectionproducesthecorrectresult,butthe algorithmusedhasaruntimethatislinearinthenumberofUEsandishencesuboptimalrelativeto otherreductionalgorithmsasdiscussedintheImplementationMechanismsdesignspace.Onsystems withamodestnumberofprocessors,however,thereductionwithacriticalsectionworksadequately. Thebarrierfollowingthecriticalsectionisrequiredtomakesurethereductioncompletesbeforethe atomicpositionsandvelocitiesareupdated.WethenuseanOpenMPsingleconstructtocauseonly oneUEtodotheupdate.Anadditionalbarrierisnotneededfollowingthesinglesincethecloseof asingleconstructimpliesabarrier.Thefunctionsusedtocomputetheforcesareunchanged betweentheOpenMPandMPIversionsoftheprogram.
Mandelbrot set computation
ConsiderthewellknownMandelbrotset[Dou86].Wediscussedthisproblemanditsparallelization asataskparallelproblemintheTaskParallelismpattern.Eachpixeliscoloredbasedonthebehavior ofthequadraticrecurrencerelationinEq.5.4. Equation5.4
CandZarecomplexnumbersandtherecurrenceisstartedwithz0=C.Theimageplotsthe imaginarypartofContheverticalaxis(1.5to1.5)andtherealpartonthehorizontalaxis(1to2). Thecolorofeachpixelisblackiftherecurrencerelationconvergestoastablevalueoriscolored dependingonhowrapidlytherelationdiverges. IntheTaskParallelismpattern,wedescribedaparallelalgorithmwhereeachtaskcorrespondstothe
computationofarowintheimage.AstaticschedulewithmoretasksthanUEsshouldbepossible thatachievesaneffectivestatisticalbalanceoftheloadamongnodes.W ewillshowhowtosolvethis problemusingtheSPMDpatternwithMPI. PseudocodeforthesequentialversionofthiscodeisshowninFig.5.12.Theinterestingpartofthe problemishiddeninsidetheroutinecompute_Row().Becausethedetailsofthisroutinearenot importantforunderstandingtheparallelalgorithm,wewillnotshowthemhere,however.Atahigh level,foreachpointintherowthefollowinghappens.

Figure 5.12. Pseudocode for a sequential version of the Mandelbrot set generation program
Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map Real :: conv // divergence rate for a pixel Array of Int :: color_map (M) // pixel color based on conv rate Array of Int :: row (RowSize) // Pixels to draw Array of Real :: ranges(2) // ranges in X and Y dimensions
manage_user_input(ranges, color_map) // input ranges, color map initialize_graphics(RowSize for (int i = 0; i<Nrows; i++){ compute_Row (RowSize, ranges, row) graph(i, RowSize, M, color_map, ranges, row) } // end loop [i] over rows
EachpixelcorrespondstoavalueofCinthequadraticrecurrence.Wecomputethisvalue basedontheinputrangeandthepixelindices. Wethencomputethetermsintherecurrenceandsetthevalueofthepixelbasedonwhetherit convergestoafixedvalueordiverges.Ifitdiverges,wesetthepixelvaluebasedontherateof divergence.
Oncecomputed,therowsareplottedtomakethewellknownMandelbrotsetimages.Thecolorsused forthepixelsaredeterminedbymappingdivergenceratesontoacolormap. AnSPMDprogrambasedonthisalgorithmisstraightforward;codeisshowninFig.5.13.Wewill assumethecomputationisbeingcarriedoutonsomesortofdistributedmemorymachine(acluster orevenanMPP)andthatthereisonemachinethatservesastheinteractivegraphicsnode,whilethe othersarerestrictedtocomputation.Wewillassumethatthegraphicsnodeistheonewithrank0. TheprogramstartswiththeusualMPIsetup,asdescribedintheMPIappendix,AppendixB.TheUE withrank0takesinputfromtheuserandthenbroadcaststhistotheotherUEs.Itthenloopsoverthe
numberofrowsintheimage,receivingrowsastheyfinishandplottingthem.UEswithrankother than0useacyclicdistributionofloopiterationsandsendtherowstothegraphicsUEastheyfinish.
Known uses
TheoverwhelmingmajorityofMPIprogramsusethispattern.Pedagogicallyorienteddiscussionsof SPMDprogramsandexamplescanbefoundinMPItextbookssuchas[GLS99]and[Pac96]. Representativeapplicationsusingthispatternincludequantumchemistry[WSG95],finiteelement methods[ABKP03,KLK03 + ],and3Dgasdynamics[MHC99 + ].
Figure 5.13. Pseudocode for a parallel MPI version of the Mandelbrot set generation program
#include <mpi.h> Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map Real :: conv // divergence rate for a pixel Array of Int :: color_map (M) // pixel color based on conv rate Array of Int :: row (RowSize) // Pixels to draw Array of Real :: ranges(2) // ranges in X and Y dimensions Int :: inRowSize // size of received row Int :: ID // ID of each UE (process) Int :: num_UEs // number of UEs (processes) Int :: nworkers // number of UEs computing rows MPI_Status :: stat // MPI status parameter MPI_Init() MPI_Comm_size(MPI_COMM_WORLD, &ID) MPI_Comm_rank(MPI_COMM_WORLD, &num_UEs) // Algorithm requires at least two UEs since we are // going to dedicate one to graphics if (num_UEs < 2) MPI_Abort(MPI_COMM_WORLD, 1) if (ID == 0 ){ manage_user_input(ranges, color_map) // input ranges, color map initialize_graphics(RowSize, Nrows, M, ranges, color_map) } // Broadcast data from rank 0 process to all other processes MPI_Bcast (ranges, 2, MPI_REAL, 0, MPI_COMM_WORLD); if (ID == 0) { // UE with rank 0 does graphics for (int i = 0; i<Nrows; i++){ MPI_Recv(row, &inRowSize, MPI_REAL, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &stat) row_index = stat(MPI_TAG) graph(row_index, RowSize, M, color_map, ranges, Row) } // end loop over i else { // The other UEs compute the rows nworkers = num_UEs - 1 for (int i = ID-1; i<Nrows; i+=nworkers){ compute_Row (RowSize, ranges, row) MPI_Send (row, RowSize, MPI_REAL, 0, i, MPI_COMM_WORLD); } // end loop over i }
MPI_Finalize()
ExamplesoftheSPMDpatternincombinationwiththeDistributedArraypatternincludethe GAMESSquantumchemistryprogram[OSG03]andtheScaLAPACKlibrary[BCC97 + ,Sca]. Related Patterns TheSPMDpatternisverygeneralandcanbeusedtoimplementotherpatterns.Manyoftheexamples inthetextofthispatternarecloselyrelatedtotheLoopParallelismpattern.Mostapplicationsofthe GeometricDecompositionpatternwithMPIusetheSPMDpatternaswell.TheDistributedArray patternisessentiallyaspecialcaseofdistributingdataforprogramsusingtheSPMDpattern.
5.5. THE MASTER/WORKER PATTERN

Problem Howshouldaprogrambeorganizedwhenthedesignisdominatedbytheneedtodynamically balancetheworkonasetoftasksamongtheUEs? Context Parallelefficiencyfollowsfromanalgorithm'sparalleloverhead,itsserialfraction,andtheload balancing.Agoodparallelalgorithmmustdealwitheachofthese,butsometimesbalancingtheload issodifficultthatitdominatesthedesign.Problemsfallingintothiscategoryusuallyshareoneor moreofthefollowingcharacteristics.
Theworkloadsassociatedwiththetasksarehighlyvariableandunpredictable.Ifworkloads arepredictable,theycanbesortedintoequalcostbins,staticallyassignedtoUEs,and parallelizedusingtheSPMDorLoopParallelismpatterns.Butiftheyareunpredictable,static distributionstendtoproducesuboptimalloadbalance. Theprogramstructureforthecomputationallyintensiveportionsoftheproblemdoesn'tmap ontosimpleloops.Ifthealgorithmisloopbased,onecanusuallyachieveastatisticallynear optimalworkloadbyacyclicdistributionofiterationsorbyusingadynamicscheduleonthe loop(forexample,inOpenMP,byusingtheschedule (dynamic)clause).Butifthe controlstructureintheprogramismorecomplexthanasimpleloop,moregeneralapproaches arerequired. ThecapabilitiesofthePEsavailablefortheparallelcomputationvaryacrosstheparallel system,changeoverthecourseofthecomputation,orareunpredictable.
Insomecases,tasksaretightlycoupled(thatis,theycommunicateorsharereadandwritedata)and mustbeactiveatthesametime.Inthiscase,theMaster/Workerpatternisnotapplicable:The programmerhasnochoicebuttoexplicitlysizeorgrouptasksontoUEsdynamically(thatis,during thecomputation)toachieveaneffectiveloadbalance.Thelogictoaccomplishthiscanbedifficultto implement,andifoneisnotcareful,canaddprohibitivelylargeparalleloverhead. Ifthetasksareindependentofeachother,however,orifthedependenciescansomehowbepulledout fromtheconcurrentcomputation,theprogrammerhasmuchgreaterflexibilityinhowtobalancethe load.Thisallowstheloadbalancingtobedoneautomaticallyandisthesituationweaddressinthis pattern. ThispatternisparticularlyrelevantforproblemsusingtheTaskParallelismpatternwhen therearenodependenciesamongthetasks(embarrassinglyparallelproblems).Itcanalso beusedwiththeFork/JoinpatternforthecaseswherethemappingoftasksontoUEsis indirect. Forces
Theworkforeachtask,andinsomecaseseventhecapabilitiesofthePEs,varies unpredictablyintheseproblems.Hence,explicitpredictionsoftheruntimeforanygiventask arenotpossibleandthedesignmustbalancetheloadwithoutthem. Operationstobalancetheloadimposecommunicationoverheadandcanbeveryexpensive. Thissuggeststhatschedulingshouldrevolvearoundasmallernumberoflargetasks.However, largetasksreducethenumberofwaystaskscanbepartitionedamongthePEs,therebymaking itmoredifficulttoachievegoodloadbalance. Logictoproduceanoptimalloadcanbeconvolutedandrequireerrorpronechangestoa program.Programmersneedtomaketradeoffsbetweenthedesireforanoptimaldistribution oftheloadandcodethatiseasytomaintain.
Solution ThewellknownMaster/Workerpatternisagoodsolutiontothisproblem.Thispatternissummarized inFig.5.14.Thesolutionconsistsoftwologicalelements:amasterandoneormoreinstancesofa worker.Themasterinitiatesthecomputationandsetsuptheproblem.Itthencreatesabagoftasks.In theclassicalgorithm,themasterthenwaitsuntilthejobisdone,consumestheresults,andthenshuts downthecomputation.
Figure 5.14. The two elements of the Master/Worker pattern are the master and the worker. There is only one master, but there can be one or more workers. Logically, the master sets up the calculation and then manages a bag of tasks. Each worker grabs a task from the bag, carries out the work, and then goes back to the bag, repeating until the termination condition is met.
Astraightforwardapproachtoimplementingthebagoftasksiswithasinglesharedqueueas describedintheSharedQueuepattern.Manyothermechanismsforcreatingagloballyaccessible structurewheretaskscanbeinsertedandremovedarepossible,however.Examplesincludeatuple space[CG91,FHA99],adistributedqueue,oramonotoniccounter(whenthetaskscanbespecified withasetofcontiguousintegers). Meanwhile,eachworkerentersaloop.Atthetopoftheloop,theworkertakesataskfromthebagof tasks,doestheindicatedwork,testsforcompletion,andthengoestofetchthenexttask.This continuesuntiltheterminationconditionismet,atwhichtimethemasterwakesup,collectsthe results,andfinishesthecomputation. Master/workeralgorithmsautomaticallybalancetheload.Bythis,wemeantheprogrammerdoesnot explicitlydecidewhichtaskisassignedtowhichUE.Thisdecisionismadedynamicallybythe masterasaworkercompletesonetaskandaccessesthebagoftasksformorework.
Discussion
Master/workeralgorithmshavegoodscalabilityaslongasthenumberoftasksgreatlyexceedsthe numberofworkersandthecostsoftheindividualtasksarenotsovariablethatsomeworkerstake drasticallylongerthantheothers. Managementofthebagoftaskscanrequireglobalcommunication,andtheoverheadforthiscanlimit efficiency.Thiseffectisnotaproblemwhentheworkassociatedwiththetasksonaverageismuch greaterthanthetimerequiredformanagement.Insomecases,thedesignermightneedtoincreasethe sizeofeachtasktodecreasethenumberoftimestheglobaltaskbagneedstobeaccessed. TheMaster/Workerpatternisnottiedtoanyparticularhardwareenvironment.Programsusingthis patternworkwelloneverythingfromclusterstoSMPmachines.Itis,ofcourse,beneficialifthe programmingenvironmentprovidessupportformanagingthebagoftasks.
Detecting completion
Oneofthechallengesinworkingwithmaster/workerprogramsistocorrectlydeterminewhenthe entireproblemiscomplete.Thisneedstobedoneinawaythatisefficientbutalsoguaranteesthatall oftheworkiscompletebeforeworkersshutdown.
Inthesimplestcase,alltasksareplacedinthebagbeforetheworkersbegin.Theneachtask continuesuntilthebagisempty,atwhichpointtheworkersterminate. Anotherapproachistouseaqueuetoimplementthetaskbagandarrangeforthemasterora workertocheckforthedesiredterminationcondition.Whenitisdetected,apoisonpill,a specialtaskthattellstheworkerstoterminate,iscreated.Thepoisonpillmustbeplacedinthe baginsuchawaythatitwillbepickeduponthenextroundofwork.Dependingonhowthe setofsharedtasksaremanaged,itmaybenecessarytocreateonepoisonpillforeach remainingworkertoensurethatallworkersreceivetheterminationcondition. Problemsforwhichthesetoftasksisnotknowninitiallyproduceuniquechallenges.This occurs,forexample,whenworkerscanaddtasksaswellasconsumethem(suchasin applicationsoftheDivideandConquerpattern).Inthiscase,itisnotnecessarilytruethat whenaworkerfinishesataskandfindsthetaskbagemptythatthereisnomoreworktodo anotherstillactiveworkercouldgenerateanewtask.Onemustthereforeensurethatthetask bagisemptyandallworkersarefinished.Further,insystemsbasedonasynchronousmessage passing,itmustbedeterminedthattherearenomessagesintransitthatcould,ontheirarrival, resultinthecreationofanewtask.Therearemanyknownalgorithmsthatsolvethisproblem. Forexample,supposethetasksareconceptuallyorganizedintoatree,wheretherootisthe mastertask,andthechildrenofataskarethetasksitgenerates.Whenallofthechildrenofa taskhaveterminated,theparenttaskcanterminate.Whenallthechildrenofthemastertask haveterminated,thecomputationhasterminated.Algorithmsforterminationdetectionare describedin[BT89,Mat87,DS80].
Variations
Thereareseveralvariationsonthispattern.Becauseofthesimplewayitimplementsdynamicload balancing,thispatternisverypopular,especiallyinembarrassinglyparallelproblems(asdescribedin
theTaskParallelismpattern).Hereareafewofthemorecommonvariations.
Themastermayturnintoaworkerafterithascreatedthetasks.Thisisaneffectivetechnique whentheterminationconditioncanbedetectedwithoutexplicitactionbythemaster(thatis, thetaskscandetecttheterminationconditionontheirownfromthestateofthebagoftasks). Whentheconcurrenttasksmapontoasimpleloop,themastercanbeimplicitandthepattern canbeimplementedasaloopwithdynamiciterationassignmentasdescribedintheLoop Parallelismpattern. Acentralizedtaskqueuecanbecomeabottleneck,especiallyinadistributedmemory environment.Anoptimalsolution[FLR98]isbasedonrandomworkstealing.Inthis approach,eachPEmaintainsaseparatedoubleendedtaskqueue.Newtasksareplacedinthe frontofthetaskqueueofthelocalPE.Whenataskiscompleted,asubproblemisremoved fromthefrontofthelocaltaskqueue.Ifthelocaltaskqueueisempty,thenanotherPEis chosenrandomly,andasubproblemfromthebackofitstaskqueueis"stolen".Ifthatqueueis alsoempty,thenthePEtriesagainwithanotherrandomlychosenPE.Thisisparticularly effectivewhenusedinproblemsbasedontheDivideandConquerpattern.Inthiscase,the tasksatthebackofthequeuewereinsertedearlierandhencerepresentlargersubproblems. Thus,thisapproachtendstomovelargesubproblemswhilehandlingthefinergrained subproblemsatthePEwheretheywerecreated.Thishelpstheloadbalanceandreduces overheadforthesmalltaskscreatedinthedeeperlevelsofrecursion. TheMaster/Workerpatterncanbemodifiedtoprovideamodestleveloffaulttolerance [BDK95].Themastermaintainstwoqueues:onefortasksthatstillneedtobeassignedto workersandanotherfortasksthathavealreadybeenassigned,butnotcompleted.Afterthe firstqueueisempty,themastercanredundantlyassigntasksfromthe"notcompleted"queue. Hence,ifaworkerdiesandthereforecan'tcompleteitstasks,anotherworkerwillcoverthe unfinishedtasks.
Examples Wewillstartwithagenericdescriptionofasimplemaster/workerproblemandthenprovidea detailedexampleofusingtheMaster/Workerpatternintheparallelimplementationofaprogramto generatetheMandelbrotset.AlsoseetheExamplessectionoftheSharedQueuepattern,which illustratestheuseofsharedqueuesbydevelopingamaster/workerimplementationofasimpleJava frameworkforprogramsusingtheFork/Joinpattern.

Generic solutions
Thekeytothemaster/workerprogramisthestructurethatholdsthebagoftasks.Thecodeinthis sectionusesataskqueue.WeimplementthetaskqueueasaninstanceoftheSharedQueuepattern. Themasterprocess,showninFig.5.15,initializesthetaskqueue,representingeachtaskbyaninteger. ItthenusestheFork/Joinpatterntocreatetheworkerprocessesorthreadsandwaitforthemto complete.Whentheyhavecompleted,itconsumestheresults.
Figure 5.15. Master process for a master/worker program. This assumes a shared address space so the task and results queues are visible to all UEs. In this simple version, the master initializes the queue, launches the workers, and then waits for the workers to finish (that is, the ForkJoin command launches the workers and then waits for them to finish before returning). At that point, results are consumed and the computation completes. Int const Ntasks // Number of tasks Int const Nworkers // Number of workers SharedQueue :: task_queue; // task queue SharedQueue :: global_results; // queue to hold results void master() { void worker() // Create and initialize shared data structures task_queue = new SharedQueue() global_results = new SharedQueue() for (int i = 0; i < N; i++) enqueue(task_queue, i) // Create Nworkers threads executing function Worker() ForkJoin (Nworkers, Worker) } consume_the_results (Ntasks)
Figure 5.16. Worker process for a master/worker program. We assume a shared address space thereby making task_queue and global_results available to the master and all workers. A worker loops over the task_queue and exits when the end of the queue is encountered. void worker() { Int :: i Result :: res while (!empty(task_queue) { i = dequeue(task_queue) res = do_lots_of_work(i) enqueue(global_results, res) } }
Theworker,showninFig.5.16,loopsuntilthetaskqueueisempty.Everytimethroughtheloop,it takesthenexttaskanddoestheindicatedwork,storingtheresultsinaglobalresultsqueue.Whenthe taskqueueisempty,theworkerterminates. Notethatweensuresafeaccesstothekeysharedvariables(task_queueandglobal_results) byusinginstancesoftheSharedQueuepattern. ForprogramswritteninJava,athreadsafequeuecanbeusedtoholdRunnableobjectsthatare
executedbyasetofthreadswhoserunmethodsbehaveliketheworkerthreadsdescribedpreviously: removingaRunnableobjectfromthequeueandexecutingitsrunmethod.TheExecutor interfaceinthejava.util.concurrentpackageinJava21.5providesdirectsupportforthe Master/Workerpattern.Classesimplementingtheinterfaceprovideanexecutemethodthattakesa Runnableobjectandarrangesforitsexecution.DifferentimplementationsoftheExecutor interfaceprovidedifferentwaysofmanagingtheThreadobjectsthatactuallydothework.The ThreadPoolExecutorimplementstheMaster/Workerpatternbyusingafixedpoolofthreadsto executethecommands.TouseExecutor,theprograminstantiatesaninstanceofaclass implementingtheinterface,usuallyusingafactorymethodintheExecutorsclass.Forexample, thecodeinFig.5.17setsupaThreadPoolExecutorthatcreatesnum_threadsthreads.These threadsexecutetasksspecifiedbyRunnableobjectsthatareplacedinanunboundedqueue. AftertheExecutorhasbeencreated,aRunnableobjectwhoserunmethodspecifiesthe behaviorofthetaskcanbepassedtotheexecutemethod,whicharrangesforitsexecution.For example,assumetheRunnableobjectisreferredtobyavariabletask.Thenfortheexecutordefined previously,exec.execute (task);willplacethetaskinthequeue,whereitwilleventuallybe servicedbyoneoftheexecutor'sworkerthreads.
Figure 5.17. Instantiating and initializing a pooled executor /*create a ThreadPoolExecutor with an unbounded queue*/ Executor exec = new Executors.newFixedThreadPool(num_threads);
TheMaster/WorkerpatterncanalsobeusedwithSPMDprogramsandMPI.Maintainingtheglobal queuesismorechallenging,buttheoverallalgorithmisthesame.Amoredetaileddescriptionof usingMPIforsharedqueuesappearsintheImplementationMechanismsdesignspace.

Mandelbrot set generation
GeneratingtheMandelbrotsetisdescribedindetailintheExamplessectionoftheSPMDpattern. Thebasicideaistoexploreaquadraticrecurrencerelationateachpointinacomplexplaneandcolor thepointbasedontherateatwhichtherecursionconvergesordiverges.Eachpointinthecomplex planecanbecomputedindependentlyandhencetheproblemisembarrassinglyparallel(seetheTask Parallelismpattern). InFig.5.18,wereproducethepseudocodegivenintheSPMDpatternforasequentialversionofthis problem.Theprogramloopsovertherowsoftheimagedisplayingonerowatatimeastheyare computed. Onhomogeneousclustersorlightlyloadedsharedmemorymultiprocessorcomputers,approaches basedontheSPMDorLoopParallelismpatternsaremosteffective.Onaheterogeneousclusterora multiprocessorsystemsharedamongmanyusers(andhencewithanunpredictableloadonanygiven PEatanygiventime),amaster/workerapproachwillbemoreeffective. Wewillcreateamaster/workerversionofaparallelMandelbrotprogrambasedonthehighlevel structuredescribedearlier.Themasterwillberesponsibleforgraphingtheresults.Insomeproblems, theresultsgeneratedbytheworkersinteractanditcanbeimportantforthemastertowaituntilallthe
workershavecompletedbeforeconsumingresults.Inthiscase,however,theresultsdonotinteract,so wesplittheforkandjoinoperationsandhavethemasterplotresultsastheybecomeavailable. FollowingtheFork,themastermustwaitforresultstoappearontheglobal_resultsqueue. Becauseweknowtherewillbeoneresultperrow,themasterknowsinadvancehowmanyresultsto fetchandtheterminationconditionisexpressedsimplyintermsofthenumberofiterationsofthe loop.Afteralltheresultshavebeenplotted,themasterwaitsattheJoinfunctionuntilallthe workershavecompleted,atwhichpointthemastercompletes.CodeisshowninFig.5.19.Noticethat thiscodeissimilartothegenericcasediscussedearlier,exceptthatwehaveoverlappedtheprocessing oftheresultswiththeircomputationbysplittingtheForkandJoin.Asthenamesimply,Fork launchesUEsrunningtheindicatedfunctionandJoincausesthemastertowaitfortheworkersto cleanlyterminate.SeetheSharedQueuepatternformoredetailsaboutthequeue.
Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map Real :: conv // divergence rate for a pixel Array of Int :: color_map (M) // pixel color based on Conv rate Array of Int :: row (RowSize) // Pixels to draw Array of real :: ranges(2) // ranges in X and Y dimensions
manage_user_input(ranges, color_map) // input ranges, color map initialize_graphics(RowSize for (int i = 0; i<Nrows; i++){ compute_Row (RowSize, ranges, row) graph(i, RowSize, M, color_map, ranges, row) } // end loop [i] over rows
Figure 5.19. Master process for a master/worker parallel version of the Mandelbrot set generation program
Int const Ntasks // number of tasks Int const Nworkers // number of workers Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map typedef Row :: struct of { int :: index array of int :: pixels (RowSize)
} temp_row; Array of Int :: color_map (M) // pixel color based on conv rate Array of Real :: ranges(2) // ranges in X and Y dimensions SharedQueue of Int :: task_queue; // task queue SharedQueue of Row :: global_results; // queue to hold results void master() { void worker(); manage_user_input(ranges, Color_map) // input ranges, color map initialize_graphics(RowSize, Nrows, M, ranges, color_map) // Create and initialize shared data structures task_queue = new SharedQueue(); global_results = new SharedQueue(); for (int i = 0; i < Nrows; i++) enqueue(task_queue, i); // Create Nworkers threads executing function worker() Fork (Nworkers, worker); // Wait for results and graph them as they appear for (int i = 0; i< Nrows; i++) { while (empty(task_queue) { // wait for results wait } temp_row = dequeue(global_results) graph(temp_row_index, RowSize, M, color_map, ranges, Row.pixels) } // Terminate the worker UEs Join (Nworkers); }
Figure 5.20. Worker process for a master/worker parallel version of the Mandelbrot set generation program. We assume a shared address space thereby making task_queue, global_results, and ranges available to the master and the workers. void worker() { Int i, irow; Row temp_row; while (!empty(task_queue) { irow = dequeue(task_queue); compute_Row (RowSize, ranges, irow, temp_row.pixels) temp_row.index = irow enqueue(global_results, temp_row); } }
ThecodefortheworkerismuchsimplerandisshowninFig5.20.First,notethatweassumethe sharedvariablessuchasthequeuesandcomputationparametersaregloballyvisibletothemasterand theworkers.Becausethequeueisfilledbythemasterbeforeforkingtheworkers,thetermination conditionissimplygivenbyanemptyqueue.Eachworkergrabsarowindex,doesthecomputation, packstherowindexandthecomputedrowintotheresultqueue,andcontinuesuntilthequeueis empty.

Known uses
ThispatternisextensivelyusedwiththeLindaprogrammingenvironment.ThetuplespaceinLindais ideallysuitedtoprogramsthatusetheMaster/Workerpattern,asdescribedindepthin[CG91]andin thesurveypaper[CGMS94]. TheMaster/Workerpatternisusedinmanydistributedcomputingenvironmentsbecausethese systemsmustdealwithextremelevelsofunpredictabilityintheavailabilityofresources.The SETI@homeproject[SET]usestheMaster/Workerpatterntoutilizevolunteers'Internetconnected computerstodownloadandanalyzeradiotelescopedataaspartoftheSearchforExtraterrestrial Intelligence(SETI).ProgramsconstructedwiththeCalypsosystem[BDK95],adistributedcomputing frameworkwhichprovidessystemsupportfordynamicchangesinthesetofPEs,alsousethe Master/Workerpattern.Aparallelalgorithmfordetectingrepeatsingenomicdata[RHB03]usesthe Master/WorkerpatternwithMPIonaclusterofdualprocessorPCs. Related Patterns ThispatterniscloselyrelatedtotheLoopParallelismpatternwhentheloopsutilizesomeformof dynamicscheduling(suchaswhentheschedule (dynamic)clauseisusedinOpenMP). ImplementationsoftheFork/JoinpatternsometimesusetheMaster/Workerpatternbehindthescenes. Thispatternisalsocloselyrelatedtoalgorithmsthatmakeuseofthenextvalfunctionfrom TCGMSG[Har91,WSG95,LDSH95].Thenextvalfunctionimplementsamonotoniccounter.If thebagoftaskscanbemappedontoafixedrangeofmonotonicindices,thecounterprovidesthebag oftasksandthefunctionofthemasterisimpliedbythecounter. Finally,theownercomputesfilterdiscussedinthemoleculardynamicsexampleintheSPMDpattern isessentiallyavariationonthemaster/workertheme.Insuchanalgorithm,allthemasterwoulddois setupthebagoftasks(loopiterations)andassignthemtoUEs,withtheassignmentoftaskstoUEs definedbythefilter.BecausetheUEscanessentiallyperformthisassignmentthemselves(by examiningeachtaskwiththefilter),noexplicitmasterisneeded.
5.6. THE LOOP PARALLELISM PATTERN

Problem Givenaserialprogramwhoseruntimeisdominatedbyasetofcomputationallyintensiveloops,how canitbetranslatedintoaparallelprogram?
Context Theoverwhelmingmajorityofprogramsusedinscientificandengineeringapplicationsareexpressed intermsofiterativeconstructs;thatis,theyareloopbased.Optimizingtheseprogramsbyfocusing strictlyontheloopsisatraditiondatingbacktotheoldervectorsupercomputers.Extendingthis approachtomodernparallelcomputerssuggestsaparallelalgorithmstrategyinwhichconcurrent tasksareidentifiedasiterationsofparallelizedloops. Theadvantageofstructuringaparallelalgorithmaroundparallelizedloopsisparticularlyimportantin problemsforwhichwellacceptedprogramsalreadyexist.Inmanycases,itisn'tpracticaltomassively restructureanexistingprogramtogainparallelperformance.Thisisparticularlyimportantwhenthe program(asisfrequentlythecase)containsconvolutedcodeandpoorlyunderstoodalgorithms. Thispatternaddresseswaystostructureloopbasedprogramsforparallelcomputation.Whenexisting codeisavailable,thegoalisto"evolve"asequentialprogramintoaparallelprogrambyaseriesof transformationsontheloops.Ideally,allchangesarelocalizedtotheloopswithtransformationsthat removeloopcarrieddependenciesandleavetheoverallprogramsemanticsunchanged.(Such transformationsarecalledsemanticallyneutraltransformations). Notallproblemscanbeapproachedinthisloopdrivenmanner.Clearly,itwillonlyworkwhenthe algorithmstructurehasmost,ifnotall,ofthecomputationallyintensiveworkburiedinamanageable numberofdistinctloops.Furthermore,thebodyoftheloopmustresultinloopiterationsthatwork wellasparalleltasks(thatis,theyarecomputationallyintensive,expresssufficientconcurrency,and aremostlyindependent). Notalltargetcomputersystemsalignwellwiththisstyleofparallelprogramming.Ifthecodecannot berestructuredtocreateeffectivedistributeddatastructures,somelevelofsupportforashared addressspaceisessentialinallbutthemosttrivialcases.Finally,Amdahl'slawanditsrequirementto minimizeaprogram'sserialfractionoftenmeansthatloopbasedapproachesareonlyeffectivefor systemswithsmallernumbersofPEs. Evenwiththeserestrictions,thisclassofparallelalgorithmsisgrowingrapidly.Becauseloopbased algorithmsarethetraditionalapproachinhighperformancecomputingandarestilldominantinnew programs,thereisalargebacklogofloopbasedprogramsthatneedtobeportedtomodernparallel computers.TheOpenMPAPIwascreatedprimarilytosupportparallelizationoftheseloopdriven problems.Limitationsonthescalabilityofthesealgorithmsareserious,butacceptable,giventhat thereareordersofmagnitudemoremachineswithtwoorfourprocessorsthanmachineswithdozens orhundredsofprocessors. ThispatternisparticularlyrelevantforOpenMPprogramsrunningonsharedmemory computersandforproblemsusingtheTaskParallelismandGeometricDecomposition patterns. Forces
Sequentialequivalence.Aprogramthatyieldsidenticalresults(exceptforroundofferrors) whenexecutedwithonethreadormanythreadsissaidtobesequentiallyequivalent(also knownasseriallyequivalent).Sequentiallyequivalentcodeiseasiertowrite,easierto maintain,andletsasingleprogramsourcecodeworkforserialandparallelmachines.
Incrementalparallelism(orrefactoring).Whenparallelizinganexistingprogram,itismuch easiertoendupwithacorrectparallelprogramif(1)theparallelizationisintroducedasa sequenceofincrementaltransformations,oneloopatatime,and(2)thetransformationsdon't "break"theprogram,allowingtestingtobecarriedoutaftereachtransformation. Memoryutilization.Goodperformancerequiresthatthedataaccesspatternsimpliedbythe loopsmeshwellwiththememoryhierarchyofthesystem.Thiscanbeatoddswiththe previoustwoforces,causingaprogrammertomassivelyrestructureloops.
Solution ThispatterniscloselyalignedwiththestyleofparallelprogrammingimpliedbyOpenMP.Thebasic approachconsistsofthefollowingsteps.
Findthebottlenecks.Locatethemostcomputationallyintensiveloopseitherbyinspectionof thecode,byunderstandingtheperformanceneedsofeachsubproblem,orthroughtheuseof programperformanceanalysistools.Theamountoftotalruntimeonrepresentativedatasets containedbytheseloopswillultimatelylimitthescalabilityoftheparallelprogram(see Amdahl'slaw). Eliminateloopcarrieddependencies.Theloopiterationsmustbenearlyindependent.Find dependenciesbetweeniterationsorread/writeaccessesandtransformthecodetoremoveor mitigatethem.FindingandremovingthedependenciesisdiscussedintheTaskParallelism pattern,whileprotectingdependencieswithsynchronizationconstructsisdiscussedinthe SharedDatapattern. Parallelizetheloops.SplituptheiterationsamongtheUEs.Tomaintainsequential equivalence,usesemanticallyneutraldirectivessuchasthoseprovidedwithOpenMP(as describedintheOpenMPappendix,AppendixA).Ideally,thisshouldbedonetooneloopata timewithtestingandcarefulinspectioncarriedoutateachpointtomakesureraceconditions orothererrorshavenotbeenintroduced. Optimizetheloopschedule.TheiterationsmustbescheduledforexecutionbytheUEssothe loadisevenlybalanced.Althoughtherightschedulecanoftenbechosenbasedonaclear understandingoftheproblem,frequentlyitisnecessarytoexperimenttofindtheoptimal schedule.
Thisapproachisonlyeffectivewhenthecomputetimesfortheloopiterationsarelargeenoughto compensateforparallelloopoverhead.Thenumberofiterationsperloopisalsoimportant,because havingmanyiterationsperUEprovidesgreaterschedulingflexibility.Insomecases,itmightbe necessarytotransformthecodetoaddresstheseissues. Twotransformationscommonlyusedarethefollowing:
Mergeloops.Ifaproblemconsistsofasequenceofloopsthathaveconsistentlooplimits,the loopscanoftenbemergedintoasingleloopwithmorecomplexloopiterations,asshownin Fig.5.21. Coalescenestedloops.Nestedloopscanoftenbecombinedintoasingleloopwithalarger combinediterationcount,asshowninFig.5.22.Thelargernumberofiterationscanhelp
overcomeparallelloopoverhead,by(1)creatingmoreconcurrencytobetterutilizelarger numbersofUEs,and(2)providingadditionaloptionsforhowtheiterationsarescheduledonto UEs. ParallelizingtheloopsiseasilydonewithOpenMPbyusingtheomp parallel fordirective. Thisdirectivetellsthecompilertocreateateamofthreads(theUEsinasharedmemory environment)andtosplituploopiterationsamongtheteam.ThelastloopinFig.5.22isanexample ofaloopparallelizedwithOpenMP.WedescribethisdirectiveatahighlevelintheImplementation Mechanismsdesignspace.SyntacticdetailsareincludedintheOpenMPappendix,AppendixA. NoticethatinFig.5.22wehadtodirectthesystemtocreatecopiesoftheindicesiandjlocalto eachthread.Thesinglemostcommonerrorinusingthispatternistoneglectto"privatize"key variables.Ifiandjareshared,thenupdatesofiandjbydifferentUEscancollideandleadto unpredictableresults(thatis,theprogramwillcontainaracecondition).Compilersusuallywillnot detecttheseerrors,soprogrammersmusttakegreatcaretomakesuretheyavoidthesesituations.
Figure 5.21. Program fragment showing merging loops to increase the amount of work per iteration
#define N 20 #define Npoints 512 void void void void FFT(); // a function to apply an FFT invFFT(); // a function to apply an inverse FFT filter(); // a frequency space filter setH(); // Set values of filter, H
int main() { int i, j; double A[Npoints], B[Npoints], C[Npoints], H[Npoints]; setH(Npoints, H); // do a bunch of work resulting in values for A and C // method one: distinct loops to compute A and C for(i=0; i<N; i++){ FFT (Npoints, A, B); // B = transformed A filter(Npoints, B, H); // B = B filtered with invFFT(Npoints, B, A); // A = inv transformed } for(i=0; i<N; i++){ FFT (Npoints, C, B); // B = transformed C filter(Npoints, B, H); // B = B filtered with invFFT(Npoints, B, C); // C = inv transformed }
H B
H B
// method two: the above pair of loops combined into // a single loop for(i=0; i<N; i++){ FFT (Npoints, A, B); // B = transformed A filter(Npoints, B, H); // B = B filtered with H invFFT(Npoints, B, A); // A = inv transformed B FFT (Npoints, C, B); // B = transformed C
filter(Npoints, B, H); // B = B filtered with H invFFT(Npoints, B, C); // C = inv transformed B } return 0; }
Thekeytotheapplicationofthispatternistousesemanticallyneutralmodificationstoproduce sequentiallyequivalentcode.Asemanticallyneutralmodificationdoesn'tchangethemeaningofthe singlethreadedprogram.Techniquesforloopmergingandcoalescingofnestedloopsdescribed previously,whenusedappropriately,areexamplesofsemanticallyneutralmodifications.Inaddition, mostofthedirectivesinOpenMParesemanticallyneutral.Thismeansthataddingthedirectiveand runningitwithasinglethreadwillgivethesameresultasrunningtheoriginalprogramwithoutthe OpenMPdirective. Twoprogramsthataresemanticallyequivalent(whenrunwithasinglethread)neednotbothbe sequentiallyequivalent.Recallthatsequentiallyequivalentmeansthattheprogramwillgivethesame result(subjecttoroundofferrorsduetochangingtheorderoffloatingpointoperations)whetherrun withonethreadormany.Indeed,the(semanticallyneutral)transformationsthateliminateloop carrieddependenciesaremotivatedbythedesiretochangeaprogramthatisnotsequentially equivalenttoonethatis.Whentransformationsaremadetoimproveperformance,eventhoughthe transformationsaresemanticallyneutral,onemustbecarefulthatsequentialequivalencehasnotbeen lost.
Figure 5.22. Program fragment showing coalescing nested loops to produce a single loop with a larger number of iterations
#define N 20 #define M 10 extern double work(); // a time-consuming function int main() { int i, j, ij; double A[N][M]; // method one: nested loops for(j=0; j<N; j++){ for(i=0; i<M i++){ A[i][j] = work(i,j); } } // method two: the above pair of nested loops combined into // a single loop.
for(ij=0; ij<N*M; ij++){ j = ij/N; i = ij%M; } // // // // A[i][j] = work(i,j); method three: the above loop parallelized with OpenMP. The omp pragma creates a team of threads and maps loop iterations onto them. The private clause tells each thread to maintain local copies of ij, j, and i.
#pragma omp parallel for private(ij, j, i) for(ij=0; ij<N*M; ij++){ j = ij/N; i = ij%M; A[i][j] = work(i,j); } return 0; }
Itismuchmoredifficulttodefinesequentiallyequivalentprogramswhenthecodementionseithera threadIDorthenumberofthreads.AlgorithmsthatreferencethreadIDsandthenumberofthreads tendtofavorparticularthreadsorevenparticularnumbersofthreads,asituationthatisdangerous whenthegoalisasequentiallyequivalentprogram. WhenanalgorithmdependsonthethreadID,theprogrammerisusingtheSPMDpattern.Thismay beconfusing.SPMDprogramscanbeloopbased.Infact,manyoftheexamplesintheSPMDpattern areindeedloopbasedalgorithms.ButtheyarenotinstancesoftheLoopParallelismpattern,because theydisplaythehallmarktraitofanSPMDprogramnamely,theyusetheUEIDtoguidethe algorithm. Finally,we'veassumedthatadirectivebasedsystemsuchasOpenMPisavailablewhenusingthis pattern.Itispossible,butclearlymoredifficult,toapplythispatternwithoutsuchadirectivebased programmingenvironment.Forexample,inobjectorienteddesigns,onecanusetheLoopParallelism patternbymakingcleveruseofanonymousclasseswithparalleliterators.Becausetheparallelismis buriedintheiterators,theconditionsofsequentialequivalencecanbemet.
Performance considerations
Inalmosteveryapplicationofthispattern,especiallywhenusedwithOpenMP,theassumptionis madethattheprogramwillexecuteonacomputerthathasmultiplePEssharingasingleaddress space.Thisaddressspaceisassumedtoprovideequaltimeaccesstoeveryelementofmemory. Unfortunately,thisisusuallynotthecase.Memoriesonmoderncomputersarehierarchical.Thereare cachesonthePEs,memorymodulespackagedwithsubsetsofPEs,andothercomplications.While greateffortismadeindesigningsharedmemorymultiprocessorcomputerstomakethemactlike symmetricmultiprocessor(SMP)computers,thefactisthatallsharedmemorycomputersdisplay
somedegreeofnonuniformityinmemoryaccesstimesacrossthesystem.Inmanycases,theseeffects areofsecondaryconcern,andwecanignorehowaprogram'smemoryaccesspatternsmatchupwith thetargetsystem'smemoryhierarchy.Inothercases,particularlyonlargersharedmemorymachines, programsmustbeexplicitlyorganizedaccordingtotheneedsofthememoryhierarchy.Themost commontrickistomakesurethedataaccesspatternsduringinitializationofkeydatastructures matchthoseduringlatercomputationusingthesedatastructures.Thisisdiscussedinmoredetailin [Mat03,NA01]andlaterinthispatternaspartofthemeshcomputationexample. Anotherperformanceproblemisfalsesharing.Thisoccurswhenvariablesarenotsharedbetween UEs,buthappentoresideonthesamecacheline.Hence,eventhoughtheprogramsemanticsimplies independence,eachaccessbyeachUErequiresmovementofacachelinebetweenUEs.Thiscan createhugeoverheadsascachelinesarerepeatedlyinvalidatedandmovedbetweenUEsasthese supposedlyindependentvariablesareupdated.Anexampleofaprogramfragmentthatwouldincur highlevelsoffalsesharingisshowninFig.5.23.Inthiscode,wehaveapairofnestedloops.The outermostloophasasmalliterationcountthatwillmapontothenumberofUEs(whichweassumeis fourinthiscase).Theinnermostlooprunsoveralargenumberoftimeconsumingiterations. Assumingtheiterationsoftheinnermostloopareroughlyequal,thisloopshouldparallelize effectively.ButtheupdatestotheelementsoftheAarrayinsidetheinnermostloopmeaneachupdate requirestheUEinquestiontoowntheindicatedcacheline.AlthoughtheelementsofAaretruly independentbetweenUEs,theylikelysitinthesamecacheline.Hence,everyiterationinthe innermostloopincursanexpensivecachelineinvalidateandmovementoperation.Itisnot uncommonforthistonotonlydestroyallparallelspeedup,buttoevencausetheparallelprogramto becomeslowerasmorePEsareadded.Thesolutionistocreateatemporaryvariableoneachthread toaccumulatevaluesintheinnermostloop.Falsesharingisstillafactor,butonlyforthemuch smalleroutermostloopwheretheperformanceimpactisnegligible.
Figure 5.23. Program fragment showing an example of false sharing. The small array A is held in one or two cache lines. As the UEs access A inside the innermost loop, they will need to take ownership of the cache line back from the other UEs. This back-andforth movement of the cache lines destroys performance. The solution is to use a temporary variable inside the innermost loop.
#include <omp.h> #define N 4 // Assume this equals the number of UEs #define M 1000 extern double work(int, int); // a time-consuming function int main() { int i, j; double A[N] = {0.0}; // Initialize the array to zero // method one: a loop with false sharing from A since the elements // of A are likely to reside in the same cache line. #pragma omp parallel for private(j,i) for(j=0; j<N; j++){ for(i=0; i<M; i++){
} }
A[j] += work(i,j);
// method two: remove the false sharing by using a temporary // private variable in the innermost loop double temp; #pragma omp parallel for private(j,i, temp) for(j=0; j<N; j++){ temp = 0.0; for(i=0; i<M;i++){ temp += work(i,j); } A[j] += temp; } return 0; }
Examples Asexamplesofthispatterninaction,wewillbrieflyconsiderthefollowing:

Numericalintegrationtoestimatethevalueofadefiniteintegralusingthetrapezoidrule Moleculardynamics,nonbondedenergycomputation Mandelbrotsetcomputation Meshcomputation
Eachoftheseexampleshasbeendescribedelsewhereindetail.Wewillrestrictourdiscussioninthis patterntothekeyloopsandhowtheycanbeparallelized.
Numerical integration
ConsidertheproblemofestimatingthevalueofusingEq.5.5. Equation5.5
Weusetrapezoidalintegrationtonumericallysolvetheintegral.Theideaistofilltheareaundera curvewithaseriesofrectangles.Asthewidthoftherectanglesapproaches0,thesumoftheareasof therectanglesapproachesthevalueoftheintegral. AprogramtocarryoutthiscalculationonasingleprocessorisshowninFig.5.24.Tokeepthe programassimpleaspossible,wefixthenumberofstepstouseintheintegrationat1,000,000.The variablesumisinitializedto0andthestepsizeiscomputedastherangeinx(equalto1.0inthis
case)dividedbythenumberofsteps.Theareaofeachrectangleisthewidth(thestepsize)timesthe height(thevalueoftheintegrandatthecenteroftheinterval).Thewidthisaconstant,sowepullit outofthesummationandmultiplythesumoftherectangleheightsbythestepsize,step,togetour estimateofthedefiniteintegral.

Figure 5.24. Sequential program to carry out a trapezoid rule integration to compute
#include <stdio.h> #include <math.h> int main () { int i; int num_steps = 1000000; double x, pi, step, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("pi %lf\n",pi); return 0;
CreatingaparallelversionofthisprogramusingtheLoopParallelismpatternissimple.Thereisonly oneloop,sotheinspectionphaseistrivial.Tomaketheloopiterationsindependent,werecognizethat (1)thevaluesofthevariablexarelocaltoeachiteration,sothisvariablecanbehandledasathread localorprivatevariableand(2)theupdatestosumdefineareduction.Reductionsaresupportedby theOpenMPAPI.Otherthanadding#include <omp.h>[2],onlyoneadditionallineofcodeis neededtocreateaparallelversionoftheprogram.Thefollowingisplacedabovetheforloop:

[2]
TheOpenMPincludefiledefinesfunctionprototypesandopaquedatatypesusedby OpenMP.
#pragma omp parallel for private(x) reduction(+:sum)
ThepragmatellsanOpenMPcompilerto(1)createateamofthreads,(2)createaprivatecopyofx andsumforeachthread,(3)initializesumto0(theidentityoperandforaddition),(4)maploop iterationsontothreadsintheteam,(5)combinelocalvaluesofsumintoasingleglobalvalue,and(6) jointheparallelthreadswiththesinglemasterthread.Eachofthesestepsisdescribedindetailinthe ImplementationMechanismsdesignspaceandtheOpenMPappendix,AppendixA.Foranon OpenMPcompiler,thispragmaisignoredandthereforehasnoeffectontheprogram'sbehavior.
Molecular dynamics.
Throughoutthisbook,wehaveusedmoleculardynamicsasarecurringexample.Moleculardynamics simulatesthemotionsofalargemolecularsystem.Itusesanexplicittimesteppingmethodology whereateachtimesteptheforceoneachatomiscomputedandstandardtechniquesfromclassical mechanicsareusedtocomputehowtheforceschangeatomicmotions. Thecorealgorithm,includingpseudocode,waspresentedinSec.3.1.3andintheSPMDpattern.The problemcomesdowntoacollectionofcomputationallyexpensiveloopsovertheatomswithinthe molecularsystem.Theseareembeddedinatoplevelloopovertime. Theloopovertimecannotbeparallelizedbecausethecoordinatesandvelocitiesfromtimestept1arethestartingpointfortimestept.Theindividualloopsoveratoms,however,canbeparallelized. Themostimportantcasetoaddressisthenonbondedenergycalculation.Thecodeforthis computationisshowninFig.5.25.UnliketheapproachusedintheexamplesfromtheSPMDpattern, weassumethattheprogramanditsdatastructuresareunchangedfromtheserialcase.
Figure 5.25. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This is code is almost identical to the sequential version of the function shown previously in Fig. 4.4. function non_bonded_forces (N, Atoms, neighbors, Forces) Int N // number of atoms Array of Real :: atoms (3,N) //3D coordinates Array of Real :: forces (3,N) //force in each dimension Array of List :: neighbors(N) //atoms in cutoff volume Real :: forceX, forceY, forceZ loop [i] over atoms loop [j] over neighbors(i) forceX = non_bond_force(atoms(l,i), atoms(l,j)) forceY = non_bond_force(atoms(2,i), atoms(2,j)) forceZ = non_bond_force(atoms(3,i), atoms(3,j)) force,(1,i) += forceX; force(l,j) -= forceX; force(2,i) += forceY; force{2,j) -= forceY; force{3,i) += forceZ; force{3,j) -= forceZ; end loop [j] end loop [i] end function non_bonded_forces
Wewillparallelizetheloop [i] over atoms.NoticethatthevariablesforceX,forceY,and forceZaretemporaryvariablesusedinsideaniteration.Wewillneedtocreatelocalcopiesofthese privatetoeachUE.Theupdatestotheforcearraysarereductions.Parallelizationofthisfunction wouldthereforerequireaddingasingledirectivebeforetheloopoveratoms:

#pragma omp parallel for private(j, forceX, forceY, forceZ) \ reduction (+ : force)
Theworkassociatedwitheachatomvariesunpredictablydependingonhowmanyatomsarein"its neighborhood".Althoughthecompilermightbeabletoguessaneffectiveschedule,incasessuchas thisone,itisusuallybesttotrydifferentschedulestofindtheonethatworksbest.Theworkperatom isunpredictable,sooneofthedynamicschedulesavailablewithOpenMP(anddescribedinthe OpenMPappendix,AppendixA)shouldbeused.Thisrequirestheadditionofasingleschedule clause.Doingsogivesusourfinalpragmaforparallelizingthisprogram:

#pragma omp parallel for private(j, forceX, forceY, forceZ) \ reduction (+ : force) schedule (dynamic,10)
Thisscheduletellsthecompilertogrouptheloopiterationsintoblocksofsize10andassignthem dynamicallytotheUEs.Thesizeoftheblocksisarbitraryandchosentobalancedynamicscheduling overheadversushoweffectivelytheloadcanbebalanced. OpenMP2.0forC/C++doesnotsupportreductionsoverarrayssothereductionwouldneedtobe doneexplicitly.ThisisstraightforwardandisshowninFig.5.11.AfuturereleaseofOpenMPwill correctthisdeficiencyandsupportreductionsoverarraysforalllanguagesthatsupportOpenMP. Thesamemethodusedtoparallelizethenonbondedforcecomputationcouldbeusedthroughoutthe moleculardynamicsprogram.TheperformanceandscalabilitywilllagtheanalogousSPMDversion oftheprogram.Theproblemisthateachtimeaparalleldirectiveisencountered,anewteamof threadsisinprinciplecreated.MostOpenMPimplementationsuseathreadpool,ratherthanactually creatinganewteamofthreadsforeachparallelregion,whichminimizesthreadcreationand destructionoverhead.However,thismethodofparallelizingthecomputationstilladdssignificant overhead.Also,thereuseofdatafromcachestendstobepoorfortheseapproaches.Inprinciple,each loopcanaccessadifferentpatternofatomsoneachUE.ThiseliminatesthecapabilityforUEsto makeeffectiveuseofvaluesalreadyincache. Evenwiththeseshortcomings,however,theseapproachesarecommonlyusedwhenthegoalisextra parallelismonasmallsharedmemorysystem[BBE99 + ].Forexample,onemightuseanSPMD versionofthemoleculardynamicsprogramacrossaclusterandthenuseOpenMPtogainextra performancefromdualprocessorsorfrommicroprocessorsutilizingsimultaneousmultithreading [MPS02].
Mandelbrot set computation
ConsiderthewellknownMandelbrotset[Dou86].Wediscussthisproblemanditsparallelizationasa taskparallelproblemintheTaskParallelismandSPMDpatterns.Eachpixeliscoloredbasedonthe behaviorofthequadraticrecurrencerelationinEq.5.6. Equation5.6
CandZarecomplexnumbersandtherecurrenceisstartedwithZ0=C.Theimageplotsthe
imaginarypartofContheverticalaxis(1.5to1.5)andtherealpartonthehorizontalaxis(1to2). Thecolorofeachpixelisblackiftherecurrencerelationconvergestoastablevalueoriscolored dependingonhowrapidlytherelationdiverges. PseudocodeforthesequentialversionofthiscodeisshowninFig.5.26.Theinterestingpartofthe problemishiddeninsidetheroutinecompute_Row().Thedetailsofthisroutinearenotimportant forunderstandingtheparallelalgorithm,however,sowewillnotshowthemhere.Atahighlevel,the followinghappensforeachpointintherow.
EachpixelcorrespondstoavalueofCinthequadraticrecurrence.Wecomputethisvalue basedontheinputrangeandthepixelindices.
Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map Real :: conv // divergence rate for a pixel Array of Int :: color_map (M) // pixel color based on conv rate Array of Int :: row (RowSize) // Pixels to draw Array of Real :: ranges(2) // ranges in X and Y dimensions manage_user_input(ranges, color_map) // input ranges, //color map initialize_graphics(RowSize, Nrows, M, ranges, color_map) for (int i = 0; i<Nrows; i++){ compute_Row (RowSize, ranges, row) graph(i, RowSize, M, color-map, ranges, row) } // end loop [i] over rows
Wethencomputethetermsintherecurrenceandsetthevalueofthepixelbasedonwhetherit convergestoafixedvalueordiverges.Ifitdiverges,wesetthepixelvaluebasedontherateof divergence.
Oncecomputed,therowsareplottedtomakethewellknownMandelbrotsetimages.Thecolorsused forthepixelsaredeterminedbymappingdivergenceratesontoacolormap. CreatingaparallelversionofthisprogramusingtheLoopParallelismpatternistrivial.Theiterations oftheloopoverrowsareindependent.Allweneedtodoismakesureeachthreadhasitsownrowto workon.Wedothiswiththesinglepragma:

#pragma omp parallel for private(row)
Theschedulingcanbeabittrickybecauseworkassociatedwitheachrowwillvaryconsiderably
dependingonhowmanypointsdiverge.Theprogrammershouldtryseveraldifferentschedules,buta cyclicdistributionislikelytoprovideaneffectiveloadbalance.Inthisschedule,theloopiterations aredealtoutlikeadeckofcards.Byinterleavingtheiterationsamongasetofthreads,wearelikelyto getabalancedload.Becausetheschedulingdecisionsarestatic,theoverheadincurredbythis approachissmall.

#pragma omp parallel for private(Row) schedule(static, 1)
Formoreinformationaboutthescheduleclauseandthedifferentoptionsavailabletotheparallel programmer,seetheOpenMPappendix,AppendixA. Noticethatwehaveassumedthatthegraphicspackageisthreadsafe.Thismeansthatmultiple threadscansimultaneouslycallthelibrarywithoutcausinganyproblems.TheOpenMPspecifications requirethisforthestandardI/Olibrary,butnotforanyotherlibraries.Therefore,itmaybenecessary toprotectthecalltothegraphfunctionbyplacingitinsideacriticalsection:

#pragma critical graph(i, RowSize, M, color_map, ranges, row)
WedescribethisconstructindetailintheImplementationMechanismsdesignspaceandinthe OpenMPappendix,AppendixA.Thisapproachwouldworkwell,butitcouldhaveserious performanceimplicationsiftherowstookthesametimetocomputeandthethreadsalltriedtograph theirrowsatthesametime.

Mesh computation
Considerasimplemeshcomputationthatsolvesthe1Dheatdiffusionequation.Thedetailsofthis problemanditssolutionusingOpenMParepresentedintheGeometricDecompositionpattern.We reproducethissolutioninFig.5.27. Thisprogramwouldworkwellonmostsharedmemorycomputers.Acarefulanalysisoftheprogram performance,however,wouldexposetwoperformanceproblems.First,thesingledirectiverequired toprotecttheswappingofthesharedpointersaddsanextrabarrier,therebygreatlyincreasingthe synchronizationoverhead.Second,onNUMAcomputers,memoryaccessoverheadislikelytobe highbecausewe'vemadenoefforttokeepthearraysnearthePEsthatwillbemanipulatingthem. WeaddressbothoftheseproblemsinFig.5.28.Toeliminatetheneedforthesingledirective,we modifytheprogramsoeachthreadhasitsowncopyofthepointersukandukp1.Thiscanbedone withaprivateclause,buttobeuseful,weneedthenew,privatecopiesofukandukp1topointto thesharedarrayscomprisingthemeshofvalues.Wedothiswiththefirstprivateclauseapplied totheparalleldirectivethatcreatestheteamofthreads. Theotherperformanceissueweaddress,minimizingmemoryaccessoverhead,ismoresubtle.As discussedearlier,toreducememorytrafficinthesystem,itisimportanttokeepthedataclosetothe PEsthatwillworkwiththedata.OnNUMAcomputers,thiscorrespondstomakingsurethepagesof memoryareallocatedand"owned"bythePEsthatwillbeworkingwiththedatacontainedinthe page.ThemostcommonNUMApageplacementalgorithmisthe"firsttouch"algorithm,inwhich
thePEfirstreferencingaregionofmemorywillhavethepageholdingthatmemoryassignedtoit.So averycommontechniqueinOpenMPprogramsistoinitializedatainparallelusingthesameloop scheduleaswillbeusedlaterinthecomputations.

Figure 5.27. Parallel heat-diffusion program using OpenMP. This program is described in the Examples section of the Geometric Decomposition pattern.
#include <stdio.h> #include <stdlib.h> #include <omp.h> #define NX 100 #define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]) { uk[0] = LEFTVAL; uk[NX-1] = RIGHTVAL; for (int i=1; i < NX-1; ++i) uk[i] = 0.0; for (int i = 0; i < NX; ++i) ukp1 [i] = uk[i]; } void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) { /* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; int i,k; double dx = 1.0/NX; double dt = 0.5*dx*dx; #pragma omp parallel private (k, i) { initialize(uk, ukp1); for (k = 0; k < NSTEPS; ++k) { #pragma omp for schedule(static) for (i = 1; i < NX-1; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ #pragma omp single {temp = ukpl; ukpl = uk; uk = temp;} }
} return 0;
Wedothisbyfirstchangingtheinitializationslightlysotheinitializationloopisidenticaltothe
computationloop.Wethenusethesameloopparallelizationdirectiveontheinitializationloopason thecomputationalloop.Thisdoesn'tguaranteeanoptimalmappingofmemorypagesontothePEs, butitisaportablewaytoimprovethismappingandinmanycasescomequiteclosetoanoptimal solution.

Figure 5.28. Parallel heat-diffusion program using OpenMP, with reduced thread management overhead and memory management more appropriate for NUMA computers
#include <stdio.h> #include <stdlib.h> #include <omp.h> #define NX 100 #define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]) { int i; uk[0] = LEFTVAL; uk[NX-1] = RIGHTVAL; ukp1[NX-1] = 0.0; #pragma omp for schedule(static) for (i = 1; i < NX-1; ++i){ uk[i] = 0.0; ukp1[i] = 0.0; } } void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) { /* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; int i,k; double dx = 1.0/NX; double dt = 0.5*dx*dx; #pragma omp parallel private (k, i, temp) firstprivate(uk, ukpl) { initialize(uk, ukpl); for (k = 0; k < NSTEPS; ++k) { #pragma omp for schedule(static) for (i = 1; i < NX-1; ++i) { ukpl[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukpl to uk by swapping pointers */ temp = ukpl; ukpl = uk; uk = temp; }
} return 0;
Known uses
TheLoopParallelismpatternisheavilyusedbyOpenMPprogrammers.Annualworkshopsareheldin NorthAmerica(Wompat:WorkshoponOpenMPApplicationsandTools),Europe(EWOMP: EuropeanWorkshoponOpenMP),andJapan(WOMPEI:WorkshoponOpenMPExperiencesand Implementations)todiscussOpenMPanditsuse.Proceedingsfrommanyoftheseworkshopsare widelyavailable[VJKT00,Sci03,EV01]andarefullofexamplesoftheLoopParallelismpattern. MostoftheworkonOpenMPhasbeenrestrictedtosharedmemorymultiprocessormachinesfor problemsthatworkwellwithanearlyflatmemoryhierarchy.WorkhasbeendonetoextendOpenMP applicationstomorecomplicatedmemoryhierarchies,includingNUMAmachines[NA01,SSGF00] andevenclusters[HLCZ99,SHTS01]. Related Patterns Theconceptofdrivingparallelismfromacollectionofloopsisgeneralandusedwithmanypatterns. Inparticular,manyproblemsusingtheSPMDpatternareloopbased.TheyusetheUEID,however, todrivetheparallelizationoftheloopandhencedon'tperfectlymapontothispattern.Furthermore, problemsusingtheSPMDpatternusuallyincludesomedegreeofparallellogicinbetweentheloops. ThisallowsthemtodecreasetheirserialfractionandisoneofthereasonswhySPMDprogramstend toscalebetterthanprogramsusingtheLoopParallelismpattern. AlgorithmstargetedforsharedmemorycomputersthatusetheTaskParallelismorGeometric DecompositionpatternsfrequentlyusetheLoopParallelismpattern.
5.7. THE FORK/JOIN PATTERN

Problem Insomeprograms,thenumberofconcurrenttasksvariesastheprogramexecutes,andthewaythese tasksarerelatedpreventstheuseofsimplecontrolstructuressuchasparallelloops.Howcana parallelprogrambeconstructedaroundsuchcomplicatedsetsofdynamictasks? Context Insomeproblems,thealgorithmimposesageneralanddynamicparallelcontrolstructure.Tasksare createddynamically(thatis,forked)andlaterterminated(thatis,joinedwiththeforkingtask)asthe
programcontinuestoexecute.Inmostcases,therelationshipsbetweentasksaresimple,anddynamic taskcreationcanbehandledwithparallelloops(asdescribedintheLoopParallelismpattern)or throughtaskqueues(asdescribedintheMaster/Workerpattern).Inothercases,relationshipsbetween thetaskswithinthealgorithmmustbecapturedinthewaythetasksaremanaged.Examplesinclude recursivelygeneratedtaskstructures,highlyirregularsetsofconnectedtasks,andproblemswhere differentfunctionsaremappedontodifferentconcurrenttasks.Ineachoftheseexamples,tasksare forkedandlaterjoinedwiththeparenttask(thatis,thetaskthatexecutedthefork)andtheothertasks createdbythesamefork.TheseproblemsareaddressedintheFork/Joinpattern. Asanexample,consideranalgorithmdesignedusingtheDivideandConquerpattern.Asthe programexecutionproceeds,theproblemissplitintosubproblemsandnewtasksarerecursively created(orforked)toconcurrentlyexecutesubproblems;eachofthesetasksmayinturnbefurther split.Whenallthetaskscreatedtohandleaparticularsplithaveterminatedandjoinedwiththeparent task,theparenttaskcontinuesthecomputation. ThispatternisparticularlyrelevantforJavaprogramsrunningonsharedmemory computersandforproblemsusingtheDivideandConquerandRecursiveDatapatterns. OpenMPcanbeusedeffectivelywiththispatternwhentheOpenMPenvironment supportsnestedparallelregions. Forces
Algorithmsimplyrelationshipsbetweentasks.Insomeproblems,therearecomplexor recursiverelationsbetweentasks,andtheserelationsneedtobecreatedandterminated dynamically.Althoughthesecanbemappedontofamiliarcontrolstructures,thedesignin manycasesismucheasiertounderstandifthestructureofthetasksismimickedbythe structureoftheUEs. AonetoonemappingoftasksontoUEsisnaturalinthesealgorithms,butthatmustbe balancedagainstthenumberofUEsasystemcanhandle. UEcreationanddestructionarecostlyoperations.Thealgorithmmightneedtoberecastto decreasetheseoperationssotheydon'tadverselyaffecttheprogram'soverallperformance.
Solution InproblemsthatusetheFork/Joinpattern,tasksmapontoUEsindifferentways.Wewilldiscusstwo differentapproachestothesolution:(1)asimpledirectmappingwherethereisonetaskperUE,and (2)anindirectmappingwhereapoolofUEsworkonsetsoftasks.

Direct task/UE mapping
ThesimplestcaseisonewherewemapeachsubtasktoaUE.Asnewsubtasksareforked,newUEs arecreatedtohandlethem.ThiswillbuildupcorrespondingsetsoftasksandUEs.Inmanycases, thereisasynchronizationpointwherethemaintaskwaitsforitssubtaskstofinish.Thisiscalleda join.Afterasubtaskterminates,theUEhandlingitwillbedestroyed.Wewillprovideanexampleof thisapproachlaterusingJava. Thedirecttask/UEmappingsolutiontotheFork/Joinpatternisthestandardprogrammingmodelin
OpenMP.Aprogrambeginsasasinglethread(themasterthread).Aparallelconstructforksateamof threads,thethreadsexecutewithinasharedaddressspace,andattheendoftheparallelconstruct,the threadsjoinbacktogether.Theoriginalmasterthreadthencontinuesexecutionuntiltheendofthe programoruntilthenextparallelconstruct.[3]Thisstructureunderliestheimplementationofthe OpenMPparallelloopconstructsdescribedintheLoopParallelismpattern.

[3]
mappingsolution.Thisapproachhasbeensuccessfullyusedin[AML99 + ].TheOpenMP specification,however,letsconformingOpenMPimplementations"serialize"nested parallelregions(thatis,executethemwithateamofsizeone).Therefore,anOpenMP programcannotdependonnestedparallelregionsactuallyforkingadditionalthreads,and programmersmustbecautiouswhenusingOpenMPforallbutthesimplestfork/join programs.

Indirect task/UE mapping
Inprinciple,nestedparallelregionsinOpenMPprogramsalsomapontothisdirect
Threadandprocesscreationanddestructionisoneofthemoreexpensiveoperationsthatoccurin parallelprograms.Thus,ifaprogramcontainsrepeatedforkandjoinsections,thesimplesolution, whichwouldrequirerepeateddestructionandcreationofUEs,mightnotbeefficientenough.Also,if atsomepointtherearemanymoreUEsthanPEs,theprogrammightincurunacceptableoverheaddue tothecostsofcontextswitches. Inthiscase,itisdesirabletoavoidthedynamicUEcreationbyimplementingthefork/joinparadigm usingathreadpool.Theideaistocreatea(relatively)staticsetofUEsbeforethefirstforkoperation. ThenumberofUEsisusuallythesameasthenumberofPEs.ThemappingoftaskstoUEsthen occursdynamicallyusingataskqueue.TheUEsthemselvesarenotrepeatedlycreatedanddestroyed, butsimplymappedtodynamicallycreatedtasksastheneedarises.Thisapproach,although complicatedtoimplement,usuallyresultsinefficientprogramswithgoodloadbalance.Wewill discussaJavaprogramthatusesthisapproachintheExamplessectionofthispattern. InOpenMP,thereissomecontroversyoverthebestapproachtousewiththisindirectmapping approach[Mat03].Anapproachgainingcredibilityisonebasedonanew,proposedOpenMP workshareconstructcalledataskqueue[SHPT00].Theproposalactuallydefinestwonew constructs:ataskqueueandatask.Asthenameimplies,theprogrammerusesataskqueue constructtocreatethetaskqueue.Insidethetaskqueueconstruct,ataskconstructdefinesa blockofcodethatwillbepackagedintoataskandplacedonthetaskqueue.Theteamofthreads(as usuallycreatedwithaparallelconstruct),playingtheroleofathreadpool,pullstasksoffthequeue andexecutesthemuntilthequeueisempty. UnlikeOpenMPparallelregions,taskqueuescanbedependablynestedtoproduceahierarchyof taskqueues.Thethreadsworkacrosstaskqueuesusingworkstealingtokeepallthreadsfully occupieduntilallofthequeuesareempty.Thisapproachhasbeenshowntoworkwell[SHPT00]and islikelytobeadoptedinafutureOpenMPspecification. Examples Asexamplesofthispattern,wewillconsiderdirectmappingandindirectmappingimplementations
ofaparallelmergesortalgorithm.TheindirectmappingsolutionmakesuseofaJavapackageFJTasks [Lea00b].TheExamplessectionoftheSharedQueuepatterndevelopsasimilar,butsimpler, framework.

Mergesort using direct mapping
Asanexample,considerthestraightforwardimplementationofamethodtoperformsamergesortin JavashowninFig.5.29.Themethodtakesareferencetothearraytobesortedandsortstheelements withindicesrangingfromlo(inclusive)tohi(exclusive).SortingtheentirearrayAisdoneby invokingsort(A,0,A.length).

Figure 5.29. Parallel mergesort where each task corresponds to a thread
static void sort(final int[] A,final int lo, final int hi) { int n = hi - lo; //if not large enough to do in parallel, sort sequentially if (n <= THRESHOLD){ Arrays.sort(A,lo,hi); return; } else { //split array final int pivot = (hi+lo)/2; //create and start new thread to sort lower half Thread t = new Thread() { public void run() { sort(A, lo, pivot); } }; t.start(); //sort upper half in current thread sort(A,pivot,hi); //wait for other thread try{t. join();} catch (InterruptedException e){Thread.dumpStack();} //merge sorted arrays int [] ws = new int [n] ; System.arraycopy(A,lo,ws,0,n); int wpivot = pivot - lo; int wlo = 0; int whi = wpivot; for (int i = lo; i != hi; i++) { if((wlo < wpivot) && (whi >= n II ws[wlo] <= ws[whi])) { A[i] = ws[wlo++]; } else { A[i] = ws[whi++]; }
} }
Thefirststepofthemethodistocomputethesizeofthesegmentofthearraytobesorted.Ifthesize oftheproblemistoosmalltomaketheoverheadofsortingitinparallelworthwhile,thenasequential sortingalgorithmisused(inthiscase,thetunedquicksortimplementationprovidedbytheArrays classinthejava.utilpackage).Ifthesequentialalgorithmisnotused,thenapivotpointis computedtodividethesegmenttobesorted.Anewthreadisforkedtosortthelowerhalfofthearray, whiletheparentthreadsortstheupperhalf.Thenewtaskisspecifiedbytherunmethodofan anonymousinnersubclassoftheThreadclass.Whenthenewthreadhasfinishedsorting,it terminates.Whentheparentthreadfinishessorting,itperformsajointowaitforthechildthreadto terminateandthenmergesthetwosortedsegmentstogether. Thissimpleapproachmaybeadequateinfairlyregularproblemswhereappropriatethresholdvalues caneasilybedetermined.Westressthatitiscrucialthatthethresholdvaluebechosenappropriately: Iftoosmall,theoverheadfromtoomanyUEscanmaketheprogramrunevenslowerthanasequential version.Iftoolarge,potentialconcurrencyremainsunexploited.
Figure 5.30. Instantiating FJTaskRunnerGroup and invoking the master task
int groupSize = 4; //number of threads FJTaskRunnerGroup group = new FJTaskRunnerGroup(groupSize); group.invoke(new FJTask() { public void run() { synchronized(this) { sort(A,0, A.length); } } });
Mergesort using indirect mapping
ThisexampleusestheFJTaskframeworkincludedaspartofthepublicdomainpackage EDU.oswego.cs.dl.util.concurrent[Lea00b].[4]Insteadofcreatinganewthreadto executeeachtask,aninstanceofa(subclassof)FJTaskiscreated.Thepackagethendynamically mapstheFJTaskobjectstoastaticsetofthreadsforexecution.Althoughlessgeneralthana Thread,anFJTaskisamuchlighterweightobjectthanathreadandisthusmuchcheapertocreate anddestroy.InFig.5.30andFig.5.31,weshowhowtomodifythemergesortexampletouse FJTasksinsteadofJavathreads.Theneededclassesareimportedfrompackage EDU.oswego.cs.dl.util.concurrent.BeforestartinganyFJTasks,a FJTaskRunnerGroupmustbeinstantiated,asshowninFig.5.30.Thiscreatesthethreadsthatwill constitutethethreadpoolandtakesthenumberofthreads(groupsize)asaparameter.Once instantiated,themastertaskisinvokedusingtheinvokemethodontheFJTaskRunnerGroup.
[4]
Thispackagewasthebasisforthenewfacilitiestosupportconcurrencyintroducedvia JSR166inJava21.5.Itsauthor,DougLea,wasaleadintheJSReffort.TheFJTask frameworkisnotpartofJava21.5,butremainsavailablein[Lea00b].
Thesortroutineitselfissimilartothepreviousversionexceptthatthedynamicallycreatedtasksare implementedbytherunmethodofanFJTasksubclassinsteadofaThreadsubclass.Thefork andjoinmethodsofFJTaskareusedtoforkandjointhetaskinplaceoftheThread start andjoinmethods.Althoughtheunderlyingimplementationisdifferent,fromtheprogrammer's viewpoint,thisindirectmethodisverysimilartothedirectimplementationshownpreviously. AmoresophisticatedparallelimplementationofmergesortisprovidedwiththeFJTaskexamplesin theutil.concurrentdistribution.Thepackagealsoincludesfunctionalitynotillustratedbythis example.

Known uses
ThedocumentationwiththeFJTaskpackageincludesseveralapplicationsthatusetheFork/Join pattern.ThemostinterestingoftheseincludeJacobiiteration,aparalleldivideandconquermatrix multiplication,astandardparallelprocessingbenchmarkprogramthatsimulatesheatdiffusionacross amesh,LUmatrixdecomposition,integralcomputationusingrecursiveGaussianQuadrature,andan adaptationoftheMicroscopegame.[5]

[5]
Accordingtothedocumentationforthisapplication,thisisthegamethatisplayed whilelookingthroughthemicroscopeinthelaboratoryinThe7thGuest(T7G;ACD ROMgameforPCs).Itisaboardgameinwhichtwoplayerscompetetofillspacesonthe boardwiththeirtiles,somethinglikeReversiorOthello.

Figure 5.31. Mergesort using the FJTask framework static void sort(final int[] A,final int lo, final int hi) { int n = hi - lo; if (n <= THRESHOLD){ Arrays.sort(A,lo,hi); return; } else { //split array final int pivot = (hi+lo)/2; //override run method in FJTask to execute run method FJTask t = new FJTask() { public void run() { sort(A, lo, pivot); } } //fork new task to sort lower half of array t.fork(); //perform sort on upper half in current task sort(A,pivot,hi); //join with forked task t.join(); //merge sorted arrays as before, code omitted } }
BecauseOpenMPisbasedonafork/joinprogrammingmodel,onemightexpectheavyuseofthe
Fork/JoinpatternbyOpenMPprogrammers.Therealityis,however,thatmostOpenMPprogrammers useeithertheLoopParallelismorSPMDpatternsbecausethecurrentOpenMPstandardprovides poorsupportfortruenestingofparallelregions.Oneofthefewpublishedaccountsofusingthe Fork/JoinpatternwithstandardOpenMPisapaperwherenestedparallelismwasusedtoprovidefine grainedparallelisminanimplementationofLAPACK[ARv03]. ExtendingOpenMPsoitcanusetheFork/Joinpatterninsubstantialapplicationsisanactiveareaof research.We'vementionedoneoftheselinesofinvestigationforthecaseoftheindirectmapping solutionoftheFork/Joinpattern(thetaskqueue[SHPT00]).Anotherpossibilityistosupportnested parallelregionswithexplicitgroupsofthreadsforthedirectmappingsolutionoftheFork/Join pattern(theNanosOpenMPcompiler[GAM00 + ]). Related Patterns AlgorithmsthatusetheDivideandConquerpatternusetheFork/Joinpattern. TheLoopParallelismpattern,inwhichthreadsareforkedjusttohandleasingleparallelloop,isan instanceoftheFork/Joinpattern. TheMaster/Workerpattern,whichinturnusestheSharedQueuepattern,canbeusedtoimplement theindirectmappingsolution.
5.8. THE SHARED DATA PATTERN

Problem Howdoesoneexplicitlymanageshareddatainsideasetofconcurrenttasks? Context MostoftheAlgorithmStructurepatternssimplifythehandlingofshareddatabyusingtechniquesto "pull"theshareddata"outside"thesetoftasks.Examplesincludereplicationplusreductioninthe TaskParallelismpatternandalternatingcomputationandcommunicationintheGeometric Decompositionpattern.Forcertainproblems,however,thesetechniquesdonotapply,thereby requiringthatshareddatabeexplicitlymanagedinsidethesetofconcurrenttasks. Forexample,considerthephylogenyproblemfrommolecularbiology,asdescribedin[YWC96 + ].A phylogenyisatreeshowingrelationshipsbetweenorganisms.Theproblemconsistsofgenerating largenumbersofsubtreesaspotentialsolutionsandthenrejectingthosethatfailtomeetthevarious consistencycriteria.Differentsetsofsubtreescanbeexaminedconcurrently,soanaturaltask definitioninaparallelphylogenyalgorithmwouldbetheprocessingrequiredforeachsetofsubtrees. However,notallsetsmustbeexaminedifasetSisrejected,allsupersetsofScanalsoberejected. Thus,itmakessensetokeeptrackofthesetsstilltobeexaminedandthesetsthathavebeenrejected.
Giventhattheproblemnaturallydecomposesintonearlyindependenttasks(oneperset),thesolution tothisproblemwouldusetheTaskParallelismpattern.Usingthepatterniscomplicated,however,by thefactthatalltasksneedbothreadandwriteaccesstothedatastructureofrejectedsets.Also, becausethisdatastructurechangesduringthecomputation,wecannotusethereplicationtechnique describedintheTaskParallelismpattern.Partitioningthedatastructureandbasingasolutiononthis datadecomposition,asdescribedintheGeometricDecompositionpattern,mightseemlikeagood alternative,butthewayinwhichtheelementsarerejectedisunpredictable,soanydatadecomposition islikelytoleadtoapoorloadbalance. Similardifficultiescanariseanytimeshareddatamustbeexplicitlymanagedinsideasetof concurrenttasks.ThecommonelementsforproblemsthatneedtheSharedDatapatternare(1)at leastonedatastructureisaccessedbymultipletasksinthecourseoftheprogram'sexecution,(2)at leastonetaskmodifiestheshareddatastructure,and(3)thetaskspotentiallyneedtousethemodified valueduringtheconcurrentcomputation. Forces
Theresultsofthecomputationmustbecorrectforanyorderingofthetasksthatcouldoccur duringthecomputation. Explicitlymanagingshareddatacanincurparalleloverhead,whichmustbekeptsmallifthe programistorunefficiently. Techniquesformanagingshareddatacanlimitthenumberoftasksthatcanrunconcurrently, therebyreducingthepotentialscalabilityofanalgorithm. Iftheconstructsusedtomanageshareddataarenoteasytounderstand,theprogramwillbe moredifficulttomaintain.
Solution Explicitlymanagingshareddatacanbeoneofthemoreerrorproneaspectsofdesigningaparallel algorithm.Therefore,agoodapproachistostartwithasolutionthatemphasizessimplicityandclarity ofabstractionandthentrymorecomplexsolutionsifnecessarytoobtainacceptableperformance. Thesolutionreflectsthisapproach.

Be sure this pattern is needed
Thefirststepistoconfirmthatthispatternistrulyneeded;itmightbeworthwhiletorevisitdecisions madeearlierinthedesignprocess(thedecompositionintotasks,forexample)toseewhetherdifferent decisionsmightleadtoasolutionthatfitsoneoftheAlgorithmStructurepatternswithouttheneedto explicitlymanageshareddata.Forexample,iftheTaskParallelismpatternisagoodfit,itis worthwhiletoreviewthedesignandseeifdependenciescanbemanagedbyreplicationandreduction.

Define an abstract data type
Assumingthispatternmustindeedbeused,startbyviewingtheshareddataasanabstractdatatype (ADT)withafixedsetof(possiblycomplex)operationsonthedata.Forexample,iftheshareddata structureisaqueue(seetheSharedQueuepattern),theseoperationswouldconsistofput(enqueue),
take(dequeue),andpossiblyotheroperations,suchasatestforanemptyqueueoratesttoseeifa specifiedelementispresent.Eachtaskwilltypicallyperformasequenceoftheseoperations.These operationsshouldhavethepropertythatiftheyareexecutedserially(thatis,oneatatime,without interferencefromothertasks),eachoperationwillleavethedatainaconsistentstate. Theimplementationoftheindividualoperationswillmostlikelyinvolveasequenceoflowerlevel actions,theresultsofwhichshouldnotbevisibletootherUEs.Forexample,ifweimplementedthe previouslymentionedqueueusingalinkedlist,a"take"operationactuallyinvolvesasequenceof lowerleveloperations(whichmaythemselvesconsistofasequenceofevenlowerleveloperations): 1. Usevariablefirsttoobtainareferencetothefirstobjectinthelist. 2. Fromthefirstobject,getareferencetothesecondobjectinthelist. 3. Replacethevalueoffirstwiththereferencetothesecondobject. 4. Updatethesizeofthelist. 5. Returnthefirstelement. Iftwotasksareexecuting"take"operationsconcurrently,andtheselowerleveloperationsare interleaved(thatis,the"take"operationsarenotbeingexecutedatomically),theresultcouldeasilybe aninconsistentlist.
Implement an appropriate concurrency-control protocol
AftertheADTanditsoperationshavebeenidentified,theobjectiveistoimplementaconcurrency controlprotocoltoensurethattheseoperationsgivethesameresultsasiftheywereexecutedserially. Thereareseveralwaystodothis;startwiththefirsttechnique,whichisthesimplest,andthentrythe othermorecomplextechniquesifitdoesnotyieldacceptableperformance.Thesemorecomplex techniquescanbecombinedifmorethanoneisapplicable. Oneatatimeexecution.Theeasiestsolutionistoensurethattheoperationsareindeedexecuted serially. Inasharedmemoryenvironment,themoststraightforwardwaytodothisistotreateachoperationas partofasinglecriticalsectionanduseamutualexclusionprotocoltoensurethatonlyoneUEata timeisexecutingitscriticalsection.Thismeansthatalloftheoperationsonthedataaremutually exclusive.Exactlyhowthisisimplementedwilldependonthefacilitiesofthetargetprogramming environment.Typicalchoicesincludemutexlocks,synchronizedblocks,criticalsections,and semaphores.ThesemechanismsaredescribedintheImplementationMechanismsdesignspace.Ifthe programminglanguagenaturallysupportstheimplementationofabstractdatatypes,itisusually appropriatetoimplementeachoperationasaprocedureormethod,withthemutualexclusion protocolimplementedinthemethoditself. Inamessagepassingenvironment,themoststraightforwardwaytoensureserialexecutionistoassign theshareddatastructuretoaparticularUE.Eachoperationshouldcorrespondtoamessagetype;
otherprocessesrequestoperationsbysendingmessagestotheUEmanagingthedatastructure,which processesthemserially. Ineitherenvironment,thisapproachisusuallynotdifficulttoimplement,butitcanbeoverly conservative(thatis,itmightdisallowconcurrentexecutionofoperationsthatwouldbesafeto executesimultaneously),anditcanproduceabottleneckthatnegativelyaffectstheperformanceofthe program.Ifthisisthecase,theremainingapproachesdescribedinthissectionshouldbereviewedto seewhetheroneofthemcanreduceoreliminatethisbottleneckandgivebetterperformance. Noninterferingsetsofoperations.Oneapproachtoimprovingperformancebeginsbyanalyzingthe interferencebetweentheoperations.WesaythatoperationAinterfereswithoperationBifAwritesa variablethatBreads.Noticethatanoperationmayinterferewithitself,whichwouldbeaconcernif morethanonetaskexecutesthesameoperation(forexample,morethanonetaskexecutes"take" operationsonasharedqueue).Itmaybethecase,forexample,thattheoperationsfallintotwo disjointsets,wheretheoperationsindifferentsetsdonotinterferewitheachother.Inthiscase,the amountofconcurrencycanbeincreasedbytreatingeachofthesetsasadifferentcriticalsection.That is,withineachset,operationsexecuteoneatime,butoperationsindifferentsetscanproceed concurrently. Readers/writers.Ifthereisnoobviouswaytopartitiontheoperationsintodisjointsets,considerthe typeofinterference.Itmaybethecasethatsomeoftheoperationsmodifythedata,butothersonly readit.Forexample,ifoperationAisawriter(bothreadingandwritingthedata)andoperationBisa reader(reading,butnotwriting,thedata),AinterfereswithitselfandwithB,butBdoesnotinterfere withitself.Thus,ifonetaskisperformingoperationA,noothertaskshouldbeabletoexecuteeither AorB,butanynumberoftasksshouldbeabletoexecuteBconcurrently.Insuchcases,itmaybe worthwhiletoimplementareaders/writersprotocolthatwillallowthispotentialconcurrencytobe exploited.Theoverheadofmanagingthereaders/writersprotocolisgreaterthanthatofsimplemutex locks,sothelengthofthereaders'computationshouldbelongenoughtomakethisoverhead worthwhile.Inaddition,thereshouldgenerallybealargernumberofconcurrentreadersthanwriters. Thejava.util.concurrentpackageprovidesread/writelockstosupportthereaders/writers protocol.ThecodeinFig.5.32illustrateshowtheselocksaretypicallyused:Firstinstantiatea ReadWriteLock,andthenobtainitsreadandwritelocks.ReentrantReadWriteLockisa classthatimplementstheReadWriteLockinterface.Toperformareadoperation,thereadlock mustbelocked.Toperformawriteoperation,thewritelockmustbelocked.Thesemanticsofthe locksarethatanynumberofUEscansimultaneouslyholdthereadlock,butthewritelockis exclusive;thatis,onlyoneUEcanholdthewritelock,andifthewritelockisheld,noUEscanhold thereadlockeither.
Figure 5.32. Typical use of read/write locks. These locks are defined in the java.util.concurrent.locks package. Putting the unlock in the finally block ensures that the lock will be unlocked regardless of how the try block is exited (normally or with an exception) and is a standard idiom in Java programs that use locks rather than synchronized blocks.
class X {
ReadWriteLock rw = new ReentrantReadWriteLock() ; // ... /*operation A is a writer*/ public void A() throws InterruptedException { rw.writeLock().lock(); //lock the write lock try { // ... do operation A } finally { rw.writeLock().unlock(); //unlock the write lock } } /*operation B is a reader*/ public void B() throws InterruptedException { rw.readLock().lock(); //lock the read lock try { // ... do operation B } finally { rw.readLock().unlock(); //unlock the read lock } } }
Readers/writersprotocolsarediscussedin[And00]andmostoperatingsystemstexts. Reducingthesizeofthecriticalsection.Anotherapproachtoimprovingperformancebeginswith analyzingtheimplementationsoftheoperationsinmoredetail.Itmaybethecasethatonlypartofthe operationinvolvesactionsthatinterferewithotheroperations.Ifso,thesizeofthecriticalsectioncan bereducedtothatsmallerpart.Noticethatthissortofoptimizationisveryeasytogetwrong,soit shouldbeattemptedonlyifitwillgivesignificantperformanceimprovementsoversimpler approaches,andtheprogrammercompletelyunderstandstheinterferencesinquestion. Nestedlocks.Thistechniqueisasortofhybridbetweentwoofthepreviousapproaches, noninterferingoperationsandreducingthesizeofthecriticalsection.SupposewehaveanADTwith twooperations.OperationAdoesalotofworkbothreadingandupdatingvariablexandthenreads andupdatesvariableyinasinglestatement.OperationBreadsandwritesy.Someanalysisshows thatUEsexecutingAneedtoexcludeeachother,UEsexecutingBneedtoexcludeeachother,and becausebothoperationsreadandupdatey,technically,AandBneedtomutuallyexcludeeachother aswell.However,closerinspectionshowsthatthetwooperationsarealmostnoninterfering.Ifit weren'tforthatsinglestatementwhereAreadsandupdatesy,thetwooperationscouldbe implementedinseparatecriticalsectionsthatwouldallowoneAandoneBtoexecuteconcurrently.A solutionistousetwolocks,asshowninFig.5.33.AacquiresandholdslockAfortheentire operation.BacquiresandholdslockBfortheentireoperation.AacquireslockBandholdsitonly forthestatementupdatingy. Whenevernestedlockingisused,theprogrammershouldbeawareofthepotentialfordeadlocksand
doublecheckthecode.(Theclassicexampleofdeadlock,statedintermsofthepreviousexample,is asfollows:AacquireslockAandBacquireslockB.AthentriestoacquirelockBandBtriesto acquirelockA.Neitheroperationcannowproceed.)Deadlockscanbeavoidedbyassigningapartial ordertothelocksandensuringthatlocksarealwaysacquiredinanorderthatrespectsthepartial order.Inthepreviousexample,wewoulddefinetheordertobelockA < lockBandensurethat lockAisneveracquiredbyaUEalreadyholdinglockB. Applicationspecificsemanticrelaxation.Yetanotherapproachistoconsiderpartiallyreplicating shareddata(thesoftwarecachingdescribedin[YWC96 + ])andperhapsevenallowingthecopiestobe inconsistentifthiscanbedonewithoutaffectingtheresultsofthecomputation.Forexample,a distributedmemorysolutiontothephylogenyproblemdescribedearliermightgiveeachUEitsown copyofthesetofsetsalreadyrejectedandallowthesecopiestobeoutofsynch;tasksmaydoextra work(inrejectingasetthathasalreadybeenrejectedbyataskassignedtoadifferentUE),butthis extraworkwillnotaffecttheresultofthecomputation,anditmaybemoreefficientoverallthanthe communicationcostofkeepingallcopiesinsynch.
Figure 5.33. Example of nested locking using synchronized blocks with dummy objects lockA and lockB class Y { Object lockA = new Object(); Object lockB = new Object(); void A() { synchronized(lockA) { ....compute.... synchronized(lockB) { ....read and update y.... } } } void B() throws InterruptedException { synchronized(lockB) { ...compute.... } }
Review other considerations Memorysynchronization.Makesurememoryissynchronizedasrequired:Cachingandcompiler optimizationscanresultinunexpectedbehaviorwithrespecttosharedvariables.Forexample,astale valueofavariablemightbereadfromacacheorregisterinsteadofthenewestvaluewrittenby anothertask,orthelatestvaluemightnothavebeenflushedtomemoryandthuswouldnotbevisible toothertasks.Inmostcases,memorysynchronizationisperformedimplicitlybyhigherlevel synchronizationprimitives,butitisstillnecessarytobeawareoftheissue.Unfortunately,memory synchronizationtechniquesareveryplatformspecific.InOpenMP,theflushdirectivecanbeusedto synchronizememoryexplicitly;itisimplicitlyinvokedbyseveralotherdirectives.InJava,memoryis
implicitlysynchronizedwhenenteringandleavingasynchronizedblock,and,inJava21.5,when lockingandunlockinglocks.Also,variablesmarkedvolatileareimplicitlysynchronizedwith respecttomemory.ThisisdiscussedinmoredetailintheImplementationMechanismsdesignspace. Taskscheduling.Considerwhethertheexplicitlymanageddatadependenciesaddressedbythis patternaffecttaskscheduling.Akeygoalindecidinghowtoscheduletasksisgoodloadbalance;in additiontotheconsiderationsdescribedintheAlgorithmStructurepatternbeingused,oneshould alsotakeintoaccountthattasksmightbesuspendedwaitingforaccesstoshareddata.Itmakessense totrytoassigntasksinawaythatminimizessuchwaiting,ortoassignmultipletaskstoeachUEin thehopethattherewillalwaysbeonetaskperUEthatisnotwaitingforaccesstoshareddata. Examples
Shared queues
ThesharedqueueisacommonlyusedADTandanexcellentexampleoftheSharedDatapattern.The SharedQueuepatterndiscussesconcurrencycontrolprotocolsandthetechniquesusedtoachieve highlyefficientsharedqueueprograms.

Genetic algorithm for nonlinear optimization
ConsidertheGAFORTprogramfromtheSPECOMP2001benchmarksuite[ADE01 + ].GAFORTisa smallFortranprogram(around1,500lines)thatimplementsageneticalgorithmfornonlinear optimization.Thecalculationsarepredominantlyintegerarithmetic,andtheprogram'sperformanceis dominatedbythecostofmovinglargearraysofdatathroughthememorysubsystem. Thedetailsofthegeneticalgorithmarenotimportantforthisdiscussion.Wearegoingtofocusona singleloopwithinGAFORT.Pseudocodeforthesequentialversionofthisloop,basedonthe discussionofGAFORTin[EM],isshowninFig.5.34.Thisloopshufflesthepopulationof chromosomesandconsumesontheorderof36percentoftheruntimeinatypicalGAFORTjob [AE03].
Figure 5.34. Pseudocode for the population shuffle loop from the genetic algorithm program GAFORT Int const NPOP // number of chromosomes ("40000) Int const NCHROME // length of each chromosome Real :: tempScalar Array of Real :: temp(NCHROME) Array of Int :: iparent(NCHROME, NPOP) Array of Int :: fitness(NPOP) Int :: j, iother loop [j] over NPOP iother = rand(j) // returns random value greater // than or equal to zero but not // equal to j and less than NPOP // Swap Chromosomes temp(1:NCHROME) = iparent(1:NCHROME, iother) iparent(1:NCHROME, iother) = iparent(1:NCHROME, j)
iparent(1:NCHROME, j) = temp(1:NCHROME) // Swap fitness metrics tempScalar = fitness(iother) fitness(iother) = fitness(j) fitness(j) = tempScalar end loop [j]
Aparallelversionofthisprogramwillbecreatedbyparallelizingtheloop,usingtheLoopParallelism pattern.Inthisexample,theshareddataconsistsoftheiparentandfitnessarrays.Withinthe bodyoftheloop,calculationsinvolvingthesearraysconsistofswappingtwoelementsofiparent andthenswappingthecorrespondingelementsoffitness.Examinationoftheseoperationsshows thattwoswapoperationsinterferewhenatleastoneofthelocationsbeingswappedisthesamein bothoperations. ThinkingabouttheshareddataasanADThelpsustoidentifyandanalyzetheactionstakenonthe shareddata.Thisdoesnotmean,however,thattheimplementationitselfalwaysneedstoreflectthis structure.Insomecases,especiallywhenthedatastructureissimpleandtheprogramminglanguage doesnotsupportADTswell,itcanbemoreeffectivetoforgotheencapsulationimpliedinanADT andworkwiththedatadirectly.Thisexampleillustratesthis. Asmentionedearlier,thechromosomesbeingswappedmightinterferewitheachother;thustheloop overjcannotsafelyexecuteinparallel.Themoststraightforwardapproachistoenforcea"oneata time"protocolusingacriticalsection,asshowninFig.5.35.Itisalsonecessarytomodifythe randomnumbergeneratorsoitproducesaconsistentsetofpseudorandomnumberswhencalledin parallelbymanythreads.Thealgorithmstoaccomplishthisarewellunderstood[Mas97],butwillnot bediscussedhere.
Figure 5.35. Pseudocode for an ineffective approach to parallelizing the population shuffle in the genetic algorithm program GAFORT
#include <omp.h> Int const NPOP // number of chromosomes ("40000) Int const NCHROME // length of each chromosome Real :: tempScalar Array of Real :: temp(NCHROME) Array of Int :: iparent(NCHROME, NPOP) Array of Int :: fitness(NPOP) Int :: j, iother #pragma omp parallel for loop [j] over NPOP iother = par_rand(j) // returns random value greater // than or equal to zero but not // equal to j and less than NPOP #pragma omp critical { // Swap Chromosomes
temp(1:NCHROME) = iparent(1:NCHROME, iother) iparent(1:NCHROME, iother) = iparent(1:NCHROME, j) iparent(1:NCHROME, j) = temp(1:NCHROME) // Swap fitness metrics tempScalar = fitness(iother) fitness(iother) = fitness(j) fitness(j) = tempScalar } end loop [j]
TheprograminFig.5.35cansafelyexecutewithmultiplethreads,butitwillnotrunanyfasteras morethreadsareadded.Infact,thisprogramwillslowdownasmorethreadsareaddedbecausethe threadswillwastesystemresourcesastheywaitfortheirturntoexecutethecriticalsection.In essence,theconcurrencycontrolprotocoleliminatesalloftheavailableconcurrency. Thesolutiontothisproblemistotakeadvantageofthefactthattheswapoperationsontheshared dataonlyinterferewhenatleastoneofthelocationsbeingswappedisthesameinbothoperations. Hence,therightconcurrencycontrolprotocolusespairwisesynchronizationwithnestedlocks, therebyaddingonlymodestoverheadwhenloopiterationsdonotinterfere.Theapproachusedin
[ADE01 + ]istocreateanOpenMPlockforeachchromosome.Pseudocodeforthissolutionisshown inFig.5.36.Intheresultingprogram,mostoftheloopiterationsdonotactuallyinterferewitheach other.Thetotalnumberofchromosomes,NPOP(40,000intheSPECOMP2001benchmark),ismuch largerthanthenumberofUEs,sothereisonlyaslightchancethatloopiterationswillhappento interferewithanotherloopiteration. OpenMPlocksaredescribedintheOpenMPappendix,AppendixA.Thelocksthemselvesusean opaquetype,omp_lock_t,definedintheomp.hheaderfile.Thelockarrayisdefinedandlater initializedinaseparateparallelloop.Onceinsidethechromosomeswappingloop,thelocksareset forthepairofswappingchromosomes,theswapiscarriedout,andthelocksareunset.Nestedlocks arebeingused,sothepossibilityofdeadlockmustbeconsidered.Thesolutionhereistoorderthe locksusingthevalueoftheindicesofthearrayelementassociatedwiththelock.Alwaysacquiring locksinthisorderwillpreventdeadlockwhenapairofloopiterationshappentobeswappingthe sametwoelementsatthesametime.Afterthemoreefficientconcurrencycontrolprotocolis implemented,theprogramrunswellinparallel.
Known uses
AsolutiontothephylogenyproblemdescribedintheContextsectionispresentedin[YWC+96].The overallapproachfitstheTaskParallelismpattern;therejectedsetsdatastructureisexplicitlymanaged usingreplicationandperiodicupdatestoreestablishconsistencyamongcopies. Anotherproblempresentedin[YWC96 + ]istheGrbnerbasisprogram.Omittingmostofthedetails, inthisapplicationthecomputationconsistsofusingpairsofpolynomialstogeneratenew polynomials,comparingthemagainstamastersetofpolynomials,andaddingthosethatarenotlinear
combinationsofelementsofthemastersettothemasterset(wheretheyareusedtogeneratenew pairs).Differentpairscanbeprocessedconcurrently,soonecandefineataskforeachpairand
partitionthemamongUEs.Thesolutiondescribedin[YWC96 + ]fitstheTaskParallelismpattern (withataskqueueconsistingofpairsofpolynomials),plusexplicitmanagementofthemasterset usinganapplicationspecificprotocolcalledsoftwarecaching.

Figure 5.36. Pseudocode for a parallelized loop to carry out the population shuffle in the genetic algorithm program GAFORT. This version of the loop uses a separate lock for each chromosome and runs effectively in parallel.
#include <omp.h> Int const NPOP // number of chromosomes ("40000) Int const NCHROME // length of each chromosome Array of omp_lock_t :: lck(NPOP) Real :: tempScalar Array of Real :: temp(NCHROME) Array of Int :: iparent(NCHROME, NPOP) Array of Int :: fitness(NPOP) Int :: j, iother // Initialize the locks #pragma omp parallel for for (j=0; j<NPOP; j++){ omp_init_lock (&lck(j)) } #pragma omp parallel for for (j=0; j<NPOP; j++){ iother = par_rand(j) // returns random value >= 0, != j, // < NPOP if (j < iother) { set_omp_lock (lck(j)); set_omp_lock (lck(iother)) } else { set_omp_lock (lck(iother)); set_omp_lock (lck(j)) } // Swap Chromosomes temp(1:NCHROME) = iparent(1:NCHROME, iother); iparent(1:NCHROME, iother) = iparent(1:NCHROME, j); iparent(1:NCHROME, j) = temp(1:NCHROME); // Swap fitness metrics tempScalar = fitness(iother) fitness(iother) = fitness(j) fitness(j) = tempScalar if (j < iother) { unset_omp_lock (lck(iother)); unset_omp_lock (lck(j)) } else { unset_omp_lock (lck(j)); unset_omp_lock (lck(iother)) } } // end loop [j]
Related Patterns TheSharedQueueandDistributedArraypatternsdiscussspecifictypesofshareddatastructures. ManyproblemsthatusetheSharedDatapatternusetheTaskParallelismpatternforthealgorithm structure.
5.9. THE SHARED QUEUE PATTERN

Problem HowcanconcurrentlyexecutingUEssafelyshareaqueuedatastructure? Context Effectiveimplementationofmanyparallelalgorithmsrequiresaqueuethatistobesharedamong UEs.Themostcommonsituationistheneedforataskqueueinprogramsimplementingthe Master/Workerpattern. Forces
Simpleconcurrencycontrolprotocolsprovidegreaterclarityofabstractionandmakeiteasier fortheprogrammertoverifythatthesharedqueuehasbeencorrectlyimplemented. Concurrencycontrolprotocolsthatencompasstoomuchofthesharedqueueinasingle synchronizationconstructincreasethechancesUEswillremainblockedwaitingtoaccessthe queueandwilllimitavailableconcurrency. Aconcurrencycontrolprotocolfinelytunedtothequeueandhowitwillbeusedincreasesthe availableconcurrency,butatthecostofmuchmorecomplicated,andmoreerrorprone, synchronizationconstructs. Maintainingasinglequeueforsystemswithcomplicatedmemoryhierarchies(asfoundon NUMAmachinesandclusters)cancauseexcesscommunicationandincreaseparallel overhead.Solutionsmayinsomecasesneedtobreakwiththesinglequeueabstractionanduse multipleordistributedqueues.
Solution Ideallythesharedqueuewouldbeimplementedaspartofthetargetprogrammingenvironment,either explicitlyasanADTtobeusedbytheprogrammer,orimplicitlyassupportforthehigherlevel patterns(suchasMaster/Worker)thatuseit.InJava21.5,suchqueuesareavailableinthe
java.util.concurrentpackage.Herewedevelopimplementationsfromscratchtoillustrate theconcepts. Implementingsharedqueuescanbetricky.Appropriatesynchronizationmustbeutilizedtoavoidrace conditions,andperformanceconsiderationsespeciallyforproblemswherelargenumbersofUEs accessthequeuecanrequiresophisticatedsynchronization.Insomecases,anoncentralizedqueue mightbeneededtoeliminateperformancebottlenecks. However,ifitisnecessarytoimplementasharedqueue,itcanbedoneasaninstanceoftheShared Datapattern:First,wedesignanADTforthequeuebydefiningthevaluesthequeuecanholdandthe setofoperationsonthequeue.Next,weconsidertheconcurrencycontrolprotocols,startingwiththe simplest"oneatatimeexecution"solutionandthenapplyingaseriesofrefinements.Tomakethis discussionmoreconcrete,wewillconsiderthequeueintermsofaspecificproblem:aqueuetohold tasksinamaster/workeralgorithm.Thesolutionspresentedhere,however,aregeneralandcanbe easilyextendedtocoverotherapplicationsofasharedqueue.
The abstract data type (ADT)
AnADTisasetofvaluesandtheoperationsdefinedonthatsetofvalues.Inthecaseofaqueue,the valuesareorderedlistsofzeroormoreobjectsofsometype(f rexample,integersortaskIDs).The o operationsonthequeueareput(orenqueue)andtake(ordequeue).Insomesituations,theremight beotheroperations,butforthesakeofthisdiscussion,thesetwoaresufficient. Wemustalsodecidewhathappenswhen takeisattemptedonanemptyqueue.Whatshouldbe a donedependsonhowterminationwillbehandledbythemaster/workeralgorithm.Suppose,for example,thatallthetaskswillbecreatedatstartuptimebythemaster.Inthiscase,anemptytask queuewillindicatethattheUEshouldterminate,andwewillwantthetakeoperationonanempty queuetoreturnimmediatelywithanindicationthatthequeueisemptythatis,wewanta nonblockingqueue.AnotherpossiblesituationisthattaskscanbecreateddynamicallyandthatUEs willterminatewhentheyreceiveaspecialpoisonpilltask.Inthiscase,appropriatebehaviormightbe forthetakeoperationonanemptyqueuetowaituntilthequeueisnonemptythatis,wewanta blockonemptyqueue. Queue with "one at a time" exe cution Nonblockingqueue.Becausethequeuewillbeaccessedconcurrently,wemustdefineaconcurrency controlprotocoltoensurethatinterferencebymultipleUEswillnotoccur.Asrecommendedinthe SharedDatapattern,thesimplestsolutionistomakealloperationsontheADTexcludeeachother. Becausenoneoftheoperationsonthequeuecanblock,astraightforwardimplementationofmutual exclusionasdescribedintheImplementationMechanismsdesignspacessuffices.TheJava implementationshowninFig.5.37usesalinkedlisttoholdthetasksinthequeue.(Wedevelopour ownlistclassratherthanusinganunsynchronizedlibraryclasssuchasjava.util.LinkedList oraclassfromthejava.util.concurrentpackagetoillustratehowtoaddappropriate synchronization.)headreferstoanalwayspresentdummynode.[6]
[6]
Thecodefortakemakestheoldheadnodeintoadummynoderatherthansimply manipulatingnextpointerstoallowustolateroptimizethecodesothatputandget
canexecuteconcurrently. Thefirsttaskinthequeue(ifany)isheldinthenodereferredtobyhead.next.TheisEmpty methodisprivate,andonlyinvokedinsideasynchronizedmethod.Thus,itneednotbesynchronized. (Ifitwerepublic,itwouldneedtobesynchronizedaswell.)Ofcourse,numerouswaysof implementingthestructurethatholdsthetasksarepossible. Blockonemptyqueue.ThesecondversionofthesharedqueueisshowninFig.5.38.Inthisversion ofthequeue,thetakeoperationischangedsothatathreadtryingtotakefromanemptyqueuewill waitforataskratherthanreturningimmediately.Thewaitingthreadneedstoreleaseitslockand reacquireitbeforetryingagain.ThisisdoneinJavausingthewaitandnotifymethods.Theseare describedintheJavaappendix,AppendixC.TheJavaappendixalsoshowsthequeueimplemented usinglocksfromthejava.util.concurrent.lockspackageintroducedinJava21.5instead ofwaitandnotify.SimilarprimitivesareavailablewithPOSIXthreads(Pthreads)[But97,IEE], andtechniquesforimplementingthisfunctionalitywithsemaphoresandotherbasicprimitivescanbe foundin[And00].
Figure 5.37. Queue that ensures that at most one thread can access the data structure at one time. If the queue is empty, null is immediately returned.
public class SharedQueue1 { class Node //inner class defines list nodes { Object task; Node next; Node(Object task) {this.task = task; next = null;} } private Node head = new Node(null); //dummy node private Node last = head; public synchronized void put(Object task) { assert task != null: "Cannot insert null task"; Node p = new Node(task); last.next = p; last = p; } public synchronized Object take() { //returns first task in queue or null if queue is empty Object task = null; if (!isEmpty()) { Node first = head.next; task = first.task; first.task = null; head = first; } return task; } private boolean isEmpty(){return head.next == null;} }
Ingeneral,tochangeamethodthatreturnsimmediatelyifaconditionisfalsetoonethatwaitsuntil theconditionistrue,twochangesneedtobemade:First,wereplaceastatementoftheform
if (condition){do_something;}
Figure 5.38. Queue that ensures at most one thread can access the data structure at one time. Unlike the first shared queue example, if the queue is empty, the thread waits. When used in a master/worker algorithm, a poison pill would be required to signal termination to a thread.
public class SharedQueue2 { class Node { Object task; Node next; Node(Object task) {this.task = task; next = null;}
private Node head = new Node(null); private Node last = head; public synchronized void put(Object task) { assert task != null: "Cannot insert null task"; Node p = new Node(task); last.next = p; last = p; notifyAll(); } public synchronized Object take() { //returns first task in queue, waits if queue is empty Object task = null; while (isEmpty()) {try{wait();}catch(InterruptedException ignore){}} { Node first = head.next; task = first.task; first.task = null; head = first; } return task; } private boolean isEmpty(){return head.next == null;} }
withaloop[7]
[7]
ThefactthatwaitcanthrowanInterruptedExceptionmustbedealtwith;it isignoredhereforclarity,buthandledproperlyinthecodeexamples.
while( !condition) {wait();} do_something;
Second,weexaminetheotheroperationsonthesharedqueueandaddanotifyAlltoany operationsthatmightestablishcondition.Theresultisaninstanceofthebasicidiomforusingwait, describedinmoredetailintheJavaappendix,AppendixC. Thus,twomajorchangesaremadeinmovingtothecodeinFig.5.38.First,wereplacethecode

if (!isEmpty( )){....}
with
while(isEmpty()) {try{wait()}catch(InterruptedException ignore){}}{....}
Second,wenotethattheputmethodwillmakethequeuenotempty,soweaddtoitacallto notifyAll. Thisimplementationhasaperformanceprobleminthatitwillgenerateextraneouscallsto notifyAll.Thisdoesnotaffectthecorrectness,butitmightdegradetheperformance.Oneway thisimplementationcouldbeoptimizedwouldbetominimizethenumberofinvocationsof notifyAllinput.Onewaytodothisistokeeptrackofthenumberofwaitingthreadsandonly performanotifyAllwhentherearethreadswaiting.Wewouldhave,forint windicatingthe numberofwaitingthreads:

while( !condition){w++; wait(); w-} do_something;
and
if (w>0) notifyAll();
Inthisparticularexample,becauseonlyonewaitingthreadwillbeabletoconsumeatask, notifyAllcouldbereplacedbynotify,whichnotifiesonlyonewaitingthread.Weshowcode forthisrefinementinalaterexample(Fig.5.40).

Concurrency-control protocols for noninterfering operations
Iftheperformanceofthesharedqueueisinadequate,wemustlookformoreefficientconcurrency
controlprotocols.AsdiscussedintheSharedDatapattern,weneedtolookfornoninterferingsetsof operationsinourADT.Carefulexaminationoftheoperationsinournonblockingsharedqueue(see Fig.5.37andFig.5.38)showsthattheputandtakearenoninterferingbecausetheydonotaccess thesamevariables.Theputmethodmodifiesthereferencelastandthenextmemberofthe objectreferredtobylast.Thetakemethodmodifiesthevalueofthetaskmemberintheobject referredtobyhead.nextandthereferencehead.Thus,putmodifieslastandthenext memberofsomeNodeobject.Thetakemethodmodifiesheadandthetaskmemberofsome object.Thesearenoninterferingoperations,sowecanuseonelockforputandadifferentlockfor take.ThissolutionisshowninFig.5.39.

Figure 5.39. Shared queue that takes advantage of the fact that put and take are noninterfering and uses separate locks so they can proceed concurrently
public class SharedQueue3 { class Node { Object task; Node next; Node(Object task) {this.task = task; next = null;}
private Node head = new Node(null); private Node last = head; private Object putLock = new Object(); private Object takeLock = new Object(); public void put(Object task) { synchronized(putLock) { assert task != null: "Cannot insert null task"; Node p = new Node(task); last.next = p; last = p; } } public Object take() { Object task = null; synchronized(takeLock) { if (!isEmpty()) { Node first = head.next; task = first.task; first.task = null; head = first; } } return task; } }
Concurrency-control protocols using nested locks
TheapproachshowninFig.5.39isn'taseasytoapplytoablockonemptyqueue,however.Firstof all,thewait,notify,andnotifyAllmethodsonanobjectcanonlybeinvokedwithinablock synchronizedonthatobject.Also,ifwehaveoptimizedtheinvocationsofnotifyasdescribed previously,thenw,thecountofwaitingthreads,isaccessedinbothputandtake.Therefore,weuse putLockbothtoprotectwandtoserveasthelockonwhichatakingthreadblockswhenthequeue isempty.CodeisshowninFig.5.40.NoticethatputLock.wait()ingetwillreleaseonlythe lockonputLock,soablockedthreadwillcontinuetoblockothertakersfromtheouterblock synchronizedontakeLock.Thisisokayforthisparticularproblem.Thisschemecontinuestoallow puttersandtakerstoexecuteconcurrently;theonlyexceptionbeingwhenthequeueisempty.

Figure 5.40. Blocking queue with multiple locks to allow concurrent put and take on a nonempty queue
pubic class SharedQueue4 { class Node { Object task; Node next; Node(Object task) {this.task = task; next = null;} Node head = new Node(null); Node last = head; int w; Object putLock = new Object(); Object takeLock = new Object();
private private private private private
public void put(0bject task) { synchronized(putLock) { assert task != null: "Cannot insert null task"; Node p = new Node(task); last.next = p; last = p; if(w>0){putLock.notify();} } } public Object take() { Object task = null; synchronized(takeLock) { //returns first task in queue, waits if queue is empty while (isEmpty()) { try{synchronized(putLock){w++; putLock.wait();w--;} } catch(InterruptedException error){assert false;}} { Node first = head.next; task = first.task; first.task = null; head = first;
} } return task; } private boolean isEmpty() {return head.next == null;} }
Anotherissuetonoteisthatthissolutionhasnestedsynchronizedblocksinbothtakeandput. Nestedsynchronizedblocksshouldalwaysbeexaminedforpotentialdeadlocks.Inthiscase,there willbenodeadlockbecauseputonlyacquiresonelock,putLock.Moregenerally,wewoulddefine apartialorderoverallthelocksandensurethatthelocksarealwaysacquiredinanorderconsistent withourpartialorder.Forexample,here,wecoulddefinetakeLock < putLockandmakesure thatthesynchronizedblocksareenteredinawaythatrespectsthatpartialorder. Asmentionedearlier,severalJavabasedimplementationsofqueuesareincludedinJava21.5inthe java.util.concurrentpackage,somebasedonthesimplestrategiesdiscussedhereandsome basedonmorecomplexstrategiesthatprovideadditionalflexibilityandperformance.
Distributed shared queues
Acentralizedsharedqueuemaycauseahotspot,indicatingthatperformancemightbeimprovedbya moredistributedimplementation.Asanexample,wewilldevelopasimplepackagetosupport fork/joinprogramsusingapoolofthreadsandadistributedtaskqueueintheunderlying implementation.ThepackageisamuchsimplifiedversionoftheFJTaskpackage[Lea00b],whichin
turnusesideasfrom[BJK96 + ].Theideaistocreateafixedpoolofthreadstoexecutethetasksthat aredynamicallycreatedastheprogramexecutes.Insteadofasinglecentraltaskqueue,weassociatea nonblockingqueuewitheachthread.Whenathreadgeneratesanewtask,itisplacedinitsown queue.Whenathreadisabletoexecuteanewtask,itfirsttriestoobtainataskfromitsownqueue.If itsownqueueisempty,itrandomlychoosesanotherthreadandattemptstostealataskfromthat thread'squeueandcontinuescheckingtheotherqueuesuntilataskisfound.(In[BJK96 + ],thisis calledrandomworkstealing.)
Athreadterminateswhenitreceivesapoisonpilltask.Forthefork/joinprogramswehaveinmind, thisapproachhasbeenshowntoworkwellwhenthreadsremovetasksfromtheirownqueueinLIFO (lastin,firstout)orderandfromotherqueuesinFIFO(firstin,firstout)order.Therefore,wewilladd totheADTanoperationthatremovesthelastelement,tobeusedbythreadstoremovetasksfrom theirownqueues.TheimplementationcanthenbesimilartoFig.5.40,butwithanadditionalmethod takeLastfortheaddedoperation.TheresultisshowninFig.5.41. Theremainderofthepackagecomprisesthreeclasses.
Taskisanabstractclass.Applicationsextenditandoverrideitsrunmethodtoindicatethe functionalityofataskinthecomputation.Methodsofferedbytheclassincludeforkand
join.
TaskRunnerextendsThreadandprovidesthefunctionalityofthethreadsinthethread pool.Eachinstancecontainsasharedtaskqueue.Thetaskstealingcodeisinthisclass.
Figure 5.41. Nonblocking shared queue with takeLast operation
public class SharedQueue5 { class Node { Object task; Node next; Node prev; Node(0bject task, Node prev) {this.task = task; next = null; this.prev = prev;} } private Node head = new Node(null, null); private Node last = head; public synchronized void put(0bject task) { assert task != null: "Cannot insert null task"; Node p = new Node(task, last); last.next = p; last = p; } public synchronized Object take() { //returns first task in queue or null if queue is empty Object task = null; if (!isEmpty()) { Node first = head.next; task = first.task; first.task = null; head = first; } return task; } public synchronized Object takeLast() { //returns last task in queue or null if queue is empty Object task = null; if (!isEmpty()) { task = last.task; last = last.prev; last.next = null;} return task; } } private boolean isEmpty(){return head.next == null;}
TaskRunnerGroupmanagestheTaskRunners.Itcontainsmethodstoinitializeandshut
downthethreadpool.ItalsohasamethodexecuteAndWaitthatstartsataskrunningand waitsforitscompletion.Thismethodisusedtogetthecomputationstarted.(Itisneeded becausetheforkmethodinclassTaskcanonlybeinvokedfromwithinaTask.We describethereasonforthisrestrictionlater.)

Figure 5.42. Abstract base class for tasks
public abstract class Task implements Runnable { //done indicates whether the task is finished private volatile boolean done; public final void setDone(){done = true;} public boolean isDone(){return done;} //returns the currently executing TaskRunner thread public static TaskRunner getTaskRunner() { return (TaskRunner)Thread.currentThread() ; } //push this task on the local queue of current thread public void fork() { getTaskRunner().put(this); } //wait until this task is done public void join() { getTaskRunner().taskJoin(this); } //execute the run method of this task public void invoke() { if (!isDone()){run(); setDone(); } } }
Wewillnowdiscusstheseclassesinmoredetail.TaskisshowninFig.5.42.Theonlystate associatedwiththeabstractclassisdone,whichismarkedvolatiletoensurethatanythreadthat triestoaccessitwillobtainafreshvalue. TheTaskRunnerclassisshowninFig.5.43,Fig.5.44,andFig.5.45.Thethread,asspecifiedin therunmethod,loopsuntilthepoisontaskisencountered.Firstittriestoobtainataskfromtheback ofitslocalqueue.Ifthelocalqueueisempty,itattemptstostealataskfromthefrontofaqueue belongingtoanotherthread. ThecodefortheTaskRunnerGroupclassisshowninFig.5.46.Theconstructorfor TaskRunnerGroupinitializesthethreadpool,giventhenumberofthreadsasaparameter. Typically,thisvaluewouldbechosentomatchthenumberofprocessorsinthesystem.The executeAndWaitmethodstartsataskbyplacingitinthetaskqueueofthread0. Oneuseforthismethodisgetacomputationstarted.Somethinglikethisisneededbecausewecan't
justforkanewTaskfromamainorothernonTaskRunnerthreadthisiswhatwasmeantby theearlierremarkthattheforkandjoinmethodsofTaskcanonlybeinvokedfromwithin anotherTask.ThisisbecausethesemethodsrequireinteractionwiththeTaskRunnerthread executingthetask(forexample,forkinvolvesaddingthetasktothethread'staskqueue);wefindthe appropriateTaskRunnerusingThread.getCurrentThread,thusforkandjoinmustbe invokedonlyincodebeingexecutedbyathreadthatisaTaskRunner.

Figure 5.43. Class defining behavior of threads in the thread pool (continued in Fig. 5.44 and Fig. 5.45)
import java.util.*; class TaskRunner extends Thread { private final TaskRunnerGroup g; //managing group private final Random chooseToStealFrom; //random number generator private final Task poison; //poison task protected volatile boolean active; //state of thread final int id; //index of task in the TaskRunnerGroup private final SharedQueue5 q; //Nonblocking shared queue //operations relayed to queue public void put(Task t){q.put(t);} public Task take(){return (Task)q.take() ;} public Task takeLast(){return (Task)q.takeLast() ;} //constructor TaskRunner(TaskRunnerGroup g, int id, Task poison) { this.g = g; this.id = id; this.poison = poison; chooseToStealFrom = new Random(System.identityHashCode(this)); setDaemon(true); q = new SharedQueue5(); } protected final TaskRunnerGroup getTaskRunnerGroup(){return g;} protected final int getlD(){return id;} /* continued in next figure */
Wenormallyalsowanttheprogramthatcreatestheinitialtasktowaituntilitcompletesbeforegoing on.Toaccomplishthisandalsomeettherestrictiononwhenforkcanbeinvokedonatask,we createa"wrapper"taskwhosefunctionistostarttheinitialtask,waitforittocomplete,andthen notifythemainthread(theonethatcalledexecuteAndWait).Wethenaddthiswrappertaskto thread0'staskqueue,makingiteligibletobeexecuted,andwaitforittonotifyus(with notifyAll)thatithascompleted.
Allofthismaybeclearerfromtheusageoffork,join,andexecuteAndWaitintheFibonacci exampleintheExamplessection.
Figure 5.44. Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and continued in Fig. 5.45)
/* continued from previous figure */ //Attempts to steal a task from another thread. First chooses a //random victim, then continues with other threads until either //a task has been found or all have been checked. If a task //is found, it is invoked. The parameter waitingFor is a task //on which this thread is waiting for a join. If steal is not //called as part of a join, use waitingFor = null. void steal(final Task waitingFor) { Task task = null; TaskRunner[] runners = g.getRunners(); int victim = chooseToStealFrom.nextInt(runners.length); for (int i = 0; i != runners.length; ++i) { TaskRunner tr = runners[victim]; if (waitingFor != null && waitingFor.isDone()){break;} else { if (tr != null && tr != this) task = (Task)tr.q.take(); if(task != null) {break;} yield(); victim = (victim + 1)%runners.length; } } //have either found a task or have checked all other queues //if have a task, invoke it if(task != null && ! task.isDone()) { task.invoke(); } } /* continued in next figure
Examples
Computing Fibonacci numbers
WeshowinFig.5.47andFig.5.48codethatusesourdistributedqueuepackage.[8]Recallthat
[8]
ThiscodeisessentiallythesameastheclasstocomputeFibonaccinumbersthatis providedasademowiththeFJTaskpackage,exceptfortheslightmodificationnecessary tousetheclassesdescribedpreviously.
Equation5.7
Equation5.8
Equation5.9
Thisisaclassicdivideandconqueralgorithm.Touseourtaskpackage,wedefineaclassFibthat extendsTask.EachFibtaskcontainsamembernumberthatinitiallycontainsthenumberfor whichtheFibonaccinumbershouldbecomputedandlaterisreplacedbytheresult.ThegetAnswer methodreturnstheresultafterithasbeencomputed.Becausethisvariablewillbeaccessedby multiplethreads,itisdeclaredvolatile.

Figure 5.45. Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and Fig. 5.44)
/* continued from previous figure */ //Main loop of thread. First attempts to find a task on local //queue and execute it. If not found, then tries to steal a task //from another thread. Performance may be improved by modifying //this method to back off using sleep or lowered priorities if the //thread repeatedly iterates without finding a task. The run //method, and thus the thread, terminates when it retrieves the //poison task from the task queue. public void run() { Task task = null; try { while (!poison.equals(task)) { task = (Task)q.takeLast(); if (task != null) { if (!task.isDone()){task.invoke();}} else { steal(null); } } } finally { active = false; } } //Looks for another task to run and continues when Task w is done. protected final void taskJoin(final Task w) { while(!w.isDone()) { Task task = (Task)q.takeLast(); if (task != null) { if (!task.isDone()){ task.invoke();}}
} } }
else { steal(w); }
Therunmethoddefinesthebehaviorofeachtask.Recursiveparalleldecompositionisdoneby creatinganewFibobjectforeachsubtask,invokingtheforkmethodoneachsubtasktostarttheir computation,callingthejoinmethodforeachsubtasktowaitforthesubtaskstocomplete,andthen computingthesumoftheirresults. Themainmethoddrivesthecomputation.Itfirstreadsproc(thenumberofthreadstocreate),num (thevalueforwhichtheFibonaccinumbershouldbecomputed),andoptionallythe sequentialThreshold.Thevalueofthislast,optionalparameter(thedefaultis0)isusedto decidewhentheproblemistoosmalltobotherwithaparalleldecompositionandshouldthereforeuse asequentialalgorithm.Aftertheseparametershavebeenobtained,themainmethodcreatesa TaskRunnerGroupwiththeindicatednumberofthreads,andthencreatesaFibobject,initialized withnum.ThecomputationisinitiatedbypassingtheFibobjecttotheTaskRunnerGroup's invokeAndWaitmethod.Whenthisreturns,thecomputationisfinished.Thethreadpoolisshut downwiththeTaskRunnerGroup's cancelmethod.Finally,theresultisretrievedfromthe Fibobjectanddisplayed.
Figure 5.46. The TaskRunnerGroup class. This class initializes and manages the threads in the thread pool.
class TaskRunnerGroup { protected final TaskRunner[] threads; protected final int groupSize; protected final Task poison; public TaskRunnerGroup(int groupSize) { this.groupSize = groupSize; threads = new TaskRunner[groupSize]; poison = new Task(){public void run(){assert false;}}; poison.setDone(); for (int i = 0; i!= groupSize; i++) {threads[i] = new TaskRunner(this,i,poison);} for(int i=0; i!= groupSize; i++){ threads[i].start(); } } //start executing task t and wait for its completion. //The wrapper task is used in order to start t from within //a Task (thus allowing fork and join to be used) public void executeAndWait(final Task t) { final TaskRunnerGroup thisGroup = this; Task wrapper = new Task() { public void run() { t.fork(); t.join();
} }; //add wrapped task to queue of thread[0] threads[0].put(wrapper); //wait for notification that t has finished. synchronized(thisGroup) { try{thisGroup.wait();} catch(InterruptedException e){return;} } }
setDone(); synchronized(thisGroup) { thisGroup.notifyAll();} //notify waiting thread
//cause all threads to terminate. The programmer is responsible //for ensuring that the computation is complete. public void cancel() { for(int i=0; i!= groupSize; i++) { threads[i].put(poison); } } public TaskRunner[] getRunners(){return threads;} }
Related Patterns TheSharedQueuepatternisaninstanceoftheSharedDatapattern.Itisoftenusedtorepresentthe taskqueuesinalgorithmsthatusetheMaster/Workerpattern.Itcanalsobeusedtosupportthread poolbasedimplementationsoftheFork/Joinpattern.

Figure 5.47. Program to compute Fibonacci numbers (continued in Fig. 5.48)
public class Fib extends Task { volatile int number; // number holds value to compute initially, //after computation is replaced by answer Fib(int n) { number = n; } //task constructor, initializes number //behavior of task public void run() { int n = number; // Handle base cases: if (n <= 1) { // Do nothing: fib(0) = 0; fib(1) = 1 } // Use sequential code for small problems: else if (n <= sequentialThreshold) { number = seqFib(n); } // Otherwise use recursive parallel decomposition: else { // Construct subtasks:
Fib fl = new Fib(n - 1); Fib f2 = new Fib(n - 2); // Run them in parallel: f1.fork();f2.fork(); // Await completion; f1.join();f2.join(); // Combine results: number = f1.number + f2.number; // (We know numbers are ready, so directly access them.)
} }
// Sequential version for arguments less than threshold static int seqFib(int n) { if (n <= 1) return n; else return seqFib(n-l) + seqFib(n-2); } //method to retrieve answer after checking to make sure //computation has finished, note that done and isDone are //inherited from the Task class. done is set by the executing //(TaskRunner) thread when the run method is finished. int getAnswer() { if (!isDone()) throw new Error("Not yet computed"); return number; } /* continued in next figure */
Notethatwhenthetasksinataskqueuemapontoaconsecutivesequenceofintegers,amonotonic sharedcounter,whichwouldbemuchmoreefficient,canbeusedinplaceofaqueue.
Figure 5.48. Program to compute Fibonacci numbers (continued from Fig. 5.47)
/* continued from previous figure */ //Performance-tuning constant, sequential algorithm is used to //find Fibonacci numbers for values <= this threshold static int sequentialThreshold = 0; public static void main(String[] args) { int procs; //number of threads int num; //Fibonacci number to compute try { //read parameters from command line procs = Integer.parseInt(args[0]); num = Integer.parseInt(args[1]); if (args.length > 2) sequentialThreshold = Integer.parseInt(args[2]); } catch (Exception e) { System.out.println("Usage: java Fib <threads> <number> "+
"[<sequentialThreshold>]"); return; } //initialize thread pool TaskRunnerGroup g = new TaskRunnerGroup(procs); //create first task Fib f = new Fib(num); //execute it g.executeAndWait(f); //computation has finished, shutdown thread pool g.cancel(); //show result long result; {result = f.getAnswer();} System.out.println("Fib: Size: " + num + " Answer: " + result); } }
5.10. THE DISTRIBUTED ARRAY PATTERN

Problem ArraysoftenneedtobepartitionedbetweenmultipleUEs.Howcanwedothissotheresulting programthatisbothreadableandefficient? Context Largearraysarefundamentaldatastructuresinscientificcomputingproblems.Differentialequations areatthecoreofmanytechnicalcomputingproblems,andsolvingtheseequationsrequirestheuseof largearraysthatarisenaturallywhenacontinuousdomainisreplacedbyacollectionofvaluesat discretepoints.Largearraysalsoariseinsignalprocessing,statisticalanalysis,globaloptimization, andahostofotherproblems.Hence,itshouldcomeasnosurprisethatdealingeffectivelywithlarge arraysisanimportantproblem. Ifparallelcomputerswerebuiltwithasingleaddressspacethatwaslargeenoughtoholdthefull arrayyetprovidedequaltimeaccessfromanyPEtoanyarrayelement,wewouldnotneedtoinvest muchtimeinhowthesearraysarehandled.Butprocessorsaremuchfasterthanlargememory subsystems,andnetworksconnectingnodesaremuchslowerthanmemorybuses.Theendresultis
usuallyasysteminwhichaccesstimesvarysubstantiallydependingonwhichPEisaccessingwhich arrayelement. ThechallengeistoorganizethearrayssothattheelementsneededbyeachUEarenearbyattheright timeinthecomputation.Inotherwords,thearraysmustbedistributedaboutthecomputersothatthe arraydistributionmatchestheflowofthecomputation. Thispatternisimportantforanyparallelalgorithminvolvinglargearraysinaparallelalgorithm.Itis particularlyimportantwhenthealgorithmusestheGeometricDecompositionpatternforitsalgorithm structureandtheSPMDpatternforitsprogramstructure.Althoughthispatternisinsomerespects specifictodistributedmemoryenvironmentsinwhichglobaldatastructuresmustbesomehow distributedamongtheensembleofPEs,someoftheideasofthispatternapplyifthesingleaddress spaceisimplementedonaNUMAplatform,inwhichallPEshaveaccesstoallmemorylocations,but accesstimevaries.Forsuchplatforms,itisnotnecessarytoexplicitlydecomposeanddistribute arrays,butitisstillimportanttomanagethememoryhierarchysothatarrayelementsstayclose[9]to thePEsthatneedthem.Becauseofthis,onNUMAmachines,MPIprogramscansometimes outperformsimilaralgorithmsimplementedusinganativemultithreadedAPI.Further,theideasof thispatterncanbeusedwithamultithreadedAPItokeepmemorypagesclosetotheprocessorsthat willworkwiththem.Forexample,ifthetargetsystemusesafirsttouchpagemanagementscheme, efficiencyisimprovedifeveryarrayelementisinitializedbythePEthatwillbeworkingwithit.This strategy,however,breaksdownifarraysneedtoberemappedinthecourseofthecomputation.
[9]
NUMAcomputersareusuallybuiltfromhardwaremodulesthatbundletogether processorsandasubsetofthetotalsystemmemory.Withinoneofthesehardware modules,theprocessorsandmemoryare"close"togetherandprocessorscanaccessthis "close"memoryinmuchlesstimethanforremotememory. Forces
Loadbalance.BecauseaparallelcomputationisnotfinisheduntilallUEscompletetheir work,thecomputationalloadamongtheUEsmustbedistributedsoeachUEtakesnearlythe sametimetocompute. Effectivememorymanagement.Modernmicroprocessorsaremuchfasterthanthecomputer's memory.Toaddressthisproblem,highperformancecomputersystemsincludecomplex memoryhierarchies.Goodperformancedependsonmakinggooduseofthismemory hierarchy,andthisisdonebyensuringthatthememoryreferencesimpliedbyaseriesof calculationsareclosetotheprocessormakingthecalculation(thatis,datareusefromthe cachesishighandneededpagesstayaccessibletotheprocessor). Clarityofabstraction.Programsinvolvingdistributedarraysareeasiertowrite,debug,and maintainifitisclearhowthearraysaredividedamongUEsandmappedtolocalarrays.
Solution
Overview
Thesolutionissimpletostateatahighlevel;itisthedetailsthatmakeitcomplicated.Thebasic
approachistopartitiontheglobalarrayintoblocksandthenmapthoseblocksontotheUEs.This mappingontoUEsshouldbedonesothat,asthecomputationunfolds,eachUEhasanequalamount ofworktocarryout(thatis,theloadmustbewellbalanced).UnlessallUEsshareasingleaddress space,eachUE'sblockswillbestoredinanarraythatislocaltoasingleUE.Thus,thecodewill accesselementsofthedistributedarrayusingindicesintoalocalarray.Themathematicaldescription oftheproblemandsolution,however,isbasedonindicesintotheglobalarray.Thus,itmustbeclear howtomovebackandforthbetweentwoviewsofthearray,oneinwhicheachelementisreferenced byglobalindicesandoneinwhichitisreferencedbyacombinationoflocalindicesandUE identifier.Makingthesetranslationsclearwithinthetextoftheprogramisthechallengeofusingthis patterneffectively.
Array distributions
Overtheyears,asmallnumberofarraydistributionshavebecomestandard.
Onedimensional(1D)block.Thearrayisdecomposedinonedimensiononlyanddistributed oneblockperUE.Fora2Dmatrix,forexample,thiscorrespondstoassigningasingleblock ofcontiguousrowsorcolumnstoeachUE.Thisdistributionissometimescalledacolumn blockorrowblockdistributiondependingonwhichsingledimensionisdistributedamongthe UEs.TheUEsareconceptuallyorganizedasa1Darray. Twodimensional(2D)block.Asinthe1Dblockcase,oneblockisassignedtoeachUE,but nowtheblockisarectangularsubblockoftheoriginalglobalarray.Thismappingviewsthe collectionofUEsasa2Darray. Blockcyclic.Thearrayisdecomposedintoblocks(usinga1Dor2Dpartition)suchthatthere aremoreblocksthanUEs.TheseblocksarethenassignedroundrobintoUEs,analogousto thewayadeckofcardsisdealtout.TheUEsmaybeviewedaseithera1Dor2Darray.
Next,weexplorethesedistributionsinmoredetail.Forillustration,weuseasquarematrixAoforder 8,asshowninFig.5.49.[10]
[10]
Inthisandtheotherfiguresinthispattern,wewillusethefollowingnotational conventions:Amatrixelementwillberepresentedasalowercaseletterwithsubscripts representingindices;forexample,a1,2istheelementinrow1andcolumn2ofmatrixA. Asubmatrixwillberepresentedasanuppercaseletterwithsubscriptsrepresenting indices;forexample,A0,0isasubmatrixcontainingthetopleftcornerofA.Whenwe talkaboutassigningpartsofAtoUEs,wewillreferencedifferentUEsusingUEandan indexorindicesinparentheses;forexample,ifweareregardingUEsasforminga1D array,UE(0)istheconceptuallyleftmostUE,whileifweareregardingUEsasforminga 2Darray,UE(0,0)istheconceptuallytopleftUE.Indicesareallassumedtobezero based(thatis,thesmallestindexis0).
Figure 5.49. Original square matrix A
1Dblock.Fig.5.50showsacolumnblockdistributionofAontoalineararrayoffourUEs.The matrixisdecomposedalongthecolumnindexonly;thenumberofcolumnsineachblock,MB(2 here),isthematrixorderdividedbythenumberofUEs.Matrixelement(i,j)isassignedto UE(j\MB).[11]

[11]
Wewillusethenotation"\"forintegerdivision,and"/"fornormaldivision.Thus a\b= a/b .Also, x (floor)isthelargestintegeratmostx,and x (ceiling)isthe smallestintegeratleastx.Forexample, 4/3 =1,and 4/2 =2.
Figure 5.50. 1D distribution of A onto four UEs
MappingtoUEs.Moregenerally,wecouldhaveanNxMmatrixwherethenumberofUEs,P,need notdividethenumberofcolumnsevenly.Inthiscase,MBisthemaximumnumberofcolumns mappedtoaUE,andallUEsexceptUE(P1)containMBblocks.Then,MB= M/P ,andelements ofcolumnjaremappedtoUE( j/MB ).[12](Thisreducestotheformulagivenearlierforthe example,becauseinthespecialcasewherePevenlydividesM, M/P =M/Pand j/MB =j/MB.) Analogousformulasapplyforrowdistributions.
[12]
NoticethatthisisnottheonlypossiblewaytodistributecolumnsamongUEswhen thenumberofUEsdoesnotevenlydividethenumberofcolumns.Anotherapproach, morecomplextodefinebutproducingamorebalanceddistributioninsomecases,isto firstdefinetheminimumnumberofcolumnsperUEas M/P ,andthenincreasethis numberbyoneforthefirst(MmodP)UEs.Forexample,forM=10andP=4,UE(0) andUE(1)wouldhavethreecolumnseachandUE(2)andUE(3)wouldhavetwocolumns each. Mappingtolocalindices.InadditiontomappingthecolumnstoUEs,wealsoneedtomaptheglobal indicestolocalindices.Inthiscase,matrixelement(i,j)mapstolocalelement(i,jmodMB).Given localindices(x,y)andUE(w),wecanrecovertheglobalindices(x,wMB+y).Again,analogous formulasapplyforrowdistributions. 2Dblock.Fig.5.51showsa2DblockdistributionofAontoatwobytwoarrayofUEs.Here,Ais beingdecomposedalongtwodimensions,soforeachsubblock,thenumberofcolumnsisthematrix orderdividedbythenumberofcolumnsofUEs,andthenumberofrowsisthematrixorderdivided bythenumberofrowsofUEs.Matrixelement(i,j)isassignedtoUE(i\2,j\2).
Figure 5.51. 2D distribution of A onto four UEs
MappingtoUEs.Moregenerally,wemapanNxMmatrixtoaprxPcmatrixofUEs.Themaximum sizeofasubblockisNBxMB,whereNB= N/PR andMB= M/Pc .Then,element(i,j)inthe globalmatrixisstoredinUE( i/NB , jIMB . Mappingtolocalindices.Globalindices(i,j)maptolocalindices(imodNB,jmodMB).Givenlocal indices(x,y)onUE(z,w)thecorrespondingglobalindicesare(zNB+x,wMB+y). Blockcyclic.ThemainideabehindtheblockcyclicdistributionistocreatemoreblocksthanUEs andallocatetheminacyclicmanner,similartodealingoutadeckofcards.Fig.5.52showsa1D blockcyclicdistributionofAontoalineararrayoffourUEs,illustratinghowcolumnsareassignedto UEsinaroundrobinfashion.Here,matrixelement(i,j)isassignedtoUE(jmod4)(where4isthe numberofUEs).
Figure 5.52. 1D block-cyclic distribution of A onto four UEs
Fig.5.53andFig.5.54showa2DblockcyclicdistributionofAontoatwobytwoarrayofUEs:Fig. 5.53illustrateshowAisdecomposedintotwobytwosubmatrices.(Wecouldhavechosenadifferent decomposition,forexampleonebyonesubmatrices,buttwobytwoillustrateshowthisdistribution canhavebothblockandcycliccharacteristics.)Fig.5.54thenshowshowthesesubmatricesare assignedtoUEs.Matrixelement(i,j)isassignedtoUE(imod2,jmod2).
Figure 5.53. 2D block-cyclic distribution of A onto four UEs, part 1: Decomposing A
Figure 5.54. 2D block-cyclic distribution of A onto four UEs, part 2: Assigning submatrices to UEs
MappingtoUEs.Inthegeneralcase,wehaveanNxMmatrixtobemappedontoaPRxpcarrayof UEs.WechooseblocksizeNBxMB.Element(i,j)intheglobalmatrixwillbemappedtoUE(z,w), wherez= i/NB modPRandw= j/MB modPc. Mappingtolocalindices.BecausemultipleblocksaremappedtothesameUE,wecanviewthelocal indexingblockwiseorelementwise. Intheblockwiseview,eachelementonaUEisindexedlocallybyblockindices(l,m)andindices(x, y)intotheblock.Torestatethis:Inthisscheme,theglobalmatrixelement(i,j)willbefoundonthe UEwithinthelocal(l,m)blockattheposition(x,y)where(l,m)=( i/(PRNB) , j/(PcMB) )and(x, y)=(imodNB,jmodMB).Fig.5.55illustratesthisforUE(0,0).
Figure 5.55. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). LAl,m is the block with block indices (l, m). Each element is labeled both with its original global indices (ai,j) and its indices within block LAl,m (lx,y).
Forexample,considerglobalmatrixelementa5,1.BecausePR=PC=NB=MB=2,thiselementwill maptoUE(0,0).TherearefourtwobytwoblocksonthisUE.Fromthefigure,weseethatthis elementappearsintheblockonthebottomleft,orblockLA1,0andindeed,fromtheformulas,we obtain(l,m)=( (2x2) , (2x2) )=(1,0).Finally,weneedthelocalindiceswithintheblock.Inthis case,theindiceswithinblockare(x,y)=(5mod2,1mod2)=(1,1). Intheelementwiseview(whichrequiresthatalltheblocksforeachUEformacontiguousmatrix), globalindices(i,j)aremappedelementwisetolocalindices(INB+x,mMB+y),wherelandmare definedasbefore.Fig.5.56illustratesthisforUE(0,0).
Figure 5.56. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). Each element is labeled both with its original global indices ai,j and its local indices [x', y' . Local indices are with respect to the contiguous matrix used to store all blocks assigned to this UE.
Again,lookingatglobalmatrixelementa5,1,weseethatviewingthedataasasinglematrix,the elementisfoundatlocalindices(1x2+1,0x2+1)=(3,1).Localindices(x,y)inblock(l,m)on UE(z,w)correspondtoglobalindices((lPR+z)NB+x,(mPc+w)MB+y).

Choosing a distribution
Toselectwhichdistributiontouseforaproblem,considerhowthecomputationalloadontheUEs changesasthecomputationproceeds.Forexample,inmanysinglechannelsignalprocessing problems,thesamesetofoperationsisperformedoneachcolumnofanarray.Theworkdoesnotvary asthecomputationproceeds,soacolumnblockdecompositionwillproducebothclarityofcodeand goodloadbalance.Ifinsteadtheamountofworkvariesbycolumn,withhighernumberedcolumns requiringmorework,acolumnblockdecompositionwouldleadtopoorloadbalance,withtheUEs processinglowernumberedcolumnsfinishingaheadoftheUEsprocessinghighernumbered columns.Inthiscase,acyclicdistributionwouldproducebetterloadbalance,becauseeachUEis assignedamixoflownumberedandhighnumberedcolumns. Thissameapproachappliestohigherdimensionsaswell.ScaLAPACK[Sca,BCC97 + ],theleading packageofdenselinearalgebrasoftwarefordistributedmemorycomputers,requiresa2Dblock cyclicdistribution.Toseewhythischoicewasmade,considerGaussianelimination,oneofthemore commonlyusedoftheScaLAPACKroutines.Inthisalgorithm,alsoknownasLUdecomposition,a densesquarematrixistransformedintoapairoftriangularmatrices,anuppermatrixUandalower matrixL.Atahighlevel,thealgorithmproceedsfromtheupperleftcornerandworksitswaydown
thediagonaloftheglobalmatrix,eliminatingelementsbelowthediagonalandtransformingthe remainingblockstotherightasneeded.AblockdistributionwouldresultinidleUEsasthe processingmarchesdownthediagonal.Butwitha2Dblockcyclicdistributionsuchastheoneshown inFig.5.54,eachUEcontainselementsusedbothearlyandlateinthealgorithm,resultingin excellentloadbalance.

Mapping indices
Theexamplesintheprecedingsectionillustratehoweachelementoftheoriginal(global)arrayis mappedtoaUEandhoweachelementintheglobalarray,afterdistribution,isidentifiedbybothaset ofglobalindicesandacombinationofUEidentifierandlocalinformation.Theoriginalproblemis typicallystatedintermsofglobalindices,butcomputationwithineachUEmustbeintermsoflocal indices.Applyingthispatterneffectivelyrequiresthattherelationshipbetweenglobalindicesandthe combinationofUEandlocalindicesbeastransparentaspossible.Inaquestforprogramefficiency,it isaltogethertooeasytoburytheseindexmappingsinthecodeinawaythatmakestheprogram painfullydifficulttodebug.Abetterapproachistousemacrosandinlinefunctionstocapturethe indexmappings;ahumanreaderoftheprogramthenonlyneedstomasterthemacroorfunctiononce. Suchmacrosorfunctionsalsocontributetoclarityofabstraction.TheExamplessectionillustrates thisstrategy.
Aligning computation with locality
Oneofthecardinalrulesofperformanceorientedcomputingistomaximizereuseofdataclosetoa UE.Thatis,theloopsthatupdatelocaldatashouldbeorganizedinawaythatgetsasmuchuseas possibleoutofeachmemoryreference.Thisobjectivecanalsoinfluencethechoiceofarray distribution. Forexample,inlinearalgebracomputations,itispossibletoorganizecomputationsonamatrixinto smallercomputationsoversubmatrices.Ifthesesubmatricesfitintocache,dramaticperformance gainscanresult.Similareffectsapplytootherlevelsofthememoryhierarchy:minimizingmissesin thetranslationlookasidebuffer(TLB),pagefaults,andsoon.Adetaileddiscussionofthistopicgoes wellbeyondthescopeofthisbook.Anintroductioncanbefoundin[PH98]. Examples
Transposing a matrix stored as column blocks
Asanexampleoforganizingmatrixcomputationsintosmallercomputationsoversubmatrices, considertransposingasquarematrixdistributedwithacolumnblockdistribution.Forsimplicity,we willassumethatthenumberofUEsevenlydividesthenumberofcolumns,sothatallblocksarethe samesize.Ourstrategyfortransposingthematrixwillbebasedonlogicallydecomposingthematrix intosquaresubmatrices,asshowninFig.5.57.Eachofthelabeledblocksinthefigurerepresentsa squaresubmatrix;labelsshowhowtheblocksofthetransposerelatetotheblocksoftheoriginal matrix.(Forexample,theblocklabeled(A0,l)Tinthetransposeisthetransposeoftheblocklabeled submatricesperUE(whichisalsothenumberofUEs).Inthefirstphase,wetransposethe A0,lintheoriginalmatrix.)Thealgorithmproceedsinphases;thenumberofphasesisthenumberof
submatricesonthediagonalofA,witheachUEtransposingonesubmatrixandnocommunication required.Insuccessivephases,wetransposethesubmatricesonebelowthediagonal,thentwobelow thediagonal,andsoforth,wrappingaroundtothetopofthematrixasnecessary.Ineachofthese phases,eachUEmusttransposeoneofitssubmatrices,sendittoanotherUE,andreceivea submatrix.Forexample,inthesecondphase,theUElabeledUE(1)mustcompute(A2,1)T,senditto UE(2),andreceive(A0,1)TfromUE(0).Figs.5.58and5.59showcodetotransposesuchamatrix.
Thiscoderepresentsafunctionthatwilltransposeasquarecolumnblockedarray.Weassumethe blocksaredistributedcontiguouslywithonecolumnblockperUE.Thisfunctionisintendedaspart ofalargerprogram,soweassumethearrayhasalreadybeendistributedpriortocallingthisfunction.

Figure 5.57. Matrix A and its transpose, in terms of submatrices, distributed among four UEs
Theprogramrepresentseachlocalcolumnblock(oneforAandoneforthetransposedresult)asa1D array.ThesearraysinturnconsistofNum_procssubmatriceseach,eachofsizeblock_size = Block_order * Block_order,whereBlock_orderisthenumberofcolumnsperUE.We canthereforefindtheblockindexedIDusingtheBLOCKmacro:

Figure 5.58. Code to transpose a matrix (continued in Fig. 5.59)
/******************************************************************* NAME: trans_isend_ircv PURPOSE: This function uses MPI Isend and Irecv to transpose a column-block distributed matrix. *******************************************************************/ #include "mpi.h" #include <stdio.h> /******************************************************************* ** This function transposes a local block of a matrix. We don't ** display the text of this function as it is not relevant to the
** point of this example. *******************************************************************/ void transpose( double* A, int Acols, /* input matrix */ double* B, int Bcols, /* transposed mat */ int sub_rows, int sub_cols); /* size of slice to transpose */ /******************************************************************* ** Define macros to compute process source and destinations and ** local indices *******************************************************************/ #define T0 (ID, PHASE, NPROC) ((ID + PHASE ) % NPROC) #define FROM(ID, PHASE, NPROC) ((ID + NPROC - PHASE) % NPROC) #define BLOCK(BUFF, ID) (BUFF + (ID * block_size)) /* continued in next figure */
Figure 5.59. Code to transpose a matrix (continued from Fig. 5.58)
/* continued from previous figure */ void trans_isnd_ircv(double *buff, double *trans, int Block_order, double *work, int my_ID, int num_procs) { int iphase; int block_size; int send_to, recv_from; double *bblock; /* pointer to current location in buff */ double *tblock; /* pointer to current location in trans */ MPI_Status status; MPI_Request send_req, recv_req; block_size = Block_order * Block_order; /******************************************************************* ** Do the transpose in num_procs phases. ** ** In the first phase, do the diagonal block. Then move out ** from the diagonal copying the local matrix into a communication ** buffer (while doing the local transpose) and send to process ** (diag+phase)%,num_procs. *******************************************************************/ bblock = BLOCK(buff, my_ID); tblock = BLOCK(trans, my_ID); transpose(bblock, Block_order, tblock, Block_order, Block_order, Block_order); for (iphase=l; iphase<num_procs; iphase++){ recv_from = FROM(my_ID, iphase, num_procs); tblock = BLOCK(trans, recv.from); MPI.Irecv (tblock, block_size, MPI_DOUBLE, recv_from, iphase, MPI_COMM_WORLD, &recv_req);
send_to = TO(my_ID, iphase, num_procs); bblock = BLOCK(buff, send_to); transpose(bblock, Block_order, work, Block_order, Block_order, Block_order); MPI_Isend (work, block_size, MPI_DOUBLE, send_to, iphase, MPI_COMM_WORLD, &send.req); MPI_Wait(&recv_req, &status); MPI_Wait(&send_req, &status);
} }
#define BLOCK(BUFF, ID) (BUFF + (ID * block_size))
BUFFisthestartofthe1Darray(bufffortheoriginalarray,transforthetranspose)andIDisthe secondindexoftheblock.Soforexample,wefindthediagonalblockofbotharraysasfollows:
bblock = BLOCK(buff, my_ID); tblock = BLOCK(trans, my_ID);
Insucceedingphasesofthealgorithm,wemustdeterminetwothings:(1)theindexoftheblockwe shouldtransposeandsendand(2)theindexoftheblockweshouldreceive.WedothiswiththeTO andFROMmacros:

#define T0(ID, PHASE, NPROC) ((ID + PHASE ) % NPROC) #define FROM(ID, PHASE, NPROC) ((ID + NPROC - PHASE) % NPROC)
TheTOindexshowstheprogressionthroughtheoffdiagonalblocks,workingdownfromthe diagonalandwrappingbacktothetopatthebottomofthematrix.Ateachphaseofthealgorithm,we computewhichUEistoreceivetheblockandthenupdatethelocalpointer(bblock)totheblock thatwillbesent:

send_to = TO(my_ID, iphase, num_procs); bblock = BLOCK(buff, send_to);
Likewise,wecomputewherethenextblockiscomingfromandwhichlocalindexcorrespondstothat block:
recv_from = FROM(my_ID, iphase, num_procs); tblock = BLOCK(trans, recv_from);
Thiscontinuesuntilalloftheblockshavebeentransposed.
Weuseimmediate(nonblocking)sendsandreceivesinthisexample.(Theseprimitivesaredescribed inmoredetailintheMPIappendix,AppendixB.)Duringeachphase,eachUEfirstpostsareceive andthenperformsatransposeontheblockitwillsend.Afterthattransposeiscomplete,theUEsends thenowtransposedblocktotheUEthatshouldreceiveit.Atthebottomoftheloopandbefore movingtothenextphase,functionsarecalledtoforcetheUEtowaituntilboththesendsandreceives complete.Thisapproachletsusoverlapcommunicationandcomputation.Moreimportantly(because inthiscasethereisn'tmuchcomputationtooverlapwithcommunication),itpreventsdeadlock:A morestraightforwardapproachusingregularsendsandreceiveswouldbetofirsttransposetheblock tobesent,thensendit,andthen(waitto)receiveablockfromanotherUE.However,iftheblocksto besentarelarge,aregularsendmightblockbecausethereisinsufficientbufferspaceforthemessage; inthiscase,suchblockingcouldproducedeadlock.Byinsteadusingnonblockingsendsandreceives andpostingthereceivesfirst,weavoidthissituation.
Known uses
Thispatternisusedthroughoutthescientificcomputingliterature.ThewellknownScaLAPACK
package[Sca,BCC97 + ]makesheavyuseofthe2Dblockcyclicdistribution,andthedocumentation givesathoroughexplanationofmappingandindexingissuesforthisdistribution. SeveraldifferentarraydistributionswereembeddedintotheHPFlanguage[HPF97]definition. Someofthemostcreativeusesofthispatterncanbefoundinquantumchemistry,particularlyinthe areaofpostHartreeFockcomputations.TheGlobalArraysorGApackage[NHL94,NHL96,
NHK02 + ,Gloa]wascreatedspecificallytoaddressdistributedarrayproblemsinpostHartreeFock algorithms.Amorerecentapproachisdescribedin[NHL96,LDSH95].
ThePLAPACKpackage[ABE97 + ,PLA,vdG97]takesadifferentapproachtoarraydistribution. Ratherthanfocusingonhowtodistributethearrays,PLAPACKconsidershowthevectorsoperated uponbythearraysareorganized.Fromthesedistributedvectors,thecorrespondingarraydistributions arederived.Inmanyproblems,thesevectorscorrespondtothephysicalquantitiesintheproblem domain,sothePLAPACKteamreferstothisasthephysicallybaseddistribution. Related Patterns TheDistributedArraypatternisoftenusedtogetherwiththeGeometricDecompositionandSPMD patterns.
5.11. OTHER SUPPORTING STRUCTURES

Thispatternlanguage(andhencetheSupportingStructurespatterns)isbasedoncommonpractice amongOpenMP,MPI,andJavaprogrammerswritingcodeforbothsharedmemoryanddistributed memoryMIMDcomputers.Parallelapplicationprogrammerswillinmostcasesfindthepatternsthey needwithinthispatternlanguage. Thereare,however,additionalpatterns(withtheirownsupportingstructures)thathaveatvarious
timesbeenimportantinparallelprogramming.Theyareonlyrarelyusedatthistime,butitisstill importanttobeawareofthem.Theycanprovideinsightsintodifferentopportunitiesforfindingand exploitingconcurrency.Anditispossiblethatasparallelarchitecturescontinuetoevolve,theparallel programmingtechniquessuggestedbythesepatternsmaybecomeimportant. Inthissection,wewillbrieflydescribesomeoftheseadditionalpatternsandtheirsupporting structures:SIMD,MPMD,ClientServer,andDeclarativeProgramming.Weclosewithabrief discussionofproblemsolvingenvironments.Thesearenotpatterns,buttheyhelpprogrammerswork withinatargetedsetofproblems. 5.11.1. SIMD ASIMDcomputerhasasinglestreamofinstructionsoperatingonmultiplestreamsofdata.These machineswereinspiredbythebeliefthatprogrammerswouldfindittoodifficulttomanagemultiple streamsofinstructions.Manyimportantproblemsaredataparallel;thatis,theconcurrencycanbe expressedintermsofconcurrentupdatesacrosstheproblem'sdatadomain.Carriedtoitslogical extreme,theSIMDapproachassumesthatitispossibletoexpressallparallelismintermsofthedata. Programswouldthenhavesinglethreadsemantics,makingunderstandingandhencedebuggingthem mucheasier.ThebasicideabehindtheSIMDpatterncanbesummarizedasfollows.
DefineanetworkofvirtualPEstobemappedontotheactualPEs.ThesevirtualPEsare connectedaccordingtoawelldefinedtopology.Ideallythetopologyis(1)wellalignedwith thewaythePEsinthephysicalmachineareconnectedand(2)effectiveforthecommunication patternsimpliedbytheproblembeingsolved. Expresstheproblemintermsofarraysorotherregulardatastructuresthatcanbeupdated concurrentlywithasinglestreamofinstructions. AssociatethesearrayswiththelocalmemoriesofthevirtualPEs. Createasinglestreamofinstructionsthatoperatesonslicesoftheregulardatastructures. Theseinstructionsmayhaveanassociatedmasksotheycanbeselectivelyskippedforsubsets ofarrayelements.Thisiscriticalforhandlingboundaryconditionsorotherconstraints.
Whenaproblemistrulydataparallel,thisisaneffectivepattern.Theresultingprogramsarerelatively easytowriteanddebug[DKK90]. Unfortunately,mostdataproblemscontainsubproblemsthatarenotdataparallel.Settingupthecore datastructures,dealingwithboundaryconditions,andpostprocessingafteracoredataparallel algorithmcanallintroducelogicthatmightnotbestrictlydataparallel.Furthermore,thisstyleof programmingistightlycoupledtocompilersthatsupportdataparallelprogramming.Thesecompilers haveprovendifficulttowriteandresultincodethatisdifficulttooptimizebecauseitcanbefar removedfromhowaprogramrunsonaparticularmachine.Thus,thisstyleofparallelprogramming andthemachinesbuiltaroundtheSIMDconcepthavelargelydisappeared,exceptforafewspecial purposemachinesusedforsignalprocessingapplications. TheprogrammingenvironmentmostcloselyassociatedwiththeSIMDpatternisHighPerformance Fortran(HPF)[HPF97].HPFisanextensionofthearraybasedconstructsinFortran90.Itwas createdtosupportportableparallelprogrammingacrossSIMDmachines,butalsotoallowtheSIMD
programmingmodeltobeusedonMIMDcomputers.Thisrequiredexplicitcontroloverdata placementontothePEsandthecapabilitytoremapthedataduringacalculation.Itsdependenceona strictlydataparallel,SIMDmodel,however,doomedHPFbymakingitdifficulttousewithcomplex applications.ThelastlargecommunityofHPFusersisinJapan[ZJS02 + ],wheretheyhaveextended thelanguagetorelaxthedataparallelconstraints[HPF99]. 5.11.2. MPMD TheMultipleProgram,MultipleData(MPMD)pattern,asthenameimplies,isusedinaparallel algorithmwhendifferentprogramsrunondifferentUEs.Thebasicapproachisthefollowing.
Decomposetheproblemintoasetofsubproblems,whereeachsubproblemmapsontoasubset ofUEs.OfteneachsubsetofUEscorrespondstothenodesofadifferentparallelcomputer. Createindependentprogramssolvingtheappropriatesubproblemsandtunedtotherelevant targetUEs. CoordinatetheprogramsrunningondistinctUEsasneeded,typicallythroughamessage passingframework.
Inmanyways,theMPMDapproachisnottoodifferentfromanSPMDprogramusingMPI.Infact, theruntimeenvironmentsassociatedwiththetwomostcommonimplementationsofMPI,MPICH [MPI]andLAM/MPI[LAM],supportsimpleMPMDprogramming. ApplicationsoftheMPMDpatterntypicallyariseinoneoftwoways.First,thearchitectureofthe UEsmaybesodifferentthatasingleprogramcannotbeusedacrossthefullsystem.Thisisthecase whenusingparallelcomputingacrosssometypeofcomputationalgrid[Glob,FK03]usingmultiple classesofhighperformancecomputingarchitectures.Thesecond(andfromaparallelalgorithmpoint ofviewmoreinteresting)caseoccurswhencompletelydifferentsimulationprogramsarecombined intoacoupledsimulation. Forexample,climateemergesfromacomplexinterplaybetweenatmosphericandoceanphenomena. Wellunderstoodprogramsformodelingtheoceanandtheatmosphereindependentlyhavebeen developedandhighlyrefinedovertheyears.AlthoughanSPMDprogramcouldbecreatedthat implementsacoupledocean/atmosphericmodeldirectly,amoreeffectiveapproachistotakethe separate,validatedoceanandatmosphericprogramsandcouplethemthroughsomeintermediate layer,therebyproducinganewcoupledmodelfromwellunderstoodcomponentmodels. AlthoughbothMPICHandLAM/MPIprovidesomesupportforMPMDprogramming,theydonot allowdifferentimplementationsofMPItointeract,soonlyMPMDprogramsusingacommonMPI implementationaresupported.ToaddressawiderrangeofMPMDproblemsspanningdifferent architecturesanddifferentMPIimplementations,anewstandardcalledinteroperableMPI(iMPI)was created.ThegeneralideaofcoordinatingUEsthroughtheexchangeofmessagesiscommontoMPI andiMPI,butthedetailedsemanticsareextendediniMPItoaddresstheuniquechallengesarising fromprogramsrunningonwidelydifferingarchitectures.Thesemultiarchitectureissuescanadd significantcommunicationoverhead,sothepartofanalgorithmdependentontheperformanceof iMPImustberelativelycoarsegrained. MPMDprogramsarerare.Asincreasinglycomplicatedcoupledsimulationsgrowinimportance,
however,useoftheMPMDpatternwillincrease.Useofthispatternwillalsogrowasgridtechnology becomesmorerobustandmorewidelydeployed. 5.11.3. Client-Server Computing ClientserverarchitecturesarerelatedtoMPMD.Traditionally,thesesystemshavecomprisedtwoor threetierswherethefrontendisagraphicaluserinterfaceexecutedonaclient'scomputeranda mainframebackend(oftenwithmultipleprocessors)providesaccesstoadatabase.Themiddletier,if present,dispatchesrequestsfromtheclientsto(possiblymultiple)backends.Webserversarea familiarexampleofaclientserversystem.Moregenerally,aservermightofferavarietyofservicesto clients,anessentialaspectofthesystembeingthatserviceshavewelldefinedinterfaces.Parallelism canappearattheserver(whichcanservicemanyclientsconcurrentlyorcanuseparallelprocessingto obtainresultsmorequicklyforsinglerequests)andattheclient(whichcaninitiaterequestsatmore thanoneserversimultaneously). Techniquesusedinclientserversystemsareespeciallyimportantinheterogeneoussystems. MiddlewaresuchasCORBA[COR]providesastandardforserviceinterfacespecifications,enabling newprogramstobeputtogetherbycomposingexistingservices,evenifthoseservicesareofferedon vastlydifferenthardwareplatformsandimplementedindifferentprogramminglanguages.CORBA alsoprovidesfacilitiestoallowservicestobelocated.TheJavaJ2EE(Java2Platform,Enterprise Edition)[Javb]alsoprovidessignificantsupportforclientserverapplications.Inbothofthesecases, interoperabilitywasamajordesignforce. Clientserverarchitectureshavetraditionallybeenusedinenterpriseratherthanscientific applications.Gridtechnology,whichisheavilyusedinscientificcomputing,borrowsfromclient servertechnology,extendingitbyblurringthedistinctionbetweenclientsandservers.Allresourcesin agrid,whethertheyarecomputers,instruments,filesystems,oranythingelseconnectedtothe network,arepeersandcanserveasclientsandservers.Themiddlewareprovidesstandardsbased interfacestotietheresourcestogetherintoasinglesystemthatspansmultipleadministrativedomains. 5.11.4. Concurrent Programming with Declarative Languages TheoverwhelmingmajorityofprogrammingisdonewithimperativelanguagessuchasC++,Java,or Fortran.Thisisparticularlythecasefortraditionalapplicationsinscienceandengineering.The artificialintelligencecommunityandasmallsubsetofacademiccomputerscientists,however,have developedandshowngreatsuccesswithadifferentclassoflanguages,thedeclarativelanguages.In theselanguages,theprogrammerdescribesaproblem,aproblemdomain,andtheconditionssolutions mustsatisfy.Theruntimesystemassociatedwiththelanguagethenusesthesetofindvalidsolutions. Declarativesemanticsimposeadifferentstyleofprogrammingthatoverlapswiththeapproaches discussedinthispatternlanguage,buthassomesignificantdifferences.Therearetwoimportant classesofdeclarativelanguages:functionallanguagesandlogicprogramminglanguages. Logicprogramminglanguagesarebasedonformalrulesoflogicalinference.Themostcommonlogic programminglanguagebyfarisProlog[SS94],aprogramminglanguagebasedonfirstorder predicatecalculus.WhenPrologisextendedtosupportexpressionofconcurrency,theresultisa concurrentlogicprogramminglanguage.Concurrencyisexploitedinoneofthreewayswiththese
Prologextensions:andparallelism(executemultiplepredicates),orparallelism(executemultiple guards),orthroughexplicitmappingofpredicateslinkedtogetherthroughsingleassignmentvariables [CG86]. Concurrentlogicprogramminglanguageswereahotareaofresearchinthelate1980sandearly 1990s.Theyultimatelyfailedbecausemostprogrammersweredeeplycommittedtomoretraditional imperativelanguages.Evenwiththeadvantagesofdeclarativesemanticsandthevalueoflogic programmingforsymbolicreasoning,thelearningcurveassociatedwiththeselanguagesproved prohibitive. Theolderandmoreestablishedclassofdeclarativeprogramminglanguagesisbasedonfunctional programmingmodels[Hud89].LISPistheoldestandbestknownofthefunctionallanguages.Inpure functionallanguages,therearenosideeffectsfromafunction.Therefore,functionscanexecuteas soonastheirinputdataisavailable.Theresultingalgorithmsexpressconcurrencyintermsofthe flowofdatathroughtheprogramleading,therebyresultingin"dataflow"algorithms[Jag96]. ThebestknownconcurrentfunctionallanguagesareSisal[FCO90],ConcurrentML[Rep99,Con] (anextensiontoML),andHaskell[HPF].Becausemathematicalexpressionsarenaturallywritten downinafunctionalnotation,Sisalwasparticularlystraightforwardtoworkwithinscienceand engineeringapplicationsandprovedtobehighlyefficientforparallelprogramming.However,justas withthelogicprogramminglanguages,programmerswereunwillingtopartwiththeirfamiliar imperativelanguages,andSisalessentiallydied.ConcurrentMLandHaskellhavenotmademajor inroadsintohighperformancecomputing,althoughbothremainpopularinthefunctional programmingcommunity. 5.11.5. Problem-Solving Environments Adiscussionofsupportingstructuresforparallelalgorithmswouldnotbecompletewithout mentioningproblemsolvingenvironments(PSE).APSEisaprogrammingenvironmentspecialized totheneedsofaparticularclassofproblems.Whenappliedtoparallelcomputing,PSEsalsoimplya particularalgorithmstructureaswell. ThemotivationbehindPSEsistosparetheapplicationprogrammerthelowleveldetailsofthe parallelsystem.Forexample,PETsc(Portable,Extensible,ToolkitforScientificComputation) [BGMS98]supportsavarietyofdistributeddatastructuresandfunctionsrequiredtousethemfor solvingpartialdifferentialequations(typicallyforproblemsfittingtheGeometricDecomposition pattern).TheprogrammerneedstounderstandthedatastructureswithinPETSc,butissparedthe needtomasterthedetailsofhowtoimplementthemefficientlyandportably.OtherimportantPSEs arePLAPACK[ABE97 + ](fordenselinearalgebraproblems)andPOOMA[RHC96 + ](anobject orientedframeworkforscientificcomputing).
PSEshavenotbeenverywellaccepted.PETScisprobablytheonlyPSEthatisheavilyusedfor seriousapplicationprogramming.Theproblemisthatbytyingthemselvestoanarrowclassof problems,PSEsrestricttheirpotentialaudienceandhaveadifficulttimereachingacriticalmassof users.Webelievethatovertimeandasthecorepatternsbehindparallelalgorithmsbecomebetter understood,PSEswillbeabletobroadentheirimpactandplayamoredominantroleinparallel programming.
Chapter 6. The Implementation Mechanisms Design Space

6.1OVERVIEW 6.2UEMANAGEMENT 6.3SYNCHRONIZATION 6.4COMMUNICATION Uptothispoint,wehavefocusedondesigningalgorithmsandthehighlevelconstructsusedto organizeparallelprograms.Withthischapter,weshiftgearsandconsideraprogram'ssourcecodeand thelowleveloperationsusedtowriteparallelprograms. Whataretheselowleveloperations,orimplementationmechanisms,forparallelprogramming?Of course,thereisthecomputer'sinstructionset,typicallyaccessedthroughahighlevelprogramming language,butthisisthesameforserialandparallelprograms.Ourconcernistheimplementation mechanismsuniquetoparallelprogramming.Acompleteanddetaileddiscussionoftheseparallel programming"buildingblocks"wouldfillalargebook.Fortunately,mostparallelprogrammersuse onlyamodestcoresubsetofthesemechanisms.Thesecoreimplementationmechanismsfallinto threecategories:

UEmanagement Synchronization Communication
Withineachofthesecategories,themostcommonlyusedmechanismsarecoveredinthischapter.An overviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.6.1.
Figure 6.1. Overview of the Implementation Mechanisms design space and its place in the pattern language
Inthischapterwealsodroptheformalismofpatterns.Mostoftheimplementationmechanismsare includedwithinthemajorparallelprogrammingenvironments.Hence,ratherthanusepatterns,we provideahighleveldescriptionofeachimplementationmechanismandtheninvestigatehowthe mechanismmapsontoourthreetargetprogrammingenvironments:OpenMP,MPI,andJava.This mappingwillinsomecasesbetrivialandrequirelittlemorethanpresentinganexistingconstructin anAPIorlanguage.Thediscussionwillbecomeinterestingwhenwelookatoperationsnativetoone programmingmodel,butforeigntoanother.Forexample,itispossibletodomessagepassingin OpenMP.Itisnotpretty,butitworksandcanbeusefulattimes. WeassumethatthereaderisfamiliarwithOpenMP,MPI,andJavaandhowtheyareusedforwriting parallelprograms.AlthoughwecoverspecificfeaturesoftheseAPIsinthischapter,thedetailsof usingthemarelefttotheappendixes.
6.1. OVERVIEW
ParallelprogramsexploitconcurrencybymappinginstructionsontomultipleUEs.Ataverybasic level,everyparallelprogramneedsto(1)createthesetofUEs,(2)manageinteractionsbetweenthem andtheiraccesstosharedresources,(3)exchangeinformationbetweenUEs,and(4)shutthemdown inanorderlymanner.Thissuggeststhefollowingcategoriesofimplementationmechanisms.
UEmanagement.Thecreation,destruction,andmanagementoftheprocessesandthreads
usedinparallelcomputation.
Synchronization.EnforcingconstraintsontheorderingofeventsoccurringindifferentUEs. ThisisprimarilyusedtoensurethatsharedresourcesareaccessedbyacollectionofUEsin suchawaythattheprogramiscorrectregardlessofhowtheUEsarescheduled. Communication.TheexchangeofinformationbetweenUEs.
6.2. UE MANAGEMENT
Letusrevisitthedefinitionofunitofexecution,orUE.AUEisanabstractionfortheentitythat carriesoutcomputationsandismanagedfortheprogrammerbytheoperatingsystem.Inmodern parallelprogrammingenvironments,therearetwotypesofUEs:processesandthreads. Aprocessisaheavyweightobjectthatcarrieswithitthestateorcontextrequiredtodefineitsplacein thesystem.Thisincludesmemory,programcounters,registers,buffers,openfiles,andanythingelse requiredtodefineitscontextwithintheoperatingsystem.Inmanysystems,differentprocessescan belongtodifferentusers,andthusprocessesarewellprotectedfromeachother.Creatinganew processandswappingbetweenprocessesisexpensivebecauseallthatstatemustbesavedand restored.Communicationbetweenprocesses,evenonthesamemachine,isalsoexpensivebecausethe protectionboundariesmustbecrossed. Athread,ontheotherhand,isalightweightUE.Acollectionofthreadsiscontainedinaprocess. Mostoftheresources,includingthememory,belongtotheprocessandaresharedamongthethreads. Theresultisthatcreatinganewthreadandswitchingcontextbetweenthreadsislessexpensive, requiringonlythesavingofaprogramcounterandsomeregisters.Communicationbetweenthreads belongingtothesameprocessisalsoinexpensivebecauseitcanbedonebyaccessingtheshared memory. ThemechanismsformanagingthesetwotypesofUEsarecompletelydifferent.Wehandlethem separatelyinthenexttwosections. 6.2.1. Thread Creation/Destruction Threadsrequirearelativelymodestnumberofmachinecyclestocreate.Programmerscanreasonably createanddestroythreadsasneededinsideaprogram,andaslongastheydonotdosoinsidetight loopsortimecriticalkernels,theprogram'soverallruntimewillbeaffectedonlymodestly.Hence, mostparallelprogrammingenvironmentsmakeitstraightforwardtocreatethreadsinsideaprogram, andtheAPIsupportsthreadcreationanddestruction.
OpenMP: thread creation/destruction
InOpenMP,threadsarecreatedwiththeparallelpragma:
#pragma omp parallel { structured block }
Eachthreadwillindependentlyexecutethecodewithinthestructuredblock.Astructuredblockisjust ablockofstatementswithasinglepointofentryatthetopandasinglepointofexitatthebottom. ThenumberofthreadscreatedinOpenMPcanbeeitherlefttotheoperatingsystemorcontrolledby theprogrammer(seetheOpenMPappendix,AppendixA,formoredetails). Destructionofthreadsoccursattheendofthestructuredblock.Thethreadswaitattheendofthe structuredblock.Afterallthreadshavearrived,thethreadsaredestroyedandtheoriginalormaster threadcontinues.

Java: thread creation/destruction
InJava,threadsareinstancesofthejava.1ang.ThreadclassorasubclassoftheThreadclass. AThreadobjectisinstantiatedintheusualwayusingthenewkeyword,andthenthestart methodisinvokedtolaunchthethread.Thethreadthuscreatedcanaccessanyvariablevisible accordingtoJava'sscoperules. Therearetwowaystospecifythebehaviorofathread.ThefirstistocreateasubclassofThread andoverridetherunmethod.Thefollowingshowshowtodothistocreateathreadthat,when launched,willexecutethread_body.

class MyThread extends Thread { public void run(){ thread_body }}
Tocreateandlaunchthethreadonewouldwrite
Thread t = new MyThread() ; //create thread object t. start(); //launch the thread
Tousethesecondapproach,wedefineaclassthatimplementsthejava. lang. Runnable interface,whichcontainsasinglemethodpublic void run(),andpassaninstanceofthe RunnableclasstotheThreadconstructor.Forexample,firstwedefinetheRunnableclass:

class MyRunnable implements Runnable { public void run(){ thread_body }}
Tocreateandexecutethethread,wecreateaRunnableobjectandpassittotheThread constructor.Thethreadislaunchedusingthestartmethodasbefore:
Thread t = new Thread(new MyRunnable()); //create Runnable //and Thread objects t.start(); //start the thread
Inmostcases,thesecondapproachispreferred.[1]
[1]
InJava,aclasscanimplementanynumberofinterfaces,butisonlyallowedtoextenda
singlesuperclass.Thus,extendingThreadinthefirstapproachmeansthattheclass definingtherunmethodcannotextendanapplicationspecificclass. Athreadterminateswhentherunmethodreturns.TheThreadobjectitselfwillbecollectedafter terminationbythegarbagecollectorinthesamewayasanyotherobjectinaJavaprogram. Thejava.util.concurrentpackagedefinestheExecutorandExecutor-Services interfacesandprovidesseveralclassesimplementingthem.Theseclassesdirectlysupporthigherlevel structuressuchastheMaster/WorkerpatternbyarrangingfortheexecutionofRunnableswhile hidingthedetailsofthreadcreationandscheduling.MoredetailsaregivenintheJavaappendix, AppendixC.

MPI: thread creation/destruction
MPIisfundamentallybasedonprocesses.ItisthreadawareinthattheMPI2.0API[Mesa]defines differentlevelsofthreadsafetyandprovidesfunctionstoqueryasystematruntimeastothelevelof threadsafetythatissupported.TheAPI,however,hasnoconceptofcreatinganddestroyingthreads. TomixthreadsintoanMPIprogram,theprogrammermustuseathreadbasedprogrammingmodelin additiontoMPI. 6.2.2. Process Creation/Destruction Aprocesscarrieswithitalltheinformationrequiredtodefineitsplaceintheoperatingsystem.In additiontoprogramcountersandregisters,aprocessincludesalargeblockofmemory(itsaddress space),systembuffers,andeverythingelserequiredtodefineitsstatetotheoperatingsystem. Consequently,creatinganddestroyingprocessesisexpensiveandnotdoneveryoften.
MPI: process creation/destruction
InoldermessagepassingAPIssuchasPVM[Sun90],thecapabilitytocreatenewprocesseswas embeddedintheAPI.AprogrammercouldissueacommandcalledPVM_spawntocreateanew process:

PVM_Spawn(node, program-executable)
ThiscapabilityofPVMallowedprogrammerstocontrolwhichexecutablesranonwhichnodesfrom insideaprogram.InMPI1.1,however,thiscapabilitywasnotprovidedtotheprogrammerandwas lefttotheruntimeenvironment.Thisdecisionmightseemlikeastepbackwards,butitwasdonefor tworeasons.OnewastheobservationthatthevastmajorityofPVMprogramswerebasedonthe SPMDpattern,soitmadesensetobuildthispatternintoMPI.Second,itallowedthestandardtobe implementedonawiderrangeofparallelarchitectures.ManyoftheMPPcomputersavailableatthe timetheMPIForumdefinedMPIcouldnoteasilyhandleaspawnstatement. ForanexampleofprocesscreationinMPI,consideranMPIprogramwithanexecutablenamedfoo. Theprogrammerlaunchesthejobonmultipleprocessors(fourinthisexample)withthecommand:

mpirun -np 4 foo
Inresponsetothiscommand,thesystemgoestoastandardfilelistingthenamesofthenodestouse, selectsfourofthem,andlaunchesthesameexecutableoneachone. Theprocessesaredestroyedwhentheprogramsrunningonthenodesoftheparallelcomputerexit.To maketheterminationclean,anMPIprogramhasasitsfinalexecutablestatement:

MPI-Finalize()
Thesystemattemptstocleanupanyprocessesleftrunningaftertheprogramexits.Iftheexitis abnormal,suchascanhappenwhenanexternalinterruptoccurs,itispossiblefororphanchild processestobeleftbehind.Thisisamajorconcerninlargeproductionenvironmentswheremany largeMPIprogramscomeandgo.Lackofpropercleanupcanleadtoanoverlycrowdedsystem.

Java: process creation/destruction
Usually,oneinstanceofaJavaruntime(implementingtheJavaVirtualMachine,orJVM, specification)correspondstoaprocess,whichwillthencontainallthethreadscreatedbytheJava programsitsupports.Limitedfacilitiesforcreatingnewprocesses,andcommunicatingand synchronizingwiththem,arefoundinthejava.lang.Processandjava.lang.Runtime classesinthestandardAPI.Theseprocesses,however,aretypicallyusedtoinvokenonJavaprograms fromwithinaJavaprogram,notforparallelism.Indeed,thespecificationdoesnotevenrequirethata newchildprocessexecuteconcurrentlywiththeparent. Javacanalsobeusedindistributedmemorymachines;inthiscase,onestartsaJVMinstanceoneach machineusingfacilitiesoftheoperatingsystem.

OpenMP: process creation/destruction
OpenMPisanAPIcreatedtosupportmultithreadedprogramming.Thesethreadsshareasingle process.ThereisnocapabilitywithinOpenMPtocreateordestroyprocesses. TheextensionofOpenMPtodistributedmemorycomputersandhencetoamultiprocessmodelisan activeareaofresearch[SLGZ99,BB99,Omn].Thesesystemsusuallyadoptthetechniqueusedby MPIandleaveprocesscreationanddestructiontotheruntimeenvironment.
6.3. SYNCHRONIZATION
SynchronizationisusedtoenforceaconstraintontheorderofeventsoccurringindifferentUEs. Thereisavastbodyofliteratureonsynchronization[And00],anditcanbecomplicated.Most programmers,however,useonlyafewsynchronizationmethodsonaregularbasis. 6.3.1. Memory Synchronization and Fences Inasimple,classicalmodelofasharedmemorymultiprocessor,eachUEexecutesasequenceof instructionsthatcanreadorwriteatomicallyfromthesharedmemory.Wecanthinkofthe computationasasequenceofatomicevents,withtheeventsfromdifferentUEsinterleaved.Thus,if UEAwritesamemorylocationandthenUEBreadsit,UEBwillseethevaluewrittenbyUEA.
Suppose,forexample,thatUEAdoessomeworkandthensetsavariabledonetotrue.Meanwhile, UEBexecutesaloop:
while (!done) {/*do something but don't change done*/}
Inthesimplemodel,UEAwilleventuallysetdone,andtheninthenextloopiteration,UEBwill readthenewvalueandterminatetheloop. Inreality,severalthingscouldgowrong.Firstofall,thevalueofthevariablemaynotactuallybe writtenbyUEAorreadbyUEB.Thenewvaluecouldbeheldinacacheinsteadofthemain memory,andeveninsystemswithcachecoherency,thevaluecouldbe(asaresultofcompiler optimizations,say)heldinaregisterandnotbemadevisibletoUEB.Similarly,UEBmaytrytoread thevariableandobtainastalevalue,orduetocompileroptimizations,notevenreadthevaluemore thanoncebecauseitisn'tchangedintheloop.Ingeneral,manyfactorspropertiesofthememory system,thecompiler,instructionreorderingetc.canconspiretoleavethecontentsofthememories (asseenbyeachUE)poorlydefined. AmemoryfenceisasynchronizationeventthatguaranteesthattheUEswillseeaconsistentviewof memory.Writesperformedbeforethefencewillbevisibletoreadsperformedafterthefence,as wouldbeexpectedintheclassicalmodel,andallreadsperformedafterthefencewillobtainavalue writtennoearlierthanthelatestwritebeforethefence. Clearly,memorysynchronizationisonlyanissuewhenthereissharedcontextbetweentheUEs. Hence,thisisnotgenerallyanissuewhentheUEsareprocessesrunninginadistributedmemory environment.Forthreads,however,puttingmemoryfencesintherightlocationcanmakethe differencebetweenaworkingprogramandaprogramriddledwithraceconditions. Explicitmanagementofmemoryfencesiscumbersomeanderrorprone.Fortunately,most programmers,althoughneedingtobeawareoftheissue,onlyrarelyneedtodealwithfences explicitlybecause,aswewillseeinthenextfewsections,thememoryfenceisusuallyimpliedby higherlevelsynchronizationconstructs.
OpenMP: fences
InOpenMP,amemoryfenceisdefinedwiththeflushstatement:
#pragma omp flush
ThisstatementaffectseveryvariablevisibletothecallingUE,causingthemtobeupdatedwithinthe computer'smemory.Thisisanexpensiveoperationbecauseguaranteeingconsistencyrequiressome ofthecachelinesandallsystembuffersandregisterstobewrittentomemory.Alowercostversionof flushisprovidedwheretheprogrammerliststhevariablestobeflushed:

#pragma omp flush (flag)
OpenMPprogrammersonlyrarelyusetheflushconstructbecauseOpenMP'shighlevel
synchronizationconstructsimplyaflushwhereneeded.Whencustomsynchronizationconstructsare created,however,flushcanbecritical.Agoodexampleispairwisesynchronization,wherethe synchronizationoccursbetweenspecificpairsofthreadsratherthanamongthefullteam.Because pairwisesynchronizationisnotdirectlysupportedbytheOpenMPAPI[2],whenfacedwithan algorithmthatdemandsit,programmersmustcreatethepairwisesynchronizationconstructontheir own.ThecodeinFig.6.2showshowtosafelyimplementpairwisesynchronizationinOpenMPusing theflushconstruct.

[2]
Ifaprogramusessynchronizationamongthefullteam,thesynchronizationwillwork independentlyofthesizeoftheteam,eveniftheteamsizeisone.Ontheotherhand,a programwithpairwisesynchronizationwilldeadlockifrunwithasinglethread.An OpenMPdesigngoalwastoencouragecodethatisequivalentwhetherrunwithone threadormany,apropertycalledsequentialequivalence.Thus,highlevelconstructsthat arenotsequentiallyequivalent,suchaspairwisesynchronization,wereleftoutoftheAPI. Inthisprogram,eachthreadhastwoblocksofworktocarryoutconcurrentlywiththeotherthreadsin theteam.Theworkisrepresentedbytwofunctions:do_a_whole_bunch()and do_more_stuff ().Thecontentsofthesefunctionsareirrelevant(andhencearenotshown)for thisexample.Allthatmattersisthatforthisexampleweassumethatathreadcannotsafelybegin workonthesecondfunctiondo_more_stuff ()untilitsneighborhasfinishedwiththefirst functiondo_a_whole_bunch(). TheprogramusestheSPMDpattern.Thethreadscommunicatetheirstatusforthesakeofthe pairwisesynchronizationbysettingtheirvalue(indexedbythethreadID)oftheflagarray.Because thisarraymustbevisibletoallofthethreads,itneedstobeasharedarray.Wecreatethiswithin OpenMPbydeclaringthearrayinthesequentialregion(thatis,priortocreatingtheteamofthreads). Wecreatetheteamofthreadswithaparallelpragma:
#pragma omp parallel shared(flag)
Whentheworkisdone,thethreadsetsitsflagto1tonotifyanyinterestedthreadsthattheworkis done.Thismustbeflushedtomemorytoensurethatotherthreadscanseetheupdatedvalue:
#pragma omp flush (flag)
Figure 6.2. Program showing one way to implement pairwise synchronization in OpenMP The flush construct is vital. It forces the memory to be consistent, thereby . making the updates to the flag array visible. For more details about the syntax of OpenMP see the OpenMP appendix, Appendix A. ,
#include <omp.h> #include <stdio.h> #define MAX 10 // max number of threads // Functions used in this program: the details of these
// functions are not relevant so we do not include the // function bodies. extern int neighbor(int); // return the ID for a thread's neighbor extern void do_a_whole_bunch(int); extern void do_more_stuff( ); int main() { int flag[MAX]; //Define an array of flags one per thread int i; for(i=0;i<MAX;i++)flag[i] = 0; #pragma omp parallel shared (flag) { int ID; ID = omp_get_thread_num(); // returns a unique ID for each thread. do_a_wholebunch(ID); // Do a whole bunch of work. flag[ID] = 1; // signal that this thread has finished its work #pragma omp flush (flag) // make sure all the threads have a chance to // see the updated flag value while (!flag[neighbor(ID)]){ // wait to see if neighbor is done. #pragma omp flush(flag) // required to see any changes to flag } do_more_stuff(); // call a function that can't safely start until // the neighbor's work is complete } // end parallel region }
Thethreadthenwaitsuntilitsneighborisfinishedwithdo_a_whole_bunch()beforemovingon tofinishitsworkwithacalltodo_more_work().
while (!flag( neighbor(ID))){ // wait to see if neighbor is done. #pragma omp flush(flag) // required to see any changes to flag }
TheflushoperationinOpenMPonlyaffectsthethreadvisiblevariablesforthecallingthread.If anotherthreadwritesasharedvariableandforcesittobeavailabletotheotherthreadsbyusinga flush,thethreadreadingthevariablestillneedstoexecuteaflushtomakesureitpicksupthenew value.Hence,thebodyofthewhileloopmustincludeaflush (flag)constructtomakesure thethreadseesanynewvaluesintheflagarray. Aswementionedearlier,knowingwhenaflushisneededandwhenitisnotcanbechallenging.In mostcases,theflushisbuiltintothesynchronizationconstruct.Butwhenthestandardconstructs aren'tadequateandcustomsynchronizationisrequired,placingmemoryfencesintherightlocations isessential.
Java: fences
JavadoesnotprovideanexplicitflushconstructasinOpenMP.Infact,theJavamemorymodel[3]is notdefinedintermsofflushoperations,butintermsofconstraintsonvisibilityandorderingwith respecttolockingoperations.Thedetailsarecomplicated,butthegeneralideaisnot:Supposethread 1performssomeoperationswhileholdinglockLandthenreleasesthelock,andthenthread2 acquireslockL.Theruleisthatallwritesthatoccurredinthread1beforethread1releasedthelock arevisibleinthread2afteritacquiresthelock.Further,whenathreadisstartedbyinvokingits startmethod,thestartedthreadseesallwritesvisibletothecalleratthecallpoint.Similarly,when athreadcallsjoin,thecallerwillseeallwritesperformedbytheterminatingthread.

[3]
Javaisoneofthefirstlanguageswhereanattemptwasmadetospecifyitsmemory modelprecisely.Theoriginalspecificationhasbeencriticizedforimprecisionaswellas fornotsupportingcertainsynchronizationidiomswhileatthesametimedisallowing somereasonablecompileroptimizations.AnewspecificationforJava21.5isdescribed in[JSRa].Inthisbook,weassumethenewspecification. Javaallowsvariablestobedeclaredasvolatile.Whenavariableismarkedvolatile,allwritesto thevariableareguaranteedtobeimmediatelyvisible,andallreadsareguaranteedtoobtainthelast valuewritten.Thus,thecompilertakescareofmemorysynchronizationissuesforvolatiles.[4]Ina Javaversionofaprogramcontainingthefragment(wheredoneisexpectedtobesetbyanother thread)
[4]
Fromthepointofviewoftherulestatedpreviously,readingavolatilevariableis definedtohavethesameeffectwithregardtomemorysynchronizationasacquiringa lockassociatedwiththevariable,whereaswritinghasthesameeffectasreleasingalock.

while (!done) {.../*do something but don't change done*/}
wewouldmarkdonetobevolatilewhenthevariableisdeclared,asfollows,andthenignore memorysynchronizationissuesrelatedtothisvariableintherestoftheprogram.
volatile boolean done = false;
Becausethevolatilekeywordcanbeappliedonlytoreferencestoanarrayandnottothe individualelements,thejava.util. concurrent. atomicpackageintroducedinJava21.5 addsanotionofatomicarrayswheretheindividualelementsareaccessedwithvolatilesemantics.For example,inaJavaversionoftheOpenMPexampleshowninFig.6.2,wewoulddeclaretheflag arraytobeoftypeAtomicIntegerArrayandupdateandreadwiththatclass'ssetandget methods. Anothertechniqueforensuringpropermemorysynchronizationissynchronizedblocks.A synchronizedblockappearsasfollows:
synchronized(some_object){/*do something with shared variables*/}
WewilldescribesynchronizedblocksinmoredetailinSec.6.3.3andtheJavaappendix,AppendixC. Forthetimebeing,itissufficienttoknowthatsome_objectisimplicitlyassociatedwithalock andthecompilerwillgeneratecodetoacquirethislockbeforeexecutingthebodyofthesynchronized blockandtoreleasethelockonexitfromtheblock.Thismeansthatonecanguaranteeproper memorysynchronizationofaccesstoavariablebyensuringthatallaccessestothevariableoccurin synchronizedblocksassociatedwiththesameobject.

MPI: fences
Afenceonlyarisesinenvironmentsthatincludesharedmemory.InMPIspecificationspriortoMPI 2.0,theAPIdidnotexposesharedmemorytotheprogrammer,andhencetherewasnoneedfora usercallablefence.MPI2.0,however,includesonesidedcommunicationconstructs.Theseconstructs create"windows"ofmemoryvisibletootherprocessesinanMPIprogram.Datacanbypushedtoor pulledfromthesewindowsbyasingleprocesswithouttheexplicitcooperationoftheprocessowning thememoryregioninquestion.Thesememorywindowsrequiresometypeoffence,butarenot discussedherebecauseimplementationsofMPI2.0arenotwidelyavailableatthetimethiswas written. 6.3.2. Barriers AbarrierisasynchronizationpointatwhicheverymemberofacollectionofUEsmustarrivebefore anymemberscanproceed.IfaUEarrivesearly,itwillwaituntilalloftheotherUEshavearrived. Abarrierisoneofthemostcommonhighlevelsynchronizationconstructs.Ithasrelevancebothin processorientedenvironmentssuchasMPIandthreadbasedsystemssuchasOpenMPandJava.
MPI: barriers
InMPI,abarrierisinvokedbycallingthefunction
MPI Barrier(MPI COMM)
whereMPI_COMMisacommunicatordefiningtheprocessgroupandthecommunicationcontext.All processesinthegroupassociatedwiththecommunicatorparticipateinthebarrier.Althoughitmight notbeapparenttotheprogrammer,thebarrieritselfisalmostalwaysimplementedwithacascadeof pairwisemessagesusingthesametechniquesasusedinareduction(seethediscussionofreductionin Sec.6.4.2).

Figure 6.3. MPI program containing a barrier. This program is used to time the execution of function runit().
#include <mpi.h> // MPI include file #include <stdio.h> extern void runit(); int main(int argc, char **argv) {
int num_procs; // number of processes in the group int ID; // unique identifier ranging from 0 to (num_procs-l) double time_init, time_final, time_elapsed; // // Initialize MPI and set up the SPMD program // MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size (MPI_COMM_WORLD, &num_procs); // // Ensure that all processes are set up and ready to go before timing // runit () // MPI_Barrier(MPI_COMM_WORLD); time.init = MPI_Wtime(); runit(); // a function that we wish to time on each process time_final = MPI_Wtime(); time_elapsed = time_final - time_init; printf(" I am \%d and my computation took \%f seconds\n", ID, time_elapsed); MPI-Finalize(); return 0;
Asanexampleofabarrier,considertheprograminFig.6.3.ThissimpleprogramsetsuptheMPI environmentandthenrecordstheexecutiontimeofafunctioncalledrunit( )(thecodeofwhich isnotshown).ThetimeitselfisfoundusingtheMPItimingroutine:

MPI Wtime()
Thisfunctionreturnsadoubleprecisionvalueholdingtheelapsedtimeinsecondssincesomepointin thepast.Thedifferencebetweenthevaluereturnedafterthefunctioncallandthevaluereturned beforethefunctioncallgivestheelapsedtimeforthefunction'sexecution.Thisiswallclocktime, thatis,thetimethatwouldelapseonaclockexternaltothecomputer. TherecanbeconsiderablevariationinprocessstartuportheinitializationofMPI.Thus,forthetime tobeconsistentacrossalltheprocesses,itisimportantthatallprocessesenterthetimedsectionof codetogether.Toaddressthisissue,weplaceabarrierbeforethetimedsectionofcode.

OpenMP: barriers
InOpenMP,asimplepragmasetsthebarrier:
#pragma omp barrier
Allthreadsintheteamparticipateinthebarrier.Asanexampleofabarrier,considerthecodeinFig. 6.4.ThisprogramisessentiallythesameastheMPIprogram.Abarrierisusedtoensurethatall threadscompleteanystartupactivitiesbeforethetimingmeasurementsaretaken.Thetimingroutine, omp_Wtick(),wasmodeledaftertheanalogousMPIroutine,MPI_Wtime(),andisdefinedin thesameway.

Figure 6.4. OpenMP program containing a barrier. This program is used to time the execution of function runit().
#include <omp.h> #include <stdio.h> extern void runit(); int main(int argc, char **argv) { double time_init, time_final, time_elapsed; #pragma omp parallel private(time_init, time_final, time_elapsed) { int ID; ID = omp_get_thread_num(); // // ensure that all threads are set up and ready to go before timing runit() // #pragma omp barrier time_init = omp_get_wtime(); runit(); // a function that we wish to time on each thread time_final = omp_get_wtime(); time_ elapsed = time_final - time_init; printf(" I am %d and my computation took \%f seconds\n", ID, time_elapsed);
} return (); }
InadditiontoanexplicitbarrierasshowninFig.6.4,OpenMPautomaticallyinsertsbarriersatthe endoftheworksharingconstructs(for, single, section,etc.).Thisimplicitbarriercanbe disabled,ifdesired,byusingthenowaitclause. Thebarrierimpliesacalltoflush,sotheOpenMPbarriercreatesamemoryfenceaswell.These memoryflushes,combinedwithanycycleswastedwhileUEswaitatthebarrier,makeitapotentially expensiveconstruct.Barriers,whichareexpensiveinanyprogrammingenvironment,mustbeused
whererequiredtoensurethecorrectprogramsemantics,butforperformancereasonsshouldbeused nomorethanabsolutelyrequired.
Java: barriers
Javadidnotoriginallyincludeabarrierprimitive,althoughitisnotdifficulttocreateoneusingthe facilitiesinthelanguage,aswasdoneinthepublicdomainutil. concurrentpackage[Lea].In Java21.5,similarclassesareprovidedintheJava.util.concurrentpackage. ACyclicBarrierissimilartothebarrierdescribedpreviously.TheCyclic-Barrierclass containstwoconstructors:onethatrequiresthenumberofthreadsthatwillsynchronizeonthebarrier, andanotherthattakesthenumberofthreadsalongwithaRunnableobjectwhoserunmethodwill beexecutedbythelastthreadtoarriveatthebarrier.Whenathreadarrivesatthebarrier,itinvokes thebarrier'sawaitmethod.Ifathread"breaks"thebarrierbyterminatingprematurelywithan exception,theotherthreadswillthrowaBrokenBarrierException.ACyclicBarrier automaticallyresetsitselfwhenpassedandcanbeusedmultipletimes(inaloop,forexample). InFig.6.5,weprovideaJavaversionofthebarrierexamplesgivenpreviouslyusinga CyclicBarrier. Thejava.util.concurrentpackagealsoprovidesarelatedsynchronizationprimitive CountDownLatch.ACountDownLatchisinitializedtoaparticularvalueN.Eachinvocationof itscountDownmethoddecreasesthecount.Athreadexecutingtheawaitmethodblocksuntilthe valueofthelatchreaches0.TheseparationofcountDown(analogousto"arrivingatthebarrier") andawait(waitingfortheotherthreads)allowsmoregeneralsituationsthan"allthreadswaitforall otherthreadstoreachabarrier."Forexample,asinglethreadcouldwaitforNeventstohappen,orN threadscouldwaitforasingleeventtohappen.ACountDownLatchcannotberesetandcanonly beusedonce.AnexampleusingaCountDownLatchisgivenintheJavaappendix,AppendixC. 6.3.3. Mutual Exclusion Whenmemoryorothersystemresources(forexample,afilesystem)areshared,theprogrammust ensurethatmultipleUEsdonotinterferewitheachother.Forexample,iftwothreadstrytoupdatea shareddatastructureatthesametime,araceconditionresultsthatcanleavethestructureinan inconsistentstate.Acriticalsectionisasequenceofstatementsthatconflictwithasequenceof statementsthatmaybeexecutedbyotherUEs.Twosequencesofstatementsconflictifbothaccessthe samedataandatleastoneofthemmodifiesthedata.Toprotecttheresourcesaccessedinsidethe criticalsection,theprogrammermustusesomemechanismthatensuresthatonlyonethreadatatime willexecutethecodewithinthecriticalsection.Thisiscalledmutualexclusion.
Figure 6.5. Java program containing a CyclicBarrier. This program is used to time the execution of function runit().
import java.util.concurrent.*; public class TimerExample implements Runnable { static int N;
static CyclicBarrier barrier; final int ID; public void run() { //wait at barrier until all threads are ready System.out.println(ID + " at await"); try{ barrier.await(); } catch (InterruptedException ex) { Thread.dumpStack();} catch (BrokenBarrierException ex){Thread.dumpStack();} //record start time long time_init = System.currentTimeMillis() ; //execute function to be timed on each thread runit() ; //record ending time long time_final = System.currentTimeMillis() ; //print elapsed time System.out.println("Elapsed time for thread "+ID+ " = "+(time_final-time_init)+" msecs"); return; } void runit(){...} //definition of runit() TimerExample(int ID){this.ID = ID;} public static void main(String[] args) { N = Integer.parseInt(args[0]); //read number of threads barrier = new CyclicBarrier(N); //instantiate barrier for(int i = 0; i!= N; i++) //create and start threads {new Thread(new TimerExample(i)) .start() ; } } }
Whenusingmutualexclusion,itiseasytofallintoasituationwhereonethreadismakingprogress whileoneormorethreadsareblockedwaitingfortheirturntoenterthecriticalsection.Thiscanbea serioussourceofinefficiencyinaparallelprogram,sogreatcaremustbetakenwhenusingmutual exclusionconstructs.Itisimportanttominimizetheamountofcodethatisprotectedbymutual exclusionand,ifpossible,staggerthearrivalatthemutualexclusionconstructbythemembersofa teamofthreadssoaminimumnumberofthreadsarewaitingtoexecutetheprotectedcode.The SharedDatapatterndiscussestheissueofwhatshouldbeprotectedbymutualexclusion.Herewe focusontheimplementationmechanismsforprotectingacriticalsection.

Figure 6.6. Example of an OpenMP program that includes a critical section
#include <omp.h> #include <stdio.h> #define N 1000 extern double big_computation(int, int); extern void consume_results(int, double, double *); int main() { double global_result[N]; #pragma omp parallel shared (global_result) { double local_result; int I; int ID = omp_get_thread_num(); // set a thread ID #pragma omp for for(i=0;i<N;i++){ local_result = big_computation(ID, i); // carry out the UE's work #pragma omp critical { consume_results(ID, local_result, global_result); } } } return 0; }
OpenMP: mutual exclusion
InOpenMP,mutualexclusionismosteasilyaccomplishedusingaconstructcalledcriticalsection.An exampleofusingthecriticalsectionconstructforOpenMPisshowninFig.6.6. Inthisprogram,ateamofthreadsiscreatedtocooperativelycarryoutaseriesofNcallsto big_computation().Thepragma

#pragma omp for
istheOpenMPconstructthattellsthecompilertodistributetheiterationsoftheloopamongateamof threads.Afterbig_computation()iscomplete,theresultsneedtobecombinedintotheglobal datastructurethatwillholdtheresult. Whilewedon'tshowthecode,assumetheupdatewithinconsume_results()canbedoneinany order,buttheupdatebyonethreadmustcompletebeforeanotherthreadcanexecuteanupdate.The criticalpragmaaccomplishesthisforus.Thefirstthreadtofinishitsbig_computation() enterstheenclosedblockofcodeandcallsconsume_results().Ifathreadarrivesatthetopof thecriticalsectionblockwhileanotherthreadisprocessingtheblock,itwaitsuntilthepriorthreadis finished. Thecriticalsectionisanexpensivesynchronizationoperation.Uponentrytoacriticalsection,a
threadflushesallvisiblevariablestoensurethataconsistentviewofthememoryisseeninsidethe criticalsection.Attheendofthecriticalsection,weneedanymemoryupdatesoccurringwithinthe criticalsectiontobevisibletotheotherthreadsintheteam,soasecondflushofallthreadvisible variablesisrequired. Thecriticalsectionconstructisnotonlyexpensive,itisnotverygeneral.Itcannotbeusedamong subsetsofthreadswithinateamortoprovidemutualexclusionbetweendifferentblocksofcode. Thus,theOpenMPAPIprovidesalowerlevelandmoreflexibleconstructformutualexclusioncalled alock. LocksindifferentsharedmemoryAPIstendtobesimilar.Theprogrammerdeclaresthelockand initializesit.Onlyonethreadatatimeisallowedtoholdthelock.Otherthreadstryingtoacquirethe lockwillblock.Blockingwhilewaitingforalockisinefficient,somanylockAPIsallowthreadsto testalock'savailabilitywithouttryingtoacquireit.Thus,athreadcanopttodousefulworkandcome backtoattempttoacquirethelocklater. ConsidertheuseoflocksinOpenMP.TheexampleinFig.6.7showsuseofasimplelocktomake sureonlyonethreadatatimeattemptstowritetostandardoutput. Theprogramfirstdeclaresthelocktobeoftypeomp_lock_t.Thisisanopaqueobject,meaning thataslongastheprogrammeronlymanipulateslockobjectsthroughtheOpenMPruntimelibrary, theprogrammercansafelyworkwiththelockswithouteverconsideringthedetailsofthelocktype. Thelockistheninitializedwithacalltoomp_init_lock.
Figure 6.7. Example of using locks in OpenMP #include <omp.h> #include <stdio.h> int main() { omp_lock_t lock; // declare the lock using the lock // type defined in omp.h omp_set_num_threads(5); omp_init_lock (&lock); // initialize the lock #pragma omp parallel shared (lock) { int id = omp_get_thread_num(); omp_set_lock (&lock); printf("\n only thread %d can do this print\n",id); omp_unset_lock (&lock); } }
Tobeofanyuseformanagingconcurrency,alockmustbesharedbetweentheindividualmembersof theteamofthreads.Thus,thelockisdefinedpriortotheparallelpragmaanddeclaredasa sharedvariable(whichisthedefault,butwecallthesharedclauseinthisexamplejusttoemphasize thepoint).Insidetheparallelregion,athreadcansetthelock,whichcausesanyotherthreads attemptingtosetthesamelocktoblockandwaituntilthelockhasbeenunset. UnliketheOpenMPcriticalsection,anOpenMPlockdoesnotdefineamemoryfence.Ifthe
operationscarriedoutbythethreadholdingthelockdependonanyvaluesfromotherthreads,the programmightfailbecausethememorymightnotbeconsistent.Managingmemoryconsistencycan bedifficultformanyprogrammers,somostOpenMPprogrammersopttoavoidlocksandusethe muchsafercriticalsections.

Java: mutual exclusion
TheJavalanguageprovidessupportformutualexclusionwiththesynchronizedblockconstructand also,inJava21.5,withnewlockclassescontainedinthepackage java.util.concurrent.lock.EveryobjectinaJavaprogramimplicitlycontainsitsown lock.Eachsynchronizedblockhasanassociatedobject,andathreadmustacquirethelockonthat objectbeforeexecutingthebodyoftheblock.Whenthethreadexitsthebodyofthesynchronized block,whethernormallyorabnormallybythrowinganexception,thelockisreleased. InFig.6.8,weprovideaJavaversionoftheexamplegivenpreviously.Theworkdonebythethreads isspecifiedbytherunmethodinthenestedWorkerclass.NotethatbecauseNisdeclaredtobe final(andisthusimmutable),itcansafelybeaccessedbyanythreadwithoutrequiringany synchronization. Forthesynchronizedblockstoexcludeeachother,theymustbeassociatedwiththesameobject.In theexample,thesynchronizedblockintherunmethodusesthis.getClass()asanargument. ThisexpressionreturnsareferencetotheruntimeobjectrepresentingtheWorkerclass.Thisisa convenientwaytoensurethatallcallersusethesameobject.Wecouldalsohaveintroducedaglobal instanceofjava.lang.Object(oraninstanceofanyotherclass)andusedthatastheargument tothesynchronizedblock.Whatisimportantisthatallthethreadssynchronizeonthesameobject.A potentialmistakewouldbetouse,say,this,whichwouldnotenforcethedesiredmutualexclusion becauseeachworkerthreadwouldbesynchronizingonitselfandthuslockingadifferentlock. BecausethisisaverycommonmisunderstandingandsourceoferrorsinmultithreadedJava programs,weemphasizeagainthatasynchronizedblockonlyprotectsacriticalsectionfromaccess byotherthreadswhoseconflictingstatementsarealsoenclosedinasynchronizedblockwiththesame objectasanargument.Synchronizedblocksassociatedwithdifferentobjectsdonotexcludeeach other.(Theyalsodonotguaranteememorysynchronization.)Also,thepresenceofasynchronized blockinamethoddoesnotconstraincodethatisnotinasynchronizedblock.Thus,forgettinga neededsynchronizedblockormakingamistakewiththeargumenttothesynchronizedblockcan haveseriousconsequences. ThecodeinFig.6.8isstructuredsimilarlytotheOpenMPexample.Amorecommonapproachused inJavaprogramsistoencapsulateashareddatastructureinaclassandprovideaccessonlythrough synchronizedmethods.Asynchronizedmethodisjustaspecialcaseofasynchronizedblockthat includesanentiremethod.
Figure 6.8. Java version of the OpenMP program in Fig. 6.6
public class Example { static final int N = 10;
static double[] global_result = new double[N]; public static void main(String[] args) throws InterruptedException { //create and start N threads Thread[] t = new Thread[N]; for (int i = 0; i != N; i++) {t[i] = new Thread(new Worker(i)); t[i] .start( ) ; } //wait for all N threads to finish for (int i = 0; i != N; i++){t[i].join();} //print the results for (int i = 0; i!=N; i++) {System.out.print(global_result[i] + " ");} System.out.println("done"); } static class Worker implements Runnable { double local_result; int i; int ID; Worker(int ID){this.ID = ID;} //main work of threads public void run() { //perform the main computation local_result = big_computation(ID, i); //update global variables in synchronized block synchronized(this.getClass()) {consume_results(ID, local_result, global_result);} } //define computation double big_computation(int ID, int i){ . . . } //define result update void consume_results(int ID, double local_result, double[] global_result){. . .} }
Itisimplicitlysynchronizedonthisfornormalmethodsandtheclassobjectforstaticmethods.This approachmovestheresponsibilityforsynchronizationfromthethreadsaccessingtheshareddata structuretothedatastructureitself.Oftenthisisabetterengineeredapproach.InFig.6.9,the previousexampleisrewrittentousethisapproach.Nowtheglobal_resultvariableis encapsulatedintheExample2classandmarkedprivatetohelpenforcethis.(Tomakethepoint,the Workerclassisnolongeranestedclass.)Theonlywayfortheworkerstoaccessthe global_resultarrayisthroughtheconsume_resultsmethod,whichisnowasynchronized methodintheExample2class.Thus,theresponsibilityforsynchronizationhasbeenmovedfrom
theclassdefiningtheworkerthreadstotheclassowningtheglobal_resultarray.
Figure 6.9. Java program showing how to implement mutual exclusion with a synchronized method
public class Example2 { static final int N = 10; private static double[] global_result = new double[N]; public static void main(String[] args) throws InterruptedException { //create and start N threads Thread[] t = new Thread[N] ; for (int i = 0; i != N; i++) {t[i] = new Thread(new Worker(i)); t[i] .start( ) ; } //wait for all threads to terminate for (int i = 0; i != N; i++){t[i] . join( ) ;} //print results for (int i = 0; i!=N; i++) {System.out.print(global_result[i] + " ");} System.out.println("done"); } //synchronized method serializing consume_results method synchronized static void consume_results(int ID, double local_result) { global.result[ID] = . . . }
class Worker implements Runnable { double local_result; int i; int ID; Worker(int ID){this.ID = ID;} public void run() { //perform the main computation local_result = big_computation(ID, i); //carry out the UE's work //invoke method to update results Example2.consume_results(ID, local_result);
//define computation double big_computation(int ID, int i){ . . . }
ThesynchronizedblockconstructinJavahassomedeficiencies.Probablythemostimportantfor parallelprogrammersisthelackofawaytofindoutwhetheralockisavailablebeforeattemptingto acquireit.Thereisalsonowaytointerruptathreadwaitingonasynchronizedblock,andthe
synchronizedblockconstructforcesthelockstobeacquiredandreleasedinanestedfashion.This disallowscertainkindsofprogrammingidiomsinwhichalockisacquiredinoneblockandreleased inanother. Asaresultofthesedeficiencies,manyprogrammershavecreatedtheirownlockclassesinsteadof usingthebuiltinsynchronizedblocks.Forexamples,see[Lea].Inresponsetothissituation,inJava2 1.5,packagejava.util.concurrent.locksprovidesseverallockclassesthatcanbeusedas analternativetosynchronizedblocks.ThesearediscussedintheJavaappendix,AppendixC.

MPI: mutual exclusion
Asisthecasewithmostofthesynchronizationconstructs,mutualexclusionisonlyneededwhenthe statementsexecutewithinasharedcontext.Hence,asharednothingAPIsuchasMPIdoesnot providesupportforcriticalsectionsdirectlywithinthestandard.ConsidertheOpenMPprogramin Fig.6.6.IfwewanttoimplementasimilarmethodinMPIwithacomplexdatastructurethathasto beupdatedbyoneUEatatime,thetypicalapproachistodedicateaprocesstothisupdate.Theother processeswouldthensendtheircontributionstothededicatedprocess.WeshowthissituationinFig. 6.10. ThisprogramusestheSPMDpattern.AswiththeOpenMPprograminFig.6.6,wehavealoopto carryoutNcallstobig_computation(),theresultsofwhichareconsumedandplacedina singleglobaldatastructure.Updatestothisdatastructuremustbeprotectedsothatresultsfromonly oneUEatatimeareapplied. WearbitrarilychoosetheUEwiththehighestranktomanagethecriticalsection.Thisprocessthen executesaloopandpostsNreceives.ByusingtheMPI_ANY_SOURCEandMPI_ANY_TAGvalues intheMPI_Recv()statement,themessagesholdingresultsfromthecallsto big_computation()aretakeninanyorder.IfthetagorIDarerequired,theycanberecovered fromthestatusvariablereturnedfromMPI_Recv(). TheotherUEscarryouttheNcallstobig_computation().BecauseoneUEhasbeendedicated tomanagingthecriticalsection,theeffectivenumberofprocessesinthecomputationisdecreasedby one.WeuseacyclicdistributionoftheloopiterationsaswasdescribedintheExamplessectionofthe SPMDpattern.Thisassignstheloopiterationsinaroundrobinfashion.AfteraUEcompletesits computation,theresultissenttotheprocessmanagingthecriticalsection.[5]
[5]
Weusedasynchronoussend(MPI_Ssend()),whichdoesnotreturnuntilamatching MPIreceivehasbeenposted,toduplicatethebehaviorofsharedmemorymutual exclusionascloselyaspossible.Itisworthnoting,however,thatinadistributedmemory environment,makingthesendingprocesswaituntilthemessagehasbeenreceivedisonly rarelyneededandaddsadditionalparalleloverhead.Usually,MPIprogrammersgoto greatlengthstoavoidparalleloverheadsandwouldonlyusesynchronousmessage passingasalastresort.Inthisexample,thestandardmodemessagepassingfunctions, MPI_Send()andMPI_Recv(),wouldbeabetterchoiceunlesseither(1)acondition externaltothecommunicationrequiresthetwoprocessestosatisfyanorderingconstraint, henceforcingthemtosynchronizewitheachother,or(2)communicationbuffersor anothersystemresourcelimitthecapacityofthecomputerreceivingthemessages,
therebyforcingtheprocessesonthesendingsidetowaituntilthereceivingsideisready.
Figure 6.10. Example of an MPI program with an update that requires mutual exclusion. A single process is dedicated to the update of this data structure.
#include <mpi.h> // MPI include file #include <stdio.h> #define N 1000 extern void consume_results(int, double, double * ); extern double big_computation(int, int); int main(int argc, char **argv) { int Tag1 = 1; int Tag2 = 2; // message tags int num_procs; // number of processes in group int ID; // unique identifier from 0 to (num_Procs-1) double local_result, global_result[N]; int i, ID_CRIT; MPI_Status stat; // MPI status parameter // Initialize MPI and set up the SPMD program MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size (MPI_COMM_WORLD, &num_procs); // Need at least two processes for this method to work if(num_procs < 2) MPI_Abort(MPI_COMM_WORLD,-1); // Dedicate the last process to managing update of final result ID_CRIT = num_procs-1; if (ID == ID_CRIT) { int ID_sender; // variable to hold ID of sender for(i=0;i<N;i++){ MPI_Recv(&local_result, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); ID_sender = stat.MPI_SOURCE; consume_results(ID_sender, local_result, global_result); } } else { num_procs--; for(i=ID;i<N;i+=num_procs){ // cyclic distribution of loop iterations local_result = big_computation(ID, i); // carry out UE's work // Send local result using a synchronous Send - a send that doesn't // return until a matching receive has been posted. MPI_Ssend (&local_result, 1, MPIDOUBLE, ID_CRIT, ID, MPI_COMM_WORLD); } } MPI_Finalize(); return 0; }
6.4. COMMUNICATION
Inmostparallelalgorithms,UEsneedtoexchangeinformationasthecomputationproceeds.Shared memoryenvironmentsprovidethiscapabilitybydefault,andthechallengeinthesesystemsisto synchronizeaccesstosharedmemorysothattheresultsarecorrectregardlessofhowtheUEsare scheduled.Indistributedmemorysystems,however,itistheotherwayaround:Becausetherearefew, ifany,sharedresources,theneedforexplicitsynchronizationtoprotecttheseresourcesisrare. Communication,however,becomesamajorfocusoftheprogrammer'seffort. 6.4.1. Message Passing Amessageisthemostbasiccommunicationelement.Amessageconsistsofaheadercontainingdata aboutthemessage(forexample,source,destination,andatag)andaseriesofbitstobe communicated.Messagepassingistypicallytwosided,meaningthatamessageisexplicitlysent betweenapairofUEs,fromaspecificsourcetoaspecificdestination.Inadditiontodirectmessage passing,therearecommunicationeventsthatinvolvemultipleUEs(usuallyallofthem)inasingle communicationevent.Wecallthiscollectivecommunication.Thesecollectivecommunicationevents oftenincludecomputationaswell,theclassicexamplebeingaglobalsummationwhereasetof valuesdistributedaboutthesystemissummedintoasinglevalueoneachnode. Inthefollowingsections,wewillexplorecommunicationmechanismsinmoredetail.Westartwith basicmessagepassinginMPI,OpenMP,andJava.Then,weconsidercollectivecommunication, lookinginparticularatthereductionoperationinMPIandOpenMP.Weclosewithabrieflookat otherapproachestomanagingcommunicationinparallelprograms.
MPI: message passing
MessagepassingbetweenapairofUEsisthemostbasicofthecommunicationoperationsand providesanaturalstartingpointforourdiscussion.AmessageissentbyoneUEandreceivedby another.Initsmostbasicform,thesendandthereceiveoperationarepaired. Asanexample,considertheMPIprograminFig.6.11and6.12.Inthisprogram,aringofprocessors worktogethertoiterativelycomputetheelementswithinafield(fieldintheprogram).Tokeepthe problemsimple,weassumethedependenciesintheupdateoperationaresuchthateachUEonly needsinformationfromitsneighbortothelefttoupdatethefield. Weshowonlythepartsoftheprogramrelevanttothecommunication.Thedetailsoftheactual updateoperation,initializationofthefield,oreventhestructureofthefieldandboundarydataare omitted. TheprograminFig.6.11and6.12declaresitsvariablesandtheninitializestheMPIenvironment.This isaninstanceoftheSPMDpatternwherethelogicwithintheparallelalgorithmwillbedrivenbythe processrankIDandthenumberofprocessesintheteam. Afterinitializingfield,wesetupthecommunicationpatternbycomputingtwovariables,left
andright,thatidentifywhichprocessesarelocatedtotheleftandtheright.Inthisexample,the computationisstraightforwardandimplementsaringcommunicationpattern.Inmorecomplex programs,however,theseindexcomputationscanbecomplex,obscure,andthesourceofmanyerrors. Thecoreloopoftheprogramexecutesanumberofsteps.Ateachstep,theboundarydataiscollected andcommunicatedtoitsneighborsbyshiftingaroundaring,andthenthelocalblockofthefield controlledbytheprocessisupdated.

Figure 6.11. MPI program that uses a ring of processors and a communication pattern where information is shifted to the right. The functions to do the computation do not affect the communication itself so they are not shown. (Continued in Fig. 6.12.)
#include <stdio.h> #include "mpi.h" // MPI include file #define IS_ODD(x) ((x)%2) // test for an odd int // prototypes for functions to initialize the problem, extract // the boundary region to share, and perform the field update. // The contents of these functions are not provided. extern extern extern extern void void void void init (int, int, double *, double * , double *); extract_boundary (int, int, int, int, double *); update (int, int, int, int, double *, double *); output_results (int, double *);
int main(int argc, char **argv) { int Tag1 = 1; // message tag int nprocs; // the number of processes in the group int ID; // the process rank int Nsize; // Problem size (order of field matrix) int Bsize; // Number of doubles in the boundary int Nsteps; // Number of iterations double *field, *boundary, *incoming; int i, left, right; MPI_Status stat; // MPI status parameter
// // Initialize MPI and set up the SPMD program // MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); init (Nsize, Bsize, field, boundary, incoming); /* continued in next figure */
Noticethatwehadtoexplicitlyswitchthecommunicationbetweenoddandevenprocesses.This ensuresthatthematchingcommunicationeventsareorderedconsistentlyonthetwoprocesses involved.Thisisimportantbecauseonsystemswithlimitedbufferspace,asendmaynotbeableto
returnuntiltherelevantreceivehasbeenposted.
OpenMP: message passing
MessagepassingcanbeaneffectivestrategyinOpenMP.OneuseofmessagepassinginOpenMPis toemulateanMPIalgorithm,forexamplewhenportingaworkingMPIprogramtoasystemwhere onlyOpenMPisavailable.AnothercaseiswhenusingNUMAmachineswheredatalocalityis important.ByusingtheSPMDpatternandmessagepassing,theprogrammercanmoreprecisely controlhowtheprogramdataisalignedwiththesystem'smemoryhierarchy,therebyresultingin betterperformance.

Figure 6.12. MPI program that uses a ring of processors and a communication pattern where information is shifted to the right (continued from Fig. 6.11)
/* continued from previous figure */ // assume a ring of processors and a communication pattern // where boundaries are shifted to the right. left = (ID+1); if(left>(nprocs-1)) left = 0; right = (ID-1); if(right<0)right = nprocs-1; for(i = 0; i < Nsteps; i++){ extract_boundary(Nsize, Bsize, ID, nprocs, boundary); if(IS_ODD(ID)){ MPI_Send (boundary, Bsize, MPI.DOUBLE, right, Tagl, MPI_COMM_WORLD); MPI_Recv (incoming, Bsize, MPI.DOUBLE, left, Tagl, MPI_COMM_WORLD, &stat); } else { MPI_Recv (incoming, Bsize, MPI_DOUBLE, left, Tagl, MPI_COMM_WORLD, &stat); MPI_Send (boundary, Bsize, MPI_DOUBLE, right, Tagl, MPI_COMM_WORLD); } update(Nsize, Bsize, ID, nprocs, field, incoming); } output_results(Nsize, field); MPI_Finalize(); return 0; }
AparticularlyeasywaytousemessagepassingwithinOpenMPisshowninFig.6.13and6.14.This programisbasicallythesameastheMPIprograminFig.6.11and6.12.Asbefore,wehidedetailsof theproblemnotrelevanttoadiscussionoftheparallelalgorithm. MessagepassingisaccomplishedinOpenMPbythereceivingprocessreadingthemessagefroma
shareddatastructure.InMPI,thesynchronizationrequiredbytheproblemisimpliedbythemessage passingfunctions.InOpenMP,thesynchronizationisexplicit.Westartwithabarriertomakesure allthreadsareatthesamepoint(thatis,readytofilltheboundarytheywillshare).Asecond barrierensuresthatallthethreadshavefinishedfillingtheirboundarybeforethereceivingside usesthedata. Thisstrategyisextremelyeffectiveforproblemsthatinvolvethesameamountofworkoneachthread. Whensomethreadsarefasterthanotherthreads,becauseeitherthehardwareislessloadedorperhaps thedataisnotevenlydistributed,theuseofbarriersleadstoexcessiveparalleloverheadasthreads waitatthebarrier.Thesolutionistomorefinelytunethesynchronizationtotheneedsoftheproblem, which,inthiscase,wouldindicatethatpairwisesynchronizationshouldbeused. The"messagepassing"OpenMPprograminFig.6.15showsasimplewaytointroducepairwise synchronizationinOpenMP.ThiscodeismuchmorecomplicatedthantheOpenMPprograminFig. 6.13andFig.6.14.Basically,wehaveaddedanarraycalleddonethatathreadusestoindicatethatits buffer(thatis,theboundarydata)isreadytouse.Athreadfillsitsbuffer,setstheflag,andthen flushesittomakesureotherthreadscanseetheupdatedvalue.Thethreadthencheckstheflagforits neighborandwaits(usingasocalledspinlock)untilthebufferfromitsneighborisreadytoreceive.
Figure 6.13. OpenMP program that uses a ring of threads and a communication pattern where information is shifted to the right (continued in Fig. 6.14)
#include <stdio.h> #include <omp.h> // OpenMP include file #define MAX 10 // maximum number of threads // // prototypes for functions to initialize the problem, // extract the boundary region to share, and perform the // field update. Note: the initialize routine is different // here in that it sets up a large shared array (that is, for // the full problem), not just a local block. // extern void init (int, int, double *, double * , double *); extern void extract_boundary (int, int, double *, double *); extern void update (int, int, double *, double *); extern void output_results (int, double *); int main(int argc, char **argv) { int Nsize; // Problem size (order of field matrix) int Bsize; // Number of doubles in the boundary double *field; double *boundary[MAX]; // array of pointers to a buffer to hold // boundary data // // Create Team of Threads // init (Nsize, Bsize, field, boundary, incoming); /* continued in next figure */
Thiscodeissignificantlymorecomplex,butitcanbemoreefficientontwocounts.First,barriers causeallthreadstowaitforthefullteam.Ifanyonethreadisdelayedforwhateverreason,itslows downtheentireteam.Thiscanbedisastrousforperformance,especiallyifthevariabilitybetween threads'workloadsishigh.Second,we'vereplacedtwobarrierswithonebarrierandaseriesof flushes.Aflushisexpensive,butnoticethateachoftheseflushesonlyflushesasinglesmallarray (done).Itislikelythatmultiplecallstoaflushwithasinglearraywillbemuchfasterthanthesingle flushofallthreadvisibledataimpliedbyabarrier.

Java: message passing
TheJavalanguagedefinitiondoesnotspecifymessagepassing(asitdoesfacilitiesforconcurrent programmingwiththreads).SimilartechniquestothosediscussedformessagepassinginOpenMP couldbeused.However,thestandardclasslibrariesprovidedwiththeJavadistributionprovide extensivesupportforvarioustypesofcommunicationindistributedenvironments.Ratherthan provideexamplesofthosetechniques,wewillprovideanoverviewoftheavailablefacilities.

Figure 6.14. OpenMP program that uses a ring of threads and a communication pattern where information is shifted to the right (continued from Fig. 6.13)
/* continued from previous figure */ #pragma omp parallel shared(boundary, field, Bsize, Nsize) // // Set up the SPMD program. Note: by declaring ID and Num_threads // inside the parallel region, we make them private to each thread. // int ID, nprocs, i, left; ID = omp_get_thread_num(); nprocs = omp_get_num_threads(); if (nprocs > MAX) { exit (-1); } // // assume a ring of processors and a communication pattern // where boundaries are shifted to the right. // left = (ID-1); if(left<0) left = nprocs-1; for(i = 0; i < Nsteps; i++){ #pragma omp barrier extract_boundary(Nsize, Bsize, ID, nprocs, boundary[ID]); #pragma omp barrier update(Nsize, Bsize, ID, nprocs, field, boundary[left]);
} output_results(Nsize, field); return 0;
Javaprovidessignificantsupportforclientserverdistributedcomputinginheterogeneous environments.ItisveryeasytosetupaTCPsocketconnectionbetweentwoPEsandsendandreceive dataovertheconnection.Therelevantclassesarefoundinpackagesjava.netandjava.io.The serializationfacilities(seethejava.io.Serializableinterface)supportconvertingapossibly complexdatastructureintoasequenceofbytesthatcanbetransmittedoveranetwork(orwrittentoa file)andreconstructedatthedestination.TheRMI(remotemethodinvocation)packages java.rmi.*provideamechanismenablinganobjectononeJVMtoinvokemethodsonanobject inanotherJVM,whichmayberunningonadifferentcomputer.SerializationisusedbytheRMI packagestomarshaltheargumentstothemethodandreturntheresulttothecaller.

Figure 6.15. The message-passing block from Fig. 6.13 and Fig. 6.14, but with more careful synchronization management (pairwise synchronization)
int done[MAX]; // an array of flags dimensioned for the // boundary data // // Create Team of Threads // init (Nsize, Bsize, field, boundary, incoming); #pragma omp parallel shared(boundary, field, Bsize, Nsize) // // Set up the SPMD program. Note: by declaring ID and Num_threads // inside the parallel region, we make them private to each thread. // int ID, nprocs, i, left; ID = omp_get_thread_num(); nprocs = omp_get_num_threads(); if (nprocs > MAX) { exit (-1); } // // assume a ring of processors and a communication pattern // where boundaries are shifted to the right. // left = (ID-1); if(left<0) left = nprocs-1; done[ID] = 0; // set flag stating "buffer ready to fill" for(i = 0; i < Nsteps; i++){ #pragma omp barrier // all visible variables flushed, so we // don't need to flush "done". extract_boundary(Nsize, Bsize, ID, num_procs, boundary[ID]);
done[ID] = 1; // flag that the buffer is ready to use #pragma omp flush (done) while (done[left] != 1){ #pragma omp flush (done) } update(Nsize, Bsize, ID, num_procs, field, boundary[left]); done[left] = 0; // set flag stating "buffer ready to fill" } output_results(Nsize, field);
Althoughthejava, io, java.net,andjava.rmi.*packagesprovideveryconvenient programmingabstractionsandworkverywellinthedomainforwhichtheyweredesigned,theyincur highparalleloverheadsandareconsideredtobeinadequateforhighperformancecomputing.High performancecomputerstypicallyusehomogeneousnetworksofcomputers,sothegeneralpurpose distributedcomputingapproachessupportedbyJava'sTCP/IPsupportresultindataconversionsand checksthatarenotusuallyneededinhighperformancecomputing.Anothersourceofparallel overheadistheemphasisinJavaonportability.Thisledtolowestcommondenominatorfacilitiesin thedesignofJava'snetworkingsupport.AmajorproblemisblockingI/O.Thismeans,forexample, thatareadoperationonasocketwillblockuntildataisavailable.Stallingtheapplicationisusually avoidedinpracticebycreatinganewthreadtoperformtheread.Becauseareadoperationcanbe appliedtoonlyonesocketatatime,thisleadstoacumbersomeandnonscalableprogrammingstyle withseparatethreadsforeachcommunicationpartner. Toremedysomeofthesedeficiencies,(whichwerealsoaffectinghighperformanceenterprise servers)newI/Ofacilitiesinthejava.niopackageswereintroducedinJava21.4.BecausetheJava specificationhasnowbeensplitintothreeseparateversions(enterprise,core,andmobile),thecore JavaandenterpriseJavaAPIsarenolongerrestrictedtosupportingonlyfeaturesthatcanbe supportedontheleastcapabledevices.ResultsreportedbyPughandSacco[PS04]indicatethatthe newfacilitiespromisetoprovidesufficientperformancetomakeJavaareasonablechoiceforhigh performancecomputinginclusters. Thejava.niopackagesprovidenonblockingI/Oandselectorssothatonethreadcanmonitor severalsocketconnections.Themechanismforthisisanewabstraction,calledachannel,thatserves asanopenconnectiontosockets,files,hardwaredevices,etc.ASocketChannelisusedfor communicationoverTCPorUDPconnections.Aprogrammayreadorwritetoorfroma SocketChannelintoaByte Bufferusingnonblockingoperations.Buffersareanothernew abstractionintroducedinJava21.4.Theyarecontainersforlinear,finitesequencesofprimitivetypes. Theymaintainastatecontainingacurrentposition(alongwithsomeotherinformation)andare accessedwith"put"and"get"operationsthatputorgetanelementintheslotindicatedbythecurrent position.Buffersmaybeallocatedasdirectorindirect.Thespaceforadirectbufferisallocated
outsideoftheusualmemoryspacemanagedbytheJVMandsubjecttogarbagecollection.Asa result,directbuffersarenotmovedbythegarbagecollector,andreferencestothemcanbepassedto thesystemlevelnetworksoftware,eliminatingacopyingstepbytheJVM.Unfortunately,thiscomes atthepriceofmorecomplicatedprogramming,becauseputandgetoperationsarenotespecially convenienttouse.Typically,however,onewouldnotuseaBufferforthemaincomputationinthe program,butwoulduseamoreconvenientdatastructuresuchasanarraythatwouldbebulkcopied toorfromtheBuffer.[6] Manyparallelprogrammersdesirehigherlevel,oratleastmorefamiliar,communicationabstractions thanthosesupportedbythestandardpackagesinJava.SeveralgroupshaveimplementedMPIlike bindingsforJavausingvariousapproaches.OneapproachusestheJNI(JavaNativeInterface)tobind toexistingMPIlibraries.JavaMPI[Min97]andmpiJava[BCKL98]areexamples.Otherapproaches usenewcommunicationsystemswritteninJava.Thesearenotwidelyusedandareinvariousstates ofavailability.Some[JCS98]attempttoprovideanMPIlikeexperiencefortheprogrammer followingproposedstandardsdescribedin[BC00],whileothersexperimentwithalternativemodels [Man].Anoverviewisgivenin[AJMJS02].Thesesystemstypicallyusetheoldjava.iobecause thejava.niopackagehasonlyrecentlybecomeavailable.Anexceptionistheworkreportedby PughandSacco[PS04].Althoughtheyhavenot,atthiswriting,provideddownloadablesoftware, theydescribeapackageprovidingasubsetofMPIwithperformancemeasurementsthatsuggestsan optimisticoutlookforthepotentialofJavawithjava.nioforhighperformancecomputingin clusters.Onecanexpectthatinthenearfuturemorepackagesbuiltonjava.niothatsimplify parallelprogrammingwillbecomeavailable. 6.4.2. Collective Communication Whenmultiple(morethantwo)UEsparticipateinasinglecommunicationevent,theeventiscalleda collectivecommunicationoperation.MPI,asamessagepassinglibrary,includesmostofthemajor collectivecommunicationoperations.

Broadcast.AmechanismtosendasinglemessagetoallUEs. Barrier.ApointwithinaprogramatwhichallUEsmustarrivebeforeanyUEscancontinue. Thisisdescribedearlierinthischapterwiththeothersynchronizationmechanisms,butitis mentionedagainherebecauseinMPIitisimplementedasacollectivecommunication. Reduction.Amechanismtotakeacollectionofobjects,oneoneachUE,andcombinethem intoasingleobjectononeUE(MPI_Reduce)orcombinethemsuchthatthecombined valueisleftoneachoftheUEs(MPI_Allreduce).
Ofthese,reductionisthemostcommonlyused.Itplayssuchanimportantroleinparallelalgorithms thatitisincludedwithinmostparallelprogrammingAPIs(includingsharedmemoryAPIssuchas OpenMP).Wewillfocusonreductioninthisdiscussionbecausethecommunicationpatternsand techniquesusedwithreductionarethesameasthoseusedwithanyoftheglobalcommunication events.

Reduction
Areductionoperationreducesacollectionofdataitemstoasingledataitembyrepeatedlycombining
thedataitemspairwisewithabinaryoperator,usuallyonethatisassociativeandcommutative. Examplesofreductionsincludefindingthesum,product,ormaximumofallelementsinanarray.In general,wecanrepresentsuchanoperationasthecalculationof Equation6.1
whereoisabinaryoperator.Themostgeneralwaytoimplementareductionistoperformthe calculationseriallyfromlefttoright,anapproachthatoffersnopotentialforconcurrentexecution.If oisassociative,however,thecalculationinEq.6.1containsexploitableconcurrency,inthatpartial resultsoversubsetsofv0,,vm1canbecalculatedconcurrentlyandthencombined.Ifoisalso commutative,thecalculationcanbenotonlyregrouped,butalsoreordered,openingupadditional possibilitiesforconcurrentexecution,asdescribedlater. Notallreductionoperatorshavetheseusefulproperties,however,andonequestiontobeconsideredis whethertheoperatorcanbetreatedasifitwereassociativeand/orcommutativewithoutsignificantly changingtheresultofthecalculation.Forexample,floatingpointadditionisnotstrictlyassociative (becausethefiniteprecisionwithwhichnumbersarerepresentedcancauseroundofferrors, especiallyifthedifferenceinmagnitudebetweenoperandsislarge),butifallthedataitemstobe addedhaveroughlythesamemagnitude,itisusuallycloseenoughtoassociativetopermitthe parallelizationstrategiesdiscussednext.Ifthedataitemsvaryconsiderablyinmagnitude,thismay notbethecase. Mostparallelprogrammingenvironmentsprovideconstructsthatimplementreduction.
MPIprovidesgeneralpurposefunctionsMPI_ReduceandMPI_Allreduceaswellas supportforseveralcommonreductionoperations(MPI_MIN, MPI_MAX,andMPI_SUM). OpenMPincludesareductionclausethatcanbeappliedonaparallelsectionora workshareconstruct.Itprovidesforaddition,subtraction,multiplication,andanumberof bitwiseandlogicaloperators.
Asanexampleofareduction,considertheMPIprograminFig.6.16.Thisisacontinuationofour earlierexamplewhereweusedabarriertoenforceconsistenttiming(Fig.6.3).Thetimingis measuredforeachUEindependently.Asanindicationoftheloadimbalancefortheprogram,itis usefultoreportthetimeasaminimumtime,amaximumtime,andanaveragetime.WedothisinFig. 6.16withthreecallstoMPI_ReduceonewiththeMPI_MINfunction,onewiththeMPI_MAX function,andonewiththeMPI_SUMfunction. AnexampleofreductioninOpenMPcanbefoundinFig.6.17.ThisisanOpenMPversionofthe programinFig.6.16.Thevariablesforthenumberofthreads(num_threads)andtheaveragetime (ave_time)aredeclaredpriortotheparallelregionsotheycanbevisiblebothinsideandfollowing theparallelregion.Weuseareductionclauseontheparallelpragmatoindicatethatwewill bedoingareductionwiththe+operatoronthevariableave_time.Thecompilerwillcreatealocal copyofthevariableandinitializeittothevalueoftheidentityoftheoperatorinquestion(zerofor
addition).Eachthreadsumsitsvaluesintothelocalcopy.Followingthecloseoftheparallelregion, allthelocalcopiesarecombinedwiththeglobalcopytoproducethefinalvalue. Becauseweneedtocomputetheaveragetime,weneedtoknowhowmanythreadswereusedinthe parallelregion.Theonlywaytosafelydeterminethenumberofthreadsisbyacalltothe omp_num_threads ()functioninsidetheparallelregion.Multiplewritestothesamevariable caninterferewitheachother,soweplacethecalltoomp_num_threads ()insideasingle construct,therebyensuringthatonlyonethreadwillassignavaluetothesharedvariable num_threads.AsingleinOpenMPimpliesabarrier,sowedonotneedtoexplicitlyincludeone tomakesureallthethreadsenterthetimedcodeatthesametime.Finally,notethatwecomputeonly theaveragetimeinFig.6.17becauseminandmaxoperatorsarenotincludedintheC/C++OpenMP version2.0.
Figure 6.16. MPI program to time the execution of a function called runit(). We use MPI_Reduce to find minimum, maximum, and average runtimes.
#include "mpi.h" // MPI include file #include <stdio.h> int main(int argc, char**argv) { int num_procs; // the number of processes in the group int ID; // a unique identifier ranging from 0 to (num_Procs-1) double local_result; double time_init, time_final, time_elapsed; double min_time, max_time, ave_time; // // Initialize MPI and set up the SPMD program // MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size (MPI_COMM_WORLD, &num_procs); // // Ensure that all processes are set up and ready to go before timing // runit () // MPI_Barrier(MPI_COMM_WORLD); time_init = MPI_Wtime(); runit(); // a function that we wish to time on each process time_final = MPI_Wtime(); time_elapsed = time_final-time_init; MPI_Reduce (&time_elapsed, &min_time, 1, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD); MPI_Reduce (&time_elapsed, &max_time, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); MPI_Reduce (&time_elapsed, &ave_time, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (ID == 0){ ave_time = ave_time /(double)num_procs;
printf(" min, ave and max times (secs): %f, %f, %f\n", min_time, ave_time,max_time); } MPI_Finalize(); return 0; }
Implementing reduction operations
Mostprogrammerswillneverimplementtheirownreductionsbecausethereductionmechanisms availablewithmostparallelprogrammingenvironmentsmeettheirneeds.Java,however,doesnot provideareductionoperator,soJavaprogrammersmightneedtoimplementtheirownparallel reductions.Inanycase,thealgorithmsusedforreductionsareinstructiveandwellworththeeffortto understand.

Figure 6.17. OpenMP program to time the execution of a function called runit(). We use a reduction clause to find sum of the runtimes.
#include <stdio.h> #include <omp.h> // OpenMP include file extern void runit(); int main(int argc, char**argv) { int num_threads; // the number of processes in the group double ave_time=0.0; #pragma omp parallel reduction(+ : ave_time) num_threads(10) { double local_result; double time_init, time_final, time_elapsed; // The single construct causes one thread to set the value of // the num_threads variable. The other threads wait at the // barrier implied by the single construct. #pragma omp single num_threads = omp_get_num_threads(); time_init = omp_get wtime(); runit(); // a function that we wish to time on each process time_final = omp_get wtime(); time_elapsed = time_final - time_init; } ave_time += time_elapsed; ave_time = ave_time/(double)num_threads;
printf(" ave time (secs): % \n", ave_time); return 0; }
Weconsidertreebasedreductionsinthefollowingdiscussion.Thesearethesimplestofthescalable parallelreductionalgorithmsandalthoughnotoptimal,theyexposemostoftheissuesaddressedin themoreoptimalreductionalgorithms[CKP93 + ].

Serial computation
Ifthereductionoperatorisnotassociative,orcannotbetreatedasassociativewithoutsignificantly affectingtheresult,itwilllikelybenecessarytoperformtheentirereductionseriallyinasingleUE, assketchedinFig.6.18.IfonlyoneUEneedstheresultofthereduction,itisprobablysimplestto havethatUEperformtheoperation;ifallUEsneedtheresult,thereductionoperationcanbefollowed byabroadcastoperationtocommunicatetheresulttootherUEs.Forsimplicity,thefigureshowsa situationinwhichthereareasmanyUEsasdataitems.Thesolutioncanbeextendedtosituationsin whichtherearemoredataitemsthanUEs,buttherequirementthatalltheactualcomputationbedone inasingleUEstillholdsbecauseofthelackofassociativity.(Clearlythissolutioncompletelylacks concurrency;wementionitforthesakeofcompletenessandbecauseitmightstillbeusefuland appropriateifthereductionoperationrepresentsonlyarelativelysmallpartofacomputationwhose otherpartsdohaveexploitableconcurrency.)
Figure 6.18. Serial reduction to compute the sum of a(0) through a(3). sum(a(i:j)) denotes the sum of elements i through j of array a.
Inthisapproach,theindividualcombinetwoelementsoperationsmustbeperformedinsequence,so itissimplesttohavethemallperformedbyasingletask.Becauseofthedatadependencies(indicated bythearrowsinthefigure),however,somecautionisrequiredifthereductionoperationisperformed aspartofalargercalculationinvolvingmultipleconcurrentUEs;theUEsnotperformingthe reductioncancontinuewithotherworkonlyiftheycandosowithoutaffectingthecomputationofthe reductionoperation(forexample,ifmultipleUEsshareaccesstoanarrayandoneofthemis computingthesumoftheelementsofthearray,theothersshouldnotbesimultaneouslymodifying elementsofthearray).Inamessagepassingenvironment,thiscanusuallybeaccomplishedusing messagepassingtoenforcethedatadependencyconstraints(thatis,theUEsnotperformingthe reductionoperationsendtheirdatatothesingleUEactuallydoingthecomputation).Inother environments,theUEsnotperformingthereductionoperationcanbeforcedtowaitbymeansofa barrier.
Tree-based reduction
Ifthereductionoperatorisassociativeorcanbetreatedassuch,thenthetreebasedreductionshown inFig.6.19isappropriate.Thealgorithmproceedsthroughaseriesofstages;ateachstagehalfthe UEspasstheirdatatoanotherUE.Atthebeginning,alloftheUEsareactivelyinvolvedinthe reduction,butateachstage,halftheUEsdropout,andfinallyweareleftwithasingleUEholdingthe reducedvalue.

Figure 6.19. Tree-based reduction to compute the sum of a(0) through a(3) on a system with 4 UEs. sum(a(i:j)) denotes the sum of elements i through j of array a.
Forsimplicity,thefigureshowsthecasewherethereareasmanyUEsasdataitems.Thesolutioncan beextendedtosituationsinwhichtherearemoredataitemsthanUEsbyfirsthavingeachUE performaserialreductiononasubsetofthedataitemsandthencombiningtheresultsasshown.(The serialreductions,oneperUE,areindependentandcanbedoneconcurrently.) Inatreebasedreductionalgorithmsome,butnotall,ofthecombinetwoelementsoperationscanbe performedconcurrently(forexample,inthefigure,wecancomputesum(a(0:1))andsum (a(2:3))concurrently,butthecomputationofsum(a(0:3))mustoccurlater).Amoregeneral sketchforperformingatreebasedreductionusing2nUEssimilarlybreaksdownintonsteps,with eachstepinvolvinghalfasmanyconcurrentoperationsasthepreviousstep.Aswiththeserial
strategy,cautionisrequiredtomakesurethatthedatadependenciesshowninthefigurearehonored. Inamessagepassingenvironment,thiscanusuallybeaccomplishedbyappropriatemessagepassing; inotherenvironments,itcouldbeimplementedusingbarriersynchronizationaftereachofthensteps. UsingthetreebasedreductionalgorithmisparticularlyattractiveifonlyoneUEneedstheresultof thereduction.IfotherUEsalsoneedtheresult,thereductionoperationcanbefollowedbyabroadcast operationtocommunicatetheresulttootherUEs.Noticethatthebroadcastisjusttheinverseofthe reductionshowninFig.6.19;thatis,ateachstage,aUEpassesthevaluetotwoUEs,thereby doublingthenumberofUEswiththebroadcastvalue.

Recursive doubling
IfalloftheUEsmustknowtheresultofthereductionoperation,thentherecursivedoublingscheme ofFig.6.20isbetterthanthetreebasedapproachfollowedbyabroadcast.
Figure 6.20. Recursive-doubling reduction to compute the sum of a(0) through a(3). sum (a(i:j)) denotes the sum of elements i through j of array a.
Aswiththetreebasedcode,ifthenumberofUEsisequalto2n,thenthealgorithmproceedsinn steps.Atthebeginningofthealgorithm,everyUEhassomenumberofvaluestocontributetothe reduction.Thesearecombinedlocallytoasinglevaluetocontributetothereduction.Inthefirststep, theevennumberedUEsexchangetheirpartialsumswiththeiroddnumberedneighbors.Inthe secondstage,insteadofimmediateneighborsexchangingvalues,UEstwostepsawayinteract.Atthe nextstage,UEsfourstepsawayinteract,andsoforth,doublingthereachoftheinteractionateach stepuntilthereductioniscomplete. Attheendofnsteps,eachUEhasacopyofthereducedvalue.Comparingthistotheprevious strategyofusingatreebasedalgorithmfollowedbyabroadcast,weseethefollowing:Thereduction andthebroadcasttakenstepseach,andthebroadcastcannotbeginuntilthereductioniscomplete,so theelapsedtimeisO(2n),andduringthese2nstepsmanyoftheUEsareidle.Therecursivedoubling algorithm,however,involvesalltheUEsateachstepandproducesthesinglereducedvalueatevery UEafteronlynsteps.
6.4.3. Other Communication Constructs Wehavelookedatonlythemostcommonmessagepassingconstructs:pointtopointmessagepassing andcollectivecommunication.Therearemanyimportantvariationsinmessagepassing.InMPI,itis possibletofinetuneperformancebychanginghowthecommunicationbuffersarehandledorto overlapcommunicationandcomputation.SomeofthesemechanismsaredescribedintheMPI appendix,AppendixB. Noticethatinourdiscussionofcommunication,thereisareceivingUEandasendingUE.Thisis calledtwosidedcommunication.Forsomealgorithms(asdiscussedintheDistributedArraypattern), thecalculationsinvolvedincomputingindicesintoadistributeddatastructureandmappingthemonto theprocessowningthedesireddatacanbeverycomplicatedanddifficulttodebug.Someofthese problemscanbeavoidedbyusingonesidedcommunication. OnesidedcommunicationoccurswhenaUEmanagescommunicationwithanotherUEwithoutthe explicitinvolvementoftheotherUE.Forexample,amessagecanbedirectly"put"intoabufferon anothernodewithouttheinvolvementofthereceivingUE.TheMPI2.0standardincludesaonesided communicationAPI.AnotherexampleofonesidedcommunicationisanoldersystemcalledGAor GlobalArrays[NHL94,NHL96,NHK+02,Gloa].GAprovidesasimpleonesidedcommunication environmentspecializedtotheproblemofdistributedarrayalgorithms.
Anotheroptionistoreplaceexplicitcommunicationwithavirtualsharedmemory,wheretheterm "virtual"isusedbecausethephysicalmemorycouldbedistributed.Anapproachthatwaspopularin theearly1990swasLinda[CG91].Lindaisbasedonanassociativevirtualsharedmemorycalleda tuplespace.TheoperationsinLinda"put","take",or"read"asetofvaluesbundledtogetherintoan objectcalledatuple.Tuplesareaccessedbymatchingagainstatemplate,makingthememory contentaddressableorassociative.Lindaisgenerallyimplementedasacoordinationlanguage,thatis, asmallsetofinstructionsthatextendanormalprogramminglanguage(thesocalledcomputation language).Lindaisnolongerusedtoanysignificantextent,buttheideaofanassociativevirtual sharedmemoryinspiredbyLindalivesoninJavaSpaces[FHA99]. Morerecentattemptstohidemessagepassingbehindavirtualsharedmemoryarethecollectionof languagesbasedonthePartitionedGlobalAddressSpaceModel:UPC[UPC],Titanium[Tita],and CoArrayFortran[Co].TheseareexplicitlyparalleldialectsofC,Java,andFortran(respectively) basedonavirtualsharedmemory.Unlikeothersharedmemorymodels,suchasLindaorOpenMP, thesharedmemoryinthePartitionedGlobalAddressSpacemodelispartitionedandincludesthe conceptofaffinityofsharedmemorytoparticularprocessors.UEscanreadandwriteeachothers' memoryandperformbulktransfers,buttheprogrammingmodeltakesintoaccountnonuniform memoryaccess,allowingthemodeltobemappedontoawiderangeofmachines,fromSMPto NUMAtoclusters.
Endnotes
6. PughandSacco[PS04]havereportedthatitisactuallymoreefficienttodobulkcopies betweenbuffersandarraysbeforeasendthantoeliminatethearrayandperformthe calculationupdatesdirectlyonthebuffersusingputandget.
Appendix A. A Brief Introduction to OpenMP

A.1CORECONCEPTS A.2STRUCTUREDBLOCKSANDDIRECTIVEFORMATS A.3WORKSHARING A.4DATAENVIRONMENTCLAUSES A.5THEOpenMPRUNTIMELIBRARY A.6SYNCHRONIZATION A.7THESCHEDULECLAUSE A.8THERESTOFTHELANGUAGE OpenMP[OMP]isacollectionofcompilerdirectivesandlibraryfunctionsthatareusedtocreate parallelprogramsforsharedmemorycomputers.OpenMPiscombinedwithC,C++,orFortranto createamultithreadingprogramminglanguage;thatis,thelanguagemodelisbasedonthe assumptionthattheUEsarethreadsthatshareanaddressspace. TheformaldefinitionofOpenMPiscontainedinapairofspecifications,oneforFortranandthe otherforCandC++.Theydifferinsomeminordetails,butforthemostpart,aprogrammerwho knowsOpenMPforonelanguagecanpickuptheotherlanguagewithlittleadditionaleffort. OpenMPisbasedonthefork/joinprogrammingmodel.AnexecutingOpenMPprogramstartsasa singlethread.Atpointsintheprogramwhereparallelexecutionisdesired,theprogramforks additionalthreadstoformateamofthreads.Thethreadsexecuteinparallelacrossaregionofcode calledaparallelregion.Attheendoftheparallelregion,thethreadswaituntilthefullteamarrives, andthentheyjoinbacktogether.Atthatpoint,theoriginalormasterthreadcontinuesuntilthenext parallelregion(ortheendoftheprogram). ThegoalofOpenMP'screatorswastomakeOpenMPeasyforapplicationprogrammerstouse. Ultimateperformanceisimportant,butnotifitwouldmakethelanguagedifficultforsoftware engineerstouse,eithertocreateparallelprogramsortomaintainthem.Tothisend,OpenMPwas designedaroundtwokeyconcepts:sequentialequivalenceandincrementalparallelism. Aprogramissaidtobesequentiallyequivalentwhenityieldsthesame[1]resultswhetheritexecutes usingonethreadormanythreads.Asequentiallyequivalentprogramiseasiertomaintainand,in mostcases,mucheasiertounderstand(andhencewrite).
[1]
Theresultsmaydifferslightlyduetothenonassociativityoffloatingpointoperations.
Figure A.1. Fortran and C programs that print a simple string to standard output
Incrementalparallelismreferstoastyleofparallelprogramminginwhichaprogramevolvesfroma sequentialprogramintoaparallelprogram.Aprogrammerstartswithaworkingsequentialprogram andblockbyblockfindspiecesofcodethatareworthwhiletoexecuteinparallel.Thus,parallelismis addedincrementally.Ateachphaseoftheprocess,thereisaworkingprogramthatcanbeverified, greatlyincreasingthechancesthattheprojectwillbesuccessful. ItisnotalwayspossibletouseincrementalparallelismortocreatesequentiallyequivalentOpenMP programs.Sometimesaparallelalgorithmrequirescompleterestructuringoftheanalogoussequential program.Inothercases,theprogramisconstructedfromthebeginningtobeparallelandthereisno sequentialprogramtoincrementallyparallelize.Also,thereareparallelalgorithmsthatdonotwork withonethreadandhencecannotbesequentiallyequivalent.Still,incrementalparallelismand sequentialequivalenceguidedthedesignoftheOpenMPAPIandarerecommendedpractices.
A.1. CORE CONCEPTS

WewillstartourreviewofthecoreconceptsbehindOpenMPbyconsideringasimpleprogramthat printsastringtothestandardoutputdevice.FortranandCversionsofthisprogramaregiveninFig. A.1. Wewillwriteaversionofthisprogramthatcreatesmultiplethreads,eachofwhichwillprint"Epur simuove."[2]
[2]
ThesewordsareattributedtoGiordanoBrunoonFebruary16,1600,ashewasburned atthestakeforinsistingthattheearthorbitedtheSun.ThisLatinphraseroughly translatesas"Andnonetheless,itmoves." OpenMPisanexplicitlyparallelprogramminglanguage.Thecompilerdoesn'tguesshowtoexploit concurrency.Anyparallelismexpressedinaprogramistherebecausetheprogrammerdirectedthe compilerto"putitthere".TocreatethreadsinOpenMP,theprogrammerdesignatesblocksofcode thataretoruninparallel.ThisisdoneinCandC++withthepragma

#pragma omp parallel
orinFortranwiththedirective
C$OMP PARALLEL
ModernlanguagessuchasCandC++areblockstructured.Fortran,however,isnot.Hence,the FortranOpenMPspecificationdefinesadirectivetoclosetheparallelblock:
$OHP END PARALLEL
ThispatternisusedwithotherFortranconstructsinOpenMP;thatis,oneformopensthestructured block,andamatchingformwiththewordENDinsertedaftertheOMPclosestheblock. InFig.A.2,weshowparallelprogramsinwhicheachthreadprintsastringtothestandardoutput. Whenthisprogramexecutes,theOpenMPruntimesystemcreatesanumberofthreads,eachofwhich willexecutetheinstructionsinsidetheparallelconstruct.Iftheprogrammerdoesn'tspecifythe numberofthreadstocreate,adefaultnumberisused.Wewilllatershowhowtocontrolthedefault numberofthreads,butforthesakeofthisexample,assumeitwassettothree. OpenMPrequiresthatI/Obethreadsafe.Therefore,eachoutputrecordprintedbyonethreadis printedcompletelywithoutinterferencefromotherthreads.TheoutputfromtheprograminFig.A.2 wouldthenlooklike:
E pur si muove E pur si muove E pur si muove
Inthiscase,eachoutputrecordwasidentical.Itisimportanttonote,however,thatalthougheach recordprintsasaunit,therecordscanbeinterleavedinanyway,soaprogrammercannotdependon thethreadswritingtheirrecordsinaparticularorder.

OpenMPisasharedmemoryprogrammingmodel.Thedetailsofthememorymodelwillbe discussedlater,butagoodrulethatholdsinmostcasesisthis:Avariableallocatedpriortothe parallelregionissharedbetweenthethreads.SotheprograminFig.A.3wouldprint:

E pur si muove 5 E pur si muove 5 E pur si muove 5
Ifavariableisdeclaredinsideaparallelregion,itissaidtobelocalorprivatetoathread.InC,a variabledeclarationcanoccurinanyblock.InFortran,however,declarationscanonlyoccuratthe beginningofasubprogram. Fig.A.4showsaCprogramwithalocalvariable.Wehavealsoincludedacalltoafunctioncalled omp_get_thread_num().ThisintegerfunctionispartoftheOpenMPstandardruntimelibrary (describedlaterinthisappendix).Itreturnsanintegeruniquetoeachthreadthatrangesfromzeroto thenumberofthreadsminusone.Ifweassumethedefaultnumberofthreadsisthree,thenthe followinglines(interleavedinanyorder)willbeprinted:

c = 0, i = 5 c = 2, i = 5 c = 1, i = 5
Figure A.4. Simple program to show the difference between shared and local (or private) data #include <stdio.h> #include <omp.h> int main() { int i=5; // a shared variable
#pragma omp parallel { int c; // a variable local or private to each thread c = omp_get_thread num(); printf("c = %d, i = %d\n",c,i); } }
Thefirstvalueprintedistheprivatevariable,c,whereeachthreadhasitsownprivatecopyholdinga uniquevalue.Thesecondvariableprinted,i,isshared,andthusallthethreadsdisplaythesamevalue fori. Ineachoftheseexamples,theruntimesystemisallowedtoselectthenumberofthreads.Thisisthe mostcommonapproach.Itispossibletochangetheoperatingsystem'sdefaultnumberofthreadsto usewithOpenMPapplicationsbysettingtheOMP_NUM_THREADSenvironmentvariable.For example,onaLinuxsystemwithcshastheshell,tousethreethreadsinourprogram,priorto runningtheprogramonewouldissuethefollowingcommand:

setenv OHP_NUM_THREADS 3
Thenumberofthreadstobeusedcanalsobesetinsidetheprogramwiththenum_threadsclause. Forexample,tocreateaparallelregionwiththreethreads,theprogrammerwouldusethepragma
#pragma omp parallel num_threads(3)
Ineachofthesecases,thenumberofthreadsshouldbethoughtofasarequestforacertainnumberof threads.Thesystemincertaincircumstancesmayprovidefewerthreadsthanthenumberrequested. Hence,ifanalgorithmmustknowtheactualnumberofthreadsbeingused,theprogrammustquery thesystemforthisnumberinsidetheparallelregion.Wediscusshowtodothislaterwhenwe describeOpenMP'sruntimelibrary.
A.2. STRUCTURED BLOCKS AND DIRECTIVE FORMATS

AnOpenMPconstructisdefinedtobeadirective(orpragma)plusablockofcode.Notjustanyblock ofcodewilldo.Itmustbeastructuredblock;thatis,ablockwithonepointofentryatthetopanda singlepointofexitatthebottom. AnOpenMPprogramisnotallowedtobranchintooroutofastructuredblock.Toattempttodosois afatalerrorgenerallycaughtatcompiletime.Likewise,thestructuredblockcannotcontaina returnstatement.Theonlybranchstatementsallowedarethosethatshutdowntheentireprogram (STOPinFortranorexit()inC).Whenthestructuredblockcontainsonlyasinglestatement,itis notnecessarytoincludethebrackets(inC)ortheend constructdirectives. Aswesawinoursimpleexample,anOpenMPdirectiveinCorC++hastheform
#pragma omp directive-name [clause[ clause] ... ]
wheredirective-nameidentifiestheconstructandtheoptionalclauses[3]modifytheconstruct. SomeexamplesofOpenMPpragmasinCorC++follow:
[3]
Throughoutthisappendix,wewillusesquarebracketstoindicateoptionalsyntactic elements.
#pragma omp parallel private(ii, jj, kk) #pragma omp barrier #pragma omp for reduction(+:result)
ForFortran,thesituationismorecomplicated.Wewillconsideronlythesimplestcase,fixedform[4] Fortrancode.Inthiscase,anOpenMPdirectivehasthefollowingform:
[4]
Fixedformreferstothefixedcolumnconventionsforstatementsinolderversionsof Fortran(Fortran77andearlier).
sentinel directive-name [clause[[,]clause] ... ]
wheresentinelcanbeoneof:
C$OHP C!OHP *!OHP
Therulesconcerningfixedformsourcelinesapply.Spaceswithintheconstructsareoptional,and continuationisindicatedbyacharacterincolumnsix.Forexample,thefollowingthreeOpenMP directivesareequivalent:

C$oMP PARALLEL DO PRIVATE(I,J)
*!OHP PARALLEL *!OMP1 DOPRIVATE(I,J)
C!OMP PARALLEL DO PRIVATE(I,J)
A.3. WORKSHARING
Whenusingtheparallelconstructalone,everythreadexecutesthesameblockofstatements.There aretimes,however,whenweneeddifferentcodetomapontodifferentthreads.Thisiscalled worksharing.
ThemostcommonlyusedworksharingconstructinOpenMPistheconstructtosplitloopiterations betweendifferentthreads.Designingaparallelalgorithmaroundparallelloopsisanoldtraditionin parallelprogramming[X393].Thisstyleissometimescalledloopsplittingandisdiscussedatlength intheLoopParallelismpattern.Inthisapproach,theprogrammeridentifiesthemosttimeconsuming loopsintheprogram.Eachloopisrestructured,ifnecessary,sotheloopiterationsarelargely independent.Theprogramisthenparallelizedbymappingdifferentgroupsofloopiterationsonto differentthreads. Forexample,considertheprograminFig.A.5.Inthisprogram,acomputationallyintensivefunction big_comp()iscalledrepeatedlytocomputeresultsthatarethencombinedintoasingleglobal answer.Forthesakeofthisexample,weassumethefollowing.
Figure A.5. Fortran and C examples of a typical loop-oriented program
Figure A.6. Fortran and C examples of a typical loop-oriented program. In this version of the program, the computationally intensive loop has been isolated and modified so the iterations are independent.
Thecombine()routinedoesnottakemuchtimetorun. Thecombine()functionmustbecalledinthesequentialorder.
Thefirststepistomaketheloopiterationsindependent.OnewaytoaccomplishthisisshowninFig. A.6.Becausethecombine()functionmustbecalledinthesameorderasinthesequential program,thereisanextraorderingconstraintintroducedintotheparallelalgorithm.Thiscreatesa dependencybetweentheiterationsoftheloop.Ifwewanttoruntheloopiterationsinparallel,we needtoremovethisdependency. Inthisexample,we'veassumedthecombine()functionissimpleanddoesn'ttakemuchtime. Hence,itshouldbeacceptabletorunthecallstocombine ()outsidetheparallelregion.Wedo thisbyplacingeachintermediateresultcomputedbybig_comp()intoanelementofanarray.Then thearrayelementscanbepassedtothecombine()functioninthesequentialorderinaseparate loop.Thiscodetransformationpreservesthemeaningoftheoriginalprogram(thatis,theresultsare identicalbetweentheparallelcodeandtheoriginalversionoftheprogram). Withthistransformation,theiterationsofthefirstloopareindependentandtheycanbesafely computedinparallel.Todividetheloopiterationsamongmultiplethreads,anOpenMPworksharing constructisused.Thisconstructassignsloopiterationsfromtheimmediatelyfollowingloopontoa teamofthreads.Laterwewilldiscusshowtocontrolthewaytheloopiterationsarescheduled,butfor now,weleaveittothesystemtofigureouthowtheloopiterationsaretobemappedontothethreads. TheparallelversionsareshowninFig.A.7.
Figure A.7. Fortran and C examples of a typical loop-oriented program parallelized with OpenMP
TheOpenMPparallelconstructisusedtocreatetheteamofthreads.Thisisfollowedbythe worksharingconstructtosplituploopiterationsamongthethreads:aDOconstructinthecaseof FortranandaforconstructforC/C++.Theprogramrunscorrectlyinparallelandpreserves sequentialequivalencebecausenotwothreadsupdatethesamevariableandanyoperations(suchas callstothecombine()function)thatdonotcommuteorarenotassociativearecarriedoutinthe sequentialorder.Noticethat,accordingtotheruleswegaveearlierconcerningthesharingof variables,theloopcontrolvariableiwouldbesharedbetweenthreads.TheOpenMPspecification, however,recognizesthatitnevermakessensetosharetheloopcontrolindexonaparallelloop,soit automaticallycreatesaprivatecopyoftheloopcontrolindexforeachthread. Bydefault,thereisanimplicitbarrierattheendofanyOpenMPworkshareconstruct;thatis,allthe threadswaitattheendoftheconstructandonlyproceedafterallofthethreadshavearrived.This barriercanberemovedbyaddinganowaitclausetotheworksharingconstruct:
#pragma omp for nowait
Oneshouldbeverycarefulwhenusinganowaitbecause,inmostcases,thesebarriersareneededto preventraceconditions. TherearetwoothercommonlyusedtypesofworksharingconstructsinOpenMP:thesingle constructandthesectionsconstruct.Thesingleconstructdefinesablockofcodethatwillbe executedbythefirstthreadthatencounterstheconstruct.Theotherthreadsskiptheconstructand waitattheimplicitbarrierattheendofthesingle(unlessthenowaitclauseisused).Weused thisconstructintheOpenMPreductionexampleintheImplementationMechanismsdesignspace.We neededtocopythenumberofthreadsusedinaparallelregionintoasharedvariable.Tomakesure onlyonethreadwroteintothissharedvariable,weplaceditinasingleconstruct:

#pragma omp single num_threads = omp_get_num_threads();
Thesectionsconstructisusedtosetuparegionoftheprogramwheredistinctblocksofcodeare tobeassignedtodifferentthreads.Eachblockisdefinedwithasectionconstruct.Wedonotuse sectionsinanyofourexamples,sowewillnotdiscussthemindetailhere. Finally,itisverycommoninanOpenMPprogramtohaveaparallelconstructimmediatelyfollowed byaworksharingconstruct.Forexample,inFig.A.7,wehad

#pragma omp parallel { #pragma omp for . . . }
Asashortcut,OpenMPdefinesacombinedconstruct:
#pragma omp parallel for . . .
Thisisidenticaltothecasewheretheparallelandforconstructsareplacedwithinseparate pragmas.
A.4. DATA ENVIRONMENT CLAUSES

OpenMPisarelativelysimpleAPI.MostofthechallengesinworkingwithOpenMParisefromthe detailsofhowdataissharedbetweenthreadsandhowdatainitializationinteractswithOpenMP.We willaddresssomeofthemorecommonissueshere,buttothoroughlyunderstandthedata environmentassociatedwithOpenMP,oneshouldreadtheOpenMPspecification[OMP].Detailsof memorysynchronizationarediscussedinthesectiononsynchronization(Section6.3)inthe ImplementationMechanismsdesignspace. WebeginbydefiningthetermswewillusetodescribethedataenvironmentinOpenMP.Ina program,avariableisacontainer(ormoreconcretely,astoragelocationinmemory)boundtoaname andholdingavalue.Variablescanbereadandwrittenastheprogramruns(asopposedtoconstants thatcanonlyberead). InOpenMP,thevariablethatisboundtoagivennamedependsonwhe therthenameappearspriorto aparallelregion,insideaparallelregion,orfollowingaparallelregion.Whenthevariableisdeclared priortoaparallelregion,itisbydefaultsharedandthenameisalwaysboundtothesamevariable. OpenMP,however,includesclausesthatcanbeaddedtoparallelandtotheworksharing constructstocontrolthedataenvironment.Theseclausesaffectthevariableboundtoaname.A private (list)clausedirectsthecompilertocreate,foreachthread,aprivate(orlocal)variable foreachnameincludedinthelist.Thenamesintheprivatelistmusthavebeendefinedandboundto sharedvariablespriortotheparallelregion.Theinitialvaluesofthesenewprivatevariablesare undefined,sotheymustbeexplicitlyinitialized.Furthermore,aftertheparallelregion,thevalueofa variableboundtoanameappearinginaprivateclausefortheregionisundefined. Forexample,intheLoopParallelismpatternwepresentedaprogramtocarryoutasimpletrapezoid integration.Theprogramconsistsofasinglemainloopinwhichvaluesoftheintegrandarecomputed forarangeofxvalues.Thexvariableisatemporaryvariablesetandthenusedineachiterationof theloop.Hence,anydependenciesimpliedbythisvariablecanberemovedbygivingeachthreadits owncopyofthevariable.Thiscanbedonesimplywiththeprivate (x)clause,asshowninFig. A.8.(Thereductionclauseinthisexampleisdiscussedlater.)
Figure A.8. C program to carry out a trapezoid rule integration to compute (here comes equation)
#include <stdio.h> #include <math.h> #include <omp.h>
int main () { int i; int num_steps = 1000000; double x, pi, step, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for privateW reduction(+:sum) for (i=0;i< num_steps; i++) { x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } pi = step * sum; printf("pi %lf\n",pi); return0 }
Severalotherclauseschangehowvariablesaresharedbetweenthreads.Themostcommonlyused onesfollow.
firstprivate(list).Justaswithprivate,foreachnameappearinginthelist,anew privatevariableforthatnameiscreatedforeachthread.Unlikeprivate,however,thenewly createdvariablesareinitializedwiththevalueofthevariablethatwasboundtothenamein theregionofcodeprecedingtheconstructcontainingthefirstprivateclause.This clausecanbeusedonbothparallelandtheworksharingconstructs. lastprivate (list).Onceagain,privatevariablesforeachthreadarecreatedforeach nameinthelist.Inthiscase,however,thevalueoftheprivatevariablefromthe sequentiallylastloopiterationiscopiedoutintothevariableboundtothenameintheregion followingtheOpenMPconstructcontainingthelastprivateclause.Thisclausecanonly beusedwiththelooporientedworkshareconstructs.
Variablesgenerallycanappearinthelistofonlyasingledataclause.Theexceptioniswith lastprivateandfirstprivate,becauseitisquitepossiblethataprivatevariablewillneed bothawelldefinedinitialvalueandavalueexportedtotheregionfollowingtheOpenMPconstructin question. AnexampleusingtheseclausesisprovidedinFig.A.9.Wedeclarefourvariablesprivatetoeach threadandassignvaluestothreeofthem:h = 1, j = 2,andk = 0.Followingtheparallel loop,weprintthevaluesofh, j,andk.Thevariablekiswelldefined.Aseachthreadexecutesa loopiteration,thevariablekisincremented.Itisthereforeameasureofhowmanyiterationseach threadhandled.
Figure A.9. C program showing use of the private, firstprivate, and lastprivate clauses. This program is incorrect in that the variables h and j do not have welldefined values when the printf is called. Notice the use of a backslash to continue the OpenMP pragma onto a second line. #include <stdio.h> #include <omp.h> #define N 1000
int main() { int h, i, j, k; h= 1; j =2; k=0; #pragma omp parallel for private(h) firstprivate(j,k) \ lastprivate(j,k) for(i=O;i<N;i++) { k++; j = h + i; //ERROR: h, and therefore j, is undefined } printf("h = %d, j = %d, k = %d\n",h, j, k); //ERROR j and h //are undefined }
Thevaluepassedoutsidetheloopbecauseofthelastprivateclauseisthevalueofkfor whicheverthreadexecutedthesequentiallylastiterationoftheloop(thatis,theiterationforwhichi = 999).Thevaluesofbothhandjareundefined,butfordifferentreasons.Thevariablejis undefinedbecauseitwasassignedthevaluefromasumwithanuninitializedvariable(h)insidethe parallelloop.Theproblemwiththevariablehismoresubtle.Itwasdeclaredasaprivatevariable, butitsvaluewasunchangedinsidetheparallelloop.OpenMPstipulates,however,thatafteraname appearsinanyoftheprivateclauses,thevariableassociatedwiththatnameintheregionofcode followingtheOpenMPconstructisundefined.Hence,theprintstatementfollowingtheparallel fordoesnothaveawelldefinedvalueofhtoprint. Afewotherclausesaffectthewayvariablesareshared,butwedonotusetheminthisbookandhence willnotdiscussthemhere. Thefinalclausewewilldiscussthataffectshowdataissharedisthereductionclause.Reductions werediscussedatlengthintheImplementationMechanismsdesignspace.Areductionisanoperation that,usingabinary,associativeoperator,combinesasetofvaluesintoasinglevalue.Reductionsare verycommonandareincludedinmostparallelprogrammingenvironments.InOpenMP,the reductionclausedefinesalistofvariablenamesandabinaryoperator.Foreachnameinthelist,a privatevariableiscreatedandinitializedwiththevalueoftheidentityelementforthebinaryoperator (forexample,zeroforaddition).Eachthreadcarriesoutthereductionintoitscopyofthelocal variableassociatedwitheachnameinthelist.Attheendoftheconstructcontainingthereduction clause,thelocalvaluesarecombinedwiththeassociatedvaluepriortotheOpenMPconstructin questiontodefineasinglevalue.Thisvalueisassignedtothevariablewiththesamenameinthe regionfollowingtheOpenMPconstructcontainingthereduction. AnexampleofareductionwasgiveninFig.A.8.Intheexample,thereductionclausewasapplied withthe+operatortocomputeasummationandleavetheresultinthevariablesum. Althoughthemostcommonreductioninvolvessummation,OpenMPalsosupportsreductionsin Fortranfortheoperators+,*,-,.AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, IOR,andIEOR.InCandC++,OpenMPsupportsreductionwiththestandardC/C++operators*,-, &,|,^,&&,and||.CandC++donotincludeanumberofusefulintrinsicfunctionssuchas"min"or "max"withinthelanguagedefinition.Hence,OpenMPcannotprovidereductionsinsuchcases;if
theyarerequired,theprogrammermustcodethemexplicitlybyhand.Moredetailsaboutreductionin OpenMParegivenintheOpenMPspecification[OMP].
A.5. THE OpenMP RUNTIME LIBRARY

Asmuchaspossible,thesyntaxofOpenMPisexpressedthroughcompilerdirectives.Certainfeatures ofthelanguage,however,canonlybehandledwithruntimelibraryfunctions.Ofthefunctionswithin theruntimelibrary,themostcommonlyusedfunctionsarethefollowing.
omp_set_num_threads ()takesanintegerargumentandrequeststhattheoperating systemprovidethatnumberofthreadsinsubsequentparallelregions.

Figure A.10. C program showing use of the most common runtime library functions #include <stdio.h> #include <omp.h> int main() { int id, numb; omp_set num_threads(3); #pragma omp parallel private (id, numb) { id = omp_get thread_num(); numb = omp_getnum_threads(); printf(" I am thread %d out of %d \n",id,numb); } }
omp_get_num_threads()(integerfunction)returnstheactualnumberofthreadsinthe currentteamofthreads. omp_get_thread_num()(integerfunction)returnstheIDofathread,wheretheID rangesfrom0tothenumberofthreadsminus1.ThethreadwithIDof0isthemasterthread. Thelockfunctionscreate,use,anddestroylocks.Thesearedescribedlaterwiththeother synchronizationconstructs.
WeprovideasimpleexampleofhowthesefunctionsareusedinFig.A.10.Theprogramprintsthe threadIDandthenumberofthreadstothestandardoutput. TheoutputfromtheprograminFig.A.10wouldlooksomethinglikethis:

I am thread 2 out of 3 I am thread 0 out of 3 I am thread 1 out of 3
TheoutputrecordswouldbecompleteandnonoverlappingbecauseOpenMPrequiresthattheI/O
librariesbethreadsafe.Whichthreadprintsitsrecordwhen,however,isnotspecified,andanyvalid interleavingoftheoutputrecordscanoccur.
A.6. SYNCHRONIZATION
ManyOpenMPprogramscanbewrittenusingonlytheparallelandparallel for (parallel doinFortran)constructs.Therearealgorithms,however,whereoneneedsmorecareful controloverhowvariablesareshared.Whenmultiplethreadsreadandwriteshareddata,the programmermustensurethatthethreadsdonotinterferewitheachother,sothattheprogramreturns thesameresultsregardlessofhowthethreadsarescheduled.Thisisofcriticalimportancesinceasa multithreadedprogramruns,anysemanticallyallowedinterleavingoftheinstructionscouldactually occur.Hence,theprogrammermustmanagereadsandwritestosharedvariablestoensurethatthreads readthecorrectvalueandthatmultiplethreadsdonottrytowritetoavariableatthesametime. Synchronizationistheprocessofmanagingsharedresourcessothatreadsandwritesoccurinthe correctorderregardlessofhowthethreadsarescheduled.Theconceptsbehindsynchronizationare discussedindetailinthesectiononsynchronization(Section6.3)intheImplementationMechanisms designspace.OurfocusherewillbeonthesyntaxanduseofsynchronizationinOpenMP. ConsidertheloopbasedprograminFig.A.5earlierinthischapter.Weusedthisexampletointroduce worksharinginOpenMP,atwhichtimeweassumedthecombinationofthecomputedresults(res) didnottakemuchtimeandhadtooccurinthesequentialorder.Hence,itwasnoproblemtostore intermediateresultsinasharedarrayandlater(withinaserialregion)combinetheresultsintothe finalanswer. Inthemorecommoncase,however,resultsofbig_comp()canbeaccumulatedinanyorderaslong astheaccumulationsdonotinterfere.Tomakethingsmoreinteresting,wewillassumethatthe combine()andbig_comp()routinesarebothtimeconsumingandtakeunpredictableand widelyvaryingamountsoftimetoexecute.Hence,weneedtobringthecombine()functioninto theparallelregionandusesynchronizationconstructstoensurethatparallelcallstothecombine() functiondonotinterfere. ThemajorsynchronizationconstructsinOpenMParethefollowing.
flushdefinesasynchronizationpointatwhichmemoryconsistencyisenforced.Thiscanbe subtle.Basically,amoderncomputercanholdvaluesinregistersorbuffersthatarenot guaranteedtobeconsistentwiththecomputer'smemoryatanygivenpoint.Cachecoherency protocolsguaranteethatallprocessorsultimatelyseeasingleaddressspace,buttheydonot guaranteethatmemoryreferenceswillbeuptodateandconsistentateverypointintime.The syntaxofflushis

#pragma omp flush [(list)]
wherelistisacommaseparatedlistofvariablesthatneedtobeflushed.Ifthelistis omitted,allvariablesvisibletothecallingthreadwillbeflushed.Programmersonlyrarely needtocallflushbecauseitisautomaticallyinsertedatmostpointswhereitisneeded.
Typicallyitisneededonlybyprogrammersbuildingtheirownlowlevelsynchronization primitives.
criticalimplementsacriticalsectionformutualexclusion.Inotherwords,onlyone threadatatimewillexecutethestructuredblockwithinacriticalsection.Theotherthreads willwaittheirturnatthetopoftheconstruct. Thesyntaxofacriticalsectionis

#pragma omp critical [(name)] { a structured block }
wherenameisanidentifierthatcanbeusedtosupportdisjointsetsofcriticalsections.A criticalsectionimpliesacalltoflushonentrytoandonexitfromthecriticalsection.
barrierprovidesasynchronizationpointatwhichthethreadswaituntileverymemberof theteamhasarrivedbeforeanythreadscontinue.Thesyntaxofabarrieris
#pragma omp barrier
Abarriercanbeaddedexplicitly,butitisalsoimpliedwhereitmakessense(suchasatthe endofparallelorworksharingconstructs).Abarrierimpliesaflush. Criticalsections,barriers,andflushesarediscussedfurtherintheImplementationMechanisms designspace. ReturningtoourexampleinFig.A.5,wecansafelyincludethecalltothecombine()routineinside theparallelloopifweenforcemutualexclusion.Wewilldothiswiththecriticalconstructas showninFig.A.11.Noticethatwehadtocreateaprivatecopyofthevariablerestoprevent conflictsbetweeniterationsoftheloop.

Figure A.11. Parallel version of the program in Fig. A.5. In this case, however, we assume that the calls to combine() can occur in any order as long as only one thread at a time executes the function. This is enforced with the critical construct. #include <stdio.h> #include <omp.h> #define N 1000 extern void combine(double,double); extern double big_comp(int); int main() { int i; double answer, res; answer = 0.0; #pragma omp parallel for private (res) for (i=O;i<N;i++){ res = big_comp(i); #pragma omp critical combine(answer,res); } printf("%f\n", answer); }
Formostprogrammers,thehighlevelsynchronizationconstructsinOpenMParesufficient.Thereare cases,however,wherethehighlevelsynchronizationconstructscannotbeused.Twocommoncases wherethissituationcomesuparethefollowing.
ThesynchronizationprotocolsrequiredbyaproblemcannotbeexpressedwithOpenMP's highlevelsynchronizationconstructs. TheparalleloverheadincurredbyOpenMP'shighlevelsynchronizationconstructsistoo large.
Toaddresstheseproblems,theOpenMPruntimelibraryincludeslowlevelsynchronizationfunctions thatprovidealockcapability.Theprototypesforlockfunctionsareincludedintheomp. hinclude file.Thelocksuseanopaquedatatypeomp_lock_tdefinedintheomp.hincludefile.Thekey functionsfollow.

void omp_init_lock(omp_lock_t *lock).Initializethelock. void omp_destroy_lock(omp_lock_t *lock).Destroythelock,therebyfreeing anymemoryassociatedwiththelock. void omp_set_lock(omp_lock_t *lock).Setoracquirethelock.Ifthelockis free,thenthethreadcallingomp_set_lock()willacquirethelockandcontinue.Ifthe lockisheldbyanotherthread,thethreadcallingomp_set_lock()willwaitonthecall untilthelockisavailable. void omp_unset_lock(omp_lock_t *lock).Unsetorreleasethelocksosome otherthreadcanacquireit. int omp_test_lock(omp_lock_t *lock).Testorinquireifthelockisavailable.If itis,thenthethreadcallingthisfunctionwillacquirethelockandcontinue.Thetestand acquisitionofthelockisdoneatomically.Ifitisnot,thefunctionreturnsfalse(nonzero)and thecallingthreadcontinues.Thisfunctionisusedsoathreadcandousefulworkwhile waitingforalock.
Thelockfunctionsguaranteethatthelockvariableitselfisconsistentlyupdatedbetweenthreads,but donotimplyaflushofothervariables.Therefore,programmersusinglocksmustcallflush explicitlyasneeded.AnexampleofaprogramusingOpenMPlocksisshowninFig.A.12.The programdeclaresandtheninitializesthelockvariablesatthebeginningoftheprogram.Becausethis occurspriortotheparallelregion,thelockvariablesaresharedbetweenthethreads.Insidethe parallelregion,thefirstlockisusedtomakesureonlyonethreadatatimetriestoprintamessageto thestandardoutput.Thelockisneededtoensurethatthetwoprintfstatementsareexecuted togetherandnotinterleavedwiththoseofotherthreads.Thesecondlockisusedtoensurethatonly onethreadatatimeexecutesthego_for_it()function,butthistime,theomp_test_lock() functionisusedsoathreadcandousefulworkwhilewaitingforthelock.Aftertheparallelregion completes,thememoryassociatedwiththelocksisfreedbyacalltoomp_lock_destroy().
Figure A.12. Example showing how the lock functions in OpenMP are used
#include <stdio.h> #include <omp.h> extern void do_something_else(int); extern void go_for_it(int); int main() { omp_lock_t lcki, lck2; int id; omp_init_lock(&lcki); omp_init_lock(&lck2); #pragma omp parallel shared(lcki, lck2) private(id) { id = omp_get thread_numO ; omp_set_lock(&lcki); printf("thread %d has the lock \n", id); printf("thread %d ready to release the lock \n", id); omp unset_lock(&lcki); while (! omp test_lock(&lck2)) { do_something_else(id); // do something useful while waiting // for the lock } go_for_it(id); // Thread has the lock omp unset_lock(&lck2); } omp_destroy_lock(&lcki); omp_destroy_lock(&lck2);
A.7. THE SCHEDULE CLAUSE

Thekeytoperformanceinaloopbasedparallelalgorithmistoscheduletheloopiterationsonto threadssuchthattheloadisbalancedbetweenthreads.OpenMPcompilerstrytodothisforthe programmer.Althoughcompilersareexcellentatmanagingdatadependenciesandareveryeffective atgenericoptimizations,theydoapoorjobofunderstanding,foraparticularalgorithm,memory accesspatternsandhowexecutiontimesvaryfromoneloopiterationtothenext.Togetthebest performance,programmersneedtotellthecompilerhowtodividetheloopiterationsamongthe threads. Thisisaccomplishedbyaddingascheduleclausetotheforordoworksharingconstructs.In bothCandFortran,thescheduleclausetakestheform
schedule( sched [,chunk])
whereschediseitherstatic, dynamic, guided,orruntimeandchunkisanoptional integerparameter.
schedule (static [,chunk]).Theiterationspaceisdividedintoblocksofsize chunk.Ifchunkisomitted,thentheblocksizeisselectedtoprovideoneapproximately equalsizedblockperthread.Theblocksaredealtouttothethreadsthatmakeupateamina roundrobinfashion.Forexample,achunksizeof2for3threadsand12iterationswillcreate 6blockscontainingiterations(0,1),(2,3),(4,5),(6,7),(8,9),and(10,11)andassignthemto threadsas[(0,1),(6,7)]toonethread,[(2,3),(8,9)]toanotherthreadand[(4,5),(10,11)]tothe lastthread. schedule (dynamic [chunk]).Theiterationspaceisdividedintoblocksofsize chunk.Ifchunkisomitted,theblocksizeissetto1.Eachthreadisinitiallygivenoneblock ofiterationstoworkwith.Theremainingblocksareplacedinaqueue.Whenathreadfinishes withitscurrentblock,itpullsthenextblockofiterationsthatneedtobecomputedoffthe queue.Thiscontinuesuntilalltheiterationshavebeencomputed. schedule (guided [chunk]).Thisisavariationofthedynamicscheduleoptimized todecreaseschedulingoverhead.Aswithdynamicscheduling,theloopiterationsaredivided intoblocks,andeachthreadisassignedoneblockinitiallyandthenreceivesanadditional blockwhenitfinishesthecurrentone.Thedifferenceisthesizeoftheblocks:Thesizeofthe firstblockisimplementationdependent,butlarge;blocksizeisdecreasedrapidlyfor subsequentblocks,downtothevaluespecifiedbychunk.Thismethodhasthebenefitsofthe dynamicschedule,butbystartingwithlargeblocksizes,thenumberofschedulingdecisions atruntime,andhencetheparalleloverhead,isgreatlyreduced. schedule (runtime).Theruntimeschedulestipulatesthattheactualscheduleand chunksizefortheloopistobetakenfromthevalueoftheenvironmentvariable OMP_SCHEDULE.Thisletsaprogrammertrydifferentscheduleswithouthavingtorecompile foreachtrial.
ConsidertheloopbasedprograminFig.A.13.Becausethescheduleclauseisthesameforboth FortranandC,wewillonlyconsiderthecaseforC.Iftheruntimeassociatedwithdifferentloop iterationschangesunpredictablyastheprogramruns,astaticscheduleisprobablynotgoingtobe effective.Wewillthereforeusethedynamicschedule.Schedulingoverheadisaseriousproblem, however,sotominimizethenumberofschedulingdecisions,wewillstartwithablocksizeof10 iterationsperschedulingdecision.Therearenofirmrules,however,andOpenMPprogrammers usuallyexperimentwitharangeofschedulesandchunksizesuntiltheoptimumvaluesarefound.For example,parallelloopsinprogramssuchastheoneinFig.A.13canalsobeeffectivelyscheduled withastaticscheduleaslongasthechunksizeissmallenoughthatworkisequallydistributedamong threads.
Figure A.13. Parallel version of the program in Fig. A.11, modified to show the use of the schedule clause #include <stdio.h> #include <omp.h> #define N 1000
extern void combine(double,double); extern double big_comp(int); int main() { int i; double answer, res; answer = 0.0; #pragma omp parallel for private(res) schedule(dynamic,10) for (i=O;i<N;i++){ res = big_comp(i); #pragma omp critical combine(answer,res); } printf("%f\n", answer); }
A.8. THE REST OF THE LANGUAGE

ThisdiscussionhasaddressedonlythefeaturesofOpenMPusedinthisbook.Althoughwehave coveredthemostcommonlyusedconstructsinOpenMP,aprogrammerinterestedinworkingwiththe OpenMPlanguageshouldobtainthecompletespecification[OMP].ThedefinitionofthefullAPIis only50pagesoftext.Thespecificationalsoincludesmorethan25pagesofexamples.Abookabout thelanguageisavailable[CDK+00],andadescriptionoftheabstractionsbehindOpenMPandsome ideasaboutfuturedirectionsarepresentedin[MatOS].Thereisagrowingbodyofliteratureabout OpenMP;manyOpenMPapplicationsaredescribedintheproceedingsofthevariousOpenMP workshops[VJKT00,EV01,EWO01].
Appendix B. A Brief Introduction to MPI

B.1CONCEPTS B.2GETTINGSTARTED B.3BASICPOINTTOPOINTMESSAGEPASSING B.4COLLECTIVEOPERATIONS B.5ADVANCEDPOINTTOPOINTMESSAGEPASSING B.6MPIANDFORTRAN B.7CONCLUSION MPI(MessagePassingInterface)isthestandardprogrammingenvironmentfordistributedmemory parallelcomputers.ThecentralconstructinMPIismessagepassing:Oneprocess[1]packages informationintoamessageandsendsthatmessagetoanotherprocess.MPI,however,includesfar morethansimplemessagepassing.MPIincludesroutinestosynchronizeprocesses,sumnumbers
distributedamongacollectionofprocesses,scatterdataacrossacollectionofprocesses,andmuch more.
[1]
TheUEsareprocessesinMPI.
MPIwascreatedintheearly1990stoprovideacommonmessagepassingenvironmentthatcouldrun onclusters,MPPs,andevensharedmemorymachines.MPIisdistributedintheformofalibrary,and theofficialspecificationdefinesbindingsforCandFortran,althoughbindingsforotherlanguages havebeendefinedaswell.TheoverwhelmingmajorityofMPIprogrammerstodayuseMPIversion 1.1(releasedin1995).Thespecificationofanenhancedversion,MPI2.0,withparallelI/O,dynamic processmanagement,onesidedcommunication,andotheradvancedfeatureswasreleasedin1997. Unfortunately,itwassuchacomplexadditiontothestandardthatatthetimethisisbeingwritten (morethansixyearsafterthestandardwasdefined),onlyahandfulofMPIimplementationssupport MPI2.0.Forthisreason,wewillfocusonMPI1.1inthisdiscussion. ThereareseveralimplementationsofMPIincommonuse.ThetwomostcommonareLAM/MPI [LAM]andMPICH[MPI].Bothmaybedownloadedfreeofchargeandarestraightforwardtoinstall usingtheinstructionsthatcomewiththem.Theysupportawiderangeofparallelcomputers, includingLinuxclusters,NUMAcomputers,andSMPs.
B.1. CONCEPTS
Thebasicideaofpassingamessageisdeceptivelysimple:Oneprocesssendsamessageandanother onereceivesit.Diggingdeeper,however,thedetailsbehindmessagepassingbecomemuchmore complicated:Howaremessagesbufferedwithinthesystem?Canaprocessdousefulworkwhileitis sendingorreceivingmessages?Howcanmessagesbeidentifiedsothatsendsarealwayspairedwith theirintendedreceives? ThelongtermsuccessofMPIisduetoitselegantsolutiontothese(andother)problems.The approachisbasedontwocoreelementsofMPI:processgroupsandacommunicationcontext.A processgroupisasetofprocessesinvolvedinacomputation.InMPI,alltheprocessesinvolvedinthe computationarelaunchedtogetherwhentheprogramstartsandbelongtoasinglegroup.Asthe computationproceeds,however,theprogrammercandividetheprocessesintosubgroupsand preciselycontrolhowthegroupsinteract. Acommunicationcontextprovidesamechanismforgroupingtogethersetsofrelated communications.Inanymessagepassingsystem,messagesmustbelabeledsotheycanbedelivered totheintendeddestinationordestinations.ThemessagelabelsinMPIconsistoftheIDofthesending process,theIDoftheintendedreceiver,andanintegertag.Areceivestatementincludesparameters indicatingasourceandtag,eitherorbothofwhichmaybewildcards.Theresult,then,ofexecutinga receivestatementatprocessiisthedeliveryofamessagewithdestinationiwhosesourceandtag matchthoseinthereceivestatement. Whilestraightforward,identifyingmessageswithsource,destination,andtagmaynotbeadequatein complexapplications,particularlythosethatincludelibrariesorotherfunctionsreusedfromother programs.Often,theapplicationprogrammerdoesn'tknowanyofthedetailsaboutthisborrowed code,andifthelibraryincludescallstoMPI,thepossibilityexiststhatmessagesintheapplication andthelibrarymightaccidentallysharetags,destinationsIDs,andsourceIDs.Thiscouldleadto
errorswhenalibrarymessageisdeliveredtoapplicationcode,orviceversa.Onewaytodealwith thisproblemisforlibrarywriterstospecifyreservedtagsthatusersmustavoidintheircode.This approachhasprovedcumbersome,however,andispronetoerrorbecauseitrequiresprogrammersto carefullyreadandfollowtheinstructionsinthedocumentation. MPI'ssolutiontothisproblemisbasedonthenotionofcommunicationcontexts.[2]Eachsend(and itsresultingmessage)andreceivebelongtoacommunicationcontext,andonlythosecommunication eventsthatshareacommunicationcontextwillmatch.Hence,evenifmessagesshareasource,a destination,andatag,theywillnotbeconfusedwitheachotheraslongastheyhavedifferent contexts.Communicationcontextsaredynamicallycreatedandguaranteedtobeunique.

[2]
TheZipcodemessagepassinglibrary[SSD94 + ]wastheonlymessagepassinglibrary inusepriortoMPIthatincludeddistinctcommunicationcontexts. InMPI,theprocessgroupandcommunicationcontextarecombinedintoasingleobjectcalleda communicator.Withonlyafewexceptions,thefunctionsinMPIincludeareferencetoa communicator.Atprogramstartup,theruntimesystemcreatesacommoncommunicatorcalled

MPI_COMM_WORLD
Inmostcases,MPIprogrammersonlyneedasinglecommunicatorandjustuseMPI_COMM_WORLD. Whilecreatingandmanipulatingcommunicatorsisstraightforward,onlyprogrammerswriting reusablesoftwarecomponentsneedtodoso.Hence,manipulatingcommunicatorsisbeyondthescope ofthisdiscussion.
B.2. GETTING STARTED

TheMPIstandarddoesnotmandatehowajobisstarted,sothereisconsiderablevariationbetween differentMPIimplementationsandindifferentsituations(forexample,jobsstartedinteractively versusjobsstartedbyabatchscheduler).ForstartingajobinteractivelywithMPI1.1,themost commonmethodlaunchesalltheprocessesinvolvedintheMPIprogramtogetheronnodesobtained fromalistinaconfigurationfile.Allprocessesexecutethesameprogram.Thecommandthat accomplishesthisisusuallycalledmpirunandtakesthenameoftheprogramasaparameter. Unfortunately,theMPIstandarddoesnotdefinetheinterfacetompirun,anditvariesbetweenMPI implementations.Detailscanbefoundineachimplementation'sdocumentation[LAM,MPI]. AllMPIprogramsincludeafewbasicelements.ConsidertheprograminFig.B.1. WewillexploretheelementsofMPIrequiredtoturnthisintoaparallelprogramwheremultiple processesexecutethefunction.First,thefunctionprototypesforMPIneedtobedefined.Thisisdone withtheMPIincludefile:
#include <mpi.h>
Next,theMPIenvironmentmustbeinitialized.Aspartoftheinitialization,thecommunicatorshared byalltheprocessesiscreated.Asmentionedearlier,thisiscalledMPI_COMM_WORLD:
MPI_Init(&argc, &argv);
ThecommandlineargumentsarepassedintoMPI_InitsotheMPIenvironmentcaninfluencethe behavioroftheprogrambyaddingitsowncommandlinearguments(transparentlytothe programmer).Thisfunctionreturnsanintegerstatusflagusedtoindicatesuccessorfailureofthe functioncall.Withveryfewexceptions,allMPIfunctionsreturnthisflag.Possiblevaluesofthisflag aredescribedintheMPIincludefile,mpi. h.

Figure B.1. Program to print a simple string to standard output #include <stdio.h> int main(int argc, char **argv) { printf("\n Never miss a good chance to shut up \n"); }
Figure B.2. Parallel program in which each process prints a simple string to the standard output #include <stdio.h> #include "mpi.h" int main(int argc, char **argv) { int num_procs; // the number of processes in the group int ID; // a unique identifier ranging from 0 to (num_procs-1) if (MPI_Init(&argc, &argv) != MPI_SUCCESS) { // print error message and exit } MPI_Comm_size (MPI_COMM_WORLD, &num_procs); MPI_Comm_rank (MPI_COMM_WORLD, &ID); printf("\n Never miss a good chance to shut up %d \n",ID); MPI_Finalize(); }
Althoughnotrequired,almosteveryMPIprogramusesthenumberofprocessesinthegroupandthe rank[3]ofeachprocessinthegrouptoguidethecomputation,asdescribedintheSPMDpattern.This informationisfoundthroughcallstothefunctionsMPI_Comm_sizeandMPI_Comm_rank:

[3]
Therankisanintegerrangingfromzerotothenumberofprocessesinthegroupminus one.Itindicatesthepositionofeachprocesswithintheprocessgroup.
int num_procs; // the number of processes in the group int ID;// a unique identifier ranging from 0 to (num_procs-1) MPI_Comm_size(MPI_COMM_WORLD, &num_procs); MPI_Comm_rank(MPI_COMM_WORLD, &ID);
WhentheMPIprogramfinishesrunning,theenvironmentneedstobecleanlyshutdown.Thisisdone withthisfunction:
MPI_Finalize();
Usingtheseelementsinoursimpleexampleprogram,wearriveattheparallelprograminFig.B.2. Ifatanypointwithinaprogramafatalconditionisdetected,itmightbenecessarytoshutdownthe program.Ifthisisnotdonecarefully,processescanbeleftonsomeofthenodes.Theseprocesses, calledorphanprocesses,caninprinciplestayaround"forever"waitingforinteractionwithother processeswithinthegroup.ThefollowingfunctiontellstheMPIruntimeenvironmenttomakeabest effortattempttoshutdownalltheprocessesintheMPIprogram:

MPI_Abort();
B.3. BASIC POINT-TO-POINT MESSAGE PASSING

ThepointtopointmessagepassingroutinesinMPIsendamessagefromoneprocesstoanother. Therearemorethan21functionsinMPI1.1forpointtopointcommunication.Thislargesetof messagepassingfunctionsprovidesthecontrolsneededtooptimizetheuseofcommunicationbuffers andspecifyhowcommunicationandcomputationoverlap. ThemostcommonlyusedmessagepassingfunctionsinMPI1.1aretheblockingsend/receive functionsdefinedinFig.B.3.TheMPI_Send()functionreturnswhenthebuffer(buff)hasbeen transmittedintothesystemandcansafelybereused.Onthereceivingside,theMPI_Recv() functionreturnswhenthebuffer(buff)hasreceivedthemessageandisreadytouse.
Figure B.3. The standard blocking point-to-point communication routines in the C binding for MPI 1.1
TheMPI_Statusdatatypeisdefinedinmpi.h.Thestatusvariableisusedtocharacterizea receivedmessage.IfMPI_ANY_TAGwasusedbyMPI_Recv(),forexample,theactualtagforthe messagecanbeextractedfromthestatusvariablesasstatus.MPI_TAG. TheprograminFig.B.4providesanexampleofhowtousethebasicmessagepassingfunctions.In thisprogram,amessageisbouncedbetweentwoprocesses.Thisprogram(sometimescalleda"ping pong"program)isfrequentlyusedasaperformancemetricforcommunicationsystemsinparallel computers.

Figure B.4. MPI program to "bounce" a message between two processes using the standard blocking point-to-point communication routines in the C binding to MPI 1.1
#include <stdio.h> // standard I/O include file #include "memory.h" // standard include file with function // prototypes for memory management #include "mpi.h" // MPI include file int main(int argc, char **argv) { int Tag1 = 1; int Tag2 = 2; // message tags int num_procs; // the number of processes in the group int ID; // a unique identifier ranging from 0 to (num_procs-1) int buffer_count = 100; // number of items in message to bounce long *buffer; // buffer to bounce between processes int i;
MPI_Status stat; // MPI status parameter MPI_Init(&argc,&argv); // Initialize MPI environment MPI_Comm_rank(MPI_COMM_WORLD, &ID); // Find Rank of this // process in the group MPI_Comm_size (MPI_COMM_WORLD, &num_procs); //Find number of // processes in the group if (num_procs != 2) MPI_Abort(MPI_COMM_WORLD, 1); buffer = (long *)malloc(buffer_count* sizeof(long)); for (i=0; i<buffer_count; i++) // fill buffer with some values buffer[i] = (long) i; if (ID == 0) { MPI_Send (buffer, buffer_count, MPI_LONG, 1, Tag1, MPI_COMM_WORLD); MPI_Recv (buffer, buffer_count, MPI_LONG, 1, Tag2, MPI_COMM_WORLD, &stat); } else { MPI_Recv (buffer, buffer_count, MPI_LONG, 0, Tag1, MPI_COMM_WORLD,&stat); MPI_Send (buffer, buffer_count, MPI_LONG, 0, Tag2, MPI_COMM_WORLD); } } MPI_Finalize();
Theprogramopenswiththeregularincludefilesanddeclarations.WeinitializeMPIasthefirst executablestatementintheprogram.Thiscouldbedonelater,theonlyfirmrequirementbeingthat initializationmustoccurbeforeanyMPIroutinesarecalled.Next,wedeterminetherankofeach processandthesizeoftheprocessgroup.Tokeeptheexampleshortandsimple,theprogramhas beenwrittenassumingtherewillonlybetwoprocessesinthegroup.Weverifythisfact(abortingthe programifnecessary)andthenallocatememorytoholdthebuffer. Thecommunicationitselfissplitintotwoparts.Intheprocesswithrankequalto0,wesendandthen receive.Intheotherprocess,wereceiveandthensend.Matchingupsendsandreceivesinthisway (sendingandthenreceivingononesideofapairwiseexchangeofmessageswhilereceivingandthen sendingontheother)canbeimportantwithblockingsendsandlargemessages.Tounderstandthe issues,consideraproblemwheretwoprocesses(ID0andID1)bothneedtosendamessage(from bufferoutgoing)andreceiveamessage(intobufferincoming).BecausethisisanSPMDprogram,it istemptingtowritethisas
int neigh; // the rank of the neighbor process If (ID == 0) neigh = 1; else neigh = 0; MPI_Send (outgoing, buffer_count, MPI_LONG, neigh, Tag1, MPI_COMM_WORLD);
// ERROR: possibly deadlocking program MPI_Recv (incoming, buffer_count, MPI_LONG, neigh, Tag2, MPI_COMM_WORLD, &stat);
Thiscodeissimpleandcompact.Unfortunately,insomecases(largemessagesizes),thesystemcan't freeupthesystembuffersusedtosendthemessageuntiltheyhavebeencopiedintotheincoming bufferonthereceivingendofthecommunication.Becausethesendsblockuntilthesystembuffers canbereused,thereceivefunctionsarenevercalledandtheprogramdeadlocks.Hence,thesafestway towritethepreviouscodeistosplitupthecommunicationsaswedidinFig.B.4.
B.4. COLLECTIVE OPERATIONS

Inadditiontothepointtopointmessagepassingroutines,MPIincludesasetofoperationsinwhich alltheprocessesinthegroupworktogethertocarryoutacomplexcommunication.Thesecollective communicationoperationsareextremelyimportantinMPIprogramming.Infact,manyMPI programsconsistprimarilyofcollectiveoperationsandcompletelylackpairwisemessagepassing. Themostcommonlyusedcollectiveoperationsincludethefollowing.
MPI_Barrier.Abarrierdefinesasynchronizationpointatwhichallprocessesmustarrive beforeanyofthemareallowedtoproceed.ForMPI,thismeansthateveryprocessusingthe indicatedcommunicatormustcallthebarrierfunctionbeforeanyofthemproceed.This functionisdescribedindetailintheImplementationMechanismsdesignspace. MPI_Bcast.Abroadcastsendsamessagefromoneprocesstoalltheprocessesinagroup. MPI_Reduce.Areductionoperationtakesasetofvalues(inthebufferpointedtobyinbuff) spreadoutaroundaprocessgroupandcombinesthemusingtheindicatedbinaryoperation.To bemeaningful,theoperationinquestionmustbeassociative.Themostcommonexamplesfor thebinaryfunctionaresummationandfindingthemaximumorminimumofasetofvalues. Noticethatthefinalreducedvalue(inthebufferpointedtobyoutbuff)isonlyavailableinthe indicateddestinationprocess.Ifthevalueisneededbyallprocesses,thereisavariantofthis routinecalledMPI_All_reduce().Reductionsaredescribedindetailinthe ImplementationMechanismsdesignspace.
ThesyntaxofthesefunctionsisdefinedinFig.B.5. InFigs.B.6andB.7wepresentaprogramthatusesthesethreefunctions.Inthisprogram,wewishto timeafunctionthatpassesamessagearoundaringofprocesses.Thecodeforthering communicationfunctionisnotrelevantforthisdiscussion,butforcompleteness,weprovideitinFig. B.8.Theruntimeforanyparallelprogramisthetimetakenbytheslowestprocess.Soweneedtofind thetimeconsumedineachprocessandthenfindthemaximumacrossalltheprocesses.Thetimeis measuredusingthestandardMPItimingfunction:

double MPI_Wtime();
Figure B.6. Program to time the ring function as it passes messages around a ring of processes (continued in Fig. B.7). The program returns the time from the process that takes the longest elapsed time to complete the communication. The code to the ring function is not relevant for this example, but it is included in Fig. B.8.
// // Ring communication test. // Command-line arguments define the size of the message // and the number of times it is shifted around the ring: // // a.out msg_size num_shifts // #include "mpi.h" #include <stdio.h> #include <memory.h> int main(int argc, char **argv) { int num_shifts = 0; // number of times to shift the message int buff_count = 0; // number of doubles in the message int num_procs = 0; // number of processes in the ring int ID; // process (or node) id int buff_size_bytes; // message size in bytes int i; double double double double double t0; // wall-clock time in seconds ring_time; // ring comm time - this process max_time; // ring comm time - max time for all processes *x; // the outgoing message *incoming; // the incoming message
MPI_Status stat; // Initialize the MPI environment MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size(MPI_COMM_WORLD, &num_procs); // Process, test, and broadcast input parameters if (ID == 0)f if (argc != 3){ printf("Usage: %s <size of message> <Num of shifts> \n",*argv); fflush(stdout); MPI_Abort(MPI_COMM_WORLD, 1); } buff_count = atoi(*++argv); num_shifts = atoi(*++argv); printf(": shift %d doubles %d times \n",buff_count, num_shifts); fflush(stdout); } // Continued in the next figure
Figure B.7. Program to time the ring function as it passes messages around a ring of processes (continued from Fig. B.6)
// Continued from the previous figure // Broadcast data from rank 0 process to all other processes MPI_Bcast (&buff_count, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast (&num_shifts, 1, MPI_INT, 0, MPI_COMM_WORLD); // Allocate space and fill the outgoing ("x") and "incoming" vectors. buff_size_bytes = buff_count * sizeof(double); x = (double*)malloc(buff_size_bytes); incoming = (double*)malloc(buff_size_bytes); for(i=0;i<buff_count;i++){ x[i] = (double) i; incoming[i] = -1.0; } // Do the ring communication tests. MPI_Barrier(MPI_COMM_WORLD); t0 = MPI_Wtime(); /* code to pass messages around a ring */ ring (x,incoming,buff_count,num_procs,num_shifts,ID); ring_time = MPI_Wtime() - t0; // Analyze results MPI_Barrier(MPI_COMM_WORLD); MPI_Reduce(&ring_time, &max_time, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); if(ID == 0){ printf("\n Ring test took %f seconds", max_time); } MPI_Finalize(); }
Figure B.5. The major collective communication routines in the C binding to MPI 1.1 (MPI_Barrier, MPI_Bcast, and MPI_Reduce)
MPI_Wtime()returnsthetimeinsecondssincesomearbitrarypointinthepast.Usuallythetime intervalofinterestiscomputedbycallingthisfunctiontwice. ThisprogrambeginsasmostMPIprogramsdo,withdeclarations,MPIinitialization,andfindingthe rankandnumberofprocesses.Wethenprocessthecommandlineargumentstodeterminethe messagesizeandnumberoftimestoshiftthemessagearoundtheringofprocesses.Everyprocess willneedthesevalues,sotheMPI_Bcast()functioniscalledtobroadcastthesevalues. Wethenallocatespacefortheoutgoingandincomingvectorstobeusedintheringtest.Toproduce consistentresults,everyprocessmustcompletetheinitializationbeforeanyprocessesenterthetimed sectionoftheprogram.ThisisguaranteedbycallingtheMPI_Barrier()functionjustbeforethe timedsectionofcode.Thetimefunctionisthencalledtogetaninitialtime,theringtestitselfis called,andthenthetimefunctioniscalledasecondtime.Thedifferencebetweenthesetwocallsto thetimefunctionsistheelapsedtimethisprocessspentpassingmessagesaroundthering.
Figure B.8. Function to pass a message around a ring of processes. It is deadlock-free because the sends and receives are split between the even and odd processes.
/******************************************************************* NAME: ring PURPOSE: This function does the ring communication, with the
odd numbered processes sending then receiving while the even processes receive and then send. The sends are blocking sends, but this version of the ring test still is deadlock-free since each send always has a posted receive. *******************************************************************/ #define IS_ODD(x) ((x)%2) /* test for an odd int */ #include "mpi.h" ring( double *x, /* message to shift around the ring */ double *incoming, /* buffer to hold incoming message */ int buff_count, /* size of message */ int num_procs, /* total number of processes */a int num_shifts, /* numb of times to shift message */ int my_ID) /* process id number */ { int next; /* process id of the next process */ int prev; /* process id of the prev process */ int i; MPI_Status stat; /******************************************************************* ** In this ring method, odd processes snd/rcv and even processes rcv/snd. ******************************************************************/ next = (my_ID +1 )%num_procs; prev = ((my_ID==0)?(num_procs-1):(my_ID-1)); if( IS_ODD(my_ID) ){ for(i=0;i<num_shifts; i++){ MPI_Send (x, buff_count, MPI_DOUBLE, next, 3, MPI_COMM_WORLD); MPI_Recv (incoming, buff_count, MPI_DOUBLE, prev, 3, MPI_COMM_WORLD, &stat); }
} else{ for(i=0;i<num_shifts; i++){ MPI_Recv (incoming, buff_count, MPI_DOUBLE, prev, 3, MPI_COMM_WORLD, &stat); MPI_Send (x, buff_count, MPI_DOUBLE, next, 3, MPI_COMM_WORLD); } } }
ThetotalruntimeforanMPIprogramisgivenbythetimerequiredbytheslowestprocessor.Soto reportasinglenumberforthetime,weneedtodeterminethemaximumofthetimestakenbyeach processors.WedothiswithasinglecalltoMPI_Reduce()withMPI_MAX.

Figure B.9. The nonblocking or asynchronous communication functions
B.5. ADVANCED POINT-TO-POINT MESSAGE PASSING

MostMPIprogrammersuseonlythestandardpointtopointmessagepassingfunctions.MPIincludes additionalmessagepassingfunctions,however,givingprogrammersmorecontroloverthedetailsof thecommunication.Probablythemostimportantoftheseadvancedmessagepassingroutinesarethe nonblockingorasynchronouscommunicationfunctions. Nonblockingcommunicationcansometimesbeusedtodecreaseparalleloverheadbyoverlapping computationandcommunication.Thenonblockingcommunicationfunctionsreturnimmediately, providing"requesthandles"thatcanbewaitedonandqueried.Thesyntaxofthesefunctionsis presentedinFig.B.9.Tounderstandhowthesefunctionsareused,considertheprograminFig.B.10. AfterinitializingtheMPIenvironment,memoryisallocatedforthefield(U)andthebuffersusedto communicatetheedgesofU(BandinB).Theringcommunicationpatternisthenestablishedforeach processbycomputingtheIDsfortheprocessestotheleftandtotheright.Enteringthemain processingloop,eachiterationbeginsbypostingthereceive:
MPI_Irecv(inB, N, MPI_DOUBLE, left, i, MPI_COMM_WORLD, &req_recv);
Figure B.10. Program using nonblocking communication to iteratively update a field using an algorithm that requires only communication around a ring (shifting messages to the right)
#include <stdio.h> #include <mpi.h> // Update a distributed field with a local N by N block on each process // (held in the array U). The point of this example is to show // communication overlapped with computation, so code for other // functions is not included. #define N 100 // size of an edge in the square field. void extern initialize(int, double*); void extern extract_boundary(int, double*, double*); void extern update_interior(int, double*); void extern update_edge(int,double*, double*); int main(int argc, char **argv) { double *U, *B, *inB; int i, num_procs, ID, left, right, Nsteps = 100; MPI_Status status; MPI_Request req_recv, req_send; // Initialize the MPI environment MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size(MPI_COMM_WORLD, &num_procs); // allocate space for the field (U), and the buffers // to send and receive the edges (B, inB) U = (double*)malloc(N*N * sizeof(double)); B = (double*)malloc(N * sizeof(double)); inB = (double*)malloc(N * sizeof(double)); // Initialize the field and set up a ring communication pattern initialize (N, U); right = ID + 1; if (right > (num_procs-1)) right = 0; left = ID - 1; if (left < 0 ) left = num_procs-1; // Iteratively update the field U for(i=0; i<Nsteps; i++){ MPI_Irecv(inB, N, MPI_DOUBLE, left, i, MPI_COMM_WORLD, &req_recv); extract_boundary(N, U, B); //Copy the edge of U into B MPI_Isend(B, N, MPI_DOUBLE, right, i, MPI_COMM_WORLD, &req_send); update_interior(N, U); MPI_Wait(&req_recv, &status); MPI_Wait(&req_send, &status); update_edge(ID, inB, U); } MPI_Finalize(); }
Thefunctionreturnsassoonasthesystemsetsupresourcestoholdthemessageincomingfromthe left.Thehandlereq_recvprovidesamechanismtoinquireaboutthestatusofthecommunication.The edgeofthefieldisthenextractedandsenttotheneighborontheright:

MPI_Isend(B, N, MPI_DOUBLE, right, i, MPI_COMM_WORLD, &req_send);
Whilethecommunicationistakingplace,theprogramupdatestheinteriorofthefield(theinterior referstothatpartoftheupdatethatdoesnotrequireedgeinformationfromtheneighboring processes).Afterthatworkiscomplete,eachprocessmustwaituntilthecommunicationiscomplete

MPI_Wait(&req_send, &status); MPI_Wait(&req_recv, &status);
atwhichpoint,thefieldedgesareupdatedandtheprogramcontinuestothenextiteration. AnothertechniqueforreducingparalleloverheadinanMPIprogramispersistentcommunication. Thisapproachisusedwhenaproblemisdominatedbyrepeateduseofacommunicationpattern.The ideaistosetupthecommunicationonceandthenuseitmultipletimestopasstheactualmessages. Thefunctionsusedinpersistentcommunicationare

MPI_Send_init(outbuff, count, MPI_type, dest, tag, MPI_COMM, &request); MPI_Recv_init(inbuff, count, MPI_type, src, tag, MPI_COMM, &request); MPI_Start(&request); MPI_Wait(&request, &status);
MPI_Send_initandMPI_Recv_initarecalledtosetupthecommunication.Arequesthandle isreturnedtomanipulatetheactualcommunicationevents.Thecommunicationisinitiatedwithacall toMPI_Start,atwhichpoint,theprocessisfreetocontinuewithanycomputation.Whenno furtherworkcanbedoneuntilthecommunicationiscomplete,theprocessescanwaitonacallto MPI_Wait.Afunctionusingpersistentcommunicationfortheringcommunicationpatternusedin Fig.B.8isshowninFig.B.11. Inadditiontononblockingandpersistentcommunication,MPIdefinesseveralcommunicationmodes correspondingtodifferentwaysthesendsworkwiththecommunicationbuffers:
Standardmode(MPI_Send).ThestandardMPIsend;thesendwillnotcompleteuntilthe sendbufferisemptyandreadytoreuse. Synchronousmode(MPI_Ssend).Thesenddoesnotcompleteuntilafteramatchingreceive hasbeenposted.Thismakesitpossibletousethecommunicationasapairwise synchronizationevent. Bufferedmode(MPI_Bsend).Usersuppliedbufferspaceisusedtobufferthemessages.The sendwillcompleteassoonasthesendbufferiscopiedtothesystembuffer.
Figure B.11. Function to pass a message around a ring of processes using persistent communication
/******************************************************************* NAME: ring_persistent PURPOSE: This function uses the persistent communication request mechanism to implement the ring communication in MPI. #include "mpi.h" #include <stdio.h> ring_persistent( double *x, /* message to shift around the ring */ double *incoming, /* buffer to hold incoming message */ int buff_count, /* size of message */ int num_procs, /* total number of processes */ int num_shifts, /* numb of times to shift message */ int my_ID) /* process id number */ { int next; /* process id of the next process */ int prev; /* process id of the prev process */ int i; MPI_Request snd_req; /* handle to the persistent send */ MPI_Request rcv_req; /* handle to the persistent receive */ MPI_Status stat; /******************************************************************* ** In this ring method, first post all the sends and then pick up ** the. messages with the receives. *******************************************************************/ next = (my_ID +1 )%num_procs; prev = ((my_ID==0)?(num_procs-1):(my_ID-1)); MPI_Send_init(x, buff_count, MPI_DOUBLE, next, 3, MPI_COMM_WORLD, &snd_req); MPI_Recv_init(incoming, buff_count, MPI_DOUBLE, prev, 3, MPI_COMM_WORLD, &rcv_req); for(i=0;i<num_shifts; i++){ MPI_Start(&snd_req); MPI_Start(&rcv_req); MPI_Wait(&snd_req, &stat); MPI_Wait(&rcv_req, &stat); } }
Readymode(MPI_Rsend).Thesendwilltransmitthemessageimmediatelyunderthe assumptionthatamatchingreceivehasalreadybeenposted(anerroneousprogramotherwise). Onsomesystems,readymodecommunicationismoreefficient.
MostoftheMPIexamplesinthisbookusestandardmode.Weusedthesynchronousmode
communicationtoimplementmutualexclusionintheImplementationMechanismsdesignspace. InformationabouttheothermodescanbefoundintheMPIspecification[Mesb].
B.6. MPI AND FORTRAN

TheformalMPIdefinitionincludeslanguagebindingsforCandFortran.BindingsforMPIandJava havebeendefinedaswell[BC00,Min97,BCKL98,AJMJS02].Uptothispoint,wehavefocusedon C.TranslatingfromCtoFortran,however,isstraightforwardandbasedonafewsimplerules.

TheincludefileinFortrancontainingconstants,errorcodes,etc.,iscalledmpif.h. MPIroutinesessentiallyhavethesamenamesinthetwolanguages.WhereastheMPI functionsinCarecasesensitive,theMPIsubprogramsinFortranarecaseinsensitive. Ineverycaseexceptforthetimingroutines,FortranusessubroutineswhileCusesfunctions. TheargumentstotheCfunctionsandFortransubroutinesarethesamewiththeobvious mappingsontoFortran'sstandarddatatypes.Thereisoneadditionalargumentaddedtomost Fortransubroutines.Thisisanintegerparameter,ierr,thatholdstheMPIerrorreturncode.
Forexample,weshowCandFortranversionofMPI'sreductionroutineinFig.B.12.
Figure B.12. Comparison of the C and Fortran language bindings for the reduction routine in MPI 1.1
[Viewfullsizeimage]
Opaqueobjects(suchasMPI_COMMorMPI_Request)areoftypeINTEGERinFortran(withthe exceptionofBooleanvaluedvariables,whichareoftypeLOGICAL). AsimpleFortranprogramusingMPIisshowninFig.B.13.Thisprogramshowshowthebasicsetup andfinalizationsubroutinesworkinMPIFortran.ThedirectanalogywiththeMPIClanguage bindingshouldbeclearfromthisfigure.Basically,ifaprogrammerunderstandsMPIwithone language,thenheorsheknowsMPIfortheotherlanguage.Thebasicconstructsarethesame betweenlanguagebindings;however,programmersmustbecarefulwhenmixingFortranMPIandC MPIbecausetheMPIspecificationsdonotguaranteeinteroperabilitybetweenlanguages.

Figure B.13. Simple Fortran MPI program where each process prints its ID and the number of processes in the computation program firstprog include "mpif.h" integer ID, Nprocs, ierr call MPI_INIT( ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, ID, ierr) call MPI_COMM_SIZE ( MPI_COMM_WORLD, Nprocs, ierr )
print *, "Process ", ID, " out of ", Nprocs call MPI_FINALIZE( ierr ) end
B.7. CONCLUSION
MPIisbyfarthemostcommonlyusedAPIforparallelprogramming.Itisoftencalledthe"assembly code"ofparallelprogramming.MPI'slowlevelconstructsarecloselyalignedtotheMIMDmodelof parallelcomputers.ThisallowsMPIprogrammerstopreciselycontrolhowtheparallelcomputation unfoldsandwritehighlyefficientprograms.Perhapsevenmoreimportant,thisletsprogrammers writeportableparallelprogramsthatrunwellonsharedmemorymachines,massivelyparallel supercomputers,clusters,andevenoveragrid. LearningMPIcanbeintimidating.Itishuge,withmorethan125differentfunctionsinMPI1.1.The largesizeofMPIdoesmakeitcomplex,butmostprogrammersavoidthiscomplexityanduseonlya smallsubsetofMPI.Manyparallelprogramscanbewrittenwithjustsixfunctions:MPI_Init, MPI_Comm_Size,MPI_Comm_Rank,MPI_Send,MPI_Recv,andMPI_Finalize.Good sourcesofmoreinformationaboutMPIinclude[Pac96],[GLS99],and[GS98].VersionsofMPIare availableformostcomputersystems,usuallyintheformofopensourcesoftwarereadilyavailableon line.ThemostcommonlyusedversionsofMPIareLAM/MPI[LAM]andMPICH[MPI].
Appendix C. A Brief Introduction to Concurrent Programming in Java

C.1CREATINGTHREADS C.2ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD
C.3SYNCHRONIZEDBLOCKS C.4WAITANDNOTIFY C.5LOCKS C.6OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA STRUCTURES C.7INTERRUPTS Javaisanobjectorientedprogramminglanguagethatprovideslanguagesupportforexpressing concurrencyinsharedmemoryprograms.Java'ssupportforpolymorphismcanbeexploitedtowrite frameworksdirectlysupportingsomeofthepatternsdescribedinthisbook.Theframeworkprovides theinfrastructureforthepattern;theapplicationprogrammeraddssubclassescontainingthe applicationspecificcode.AnexampleisfoundintheExamplessectionofthePipelinepattern. BecauseJavaprogramsaretypicallycompiledtoanintermediatelanguagecalledJavabytecode, whichisthencompiledand/orinterpretedonthetargetmachine,Javaprogramsenjoyahighdegreeof portability. Javacanalsobeusedfordistributedcomputing,andthestandardlibrariesprovidesupportforseveral mechanismsforinterprocesscommunicationindistributedsystems.Abriefoverviewoftheseisgiven intheImplementationMechanismsdesignspace.Inaddition,thereisanevergrowingcollectionof packagesthatallowconcurrencyandinterprocesscommunicationtobeexpressedathigherlevelsthan withthefacilitiesprovidedbythelanguageandcorepackages.Currently,themostcommontypeof applicationsthatexploitJava'sfacilitiesforconcurrencyaremultithreadedserversideapplications, graphicaluserinterfaces,andprogramsthatarenaturallydistributedduetotheiruseofdataand/or resourcesindifferentlocations. TheperformanceavailablefromcurrentimplementationsofJavaisnotasgoodasthatfrom programminglanguagesmoretypicallyusedinhighperformancescientificcomputing.However,the ubiquityandportabilityofJava,alongwiththelanguagesupportforconcurrency,makeitan importantplatform.Formanypeople,aJavaprogrammaybetheirfirstconcurrentprogram.In addition,continuallyimprovingcompilertechnologyandlibrariesshouldreducetheperformancegap inthefuture.
Figure C.1. A class holding pairs of objects of an arbitrary type. Without generic types, this would have been done by declaring x and y to be of type Object, requiring casting the returned values of getX and getY. In addition to less-verbose programs, this allows type errors to be found by the compiler rather than throwing a ClassCastException at runtime.
/*class holding two objects of arbitrary type providing swap and accessor methods. The gene class Pair<Gentype> //Gentype is a type variable { private Gentype x,y; void swap(){Gentype temp = x; x=y; y=temp;} public Gentype getX(){return x;} public Gentype getY(){return y;}
Pair(Gentype x, Gentype y){this.x = x; this.y = y;} public String toString(){return "(" + x + "," + y + ")";}
} /*The following method, defined in some class, instantiates two Pair objects, replacing the public static void test() { /*create a Pair holding Strings.*/ Pair<String> p1 = new Pair<String>("hello", "goodbye");
/*create a Pair holding Integers. Autoboxing automatically creates Integer objects from new Pair<Integer>(new Integer(l), new Integer(2));*/ Pair<Integer> p2 = new Pair<Integer>(l,2); /*do something with the Pairs.*/ System.out.println(pl); pl.swap(); System.out.println(pl); System.out.println(p2); p2.swap(); System.out.println(p2);
Inthisappendix,wedescribeJava21.5[1][Java],which,comparedwithearlierversions,includes severaladditionstotheprogramminglanguageandstandardlibraries.Additionstotheprogramming languageincludesupportforgenerictypes,autoboxing(whichprovidesautomaticconversionbetween primitiveandobjecttypessuchasintandInteger),andenhancedforloops.Theexamplein Fig.C.1illustratesgenerictypes,theonlylanguageadditionusedinthisbook.

[1]
TheversionavailableatpresstimewasJ2SE1.5.0Beta1.
Java21.5alsoprovidessignificantlyenhancedsupportforconcurrentprogrammingviaseveral changesinthejava.langandjava.utilpackagesandtheintroductionofnewpackages java.util.concurrent,java.util.concurrent.atomic,and java.util.concurrent.locks.Anew,moreprecisespecificationofthememorymodelhas alsobeenincorporated.Thefacilitiesforconcurrentprogrammingaredescribedin[JSRb,JSRc]and thenewmemorymodelisdescribedin[JSRa]. Itisn'tfeasibletogiveacompleteintroductiontoJavainthisappendix.Hence,weassumethatthe readerhassomefamiliaritywiththelanguageandfocusonthoseaspectsthataremostrelevantto sharedmemoryparallelprogramming,includingnewfeaturesintroducedinJava21.5.Introductions toJavaandits(preJava21.5)librariescanbefoundin[AGH00,HC02,HC01].Anexcellent discussionofconcurrentprogramminginJavais[Lea00a].
C.1. CREATING THREADS

InJava,anapplicationalwayshasatleastonethread,theonethatexecutesthemainmethod.Anew threadiscreatedbyinstantiatingandstartingathreadobject;thecodetobeexecutedisprovidedin arunmethod,asdescribedlater.Athreadterminateswhenitsrunmethodreturns.Athreadcanbe normalormarkedasadaemonthread.Eachthreadrunsindependentlyoftheotherthreadsinan application,andtheapplicationterminateswhenallofitsnondaemonthreadshaveterminated.
AthreadcanaccessallvariablesvisibletoitsrunmethodaccordingtotheusualJavascoperules. Thus,bymakingvariablesvisibletomultiplethreads,memorycanbesharedbetweenthem.By encapsulatingvariables,memorycanbeprotectedfromaccessbyotherthreads.Inaddition,avariable canbemarkedfinal,whichmeansthat,onceinitialized,itsvaluewillnotchange.Providedthe initializationisdoneproperly,finalvariablescanbeaccessedbymultiplethreadswithoutrequiring synchronization.(However,markingareferencefinaldoesnotguaranteeimmutabilityofthe referencedobject;caremustbetakentoensurethatitisthreadsafeaswell.) Therearetwodifferentwaystospecifytherunmethodthatathreadwillexecute.Oneistoextend theThreadclassandoverrideitsrunmethod.Then,onesimplyinstantiatesanobjectofthe subclassandcallsitsinheritedstartmethod.Themorecommonapproachistoinstantiatea ThreadobjectusingtheconstructorthattakesaRunnableobjectasaparameter.Asbefore, invokingtheThreadobject'sstartmethodbeginsthethread'sexecution.TheRunnable interfacecontainsapublic void runmethod;thenewlycreatedthreadexecutestheRunnable object'srunmethod.Becausetherunmethodisparameterless,informationistypicallypassedtothe threadviadatamembersoftheThreadsubclassortheRunnable.Thesearetypicallysetinthe constructor. Inthefollowingexample,theclassThinkParallelimplementstheRunnableinterface(thatis, itdeclaresthatitimplementstheinterfaceandprovidesapublic void run()method).The mainmethodcreatesandstartsfourThreadobjects.EachoftheseispassedaThinkParallel object,whoseidfieldhasbeensettothevalueofloopcounteri.Thiscausesfourthreadstobe created,eachexecutingtherunmethodoftheThinkParallelclass.
Figure C.2. Program to create four threads, passing a Runnable in the Thread constructor. Thread-specific data is held in a field of the Runnable object.
class ThinkParallel implements Runnable { int id; //thread-specific variable containing thread ID /*The run method defines the thread's behavior*/ public void run() { System.out.println(id + ": Are we there yet?"); } /*Constructor sets id*/ ThinkParallel (int id){this.id = id;}
/*main method instantiates and starts the threads*/ public static void main(String[] arg { /*create and start 4 Thread objects, passing each a ThinkParallel object */ for(int i = 0; i != 4; i++) { new Thread(new ThinkParallel(i)) .start(); }
C.1.1. Anonymous Inner Classes Acommonprogrammingidiomfoundinseveraloftheexamplesusesananonymousclasstodefine theRunnablesorThreadsatthepointwheretheyarecreated.Thisoftenmakesreadingthe sourcecodemoreconvenientandavoidsfileandclassclutter.Weshowhowtorewritetheexampleof Fig.C.2usingthisidiominFig.C.3.Variablesthatarelocaltoamethodcannotbementionedinan anonymousclassunlessdeclaredfinal,andarethusimmutableafterbeingassignedavalue.Thisis thereasonforintroducingtheseeminglyredundantvariablej. C.1.2. Executors and F actories Thejava, util. concurrentpackageprovidestheExecutorinterfaceanditssubinterface ExecutorServicesalongwithseveralimplementations.AnExecutorexecutesRunnable objectswhilehidingthedetailsofthreadcreationandscheduling.[2]Thus,insteadofinstantiatinga ThreadobjectandpassingitaRunnabletoexecuteatask,onewouldinstantiateanExecutor objectandthenusetheExecutor's executemethodtoexecutetheRunnable.Forexample,a ThreadPoolExecutormanagesapoolofthreadsandarrangesfortheexecutionofsubmitted Runnablesusingoneofthepooledthreads.Theclassesimplementingtheinterfacesprovide adjustableparameters,buttheExecutorsclassprovidesfactorymethodsthatcreatevarious ExecutorServicespreconfiguredforthemostcommonscenarios.Forexample,thefactory methodExecutors.newCachedThreadPool ()returnsanExecutorServicewithan unboundedthreadpoolandautomaticthreadreclamation.Thiswillattempttouseathreadfromthe poolifoneisavailable,and,ifnot,createanewthread.Idlethreadsareremovedfromthepoolaftera certainamountoftime.AnotherexampleisExecutors.newFixedThreadPool (int nThreads),whichcreatesafixedsizethreadpoolcontainingnThreadsthreads.Itcontainsan unboundedqueueforwaitingtasks.Otherconfigurationsarepossible,bothusingadditionalfactory methodsnotdescribedhereandalsobydirectlyinstantiatingaclassimplementing ExecutorServiceandadjustingtheparametersmanually.
[2]
TheExecutorServicesinterfaceprovidesadditionalmethodstomanagethread terminationandsupport Futures.TheimplementationsofExecutorin java.util. concurrentalsoimplementtheExecutorServicesinterface.

Figure C.3. Program similar to the one in Fig. C.2, but using an anonymous class to define the Runnable object class ThinkParallelAnon { /*main method instantiates and starts the threads*/ public static void main(String[] args) { /*create and start 4 Thread objects, passing each a Runnable object defined by an anonymous class */
for(int i = 0; i != 4; i++) { final int j = i; new Thread( new Runnable() //define Runnable objects // anonymously { int id = j; //references public void run() { System.out.println(id + ": Are we there yet?");} } ) . start(); } } }
Asanexample,wecouldrewritethepreviousexampleasshowninFig.C.4.Notethattochangetoa differentthreadmanagementpolicy,wewouldonlyneedtoinvokeadifferentfactorymethodforthe instantiationoftheExecutor.Therestofthecodewouldremainthesame. TherunmethodintheRunnableinterfacedoesnotreturnavalueandcannotbedeclaredtothrow exceptions.Tocorrectthisdeficiency,theCallableinterface,whichcontainsamethodcallthat throwsanExceptionandreturnsaresult,wasintroduced.Theinterfacedefinitionexploitsthenew supportforgenerictypes.Callablescanarrangeexecutionbyusingthesubmitmethodofan objectthatimplementstheExecutorServiceinterface.ThesubmitmethodreturnsaFuture object,whichrepresentstheresultofanasynchronouscomputation.TheFutureobjectprovides methodstocheckwhetherthecomputationiscomplete,towaitforcompletion,andtogettheresult. Thetypeenclosedwithinanglebracketsspecifiesthattheclassistobespecializedtothattype.
Figure C.4. Program using a ThreadPoolExecutor instead of creating threads directly import java.util.concurrent.*; class ThinkParalleln implements Runnable { int id; //thread-specific variable containing thread ID /*The run method defines the tasks behavior*/ public void run() { System.out.println(id + ": Are we there yet?"); } /*Constructor sets id*/ ThinkParalleln(int id){this. id = id;} /*main method creates an Executor to manage the tasks*/ public static void main(String[] args) { /*create an Executor using a factory method in Executors*/ ExecutorService executor = Executors.newCachedThreadPool(); // send each task to the executor for(int i = 0; i != 4; i++) { executor.execute(new ThinkParalleln(i)); }
/*shuts down after all queued tasks handled*/ executor.shutdown(); } }
Asanexample,weshowinFig.C.5acodefragmentinwhichthemainthreadsubmitsananonymous CallabletoanExecutor.TheExecutorarrangesforthecallmethodtobeexecutedin anotherthread.Meanwhile,theoriginalthreadperformssomeotherwork.Whenfinishedwiththe otherwork,itusesthegetmethodtoobtaintheresultoftheCallable'sexecution.Ifnecessary, themainthreadwillblockonthegetmethoduntiltheresultisavailable.

Figure C.5. Code fragment illustrating use of Callable and Future /*execute a Callable that returns a Double. The notation Future<Double> indicates a (generic) Future that has been specialized to be a <Double>. */ Future<Double> future = executor.submit( new Callable<Double>() { public Double call() { return result_of_long_computation; } }); do_something_else( ); /* do other things while long computation proceeds in another thread */ try { Double d = (future.get()); /* get results of long computation, waiting if necessary*/ } catch (ExecutionException ex) { cleanup(); return; }
C.2. ATOMICITY, MEMORY SYNCHRONIZATION, AND THE volatile KEYWORD

TheJavaVirtualMachine(JVM)specificationrequiresthatreadsandwritesofalltheprimitivetypes, exceptlonganddouble,beatomic.Thus,astatementsuchasdone = truewillnotbe interferedwithbyanotherthread.Unfortunately,aswasexplainedintheImplementationMechanisms designspace,morethanjustatomicityisneeded.Wealsoneedtoensurethatthewriteismadevisible tootherthreadsandthatvaluesthatarereadarethefreshestvalues.InJava,avariablemarked volatileisguaranteedtohavememorysynchronizedoneachaccess.Thevolatilekeyword appliedtolonganddoubleguarantees,inaddition,atomicity. Thejava.util. concurrent .atomicpackageprovidesextendedsupportforatomicityand memorysynchronization.Forexample,thepackageincludesavarietyofclassesthatprovideatomic compareAndSetoperationsontheirtypes.Onmanysystems,suchoperationscanbeimplemented withasinglemachineinstruction.Whereitmakessense,otheroperationssuchasatomicincrement
maybeprovided.Thepackagealsoprovidesarrayclasseswheretheindividualarrayelementshave volatilesemantics.[3]Theclassesinthejava. util. concurrent. atomicpackageare onlyusefulwhencriticalupdatesforanobjectarelimitedtoasinglevariable.Asaresult,theyare mostcommonlyusedasbuildingblocksforhigherlevelconstructs.

[3]
IntheJavalanguage,declaringanarraytobevolatileonlymakesthereferencetothe arrayvolatile,nottheindividualelements. ThefollowingcodeusesthegetAndIncrementmethodoftheAtomicLongclasstoimplement athreadsafesequencer.ThegetAndIncrementmethodatomicallyobtainsthevalueofthe variable,incrementsit,andreturnstheoriginalvalue.Suchasequencercouldbeusedtoimplementa taskqueueinsomemaster/workerdesigns.

class Sequencer { private AtomicLong sequenceNumber = new AtomicLong(0); public long next() { return sequenceNumber.getAndIncrement(); } }
C.3. SYNCHRONIZED BLOCKS

Whenvariablescanbeaccessedbymultiplethreads,caremustbetakentoensurethatdatacorrupting raceconditionscannotoccur.Javaprovidesaconstructcalledsynchronizedblockstoallowthe programmertoensuremutuallyexclusiveaccesstosharedvariables.Synchronizedblocksalsoserve asimplicitmemoryfences,asdescribedintheImplementationMechanismsdesignspace(Section 6.3). EveryclassinaJavaprogramisadirectorindirectsubclassofclassObject.EveryObject instanceimplicitlycontainsalock.Asynchronizedblockisalwaysassociatedwithanobject:Beforea threadcanenterasynchronizedblock,itmustacquirethelockassociatedwiththatobject.Whenthe threadleavesthesynchronizedblock,whethernormallyorbecauseanexceptionwasthrown,thelock isreleased.Atmostonethreadcanholdthelockatthesametime,sosynchronizedblockscanbeused toensuremutuallyexclusiveaccesstodata.Theyarealsousedtosynchronizememory. Asynchronizedblockisspecifiedincodeas
synchronized(object_ref){...body of block....}
Thecurlybracesdelimittheblock.Thecodeforacquiringandreleasingthelockisgeneratebythe compiler. Supposeweaddavariablestatic int counttotheThinkParallelclasstobeincremented byeachthreadafteritprintsitsmessage.Thisisastaticvariable,sothereisoneperclass(notoneper object),anditisvisibleandthussharedbyallthethreads.Toavoidraceconditions,countcouldbe accessedinasynchronizedblock.[4]Toprovideprotection,allthreadsmustusethesamelock,sowe usetheobjectassociatedwiththeclassitself.ForanyclassX, X. classisareferencetothe uniqueobjectrepresentingclassX,sowecouldwritethefollowing:
[4]
Ofcourse,forthisparticularsituation,onecouldinsteaduseanatomicvariableas definedintheJava.util.concurrent. atomicpackage.

public void run() { System.out.println(id + ": Are we there yet?"); synchronized(ThinkParallel.class){count++;} }
Itisimportanttoemphasizethatonlysynchronizedblocksassociatedwiththesameobjectexclude eachother.Twosynchronizedblocksassociatedwithdifferentobjectscouldexecuteconcurrently.For example,acommonprogrammingerrorwouldbetowritethepreviouscodeas

public void run() { System.out.println(id + ": Are we there yet?"); synchronized(this){count++;} //WRONG! }
Inthebuggyversion,eachthreadwouldbesynchronizingonthelockassociatedwiththe"self"or thisobject.Thiswouldmeanthateachthreadlocksadifferentlock(theoneassociatedwiththe threadobjectitself)asitentersthesynchronizedblock,sononeofthemwouldexcludeeachother. Also,asynchronizedblockdoesnotconstrainthebehaviorofathreadthatreferencesashared variableincodethatisnotinasynchronizedblock.Itisuptotheprogrammertocarefullyensurethat allmentionsofsharedvariablesareappropriatelyprotected. Specialsyntaxisprovidedforthecommonsituationinwhichtheentiremethodbodyshouldbeinside asynchronizedblockassociatedwiththethisobject.Inthiscase,thesynchronizedkeywordis usedtomodifythemethoddeclaration.Thatis,

public synchronized void updateSharedVariables(...) {....body.... }
isshorthandfor
public void updateSharedVariables(...) { synchronized(this){....body....} }
C.4. WAIT AND NOTIFY

Itissometimesthecasethatathreadneedstochecktoseewhethersomeconditionholds.Ifitholds, thenthethreadshouldperformsomeaction;ifnot,itshouldwaituntilsomeotherthreadestablished thecondition.Forexample,athreadmightcheckabuffertoseewhetheritcontainsanitem.Ifitdoes, thenanitemisremoved.Ifnot,thethreadshouldwaituntiladifferentthreadhasinsertedone.Itis importantthatcheckingtheconditionandperformingtheactionbedoneatomically.Otherwise,one threadcouldcheckthecondition(thatis,findthebuffernonempty),anotherthreadcouldfalsifyit(by removingtheonlyitem),andthenthefirstthreadwouldperformanactionthatdependedonthe
condition(whichnolongerholds)andcorrupttheprogramstate.Thus,checkingtheconditionand performingtheactionneedtobeplacedinsideasynchronizedblock.Ontheotherhand,thethread cannotholdthelockassociatedwiththesynchronizedblockwhilewaitingbecausethiswouldblock otherthreadsfromaccess,preventingtheconditionfromeverbeingestablished.Whatisneededisa wayforathreadwaitingonaconditiontoreleasethelockandthen,aftertheconditionhasbeen satisfied,reacquirethelockbeforerecheckingtheconditionandperformingitsaction.Traditional monitors[Hoa74]wereproposedtohandlethissituation.Javaprovidessimilarfacilities. TheObjectclass,alongwiththepreviouslymentionedlock,alsocontainsanimplicitwaitsetthat servesasaconditionvariable.TheObjectclassprovidesseveralversionsofwaitmethodsthat causethecallingthreadtoimplicitlyreleasethelockandadditselftothewaitset.Threadsinthewait setaresuspendedandnoteligibletobescheduledtorun.
Figure C.6. Basic idiom for using wait. Because wait throws an InterruptedException, it should somehow be enclosed in a try-catch block, omitted here. synchronized(lockObject) { while( ! condition ){ lockObject.wait();} action; }
ThebasicidiomforusingwaitisshowninFig.C.6.Thescenarioisasfollows:Thethreadacquires thelockassociatedwithlockObject.Itcheckscondition.Iftheconditiondoesnothold,then thebodyofthewhileloop,thewaitmethod,isexecuted.Thiscausesthelocktobereleased, suspendsthethread,andplacesitinthewaitsetbelongingtolockObject.Iftheconditiondoes hold,thethreadperformsactionandleavesthesynchronizedblock.Onleavingthesynchronized block,thelockisreleased. Threadsleavethewaitsetinoneofthreeways.First,theObjectclassmethodsnotifyand notifyAllawakenoneorallthreads,respectively,inthewaitsetofthatobject.Thesemethodsare intendedtobeinvokedbyathreadthatestablishestheconditionbeingwaitedupon.Anawakened threadleavesthewaitsetandjoinsthethreadswaitingtoreacquirethelock.Theawakenedthreadwill reacquirethelockbeforeitcontinuesexecution.Thewaitmethodmaybecalledwithoutparameters orwithtimeoutvalues.Athreadthatusesoneofthetimedwaitmethods(thatis,onethatisgivena timeoutvalue)maybeawakenedbynotificationasjustdescribed,orbythesystematsomepointafter thetimeouthasexpired.Uponbeingreawakened,itwillreacquirethelockandcontinuenormal execution.Unfortunately,thereisnoindicationofwhetherathreadwasawakenedbyanotificationor atimeout.Thethirdwaythatathreadcanleavethewaitsetisifitisinterrupted.Thiscausesan InterruptedExceptiontobethrown,whereuponthecontrolflowinthethreadfollowsthe normalrulesforhandlingexceptions. WenowcontinuedescribingthescenariostartedpreviouslyforFig.C.6,atthepointatwhichthe threadhaswaitedandbeenawakened.Whenawakenedbysomeotherthreadexecutingnotifyor notifyAllonlockObject(orbyatimeout),thethreadwillberemovedfromthewaitset.At somepoint,itwillbescheduledforexecutionandwillattempttoreacquirethelockassociatedwith lockObject.Afterthelockhasbeenreacquired,thethreadwillrechecktheconditionandeither
releasethelockandwaitagainor,iftheconditionholds,executeactionwithoutreleasi gthelock. n Itisthejoboftheprogrammertoensurethatwaitingthreadsareproperlynotifiedafterthecondition hasbeenestablished.Failuretodosocancausetheprogramtostall.Thefollowingcodeillustrates usingthenotifyAllmethodafterstatementsintheprogramthatestablishthecondition:

synchronized(lockObject) { establish_the_condition; lockObject.notifyAll() }
Inthestandardidiom,thecalltowaitisthebodyofawhileloop.Thisensuresthatthecondition willalwaysberecheckedbeforeperformingtheactionandaddsaconsiderabledegreeofrobustnessto theprogram.OneshouldneverbetemptedtosaveafewCPUcyclesbychangingthewhileloopto anifstatement.Amongotherthings,thewhileloopensuresthatanextranotifymethodcan nevercauseanerror.Thus,asafirststep,onecanusenotifyAllatanypointthatmightpossibly establishthecondition.Performanceoftheprogrammightbeimprovedbycarefulanalysisthatwould eliminatespuriousnotifyAlls,andinsomeprograms,itmaybepossibletoreplacenotifyAll withnotify.However,theseoptimizationsshouldbedonecarefully.Anexampleillustratingthese pointsisfoundintheSharedQueuepattern.
C.5. LOCKS
Thesemanticsofsynchronizedblockstogetherwithwaitandnotifyhavecertaindeficiencies whenusedinthestraightforwardwaydescribedintheprevioussection.Probablytheworstproblemis thatthereisnoaccesstoinformationaboutthestateoftheassociatedimplicitlock.Thismeansthata threadcannotdeterminewhetherornotalockisavailablebeforeattemptingtoacquireit.Further,a threadblockedwaitingforthelockassociatedwithasynchronizedblockcannotbeinterrupted.[5] Anotherproblemisthatonlyasingle(implicit)conditionvariableisassociatedwitheachlock.Thus, threadswaitingfordifferentconditionstobeestablishedshareawaitset,withnotifypossibly wakingthewrongthread(andforcingtheuseofnotifyAll).
[5]
Thisisdiscussedfurtherinthenextsection.
Forthisreason,inthepastmanyJavaprogrammersimplementedtheirownlockingprimitivesorused thirdpartypackagessuchasutil. concurrent[Lea].Now,thejava.util.concurrent. lockspackageprovidesReentrantLock,[6]whichissimilartosynchronizedblocks,butwith extendedcapabilities.Thelockmustbeexplicitlyinstantiated

[6]
Alockisreentrantifitcanbeacquiredmultipletimesbythesamethreadwithout causingdeadlock.
//instantiate lock private final ReentrantLock lock = new ReentrantLock() ;
Figure C.7. A version of SharedQueue2 (see the Shared Queue pattern) using a Lock and Condition instead of synchronized blocks with wait and notify
import java.uti1.*; import java.uti1.concurrent.*; import java.util.concurrent.locks.*; class SharedQueue2 { class Node { Object task; Node next; Node(Object task) {this.task = task; next = null;}
private Node head = new Node(null); private Node last = head; Lock lock = new ReentrantLock(); final Condition notEmpty = lock.newCondition(); public void put(Object task) { //cannot insert null assert task != null: "Cannot insert null task"; lock.lock(); try{ Node p = new Node(task); last.next = p; last = p; notEmpty.signalAll() ; } finally{lock.unlock();} } public Object take() { //returns first task in queue, waits if queue is empty Object task = null; lock.lock(); try { while (isEmpty()) { try{ notEmpty.await(); } catch(InterruptedException error) { assert false:"sq2: no interrupts here";} } Node first = head.next; task = first.task; first.task = null; head = first; } finally{lock.unlock();} return task; } } private boolean isEmpty() {return head.next == null;}
andshouldalwaysbeusedinatrycatchblock,suchas
//critical section lock.lock(); // block until lock acquired
try { critical_section } finally { lock.unlock(); }
Othermethodsallowinformationaboutthestateofthelocktobeacquired.Theselockstrade syntacticconvenienceandacertainamountofsupportbythecompiler(itisimpossibleforthe programmertoforgettoreleasethelockassociatedwithasynchronizedblock)forgreaterflexibility. Inaddition,thepackageprovidesimplementationsofthenewConditioninterfacethatimplements aconditionvariable.Thisallowsmultipleconditionvariablestobeassociatedwithasinglelock.A Conditionassociatedwithalockisobtainedbycallingthelock'snewConditionmethod.The analoguesofwait,notify,andnotifyAllareawait,signal,andsignalAll.An exampleofusingthesenewclassestoimplementasharedqueue(asdescribedintheSharedQueue pattern)isshowninFig.C.7.
C.6. OTHER SYNCHRONIZATION MECHANISMS AND SHARED DATA STRUCTURES

AcomparisonwithOpenMPrevealsthatOpenMPhasconstructsanalogoustosynchronizedblocks (locksandcriticalsections),butlacksfeaturessimilartowaitandnotify.Thisisbecause OpenMPoffershigherlevelconstructstypicallyusedinparallelprogrammingratherthanthemore generalapproachinJava.Onecaneasilyimplementavarietyofhigherlevelconstructsusingthe availablefeatures.Also,thejava.util.concurrentpackageprovidesseveralhigherlevel synchronizationprimitivesandshareddatastructures.Thesynchronizationprimitivesinclude CountDownLatch(asimplesingleusebarrierthatcausesthreadstoblockuntilagivennumberof threadshavereachedthelatch),CyclicBarrier(acyclicbarrierthatautomaticallyresetswhenit hasbeenpassedandthusisconvenienttousemultipletimes,forexample,inaloop),and Exchanger(whichallowstwothreadstoexchangeobjectsatasynchronizationpoint). CountDownLatchandCyclicBarrierarealsodiscussedintheImplementationMechanisms designspaceinSec.6.3.2. InFig.C.8,weshowaverysimpleloopbasedprogramalongwithaparallelversioninFig.C.9.This issimilartotheexampleintheOpenMPappendix,AppendixA.
Figure C.8. Simple sequential loop-based program similar to the one in Fig. A.5 class SequentialLoop { static int num_iters = 1000; static double[] res = new double[num_iters]; static double answer = 0.0; static void combine(int i){.....} static double big_comp(int i){.....} public static void main(String[] args) { for (int i = 0; i < num_iters; i++){ res[i] = big_comp(i); } for (int i = 0; i < num_iters; i++){ combine(i);} System.out.println(answer)); }
Figure C.9. Program showing a parallel version of the sequential program in Fig. C.8 where each iteration of the big_comp loop is a separate task. A thread pool containing ten threads is used to execute the tasks. A CountDownLatch is used to ensure that all of the tasks have completed before executing the (still sequential) loop that combines the results.
import java.uti1.concurrent.*; class ParallelLoop { static static static static static ExecutorService exec; CountDownLatch done; int num_iters = 1000; double [] res = new double [num_iters] ; double answer = 0.0;
static void combine(int i){..... } static double big_comp(int i){..... } public static void main(String[] args) throws InterruptedException { /*create executor with pool of 10 threads */ exec = Executors.newFixedThreadPool(10); /*create and initialize the countdown latch*/ done = new CountDownLatch(num_iters); long startTime = System.nanoTime(); for (int i = 0; i < num_iters; i++) { //only final local vars can be referenced in an anonymous class final int j = i; /*pass the executor a Runnable object to execute the loop body and decrement the CountDownLatch */ exec.execute(new Runnable(){ public void run() { res[j] = big_comp(j); done.countDown(); /*decrement the CountDownLatch*/ } }); } done.await(); //wait until all tasks have completed /*combine results using sequential loop*/ for (int i = 0; i < num_iters; i++){ combined); } System.out.println(answer); /*cleanly shut down thread pool*/ exec.shutdown();
} }
Thejava. util. concurrentpackagealsoprovidesseveralimplementationsoftheShared QueuepatternaswellassomethreadsafeCollectionclasses,including ConcurrentHashMap,CopyOnWriteArrayList,andCopyOnWriteArraySet.
C.7. INTERRUPTS
Partofthestateofathreadisitsinterruptstatus.Athreadcanbeinterruptedusingtheinterrupt method.Thissetstheinterruptstatusofthethreadtointerrupted. Ifthethreadissuspended(thatis,ithasexecutedawait, sleep, join,orothercommandthat suspendsthethread),thesuspensionwillbeinterruptedandanInterruptedExceptionthrown. Becauseofthis,themethodsthatcancauseblocking,suchaswait,throwthisexception,andthus musteitherbecalledfromamethodthatdeclaresitselftothrowtheexception,orthecallmustbe enclosedwithinatrycatchblock.BecausethesignatureoftherunmethodinclassThreadand interfaceRunnabledoesnotincludethrowingthisexception,atrycatchblockmustenclose,either directlyorindirectly,anycalltoablockingmethodinvokedbyathread.Thisdoesnotapplytothe mainthread,becausethemainmethodcanbedeclaredtothrowanInterruptedException. Theinterruptstatusofathreadcanbeusedtoindicatethatthethreadshouldterminate.Toenable this,thethread'srunmethodshouldbecodedtoperiodicallychecktheinterruptstatus(usingthe isInterruptedortheinterruptedmethod);ifthethreadhasbeeninterrupted,thethread shouldreturnfromitsrunmethodinanorderlyway.InterruptedExceptionscanbecaught andthehandlerusedtoensuregracefulterminationifthethreadisinterruptedwhenwaiting.Inmany parallelprograms,provisionstoexternallystopathreadarenotneeded,andthecatchblocksfor InterruptedExceptionscaneitherprovidedebugginginformationorsimplybeempty. TheCallableinterfacewasintroducedasanalternativetoRunnablethatallowsanexceptionto bethrown(andalso,asdiscussedpreviously,allowsaresulttobereturned).Thisinterfaceexploitsthe supportforgenerictypes.
Glossary
Abstractdatatype(ADT). Adatatypegivenbyitssetofallowedvaluesandtheavailableoperationsonthosevalues.The valuesandoperationsaredefinedindependentlyofaparticularrepresentationofthevaluesor implementationoftheoperations.InaprogramminglanguagethatdirectlysupportsADTs,the interfaceofthetyperevealstheoperationsonit,buttheimplementationishiddenandcan(in principle)bechangedwithoutaffectingclientsthatusethetype.TheclassicexampleofanADT isastack,whichisdefinedbyitsoperations,typicallyincludingpushandpop.Manydifferent internalrepresentationsarepossible.
Abstraction. Abstractioncanhaveseveralmeaningsdependingonthecontext.Insoftware,itoftenmeans combiningasetofsmalloperationsordataitemsandgivingthemaname.Forexample,control abstractiontakesagroupofoperations,combinesthemintoaprocedure,andgivesthe procedureaname.Asanotherexample,aclassinobjectorientedprogrammingisanabstraction ofbothdataandcontrol.Moregenerally,anabstractionisarepresentationthatcapturesthe essentialcharacterofanentity,buthidesthespecificdetails.Oftenwewilltalkaboutanamed abstractionwithoutconcernfortheactualdetails,whichmaynotbedetermined.
Addressspace. Therangeofmemorylocationsthataprocessorprocessorcanaccess.Dependingoncontext, thiscouldrefertoeitherphysicalorvirtualmemory.
ADT. See[abstractdatatype] Amdahl'slaw. Alawstatingthat(undercertainassumptions,asdescribedinSec.2.5)themaximumspeedup
thatcanbeobtainedbyrunninganalgorithmonasystemofPprocessorsis
whereistheserialfractionoftheprogram,andT(n)runningonnprocessors. Seealso[speedup] Seealso[serialfraction] ANDparallelism. Thisisoneofthemaintechniquesforintroducingparallelismintoalogiclanguage.Consider thegoalA: B,C,D(read"AfollowsfromBandCandD"),whichmeansthatgoalAsucceeds ifandonlyifallthreesubgoalsBandCandDsucceed.InANDparallelism,subgoalsB,C, andDareevaluatedinparallel.
API. See[applicationprogramminginterface] ApplicationProgrammingInterface(API). AnAPIdefinesthecallingconventionsandotherinformationneededforonesoftwaremodule (typicallyanapplicationprogram)toutilizetheservicesprovidedbyanothersoftwaremodule. MPIisanAPIforparallelprogramming.Thetermissometimesusedmorelooselytodefinethe notationusedbyprogrammerstoexpressaparticularfunctionalityinaprogram.Forexample, theOpenMPspecificationisreferredtoasanAPI.AnimportantaspectofanAPIisthatany programcodedtoitcanberecompiledtorunonanysystemthatsupportsthatAPI.
Atomic. Atomichasslightlydifferentmeaningsindifferentcontexts.Anatomicoperationatthe hardwarelevelisuninterruptible,forexampleloadandstore,oratomictestandsetinstructions. Inthedatabaseworld,anatomicoperation(ortransaction)isonethatappearstoexecute completelyornotatall.Inparallelprogramming,anatomicoperationisoneforwhich sufficientsynchronizationhasbeenprovidedthatitcannotbeinterferedwithbyotherUEs. Atomicoperationsalsomustbeguaranteedtoterminate(forexample,noinfiniteloops).
Autoboxing.
Alanguagefeature,availableinJava21.5,thatprovidesautomaticconversionofdataofa primitivetypetothecorrespondingwrappertypeforexample,frominttoInteger.
Bandwidth. Thecapacityofasystem,usuallyexpressedasitemspersecond.Inparallelcomputing,themost commonusageoftheterm"bandwidth"isinreferencetothenumberofbytespersecondthat canbemovedacrossanetworklink.Aparallelprogramthatgeneratesrelativelysmallnumbers ofhugemessagesmaybelimitedbythebandwidthofthenetwork,inwhichcaseitiscalleda bandwidthlimitedprogram. Seealso[bisectionbandwidth] Barrier. AsynchronizationmechanismappliedtogroupsofUEs,withthepropertythatnoUEinthe groupcanpassthebarrieruntilallUEsinthegrouphavereachedthebarrier.Inotherwords, UEsarrivingatthebarriersuspendorblockuntilallUEshavearrived;theycanthenall proceed.
Beowulfcluster. AclusterbuiltfromPCsrunningtheLinuxoperatingsystem.Clusterswerealreadywell establishedwhenBeowulfclusterswerefirstbuiltintheearly1990s.PriortoBeowulf,however, clusterswerebuiltfromworkstationsrunningUNIX.Bydroppingthecostofclusterhardware, Beowulfclustersdramaticallyincreasedaccesstoclustercomputing.
Bisectionbandwidth. Thebidirectionalcapacityofanetworkbetweentwoequalsizedpartitionsofnodes.Thecut acrossthenetworkistakenatthenarrowestpointineachbisectionofthenetwork.
Broadcast. Sendingamessagetoallmembersofagroupofrecipients,usuallyallUEsparticipatingina computation.
Cache.
Arelativelysmallregionofmemorythatislocaltoaprocessorandisconsiderablyfasterthan thecomputer'smainmemory.Cachehierarchiesconsistingofoneormorelevelsofcacheare essentialinmoderncomputersystems.Becauseprocessorsaresomuchfasterthanthe computer'smainmemory,aprocessorcanrunatasignificantfractionoffullspeedonlyifthe datacanbeloadedintocachebeforeitisneededandthatdatacanbereusedduringa calculation.Dataismovedbetweenthecacheandthecomputer'smainmemoryinsmallblocks ofbytescalledcachelines.Anentirecachelineismovedwhenanybytewithinthememory mappedtothecachelineisaccessed.Cachelinesareremovedfromthecacheaccordingto someprotocolwhenthecachebecomesfullandspace isneededforotherdata,orwhentheyare accessedbysomeotherprocessor.Usuallyeachprocessorhasitsowncache(thoughsometimes multipleprocessorssharealevelofcache),sokeepingthecachescoherent(thatis,ensuringthat allprocessorshavethesameviewofmemorythroughtheirdistinctcaches)isanissuethatmust bedealtwithbycomputerarchitectsandcompilerwriters.Programmersmustbeawareof cachingissueswhenoptimizingtheperformanceofsoftware.
ccNUMA. CachecoherentNUMA.ANUMAmodelwheredataiscoherentatthelevelofthecache. Seealso[NUMA] Cluster. Anycollectionofdistinctcomputersthatareconnectedandusedasaparallelcomputer,orto formaredundantsystemforhigheravailability.Thecomputersinaclusterarenotspecializedto clustercomputingandcould,inprinciple,beusedinisolationasstandalonecomputers.Inother words,thecomponentsmakingupthecluster,boththecomputersandthenetworksconnecting them,arenotcustombuiltforuseinthecluster.ExamplesincludeEthernetconnected workstationnetworksandrackmountedworkstationsdedicatedtoparallelcomputing. Seealso[workstationfarm] Collectivecommunication. AhighleveloperationinvolvingagroupofUEsandhavingatitscorethecooperativeexchange ofinformationbetweentheUEs.Thehighleveloperationmightbeapurecommunicationevent (forexample,abroadcast)oritmightincludesomecomputation(forexample,areduction). Seealso[broadcast] Seealso[reduction] Concurrentexecution. AconditioninwhichtwoormoreUEsareactiveandmakingprogresssimultaneously.Thiscan beeitherbecausetheyarebeingexecutedatthesametimeondifferentPEs,orbecausethe
actionsoftheUEsareinterleavedonthesamePE.
Concurrentprogram. Aprogramwithmultiplelociofcontrol(threads,processes,etc.).
Conditionvariable. Conditionvariablesarepartofthemonitorsynchronizationmechanism.Aconditionvariableis usedbyaprocessorthreadtodelayuntilthemonitor'sstatesatisfiessomecondition;itisalso usedtoawakenadelayedprocesswhentheconditionbecomestrue.Associatedwitheach conditionvariableisawaitsetofsuspended(delayed)processesorthreads.Operationsona conditionvariableincludewait(addthisprocessorthreadtothewaitsetforthisvariable)and signalornotify(awakenaprocessorthreadonthewaitsetforthisvariable). Seealso[monitor] Copyonwrite. Atechniquethatensures,usingminimalsynchronization,thatconcurrentthreadswillneversee adatastructureinaninconsistentstate.Toupdatethestructure,acopyismade,modifications aremadeonthecopy,andthenthereferencetotheoldstructureisatomicallyreplacedwitha referencetothenew.Thismeansthatathreadholdingareferencetotheoldstructuremay continuetoreadanold(consistent)version,butnothreadwilleverseethestructureinan inconsistentstate.Synchronizationisonlyneededtoacquireandupdatethereferencetothe structure,andtoserializetheupdates.
Countingsemaphore. Countingsemaphoresaresemaphoreswhosestatecanrepresentanyinteger.Some implementationsallowthePandVoperationstotakeanintegerparameterandincrementor decrementthestate(atomically)bythatvalue. Seealso[semaphore] Cyclicdistribution. Adistributionofdata(forexample,componentsofarrays)ortasks(forexample,loopiterations) producedbydividingthesetintoanumberofblocksgreaterthanthenumberofUEsandthen allocatingthoseblockstoUEsinacyclicmanneranalogoustodealingadeckofcards.
Dataparallel. Atypeofparallelcomputinginwhichtheconcurrencyisexpressedbyapplyingasinglestream ofinstructionssimultaneouslytotheelementsofadatastructure.
Deadlock. Anerrorconditioncommoninparallelprogramminginwhichthecomputationhasstalled becauseagroupofUEsareblockedandwaitingforeachotherinacyclicconfiguration.
Designpattern. Adesignpatternisa"solutiontoaproblemincontext";thatis,itrepresentsahighquality solutiontoarecurringproblemindesign.
Distributedcomputing. Atypeofcomputinginwhichacomputationaltaskisdividedintosubtasksthatexecuteona collectionofnetworkedcomputers.Thenetworksaregeneralpurposenetworks(LANs,WANs, ortheInternet)asopposedtodedicatedclusterinterconnects.
Distributedsharedmemory(DSM). AnaddressspacesharedamongmultipleUEsthatisconstructedfrommemorysubsystemsthat aredistinctanddistributedaboutthesystem.Theremaybeoperatingsystemandhardware supportforthedistributedsharedmemorysystem,orthesharedmemorymaybeimplemented entirelyinsoftwareasaseparatemiddlewarelayer Seealso[Virtualsharedmemory] DSM. See[distributedsharedmemory] Eagerevaluation. Aschedulingstrategywheretheevaluationofanexpression,orexecutionofaprocedure,can occurassoonas(butnotbefore)allofitsargumentshavebeenevaluated.Eagerevaluationis typicalformostprogrammingenvironmentsandcontrastswithlazyevaluation.Eager evaluationcansometimesleadtoextrawork(orevennontermination)whenanargumentthat
willnotactuallybeneededinacomputationmustbecomputedanyway.
Efficiency. TheefficiencyEofacomputationisthespeedupnormalizedbythenumberofPEs(P).Itis givenby
andindicateshoweffectivelytheresourcesinaparallelcomputerareused.
Embarrassinglyparallel. Ataskparallelalgorithminwhichthetasksarecompletelyindependent.SeetheTask Parallelismpattern.
Explicitlyparallellanguage. Aparallelprogramminglanguageinwhichtheprogrammerfullydefinestheconcurrencyand howitwillbeexploitedinaparallelcomputation.OpenMP,Java,andMPIareexplicitlyparallel languages.
Factory. Aclasswithmethodstocreateobjects,usuallyinstancesofanyoneofseveralsubclassesofan abstractbaseclass.Designpatternsforfactoryclasses(AbstractFactoryandFactoryMethod) weregivenin[GHJV95].
Falsesharing. Falsesharingoccurswhentwosemanticallyindependentvariablesresideinthesamecacheline andUEsrunningonmultipleprocessorsmodifythesevariables.Theyaresemantically independentsomemoryconflictsareavoided,butthecachelineholdingthevariablesmustbe shuffledbetweentheprocessors,andtheperformancesuffers.
Fork.
See[fork/join] Fork/join. AprogrammingmodelusedinmultithreadedAPIssuchasOpenMP.Athreadexecutesafork andcreatesadditionalthreads.Thethreads(calledateaminOpenMP)executeconcurrently. Whenthemembersoftheteamcompletetheirconcurrenttasks,theyexecutejoinsandsuspend untileverymemberoftheteamhasarrivedatthejoin.Atthatpoint,themembersoftheteam aredestroyedandtheoriginalthreadcontinues.
Framework. Areusable,partiallycompleteprogramthatembodiesadesignforapplicationsinaparticular domain.Programmerscompletetheprogrambyprovidingapplicationspecificcomponents.
Futurevariable. Amechanismusedinsomeparallelprogrammingenviromentsforcoordinatingtheexecutionof UEs.Thefuturevariableisaspecialvariablethatwilleventuallyholdtheresultfroman asynchronouscomputation.Forexample,Java(inpackagejava.util. concurrent) containsaclassFuturetoholdfuturevariables.
Generics. Programminglanguagefeaturesthatallowprogramstocontainplaceholdersforcertainentities, typicallytypes.Thegenericcomponent'sdefinitioniscompletedbeforeitisusedinaprogram. GenericsareincludedinAda,C++(viatemplates),andJava.
Grid. Agridisanarchitecturefordistributedcomputingandresourcesharing.Agridsystemis composedofaheterogeneouscollectionofresourcesconnectedbylocalareaand/orwidearea networks(oftentheInternet).Theseindividualresourcesaregeneralandincludecompute servers,storage,applicationservers,informationservices,orevenscientificinstruments.Grids areoftenimplementedintermsofWebservicesandintegratedmiddlewarecomponentsthat provideaconsistentinterfacetothegrid.Agridisdifferentfromaclusterinthattheresources inagridarenotcontrolledthroughasinglepointofadministration;thegridmiddleware managesthesystemsocontrolofresourcesonthegridandthepoliciesgoverninguseofthe
resourcesremainwiththeresourceowners.
Heterogeneous. Aheterogeneoussystemisconstructedfromcomponentsofmorethanonekind.Anexampleis adistributedsystemwithavarietyofprocessortypes.
Homogeneous. Thecomponentsofahomogeneoussystemareallofthesamekind.
Hypercube. Amulticomputerinwhichthenodesareplacedattheverticesofaddimensionalcube.The
mostfrequentlyusedconfigurationisabinaryhypercubewhereeachof2nnodesisconnected tonothers.
Implicitlyparallellanguage. Aparallelprogramminglanguageinwhichthedetailsofwhatcanexecuteconcurrentlyand howthatconcurrencyisimplementedislefttothecompiler.Mostparallelfunctionaland dataflowlanguagesareimplicitlyparallel.
Incrementalparallelism. Incrementalparallelismisatechniqueforparallelizinganexistingprogram,inwhichthe parallelizationisintroducedasasequenceofincrementalchanges,parallelizingoneloopata time.Followingeachtransformation,theprogramistestedtoensurethatitsbehavioristhe sameastheoriginalprogram,greatlydecreasingthechancesofintroducingundetectedbugs. Seealso[refactoring] JavaVirtualMachine(JVM). AnabstractstackbasedcomputingmachinewhoseinstructionsetiscalledJavabytecode. Typically,Javaprogramsarecompiledintoclassfilescontainingbytecode,asymboltable,and otherinformation.ThepurposeoftheJVMistoprovideaconsistentexecutionenvironmentfor classfilesregardlessoftheunderlyingplatform.
Join. See[fork/join] JVM. See[JavaVirtualMachine] Latency. Thefixedcostofservicingarequest,suchassendingamessageoraccessinginformationfrom adisk.Inparallelcomputing,thetermmostoftenisusedtorefertothetimeittakestosendan emptymessageoverthecommunicationmedium,fromthetimethesendroutineiscalledtothe timetheemptymessageisreceivedbytherecipient.Programsthatgeneratelargenumbersof smallmessagesaresensitivetothelatencyandarecalledlatencyboundprograms.
Lazyevaluation. Aschedulingpolicythatdoesnotevaluateanexpression(orinvokeaprocedure)untilthe resultsoftheevaluationareneeded.Lazyevaluationmayavoidsomeunnecessaryworkandin somesituationsmayallowacomputationtoterminatethatotherwisewouldnot.Lazy evaluationisoftenusedinfunctionalandlogicprogramming.
Linda. Acoordinationlanguageforparallelprogramming. Seealso[tuplespace] Loadbalance. Inaparallelcomputation,tasksareassignedtoUEs,whicharethenmappedontoPEsfor execution.ThenetworkcarriedoutbythecollectionofPEsisthe"load"associatedwiththe computation.LoadbalancereferstohowthatloadisdistributedamongthePEs.Inanefficient parallelprogram,theloadisbalancedsoeachPEspendsaboutthesameamountoftimeonthe computation.Inotherwords,inaprogramwithgoodloadbalance,eachPEfinisheswithits shareoftheloadataboutthesametime.
Loadbalancing.
TheprocessofdistributingworktoUEssuchthateachUEinvolvedinaparallelcomputation takesapproximatelythesameamountoftime.Therearetwomajorformsofloadbalancing.In staticloadbalancing,thedistributionofworkisdeterminedbeforethecomputationstarts.In dynamicloadbalancing,theloadismodifiedasthecomputationproceeds(thatis,during runtime).
Locality. TheextenttowhichthecomputationscarriedoutbyaPEusedatathatisassociatedwith(that is,iscloseto)thatPE.Forexample,inmanydenselinearalgebraproblems,thekeytohigh performanceistodecomposematricesintoblocksandthenstructurethecalculationsinterms oftheseblockssodatabroughtintoaprocessor'scacheisusedmanytimes.Thisisanexample ofanalgorithmtransformationthatincreaseslocalityinacomputation.
Massivelyparallelprocessor(MPP). Adistributedmemoryparallelcomputerdesignedtoscaletohundredsifnotthousandsof processors.Tobettersupporthighscalability,thecomputerelementsornodesintheMPP machinearecustomdesignedforuseinascalablecomputer.Thistypicallyincludestight integrationbetweenthecomputingelementsandthescalablenetwork.
MessagePassingInterface(MPI). AstandardmessagepassinginterfaceadoptedbymostMPPvendorsaswellasbythecluster computingcommunity.Theexistenceofawidelysupportedstandardenhancesprogram portability;anMPIbasedprogramdevelopedforoneplatformshouldalsorunonanyother platformforwhichanimplementationofMPIexists.
MIMD(MultipleInstruction,MultipleData). OneofthecategoriesofarchitecturesinFlynn'staxonomyofcomputerarchitectures.Ina MIMDsystem,eachPEhasitsownstreamofinstructionsoperatingonitsowndata.Thevast majorityofmodernparallelsystemsusetheMIMDarchitecture.
Monitor. MonitorsareasynchronizationmechanismoriginallyproposedbyHoare[Hoa74].Amonitoris anADTimplementationthatguaranteesmutuallyexclusiveaccesstoitsinternaldata.
Conditionalsynchronizationisprovidedbyconditionvariables Seealso[conditionvariable] MPI. See[MessagePassingInterface] MPP. See[massivelyparallelprocessor] Multicomputer. Aparallelcomputerbasedonadistributedmemory,MIMDparallelarchitecture.Thesystem appearstotheuserasasinglecomputer.
Multiprocessor. Aparallelcomputerwithmultipleprocessorsthatshareanaddressspace.
Mutex. Amutualexclusionlock.Amutexserializestheexecutionofmultiplethreads.
Node. Commontermforthecomputationalelementsthatmakeupadistributedmemoryparallel machine.Eachnodehasitsownmemoryandatleastoneprocessor;thatis,anodecanbea uniprocessororsometypeofmultiprocessor.
NUMA. Thistermisusedtodescribeasharedmemorycomputersystemwherenotallmemoryis equidistantfromallprocessors.Thus,thetimerequiredtoaccessmemorylocationsisnot uniform,andforgoodperformancetheprogrammerusuallyneedstobeconcernedwiththe placementofdatainthememory.
Opaquetype. Atypethatcanbeusedwithoutknowledgeoftheinternalrepresentation.Instancesofthe opaquetypecanbecreatedandmanipulatedviaawelldefinedinterface.Thedatatypesused forMPIcommunicatorsandOpenMPlocksareexamples.
OpenMP. Aspecificationdefiningcompilerdirectives,libraryroutines,andenvironmentvariablesthat canbeusedtoexpresssharedmemoryparallelisminFortranandC/C++programs.OpenMP implementationsexistforalargevarietyofplatforms.
ORparallelism. Anexecutiontechniqueinparallellogiclanguagesinwhichmultipleclausescanbeevaluatedin parallel.Forexample,consideraproblemwithtwoclauses:A: B, CandA: E,F.The clausescanexecuteinparalleluntiloneofthemsucceeds.
Parallelfilesystem. Afilesystemthatisvisibletoanyprocessorinthesystemandcanbereadandwrittenby multipleUEssimultaneously.Althoughaparallelfilesystemappearstothecomputersystemas asinglefilesystem,itisphysicallydistributedamonganumberofdisks.Tobeeffective,the aggregatethroughputforreadandwritemustbescalable.
Paralleloverhead. Thetimespentinaparallelcomputationmanagingthecomputationratherthancomputing results.Contributorstoparalleloverheadincludethreadcreationandscheduling, communication,andsynchronization.
PE. See[processingelement] Peertopeercomputing. Adistributedcomputingmodelinwhicheachnodehasequalstandingamongthecollectionof
nodes.Inthemosttypicalusageofthisterm,thesamecapabilitiesareofferedbyeachnode,and anynodecaninitiateacommunicationsessionwithanothernode.Thiscontrastswith,for example,clientservercomputing.Thecapabilitiesthataresharedinpeer topeercomputing includefilesharingaswellascomputation.
POSIX. ThePortableOperatingSystemInterfaceasdefinedbythePortableApplicationsStandards Committee(PASC)oftheIEEEComputerSociety.Whereasotheroperatingsystemsfollow someofthePOSIXstandards,theprimaryuseofthistermreferstothefamilyofstandardsthat definetheinterfacesinUNIXandUNIXlike(forexample,Linux)operatingsystems.
Precedencegraph. Awayofrepresentingtheorderconstraintsamongacollectionofstatements.Thenodesofthe graphrepresentthestatements,andthereisadirectededgefromnodeAtonodeBifstatement AmustbeexecutedbeforestatementB.Aprecedencegraphwithacyclerepresentsacollection ofstatementsthatcannotbeexecutedwithoutdeadlock.
Process. Acollectionofresourcesthatenabletheexecutionofprograminstructions.Theseresourcescan includevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,userandgroupIDs, andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight"UEwith itsownaddressspace. Seealso[unitofexecution] Seealso[thread] Processmigration. Changingtheprocessorresponsibleforrunningaprocessduringexecution.Processmigrationis commonlyusedtodynamicallybalancetheloadonmultiprocessorsystems.Itisalsousedto supportfaulttolerantcomputingbymovingprocessesawayfromfailingprocessors.
Processingelement(PE). Agenerictermusedtoreferenceahardwareelementthatexecutesastreamofinstructions.The contextdefineswhatunitofhardwareisconsideredaPE.ConsideraclusterofSMP workstations.Insomeprogrammingenvironments,eachworkstationisviewedasexecutinga
singleinstructionstream;inthiscase,aPEisaworkstation.Adifferentprogramming environmentrunningonthesamehardware,however,mightvieweachprocessorofthe individualworkstationsasexecutinganindividualinstructionstream;inthiscase,thePEisthe processorratherthantheworkstation.
Programmingenvironment. ProgrammingenvironmentsprovidethebasictoolsandAPIsneededtoconstructprograms.A programmingenvironmentimpliesaparticularabstractionofthecomputersystemcalleda programmingmodel.
Programmingmodel. Abstractionofacomputersystem,forexamplethevonNeumannmodelusedintraditional sequentialcomputers.Forparallelcomputing,therearemanypossiblemodelstypically reflectingdifferentwaysprocessorscanbeinterconnected.Themostcommonarebasedon sharedmemory,distributedmemorywithmessagepassing,orahybridofthetwo.
Pthreads. AnothernameforPOSIXthreads,thatis,thedefinitionofthreadsinthevariousPOSIX standards. Seealso[POSIX] PVM(ParallelVirtualMachine). Amessagepassinglibraryforparallelcomputing.PVMplayedanimportantroleinthehistory ofparallelcomputingasitwasthefirstportablemessagepassingprogrammingenvironmentto gainwidespreaduseintheparallelcomputingcommunity.Ithaslargelybeensupersededby MPI.
Racecondition. Anerrorconditionpeculiartoparallelprogramsinwhichtheoutcomeofaprogramchangesas therelativeschedulingofUEsvaries.
Reader/writerlocks.
ThispairoflocksissimilartomutexesexceptthatmultipleUEscanholdareadlock,whereasa writelockexcludesbothotherwritersandallreaders.Reader/writerlocksareofteneffective whenresourcesprotectedbythelockarereadfarmoreoftenthantheyarewritten.
Reduction. Anoperationthattakesacollectionofobjects(usuallyoneoneachUE)andcombinesthem intoasingleobjectononeUEorcombinesthemsuchthateachUEhasacopyofthecombined object.Reductionstypicallyinvolvecombiningasetofvaluespairwiseusinganassociative, commutativeoperator,suchasadditionormax.
Refactoring. Refactoringisasoftwareengineeringtechniqueinwhichaprogramisrestructuredcarefullyso astoalteritsinternalstructurewithoutchangingitsexternalbehavior.Therestructuringoccurs throughaseriesofsmalltransformations(calledrefactorings)thatcanbeverifiedaspreserving behaviorfollowingeachtransformation.Thesystemisfullyworkingandverifiablefollowing eachtransformation,greatlydecreasingthechancesofintroducingserious,undetectedbugs. Incrementalparallelismcanbeviewedasanapplicationofrefactoringtoparallelprogramming. Seealso[incrementalparallelism] Remoteprocedurecall(RPC). Aprocedureinvokedinadifferentaddressspacethanthecaller,oftenonadifferentmachine. Remoteprocedurecallsareapopularapproachforinterprocesscommunicationandlaunching remoteprocessesindistributedclientservercomputingenvironments.
RPC. See[remoteprocedurecall] Semaphore. AnADTusedtoimplementcertainkindsofsynchronization.Asemaphorehasavaluethatis constrainedtobeanonnegativeintegerandtwoatomicoperations.Theallowableoperationsare V(sometimescalledup)andP(sometimescalleddown).AVoperationincreasesthevalueof thesemaphorebyone.APoperationdecreasesthevalueofthesemaphorebyone,provided thatcanbedonewithoutviolatingtheconstraintthatthevaluebenonnegative.APoperation thatisinitiatedwhenthevalueofthesemaphoreis0suspends.Itmaycontinuewhenthevalue ispositive.
Serialfraction. Mostcomputationsconsistofpartsthatcontainexploitableconcurrencyandpartsthatmustbe executedserially.Theserialfractionisthatfractionoftheprogram'sexecutiontimetakenupby thepartsthatmustexecuteserially.Forexample,ifaprogramdecomposesintosetup,compute, andfinalization,wecouldwrite
Ifthesetupandfinalizationphasesmustexecuteserially,thentheserialfractionwouldbe
Sharedaddressspace. AnaddressableblockofmemorythatissharedbetweenacollectionofUEs.
Sharedmemory. Atermappliedtobothhardwareandsoftwareindicatingthepresenceofamemoryregionthat issharedbetweensystemcomponents.Forprogrammingenvironments,thetermmeansthat memoryissharedbetweenprocessesorthreads.Appliedtohardware,itmeansthatthe architecturalfeaturetyingprocessorstogetherissharedmemory. Seealso[sharedaddressspace] Sharednothing. AdistributedmemoryMIMDarchitecturewherenothingotherthanthelocalareanetworkis sharedbetweenthenodes.
Simultaneousmultithreading(SMT). Anarchitecturalfeatureofsomeprocessorsthatallowsmultiplethreadstoissueinstructionson eachcycle. Inotherwords,SMTallowsthefunctionalunitsthatmakeuptheprocessortoworkonbehalfof morethanonethreadatthesametime.ExamplesofsystemsutilizingSMTaremicroprocessors fromIntelCorporationthatuseHyperThreadingTechnology.
SIMD(SingleInstruction,MultipleData). OneofthecategoriesinFlynn'staxonomyofcomputerarchitectures.InaSIMDsystem,a singleinstructionstreamrunssynchronouslyonmultipleprocessors,eachwithitsowndata stream.
Singleassignmentvariable. Aspecialkindofvariabletowhichavaluecanbeassignedonlyonce.Thevariableinitiallyis inanunassignedstate.Afteravaluehasbeenassigned,itcannotbechanged.Thesevariables arecommonlyusedwithprogrammingenvironmentsthatemployadataflowcontrolstrategy, withtaskswaitingtofireuntilallinputvariableshavebeenassigned.
SMP. See[symmetricmultiprocessor] SMT. See[simultaneousmultithreading] Speedup. Speedup,S,isamultiplierindicatinghowmanytimesfastertheparallelprogramisthanits sequentialcounterpart.Itisgivenby
whereT(n)onasystemwithnPEs.WhenthespeedupequalsthenumberofPEsinthe parallelcomputer,thespeedupissaidtobeperfectlylinear.
SingleProgram,MultipleData(SPMD). Thisisthemostcommonwaytoorganizeaparallelprogram,especiallyonMIMDcomputers. Theideaisthatasingleprogramiswrittenandloadedontoeachnodeofaparallelcomputer. Eachcopyofthesingleprogramrunsindependently(asidefromcoordinationevents),sothe instructionstreamsexecutedoneachnodecanbecompletelydifferent.Thespecificpath throughthecodeisinpartselectedbythenodeID.
SPMD. Seesingleprogram Seealso[multipledata] Stride. Theincrementusedwhensteppingthroughastructureinmemory.Theprecisemeaningof strideiscontextdependent.Forexample,inanMxNarraystoredinacolumnmajororderina contiguousblockofmemory,traversingtheelementsofacolumnofthematrixinvolvesastride ofone.Inthesameexample,traversingacrossarowrequiresastrideofM.
Symmetricmultiprocessor(SMP). Asharedmemorycomputerinwhicheveryprocessorisfunctionallyidenticalandhasequal timeaccesstoeverymemoryaddress.Inotherwords,bothmemoryaddressesandoperating systemservicesareequallyavailabletoeveryprocessor.
Synchronization. EnforcingconstraintsontheorderingofeventsoccurringindifferentUEs.Thisisprimarily usedtoensurethatsharedresourcesareaccessedbyacollectionofUEsinsuchawaythatthe programiscorrectregardlessofhowtheUEsarescheduled.
Systolicarray. Aparallelarchitectureconsistingofanarrayofprocessorswitheachprocessorconnectedtoa smallnumberofitsnearestneighbors.Dataflowsthroughthearray.Asdataarrivesata processor,itcarriesoutitsassignedoperationsandthenpassestheoutputtooneormoreofits nearestneighbors.Althougheachprocessorinasystolicarraycanrunadistinctstreamof instructions,theyprogressinlockstep,alternatingbetweencomputationandcommunication phases.Hence,systolicarrayshaveagreatdealincommonwiththeSIMDarchitecture.
Systolicalgorithm. Aparallelalgorithmwheretasksoperatesynchronouslywitharegularnearestneighbor communicationpattern.Manycomputationalproblemscanbeformulatedassystolicalgorithms
byreformulatingasacertaintypeofrecurrencerelation.
Task. Ataskisasequenceofinstructionsthatoperatetogetherasagroup.Thisgroupcorrespondsto somelogicalpartofanalgorithmorprogram.
Taskqueue. AqueuethatholdstasksforexecutionbyoneormoreUEs.Taskqueuesarecommonlyusedto implementdynamicschedulingalgorithmsinprogramsusingtheTaskParallelismpattern, particularlywhenusedwiththeMaster/Workerpattern.
Thread. Afundamentalunitofexecutiononcertaincomputers.InaUNIXcontext,threadsare associatedwithaprocessandsharetheprocess'senvironment.Thismakesthethreads lightweight(thatis,acontextswitchbetweenthreadsischeap).Amorehighlevelviewisthata threadisa"lightweight"unitofexecutionthatsharesanaddressspacewithotherthreads. Seealso[unitofexecution] Seealso[process] Transputer. ThetransputerisamicroprocessordevelopedbyInmosLtd.withonchipsupportforparallel processing.Eachprocessorcontainsfourhighspeedcommunicationlinksthatareeasily connectedtothelinksofothertransputersandaveryefficientbuiltinschedulerfor multiprocessing.
Tuplespace. Asharedmemorysystemwheretheelementsheldinthememoryarecompoundobjectsknown astuples.Atupleisasmallsetoffieldsholdingvaluesorvariables,asinthefollowing examples: (3,"the larch",4) (X,47,[2,4,89,3]) ("done")
Asseenintheseexamples,thefieldsmakingupatuplecanholdintegers,strings,variables, arrays,oranyothervaluedefinedinthebaseprogramminglanguage.Whereastraditional memorysystemsaccessobjectsthroughanaddress,tuplesareaccessedbyassociation.The programmerworkingwithatuplespacedefinesatemplateandasksthesystemtodelivertuples matchingthetemplate.TuplespaceswerecreatedaspartoftheLindacoordinationlanguage [CG91].TheLindalanguageissmall,withonlyahandfulofprimitivestoinserttuples,remove tuples,andfetchacopyofatuple.Itiscombinedwithabaselanguage,suchasC,C++,or Fortran,tocreateacombinedparallelprogramminglanguage.Inadditiontoitsoriginal implementationsonmachineswithasharedaddressspace,Lindawasalsoimplementedwitha virtualsharedmemoryandusedtocommunicatebetweenUEsrunningonthenodesof distributedmemorycomputers.Theideaofanassociativevirtualsharedmemoryasinspiredby LindahasbeenincorporatedintoJavaSpaces[FHA99].
UE See[unitofexecution] Unitofexecution(UE). Generictermforoneofacollectionofconcurrentlyexecutingentities,usuallyeitherprocesses orthreads. Seealso[process] Seealso[thread] Vectorsupercomputer. Asupercomputerwithavectorhardwareunitasanintegralpartofitscentralprocessingunit boards.Thevectorhardwareprocessesarraysinapipelinefashion.
Virtualsharedmemory. Asystemthatprovidestheabstractionofsharedmemory,allowingprogrammerstowritetoa sharedmemoryevenwhentheunderlyinghardwareisbasedonadistributedmemory architecture.Virtualsharedmemorysystemscanbeimplementedwithintheoperatingsystem oraspartoftheprogrammingenvironment.
Workstationfarm. AclusterconstructedfromworkstationstypicallyrunningsomeversionofUNIX.Insome cases,theterm"farm"isusedtoimplythatthesystemwillbeusedtorunlargenumbersof
independentsequentialjobsasopposedtoparallelcomputing.
About the Authors

TIMOTHYG.MATTSON TimothyG.MattsonearnedaPh.D.inchemistryfromtheUniversityofCaliforniaatSantaCruzfor hisworkonquantummolecularscatteringtheory.ThiswasfollowedbyapostdocatCaltechwherehe portedhismolecularscatteringsoftwaretotheCaltech/JPLhypercubes.Sincethen,hehashelda numberofcommercialandacademicpositionswithcomputationalscienceonhighperformance computersasthecommonthread.Hehasbeeninvolvedwithanumberofnoteworthyprojectsin parallelcomputing,includingtheASCIRedproject(thefirstTeraFLOPMPP),thecreationof OpenMP,andOSCAR(apopularpackageforclustercomputing).Currentlyheisresponsiblefor Intel'sstrategyforthelifesciencesmarketandisIntel'schiefspokesmantothelifesciences community. BEVERLYA.SANDERS BeverlyA.SandersreceivedaPh.D.inappliedmathematicsfr omHarvardUniversity.Shehasheld facultypositionsattheUniversityofMaryland,theSwissFederalInstituteofTechnology(ETH Zurich),andCaltech,andiscurrentlywiththeDepartmentofComputerandInformationScienceand EngineeringattheUniversityofFlorida.Amainthemeofherteachingandresearchhasbeenthe developmentandapplicationoftechniques,includingdesignpatterns,formalmethods,and programminglanguageconcepts,tohelpprogrammersconstructhighquality,correctprograms, particularlyprogramsinvolvingconcurrency. BERNAL.MASSINGILL BernaL.MassingillearnedaPh.D.incomputersciencefromCaltech.Thiswasfollowedbyapostdoc attheUniversityofFlorida,wheresheandtheotherauthorsbegantheirworkondesignpatternsfor parallelcomputing.ShecurrentlyholdsafacultypositionintheDepartmentofComputerScienceat TrinityUniversity(SanAntonio,Texas).Shealsospentmorethantenyearsasaworkingprogrammer, firstinmainframesystemsprogrammingandlaterasadeveloperforasoftwarecompany.Her researchinterestsincludeparallelanddistributedcomputing,designpatterns,andformalmethods, andagoalofherteachingandresearchhasbeenapplyingideasfromthesefieldstohelpprogrammers constructhighquality,correctprograms.

Patterns For Parallel Programming

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Patterns For Parallel Programming

Uploaded by

Copyright:

Available Formats

TimMattson,Olympia,Washington,April2004 BeverlySanders,Gainesville,Florida,April2004 BernaMassingill,SanAntonio,Texas,April2004

Chapter 1. A Pattern Language for Parallel Programming

1.2. PARALLEL PROGRAMMING

1.3. DESIGN PATTERNS AND PATTERN LANGUAGES

1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING

Chapter 2. Background and Jargon of Parallel Computing

2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMS

2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTION

SingleInstruction,SingleData(SISD).InaSISDsystem,onestreamofinstructionsprocessesa singlestreamofdata,asshowninFig.2.1.ThisisthecommonvonNeumannmodelusedinvirtually allsingleprocessorcomputers.

Figure 2.1. The Single Instruction, Single Data (SISD) architecture

Distributedmemory.Inadistributedmemorysystem,eachprocesshasitsownaddressspaceand communicateswithotherprocessesbymessagepassing(sendingandreceivingmessages).A schematicrepresentationofadistributedmemorycomputerisshowninFig.2.6.

2.3. PARALLEL PROGRAMMING ENVIRONMENTS

"C*inC ABCPL ACE ACT++ ADDAP Adl Adsmith

CUMULVS DAGGER DAPPLE DataParallelC DC++ DCE++ DDD

JavaRMI javaPG JAVAR JavaSpaces JIDL Joyce Karma

PRIO P3L P4Linda Pablo PADE PADRE Panda

Quake Quark QuickThreads Sage++ SAM SCANDAL SCHEDULE

AFAPI ALWAN AM AMDC Amoeba AppLeS ARTS

DICE DIPC Distributed Smalltalk DOLIB DOME DOSMOS DRL

Nexus Nimrod NOW ObjectiveLinda Occam Omega OOF90 Orca P++

POOMA POSYBL PRESTO Prospero Proteus PSDM PSI PVM QPC++

Win32threads WinPar WWWinda XENOOPS XPC Zounds ZPL

2.4. THE JARGON OF PARALLEL COMPUTING

receiveamessagefromtaskA,afterwhichBwillsendamessagetoA.Becauseeachtaskiswaiting fortheothertosenditamessagefirst,bothtaskswillbeblockedforever.Fortunately,deadlocksare notdifficulttodiscover,asthetaskswillstopatthepointofthedeadlock.

2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATION

AnimportantmeasureofhowmuchadditionalPEshelpistherelativespeedupS,whichdescribes howmuchfasteraproblemrunsinawaythatnormalizesawaytheactualrunningtime. Equation2.3

Thefractionoftimespentintheparallelizablepartoftheprogramisthen(1).Wecanthus rewritetheexpressionfortotalcomputationtimewithPPEsas Equation2.7

Thus,inanidealparallelalgorithmwithnooverheadintheparallelpart,thespeedupshouldfollow Eq.2.9.Whathappenstothespeedupifwetakeouridealparallelalgorithmanduseaverylarge numberofprocessors?TakingthelimitasPgoestoinfinityinourexpressionforSyields

Thisequation,sometimesknownasGustafson'slaw,wasattributedin[Gus88]toE. Barsis. Equation2.14

Chapter 3. The Finding Concurrency Design Space

3.3THEDATADECOMPOSITIONPATTERN 3.4THEGROUPTASKSPATTERN 3.5THEORDERTASKSPATTERN 3.6THEDATASHARINGPATTERN 3.7THEDESIGNEVALUATIONPATTERN 3.8SUMMARY

3.1. ABOUT THE DESIGN SPACE

easiertheyaretocorrect.Ingeneral,workingthroughthepatternsinthisspaceisaniterative process. 3.1.2. Using the Decomposition Patterns Thefirststepindesigningaparallelalgorithmistodecomposetheproblemintoelementsthatcan executeconcurrently.Wecanthinkofthisdecompositionasoccurringintwodimensions.

Linearalgebraisanimportanttoolinappliedmathematics:Itprovidesthemachineryrequiredto analyzesolutionsoflargesystemsoflinearequations.Theclassiclinearalgebraproblemasks,for matrixAandvectorb,whatvaluesforxwillsolvetheequation Equation3.1

ThematrixAinEq.3.1takesonacentralroleinlinearalgebra.Manyproblemsareexpressedin termsoftransformationsofthismatrix.Thesetransformationsareappliedbymeansofamatrix multiplication Equation3.2

IfT,A,andCaresquarematricesoforderN,matrixmultiplicationisdefinedsuchthateachelement oftheresultingmatrixCis Equation3.3

wherethesubscriptsdenoteparticularelementsofthematrices.Inotherwords,theelementofthe productmatrixCinrowiandcolumnjisthedotproductoftheithrowofTandthejthcolumnof A.Hence,computingeachoftheN2elementsofCrequiresNmultiplicationsandN1additions, makingtheoverallcomplexityofmatrixmultiplicationO(N3).

Therearemanywaystoparallelizeamatrixmultiplicationoperation.Itcanbeparallelizedusing eitherataskbaseddecomposition(asdiscussedintheExamplessectionoftheTaskDecomposition pattern)oradatabaseddecomposition(asdiscussedintheExamplessectionoftheData Decompositionpattern).

Moleculardynamicsisusedtosimulatethemotionsofalargemolecularsystem.Forexample, moleculardynamicssimulationsshowhowalargeproteinmovesaroundandhowdifferentlyshaped drugsmightinteractwiththeprotein.Notsurprisingly,moleculardynamicsisextremelyimportantin

3.2. THE TASK DECOMPOSITION PATTERN

Theactionsthatarecarriedouttosolvetheproblem.(Arethereenoughofthemtokeepthe processingelementsonthetargetmachinesbusy?) Whethertheseactionsaredistinctandrelativelyindependent.

Asafirstpass,wetrytoidentifyasmanytasksaspossible;itismucheasiertostartwithtoomany tasksandmergethemlateronthantostartwithtoofewtasksandlatertrytosplitthem. Taskscanbefoundinmanydifferentplaces.

Afterthetaskshavebeenidentified,thenextstepistolookatthedatadecompositionimpliedbythe tasks.TheDataDecompositionpatternmayhelpwiththisanalysis. Examples

thedatamightbemoreeffective.Itisnotsomuchamatterofoneapproachbeing"better"than anotherasamatterofbalancingtheneedsofthemachinewiththeneedsoftheprogrammer.We discussthisinmoredetailintheDataDecompositionpattern.

Tasksthatfindthevibrationalforcesonanatom Tasksthatfindtherotationalforcesonanatom Tasksthatfindthenonbondedforcesonanatom Tasksthatupdatethepositionandvelocityofanatom Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential)

Taskbaseddecompositionsareextremelycommoninparallelcomputing.Forexample,thedistance geometrycodeDGEOM[Mat96]usesataskbaseddecomposition,asdoestheparallelWESDYN moleculardynamicsprogram[MR95].

3.3. THE DATA DECOMPOSITION PATTERN

Themostcomputationallyintensivepartoftheproblemisorganizedaroundthemanipulation ofalargedatastructure. Similaroperationsarebeingappliedt differentpartsofthedatastructure,insuchawaythat o thedifferentpartscanbeoperatedonrelativelyindependently.

besimple. Whenstartingwithadatadecomposition,however,weneedtolooknotatthetasks,butatthecentral datastructuresdefiningtheproblemandconsiderwhethertheycantheybebrokendownintochunks thatcanbeoperatedonconcurrently.Afewcommonexamplesincludethefollowing.

Regardlessofthenatureoftheunderlyingdatastructure,ifthedatadecompositionistheprimary factordrivingthesolutiontotheproblem,itservesastheorganizingprincipleoftheparallel algorithm. Whenconsideringhowtodecomposetheproblem'sdatastructures,keepinmindthecompeting forces.

Simplicity.Overlycomplexdatadecompositionscanbeverydifficulttodebug.Adata decompositionwillusuallyrequireamappingofaglobalindexspaceontoatasklocalindex space.Makingthismappingabstractallowsittobeeasilyisolatedandtested.

Afterthedatahasbeendecomposed,ifithasnotalreadybeendone,thenextstepistolookatthetask decompositionimpliedbythetasks.TheTaskDecompositionpatternmayhelpwiththisanalysis. Examples

Anelementofthevelocityarrayisusedonlybythetaskowningthecorrespondingatom.Thisdata doesnotneedtobesharedandcanremainlocaltothetask.Everytask,however,needsaccesstothe fullarrayofcoordinates.Thus,itwillmakesensetoreplicatethisdatainadistributedmemory

3.4. THE GROUP TASKS PATTERN

Tasksthatfindthevibrationalforcesonanatom Tasksthatfindtherotationalforcesonanatom Tasksthatfindthenonbondedforcesonanatom Tasksthatupdatethepositionandvelocityofanatom Atasktoupdatetheneighborlistforalltheatoms(asingletaskbecausewehavedecidedto leavethispartofthecomputationsequential)

3.5. THE ORDER TASKS PATTERN

Thepurposeofthispatternistohelpfindandcorrectlyaccountfordependenciesresultingfrom constraintsontheorderofexecutionofacollectionoftasks. Solution Therearetwogoalstobemetwhenidentifyingorderingconstraintsamongtasksanddefininga partialorderamongtaskgroups.

Regardlessofthesourceoftheconstraint,wemustdefinetheconstraintsthatrestricttheorderof executionandmakesuretheyarehandledcorrectlyintheresultingalgorithm.Atthesametime,itis importanttonotewhenorderingconstraintsareabsent,sincethiswillgivevaluableflexibilitylaterin thedesign. Examples

ThisproblemwasdescribedinSec.3.1.3,andwediscusseditsdecompositionintheTask DecompositionandDataDecompositionpatterns.IntheGroupTaskspattern,wedescribedhowto organizethetasksforthisprobleminthefollowinggroups:

Agroupoftaskstofindthe"bondedforces"(vibrationalforcesandrotationalforces)oneach atom Agroupoftaskstofindthenonbondedforcesoneachatom Agroupoftaskstoupdatethepositionandvelocityofeachatom Atasktoupdatetheneighborlistforalltheatoms(whichtriviallyconstitutesataskgroup)

Nowwearereadytoconsiderorderingconstraintsbetweenthegroups.Clearly,theupdateofthe atomicpositionscannotoccuruntiltheforcecomputationiscomplete.Also,thenonbondedforces cannotbecomputeduntiltheneighborlistisupdated.Soineachtimestep,thegroupsmustbe orderedasshowninFig.3.4.

Figure 3.4. Ordering of tasks in molecular dynamics problem

Whileitistooearlyinthedesigntoconsiderindetailhowtheseorderingconstraintswillbeenforced, eventuallywewillneedtoprovidesomesortofsynchronizationtoensurethattheyarestrictly followed.