Professional Documents
Culture Documents
But aftermanyyearsofhardwork,thefactofthematteristhat"they"didn'tcome.Theoverwhelming majorityofprogrammerswillnotinvesttheefforttowriteparallelsoftware. Acommonviewisthatyoucan'tteacholdprogrammersnewtricks,sotheproblemwillnotbesolved untiltheoldprogrammersfadeawayandanewgenerationtakesover. Butwedon'tbuyintothatdefeatistattitude.Programmershaveshownaremarkableabilitytoadopt newsoftwaretechnologiesovertheyears.LookathowmanyoldFortranprogrammersarenow writingelegantJavaprogramswithsophisticatedobjectorienteddesigns.Theproblemisn'twithold programmers.Theproblemiswitholdparallelcomputingexpertsandthewaythey'vetriedtocreatea poolofcapableparallelprogrammers. Andthat'swherethisbookcomesin.Wewanttocapturetheessenceofhowexpertparallel programmersthinkaboutparallelalgorithmsandcommunicatethatessentialunderstandinginaway professionalprogrammerscanreadilymaster.Thetechnologywe'veadoptedtoaccomplishthistaskis apatternlanguage.Wemadethischoicenotbecausewestartedtheprojectasdevoteesofdesign patternslookingforanewfieldtoconquer,butbecausepatternshavebeenshowntoworkinwaysthat wouldbeapplicableinparallelprogramming.Forexample,patternshavebeenveryeffectiveinthe fieldofobjectorienteddesign.Theyhaveprovidedacommonlanguageexpertscanusetotalkabout theelementsofdesignandhavebeenextremelyeffectiveathelpingprogrammersmasterobject orienteddesign.
Thisbookcontainsourpatternlanguageforparallelprogramming.Thebookopenswithacoupleof chapterstointroducethekeyconceptsinparallelcomputing.Thesechaptersfocusontheparallel computingconceptsandjargonusedinthepatternlanguageasopposedtobeinganexhaustive introductiontothefield. Thepatternlanguageitselfispresentedinfourpartscorrespondingtothefourphasesofcreatinga parallelprogram: * FindingConcurrency.Theprogrammerworksintheproblemdomaintoidentifytheavailable concurrencyandexposeitforuseinthealgorithmdesign. * AlgorithmStructure.Theprogrammerworkswithhighlevelstructuresfororganizingaparallel algorithm. * SupportingStructures.Weshiftfromalgorithmstosourcecodeandconsiderhowtheparallel programwillbeorganizedandthetechniquesusedtomanageshareddata. * ImplementationMechanisms.Thefinalstepistolookatspecificsoftwareconstructsfor implementingaparallelprogram. Thepatternsmakingupthesefourdesignspacesaretightlylinked.Youstartatthetop(Finding Concurrency),workthroughthepatterns,andbythetimeyougettothebottom(Implementation Mechanisms),youwillhaveadetaileddesignforyourparallelprogram. Ifthegoalisaparallelprogram,however,youneedmorethanjustaparallelalgorithm.Youalsoneed aprogrammingenvironmentandanotationforexpressingtheconcurrencywithintheprogram's sourcecode.Programmersusedtobeconfrontedbyalargeandconfusingarrayofparallel programmingenvironments.Fortunately,overtheyearstheparallelprogrammingcommunityhas convergedaroundthreeprogrammingenvironments.
* OpenMP.AsimplelanguageextensiontoC,C++,orFortrantowriteparallelprogramsfor sharedmemorycomputers. * MPI.Amessagepassinglibraryusedonclustersandotherdistributedmemorycomputers. * Java.Anobjectorientedprogramminglanguagewithlanguagefeaturessupportingparallel programmingonsharedmemorycomputersandstandardclasslibrariessupportingdistributed computing. Manyreaderswillalreadybefamiliarwithoneormoreoftheseprogrammingnotations,butfor readerscompletelynewtoparallelcomputing,we'veincludedadiscussionoftheseprogramming environmentsintheappendixes. Inclosing,wehavebeenworkingformanyyearsonthispatternlanguage.Presentingitasabookso peoplecanstartusingitisanexcitingdevelopmentforus.Butwedon'tseethisastheendofthis effort.Weexpectthatotherswillhavetheirownideasaboutnewandbetterpatternsforparallel programming.We'veassuredlymissedsomeimportantfeaturesthatreallybelonginthispattern language.Weembracechangeandlookforwardtoengagingwiththelargerparallelcomputing communitytoiterateonthislanguage.Overtime,we'llupdateandimprovethepatternlanguageuntil ittrulyrepresentstheconsensusviewoftheparallelprogrammingcommunity.Thenourrealwork willbeginusingthepatternlanguagetoguidethecreationofbetterparallelprogramming environmentsandhelpingpeopletousethesetechnologiestowriteparallelsoftware.Wewon'trest untilthedaysequentialsoftwareisrare. ACKNOWLEDGMENTS Westartedworkingtogetheronthispatternlanguagein1998.It'sbeenalongandtwistedroad, startingwithavagueideaaboutanewwaytothinkaboutparallelalgorithmsandfinishingwiththis book.Wecouldn'thavedonethiswithoutagreatdealofhelp. ManiChandy,whothoughtwewouldmakeagoodteam,introducedTimtoBeverlyandBerna.The NationalScienceFoundation,IntelCorp.,andTrinityUniversityhavesupportedthisresearchat varioustimesovertheyears.HelpwiththepatternsthemselvescamefromthepeopleatthePattern LanguagesofPrograms(PLoP)workshopsheldinIllinoiseachsummer.Theformatofthese
Chapter 1.APatternLanguageforParallelProgramming Section1.1. INTRODUCTION Section1.2. PARALLELPROGRAMMING Section1.3. DESIGNPATTERNSANDPATTERNLANGUAGES Section1.4. APATTERNLANGUAGEFORPARALLELPROGRAMMING Chapter 2.BackgroundandJargonofParallelComputing Section2.1. CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS Section2.2. PARALLELARCHITECTURES:ABRIEFINTRODUCTION Section2.3. PARALLELPROGRAMMINGENVIRONMENTS Section2.4. THEJARGONOFPARALLELCOMPUTING Section2.5. AQUANTITATIVELOOKATPARALLELCOMPUTATION Section2.6. COMMUNICATION Section2.7. SUMMARY Chapter 3.TheFindingConcurrencyDesignSpace Section3.1. ABOUTTHEDESIGNSPACE Section3.2. THETASKDECOMPOSITIONPATTERN Section3.3. THEDATADECOMPOSITIONPATTERN Section3.4. THEGROUPTASKSPATTERN Section3.5. THEORDERTASKSPATTERN Section3.6. THEDATASHARINGPATTERN Section3.7. THEDESIGNEVALUATIONPATTERN Section3.8. SUMMARY Chapter 4.TheAlgorithmStructureDesignSpace Section4.1. INTRODUCTION Section4.2. CHOOSINGANALGORITHMSTRUCTUREPATTERN Section4.3. EXAMPLES
Section4.4. THETASKPARALLELISMPATTERN Section4.5. THEDIVIDEANDCONQUERPATTERN Section4.6. THEGEOMETRICDECOMPOSITIONPATTERN Section4.7. THERECURSIVEDATAPATTERN Section4.8. THEPIPELINEPATTERN Section4.9. THEEVENTBASEDCOORDINATIONPATTERN Chapter 5.TheSupportingStructuresDesignSpace Section5.1. INTRODUCTION Section5.2. FORCES Section5.3. CHOOSINGTHEPATTERNS Section5.4. THESPMDPATTERN Section5.5. THEMASTER/WORKERPATTERN Section5.6. THELOOPPARALLELISMPATTERN Section5.7. THEFORK/JOINPATTERN Section5.8. THESHAREDDATAPATTERN Section5.9. THESHAREDQUEUEPATTERN Section5.10. THEDISTRIBUTEDARRAYPATTERN Section5.11. OTHERSUPPORTINGSTRUCTURES Chapter 6.TheImplementationMechanismsDesignSpace Section6.1. OVERVIEW Section6.2. UEMANAGEMENT Section6.3. SYNCHRONIZATION Section6.4. COMMUNICATION Endnotes Appendix A:ABriefIntroductiontoOpenMP SectionA.1. CORECONCEPTS SectionA.2. STRUCTUREDBLOCKSAND IRECTIVEFORMATS D SectionA.3. WORKSHARING SectionA.4. DATAENVIRONMENTCLAUSES SectionA.5. THEOpenMPRUNTIMELIBRARY SectionA.6. SYNCHRONIZATION SectionA.7. THESCHEDULECLAUSE SectionA.8. THERESTOFTHELANGUAGE Appendix B:ABriefIntroductiontoMPI SectionB.1. CONCEPTS SectionB.2. GETTINGSTARTED SectionB.3. BASICPOINTTOPOINTMESSAGEPASSING SectionB.4. COLLECTIVEOPERA TIONS SectionB.5. ADVANCEDPOINTTOPOINTMESSAGEPASSING SectionB.6. MPIANDFORTRAN SectionB.7. CONCLUSION Appendix C:ABriefIntroductiontoConcurrentProgramminginJava SectionC.1. CREATINGTHREADS SectionC.2. ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD
SectionC.3. SYNCHRONIZEDBLOCKS SectionC.4. WAITANDNOTIFY SectionC.5. LOCKS SectionC.6. OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA STRUCTURES SectionC.7. INTERRUPTS Glossary Bibliography AbouttheAuthors Index
APatternLanguageforParallelProgramming>INTRODUCTION
1.1. INTRODUCTION
Computersareusedtomodelphysicalsystemsinmanyfieldsofscience,medicine,andengineering. Modelers,whethertryingtopredicttheweatherorrenderasceneinthenextblockbustermovie,can usuallyusewhatevercomputingpowerisavailabletomakeevermoredetailedsimulations.Vast amountsofdata,whethercustomershoppingpatterns,telemetrydatafromspace,orDNAsequences, requireanalysis.Todelivertherequiredpower,computerdesignerscombinemultipleprocessing elementsintoasinglelargersystem.Thesesocalledparallelcomputersrunmultipletasks simultaneouslyandsolvebiggerproblemsinlesstime. Traditionally,parallelcomputerswererareandavailableforonlythemostcriticalproblems.Sincethe mid1990s,however,theavailabilityofparallelcomputershaschangeddramatically.With multithreadingsupportbuiltintothelatestmicroprocessorsandtheemergenceofmultipleprocessor coresonasinglesilicondie,parallelcomputersarebecomingubiquitous.Now,almostevery universitycomputersciencedepartmenthasatleastoneparallelcomputer.Virtuallyalloilcompanies, automobilemanufacturers,drugdevelopmentcompanies,andspecialeffectsstudiosuseparallel computing. Forexample,incomputeranimation,renderingisthestepwhereinformationfromtheanimationfiles, suchaslighting,textures,andshading,isappliedto3Dmodelstogeneratethe2Dimagethatmakes upaframeofthefilm.Parallelcomputingisessentialtogeneratetheneedednumberofframes(24 persecond)forafeaturelengthfilm.ToyStory,thefirstcompletelycomputergeneratedfeature lengthfilm,releasedbyPixarin1995,wasprocessedona"renderfarm"consistingof100dual
processormachines[PS00].By1999,forToyStory2,Pixarwasusinga1,400processorsystemwith theimprovementinprocessingpowerfullyreflectedintheimproveddetailsintextures,clothing,and atmosphericeffects.Monsters,Inc.(2001)usedasystemof250enterpriseserverseachcontaining14 processorsforatotalof3,500processors.Itisinterestingthattheamountoftimerequiredtogenerate aframehasremainedrelativelyconstantascomputingpower(boththenumberofprocessorsand thespeedofeachprocessor)hasincreased,ithasbeenexploitedtoimprovethequalityofthe animation. ThebiologicalscienceshavetakendramaticleapsforwardwiththeavailabilityofDNAsequence informationfromavarietyoforganisms,includinghumans.Oneapproachtosequencing,championed andusedwithsuccessbyCeleraCorp.,iscalledthewholegenomeshotgunalgorithm.Theideaisto breakthegenomeintosmallsegments,experimentallydeterminetheDNAsequencesofthesegments, andthenuseacomputertoconstructtheentiresequencefromthesegmentsbyfindingoverlapping areas.ThecomputingfacilitiesusedbyCeleratosequencethehumangenomeincluded150fourway serversplusaserverwith16processorsand64GBofmemory.Thecalculationinvolved500million trillionbasetobasecomparisons[Ein00]. TheSETI@homeproject[SET,ACK02 + ]providesafascinatingexampleofthepowerofparallel computing.Theprojectseeksevidenceofextraterrestrialintelligencebyscanningtheskywiththe world'slargestradiotelescope,theAreciboTelescopeinPuertoRico.Thecollecteddataisthen analyzedforcandidatesignalsthatmightindicateanintelligentsource.Thecomputationaltaskis beyondeventhelargestsupercomputer,andcertainlybeyondthecapabilitiesofthefacilitiesavailable totheSETI@homeproject.Theproblemissolvedwithpublicresourcecomputing,whichturnsPCs aroundtheworldintoahugeparallelcomputerconnectedbytheInternet.Dataisbrokenupintowork unitsanddistributedovertheInternettoclientcomputerswhoseownersdonatesparecom putingtime tosupporttheproject.EachclientperiodicallyconnectswiththeSETI@homeserver,downloadsthe datatoanalyze,andthensendstheresultsbacktotheserver.Theclientprogramistypically implementedasascreensaversothatitwilldevoteCPUcyclestotheSETIproblemonlywhenthe computerisotherwiseidle.Aworkunitcurrentlyrequiresanaverageofbetweensevenandeight hoursofCPUtimeonaclient.Morethan205,000,000workunitshavebeenprocessedsincethestart oftheproject.Morerecently,similartechnologytothatdemonstratedbySETI@homehasbeenused foravarietyofpublicresourcecomputingprojectsaswellasinternalprojectswithinlargecompanies utilizingtheiridlePCstosolveproblemsrangingfromdrugscreeningtochipdesignvalidation. Althoughcomputinginlesstimeisbeneficial,andmayenableproblemstobesolvedthatcouldn'tbe otherwise,itcomesatacost.Writingsoftwaretorunonparallelcomputerscanbedifficult.Onlya smallminorityofprogrammershaveexperiencewithparallelprogramming.Ifallthesecomputers designedtoexploitparallelismaregoingtoachievetheirpotential,moreprogrammersneedtolearn howtowriteparallelprograms. Thisbookaddressesthisneedbyshowingcompetentprogrammersofsequentialmachineshowto designprogramsthatcanrunonparallelcomputers.Althoughmanyexcellentbooksshowhowtouse particularparallelprogrammingenvironments,thisbookisuniqueinthatitfocusesonhowtothink aboutanddesignparallelalgor ithms.Toaccomplishthisgoal,wewillbeusingtheconceptofa patternlanguage.Thishighlystructuredrepresentationofexpertdesignexperiencehasbeenheavily usedintheobjectorienteddesigncommunity.
arithmeticisnonassociative.Agoodparallelprogrammermusttakecaretoensurethat nondeterministicissuessuchasthesedonotaffectthequalityofthefinalanswer.Creatingsafe parallelprogramscantakeconsiderableeffortfromtheprogrammer. Evenwhenaparallelprogramis"correct",itmayfailtodelivertheanticipatedperformance improvementfromexploitingconcurrency.Caremustbetakentoensurethattheoverheadincurredby managingtheconcurrencydoesnotoverwhelmtheprogramruntime.Also,partitioningthework amongtheprocessorsinabalancedwayisoftennotaseasyasthesummationexamplesuggests.The effectivenessofaparallelalgorithmdependsonhowwellitmapsontotheunderlyingparallel computer,soaparallelalgorithmcouldbeveryeffectiveononeparallelarchitectureandadisasteron another. Wewillrevisittheseissuesandprovideamorequantitativeviewofparallelcomputationinthenext chapter.
MensorePLoP(Japan).Theproceedingsoftheseworkshops[Pat]providearichsourceofpatterns coveringavastrangeofapplicationdomainsinsoftwaredevelopmentandhavebeenusedasabasis forseveralbooks[CS95,VCK96,MRB97,HFR99]. Inhisoriginalworkonpatterns,Alexanderprovidednotonlyacatalogofpatterns,butalsoapattern languagethatintroducedanewapproachtodesign.Inapatternlanguage,thepatternsareorganized intoastructurethatleadstheuserthroughthecollectionofpatternsinsuchawaythatcomplex systemscanbedesignedusingthepatterns.Ateachdecisionpoint,thedesignerselectsanappropriate pattern.Eachpatternleadstootherpatterns,resultinginafinaldesignintermsofawebofpatterns. Thus,apatternlanguageembodiesadesignmethodologyandprovidesdomainspecificadvicetothe applicationdesigner.(Inspiteoftheoverlappingterminology,apatternlanguageisnota programminglanguage.)
TheFindingConcurrencydesignspaceisconcernedwithstructuringtheproblemtoexpose exploitableconcurrency.Thedesignerworkingatthislevelfocusesonhighlevelalgorithmicissues
andreasonsabouttheproblemtoexposepotentialconcurrency.TheAlgorithmStructuredesignspace isconcernedwithstructuringthealgorithmtotakeadvantageofpotentialconcurrency.Thatis,the designerworkingatthislevelreasonsabouthowtousetheconcurrencyexposedinworkingwiththe FindingConcurrencypatterns.TheAlgorithmStructurepatternsdescribeoverallstrategiesfor exploitingconcurrency.TheSupportingStructuresdesignspacerepresentsanintermediatestage betweentheAlgorithmStructureandImplementationMechanismsdesignspaces.Twoimportant groupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesandthosethat representcommonlyusedshareddatastructures.TheImplementationMechanismsdesignspaceis concernedwithhowthepatternsofthehigherlevelspacesaremappedintoparticularprogramming environments.Weuseittoprovidedescriptionsofcommonmechanismsforprocess/thread management(forexample,creatingordestroyingprocesses/threads)andprocess/threadinteraction (forexample,semaphores,barriers,ormessagepassing).Theitemsinthisdesignspacearenot presentedaspatternsbecauseinmanycasestheymapdirectlyontoelementswithinparticularparallel programmingenvironments.Theyareincludedinthepatternlanguageanyway,however,toprovidea completepathfromproblemdescriptiontocode.
taska"slice"oftheprocessortime,theoperatingsystemcanallowmultipleuserstousethesystemas ifeachwereusingitalone(butwithdegradedperformance). Mostmodernoperatingsystemscanusemultipleprocessorstoincreasethethroughputofthesystem. TheUNIXshellusesconcurrencyalongwithacommunicationabstractionknownaspipestoprovide apowerfulformofmodularity:Commandsarewrittentoacceptastreamofbytesasinput(the consumer)andproduceastreamofbytesasoutput(theproducer).Multiplecommandscanbechained togetherwithapipeconnectingtheoutputofonecommandtotheinputofthenext,allowingcomplex commandstobebuiltfromsimplebuildingblocks.Eachcommandisexecutedinitsownprocess, withallprocessesexecutingconcurrently.Becausetheproducerblocksifbufferspaceinthepipeis notavailable,andtheconsumerblocksifdataisnotavailable,thejobofmanagingthestreamof resultsmovingbetweencommandsisgreatlysimplified.Morerecently,withoperatingsystemswith windowsthatinviteuserstodomorethanonethingatatime,andtheInternet,whichoftenintroduces I/Odelaysperceptibletotheuser,almosteveryprogramthatcontainsaGUIincorporates concurrency. Althoughthefundamentalconceptsforsafelyhandlingconcurrencyarethesameinparallelprograms andoperatingsystems,therearesomeimportantdifferences.Foranoperatingsystem,theproblemis notfindingconcurrencytheconcurrencyisinherentinthewaytheoperatingsystemfunctionsin managingacollectionofconcurrentlyexecutingprocesses(representingusers,applications,and backgroundactivitiessuchasprintspooling)andprovidingsynchronizationmechanismssoresources canbesafelyshared.However,anoperatingsystemmustsupportconcurrencyinarobustandsecure way:Processesshouldnotbeabletointerferewitheachother(intentionallyornot),andtheentire systemshouldnotcrashifsomethinggoeswrongwithoneprocess.Inaparallelprogram,findingand exploitingconcurrencycanbeachallenge,whileisolatingprocessesfromeachotherisnotthecritical concernitiswithanoperatingsystem.Performancegoalsaredifferentaswell.Inanoperating system,performancegoalsarenormallyrelatedtothroughputorresponsetime,anditmaybe acceptabletosacrificesomeefficiencytomaintainrobustnessandfairnessinresourceallocation.Ina parallelprogram,thegoalistominimizetherunningtimeofasingleprogram.
2.2.2. A Further Breakdown of M IMD TheMIMDcategoryofFlynn'staxonomyistoobroadtobeusefulonitsown;thiscategoryis typicallydecomposedaccordingtomemoryorganization. Sharedmemory.Inasharedmemorysystem,allprocessesshareasingleaddressspaceand communicatewitheachotherbywritingandreadingsharedvariables. OneclassofsharedmemorysystemsiscalledSMPs(symmetricmultiprocessors).AsshowninFig. 2.4,allprocessorsshareaconnectiontoacommonmemoryandaccessallmemorylocationsatequal speeds.SMPsystemsarearguablytheeasiestparallelsystemstoprogrambecauseprogrammersdo notneedtodistributedatastructuresamongprocessors.Becauseincreasingthenumberofprocessors increasescontentionforthememory,theprocessor/memorybandwidthistypicallyalimitingfactor. Thus,SMPsystemsdonotscalewellandarelimitedtosmallnumbersofprocessors.
Figure 2.4. The Symmetric Multiprocessor (SMP) architecture
TheothermainclassofsharedmemorysystemsiscalledNUMA(nonuniformmemoryaccess).As showninFig.2.5,thememoryisshared;thatis,itisuniformlyaddressablefromallprocessors,but someblocksofmemorymaybephysicallymorecloselyassociatedwithsomeprocessorsthanothers. Thisreducesthememorybandwidthbottleneckandallowssystemswithmoreprocessors;however,as aresult,theaccesstimefromaprocessortoamemorylocationcanbesignificantlydifferent dependingonhow"close"thememorylocationistotheprocessor.Tomitigatetheeffectsof nonuniformaccess,eachprocessorhasacache,alongwithaprotocoltokeepcacheentriescoherent. Hence,anothernameforthesearchitecturesiscachecoherentnonuniformmemoryaccesssystems (ccNUMA).Logically,programmingaccNUMAsystemisthesameasprogramminganSMP,butto obtainthebestperformance,theprogrammerwillneedtobemorecarefulaboutlocalityissuesand cacheeffects.
Figure 2.5. An example of the nonuniform memory access (NUMA) architecture
Dependingonthetopologyandtechnologyusedfortheprocessorinterconnection,communication
speedcanrangefromalmostasfastassharedmemory(intightlyintegratedsupercomputers)toorders ofmagnitudeslower(forexample,inaclusterofPCsinterconnectedwithanEthernetnetwork).The programmermustexplicitlyprogramallthecommunicationbetweenprocessorsandbeconcerned withthedistributionofdata. Distributedmemorycomputersaretraditionallydividedintotwoclasses:MPP(massivelyparallel processors)andclusters.InanMPP,theprocessorsandthenetworkinfrastructurearetightlycoupled andspecializedforuseinaparallelcomputer.Thesesystemsareextremelyscalable,insomecases supportingtheuseofmanythousandsofprocessorsinasinglesystem[MSW96,IBM02]. Clustersaredistributedmemorysystemscomposedofofftheshelfcomputersconnectedbyanoff theshelfnetwork.WhenthecomputersarePCsrunningtheLinuxoperatingsystem,theseclustersare calledBeowulfclusters.Asofftheshelfnetworkingtechnologyimproves,systemsofthistypeare becomingmorecommonandmuchmorepowerful.Clustersprovideaninexpensivewayforan organizationtoobtainparallelcomputingcapabilities[Beo].Preconfiguredclustersarenowavailable frommanyvendors.Onefrugalgroupevenreportedconstructingausefulparallelsystembyusinga clustertoharnessthecombinedpowerofobsoletePCsthatotherwisewouldhavebeendiscarded [HHS01]. Hybridsystems.Thesesystemsareclustersofnodeswithseparateaddressspacesinwhicheachnode containsseveralprocessorsthatsharememory. AccordingtovanderSteenandDongarra's"OverviewofRecentSupercomputers"[vdSD03],which containsabriefdescriptionofthesupercomputerscurrentlyorsoontobecommerciallyavailable, hybridsystemsformedfromclustersofSMPsconnectedbyafastnetworkarecurrentlythedominant trendinhighperformancecomputing.Forexample,inlate2003,fourofthefivefastestcomputersin theworldwerehybridsystems[Top]. Grids.Gridsaresystemsthatusedistributed,heterogeneousresourcesconnectedbyLANsand/or WANs[FK03].OftentheinterconnectionnetworkistheInternet.Gridswereoriginallyenvisionedas awaytolinkmultiplesupercomputerstoenablelargerproblemstobesolved,andthuscouldbe viewedasaspecialtypeofdistributedmemoryorhybridMIMDmachine.Morerecently,theideaof gridcomputinghasevolvedintoageneralwaytoshareheterogeneousresources,suchascomputation servers,storage,applicationservers,informationservices,orevenscientificinstruments.Gridsdiffer fromclustersinthatthevariousresourcesinthegridneednothaveacommonpointofadministration. Inmostcases,theresourcesonagridareownedbydifferentorganizationsthatmaintaincontrolover thepoliciesgoverninguseoftheresources.Thisaffectsthewaythesesystemsareused,the middlewarecreatedtomanagethem,andmostimportantlyforthisdiscussion,theoverheadincurred whencommunicatingbetweenresourceswithinthegrid.
Khoros KOAN/FortranS LAM Legion Lilac Linda LiPS Locust Lparx Lucid Maisie Manifold Mentat MetaChaos Midway Millipede Mirage Modula2* ModulaP MOSIX MpC MPC++ MPI Multipol Munin NanoThreads NESL NetClasses++
Papers Para++ Paradigm Parafrase2 Paralation Parallaxis Parallel Haskell ParallelC++ ParC ParLib++ ParLin Parlog Parmacs Parti pC pC++ PCN PCP: PCU PEACE PENNY PET PETSc PH Phosphorus POET Polaris POOLT
SciTL SDDA SHMEM SIMPLE Sina SISAL SMI SONiC SplitC SR Sthreads Strand SUIF SuperPascal Synergy TCGMSG Telegraphos TheFORCE Threads.h++ TRAPPER TreadMarks UC uC++ UNITY V Vic* VisifoldVNUS VPE
AthapascanOb DSMThreads Aurora Automap bb_threads Blaze BlockComm BSP C* C** C4 CarlOS Cashmere CC++ Charlotte Charm Charm++ Chu Cid Cilk CMFortran Code Ease ECO Eilean Emerald EPL Excalibur Express Falcon Filaments FLASH FM Fork FortranM FX GA GAMMA Glenda GLU GUARD HAsL
ConcurrentML HORUS Converse COOL CORRELATE CparPar CPS CRL CSP Cthreads HPC HPC++ HPF IMPACT ISETLLinda ISIS JADA JADE
Fortunately,bythelate1990s,theparallelprogrammingcommunityconvergedpredominantlyontwo environmentsforparallelprogramming:OpenMP[OMP]forsharedmemoryandMPI[Mesb]for messagepassing. OpenMPisasetoflanguageextensionsimplementedascompilerdirectives.Implementationsare currentlyavailableforFortran,C,andC++.OpenMPisfrequentlyusedtoincrementallyadd parallelismtosequentialcode.Byaddingacompilerdirectivearoundaloop,forexample,the compilercanbeinstructedtogeneratecodetoexecutetheiterationsoftheloopinparallel.The compilertakescareofmostofthedetailsofthreadcreationandmanagement.OpenMPprogramstend toworkverywellonSMPs,butbecauseitsunderlyingprogrammingmodeldoesnotincludeanotion ofnonuniformmemoryaccesstimes,itislessidealforccNUMAanddistributedmemorymachines. MPIisasetoflibraryroutinesthatprovideforprocessmanagement,messagepassing,andsome collectivecommunicationoperations(theseareoperationsthatinvolvealltheprocessesinvolvedina program,suchasbarrier,broadcast,andreduction).MPIprogramscanbedifficulttowritebecause theprogrammerisresponsiblefordatadistributionandexplicitinterprocesscommunicationusing messages.Becausetheprogrammingmodelassumesdistributedmemory,MPIisagoodchoicefor MPPsandotherdistributedmemorymachines. NeitherOpenMPnorMPIisanidealfitforhybridarchitecturesthatcombinemultiprocessornodes, eachwithmultipleprocessesandasharedmemory,intoalargersystemwithseparateaddressspaces foreachnode:TheOpenMPmodeldoesnotrecognizenonuniformmemoryaccesstimes,soitsdata allocationcanleadtopoorperformanceonmachinesthatarenotSMPs,whileMPIdoesnotinclude constructstomanagedatastructuresresidinginasharedmemory.Onesolutionisahybridmodelin whichOpenMPisusedoneachsharedmemorynodeandMPIisusedbetweenthenodes.Thisworks well,butitrequirestheprogrammertoworkwithtwodifferentprogrammingmodelswithinasingle program.AnotheroptionistouseMPIonboththesharedmemoryanddistributedmemoryportions ofthealgorithmandgiveuptheadvantagesofasharedmemoryprogrammingmodel,evenwhenthe hardwaredirectlysupportsit. Newhighlevelprogrammingenvironmentsthatsimplifyportableparallelprogrammingandmore accuratelyreflecttheunderlyingparallelarchitecturesaretopicsofcurrentresearch[Cen].Another
approachmorepopularinthecommercialsectoristoextendMPIandOpenMP.Inthemid1990s,the MPIForumdefinedanextendedMPIcalledMPI2.0,althoughimplementationsarenotwidely availableatthetimethiswaswritten.ItisalargecomplexextensiontoMPIthatincludesdynamic processcreation,parallelI/O,andmanyotherfeatures.Ofparticularinteresttoprogrammersof modernhybridarchitecturesistheinclusionofonesidedcommunication.Onesidedcommunication mimicssomeofthefeaturesofasharedmemorysystembylettingoneprocesswriteintoorreadfrom thememoryregionsofotherprocesses.Theterm"onesided"referstothefactthatthereadorwriteis launchedbytheinitiatingprocesswithouttheexplicitinvolvementoftheotherparticipatingprocess. AmoresophisticatedabstractionofonesidedcommunicationisavailableaspartoftheGlobalArrays [NHL96,NHK02 + ,Gloa]package.GlobalArraysworkstogetherwithMPItohelpaprogrammer managedistributedarraydata.Aftertheprogrammerdefinesthearrayandhowitislaidoutin memory,theprogramexecutes"puts"or"gets"intothearraywithoutneedingtoexplicitlymanage whichMPIprocess"owns"theparticularsectionofthearray.Inessence,theglobalarrayprovidesan abstractionofagloballysharedarray.Thisonlyworksforarrays,butthesearesuchcommondata structuresinparallelcomputingthatthispackage,althoughlimited,canbeveryuseful. JustasMPIhasbeenextendedtomimicsomeofthebenefitsofasharedmemoryenvironment, OpenMPhasbeenextendedtorunindistributedmemoryenvironments.TheannualWOMPAT (WorkshoponOpenMPApplicationsandTools)workshopscontainmanypapersdiscussingvarious approachesandexperienceswithOpenMPinclustersandccNUMAenvironments. MPIisimplementedasalibraryofroutinestobecalledfromprogramswritteninasequential programminglanguage,whereasOpenMPisasetofextensionstosequentialprogramminglanguages. Theyrepresenttwoofthepossiblecategoriesofparallelprogrammingenvironments(librariesand languageextensions),andthesetwoparticularenvironmentsaccountfortheoverwhelmingmajority ofparallelcomputingbeingdonetoday.Thereis,however,onemorecategoryofparallel programmingenvironments,namelylanguageswithbuiltinfeaturestosupportparallelprogramming. Javaissuchalanguage.Ratherthanbeingdesignedtosupporthighperformancecomputing,Javais anobjectoriented,generalpurposeprogrammingenvironmentwithfeaturesforexplicitlyspecifying concurrentprocessingwithsharedmemory.Inaddition,thestandardI/Oandnetworkpackages provideclassesthatmakeiteasyforJavatoperforminterprocesscommunicationbetweenmachines, thusmakingitpossibletowriteprogramsbasedonboththesharedmemoryandthedistributed memorymodels.Thenewerjava.niopackagessupportI/Oinawaythatislessconvenientforthe programmer,butgivessignificantlybetterperformance,andJava21.5includesnewsupportfor concurrentprogramming,mostsignificantlyinthejava.util.concurrent.*packages.Additional packagesthatsupportdifferentapproachestoparallelcomputingarewidelyavailable. Althoughtherehavebeenothergeneralpurposelanguages,bothpriortoJavaandmorerecent(for example,C#),thatcontainedconstructsforspecifyingconcurrency,Javaisthefirsttobecomewidely used.Asaresult,itmaybethefirstexposureformanyprogrammerstoconcurrentandparallel programming.AlthoughJavaprovidessoftwareengineeringbenefits,currentlytheperformanceof parallelJavaprogramscannotcompetewithOpenMPorMPIprogramsfortypicalscientific computingapplications.TheJavadesignhasalsobeencriticizedforseveraldeficienciesthatmatterin thisdomain(forexample,afloatingpointmodelthatemphasizesportabilityandmorereproducible resultsoverexploitingtheavailablefloatingpointhardwaretothefullest,inefficienthandlingof
Loadbalanceandloadbalancing.Toexecuteaparallelprogram,thetasksmustbemappedtoUEs, andtheUEstoPEs.Howthemappingsaredonecanhaveasignificantimpactontheoverall performanceofaparallelalgorithm.ItiscrucialtoavoidthesituationinwhichasubsetofthePEsis doingmostoftheworkwhileothersareidle.Loadbalancereferstohowwelltheworkisdistributed amongPEs.LoadbalancingistheprocessofallocatingworktoPEs,eitherstaticallyordynamically, sothattheworkisdistributedasevenlyaspossible. Synchronization.Inaparallelprogram,duetothenondeterminismoftaskschedulingandother factors,eventsinthecomputationmightnotalwaysoccurinthesameorder.Forexample,inonerun, ataskmightreadvariablexbeforeanothertaskreadsvariabley;inthenextrunwiththesameinput, theeventsmightoccurintheoppositeorder.Inmanycases,theorderinwhichtwoeventsoccurdoes notmatter.Inothersituations,theorderdoesmatter,andtoensurethattheprogramiscorrect,the programmermustintroducesynchronizationtoenforcethenecessaryorderingconstraints.The primitivesprovidedforthispurposeinourselectedenvironmentsarediscussedintheImplementation Mechanismsdesignspace(Section6.3). Synchronousversusasynchronous.Weusethesetwotermstoqualitativelyrefertohowtightly coupledintimetwoeventsare.Iftwoeventsmusthappenatthesametime,theyaresynchronous; otherwisetheyareasynchronous.Forexample,messagepassing(thatis,communicationbetweenUEs bysendingandreceivingmessages)issynchronousifamessagesentmustbereceivedbeforethe sendercancontinue.Messagepassingisasynchronousifthesendercancontinueitscomputation regardlessofwhathappensatthereceiver,orifthereceivercancontinuecomputationswhilewaiting forareceivetocomplete. Raceconditions.Araceconditionisakindoferrorpeculiartoparallelprograms.Itoccurswhenthe outcomeofaprogramchangesastherelativeschedulingofUEsvaries.Becausetheoperatingsystem andnottheprogrammercontrolstheschedulingoftheUEs,raceconditionsresultinprogramsthat potentiallygivedifferentanswersevenwhenrunonthesamesystemwiththesamedata.Race conditionsareparticularlydifficulterrorstodebugbecausebytheirnaturetheycannotbereliably reproduced.Testinghelps,butisnotaseffectiveaswithsequentialprograms:Aprogrammayrun correctlythefirstthousandtimesandthenfailcatastrophicallyonthethousandandfirstexecution andthenrunagaincorrectlywhentheprogrammerattemptstoreproducetheerrorasthefirststepin debugging. Raceconditionsresultfromerrorsinsynchronization.IfmultipleUEsreadandwriteshared variables,theprogrammermustprotectaccesstothesesharedvariablessothereadsandwritesoccur inavalidorderregardlessofhowthetasksareinterleaved.Whenmanyvariablesaresharedorwhen theyareaccessedthroughmultiplelevelsofindirection,verifyingbyinspectionthatnorace conditionsexistcanbeverydifficult.Toolsareavailablethathelpdetectandfixraceconditions,such asThreadCheckerfromIntelCorporation,andtheproblemremainsanareaofactiveandimportant research[NM92]. Deadlocks.Deadlocksareanothertypeoferrorpeculiartoparallelprograms.Adeadlockoccurs whenthereisacycleoftasksinwhicheachtaskisblockedwaitingforanothertoproceed.Because allarewaitingforanothertasktodosomething,theywillallbeblockedforever.Asasimpleexample, considertwotasksinamessagepassingenvironment.TaskAattemptstoreceiveamessagefromtask B,afterwhichAwillreplybysendingamessageofitsowntotaskB.Meanwhile,taskBattemptsto
ArelatedmeasureistheefficiencyE,whichisthespeedupnormalizedbythenumberofPEs. Equation2.4
Equation2.5
Now,rewritingSintermsofthenewexpressionforTtotal(P),weobtainthefamousAmdahl'slaw: Equation2.8
Equation2.9
Equation2.10
Eq.2.10thusgivesanupperboundonthespeedupobtainableinanalgorithmwhoseserialpart representsofthetotalcomputation. Theseconceptsarevitaltotheparallelalgorithmdesigner.Indesigningaparallelalgorithm,itis importanttounderstandthevalueoftheserialfractionsothatrealisticexpectationscanbesetfor performance.Itmaynotmakesensetoimplementacomplex,arbitrarilyscalableparallelalgorithmif 10%ormoreofthealgorithmisserialand10%isfairlycommon. Ofcourse,Amdahl'slawisbasedonassumptionsthatmayormaynotbetrueinpractice.Inreallife, anumberoffactorsmaymaketheactualrunningtimelongerthanthisformulaimplies.Forexample, creatingadditionalparalleltasksmayincreaseoverheadandthechancesofcontentionforshared resources.Ontheotherhand,iftheoriginalserialcomputationislimitedbyresourcesotherthanthe availabilityofCPUcycles,theactualperformancecouldbemuchbetterthanAmdahl'slawwould predict.Forexample,alargeparallelmachinemayallowbiggerproblemstobeheldinmemory,thus reducingvirtualmemorypaging,ormultipleprocessorseachwithitsowncachemayallowmuch moreoftheproblemtoremaininthecache.Amdahl'slawalsorestsontheassumptionthatforany giveninput,theparallelandserialimplementationsperformexactlythesamenumberof computationalsteps.Iftheserialalgorithmbeingusedintheformulaisnotthebestpossible algorithmfortheproblem,thenacleverparallelalgorithmthatstructuresthecomputationdifferently canreducethetotalnumberofcomputationalsteps. Ithasalsobeenobserved[Gus88]thattheexerciseunderlyingAmdahl'slaw,namelyrunningexactly thesameproblemwithvaryingnumbersofprocessors,isartificialinsomecircumstances.If,say,the parallelapplicationwereaweathersimulation,thenwhennewprocessorswereadded,onewould mostlikelyincreasetheproblemsizebyaddingmoredetailstothemodelwhilekeepingthetotal executiontimeconstant.Ifthisisthecase,thenAmdahl'slaw,orfixedsizespeedup,givesa pessimisticviewofthebenefitsofadditionalprocessors. Toseethis,wecanreformulatetheequationtogivethespeedupintermsofperformanceonaP processorsystem.EarlierinEq.2.2,weobtainedtheexecutiontimeforTprocessors,Ttotal(P),from theexecutiontimeoftheserialtermsandtheexecutiontimeoftheparallelizablepartwhenexecuted ononeprocessor.Here,wedotheoppositeandobtainTtotal(1)fromtheserialandparallelterms whenexecutedonPprocessors. Equation2.11
Now,wedefinethesocalledscaledserialfraction,denotedscaled,as
Equation2.12
andthen
Equation2.13
Rewritingtheequationforspeedup(Eq.2.3)andsimplifying,weobtainthescaled(orfixedtime) speedup.[1]
[1]
ThisgivesexactlythesamespeedupasAmdahl'slaw,butallowsadifferentquestiontobeaskedwhen thenumberofprocessorsisincreased.SincescaleddependsonP,theresultoftakingthelimitisn't immediatelyobvious,butwouldgivethesameresultasthelimitinAmdahl'slaw.However,suppose wetakethelimitinPwhileholdingTcomputeandthusscaledconstant.Theinterpretationisthatwe areincreasingthesizeoftheproblemsothatthetotalrunningtimeremainsconstantwhenmore processorsareadded.(Thiscontainstheimplicitassumptionthattheexecutiontimeoftheserial termsdoesnotchangeastheproblemsizegrows.)Inthiscase,thespeedupislinearinP.Thus,while addingmoreprocessorstosolveafixedproblemmayhitthespeeduplimitsofAmdahl'slawwitha relativelysmallnumberofprocessors,iftheproblemgrowsasmoreprocessorsareadded,Amdahl's lawwillbepessimistic.Thesetwomodelsofspeedup,alongwithafixedmemoryversionofspeedup, arediscussedin[SN90].
2.6. COMMUNICATION
2.6.1. Latency and Bandwidth Asimplebutusefulmodelcharacterizesthetotaltimeformessagetransferasthesumofafixedcost
plusavariablecostthatdependsonthelengthofthemessage. Equation2.15
Thefixedcostiscalledlatencyandisessentiallythetimeittakestosendanemptymessageover thecommunicationmedium,fromthetimethesendroutineiscalledtothetimethedataisreceived bytherecipient.Latency(giveninsomeappropriatetimeunit)includesoverheadduetosoftwareand networkhardwareplusthetimeittakesforthemessagetotraversethecommunicationmedium.The bandwidth(giveninsomemeasureofbytespertimeunit)isameasureofthecapacityofthe communicationmedium.Nisthelengthofthemessage. Thelatencyandbandwidthcanvarysignificantlybetweensystemsdependingonboththehardware usedandthequalityofthesoftwareimplementingthecommunicationprotocols.Becausethesevalues canbemeasuredwithfairlysimplebenchmarks[DD97],itissometimesworthwhiletomeasure valuesforand,asthesecanhelpguideoptimizationstoimprovecommunicationperformance. Forexample,inasysteminwhichisrelativelylarge,itmightbeworthwhiletotrytorestructurea programthatsendsmanysmallmessagestoaggregatethecommunicationintoafewlargemessages instead.Dataforseveralrecentsystemshasbeenpresentedin[BBC03 + ].
2.6.2. Overlapping Communication and Computation and Latency Hiding Ifwelookmorecloselyatthecomputationtimewithinasingletaskonasingleprocessor,itcan roughlybedecomposedintocomputationtime,communicationtime,andidletime.The communicationtimeisthetimespentsendingandreceivingmessages(andthusonlyappliesto distributedmemorymachines),whereastheidletimeistimethatnoworkisbeingdonebecausethe taskiswaitingforanevent,suchasthereleaseofaresourceheldbyanothertask. Acommonsituationinwhichataskmaybeidleiswhenitiswaitingforamessagetobetransmitted throughthesystem.Thiscanoccurwhensendingamessage(astheUEwaitsforareplybefore proceeding)orwhenreceivingamessage.Sometimesitispossibletoeliminatethiswaitby restructuringthetasktosendthemessageand/orpostthereceive(thatis,indicatethatitwantsto receiveamessage)andthencontinuethecomputation.Thisallowstheprogrammertooverlap communicationandcomputation.WeshowanexampleofthistechniqueinFig.2.7.Thisstyleof messagepassingismorecomplicatedfortheprogrammer,becausetheprogrammermusttakecareto waitforthereceivetocompleteafteranyworkthatcanbeoverlappedwithcommunicationis completed.
Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the computation requires less total time because of UE 1 's earlier start.
2.7. SUMMARY
Thischapterhasgivenabriefoverviewofsomeoftheconceptsandvocabularyusedinparallel computing.Additionaltermsaredefinedintheglossary.Wealsodiscussedthemajorprogramming environmentsinuseforparallelcomputing:OpenMP,MPI,andJava.Throughoutthebook,wewill usethesethreeprogrammingenvironmentsforourexamples.MoredetailsaboutOpenMP,MPI,and Javaandhowtousethemtowriteparallelprogramsareprovidedintheappendixes.
Experienceddesignersworkinginafamiliardomainmayseetheexploitableconcurrency immediatelyandcouldmovedirectlytothepatternsintheAlgorithmStructuredesignspace. 3.1.1. Overview Beforestartingtoworkwiththepatternsinthisdesignspace,thealgorithmdesignermustfirst considertheproblemtobesolvedandmakesuretheefforttocreateaparallelprogramwillbe justified:Istheproblemlargeenoughandtheresultssignificantenoughtojustifyexpendingeffortto solveitfaster?Ifso,thenextstepistomakesurethekeyfeaturesanddataelementswithinthe problemarewellunderstood.Finally,thedesignerneedstounderstandwhichpartsoftheproblemare mostcomputationallyintensive,becausetheefforttoparallelizetheproblemshouldbefocusedon thoseparts. Afterthisanalysisiscomplete,thepatternsintheFindingConcurrencydesignspacecanbeusedto startdesigningaparallelalgorithm.Thepatternsinthisdesignspacecanbeorganizedintothree groups.
DecompositionPatterns.Thetwodecompositionpatterns,TaskDecompositionandData Decomposition,areusedtodecomposetheproblemintopiecesthatcanexecuteconcurrently. DependencyAnalysisPatterns.Thisgroupcontainsthreepatternsthathelpgroupthetasks andanalyzethedependenciesamongthem:GroupTasks,OrderTasks,andDataSharing. Nominally,thepatternsareappliedinthisorder.Inpractice,however,itisoftennecessaryto workbackandforthbetweenthem,orpossiblyevenrevisitthedecompositionpatterns. DesignEvaluationPattern.Thefinalpatterninthisspaceguidesthealgorithmdesigner throughananalysisofwhathasbeendonesofarbeforemovingontothepatternsinthe AlgorithmStructuredesignspace.Thispatternisimportantbecauseitoftenhappensthatthe bestdesignisnotfoundonthefirstattempt,andtheearlierdesignflawsareidentified,the
Viewingtheproblemdecompositionintermsoftwodistinctdimensionsissomewhatartificial.Atask decompositionimpliesadatadecompositionandviceversa;hence,thetwodecompositionsarereally differentfacetsofthesamefundamentaldecomposition.Wedividethemintoseparatedimensions, however,becauseaproblemdecompositionusuallyproceedsmostnaturallybyemphasizingone dimensionofthedecompositionovertheother.Bymakingthemdistinct,wemakethisdesign emphasisexplicitandeasierforthedesignertounderstand. 3.1.3. Background for Examples Inthissection,wegivebackgroundinformationonsomeoftheexamplesthatareusedinseveral patterns.Itcanbeskippedforthetimebeingandrevisitedlaterwhenreadingapatternthatrefersto oneoftheexamples.
Medical imaging
PET(PositronEmissionTomography)scansprovideanimportantdiagnostictoolbyallowing physicianstoobservehowaradioactivesubstancepropagatesthroughapatient'sbody.Unfortunately, theimagesformedfromthedistributionofemittedradiationareoflowresolution,dueinparttothe scatteringoftheradiationasitpassesthroughthebody.Itisalsodifficulttoreasonfromtheabsolute radiationintensities,becausedifferentpathwaysthroughthebodyattenuatetheradiationdifferently. Tosolvethisproblem,modelsofhowradiationpropagatesthroughthebodyareusedtocorrectthe images.AcommonapproachistobuildaMonteCarlomodel,asdescribedbyLjungbergandKing [LK98].Randomlyselectedpointswithinthebodyareassumedtoemitradiation(usuallyagamma ray),andthetrajectoryofeachrayisfollowed.Asaparticle(ray)passesthroughthebody,itis attenuatedbythedifferentorgansittraverses,continuinguntiltheparticleleavesthebodyandhitsa cameramodel,therebydefiningafulltrajectory.Tocreateastatisticallysignificantsimulation, thousands,ifnotmillions,oftrajectoriesarefollowed. Thisproblemcanbeparallelizedintwoways.Becauseeachtrajectoryisindependent,itispossibleto parallelizetheapplicationbyassociatingeachtrajectorywithatask.Thisapproachisdiscussedinthe ExamplessectionoftheTaskDecompositionpattern.Anotherapproachwouldbetopartitionthe
bodyintosectionsandassigndifferentsectionstodifferentprocessingelements.Thisapproachis discussedintheExamplessectionoftheDataDecompositionpattern.
Linear algebra
thepharmaceuticalindustry.Itisalsoausefultestproblemforcomputerscientistsworkingonparallel computing:Itisstraightforwardtounderstand,relevanttoscienceatlarge,anddifficulttoparallelize effectively.Asaresult,ithasbeenthesubjectofmuchresearch[Mat94,PH95,Pli95]. Thebasicideaistotreatamoleculeasalargecollectionofballsconnectedbysprings.Theballs representtheatomsinthemolecule,whilethespringsrepresentthechemicalbondsbetweenthe atoms.Themoleculardynamicssimulationitselfisanexplicittimesteppingprocess.Ateachtime step,theforceoneachatomiscomputedandthenstandardclassicalmechanicstechniquesareusedto computehowtheforcemovestheatoms.Thisprocessiscarriedoutrepeatedlytostepthroughtime andcomputeatrajectoryforthemolecularsystem. Theforcesduetothechemicalbonds(the"springs")arerelativelysimpletocompute.These correspondtothevibrationsandrotationsofthechemicalbondsthemselves.Theseareshortrange forcesthatcanbecomputedwithknowledgeofthehandfulofatomsthatsharechemicalbonds.The majordifficultyarisesbecausetheatomshavepartialelectricalcharges.Hence,whileatomsonly interactwithasmallneighborhoodofatomsthroughtheirchemicalbonds,theelectricalcharges causeeveryatomtoapplyaforceoneveryotheratom. ThisisthefamousNbodyproblem.OntheorderofN2termsmustbecomputedtofindthese nonbondedforces.BecauseNislarge(tensorhundredsofthousands)andthenumberoftimestepsin asimulationishuge(tensofthousands),thetimerequiredtocomputethesenonbondedforces dominatesthecomputation.Severalwayshavebeenproposedtoreducetheeffortrequiredtosolvethe Nbodyproblem.Weareonlygoingtodiscussthesimplestone:thecutoffmethod. Theideaissimple.Eventhougheachatomexertsaforceoneveryotheratom,thisforcedecreases withthesquareofthedistancebetweentheatoms.Hence,itshouldbepossibletopickadistance beyondwhichtheforcecontributionissosmallthatitcanbeignored.Byignoringtheatomsthat exceedthiscutoff,theproblemisreducedtoonethatscalesasO(Nxn),wherenisthenumberof atomswithinthecutoffvolume,usuallyhundreds.Thecomputationisstillhuge,anditdominatesthe overallruntimeforthesimulation,butatleasttheproblemistractable. Thereareahostofdetails,butthebasicsimulationcanbesummarizedasinFig.3.2. Theprimarydatastructuresholdtheatomicpositions(atoms),thevelocitiesofeachatom (velocity),theforcesexertedoneachatom(forces),andlistsofatomswithinthecutoff distanceofeachatoms(neighbors).Theprogramitselfisatimesteppingloop,inwhicheach iterationcomputestheshortrangeforceterms,updatestheneighborlists,andthenfindsthe nonbondedforces.Aftertheforceoneachatomhasbeencomputed,asimpleordinarydifferential equationissolvedtoupdatethepositionsandvelocities.Physicalpropertiesbasedonatomicmotions arethenupdated,andwegotothenexttimestep. Therearemanywaystoparallelizethemoleculardynamicsproblem.Weconsiderthemostcommon approach,startingwiththetaskdecomposition(discussedintheTaskDecompositionpattern)and followingwiththeassociateddatadecomposition(discussedintheDataDecompositionpattern).This exampleshowshowthetwodecompositionsfittogethertoguidethedesignoftheparallelalgorithm.
Figure 3.2. Pseudocode for the molecular dynamics example Int const N // number of atoms Array Array Array Array of of of of Real Real Real List :: :: :: :: atoms (3,N) //3D coordinates velocities (3,N) //velocity vector forces (3,N) //force in each dimension neighbors(N) //atoms in cutoff volume
loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors) non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) end loop
Forces Themainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.
Flexibility.Flexibilityinthedesignwillallowittobeadaptedtodifferentimplementation requirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasingle computersystemorstyleofprogrammingatthisstageofthedesign. Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallel computer(intermsofreducedruntimeand/ormemoryutilization).Forataskdecomposition, thismeansweneedenoughtaskstokeepallthePEsbusy,withenoughworkpertaskto compensateforoverheadincurredtomanagedependencies.However,thedriveforefficiency canleadtocomplexdecompositionsthatlackflexibility. Simplicity.Thetaskdecompositionneedstobecomplexenoughtogetthejobdone,but simpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.
Solution Thekeytoaneffectivetaskdecompositionistoensurethatthetasksaresufficientlyindependentso thatmanagingdependenciestakesonlyasmallfractionoftheprogram'soverallexecutiontime.Itis alsoimportanttoensurethattheexecutionofthetaskscanbeevenlydistributedamongtheensemble ofPEs(theloadbalancingproblem). Inanidealworld,thecompilerwouldfindthetasksfortheprogrammer.Unfortunately,thisalmost neverhappens.Instead,itmustusuallybedonebyhandbasedonknowledgeoftheproblemandthe coderequiredtosolveit.Insomecases,itmightbenecessarytocompletelyrecasttheproblemintoa formthatexposesrelativelyindependenttasks. Inataskbaseddecomposition,welookattheproblemasacollectionofdistincttasks,paying particularattentionto
Insomecases,eachtaskcorrespondstoadistinctcalltoafunction.Definingataskforeach functioncallleadstowhatissometimescalledafunctionaldecomposition. Anotherplacetofindtasksisindistinctiterationsoftheloopswithinanalgorithm.Ifthe iterationsareindependen tandthereareenoughofthem,thenitmightworkwelltobaseatask decompositiononmappingeachiterationontoatask.Thisstyleoftaskbaseddecomposition leadstowhataresometimescalledloopsplittingalgorithms. Tasksalsoplayakeyroleindatadrivendecompositions.Inthiscase,alargedatastructureis decomposedandmultipleunitsofexecutionconcurrentlyupdatedifferentchunksofthedata structure.Inthiscase,thetasksarethoseupdatesonindividualchunks.
AlsokeepinmindtheforcesgivenintheForcessection:
Flexibility.Thedesignneedstobeflexibleinthenumberoftasksgenerated.Usuallythisis donebyparameterizingthenumberandsizeoftasksonsomeappropriatedimension.This willletthedesignbeadaptedtoawiderangeofparallelcomputerswithdifferentnumbersof processors. Efficiency.Therearetwomajorefficiencyissuestoconsiderinthetaskdecomposition.First, eachtaskmustincludeenoughworktocompensatefortheoverheadincurredbycreatingthe tasksandmanagingtheirdependencies.Second,thenumberoftasksshouldbelargeenough sothatalltheunitsofexecutionarebusywithusefulworkthroughoutthecomputation. Simplicity.Tasksshouldbedefinedinawaythatmakesdebuggingandmaintenancesimple. Whenpossible,tasksshouldbedefinedsotheyreusecodefromexistingsequentialprograms thatsolverelatedproblems.
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsidea modelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthe trajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands, ifnotmillions,oftrajectoriesarefollowed. Itisnaturaltoassociateataskwitheachtrajectory.Thesetasksareparticularlysimpletomanage concurrentlybecausetheyarecompletelyindependent.Furthermore,therearelargenumbersof trajectories,sotherewillbemanytasks,makingthisdecompositionsuitableforalargerangeof computersystems,fromasharedmemorysystemwithasmallnumberofprocessingelementstoa largeclusterwithhundredsofprocessingelements. Withthebasictasksdefined,wenowconsiderthecorrespondingdatadecompositionthatis,we definethedataassociatedwitheachtask.Eachtaskneedstoholdtheinformationdefiningthe trajectory.Butthatisnotall:Thetasksneedaccesstothemodelofthebodyaswell.Althoughit mightnotbeapparentfromourdescriptionoftheproblem,thebodymodelcanbeextremelylarge. Becauseitisareadonlymodel,thisisnoproblemifthereisaneffectivesharedmemorysystem; eachtaskcanreaddataasneeded.Ifthetargetplatformisbasedonadistributedmemory architecture,however,thebodymodelwillneedtobereplicatedoneachPE.Thiscanbeverytime consumingandcanwasteagreatdealofmemory.ForsystemswithsmallmemoriesperPEand/or withslownetworksbetweenPEs,adecompositionoftheproblembasedonthebodymodelmightbe moreeffective. Thisisacommonsituationinparallelprogramming:Manyproblemscanbedecomposedprimarilyin termsofdataorprimarilyintermsoftasks.Ifataskbaseddecompositionavoidstheneedtobreakup anddistributecomplexdatastructures,itwillbeamuchsimplerprogramtowriteanddebug.Onthe otherhand,ifmemoryand/ornetworkbandwidthisalimitingfactor,adecompositionthatfocuseson
Considerthemultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Wecanproducea taskbaseddecompositionofthisproblembyconsideringthecalculationofeachelementofthe productmatrixasaseparatetask.EachtaskneedsaccesstoonerowofAandonecolumnofB.This decompositionhastheadvantagethatallthetasksareindependent,andbecauseallthedatathatis sharedamongtasks(AandB)isreadonly,itwillbestraightforwardtoimplementinashared memoryenvironment. Theperformanceofthisalgorithm,however,wouldbepoor.Considerthecasewherethethree matricesaresquareandoforderN.ForeachelementofC,NelementsfromAandNelementsfromB wouldberequired,resultingin2NmemoryreferencesforNmultiply/addoperations.Memoryaccess timeisslowcomparedtofloatingpointarithmetic,sothebandwidthofthememorysubsystemwould limittheperformance. Abetterapproachwouldbetodesignanalgorithmthatmaximizesreuseofdataloadedintoa processor'scaches.Wecanarriveatthisalgorithmintwodifferentways.First,wecouldgroup togethertheelementwisetaskswedefinedearliersothetasksthatusesimilarelementsoftheAandB matricesrunonthesameUE(seetheGroupTaskspattern).Alternatively,wecouldstartwiththedata decompositionanddesignthealgorithmfromthebeginningaroundthewaythematricesfitintothe caches.WediscussthisexamplefurtherintheExamplessectionoftheDataDecompositionpattern.
Molecular dynamics
loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors)
non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) end loop
Second,thephysical_properties()functioncomputesenergies,correlationcoefficients,and ahostofinterestingphysicalproperties.Thesecomputations,however,aresimpleanddonot significantlyaffecttheprogram'soverallruntime,sowewillignoretheminthisdiscussion. Becausethebulkofthecomputationtimewillbeinnon_bonded_forces(),wemustpicka problemdecompositionthatmakesthatcomputationrunefficientlyinparallel.Theproblemismade easierbythefactthateachofthefunctionsinsidethetimeloophasasimilarstructure:Inthe sequentialversion,eachfunctionincludesaloopoveratomstocomputecontributionstotheforce vector.Thus,anaturaltaskdefinitionistheupdaterequiredbyeachatom,whichcorrespondstoa loopiterationinthesequentialversion.Afterperformingthetaskdecomposition,therefore,weobtain thefollowingtasks.
Problem Howcanaproblem'sdatabedecomposedintounitsthatcanbeoperatedonrelativelyindependently? Context Theparallelalgorithmdesignermusthaveadetailedunderstandingoftheproblembeingsolved.In addition,thedesignershouldidentifythemostcomputationallyintensivepartsoftheproblem,thekey datastructuresrequiredtosolvetheproblem,andhowdataisusedastheproblem'ssolutionunfolds. Afterthebasicproblemisunderstood,theparallelalgorithmdesignershouldconsiderthetasksthat makeuptheproblemandthedatadecompositionimpliedbythetasks.Boththetaskanddata decompositionsneedtobeaddressedtocreateaparallelalgorithm.Thequestionisnotwhich decompositiontodo.Thequestioniswhichonetostartwith.Adatabaseddecompositionisagood startingpointifthefollowingistrue.
Flexibility.Thesizeandnumberofdatachunksshouldbeflexibletosupportthewidestrange ofparallelsystems.Oneapproachistodefinechunkswhosesizeandnumberarecontrolledby asmallnumberofparameters.Theseparametersdefinegranularityknobsthatcanbevariedto modifythesizeofthedatachunkstomatchtheneedsoftheunderlyinghardware.(Note, however,thatmanydesignsarenotinfinitelyadaptablewithrespecttogranularity.) Theeasiestplacetoseetheimpactofgranularityonthedatadecompositionisintheoverhead requiredtomanagedependenciesbetweenchunks.Thetimerequiredtomanagedependencies mustbesmallcomparedtotheoverallruntime.Inagooddatadecomposition,the dependenciesscaleatalowerdimensionthanthecomputationaleffortassociatedwitheach chunk.Forexample,inmanyfinitedifferenceprograms,thecellsattheboundariesbetween chunks,thatis,thesurfacesofthechunks,mustbeshared.Thesizeofthesetofdependent cellsscalesasthesurfacearea,whiletheeffortrequiredinthecomputationscalesasthe volumeofthechunk.Thismeansthatthecomputationaleffortcanbescaled(basedonthe chunk'svolume)tooffsetoverheadsassociatedwithdatadependencies(basedonthesurface areaofthechunk).
Efficiency.Itisimportantthatthedatachunksbelargeenoughthattheamountofworkto updatethechunkoffsetstheoverheadofmanagingdependencies.Amoresubtleissueto considerishowthechunksmapontoUEs.Aneffectiveparallelalgorithmmustbalancethe loadbetweenUEs.Ifthisisn'tdonewell,somePEsmighthaveadisproportionateamountof work,andtheoverallscalabilitywillsuffer.Thismayrequirecleverwaystobreakupthe problem.Forexample,iftheproblemclearsthecolumnsinamatrixfromlefttoright,a columnmappingofthematrixwillcauseproblemsastheUEswiththeleftmostcolumnswill finishtheirworkbeforetheothers.Arowbasedblockdecompositionorevenablockcyclic decomposition(inwhichrowsareassignedcyclicallytoPEs)woulddoamuchbetterjobof keepingalltheprocessorsfullyoccupied.Theseissuesarediscussedinmoredetailinthe DistributedArraypattern.
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsidea modelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthe trajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands ifnotmillionsoftrajectoriesarefollowed. Inadatabaseddecompositionofthisproblem,thebodymodelisthelargecentraldatastructure aroundwhichthecomputationcanbeorganized.Themodelisbrokenintosegments,andoneormore segmentsareassociatedwitheachprocessingelement.Thebodysegmentsareonlyread,notwritten, duringthetrajectorycomputations,sotherearenodatadependenciescreatedbythedecompositionof thebodymodel. Afterthedatahasbeendecomposed,weneedtolookatthetasksassociatedwitheachdatasegment. Inthiscase,eachtrajectorypassingthroughthedatasegmentdefinesatask.Thetrajectoriesare initiatedandpropagatedwithinasegment.Whenasegmentboundaryisencountered,thetrajectory mustbepassedbetweensegments.Itisthistransferthatdefinesthedependenciesbetweendata chunks. Ontheotherhand,inataskbasedapproachtothisproblem(asdiscussedintheTaskDecomposition pattern),thetrajectoriesforeachparticledrivethealgorithmdesign.EachPEpotentiallyneedsto accessthefullbodymodeltoserviceitssetoftrajectories.Inasharedmemoryenvironment,thisis easybecausethebodymodelisareadonlydataset.Inadistributedmemoryenvironment,however, thiswouldrequiresubstantialstartupoverheadasthebodymodelisbroadcastacrossthesystem. Thisisacommonsituationinparallelprogramming:Differentpointsofviewleadtodifferent algorithmswithpotentiallyverydifferentperformancecharacteristics.Thetaskbasedalgorithmis simple,butitonlyworksifeachprocessingelementhasaccesstoalargememoryandiftheoverhead incurredloadingthedataintomemoryisinsignificantcomparedtotheprogram'sruntime.An algorithmdrivenbyadatadecomposition,ontheotherhand,makesefficientuseofmemoryand(in distributedmemoryenvironments)lessuseofnetworkbandwidth,butitincursmorecommunication overheadduringtheconcurrentpartofcomputationandissignificantlymorecomplex.Choosing whichistheappropriateapproachcanbedifficultandisdiscussedfurtherintheDesignEvaluation pattern.
Matrix multiplication
Considerthestandardmultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Several databaseddecompositionsarepossibleforthisproblem.Astraightforwardonewouldbeto
decomposetheproductmatrixCintoasetofrowblocks(setofadjacentrows).Fromthedefinitionof matrixmultiplication,computingtheelementsofarowblockofCrequiresthefullAmatrix,butonly thecorrespondingrowblockofB.Withsuchadatadecomposition,thebasictaskinthealgorithm becomesthecomputationoftheelementsinarowblockofC. AnevenmoreeffectiveapproachthatdoesnotrequirethereplicationofthefullAmatrixisto decomposeallthreematricesintosubmatricesorblocks.Thebasictaskthenbecomestheupdateofa Cblock,withtheAandBblocksbeingcycledamongthetasksasthecomputationproceeds.This decomposition,however,ismuchmorecomplextoprogram;communicationandcomputationmustbe carefullycoordinatedduringthemosttimecriticalportionsoftheproblem.Wediscussthisexample furtherintheGeometricDecompositionandDistributedArraypatterns. Oneofthefeaturesofthematrixmultiplicationproblemisthattheratiooffloatingpointoperations (O(N3))tomemoryreferences(O(N2))issmall.Thisimpliesthatitisespeciallyimportanttotakeinto accountthememoryaccesspatternstomaximizereuseofdatafromthecache.Themosteffective approachistousetheblock(submatrix)decompositionandadjustthesizeoftheblockssothe problemsfitintocache.Wecouldarriveatthesamealgorithmbycarefullygroupingtogetherthe elementwisetasksthatwereidentifiedintheExamplessectionoftheTaskDecompositionpattern,but startingwithadatadecompositionandassigningatasktoupdateeachsubmatrixseemseasierto understand.
Molecular dynamics
Tasksthatfindthevibrationalforcesonanatom Tasksthatfindtherotationalforcesonanatom Tasksthatfindthenonbondedforcesonanatom Tasksthatupdatethepositionandvelocityofanatom Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential) Anarrayofatomcoordinates,oneelementperatom Anarrayofatomvelocities,oneelementperatom Anarrayoflists,oneperatom,eachdefiningtheneighborhoodofatomswithinthecutoff distanceoftheatom Anarrayofforcesonatoms,oneelementperatom
Thekeydatastructuresare
Datadecompositionsareverycommoninparallelscientificcomputing.Theparallellinearalgebra
waitingononegrouptofinishfillingafilebeforeanothergroupcanbeginreadingit),wecansatisfy thatconstraintonceforthewholegroup.Ifagroupoftasksmustworktogetheronashareddata structure,therequiredsynchronizationcanbeworkedoutonceforthewholegroup.Ifasetoftasks areindependent,combiningthemintoasinglegroupandschedulingthemforexecutionasasingle largegroupcansimplifythedesignandincreasetheavailableconcurrency(therebylettingthe solutionscaletomorePEs). Ineachcase,theideaistodefinegroupsoftasksthatshareconstraintsandsimplifytheproblemof managingconstraintsbydealingwithgroupsratherthanindividualtasks. Solution Constraintsamongtasksfallintoafewmajorcategories.
Theeasiestdependencytounderstandisatemporaldependencythatis,aconstraintonthe orderinwhichacollectionoftasksexecutes.IftaskAdependsontheresultsoftaskB,for example,thentaskAmustwaituntiltaskBcompletesbeforeitcanexecute.Wecanusually thinkofthiscaseintermsofdataflow:TaskAisblockedwaitingforthedatatobeready fromtaskB;whenBcompletes,thedataflowsintoA.Insomecases,Acanbegincomputing assoonasdatastartstoflowfromB(forexample,pipelinealgorithmsasdescribedinthe Pipelinepattern). Anothertypeoforderingconstraintoccurswhenacollectionoftasksmustrunatthesame time.Forexample,inmanydataparallelproblems,theoriginalproblemdomainisdivided intomultipleregionsthatcanbeupdatedinparallel.Typically,theupdateofanygivenregion requiresinformationabouttheboundariesofitsneighboringregions.Ifalloftheregionsare notprocessedatthesametime,theparallelprogramcouldstallordeadlockassomeregions waitfordatafrominactiveregions. Insomecases,tasksinagrouparetrulyindependentofeachother.Thesetasksdonothavean orderingconstraintamongthem.Thisisanimportantfeatureofasetoftasksbecauseitmeans theycanexecuteinanyorder,includingconcurrently,anditisimportanttoclearlynotewhen thisholds. Bygroupingtasks,wesimplifytheestablishmentofpartialordersbetweentasks,since orderingconstraintscanbeappliedtogroupsratherthantoindividualtasks. Groupingtasksmakesiteasiertoidentifywhichtasksmustexecuteconcurrently.
Thegoalofthispatternistogrouptasksbasedontheseconstraints,becauseofthefollowing.
First,lookathowtheoriginalproblemwasdecomposed.Inmostcases,ahighleveloperation (forexample,solvingamatrix)oralargeiterativeprogramstructure(forexample,aloop) playsakeyroleindefiningthedecomposition.Thisisthefirstplacetolookforgrouping tasks.Thetasksthatcorrespondtoahighleveloperationnaturallygrouptogether. Atthispoint,theremaybemanysmallgroupsoftasks.Inthenextstep,wewilllookatthe constraintssharedbetweenthetaskswithinagroup.Ifthetasksshareaconstraintusuallyin termsoftheupdateofashareddatastructurekeepthemasadistinctgroup.Thealgorithm designwillneedtoensurethatthesetasksexecuteatthesametime.Forexample,many problemsinvolvethecoordinatedupdateofashareddatastructurebyasetoftasks.Ifthese tasksdonotrunconcurrently,theprogramcoulddeadlock.
Next,weaskifanyothertaskgroupssharethesameconstraint.Ifso,mergethegroups together.LargetaskgroupsprovideadditionalconcurrencytokeepmorePEsbusyandalso provideextraflexibilityinschedulingtheexecutionofthetasks,therebymakingiteasierto balancetheloadbetweenPEs(thatis,ensurethateachofthePEsspendsapproximatelythe sameamountoftimeworkingontheproblem). Thenextstepistolookatconstraintsbetweengroupsoftasks.Thisiseasywhengroupshave acleartemporalorderingorwhenadistinctchainofdatamovesbetweengroups.Themore complexcase,however,iswhenotherwiseindependenttaskgroupsshareconstraintsbetween groups.Inthesecases,itcanbeusefultomergetheseintoalargergroupofindependenttasks onceagainbecauselargetaskgroupsusuallymakeformoreschedulingflexibilityandbetter scalability.
Examples
Molecular dynamics
ThisproblemwasdescribedinSec.3.1.3,andwediscusseditsdecompositionintheTask DecompositionandDataDecompositionpatterns.Weidentifiedthefollowingtasks:
theseintoasinglegroupforbondedinteractions.Theothergroupshavedistincttemporalorordering constraintsandthereforeshouldnotbemerged.
Matrix multiplication
Theorderingmustberestrictiveenoughtosatisfyalltheconstraintssothattheresulting designiscorrect. Theorderingshouldnotbemorerestrictivethanitneedstobe.Overlyconstrainingthe solutionlimitsdesignoptionsandcanimpairprogramefficiency;thefewertheconstraints,the moreflexibilityyouhavetoshifttasksaroundtobalancethecomputationalloadamongPEs. Firstlookatthedatarequiredbyagroupoftasksbeforetheycanexecute.Afterthisdatahas beenidentified,findthetaskgroupthatcreatesitandanorderingconstraintwillbeapparent. Forexample,ifonegroupoftasks(callitA)buildsacomplexdatastructureandanother group(B)usesit,thereisasequentialorderingconstraintbetweenthesegroups.Whenthese twogroupsarecombinedinaprogram,theymustexecuteinsequence,firstAandthenB. Alsoconsiderwhetherexternalservicescanimposeorderingconstraints.Forexample,ifa programmustwritetoafileinacertainorder,thenthesefileI/Ooperationslikelyimposean orderingconstraint. Finally,itisequallyimportanttonotewhenanorderingconstraintdoesnotexist.Ifanumber oftaskgroupscanexecuteindependently,thereisamuchgreateropportunitytoexploit parallelism,soweneedtonotewhentasksareindependentaswellaswhentheyare dependent.
Toidentifyorderingconstraints,considerthefollowingwaystaskscandependoneachother.
AsaddressedintheGroupTasksandOrderTaskspatterns,thestartingpointinadependencyanalysis istogrouptasksbasedonconstraintsamongthemandthendeterminewhatorderingconstraintsapply togroupsoftasks.Thenextstep,discussedhere,istoanalyzehowdataissharedamonggroupsof tasks,sothataccesstoshareddatacanbemanagedcorrectly. Althoughtheanalysisthatledtothegroupingoftasksandtheorderingconstraintsamongthem focusesprimarilyonthetaskdecomposition,atthisstageofthedependencyanalysis,thefocusshifts tothedatadecomposition,thatis,thedivisionoftheproblem'sdataintochunksthatcanbeupdated independently,eachassociatedwithoneormoretasksthathandletheupdateofthatchunk.This chunkofdataissometimescalledtasklocaldata(orjustlocaldata),becauseitistightlycoupledto thetask(s)responsibleforitsupdate.Itisrare,however,thateachtaskcanoperateusingonlyitsown localdata;datamayneedtobesharedamongtasksinmanyways.Twoofthemostcommonsituations
arethefollowing.
Inadditiontotasklocaldata,theproblem'sdatadecompositionmightdefinesomedatathat mustbesharedamongtasks;forexample,thetasksmightneedtocooperativelyupdatealarge shareddatastructure.Suchdatacannotbeidentifiedwithanygiventask;itisinherentlyglobal totheproblem.Thisshareddataismodifiedbymultipletasksandthereforeservesasasource ofdependenciesamongthetasks. Datadependenciescanalsooccurwhenonetaskneedsaccesstosomeportionofanother task'slocaldata.Theclassicexampleofthistypeofdatadependencyoccursinfinite differencemethodsparallelizedusingadatadecomposition,whereeachpointintheproblem spaceisupdatedusingvaluesfromnearbypointsandthereforeupdatesforonechunkofthe decompositionrequirevaluesfromtheboundariesofneighboringchunks.
Abarrierisasynchronizationconstructthatdefinesapointinaprogramthata groupofUEsmustallreachbeforeanyofthemareallowedtoproceed.
Solution Thefirststepistoidentifydatathatissharedamongtasks. Thisismostobviouswhenthedecompositionispredominantlyadatabaseddecomposition.For example,inafinitedifferenceproblem,thebasicdataisdecomposedintoblocks.Thenatureofthe decompositiondictatesthatthedataattheedgesoftheblocksissharedbetweenneighboringblocks. Inessence,thedatasharingwasworkedoutwhenthebasicdecompositionwasdone. Inadecompositionthatispredominantlytaskbased,thesituationismorecomplex.Atsomepointin thedefinitionoftasks,itwasdeterminedhowdataispassedintooroutofthetaskandwhetherany dataisupdatedinthebodyofthetask.Thesearethesourcesofpotentialdatasharing. Aftertheshareddatahasbeenidentified,itneedstobeanalyzedtoseehowitisused.Shareddata fallsintooneofthefollowingthreecategories.
Readonly.Thedataisreadbutnotwritten.Becauseitisnotmodified,accesstothesevalues doesnotneedtobeprotected.Onsomedistributedmemorysystems,itisworthwhileto replicatethereadonlydatasoeachunitofexecutionhasitsowncopy. Effectivelylocal.Thedataispartitionedintosubsets,eachofwhichisaccessed(forreador write)byonlyoneofthetasks.(Anexampleofthiswouldbeanarraysharedamongtasksin suchawaythatitselementsareeffectivelypartitionedintosetsoftasklocaldata.)Thiscase providessomeoptionsforhandlingthedependencies.Ifthesubsetscanbeaccessed independently(aswouldnormallybethecasewith,say,arrayelements,butnotnecessarily withlistelements),thenitisnotnecessarytoworryaboutprotectingaccesstothisdata.On distributedmemorysystems,suchdatawouldusuallybedistributedamongUEs,witheach UEhavingonlythedataneededbyitstasks.Ifnecessary,thedatacanberecombinedintoa singledatastructureattheendofthecomputation. Readwrite.Thedataisbothreadandwrittenandisaccessedbymorethanonetask.Thisis thegeneralcase,andincludesarbitrarilycomplicatedsituationsinwhichdataisreadfromand writtentobyanynumberoftasks.Itisthemostdifficulttodealwith,becauseanyaccessto thedata(readorwrite)mustbeprotectedwithsometypeofexclusiveaccessmechanism (locks,semaphores,etc.),whichcanbeveryexpensive. Accumulate.Thedataisbeingusedtoaccumulatearesult(forexample,whencomputinga reduction).Foreachlocationintheshareddata,thevaluesareupdatedbymultipletasks,with theupdatetakingplacethroughsomesortofassociativeaccumulationoperation.Themost commonaccumulationoperationsaresum,minimum,andmaximum,butanyassociative operationonpairsofoperandscanbeused.Forsuchdata,eachtask(or,usually,eachUE)has aseparatecopy;theaccumulationsoccurintotheselocalcopies,whicharethenaccumulated intoasingleglobalcopyasafinalstepattheendoftheaccumulation. Multipleread/singlewrite.Thedataisreadbymultipletasks(allofwhichneeditsinitial value),butmodifiedbyonlyonetask(whichcanreadandwriteitsvaluearbitrarilyoften). Suchvariablesoccurfrequentlyinalgorithmsbasedondatadecompositions.Fordataofthis type,atleasttwocopiesareneeded,onetopreservetheinitialvalueandonetobeusedbythe
Twospecialcasesofreadwritedataarecommonenoughtodeservespecialmention:
Thedatasharinginthisproblemcanbecomplicated.Wesummarizethedatasharedbetweengroups inFig.3.5.Themajorshareddataitemsarethefollowing.
Theforcearray,usedbyeachgroupexceptfortheneighborlistupdate. Thisarrayisusedasreadonlydatabythepositionupdategroupandasaccumulatedatafor thebondedandnonbondedforcegroups.Becausethepositionupdategroupmustfollowthe forcecomputations(asdeterminedusingtheOrderTaskspattern),wecanputthisarrayinthe accumulatecategoryfortheforcegroupsandinthereadonlycategoryforthepositionupdate group. Thestandardprocedureformoleculardynamicssimulations[MR95]beginsbyinitializingthe forcearrayasalocalarrayoneachUE.Contributionstoelementsoftheforcearrayarethen computedbyeachUE,withtheprecisetermscomputedbeingunpredictablebecauseofthe waythemoleculefoldsinspace.Afteralltheforceshavebeencomputed,thelocalarraysare reducedintoasinglearray,acopyofwhichisplaceoneachUE(seethediscussionof reductioninSec.6.4.2formoreinformation.)
Figure 3.5. Data sharing in molecular dynamics. We distinguish between sharing for reads, read-writes, and accumulations.
Itisthesefouritemsthatwillguidethedesigner'sworkinthenextdesignspace(theAlgorithm Structurepatterns).Therefore,gettingtheseitemsrightandfindingthebestproblemdecompositionis importantforproducingahighqualitydesign. Insomecases,theconcurrencyisstraightforwardandthereisclearlyasinglebestwaytodecompose aproblem.Moreoften,however,multipledecompositionsarepossible.Hence,itisimportantbefore proceedingtoofarintothedesignprocesstoevaluatetheemergingdesignandmakesureitmeetsthe application'sneeds.Rememberthatalgorithmdesignisaninherentlyiterativeprocess,anddesigners shouldnotexpecttoproduceanoptimumdesignonthefirstpassthroughtheFindingConcurrency patterns. Forces Thedesignneedstobeevaluatedfromthreeperspectives.
Suitabilityforthetargetplatform.Issuessuchasnumberofprocessorsandhowdatastructures aresharedwillinfluencetheefficiencyofanydesign,butthemorethedesigndependsonthe targetarchitecture,thelessflexibleitwillbe. Designquality.Simplicity,flexibility,andefficiencyarealldesirablebutpossibly conflictingattributes. Preparationforthenextphaseofthedesign.Arethetasksanddependenciesregularor irregular(thatis,aretheysimilarinsize,ordotheyvary)?Istheinteractionbetweentasks synchronousorasynchronous(thatis,dotheinteractionsoccuratregularintervalsorhighly variableorevenrandomtimes)?Arethetasksaggregatedinaneffectiveway?Understanding theseissueswillhelpchooseanappropriatesolutionfromthepatternsintheAlgorithm Structuredesignspace.
Althoughitisdesirabletodelaymappingaprogramontoaparticulartargetplatformaslongas possible,thecharacteristicsofthetargetplatformdoneedtobeconsideredatleastminimallywhile evaluatingadesign.Followingaresomeissuesrelevanttothechoiceoftargetplatformorplatforms. HowmanyPEsareavailable?Withsomeexceptions,havingmanymoretasksthanPEsmakesit easiertokeepallthePEsbusy.Obviouslywecan'tmakeuseofmorePEsthanwehavetasks,but havingonlyoneorafewtasksperPEcanleadtopoorloadbalance.Forexample,considerthecaseof aMonteCarlosimulationinwhichacalculationisrepeatedoverandoverfordifferentsetsof randomlychosendata,suchthatthetimetakenforthecalculationvariesconsiderablydependingon thedata.Anaturalapproachtodevelopingaparallelalgorithmwouldbetotreateachcalculation(for aseparatesetofdata)asatask;thesetasksarethencompletelyindependentandcanbescheduled howeverwelike.Butbecausethetimeforeachtaskcanvaryconsiderably,unlesstherearemany
moretasksthanPEs,itwillbedifficulttoachievegoodloadbalance. Theexceptionstothisrulearedesignsinwhichthenumberoftaskscanbeadjustedtofitthenumber ofPEsinsuchawaythatgoodloadbalanceismaintained.Anexampleofsuchadesignistheblock basedmatrixmultiplicationalgorithmdescribedintheExamplessectionoftheDataDecomposition pattern:Taskscorrespondtoblocks,andallthetasksinvolveroughlythesameamountof computation,soadjustingthenumberoftaskstobeequaltothenumberofPEsproducesanalgorithm withgoodloadbalance.(Note,however,thateveninthiscaseitmightbeadvantageoustohavemore tasksthanPEs.Thismight,forexample,allowoverlapofcomputationandcommunication.) HowaredatastructuressharedamongPEs?Adesignthatinvolveslargescaleorfinegraineddata sharingamongtaskswillbeeasiertoimplementandmoreefficientifalltaskshaveaccesstothe samememory.Easeofimplementationdependsontheprogrammingenvironment;anenvironment basedonasharedmemorymodel(allUEsshareanaddressspace)makesiteasiertoimplementa designrequiringextensivedatasharing.Efficiencydependsalsoonthetargetmachine;adesign involvingextensivedatasharingislikelytobemoreefficientonasymmetricmultiprocessor(where accesstimetomemoryisuniformacrossprocessors)thanonamachinethatlayersasharedmemory environmentoverphysicallydistributedmemory.Incontrast,iftheplanistouseamessagepassing environmentrunningonadistributedmemoryarchitecture,adesigninvolvingextensivedatasharing isprobablynotagoodchoice. Forexample,considerthetaskbasedapproachtothemedicalimagingproblemdescribedinthe ExamplessectionoftheTaskDecompositionpattern.Thisdesignrequiresthatalltaskshaveread accesstoapotentiallyverylargedatastructure(thebodymodel).Thispresentsnoproblemsina sharedmemoryenvironment;itisalsonoprobleminadistributedmemoryenvironmentinwhich eachPEhasalargememorysubsystemandthereisplentyofnetworkbandwidthtohandle broadcastingthelargedataset.However,inadistributedmemoryenvironmentwithlimitedmemory ornetworkbandwidth,themorememoryefficientalgorithmthatemphasizesthedatadecomposition wouldberequired. Adesignthatrequiresfinegraineddatasharing(inwhichthesamedatastructureisaccessed repeatedlybymanytasks,particularlywhenbothreadsandwritesareinvolved)isalsolikelytobe moreefficientonasharedmemorymachine,becausetheoverheadrequiredtoprotecteachaccessis likelytobesmallerthanforadistributedmemorymachine. Theexceptiontotheseprincipleswouldbeaprobleminwhichitiseasytogroupandscheduletasks insuchawaythattheonlylargescaleorfinegraineddatasharingisamongtasksassignedtothe sameunitofexecution. WhatdoesthetargetarchitectureimplyaboutthenumberofUEsandhowstructuresareshared amongthem?Inessence,werevisittheprecedingtwoquestions,butintermsofUEsratherthanPEs. ThiscanbeanimportantdistinctiontomakeifthetargetsystemdependsonmultipleUEsperPEto hidelatency.Therearetwofactorstokeepinmindwhenconsideringwhetheradesignusingmore thanoneUEperPEmakessense. ThefirstfactoriswhetherthetargetsystemprovidesefficientsupportformultipleUEsperPE.Some systemsdoprovidesuchsupport,suchastheCrayMTAmachinesandmachinesbuiltwithIntel processorsthatutilizehyperthreading.Thisarchitecturalapproachprovideshardwaresupportfor
extremelyrapidcontextswitching,makingitpracticaltouseinafarwiderrangeoflatencyhiding situations.OthersystemsdonotprovidegoodsupportformultipleUEsperPE.Forexample,anMPP systemwithslowcontextswitchingand/oroneprocessorpernodemightrunmuchbetterwhenthere isonlyoneUEperPE. ThesecondfactoriswhetherthedesigncanmakegooduseofmultipleUEsperPE.Forexample,if thedesigninvolvescommunicationoperationswithhighlatency,itmightbepossibletomaskthat latencybyassigningmultipleUEstoeachPEsosomeUEscanmakeprogresswhileothersare waitingonahighlatencyoperation.If,however,thedesigninvolvescommunicationoperationsthat aretightlysynchronized(forexample,pairsofblockingsend/receives)andrelativelyefficient, assigningmultipleUEstoeachPEismorelikelytointerferewitheaseofimplementation(by requiringextraefforttoavoiddeadlock)thantoimproveefficiency. Onthetargetplatform,willthetimespentdoingusefulworkinataskbesignificantlygreaterthanthe timetakentodealwithdependencies?Acriticalfactorindeterminingwhetheradesigniseffectiveis theratiooftimespentdoingcomputationtotimespentincommunicationorsynchronization:The highertheratio,themoreefficienttheprogram.Thisratioisaffectednotonlybythenumberandtype ofcoordinationeventsrequiredbythedesign,butalsobythecharacteristicsofthetargetplatform.For example,amessagepassingdesignthatisacceptablyefficientonanMPPwithafastinterconnect networkandrelativelyslowprocessorswilllikelybelessefficient,perhapsunacceptablyso,onan Ethernetconnectednetworkofpowerfulworkstations. NotethatthiscriticalratioisalsoaffectedbyproblemsizerelativetothenumberofavailablePEs, becauseforafixedproblemsize,thetimespentbyeachprocessordoingcomputationdecreaseswith thenumberofprocessors,whilethetimespentbyeachprocessordoingcoordinationmightstaythe sameorevenincreaseasthenumberofprocessorsincreases.
Design quality
Isthedecompositionflexibleinthenumberoftasksgenerated?Suchflexibilityallowsthe designtobeadaptedtoawiderangeofparallelcomputers. Isthedefinitionoftasksimpliedbythetaskdecompositionindependentofhowtheyare scheduledforexecution?Suchindependencemakestheloadbalancingproblemeasierto solve. Canthesizeandnumberofchunksinthedatadecompositionbeparameterized?Such parameterizationmakesadesigneasiertoscaleforvaryingnumbersofPEs. Doesthealgorithmhandletheproblem'sboundarycases?Agooddesignwillhandleall relevantcases,evenunusualones.Forexample,acommonoperationistotransposeamatrix sothatadistributionintermsofblocksofmatrixcolumnsbecomesadistributionintermsof
blocksofmatrixrows.Itiseasytowritedownthealgorithmandcodeitforsquarematrices wherethematrixorderisevenlydividedbythenumberofPEs.Butwhatifthematrixisnot square,orwhatifthenumberofrowsismuchgreaterthanthenumberofcolumnsandneither numberisevenlydividedbythenumberofPEs?Thisrequiressignificantchangestothe transposealgorithm.Forarectangularmatrix,forexample,thebufferthatwillholdthematrix blockwillneedtobelargeenoughtoholdthelargerofthetwoblocks.Ifeithertherowor columndimensionofthematrixisnotevenlydivisiblebythenumberofPEs,thentheblocks willnotbethesamesizeoneachPE.Canthealgorithmdealwiththeunevenloadthatwill resultfromhavingdifferentblocksizesoneachPE? Efficiency.Theprogramshouldeffectivelyutilizetheavailablecomputingresources.Therestofthis sectiongivesapartiallistofimportantfactorstocheck.Notethattypicallyitisnotpossibleto simultaneouslyoptimizeallofthesefactors;designtradeoffsareinevitable.
CanthecomputationalloadbeevenlybalancedamongthePEs?Thisiseasierifthetasksare independent,oriftheyareroughlythesamesize. Istheoverheadminimized?Overheadcancomefromseveralsources,includingcreationand schedulingoftheUEs,communication,andsynchronization.CreationandschedulingofUEs involvesoverhead,soeachUEneedstohaveenoughworktodotojustifythisoverhead.On theotherhand,moreUEsallowforbetterloadbalance. Communicationcanalsobeasourceofsignificantoverhead,particularlyondistributed memoryplatformsthatdependonmessagepassing.AswediscussedinSection2.6,thetime totransferamessagehastwocomponents:latencycostarisingfromoperatingsystem overheadandmessagestartupcostsonthenetwork,andacostthatscaleswiththelengthof themessage.Tominimizethelatencycosts,thenumberofmessagestobesentshouldbekept toaminimum.Inotherwords,asmallnumberoflargemessagesisbetterthanalargenumber ofsmallones.Thesecondtermisrelatedtothebandwidthofthenetwork.Thesecostscan sometimesbehiddenbyoverlappingcommunicationwithcomputation. Onsharedmemorymachines,synchronizationisamajorsourceofoverhead.Whendatais sharedbetweenUEs,dependenciesariserequiringonetasktowaitforanothertoavoidrace conditions.Thesynchronizationmechanismsusedtocontrolthiswaitingareexpensive comparedtomanyoperationscarriedoutbyaUE.Furthermore,somesynchronization constructsgeneratesignificantmemorytrafficastheyflushcaches,buffers,andothersystem resourcestomakesureUEsseeaconsistentviewofmemory.Thisextramemorytrafficcan interferewiththeexplicitdatamovementwithinacomputation.Synchronizationoverheadcan bereducedbykeepingdatawelllocalizedtoatask,therebyminimizingthefrequencyof synchronizationoperations.
TheproblemdecompositioncarriedoutwiththeFindingConcurrencypatternsdefinesthekey componentsthatwillguidethedesignintheAlgorithmStructuredesignspace:
Beforemovingoninthedesign,considerthesecomponentsrelativetothefollowingquestions. Howregulararethetasksandtheirdatadependencies?Regulartasksaresimilarinsizeandeffort. Irregulartaskswouldvarywidelyamongthemselves.Ifthetasksareirregular,theschedulingofthe tasksandtheirsharingofdatawillbemorecomplicatedandwillneedtobeemphasizedinthedesign. Inaregulardecomposition,allthetasksareinsomesensethesameroughlythesamecomputation (ondifferentsetsofdata),roughlythesamedependenciesondatasharedwithothertasks,etc. ExamplesincludethevariousmatrixmultiplicationalgorithmsdescribedintheExamplessectionsof theTaskDecomposition,DataDecomposition,andotherpatterns. Inanirregulardecomposition,theworkdonebyeachtaskand/orthedatadependenciesvaryamong tasks.Forexample,consideradiscreteeventsimulationofalargesystemconsistingofanumberof distinctcomponents.Wemightdesignaparallelalgorithmforthissimulationbydefiningataskfor eachcomponentandhavingtheminteractbasedonthediscreteeventsofthesimulation.Thiswould beaveryirregulardesigninthattherewouldbeconsiderablevariationamongtaskswithregardto workdoneanddependenciesonothertasks. Areinteractionsbetweentasks(ortaskgroups)synchronousorasynchronous?Insomedesigns,the interactionbetweentasksisalsoveryregularwithregardtotimethatis,itissynchronous.For example,atypicalapproachtoparallelizingalinearalgebraprobleminvolvingtheupdateofalarge matrixistopartitionthematrixamongtasksandhaveeachtaskupdateitspartofthematrix,using datafrombothitsandotherpartsofthematrix.Assumingthatallthedataneededfortheupdateis presentatthestartofthecomputation,thesetaskswilltypicallyfirstexchangeinformationandthen computeindependently.Anothertypeofexampleisapipelinecomputation(seethePipelinepattern), inwhichweperformamultistepoperationonasequenceofsetsofinputdatabysettingupan assemblylineoftasks(oneforeachstepoftheoperation),withdataflowingfromonetasktothenext aseachtaskaccomplishesitswork.Thisapproachworksbestifallofthetasksstaymoreorlessin stepthatis,iftheirinteractionissynchronous.
Inotherdesigns,theinteractionbetweentasksisnotsochronologicallyregular.Anexampleisthe discreteeventsimulationdescribedpreviously,inwhichtheeventsthatleadtointeractionbetween taskscanbechronologicallyirregular. Arethetasksgroupedinthebestway?Thetemporalrelationsareeasy:Tasksthatcanrunatthesame timearenaturallygroupedtogether.Butaneffectivedesignwillalsogrouptaskstogetherbasedon theirlogicalrelationshipintheoverallproblem. Asanexampleofgroupingtasks,considerthemoleculardynamicsproblemdiscussedinthe ExamplessectionoftheGroupTasks,OrderTasks,andDataSharingpatterns.Thegroupingwe eventuallyarriveat(intheGroupTaskspattern)ishierarchical:groupsofrelatedtasksbasedonthe highleveloperationsoftheproblem,furthergroupedonthebasisofwhichonescanexecute concurrently.Suchanapproachmakesiteasiertoreasonaboutwhetherthedesignmeetsthe necessaryconstraints(becausetheconstraintscanbestatedintermsofthetaskgroupsdefinedbythe highleveloperations)whileallowingforschedulingflexibility.
3.8. SUMMARY
WorkingthroughthepatternsintheFindingConcurrencydesignspaceexposestheconcurrencyin yourproblem.Thekeyelementsfollowingfromthatanalysisare
4.1. INTRODUCTION
Thefirstphaseofdesigningaparallelalgorithmconsistsofanalyzingtheproblemtoidentify exploitableconcurrency,usuallybyusingthepatternsoftheFindingConcurrencydesignspace.The outputfromtheFindingConcurrencydesignspaceisadecompositionoftheproblemintodesign elements:
Thekeyissueatthisstageistodecidewhichpatternorpatternsaremostappropriatefortheproblem.
Efficiency.Itiscrucialthataparallelprogramrunquicklyandmakegooduseofthecomputer resources. Simplicity.Asimplealgorithmresultingineasytounderstandcodeiseasiertodevelop, debug,verify,andmodify. Portability.Ideally,programsshouldrunonthewidestrangeofparallelcomputers.Thiswill maximizethe"market"foraparticularprogram.Moreimportantly,aprogramisusedfor manyyears,whileanyparticularcomputersystemisusedforonlyafewyears.Portable programsprotectasoftwareinvestment. Scalability.Ideally,analgorithmshouldbeeffectiveonawiderangeofnumbersofprocessing elements(PEs),fromafewuptohundredsoreventhousands.
Theseforcesconflictinseveralways,however. Efficiencyconflictswithportability:Makingaprogramefficientalmostalwaysrequiresthatthecode takeintoaccountthecharacteristicsofthespecificsystemonwhichitisintendedtorun,whichlimits portability.Adesignthatmakesuseofthespecialfeaturesofaparticularsystemorprogramming environmentmayleadtoanefficientprogramforthatparticularenvironment,butbeunusablefora differentplatform,eitherbecauseitperformspoorlyorbecauseitisdifficultorevenimpossibleto implementforthenewplatform. Efficiencyalsocanconflictwithsimplicity:Forexample,towriteefficientprogramsthatusetheTask Parallelismpattern,itissometimesnecessarytousecomplicatedschedulingalgorithms.These algorithmsinmanycases,however,maketheprogramverydifficulttounderstand. Thus,agoodalgorithmdesignmuststrikeabalancebetween(1)abstractionandportabilityand(2) suitabilityforaparticulartargetarchitecture.Thechallengefacedbythedesigner,especiallyatthis earlyphaseofthealgorithmdesign,istoleavetheparallelalgorithmdesignabstractenoughto supportportabilitywhileensuringthatitcaneventuallybeimplementedeffectivelyfortheparallel systemsonwhichitwillbeexecuted.
anidealworld,however,andsoftwaredesignedwithoutconsideringthemajorfeaturesofthetarget platformisunlikelytorunefficiently. Theprimaryissueishowmanyunitsofexecution(UEs)thesystemwilleffectivelysupport,because analgorithmthatworkswellfortenUEsmaynotworkwellforhundredsofUEs.Itisnotnecessary todecideonaspecificnumber(infacttodosowouldoverlyconstraintheapplicabilityofthedesign), butitisimportanttohaveinmindatthispointanorderofmagnitudeforthenumberofUEs. AnotherissueishowexpensiveitistoshareinformationamongUEs.Ifthereishardwaresupportfor sharedmemory,informationexchangetakesplacethroughsharedaccesstocommonmemory,and frequentdatasharingmakessense.Ifthetargetisacollectionofnodesconnectedbyaslownetwork, however,thecommunicationrequiredtoshareinformationisveryexpensiveandmustbeavoided whereverpossible. WhenthinkingaboutbothoftheseissuesthenumberofUEsandthecostofsharinginformation avoidthetendencytooverconstrainthedesign.Softwaretypicallyoutliveshardware,sooverthe courseofaprogram'slifeitmaybeusedonatremendousrangeoftargetplatforms.Thegoalisto obtainadesignthatworkswellontheoriginaltargetplatform,butatthesametimeisflexibleenough toadapttodifferentclassesofhardware. Finally,inadditiontomultipleUEsandsomewaytoshareinformationamongthem,aparallel computerhasoneormoreprogrammingenvironmentsthatcanbeusedtoimplementparallel algorithms.Differentprogrammingenvironmentsprovidedifferentwaystocreatetasksandshare informationamongUEs,andadesignthatdoesnotmapwellontothecharacteristicsofthetarget programmingenvironmentwillbedifficulttoimplement. 4.2.2. Major Organizing Principle Whenconsideringtheconcurrencyintheproblem,isthereaparticularwayoflookingatitthatstands outandprovidesahighlevelmechanismfororganizingthisconcurrency? TheanalysiscarriedoutusingthepatternsoftheFindingConcurrencydesignspacedescribesthe potentialconcurrencyintermsoftasksandgroupsoftasks,data(bothsharedandtasklocal),and orderingconstraintsamongtaskgroups.Thenextstepistofindanalgorithmstructurethatrepresents howthisconcurrencymapsontotheUEs.Thereisusuallyamajororganizingprincipleimpliedbythe concurrency.Thisusuallyfallsintooneofthreecamps:organizationbytasks,organizationbydata decomposition,andorganizationbyflowofdata.Wenowconsidereachoftheseinmoredetail. Forsomeproblems,thereisreallyonlyonegroupoftasksactiveatonetime,andthewaythetasks withinthisgroupinteractisthemajorfeatureoftheconcurrency.Examplesincludesocalled embarrassinglyparallelprogramsinwhichthetasksarecompletelyindependent,aswellasprograms inwhichthetasksinasinglegroupcooperatetocomputearesult. Forotherproblems,thewaydataisdecomposedandsharedamongtasksstandsoutasthemajorway toorganizetheconcurrency.Forexample,manyproblemsfocusontheupdateofafewlargedata structures,andthemostproductivewaytothinkabouttheconcurrencyisintermsofhowthis structureisdecomposedanddistributedamongUEs.Programstosolvedifferentialequationsorcarry outlinearalgebracomputationsoftenfallintothiscategorybecausetheyarefrequentlybasedon updatinglargedatastructures.
Finally,forsomeproblems,themajorfeatureoftheconcurrencyisthepresenceofwelldefined interactinggroupsoftasks,andthekeyissueishowthedataflowsamongthetasks.Forexample,ina signalprocessingapplication,datamayflowthroughasequenceoftasksorganizedasapipeline,each performingatransformationonsuccessivedataelements.Oradiscreteeventsimulationmightbe parallelizedbydecomposingitintoatasksinteractingvia"events".Here,themajorfeatureofthe concurrencyisthewayinwhichthesedistincttaskgroupsinteract. Noticealsothatthemosteffectiveparallelalgorithmdesignmightmakeuseofmultiplealgorithm structures(combinedhierarchically,compositionally,orinsequence),andthisisthepointatwhichto considerwhethersuchadesignmakessense.Forexample,itoftenhappensthattheverytoplevelof thedesignisasequentialcompositionofoneormoreAlgorithmStructurepatterns.Otherdesigns mightbeorganizedhierarchically,withonepatternusedtoorganizetheinteractionofthemajortask groupsandotherpatternsusedtoorganizetaskswithinthegroupsforexample,aninstanceofthe PipelinepatterninwhichindividualstagesareinstancesoftheTaskParallelismpattern. 4.2.3. The Algorithm Structure Decision T ree Foreachsubsetoftasks,whichAlgorithmStructuredesignpatternmosteffectivelydefineshowto mapthetasksontoUEs? Havingconsideredthequestionsraisedintheprecedingsections,wearenowreadytoselectan algorithmstructure,guidedbyanunderstandingofconstraintsimposedbythetargetplatform,an appreciationoftheroleofhierarchyandcomposition,andamajororganizingprincipleforthe problem.ThedecisionisguidedbythedecisiontreeshowninFig.4.2.Startingatthetopofthetree, considertheconcurrencyandthemajororganizingprinciple,andusethisinformationtoselectoneof thethreebranchesofthetree;thenfollowtheupcomingdiscussionfortheappropriatesubtree.Notice againthatforsomeproblems,thefinaldesignmightcombinemorethanonealgorithmstructure:Ifno singlestructureseemssuitable,itmightbenecessarytodividethetasksmakinguptheprobleminto twoormoregroups,workthroughthisprocedureseparatelyforeachgroup,andthendeterminehow tocombinetheresultingalgorithmstructures.
Figure 4.2. Decision tree for the Algorithm Structure design space
Organize By Tasks
SelecttheOrganizeByFlowofDatabranchwhenthemajororganizingprincipleishowtheflowof dataimposesanorderingonthegroupsoftasks.Thispatterngrouphastwomembers,onethat applieswhenthisorderingisregularandstaticandonethatapplieswhenitisirregularand/or dynamic.ChoosethePipelinepatternwhentheflowofdataamongtaskgroupsisregular,oneway, anddoesnotchangeduringthealgorithm(thatis,thetaskgroupscanbearrangedintoapipeline throughwhichthedataflows).ChoosetheEventBasedCoordinationpatternwhentheflowofdatais irregular,dynamic,and/orunpredictable(thatis,whenthetaskgroupscanbethoughtofasinteracting viaasynchronousevents). 4.2.4. Re-evaluation IstheAlgorithmStructurepattern(orpatterns)suitableforthetargetplatform?Itisimportantto frequentlyreviewdecisionsmadesofartobesurethechosenpattern(s)areagoodfitwiththetarget platform. AfterchoosingoneormoreAlgorithmStructurepatternstobeusedinthedesign,skimthroughtheir descriptionstobesuretheyarereasonablysuitableforthetargetplatform.(Forexample,ifthetarget platformconsistsofalargenumberofworkstationsconnectedbyaslownetwork,andoneofthe chosenAlgorithmStructurepatternsrequiresfrequentcommunicationamongtasks,itmightbe difficulttoimplementthedesignefficiently.)Ifthechosenpatternsseemwildlyunsuitableforthe targetplatform,tryidentifyingasecondaryorganizingprincipleandworkingthroughthepreceding
stepagain.
4.3. EXAMPLES
4.3.1. Medical Imaging Forexample,considerthemedicalimagingproblemdescribedinSec.3.1.3.Thisapplication simulatesalargenumberofgammaraysastheymovethroughabodyandouttoacamera.Oneway todescribetheconcurrencyistodefinethesimulationofeachrayasatask.Becausetheyareall logicallyequivalent,weputthemintoasingletaskgroup.Theonlydatasharedamongthetasksisa largedatastructurerepresentingthebody,andsinceaccesstothisdatastructureisreadonly,thetasks donotdependoneachother. Becausetherearemanyindependenttasksforthisproblem,itislessnecessarythanusualtoconsider thetargetplatform:Thelargenumberoftasksshouldmeanthatwecanmakeeffectiveuseofany (reasonable)numberofUEs;theindependenceofthetasksshouldmeanthatthecostofsharing informationamongUEswillnothavemucheffectonperformance. Thus,weshouldbeabletochooseasuitablestructurebyworkingthroughthedecisiontreeshown previouslyinFig.4.2.Giventhatinthisproblemthetasksareindependent,theonlyissuewereally needtoworryaboutasweselectanalgorithmstructureishowtomapthesetasksontoUEs.Thatis, forthisproblem,themajororganizingprincipleseemstobethewaythetasksareorganized,sowe startbyfollowingtheOrganizeByTasksbranch. Wenowconsiderthenatureofoursetoftaskswhethertheyarearrangedhierarchicallyorresidein anunstructuredorflatset.Forthisproblem,thetasksareinanunstructuredsetwithnoobvious hierarchicalstructureamongthem,sowechoosetheTaskParallelismpattern.Notethatinthe problem,thetasksareindependent,afactthatwewillbeabletousetosimplifythesolution. Finally,wereviewthisdecisioninlightofpossibletargetplatformconsiderations.Asweobserved earlier,thekeyfeaturesofthisproblem(thelargenumberoftasksandtheirindependence)makeit unlikelythatwewillneedtoreconsiderbecausethechosenstructurewillbedifficulttoimplementon thetargetplatform. 4.3.2. Molecular Dynamics Asasecondexample,considerthemoleculardynamicsproblemdescribedinSec.3.1.3.IntheTask Decompositionpattern,weidentifiedthefollowinggroupsoftasksassociatedwiththisproblem:
Thetaskswithineachgroupareexpressedastheiterationsofaloopovertheatomswithinthe molecularsystem. Wecanchooseasuitablealgorithmstructurebyworkingthroughthedecisiontreeshownearlierin Fig.4.2.Oneoptionistoorganizetheparallelalgorithmintermsoftheflowofdataamongthe groupsoftasks.Notethatonlythefirstthreetaskgroups(thevibrational,rotational,andnonbonded forcecalculations)canexecuteconcurrently;thatis,theymustfinishcomputingtheforcesbeforethe atomicpositions,velocitiesandneighborlistscanbeupdated.Thisisnotverymuchconcurrencyto workwith,soadifferentbranchinFig.4.2shouldbeusedforthisproblem. Anotheroptionistoderiveexploitableconcurrencyfromthesetoftaskswithineachgroup,inthis casetheiterationsofaloopoveratoms.Thissuggestsanorganizationbytaskswithalinear arrangementoftasks,orbasedonFig.4.2,theTaskParallelismpatternshouldbeused.Totalavailable concurrencyislarge(ontheorderofthenumberofatoms),providingagreatdealofflexibilityin designingtheparallelalgorithm. Thetargetmachinecanhaveamajorimpactontheparallelalgorithmforthisproblem.The dependenciesdiscussedintheDataDecompositionpattern(replicatedcoordinatesoneachUEanda combinationofpartialsumsfromeachUEtocomputeaglobalforcearray)suggestthatontheorder of23Nterms(whereNisthenumberofatoms)willneedtobepassedamongtheUEs.The computation,however,isofordernN,wherenisthenumberofatomsintheneighborhoodofeach atomandconsiderablylessthanN.Hence,thecommunicationandcomputationareofthesameorder andmanagementofcommunicationoverheadwillbeakeyfactorindesigningthealgorithm.
RaytracingcodessuchasthemedicalimagingexampledescribedintheTaskDecomposition pattern:Herethecomputationassociatedwitheach"ray"becomesaseparateandcompletely independenttask. ThemoleculardynamicsexampledescribedintheTaskDecompositionpattern:Theupdateof thenonbondedforceoneachatomisatask.Thedependenciesamongtasksaremanagedby replicatingtheforcearrayoneachUEtoholdthepartialsumsforeachatom.Whenallthe taskshavecompletedtheircontributionstothenonbondedforce,theindividualforcearrays arecombined(or"reduced")intoasinglearrayholdingthefullsummationofnonbonded forcesforeachatom. Branchandboundcomputations,inwhichtheproblemissolvedbyrepeatedlyremovinga solutionspacefromalistofsuchspaces,examiningit,andeitherdeclaringitasolution, discardingit,ordividingitintosmallersolutionspacesthatarethenaddedtothelistofspaces toexamine.Suchcomputationscanbeparallelizedusingthispatternbymakingeach"examine andprocessasolutionspace"stepaseparatetask.Thetasksweaklydependoneachother throughthesharedqueueoftasks.
Thecommonfactoristhattheproblemcanbedecomposedintoacollectionoftasksthatcanexecute concurrently.Thetaskscanbecompletelyindependent(asinthemedicalimagingexample)orthere canbedependenciesamongthem(asinthemoleculardynamicsexample).Inmostcases,thetasks willbeassociatedwithiterationsofaloop,butitispossibletoassociatethemwithlargerscale programstructuresaswell. Inmanycases,allofthetasksareknownatthebeginningofthecomputation(thefirsttwoexamples). However,insomecases,tasksarisedynamicallyasthecomputationunfolds,asinthebranchand boundexample. Also,whileitisusuallythecasethatalltasksmustbecompletedbeforetheproblemisdone,forsome problems,itmaybepossibletoreachasolutionwithoutcompletingallofthetasks.Forexample,in thebranchandboundexample,wehaveapooloftaskscorrespondingtosolutionspacestobe searched,andwemightfindanacceptablesolutionbeforeallthetasksinthispoolhavebeen completed. Forces
Ideally,thetasksintowhichtheproblemisdecomposedshouldmeettwocriteria:First,thereshould beatleastasmanytasksasUEs,andpreferablymanymore,toallowgreaterflexibilityinscheduling. Second,thecomputationassociatedwitheachtaskmustbelargeenoughtooffsettheoverhead associatedwithmanagingthetasksandhandlinganydependencies.Iftheinitialdecompositiondoes notmeetthesecriteria,itisworthwhiletoconsiderwhetherthereisanotherwayofdecomposingthe problemintotasksthatdoesmeetthecriteria. Forexample,inimageprocessingapplicationswhereeachpixelupdateisindependent,thetask definitioncanbeindividualpixels,imagelines,orevenwholeblocksintheimage.Onasystemwith asmallnumberofnodesconnectedbyaslownetwork,tasksshouldbelargetooffsethigh communicationlatencies,sobasingtasksonblocksoftheimageisappropriate.Thesameproblemon asystemcontainingalargenumberofnodesconnectedbyafast(lowlatency)network,however, wouldneedsmallertaskstomakesureenoughworkexiststokeepalltheUEsoccupied.Noticethat thisimposesarequirementforafastnetwork,becauseotherwisethesmalleramountofworkpertask willnotbeenoughtocompensateforcommunicationoverhead.
Dependencies
Dependenciesamongtaskshaveamajorimpactontheemergingalgorithmdesign.Therearetwo categoriesofdependencies,orderingconstraintsanddependenciesrelatedt shareddata. o Forthispattern,orderingconstraintsapplytotaskgroupsandcanbehandledbyforcingthegroupsto executeintherequiredorder.Forexample,inataskparallelmultidimensionalFastFourierTransform, thereisagroupoftasksforeachdimensionofthetransform,andsynchronizationorotherprogram constructsareusedtomakesurecomputationononedimensioncompletesbeforethenextdimension begins.Alternatively,wecouldsimplythinkofsuchaproblemasasequentialcompositionoftask parallelcomputations,oneforeachtaskgroup. Shareddatadependenciesarepotentiallymorecomplicated.Inthesimplestcase,thereareno dependenciesamongthetasks.Asurprisinglylargenumberofproblemscanbecastintothisform. Suchproblemsareoftencalledembarrassinglyparallel.Theirsolutionsareamongthesimplestof parallelprograms;themainconsiderationsarehowthetasksaredefined(asdiscussedpreviously)and scheduled(asdiscussedlater).Whendataissharedamongtasks,thealgorithmcanbemuchmore complicated,althoughtherearestillsomecommoncasesthatcanbedealtwithrelativelyeasily.We cancategorizedependenciesasfollows.
carrieddependency.Forexample,considerthefollowingsimpleloop:
int ii = 0, jj = 0; for(int i = 0; i< N; i++) { ii = ii + 1; d[ii] = big_time_consuming_work(ii); jj = jj + i; a[jj] = other_big_calc(jj); }
"Separable"dependencies.Whenthedependenciesinvolveaccumulationintoashareddata structure,theycanbeseparatedfromthetasks("pulledoutsidetheconcurrentcomputation") byreplicatingthedatastructureatthebeginningofthecomputation,executingthetasks,and thencombiningthecopiesintoasingledatastructureafterthetaskscomplete.Oftenthe accumulationisareductionoperation,inwhichacollectionofdataelementsisreducedtoa singleelementbyrepeatedlyapplyingabinaryoperationsuchasadditionormultiplication. Inmoredetail,thesedependenciescanbemanagedasfollows:Acopyofthedatastructure usedintheaccumulationiscreatedoneachUE.Eachcopyisinitialized(inthecaseofa reduction,totheidentityelementforthebinaryoperationforexample,zeroforadditionand oneformultiplication).Eachtaskthencarriesouttheaccumulationintoitslocaldata structure,eliminatingtheshareddatadependency.Whenalltasksarecomplete,thelocaldata structuresoneachUEarecombinedtoproducethefinalglobalresult(inthecaseofa reduction,byapplyingthebinaryoperationagain).Asanexample,considerthefollowingloop tosumtheelementsofarrayf:
for(int i = 0; i< N; i++){ sum = sum + f(i); }
SharedDatapattern.
Schedule
Twoclassesofschedulesareusedinparallelalgorithms:staticschedules,inwhichthedistributionof tasksamongUEsisdeterminedatthestartofthecomputationanddoesnotchange;anddynamic schedules,inwhichthedistributionoftasksamongUEsvariesasthecomputationproceeds. Inastaticschedule,thetasksareassociatedintoblocksandthenassignedtoUEs.Blocksizeis adjustedsoeachUEtakesapproximatelythesameamountoftimetocompleteitstasks.Inmost applicationsusingastaticschedule,thecomputationalresourcesavailablefromtheUEsare predictableandstableoverthecourseofthecomputation,withthemostcommoncasebeingUEsthat areidentical(thatis,thecomputingsystemishomogeneous).Ifthesetoftimesrequiredtocomplete eachtaskisnarrowlydistributedaboutamean,thesizesoftheblocksshouldbeproportionaltothe relativeperformanceoftheUEs(so,inahomogeneoussystem,theyareallthesamesize).Whenthe
effortassociatedwiththetasksvariesconsiderably,astaticschedulecanstillbeuseful,butnowthe numberofblocksassignedtoUEsmustbemuchgreaterthanthenumberofUEs.Bydealingoutthe blocksinaroundrobinmanner(muchasadeckofcardsisdealtamongagroupofcardplayers),the loadisbalancedstatistically. Dynamicschedulesareusedwhen(1)theeffortassociatedwitheachtaskvarieswidelyandis unpredictableand/or(2)whenthecapabilitiesoftheUEsvarywidelyandunpredictably.Themost commonapproachusedfordynamicloadbalancingistodefineataskqueuetobeusedbyallthe UEs;whenaUEcompletesitscurrenttaskandisthereforereadytoprocessmorework,itremovesa taskfromthetaskqueue.FasterUEsorthosereceivinglighterweighttaskswillaccessthequeue moreoftenandtherebybeassignedmoretasks. Anotherdynamicschedulingstrategyusesworkstealing,whichworksasfollows.Thetasksare distributedamongtheUEsatthestartofthecomputation.EachUEhasitsownworkqueue.When thequeueisempty,theUEwilltrytostealworkfromthequeueonsomeotherUE(wheretheother UEisusuallyrandomlyselected).Inmanycases,thisproducesanoptimaldynamicschedulewithout incurringtheoverheadofmaintainingasingleglobalqueue.Inprogrammingenvironmentsor packagesthatprovidesupportfortheconstruct,suchasCilk[BJK96 + ],Hood[BP99],ortheFJTask framework[Lea00b,Lea],itisstraightforwardtousethisapproach.Butwithmorecommonlyused programmingenvironmentssuchasOpenMP,MPI,orJava(withoutsupportsuchastheFJTask framework),thisapproachaddssignificantcomplexityandthereforeisnotoftenused.
Manytaskparallelproblemscanbeconsideredtobeloopbased.Loopbasedproblemsare,asthe nameimplies,thoseinwhichthetasksarebasedontheiterationsofaloop.Thebestsolutionsfor suchproblemsusetheLoopParallelismpattern.Thispatterncanbeparticularlysimpletoimplement inprogrammingenvironmentsthatprovidedirectivesforautomaticallyassigningloopiterationsto UEs.Forexample,inOpenMPaloopcanbeparallelizedbysimplyaddinga"parallelfor"directive withanappropriatescheduleclause(onethatmaximizesefficiency).Thissolutionisespecially attractivebecauseOpenMPthenguaranteesthattheresultingprogramissemanticallyequivalentto theanalogoussequentialcode(withinroundofferrorassociatedwithdifferentorderingsof floatingpointoperations). ForproblemsinwhichthetargetplatformisnotagoodfitwiththeLoopParallelismpattern,orfor problemsinwhichthemodelof"alltasksknowninitially,alltasksmustcomplete"doesnotapply (eitherbecausetaskscanbecreatedduringthecomputationorbecausethecomputationcanterminate withoutalltasksbeingcomplete),thisstraightforwardapproachisnotthebestchoice.Instead,the bestdesignmakesuseofataskqueue;tasksareplacedonthetaskqueueastheyarecreatedand
removedbyUEsuntilthecomputationiscomplete.Theoverallprogramstructurecanbebasedon eithertheMaster/WorkerpatternortheSPMDpattern.Theformerisparticularlyappropriatefor problemsrequiringadynamicschedule. Inthecaseinwhichthecomputationcanterminatebeforeallthetasksarecomplete,somecaremust betakentoensurethatthecomputationendswhenitshould.Ifwedefinetheterminationconditionas theconditionthatwhentruemeansthecomputationiscompleteeitheralltasksarecompleteor someothercondition(forexample,anacceptablesolutionhasbeenfoundbyonetask)thenwewant tobesurethat(1)theterminationconditioniseventuallymet(which,iftaskscanbecreated dynamically,mightmeanbuildingintoitalimitonthetotalnumberoftaskscreated),and(2)when theterminationconditionismet,theprogramends.Howtoensurethelatterisdiscussedinthe Master/WorkerandSPMDpatterns.
Common idioms
Mostproblemsforwhichthispatternisapplicablefallintothefollowingtwocategories. Embarrassinglyparallelproblemsarethoseinwhichtherearenodependenciesamongthetasks.A widerangeofproblemsfallintothiscategory,rangingfromrenderingframesinamotionpictureto statisticalsamplingincomputationalphysics.Becausetherearenodependenciestomanage,thefocus isonschedulingthetaskstomaximizeefficiency.Inmanycases,itispossibletodefineschedulesthat automaticallyanddynamicallybalancetheloadamongUEs. Replicateddataorreductionproblemsarethoseinwhichdependenciescanbemanagedby "separatingthemfromthetasks"asdescribedearlierreplicatingthedataatthebeginningof computationandcombiningresultswhentheterminationconditionismet(usually"alltasks complete").Fortheseproblems,theoverallsolutionconsistsofthreephases,onetoreplicatethedata intolocalvariables,onetosolvethenowindependenttasks(usingthesametechniquesusedfor embarrassinglyparallelproblems),andonetorecombinetheresultsintoasingleresult. Examples Wewillconsidertwoexamplesofthispattern.Thefirstexample,animageconstructionexample,is embarrassinglyparallel.Thesecondexamplewillbuildonthemoleculardynamicsexampleusedin severaloftheFindingConcurrencypatterns.
Image construction
whereCandZarecomplexnumbersandtherecurrenceisstartedwithZ0=C.Theimageplotsthe
imaginarypartofContheverticalaxisandtherealpartonthehorizontalaxis.Thecolorofeach pixelisblackiftherecurrencerelationconvergestoastablevalueoriscoloreddependingonhow rapidlytherelationdiverges. Atthelowestlevel,thetaskistheupdateforasinglepixel.Firstconsidercomputingthissetona clusterofPCsconnectedbyanEthernet.Thisisacoarsegrainedsystem;thatis,therateof communicationisslowrelativetotherateofcomputation.Tooffsettheoverheadincurredbytheslow network,thetasksizeneedstobelarge;forthisproblem,thatmightmeancomputingafullrowofthe image.Theworkinvolvedincomputingeachrowvariesdependingonthenumberofdivergentpixels intherow.Thevariation,however,ismodestanddistributedcloselyaroundameanvalue.Therefore, astaticschedulewithmanymoretasksthanUEswilllikelygiveaneffectivestatisticalbalanceofthe loadamongnodes.Theremainingstepinapplyingthepatternischoosinganoverallstructureforthe program.OnasharedmemorymachineusingOpenMP,theLoopParallelismpatterndescribedinthe SupportingStructuresdesignspaceisagoodfit.OnanetworkofworkstationsrunningMPI,the SPMDpattern(alsointheSupportingStructuresdesignspace)isappropriate. Beforemovingontothenextexample,weconsideronemoretargetsystem,aclusterinwhichthe nodesarenotheterogeneousthatis,somenodesaremuchfasterthanothers.Assumealsothatthe speedofeachnodemaynotbeknownwhentheworkisscheduled.Becausethetimeneededto computetheimageforarownowdependsbothontherowandonwhichnodecomputesit,adynamic scheduleisindicated.Thisinturnsuggeststhatageneraldynamicloadbalancingschemeisindicated, whichthensuggeststhattheoverallprogramstructureshouldbebasedontheMaster/Workerpattern.
Molecular dynamics
Foroursecondexample,weconsiderthecomputationofthenonbondedforcesinamolecular dynamicscomputation.ThisproblemisdescribedinSec.3.1.3andin[Mat95,PH95]andisused throughoutthepatternsintheFindingConcurrencydesignspace.Pseudocodeforthiscomputationis showninFig.4.4.Thephysicsinthisexampleisnotrelevantandisburiedincodenotshownhere (thecomputationoftheneighborslistandtheforcefunction).Thebasiccomputationstructureisa loopoveratoms,andthenforeachatom,aloopoverinteractionswithotheratoms.Thenumberof interactionsperatomiscomputedseparatelywhentheneighborslistisdetermined.Thisroutine(not shownhere)computesthenumberofatomswithinaradiusequaltoapresetcutoffdistance.The neighborlistisalsomodifiedtoaccountforNewton'sthirdlaw:Becausetheforceofatomionatom jisthenegativeoftheforceofatomjonatomi,onlyhalfofthepotentialinteractionsneedactually becomputed.Understandingthisdetailisnotimportantforunderstandingthisexample.Thekeyis thatthiscauseseachloopoverjtovarygreatlyfromoneatomtoanother,therebygreatly complicatingtheloadbalancingproblem.Indeed,forthepurposesofthisexample,allthatmustreally beunderstoodisthatcalculatingtheforceisanexpensiveoperationandthatthenumberof interactionsperatomvariesgreatly.Hence,thecomputationaleffortforeachiterationoveriis difficulttopredictinadvance.
Figure 4.4. Pseudocode for the nonbonded computation in a typical molecular dynamics code function non_bonded_forces (N, Atoms, neighbors, Forces)
Int const N // number of atoms Array of Real :: atoms (3,N) //3D coordinates Array of Real :: forces (3,N) //force in each dimension Array of List :: neighbors(N) //atoms in cutoff volume Real :: forceX, forceY, forceZ loop [i] over atoms loop [j] over neighbors(i) forceX = non_bond_force(atoms(1,i), forceY = non_bond_force(atoms(2,i), forceZ = non_bond_force(atoms(3,i), force(1,i) += forceX; force(1,j) -= force(2,i) += forceY; force(2,j) -= force(3,i) += forceZ; force(3,j) -= end loop [j] end loop [i] end function non_bonded_forces atoms(1,j)) atoms(2,j)) atoms(3,j)) forceX; forceY; forceZ;
Eachcomponentoftheforcetermisanindependentcomputation,meaningthateach(i, j)pairis fundamentallyanindependenttask.Thenumberofatomstendstobeontheorderofthousands,and squaringthatgivesanumberoftasksthatismorethanenoughforallbutthelargestparallelsystems. Therefore,wecantakethemoreconvenientapproachofdefiningataskasoneiterationoftheloop overi.Thetasks,however,arenotindependent:Theforcearrayisreadandwrittenbyeachtask. Inspectionofthecodeshowsthatthearraysareonlyusedtoaccumulateresultsfromthecomputation, however.Thus,thefullarraycanbereplicatedoneachUEandthelocalcopiescombined(reduced) afterthetaskscomplete. Afterthereplicationisdefined,theproblemisembarrassinglyparallelandthesameapproaches discussedpreviouslyapply.WewillrevisitthisexampleintheMaster/Worker,LoopParallelism,and SPMDpatterns.Achoiceamongthesepatternsisnormallymadebasedonthetargetplatforms.
Known uses
Therearemanyapplicationareasinwhichthispatternisuseful,includingthefollowing. Manyraytracingprogramsusesomeformofpartitioningwithindividualtaskscorrespondingtoscan linesinthefinalimage[BKS91]. ApplicationswrittenwithcoordinationlanguagessuchasLindaareanotherrichsourceofexamples ofthispattern[BCM91 + ].Linda[CG91]isasimplelanguageconsistingofonlysixoperationsthat readandwriteanassociative(thatis,contentaddressable)sharedmemorycalledatuplespace.The tuplespaceprovidesanaturalwaytoimplementawidevarietyofsharedqueueandmaster/worker algorithms. Parallelcomputationalchemistryapplicationsalsomakeheavyuseofthispattern.Inthequantum chemistryprogramGAMESS,theloopsovertwoelectronintegralsareparallelizedwiththetask queueimpliedbytheNextvalconstructwithinTCGMSG.Anearlyversionofthedistance geometryprogramDGEOMwasparallelizedwiththemaster/workerformofthispattern.These
Forces
Thetraditionaldivideandconquerstrategyisawidelyusefulapproachtoalgorithmdesign. Sequentialdivideandconqueralgorithmsarealmosttrivialtoparallelizebasedontheobvious exploitableconcurrency. AsFig.4.5suggests,however,theamountofexploitableconcurrencyvariesoverthelifeofthe program.Attheoutermostleveloftherecursion(initialsplitandfinalmerge),thereislittleor noexploitableconcurrency,andthesubproblemsalsocontainsplitandmergesections. Amdahl'slaw(Chapter2)tellsusthattheserialpartsofaprogramcansignificantlyconstrain thespeedupthatcanbeachievedbyaddingmoreprocessors.Thus,ifthesplitandmerge computationsarenontrivialcomparedtotheamountofcomputationforthebasecases,a programusingthispatternmightnotbeabletotakeadvantageoflargenumbersofprocessors. Further,iftherearemanylevelsofrecursion,thenumberoftaskscangrowquitelarge, perhapstothepointthattheoverheadofmanagingthetasksoverwhelmsanybenefitfrom executingthemconcurrently. Indistributedmemorysystems,subproblemscanbegeneratedononePEandexecutedby another,requiringdataandresultstobemovedbetweenthePEs.Thealgorithmwillbemore efficientiftheamountofdataassociatedwithacomputation(thatis,thesizeoftheparameter setandresultforeachsubproblem)issmall.Otherwise,largecommunicationcostscan dominatetheperformance. Individeandconqueralgorithms,thetasksarecreateddynamicallyasthecomputation proceeds,andinsomecases,theresulting"taskgraph"willhaveanirregularanddata dependentstructure.Ifthisisthecase,thenthesolutionshouldemploydynamicload balancing.
Figure 4.6. Sequential pseudocode for the divide-and-conquer algorithm func func func func func solve returns Solution; // a solution stage baseCase returns Boolean; // direct solution test baseSolve returns Solution; // direct solution merge returns Solution; // combine subsolutions split returns Problem[]; // split into subprobs
Solution solve(Problem P) { if (baseCase(P)) return baseSolve(P); else { Problem subProblems[N]; Solution subSolutions[N]; subProblems = split(P); for (int i = 0; i < N; i++) subSolutions[i] = solve(subProblems[i]); return merge(subSolutions); } }
Solution AsequentialdivideandconqueralgorithmhasthestructureshowninFig.4.6.Thecornerstoneof thisstructureisarecursivelyinvokedfunction(solve ())thatdriveseachstageinthesolution. Insidesolve,theproblemiseithersplitintosmallersubproblems(usingsplit())oritisdirectly solved(usingbaseSolve()).Intheclassicalstrategy,recursioncontinuesuntilthesubproblems aresimpleenoughtobesolveddirectly,oftenwithjustafewlinesofcodeeach.However,efficiency canbeimprovedbyadoptingtheviewthatbaseSolve()shouldbecalledwhen(1)theoverheadof performingfurthersplitsandmergessignificantlydegradesperformance,or(2)thesizeofthe problemisoptimalforthetargetsystem(forexample,whenthedatarequiredforabaseSolve() fitsentirelyincache). Theconcurrencyinadivideandconquerproblemisobviouswhen,asisusuallythecase,the subproblemscanbesolvedindependently(andhence,concurrently).Thesequentialdivideand conqueralgorithmmapsdirectlyontoataskparallelalgorithmbydefiningonetaskforeach invocationofthesolve()function,asillustratedinFig.4.7.Notetherecursivenatureofthedesign, witheachtaskineffectdynamicallygeneratingandthenabsorbingataskforeachsubproblem.
Figure 4.7. Parallelizing the divide-and-conquer strategy. Each dashed-line box represents a task.
Conceptually,thispatternfollowsastraightforwardfork/joinapproach(seetheFork/Joinpattern). Onetasksplitstheproblem,thenforksnewtaskstocomputethesubproblems,waitsuntilthe subproblemsarecomputed,andthenjoinswiththesubtaskstomergetheresults. Theeasiestsituationiswhenthesplitphasegeneratessubproblemsthatareknowntobeaboutthe samesizeintermsofneededcomputation.Then,astraightforwardimplementationofthefork/join strategy,mappingeachtasktoaUEandstoppingtherecursionwhenthenumberofactivesubtasksis thesameasthenumberofPEs,workswell. Inmanysituations,theproblemwillnotberegular,anditisbesttocreatemore,finergrainedtasks anduseamaster/workerstructuretomaptaskstounitsofexecution.Thisimplementationofthis approachisdescribedindetailintheMaster/Workerpattern.Thebasicideaistoconceptually maintainaqueueoftasksandapoolofUEs,typicallyoneperPE.Whenasubproblemissplit,the newtasksareplacedinthequeue.WhenaUEfinishesatask,itobtainsanotheronefromthequeue. Inthisway,alloftheUEstendtoremainbusy,andthesolutionshowsagoodloadbalance.Finer grainedtasksallowabetterloadbalanceatthecostofmoreoverheadfortaskmanagement. Manyparallelprogrammingenvironmentsdirectlysupportthefork/joinconstruct.Forexample,in OpenMP,wecouldeasilyproduceaparallelapplicationbyturningtheforloopofFig.4.6intoan OpenMPparallel forconstruct.Thenthesubproblemswillbesolvedconcurrentlyratherthan
Mergesortisawellknownsortingalgorithmbasedonthedivideandconquerstrategy,appliedas followstosortanarrayofNelements.
procedurerecursively).
Inthemergephase,thetwo(sorted)subarraysarerecombinedintoasinglesortedarray.
whereT1andT2aresymmetrictridiagonalmatrices(whichcanbediagonalizedbyrecursive callstothesameprocedure).
ThemergephaserecombinesthediagonalizationsofT1andT2intoadiagonalizationofT.
Detailscanbefoundin[DS87]orin[GL96].
Known uses
Anyintroductoryalgorithmstextwillhavemanyexamplesofalgorithmsbasedonthedivideand conquerstrategy,mostofwhichcanbeparallelizedwiththispattern. SomealgorithmsfrequentlyparallelizedwiththisstrategyincludetheBarnesHut[BH86]andFast Multipole[GG90]algorithmsusedinNbodysimulations;signalprocessingalgorithms,suchas discreteFouriertransforms;algorithmsforbandedandtridiagonallinearsystems,suchasthosefound intheScaLAPACKpackage[CD97,Sca];andalgorithmsfromcomputationalgeometry,suchas convexhullandnearestneighbor. AparticularlyrichsourceofproblemsthatusetheDivideandConquerpatternistheFLAMEproject [GGHvdG01].Thisisanambitiousprojecttorecastlinearalgebraproblemsinrecursivealgorithms. Themotivationistwofold.First,mathematically,thesealgorithmsarenaturallyrecursive;infact,most pedagogicaldiscussionsofthesealgorithmsarerecursive.Second,theserecursivealgorithmshave proventobeparticularlyeffectiveatproducingcodethatisbothportableandhighlyoptimizedforthe cachearchitecturesofmodernmicroprocessors. Related Patterns Justbecauseanalgorithmisbasedonasequentialdivideandconquerstrategydoesnotmeanthatit
mustbeparallelizedwiththeDivideandConquerpattern.Ahallmarkofthispatternistherecursive arrangementofthetasks,leadingtoavaryingamountofconcurrencyandpotentiallyhighoverheads onmachinesforwhichmanagingtherecursionisexpensive.Iftherecursivedecompositionintosub problemscanbereused,however,itmightbemoreeffectivetodotherecursivedecomposition,and thenusesomeotherpattern(suchastheGeometricDecompositionpatternortheTaskParallelism pattern)fortheactualcomputation.Forexample,thefirstproductionlevelmoleculardynamics programtousethefastmultipolemethod,PMD[Win95],usedtheGeometricDecompositionpattern toparallelizethefastmultipolealgorithm,eventhoughtheoriginalfastmultipolealgorithmused divideandconquer.Thisworkedbecausethemultipolecomputationwascarriedoutmanytimesfor eachconfigurationofatoms.
Considerthemultiplicationoftwosquarematrices(thatis,computeC=AB).Asdiscussedin
whereateachstepinthesummation,wecomputethematrixproductAikBkjandaddittothe runningmatrixsum. ThisequationimmediatelyimpliesasolutionintermsoftheGeometricDecompositionpattern;that is,oneinwhichthealgorithmisbasedondecomposingthedatastructureintochunks(squareblocks here)thatcanbeoperatedonconcurrently. Tohelpvisualizethisalgorithmmoreclearly,considerthecasewherewedecomposeallthree matricesintosquareblockswitheachtask"owning"correspondingblocksofA,B,andC.Eachtask willrunthroughthesumoverktocomputeitsblockofC,withtasksreceivingblocksfromother tasksasneeded.InFig.4.9,weillustratetwostepsinthisprocessshowingablockbeingupdated(the solidblock)andthematrixblocksrequiredattwodifferentsteps(theshadedblocks),whereblocksof theAmatrixarepassedacrossarowandblocksoftheBmatrixarepassedaroundacolumn.
Figure 4.9. Data dependencies in the matrix-multiplication problem. Solid boxes indicate the "chunk" being updated (C); shaded boxes indicate the chunks of A (row) and B (column) required to update C at each of the two steps.
Forces
Thegranularityofthedatadecompositionhasasignificantimpactontheefficiencyoftheprogram. Inacoarsegraineddecomposition,thereareasmallernumberoflargechunks.Thisresultsina smallernumberoflargemessages,whichcangreatlyreducecommunicationoverhead.Afinegrained decomposition,ontheotherhand,resultsinalargernumberofsmallerchunks,inmanycasesleading tomanymorechunksthanPEs.Thisresultsinalargernumberofsmallermessages(andhence increasescommunicationoverhead),butitgreatlyfacilitatesloadbalancing. Althoughitmightbepossibleinsomecasestomathematicallyderiveanoptimumgranularityforthe datadecomposition,programmersusuallyexperimentwitharangeofchunksizestoempirically determinethebestsizeforagivensystem.Thisdepends,ofcourse,onthecomputationalperformance ofthePEsandontheperformancecharacteristicsofthecommunicationnetwork.Therefore,the programshouldbeimplementedsothatthegranularityiscontrolledbyparametersthatcanbeeasily changedatcompileorruntime. Theshapeofthechunkscanalsoaffecttheamountofcommunicationneededbetweentasks.Often, thedatatosharebetweentasksislimitedtotheboundariesofthechunks.Inthiscase,theamountof sharedinformationscaleswiththesurfaceareaofthechunks.Becausethecomputationscaleswith thenumberofpointswithinachunk,itscalesasthevolumeoftheregion.Thissurfacetovolume effectcanbeexploitedtomaximizetheratioofcomputationtocommunication.Therefore,higher dimensionaldecompositionsareusuallypreferred.Forexample,considertwodifferent decompositionsofanNbyNmatrixintofourchunks.Inonecase,wedecomposetheprobleminto fourcolumnchunksofsizeNbyN/4.Inthesecondcase,wedecomposetheproblemintofoursquare chunksofsizeN/2byN/2.Forthecolumnblockdecomposition,thesurfaceareais2N+2(N/4)or 5N/2.Forthesquarechunkcase,thesurfaceareais4(N/2)or2N.Hence,thetotalamountofdatathat mustbeexchangedislessforthesquarechunkdecomposition. Insomecases,thepreferredshapeofthedecompositioncanbedictatedbyotherconcerns.Itmaybe thecase,forexample,thatexistingsequentialcodecanbemoreeasilyreusedwithalower dimensionaldecomposition,andthepotentialincreaseinperformanceisnotworththeeffortof reworkingthecode.Also,aninstanceofthispatterncanbeusedasasequentialstepinalarger computation.Ifthedecompositionusedinanadjacentstepdiffersfromtheoptimaloneforthis patterninisolation,itmayormaynotbeworthwhiletoredistributethedataforthisstep.Thisis especiallyanissueindistributedmemorysystemswhereredistributingthedatacanrequire
significantcommunicationthatwilldelaythecomputation.Therefore,datadecompositiondecisions musttakeintoaccountthecapabilitytoreusesequentialcodeandtheneedtointerfacewithother stepsinthecomputation.Noticethattheseconsiderationsmightleadtoadecompositionthatwould besuboptimalunderothercircumstances. Communicationcanoftenbemoreeffectivelymanagedbyreplicatingthenonlocaldataneededto updatethedatainachunk.Forexample,ifthedatastructureisanarrayrepresentingthepointsona meshandtheupdateoperationusesalocalneighborhoodofpointsonthemesh,acommon communicationmanagementtechniqueistosurroundthedatastructurefortheblockwithaghost boundarytocontainduplicatesofdataattheboundariesofneighboringblocks.Sonoweachchunk hastwoparts:aprimarycopyownedbytheUE(thatwillbeupdateddirectly)andzeroormoreghost copies(alsoreferredtoasshadowcopies).Theseghostcopiesprovidetwobenefits.First,theiruse mayconsolidatecommunicationintopotentiallyfewer,largermessages.Onlatencysensitive networks,thiscangreatlyreducecommunicationoverhead.Second,communicationoftheghost copiescanbeoverlapped(thatis,itcanbedoneconcurrently)withtheupdateofpartsofthearray thatdon'tdependondatawithintheghostcopy.Inessence,thishidesthecommunicationcostbehind usefulcomputation,therebyreducingtheobservedcommunicationoverhead. Forexample,inthecaseofthemeshcomputationexamplediscussedearlier,eachofthechunks wouldbeextendedbyonecelloneachside.Theseextracellswouldbeusedasghostcopiesofthe cellsontheboundariesofthechunks.Fig.4.10illustratesthisscheme.
Figure 4.10. A data distribution with ghost boundaries. Shaded cells are ghost copies; arrows point from primary copies to corresponding secondary copies.
Akeyfactorinusingthispatterncorrectlyisensuringthatnonlocaldatarequiredfortheupdate operationisobtainedbeforeitisneeded. Ifallthedataneededispresentbeforethebeginningoftheupdateoperation,thesimplestapproachis toperformtheentireexchangebeforebeginningtheupdate,storingtherequirednonlocaldataina localdatastructuredesignedforthatpurpose(forexample,theghostboundaryinamesh computation).Thisapproachisrelativelystraightforwardtoimplementusingeithercopyingor messagepassing. Moresophisticatedapproachesinwhichcomputationandcommunicationoverlaparealsopossible. Suchapproachesarenecessaryifsomedataneededfortheupdateisnotinitiallyavailable,andmay improveperformanceinothercasesaswell.Forexample,intheexampleofameshcomputation,the exchangeofghostcellsandtheupdateofcellsintheinteriorregion(whichdonotdependonthe ghostcells)canproceedconcurrently.Aftertheexchangeiscomplete,theboundarylayer(thevalues thatdodependontheghostcells)canbeupdated.Onsystemswherecommunicationandcomputation occurinparallel,thesavingsfromsuchanapproachcanbesignificant.Thisissuchacommonfeature ofparallelalgorithmsthatstandardcommunicationAPIs(suchasMPI)includewholeclassesof messagepassingroutinestooverlapcomputationandcommunication.Thesearediscussedinmore
Updatingthedatastructureisdonebyexecutingthecorrespondingtasks(eachresponsibleforthe updateofonechunkofthedatastructures)concurrently.Ifalltheneededdataispresentatthe beginningoftheupdateoperation,andifnoneofthisdataismodifiedduringthecourseofthe update,parallelizationiseasierandmorelikelytobeefficient. Iftherequiredexchangeofinformationhasbeenperformedbeforebeginningtheupdateoperation, theupdateitselfisusuallystraightforwardtoimplementitisessentiallyidenticaltotheanalogous updateinanequivalentsequentialprogram,particularlyifgoodchoiceshavebeenmadeabouthowto representnonlocaldata.Iftheexchangeandupdateoperationsoverlap,morecareisneededtoensure thattheupdateisperformedcorrectly.Ifasystemsupportslightweightthreadsthatarewellintegrated withthecommunicationsystem,thenoverlapcanbeachievedviamultithreadingwithinasingletask, withonethreadcomputingwhileanotherhandlescommunication.Inthiscase,synchronization betweenthethreadsisrequired. Insomesystems,forexampleMPI,nonblockingcommunicationissupportedbymatching communicationprimitives:onetostartthecommunication(withoutblocking),andtheother (blocking)tocompletetheoperationandusetheresults.Formaximaloverlap,communicationshould bestartedassoonaspossible,andcompletedaslateaspossible.Sometimes,operationscanbe reorderedtoallowmoreoverlapwithoutchangingthealgorithmsemantics.
Data distribution and task scheduling
Thefinalstepindesigningaparallelalgorithmforaproblemthatfitsthispatternisdecidinghowto mapthecollectionoftasks(eachcorrespondingtotheupdateofonechunk)toUEs.EachUEcan thenbesaidto"own"acollectionofchunksandthedatatheycontain.Thus,wehaveatwotiered schemefordistributingdataamongUEs:partitioningthedataintochunksandthenassigningthese chunkstoUEs.Thisschemeisflexibleenoughtorepresentavarietyofpopularschemesfor distributingdataamongUEs. Inthesimplestcase,eachtaskcanbestaticallyassignedtoaseparateUE;thenalltaskscanexecute concurrently,andtheintertaskcoordinationneededtoimplementtheexchangeoperationis straightforward.Thisapproachismostappropriatewhenthecomputationtimesofthetasksare uniformandtheexchangeoperationhasbeenimplementedtooverlapcommunicationand computationwithineachtasks. Thesimpleapproachcanleadtopoorloadbalanceinsomesituations,however.Forexample,consider alinearalgebraprobleminwhichelementsofthematrixaresuccessivelyeliminatedasthe
computationproceeds.Earlyinthecomputation,alltherowsandcolumnsofthematrixhave numerouselementstoworkwithanddecompositionsbasedonassigningfullrowsorcolumnstoUEs areeffective.Laterinthecomputation,however,rowsorcolumnsbecomesparse,theworkperrow becomesuneven,andthecomputationalloadbecomespoorlybalancedbetweenUEs.Thesolutionis todecomposetheproblemintomanymorechunksthanthereareUEsandtoscatterthemamongthe UEswithacyclicorblockcyclicdistribution.(Cyclicandblockcyclicdistributionsarediscussedin theDistributedArraypattern.)Then,aschunksbecomesparse,thereare(withhighprobability)other nonsparsechunksforanygivenUEtoworkon,andtheloadbecomeswellbalanced.Aruleofthumb isthatoneneedsaroundtentimesasmanytasksasUEsforthisapproachtoworkwell. Itisalsopossibletousedynamicloadbalancingalgorithmstoperiodicallyredistributethechunks amongtheUEstoimprovetheloadbalance.Theseincuroverheadthatmustbetradedoffagainstthe improvementlikelytooccurfromtheimprovedloadbalanceandincreasedimplementationcosts.In addition,theresultingprogramismorecomplexthanthosethatuseoneofthestaticmethods. Generally,oneshouldconsiderthe(static)cyclicallocationstrategyfirst.
Program structure
CodeView:Scroll/ShowAll
#include <stdio.h> #include <stdlib.h> #define NX 100
#define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]) { uk[0] = LEFTVAL; uk[NX-1] = RIGHTVAL; for (int i = 1; i < NX-1; ++i) uk[i] = 0.0; for (int i = 0; i < NX; ++i) ukp1[i] = uk[i]; } void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) { /* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; double dx = 1.0/NX; double dt = 0.5*dx*dx; initialize(uk, ukpl); for (int k = 0; k < NSTEPS; ++k) { /* compute new values */ for (int i = 1; i < NX-1; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ temp = ukp1; ukp1 = uk; uk = temp; printValues(uk, k); } return 0; }
Theschedule (static)clausedecomposestheiterationsoftheparallelloopintoone contiguousblockperthreadwitheachblockbeingapproximatelythesamesize.Thisscheduleis importantforLoopParallelismprogramsimplementingtheGeometricDecompositionpattern.Good performanceformostGeometricDecompositionproblems(andmeshprogramsinparticular)requires thatthedataintheprocessor'scachebeusedmanytimesbeforeitisdisplacedbydatafromnewcache lines.Usinglargeblocksofcontiguousloopiterationsincreasesthechancethatmultiplevalues fetchedinacachelinewillbeutilizedandthatsubsequentloopiterationsarelikelytofindatleast someoftherequireddataincache. ThelastdetailtodiscussfortheprograminFig.4.12isthesynchronizationrequiredtosafelycopythe pointers.Itisessentialthatallofthethreadscompletetheirworkbeforethepointerstheymanipulate areswappedinpreparationforthenextiteration.Inthisprogram,thissynchronizationhappens automaticallyduetotheimpliedbarrier(seeSec.6.3.2)attheendoftheparallelloop. TheprograminFig.4.12workswellwithasmallnumberofthreads.Whenlargenumbersofthreads areinvolved,however,theoverheadincurredbyplacingthethreadcreationanddestructioninsidethe loopoverkwouldbeprohibitive.Wecanreducethreadmanagementoverheadbysplittingthe parallel fordirectiveintoseparateparallelandfordirectivesandmovingthethread creationoutsidetheloopoverk.ThisapproachisshowninFig.4.13.Becausethewholekloopis nowinsideaparallelregion,wemustbemorecarefulabouthowdataissharedbetweenthreads.The privateclausecausestheloopindiceskanditobelocaltoeachthread.Thepointersukand ukp1areshared,however,sotheswapoperationmustbeprotected.Theeasiestwaytodothisisto ensurethatonlyonememberoftheteamofthreadsdoestheswap.InOpenMP,thisismosteasily donebyplacingtheupdateinsideasingleconstruct.AsdescribedinmoredetailintheOpenMP appendix,AppendixA,thefirstthreadtoencountertheconstructwillcarryouttheswapwhilethe otherthreadswaitattheendoftheconstruct.
Figure 4.12. Parallel heat-diffusion program using OpenMP
CodeView:Scroll/ShowAll
#include <stdio.h> #include <stdlib.h> #define NX 100 #define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]) { uk[0] = LEFTVAL; uk[NX-1] = RIGHTVAL; for (int i = 1; i < NX-1; ++i) uk[i] = 0.0; for (int i=0; i < NX; ++i) ukp1[i] = uk[i]; } void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) {
/* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; double dx = 1.0/NX; double dt = 0.5*dx*dx; initialize(uk, ukp1); for (int k = 0; k < NSTEPS; ++k) { #pragma omp parallel for schedule(static) /* compute new values */ for (int i = 1; i < NX-1; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ temp = ukp1; ukpl = uk; uk = temp; printValues(uk, k); } return 0; }
MPI solution
AnMPIbasedprogramforthisexampleisshowninFig.4.14and4.15.Theapproachusedinthis programusesadatadistributionwithghostcellsandtheSPMDpattern. EachprocessisgivenasinglechunkofthedatadomainofsizeNX/NP,whereNXisthetotalsizeof theglobaldataarrayandNPisthenumberofprocesses.Forsimplicity,weassumeNXisevenly dividedbyNP. Theupdateofthechunkisstraightforwardandessentiallyidenticaltothatfromthesequentialcode. ThelengthandgreatercomplexityinthisMPIprogramarisesfromtwosources.First,thedata initializationismorecomplex,becauseitmustaccountforthedatavaluesattheedgesofthefirstand lastchunks.Second,messagepassingroutinesarerequiredinsidetheloopoverktoexchangeghost cells.
Figure 4.13. Parallel heat-diffusion program using OpenMP. This version has less thread-management overhead.
CodeView:Scroll/ShowAll
#include <stdio.h> #include <stdlib.h> #include <omp.h> #define NX 100 #define LEFTVAL 1.0
#define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]){/* NOT SHOWN */} void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) { /* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; int i,k; double dx = 1.0/NX; double dt = 0.5*dx*dx; #pragma omp parallel private (k, i) { initialize(uk, ukp1); for (k = 0; k < NSTEPS; ++k) { #pragma omp for schedule(static) for (i = 1; i < NX-1; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ #pragma omp single { temp = ukp1; ukp1 = uk; uk = temp; } } } return 0; }
CodeView:Scroll/ShowAll
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <mpi.h> #define NX 100 #define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[], int numPoints,
int numProcs, int myID) { for (int i = 1; i <= numPoints; ++i) uk[i] = 0.0; /* left endpoint */ if (myID == 0) uk[1] = LEFTVAL; /* right endpoint */ if (myID == numProcs-1) uk[numPoints] = RIGHTVAL; /* copy values to ukpl */ for (int i = 1; i <= numPoints; ++i) ukp1[i] = uk[i];
void printValues(double uk[], int step, int numPoints, int myID) { /* NOT SHOWN */ } int main(int argc, char *argv[]) { /* pointers to arrays for two iterations of algorithm */ double *uk, *ukp1, *temp; double dx = 1.0/NX; double dt = 0.5*dx*dx; int numProcs, myID, leftNbr, rightNbr, numPoints; MPI_Status status; /* MPI initialization */ MPI-Init(&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &numProcs); MPI_Comm_rank(MPI_COMM_WORLD, &myID); //get own ID /* initialization of other variables */ leftNbr = myID - 1; // ID of left "neighbor" process rightNbr = myID + 1; // ID of right "neighbor" process numPoints = (NX / numProcs); /* uk, ukp1 include a "ghost cell" at each end */ uk = malloc(sizeof(double) * (numPoints+2)); ukp1 = malloc(sizeof(double) * (numPoints+2)); initialize(uk, ukp1, numPoints, numProcs, myID); /* continued in next figure */
CodeView:Scroll/ShowAll
/* continued from Figure 4.14 */ for (int k = 0; k < NSTEPS; ++k) {
/* exchange boundary information */ if (myID != 0) MPI_Send(&uk[1], 1, MPI_DOUBLE, leftNbr, 0, MPI_COMM_WORLD); if (myID != numProcs-1) MPI_Send(&uk[numPoints], 1, MPI_DOUBLE, rightNbr, 0, MPI_COMM_WORLD); if (myID != 0) MPI_Recv(&uk[0], 1, MPI_DOUBLE, leftNbr, 0, MPI_COMM_WORLD, &status); if (myID != numProcs-1) MPI_Recv(&uk[numPoints+l],1, MPI_DOUBLE, rightNbr, 0, MPI_COMM_WORLD, &status); /* compute new values for interior points */ for (int i = 2; i < numPoints; ++i) { ukpl[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* compute new values for boundary points */ if (myID != 0) { int i=1; ukpl[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } if (myID != numProcs-1) { int i=numPoints; ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ temp = ukp1; ukp1 = uk; uk = temp; printValues(uk, k, numPoints, myID); } /* clean up and end */ MPI_Finalize(); return 0;
Whilethebasicalgorithmisthesame,thecommunicationisquitedifferent.Theimmediatemode communicationroutines,MPI_IsendandMPI_Irecv,areusedtosetupandthenlaunchthe communicationevents.Thesefunctions(describedinmoredetailintheMPIappendix,AppendixB) returnimmediately.Theupdateoperationsontheinteriorpointscanthentakeplacebecausethey don'tdependontheresultsofthecommunication.Wethencallfunctionstowaituntilthe communicationiscompleteandupdatetheedgesofeachUE'schunksusingtheresultsofthe communicationevents.Inthiscase,themessagesaresmallinsize,soitisunlikelythatthisversionof theprogramwouldbeanyfasterthanourfirstone.Butitiseasytoimaginecaseswherelarge, complexcommunicationeventswouldbeinvolvedandbeingabletodousefulworkwhilethe messagesmoveacrossthecomputernetworkwouldresultinsignificantlygreaterperformance.
Figure 4.16. Parallel heat-diffusion program using MPI with overlapping communication/ computation (continued from Fig. 4.14)
CodeView:Scroll/ShowAll
/* continued */ MPI_Request reqRecvL, reqRecvR, reqSendL, reqSendR; //needed for // nonblocking I/0 for (int k = 0; k < NSTEPS; ++k) { /* initiate communication to exchange boundary information */ if (myID != 0) { MPI_Irecv(&uk[0], 1, MPI_DOUBLE, leftNbr, 0, MPI_COMM_WORLD, ®RecvL); MPI_Isend(&uk[l], 1, MPI_DOUBLE, leftNbr, 0, MPI_COMM_WORLD, ®SendL); } if (myID != numProcs-1) { MPI_Irecv(&uk[numPoints+1],1, MPI_DOUBLE, rightNbr, 0, MPI_COMM_WORLD, &reqRecvR); MPI_Isend(&uk[numPoints], 1, MPI_DOUBLE, rightNbr, 0, MPI_COMM_WORLD, &reqSendR); } /* compute new values for interior points */ for (int i = 2; i < numPoints; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* wait for communication to complete */ if (myID != 0) { MPI_Wait(&reqRecvL, &status); MPI_Wait(&reqSendL, &status); } if (myID != numProcs-1) { MPI_Wait(&reqRecvR, &status); MPI_Wait(&reqSendR, &status); } /* compute new values for boundary points */ if (myID != 0) { int i=1; ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } if (myID != numProcs-1) { int i=numPoints; ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ temp = ukp1; ukp1 = uk; uk = temp; printValues(uk, k, numPoints, myID); } /* clean up and end */ MPI_Finalize(); return 0; }
CodeView:Scroll/ShowAll
#include <stdio.h> #include <stdlib.h> #define N 100 #define NB 4 #define blockstart(M,i,j,rows_per_blk,cols_per_blk,stride) \ (M + ((i)*(rows_per_blk))*(stride) + (j)*(cols_per_blk)) int main(int argc, char *argv[]) { /* matrix dimensions */ int dimN = N; int dimP = N; int dimM = N; /* block dimensions */ int dimNb = dimN/NB; int dimPb = dimP/NB; int dimMb = dimM/NB; /* allocate double *A = double *B = double *C = memory for matrices */ malloc(dimN*dimP*sizeof(double)); malloc(dimP*dimM*sizeof(double)); malloc(dimN*dimM*sizeof(double));
/* Initialize matrices */ initialize(A, B, dimN, dimP, dimM); /* Do the matrix multiplication */ for (int ib=0; ib < NB; ++ib) { for (int jb=0; jb < NB; ++jb) { /* find block[ib][jb] of C */ double * blockPtr = blockstart(C, ib, jb, dimNb, dimMb, dimM); /* clear block[ib] [jb] of C (set all elements to zero) */ matclear(blockPtr, dimNb, dimMb, dimM); for (int kb=0; kb < NB; ++kb) { /* compute product of block[ib][kb] of A and block[kb] [jb] of B and add to block[ib][jb] of C */ matmul_add(blockstart(A, ib, kb, dimNb, dimPb, dimP), blockstart(B, kb, jb, dimPb, dimMb, dimM), blockPtr, dimNb, dimPb, dimMb, dimP, dimM, dimM); } } /* Code to print results not shown */ } return 0; }
Matrix multiplication
ThematrixmultiplicationproblemisdescribedintheContextsection.Fig.4.17presentsasimple
sequentialprogramtocomputethedesiredresult,basedondecomposingtheNbyNmatrixinto NB*NBsquareblocks.Thenotationblock [i] [j]incommentsindicatesthe(i,j)thblockas describedearlier.TosimplifythecodinginC,werepresentthematricesas1Darrays(internally arrangedinrowmajororder)anddefineamacroblockstarttofindthetopleftcornerofa submatrixwithinoneofthese1Darrays.Weomitcodeforfunctionsinitialize(initialize matricesAandB),printMatrix(printamatrix'svalues),matclear(clearamatrixsetall valuestozero),andmatmul_add(computethematrixproductofthetwoinputmatricesandaddit totheoutputmatrix).Parameterstomostofthesefunctionsincludematrixdimensions,plusastride thatdenotesthedistancefromthestartofonerowofthematrixtothestartofthenextandallowsus toapplythefunctionstosubmatricesaswellastowholematrices.
Figure 4.18. Sequential matrix multiplication, revised. We do not show the parts of the program that are not changed from the program in Fig. 4.17. /* Declarations, initializations, etc. not shown -- same as first version */ /* Do the multiply */ matclear(C, dimN, dimM, dimM); /* sets all elements to zero */ for (int kb=0; kb < NB; ++kb) { for (int ib=0; ib < NB; ++ib) { for (int jb=0; jb < NB; ++jb) { /* compute product of block[ib][kb] of A and block[kb][jb] of B and add to block[ib][jb] of C */ matmul_add(blockstart(A, ib, kb, dimNb, dimPb, dimP), blockstart(B, kb, jb, dimPb, dimMb, dimM), blockstart(C, ib, jb, dimNb, dimMb, dimM), dimNb, dimPb, dimMb, dimP, dimM, dimM); } }
essentiallythesameasthosearisingfromthemeshprogramexample,sowedonotshowprogram sourcecodehere.
MPI solution
CodeView:Scroll/ShowAll
#include #include #include #include #include <stdio.h> <stdlib.h> <string.h> <math.h> <mpi.h> #define N 100
#define blockstart(M,i,j,rows_per_blk,cols_per_blk,stride) \ (M + ((i)*(rows_per_blk))*(stride) + (j)*(cols_per_blk)) int main(int argc, char *argv[]) { /* matrix dimensions */ int dimN = N; int dimP = N; int dimM = N; /* block dimensions */ int dimNb, dimPb, dimMb; /* matrices */ double *A, *B, *C; /* buffers for receiving sections of A, B from other processes */ double *Abuffer, *Bbuffer; int numProcs, myID, myID_i, myID_j, NB; MPI_Status status; /* MPI initialization */ MPI_Init(&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &numProcs); MPI_Comm_rank(MPI_COMM_WORLD, &myID) ; /* initialize other variables */ NB = (int) sqrt((double) numProcs); myID_i = myID / NB; myID_j = myID % NB; dimNb = dimN/NB; dimPb = dimP/NB; dimMb = dimM/NB; A = malloc(dimNb*dimPb*sizeof(double)); B = malloc(dimPb*dimMb*sizeof(double)); C = malloc(dimNb*dimMb*sizeof(double)); Abuffer = malloc(dimNb*dimPb*sizeof(double)); Bbuffer = malloc(dimPb*dimMb*sizeof(double)); /* Initialize matrices */ initialize(A, B, dimNb, dimPb, dimPb, dimPb, NB, myID_i, myID_j);
CodeView:Scroll/ShowAll
/* continued from previous figure */ /* Do the multiply */ matclear(C, dimNb, dimMb, dimMb); for (int kb=0; kb < NB; ++kb) { if (myID_j == kb) { /* send A to other processes in the same "row" */ for (int jb=0; jb < NB; ++jb) { if (jb != myID_j) MPI_Send(A, dimNb*dimPb, MPI_DOUBLE, myID_i*NB + jb, 0, MPI_COMM_WORLD); } /* copy A to Abuffer */ memcpy(Abuffer, A, dimNb*dimPb*sizeof(double)); } else { MPI_Recv(Abuffer, dimNb*dimPb, MPI_DOUBLE, myID_i*NB + kb, 0, MPI_COMM_WORLD, &status); ) if (myID_i == kb) { /* send B to other processes in the same "column" */ for (int ib=0; ib < NB; ++ib) { if (ib != myID_i) MPI_Send(B, dimPb*dimMb, MPI_DOUBLE, ib*NB + myID_j, 0, MPI_COMM_WORLD); } /* copy B to Bbuffer */ memcpy(Bbuffer, B, dimPb*dimMb*sizeof(double)); } else { MPI_Recv(Bbuffer, dimPb*dimMb, MPI_DOUBLE, kb*NB + myID_j, 0, MPI_COMM_WORLD, &status); } /* compute product of block[ib] [kb] of A and block[kb][jb] of B and add to block[ib][jb] of C */ matmul_add(Abuffer, Bbuffer, C, dimNb, dimPb, dimMb, dimPb, dimMb, dimMb); }
} /* Code to print results not shown */ /* Clean up and end */ MPI_Finalize(); return 0; }
Althoughthealgorithmmayseemcomplexatfirst,theoverallideaisstraightforward.The computationproceedsthroughanumberofphases(theloopoverkb).Ateachphase,theprocess whoserowindexequalsthekbindexsendsitsblocksofAacrosstherowofprocesses.Likewise,the processwhosecolumnindexequalskbsendsitsblocksofBalongthecolumnofprocesses. Followingthecommunicationoperations,eachprocessthenmultipliestheAandBblocksitreceived andsumstheresultintoitsblockofC.AfterNBphases,theblockoftheCmatrixoneachprocesswill holdthefinalproduct. ThesetypesofalgorithmsareverycommonwhenworkingwithMPI.Thekeytounderstandingthese algorithmsistothinkintermsofthesetofprocesses,thedataownedbyeachprocess,andhowdata fromneighboringprocessesflowsamongtheprocessesasthecalculationunfolds.Werevisitthese issuesintheSPMDandDistributedArraypatternsaswellasintheMPIappendix. Agreatdealofresearchhasbeencarriedoutonparallelmatrixmultiplicationandrelatedlinear algebraalgorithms.Amoresophisticatedapproach,inwhichtheblocksofAandBcirculateamong processes,arrivingateachprocessjustintimetobeused,isgivenin[FJL88 + ].
Known uses
[Sca,BCC97 + ]libraryareforthemostpartbasedonthispattern.Thesetwoclassesofproblemscover alargeportionofallparallelapplicationsinscientificcomputing. Related Patterns Iftheupdaterequiredforeachchunkcanbedonewithoutdatafromotherchunks,thenthispattern reducestotheembarrassinglyparallelalgorithmdescribedintheTaskParallelismpattern.Asan exampleofsuchacomputation,considercomputinga2DFFT(FastFourierTransform)byfirst applyinga1DFFTtoeachrowofthematrixandthenapplyinga1DFFTtoeachcolumn.Although thedecompositionmayappeardatabased(byrows/bycolumns),infactthecomputationconsistsof twoinstancesoftheTaskParallelismpattern. Ifthedatastructuretobedistributedisrecursiveinnature,thentheDivideandConquerorRecursive Datapatternmaybeapplicable
Figure 4.21. Finding roots in a forest. Solid lines represent the original parent-child relationships among nodes; dashed lines point from nodes to their successors.
Whatwehavedoneistransformtheoriginalsequentialcalculation(findrootsfornodesone"hop" fromaroot,thenfindrootsfornodestwo"hops"fromaroot,etc.)intoacalculationthatcomputesa partialresult(successor)foreachnodeandthenrepeatedlycombinesthesepartialresults,firstwith neighboringresults,thenwithresultsfromnodestwohopsaway,thenwithresultsfromnodesfour hopsaway,andsoon.Thisstrategycanbeappliedtootherproblemsthatatfirstappearunavoidably sequential;theExamplessectionpresentsotherexamples.Thistechniqueissometimesreferredtoas pointerjumpingorrecursivedoubling. Aninterestingaspectofthisrestructuringisthatthenewalgorithminvolvessubstantiallymoretotal workthantheoriginalsequentialone(O(NlogN)versusO(N)),buttherestructuredalgorithm containspotentialconcurrencythatiffullyexploitedreducestotalrunningtimetoO(logN)(versus O(N)).Moststrategiesandalgorithmsbasedonthispatternsimilarlytradeoffanincreaseintotal workforapotentialdecreaseinexecutiontime.Noticealsothattheexploitableconcurrencycanbe extremelyfinegrained(asinthepreviousexample),whichmaylimitthesituationsinwhichthis patternyieldsanefficientalgorithm.Nevertheless,thepatterncanstillserveasaninspirationfor lateralthinkingabouthowtoparallelizeproblemsthatatfirstglanceappeartobeinherently sequential. Forces
Inthispattern,therecursivedatastructureiscompletelydecomposedintoindividualelementsand eachelementisassignedtoaseparateUE.IdeallyeachUEwouldbeassignedtoadifferentPE,butit isalsopossibletoassignmultipleUEstoeachPE.IfthenumberofUEsperPEistoolarge,however, theoverallperformancewillbepoorbecausetherewillnotbeenoughconcurrencytoovercomethe increaseinthetotalamountofwork. Forexample,considertherootfindingproblemdescribedearlier.We'llignoreoverheadinour computations.IfN=1024andtisthetimetoperformonestepforonedataelement,thentherunning timeofasequentialalgorithmwillbeabout1024t.IfeachUEisassigneditsownPE,thenthe runningtimeoftheparallelalgorithmwillbearound(logN)tor10t.IfonlytwoPEsareavailablefor theparallelalgorithm,however,thenallNlogNor10240computationstepsmustbeperformedon thetwoPEs,andtheexecutiontimewillbeatleast5120t,considerablymorethanthesequential algorithm.
Structure
thentheparallelalgorithmmustensurethatnext [k]isnotupdatedbeforeotherUEsthatneedits valuefortheircomputationhavereceivedit.Onecommontechniqueistointroduceanewvariable, saynext2,ateachelement.Evennumberediterationsthenreadnextbutupdatenext2,while oddnumberediterationsreadnext2andupdatenext.Thenecessarysynchronizationis accomplishedbyplacingabarrier(asdescribedintheImplementationMechanismsdesignspace) betweeneachsuccessivepairofiterations.Noticethatthiscansubstantiallyincreasetheoverhead associatedwiththeparallelalgorithm,whichcanoverwhelmanyspeedupderivedfromtheadditional concurrency.Thisismostlikelytobeafactorifthecalculationrequiredforeachelementistrivial (which,alas,formanyoftheexamplesitis). IftherearefewerPEsthandataelements,theprogramdesignermustdecidewhethertoassigneach dataelementtoaUEandassignmultipleUEstoeachPE(therebysimulatingsomeoftheparallelism) orwhethertoassignmultipledataelementstoeachUEandprocessthemserially.Thelatterisless straightforward(requiringanapproachsimilartothatsketchedpreviously,inwhichvariablesinvolved inthesimultaneousupdateareduplicated),butcanbemoreefficient. Examples
Partial sums of a linked list
Figure 4.23. Steps in finding partial sums of a list. Straight arrows represent links between elements; curved arrows indicate additions.
Incombinatorialoptimization,problemsinvolvingtraversingallnodesinagraphortreecanoftenbe solvedwiththispatternbyfirstfindinganorderingonthenodestocreatealist.Eulertoursandear
decomposition[EG88]arewellknowntechniquestocomputethisordering. JJ[J92]alsodescribesseveralapplicationsofthispattern:findingtherootsoftreesinaforestof rooteddirectedtrees,computingpartialsumsonasetofrooteddirectedtrees(similartothepreceding examplewithlinkedlists),andlistranking(determiningforeachelementofthelistitsdistancefrom thestart/endofthelist). Related Patterns Withrespecttotheactualconcurrency,thispatternisverymuchliketheGeometricDecomposition pattern,adifferencebeingthatinthispatternthedatastructurecontainingtheelementstobeoperated onconcurrentlyisrecursive(atleastconceptually).Whatmakesitdifferentistheemphasison fundamentallyrethinkingtheproblemtoexposefinegrainedconcurrency.
canbevectorizedinawaythatthespecialhardwarecanexploit.Afterashortstartup,one a[i]valuewillbegeneratedeachclockcycle.
createsathreestagepipeline,withoneprocessforeachcommand(cat,grep,andwc). Theseexamplesandtheassemblylineanalogyhaveseveralaspectsincommon.Allinvolveapplying asequenceofoperations(intheassemblylinecaseitisinstallingtheengine,installingthe windshield,etc.)toeachelementinasequenceofdataelements(intheassemblyline,thecars). Althoughtheremaybeorderingconstraintsontheoperationsonasingledataelement(forexample,it mightbenecessarytoinstalltheenginebeforeinstallingthehood),itispossibletoperformdifferent operationsondifferentdataelementssimultaneously(forexample,onecaninstalltheengineonone carwhileinstallingthehoodonanother.) Thepossibilityofsimultaneouslyperformingdifferentoperationsondifferentdataelementsisthe potentialconcurrencythispatternexploits.IntermsoftheanalysisdescribedintheFinding Concurrencypatterns,eachtaskconsistsofrepeatedlyapplyinganoperationtoadataelement (analogoustoanassemblylineworkerinstallingacomponent),andthedependenciesamongtasksare orderingconstraintsenforcingtheorderinwhichoperationsmustbeperformedoneachdataelement (analogoustoinstallingtheenginebeforethehood). Forces
Solution Thekeyideaofthispatterniscapturedbytheassemblylineanalogy,namelythatthepotential concurrencycanbeexploitedbyassigningeachoperation(stageofthepipeline)toadifferentworker andhavingthemworksimultaneously,withthedataelementspassingfromoneworkertothenextas operationsarecompleted.Inparallelprogrammingterms,theideaistoassigneachtask(stageofthe pipeline)toaUEandprovideamechanismwherebyeachstageofthepipelinecansenddataelements tothenextstage.Thisstrategyisprobablythemoststraightforwardwaytodealwiththistypeof orderingconstraints.Itallowstheapplicationtotakeadvantageofspecialpurposehardwareby appropriatemappingofpipelinestagestoPEsandprovidesareasonablemechanismforhandling errors,describedlater.Italsoislikelytoyieldamodulardesignthatcanlaterbeextendedor modified. Beforegoingfurther,itmayhelptoillustratehowthepipelineissupposedtooperate.LetCirepresent amultistepcomputationondataelementi.Ci(j)isthejthstepofthecomputation.Theideaistomap computationstepstopipelinestagessothateachstageofthepipelinecomputesonestep.Initially,the firststageofthepipelineperformsC1(1).Afterthatcompletes,thesecondstageofthepipeline receivesthefirstdataitemandcomputesC1(2)whilethefirststagecomputesthefirststepofthe seconditem,C2(1).Next,thethirdstagecomputesC1(3),whilethesecondstagecomputesC2(2)and thefirststageC3(1).Fig.4.24illustrateshowthisworksforapipelineconsistingoffourstages.Notice thatconcurrencyisinitiallylimitedandsomeresourcesremainidleuntilallthestagesareoccupied withusefulwork.Thisisreferredtoasfillingthepipeline.Attheendofthecomputation(draining thepipeline),againthereislimitedconcurrencyandidleresourcesasthefinalitemworksitsway throughthepipeline.Wewantthetimespentfillingordrainingthepipelinetobesmallcomparedto thetotaltimeofthecomputation.Thiswillbethecaseifthenumberofstagesissmallcomparedto thenumberofitemstobeprocessed.Noticealsothatoverallthroughput/efficiencyismaximizedif thetimetakentoprocessadataelementisroughlythesameforeachstage.
Figure 4.24. Operation of a pipeline. Each pipeline stage i computes the i-th step of the computation.
Theamountofconcurrencyinafullpipelineislimitedbythenumberofstages.Thus,alarger numberofstagesallowsmoreconcurrency.However,thedatasequencemustbetransferred betweenthestages,introducingoverheadtothecalculation.Thus,w eneedtoorganizethe computationintostagessuchthattheworkdonebyastageislargecomparedtothe communicationoverhead.Whatis"largeenough"ishighlydependentontheparticular architecture.Specializedhardware(suchasvectorprocessors)allowsveryfinegrained parallelism. Thepatternworksbetteriftheoperationsperformedbythevariousstagesofthepipelineare allaboutequallycomputationallyintensive.Ifthestagesinthepipelinevarywidelyin computationaleffort,thesloweststagecreatesabottleneckfortheaggregatethroughput.
Figure 4.26. Basic structure of a pipeline stage initialize while (more data) { receive data element from previous stage perform operation on data element send data element to next stage } finalize
Howdataflowbetweenpipelineelementsisrepresenteddependsonthetargetplatform. Inamessagepassingenvironment,themostnaturalapproachistoassignoneprocesstoeach operation(stageofthepipeline)andimplementeachconnectionbetweensuccessivestagesofthe pipelineasasequenceofmessagesbetweenthecorrespondingprocesses.Becausethestagesare hardlyeverperfectlysynchronized,andtheamountofworkcarriedoutatdifferentstagesalmost alwaysvaries,thisflowofdatabetweenpipelinestagesmustusuallybebothbufferedandordered. Mostmessagepassingenvironments(e.g.,MPI)makethiseasytodo.Ifthecostofsendingindividual messagesishigh,itmaybeworthwhiletoconsidersendingmultipledataelementsineachmessage; thisreducestotalcommunicationcostattheexpenseofincreasingthetimeneededtofillthepipeline. Ifamessagepassingprogrammingenvironmentisnotagoodfitwiththetargetplatform,thestages ofthepipelinecanbeconnectedexplicitlywithbufferedchannels.Suchabufferedchannelcanbe implementedasaqueuesharedbetweenthesendingandreceivingtasks,usingtheSharedQueue pattern. Iftheindividualstagesarethemselvesimplementedasparallelprograms,thenmoresophisticated approachesmaybecalledfor,especiallyifsomesortofdataredistributionneedstobeperformed betweenthestages.Thismightbethecaseif,forexample,thedataneedstobepartitionedalonga differentdimensionorpartitionedintoadifferentnumberofsubsetsinthesamedimension.For example,anapplicationmightincludeonestageinwhicheachdataelementispartitionedintothree subsetsandanotherstageinwhichitispartitionedintofoursubsets.Thesimplestwaystohandlesuch situationsaretoaggregateanddisaggregatedataelementsbetweenstages.Oneapproachwouldbeto haveonlyonetaskineachstagecommunicatewithtasksinotherstages;thistaskwouldthenbe responsibleforinteractingwiththeothertasksinitsstagetodistributeinputdataelementsandcollect outputdataelements.Anotherapproachwouldbetointroduceadditionalpipelinestagestoperform aggregation/disaggregationoperations.Eitheroftheseapproaches,however,involvesafairamountof
communication.Itmaybepreferabletohavetheearlierstage"know"abouttheneedsofitssuccessor andcommunicatewitheachtaskreceivingpartofitsdatadirectlyratherthanaggregatingthedataat onestageandthendisaggregatingatthenext.Thisapproachimprovesperformanceatthecostof reducedsimplicity,modularity,andflexibility. Lesstraditionally,networkedfilesystemshavebeenusedforcommunicationbetweenstagesina pipelinerunninginaworkstationcluster.Thedataiswrittentoafilebyonestageandreadfromthe filebyitssuccessor.Networkfilesystemsareusuallymatureandfairlywelloptimized,andthey provideforthevisibilityofthefileatallPEsaswellasmechanismsforconcurrencycontrol.Higher levelabstractionssuchastuplespacesandblackboardsimplementedovernetworkedfilesystemscan alsobeused.Filesystembasedsolutionsareappropriateinlargegrainedapplicationsinwhichthe timeneededtoprocessthedataateachstageislargecomparedwiththetimetoaccessthefilesystem.
Handling errors
ThesimplestapproachistoallocateonePEtoeachstageofthepipeline.Thisgivesgoodloadbalance ifthePEsaresimilarandtheamountofworkneededtoprocessadataelementisroughlythesamefor eachstage.Ifthestageshavedifferentrequirements(forexample,oneismeanttoberunonspecial purposehardware),thisshouldbetakenintoconsiderationinassigningstagestoPEs. IftherearefewerPEsthanpipelinestages,thenmultiplestagesmustbeassignedtothesamePE, preferablyinawaythatimprovesoratleastdoesnotmuchreduceoverallperformance.Stagesthatdo notsharemanyresourcescanbeallocatedtothesamePE;forexample,astagethatwritestoadisk andastagethatinvolvesprimarilyCPUcomputationmightbegoodcandidatestoshareaPE.Ifthe amountofworktoprocessadataelementvariesamongstages,stagesinvolvinglessworkmaybe allocatedtothesamePE,therebypossiblyimprovingloadbalance.Assigningadjacentstagestothe samePEcanreducecommunicationcosts.Itmightalsobeworthwhiletoconsidercombining adjacentstagesofthepipelineintoasinglestage. IftherearemorePEsthanpipelinestages,itisworthwhiletoconsiderparallelizingoneormoreofthe pipelinestagesusinganappropriateAlgorithmStructurepattern,asdiscussedpreviously,and allocatingmorethanonePEtotheparallelizedstage(s).Thisisparticularlyeffectiveifthe parallelizedstagewaspreviouslyabottleneck(takingmoretimethantheotherstagesandthereby draggingdownoverallperformance). AnotherwaytomakeuseofmorePEsthanpipelinestages,iftherearenotemporalconstraintsamong thedataitemsthemselves(thatis,itdoesn'tmatterif,say,dataitem3iscomputedbeforedataitem2), istorunmultipleindependentpipelinesinparallel.ThiscanbeconsideredaninstanceoftheTask Parallelismpattern.Thiswillimprovethethroughputoftheoverallcalculation,butdoesnot significantlyimprovethelatency,however,sinceitstilltakesthesameamountoftimeforadata elementtotraversethepipeline.
Therearefewmorefactorstokeepinmindwhenevaluatingwhetheragivendesignwillproduce acceptableperformance. InmanysituationswherethePipelinepatternisused,theperformancemeasureofinterestisthe throughput,thenumberofdataitemspertimeunitthatcanbeprocessedafterthepipelineisalready full.Forexample,iftheoutputofthepipelineisasequenceofrenderedimagestobeviewedasan animation,thenthepipelinemusthavesufficientthroughput(numberofitemsprocessedpertime unit)togeneratetheimagesattherequiredframerate. Inanothersituation,theinputmightbegeneratedfromrealtimesamplingofsensordata.Inthiscase, theremightbeconstraintsonboththethroughput(thepipelineshouldbeabletohandleallthedataas itcomesinwithoutbackinguptheinputqueueandpossiblylosingdata)andthelatency(theamount oftimebetweenthegenerationofaninputandthecompletionofprocessingofthatinput).Inthis case,itmightbedesirabletominimizelatencysubjecttoaconstraintthatthethroughputissufficient tohandletheincomingdata. Examples
Fourier-transform computations
ThefirststageofthepipelineperformstheinitialFouriertransform;itrepeatedlyobtainsone setofinputdata,performsthetransform,andpassestheresulttothesecondstageofthe pipeline. Thesecondstageofthepipelineperformsthedesiredelementwisemanipulation;itrepeatedly obtainsapartialresult(ofapplyingtheinitialFouriertransformtoaninputsetofdata)from thefirststageofthepipeline,performsitsmanipulation,andpassestheresulttothethirdstage ofthepipeline.ThisstagecanoftenitselfbeparallelizedusingoneoftheotherAlgorithm Structurepatterns. ThethirdstageofthepipelineperformsthefinalinverseFouriertransform;itrepeatedly obtainsapartialresult(ofapplyingtheinitialFouriertransformandthentheelementwise manipulationtoaninputsetofdata)fromthesecondstageofthepipeline,performsthe inverseFouriertransform,andoutputstheresult.
ThefiguresforthisexampleshowasimpleJavaframeworkforpipelinesandanexampleapplication. Theframeworkconsistsofabaseclassforpipelinestages,PipelineStage,showninFig.4.27, andabaseclassforpipelines,LinearPipeline,showninFig.4.28.Applicationsprovidea subclassofPipelineStageforeachdesiredstage,implementingitsthreeabstractmethodsto indicatewhatthestageshoulddoontheinitialstep,thecomputationsteps,andthefinalstep,anda subclassofLinearPipelinethatimplementsitsabstractmethodstocreateanarraycontaining thedesiredpipelinestagesandthedesiredqueuesconnectingthestages.Forthequeueconnectingthe stages,weuseLinkedBlockingQueue,animplementationoftheBlockingQueueinterface. Theseclassesarefoundinthejava.util. concurrentpackage.Theseclassesusegenericsto specifythetypeofobjectsthequeuecanhold.Forexample,new LinkedBlockingQueue<String>createsaBlockingQueueimplementedbyanunderlying linkedlistthatcanholdStrings.Theoperationsofinterestareput,toaddanobjecttothequeue, andtake,toremoveanobject,takeblocksifthequeueisempty.TheclassCountDownLatch, alsofoundinthejava.util.concurrentpackage,isasimplebarrierthatallowstheprogramto printamessagewhenithasterminated.Barriersingeneral,andCountDownLatchinparticular, arediscussedintheImplementationMechanismsdesignspace. Theremainingfiguresshowcodeforanexampleapplication,apipelinetosortintegers.Fig.4.29is therequiredsubclassofLinearPipeline,andFig.4.30istherequiredsubclassof PipelineStage.Additionalpipelinestagestogenerateorreadtheinputandtohandletheoutput arenotshown.
Known uses
CodeView:Scroll/ShowAll
import java.util.concurrent.*; abstract class PipelineStage implements Runnable { BlockingQueue in; BlockingQueue out; CountDownLatch s;
boolean done; //override to abstract void //override to abstract void //override to abstract void specify initialization step firstStep() throws Exception; specify compute step step() throws Exception; specify finalization step lastStep() throws Exception;
void handleComputeException(Exception e) { e.printStackTrace(); } public void run() { try { firstStep(); while(!done){ step();} lastStep(); } catch(Exception e){handleComputeException(e);} finally {s.countDown();} } public void init(BlockingQueue in, BlockingQueue out, CountDownLatch s) { this.in = in; this.out = out; this.s = s;} }
Fx[GOS94],aparallelizingFortrancompilerbasedonHPF[HPF97],hasbeenusedtodevelop
CodeView:Scroll/ShowAll
import java.util.concurrent.*;
abstract class LinearPipeline { PipelineStage[] stages; BlockingQueue[] queues; int numStages; CountDownLatch s; //override method to create desired array of pipeline stage objects abstract PipelineStage[] getPipelineStages(String[] args); //override method to create desired array of BlockingQueues //element i of returned array contains queue between stages i and i+1 abstract BlockingQueue[] getQueues(String[] args); LinearPipeline(String[] args) { stages = getPipelineStages(args); queues = getQueues(args); numStages = stages.length; s = new CountDownLatch(numStages); BlockingQueue in = null; BlockingQueue out = queues[0]; for (int i = 0; i != numStages; i++) { stages[i].init(in,out,s); in = out; if (i < numStages-2) out = queues[i+1]; else out = null; } } public void start() { for (int i = 0; i != numStages; i++) { new Thread(stages[i]).start(); } }
[J92]presentssomefinergrainedapplicationsofpipelining,includinginsertingasequenceof elementsintoa23treeandpipelinedmergesort. Related Patterns ThispatternisverysimilartothePipesandFilterspatternof[BMR96 + ];thekeydifferenceisthat thispatternexplicitlydiscussesconcurrency. Forapplicationsinwhichtherearenotemporaldependenciesbetweenthedatainputs,analternative tothispatternisadesignbasedonmultiplesequentialpipelinesexecutinginparallelandusingthe TaskParallelismpattern.
Figure 4.29. Pipelined sort (main class)
CodeView:Scroll/ShowAll
import java.uti1.concurrent.*;
class SortingPipeline extends LinearPipeline { /*Creates an array of pipeline stages with the number of sorting stages given via args. Input and output stages are also included at the beginning and end of the array. Details are omitted. */ PipelineStage[] getPipelineStages(String[] args) { //.... return stages; } /* Creates an array of LinkedBlockingQueues to serve as communication channels between the stages. For this example, the first is restricted to hold Strings, the rest can hold Comparables. */ BlockingQueue[] getQueues(String[] args) { BlockingQueue[] queues = new BlockingQueue[totalStages - 1]; queues[0] = new LinkedBlockingQueue<String>(); for (int i = 1; i!= totalStages -1; i++) { queues[i] = new LinkedBlockingQueue<Comparable>();} return queues; } SortingPipeline(String[] args) { super(args); } public static void main(String[] args) throws InterruptedException { // create pipeline LinearPipeline 1 = new SortingPipeline(args); 1.start(); //start threads associated with stages l.s.await(); //terminate thread when all stages terminated System.out.println("All threads terminated"); } }
Atfirstglance,onemightalsoexpectthatsequentialsolutionsbuiltusingtheChainofResponsibility pattern[GHJV95]couldbeeasilyparallelizedusingthePipelinepattern.InChainofResponsibility, orCOR,an"event"ispassedalongachainofobjectsuntiloneormoreoftheobjectshandlethe event.Thispatternisdirectlysupported,forexample,intheJavaServletSpecification[1][SER]to enablefilteringofHTTPrequests.WithServlets,aswellasothertypicalapplicationsofCOR, however,thereasonforusingthepatternistosupportmodularstructuringofaprogramthatwillneed tohandleindependenteventsindifferentwaysdependingontheeventtype.Itmaybethatonlyone objectinthechainwillevenhandletheevent.Weexpectthatinmostcases,theTaskParallelism patternwouldbemoreappropriatethanthePipelinepattern.Indeed,Servletcontainer implementationsalreadysupportingmultithreadingtohandleindependentHTTPrequestsprovidethis solutionforfree.
[1]
AServletisaJavaprograminvokedbyaWebserver.TheJavaServletstechnologyis includedintheJava2EnterpriseEditionplatformforWebserverapplications.
Figure 4.30. Pipelined sort (sorting stage)
CodeView:Scroll/ShowAll
class SortingStage extends PipelineStage { Comparable val = null; Comparable input = null; void firstStep() throws InterruptedException { input = (Comparable)in.take(); done = (input.equals("DONE")); val = input; return; } void step() throws InterruptedException { input = (Comparable)in.take(); done = (input.equals("DONE")); if (!done) { if(val.compareTo(input)<0) { out.put(val); val = input; } else { out.put(input); } } else out.put(val); } void lastStep() throws InterruptedException { out.put("DONE"); }
Problem Supposetheapplicationcanbedecomposedintogroupsofsemiindependenttasksinteractinginan irregularfashion.Theinteractionisdeterminedbytheflowofdatabetweenthemwhichimplies orderingconstraintsbetweenthetasks.Howcanthesetasksandtheirinteractionbeimplementedso theycanexecuteconcurrently? Context Someproblemsaremostnaturallyrepresentedasacollectionofsemiindependententitiesinteracting inanirregularway.Whatthismeansisperhapscleare stifwecomparethispatternwiththePipeline pattern.InthePipelinepattern,theentitiesformalinearpipeline,eachentityinteractsonlywiththe entitiestoeitherside,theflowofdataisoneway,andinteractionoccursatfairlyregularand predictableintervals.IntheEventBasedCoordinationpattern,incontrast,thereisnorestrictiontoa linearstructure,norestrictionthattheflowofdatabeoneway,andtheinteractiontakesplaceat irregularandsometimesunpredictableintervals. Asarealworldanalogy,consideranewsroom,withreporters,editors,factcheckers,andother employeescollaboratingonstories.Asreportersfinishstories,theysendthemtotheappropriate editors;aneditorcandecidetosendthestorytoafactchecker(whowouldtheneventuallysendit back)orbacktothereporterforfurtherrevision.Eachemployeeisasemiindependententity,and theirinteraction(forexample,areportersendingastorytoaneditor)isirregular. Manyotherexamplescanbefoundinthefieldofdiscreteeventsimulation,thatis,simulationofa physicalsystemconsistingofacollectionofobjectswhoseinteractionisrepresentedbyasequenceof discrete"events".Anexampleofsuchasystemisthecarwashfacilitydescribedin[Mis86]:The facilityhastwocarwashmachinesandanattendant.Carsarriveatrandomtimesattheattendant. Eachcarisdirectedbytheattendanttoanonbusycarwashmachineifoneexists,orqueuedifboth machinesarebusy.Eachcarwashmachineprocessesonecaratatime.Thegoalistocompute,fora givendistributionorarrivaltimes,theaveragetimeacarspendsinthesystem(timebeingwashed plusanytimewaitingforanonbusymachine)andtheaveragelengthofthequeuethatbuildsupatthe attendant.The"events"inthissystemincludecarsarrivingattheattendant,carsbeingdirectedtothe carwashmachines,andcarsleavingthemachines.Fig.4.31sketchesthisexample.Noticethatit includes"source"and"sink"objectstomakeiteasiertomodelcarsarrivingandleavingthefacility. Noticealsothattheattendantmustbenotifiedwhencarsleavethecarwashmachinessothatitknows whetherthemachinesarebusy.
Figure 4.31. Discrete-event simulation of a car-wash facility. Arrows indicate the flow of events.
Thebasicstructureofeachtaskconsistsofreceivinganevent,processingit,andpossiblygenerating
Toallowcommunicationandcomputationtooverlap,onegenerallyneedsaformofasynchronous communicationofeventsinwhichataskcancreate(send)aneventandthencontinuewithoutwaiting fortherecipienttoreceiveit.Inamessagepassingenvironment,aneventcanberepresentedbya messagesentasynchronouslyfromthetaskgeneratingtheeventtothetaskthatistoprocessit.Ina sharedmemoryenvironment,aqueuecanbeusedtosimulatemessagepassing.Becauseeachsuch queuewillbeaccessedbymorethanonetask,itmustbeimplementedinawaythatallowssafe concurrentaccess,asdescribedintheSharedQueuepattern.Othercommunicationabstractions,such astuplespacesasfoundintheLindacoordinationlanguageorJavaSpaces[FHA99],canalsobeused effectivelywiththispattern.Linda[CG91]isasimplelanguageconsistingofonlysixoperationsthat readandwriteanassociative(thatis,contentaddressable)sharedmemorycalledatuplespace.A tuplespaceisaconceptuallysharedrepositoryfordatacontainingobjectscalledtuplesthattasksuse forcommunicationinadistributedsystem.
Enforcing event ordering
Theenforcementoforderingconstraintsmaymakeitnecessaryforatasktoprocesseventsina differentorderfromtheorderinwhichtheyaresent,ortowaittoprocessaneventuntilsomeother eventfromagiventaskhasbeenreceived,soitisusuallynecessarytobeabletolookaheadinthe queueormessagebufferandremoveelementsoutoforder.Forexample,considerthesituationinFig. 4.33.Task1generatesaneventandsendsittotask2,whichwillprocessit,andalsosendsittotask3, whichisrecordinginformationaboutallevents.Task2processestheeventfromtask1andgenerates anewevent,acopyofwhichisalsosenttotask3.Supposethatthevagariesoftheschedulingand underlyingcommunicationlayercausetheeventfromtask2toarrivebeforetheeventfromtask1. Dependingonwhattask3isdoingwiththeevents,thismayormaynotbeproblematic.Iftask3is simplytallyingthenumberofeventsthatoccur,thereisnoproblem.Iftask3iswritingalogentry thatshouldreflecttheorderinwhicheventsarehandled,however,simplyprocessingeventsinthe orderinwhichtheyarrivewouldinthiscaseproduceanincorrectresult.Iftask3iscontrollingagate,
Indiscreteeventsimulations,asimilarproblemcanoccurbecauseofthesemanticsoftheapplication domain.Aneventarrivesatastation(task)alongwithasimulationtimewhenitshouldbescheduled. Aneventcanarriveatastationbeforeothereventswithearliersimulationtimes. Thefirststepistodeterminewhether,inaparticularsituation,outofordereventscanbeaproblem. Therewillbenoproblemifthe"event"pathislinearsothatnooutofordereventswilloccur,orif, accordingtotheapplicationsemantics,outofordereventsdonotmatter. Ifoutofordereventsmaybeaproblem,theneitheranoptimisticorpessimisticapproachcanbe chosen.Anoptimisticapproachrequirestheabilitytorollbacktheeffectsofeventsthatare mistakenlyexecuted(includingtheeffectsofanyneweventsthathavebeencreatedbytheoutof orderexecution).Intheareaofdistributedsimulation,thisapproachiscalledtimewarp[Jef85]. Optimisticapproachesareusuallynotfeasibleifaneventcausesinteractionwiththeoutsideworld. Pessimisticapproachesensurethattheeventsarealwaysexecutedinorderattheexpenseofincreased latencyandcommunicationoverhead.Pessimisticapproachesdonotexecuteeventsuntilitcanbe guaranteed"safe"todoso.Inthefigure,forexample,task3cannotprocessaneventfromtask2until it"knows"thatnoearliereventwillarrivefromtask1andviceversa.Providingtask3withthat knowledgemayrequireintroducingnulleventsthatcontainnoinformationusefulforanythingexcept theeventordering.Manyimplementationsofpessimisticapproachesarebasedontimestampsthatare consistentwiththecausalityinthesystem[Lam78]. Muchresearchanddevelopmentefforthasgoneintoframeworksthattakecareofthedetailsofevent orderingindiscreteeventsimulationforbothoptimistic[RMC98 + ]andpessimisticapproaches [CLL99 + ].Similarly,middlewareisavailablethathandleseventorderingproblemsinprocessgroups causedbythecommunicationsystem.AnexampleistheEnsemblesystemdevelopedatCornell [vRBH98 + ].
Avoiding deadlocks
canalsobecausedbyproblemsinthemodelthatisbeingsimulated.Inthelattercase,thedeveloper mustrethinkthesolution. Ifpessimistictechniquesareusedtocontroltheorderinwhicheventsareprocessed,thendeadlocks canoccurwhenaneventisavailableandactuallycouldbeprocessed,butisnotprocessedbecausethe eventisnotyetknowntobesafe.Thedeadlockcanbebrokenbyexchangingenoughinformationthat theeventcanbesafelyprocessed.Thisisaverysignificantproblemastheoverheadofdealingwith deadlockscancancelthebenefitsofparallelismandmaketheparallelalgorithmsslowerthana sequentialsimulation.Approachestodealingwiththistypeofdeadlockrangefromsendingfrequent enough"nullmessages"toavoiddeadlocksaltogether(atthecostofmanyextramessages)tousing deadlockdetectionschemestodetectthepresenceofadeadlockandthenresolveit(atthecostof possiblesignificantidletimebeforethedeadlockisdetectedandresolved).Theapproachofchoice willdependonthefrequencyofdeadlock.Amiddlegroundsolutionistousetimeoutsinsteadof accuratedeadlockdetection,andisoftenthebestapproach.
Scheduling and processor allocation
Anumberofdiscreteeventsimulationapplicationsusethispattern.TheDPATsimulationusedto analyzeairtrafficcontrolsystems[Wie01]isasuccessfulsimulationthatusesoptimistictechniques.
digitalcircuits(thatis,circuitswithoutfeedbackpaths).Thecircuitispartitionedintosubcircuits; associatedwitheachareinputsignalportsandoutputvoltageports,whichareconnectedtoforma representationofthewholecircuit.Thesimulationofeachsubcircuitproceedsinatimestepped fashion;ateachtimestep,thesubcircuit'sbehaviordependsonitspreviousstateandthevaluesreadat itsinputports(whichcorrespondtovaluesatthecorrespondingoutputportsofothersubcircuitsat previoustimesteps).Simulationofthesesubcircuitscanproceedconcurrently,withordering constraintsimposedbytherelationshipbetweenvaluesgeneratedforoutputportsandvaluesreadon inputports.Thesolutiondescribedin[YWC96 + ]fitstheEventBasedCoordinationpattern,defining ataskforeachsubcircuitandrepresentingtheorderingconstraintsasevents. Related Patterns ThispatternissimilartothePipelinepatterninthatbothpatternsapplytoproblemsinwhichitis naturaltodecomposethecomputationintoacollectionofsemiindependententitiesinteractingin termsofaflowofdata.Therearetwokeydifferences.First,inthePipelinepattern,theinteraction amongentitiesisfairlyregular,withallstagesofthepipelineproceedinginalooselysynchronous way,whereasintheEventBasedCoordinationpatternthereisnosuchrequirement,andtheentities caninteractinveryirregularandasynchronousways.Second,inthePipelinepattern,theoverall structure(numberoftasksandtheirinteraction)isusuallyfixed,whereasintheEventBased Coordinationpattern,theproblemstructurecanbemoredynamic.
5.1. INTRODUCTION
TheFindingConcurrencyandAlgorithmStructuredesignspacesfocusonalgorithmexpression.At somepoint,however,algorithmsmustbetranslatedintoprograms.ThepatternsintheSupporting Structuresdesignspaceaddressthatphaseoftheparallelprogramdesignprocess,representingan intermediatestagebetweentheproblemorientedpatternsoftheAlgorithmStructuredesignspaceand thespecificprogrammingmechanismsdescribedintheImplementationMechanismsdesignspace. WecallthesepatternsSupportingStructuresbecausetheydescribesoftwareconstructionsor "structures"thatsupporttheexpressionofparallelalgorithms.Anoverviewofthisdesignspaceand itsplaceinthepatternlanguageisshowninFig.5.1.
Figure 5.1. Overview of the Supporting Structures design space and its place in the pattern language
Thetwogroupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesand thosethatrepresentcommonlyusedshareddatastructures.Thesepatternsarebrieflydescribedinthe nextsection.Insomeprogrammingenvironments,someofthesepatternsaresowellsupportedthat thereislittleworkfortheprogrammer.Weneverthelessdocumentthemaspatternsfortworeasons: First,understandingthelowleveldetailsbehindthesestructuresisimportantforeffectivelyusing them.Second,describingthesestructuresaspatternsprovidesguidanceforprogrammerswhomight needtoimplementthemfromscratch.Thefinalsectionofthischapterdescribesstructuresthatwere notdeemedimportantenough,forvariousreasons,towarrantadedicatedpattern,butwhichdeserve mentionforcompleteness. 5.1.1. Program Structuring Patterns Patternsinthisfirstgroupdescribeapproachesforstructuringsourcecode.Thesepatternsincludethe following.
SPMD.InanSPMD(SingleProgram,MultipleData)program,allUEsexecutethesame
Master/Worker.Amasterprocessorthreadsetsupapoolofworkerprocessesorthreadsanda bagoftasks.Theworkersexecuteconcurrently,witheachworkerrepeatedlyremovingatask fromthebagoftasksandprocessingit,untilalltaskshavebeenprocessedorsomeother terminationconditionhasbeenreached.Insomeimplementations,noexplicitmasteris present. LoopParallelism.Thispatternaddressestheproblemoftransformingaserialprogramwhose runtimeisdominatedbyasetofcomputeintensiveloopsintoaparallelprogramwherethe differentiterationsoftheloopareexecutedinparallel. Fork/Join.AmainUEforksoffsomenumberofotherUEsthatthencontinueinparallelto accomplishsomeportionoftheoverallwork.OftentheforkingUEwaitsuntilthechildUEs terminateandjoin.
Whilewedefineeachoftheseprogramstructuresasadistinctpattern,thisissomewhatartificial.Itis possible,forexample,toimplementtheMaster/WorkerpatternusingtheFork/Joinpatternorthe SPMDpattern.Thesepatternsdonotrepresentexclusive,uniquewaystostructureaparallelprogram. Rather,theydefinethemajoridiomsusedbyexperiencedparallelprogrammers. Thesepatternsalsoinevitablyexpressabiasrootedinthesubsetofparallelprogramming environmentsweconsiderinthispatternlanguage.ToanMPIprogrammer,forexample,allprogram structurepatternsareessentiallyavariationontheSPMDpattern.ToanOpenMPprogrammer, however,thereisahugedifferencebetweenprogramsthatutilizethreadIDs(thatis,theSPMD pattern)versusprogramsthatexpressallconcurrencyintermsoflooplevelworksharingconstructs (thatis,theLoopParallelismpattern). Therefore,inusingthesepatterns,don'tthinkofthemtoorigidly.Thesepatternsexpressimportant techniquesandareworthyofconsiderationinisolation,butdonothesitatetocombinethemin differentwaystomeettheneedsofaparticularproblem.Forexample,intheSPMDpattern,wewill discussparallelalgorithmsbasedonparallelizingloopsbutexpressedwiththeSPMDpattern.It mightseemthatthisindicatesthattheSPMDandLoopParallelismpatternsarenotreallydistinct patterns,butinfactitshowshowflexibletheSPMDpatternis. 5.1.2. Patterns Representing Data Structures Patternsinthissecondgrouphavetodowithmanagingdatadependencies.TheSharedDatapattern dealswiththegeneralcase.Theothersdescribespecificfrequentlyuseddatastructures.
5.2. FORCES
Alloftheprogramstructuringpatternsaddressthesamebasicproblem:howtostructuresourcecode tobestsupportalgorithmstructuresofinterest.Uniqueforcesareapplicabletoeachpattern,butin designingaprogramaroundthesestructures,therearesomecommonforcestoconsiderinmostcases:
Equation5.2
Sequentialequivalence.Whereappropriate,doesaprogramproduceequivalentresultswhen runwithmanyUEsaswithone?Ifnotequivalent,istherelationshipbetweenthemclear? Itishighlydesirablethattheresultsofanexecutionofaparallelprogrambethesame regardlessofthenumberofPEsused.Thisisnotalwayspossible,especiallyifbitwise equivalenceisdesired,becausefloatingpointoperationsperformedinadifferentordercan producesmall(or,forillconditionedalgorithms,notsosmall)changesintheresultingvalues. However,ifweknowthattheparallelprogramgivesequivalentresultswhenexecutedonone processorasmany,thenwecanreasonaboutcorrectnessanddomostofthetestingonthe singleprocessorversion.Thisismucheasier,andthus,whenpossible,sequentialequivalence isahighlydesirablegoal.
Task Parallelism
Divide Geometric Recursive EventBased and Pipeline Decomposition Data Coordination Conquer
OpenMP
MPI
Java
ItisnotthattheavailableprogrammingenvironmentspushedSPMD;theforcewasthe otherwayaround.TheprogrammingenvironmentsforMIMDmachinespushedSPMD becausethatisthewayprogrammerswantedtowritetheirprograms.Theywrotethem thiswaybecausetheyfoundittobethebestwaytogetthelogiccorrectandefficientfor whatthetasksdoandhowtheyinteract.Forexample,theprogrammingenvironment PVM,sometimesconsideredapredecessortoMPI,inadditiontotheSPMDprogram structurealsosupportedrunningdifferentprogramsondifferentUEs(sometimescalled theMPMDprogramstructure).TheMPIdesigners,withthebenefitofthePVM experience,chosetosupportonlySPMD. Inadditiontotheadvantagestotheprogrammer,SPMDmakesmanagementofthesolutionmuch easier.Itismucheasiertokeepasoftwareinfrastructureuptodateandconsistentifthereisonlyone programtomanage.ThisfactorbecomesespeciallyimportantonsystemswithlargenumbersofPEs. Thesecangrowtohugenumbers.Forexample,thetwofastestcomputersintheworldaccordingto theNovember2003top500list[Top],theEarthSimulatorattheEarthSimulatorCenterinJapanand theASCIQatLosAlamosNationalLabs,have5120and8192processors,respectively.IfeachPE runsadistinctprogram,managingtheapplicationsoftwarecouldquicklybecomeprohibitively
Initialize.TheprogramisloadedontoeachUEandopenswithbookkeepingoperationsto establishacommoncontext.Thedetailsofthisprocedurearetiedtotheparallelprogramming environmentandtypicallyinvolveestablishingcommunicationchannelswithotherUEs. Obtainauniqueidentifier.Nearthetopoftheprogram,anidentifierissetthatisuniqueto eachUE.ThisisusuallytheUE'srankwithintheMPIgroup(thatis,anumberintheinterval from0toN1,whereNisthenumberofUEs)orthethreadIDinOpenMP.Thisunique identifierallowsdifferentUEstomakedifferentdecisionsduringprogramexecution. RunthesameprogramoneachUE,usingtheuniqueIDtodifferentiatebehaviorondifferent UEs.ThesameprogramrunsoneachUE.Differencesintheinstructionsexecutedbydifferent UEsareusuallydrivenbytheidentifier.(TheycouldalsodependontheUE'sdata.)Thereare manywaystospecifythatdifferentUEstakedifferentpathsthroughthesourcecode.The mostcommonare(1)branchingstatementstogivespecificblocksofcodetodifferentUEs and(2)usingtheUEidentifierinloopindexcalculationstosplitloopiterationsamongthe UEs. Distributedata.ThedataoperatedonbyeachUEisspecializedtothatUEbyoneoftwo techniques:(1)decomposingglobaldataintochunksandstoringtheminthelocalmemoryof eachUE,andlater,ifrequired,recombiningthemintothegloballyrelevantresultsor(2) sharingorreplicatingtheprogram'smajordatastructuresandusingtheUEidentifierto associatesubsetsofthedatawithparticularUEs. Finalize.Theprogramclosesbycleaningupthesharedcontextandshuttingdownthe
computation.IfgloballyrelevantdatawasdistributedamongUEs,itwillneedtobe recombined.
Discussion
AnimportantissuetokeepinmindwhendevelopingSPMDprogramsistheclarityofabstraction, thatis,howeasyitistounderstandthealgorithmfromreadingtheprogram'ssourcecode.Depending onhowthedataishandled,thiscanrangefromawfultogood.IfcomplexindexalgebraontheUE identifierisneededtodeterminethedatarelevanttoaUEortheinstructionbranch,thealgorithmcan bealmostimpossibletofollowfromthesourcecode.(TheDistributedArraypatterndiscussesuseful techniquesforarrays.) Insomecases,areplicateddataalgorithmcombinedwithsimpleloopsplittingisthebestoption becauseitleadstoaclearabstractionoftheparallelalgorithmwithinthesourcecodeandan algorithmwithahighdegreeofsequentialequivalence.Unfortunately,thissimpleapproachmightnot scalewell,andmorecomplexsolutionsmightbeneeded.Indeed,SPMDalgorithmscanbehighly scalable,andalgorithmsrequiringcomplexcoordinationbetweenUEsandscalingouttoseveral thousandUEs[PH95]havebeenwrittenusingthispattern.Thesehighlyscalablealgorithmsare usuallyextremelycomplicatedastheydistributethedataacrossthenodes(thatis,nosimplifying replicateddatatechniques),andtheygenerallyincludecomplexloadbalancinglogic.These algorithms,unfortunately,bearlittleresemblancetotheirserialcounterparts,reflectingacommon criticismoftheSPMDpattern. AnimportantadvantageoftheSPMDpatternisthatoverheadsassociatedwithstartupand terminationaresegregatedatthebeginningandendoftheprogram,notinsidetimecriticalloops. Thiscontributestoefficientprogramsandresultsintheefficiencyissuesbeingdrivenbythe communicationoverhead,thecapabilitytobalancethecomputationalloadamongtheUEs,andthe amountofconcurrencyavailableinthealgorithmitself. SPMDprogramsarecloselyalignedwithprogrammingenvironmentsbasedonmessagepassing.For example,mostMPIorPVMprogramsusetheSPMDpattern.Note,however,thatitispossibletouse theSPMDpatternwithOpenMP[CPP01].Withregardtothehardware,theSPMDpatterndoesnot assumeanythingconcerningtheaddressspacewithinwhichthetasksexecute.AslongaseachUE canrunitsowninstructionstreamoperatingonitsowndata(thatis,thecomputercanbeclassifiedas MIMD),theSPMDstructureissatisfied.ThisgeneralityofSPMDprogramsisoneofthestrengthsof thispattern. Examples TheissuesraisedbyapplicationoftheSPMDpatternarebestdiscussedusingthreespecific examples:
Numerical integration
#include <stdio.h> #include <math.h> int main () { int i; int num_steps = 1000000; double x, pi, step, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("pi %lf\n",pi); return 0; }
AprogramtocarrythiscalculationoutonasingleprocessorisshowninFig.5.2.Tokeepthe programassimpleaspossible,wefixthenumberofstepstouseintheintegrationat1,000,000.The variablesumisinitializedto0andthestepsizeiscomputedastherangeinx(equalto1.0inthis case)dividedbythenumberofsteps.Theareaofeachrectangleisthewidth(thestepsize)timesthe height(thevalueoftheintegrandatthecenteroftheinterval).Becausethewidthisaconstant,we pullitoutofthesummationandmultiplythesumoftherectangleheightsbythestepsize,step,to getourestimateofthedefiniteintegral. Wewilllookatseveralversionsoftheparallelalgorithm.Wecanseealltheelementsofaclassic SPMDprograminthesimpleMPIversionofthisprogram,asshowninFig.5.3.Thesameprogramis runoneachUE.Nearthebeginningoftheprogram,theMPIenvironmentisinitializedandtheIDfor
eachUE(my_id)isgivenbytheprocessrankforeachUEintheprocessgroupassociatedwiththe communicatorMPI_COMM_WORLD(forinformationaboutcommunicatorsandotherMPIdetails,see theMPIappendix,AppendixB).WeusethenumberofUEsandtheIDtoassignloopranges (i_startandi_end)toeachUE.Becausethenumberofstepsmaynotbeevenlydividedbythe numberofUEs,wehavetomakesurethelastUErunsuptothelaststepinthecalculation.Afterthe partialsumshavebeencomputedoneachUE,wemultiplybythestepsize,step,andthenusethe MPI_Reduce()routinetocombinethepartialsumsintoaglobalsum.(Reductionoperationsare describedinmoredetailintheImplementationMechanismsdesignspace.)Thisglobalvaluewillonly beavailableintheprocesswithmy_id == 0,sowedirectthatprocesstoprinttheanswer. Inessence,whatwehavedoneintheexampleinFig.5.3istoreplicatethekeydata(inthiscase,the partialsummationvalue,sum),usetheUE'sIDtoexplicitlysplituptheworkintoblockswithone blockperUE,andthenrecombinethelocalresultsintothefinalglobalresult.Thechallengein applyingthispatternisto(1)splitupthedatacorrectly,(2)correctlyrecombinetheresults,and(3) achieveanevendistributionofthework.Thefirsttwostepsweretrivialinthisexample.Theload balance,however,isabitmoredifficult.Unfortunately,thesimpleprocedureweusedinFig.5.3could resultinsignificantlymoreworkforthelastUEifthenumberofUEsdoesnotevenlydividethe numberofsteps.Foramoreevendistributionofthework,weneedtospreadouttheextraiterations amongmultipleUEs.WeshowonewaytodothisintheprogramfragmentinFig.5.4.Wecompute thenumberofiterationsleftoverafterdividingthenumberofstepsbythenumberofprocessors (rem).WewillincreasethenumberofiterationscomputedbythefirstremUEstocoverthatamount ofwork.ThecodeinFig.5.4accomplishesthattask.Thesesortsofindexadjustmentsarethebaneof programmersusingtheSPMDpattern.Suchcodeiserrorproneandthesourceofhoursoffrustration asprogramreaderstrytounderstandthereasoningbehindthislogic.
Figure 5.3. MPI program to carry out a trapezoid rule integration in parallel by assigning one block of loop iterations to each UE and performing a reduction
CodeView:Scroll/ShowAll
#include <stdio.h> #include <math.h> #include <mpi.h> int main (int argc, char *argv[]) { int i, i_start, i_end; int num_steps = 1000000; double x, pi, step, sum = 0.0; int my_id, numprocs; step = 1.0/(double) num_steps; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_id); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); i_start = my_id * (num_steps/numprocs); i_end = i_start + (num_steps/numprocs); if (my_id == (numprocs-1)) i_end = num_steps; for (i=i_start; i< i_end; i++)
} sum *= step; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (my_id == 0) printf("pi %lf\n",pi); MPI_Finalize(); return 0; }
Figure 5.5. MPI program to carry out a trapezoid rule integration in parallel using a simple loop-splitting algorithm with cyclic distribution of iterations and a reduction
CodeView:Scroll/ShowAll
#include <stdio.h> #include <math.h> #include <mpi.h> int main (int argc, char *argv[]) { int i;
int num_steps = 1000000; double x, pi, step, sum = 0.0; int my_id, numprocs; step = 1.0/(double) num_steps; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_id); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); for (i=my_id; i< num_steps; i+= numprocs) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } sum *= step; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (my_id == 0) printf("pi %lf\n",pi); MPI_Finalize(); return 0;
Figure 5.6. OpenMP program to carry out a trapezoid rule integration in parallel using the same SPMD algorithm used in Fig. 5.5 #include <stdio.h> #include <math.h> #include <omp.h> int main () { int num_steps = 1000000; double pi, step, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel reduction(+:sum) { int i, id = omp_get_thread_num(); int numthreads = omp_get_num_threads(); double x; for (i=id;i< num_steps; i+=numthreads){ x = (i+0.5)*step; sum += + 4.0/(1.0+x*x); } } // end of parallel region pi = step * sum; printf("\n pi is %lf\n",pi); return 0;
SPMDprogramscanalsobewrittenusingOpenMPandJava.InFig.5.6,weshowanOpenMP
Throughoutthispatternlanguage,wehaveusedmoleculardynamicsasarecurringexample. Moleculardynamicssimulatesthemotionsofalargemolecularsystem.Itusesanexplicittime steppingmethodologywhereateachtimestep,theforceoneachatomiscomputedandstandard techniquesfromclassicalmechanicsareusedtocomputehowtheforceschangeatomicmotions. Thisproblemisidealforpresentingkeyconceptsinparallelalgorithmsbecausetherearesomany waystoapproachtheproblembasedonthetargetcomputersystemandtheintendeduseofthe program.Inthisdiscussion,wewillfollowtheapproachtakenin[Mat95]andassumethat(1)a sequentialversionoftheprogramexists,(2)havingasingleprogramforsequentialandparallel executionisimportant,and(3)thetargetsystemisasmallclusterconnectedbystandardEthernet LAN.Morescalablealgorithmsforexecutiononmassivelyparallelsystemsarediscussedin[PH95].
Figure 5.7. Pseudocode for molecular dynamics example. This code is very similar to the version discussed earlier, but a few extra details have been included. To support more detailed pseudocode examples, the call to the function that initializes the force arrays has been made explicit. Also, the fact that the neighbor list is only occasionally updated is made explicit. Int const N // number of atoms Array Array Array Array of of of of Real Real Real List :: :: :: :: atoms (3,N) //3D coordinates velocities (3,N) //velocity vector forces (3,N) //force in each dimension neighbors(N) //atoms in cutoff volume
loop over time steps initialize_forces (N, Forces) if(time to update neighbor list) neighbor_list (N, Atoms, neighbors) end if vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) non_bonded-forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) end loop
(Sec.3.1.3). 2. Incomputingthenon_bonded_force,eachatompotentiallyinteractswithalltheother atoms.Hence,eachUEneedsreadaccesstothefullatomicpositionarray.Also,dueto Newton'sthirdlaw,eachUEwillbescatteringcontributionstotheforceacrossthefullforce array(theExamplessectionoftheDataSharingpattern). 3. OnewaytodecomposetheMDproblemintotasksistofocusonthecomputationsneededfor aparticularatom,thatis,wecanparallelizethisproblembyassigningatomstoUEs(the ExamplessectionoftheTaskDecompositionpattern). Giventhatourtargetisasmallclusterandfrompoint(1)intheprecedinglist,wewillonlyparallelize theforcecomputations.Becausethenetworkisslowforparallelcomputingandgiventhedata dependencyinpoint(2),wewill:
CodeView:Scroll/ShowAll
#include <mpi.h> Int const N // number of atoms Int const LN // maximum number of atoms assigned to a UE Int ID // an ID for each UE Int num_UEs // the number of UEs in the parallel computation Array Array Array Array Array Array of of of of of of Real :: atoms (3,N) //3D coordinates Real :: velocities (3,N) //velocity vector Real :: forces (3,N) //force in each dimension Real :: final_forces(3,N) //globally summed force List :: neighbors(LN) //atoms in cutoff volume Int :: local_atoms(LN) //atoms for this UE
ID = 0 // default ID (used by the serial code) num_UEs = 1 // default num_UEs (used by the serial code) MPI_Init() MPI_Comm_size(MPI_COMM_WORLD, &ID) MPI_Comm_rank(MPI_COMM_WORLD, &num_UEs) loop over time steps
initialize_forces (N, forces, final_forces) if(time to update neighbor list) neighbor_list (N, LN, atoms, neighbors) end if vibrational_forces (N, LN, local_atoms, atoms, forces) rotational_forces (N, LN, local_atoms, atoms, forces) non_bonded_forces (N, LN, atoms, local_atoms, neighbors, forces) MPI_All_reduce{forces, final.forces, 3*N, MPI_REAL, MPI_SUM, MPI_COMM_WORLD) update_atom_positions_and_velocities( N, atoms, velocities, final_forces) physical_properties ( ... Lots of stuff ... ) end loop MPI_Finalize()
Figure 5.9. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This code is almost identical to the sequential version of the function shown in Fig. 4.4. The only major change is a new array of integers holding the indices for the atoms assigned to this UE, local_atoms. We've also assumed that the neighbor list has been generated to hold only those atoms assigned to this UE. For the sake of allocating space for these arrays, we have added a parameter LN which is the largest number of atoms that can be assigned to a single UE. function non_bonded_forces (N, LN, atoms, local_atoms, neighbors, Forces) Int N // number of atoms Int LN // maximum number of atoms assigned to a UE Array of Real :: atoms (3,N) //3D coordinates Array of Real :: forces (3,N) //force in each dimension Array of List :: neighbors(LN) //atoms in cutoff volume Array of Int :: local_atoms(LN) //atoms assigned to this UE real :: forceX, forceY, forceZ loop [i] over local_atoms loop [j] over neighbors(i) forceX = non_bond_force(atoms(1,i), forceY = non_bond_force(atoms(2,i), forceZ = non_bond_force(atoms(3,i), force{l,i) += forceX; force{1,j) -= force(2,i) += forceY; force{2,j) -= force{3,i) += forceZ; force{3,j) -= end loop [j] end loop [i] end function non_bonded_forces atoms(l,j)) atoms(2,j)) atoms(3,j)) forceX; forceY; forceZ;
Onlyafewchangesaremadetothesequentialfunctions.First,asecondforcearraycalled final_forcesisdefinedtoholdthegloballyconsistentforcearrayappropriatefortheupdateof theatomicpositionsandvelocities.Second,alistofatomsassignedtotheUEiscreatedandpassedto anyfunctionthatwillbeparallelized.Finally,theneighbor_listismodifiedtoholdthelistfor onlythoseatomsassignedtotheUE. Finally,withineachofthefunctionstobeparallelized(theforcescalculations),theloopoveratomsis replacedbyaloopoverthelistoflocalatoms. WeshowanexampleofthesesimplechangesinFig.5.9.Thisisalmostidenticaltothesequential versionofthisfunctiondiscussedintheTaskParallelismpattern.Asdiscussedearlier,thefollowing arethekeychanges.
CodeView:Scroll/ShowAll
function neighbor (N, LN, ID, cutoff, atoms, local_atoms, neighbors) Int N // number of atoms Int LN // max number of atoms assigned to a UE Real cutoff // radius of sphere defining neighborhood Array of Real :: atoms (3,N) //3D coordinates Array of List :: neighbors(LN) //atoms in cutoff volume Array of Int :: local_atoms(LN) //atoms assigned to this UE real :: dist_squ initialize_lists (local_atoms, neighbors)
loop [i] over atoms on UE //split loop iterations among UEs add_to_list (i, local_atoms) loop [j] over atoms greater than i dist_squ = square(atom(l,i)-atom(i,j)) + square(atom(2,i)-atom(2,j)) + square(atom(3,i)-atom(3,j)) if(dist_squ < (cutoff * cutoff)) add_to_list (j, neighbors(i)) end if end loop [j] end loop [i] end function neighbors
ThelogicdefininghowtheparallelismisdistributedamongtheUEsiscapturedinthesingleloopin Fig.5.10:
loop [i] over atoms on UE //split loop iterations among UEs add_to_list (i, local_atoms)
ThedetailsofhowthisloopissplitamongUEsdependsontheprogrammingenvironment.An approachthatworkswellwithMPIisthecyclicdistributionweusedinFig.5.5:
for (i=id;i<number_of_atoms; i+= number_of UEs){ add_to_list (i, local_atoms) }
CodeView:Scroll/ShowAll
#include <omp.h> Int const N // number of atoms Int const LN // maximum number of atoms assigned to a UE Int ID // an ID for each UE Int num_UEs // number of UEs in the parallel computation Array of Real :: atoms(3,N) //3D coordinates Array of Real :: velocities(3,N) //velocity vector Array of Real :: forces(3,N) //force in each dim Array of List :: neighbors(LN) //atoms in cutoff volume Array of Int :: local_atoms(LN) //atoms for this UE ID = 0 num_UEs = 1 #pragma omp parallel private (ID, num_UEs, local_atoms, forces) { ID = omp_get_thread_num() num_UEs = omp_get_num_threads() loop over time steps initialize_forces (N, forces, final_forces) if(time to update neighbor list) neighbor_list (N, LN, atoms, neighbors) end if vibrational_forces (N, LN, local_atoms, atoms, forces) rotational_forces (N, LN, local_atoms, atoms, forces) non_bonded_forces (N, LN, atoms, local_atoms, neighbors, forces) #pragma critical final_forces += forces #barrier #pragma single { update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... ) } // remember, the end of a single implies a barrier end loop } // end of OpenMP parallel region
Areductionclauseontheparallelregioncannotbeusedinthiscasebecausetheresultwouldnot beavailableuntiltheparallelregioncompletes.Thecriticalsectionproducesthecorrectresult,butthe algorithmusedhasaruntimethatislinearinthenumberofUEsandishencesuboptimalrelativeto otherreductionalgorithmsasdiscussedintheImplementationMechanismsdesignspace.Onsystems withamodestnumberofprocessors,however,thereductionwithacriticalsectionworksadequately. Thebarrierfollowingthecriticalsectionisrequiredtomakesurethereductioncompletesbeforethe atomicpositionsandvelocitiesareupdated.WethenuseanOpenMPsingleconstructtocauseonly oneUEtodotheupdate.Anadditionalbarrierisnotneededfollowingthesinglesincethecloseof asingleconstructimpliesabarrier.Thefunctionsusedtocomputetheforcesareunchanged betweentheOpenMPandMPIversionsoftheprogram.
Mandelbrot set computation
CodeView:Scroll/ShowAll
Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map Real :: conv // divergence rate for a pixel Array of Int :: color_map (M) // pixel color based on conv rate Array of Int :: row (RowSize) // Pixels to draw Array of Real :: ranges(2) // ranges in X and Y dimensions
manage_user_input(ranges, color_map) // input ranges, color map initialize_graphics(RowSize for (int i = 0; i<Nrows; i++){ compute_Row (RowSize, ranges, row) graph(i, RowSize, M, color_map, ranges, row) } // end loop [i] over rows
numberofrowsintheimage,receivingrowsastheyfinishandplottingthem.UEswithrankother than0useacyclicdistributionofloopiterationsandsendtherowstothegraphicsUEastheyfinish.
Known uses
Figure 5.13. Pseudocode for a parallel MPI version of the Mandelbrot set generation program
CodeView:Scroll/ShowAll
#include <mpi.h> Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map Real :: conv // divergence rate for a pixel Array of Int :: color_map (M) // pixel color based on conv rate Array of Int :: row (RowSize) // Pixels to draw Array of Real :: ranges(2) // ranges in X and Y dimensions Int :: inRowSize // size of received row Int :: ID // ID of each UE (process) Int :: num_UEs // number of UEs (processes) Int :: nworkers // number of UEs computing rows MPI_Status :: stat // MPI status parameter MPI_Init() MPI_Comm_size(MPI_COMM_WORLD, &ID) MPI_Comm_rank(MPI_COMM_WORLD, &num_UEs) // Algorithm requires at least two UEs since we are // going to dedicate one to graphics if (num_UEs < 2) MPI_Abort(MPI_COMM_WORLD, 1) if (ID == 0 ){ manage_user_input(ranges, color_map) // input ranges, color map initialize_graphics(RowSize, Nrows, M, ranges, color_map) } // Broadcast data from rank 0 process to all other processes MPI_Bcast (ranges, 2, MPI_REAL, 0, MPI_COMM_WORLD); if (ID == 0) { // UE with rank 0 does graphics for (int i = 0; i<Nrows; i++){ MPI_Recv(row, &inRowSize, MPI_REAL, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &stat) row_index = stat(MPI_TAG) graph(row_index, RowSize, M, color_map, ranges, Row) } // end loop over i else { // The other UEs compute the rows nworkers = num_UEs - 1 for (int i = ID-1; i<Nrows; i+=nworkers){ compute_Row (RowSize, ranges, row) MPI_Send (row, RowSize, MPI_REAL, 0, i, MPI_COMM_WORLD); } // end loop over i }
MPI_Finalize()
Theworkloadsassociatedwiththetasksarehighlyvariableandunpredictable.Ifworkloads arepredictable,theycanbesortedintoequalcostbins,staticallyassignedtoUEs,and parallelizedusingtheSPMDorLoopParallelismpatterns.Butiftheyareunpredictable,static distributionstendtoproducesuboptimalloadbalance. Theprogramstructureforthecomputationallyintensiveportionsoftheproblemdoesn'tmap ontosimpleloops.Ifthealgorithmisloopbased,onecanusuallyachieveastatisticallynear optimalworkloadbyacyclicdistributionofiterationsorbyusingadynamicscheduleonthe loop(forexample,inOpenMP,byusingtheschedule (dynamic)clause).Butifthe controlstructureintheprogramismorecomplexthanasimpleloop,moregeneralapproaches arerequired. ThecapabilitiesofthePEsavailablefortheparallelcomputationvaryacrosstheparallel system,changeoverthecourseofthecomputation,orareunpredictable.
Insomecases,tasksaretightlycoupled(thatis,theycommunicateorsharereadandwritedata)and mustbeactiveatthesametime.Inthiscase,theMaster/Workerpatternisnotapplicable:The programmerhasnochoicebuttoexplicitlysizeorgrouptasksontoUEsdynamically(thatis,during thecomputation)toachieveaneffectiveloadbalance.Thelogictoaccomplishthiscanbedifficultto implement,andifoneisnotcareful,canaddprohibitivelylargeparalleloverhead. Ifthetasksareindependentofeachother,however,orifthedependenciescansomehowbepulledout fromtheconcurrentcomputation,theprogrammerhasmuchgreaterflexibilityinhowtobalancethe load.Thisallowstheloadbalancingtobedoneautomaticallyandisthesituationweaddressinthis pattern. ThispatternisparticularlyrelevantforproblemsusingtheTaskParallelismpatternwhen therearenodependenciesamongthetasks(embarrassinglyparallelproblems).Itcanalso beusedwiththeFork/JoinpatternforthecaseswherethemappingoftasksontoUEsis indirect. Forces
Theworkforeachtask,andinsomecaseseventhecapabilitiesofthePEs,varies unpredictablyintheseproblems.Hence,explicitpredictionsoftheruntimeforanygiventask arenotpossibleandthedesignmustbalancetheloadwithoutthem. Operationstobalancetheloadimposecommunicationoverheadandcanbeveryexpensive. Thissuggeststhatschedulingshouldrevolvearoundasmallernumberoflargetasks.However, largetasksreducethenumberofwaystaskscanbepartitionedamongthePEs,therebymaking itmoredifficulttoachievegoodloadbalance. Logictoproduceanoptimalloadcanbeconvolutedandrequireerrorpronechangestoa program.Programmersneedtomaketradeoffsbetweenthedesireforanoptimaldistribution oftheloadandcodethatiseasytomaintain.
Figure 5.14. The two elements of the Master/Worker pattern are the master and the worker. There is only one master, but there can be one or more workers. Logically, the master sets up the calculation and then manages a bag of tasks. Each worker grabs a task from the bag, carries out the work, and then goes back to the bag, repeating until the termination condition is met.
Astraightforwardapproachtoimplementingthebagoftasksiswithasinglesharedqueueas describedintheSharedQueuepattern.Manyothermechanismsforcreatingagloballyaccessible structurewheretaskscanbeinsertedandremovedarepossible,however.Examplesincludeatuple space[CG91,FHA99],adistributedqueue,oramonotoniccounter(whenthetaskscanbespecified withasetofcontiguousintegers). Meanwhile,eachworkerentersaloop.Atthetopoftheloop,theworkertakesataskfromthebagof tasks,doestheindicatedwork,testsforcompletion,andthengoestofetchthenexttask.This continuesuntiltheterminationconditionismet,atwhichtimethemasterwakesup,collectsthe results,andfinishesthecomputation. Master/workeralgorithmsautomaticallybalancetheload.Bythis,wemeantheprogrammerdoesnot explicitlydecidewhichtaskisassignedtowhichUE.Thisdecisionismadedynamicallybythe masterasaworkercompletesonetaskandaccessesthebagoftasksformorework.
Discussion
Master/workeralgorithmshavegoodscalabilityaslongasthenumberoftasksgreatlyexceedsthe numberofworkersandthecostsoftheindividualtasksarenotsovariablethatsomeworkerstake drasticallylongerthantheothers. Managementofthebagoftaskscanrequireglobalcommunication,andtheoverheadforthiscanlimit efficiency.Thiseffectisnotaproblemwhentheworkassociatedwiththetasksonaverageismuch greaterthanthetimerequiredformanagement.Insomecases,thedesignermightneedtoincreasethe sizeofeachtasktodecreasethenumberoftimestheglobaltaskbagneedstobeaccessed. TheMaster/Workerpatternisnottiedtoanyparticularhardwareenvironment.Programsusingthis patternworkwelloneverythingfromclusterstoSMPmachines.Itis,ofcourse,beneficialifthe programmingenvironmentprovidessupportformanagingthebagoftasks.
Detecting completion
Inthesimplestcase,alltasksareplacedinthebagbeforetheworkersbegin.Theneachtask continuesuntilthebagisempty,atwhichpointtheworkersterminate. Anotherapproachistouseaqueuetoimplementthetaskbagandarrangeforthemasterora workertocheckforthedesiredterminationcondition.Whenitisdetected,apoisonpill,a specialtaskthattellstheworkerstoterminate,iscreated.Thepoisonpillmustbeplacedinthe baginsuchawaythatitwillbepickeduponthenextroundofwork.Dependingonhowthe setofsharedtasksaremanaged,itmaybenecessarytocreateonepoisonpillforeach remainingworkertoensurethatallworkersreceivetheterminationcondition. Problemsforwhichthesetoftasksisnotknowninitiallyproduceuniquechallenges.This occurs,forexample,whenworkerscanaddtasksaswellasconsumethem(suchasin applicationsoftheDivideandConquerpattern).Inthiscase,itisnotnecessarilytruethat whenaworkerfinishesataskandfindsthetaskbagemptythatthereisnomoreworktodo anotherstillactiveworkercouldgenerateanewtask.Onemustthereforeensurethatthetask bagisemptyandallworkersarefinished.Further,insystemsbasedonasynchronousmessage passing,itmustbedeterminedthattherearenomessagesintransitthatcould,ontheirarrival, resultinthecreationofanewtask.Therearemanyknownalgorithmsthatsolvethisproblem. Forexample,supposethetasksareconceptuallyorganizedintoatree,wheretherootisthe mastertask,andthechildrenofataskarethetasksitgenerates.Whenallofthechildrenofa taskhaveterminated,theparenttaskcanterminate.Whenallthechildrenofthemastertask haveterminated,thecomputationhasterminated.Algorithmsforterminationdetectionare describedin[BT89,Mat87,DS80].
Variations
Thereareseveralvariationsonthispattern.Becauseofthesimplewayitimplementsdynamicload balancing,thispatternisverypopular,especiallyinembarrassinglyparallelproblems(asdescribedin
theTaskParallelismpattern).Hereareafewofthemorecommonvariations.
Themastermayturnintoaworkerafterithascreatedthetasks.Thisisaneffectivetechnique whentheterminationconditioncanbedetectedwithoutexplicitactionbythemaster(thatis, thetaskscandetecttheterminationconditionontheirownfromthestateofthebagoftasks). Whentheconcurrenttasksmapontoasimpleloop,themastercanbeimplicitandthepattern canbeimplementedasaloopwithdynamiciterationassignmentasdescribedintheLoop Parallelismpattern. Acentralizedtaskqueuecanbecomeabottleneck,especiallyinadistributedmemory environment.Anoptimalsolution[FLR98]isbasedonrandomworkstealing.Inthis approach,eachPEmaintainsaseparatedoubleendedtaskqueue.Newtasksareplacedinthe frontofthetaskqueueofthelocalPE.Whenataskiscompleted,asubproblemisremoved fromthefrontofthelocaltaskqueue.Ifthelocaltaskqueueisempty,thenanotherPEis chosenrandomly,andasubproblemfromthebackofitstaskqueueis"stolen".Ifthatqueueis alsoempty,thenthePEtriesagainwithanotherrandomlychosenPE.Thisisparticularly effectivewhenusedinproblemsbasedontheDivideandConquerpattern.Inthiscase,the tasksatthebackofthequeuewereinsertedearlierandhencerepresentlargersubproblems. Thus,thisapproachtendstomovelargesubproblemswhilehandlingthefinergrained subproblemsatthePEwheretheywerecreated.Thishelpstheloadbalanceandreduces overheadforthesmalltaskscreatedinthedeeperlevelsofrecursion. TheMaster/Workerpatterncanbemodifiedtoprovideamodestleveloffaulttolerance [BDK95].Themastermaintainstwoqueues:onefortasksthatstillneedtobeassignedto workersandanotherfortasksthathavealreadybeenassigned,butnotcompleted.Afterthe firstqueueisempty,themastercanredundantlyassigntasksfromthe"notcompleted"queue. Hence,ifaworkerdiesandthereforecan'tcompleteitstasks,anotherworkerwillcoverthe unfinishedtasks.
Figure 5.15. Master process for a master/worker program. This assumes a shared address space so the task and results queues are visible to all UEs. In this simple version, the master initializes the queue, launches the workers, and then waits for the workers to finish (that is, the ForkJoin command launches the workers and then waits for them to finish before returning). At that point, results are consumed and the computation completes. Int const Ntasks // Number of tasks Int const Nworkers // Number of workers SharedQueue :: task_queue; // task queue SharedQueue :: global_results; // queue to hold results void master() { void worker() // Create and initialize shared data structures task_queue = new SharedQueue() global_results = new SharedQueue() for (int i = 0; i < N; i++) enqueue(task_queue, i) // Create Nworkers threads executing function Worker() ForkJoin (Nworkers, Worker) } consume_the_results (Ntasks)
Figure 5.16. Worker process for a master/worker program. We assume a shared address space thereby making task_queue and global_results available to the master and all workers. A worker loops over the task_queue and exits when the end of the queue is encountered. void worker() { Int :: i Result :: res while (!empty(task_queue) { i = dequeue(task_queue) res = do_lots_of_work(i) enqueue(global_results, res) } }
executedbyasetofthreadswhoserunmethodsbehaveliketheworkerthreadsdescribedpreviously: removingaRunnableobjectfromthequeueandexecutingitsrunmethod.TheExecutor interfaceinthejava.util.concurrentpackageinJava21.5providesdirectsupportforthe Master/Workerpattern.Classesimplementingtheinterfaceprovideanexecutemethodthattakesa Runnableobjectandarrangesforitsexecution.DifferentimplementationsoftheExecutor interfaceprovidedifferentwaysofmanagingtheThreadobjectsthatactuallydothework.The ThreadPoolExecutorimplementstheMaster/Workerpatternbyusingafixedpoolofthreadsto executethecommands.TouseExecutor,theprograminstantiatesaninstanceofaclass implementingtheinterface,usuallyusingafactorymethodintheExecutorsclass.Forexample, thecodeinFig.5.17setsupaThreadPoolExecutorthatcreatesnum_threadsthreads.These threadsexecutetasksspecifiedbyRunnableobjectsthatareplacedinanunboundedqueue. AftertheExecutorhasbeencreated,aRunnableobjectwhoserunmethodspecifiesthe behaviorofthetaskcanbepassedtotheexecutemethod,whicharrangesforitsexecution.For example,assumetheRunnableobjectisreferredtobyavariabletask.Thenfortheexecutordefined previously,exec.execute (task);willplacethetaskinthequeue,whereitwilleventuallybe servicedbyoneoftheexecutor'sworkerthreads.
Figure 5.17. Instantiating and initializing a pooled executor /*create a ThreadPoolExecutor with an unbounded queue*/ Executor exec = new Executors.newFixedThreadPool(num_threads);
GeneratingtheMandelbrotsetisdescribedindetailintheExamplessectionoftheSPMDpattern. Thebasicideaistoexploreaquadraticrecurrencerelationateachpointinacomplexplaneandcolor thepointbasedontherateatwhichtherecursionconvergesordiverges.Eachpointinthecomplex planecanbecomputedindependentlyandhencetheproblemisembarrassinglyparallel(seetheTask Parallelismpattern). InFig.5.18,wereproducethepseudocodegivenintheSPMDpatternforasequentialversionofthis problem.Theprogramloopsovertherowsoftheimagedisplayingonerowatatimeastheyare computed. Onhomogeneousclustersorlightlyloadedsharedmemorymultiprocessorcomputers,approaches basedontheSPMDorLoopParallelismpatternsaremosteffective.Onaheterogeneousclusterora multiprocessorsystemsharedamongmanyusers(andhencewithanunpredictableloadonanygiven PEatanygiventime),amaster/workerapproachwillbemoreeffective. Wewillcreateamaster/workerversionofaparallelMandelbrotprogrambasedonthehighlevel structuredescribedearlier.Themasterwillberesponsibleforgraphingtheresults.Insomeproblems, theresultsgeneratedbytheworkersinteractanditcanbeimportantforthemastertowaituntilallthe
workershavecompletedbeforeconsumingresults.Inthiscase,however,theresultsdonotinteract,so wesplittheforkandjoinoperationsandhavethemasterplotresultsastheybecomeavailable. FollowingtheFork,themastermustwaitforresultstoappearontheglobal_resultsqueue. Becauseweknowtherewillbeoneresultperrow,themasterknowsinadvancehowmanyresultsto fetchandtheterminationconditionisexpressedsimplyintermsofthenumberofiterationsofthe loop.Afteralltheresultshavebeenplotted,themasterwaitsattheJoinfunctionuntilallthe workershavecompleted,atwhichpointthemastercompletes.CodeisshowninFig.5.19.Noticethat thiscodeissimilartothegenericcasediscussedearlier,exceptthatwehaveoverlappedtheprocessing oftheresultswiththeircomputationbysplittingtheForkandJoin.Asthenamesimply,Fork launchesUEsrunningtheindicatedfunctionandJoincausesthemastertowaitfortheworkersto cleanlyterminate.SeetheSharedQueuepatternformoredetailsaboutthequeue.
Figure 5.18. Pseudocode for a sequential version of the Mandelbrot set generation program
CodeView:Scroll/ShowAll
Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map Real :: conv // divergence rate for a pixel Array of Int :: color_map (M) // pixel color based on Conv rate Array of Int :: row (RowSize) // Pixels to draw Array of real :: ranges(2) // ranges in X and Y dimensions
manage_user_input(ranges, color_map) // input ranges, color map initialize_graphics(RowSize for (int i = 0; i<Nrows; i++){ compute_Row (RowSize, ranges, row) graph(i, RowSize, M, color_map, ranges, row) } // end loop [i] over rows
Figure 5.19. Master process for a master/worker parallel version of the Mandelbrot set generation program
CodeView:Scroll/ShowAll
Int const Ntasks // number of tasks Int const Nworkers // number of workers Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map typedef Row :: struct of { int :: index array of int :: pixels (RowSize)
} temp_row; Array of Int :: color_map (M) // pixel color based on conv rate Array of Real :: ranges(2) // ranges in X and Y dimensions SharedQueue of Int :: task_queue; // task queue SharedQueue of Row :: global_results; // queue to hold results void master() { void worker(); manage_user_input(ranges, Color_map) // input ranges, color map initialize_graphics(RowSize, Nrows, M, ranges, color_map) // Create and initialize shared data structures task_queue = new SharedQueue(); global_results = new SharedQueue(); for (int i = 0; i < Nrows; i++) enqueue(task_queue, i); // Create Nworkers threads executing function worker() Fork (Nworkers, worker); // Wait for results and graph them as they appear for (int i = 0; i< Nrows; i++) { while (empty(task_queue) { // wait for results wait } temp_row = dequeue(global_results) graph(temp_row_index, RowSize, M, color_map, ranges, Row.pixels) } // Terminate the worker UEs Join (Nworkers); }
Figure 5.20. Worker process for a master/worker parallel version of the Mandelbrot set generation program. We assume a shared address space thereby making task_queue, global_results, and ranges available to the master and the workers. void worker() { Int i, irow; Row temp_row; while (!empty(task_queue) { irow = dequeue(task_queue); compute_Row (RowSize, ranges, irow, temp_row.pixels) temp_row.index = irow enqueue(global_results, temp_row); } }
ThispatternisextensivelyusedwiththeLindaprogrammingenvironment.ThetuplespaceinLindais ideallysuitedtoprogramsthatusetheMaster/Workerpattern,asdescribedindepthin[CG91]andin thesurveypaper[CGMS94]. TheMaster/Workerpatternisusedinmanydistributedcomputingenvironmentsbecausethese systemsmustdealwithextremelevelsofunpredictabilityintheavailabilityofresources.The SETI@homeproject[SET]usestheMaster/Workerpatterntoutilizevolunteers'Internetconnected computerstodownloadandanalyzeradiotelescopedataaspartoftheSearchforExtraterrestrial Intelligence(SETI).ProgramsconstructedwiththeCalypsosystem[BDK95],adistributedcomputing frameworkwhichprovidessystemsupportfordynamicchangesinthesetofPEs,alsousethe Master/Workerpattern.Aparallelalgorithmfordetectingrepeatsingenomicdata[RHB03]usesthe Master/WorkerpatternwithMPIonaclusterofdualprocessorPCs. Related Patterns ThispatterniscloselyrelatedtotheLoopParallelismpatternwhentheloopsutilizesomeformof dynamicscheduling(suchaswhentheschedule (dynamic)clauseisusedinOpenMP). ImplementationsoftheFork/JoinpatternsometimesusetheMaster/Workerpatternbehindthescenes. Thispatternisalsocloselyrelatedtoalgorithmsthatmakeuseofthenextvalfunctionfrom TCGMSG[Har91,WSG95,LDSH95].Thenextvalfunctionimplementsamonotoniccounter.If thebagoftaskscanbemappedontoafixedrangeofmonotonicindices,thecounterprovidesthebag oftasksandthefunctionofthemasterisimpliedbythecounter. Finally,theownercomputesfilterdiscussedinthemoleculardynamicsexampleintheSPMDpattern isessentiallyavariationonthemaster/workertheme.Insuchanalgorithm,allthemasterwoulddois setupthebagoftasks(loopiterations)andassignthemtoUEs,withtheassignmentoftaskstoUEs definedbythefilter.BecausetheUEscanessentiallyperformthisassignmentthemselves(by examiningeachtaskwiththefilter),noexplicitmasterisneeded.
Context Theoverwhelmingmajorityofprogramsusedinscientificandengineeringapplicationsareexpressed intermsofiterativeconstructs;thatis,theyareloopbased.Optimizingtheseprogramsbyfocusing strictlyontheloopsisatraditiondatingbacktotheoldervectorsupercomputers.Extendingthis approachtomodernparallelcomputerssuggestsaparallelalgorithmstrategyinwhichconcurrent tasksareidentifiedasiterationsofparallelizedloops. Theadvantageofstructuringaparallelalgorithmaroundparallelizedloopsisparticularlyimportantin problemsforwhichwellacceptedprogramsalreadyexist.Inmanycases,itisn'tpracticaltomassively restructureanexistingprogramtogainparallelperformance.Thisisparticularlyimportantwhenthe program(asisfrequentlythecase)containsconvolutedcodeandpoorlyunderstoodalgorithms. Thispatternaddresseswaystostructureloopbasedprogramsforparallelcomputation.Whenexisting codeisavailable,thegoalisto"evolve"asequentialprogramintoaparallelprogrambyaseriesof transformationsontheloops.Ideally,allchangesarelocalizedtotheloopswithtransformationsthat removeloopcarrieddependenciesandleavetheoverallprogramsemanticsunchanged.(Such transformationsarecalledsemanticallyneutraltransformations). Notallproblemscanbeapproachedinthisloopdrivenmanner.Clearly,itwillonlyworkwhenthe algorithmstructurehasmost,ifnotall,ofthecomputationallyintensiveworkburiedinamanageable numberofdistinctloops.Furthermore,thebodyoftheloopmustresultinloopiterationsthatwork wellasparalleltasks(thatis,theyarecomputationallyintensive,expresssufficientconcurrency,and aremostlyindependent). Notalltargetcomputersystemsalignwellwiththisstyleofparallelprogramming.Ifthecodecannot berestructuredtocreateeffectivedistributeddatastructures,somelevelofsupportforashared addressspaceisessentialinallbutthemosttrivialcases.Finally,Amdahl'slawanditsrequirementto minimizeaprogram'sserialfractionoftenmeansthatloopbasedapproachesareonlyeffectivefor systemswithsmallernumbersofPEs. Evenwiththeserestrictions,thisclassofparallelalgorithmsisgrowingrapidly.Becauseloopbased algorithmsarethetraditionalapproachinhighperformancecomputingandarestilldominantinnew programs,thereisalargebacklogofloopbasedprogramsthatneedtobeportedtomodernparallel computers.TheOpenMPAPIwascreatedprimarilytosupportparallelizationoftheseloopdriven problems.Limitationsonthescalabilityofthesealgorithmsareserious,butacceptable,giventhat thereareordersofmagnitudemoremachineswithtwoorfourprocessorsthanmachineswithdozens orhundredsofprocessors. ThispatternisparticularlyrelevantforOpenMPprogramsrunningonsharedmemory computersandforproblemsusingtheTaskParallelismandGeometricDecomposition patterns. Forces
Findthebottlenecks.Locatethemostcomputationallyintensiveloopseitherbyinspectionof thecode,byunderstandingtheperformanceneedsofeachsubproblem,orthroughtheuseof programperformanceanalysistools.Theamountoftotalruntimeonrepresentativedatasets containedbytheseloopswillultimatelylimitthescalabilityoftheparallelprogram(see Amdahl'slaw). Eliminateloopcarrieddependencies.Theloopiterationsmustbenearlyindependent.Find dependenciesbetweeniterationsorread/writeaccessesandtransformthecodetoremoveor mitigatethem.FindingandremovingthedependenciesisdiscussedintheTaskParallelism pattern,whileprotectingdependencieswithsynchronizationconstructsisdiscussedinthe SharedDatapattern. Parallelizetheloops.SplituptheiterationsamongtheUEs.Tomaintainsequential equivalence,usesemanticallyneutraldirectivessuchasthoseprovidedwithOpenMP(as describedintheOpenMPappendix,AppendixA).Ideally,thisshouldbedonetooneloopata timewithtestingandcarefulinspectioncarriedoutateachpointtomakesureraceconditions orothererrorshavenotbeenintroduced. Optimizetheloopschedule.TheiterationsmustbescheduledforexecutionbytheUEssothe loadisevenlybalanced.Althoughtherightschedulecanoftenbechosenbasedonaclear understandingoftheproblem,frequentlyitisnecessarytoexperimenttofindtheoptimal schedule.
overcomeparallelloopoverhead,by(1)creatingmoreconcurrencytobetterutilizelarger numbersofUEs,and(2)providingadditionaloptionsforhowtheiterationsarescheduledonto UEs. ParallelizingtheloopsiseasilydonewithOpenMPbyusingtheomp parallel fordirective. Thisdirectivetellsthecompilertocreateateamofthreads(theUEsinasharedmemory environment)andtosplituploopiterationsamongtheteam.ThelastloopinFig.5.22isanexample ofaloopparallelizedwithOpenMP.WedescribethisdirectiveatahighlevelintheImplementation Mechanismsdesignspace.SyntacticdetailsareincludedintheOpenMPappendix,AppendixA. NoticethatinFig.5.22wehadtodirectthesystemtocreatecopiesoftheindicesiandjlocalto eachthread.Thesinglemostcommonerrorinusingthispatternistoneglectto"privatize"key variables.Ifiandjareshared,thenupdatesofiandjbydifferentUEscancollideandleadto unpredictableresults(thatis,theprogramwillcontainaracecondition).Compilersusuallywillnot detecttheseerrors,soprogrammersmusttakegreatcaretomakesuretheyavoidthesesituations.
Figure 5.21. Program fragment showing merging loops to increase the amount of work per iteration
CodeView:Scroll/ShowAll
#define N 20 #define Npoints 512 void void void void FFT(); // a function to apply an FFT invFFT(); // a function to apply an inverse FFT filter(); // a frequency space filter setH(); // Set values of filter, H
int main() { int i, j; double A[Npoints], B[Npoints], C[Npoints], H[Npoints]; setH(Npoints, H); // do a bunch of work resulting in values for A and C // method one: distinct loops to compute A and C for(i=0; i<N; i++){ FFT (Npoints, A, B); // B = transformed A filter(Npoints, B, H); // B = B filtered with invFFT(Npoints, B, A); // A = inv transformed } for(i=0; i<N; i++){ FFT (Npoints, C, B); // B = transformed C filter(Npoints, B, H); // B = B filtered with invFFT(Npoints, B, C); // C = inv transformed }
H B
H B
// method two: the above pair of loops combined into // a single loop for(i=0; i<N; i++){ FFT (Npoints, A, B); // B = transformed A filter(Npoints, B, H); // B = B filtered with H invFFT(Npoints, B, A); // A = inv transformed B FFT (Npoints, C, B); // B = transformed C
Thekeytotheapplicationofthispatternistousesemanticallyneutralmodificationstoproduce sequentiallyequivalentcode.Asemanticallyneutralmodificationdoesn'tchangethemeaningofthe singlethreadedprogram.Techniquesforloopmergingandcoalescingofnestedloopsdescribed previously,whenusedappropriately,areexamplesofsemanticallyneutralmodifications.Inaddition, mostofthedirectivesinOpenMParesemanticallyneutral.Thismeansthataddingthedirectiveand runningitwithasinglethreadwillgivethesameresultasrunningtheoriginalprogramwithoutthe OpenMPdirective. Twoprogramsthataresemanticallyequivalent(whenrunwithasinglethread)neednotbothbe sequentiallyequivalent.Recallthatsequentiallyequivalentmeansthattheprogramwillgivethesame result(subjecttoroundofferrorsduetochangingtheorderoffloatingpointoperations)whetherrun withonethreadormany.Indeed,the(semanticallyneutral)transformationsthateliminateloop carrieddependenciesaremotivatedbythedesiretochangeaprogramthatisnotsequentially equivalenttoonethatis.Whentransformationsaremadetoimproveperformance,eventhoughthe transformationsaresemanticallyneutral,onemustbecarefulthatsequentialequivalencehasnotbeen lost.
Figure 5.22. Program fragment showing coalescing nested loops to produce a single loop with a larger number of iterations
CodeView:Scroll/ShowAll
#define N 20 #define M 10 extern double work(); // a time-consuming function int main() { int i, j, ij; double A[N][M]; // method one: nested loops for(j=0; j<N; j++){ for(i=0; i<M i++){ A[i][j] = work(i,j); } } // method two: the above pair of nested loops combined into // a single loop.
for(ij=0; ij<N*M; ij++){ j = ij/N; i = ij%M; } // // // // A[i][j] = work(i,j); method three: the above loop parallelized with OpenMP. The omp pragma creates a team of threads and maps loop iterations onto them. The private clause tells each thread to maintain local copies of ij, j, and i.
#pragma omp parallel for private(ij, j, i) for(ij=0; ij<N*M; ij++){ j = ij/N; i = ij%M; A[i][j] = work(i,j); } return 0; }
Itismuchmoredifficulttodefinesequentiallyequivalentprogramswhenthecodementionseithera threadIDorthenumberofthreads.AlgorithmsthatreferencethreadIDsandthenumberofthreads tendtofavorparticularthreadsorevenparticularnumbersofthreads,asituationthatisdangerous whenthegoalisasequentiallyequivalentprogram. WhenanalgorithmdependsonthethreadID,theprogrammerisusingtheSPMDpattern.Thismay beconfusing.SPMDprogramscanbeloopbased.Infact,manyoftheexamplesintheSPMDpattern areindeedloopbasedalgorithms.ButtheyarenotinstancesoftheLoopParallelismpattern,because theydisplaythehallmarktraitofanSPMDprogramnamely,theyusetheUEIDtoguidethe algorithm. Finally,we'veassumedthatadirectivebasedsystemsuchasOpenMPisavailablewhenusingthis pattern.Itispossible,butclearlymoredifficult,toapplythispatternwithoutsuchadirectivebased programmingenvironment.Forexample,inobjectorienteddesigns,onecanusetheLoopParallelism patternbymakingcleveruseofanonymousclasseswithparalleliterators.Becausetheparallelismis buriedintheiterators,theconditionsofsequentialequivalencecanbemet.
Performance considerations
somedegreeofnonuniformityinmemoryaccesstimesacrossthesystem.Inmanycases,theseeffects areofsecondaryconcern,andwecanignorehowaprogram'smemoryaccesspatternsmatchupwith thetargetsystem'smemoryhierarchy.Inothercases,particularlyonlargersharedmemorymachines, programsmustbeexplicitlyorganizedaccordingtotheneedsofthememoryhierarchy.Themost commontrickistomakesurethedataaccesspatternsduringinitializationofkeydatastructures matchthoseduringlatercomputationusingthesedatastructures.Thisisdiscussedinmoredetailin [Mat03,NA01]andlaterinthispatternaspartofthemeshcomputationexample. Anotherperformanceproblemisfalsesharing.Thisoccurswhenvariablesarenotsharedbetween UEs,buthappentoresideonthesamecacheline.Hence,eventhoughtheprogramsemanticsimplies independence,eachaccessbyeachUErequiresmovementofacachelinebetweenUEs.Thiscan createhugeoverheadsascachelinesarerepeatedlyinvalidatedandmovedbetweenUEsasthese supposedlyindependentvariablesareupdated.Anexampleofaprogramfragmentthatwouldincur highlevelsoffalsesharingisshowninFig.5.23.Inthiscode,wehaveapairofnestedloops.The outermostloophasasmalliterationcountthatwillmapontothenumberofUEs(whichweassumeis fourinthiscase).Theinnermostlooprunsoveralargenumberoftimeconsumingiterations. Assumingtheiterationsoftheinnermostloopareroughlyequal,thisloopshouldparallelize effectively.ButtheupdatestotheelementsoftheAarrayinsidetheinnermostloopmeaneachupdate requirestheUEinquestiontoowntheindicatedcacheline.AlthoughtheelementsofAaretruly independentbetweenUEs,theylikelysitinthesamecacheline.Hence,everyiterationinthe innermostloopincursanexpensivecachelineinvalidateandmovementoperation.Itisnot uncommonforthistonotonlydestroyallparallelspeedup,buttoevencausetheparallelprogramto becomeslowerasmorePEsareadded.Thesolutionistocreateatemporaryvariableoneachthread toaccumulatevaluesintheinnermostloop.Falsesharingisstillafactor,butonlyforthemuch smalleroutermostloopwheretheperformanceimpactisnegligible.
Figure 5.23. Program fragment showing an example of false sharing. The small array A is held in one or two cache lines. As the UEs access A inside the innermost loop, they will need to take ownership of the cache line back from the other UEs. This back-andforth movement of the cache lines destroys performance. The solution is to use a temporary variable inside the innermost loop.
CodeView:Scroll/ShowAll
#include <omp.h> #define N 4 // Assume this equals the number of UEs #define M 1000 extern double work(int, int); // a time-consuming function int main() { int i, j; double A[N] = {0.0}; // Initialize the array to zero // method one: a loop with false sharing from A since the elements // of A are likely to reside in the same cache line. #pragma omp parallel for private(j,i) for(j=0; j<N; j++){ for(i=0; i<M; i++){
} }
A[j] += work(i,j);
// method two: remove the false sharing by using a temporary // private variable in the innermost loop double temp; #pragma omp parallel for private(j,i, temp) for(j=0; j<N; j++){ temp = 0.0; for(i=0; i<M;i++){ temp += work(i,j); } A[j] += temp; } return 0; }
Examples Asexamplesofthispatterninaction,wewillbrieflyconsiderthefollowing:
Eachoftheseexampleshasbeendescribedelsewhereindetail.Wewillrestrictourdiscussioninthis patterntothekeyloopsandhowtheycanbeparallelized.
Numerical integration
ConsidertheproblemofestimatingthevalueofusingEq.5.5. Equation5.5
#include <stdio.h> #include <math.h> int main () { int i; int num_steps = 1000000; double x, pi, step, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("pi %lf\n",pi); return 0;
TheOpenMPincludefiledefinesfunctionprototypesandopaquedatatypesusedby OpenMP.
#pragma omp parallel for private(x) reduction(+:sum)
Molecular dynamics.
Throughoutthisbook,wehaveusedmoleculardynamicsasarecurringexample.Moleculardynamics simulatesthemotionsofalargemolecularsystem.Itusesanexplicittimesteppingmethodology whereateachtimesteptheforceoneachatomiscomputedandstandardtechniquesfromclassical mechanicsareusedtocomputehowtheforceschangeatomicmotions. Thecorealgorithm,includingpseudocode,waspresentedinSec.3.1.3andintheSPMDpattern.The problemcomesdowntoacollectionofcomputationallyexpensiveloopsovertheatomswithinthe molecularsystem.Theseareembeddedinatoplevelloopovertime. Theloopovertimecannotbeparallelizedbecausethecoordinatesandvelocitiesfromtimestept1arethestartingpointfortimestept.Theindividualloopsoveratoms,however,canbeparallelized. Themostimportantcasetoaddressisthenonbondedenergycalculation.Thecodeforthis computationisshowninFig.5.25.UnliketheapproachusedintheexamplesfromtheSPMDpattern, weassumethattheprogramanditsdatastructuresareunchangedfromtheserialcase.
Figure 5.25. Pseudocode for the nonbonded computation in a typical parallel molecular dynamics code. This is code is almost identical to the sequential version of the function shown previously in Fig. 4.4. function non_bonded_forces (N, Atoms, neighbors, Forces) Int N // number of atoms Array of Real :: atoms (3,N) //3D coordinates Array of Real :: forces (3,N) //force in each dimension Array of List :: neighbors(N) //atoms in cutoff volume Real :: forceX, forceY, forceZ loop [i] over atoms loop [j] over neighbors(i) forceX = non_bond_force(atoms(l,i), atoms(l,j)) forceY = non_bond_force(atoms(2,i), atoms(2,j)) forceZ = non_bond_force(atoms(3,i), atoms(3,j)) force,(1,i) += forceX; force(l,j) -= forceX; force(2,i) += forceY; force{2,j) -= forceY; force{3,i) += forceZ; force{3,j) -= forceZ; end loop [j] end loop [i] end function non_bonded_forces
Thisscheduletellsthecompilertogrouptheloopiterationsintoblocksofsize10andassignthem dynamicallytotheUEs.Thesizeoftheblocksisarbitraryandchosentobalancedynamicscheduling overheadversushoweffectivelytheloadcanbebalanced. OpenMP2.0forC/C++doesnotsupportreductionsoverarrayssothereductionwouldneedtobe doneexplicitly.ThisisstraightforwardandisshowninFig.5.11.AfuturereleaseofOpenMPwill correctthisdeficiencyandsupportreductionsoverarraysforalllanguagesthatsupportOpenMP. Thesamemethodusedtoparallelizethenonbondedforcecomputationcouldbeusedthroughoutthe moleculardynamicsprogram.TheperformanceandscalabilitywilllagtheanalogousSPMDversion oftheprogram.Theproblemisthateachtimeaparalleldirectiveisencountered,anewteamof threadsisinprinciplecreated.MostOpenMPimplementationsuseathreadpool,ratherthanactually creatinganewteamofthreadsforeachparallelregion,whichminimizesthreadcreationand destructionoverhead.However,thismethodofparallelizingthecomputationstilladdssignificant overhead.Also,thereuseofdatafromcachestendstobepoorfortheseapproaches.Inprinciple,each loopcanaccessadifferentpatternofatomsoneachUE.ThiseliminatesthecapabilityforUEsto makeeffectiveuseofvaluesalreadyincache. Evenwiththeseshortcomings,however,theseapproachesarecommonlyusedwhenthegoalisextra parallelismonasmallsharedmemorysystem[BBE99 + ].Forexample,onemightuseanSPMD versionofthemoleculardynamicsprogramacrossaclusterandthenuseOpenMPtogainextra performancefromdualprocessorsorfrommicroprocessorsutilizingsimultaneousmultithreading [MPS02].
Mandelbrot set computation
CandZarecomplexnumbersandtherecurrenceisstartedwithZ0=C.Theimageplotsthe
EachpixelcorrespondstoavalueofCinthequadraticrecurrence.Wecomputethisvalue basedontheinputrangeandthepixelindices.
Figure 5.26. Pseudocode for a sequential version of the Mandelbrot set generation program
CodeView:Scroll/ShowAll
Int const Nrows // number of rows in the image Int const RowSize // number of pixels in a row Int const M // number of colors in color map Real :: conv // divergence rate for a pixel Array of Int :: color_map (M) // pixel color based on conv rate Array of Int :: row (RowSize) // Pixels to draw Array of Real :: ranges(2) // ranges in X and Y dimensions manage_user_input(ranges, color_map) // input ranges, //color map initialize_graphics(RowSize, Nrows, M, ranges, color_map) for (int i = 0; i<Nrows; i++){ compute_Row (RowSize, ranges, row) graph(i, RowSize, M, color-map, ranges, row) } // end loop [i] over rows
Theschedulingcanbeabittrickybecauseworkassociatedwitheachrowwillvaryconsiderably
Considerasimplemeshcomputationthatsolvesthe1Dheatdiffusionequation.Thedetailsofthis problemanditssolutionusingOpenMParepresentedintheGeometricDecompositionpattern.We reproducethissolutioninFig.5.27. Thisprogramwouldworkwellonmostsharedmemorycomputers.Acarefulanalysisoftheprogram performance,however,wouldexposetwoperformanceproblems.First,thesingledirectiverequired toprotecttheswappingofthesharedpointersaddsanextrabarrier,therebygreatlyincreasingthe synchronizationoverhead.Second,onNUMAcomputers,memoryaccessoverheadislikelytobe highbecausewe'vemadenoefforttokeepthearraysnearthePEsthatwillbemanipulatingthem. WeaddressbothoftheseproblemsinFig.5.28.Toeliminatetheneedforthesingledirective,we modifytheprogramsoeachthreadhasitsowncopyofthepointersukandukp1.Thiscanbedone withaprivateclause,buttobeuseful,weneedthenew,privatecopiesofukandukp1topointto thesharedarrayscomprisingthemeshofvalues.Wedothiswiththefirstprivateclauseapplied totheparalleldirectivethatcreatestheteamofthreads. Theotherperformanceissueweaddress,minimizingmemoryaccessoverhead,ismoresubtle.As discussedearlier,toreducememorytrafficinthesystem,itisimportanttokeepthedataclosetothe PEsthatwillworkwiththedata.OnNUMAcomputers,thiscorrespondstomakingsurethepagesof memoryareallocatedand"owned"bythePEsthatwillbeworkingwiththedatacontainedinthe page.ThemostcommonNUMApageplacementalgorithmisthe"firsttouch"algorithm,inwhich
CodeView:Scroll/ShowAll
#include <stdio.h> #include <stdlib.h> #include <omp.h> #define NX 100 #define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]) { uk[0] = LEFTVAL; uk[NX-1] = RIGHTVAL; for (int i=1; i < NX-1; ++i) uk[i] = 0.0; for (int i = 0; i < NX; ++i) ukp1 [i] = uk[i]; } void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) { /* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; int i,k; double dx = 1.0/NX; double dt = 0.5*dx*dx; #pragma omp parallel private (k, i) { initialize(uk, ukp1); for (k = 0; k < NSTEPS; ++k) { #pragma omp for schedule(static) for (i = 1; i < NX-1; ++i) { ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukp1 to uk by swapping pointers */ #pragma omp single {temp = ukpl; ukpl = uk; uk = temp;} }
} return 0;
Wedothisbyfirstchangingtheinitializationslightlysotheinitializationloopisidenticaltothe
CodeView:Scroll/ShowAll
#include <stdio.h> #include <stdlib.h> #include <omp.h> #define NX 100 #define LEFTVAL 1.0 #define RIGHTVAL 10.0 #define NSTEPS 10000 void initialize(double uk[], double ukp1[]) { int i; uk[0] = LEFTVAL; uk[NX-1] = RIGHTVAL; ukp1[NX-1] = 0.0; #pragma omp for schedule(static) for (i = 1; i < NX-1; ++i){ uk[i] = 0.0; ukp1[i] = 0.0; } } void printValues(double uk[], int step) { /* NOT SHOWN */ } int main(void) { /* pointers to arrays for two iterations of algorithm */ double *uk = malloc(sizeof(double) * NX); double *ukp1 = malloc(sizeof(double) * NX); double *temp; int i,k; double dx = 1.0/NX; double dt = 0.5*dx*dx; #pragma omp parallel private (k, i, temp) firstprivate(uk, ukpl) { initialize(uk, ukpl); for (k = 0; k < NSTEPS; ++k) { #pragma omp for schedule(static) for (i = 1; i < NX-1; ++i) { ukpl[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]); } /* "copy" ukpl to uk by swapping pointers */ temp = ukpl; ukpl = uk; uk = temp; }
} return 0;
Known uses
TheLoopParallelismpatternisheavilyusedbyOpenMPprogrammers.Annualworkshopsareheldin NorthAmerica(Wompat:WorkshoponOpenMPApplicationsandTools),Europe(EWOMP: EuropeanWorkshoponOpenMP),andJapan(WOMPEI:WorkshoponOpenMPExperiencesand Implementations)todiscussOpenMPanditsuse.Proceedingsfrommanyoftheseworkshopsare widelyavailable[VJKT00,Sci03,EV01]andarefullofexamplesoftheLoopParallelismpattern. MostoftheworkonOpenMPhasbeenrestrictedtosharedmemorymultiprocessormachinesfor problemsthatworkwellwithanearlyflatmemoryhierarchy.WorkhasbeendonetoextendOpenMP applicationstomorecomplicatedmemoryhierarchies,includingNUMAmachines[NA01,SSGF00] andevenclusters[HLCZ99,SHTS01]. Related Patterns Theconceptofdrivingparallelismfromacollectionofloopsisgeneralandusedwithmanypatterns. Inparticular,manyproblemsusingtheSPMDpatternareloopbased.TheyusetheUEID,however, todrivetheparallelizationoftheloopandhencedon'tperfectlymapontothispattern.Furthermore, problemsusingtheSPMDpatternusuallyincludesomedegreeofparallellogicinbetweentheloops. ThisallowsthemtodecreasetheirserialfractionandisoneofthereasonswhySPMDprogramstend toscalebetterthanprogramsusingtheLoopParallelismpattern. AlgorithmstargetedforsharedmemorycomputersthatusetheTaskParallelismorGeometric DecompositionpatternsfrequentlyusetheLoopParallelismpattern.
programcontinuestoexecute.Inmostcases,therelationshipsbetweentasksaresimple,anddynamic taskcreationcanbehandledwithparallelloops(asdescribedintheLoopParallelismpattern)or throughtaskqueues(asdescribedintheMaster/Workerpattern).Inothercases,relationshipsbetween thetaskswithinthealgorithmmustbecapturedinthewaythetasksaremanaged.Examplesinclude recursivelygeneratedtaskstructures,highlyirregularsetsofconnectedtasks,andproblemswhere differentfunctionsaremappedontodifferentconcurrenttasks.Ineachoftheseexamples,tasksare forkedandlaterjoinedwiththeparenttask(thatis,thetaskthatexecutedthefork)andtheothertasks createdbythesamefork.TheseproblemsareaddressedintheFork/Joinpattern. Asanexample,consideranalgorithmdesignedusingtheDivideandConquerpattern.Asthe programexecutionproceeds,theproblemissplitintosubproblemsandnewtasksarerecursively created(orforked)toconcurrentlyexecutesubproblems;eachofthesetasksmayinturnbefurther split.Whenallthetaskscreatedtohandleaparticularsplithaveterminatedandjoinedwiththeparent task,theparenttaskcontinuesthecomputation. ThispatternisparticularlyrelevantforJavaprogramsrunningonsharedmemory computersandforproblemsusingtheDivideandConquerandRecursiveDatapatterns. OpenMPcanbeusedeffectivelywiththispatternwhentheOpenMPenvironment supportsnestedparallelregions. Forces
Inprinciple,nestedparallelregionsinOpenMPprogramsalsomapontothisdirect
Threadandprocesscreationanddestructionisoneofthemoreexpensiveoperationsthatoccurin parallelprograms.Thus,ifaprogramcontainsrepeatedforkandjoinsections,thesimplesolution, whichwouldrequirerepeateddestructionandcreationofUEs,mightnotbeefficientenough.Also,if atsomepointtherearemanymoreUEsthanPEs,theprogrammightincurunacceptableoverheaddue tothecostsofcontextswitches. Inthiscase,itisdesirabletoavoidthedynamicUEcreationbyimplementingthefork/joinparadigm usingathreadpool.Theideaistocreatea(relatively)staticsetofUEsbeforethefirstforkoperation. ThenumberofUEsisusuallythesameasthenumberofPEs.ThemappingoftaskstoUEsthen occursdynamicallyusingataskqueue.TheUEsthemselvesarenotrepeatedlycreatedanddestroyed, butsimplymappedtodynamicallycreatedtasksastheneedarises.Thisapproach,although complicatedtoimplement,usuallyresultsinefficientprogramswithgoodloadbalance.Wewill discussaJavaprogramthatusesthisapproachintheExamplessectionofthispattern. InOpenMP,thereissomecontroversyoverthebestapproachtousewiththisindirectmapping approach[Mat03].Anapproachgainingcredibilityisonebasedonanew,proposedOpenMP workshareconstructcalledataskqueue[SHPT00].Theproposalactuallydefinestwonew constructs:ataskqueueandatask.Asthenameimplies,theprogrammerusesataskqueue constructtocreatethetaskqueue.Insidethetaskqueueconstruct,ataskconstructdefinesa blockofcodethatwillbepackagedintoataskandplacedonthetaskqueue.Theteamofthreads(as usuallycreatedwithaparallelconstruct),playingtheroleofathreadpool,pullstasksoffthequeue andexecutesthemuntilthequeueisempty. UnlikeOpenMPparallelregions,taskqueuescanbedependablynestedtoproduceahierarchyof taskqueues.Thethreadsworkacrosstaskqueuesusingworkstealingtokeepallthreadsfully occupieduntilallofthequeuesareempty.Thisapproachhasbeenshowntoworkwell[SHPT00]and islikelytobeadoptedinafutureOpenMPspecification. Examples Asexamplesofthispattern,wewillconsiderdirectmappingandindirectmappingimplementations
CodeView:Scroll/ShowAll
static void sort(final int[] A,final int lo, final int hi) { int n = hi - lo; //if not large enough to do in parallel, sort sequentially if (n <= THRESHOLD){ Arrays.sort(A,lo,hi); return; } else { //split array final int pivot = (hi+lo)/2; //create and start new thread to sort lower half Thread t = new Thread() { public void run() { sort(A, lo, pivot); } }; t.start(); //sort upper half in current thread sort(A,pivot,hi); //wait for other thread try{t. join();} catch (InterruptedException e){Thread.dumpStack();} //merge sorted arrays int [] ws = new int [n] ; System.arraycopy(A,lo,ws,0,n); int wpivot = pivot - lo; int wlo = 0; int whi = wpivot; for (int i = lo; i != hi; i++) { if((wlo < wpivot) && (whi >= n II ws[wlo] <= ws[whi])) { A[i] = ws[wlo++]; } else { A[i] = ws[whi++]; }
} }
Thefirststepofthemethodistocomputethesizeofthesegmentofthearraytobesorted.Ifthesize oftheproblemistoosmalltomaketheoverheadofsortingitinparallelworthwhile,thenasequential sortingalgorithmisused(inthiscase,thetunedquicksortimplementationprovidedbytheArrays classinthejava.utilpackage).Ifthesequentialalgorithmisnotused,thenapivotpointis computedtodividethesegmenttobesorted.Anewthreadisforkedtosortthelowerhalfofthearray, whiletheparentthreadsortstheupperhalf.Thenewtaskisspecifiedbytherunmethodofan anonymousinnersubclassoftheThreadclass.Whenthenewthreadhasfinishedsorting,it terminates.Whentheparentthreadfinishessorting,itperformsajointowaitforthechildthreadto terminateandthenmergesthetwosortedsegmentstogether. Thissimpleapproachmaybeadequateinfairlyregularproblemswhereappropriatethresholdvalues caneasilybedetermined.Westressthatitiscrucialthatthethresholdvaluebechosenappropriately: Iftoosmall,theoverheadfromtoomanyUEscanmaketheprogramrunevenslowerthanasequential version.Iftoolarge,potentialconcurrencyremainsunexploited.
Figure 5.30. Instantiating FJTaskRunnerGroup and invoking the master task
CodeView:Scroll/ShowAll
int groupSize = 4; //number of threads FJTaskRunnerGroup group = new FJTaskRunnerGroup(groupSize); group.invoke(new FJTask() { public void run() { synchronized(this) { sort(A,0, A.length); } } });
ThisexampleusestheFJTaskframeworkincludedaspartofthepublicdomainpackage EDU.oswego.cs.dl.util.concurrent[Lea00b].[4]Insteadofcreatinganewthreadto executeeachtask,aninstanceofa(subclassof)FJTaskiscreated.Thepackagethendynamically mapstheFJTaskobjectstoastaticsetofthreadsforexecution.Althoughlessgeneralthana Thread,anFJTaskisamuchlighterweightobjectthanathreadandisthusmuchcheapertocreate anddestroy.InFig.5.30andFig.5.31,weshowhowtomodifythemergesortexampletouse FJTasksinsteadofJavathreads.Theneededclassesareimportedfrompackage EDU.oswego.cs.dl.util.concurrent.BeforestartinganyFJTasks,a FJTaskRunnerGroupmustbeinstantiated,asshowninFig.5.30.Thiscreatesthethreadsthatwill constitutethethreadpoolandtakesthenumberofthreads(groupsize)asaparameter.Once instantiated,themastertaskisinvokedusingtheinvokemethodontheFJTaskRunnerGroup.
[4]
BecauseOpenMPisbasedonafork/joinprogrammingmodel,onemightexpectheavyuseofthe
Fork/JoinpatternbyOpenMPprogrammers.Therealityis,however,thatmostOpenMPprogrammers useeithertheLoopParallelismorSPMDpatternsbecausethecurrentOpenMPstandardprovides poorsupportfortruenestingofparallelregions.Oneofthefewpublishedaccountsofusingthe Fork/JoinpatternwithstandardOpenMPisapaperwherenestedparallelismwasusedtoprovidefine grainedparallelisminanimplementationofLAPACK[ARv03]. ExtendingOpenMPsoitcanusetheFork/Joinpatterninsubstantialapplicationsisanactiveareaof research.We'vementionedoneoftheselinesofinvestigationforthecaseoftheindirectmapping solutionoftheFork/Joinpattern(thetaskqueue[SHPT00]).Anotherpossibilityistosupportnested parallelregionswithexplicitgroupsofthreadsforthedirectmappingsolutionoftheFork/Join pattern(theNanosOpenMPcompiler[GAM00 + ]). Related Patterns AlgorithmsthatusetheDivideandConquerpatternusetheFork/Joinpattern. TheLoopParallelismpattern,inwhichthreadsareforkedjusttohandleasingleparallelloop,isan instanceoftheFork/Joinpattern. TheMaster/Workerpattern,whichinturnusestheSharedQueuepattern,canbeusedtoimplement theindirectmappingsolution.
Giventhattheproblemnaturallydecomposesintonearlyindependenttasks(oneperset),thesolution tothisproblemwouldusetheTaskParallelismpattern.Usingthepatterniscomplicated,however,by thefactthatalltasksneedbothreadandwriteaccesstothedatastructureofrejectedsets.Also, becausethisdatastructurechangesduringthecomputation,wecannotusethereplicationtechnique describedintheTaskParallelismpattern.Partitioningthedatastructureandbasingasolutiononthis datadecomposition,asdescribedintheGeometricDecompositionpattern,mightseemlikeagood alternative,butthewayinwhichtheelementsarerejectedisunpredictable,soanydatadecomposition islikelytoleadtoapoorloadbalance. Similardifficultiescanariseanytimeshareddatamustbeexplicitlymanagedinsideasetof concurrenttasks.ThecommonelementsforproblemsthatneedtheSharedDatapatternare(1)at leastonedatastructureisaccessedbymultipletasksinthecourseoftheprogram'sexecution,(2)at leastonetaskmodifiestheshareddatastructure,and(3)thetaskspotentiallyneedtousethemodified valueduringtheconcurrentcomputation. Forces
take(dequeue),andpossiblyotheroperations,suchasatestforanemptyqueueoratesttoseeifa specifiedelementispresent.Eachtaskwilltypicallyperformasequenceoftheseoperations.These operationsshouldhavethepropertythatiftheyareexecutedserially(thatis,oneatatime,without interferencefromothertasks),eachoperationwillleavethedatainaconsistentstate. Theimplementationoftheindividualoperationswillmostlikelyinvolveasequenceoflowerlevel actions,theresultsofwhichshouldnotbevisibletootherUEs.Forexample,ifweimplementedthe previouslymentionedqueueusingalinkedlist,a"take"operationactuallyinvolvesasequenceof lowerleveloperations(whichmaythemselvesconsistofasequenceofevenlowerleveloperations): 1. Usevariablefirsttoobtainareferencetothefirstobjectinthelist. 2. Fromthefirstobject,getareferencetothesecondobjectinthelist. 3. Replacethevalueoffirstwiththereferencetothesecondobject. 4. Updatethesizeofthelist. 5. Returnthefirstelement. Iftwotasksareexecuting"take"operationsconcurrently,andtheselowerleveloperationsare interleaved(thatis,the"take"operationsarenotbeingexecutedatomically),theresultcouldeasilybe aninconsistentlist.
Implement an appropriate concurrency-control protocol
AftertheADTanditsoperationshavebeenidentified,theobjectiveistoimplementaconcurrency controlprotocoltoensurethattheseoperationsgivethesameresultsasiftheywereexecutedserially. Thereareseveralwaystodothis;startwiththefirsttechnique,whichisthesimplest,andthentrythe othermorecomplextechniquesifitdoesnotyieldacceptableperformance.Thesemorecomplex techniquescanbecombinedifmorethanoneisapplicable. Oneatatimeexecution.Theeasiestsolutionistoensurethattheoperationsareindeedexecuted serially. Inasharedmemoryenvironment,themoststraightforwardwaytodothisistotreateachoperationas partofasinglecriticalsectionanduseamutualexclusionprotocoltoensurethatonlyoneUEata timeisexecutingitscriticalsection.Thismeansthatalloftheoperationsonthedataaremutually exclusive.Exactlyhowthisisimplementedwilldependonthefacilitiesofthetargetprogramming environment.Typicalchoicesincludemutexlocks,synchronizedblocks,criticalsections,and semaphores.ThesemechanismsaredescribedintheImplementationMechanismsdesignspace.Ifthe programminglanguagenaturallysupportstheimplementationofabstractdatatypes,itisusually appropriatetoimplementeachoperationasaprocedureormethod,withthemutualexclusion protocolimplementedinthemethoditself. Inamessagepassingenvironment,themoststraightforwardwaytoensureserialexecutionistoassign theshareddatastructuretoaparticularUE.Eachoperationshouldcorrespondtoamessagetype;
otherprocessesrequestoperationsbysendingmessagestotheUEmanagingthedatastructure,which processesthemserially. Ineitherenvironment,thisapproachisusuallynotdifficulttoimplement,butitcanbeoverly conservative(thatis,itmightdisallowconcurrentexecutionofoperationsthatwouldbesafeto executesimultaneously),anditcanproduceabottleneckthatnegativelyaffectstheperformanceofthe program.Ifthisisthecase,theremainingapproachesdescribedinthissectionshouldbereviewedto seewhetheroneofthemcanreduceoreliminatethisbottleneckandgivebetterperformance. Noninterferingsetsofoperations.Oneapproachtoimprovingperformancebeginsbyanalyzingthe interferencebetweentheoperations.WesaythatoperationAinterfereswithoperationBifAwritesa variablethatBreads.Noticethatanoperationmayinterferewithitself,whichwouldbeaconcernif morethanonetaskexecutesthesameoperation(forexample,morethanonetaskexecutes"take" operationsonasharedqueue).Itmaybethecase,forexample,thattheoperationsfallintotwo disjointsets,wheretheoperationsindifferentsetsdonotinterferewitheachother.Inthiscase,the amountofconcurrencycanbeincreasedbytreatingeachofthesetsasadifferentcriticalsection.That is,withineachset,operationsexecuteoneatime,butoperationsindifferentsetscanproceed concurrently. Readers/writers.Ifthereisnoobviouswaytopartitiontheoperationsintodisjointsets,considerthe typeofinterference.Itmaybethecasethatsomeoftheoperationsmodifythedata,butothersonly readit.Forexample,ifoperationAisawriter(bothreadingandwritingthedata)andoperationBisa reader(reading,butnotwriting,thedata),AinterfereswithitselfandwithB,butBdoesnotinterfere withitself.Thus,ifonetaskisperformingoperationA,noothertaskshouldbeabletoexecuteeither AorB,butanynumberoftasksshouldbeabletoexecuteBconcurrently.Insuchcases,itmaybe worthwhiletoimplementareaders/writersprotocolthatwillallowthispotentialconcurrencytobe exploited.Theoverheadofmanagingthereaders/writersprotocolisgreaterthanthatofsimplemutex locks,sothelengthofthereaders'computationshouldbelongenoughtomakethisoverhead worthwhile.Inaddition,thereshouldgenerallybealargernumberofconcurrentreadersthanwriters. Thejava.util.concurrentpackageprovidesread/writelockstosupportthereaders/writers protocol.ThecodeinFig.5.32illustrateshowtheselocksaretypicallyused:Firstinstantiatea ReadWriteLock,andthenobtainitsreadandwritelocks.ReentrantReadWriteLockisa classthatimplementstheReadWriteLockinterface.Toperformareadoperation,thereadlock mustbelocked.Toperformawriteoperation,thewritelockmustbelocked.Thesemanticsofthe locksarethatanynumberofUEscansimultaneouslyholdthereadlock,butthewritelockis exclusive;thatis,onlyoneUEcanholdthewritelock,andifthewritelockisheld,noUEscanhold thereadlockeither.
Figure 5.32. Typical use of read/write locks. These locks are defined in the java.util.concurrent.locks package. Putting the unlock in the finally block ensures that the lock will be unlocked regardless of how the try block is exited (normally or with an exception) and is a standard idiom in Java programs that use locks rather than synchronized blocks.
CodeView:Scroll/ShowAll
class X {
ReadWriteLock rw = new ReentrantReadWriteLock() ; // ... /*operation A is a writer*/ public void A() throws InterruptedException { rw.writeLock().lock(); //lock the write lock try { // ... do operation A } finally { rw.writeLock().unlock(); //unlock the write lock } } /*operation B is a reader*/ public void B() throws InterruptedException { rw.readLock().lock(); //lock the read lock try { // ... do operation B } finally { rw.readLock().unlock(); //unlock the read lock } } }
Readers/writersprotocolsarediscussedin[And00]andmostoperatingsystemstexts. Reducingthesizeofthecriticalsection.Anotherapproachtoimprovingperformancebeginswith analyzingtheimplementationsoftheoperationsinmoredetail.Itmaybethecasethatonlypartofthe operationinvolvesactionsthatinterferewithotheroperations.Ifso,thesizeofthecriticalsectioncan bereducedtothatsmallerpart.Noticethatthissortofoptimizationisveryeasytogetwrong,soit shouldbeattemptedonlyifitwillgivesignificantperformanceimprovementsoversimpler approaches,andtheprogrammercompletelyunderstandstheinterferencesinquestion. Nestedlocks.Thistechniqueisasortofhybridbetweentwoofthepreviousapproaches, noninterferingoperationsandreducingthesizeofthecriticalsection.SupposewehaveanADTwith twooperations.OperationAdoesalotofworkbothreadingandupdatingvariablexandthenreads andupdatesvariableyinasinglestatement.OperationBreadsandwritesy.Someanalysisshows thatUEsexecutingAneedtoexcludeeachother,UEsexecutingBneedtoexcludeeachother,and becausebothoperationsreadandupdatey,technically,AandBneedtomutuallyexcludeeachother aswell.However,closerinspectionshowsthatthetwooperationsarealmostnoninterfering.Ifit weren'tforthatsinglestatementwhereAreadsandupdatesy,thetwooperationscouldbe implementedinseparatecriticalsectionsthatwouldallowoneAandoneBtoexecuteconcurrently.A solutionistousetwolocks,asshowninFig.5.33.AacquiresandholdslockAfortheentire operation.BacquiresandholdslockBfortheentireoperation.AacquireslockBandholdsitonly forthestatementupdatingy. Whenevernestedlockingisused,theprogrammershouldbeawareofthepotentialfordeadlocksand
doublecheckthecode.(Theclassicexampleofdeadlock,statedintermsofthepreviousexample,is asfollows:AacquireslockAandBacquireslockB.AthentriestoacquirelockBandBtriesto acquirelockA.Neitheroperationcannowproceed.)Deadlockscanbeavoidedbyassigningapartial ordertothelocksandensuringthatlocksarealwaysacquiredinanorderthatrespectsthepartial order.Inthepreviousexample,wewoulddefinetheordertobelockA < lockBandensurethat lockAisneveracquiredbyaUEalreadyholdinglockB. Applicationspecificsemanticrelaxation.Yetanotherapproachistoconsiderpartiallyreplicating shareddata(thesoftwarecachingdescribedin[YWC96 + ])andperhapsevenallowingthecopiestobe inconsistentifthiscanbedonewithoutaffectingtheresultsofthecomputation.Forexample,a distributedmemorysolutiontothephylogenyproblemdescribedearliermightgiveeachUEitsown copyofthesetofsetsalreadyrejectedandallowthesecopiestobeoutofsynch;tasksmaydoextra work(inrejectingasetthathasalreadybeenrejectedbyataskassignedtoadifferentUE),butthis extraworkwillnotaffecttheresultofthecomputation,anditmaybemoreefficientoverallthanthe communicationcostofkeepingallcopiesinsynch.
Figure 5.33. Example of nested locking using synchronized blocks with dummy objects lockA and lockB class Y { Object lockA = new Object(); Object lockB = new Object(); void A() { synchronized(lockA) { ....compute.... synchronized(lockB) { ....read and update y.... } } } void B() throws InterruptedException { synchronized(lockB) { ...compute.... } }
Review other considerations Memorysynchronization.Makesurememoryissynchronizedasrequired:Cachingandcompiler optimizationscanresultinunexpectedbehaviorwithrespecttosharedvariables.Forexample,astale valueofavariablemightbereadfromacacheorregisterinsteadofthenewestvaluewrittenby anothertask,orthelatestvaluemightnothavebeenflushedtomemoryandthuswouldnotbevisible toothertasks.Inmostcases,memorysynchronizationisperformedimplicitlybyhigherlevel synchronizationprimitives,butitisstillnecessarytobeawareoftheissue.Unfortunately,memory synchronizationtechniquesareveryplatformspecific.InOpenMP,theflushdirectivecanbeusedto synchronizememoryexplicitly;itisimplicitlyinvokedbyseveralotherdirectives.InJava,memoryis
implicitlysynchronizedwhenenteringandleavingasynchronizedblock,and,inJava21.5,when lockingandunlockinglocks.Also,variablesmarkedvolatileareimplicitlysynchronizedwith respecttomemory.ThisisdiscussedinmoredetailintheImplementationMechanismsdesignspace. Taskscheduling.Considerwhethertheexplicitlymanageddatadependenciesaddressedbythis patternaffecttaskscheduling.Akeygoalindecidinghowtoscheduletasksisgoodloadbalance;in additiontotheconsiderationsdescribedintheAlgorithmStructurepatternbeingused,oneshould alsotakeintoaccountthattasksmightbesuspendedwaitingforaccesstoshareddata.Itmakessense totrytoassigntasksinawaythatminimizessuchwaiting,ortoassignmultipletaskstoeachUEin thehopethattherewillalwaysbeonetaskperUEthatisnotwaitingforaccesstoshareddata. Examples
Shared queues
ConsidertheGAFORTprogramfromtheSPECOMP2001benchmarksuite[ADE01 + ].GAFORTisa smallFortranprogram(around1,500lines)thatimplementsageneticalgorithmfornonlinear optimization.Thecalculationsarepredominantlyintegerarithmetic,andtheprogram'sperformanceis dominatedbythecostofmovinglargearraysofdatathroughthememorysubsystem. Thedetailsofthegeneticalgorithmarenotimportantforthisdiscussion.Wearegoingtofocusona singleloopwithinGAFORT.Pseudocodeforthesequentialversionofthisloop,basedonthe discussionofGAFORTin[EM],isshowninFig.5.34.Thisloopshufflesthepopulationof chromosomesandconsumesontheorderof36percentoftheruntimeinatypicalGAFORTjob [AE03].
Figure 5.34. Pseudocode for the population shuffle loop from the genetic algorithm program GAFORT Int const NPOP // number of chromosomes ("40000) Int const NCHROME // length of each chromosome Real :: tempScalar Array of Real :: temp(NCHROME) Array of Int :: iparent(NCHROME, NPOP) Array of Int :: fitness(NPOP) Int :: j, iother loop [j] over NPOP iother = rand(j) // returns random value greater // than or equal to zero but not // equal to j and less than NPOP // Swap Chromosomes temp(1:NCHROME) = iparent(1:NCHROME, iother) iparent(1:NCHROME, iother) = iparent(1:NCHROME, j)
iparent(1:NCHROME, j) = temp(1:NCHROME) // Swap fitness metrics tempScalar = fitness(iother) fitness(iother) = fitness(j) fitness(j) = tempScalar end loop [j]
Aparallelversionofthisprogramwillbecreatedbyparallelizingtheloop,usingtheLoopParallelism pattern.Inthisexample,theshareddataconsistsoftheiparentandfitnessarrays.Withinthe bodyoftheloop,calculationsinvolvingthesearraysconsistofswappingtwoelementsofiparent andthenswappingthecorrespondingelementsoffitness.Examinationoftheseoperationsshows thattwoswapoperationsinterferewhenatleastoneofthelocationsbeingswappedisthesamein bothoperations. ThinkingabouttheshareddataasanADThelpsustoidentifyandanalyzetheactionstakenonthe shareddata.Thisdoesnotmean,however,thattheimplementationitselfalwaysneedstoreflectthis structure.Insomecases,especiallywhenthedatastructureissimpleandtheprogramminglanguage doesnotsupportADTswell,itcanbemoreeffectivetoforgotheencapsulationimpliedinanADT andworkwiththedatadirectly.Thisexampleillustratesthis. Asmentionedearlier,thechromosomesbeingswappedmightinterferewitheachother;thustheloop overjcannotsafelyexecuteinparallel.Themoststraightforwardapproachistoenforcea"oneata time"protocolusingacriticalsection,asshowninFig.5.35.Itisalsonecessarytomodifythe randomnumbergeneratorsoitproducesaconsistentsetofpseudorandomnumberswhencalledin parallelbymanythreads.Thealgorithmstoaccomplishthisarewellunderstood[Mas97],butwillnot bediscussedhere.
Figure 5.35. Pseudocode for an ineffective approach to parallelizing the population shuffle in the genetic algorithm program GAFORT
CodeView:Scroll/ShowAll
#include <omp.h> Int const NPOP // number of chromosomes ("40000) Int const NCHROME // length of each chromosome Real :: tempScalar Array of Real :: temp(NCHROME) Array of Int :: iparent(NCHROME, NPOP) Array of Int :: fitness(NPOP) Int :: j, iother #pragma omp parallel for loop [j] over NPOP iother = par_rand(j) // returns random value greater // than or equal to zero but not // equal to j and less than NPOP #pragma omp critical { // Swap Chromosomes
temp(1:NCHROME) = iparent(1:NCHROME, iother) iparent(1:NCHROME, iother) = iparent(1:NCHROME, j) iparent(1:NCHROME, j) = temp(1:NCHROME) // Swap fitness metrics tempScalar = fitness(iother) fitness(iother) = fitness(j) fitness(j) = tempScalar } end loop [j]
[ADE01 + ]istocreateanOpenMPlockforeachchromosome.Pseudocodeforthissolutionisshown inFig.5.36.Intheresultingprogram,mostoftheloopiterationsdonotactuallyinterferewitheach other.Thetotalnumberofchromosomes,NPOP(40,000intheSPECOMP2001benchmark),ismuch largerthanthenumberofUEs,sothereisonlyaslightchancethatloopiterationswillhappento interferewithanotherloopiteration. OpenMPlocksaredescribedintheOpenMPappendix,AppendixA.Thelocksthemselvesusean opaquetype,omp_lock_t,definedintheomp.hheaderfile.Thelockarrayisdefinedandlater initializedinaseparateparallelloop.Onceinsidethechromosomeswappingloop,thelocksareset forthepairofswappingchromosomes,theswapiscarriedout,andthelocksareunset.Nestedlocks arebeingused,sothepossibilityofdeadlockmustbeconsidered.Thesolutionhereistoorderthe locksusingthevalueoftheindicesofthearrayelementassociatedwiththelock.Alwaysacquiring locksinthisorderwillpreventdeadlockwhenapairofloopiterationshappentobeswappingthe sametwoelementsatthesametime.Afterthemoreefficientconcurrencycontrolprotocolis implemented,theprogramrunswellinparallel.
Known uses
combinationsofelementsofthemastersettothemasterset(wheretheyareusedtogeneratenew pairs).Differentpairscanbeprocessedconcurrently,soonecandefineataskforeachpairand
CodeView:Scroll/ShowAll
#include <omp.h> Int const NPOP // number of chromosomes ("40000) Int const NCHROME // length of each chromosome Array of omp_lock_t :: lck(NPOP) Real :: tempScalar Array of Real :: temp(NCHROME) Array of Int :: iparent(NCHROME, NPOP) Array of Int :: fitness(NPOP) Int :: j, iother // Initialize the locks #pragma omp parallel for for (j=0; j<NPOP; j++){ omp_init_lock (&lck(j)) } #pragma omp parallel for for (j=0; j<NPOP; j++){ iother = par_rand(j) // returns random value >= 0, != j, // < NPOP if (j < iother) { set_omp_lock (lck(j)); set_omp_lock (lck(iother)) } else { set_omp_lock (lck(iother)); set_omp_lock (lck(j)) } // Swap Chromosomes temp(1:NCHROME) = iparent(1:NCHROME, iother); iparent(1:NCHROME, iother) = iparent(1:NCHROME, j); iparent(1:NCHROME, j) = temp(1:NCHROME); // Swap fitness metrics tempScalar = fitness(iother) fitness(iother) = fitness(j) fitness(j) = tempScalar if (j < iother) { unset_omp_lock (lck(iother)); unset_omp_lock (lck(j)) } else { unset_omp_lock (lck(j)); unset_omp_lock (lck(iother)) } } // end loop [j]
Simpleconcurrencycontrolprotocolsprovidegreaterclarityofabstractionandmakeiteasier fortheprogrammertoverifythatthesharedqueuehasbeencorrectlyimplemented. Concurrencycontrolprotocolsthatencompasstoomuchofthesharedqueueinasingle synchronizationconstructincreasethechancesUEswillremainblockedwaitingtoaccessthe queueandwilllimitavailableconcurrency. Aconcurrencycontrolprotocolfinelytunedtothequeueandhowitwillbeusedincreasesthe availableconcurrency,butatthecostofmuchmorecomplicated,andmoreerrorprone, synchronizationconstructs. Maintainingasinglequeueforsystemswithcomplicatedmemoryhierarchies(asfoundon NUMAmachinesandclusters)cancauseexcesscommunicationandincreaseparallel overhead.Solutionsmayinsomecasesneedtobreakwiththesinglequeueabstractionanduse multipleordistributedqueues.
java.util.concurrentpackage.Herewedevelopimplementationsfromscratchtoillustrate theconcepts. Implementingsharedqueuescanbetricky.Appropriatesynchronizationmustbeutilizedtoavoidrace conditions,andperformanceconsiderationsespeciallyforproblemswherelargenumbersofUEs accessthequeuecanrequiresophisticatedsynchronization.Insomecases,anoncentralizedqueue mightbeneededtoeliminateperformancebottlenecks. However,ifitisnecessarytoimplementasharedqueue,itcanbedoneasaninstanceoftheShared Datapattern:First,wedesignanADTforthequeuebydefiningthevaluesthequeuecanholdandthe setofoperationsonthequeue.Next,weconsidertheconcurrencycontrolprotocols,startingwiththe simplest"oneatatimeexecution"solutionandthenapplyingaseriesofrefinements.Tomakethis discussionmoreconcrete,wewillconsiderthequeueintermsofaspecificproblem:aqueuetohold tasksinamaster/workeralgorithm.Thesolutionspresentedhere,however,aregeneralandcanbe easilyextendedtocoverotherapplicationsofasharedqueue.
The abstract data type (ADT)
AnADTisasetofvaluesandtheoperationsdefinedonthatsetofvalues.Inthecaseofaqueue,the valuesareorderedlistsofzeroormoreobjectsofsometype(f rexample,integersortaskIDs).The o operationsonthequeueareput(orenqueue)andtake(ordequeue).Insomesituations,theremight beotheroperations,butforthesakeofthisdiscussion,thesetwoaresufficient. Wemustalsodecidewhathappenswhen takeisattemptedonanemptyqueue.Whatshouldbe a donedependsonhowterminationwillbehandledbythemaster/workeralgorithm.Suppose,for example,thatallthetaskswillbecreatedatstartuptimebythemaster.Inthiscase,anemptytask queuewillindicatethattheUEshouldterminate,andwewillwantthetakeoperationonanempty queuetoreturnimmediatelywithanindicationthatthequeueisemptythatis,wewanta nonblockingqueue.AnotherpossiblesituationisthattaskscanbecreateddynamicallyandthatUEs willterminatewhentheyreceiveaspecialpoisonpilltask.Inthiscase,appropriatebehaviormightbe forthetakeoperationonanemptyqueuetowaituntilthequeueisnonemptythatis,wewanta blockonemptyqueue. Queue with "one at a time" exe cution Nonblockingqueue.Becausethequeuewillbeaccessedconcurrently,wemustdefineaconcurrency controlprotocoltoensurethatinterferencebymultipleUEswillnotoccur.Asrecommendedinthe SharedDatapattern,thesimplestsolutionistomakealloperationsontheADTexcludeeachother. Becausenoneoftheoperationsonthequeuecanblock,astraightforwardimplementationofmutual exclusionasdescribedintheImplementationMechanismsdesignspacessuffices.TheJava implementationshowninFig.5.37usesalinkedlisttoholdthetasksinthequeue.(Wedevelopour ownlistclassratherthanusinganunsynchronizedlibraryclasssuchasjava.util.LinkedList oraclassfromthejava.util.concurrentpackagetoillustratehowtoaddappropriate synchronization.)headreferstoanalwayspresentdummynode.[6]
[6]
Thecodefortakemakestheoldheadnodeintoadummynoderatherthansimply manipulatingnextpointerstoallowustolateroptimizethecodesothatputandget
canexecuteconcurrently. Thefirsttaskinthequeue(ifany)isheldinthenodereferredtobyhead.next.TheisEmpty methodisprivate,andonlyinvokedinsideasynchronizedmethod.Thus,itneednotbesynchronized. (Ifitwerepublic,itwouldneedtobesynchronizedaswell.)Ofcourse,numerouswaysof implementingthestructurethatholdsthetasksarepossible. Blockonemptyqueue.ThesecondversionofthesharedqueueisshowninFig.5.38.Inthisversion ofthequeue,thetakeoperationischangedsothatathreadtryingtotakefromanemptyqueuewill waitforataskratherthanreturningimmediately.Thewaitingthreadneedstoreleaseitslockand reacquireitbeforetryingagain.ThisisdoneinJavausingthewaitandnotifymethods.Theseare describedintheJavaappendix,AppendixC.TheJavaappendixalsoshowsthequeueimplemented usinglocksfromthejava.util.concurrent.lockspackageintroducedinJava21.5instead ofwaitandnotify.SimilarprimitivesareavailablewithPOSIXthreads(Pthreads)[But97,IEE], andtechniquesforimplementingthisfunctionalitywithsemaphoresandotherbasicprimitivescanbe foundin[And00].
Figure 5.37. Queue that ensures that at most one thread can access the data structure at one time. If the queue is empty, null is immediately returned.
CodeView:Scroll/ShowAll
public class SharedQueue1 { class Node //inner class defines list nodes { Object task; Node next; Node(Object task) {this.task = task; next = null;} } private Node head = new Node(null); //dummy node private Node last = head; public synchronized void put(Object task) { assert task != null: "Cannot insert null task"; Node p = new Node(task); last.next = p; last = p; } public synchronized Object take() { //returns first task in queue or null if queue is empty Object task = null; if (!isEmpty()) { Node first = head.next; task = first.task; first.task = null; head = first; } return task; } private boolean isEmpty(){return head.next == null;} }
Ingeneral,tochangeamethodthatreturnsimmediatelyifaconditionisfalsetoonethatwaitsuntil theconditionistrue,twochangesneedtobemade:First,wereplaceastatementoftheform
if (condition){do_something;}
Figure 5.38. Queue that ensures at most one thread can access the data structure at one time. Unlike the first shared queue example, if the queue is empty, the thread waits. When used in a master/worker algorithm, a poison pill would be required to signal termination to a thread.
CodeView:Scroll/ShowAll
public class SharedQueue2 { class Node { Object task; Node next; Node(Object task) {this.task = task; next = null;}
private Node head = new Node(null); private Node last = head; public synchronized void put(Object task) { assert task != null: "Cannot insert null task"; Node p = new Node(task); last.next = p; last = p; notifyAll(); } public synchronized Object take() { //returns first task in queue, waits if queue is empty Object task = null; while (isEmpty()) {try{wait();}catch(InterruptedException ignore){}} { Node first = head.next; task = first.task; first.task = null; head = first; } return task; } private boolean isEmpty(){return head.next == null;} }
withaloop[7]
[7]
ThefactthatwaitcanthrowanInterruptedExceptionmustbedealtwith;it isignoredhereforclarity,buthandledproperlyinthecodeexamples.
while( !condition) {wait();} do_something;
with
while(isEmpty()) {try{wait()}catch(InterruptedException ignore){}}{....}
and
if (w>0) notifyAll();
Iftheperformanceofthesharedqueueisinadequate,wemustlookformoreefficientconcurrency
CodeView:Scroll/ShowAll
public class SharedQueue3 { class Node { Object task; Node next; Node(Object task) {this.task = task; next = null;}
private Node head = new Node(null); private Node last = head; private Object putLock = new Object(); private Object takeLock = new Object(); public void put(Object task) { synchronized(putLock) { assert task != null: "Cannot insert null task"; Node p = new Node(task); last.next = p; last = p; } } public Object take() { Object task = null; synchronized(takeLock) { if (!isEmpty()) { Node first = head.next; task = first.task; first.task = null; head = first; } } return task; } }
CodeView:Scroll/ShowAll
pubic class SharedQueue4 { class Node { Object task; Node next; Node(Object task) {this.task = task; next = null;} Node head = new Node(null); Node last = head; int w; Object putLock = new Object(); Object takeLock = new Object();
public void put(0bject task) { synchronized(putLock) { assert task != null: "Cannot insert null task"; Node p = new Node(task); last.next = p; last = p; if(w>0){putLock.notify();} } } public Object take() { Object task = null; synchronized(takeLock) { //returns first task in queue, waits if queue is empty while (isEmpty()) { try{synchronized(putLock){w++; putLock.wait();w--;} } catch(InterruptedException error){assert false;}} { Node first = head.next; task = first.task; first.task = null; head = first;
Anotherissuetonoteisthatthissolutionhasnestedsynchronizedblocksinbothtakeandput. Nestedsynchronizedblocksshouldalwaysbeexaminedforpotentialdeadlocks.Inthiscase,there willbenodeadlockbecauseputonlyacquiresonelock,putLock.Moregenerally,wewoulddefine apartialorderoverallthelocksandensurethatthelocksarealwaysacquiredinanorderconsistent withourpartialorder.Forexample,here,wecoulddefinetakeLock < putLockandmakesure thatthesynchronizedblocksareenteredinawaythatrespectsthatpartialorder. Asmentionedearlier,severalJavabasedimplementationsofqueuesareincludedinJava21.5inthe java.util.concurrentpackage,somebasedonthesimplestrategiesdiscussedhereandsome basedonmorecomplexstrategiesthatprovideadditionalflexibilityandperformance.
Distributed shared queues
Taskisanabstractclass.Applicationsextenditandoverrideitsrunmethodtoindicatethe functionalityofataskinthecomputation.Methodsofferedbytheclassincludeforkand
join.
TaskRunnerextendsThreadandprovidesthefunctionalityofthethreadsinthethread pool.Eachinstancecontainsasharedtaskqueue.Thetaskstealingcodeisinthisclass.
Figure 5.41. Nonblocking shared queue with takeLast operation
CodeView:Scroll/ShowAll
public class SharedQueue5 { class Node { Object task; Node next; Node prev; Node(0bject task, Node prev) {this.task = task; next = null; this.prev = prev;} } private Node head = new Node(null, null); private Node last = head; public synchronized void put(0bject task) { assert task != null: "Cannot insert null task"; Node p = new Node(task, last); last.next = p; last = p; } public synchronized Object take() { //returns first task in queue or null if queue is empty Object task = null; if (!isEmpty()) { Node first = head.next; task = first.task; first.task = null; head = first; } return task; } public synchronized Object takeLast() { //returns last task in queue or null if queue is empty Object task = null; if (!isEmpty()) { task = last.task; last = last.prev; last.next = null;} return task; } } private boolean isEmpty(){return head.next == null;}
TaskRunnerGroupmanagestheTaskRunners.Itcontainsmethodstoinitializeandshut
CodeView:Scroll/ShowAll
public abstract class Task implements Runnable { //done indicates whether the task is finished private volatile boolean done; public final void setDone(){done = true;} public boolean isDone(){return done;} //returns the currently executing TaskRunner thread public static TaskRunner getTaskRunner() { return (TaskRunner)Thread.currentThread() ; } //push this task on the local queue of current thread public void fork() { getTaskRunner().put(this); } //wait until this task is done public void join() { getTaskRunner().taskJoin(this); } //execute the run method of this task public void invoke() { if (!isDone()){run(); setDone(); } } }
Wewillnowdiscusstheseclassesinmoredetail.TaskisshowninFig.5.42.Theonlystate associatedwiththeabstractclassisdone,whichismarkedvolatiletoensurethatanythreadthat triestoaccessitwillobtainafreshvalue. TheTaskRunnerclassisshowninFig.5.43,Fig.5.44,andFig.5.45.Thethread,asspecifiedin therunmethod,loopsuntilthepoisontaskisencountered.Firstittriestoobtainataskfromtheback ofitslocalqueue.Ifthelocalqueueisempty,itattemptstostealataskfromthefrontofaqueue belongingtoanotherthread. ThecodefortheTaskRunnerGroupclassisshowninFig.5.46.Theconstructorfor TaskRunnerGroupinitializesthethreadpool,giventhenumberofthreadsasaparameter. Typically,thisvaluewouldbechosentomatchthenumberofprocessorsinthesystem.The executeAndWaitmethodstartsataskbyplacingitinthetaskqueueofthread0. Oneuseforthismethodisgetacomputationstarted.Somethinglikethisisneededbecausewecan't
CodeView:Scroll/ShowAll
import java.util.*; class TaskRunner extends Thread { private final TaskRunnerGroup g; //managing group private final Random chooseToStealFrom; //random number generator private final Task poison; //poison task protected volatile boolean active; //state of thread final int id; //index of task in the TaskRunnerGroup private final SharedQueue5 q; //Nonblocking shared queue //operations relayed to queue public void put(Task t){q.put(t);} public Task take(){return (Task)q.take() ;} public Task takeLast(){return (Task)q.takeLast() ;} //constructor TaskRunner(TaskRunnerGroup g, int id, Task poison) { this.g = g; this.id = id; this.poison = poison; chooseToStealFrom = new Random(System.identityHashCode(this)); setDaemon(true); q = new SharedQueue5(); } protected final TaskRunnerGroup getTaskRunnerGroup(){return g;} protected final int getlD(){return id;} /* continued in next figure */
Allofthismaybeclearerfromtheusageoffork,join,andexecuteAndWaitintheFibonacci exampleintheExamplessection.
Figure 5.44. Class defining behavior of threads in the thread pool (continued from Fig. 5.43 and continued in Fig. 5.45)
CodeView:Scroll/ShowAll
/* continued from previous figure */ //Attempts to steal a task from another thread. First chooses a //random victim, then continues with other threads until either //a task has been found or all have been checked. If a task //is found, it is invoked. The parameter waitingFor is a task //on which this thread is waiting for a join. If steal is not //called as part of a join, use waitingFor = null. void steal(final Task waitingFor) { Task task = null; TaskRunner[] runners = g.getRunners(); int victim = chooseToStealFrom.nextInt(runners.length); for (int i = 0; i != runners.length; ++i) { TaskRunner tr = runners[victim]; if (waitingFor != null && waitingFor.isDone()){break;} else { if (tr != null && tr != this) task = (Task)tr.q.take(); if(task != null) {break;} yield(); victim = (victim + 1)%runners.length; } } //have either found a task or have checked all other queues //if have a task, invoke it if(task != null && ! task.isDone()) { task.invoke(); } } /* continued in next figure
Examples
Computing Fibonacci numbers
WeshowinFig.5.47andFig.5.48codethatusesourdistributedqueuepackage.[8]Recallthat
[8]
Equation5.7
Equation5.8
Equation5.9
CodeView:Scroll/ShowAll
/* continued from previous figure */ //Main loop of thread. First attempts to find a task on local //queue and execute it. If not found, then tries to steal a task //from another thread. Performance may be improved by modifying //this method to back off using sleep or lowered priorities if the //thread repeatedly iterates without finding a task. The run //method, and thus the thread, terminates when it retrieves the //poison task from the task queue. public void run() { Task task = null; try { while (!poison.equals(task)) { task = (Task)q.takeLast(); if (task != null) { if (!task.isDone()){task.invoke();}} else { steal(null); } } } finally { active = false; } } //Looks for another task to run and continues when Task w is done. protected final void taskJoin(final Task w) { while(!w.isDone()) { Task task = (Task)q.takeLast(); if (task != null) { if (!task.isDone()){ task.invoke();}}
} } }
else { steal(w); }
Therunmethoddefinesthebehaviorofeachtask.Recursiveparalleldecompositionisdoneby creatinganewFibobjectforeachsubtask,invokingtheforkmethodoneachsubtasktostarttheir computation,callingthejoinmethodforeachsubtasktowaitforthesubtaskstocomplete,andthen computingthesumoftheirresults. Themainmethoddrivesthecomputation.Itfirstreadsproc(thenumberofthreadstocreate),num (thevalueforwhichtheFibonaccinumbershouldbecomputed),andoptionallythe sequentialThreshold.Thevalueofthislast,optionalparameter(thedefaultis0)isusedto decidewhentheproblemistoosmalltobotherwithaparalleldecompositionandshouldthereforeuse asequentialalgorithm.Aftertheseparametershavebeenobtained,themainmethodcreatesa TaskRunnerGroupwiththeindicatednumberofthreads,andthencreatesaFibobject,initialized withnum.ThecomputationisinitiatedbypassingtheFibobjecttotheTaskRunnerGroup's invokeAndWaitmethod.Whenthisreturns,thecomputationisfinished.Thethreadpoolisshut downwiththeTaskRunnerGroup's cancelmethod.Finally,theresultisretrievedfromthe Fibobjectanddisplayed.
Figure 5.46. The TaskRunnerGroup class. This class initializes and manages the threads in the thread pool.
CodeView:Scroll/ShowAll
class TaskRunnerGroup { protected final TaskRunner[] threads; protected final int groupSize; protected final Task poison; public TaskRunnerGroup(int groupSize) { this.groupSize = groupSize; threads = new TaskRunner[groupSize]; poison = new Task(){public void run(){assert false;}}; poison.setDone(); for (int i = 0; i!= groupSize; i++) {threads[i] = new TaskRunner(this,i,poison);} for(int i=0; i!= groupSize; i++){ threads[i].start(); } } //start executing task t and wait for its completion. //The wrapper task is used in order to start t from within //a Task (thus allowing fork and join to be used) public void executeAndWait(final Task t) { final TaskRunnerGroup thisGroup = this; Task wrapper = new Task() { public void run() { t.fork(); t.join();
} }; //add wrapped task to queue of thread[0] threads[0].put(wrapper); //wait for notification that t has finished. synchronized(thisGroup) { try{thisGroup.wait();} catch(InterruptedException e){return;} } }
//cause all threads to terminate. The programmer is responsible //for ensuring that the computation is complete. public void cancel() { for(int i=0; i!= groupSize; i++) { threads[i].put(poison); } } public TaskRunner[] getRunners(){return threads;} }
CodeView:Scroll/ShowAll
public class Fib extends Task { volatile int number; // number holds value to compute initially, //after computation is replaced by answer Fib(int n) { number = n; } //task constructor, initializes number //behavior of task public void run() { int n = number; // Handle base cases: if (n <= 1) { // Do nothing: fib(0) = 0; fib(1) = 1 } // Use sequential code for small problems: else if (n <= sequentialThreshold) { number = seqFib(n); } // Otherwise use recursive parallel decomposition: else { // Construct subtasks:
Fib fl = new Fib(n - 1); Fib f2 = new Fib(n - 2); // Run them in parallel: f1.fork();f2.fork(); // Await completion; f1.join();f2.join(); // Combine results: number = f1.number + f2.number; // (We know numbers are ready, so directly access them.)
} }
// Sequential version for arguments less than threshold static int seqFib(int n) { if (n <= 1) return n; else return seqFib(n-l) + seqFib(n-2); } //method to retrieve answer after checking to make sure //computation has finished, note that done and isDone are //inherited from the Task class. done is set by the executing //(TaskRunner) thread when the run method is finished. int getAnswer() { if (!isDone()) throw new Error("Not yet computed"); return number; } /* continued in next figure */
Notethatwhenthetasksinataskqueuemapontoaconsecutivesequenceofintegers,amonotonic sharedcounter,whichwouldbemuchmoreefficient,canbeusedinplaceofaqueue.
Figure 5.48. Program to compute Fibonacci numbers (continued from Fig. 5.47)
CodeView:Scroll/ShowAll
/* continued from previous figure */ //Performance-tuning constant, sequential algorithm is used to //find Fibonacci numbers for values <= this threshold static int sequentialThreshold = 0; public static void main(String[] args) { int procs; //number of threads int num; //Fibonacci number to compute try { //read parameters from command line procs = Integer.parseInt(args[0]); num = Integer.parseInt(args[1]); if (args.length > 2) sequentialThreshold = Integer.parseInt(args[2]); } catch (Exception e) { System.out.println("Usage: java Fib <threads> <number> "+
"[<sequentialThreshold>]"); return; } //initialize thread pool TaskRunnerGroup g = new TaskRunnerGroup(procs); //create first task Fib f = new Fib(num); //execute it g.executeAndWait(f); //computation has finished, shutdown thread pool g.cancel(); //show result long result; {result = f.getAnswer();} System.out.println("Fib: Size: " + num + " Answer: " + result); } }
usuallyasysteminwhichaccesstimesvarysubstantiallydependingonwhichPEisaccessingwhich arrayelement. ThechallengeistoorganizethearrayssothattheelementsneededbyeachUEarenearbyattheright timeinthecomputation.Inotherwords,thearraysmustbedistributedaboutthecomputersothatthe arraydistributionmatchestheflowofthecomputation. Thispatternisimportantforanyparallelalgorithminvolvinglargearraysinaparallelalgorithm.Itis particularlyimportantwhenthealgorithmusestheGeometricDecompositionpatternforitsalgorithm structureandtheSPMDpatternforitsprogramstructure.Althoughthispatternisinsomerespects specifictodistributedmemoryenvironmentsinwhichglobaldatastructuresmustbesomehow distributedamongtheensembleofPEs,someoftheideasofthispatternapplyifthesingleaddress spaceisimplementedonaNUMAplatform,inwhichallPEshaveaccesstoallmemorylocations,but accesstimevaries.Forsuchplatforms,itisnotnecessarytoexplicitlydecomposeanddistribute arrays,butitisstillimportanttomanagethememoryhierarchysothatarrayelementsstayclose[9]to thePEsthatneedthem.Becauseofthis,onNUMAmachines,MPIprogramscansometimes outperformsimilaralgorithmsimplementedusinganativemultithreadedAPI.Further,theideasof thispatterncanbeusedwithamultithreadedAPItokeepmemorypagesclosetotheprocessorsthat willworkwiththem.Forexample,ifthetargetsystemusesafirsttouchpagemanagementscheme, efficiencyisimprovedifeveryarrayelementisinitializedbythePEthatwillbeworkingwithit.This strategy,however,breaksdownifarraysneedtoberemappedinthecourseofthecomputation.
[9]
Loadbalance.BecauseaparallelcomputationisnotfinisheduntilallUEscompletetheir work,thecomputationalloadamongtheUEsmustbedistributedsoeachUEtakesnearlythe sametimetocompute. Effectivememorymanagement.Modernmicroprocessorsaremuchfasterthanthecomputer's memory.Toaddressthisproblem,highperformancecomputersystemsincludecomplex memoryhierarchies.Goodperformancedependsonmakinggooduseofthismemory hierarchy,andthisisdonebyensuringthatthememoryreferencesimpliedbyaseriesof calculationsareclosetotheprocessormakingthecalculation(thatis,datareusefromthe cachesishighandneededpagesstayaccessibletotheprocessor). Clarityofabstraction.Programsinvolvingdistributedarraysareeasiertowrite,debug,and maintainifitisclearhowthearraysaredividedamongUEsandmappedtolocalarrays.
Solution
Overview
Thesolutionissimpletostateatahighlevel;itisthedetailsthatmakeitcomplicated.Thebasic
approachistopartitiontheglobalarrayintoblocksandthenmapthoseblocksontotheUEs.This mappingontoUEsshouldbedonesothat,asthecomputationunfolds,eachUEhasanequalamount ofworktocarryout(thatis,theloadmustbewellbalanced).UnlessallUEsshareasingleaddress space,eachUE'sblockswillbestoredinanarraythatislocaltoasingleUE.Thus,thecodewill accesselementsofthedistributedarrayusingindicesintoalocalarray.Themathematicaldescription oftheproblemandsolution,however,isbasedonindicesintotheglobalarray.Thus,itmustbeclear howtomovebackandforthbetweentwoviewsofthearray,oneinwhicheachelementisreferenced byglobalindicesandoneinwhichitisreferencedbyacombinationoflocalindicesandUE identifier.Makingthesetranslationsclearwithinthetextoftheprogramisthechallengeofusingthis patterneffectively.
Array distributions
Overtheyears,asmallnumberofarraydistributionshavebecomestandard.
Onedimensional(1D)block.Thearrayisdecomposedinonedimensiononlyanddistributed oneblockperUE.Fora2Dmatrix,forexample,thiscorrespondstoassigningasingleblock ofcontiguousrowsorcolumnstoeachUE.Thisdistributionissometimescalledacolumn blockorrowblockdistributiondependingonwhichsingledimensionisdistributedamongthe UEs.TheUEsareconceptuallyorganizedasa1Darray. Twodimensional(2D)block.Asinthe1Dblockcase,oneblockisassignedtoeachUE,but nowtheblockisarectangularsubblockoftheoriginalglobalarray.Thismappingviewsthe collectionofUEsasa2Darray. Blockcyclic.Thearrayisdecomposedintoblocks(usinga1Dor2Dpartition)suchthatthere aremoreblocksthanUEs.TheseblocksarethenassignedroundrobintoUEs,analogousto thewayadeckofcardsisdealtout.TheUEsmaybeviewedaseithera1Dor2Darray.
Next,weexplorethesedistributionsinmoredetail.Forillustration,weuseasquarematrixAoforder 8,asshowninFig.5.49.[10]
[10]
Inthisandtheotherfiguresinthispattern,wewillusethefollowingnotational conventions:Amatrixelementwillberepresentedasalowercaseletterwithsubscripts representingindices;forexample,a1,2istheelementinrow1andcolumn2ofmatrixA. Asubmatrixwillberepresentedasanuppercaseletterwithsubscriptsrepresenting indices;forexample,A0,0isasubmatrixcontainingthetopleftcornerofA.Whenwe talkaboutassigningpartsofAtoUEs,wewillreferencedifferentUEsusingUEandan indexorindicesinparentheses;forexample,ifweareregardingUEsasforminga1D array,UE(0)istheconceptuallyleftmostUE,whileifweareregardingUEsasforminga 2Darray,UE(0,0)istheconceptuallytopleftUE.Indicesareallassumedtobezero based(thatis,thesmallestindexis0).
Wewillusethenotation"\"forintegerdivision,and"/"fornormaldivision.Thus a\b= a/b .Also, x (floor)isthelargestintegeratmostx,and x (ceiling)isthe smallestintegeratleastx.Forexample, 4/3 =1,and 4/2 =2.
MappingtoUEs.Moregenerally,wecouldhaveanNxMmatrixwherethenumberofUEs,P,need notdividethenumberofcolumnsevenly.Inthiscase,MBisthemaximumnumberofcolumns mappedtoaUE,andallUEsexceptUE(P1)containMBblocks.Then,MB= M/P ,andelements ofcolumnjaremappedtoUE( j/MB ).[12](Thisreducestotheformulagivenearlierforthe example,becauseinthespecialcasewherePevenlydividesM, M/P =M/Pand j/MB =j/MB.) Analogousformulasapplyforrowdistributions.
[12]
NoticethatthisisnottheonlypossiblewaytodistributecolumnsamongUEswhen thenumberofUEsdoesnotevenlydividethenumberofcolumns.Anotherapproach, morecomplextodefinebutproducingamorebalanceddistributioninsomecases,isto firstdefinetheminimumnumberofcolumnsperUEas M/P ,andthenincreasethis numberbyoneforthefirst(MmodP)UEs.Forexample,forM=10andP=4,UE(0) andUE(1)wouldhavethreecolumnseachandUE(2)andUE(3)wouldhavetwocolumns each. Mappingtolocalindices.InadditiontomappingthecolumnstoUEs,wealsoneedtomaptheglobal indicestolocalindices.Inthiscase,matrixelement(i,j)mapstolocalelement(i,jmodMB).Given localindices(x,y)andUE(w),wecanrecovertheglobalindices(x,wMB+y).Again,analogous formulasapplyforrowdistributions. 2Dblock.Fig.5.51showsa2DblockdistributionofAontoatwobytwoarrayofUEs.Here,Ais beingdecomposedalongtwodimensions,soforeachsubblock,thenumberofcolumnsisthematrix orderdividedbythenumberofcolumnsofUEs,andthenumberofrowsisthematrixorderdivided bythenumberofrowsofUEs.Matrixelement(i,j)isassignedtoUE(i\2,j\2).
MappingtoUEs.Moregenerally,wemapanNxMmatrixtoaprxPcmatrixofUEs.Themaximum sizeofasubblockisNBxMB,whereNB= N/PR andMB= M/Pc .Then,element(i,j)inthe globalmatrixisstoredinUE( i/NB , jIMB . Mappingtolocalindices.Globalindices(i,j)maptolocalindices(imodNB,jmodMB).Givenlocal indices(x,y)onUE(z,w)thecorrespondingglobalindicesare(zNB+x,wMB+y). Blockcyclic.ThemainideabehindtheblockcyclicdistributionistocreatemoreblocksthanUEs andallocatetheminacyclicmanner,similartodealingoutadeckofcards.Fig.5.52showsa1D blockcyclicdistributionofAontoalineararrayoffourUEs,illustratinghowcolumnsareassignedto UEsinaroundrobinfashion.Here,matrixelement(i,j)isassignedtoUE(jmod4)(where4isthe numberofUEs).
Figure 5.54. 2D block-cyclic distribution of A onto four UEs, part 2: Assigning submatrices to UEs
MappingtoUEs.Inthegeneralcase,wehaveanNxMmatrixtobemappedontoaPRxpcarrayof UEs.WechooseblocksizeNBxMB.Element(i,j)intheglobalmatrixwillbemappedtoUE(z,w), wherez= i/NB modPRandw= j/MB modPc. Mappingtolocalindices.BecausemultipleblocksaremappedtothesameUE,wecanviewthelocal indexingblockwiseorelementwise. Intheblockwiseview,eachelementonaUEisindexedlocallybyblockindices(l,m)andindices(x, y)intotheblock.Torestatethis:Inthisscheme,theglobalmatrixelement(i,j)willbefoundonthe UEwithinthelocal(l,m)blockattheposition(x,y)where(l,m)=( i/(PRNB) , j/(PcMB) )and(x, y)=(imodNB,jmodMB).Fig.5.55illustratesthisforUE(0,0).
Figure 5.55. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). LAl,m is the block with block indices (l, m). Each element is labeled both with its original global indices (ai,j) and its indices within block LAl,m (lx,y).
Forexample,considerglobalmatrixelementa5,1.BecausePR=PC=NB=MB=2,thiselementwill maptoUE(0,0).TherearefourtwobytwoblocksonthisUE.Fromthefigure,weseethatthis elementappearsintheblockonthebottomleft,orblockLA1,0andindeed,fromtheformulas,we obtain(l,m)=( (2x2) , (2x2) )=(1,0).Finally,weneedthelocalindiceswithintheblock.Inthis case,theindiceswithinblockare(x,y)=(5mod2,1mod2)=(1,1). Intheelementwiseview(whichrequiresthatalltheblocksforeachUEformacontiguousmatrix), globalindices(i,j)aremappedelementwisetolocalindices(INB+x,mMB+y),wherelandmare definedasbefore.Fig.5.56illustratesthisforUE(0,0).
Figure 5.56. 2D block-cyclic distribution of A onto four UEs: Local view of elements of A assigned to UE(0,0). Each element is labeled both with its original global indices ai,j and its local indices [x', y' . Local indices are with respect to the contiguous matrix used to store all blocks assigned to this UE.
Toselectwhichdistributiontouseforaproblem,considerhowthecomputationalloadontheUEs changesasthecomputationproceeds.Forexample,inmanysinglechannelsignalprocessing problems,thesamesetofoperationsisperformedoneachcolumnofanarray.Theworkdoesnotvary asthecomputationproceeds,soacolumnblockdecompositionwillproducebothclarityofcodeand goodloadbalance.Ifinsteadtheamountofworkvariesbycolumn,withhighernumberedcolumns requiringmorework,acolumnblockdecompositionwouldleadtopoorloadbalance,withtheUEs processinglowernumberedcolumnsfinishingaheadoftheUEsprocessinghighernumbered columns.Inthiscase,acyclicdistributionwouldproducebetterloadbalance,becauseeachUEis assignedamixoflownumberedandhighnumberedcolumns. Thissameapproachappliestohigherdimensionsaswell.ScaLAPACK[Sca,BCC97 + ],theleading packageofdenselinearalgebrasoftwarefordistributedmemorycomputers,requiresa2Dblock cyclicdistribution.Toseewhythischoicewasmade,considerGaussianelimination,oneofthemore commonlyusedoftheScaLAPACKroutines.Inthisalgorithm,alsoknownasLUdecomposition,a densesquarematrixistransformedintoapairoftriangularmatrices,anuppermatrixUandalower matrixL.Atahighlevel,thealgorithmproceedsfromtheupperleftcornerandworksitswaydown
Theexamplesintheprecedingsectionillustratehoweachelementoftheoriginal(global)arrayis mappedtoaUEandhoweachelementintheglobalarray,afterdistribution,isidentifiedbybothaset ofglobalindicesandacombinationofUEidentifierandlocalinformation.Theoriginalproblemis typicallystatedintermsofglobalindices,butcomputationwithineachUEmustbeintermsoflocal indices.Applyingthispatterneffectivelyrequiresthattherelationshipbetweenglobalindicesandthe combinationofUEandlocalindicesbeastransparentaspossible.Inaquestforprogramefficiency,it isaltogethertooeasytoburytheseindexmappingsinthecodeinawaythatmakestheprogram painfullydifficulttodebug.Abetterapproachistousemacrosandinlinefunctionstocapturethe indexmappings;ahumanreaderoftheprogramthenonlyneedstomasterthemacroorfunctiononce. Suchmacrosorfunctionsalsocontributetoclarityofabstraction.TheExamplessectionillustrates thisstrategy.
Aligning computation with locality
Oneofthecardinalrulesofperformanceorientedcomputingistomaximizereuseofdataclosetoa UE.Thatis,theloopsthatupdatelocaldatashouldbeorganizedinawaythatgetsasmuchuseas possibleoutofeachmemoryreference.Thisobjectivecanalsoinfluencethechoiceofarray distribution. Forexample,inlinearalgebracomputations,itispossibletoorganizecomputationsonamatrixinto smallercomputationsoversubmatrices.Ifthesesubmatricesfitintocache,dramaticperformance gainscanresult.Similareffectsapplytootherlevelsofthememoryhierarchy:minimizingmissesin thetranslationlookasidebuffer(TLB),pagefaults,andsoon.Adetaileddiscussionofthistopicgoes wellbeyondthescopeofthisbook.Anintroductioncanbefoundin[PH98]. Examples
Transposing a matrix stored as column blocks
CodeView:Scroll/ShowAll
/******************************************************************* NAME: trans_isend_ircv PURPOSE: This function uses MPI Isend and Irecv to transpose a column-block distributed matrix. *******************************************************************/ #include "mpi.h" #include <stdio.h> /******************************************************************* ** This function transposes a local block of a matrix. We don't ** display the text of this function as it is not relevant to the
** point of this example. *******************************************************************/ void transpose( double* A, int Acols, /* input matrix */ double* B, int Bcols, /* transposed mat */ int sub_rows, int sub_cols); /* size of slice to transpose */ /******************************************************************* ** Define macros to compute process source and destinations and ** local indices *******************************************************************/ #define T0 (ID, PHASE, NPROC) ((ID + PHASE ) % NPROC) #define FROM(ID, PHASE, NPROC) ((ID + NPROC - PHASE) % NPROC) #define BLOCK(BUFF, ID) (BUFF + (ID * block_size)) /* continued in next figure */
CodeView:Scroll/ShowAll
/* continued from previous figure */ void trans_isnd_ircv(double *buff, double *trans, int Block_order, double *work, int my_ID, int num_procs) { int iphase; int block_size; int send_to, recv_from; double *bblock; /* pointer to current location in buff */ double *tblock; /* pointer to current location in trans */ MPI_Status status; MPI_Request send_req, recv_req; block_size = Block_order * Block_order; /******************************************************************* ** Do the transpose in num_procs phases. ** ** In the first phase, do the diagonal block. Then move out ** from the diagonal copying the local matrix into a communication ** buffer (while doing the local transpose) and send to process ** (diag+phase)%,num_procs. *******************************************************************/ bblock = BLOCK(buff, my_ID); tblock = BLOCK(trans, my_ID); transpose(bblock, Block_order, tblock, Block_order, Block_order, Block_order); for (iphase=l; iphase<num_procs; iphase++){ recv_from = FROM(my_ID, iphase, num_procs); tblock = BLOCK(trans, recv.from); MPI.Irecv (tblock, block_size, MPI_DOUBLE, recv_from, iphase, MPI_COMM_WORLD, &recv_req);
send_to = TO(my_ID, iphase, num_procs); bblock = BLOCK(buff, send_to); transpose(bblock, Block_order, work, Block_order, Block_order, Block_order); MPI_Isend (work, block_size, MPI_DOUBLE, send_to, iphase, MPI_COMM_WORLD, &send.req); MPI_Wait(&recv_req, &status); MPI_Wait(&send_req, &status);
} }
BUFFisthestartofthe1Darray(bufffortheoriginalarray,transforthetranspose)andIDisthe secondindexoftheblock.Soforexample,wefindthediagonalblockofbotharraysasfollows:
bblock = BLOCK(buff, my_ID); tblock = BLOCK(trans, my_ID);
Likewise,wecomputewherethenextblockiscomingfromandwhichlocalindexcorrespondstothat block:
recv_from = FROM(my_ID, iphase, num_procs); tblock = BLOCK(trans, recv_from);
Thiscontinuesuntilalloftheblockshavebeentransposed.
Weuseimmediate(nonblocking)sendsandreceivesinthisexample.(Theseprimitivesaredescribed inmoredetailintheMPIappendix,AppendixB.)Duringeachphase,eachUEfirstpostsareceive andthenperformsatransposeontheblockitwillsend.Afterthattransposeiscomplete,theUEsends thenowtransposedblocktotheUEthatshouldreceiveit.Atthebottomoftheloopandbefore movingtothenextphase,functionsarecalledtoforcetheUEtowaituntilboththesendsandreceives complete.Thisapproachletsusoverlapcommunicationandcomputation.Moreimportantly(because inthiscasethereisn'tmuchcomputationtooverlapwithcommunication),itpreventsdeadlock:A morestraightforwardapproachusingregularsendsandreceiveswouldbetofirsttransposetheblock tobesent,thensendit,andthen(waitto)receiveablockfromanotherUE.However,iftheblocksto besentarelarge,aregularsendmightblockbecausethereisinsufficientbufferspaceforthemessage; inthiscase,suchblockingcouldproducedeadlock.Byinsteadusingnonblockingsendsandreceives andpostingthereceivesfirst,weavoidthissituation.
Known uses
Thispatternisusedthroughoutthescientificcomputingliterature.ThewellknownScaLAPACK
ThePLAPACKpackage[ABE97 + ,PLA,vdG97]takesadifferentapproachtoarraydistribution. Ratherthanfocusingonhowtodistributethearrays,PLAPACKconsidershowthevectorsoperated uponbythearraysareorganized.Fromthesedistributedvectors,thecorrespondingarraydistributions arederived.Inmanyproblems,thesevectorscorrespondtothephysicalquantitiesintheproblem domain,sothePLAPACKteamreferstothisasthephysicallybaseddistribution. Related Patterns TheDistributedArraypatternisoftenusedtogetherwiththeGeometricDecompositionandSPMD patterns.
timesbeenimportantinparallelprogramming.Theyareonlyrarelyusedatthistime,butitisstill importanttobeawareofthem.Theycanprovideinsightsintodifferentopportunitiesforfindingand exploitingconcurrency.Anditispossiblethatasparallelarchitecturescontinuetoevolve,theparallel programmingtechniquessuggestedbythesepatternsmaybecomeimportant. Inthissection,wewillbrieflydescribesomeoftheseadditionalpatternsandtheirsupporting structures:SIMD,MPMD,ClientServer,andDeclarativeProgramming.Weclosewithabrief discussionofproblemsolvingenvironments.Thesearenotpatterns,buttheyhelpprogrammerswork withinatargetedsetofproblems. 5.11.1. SIMD ASIMDcomputerhasasinglestreamofinstructionsoperatingonmultiplestreamsofdata.These machineswereinspiredbythebeliefthatprogrammerswouldfindittoodifficulttomanagemultiple streamsofinstructions.Manyimportantproblemsaredataparallel;thatis,theconcurrencycanbe expressedintermsofconcurrentupdatesacrosstheproblem'sdatadomain.Carriedtoitslogical extreme,theSIMDapproachassumesthatitispossibletoexpressallparallelismintermsofthedata. Programswouldthenhavesinglethreadsemantics,makingunderstandingandhencedebuggingthem mucheasier.ThebasicideabehindtheSIMDpatterncanbesummarizedasfollows.
DefineanetworkofvirtualPEstobemappedontotheactualPEs.ThesevirtualPEsare connectedaccordingtoawelldefinedtopology.Ideallythetopologyis(1)wellalignedwith thewaythePEsinthephysicalmachineareconnectedand(2)effectiveforthecommunication patternsimpliedbytheproblembeingsolved. Expresstheproblemintermsofarraysorotherregulardatastructuresthatcanbeupdated concurrentlywithasinglestreamofinstructions. AssociatethesearrayswiththelocalmemoriesofthevirtualPEs. Createasinglestreamofinstructionsthatoperatesonslicesoftheregulardatastructures. Theseinstructionsmayhaveanassociatedmasksotheycanbeselectivelyskippedforsubsets ofarrayelements.Thisiscriticalforhandlingboundaryconditionsorotherconstraints.
Whenaproblemistrulydataparallel,thisisaneffectivepattern.Theresultingprogramsarerelatively easytowriteanddebug[DKK90]. Unfortunately,mostdataproblemscontainsubproblemsthatarenotdataparallel.Settingupthecore datastructures,dealingwithboundaryconditions,andpostprocessingafteracoredataparallel algorithmcanallintroducelogicthatmightnotbestrictlydataparallel.Furthermore,thisstyleof programmingistightlycoupledtocompilersthatsupportdataparallelprogramming.Thesecompilers haveprovendifficulttowriteandresultincodethatisdifficulttooptimizebecauseitcanbefar removedfromhowaprogramrunsonaparticularmachine.Thus,thisstyleofparallelprogramming andthemachinesbuiltaroundtheSIMDconcepthavelargelydisappeared,exceptforafewspecial purposemachinesusedforsignalprocessingapplications. TheprogrammingenvironmentmostcloselyassociatedwiththeSIMDpatternisHighPerformance Fortran(HPF)[HPF97].HPFisanextensionofthearraybasedconstructsinFortran90.Itwas createdtosupportportableparallelprogrammingacrossSIMDmachines,butalsotoallowtheSIMD
programmingmodeltobeusedonMIMDcomputers.Thisrequiredexplicitcontroloverdata placementontothePEsandthecapabilitytoremapthedataduringacalculation.Itsdependenceona strictlydataparallel,SIMDmodel,however,doomedHPFbymakingitdifficulttousewithcomplex applications.ThelastlargecommunityofHPFusersisinJapan[ZJS02 + ],wheretheyhaveextended thelanguagetorelaxthedataparallelconstraints[HPF99]. 5.11.2. MPMD TheMultipleProgram,MultipleData(MPMD)pattern,asthenameimplies,isusedinaparallel algorithmwhendifferentprogramsrunondifferentUEs.Thebasicapproachisthefollowing.
Inmanyways,theMPMDapproachisnottoodifferentfromanSPMDprogramusingMPI.Infact, theruntimeenvironmentsassociatedwiththetwomostcommonimplementationsofMPI,MPICH [MPI]andLAM/MPI[LAM],supportsimpleMPMDprogramming. ApplicationsoftheMPMDpatterntypicallyariseinoneoftwoways.First,thearchitectureofthe UEsmaybesodifferentthatasingleprogramcannotbeusedacrossthefullsystem.Thisisthecase whenusingparallelcomputingacrosssometypeofcomputationalgrid[Glob,FK03]usingmultiple classesofhighperformancecomputingarchitectures.Thesecond(andfromaparallelalgorithmpoint ofviewmoreinteresting)caseoccurswhencompletelydifferentsimulationprogramsarecombined intoacoupledsimulation. Forexample,climateemergesfromacomplexinterplaybetweenatmosphericandoceanphenomena. Wellunderstoodprogramsformodelingtheoceanandtheatmosphereindependentlyhavebeen developedandhighlyrefinedovertheyears.AlthoughanSPMDprogramcouldbecreatedthat implementsacoupledocean/atmosphericmodeldirectly,amoreeffectiveapproachistotakethe separate,validatedoceanandatmosphericprogramsandcouplethemthroughsomeintermediate layer,therebyproducinganewcoupledmodelfromwellunderstoodcomponentmodels. AlthoughbothMPICHandLAM/MPIprovidesomesupportforMPMDprogramming,theydonot allowdifferentimplementationsofMPItointeract,soonlyMPMDprogramsusingacommonMPI implementationaresupported.ToaddressawiderrangeofMPMDproblemsspanningdifferent architecturesanddifferentMPIimplementations,anewstandardcalledinteroperableMPI(iMPI)was created.ThegeneralideaofcoordinatingUEsthroughtheexchangeofmessagesiscommontoMPI andiMPI,butthedetailedsemanticsareextendediniMPItoaddresstheuniquechallengesarising fromprogramsrunningonwidelydifferingarchitectures.Thesemultiarchitectureissuescanadd significantcommunicationoverhead,sothepartofanalgorithmdependentontheperformanceof iMPImustberelativelycoarsegrained. MPMDprogramsarerare.Asincreasinglycomplicatedcoupledsimulationsgrowinimportance,
however,useoftheMPMDpatternwillincrease.Useofthispatternwillalsogrowasgridtechnology becomesmorerobustandmorewidelydeployed. 5.11.3. Client-Server Computing ClientserverarchitecturesarerelatedtoMPMD.Traditionally,thesesystemshavecomprisedtwoor threetierswherethefrontendisagraphicaluserinterfaceexecutedonaclient'scomputeranda mainframebackend(oftenwithmultipleprocessors)providesaccesstoadatabase.Themiddletier,if present,dispatchesrequestsfromtheclientsto(possiblymultiple)backends.Webserversarea familiarexampleofaclientserversystem.Moregenerally,aservermightofferavarietyofservicesto clients,anessentialaspectofthesystembeingthatserviceshavewelldefinedinterfaces.Parallelism canappearattheserver(whichcanservicemanyclientsconcurrentlyorcanuseparallelprocessingto obtainresultsmorequicklyforsinglerequests)andattheclient(whichcaninitiaterequestsatmore thanoneserversimultaneously). Techniquesusedinclientserversystemsareespeciallyimportantinheterogeneoussystems. MiddlewaresuchasCORBA[COR]providesastandardforserviceinterfacespecifications,enabling newprogramstobeputtogetherbycomposingexistingservices,evenifthoseservicesareofferedon vastlydifferenthardwareplatformsandimplementedindifferentprogramminglanguages.CORBA alsoprovidesfacilitiestoallowservicestobelocated.TheJavaJ2EE(Java2Platform,Enterprise Edition)[Javb]alsoprovidessignificantsupportforclientserverapplications.Inbothofthesecases, interoperabilitywasamajordesignforce. Clientserverarchitectureshavetraditionallybeenusedinenterpriseratherthanscientific applications.Gridtechnology,whichisheavilyusedinscientificcomputing,borrowsfromclient servertechnology,extendingitbyblurringthedistinctionbetweenclientsandservers.Allresourcesin agrid,whethertheyarecomputers,instruments,filesystems,oranythingelseconnectedtothe network,arepeersandcanserveasclientsandservers.Themiddlewareprovidesstandardsbased interfacestotietheresourcestogetherintoasinglesystemthatspansmultipleadministrativedomains. 5.11.4. Concurrent Programming with Declarative Languages TheoverwhelmingmajorityofprogrammingisdonewithimperativelanguagessuchasC++,Java,or Fortran.Thisisparticularlythecasefortraditionalapplicationsinscienceandengineering.The artificialintelligencecommunityandasmallsubsetofacademiccomputerscientists,however,have developedandshowngreatsuccesswithadifferentclassoflanguages,thedeclarativelanguages.In theselanguages,theprogrammerdescribesaproblem,aproblemdomain,andtheconditionssolutions mustsatisfy.Theruntimesystemassociatedwiththelanguagethenusesthesetofindvalidsolutions. Declarativesemanticsimposeadifferentstyleofprogrammingthatoverlapswiththeapproaches discussedinthispatternlanguage,buthassomesignificantdifferences.Therearetwoimportant classesofdeclarativelanguages:functionallanguagesandlogicprogramminglanguages. Logicprogramminglanguagesarebasedonformalrulesoflogicalinference.Themostcommonlogic programminglanguagebyfarisProlog[SS94],aprogramminglanguagebasedonfirstorder predicatecalculus.WhenPrologisextendedtosupportexpressionofconcurrency,theresultisa concurrentlogicprogramminglanguage.Concurrencyisexploitedinoneofthreewayswiththese
Prologextensions:andparallelism(executemultiplepredicates),orparallelism(executemultiple guards),orthroughexplicitmappingofpredicateslinkedtogetherthroughsingleassignmentvariables [CG86]. Concurrentlogicprogramminglanguageswereahotareaofresearchinthelate1980sandearly 1990s.Theyultimatelyfailedbecausemostprogrammersweredeeplycommittedtomoretraditional imperativelanguages.Evenwiththeadvantagesofdeclarativesemanticsandthevalueoflogic programmingforsymbolicreasoning,thelearningcurveassociatedwiththeselanguagesproved prohibitive. Theolderandmoreestablishedclassofdeclarativeprogramminglanguagesisbasedonfunctional programmingmodels[Hud89].LISPistheoldestandbestknownofthefunctionallanguages.Inpure functionallanguages,therearenosideeffectsfromafunction.Therefore,functionscanexecuteas soonastheirinputdataisavailable.Theresultingalgorithmsexpressconcurrencyintermsofthe flowofdatathroughtheprogramleading,therebyresultingin"dataflow"algorithms[Jag96]. ThebestknownconcurrentfunctionallanguagesareSisal[FCO90],ConcurrentML[Rep99,Con] (anextensiontoML),andHaskell[HPF].Becausemathematicalexpressionsarenaturallywritten downinafunctionalnotation,Sisalwasparticularlystraightforwardtoworkwithinscienceand engineeringapplicationsandprovedtobehighlyefficientforparallelprogramming.However,justas withthelogicprogramminglanguages,programmerswereunwillingtopartwiththeirfamiliar imperativelanguages,andSisalessentiallydied.ConcurrentMLandHaskellhavenotmademajor inroadsintohighperformancecomputing,althoughbothremainpopularinthefunctional programmingcommunity. 5.11.5. Problem-Solving Environments Adiscussionofsupportingstructuresforparallelalgorithmswouldnotbecompletewithout mentioningproblemsolvingenvironments(PSE).APSEisaprogrammingenvironmentspecialized totheneedsofaparticularclassofproblems.Whenappliedtoparallelcomputing,PSEsalsoimplya particularalgorithmstructureaswell. ThemotivationbehindPSEsistosparetheapplicationprogrammerthelowleveldetailsofthe parallelsystem.Forexample,PETsc(Portable,Extensible,ToolkitforScientificComputation) [BGMS98]supportsavarietyofdistributeddatastructuresandfunctionsrequiredtousethemfor solvingpartialdifferentialequations(typicallyforproblemsfittingtheGeometricDecomposition pattern).TheprogrammerneedstounderstandthedatastructureswithinPETSc,butissparedthe needtomasterthedetailsofhowtoimplementthemefficientlyandportably.OtherimportantPSEs arePLAPACK[ABE97 + ](fordenselinearalgebraproblems)andPOOMA[RHC96 + ](anobject orientedframeworkforscientificcomputing).
Withineachofthesecategories,themostcommonlyusedmechanismsarecoveredinthischapter.An overviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.6.1.
Figure 6.1. Overview of the Implementation Mechanisms design space and its place in the pattern language
Inthischapterwealsodroptheformalismofpatterns.Mostoftheimplementationmechanismsare includedwithinthemajorparallelprogrammingenvironments.Hence,ratherthanusepatterns,we provideahighleveldescriptionofeachimplementationmechanismandtheninvestigatehowthe mechanismmapsontoourthreetargetprogrammingenvironments:OpenMP,MPI,andJava.This mappingwillinsomecasesbetrivialandrequirelittlemorethanpresentinganexistingconstructin anAPIorlanguage.Thediscussionwillbecomeinterestingwhenwelookatoperationsnativetoone programmingmodel,butforeigntoanother.Forexample,itispossibletodomessagepassingin OpenMP.Itisnotpretty,butitworksandcanbeusefulattimes. WeassumethatthereaderisfamiliarwithOpenMP,MPI,andJavaandhowtheyareusedforwriting parallelprograms.AlthoughwecoverspecificfeaturesoftheseAPIsinthischapter,thedetailsof usingthemarelefttotheappendixes.
6.1. OVERVIEW
ParallelprogramsexploitconcurrencybymappinginstructionsontomultipleUEs.Ataverybasic level,everyparallelprogramneedsto(1)createthesetofUEs,(2)manageinteractionsbetweenthem andtheiraccesstosharedresources,(3)exchangeinformationbetweenUEs,and(4)shutthemdown inanorderlymanner.Thissuggeststhefollowingcategoriesofimplementationmechanisms.
UEmanagement.Thecreation,destruction,andmanagementoftheprocessesandthreads
usedinparallelcomputation.
6.2. UE MANAGEMENT
Letusrevisitthedefinitionofunitofexecution,orUE.AUEisanabstractionfortheentitythat carriesoutcomputationsandismanagedfortheprogrammerbytheoperatingsystem.Inmodern parallelprogrammingenvironments,therearetwotypesofUEs:processesandthreads. Aprocessisaheavyweightobjectthatcarrieswithitthestateorcontextrequiredtodefineitsplacein thesystem.Thisincludesmemory,programcounters,registers,buffers,openfiles,andanythingelse requiredtodefineitscontextwithintheoperatingsystem.Inmanysystems,differentprocessescan belongtodifferentusers,andthusprocessesarewellprotectedfromeachother.Creatinganew processandswappingbetweenprocessesisexpensivebecauseallthatstatemustbesavedand restored.Communicationbetweenprocesses,evenonthesamemachine,isalsoexpensivebecausethe protectionboundariesmustbecrossed. Athread,ontheotherhand,isalightweightUE.Acollectionofthreadsiscontainedinaprocess. Mostoftheresources,includingthememory,belongtotheprocessandaresharedamongthethreads. Theresultisthatcreatinganewthreadandswitchingcontextbetweenthreadsislessexpensive, requiringonlythesavingofaprogramcounterandsomeregisters.Communicationbetweenthreads belongingtothesameprocessisalsoinexpensivebecauseitcanbedonebyaccessingtheshared memory. ThemechanismsformanagingthesetwotypesofUEsarecompletelydifferent.Wehandlethem separatelyinthenexttwosections. 6.2.1. Thread Creation/Destruction Threadsrequirearelativelymodestnumberofmachinecyclestocreate.Programmerscanreasonably createanddestroythreadsasneededinsideaprogram,andaslongastheydonotdosoinsidetight loopsortimecriticalkernels,theprogram'soverallruntimewillbeaffectedonlymodestly.Hence, mostparallelprogrammingenvironmentsmakeitstraightforwardtocreatethreadsinsideaprogram, andtheAPIsupportsthreadcreationanddestruction.
OpenMP: thread creation/destruction
InOpenMP,threadsarecreatedwiththeparallelpragma:
#pragma omp parallel { structured block }
Tocreateandlaunchthethreadonewouldwrite
Thread t = new MyThread() ; //create thread object t. start(); //launch the thread
Tocreateandexecutethethread,wecreateaRunnableobjectandpassittotheThread constructor.Thethreadislaunchedusingthestartmethodasbefore:
Thread t = new Thread(new MyRunnable()); //create Runnable //and Thread objects t.start(); //start the thread
Inmostcases,thesecondapproachispreferred.[1]
[1]
InJava,aclasscanimplementanynumberofinterfaces,butisonlyallowedtoextenda
MPIisfundamentallybasedonprocesses.ItisthreadawareinthattheMPI2.0API[Mesa]defines differentlevelsofthreadsafetyandprovidesfunctionstoqueryasystematruntimeastothelevelof threadsafetythatissupported.TheAPI,however,hasnoconceptofcreatinganddestroyingthreads. TomixthreadsintoanMPIprogram,theprogrammermustuseathreadbasedprogrammingmodelin additiontoMPI. 6.2.2. Process Creation/Destruction Aprocesscarrieswithitalltheinformationrequiredtodefineitsplaceintheoperatingsystem.In additiontoprogramcountersandregisters,aprocessincludesalargeblockofmemory(itsaddress space),systembuffers,andeverythingelserequiredtodefineitsstatetotheoperatingsystem. Consequently,creatinganddestroyingprocessesisexpensiveandnotdoneveryoften.
MPI: process creation/destruction
6.3. SYNCHRONIZATION
SynchronizationisusedtoenforceaconstraintontheorderofeventsoccurringindifferentUEs. Thereisavastbodyofliteratureonsynchronization[And00],anditcanbecomplicated.Most programmers,however,useonlyafewsynchronizationmethodsonaregularbasis. 6.3.1. Memory Synchronization and Fences Inasimple,classicalmodelofasharedmemorymultiprocessor,eachUEexecutesasequenceof instructionsthatcanreadorwriteatomicallyfromthesharedmemory.Wecanthinkofthe computationasasequenceofatomicevents,withtheeventsfromdifferentUEsinterleaved.Thus,if UEAwritesamemorylocationandthenUEBreadsit,UEBwillseethevaluewrittenbyUEA.
Suppose,forexample,thatUEAdoessomeworkandthensetsavariabledonetotrue.Meanwhile, UEBexecutesaloop:
while (!done) {/*do something but don't change done*/}
Inthesimplemodel,UEAwilleventuallysetdone,andtheninthenextloopiteration,UEBwill readthenewvalueandterminatetheloop. Inreality,severalthingscouldgowrong.Firstofall,thevalueofthevariablemaynotactuallybe writtenbyUEAorreadbyUEB.Thenewvaluecouldbeheldinacacheinsteadofthemain memory,andeveninsystemswithcachecoherency,thevaluecouldbe(asaresultofcompiler optimizations,say)heldinaregisterandnotbemadevisibletoUEB.Similarly,UEBmaytrytoread thevariableandobtainastalevalue,orduetocompileroptimizations,notevenreadthevaluemore thanoncebecauseitisn'tchangedintheloop.Ingeneral,manyfactorspropertiesofthememory system,thecompiler,instructionreorderingetc.canconspiretoleavethecontentsofthememories (asseenbyeachUE)poorlydefined. AmemoryfenceisasynchronizationeventthatguaranteesthattheUEswillseeaconsistentviewof memory.Writesperformedbeforethefencewillbevisibletoreadsperformedafterthefence,as wouldbeexpectedintheclassicalmodel,andallreadsperformedafterthefencewillobtainavalue writtennoearlierthanthelatestwritebeforethefence. Clearly,memorysynchronizationisonlyanissuewhenthereissharedcontextbetweentheUEs. Hence,thisisnotgenerallyanissuewhentheUEsareprocessesrunninginadistributedmemory environment.Forthreads,however,puttingmemoryfencesintherightlocationcanmakethe differencebetweenaworkingprogramandaprogramriddledwithraceconditions. Explicitmanagementofmemoryfencesiscumbersomeanderrorprone.Fortunately,most programmers,althoughneedingtobeawareoftheissue,onlyrarelyneedtodealwithfences explicitlybecause,aswewillseeinthenextfewsections,thememoryfenceisusuallyimpliedby higherlevelsynchronizationconstructs.
OpenMP: fences
InOpenMP,amemoryfenceisdefinedwiththeflushstatement:
#pragma omp flush
OpenMPprogrammersonlyrarelyusetheflushconstructbecauseOpenMP'shighlevel
Ifaprogramusessynchronizationamongthefullteam,thesynchronizationwillwork independentlyofthesizeoftheteam,eveniftheteamsizeisone.Ontheotherhand,a programwithpairwisesynchronizationwilldeadlockifrunwithasinglethread.An OpenMPdesigngoalwastoencouragecodethatisequivalentwhetherrunwithone threadormany,apropertycalledsequentialequivalence.Thus,highlevelconstructsthat arenotsequentiallyequivalent,suchaspairwisesynchronization,wereleftoutoftheAPI. Inthisprogram,eachthreadhastwoblocksofworktocarryoutconcurrentlywiththeotherthreadsin theteam.Theworkisrepresentedbytwofunctions:do_a_whole_bunch()and do_more_stuff ().Thecontentsofthesefunctionsareirrelevant(andhencearenotshown)for thisexample.Allthatmattersisthatforthisexampleweassumethatathreadcannotsafelybegin workonthesecondfunctiondo_more_stuff ()untilitsneighborhasfinishedwiththefirst functiondo_a_whole_bunch(). TheprogramusestheSPMDpattern.Thethreadscommunicatetheirstatusforthesakeofthe pairwisesynchronizationbysettingtheirvalue(indexedbythethreadID)oftheflagarray.Because thisarraymustbevisibletoallofthethreads,itneedstobeasharedarray.Wecreatethiswithin OpenMPbydeclaringthearrayinthesequentialregion(thatis,priortocreatingtheteamofthreads). Wecreatetheteamofthreadswithaparallelpragma:
#pragma omp parallel shared(flag)
Whentheworkisdone,thethreadsetsitsflagto1tonotifyanyinterestedthreadsthattheworkis done.Thismustbeflushedtomemorytoensurethatotherthreadscanseetheupdatedvalue:
#pragma omp flush (flag)
Figure 6.2. Program showing one way to implement pairwise synchronization in OpenMP The flush construct is vital. It forces the memory to be consistent, thereby . making the updates to the flag array visible. For more details about the syntax of OpenMP see the OpenMP appendix, Appendix A. ,
CodeView:Scroll/ShowAll
#include <omp.h> #include <stdio.h> #define MAX 10 // max number of threads // Functions used in this program: the details of these
// functions are not relevant so we do not include the // function bodies. extern int neighbor(int); // return the ID for a thread's neighbor extern void do_a_whole_bunch(int); extern void do_more_stuff( ); int main() { int flag[MAX]; //Define an array of flags one per thread int i; for(i=0;i<MAX;i++)flag[i] = 0; #pragma omp parallel shared (flag) { int ID; ID = omp_get_thread_num(); // returns a unique ID for each thread. do_a_wholebunch(ID); // Do a whole bunch of work. flag[ID] = 1; // signal that this thread has finished its work #pragma omp flush (flag) // make sure all the threads have a chance to // see the updated flag value while (!flag[neighbor(ID)]){ // wait to see if neighbor is done. #pragma omp flush(flag) // required to see any changes to flag } do_more_stuff(); // call a function that can't safely start until // the neighbor's work is complete } // end parallel region }
Thethreadthenwaitsuntilitsneighborisfinishedwithdo_a_whole_bunch()beforemovingon tofinishitsworkwithacalltodo_more_work().
while (!flag( neighbor(ID))){ // wait to see if neighbor is done. #pragma omp flush(flag) // required to see any changes to flag }
TheflushoperationinOpenMPonlyaffectsthethreadvisiblevariablesforthecallingthread.If anotherthreadwritesasharedvariableandforcesittobeavailabletotheotherthreadsbyusinga flush,thethreadreadingthevariablestillneedstoexecuteaflushtomakesureitpicksupthenew value.Hence,thebodyofthewhileloopmustincludeaflush (flag)constructtomakesure thethreadseesanynewvaluesintheflagarray. Aswementionedearlier,knowingwhenaflushisneededandwhenitisnotcanbechallenging.In mostcases,theflushisbuiltintothesynchronizationconstruct.Butwhenthestandardconstructs aren'tadequateandcustomsynchronizationisrequired,placingmemoryfencesintherightlocations isessential.
Java: fences
Javaisoneofthefirstlanguageswhereanattemptwasmadetospecifyitsmemory modelprecisely.Theoriginalspecificationhasbeencriticizedforimprecisionaswellas fornotsupportingcertainsynchronizationidiomswhileatthesametimedisallowing somereasonablecompileroptimizations.AnewspecificationforJava21.5isdescribed in[JSRa].Inthisbook,weassumethenewspecification. Javaallowsvariablestobedeclaredasvolatile.Whenavariableismarkedvolatile,allwritesto thevariableareguaranteedtobeimmediatelyvisible,andallreadsareguaranteedtoobtainthelast valuewritten.Thus,thecompilertakescareofmemorysynchronizationissuesforvolatiles.[4]Ina Javaversionofaprogramcontainingthefragment(wheredoneisexpectedtobesetbyanother thread)
[4]
wewouldmarkdonetobevolatilewhenthevariableisdeclared,asfollows,andthenignore memorysynchronizationissuesrelatedtothisvariableintherestoftheprogram.
volatile boolean done = false;
Becausethevolatilekeywordcanbeappliedonlytoreferencestoanarrayandnottothe individualelements,thejava.util. concurrent. atomicpackageintroducedinJava21.5 addsanotionofatomicarrayswheretheindividualelementsareaccessedwithvolatilesemantics.For example,inaJavaversionoftheOpenMPexampleshowninFig.6.2,wewoulddeclaretheflag arraytobeoftypeAtomicIntegerArrayandupdateandreadwiththatclass'ssetandget methods. Anothertechniqueforensuringpropermemorysynchronizationissynchronizedblocks.A synchronizedblockappearsasfollows:
synchronized(some_object){/*do something with shared variables*/}
Afenceonlyarisesinenvironmentsthatincludesharedmemory.InMPIspecificationspriortoMPI 2.0,theAPIdidnotexposesharedmemorytotheprogrammer,andhencetherewasnoneedfora usercallablefence.MPI2.0,however,includesonesidedcommunicationconstructs.Theseconstructs create"windows"ofmemoryvisibletootherprocessesinanMPIprogram.Datacanbypushedtoor pulledfromthesewindowsbyasingleprocesswithouttheexplicitcooperationoftheprocessowning thememoryregioninquestion.Thesememorywindowsrequiresometypeoffence,butarenot discussedherebecauseimplementationsofMPI2.0arenotwidelyavailableatthetimethiswas written. 6.3.2. Barriers AbarrierisasynchronizationpointatwhicheverymemberofacollectionofUEsmustarrivebefore anymemberscanproceed.IfaUEarrivesearly,itwillwaituntilalloftheotherUEshavearrived. Abarrierisoneofthemostcommonhighlevelsynchronizationconstructs.Ithasrelevancebothin processorientedenvironmentssuchasMPIandthreadbasedsystemssuchasOpenMPandJava.
MPI: barriers
InMPI,abarrierisinvokedbycallingthefunction
MPI Barrier(MPI COMM)
CodeView:Scroll/ShowAll
#include <mpi.h> // MPI include file #include <stdio.h> extern void runit(); int main(int argc, char **argv) {
int num_procs; // number of processes in the group int ID; // unique identifier ranging from 0 to (num_procs-l) double time_init, time_final, time_elapsed; // // Initialize MPI and set up the SPMD program // MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size (MPI_COMM_WORLD, &num_procs); // // Ensure that all processes are set up and ready to go before timing // runit () // MPI_Barrier(MPI_COMM_WORLD); time.init = MPI_Wtime(); runit(); // a function that we wish to time on each process time_final = MPI_Wtime(); time_elapsed = time_final - time_init; printf(" I am \%d and my computation took \%f seconds\n", ID, time_elapsed); MPI-Finalize(); return 0;
InOpenMP,asimplepragmasetsthebarrier:
#pragma omp barrier
CodeView:Scroll/ShowAll
#include <omp.h> #include <stdio.h> extern void runit(); int main(int argc, char **argv) { double time_init, time_final, time_elapsed; #pragma omp parallel private(time_init, time_final, time_elapsed) { int ID; ID = omp_get_thread_num(); // // ensure that all threads are set up and ready to go before timing runit() // #pragma omp barrier time_init = omp_get_wtime(); runit(); // a function that we wish to time on each thread time_final = omp_get_wtime(); time_ elapsed = time_final - time_init; printf(" I am %d and my computation took \%f seconds\n", ID, time_elapsed);
} return (); }
whererequiredtoensurethecorrectprogramsemantics,butforperformancereasonsshouldbeused nomorethanabsolutelyrequired.
Java: barriers
Javadidnotoriginallyincludeabarrierprimitive,althoughitisnotdifficulttocreateoneusingthe facilitiesinthelanguage,aswasdoneinthepublicdomainutil. concurrentpackage[Lea].In Java21.5,similarclassesareprovidedintheJava.util.concurrentpackage. ACyclicBarrierissimilartothebarrierdescribedpreviously.TheCyclic-Barrierclass containstwoconstructors:onethatrequiresthenumberofthreadsthatwillsynchronizeonthebarrier, andanotherthattakesthenumberofthreadsalongwithaRunnableobjectwhoserunmethodwill beexecutedbythelastthreadtoarriveatthebarrier.Whenathreadarrivesatthebarrier,itinvokes thebarrier'sawaitmethod.Ifathread"breaks"thebarrierbyterminatingprematurelywithan exception,theotherthreadswillthrowaBrokenBarrierException.ACyclicBarrier automaticallyresetsitselfwhenpassedandcanbeusedmultipletimes(inaloop,forexample). InFig.6.5,weprovideaJavaversionofthebarrierexamplesgivenpreviouslyusinga CyclicBarrier. Thejava.util.concurrentpackagealsoprovidesarelatedsynchronizationprimitive CountDownLatch.ACountDownLatchisinitializedtoaparticularvalueN.Eachinvocationof itscountDownmethoddecreasesthecount.Athreadexecutingtheawaitmethodblocksuntilthe valueofthelatchreaches0.TheseparationofcountDown(analogousto"arrivingatthebarrier") andawait(waitingfortheotherthreads)allowsmoregeneralsituationsthan"allthreadswaitforall otherthreadstoreachabarrier."Forexample,asinglethreadcouldwaitforNeventstohappen,orN threadscouldwaitforasingleeventtohappen.ACountDownLatchcannotberesetandcanonly beusedonce.AnexampleusingaCountDownLatchisgivenintheJavaappendix,AppendixC. 6.3.3. Mutual Exclusion Whenmemoryorothersystemresources(forexample,afilesystem)areshared,theprogrammust ensurethatmultipleUEsdonotinterferewitheachother.Forexample,iftwothreadstrytoupdatea shareddatastructureatthesametime,araceconditionresultsthatcanleavethestructureinan inconsistentstate.Acriticalsectionisasequenceofstatementsthatconflictwithasequenceof statementsthatmaybeexecutedbyotherUEs.Twosequencesofstatementsconflictifbothaccessthe samedataandatleastoneofthemmodifiesthedata.Toprotecttheresourcesaccessedinsidethe criticalsection,theprogrammermustusesomemechanismthatensuresthatonlyonethreadatatime willexecutethecodewithinthecriticalsection.Thisiscalledmutualexclusion.
Figure 6.5. Java program containing a CyclicBarrier. This program is used to time the execution of function runit().
CodeView:Scroll/ShowAll
import java.util.concurrent.*; public class TimerExample implements Runnable { static int N;
static CyclicBarrier barrier; final int ID; public void run() { //wait at barrier until all threads are ready System.out.println(ID + " at await"); try{ barrier.await(); } catch (InterruptedException ex) { Thread.dumpStack();} catch (BrokenBarrierException ex){Thread.dumpStack();} //record start time long time_init = System.currentTimeMillis() ; //execute function to be timed on each thread runit() ; //record ending time long time_final = System.currentTimeMillis() ; //print elapsed time System.out.println("Elapsed time for thread "+ID+ " = "+(time_final-time_init)+" msecs"); return; } void runit(){...} //definition of runit() TimerExample(int ID){this.ID = ID;} public static void main(String[] args) { N = Integer.parseInt(args[0]); //read number of threads barrier = new CyclicBarrier(N); //instantiate barrier for(int i = 0; i!= N; i++) //create and start threads {new Thread(new TimerExample(i)) .start() ; } } }
CodeView:Scroll/ShowAll
#include <omp.h> #include <stdio.h> #define N 1000 extern double big_computation(int, int); extern void consume_results(int, double, double *); int main() { double global_result[N]; #pragma omp parallel shared (global_result) { double local_result; int I; int ID = omp_get_thread_num(); // set a thread ID #pragma omp for for(i=0;i<N;i++){ local_result = big_computation(ID, i); // carry out the UE's work #pragma omp critical { consume_results(ID, local_result, global_result); } } } return 0; }
istheOpenMPconstructthattellsthecompilertodistributetheiterationsoftheloopamongateamof threads.Afterbig_computation()iscomplete,theresultsneedtobecombinedintotheglobal datastructurethatwillholdtheresult. Whilewedon'tshowthecode,assumetheupdatewithinconsume_results()canbedoneinany order,buttheupdatebyonethreadmustcompletebeforeanotherthreadcanexecuteanupdate.The criticalpragmaaccomplishesthisforus.Thefirstthreadtofinishitsbig_computation() enterstheenclosedblockofcodeandcallsconsume_results().Ifathreadarrivesatthetopof thecriticalsectionblockwhileanotherthreadisprocessingtheblock,itwaitsuntilthepriorthreadis finished. Thecriticalsectionisanexpensivesynchronizationoperation.Uponentrytoacriticalsection,a
threadflushesallvisiblevariablestoensurethataconsistentviewofthememoryisseeninsidethe criticalsection.Attheendofthecriticalsection,weneedanymemoryupdatesoccurringwithinthe criticalsectiontobevisibletotheotherthreadsintheteam,soasecondflushofallthreadvisible variablesisrequired. Thecriticalsectionconstructisnotonlyexpensive,itisnotverygeneral.Itcannotbeusedamong subsetsofthreadswithinateamortoprovidemutualexclusionbetweendifferentblocksofcode. Thus,theOpenMPAPIprovidesalowerlevelandmoreflexibleconstructformutualexclusioncalled alock. LocksindifferentsharedmemoryAPIstendtobesimilar.Theprogrammerdeclaresthelockand initializesit.Onlyonethreadatatimeisallowedtoholdthelock.Otherthreadstryingtoacquirethe lockwillblock.Blockingwhilewaitingforalockisinefficient,somanylockAPIsallowthreadsto testalock'savailabilitywithouttryingtoacquireit.Thus,athreadcanopttodousefulworkandcome backtoattempttoacquirethelocklater. ConsidertheuseoflocksinOpenMP.TheexampleinFig.6.7showsuseofasimplelocktomake sureonlyonethreadatatimeattemptstowritetostandardoutput. Theprogramfirstdeclaresthelocktobeoftypeomp_lock_t.Thisisanopaqueobject,meaning thataslongastheprogrammeronlymanipulateslockobjectsthroughtheOpenMPruntimelibrary, theprogrammercansafelyworkwiththelockswithouteverconsideringthedetailsofthelocktype. Thelockistheninitializedwithacalltoomp_init_lock.
Figure 6.7. Example of using locks in OpenMP #include <omp.h> #include <stdio.h> int main() { omp_lock_t lock; // declare the lock using the lock // type defined in omp.h omp_set_num_threads(5); omp_init_lock (&lock); // initialize the lock #pragma omp parallel shared (lock) { int id = omp_get_thread_num(); omp_set_lock (&lock); printf("\n only thread %d can do this print\n",id); omp_unset_lock (&lock); } }
TheJavalanguageprovidessupportformutualexclusionwiththesynchronizedblockconstructand also,inJava21.5,withnewlockclassescontainedinthepackage java.util.concurrent.lock.EveryobjectinaJavaprogramimplicitlycontainsitsown lock.Eachsynchronizedblockhasanassociatedobject,andathreadmustacquirethelockonthat objectbeforeexecutingthebodyoftheblock.Whenthethreadexitsthebodyofthesynchronized block,whethernormallyorabnormallybythrowinganexception,thelockisreleased. InFig.6.8,weprovideaJavaversionoftheexamplegivenpreviously.Theworkdonebythethreads isspecifiedbytherunmethodinthenestedWorkerclass.NotethatbecauseNisdeclaredtobe final(andisthusimmutable),itcansafelybeaccessedbyanythreadwithoutrequiringany synchronization. Forthesynchronizedblockstoexcludeeachother,theymustbeassociatedwiththesameobject.In theexample,thesynchronizedblockintherunmethodusesthis.getClass()asanargument. ThisexpressionreturnsareferencetotheruntimeobjectrepresentingtheWorkerclass.Thisisa convenientwaytoensurethatallcallersusethesameobject.Wecouldalsohaveintroducedaglobal instanceofjava.lang.Object(oraninstanceofanyotherclass)andusedthatastheargument tothesynchronizedblock.Whatisimportantisthatallthethreadssynchronizeonthesameobject.A potentialmistakewouldbetouse,say,this,whichwouldnotenforcethedesiredmutualexclusion becauseeachworkerthreadwouldbesynchronizingonitselfandthuslockingadifferentlock. BecausethisisaverycommonmisunderstandingandsourceoferrorsinmultithreadedJava programs,weemphasizeagainthatasynchronizedblockonlyprotectsacriticalsectionfromaccess byotherthreadswhoseconflictingstatementsarealsoenclosedinasynchronizedblockwiththesame objectasanargument.Synchronizedblocksassociatedwithdifferentobjectsdonotexcludeeach other.(Theyalsodonotguaranteememorysynchronization.)Also,thepresenceofasynchronized blockinamethoddoesnotconstraincodethatisnotinasynchronizedblock.Thus,forgettinga neededsynchronizedblockormakingamistakewiththeargumenttothesynchronizedblockcan haveseriousconsequences. ThecodeinFig.6.8isstructuredsimilarlytotheOpenMPexample.Amorecommonapproachused inJavaprogramsistoencapsulateashareddatastructureinaclassandprovideaccessonlythrough synchronizedmethods.Asynchronizedmethodisjustaspecialcaseofasynchronizedblockthat includesanentiremethod.
Figure 6.8. Java version of the OpenMP program in Fig. 6.6
CodeView:Scroll/ShowAll
public class Example { static final int N = 10;
static double[] global_result = new double[N]; public static void main(String[] args) throws InterruptedException { //create and start N threads Thread[] t = new Thread[N]; for (int i = 0; i != N; i++) {t[i] = new Thread(new Worker(i)); t[i] .start( ) ; } //wait for all N threads to finish for (int i = 0; i != N; i++){t[i].join();} //print the results for (int i = 0; i!=N; i++) {System.out.print(global_result[i] + " ");} System.out.println("done"); } static class Worker implements Runnable { double local_result; int i; int ID; Worker(int ID){this.ID = ID;} //main work of threads public void run() { //perform the main computation local_result = big_computation(ID, i); //update global variables in synchronized block synchronized(this.getClass()) {consume_results(ID, local_result, global_result);} } //define computation double big_computation(int ID, int i){ . . . } //define result update void consume_results(int ID, double local_result, double[] global_result){. . .} }
theclassdefiningtheworkerthreadstotheclassowningtheglobal_resultarray.
Figure 6.9. Java program showing how to implement mutual exclusion with a synchronized method
CodeView:Scroll/ShowAll
public class Example2 { static final int N = 10; private static double[] global_result = new double[N]; public static void main(String[] args) throws InterruptedException { //create and start N threads Thread[] t = new Thread[N] ; for (int i = 0; i != N; i++) {t[i] = new Thread(new Worker(i)); t[i] .start( ) ; } //wait for all threads to terminate for (int i = 0; i != N; i++){t[i] . join( ) ;} //print results for (int i = 0; i!=N; i++) {System.out.print(global_result[i] + " ");} System.out.println("done"); } //synchronized method serializing consume_results method synchronized static void consume_results(int ID, double local_result) { global.result[ID] = . . . }
class Worker implements Runnable { double local_result; int i; int ID; Worker(int ID){this.ID = ID;} public void run() { //perform the main computation local_result = big_computation(ID, i); //carry out the UE's work //invoke method to update results Example2.consume_results(ID, local_result);
Asisthecasewithmostofthesynchronizationconstructs,mutualexclusionisonlyneededwhenthe statementsexecutewithinasharedcontext.Hence,asharednothingAPIsuchasMPIdoesnot providesupportforcriticalsectionsdirectlywithinthestandard.ConsidertheOpenMPprogramin Fig.6.6.IfwewanttoimplementasimilarmethodinMPIwithacomplexdatastructurethathasto beupdatedbyoneUEatatime,thetypicalapproachistodedicateaprocesstothisupdate.Theother processeswouldthensendtheircontributionstothededicatedprocess.WeshowthissituationinFig. 6.10. ThisprogramusestheSPMDpattern.AswiththeOpenMPprograminFig.6.6,wehavealoopto carryoutNcallstobig_computation(),theresultsofwhichareconsumedandplacedina singleglobaldatastructure.Updatestothisdatastructuremustbeprotectedsothatresultsfromonly oneUEatatimeareapplied. WearbitrarilychoosetheUEwiththehighestranktomanagethecriticalsection.Thisprocessthen executesaloopandpostsNreceives.ByusingtheMPI_ANY_SOURCEandMPI_ANY_TAGvalues intheMPI_Recv()statement,themessagesholdingresultsfromthecallsto big_computation()aretakeninanyorder.IfthetagorIDarerequired,theycanberecovered fromthestatusvariablereturnedfromMPI_Recv(). TheotherUEscarryouttheNcallstobig_computation().BecauseoneUEhasbeendedicated tomanagingthecriticalsection,theeffectivenumberofprocessesinthecomputationisdecreasedby one.WeuseacyclicdistributionoftheloopiterationsaswasdescribedintheExamplessectionofthe SPMDpattern.Thisassignstheloopiterationsinaroundrobinfashion.AfteraUEcompletesits computation,theresultissenttotheprocessmanagingthecriticalsection.[5]
[5]
Weusedasynchronoussend(MPI_Ssend()),whichdoesnotreturnuntilamatching MPIreceivehasbeenposted,toduplicatethebehaviorofsharedmemorymutual exclusionascloselyaspossible.Itisworthnoting,however,thatinadistributedmemory environment,makingthesendingprocesswaituntilthemessagehasbeenreceivedisonly rarelyneededandaddsadditionalparalleloverhead.Usually,MPIprogrammersgoto greatlengthstoavoidparalleloverheadsandwouldonlyusesynchronousmessage passingasalastresort.Inthisexample,thestandardmodemessagepassingfunctions, MPI_Send()andMPI_Recv(),wouldbeabetterchoiceunlesseither(1)acondition externaltothecommunicationrequiresthetwoprocessestosatisfyanorderingconstraint, henceforcingthemtosynchronizewitheachother,or(2)communicationbuffersor anothersystemresourcelimitthecapacityofthecomputerreceivingthemessages,
therebyforcingtheprocessesonthesendingsidetowaituntilthereceivingsideisready.
Figure 6.10. Example of an MPI program with an update that requires mutual exclusion. A single process is dedicated to the update of this data structure.
CodeView:Scroll/ShowAll
#include <mpi.h> // MPI include file #include <stdio.h> #define N 1000 extern void consume_results(int, double, double * ); extern double big_computation(int, int); int main(int argc, char **argv) { int Tag1 = 1; int Tag2 = 2; // message tags int num_procs; // number of processes in group int ID; // unique identifier from 0 to (num_Procs-1) double local_result, global_result[N]; int i, ID_CRIT; MPI_Status stat; // MPI status parameter // Initialize MPI and set up the SPMD program MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size (MPI_COMM_WORLD, &num_procs); // Need at least two processes for this method to work if(num_procs < 2) MPI_Abort(MPI_COMM_WORLD,-1); // Dedicate the last process to managing update of final result ID_CRIT = num_procs-1; if (ID == ID_CRIT) { int ID_sender; // variable to hold ID of sender for(i=0;i<N;i++){ MPI_Recv(&local_result, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); ID_sender = stat.MPI_SOURCE; consume_results(ID_sender, local_result, global_result); } } else { num_procs--; for(i=ID;i<N;i+=num_procs){ // cyclic distribution of loop iterations local_result = big_computation(ID, i); // carry out UE's work // Send local result using a synchronous Send - a send that doesn't // return until a matching receive has been posted. MPI_Ssend (&local_result, 1, MPIDOUBLE, ID_CRIT, ID, MPI_COMM_WORLD); } } MPI_Finalize(); return 0; }
6.4. COMMUNICATION
Inmostparallelalgorithms,UEsneedtoexchangeinformationasthecomputationproceeds.Shared memoryenvironmentsprovidethiscapabilitybydefault,andthechallengeinthesesystemsisto synchronizeaccesstosharedmemorysothattheresultsarecorrectregardlessofhowtheUEsare scheduled.Indistributedmemorysystems,however,itistheotherwayaround:Becausetherearefew, ifany,sharedresources,theneedforexplicitsynchronizationtoprotecttheseresourcesisrare. Communication,however,becomesamajorfocusoftheprogrammer'seffort. 6.4.1. Message Passing Amessageisthemostbasiccommunicationelement.Amessageconsistsofaheadercontainingdata aboutthemessage(forexample,source,destination,andatag)andaseriesofbitstobe communicated.Messagepassingistypicallytwosided,meaningthatamessageisexplicitlysent betweenapairofUEs,fromaspecificsourcetoaspecificdestination.Inadditiontodirectmessage passing,therearecommunicationeventsthatinvolvemultipleUEs(usuallyallofthem)inasingle communicationevent.Wecallthiscollectivecommunication.Thesecollectivecommunicationevents oftenincludecomputationaswell,theclassicexamplebeingaglobalsummationwhereasetof valuesdistributedaboutthesystemissummedintoasinglevalueoneachnode. Inthefollowingsections,wewillexplorecommunicationmechanismsinmoredetail.Westartwith basicmessagepassinginMPI,OpenMP,andJava.Then,weconsidercollectivecommunication, lookinginparticularatthereductionoperationinMPIandOpenMP.Weclosewithabrieflookat otherapproachestomanagingcommunicationinparallelprograms.
MPI: message passing
MessagepassingbetweenapairofUEsisthemostbasicofthecommunicationoperationsand providesanaturalstartingpointforourdiscussion.AmessageissentbyoneUEandreceivedby another.Initsmostbasicform,thesendandthereceiveoperationarepaired. Asanexample,considertheMPIprograminFig.6.11and6.12.Inthisprogram,aringofprocessors worktogethertoiterativelycomputetheelementswithinafield(fieldintheprogram).Tokeepthe problemsimple,weassumethedependenciesintheupdateoperationaresuchthateachUEonly needsinformationfromitsneighbortothelefttoupdatethefield. Weshowonlythepartsoftheprogramrelevanttothecommunication.Thedetailsoftheactual updateoperation,initializationofthefield,oreventhestructureofthefieldandboundarydataare omitted. TheprograminFig.6.11and6.12declaresitsvariablesandtheninitializestheMPIenvironment.This isaninstanceoftheSPMDpatternwherethelogicwithintheparallelalgorithmwillbedrivenbythe processrankIDandthenumberofprocessesintheteam. Afterinitializingfield,wesetupthecommunicationpatternbycomputingtwovariables,left
CodeView:Scroll/ShowAll
#include <stdio.h> #include "mpi.h" // MPI include file #define IS_ODD(x) ((x)%2) // test for an odd int // prototypes for functions to initialize the problem, extract // the boundary region to share, and perform the field update. // The contents of these functions are not provided. extern extern extern extern void void void void init (int, int, double *, double * , double *); extract_boundary (int, int, int, int, double *); update (int, int, int, int, double *, double *); output_results (int, double *);
int main(int argc, char **argv) { int Tag1 = 1; // message tag int nprocs; // the number of processes in the group int ID; // the process rank int Nsize; // Problem size (order of field matrix) int Bsize; // Number of doubles in the boundary int Nsteps; // Number of iterations double *field, *boundary, *incoming; int i, left, right; MPI_Status stat; // MPI status parameter
// // Initialize MPI and set up the SPMD program // MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); init (Nsize, Bsize, field, boundary, incoming); /* continued in next figure */
returnuntiltherelevantreceivehasbeenposted.
OpenMP: message passing
CodeView:Scroll/ShowAll
/* continued from previous figure */ // assume a ring of processors and a communication pattern // where boundaries are shifted to the right. left = (ID+1); if(left>(nprocs-1)) left = 0; right = (ID-1); if(right<0)right = nprocs-1; for(i = 0; i < Nsteps; i++){ extract_boundary(Nsize, Bsize, ID, nprocs, boundary); if(IS_ODD(ID)){ MPI_Send (boundary, Bsize, MPI.DOUBLE, right, Tagl, MPI_COMM_WORLD); MPI_Recv (incoming, Bsize, MPI.DOUBLE, left, Tagl, MPI_COMM_WORLD, &stat); } else { MPI_Recv (incoming, Bsize, MPI_DOUBLE, left, Tagl, MPI_COMM_WORLD, &stat); MPI_Send (boundary, Bsize, MPI_DOUBLE, right, Tagl, MPI_COMM_WORLD); } update(Nsize, Bsize, ID, nprocs, field, incoming); } output_results(Nsize, field); MPI_Finalize(); return 0; }
shareddatastructure.InMPI,thesynchronizationrequiredbytheproblemisimpliedbythemessage passingfunctions.InOpenMP,thesynchronizationisexplicit.Westartwithabarriertomakesure allthreadsareatthesamepoint(thatis,readytofilltheboundarytheywillshare).Asecond barrierensuresthatallthethreadshavefinishedfillingtheirboundarybeforethereceivingside usesthedata. Thisstrategyisextremelyeffectiveforproblemsthatinvolvethesameamountofworkoneachthread. Whensomethreadsarefasterthanotherthreads,becauseeitherthehardwareislessloadedorperhaps thedataisnotevenlydistributed,theuseofbarriersleadstoexcessiveparalleloverheadasthreads waitatthebarrier.Thesolutionistomorefinelytunethesynchronizationtotheneedsoftheproblem, which,inthiscase,wouldindicatethatpairwisesynchronizationshouldbeused. The"messagepassing"OpenMPprograminFig.6.15showsasimplewaytointroducepairwise synchronizationinOpenMP.ThiscodeismuchmorecomplicatedthantheOpenMPprograminFig. 6.13andFig.6.14.Basically,wehaveaddedanarraycalleddonethatathreadusestoindicatethatits buffer(thatis,theboundarydata)isreadytouse.Athreadfillsitsbuffer,setstheflag,andthen flushesittomakesureotherthreadscanseetheupdatedvalue.Thethreadthencheckstheflagforits neighborandwaits(usingasocalledspinlock)untilthebufferfromitsneighborisreadytoreceive.
Figure 6.13. OpenMP program that uses a ring of threads and a communication pattern where information is shifted to the right (continued in Fig. 6.14)
CodeView:Scroll/ShowAll
#include <stdio.h> #include <omp.h> // OpenMP include file #define MAX 10 // maximum number of threads // // prototypes for functions to initialize the problem, // extract the boundary region to share, and perform the // field update. Note: the initialize routine is different // here in that it sets up a large shared array (that is, for // the full problem), not just a local block. // extern void init (int, int, double *, double * , double *); extern void extract_boundary (int, int, double *, double *); extern void update (int, int, double *, double *); extern void output_results (int, double *); int main(int argc, char **argv) { int Nsize; // Problem size (order of field matrix) int Bsize; // Number of doubles in the boundary double *field; double *boundary[MAX]; // array of pointers to a buffer to hold // boundary data // // Create Team of Threads // init (Nsize, Bsize, field, boundary, incoming); /* continued in next figure */
CodeView:Scroll/ShowAll
/* continued from previous figure */ #pragma omp parallel shared(boundary, field, Bsize, Nsize) // // Set up the SPMD program. Note: by declaring ID and Num_threads // inside the parallel region, we make them private to each thread. // int ID, nprocs, i, left; ID = omp_get_thread_num(); nprocs = omp_get_num_threads(); if (nprocs > MAX) { exit (-1); } // // assume a ring of processors and a communication pattern // where boundaries are shifted to the right. // left = (ID-1); if(left<0) left = nprocs-1; for(i = 0; i < Nsteps; i++){ #pragma omp barrier extract_boundary(Nsize, Bsize, ID, nprocs, boundary[ID]); #pragma omp barrier update(Nsize, Bsize, ID, nprocs, field, boundary[left]);
CodeView:Scroll/ShowAll
int done[MAX]; // an array of flags dimensioned for the // boundary data // // Create Team of Threads // init (Nsize, Bsize, field, boundary, incoming); #pragma omp parallel shared(boundary, field, Bsize, Nsize) // // Set up the SPMD program. Note: by declaring ID and Num_threads // inside the parallel region, we make them private to each thread. // int ID, nprocs, i, left; ID = omp_get_thread_num(); nprocs = omp_get_num_threads(); if (nprocs > MAX) { exit (-1); } // // assume a ring of processors and a communication pattern // where boundaries are shifted to the right. // left = (ID-1); if(left<0) left = nprocs-1; done[ID] = 0; // set flag stating "buffer ready to fill" for(i = 0; i < Nsteps; i++){ #pragma omp barrier // all visible variables flushed, so we // don't need to flush "done". extract_boundary(Nsize, Bsize, ID, num_procs, boundary[ID]);
done[ID] = 1; // flag that the buffer is ready to use #pragma omp flush (done) while (done[left] != 1){ #pragma omp flush (done) } update(Nsize, Bsize, ID, num_procs, field, boundary[left]); done[left] = 0; // set flag stating "buffer ready to fill" } output_results(Nsize, field);
Althoughthejava, io, java.net,andjava.rmi.*packagesprovideveryconvenient programmingabstractionsandworkverywellinthedomainforwhichtheyweredesigned,theyincur highparalleloverheadsandareconsideredtobeinadequateforhighperformancecomputing.High performancecomputerstypicallyusehomogeneousnetworksofcomputers,sothegeneralpurpose distributedcomputingapproachessupportedbyJava'sTCP/IPsupportresultindataconversionsand checksthatarenotusuallyneededinhighperformancecomputing.Anothersourceofparallel overheadistheemphasisinJavaonportability.Thisledtolowestcommondenominatorfacilitiesin thedesignofJava'snetworkingsupport.AmajorproblemisblockingI/O.Thismeans,forexample, thatareadoperationonasocketwillblockuntildataisavailable.Stallingtheapplicationisusually avoidedinpracticebycreatinganewthreadtoperformtheread.Becauseareadoperationcanbe appliedtoonlyonesocketatatime,thisleadstoacumbersomeandnonscalableprogrammingstyle withseparatethreadsforeachcommunicationpartner. Toremedysomeofthesedeficiencies,(whichwerealsoaffectinghighperformanceenterprise servers)newI/Ofacilitiesinthejava.niopackageswereintroducedinJava21.4.BecausetheJava specificationhasnowbeensplitintothreeseparateversions(enterprise,core,andmobile),thecore JavaandenterpriseJavaAPIsarenolongerrestrictedtosupportingonlyfeaturesthatcanbe supportedontheleastcapabledevices.ResultsreportedbyPughandSacco[PS04]indicatethatthe newfacilitiespromisetoprovidesufficientperformancetomakeJavaareasonablechoiceforhigh performancecomputinginclusters. Thejava.niopackagesprovidenonblockingI/Oandselectorssothatonethreadcanmonitor severalsocketconnections.Themechanismforthisisanewabstraction,calledachannel,thatserves asanopenconnectiontosockets,files,hardwaredevices,etc.ASocketChannelisusedfor communicationoverTCPorUDPconnections.Aprogrammayreadorwritetoorfroma SocketChannelintoaByte Bufferusingnonblockingoperations.Buffersareanothernew abstractionintroducedinJava21.4.Theyarecontainersforlinear,finitesequencesofprimitivetypes. Theymaintainastatecontainingacurrentposition(alongwithsomeotherinformation)andare accessedwith"put"and"get"operationsthatputorgetanelementintheslotindicatedbythecurrent position.Buffersmaybeallocatedasdirectorindirect.Thespaceforadirectbufferisallocated
outsideoftheusualmemoryspacemanagedbytheJVMandsubjecttogarbagecollection.Asa result,directbuffersarenotmovedbythegarbagecollector,andreferencestothemcanbepassedto thesystemlevelnetworksoftware,eliminatingacopyingstepbytheJVM.Unfortunately,thiscomes atthepriceofmorecomplicatedprogramming,becauseputandgetoperationsarenotespecially convenienttouse.Typically,however,onewouldnotuseaBufferforthemaincomputationinthe program,butwoulduseamoreconvenientdatastructuresuchasanarraythatwouldbebulkcopied toorfromtheBuffer.[6] Manyparallelprogrammersdesirehigherlevel,oratleastmorefamiliar,communicationabstractions thanthosesupportedbythestandardpackagesinJava.SeveralgroupshaveimplementedMPIlike bindingsforJavausingvariousapproaches.OneapproachusestheJNI(JavaNativeInterface)tobind toexistingMPIlibraries.JavaMPI[Min97]andmpiJava[BCKL98]areexamples.Otherapproaches usenewcommunicationsystemswritteninJava.Thesearenotwidelyusedandareinvariousstates ofavailability.Some[JCS98]attempttoprovideanMPIlikeexperiencefortheprogrammer followingproposedstandardsdescribedin[BC00],whileothersexperimentwithalternativemodels [Man].Anoverviewisgivenin[AJMJS02].Thesesystemstypicallyusetheoldjava.iobecause thejava.niopackagehasonlyrecentlybecomeavailable.Anexceptionistheworkreportedby PughandSacco[PS04].Althoughtheyhavenot,atthiswriting,provideddownloadablesoftware, theydescribeapackageprovidingasubsetofMPIwithperformancemeasurementsthatsuggestsan optimisticoutlookforthepotentialofJavawithjava.nioforhighperformancecomputingin clusters.Onecanexpectthatinthenearfuturemorepackagesbuiltonjava.niothatsimplify parallelprogrammingwillbecomeavailable. 6.4.2. Collective Communication Whenmultiple(morethantwo)UEsparticipateinasinglecommunicationevent,theeventiscalleda collectivecommunicationoperation.MPI,asamessagepassinglibrary,includesmostofthemajor collectivecommunicationoperations.
Areductionoperationreducesacollectionofdataitemstoasingledataitembyrepeatedlycombining
whereoisabinaryoperator.Themostgeneralwaytoimplementareductionistoperformthe calculationseriallyfromlefttoright,anapproachthatoffersnopotentialforconcurrentexecution.If oisassociative,however,thecalculationinEq.6.1containsexploitableconcurrency,inthatpartial resultsoversubsetsofv0,,vm1canbecalculatedconcurrentlyandthencombined.Ifoisalso commutative,thecalculationcanbenotonlyregrouped,butalsoreordered,openingupadditional possibilitiesforconcurrentexecution,asdescribedlater. Notallreductionoperatorshavetheseusefulproperties,however,andonequestiontobeconsideredis whethertheoperatorcanbetreatedasifitwereassociativeand/orcommutativewithoutsignificantly changingtheresultofthecalculation.Forexample,floatingpointadditionisnotstrictlyassociative (becausethefiniteprecisionwithwhichnumbersarerepresentedcancauseroundofferrors, especiallyifthedifferenceinmagnitudebetweenoperandsislarge),butifallthedataitemstobe addedhaveroughlythesamemagnitude,itisusuallycloseenoughtoassociativetopermitthe parallelizationstrategiesdiscussednext.Ifthedataitemsvaryconsiderablyinmagnitude,thismay notbethecase. Mostparallelprogrammingenvironmentsprovideconstructsthatimplementreduction.
Asanexampleofareduction,considertheMPIprograminFig.6.16.Thisisacontinuationofour earlierexamplewhereweusedabarriertoenforceconsistenttiming(Fig.6.3).Thetimingis measuredforeachUEindependently.Asanindicationoftheloadimbalancefortheprogram,itis usefultoreportthetimeasaminimumtime,amaximumtime,andanaveragetime.WedothisinFig. 6.16withthreecallstoMPI_ReduceonewiththeMPI_MINfunction,onewiththeMPI_MAX function,andonewiththeMPI_SUMfunction. AnexampleofreductioninOpenMPcanbefoundinFig.6.17.ThisisanOpenMPversionofthe programinFig.6.16.Thevariablesforthenumberofthreads(num_threads)andtheaveragetime (ave_time)aredeclaredpriortotheparallelregionsotheycanbevisiblebothinsideandfollowing theparallelregion.Weuseareductionclauseontheparallelpragmatoindicatethatwewill bedoingareductionwiththe+operatoronthevariableave_time.Thecompilerwillcreatealocal copyofthevariableandinitializeittothevalueoftheidentityoftheoperatorinquestion(zerofor
addition).Eachthreadsumsitsvaluesintothelocalcopy.Followingthecloseoftheparallelregion, allthelocalcopiesarecombinedwiththeglobalcopytoproducethefinalvalue. Becauseweneedtocomputetheaveragetime,weneedtoknowhowmanythreadswereusedinthe parallelregion.Theonlywaytosafelydeterminethenumberofthreadsisbyacalltothe omp_num_threads ()functioninsidetheparallelregion.Multiplewritestothesamevariable caninterferewitheachother,soweplacethecalltoomp_num_threads ()insideasingle construct,therebyensuringthatonlyonethreadwillassignavaluetothesharedvariable num_threads.AsingleinOpenMPimpliesabarrier,sowedonotneedtoexplicitlyincludeone tomakesureallthethreadsenterthetimedcodeatthesametime.Finally,notethatwecomputeonly theaveragetimeinFig.6.17becauseminandmaxoperatorsarenotincludedintheC/C++OpenMP version2.0.
Figure 6.16. MPI program to time the execution of a function called runit(). We use MPI_Reduce to find minimum, maximum, and average runtimes.
CodeView:Scroll/ShowAll
#include "mpi.h" // MPI include file #include <stdio.h> int main(int argc, char**argv) { int num_procs; // the number of processes in the group int ID; // a unique identifier ranging from 0 to (num_Procs-1) double local_result; double time_init, time_final, time_elapsed; double min_time, max_time, ave_time; // // Initialize MPI and set up the SPMD program // MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size (MPI_COMM_WORLD, &num_procs); // // Ensure that all processes are set up and ready to go before timing // runit () // MPI_Barrier(MPI_COMM_WORLD); time_init = MPI_Wtime(); runit(); // a function that we wish to time on each process time_final = MPI_Wtime(); time_elapsed = time_final-time_init; MPI_Reduce (&time_elapsed, &min_time, 1, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD); MPI_Reduce (&time_elapsed, &max_time, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); MPI_Reduce (&time_elapsed, &ave_time, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (ID == 0){ ave_time = ave_time /(double)num_procs;
printf(" min, ave and max times (secs): %f, %f, %f\n", min_time, ave_time,max_time); } MPI_Finalize(); return 0; }
CodeView:Scroll/ShowAll
#include <stdio.h> #include <omp.h> // OpenMP include file extern void runit(); int main(int argc, char**argv) { int num_threads; // the number of processes in the group double ave_time=0.0; #pragma omp parallel reduction(+ : ave_time) num_threads(10) { double local_result; double time_init, time_final, time_elapsed; // The single construct causes one thread to set the value of // the num_threads variable. The other threads wait at the // barrier implied by the single construct. #pragma omp single num_threads = omp_get_num_threads(); time_init = omp_get wtime(); runit(); // a function that we wish to time on each process time_final = omp_get wtime(); time_elapsed = time_final - time_init; } ave_time += time_elapsed; ave_time = ave_time/(double)num_threads;
Ifthereductionoperatorisnotassociative,orcannotbetreatedasassociativewithoutsignificantly affectingtheresult,itwilllikelybenecessarytoperformtheentirereductionseriallyinasingleUE, assketchedinFig.6.18.IfonlyoneUEneedstheresultofthereduction,itisprobablysimplestto havethatUEperformtheoperation;ifallUEsneedtheresult,thereductionoperationcanbefollowed byabroadcastoperationtocommunicatetheresulttootherUEs.Forsimplicity,thefigureshowsa situationinwhichthereareasmanyUEsasdataitems.Thesolutioncanbeextendedtosituationsin whichtherearemoredataitemsthanUEs,buttherequirementthatalltheactualcomputationbedone inasingleUEstillholdsbecauseofthelackofassociativity.(Clearlythissolutioncompletelylacks concurrency;wementionitforthesakeofcompletenessandbecauseitmightstillbeusefuland appropriateifthereductionoperationrepresentsonlyarelativelysmallpartofacomputationwhose otherpartsdohaveexploitableconcurrency.)
Figure 6.18. Serial reduction to compute the sum of a(0) through a(3). sum(a(i:j)) denotes the sum of elements i through j of array a.
Inthisapproach,theindividualcombinetwoelementsoperationsmustbeperformedinsequence,so itissimplesttohavethemallperformedbyasingletask.Becauseofthedatadependencies(indicated bythearrowsinthefigure),however,somecautionisrequiredifthereductionoperationisperformed aspartofalargercalculationinvolvingmultipleconcurrentUEs;theUEsnotperformingthe reductioncancontinuewithotherworkonlyiftheycandosowithoutaffectingthecomputationofthe reductionoperation(forexample,ifmultipleUEsshareaccesstoanarrayandoneofthemis computingthesumoftheelementsofthearray,theothersshouldnotbesimultaneouslymodifying elementsofthearray).Inamessagepassingenvironment,thiscanusuallybeaccomplishedusing messagepassingtoenforcethedatadependencyconstraints(thatis,theUEsnotperformingthe reductionoperationsendtheirdatatothesingleUEactuallydoingthecomputation).Inother environments,theUEsnotperformingthereductionoperationcanbeforcedtowaitbymeansofa barrier.
Tree-based reduction
IfalloftheUEsmustknowtheresultofthereductionoperation,thentherecursivedoublingscheme ofFig.6.20isbetterthanthetreebasedapproachfollowedbyabroadcast.
Figure 6.20. Recursive-doubling reduction to compute the sum of a(0) through a(3). sum (a(i:j)) denotes the sum of elements i through j of array a.
Aswiththetreebasedcode,ifthenumberofUEsisequalto2n,thenthealgorithmproceedsinn steps.Atthebeginningofthealgorithm,everyUEhassomenumberofvaluestocontributetothe reduction.Thesearecombinedlocallytoasinglevaluetocontributetothereduction.Inthefirststep, theevennumberedUEsexchangetheirpartialsumswiththeiroddnumberedneighbors.Inthe secondstage,insteadofimmediateneighborsexchangingvalues,UEstwostepsawayinteract.Atthe nextstage,UEsfourstepsawayinteract,andsoforth,doublingthereachoftheinteractionateach stepuntilthereductioniscomplete. Attheendofnsteps,eachUEhasacopyofthereducedvalue.Comparingthistotheprevious strategyofusingatreebasedalgorithmfollowedbyabroadcast,weseethefollowing:Thereduction andthebroadcasttakenstepseach,andthebroadcastcannotbeginuntilthereductioniscomplete,so theelapsedtimeisO(2n),andduringthese2nstepsmanyoftheUEsareidle.Therecursivedoubling algorithm,however,involvesalltheUEsateachstepandproducesthesinglereducedvalueatevery UEafteronlynsteps.
6.4.3. Other Communication Constructs Wehavelookedatonlythemostcommonmessagepassingconstructs:pointtopointmessagepassing andcollectivecommunication.Therearemanyimportantvariationsinmessagepassing.InMPI,itis possibletofinetuneperformancebychanginghowthecommunicationbuffersarehandledorto overlapcommunicationandcomputation.SomeofthesemechanismsaredescribedintheMPI appendix,AppendixB. Noticethatinourdiscussionofcommunication,thereisareceivingUEandasendingUE.Thisis calledtwosidedcommunication.Forsomealgorithms(asdiscussedintheDistributedArraypattern), thecalculationsinvolvedincomputingindicesintoadistributeddatastructureandmappingthemonto theprocessowningthedesireddatacanbeverycomplicatedanddifficulttodebug.Someofthese problemscanbeavoidedbyusingonesidedcommunication. OnesidedcommunicationoccurswhenaUEmanagescommunicationwithanotherUEwithoutthe explicitinvolvementoftheotherUE.Forexample,amessagecanbedirectly"put"intoabufferon anothernodewithouttheinvolvementofthereceivingUE.TheMPI2.0standardincludesaonesided communicationAPI.AnotherexampleofonesidedcommunicationisanoldersystemcalledGAor GlobalArrays[NHL94,NHL96,NHK+02,Gloa].GAprovidesasimpleonesidedcommunication environmentspecializedtotheproblemofdistributedarrayalgorithms.
Anotheroptionistoreplaceexplicitcommunicationwithavirtualsharedmemory,wheretheterm "virtual"isusedbecausethephysicalmemorycouldbedistributed.Anapproachthatwaspopularin theearly1990swasLinda[CG91].Lindaisbasedonanassociativevirtualsharedmemorycalleda tuplespace.TheoperationsinLinda"put","take",or"read"asetofvaluesbundledtogetherintoan objectcalledatuple.Tuplesareaccessedbymatchingagainstatemplate,makingthememory contentaddressableorassociative.Lindaisgenerallyimplementedasacoordinationlanguage,thatis, asmallsetofinstructionsthatextendanormalprogramminglanguage(thesocalledcomputation language).Lindaisnolongerusedtoanysignificantextent,buttheideaofanassociativevirtual sharedmemoryinspiredbyLindalivesoninJavaSpaces[FHA99]. Morerecentattemptstohidemessagepassingbehindavirtualsharedmemoryarethecollectionof languagesbasedonthePartitionedGlobalAddressSpaceModel:UPC[UPC],Titanium[Tita],and CoArrayFortran[Co].TheseareexplicitlyparalleldialectsofC,Java,andFortran(respectively) basedonavirtualsharedmemory.Unlikeothersharedmemorymodels,suchasLindaorOpenMP, thesharedmemoryinthePartitionedGlobalAddressSpacemodelispartitionedandincludesthe conceptofaffinityofsharedmemorytoparticularprocessors.UEscanreadandwriteeachothers' memoryandperformbulktransfers,buttheprogrammingmodeltakesintoaccountnonuniform memoryaccess,allowingthemodeltobemappedontoawiderangeofmachines,fromSMPto NUMAtoclusters.
Endnotes
6. PughandSacco[PS04]havereportedthatitisactuallymoreefficienttodobulkcopies betweenbuffersandarraysbeforeasendthantoeliminatethearrayandperformthe calculationupdatesdirectlyonthebuffersusingputandget.
Theresultsmaydifferslightlyduetothenonassociativityoffloatingpointoperations.
Figure A.1. Fortran and C programs that print a simple string to standard output
Incrementalparallelismreferstoastyleofparallelprogramminginwhichaprogramevolvesfroma sequentialprogramintoaparallelprogram.Aprogrammerstartswithaworkingsequentialprogram andblockbyblockfindspiecesofcodethatareworthwhiletoexecuteinparallel.Thus,parallelismis addedincrementally.Ateachphaseoftheprocess,thereisaworkingprogramthatcanbeverified, greatlyincreasingthechancesthattheprojectwillbesuccessful. ItisnotalwayspossibletouseincrementalparallelismortocreatesequentiallyequivalentOpenMP programs.Sometimesaparallelalgorithmrequirescompleterestructuringoftheanalogoussequential program.Inothercases,theprogramisconstructedfromthebeginningtobeparallelandthereisno sequentialprogramtoincrementallyparallelize.Also,thereareparallelalgorithmsthatdonotwork withonethreadandhencecannotbesequentiallyequivalent.Still,incrementalparallelismand sequentialequivalenceguidedthedesignoftheOpenMPAPIandarerecommendedpractices.
orinFortranwiththedirective
C$OMP PARALLEL
ModernlanguagessuchasCandC++areblockstructured.Fortran,however,isnot.Hence,the FortranOpenMPspecificationdefinesadirectivetoclosetheparallelblock:
$OHP END PARALLEL
ThispatternisusedwithotherFortranconstructsinOpenMP;thatis,oneformopensthestructured block,andamatchingformwiththewordENDinsertedaftertheOMPclosestheblock. InFig.A.2,weshowparallelprogramsinwhicheachthreadprintsastringtothestandardoutput. Whenthisprogramexecutes,theOpenMPruntimesystemcreatesanumberofthreads,eachofwhich willexecutetheinstructionsinsidetheparallelconstruct.Iftheprogrammerdoesn'tspecifythe numberofthreadstocreate,adefaultnumberisused.Wewilllatershowhowtocontrolthedefault numberofthreads,butforthesakeofthisexample,assumeitwassettothree. OpenMPrequiresthatI/Obethreadsafe.Therefore,eachoutputrecordprintedbyonethreadis printedcompletelywithoutinterferencefromotherthreads.TheoutputfromtheprograminFig.A.2 wouldthenlooklike:
E pur si muove E pur si muove E pur si muove
Figure A.3. Fortran and C programs that print a simple string to standard output
Figure A.4. Simple program to show the difference between shared and local (or private) data #include <stdio.h> #include <omp.h> int main() { int i=5; // a shared variable
#pragma omp parallel { int c; // a variable local or private to each thread c = omp_get_thread num(); printf("c = %d, i = %d\n",c,i); } }
Thenumberofthreadstobeusedcanalsobesetinsidetheprogramwiththenum_threadsclause. Forexample,tocreateaparallelregionwiththreethreads,theprogrammerwouldusethepragma
#pragma omp parallel num_threads(3)
wheredirective-nameidentifiestheconstructandtheoptionalclauses[3]modifytheconstruct. SomeexamplesofOpenMPpragmasinCorC++follow:
[3]
Throughoutthisappendix,wewillusesquarebracketstoindicateoptionalsyntactic elements.
#pragma omp parallel private(ii, jj, kk) #pragma omp barrier #pragma omp for reduction(+:result)
ForFortran,thesituationismorecomplicated.Wewillconsideronlythesimplestcase,fixedform[4] Fortrancode.Inthiscase,anOpenMPdirectivehasthefollowingform:
[4]
Fixedformreferstothefixedcolumnconventionsforstatementsinolderversionsof Fortran(Fortran77andearlier).
sentinel directive-name [clause[[,]clause] ... ]
wheresentinelcanbeoneof:
C$OHP C!OHP *!OHP
A.3. WORKSHARING
Whenusingtheparallelconstructalone,everythreadexecutesthesameblockofstatements.There aretimes,however,whenweneeddifferentcodetomapontodifferentthreads.Thisiscalled worksharing.
ThemostcommonlyusedworksharingconstructinOpenMPistheconstructtosplitloopiterations betweendifferentthreads.Designingaparallelalgorithmaroundparallelloopsisanoldtraditionin parallelprogramming[X393].Thisstyleissometimescalledloopsplittingandisdiscussedatlength intheLoopParallelismpattern.Inthisapproach,theprogrammeridentifiesthemosttimeconsuming loopsintheprogram.Eachloopisrestructured,ifnecessary,sotheloopiterationsarelargely independent.Theprogramisthenparallelizedbymappingdifferentgroupsofloopiterationsonto differentthreads. Forexample,considertheprograminFig.A.5.Inthisprogram,acomputationallyintensivefunction big_comp()iscalledrepeatedlytocomputeresultsthatarethencombinedintoasingleglobal answer.Forthesakeofthisexample,weassumethefollowing.
Figure A.5. Fortran and C examples of a typical loop-oriented program
Figure A.6. Fortran and C examples of a typical loop-oriented program. In this version of the program, the computationally intensive loop has been isolated and modified so the iterations are independent.
Thecombine()routinedoesnottakemuchtimetorun. Thecombine()functionmustbecalledinthesequentialorder.
Thefirststepistomaketheloopiterationsindependent.OnewaytoaccomplishthisisshowninFig. A.6.Becausethecombine()functionmustbecalledinthesameorderasinthesequential program,thereisanextraorderingconstraintintroducedintotheparallelalgorithm.Thiscreatesa dependencybetweentheiterationsoftheloop.Ifwewanttoruntheloopiterationsinparallel,we needtoremovethisdependency. Inthisexample,we'veassumedthecombine()functionissimpleanddoesn'ttakemuchtime. Hence,itshouldbeacceptabletorunthecallstocombine ()outsidetheparallelregion.Wedo thisbyplacingeachintermediateresultcomputedbybig_comp()intoanelementofanarray.Then thearrayelementscanbepassedtothecombine()functioninthesequentialorderinaseparate loop.Thiscodetransformationpreservesthemeaningoftheoriginalprogram(thatis,theresultsare identicalbetweentheparallelcodeandtheoriginalversionoftheprogram). Withthistransformation,theiterationsofthefirstloopareindependentandtheycanbesafely computedinparallel.Todividetheloopiterationsamongmultiplethreads,anOpenMPworksharing constructisused.Thisconstructassignsloopiterationsfromtheimmediatelyfollowingloopontoa teamofthreads.Laterwewilldiscusshowtocontrolthewaytheloopiterationsarescheduled,butfor now,weleaveittothesystemtofigureouthowtheloopiterationsaretobemappedontothethreads. TheparallelversionsareshowninFig.A.7.
Figure A.7. Fortran and C examples of a typical loop-oriented program parallelized with OpenMP
TheOpenMPparallelconstructisusedtocreatetheteamofthreads.Thisisfollowedbythe worksharingconstructtosplituploopiterationsamongthethreads:aDOconstructinthecaseof FortranandaforconstructforC/C++.Theprogramrunscorrectlyinparallelandpreserves sequentialequivalencebecausenotwothreadsupdatethesamevariableandanyoperations(suchas callstothecombine()function)thatdonotcommuteorarenotassociativearecarriedoutinthe sequentialorder.Noticethat,accordingtotheruleswegaveearlierconcerningthesharingof variables,theloopcontrolvariableiwouldbesharedbetweenthreads.TheOpenMPspecification, however,recognizesthatitnevermakessensetosharetheloopcontrolindexonaparallelloop,soit automaticallycreatesaprivatecopyoftheloopcontrolindexforeachthread. Bydefault,thereisanimplicitbarrierattheendofanyOpenMPworkshareconstruct;thatis,allthe threadswaitattheendoftheconstructandonlyproceedafterallofthethreadshavearrived.This barriercanberemovedbyaddinganowaitclausetotheworksharingconstruct:
#pragma omp for nowait
Asashortcut,OpenMPdefinesacombinedconstruct:
#pragma omp parallel for . . .
Thisisidenticaltothecasewheretheparallelandforconstructsareplacedwithinseparate pragmas.
int main () { int i; int num_steps = 1000000; double x, pi, step, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for privateW reduction(+:sum) for (i=0;i< num_steps; i++) { x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } pi = step * sum; printf("pi %lf\n",pi); return0 }
Severalotherclauseschangehowvariablesaresharedbetweenthreads.Themostcommonlyused onesfollow.
firstprivate(list).Justaswithprivate,foreachnameappearinginthelist,anew privatevariableforthatnameiscreatedforeachthread.Unlikeprivate,however,thenewly createdvariablesareinitializedwiththevalueofthevariablethatwasboundtothenamein theregionofcodeprecedingtheconstructcontainingthefirstprivateclause.This clausecanbeusedonbothparallelandtheworksharingconstructs. lastprivate (list).Onceagain,privatevariablesforeachthreadarecreatedforeach nameinthelist.Inthiscase,however,thevalueoftheprivatevariablefromthe sequentiallylastloopiterationiscopiedoutintothevariableboundtothenameintheregion followingtheOpenMPconstructcontainingthelastprivateclause.Thisclausecanonly beusedwiththelooporientedworkshareconstructs.
Variablesgenerallycanappearinthelistofonlyasingledataclause.Theexceptioniswith lastprivateandfirstprivate,becauseitisquitepossiblethataprivatevariablewillneed bothawelldefinedinitialvalueandavalueexportedtotheregionfollowingtheOpenMPconstructin question. AnexampleusingtheseclausesisprovidedinFig.A.9.Wedeclarefourvariablesprivatetoeach threadandassignvaluestothreeofthem:h = 1, j = 2,andk = 0.Followingtheparallel loop,weprintthevaluesofh, j,andk.Thevariablekiswelldefined.Aseachthreadexecutesa loopiteration,thevariablekisincremented.Itisthereforeameasureofhowmanyiterationseach threadhandled.
Figure A.9. C program showing use of the private, firstprivate, and lastprivate clauses. This program is incorrect in that the variables h and j do not have welldefined values when the printf is called. Notice the use of a backslash to continue the OpenMP pragma onto a second line. #include <stdio.h> #include <omp.h> #define N 1000
int main() { int h, i, j, k; h= 1; j =2; k=0; #pragma omp parallel for private(h) firstprivate(j,k) \ lastprivate(j,k) for(i=O;i<N;i++) { k++; j = h + i; //ERROR: h, and therefore j, is undefined } printf("h = %d, j = %d, k = %d\n",h, j, k); //ERROR j and h //are undefined }
Thevaluepassedoutsidetheloopbecauseofthelastprivateclauseisthevalueofkfor whicheverthreadexecutedthesequentiallylastiterationoftheloop(thatis,theiterationforwhichi = 999).Thevaluesofbothhandjareundefined,butfordifferentreasons.Thevariablejis undefinedbecauseitwasassignedthevaluefromasumwithanuninitializedvariable(h)insidethe parallelloop.Theproblemwiththevariablehismoresubtle.Itwasdeclaredasaprivatevariable, butitsvaluewasunchangedinsidetheparallelloop.OpenMPstipulates,however,thatafteraname appearsinanyoftheprivateclauses,thevariableassociatedwiththatnameintheregionofcode followingtheOpenMPconstructisundefined.Hence,theprintstatementfollowingtheparallel fordoesnothaveawelldefinedvalueofhtoprint. Afewotherclausesaffectthewayvariablesareshared,butwedonotusetheminthisbookandhence willnotdiscussthemhere. Thefinalclausewewilldiscussthataffectshowdataissharedisthereductionclause.Reductions werediscussedatlengthintheImplementationMechanismsdesignspace.Areductionisanoperation that,usingabinary,associativeoperator,combinesasetofvaluesintoasinglevalue.Reductionsare verycommonandareincludedinmostparallelprogrammingenvironments.InOpenMP,the reductionclausedefinesalistofvariablenamesandabinaryoperator.Foreachnameinthelist,a privatevariableiscreatedandinitializedwiththevalueoftheidentityelementforthebinaryoperator (forexample,zeroforaddition).Eachthreadcarriesoutthereductionintoitscopyofthelocal variableassociatedwitheachnameinthelist.Attheendoftheconstructcontainingthereduction clause,thelocalvaluesarecombinedwiththeassociatedvaluepriortotheOpenMPconstructin questiontodefineasinglevalue.Thisvalueisassignedtothevariablewiththesamenameinthe regionfollowingtheOpenMPconstructcontainingthereduction. AnexampleofareductionwasgiveninFig.A.8.Intheexample,thereductionclausewasapplied withthe+operatortocomputeasummationandleavetheresultinthevariablesum. Althoughthemostcommonreductioninvolvessummation,OpenMPalsosupportsreductionsin Fortranfortheoperators+,*,-,.AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, IOR,andIEOR.InCandC++,OpenMPsupportsreductionwiththestandardC/C++operators*,-, &,|,^,&&,and||.CandC++donotincludeanumberofusefulintrinsicfunctionssuchas"min"or "max"withinthelanguagedefinition.Hence,OpenMPcannotprovidereductionsinsuchcases;if
theyarerequired,theprogrammermustcodethemexplicitlybyhand.Moredetailsaboutreductionin OpenMParegivenintheOpenMPspecification[OMP].
TheoutputrecordswouldbecompleteandnonoverlappingbecauseOpenMPrequiresthattheI/O
librariesbethreadsafe.Whichthreadprintsitsrecordwhen,however,isnotspecified,andanyvalid interleavingoftheoutputrecordscanoccur.
A.6. SYNCHRONIZATION
ManyOpenMPprogramscanbewrittenusingonlytheparallelandparallel for (parallel doinFortran)constructs.Therearealgorithms,however,whereoneneedsmorecareful controloverhowvariablesareshared.Whenmultiplethreadsreadandwriteshareddata,the programmermustensurethatthethreadsdonotinterferewitheachother,sothattheprogramreturns thesameresultsregardlessofhowthethreadsarescheduled.Thisisofcriticalimportancesinceasa multithreadedprogramruns,anysemanticallyallowedinterleavingoftheinstructionscouldactually occur.Hence,theprogrammermustmanagereadsandwritestosharedvariablestoensurethatthreads readthecorrectvalueandthatmultiplethreadsdonottrytowritetoavariableatthesametime. Synchronizationistheprocessofmanagingsharedresourcessothatreadsandwritesoccurinthe correctorderregardlessofhowthethreadsarescheduled.Theconceptsbehindsynchronizationare discussedindetailinthesectiononsynchronization(Section6.3)intheImplementationMechanisms designspace.OurfocusherewillbeonthesyntaxanduseofsynchronizationinOpenMP. ConsidertheloopbasedprograminFig.A.5earlierinthischapter.Weusedthisexampletointroduce worksharinginOpenMP,atwhichtimeweassumedthecombinationofthecomputedresults(res) didnottakemuchtimeandhadtooccurinthesequentialorder.Hence,itwasnoproblemtostore intermediateresultsinasharedarrayandlater(withinaserialregion)combinetheresultsintothe finalanswer. Inthemorecommoncase,however,resultsofbig_comp()canbeaccumulatedinanyorderaslong astheaccumulationsdonotinterfere.Tomakethingsmoreinteresting,wewillassumethatthe combine()andbig_comp()routinesarebothtimeconsumingandtakeunpredictableand widelyvaryingamountsoftimetoexecute.Hence,weneedtobringthecombine()functioninto theparallelregionandusesynchronizationconstructstoensurethatparallelcallstothecombine() functiondonotinterfere. ThemajorsynchronizationconstructsinOpenMParethefollowing.
Typicallyitisneededonlybyprogrammersbuildingtheirownlowlevelsynchronization primitives.
wherenameisanidentifierthatcanbeusedtosupportdisjointsetsofcriticalsections.A criticalsectionimpliesacalltoflushonentrytoandonexitfromthecriticalsection.
barrierprovidesasynchronizationpointatwhichthethreadswaituntileverymemberof theteamhasarrivedbeforeanythreadscontinue.Thesyntaxofabarrieris
#pragma omp barrier
void omp_init_lock(omp_lock_t *lock).Initializethelock. void omp_destroy_lock(omp_lock_t *lock).Destroythelock,therebyfreeing anymemoryassociatedwiththelock. void omp_set_lock(omp_lock_t *lock).Setoracquirethelock.Ifthelockis free,thenthethreadcallingomp_set_lock()willacquirethelockandcontinue.Ifthe lockisheldbyanotherthread,thethreadcallingomp_set_lock()willwaitonthecall untilthelockisavailable. void omp_unset_lock(omp_lock_t *lock).Unsetorreleasethelocksosome otherthreadcanacquireit. int omp_test_lock(omp_lock_t *lock).Testorinquireifthelockisavailable.If itis,thenthethreadcallingthisfunctionwillacquirethelockandcontinue.Thetestand acquisitionofthelockisdoneatomically.Ifitisnot,thefunctionreturnsfalse(nonzero)and thecallingthreadcontinues.Thisfunctionisusedsoathreadcandousefulworkwhile waitingforalock.
Thelockfunctionsguaranteethatthelockvariableitselfisconsistentlyupdatedbetweenthreads,but donotimplyaflushofothervariables.Therefore,programmersusinglocksmustcallflush explicitlyasneeded.AnexampleofaprogramusingOpenMPlocksisshowninFig.A.12.The programdeclaresandtheninitializesthelockvariablesatthebeginningoftheprogram.Becausethis occurspriortotheparallelregion,thelockvariablesaresharedbetweenthethreads.Insidethe parallelregion,thefirstlockisusedtomakesureonlyonethreadatatimetriestoprintamessageto thestandardoutput.Thelockisneededtoensurethatthetwoprintfstatementsareexecuted togetherandnotinterleavedwiththoseofotherthreads.Thesecondlockisusedtoensurethatonly onethreadatatimeexecutesthego_for_it()function,butthistime,theomp_test_lock() functionisusedsoathreadcandousefulworkwhilewaitingforthelock.Aftertheparallelregion completes,thememoryassociatedwiththelocksisfreedbyacalltoomp_lock_destroy().
Figure A.12. Example showing how the lock functions in OpenMP are used
CodeView:Scroll/ShowAll
#include <stdio.h> #include <omp.h> extern void do_something_else(int); extern void go_for_it(int); int main() { omp_lock_t lcki, lck2; int id; omp_init_lock(&lcki); omp_init_lock(&lck2); #pragma omp parallel shared(lcki, lck2) private(id) { id = omp_get thread_numO ; omp_set_lock(&lcki); printf("thread %d has the lock \n", id); printf("thread %d ready to release the lock \n", id); omp unset_lock(&lcki); while (! omp test_lock(&lck2)) { do_something_else(id); // do something useful while waiting // for the lock } go_for_it(id); // Thread has the lock omp unset_lock(&lck2); } omp_destroy_lock(&lcki); omp_destroy_lock(&lck2);
schedule (static [,chunk]).Theiterationspaceisdividedintoblocksofsize chunk.Ifchunkisomitted,thentheblocksizeisselectedtoprovideoneapproximately equalsizedblockperthread.Theblocksaredealtouttothethreadsthatmakeupateamina roundrobinfashion.Forexample,achunksizeof2for3threadsand12iterationswillcreate 6blockscontainingiterations(0,1),(2,3),(4,5),(6,7),(8,9),and(10,11)andassignthemto threadsas[(0,1),(6,7)]toonethread,[(2,3),(8,9)]toanotherthreadand[(4,5),(10,11)]tothe lastthread. schedule (dynamic [chunk]).Theiterationspaceisdividedintoblocksofsize chunk.Ifchunkisomitted,theblocksizeissetto1.Eachthreadisinitiallygivenoneblock ofiterationstoworkwith.Theremainingblocksareplacedinaqueue.Whenathreadfinishes withitscurrentblock,itpullsthenextblockofiterationsthatneedtobecomputedoffthe queue.Thiscontinuesuntilalltheiterationshavebeencomputed. schedule (guided [chunk]).Thisisavariationofthedynamicscheduleoptimized todecreaseschedulingoverhead.Aswithdynamicscheduling,theloopiterationsaredivided intoblocks,andeachthreadisassignedoneblockinitiallyandthenreceivesanadditional blockwhenitfinishesthecurrentone.Thedifferenceisthesizeoftheblocks:Thesizeofthe firstblockisimplementationdependent,butlarge;blocksizeisdecreasedrapidlyfor subsequentblocks,downtothevaluespecifiedbychunk.Thismethodhasthebenefitsofthe dynamicschedule,butbystartingwithlargeblocksizes,thenumberofschedulingdecisions atruntime,andhencetheparalleloverhead,isgreatlyreduced. schedule (runtime).Theruntimeschedulestipulatesthattheactualscheduleand chunksizefortheloopistobetakenfromthevalueoftheenvironmentvariable OMP_SCHEDULE.Thisletsaprogrammertrydifferentscheduleswithouthavingtorecompile foreachtrial.
ConsidertheloopbasedprograminFig.A.13.Becausethescheduleclauseisthesameforboth FortranandC,wewillonlyconsiderthecaseforC.Iftheruntimeassociatedwithdifferentloop iterationschangesunpredictablyastheprogramruns,astaticscheduleisprobablynotgoingtobe effective.Wewillthereforeusethedynamicschedule.Schedulingoverheadisaseriousproblem, however,sotominimizethenumberofschedulingdecisions,wewillstartwithablocksizeof10 iterationsperschedulingdecision.Therearenofirmrules,however,andOpenMPprogrammers usuallyexperimentwitharangeofschedulesandchunksizesuntiltheoptimumvaluesarefound.For example,parallelloopsinprogramssuchastheoneinFig.A.13canalsobeeffectivelyscheduled withastaticscheduleaslongasthechunksizeissmallenoughthatworkisequallydistributedamong threads.
Figure A.13. Parallel version of the program in Fig. A.11, modified to show the use of the schedule clause #include <stdio.h> #include <omp.h> #define N 1000
extern void combine(double,double); extern double big_comp(int); int main() { int i; double answer, res; answer = 0.0; #pragma omp parallel for private(res) schedule(dynamic,10) for (i=O;i<N;i++){ res = big_comp(i); #pragma omp critical combine(answer,res); } printf("%f\n", answer); }
distributedamongacollectionofprocesses,scatterdataacrossacollectionofprocesses,andmuch more.
[1]
TheUEsareprocessesinMPI.
MPIwascreatedintheearly1990stoprovideacommonmessagepassingenvironmentthatcouldrun onclusters,MPPs,andevensharedmemorymachines.MPIisdistributedintheformofalibrary,and theofficialspecificationdefinesbindingsforCandFortran,althoughbindingsforotherlanguages havebeendefinedaswell.TheoverwhelmingmajorityofMPIprogrammerstodayuseMPIversion 1.1(releasedin1995).Thespecificationofanenhancedversion,MPI2.0,withparallelI/O,dynamic processmanagement,onesidedcommunication,andotheradvancedfeatureswasreleasedin1997. Unfortunately,itwassuchacomplexadditiontothestandardthatatthetimethisisbeingwritten (morethansixyearsafterthestandardwasdefined),onlyahandfulofMPIimplementationssupport MPI2.0.Forthisreason,wewillfocusonMPI1.1inthisdiscussion. ThereareseveralimplementationsofMPIincommonuse.ThetwomostcommonareLAM/MPI [LAM]andMPICH[MPI].Bothmaybedownloadedfreeofchargeandarestraightforwardtoinstall usingtheinstructionsthatcomewiththem.Theysupportawiderangeofparallelcomputers, includingLinuxclusters,NUMAcomputers,andSMPs.
B.1. CONCEPTS
Thebasicideaofpassingamessageisdeceptivelysimple:Oneprocesssendsamessageandanother onereceivesit.Diggingdeeper,however,thedetailsbehindmessagepassingbecomemuchmore complicated:Howaremessagesbufferedwithinthesystem?Canaprocessdousefulworkwhileitis sendingorreceivingmessages?Howcanmessagesbeidentifiedsothatsendsarealwayspairedwith theirintendedreceives? ThelongtermsuccessofMPIisduetoitselegantsolutiontothese(andother)problems.The approachisbasedontwocoreelementsofMPI:processgroupsandacommunicationcontext.A processgroupisasetofprocessesinvolvedinacomputation.InMPI,alltheprocessesinvolvedinthe computationarelaunchedtogetherwhentheprogramstartsandbelongtoasinglegroup.Asthe computationproceeds,however,theprogrammercandividetheprocessesintosubgroupsand preciselycontrolhowthegroupsinteract. Acommunicationcontextprovidesamechanismforgroupingtogethersetsofrelated communications.Inanymessagepassingsystem,messagesmustbelabeledsotheycanbedelivered totheintendeddestinationordestinations.ThemessagelabelsinMPIconsistoftheIDofthesending process,theIDoftheintendedreceiver,andanintegertag.Areceivestatementincludesparameters indicatingasourceandtag,eitherorbothofwhichmaybewildcards.Theresult,then,ofexecutinga receivestatementatprocessiisthedeliveryofamessagewithdestinationiwhosesourceandtag matchthoseinthereceivestatement. Whilestraightforward,identifyingmessageswithsource,destination,andtagmaynotbeadequatein complexapplications,particularlythosethatincludelibrariesorotherfunctionsreusedfromother programs.Often,theapplicationprogrammerdoesn'tknowanyofthedetailsaboutthisborrowed code,andifthelibraryincludescallstoMPI,thepossibilityexiststhatmessagesintheapplication andthelibrarymightaccidentallysharetags,destinationsIDs,andsourceIDs.Thiscouldleadto
Next,theMPIenvironmentmustbeinitialized.Aspartoftheinitialization,thecommunicatorshared byalltheprocessesiscreated.Asmentionedearlier,thisiscalledMPI_COMM_WORLD:
MPI_Init(&argc, &argv);
Figure B.2. Parallel program in which each process prints a simple string to the standard output #include <stdio.h> #include "mpi.h" int main(int argc, char **argv) { int num_procs; // the number of processes in the group int ID; // a unique identifier ranging from 0 to (num_procs-1) if (MPI_Init(&argc, &argv) != MPI_SUCCESS) { // print error message and exit } MPI_Comm_size (MPI_COMM_WORLD, &num_procs); MPI_Comm_rank (MPI_COMM_WORLD, &ID); printf("\n Never miss a good chance to shut up %d \n",ID); MPI_Finalize(); }
Therankisanintegerrangingfromzerotothenumberofprocessesinthegroupminus one.Itindicatesthepositionofeachprocesswithintheprocessgroup.
int num_procs; // the number of processes in the group int ID;// a unique identifier ranging from 0 to (num_procs-1) MPI_Comm_size(MPI_COMM_WORLD, &num_procs); MPI_Comm_rank(MPI_COMM_WORLD, &ID);
WhentheMPIprogramfinishesrunning,theenvironmentneedstobecleanlyshutdown.Thisisdone withthisfunction:
MPI_Finalize();
Figure B.3. The standard blocking point-to-point communication routines in the C binding for MPI 1.1
CodeView:Scroll/ShowAll
#include <stdio.h> // standard I/O include file #include "memory.h" // standard include file with function // prototypes for memory management #include "mpi.h" // MPI include file int main(int argc, char **argv) { int Tag1 = 1; int Tag2 = 2; // message tags int num_procs; // the number of processes in the group int ID; // a unique identifier ranging from 0 to (num_procs-1) int buffer_count = 100; // number of items in message to bounce long *buffer; // buffer to bounce between processes int i;
MPI_Status stat; // MPI status parameter MPI_Init(&argc,&argv); // Initialize MPI environment MPI_Comm_rank(MPI_COMM_WORLD, &ID); // Find Rank of this // process in the group MPI_Comm_size (MPI_COMM_WORLD, &num_procs); //Find number of // processes in the group if (num_procs != 2) MPI_Abort(MPI_COMM_WORLD, 1); buffer = (long *)malloc(buffer_count* sizeof(long)); for (i=0; i<buffer_count; i++) // fill buffer with some values buffer[i] = (long) i; if (ID == 0) { MPI_Send (buffer, buffer_count, MPI_LONG, 1, Tag1, MPI_COMM_WORLD); MPI_Recv (buffer, buffer_count, MPI_LONG, 1, Tag2, MPI_COMM_WORLD, &stat); } else { MPI_Recv (buffer, buffer_count, MPI_LONG, 0, Tag1, MPI_COMM_WORLD,&stat); MPI_Send (buffer, buffer_count, MPI_LONG, 0, Tag2, MPI_COMM_WORLD); } } MPI_Finalize();
Theprogramopenswiththeregularincludefilesanddeclarations.WeinitializeMPIasthefirst executablestatementintheprogram.Thiscouldbedonelater,theonlyfirmrequirementbeingthat initializationmustoccurbeforeanyMPIroutinesarecalled.Next,wedeterminetherankofeach processandthesizeoftheprocessgroup.Tokeeptheexampleshortandsimple,theprogramhas beenwrittenassumingtherewillonlybetwoprocessesinthegroup.Weverifythisfact(abortingthe programifnecessary)andthenallocatememorytoholdthebuffer. Thecommunicationitselfissplitintotwoparts.Intheprocesswithrankequalto0,wesendandthen receive.Intheotherprocess,wereceiveandthensend.Matchingupsendsandreceivesinthisway (sendingandthenreceivingononesideofapairwiseexchangeofmessageswhilereceivingandthen sendingontheother)canbeimportantwithblockingsendsandlargemessages.Tounderstandthe issues,consideraproblemwheretwoprocesses(ID0andID1)bothneedtosendamessage(from bufferoutgoing)andreceiveamessage(intobufferincoming).BecausethisisanSPMDprogram,it istemptingtowritethisas
int neigh; // the rank of the neighbor process If (ID == 0) neigh = 1; else neigh = 0; MPI_Send (outgoing, buffer_count, MPI_LONG, neigh, Tag1, MPI_COMM_WORLD);
// ERROR: possibly deadlocking program MPI_Recv (incoming, buffer_count, MPI_LONG, neigh, Tag2, MPI_COMM_WORLD, &stat);
MPI_Barrier.Abarrierdefinesasynchronizationpointatwhichallprocessesmustarrive beforeanyofthemareallowedtoproceed.ForMPI,thismeansthateveryprocessusingthe indicatedcommunicatormustcallthebarrierfunctionbeforeanyofthemproceed.This functionisdescribedindetailintheImplementationMechanismsdesignspace. MPI_Bcast.Abroadcastsendsamessagefromoneprocesstoalltheprocessesinagroup. MPI_Reduce.Areductionoperationtakesasetofvalues(inthebufferpointedtobyinbuff) spreadoutaroundaprocessgroupandcombinesthemusingtheindicatedbinaryoperation.To bemeaningful,theoperationinquestionmustbeassociative.Themostcommonexamplesfor thebinaryfunctionaresummationandfindingthemaximumorminimumofasetofvalues. Noticethatthefinalreducedvalue(inthebufferpointedtobyoutbuff)isonlyavailableinthe indicateddestinationprocess.Ifthevalueisneededbyallprocesses,thereisavariantofthis routinecalledMPI_All_reduce().Reductionsaredescribedindetailinthe ImplementationMechanismsdesignspace.
Figure B.6. Program to time the ring function as it passes messages around a ring of processes (continued in Fig. B.7). The program returns the time from the process that takes the longest elapsed time to complete the communication. The code to the ring function is not relevant for this example, but it is included in Fig. B.8.
CodeView:Scroll/ShowAll
// // Ring communication test. // Command-line arguments define the size of the message // and the number of times it is shifted around the ring: // // a.out msg_size num_shifts // #include "mpi.h" #include <stdio.h> #include <memory.h> int main(int argc, char **argv) { int num_shifts = 0; // number of times to shift the message int buff_count = 0; // number of doubles in the message int num_procs = 0; // number of processes in the ring int ID; // process (or node) id int buff_size_bytes; // message size in bytes int i; double double double double double t0; // wall-clock time in seconds ring_time; // ring comm time - this process max_time; // ring comm time - max time for all processes *x; // the outgoing message *incoming; // the incoming message
MPI_Status stat; // Initialize the MPI environment MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size(MPI_COMM_WORLD, &num_procs); // Process, test, and broadcast input parameters if (ID == 0)f if (argc != 3){ printf("Usage: %s <size of message> <Num of shifts> \n",*argv); fflush(stdout); MPI_Abort(MPI_COMM_WORLD, 1); } buff_count = atoi(*++argv); num_shifts = atoi(*++argv); printf(": shift %d doubles %d times \n",buff_count, num_shifts); fflush(stdout); } // Continued in the next figure
Figure B.7. Program to time the ring function as it passes messages around a ring of processes (continued from Fig. B.6)
CodeView:Scroll/ShowAll
// Continued from the previous figure // Broadcast data from rank 0 process to all other processes MPI_Bcast (&buff_count, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast (&num_shifts, 1, MPI_INT, 0, MPI_COMM_WORLD); // Allocate space and fill the outgoing ("x") and "incoming" vectors. buff_size_bytes = buff_count * sizeof(double); x = (double*)malloc(buff_size_bytes); incoming = (double*)malloc(buff_size_bytes); for(i=0;i<buff_count;i++){ x[i] = (double) i; incoming[i] = -1.0; } // Do the ring communication tests. MPI_Barrier(MPI_COMM_WORLD); t0 = MPI_Wtime(); /* code to pass messages around a ring */ ring (x,incoming,buff_count,num_procs,num_shifts,ID); ring_time = MPI_Wtime() - t0; // Analyze results MPI_Barrier(MPI_COMM_WORLD); MPI_Reduce(&ring_time, &max_time, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); if(ID == 0){ printf("\n Ring test took %f seconds", max_time); } MPI_Finalize(); }
Figure B.5. The major collective communication routines in the C binding to MPI 1.1 (MPI_Barrier, MPI_Bcast, and MPI_Reduce)
MPI_Wtime()returnsthetimeinsecondssincesomearbitrarypointinthepast.Usuallythetime intervalofinterestiscomputedbycallingthisfunctiontwice. ThisprogrambeginsasmostMPIprogramsdo,withdeclarations,MPIinitialization,andfindingthe rankandnumberofprocesses.Wethenprocessthecommandlineargumentstodeterminethe messagesizeandnumberoftimestoshiftthemessagearoundtheringofprocesses.Everyprocess willneedthesevalues,sotheMPI_Bcast()functioniscalledtobroadcastthesevalues. Wethenallocatespacefortheoutgoingandincomingvectorstobeusedintheringtest.Toproduce consistentresults,everyprocessmustcompletetheinitializationbeforeanyprocessesenterthetimed sectionoftheprogram.ThisisguaranteedbycallingtheMPI_Barrier()functionjustbeforethe timedsectionofcode.Thetimefunctionisthencalledtogetaninitialtime,theringtestitselfis called,andthenthetimefunctioniscalledasecondtime.Thedifferencebetweenthesetwocallsto thetimefunctionsistheelapsedtimethisprocessspentpassingmessagesaroundthering.
Figure B.8. Function to pass a message around a ring of processes. It is deadlock-free because the sends and receives are split between the even and odd processes.
CodeView:Scroll/ShowAll
/******************************************************************* NAME: ring PURPOSE: This function does the ring communication, with the
odd numbered processes sending then receiving while the even processes receive and then send. The sends are blocking sends, but this version of the ring test still is deadlock-free since each send always has a posted receive. *******************************************************************/ #define IS_ODD(x) ((x)%2) /* test for an odd int */ #include "mpi.h" ring( double *x, /* message to shift around the ring */ double *incoming, /* buffer to hold incoming message */ int buff_count, /* size of message */ int num_procs, /* total number of processes */a int num_shifts, /* numb of times to shift message */ int my_ID) /* process id number */ { int next; /* process id of the next process */ int prev; /* process id of the prev process */ int i; MPI_Status stat; /******************************************************************* ** In this ring method, odd processes snd/rcv and even processes rcv/snd. ******************************************************************/ next = (my_ID +1 )%num_procs; prev = ((my_ID==0)?(num_procs-1):(my_ID-1)); if( IS_ODD(my_ID) ){ for(i=0;i<num_shifts; i++){ MPI_Send (x, buff_count, MPI_DOUBLE, next, 3, MPI_COMM_WORLD); MPI_Recv (incoming, buff_count, MPI_DOUBLE, prev, 3, MPI_COMM_WORLD, &stat); }
} else{ for(i=0;i<num_shifts; i++){ MPI_Recv (incoming, buff_count, MPI_DOUBLE, prev, 3, MPI_COMM_WORLD, &stat); MPI_Send (x, buff_count, MPI_DOUBLE, next, 3, MPI_COMM_WORLD); } } }
Figure B.10. Program using nonblocking communication to iteratively update a field using an algorithm that requires only communication around a ring (shifting messages to the right)
CodeView:Scroll/ShowAll
#include <stdio.h> #include <mpi.h> // Update a distributed field with a local N by N block on each process // (held in the array U). The point of this example is to show // communication overlapped with computation, so code for other // functions is not included. #define N 100 // size of an edge in the square field. void extern initialize(int, double*); void extern extract_boundary(int, double*, double*); void extern update_interior(int, double*); void extern update_edge(int,double*, double*); int main(int argc, char **argv) { double *U, *B, *inB; int i, num_procs, ID, left, right, Nsteps = 100; MPI_Status status; MPI_Request req_recv, req_send; // Initialize the MPI environment MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &ID); MPI_Comm_size(MPI_COMM_WORLD, &num_procs); // allocate space for the field (U), and the buffers // to send and receive the edges (B, inB) U = (double*)malloc(N*N * sizeof(double)); B = (double*)malloc(N * sizeof(double)); inB = (double*)malloc(N * sizeof(double)); // Initialize the field and set up a ring communication pattern initialize (N, U); right = ID + 1; if (right > (num_procs-1)) right = 0; left = ID - 1; if (left < 0 ) left = num_procs-1; // Iteratively update the field U for(i=0; i<Nsteps; i++){ MPI_Irecv(inB, N, MPI_DOUBLE, left, i, MPI_COMM_WORLD, &req_recv); extract_boundary(N, U, B); //Copy the edge of U into B MPI_Isend(B, N, MPI_DOUBLE, right, i, MPI_COMM_WORLD, &req_send); update_interior(N, U); MPI_Wait(&req_recv, &status); MPI_Wait(&req_send, &status); update_edge(ID, inB, U); } MPI_Finalize(); }
Figure B.11. Function to pass a message around a ring of processes using persistent communication
CodeView:Scroll/ShowAll
/******************************************************************* NAME: ring_persistent PURPOSE: This function uses the persistent communication request mechanism to implement the ring communication in MPI. #include "mpi.h" #include <stdio.h> ring_persistent( double *x, /* message to shift around the ring */ double *incoming, /* buffer to hold incoming message */ int buff_count, /* size of message */ int num_procs, /* total number of processes */ int num_shifts, /* numb of times to shift message */ int my_ID) /* process id number */ { int next; /* process id of the next process */ int prev; /* process id of the prev process */ int i; MPI_Request snd_req; /* handle to the persistent send */ MPI_Request rcv_req; /* handle to the persistent receive */ MPI_Status stat; /******************************************************************* ** In this ring method, first post all the sends and then pick up ** the. messages with the receives. *******************************************************************/ next = (my_ID +1 )%num_procs; prev = ((my_ID==0)?(num_procs-1):(my_ID-1)); MPI_Send_init(x, buff_count, MPI_DOUBLE, next, 3, MPI_COMM_WORLD, &snd_req); MPI_Recv_init(incoming, buff_count, MPI_DOUBLE, prev, 3, MPI_COMM_WORLD, &rcv_req); for(i=0;i<num_shifts; i++){ MPI_Start(&snd_req); MPI_Start(&rcv_req); MPI_Wait(&snd_req, &stat); MPI_Wait(&rcv_req, &stat); } }
MostoftheMPIexamplesinthisbookusestandardmode.Weusedthesynchronousmode
communicationtoimplementmutualexclusionintheImplementationMechanismsdesignspace. InformationabouttheothermodescanbefoundintheMPIspecification[Mesb].
Forexample,weshowCandFortranversionofMPI'sreductionroutineinFig.B.12.
Figure B.12. Comparison of the C and Fortran language bindings for the reduction routine in MPI 1.1
[Viewfullsizeimage]
print *, "Process ", ID, " out of ", Nprocs call MPI_FINALIZE( ierr ) end
B.7. CONCLUSION
MPIisbyfarthemostcommonlyusedAPIforparallelprogramming.Itisoftencalledthe"assembly code"ofparallelprogramming.MPI'slowlevelconstructsarecloselyalignedtotheMIMDmodelof parallelcomputers.ThisallowsMPIprogrammerstopreciselycontrolhowtheparallelcomputation unfoldsandwritehighlyefficientprograms.Perhapsevenmoreimportant,thisletsprogrammers writeportableparallelprogramsthatrunwellonsharedmemorymachines,massivelyparallel supercomputers,clusters,andevenoveragrid. LearningMPIcanbeintimidating.Itishuge,withmorethan125differentfunctionsinMPI1.1.The largesizeofMPIdoesmakeitcomplex,butmostprogrammersavoidthiscomplexityanduseonlya smallsubsetofMPI.Manyparallelprogramscanbewrittenwithjustsixfunctions:MPI_Init, MPI_Comm_Size,MPI_Comm_Rank,MPI_Send,MPI_Recv,andMPI_Finalize.Good sourcesofmoreinformationaboutMPIinclude[Pac96],[GLS99],and[GS98].VersionsofMPIare availableformostcomputersystems,usuallyintheformofopensourcesoftwarereadilyavailableon line.ThemostcommonlyusedversionsofMPIareLAM/MPI[LAM]andMPICH[MPI].
C.3SYNCHRONIZEDBLOCKS C.4WAITANDNOTIFY C.5LOCKS C.6OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA STRUCTURES C.7INTERRUPTS Javaisanobjectorientedprogramminglanguagethatprovideslanguagesupportforexpressing concurrencyinsharedmemoryprograms.Java'ssupportforpolymorphismcanbeexploitedtowrite frameworksdirectlysupportingsomeofthepatternsdescribedinthisbook.Theframeworkprovides theinfrastructureforthepattern;theapplicationprogrammeraddssubclassescontainingthe applicationspecificcode.AnexampleisfoundintheExamplessectionofthePipelinepattern. BecauseJavaprogramsaretypicallycompiledtoanintermediatelanguagecalledJavabytecode, whichisthencompiledand/orinterpretedonthetargetmachine,Javaprogramsenjoyahighdegreeof portability. Javacanalsobeusedfordistributedcomputing,andthestandardlibrariesprovidesupportforseveral mechanismsforinterprocesscommunicationindistributedsystems.Abriefoverviewoftheseisgiven intheImplementationMechanismsdesignspace.Inaddition,thereisanevergrowingcollectionof packagesthatallowconcurrencyandinterprocesscommunicationtobeexpressedathigherlevelsthan withthefacilitiesprovidedbythelanguageandcorepackages.Currently,themostcommontypeof applicationsthatexploitJava'sfacilitiesforconcurrencyaremultithreadedserversideapplications, graphicaluserinterfaces,andprogramsthatarenaturallydistributedduetotheiruseofdataand/or resourcesindifferentlocations. TheperformanceavailablefromcurrentimplementationsofJavaisnotasgoodasthatfrom programminglanguagesmoretypicallyusedinhighperformancescientificcomputing.However,the ubiquityandportabilityofJava,alongwiththelanguagesupportforconcurrency,makeitan importantplatform.Formanypeople,aJavaprogrammaybetheirfirstconcurrentprogram.In addition,continuallyimprovingcompilertechnologyandlibrariesshouldreducetheperformancegap inthefuture.
Figure C.1. A class holding pairs of objects of an arbitrary type. Without generic types, this would have been done by declaring x and y to be of type Object, requiring casting the returned values of getX and getY. In addition to less-verbose programs, this allows type errors to be found by the compiler rather than throwing a ClassCastException at runtime.
CodeView:Scroll/ShowAll
/*class holding two objects of arbitrary type providing swap and accessor methods. The gene class Pair<Gentype> //Gentype is a type variable { private Gentype x,y; void swap(){Gentype temp = x; x=y; y=temp;} public Gentype getX(){return x;} public Gentype getY(){return y;}
Pair(Gentype x, Gentype y){this.x = x; this.y = y;} public String toString(){return "(" + x + "," + y + ")";}
} /*The following method, defined in some class, instantiates two Pair objects, replacing the public static void test() { /*create a Pair holding Strings.*/ Pair<String> p1 = new Pair<String>("hello", "goodbye");
/*create a Pair holding Integers. Autoboxing automatically creates Integer objects from new Pair<Integer>(new Integer(l), new Integer(2));*/ Pair<Integer> p2 = new Pair<Integer>(l,2); /*do something with the Pairs.*/ System.out.println(pl); pl.swap(); System.out.println(pl); System.out.println(p2); p2.swap(); System.out.println(p2);
TheversionavailableatpresstimewasJ2SE1.5.0Beta1.
Java21.5alsoprovidessignificantlyenhancedsupportforconcurrentprogrammingviaseveral changesinthejava.langandjava.utilpackagesandtheintroductionofnewpackages java.util.concurrent,java.util.concurrent.atomic,and java.util.concurrent.locks.Anew,moreprecisespecificationofthememorymodelhas alsobeenincorporated.Thefacilitiesforconcurrentprogrammingaredescribedin[JSRb,JSRc]and thenewmemorymodelisdescribedin[JSRa]. Itisn'tfeasibletogiveacompleteintroductiontoJavainthisappendix.Hence,weassumethatthe readerhassomefamiliaritywiththelanguageandfocusonthoseaspectsthataremostrelevantto sharedmemoryparallelprogramming,includingnewfeaturesintroducedinJava21.5.Introductions toJavaandits(preJava21.5)librariescanbefoundin[AGH00,HC02,HC01].Anexcellent discussionofconcurrentprogramminginJavais[Lea00a].
AthreadcanaccessallvariablesvisibletoitsrunmethodaccordingtotheusualJavascoperules. Thus,bymakingvariablesvisibletomultiplethreads,memorycanbesharedbetweenthem.By encapsulatingvariables,memorycanbeprotectedfromaccessbyotherthreads.Inaddition,avariable canbemarkedfinal,whichmeansthat,onceinitialized,itsvaluewillnotchange.Providedthe initializationisdoneproperly,finalvariablescanbeaccessedbymultiplethreadswithoutrequiring synchronization.(However,markingareferencefinaldoesnotguaranteeimmutabilityofthe referencedobject;caremustbetakentoensurethatitisthreadsafeaswell.) Therearetwodifferentwaystospecifytherunmethodthatathreadwillexecute.Oneistoextend theThreadclassandoverrideitsrunmethod.Then,onesimplyinstantiatesanobjectofthe subclassandcallsitsinheritedstartmethod.Themorecommonapproachistoinstantiatea ThreadobjectusingtheconstructorthattakesaRunnableobjectasaparameter.Asbefore, invokingtheThreadobject'sstartmethodbeginsthethread'sexecution.TheRunnable interfacecontainsapublic void runmethod;thenewlycreatedthreadexecutestheRunnable object'srunmethod.Becausetherunmethodisparameterless,informationistypicallypassedtothe threadviadatamembersoftheThreadsubclassortheRunnable.Thesearetypicallysetinthe constructor. Inthefollowingexample,theclassThinkParallelimplementstheRunnableinterface(thatis, itdeclaresthatitimplementstheinterfaceandprovidesapublic void run()method).The mainmethodcreatesandstartsfourThreadobjects.EachoftheseispassedaThinkParallel object,whoseidfieldhasbeensettothevalueofloopcounteri.Thiscausesfourthreadstobe created,eachexecutingtherunmethodoftheThinkParallelclass.
Figure C.2. Program to create four threads, passing a Runnable in the Thread constructor. Thread-specific data is held in a field of the Runnable object.
CodeView:Scroll/ShowAll
class ThinkParallel implements Runnable { int id; //thread-specific variable containing thread ID /*The run method defines the thread's behavior*/ public void run() { System.out.println(id + ": Are we there yet?"); } /*Constructor sets id*/ ThinkParallel (int id){this.id = id;}
/*main method instantiates and starts the threads*/ public static void main(String[] arg { /*create and start 4 Thread objects, passing each a ThinkParallel object */ for(int i = 0; i != 4; i++) { new Thread(new ThinkParallel(i)) .start(); }
C.1.1. Anonymous Inner Classes Acommonprogrammingidiomfoundinseveraloftheexamplesusesananonymousclasstodefine theRunnablesorThreadsatthepointwheretheyarecreated.Thisoftenmakesreadingthe sourcecodemoreconvenientandavoidsfileandclassclutter.Weshowhowtorewritetheexampleof Fig.C.2usingthisidiominFig.C.3.Variablesthatarelocaltoamethodcannotbementionedinan anonymousclassunlessdeclaredfinal,andarethusimmutableafterbeingassignedavalue.Thisis thereasonforintroducingtheseeminglyredundantvariablej. C.1.2. Executors and F actories Thejava, util. concurrentpackageprovidestheExecutorinterfaceanditssubinterface ExecutorServicesalongwithseveralimplementations.AnExecutorexecutesRunnable objectswhilehidingthedetailsofthreadcreationandscheduling.[2]Thus,insteadofinstantiatinga ThreadobjectandpassingitaRunnabletoexecuteatask,onewouldinstantiateanExecutor objectandthenusetheExecutor's executemethodtoexecutetheRunnable.Forexample,a ThreadPoolExecutormanagesapoolofthreadsandarrangesfortheexecutionofsubmitted Runnablesusingoneofthepooledthreads.Theclassesimplementingtheinterfacesprovide adjustableparameters,buttheExecutorsclassprovidesfactorymethodsthatcreatevarious ExecutorServicespreconfiguredforthemostcommonscenarios.Forexample,thefactory methodExecutors.newCachedThreadPool ()returnsanExecutorServicewithan unboundedthreadpoolandautomaticthreadreclamation.Thiswillattempttouseathreadfromthe poolifoneisavailable,and,ifnot,createanewthread.Idlethreadsareremovedfromthepoolaftera certainamountoftime.AnotherexampleisExecutors.newFixedThreadPool (int nThreads),whichcreatesafixedsizethreadpoolcontainingnThreadsthreads.Itcontainsan unboundedqueueforwaitingtasks.Otherconfigurationsarepossible,bothusingadditionalfactory methodsnotdescribedhereandalsobydirectlyinstantiatingaclassimplementing ExecutorServiceandadjustingtheparametersmanually.
[2]
for(int i = 0; i != 4; i++) { final int j = i; new Thread( new Runnable() //define Runnable objects // anonymously { int id = j; //references public void run() { System.out.println(id + ": Are we there yet?");} } ) . start(); } } }
Asanexample,wecouldrewritethepreviousexampleasshowninFig.C.4.Notethattochangetoa differentthreadmanagementpolicy,wewouldonlyneedtoinvokeadifferentfactorymethodforthe instantiationoftheExecutor.Therestofthecodewouldremainthesame. TherunmethodintheRunnableinterfacedoesnotreturnavalueandcannotbedeclaredtothrow exceptions.Tocorrectthisdeficiency,theCallableinterface,whichcontainsamethodcallthat throwsanExceptionandreturnsaresult,wasintroduced.Theinterfacedefinitionexploitsthenew supportforgenerictypes.Callablescanarrangeexecutionbyusingthesubmitmethodofan objectthatimplementstheExecutorServiceinterface.ThesubmitmethodreturnsaFuture object,whichrepresentstheresultofanasynchronouscomputation.TheFutureobjectprovides methodstocheckwhetherthecomputationiscomplete,towaitforcompletion,andtogettheresult. Thetypeenclosedwithinanglebracketsspecifiesthattheclassistobespecializedtothattype.
Figure C.4. Program using a ThreadPoolExecutor instead of creating threads directly import java.util.concurrent.*; class ThinkParalleln implements Runnable { int id; //thread-specific variable containing thread ID /*The run method defines the tasks behavior*/ public void run() { System.out.println(id + ": Are we there yet?"); } /*Constructor sets id*/ ThinkParalleln(int id){this. id = id;} /*main method creates an Executor to manage the tasks*/ public static void main(String[] args) { /*create an Executor using a factory method in Executors*/ ExecutorService executor = Executors.newCachedThreadPool(); // send each task to the executor for(int i = 0; i != 4; i++) { executor.execute(new ThinkParalleln(i)); }
Thecurlybracesdelimittheblock.Thecodeforacquiringandreleasingthelockisgeneratebythe compiler. Supposeweaddavariablestatic int counttotheThinkParallelclasstobeincremented byeachthreadafteritprintsitsmessage.Thisisastaticvariable,sothereisoneperclass(notoneper object),anditisvisibleandthussharedbyallthethreads.Toavoidraceconditions,countcouldbe accessedinasynchronizedblock.[4]Toprovideprotection,allthreadsmustusethesamelock,sowe usetheobjectassociatedwiththeclassitself.ForanyclassX, X. classisareferencetothe uniqueobjectrepresentingclassX,sowecouldwritethefollowing:
[4]
isshorthandfor
public void updateSharedVariables(...) { synchronized(this){....body....} }
condition(whichnolongerholds)andcorrupttheprogramstate.Thus,checkingtheconditionand performingtheactionneedtobeplacedinsideasynchronizedblock.Ontheotherhand,thethread cannotholdthelockassociatedwiththesynchronizedblockwhilewaitingbecausethiswouldblock otherthreadsfromaccess,preventingtheconditionfromeverbeingestablished.Whatisneededisa wayforathreadwaitingonaconditiontoreleasethelockandthen,aftertheconditionhasbeen satisfied,reacquirethelockbeforerecheckingtheconditionandperformingitsaction.Traditional monitors[Hoa74]wereproposedtohandlethissituation.Javaprovidessimilarfacilities. TheObjectclass,alongwiththepreviouslymentionedlock,alsocontainsanimplicitwaitsetthat servesasaconditionvariable.TheObjectclassprovidesseveralversionsofwaitmethodsthat causethecallingthreadtoimplicitlyreleasethelockandadditselftothewaitset.Threadsinthewait setaresuspendedandnoteligibletobescheduledtorun.
Figure C.6. Basic idiom for using wait. Because wait throws an InterruptedException, it should somehow be enclosed in a try-catch block, omitted here. synchronized(lockObject) { while( ! condition ){ lockObject.wait();} action; }
ThebasicidiomforusingwaitisshowninFig.C.6.Thescenarioisasfollows:Thethreadacquires thelockassociatedwithlockObject.Itcheckscondition.Iftheconditiondoesnothold,then thebodyofthewhileloop,thewaitmethod,isexecuted.Thiscausesthelocktobereleased, suspendsthethread,andplacesitinthewaitsetbelongingtolockObject.Iftheconditiondoes hold,thethreadperformsactionandleavesthesynchronizedblock.Onleavingthesynchronized block,thelockisreleased. Threadsleavethewaitsetinoneofthreeways.First,theObjectclassmethodsnotifyand notifyAllawakenoneorallthreads,respectively,inthewaitsetofthatobject.Thesemethodsare intendedtobeinvokedbyathreadthatestablishestheconditionbeingwaitedupon.Anawakened threadleavesthewaitsetandjoinsthethreadswaitingtoreacquirethelock.Theawakenedthreadwill reacquirethelockbeforeitcontinuesexecution.Thewaitmethodmaybecalledwithoutparameters orwithtimeoutvalues.Athreadthatusesoneofthetimedwaitmethods(thatis,onethatisgivena timeoutvalue)maybeawakenedbynotificationasjustdescribed,orbythesystematsomepointafter thetimeouthasexpired.Uponbeingreawakened,itwillreacquirethelockandcontinuenormal execution.Unfortunately,thereisnoindicationofwhetherathreadwasawakenedbyanotificationor atimeout.Thethirdwaythatathreadcanleavethewaitsetisifitisinterrupted.Thiscausesan InterruptedExceptiontobethrown,whereuponthecontrolflowinthethreadfollowsthe normalrulesforhandlingexceptions. WenowcontinuedescribingthescenariostartedpreviouslyforFig.C.6,atthepointatwhichthe threadhaswaitedandbeenawakened.Whenawakenedbysomeotherthreadexecutingnotifyor notifyAllonlockObject(orbyatimeout),thethreadwillberemovedfromthewaitset.At somepoint,itwillbescheduledforexecutionandwillattempttoreacquirethelockassociatedwith lockObject.Afterthelockhasbeenreacquired,thethreadwillrechecktheconditionandeither
C.5. LOCKS
Thesemanticsofsynchronizedblockstogetherwithwaitandnotifyhavecertaindeficiencies whenusedinthestraightforwardwaydescribedintheprevioussection.Probablytheworstproblemis thatthereisnoaccesstoinformationaboutthestateoftheassociatedimplicitlock.Thismeansthata threadcannotdeterminewhetherornotalockisavailablebeforeattemptingtoacquireit.Further,a threadblockedwaitingforthelockassociatedwithasynchronizedblockcannotbeinterrupted.[5] Anotherproblemisthatonlyasingle(implicit)conditionvariableisassociatedwitheachlock.Thus, threadswaitingfordifferentconditionstobeestablishedshareawaitset,withnotifypossibly wakingthewrongthread(andforcingtheuseofnotifyAll).
[5]
Thisisdiscussedfurtherinthenextsection.
Alockisreentrantifitcanbeacquiredmultipletimesbythesamethreadwithout causingdeadlock.
//instantiate lock private final ReentrantLock lock = new ReentrantLock() ;
Figure C.7. A version of SharedQueue2 (see the Shared Queue pattern) using a Lock and Condition instead of synchronized blocks with wait and notify
CodeView:Scroll/ShowAll
import java.uti1.*; import java.uti1.concurrent.*; import java.util.concurrent.locks.*; class SharedQueue2 { class Node { Object task; Node next; Node(Object task) {this.task = task; next = null;}
private Node head = new Node(null); private Node last = head; Lock lock = new ReentrantLock(); final Condition notEmpty = lock.newCondition(); public void put(Object task) { //cannot insert null assert task != null: "Cannot insert null task"; lock.lock(); try{ Node p = new Node(task); last.next = p; last = p; notEmpty.signalAll() ; } finally{lock.unlock();} } public Object take() { //returns first task in queue, waits if queue is empty Object task = null; lock.lock(); try { while (isEmpty()) { try{ notEmpty.await(); } catch(InterruptedException error) { assert false:"sq2: no interrupts here";} } Node first = head.next; task = first.task; first.task = null; head = first; } finally{lock.unlock();} return task; } } private boolean isEmpty() {return head.next == null;}
andshouldalwaysbeusedinatrycatchblock,suchas
//critical section lock.lock(); // block until lock acquired
Figure C.9. Program showing a parallel version of the sequential program in Fig. C.8 where each iteration of the big_comp loop is a separate task. A thread pool containing ten threads is used to execute the tasks. A CountDownLatch is used to ensure that all of the tasks have completed before executing the (still sequential) loop that combines the results.
CodeView:Scroll/ShowAll
import java.uti1.concurrent.*; class ParallelLoop { static static static static static ExecutorService exec; CountDownLatch done; int num_iters = 1000; double [] res = new double [num_iters] ; double answer = 0.0;
static void combine(int i){..... } static double big_comp(int i){..... } public static void main(String[] args) throws InterruptedException { /*create executor with pool of 10 threads */ exec = Executors.newFixedThreadPool(10); /*create and initialize the countdown latch*/ done = new CountDownLatch(num_iters); long startTime = System.nanoTime(); for (int i = 0; i < num_iters; i++) { //only final local vars can be referenced in an anonymous class final int j = i; /*pass the executor a Runnable object to execute the loop body and decrement the CountDownLatch */ exec.execute(new Runnable(){ public void run() { res[j] = big_comp(j); done.countDown(); /*decrement the CountDownLatch*/ } }); } done.await(); //wait until all tasks have completed /*combine results using sequential loop*/ for (int i = 0; i < num_iters; i++){ combined); } System.out.println(answer); /*cleanly shut down thread pool*/ exec.shutdown();
} }
C.7. INTERRUPTS
Partofthestateofathreadisitsinterruptstatus.Athreadcanbeinterruptedusingtheinterrupt method.Thissetstheinterruptstatusofthethreadtointerrupted. Ifthethreadissuspended(thatis,ithasexecutedawait, sleep, join,orothercommandthat suspendsthethread),thesuspensionwillbeinterruptedandanInterruptedExceptionthrown. Becauseofthis,themethodsthatcancauseblocking,suchaswait,throwthisexception,andthus musteitherbecalledfromamethodthatdeclaresitselftothrowtheexception,orthecallmustbe enclosedwithinatrycatchblock.BecausethesignatureoftherunmethodinclassThreadand interfaceRunnabledoesnotincludethrowingthisexception,atrycatchblockmustenclose,either directlyorindirectly,anycalltoablockingmethodinvokedbyathread.Thisdoesnotapplytothe mainthread,becausethemainmethodcanbedeclaredtothrowanInterruptedException. Theinterruptstatusofathreadcanbeusedtoindicatethatthethreadshouldterminate.Toenable this,thethread'srunmethodshouldbecodedtoperiodicallychecktheinterruptstatus(usingthe isInterruptedortheinterruptedmethod);ifthethreadhasbeeninterrupted,thethread shouldreturnfromitsrunmethodinanorderlyway.InterruptedExceptionscanbecaught andthehandlerusedtoensuregracefulterminationifthethreadisinterruptedwhenwaiting.Inmany parallelprograms,provisionstoexternallystopathreadarenotneeded,andthecatchblocksfor InterruptedExceptionscaneitherprovidedebugginginformationorsimplybeempty. TheCallableinterfacewasintroducedasanalternativetoRunnablethatallowsanexceptionto bethrown(andalso,asdiscussedpreviously,allowsaresulttobereturned).Thisinterfaceexploitsthe supportforgenerictypes.
Glossary
Abstractdatatype(ADT). Adatatypegivenbyitssetofallowedvaluesandtheavailableoperationsonthosevalues.The valuesandoperationsaredefinedindependentlyofaparticularrepresentationofthevaluesor implementationoftheoperations.InaprogramminglanguagethatdirectlysupportsADTs,the interfaceofthetyperevealstheoperationsonit,buttheimplementationishiddenandcan(in principle)bechangedwithoutaffectingclientsthatusethetype.TheclassicexampleofanADT isastack,whichisdefinedbyitsoperations,typicallyincludingpushandpop.Manydifferent internalrepresentationsarepossible.
thatcanbeobtainedbyrunninganalgorithmonasystemofPprocessorsis
Autoboxing.
Alanguagefeature,availableinJava21.5,thatprovidesautomaticconversionofdataofa primitivetypetothecorrespondingwrappertypeforexample,frominttoInteger.
Bandwidth. Thecapacityofasystem,usuallyexpressedasitemspersecond.Inparallelcomputing,themost commonusageoftheterm"bandwidth"isinreferencetothenumberofbytespersecondthat canbemovedacrossanetworklink.Aparallelprogramthatgeneratesrelativelysmallnumbers ofhugemessagesmaybelimitedbythebandwidthofthenetwork,inwhichcaseitiscalleda bandwidthlimitedprogram. Seealso[bisectionbandwidth] Barrier. AsynchronizationmechanismappliedtogroupsofUEs,withthepropertythatnoUEinthe groupcanpassthebarrieruntilallUEsinthegrouphavereachedthebarrier.Inotherwords, UEsarrivingatthebarriersuspendorblockuntilallUEshavearrived;theycanthenall proceed.
Cache.
Arelativelysmallregionofmemorythatislocaltoaprocessorandisconsiderablyfasterthan thecomputer'smainmemory.Cachehierarchiesconsistingofoneormorelevelsofcacheare essentialinmoderncomputersystems.Becauseprocessorsaresomuchfasterthanthe computer'smainmemory,aprocessorcanrunatasignificantfractionoffullspeedonlyifthe datacanbeloadedintocachebeforeitisneededandthatdatacanbereusedduringa calculation.Dataismovedbetweenthecacheandthecomputer'smainmemoryinsmallblocks ofbytescalledcachelines.Anentirecachelineismovedwhenanybytewithinthememory mappedtothecachelineisaccessed.Cachelinesareremovedfromthecacheaccordingto someprotocolwhenthecachebecomesfullandspace isneededforotherdata,orwhentheyare accessedbysomeotherprocessor.Usuallyeachprocessorhasitsowncache(thoughsometimes multipleprocessorssharealevelofcache),sokeepingthecachescoherent(thatis,ensuringthat allprocessorshavethesameviewofmemorythroughtheirdistinctcaches)isanissuethatmust bedealtwithbycomputerarchitectsandcompilerwriters.Programmersmustbeawareof cachingissueswhenoptimizingtheperformanceofsoftware.
ccNUMA. CachecoherentNUMA.ANUMAmodelwheredataiscoherentatthelevelofthecache. Seealso[NUMA] Cluster. Anycollectionofdistinctcomputersthatareconnectedandusedasaparallelcomputer,orto formaredundantsystemforhigheravailability.Thecomputersinaclusterarenotspecializedto clustercomputingandcould,inprinciple,beusedinisolationasstandalonecomputers.Inother words,thecomponentsmakingupthecluster,boththecomputersandthenetworksconnecting them,arenotcustombuiltforuseinthecluster.ExamplesincludeEthernetconnected workstationnetworksandrackmountedworkstationsdedicatedtoparallelcomputing. Seealso[workstationfarm] Collectivecommunication. AhighleveloperationinvolvingagroupofUEsandhavingatitscorethecooperativeexchange ofinformationbetweentheUEs.Thehighleveloperationmightbeapurecommunicationevent (forexample,abroadcast)oritmightincludesomecomputation(forexample,areduction). Seealso[broadcast] Seealso[reduction] Concurrentexecution. AconditioninwhichtwoormoreUEsareactiveandmakingprogresssimultaneously.Thiscan beeitherbecausetheyarebeingexecutedatthesametimeondifferentPEs,orbecausethe
actionsoftheUEsareinterleavedonthesamePE.
Concurrentprogram. Aprogramwithmultiplelociofcontrol(threads,processes,etc.).
Conditionvariable. Conditionvariablesarepartofthemonitorsynchronizationmechanism.Aconditionvariableis usedbyaprocessorthreadtodelayuntilthemonitor'sstatesatisfiessomecondition;itisalso usedtoawakenadelayedprocesswhentheconditionbecomestrue.Associatedwitheach conditionvariableisawaitsetofsuspended(delayed)processesorthreads.Operationsona conditionvariableincludewait(addthisprocessorthreadtothewaitsetforthisvariable)and signalornotify(awakenaprocessorthreadonthewaitsetforthisvariable). Seealso[monitor] Copyonwrite. Atechniquethatensures,usingminimalsynchronization,thatconcurrentthreadswillneversee adatastructureinaninconsistentstate.Toupdatethestructure,acopyismade,modifications aremadeonthecopy,andthenthereferencetotheoldstructureisatomicallyreplacedwitha referencetothenew.Thismeansthatathreadholdingareferencetotheoldstructuremay continuetoreadanold(consistent)version,butnothreadwilleverseethestructureinan inconsistentstate.Synchronizationisonlyneededtoacquireandupdatethereferencetothe structure,andtoserializetheupdates.
Distributedsharedmemory(DSM). AnaddressspacesharedamongmultipleUEsthatisconstructedfrommemorysubsystemsthat aredistinctanddistributedaboutthesystem.Theremaybeoperatingsystemandhardware supportforthedistributedsharedmemorysystem,orthesharedmemorymaybeimplemented entirelyinsoftwareasaseparatemiddlewarelayer Seealso[Virtualsharedmemory] DSM. See[distributedsharedmemory] Eagerevaluation. Aschedulingstrategywheretheevaluationofanexpression,orexecutionofaprocedure,can occurassoonas(butnotbefore)allofitsargumentshavebeenevaluated.Eagerevaluationis typicalformostprogrammingenvironmentsandcontrastswithlazyevaluation.Eager evaluationcansometimesleadtoextrawork(orevennontermination)whenanargumentthat
willnotactuallybeneededinacomputationmustbecomputedanyway.
andindicateshoweffectivelytheresourcesinaparallelcomputerareused.
Fork.
resourcesremainwiththeresourceowners.
Homogeneous. Thecomponentsofahomogeneoussystemareallofthesamekind.
Hypercube. Amulticomputerinwhichthenodesareplacedattheverticesofaddimensionalcube.The
mostfrequentlyusedconfigurationisabinaryhypercubewhereeachof2nnodesisconnected tonothers.
Incrementalparallelism. Incrementalparallelismisatechniqueforparallelizinganexistingprogram,inwhichthe parallelizationisintroducedasasequenceofincrementalchanges,parallelizingoneloopata time.Followingeachtransformation,theprogramistestedtoensurethatitsbehavioristhe sameastheoriginalprogram,greatlydecreasingthechancesofintroducingundetectedbugs. Seealso[refactoring] JavaVirtualMachine(JVM). AnabstractstackbasedcomputingmachinewhoseinstructionsetiscalledJavabytecode. Typically,Javaprogramsarecompiledintoclassfilescontainingbytecode,asymboltable,and otherinformation.ThepurposeoftheJVMistoprovideaconsistentexecutionenvironmentfor classfilesregardlessoftheunderlyingplatform.
Join. See[fork/join] JVM. See[JavaVirtualMachine] Latency. Thefixedcostofservicingarequest,suchassendingamessageoraccessinginformationfrom adisk.Inparallelcomputing,thetermmostoftenisusedtorefertothetimeittakestosendan emptymessageoverthecommunicationmedium,fromthetimethesendroutineiscalledtothe timetheemptymessageisreceivedbytherecipient.Programsthatgeneratelargenumbersof smallmessagesaresensitivetothelatencyandarecalledlatencyboundprograms.
Linda. Acoordinationlanguageforparallelprogramming. Seealso[tuplespace] Loadbalance. Inaparallelcomputation,tasksareassignedtoUEs,whicharethenmappedontoPEsfor execution.ThenetworkcarriedoutbythecollectionofPEsisthe"load"associatedwiththe computation.LoadbalancereferstohowthatloadisdistributedamongthePEs.Inanefficient parallelprogram,theloadisbalancedsoeachPEspendsaboutthesameamountoftimeonthe computation.Inotherwords,inaprogramwithgoodloadbalance,eachPEfinisheswithits shareoftheloadataboutthesametime.
Loadbalancing.
Multiprocessor. Aparallelcomputerwithmultipleprocessorsthatshareanaddressspace.
Mutex. Amutualexclusionlock.Amutexserializestheexecutionofmultiplethreads.
Process. Acollectionofresourcesthatenabletheexecutionofprograminstructions.Theseresourcescan includevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,userandgroupIDs, andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight"UEwith itsownaddressspace. Seealso[unitofexecution] Seealso[thread] Processmigration. Changingtheprocessorresponsibleforrunningaprocessduringexecution.Processmigrationis commonlyusedtodynamicallybalancetheloadonmultiprocessorsystems.Itisalsousedto supportfaulttolerantcomputingbymovingprocessesawayfromfailingprocessors.
Reader/writerlocks.
Refactoring. Refactoringisasoftwareengineeringtechniqueinwhichaprogramisrestructuredcarefullyso astoalteritsinternalstructurewithoutchangingitsexternalbehavior.Therestructuringoccurs throughaseriesofsmalltransformations(calledrefactorings)thatcanbeverifiedaspreserving behaviorfollowingeachtransformation.Thesystemisfullyworkingandverifiablefollowing eachtransformation,greatlydecreasingthechancesofintroducingserious,undetectedbugs. Incrementalparallelismcanbeviewedasanapplicationofrefactoringtoparallelprogramming. Seealso[incrementalparallelism] Remoteprocedurecall(RPC). Aprocedureinvokedinadifferentaddressspacethanthecaller,oftenonadifferentmachine. Remoteprocedurecallsareapopularapproachforinterprocesscommunicationandlaunching remoteprocessesindistributedclientservercomputingenvironments.
RPC. See[remoteprocedurecall] Semaphore. AnADTusedtoimplementcertainkindsofsynchronization.Asemaphorehasavaluethatis constrainedtobeanonnegativeintegerandtwoatomicoperations.Theallowableoperationsare V(sometimescalledup)andP(sometimescalleddown).AVoperationincreasesthevalueof thesemaphorebyone.APoperationdecreasesthevalueofthesemaphorebyone,provided thatcanbedonewithoutviolatingtheconstraintthatthevaluebenonnegative.APoperation thatisinitiatedwhenthevalueofthesemaphoreis0suspends.Itmaycontinuewhenthevalue ispositive.
Ifthesetupandfinalizationphasesmustexecuteserially,thentheserialfractionwouldbe
Sharedaddressspace. AnaddressableblockofmemorythatissharedbetweenacollectionofUEs.
whereT(n)onasystemwithnPEs.WhenthespeedupequalsthenumberofPEsinthe parallelcomputer,thespeedupissaidtobeperfectlylinear.
byreformulatingasacertaintypeofrecurrencerelation.
Thread. Afundamentalunitofexecutiononcertaincomputers.InaUNIXcontext,threadsare associatedwithaprocessandsharetheprocess'senvironment.Thismakesthethreads lightweight(thatis,acontextswitchbetweenthreadsischeap).Amorehighlevelviewisthata threadisa"lightweight"unitofexecutionthatsharesanaddressspacewithotherthreads. Seealso[unitofexecution] Seealso[process] Transputer. ThetransputerisamicroprocessordevelopedbyInmosLtd.withonchipsupportforparallel processing.Eachprocessorcontainsfourhighspeedcommunicationlinksthatareeasily connectedtothelinksofothertransputersandaveryefficientbuiltinschedulerfor multiprocessing.
Asseenintheseexamples,thefieldsmakingupatuplecanholdintegers,strings,variables, arrays,oranyothervaluedefinedinthebaseprogramminglanguage.Whereastraditional memorysystemsaccessobjectsthroughanaddress,tuplesareaccessedbyassociation.The programmerworkingwithatuplespacedefinesatemplateandasksthesystemtodelivertuples matchingthetemplate.TuplespaceswerecreatedaspartoftheLindacoordinationlanguage [CG91].TheLindalanguageissmall,withonlyahandfulofprimitivestoinserttuples,remove tuples,andfetchacopyofatuple.Itiscombinedwithabaselanguage,suchasC,C++,or Fortran,tocreateacombinedparallelprogramminglanguage.Inadditiontoitsoriginal implementationsonmachineswithasharedaddressspace,Lindawasalsoimplementedwitha virtualsharedmemoryandusedtocommunicatebetweenUEsrunningonthenodesof distributedmemorycomputers.Theideaofanassociativevirtualsharedmemoryasinspiredby LindahasbeenincorporatedintoJavaSpaces[FHA99].
independentsequentialjobsasopposedtoparallelcomputing.