You are on page 1of 21

ACMParallelComputingTechPack JourneymansProgrammingTour

November,2010 ParallelComputingCommittee PaulSteinberg,Intel,CoChair MatthewWolf,CERCS,GeorgiaTech,CoChair JudithBishop,Microsoft ClayBreshears,Intel BarbaraMaryChapman,UniversityofHouston DanielJ.Ernst,UniversityofWisconsinEauClaire AndrewFitzGibbon,ShodorFoundation DanGarcia,UniversityofCalifornia,Berkeley BenedictGaster,AMD KatherineHartsell,Oracle TomMurphy,ContraCostaCollege StevenParker,NVIDIA CharliePeck,EarlhamCollege JenniferTeal,Intel SpecialthankstoAbiSundaram,Intel

TableofContents Introduction

TheBasicsofParallelComputing
Parallelism Parallelcomputing Isconcurrencythesameasparallelism?

ParallelDecompositions
Introduction Taskdecomposition Datadecomposition Pipelineparallelism

ParallelHardware
Memorysystems Processingcharacteristics Coordination Scalability Heterogeneousarchitectures

ParallelProgrammingModels,Libraries,andInterfaces
Introduction Sharedmemorymodelprogramming Posixthreads Win32threads Java OpenMP ThreadingBuildingBlocks(TBB) Distributedmemorymodelprogramming Messagepassinginterface(MPI) GeneralpurposeGPUprogramming TheOpenComputeLanguage(OpenCL) CUDA Hybridparallelsoftwarearchitectures Parallellanguages ConcurrentMLandConcurrentHaskell

Tools Compilers Autoparallelization Threaddebuggers Tuners/performanceprofilers Memorytools

PARALLELISMCOMPUTING JOURNEYMANSPROGRAMMINGTOUR

INTRODUCTION

Ineverydomainthetoolsthatallowustotacklethebigproblems,andexecutethe complexcalculationsthatarenecessarytosolvethem,arecomputerbased.The evolutionofcomputerarchitecturetowardshardwareparallelismmeansthat software/computationalparallelismhasbecomeanecessarypartofthecomputer scientistandengineerscoreknowledge.Indeed,understandingandapplying computationalparallelismisessentialtogaininganythinglikeasustainedperformance onmoderncomputers.Goingforward,performancecomputingwillbeevenmore dependentonscalingacrossmanycomputingcoresandonhandlingtheincreasingly complexnatureofthecomputingtask.Thisistrueirrespectiveofwhetherthedomain problemispredictingclimatechange,analyzingproteinfolding,orproducingthelatest animatedblockbuster. TheParallelismTechPackisacollectionofguidedreferencestohelpstudents, practitioners,andeducatorscometotermswiththelargeanddynamicbodyof knowledgethatgoesbythenameparallelism.Wehaveorganizeditasaseriesof tours;eachtourinthetechpackcorrespondstooneparticularguidedroutethroughthat bodyofknowledge.Thisparticulartourisgearedtowardsthosewhohavesome reasonableskillsaspractitionersofserialprogrammingbutwhohavenotyetreally exploredparallelisminanycoherentway.AllofthetoursintheParallelismTechPack arelivingdocumentsthatprovidepointerstoresourcesforthenoviceandtheadvanced programmer,forthestudentandtheworkingengineer.FuturetourswithintheTech Packwilladdressothertopics. TheauthorsofthisTechPackaredrawnfrombothindustryandacademia.Despitethis groupswidevarietyofexperiencesinutilizingparallelplatforms,interfaces,and applications,weallagreethatparallelismisnowafundamentalconceptforallof computing.

ScopeofTour:Thistourapproachesparallelismfromthepointofviewof someonecomfortablewithprogrammingbutnotyetfamiliarwithparallel concepts.Itwasdesignedtoeaseintothetopicwithsomeintroductorycontext, followedbylinkstoreferencesforfurtherstudy.Thetopicspresentedarebyno meansexhaustive.Instead,thetopicswerechosensothatacarefulreader shouldachieveareasonablycompletefeelforthefundamentalconceptsand paradigmsusedinparallelcomputingacrossmanyplatforms.Excitingareas liketransactionalmemory,parallelisminfunctionallanguages,distributed sharedmemoryconstructs,andsoonwillbeaddressedinothertoursbutalso

shouldbeseenasbuildingonthefoundationsputforthhere. OnlineReadings
HerbSutter.2005.Thefreelunchisover:Afundamentalturntowardconcurrencyin software.Dr.DobbsJ.33,3(March).http://www.gotw.ca/publications/concurrency ddj.htm. JamesLaurus.2009.SpendingMooresdividend.Commun.ACM52,5(May). http://doi.acm.org/10.1145/1506409.1506425.


1.THEBASICSOFPARALLELCOMPUTING

Parallelismisapropertyofacomputationinwhichportionsofthecalculationsare independentofeachother,allowingthemtobeexecutedatthesametime.Themore parallelismexistsinaparticularproblem,themoreopportunitythereisforusing parallelsystemsandparallellanguagefeaturestoexploitthisparallelismandgainan overallperformanceimprovement.Forexample,considerthefollowingpseudocode:


floata=E+A; floatb=E+B; floatc=E+C; floatd=E+D; floatr=a+b+c+d.

Thefirstfourassignmentsareindependentofeachother,andtheexpressionsE+A,E+B, E+C,andE+Dcanallbecalculatedinparallel,thatis,atthesametime,whichcan potentiallyprovideaperformanceimprovementoverexecutingthemsequentially,that is,oneatatime. Parallelcomputingisdefinedasthesimultaneoususeofmorethanoneprocessorto solveaproblem,exploitingthatprogramsparallelismtospeedupitsexecutiontime. Isconcurrencythesameasparallelism?Whileconcurrencyandparallelismarerelated, theyarenotthesame!Concurrencymostlyinvolvesasetofprogrammingabstractions toarbitratecommunicationbetweenmultipleprocessingentities(likeprocessesor threads).Thesetechniquesareoftenusedtobuilduserinterfacesandother asynchronoustasks.Whileconcurrencydoesnotprecluderunningtasksinparallel(and theseabstractionsareusedinmanytypesofparallelprogramming),itisnota necessarycomponent.Parallelism,ontheotherhand,isconcernedwiththeexecutionof multipleoperationsinparallel,thatis,atthesametime.Thefollowingdiagramshows parallelprogramsasasubsetofconcurrentones,togetherformingasubsetofall possibleprograms:

all programs

concurrent programs

parallel programs

2.PARALLELDECOMPOSITIONS

Introduction
Thereareanumberofdecompositionmodelsthatarehelpfultothinkaboutwhen breakingcomputationintoindependentwork.Sometimesitisclearwhichmodeltopick. Atothertimesitismoreofajudgmentcall,dependingonthenatureoftheproblem, howtheprogrammerviewstheproblem,andtheprogrammersfamiliaritywiththe availabletoolsets.Forexample,ifyouneedtogradefinalexamsforacoursewith hundredsofstudents,therearemanydifferentwaystoorganizethejobwithmultiple graderssoastofinishintheshortestamountoftime.

Tutorials
TheEPCCcenteratEdinburghhasanumberofgoodtutorials.Thetutorialsmostuseful inthiscontextarethefollowing. IntroductiontoHighPerformanceComputingandDecomposingthePotentially Parallel.http://www2.epcc.ed.ac.uk/computing/training/document_archive/.

BlaiseBarney.AnIntroductiontoParallelComputing.LawrenceLivermoreNational Labs.https://computing.llnl.gov/tutorials/parallel_comp/.

Videos
IntroductiontoParallelProgrammingVideoLectureSeries:Part02.Parallel DecompositionMethods. Thisvideopresentsthreemethodsfordividingcomposition,andpipelining. http://software.intel.com/enus/courseware/course/view.php?id=381. IntroductiontoParallelProgrammingVideoLectureSeries:Part04.SharedMemory Considerations. Thisvideoprovidestheviewerwithadescriptionofthesharedmemorymodelof parallelprogramming.Implementationstrategiesfordomaindecompositionandtask decompositionproblemsusingthreadswithinasharedmemoryexecutionenvironment areillustrated.http://software.intel.com/enus/courseware/course/view.php?id=249. Taskdecomposition,sometimescalledfunctionaldecomposition,dividestheproblem bythetypeoftasktobedoneandthenassignsaparticulartasktoeachparallelworker. Asanexample,togradehundredsoffinalexams,alltestpaperscanbepiledontoa tableandagroupofgraderscaneachbeassignedasinglequestionortypeofquestion toscore,whichisthetasktobeexecuted.Soonegraderhasthetaskofscoringallessay questions,anothergraderwouldscorethemultiplechoicequestions,andanotherwould scorethetrue/falsequestions.

Videos
IntroductiontoParallelProgrammingVideoLectureSeries:Part09.Implementinga TaskDecomposition. http://software.intel.com/enus/courseware/course/view.php?id=378. Thisvideodescribeshowtodesignandimplementataskdecompositionsolution.An illustrativeexampleforsolvingthe8Queensproblemisused.Multipleapproachesare presentedwiththeprosandconsforeachdescribed.Aftertheapproachisdecided upon,codemodificationsusingOpenMParepresented.Potentialdatatraceerrorswith asharedstackdatastructureholdingboardconfigurations(thetaskstobeprocessed) areofferedandasolutionisfoundandimplemented. Datadecomposition,sometimescalleddomaindecomposition,dividestheproblem intoelementstobeprocessedandthenassignsasubsetoftheelementstoeachparallel worker.Asanexample,togradehundredsoffinalexamsalltestpaperscanbestacked ontoatableanddividedintopilesofequalsize.Eachgraderwouldthentakeastackof examsandgradetheentiresetofquestions.

Tutorials
BlaiseBarney.AnIntroductiontoParallelComputing.LawrenceLivermoreNational Lab.https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning. Pipelineparallelismisaspecialformoftaskdecompositionwheretheoutputfromone process,orstage,astheyareoftencalled,servesdirectlyastheinputtothenextprocess. Thisimposesamuchmoretightlycoordinatedstructureontheprogramthanis typicallyfoundineitherplaintaskordatadecompositions.Asanexample,tograde hundredsoffinalexams,alltestpaperscanbepiledontoatableandagroupofgraders arrangedinaline.Thefirstgradertakesapaperfromthepile,scoresallquestionson thefirstpageandpassesthepapertothesecondgrader;thesecondgraderreceivesa paperfromthefirstgraderandscoresallthequestionsonthesecondpageandpasses thepapertothethirdgrader,andsoon,untiltheexamisfullygraded. 3.PARALLELHARDWARE Theprevioussectiondescribedsomeofthecategoriesofparallelcomputation.Inorder todiscussparallelcomputing,however,wealsoneedtoaddressthewaythat computinghardwarecanalsoexpressparallelism. Memorysystems.Fromaverybasicarchitecturestandpoint,thereareseveralgeneral classificationsofparallelcomputingsystems: Inasharedmemorysystem,theprocessingelementsallshareaglobalmemoryaddress space.PopularsharedmemorysystemsincludemulticoreCPUsandmanycoreGPUs (GraphicsProcessingUnit). Inadistributedmemorysystem,multipleindividualcomputingsystemswiththeirown memoryspacesareconnectedtoeachotherthroughanetwork. Thesesystemtypesarenotmutuallyexclusive.Inhybridsystems,inwhichmodern computationalclustersareclassified,systemsconsistofdistributedmemorynodes,each ofwhichisasharedmemorysystem. Processingcharacteristics.Inaparallelapplication,calculationsareperformedinthe samewaytheyareintheserialcase,onaCPUofsomekind.However,inparallel computingtherearemultipleprocessingentities(tasks,threadsorprocesses)insteadof one.Thisresultsinaneedfortheseentitiestocommunicatevalueswitheachotheras theycompute.Thiscommunicationhappensacrossanetworkofsomekind. Coordination,suchasmanagingaccesstoshareddatastructuresinathreaded environment,isalsoaformofcommunication.Ineithercase,communicationaddsa costtotheruntimeofaprogram,inanamountthatvariesgreatlybasedonthedesignof

theprogram.Ideally,parallelprogrammerswanttominimizetheamountof communicationdone(comparedtotheamountofcomputation). Scalability.Animportantcharacteristicofparallelprogramsistheirabilitytoscale,both intermsofthecomputingresourcesusedbytheprogramandthesizeofthedataset processedbytheprogram.Therearetwotypesofscalingweconsiderwhenanalyzing parallelprograms,strongandweakscaling.Strongscalingexaminesthebehaviorofthe programwhenthesizeofthedatasetisheldconstantwhilethenumberofprocessing unitsincreases.Weakscalingexamineswhathappenswhenthesizeofthedatasetis increasedproportionallyasthenumberofprocessingunitsincreases.Generally speaking,itiseasiertodesignparallelprogramsthatdowellwithweakscalingthanitis todesignprogramsthatdowellwithstrongscaling. Heterogeneousarchitectures(e.g.,IBMCellArchitecture,AMDsFusionArchitecture, andIntelsSandyBridgeArchitecture).Heterogeneoussystemsmayconsistofmany differentdevices,eachwithitsowncapabilitiesandperformanceproperties,allexposed withinasinglesystem.Suchsystems,whilenotnew(embeddedsystemonachip designshavebeenaroundforovertwodecades),thesearchitecturesarebecomingmore prevalentinthemainstreamdesktopandsupercomputingenvironments.Thisisdueto theemergenceofacceleratorssuchasIBMCellBroadband,andmorerecentlythewide adoptionoftheGeneralpurposecomputingongraphicsprocessingunits(GPGPU) programmingmodel,whereCPUsandGPUsareconnectedtoformasinglesystem. NVIDIAsComputeUnifiedDeviceArchitecture(CUDA)devicesarethemostcommon GPGPUsinusecurrently. 4.PARALLELPROGRAMMINGMODELS,LIBRARIES,ANDINTERFACES Introduction Thismaterialisgroupedbyparallelprogrammingmodel.Thefirstsectioncovers librariesandinterfacesdesignedtobeusedinasharedmemorymodel;thesecond coverstoolsforthedistributedmemorymodel;andthethirdcoverstoolsforthe GPGPUmodel.Anothercomponentofthistourcovershybridmodelswheretwoor moreofthesemodelsmaybecombinedintoasingleparallelapplication. Sharedmemorymodelprogramming. Posixthreadsareastandardsetofthreadingprimitives.Lowlevelthreadingmethod whichunderliesmanyofthemoremodernthreadingabstractionslikeOpenMPand TBB. ThefollowingaresomeresourcestoassistyouinunderstandingPosixthreadsbetter.

Tutorials
POSIXthread(pthread)libraries. http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html

Books
DavidButenhof.1997.ProgrammingwithPOSIXThreads,AddisonWesley. http://www.amazon.com/ProgrammingPOSIXThreadsDavidButenhof/dp/0201633922 ThisbookoffersanindepthdescriptionoftheIEEEoperatingsysteminterface standard,POSIXAE(PortableOperatingSystemInterface)threads,commonlycalled Pthreads.ItswrittenforexperiencedCprogrammers,butassumesnoprevious knowledgeofthreads,andexplainsbasicconceptssuchasasynchronousprogramming, thelifecycleofathread,andsynchronization.AbbreviatedPublishersAbstract. BradfordNichols,DickButtlar,andJacquelineProulxFarrell.1996.Pthreads Programming:APOSIXStandardforBetterMultiprocessing,OReilly. http://oreilly.com/catalog/9781565921153 POSIXthreads,orpthreads,allowmultipletaskstorunconcurrentlywithinthesame program.Thisbookdiscusseswhentousethreadsandhowtomakethemefficient.It featuresrealisticexamples,alookbehindthescenesattheimplementationand performanceissues,andspecialtopicssuchasDCEandrealtimeextensions.Abbreviated PublishersAbstract. JoyDuffy.2008.ConcurrentProgrammingonWindows,AddisonWesley. http://www.amazon.com/ConcurrentProgrammingWindowsJoe Duffy/dp/032143482X Thisbookoffersanindepthdescriptionoftheissueswithconcurrency,introducing generalmechanismsandtechniques,andcoveringdetailsofimplementationswithinthe .NETframeworkonWindows.Therearenumerousexamplesofgoodandbadpractice, anddetailsonhowtoimplementyourownconcurrentdatastructuresandalgorithms. Win32threads,alsocalledNativeThreadingbyWindowsdevelopers,arestillthe defaultmethodusedbymanytointroduceparallelismintocodeinWindows environments.Nativethreadingcanbedifficulttoimplementandmaintain.Microsoft hasarichbodyofmaterialavailableontheMicrosoftDeveloperNetwork. Materialthatprovidesmoreinformationaboutthreadsfollows.

OnlineResources
MicrosoftDeveloperNetwork.AnonlineintroductiontoWindowsthreadingconcepts. http://msdn.microsoft.com/enus/library/ms684841(VS.85).aspx.

Books
JohnsonM.Hart.2010.WindowsSystemProgramming(4thed.),AddisonWesley. http://www.amazon.com/WindowsProgrammingAddisonWesleyMicrosoft Technology/dp/0321657748 Thisbookcontainsextensivenewcoverageof64bitprogramming,parallelism, multicoresystems,andothercrucialtopics.JohnsonHartsrobustcodeexampleshave beendebuggedandtestedinboth32bitand64bitversions,onsingleand multiprocessorsystems,andunderWindows7,Vista,Server2008,andWindowsXP. HartcoversWindowsexternalsattheAPIlevel,presentingpracticalcoverageofallthe servicesWindowsprogrammersneed,andemphasizinghowWindowsfunctions actuallybehaveandinteractinrealworldapplications.AbbreviatedPublishersAbstract. Java:Sinceversion5.0concurrencysupporthasbeenafundamentalcomponentofthe Javaspecification.JavafollowsasimilarapproachtoPOSIXandotherthreadingAPIs, introducingthreadcreationandsynchronizationprimitivesintothelanguageasahigh levelAPI,throughthepackagejava.util.concurrent.Thereareanumberofapproaches tointroducingparallelismintoJavacodebutconventionallytheyfollowastandard pattern.EachthreadiscreatedasaninstanceoftheclassThread,definingaclassthat implementstherunnableinterface(thinkofthisasthePOSIXentrypointforafunction), thatmustimplementthemethodpublicvoidrun().JustlikePOSIX,thethreadis terminatedwhenthemethodreturns. TherearealargenumberofresourcestohelpwithunderstandingJavathreadsbetter, andthefollowingisjustasmallselection.

Tutorials
OraclehasalargenumberofJavaonlinetutorials,andonethatintroducesJavathreads. http://download.oracle.com/javase/tutorial/essential/concurrency/index.html.

ForthedevelopernewtoJavaandorconcurrency,theJavaforBeginnersportal providesanexcellentsetoftutorialsandonethatisspecificallyonthethreadingmodel. http://www.javabeginner.com/learnjava/javathreadstutorial.

Books
ScottOaksandHenryWong.2004.JavaThreads,OReilly. http://oreilly.com/catalog/9780596007829 Thisisawelldevelopedbookthat,whilenotthemostuptodateresourceonthe subject,providesanexcellentreferenceguide. OpenMP:OpenMPisadirectivebased,sharedmemoryparallelprogramingmodel.It ismostusefulforparallelizingindependentloopiterationsinbothCandFortran.New

facilitiesinOpenMP3.0allowforindependenttaskstoexecuteinparallel.Tobeused, OpenMPmustbesupportedbyyourcompiler.Thestandardislimitedinscopeonthe typesofparallelismthatyoucanimplement,butitiseasytouseandagoodstarting pointforlearningparallelprogramming.Someresourcestohelpwithunderstanding OpenMPbetterarelistedbelow.

Tutorials
BlaiseBarney.OpenMPTutorial,LawrenceLivermoreNationalLab. https://computing.llnl.gov/tutorials/openMP/. Thisexcellenttutorialisgearedtothosewhoarenewtoparallelprogrammingwith OpenMP.BasicunderstandingofparallelprogramminginC/C++orFORTRANis assumed. OpenMPexercises.TimMattsonandLarryMeadows.IntelCorporation. ThistutorialprovidesanexcellentintroductiontoOpenMP,includingcodeand examples.http://openmp.org/mpdocuments/OMP_Exercises.zipand http://openmp.org/mpdocuments/omphandsonSC08.pdf.

GettingStartedwithOpenMP.Textbasedtutorial;readandlearnwithexamples.

http://software.intel.com/enus/articles/gettingstartedwithopenmp/. AnIntroductiontoOpenMP3.0.Thisdeckcontainsmoreadvancedtechniques(e.g.,
inclusionofwaitstatements)thatwouldneedmoreexplanationtobeusedsafely. https://iwomp.zih.tudresden.de/downloads/2.Overview_OpenMP.pdf.

Videos
AnIntroductiontoParallelProgramming:VideoLectureSeries. http://software.intel.com/enus/courseware/course/view.php?id=224.

ThismultipartintroductioncontainsmanyunitsonOpenMP,andincludescoding exercisesandcodesamples.

Communitysites
www.openmp.org.ContainsthecurrentandpastOpenMPlanguagespecifications,lists ofcompilersthatsupportOpenMP,references,andotherresources.

Cheatsheet
FORTRANandC/C++. C++:http://www.openmp.org/mpdocuments/OpenMP3.0SummarySpec.pdf. FORTRAN:http://www.openmp.org/mpdocuments/OpenMP3.0FortranCard.pdf.

Books
BarbaraChapman,GabrieleJost,andRuudvanderPas,2007.UsingOpenMP,Portable SharedMemoryParallelProgramming,MITPress. ACMMembers,readithere: http://learning.acm.org/books/book_detail.cfm?isbn=9780262533027&type=24. MichaelJ.Quinn,1004.ParallelProgramminginCwithMPIandOpenMP,McGrawHill. Thisbookaddressestheneedsofstudentsandprofessionalswhowanttolearnhowto design,analyze,implement,andbenchmarkparallelprogramsinCusingMPIand/or OpenMP.ItintroducesadesignmethodologywithcoverageofthemostimportantMPI functionsandOpenMPdirectives.Italsodemonstrates,throughawiderangeof examples,howtodevelopparallelprogramsthatwillexecuteefficientlyontodays parallelplatforms.AbbreviatedPublishersAbstract. Breshears.2009.TheArtofConcurrency:AThreadMonkeysGuidetoWritingParallel Applications,OReilly.http://oreilly.com/catalog/9780596521547 ThisbookcontainsnumerousexamplesofappliedOpenMPcode. WrittenbyanIntelengineerwithovertwodecadesofparallelandconcurrent programmingexperience,TheArtofConcurrencyisoneofthefewresourcestofocuson implementingalgorithmsinthesharedmemorymodelofmulticoreprocessors,rather thanjusttheoreticalmodelsordistributedmemoryarchitectures.Thebookprovides detailedexplanationsandusablesamplestohelpyoutransformalgorithmsfromserial toparallelcode,alongwithadviceandanalysisforavoidingmistakesthatprogrammers typicallymakewhenfirstattemptingthesecomputations. RohitChandra,RameshMenon,LeoDagum,DavidKohr,DrorMaydan,andJeff McDonald.2001.ParallelProgramminginOpenMP,MorganKaufmann. AimedattheworkingresearcherorscientificC/C++orFortranprogrammer,Parallel ProgramminginOpenMPbothexplainswhattheOpenMPstandardisandhowtouseit tocreatesoftwarethattakesfulladvantageofparallelcomputing.Byaddingahandful ofcompilerdirectives(orpragmas)inFortranorC/C++,plusafewoptionallibrarycalls, programmerscanparallelizeexistingsoftwarewithoutcompletelyrewritingit.This bookstartswithsimpleexamplesofhowtoparallelizeloopsiterativecodethatin scientificsoftwaremightworkwithverylargearrays.Samplecodereliesprimarilyon Fortran(thelanguageofchoiceforhighendnumericalsoftware)withdescriptionsof theequivalentcallsandstrategiesinC/C++.AbbreviatedPublishersAbstract.

ThreadingBuildingBlocks(TBB).IntelThreadingBuildingBlocks(IntelTBB.TBBis athreadinglibraryusedtointroduceparallelismintoC/C++.TBBisarelativelyeasy

waytointroducelooplevelparallelism,especiallyforprogrammersfamiliarwith templatedcode.TBBisavailablebothasanopensourceprojectandascommercial productfromtheIntelCorporation.

Tutorials
IntelThreadingBuildingBlocksTutorial. http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20 Documentation/Tutorial.pdf. WrittenbyIntelCorporation,thisisathoroughintroductiontothethreadinglibrary. ThistutorialteachesyouhowtouseIntelThreadingBuildingBlocks(IntelTBB),a librarythathelpsyouleveragemulticoreperformancewithouthavingtobeathreading expert.

Multicoreinfo.combringstogetheranumberofTBBtutorials. http://www.multicoreinfo.com/2009/07/parprogpart6/.

Codeexamples
ThisTBB.orgwebsitecontainsathoroughsetofcodingexamples. http://www.threadingbuildingblocks.org/codesamples.php. CodeexamplesfromarecentcodingwithTBBcontest. http://software.intel.com/enus/articles/codingwithinteltbbsweepstakes/.

Communitysites
Containsproductannouncements(releasesandupdates),linkstocodesamples,blogs, andforumsonTBB.http://www.threadingbuildingblocks.org. IntelsiteforcommercialversionofIntelThreadingBuildingBlocks. http://www.threadingbuildingblocks.com.

Books
JamesReinders.2007.IntelThreadingBuildingBlocks,OReilly. http://oreilly.com/catalog/9780596514808 Thisguideexplainshowtomaximizethebenefitsofmulticoreprocessorsthrougha portableC++librarythatworksonWindows,Linux,Macintosh,andUnixsystems.With it,youlllearnhowtouseIntelThreadingBuildingBlocks(TBB)effectivelyforparallel programming,withouthavingtobeathreadingexpert.WrittenbyJamesReinders, ChiefEvangelistofIntelSoftwareProducts,andbasedontheexperienceofIntels developersandcustomers,thisbookexplainsthekeytasksinmultithreadingandhow toaccomplishthemwithTBBinaportableandrobustmanner.AbbreviatedPublishers Abstract.

Breshears.2009.TheArtofConcurrency:AThreadMonkeysGuidetoWritingParallel Applications,OReilly.http://oreilly.com/catalog/9780596521547 ThisbookcontainsnumerousexamplesofappliedTBBcodes. Distributedmemorymodelprogramming.Theprecedinglibrariesandinterfacesassume thattheresultsfromonethreadoftheoverallcomputationcanbemadedirectly availabletoanyotherthread.However,someparallelhardware(suchasclusters) forbiddirectaccessfromonememoryspacetoanother.Instead,processesmust cooperatebysendingeachothermessagescontainingthedatatobeexchanged.

Messagepassinginterface(MPI).MPIisalibraryspecificationthatsupportsmessage passingbetweenprogramimagesrunningondistributedmemorymachines,typically clustersofsometype.Anumberofdifferentorganizationsdevelopandsupport implementationsoftheMPIstandard,whichspecifiesinterfacesforC/C++and FORTRAN.However,bindingsforPerl,Python,andmanyotherlanguagesalsoexist. MPIprovidesroutinesthatmanagethetransmissionofdatafromthememoryspaceof oneprocesstothememoryspaceofanotherprocess.Distributedmemorymachines requiretheuseofMPIoranothermessagepassinglibrarybytheparallelprogramin ordertousemultipleprocessesrunningonmorethanonenode. Gettingstarted:Startwiththesixbasiccommands: MPI_Init()InitializetheMPIworld MPI_Finalize()TerminatetheMPIworld MPI_Comm_rank()WhichprocessamI? MPI_Comm_size()Howmanyprocessesexist? MPI_Send()Senddata MPI_Recv()Receivedata Moveontomorecomplexcommunicationmodelsasneeded,thatis,tocollective communication(one>many,many>one,many>many);andoradvanced communicationtechniques:synchronousvsasynchronouscommunication, blockingvsnonblockingcommunication.

OnlineReadings
Moodleswithslidesandcodeexamples.NCSIparallelanddistributedworkshop. http://moodle.sceducation.org/course/category.php?id=17.

Tutorials
WilliamGropp,RustyLusk,RobRoss,andRajeevThakur.2005.AdvancedMPI:I/Oand OneSidedCommunication.http://www.mcs.anl.gov/research/projects/mpi/tutorial/. SuperComputinginPlainEnglish(SIPE). http://www.oscer.ou.edu/Workshops/DistributedParallelism/sipe_distribmem_20090324 .pdf. Cheatsheets http://wiki.sceducation.org/index.php/MPI_Cheat_Sheet. Books PeterPacheco.1997.ParallelProgrammingwithMPI.http://www.amazon.com/Parallel ProgrammingMPIPeterPacheco/dp/1558603395 AhandsonintroductiontoparallelprogrammingbasedontheMessagePassing Interface(MPI)standard,thedefactoindustrystandardadoptedbymajorvendorsof commercialparallelsystems.Thistextbook/tutorial,basedontheClanguage,contains manyfullydevelopedexamplesandexercises.Thecompletesourcecodeforthe examplesisavailableinbothCandFortran77.Studentsandprofessionalswillfindthat theportabilityofMPI,combinedwithathoroughgroundinginparallelprogramming principles,willallowthemtoprogramanyparallelsystem,fromanetworkof workstationstoaparallelsupercomputer.AbbreviatedPublishersAbstract. GeneralpurposeGPUprogramming.Incontrasttothethreadingmodelspresented earlier,acceleratorbasedhardwareparallelism(likeGPUs)focusesonthefactthat althoughresultsmaybeshareable,thatis,canbesentfromonepartofacomputationto another,thecostofaccessingthememorymaynotbeuniform;CPUsseeCPUmemory betterthanGPUmemory,andviceversa. TheOpenComputeLanguage.(OpenCL)isanopenstandardforheterogeneous computing,developedbytheKhronosOpenCLworkinggroup.Implementationsare currentlyavailablefromabroadselectionofhardwareandsoftwarevendors,including AMD,Apple,NVIDIA,andIBM. OpenCLisintendedasalowlevelprogrammingmodel,designedaroundthenotionof ahostapplication,commonlyaCPU,drivingasetofassociatedcomputedevices,where parallelcomputationscanbeperformed. AkeydesignfeatureofOpenCLisitsuseofasynchronouscommandqueues,associated withindividualdevices,thatprovidetheabilitytoenqeuework(e.g.,datatransfersand

parallelcodeexecution),andbuildupcomplexgraphsdescribingthedependencies betweentasks. Theexecutionmodelsupportsbothdataparallelandtaskparallelstyles,butOpenCL wasdevelopedspecificallywithaneyetowardtodaysmanycore,throughput,GPU stylearchitectures,andhenceexposesacomplexmemorystructurethatprovidesa numberoflevelsofsoftwaremanagedmemories.Thisisincontrasttothetraditional singleaddressspacemodeloflanguageslikeCandC++,backedbylargecacheson generalpurposeCPUs. TheOpenCLstandardiscurrentlyinitsseconditeration,atversion1.1,andincludes bothaCAPI,forprogrammingthehost,andanewC++WrapperAPI,addedto1.1and intendedtobeusedforOpenCLC++development.Byexposingmultipleaddress spaces,OpenCLprovidesaverypowerfulprogrammingmodeltoaccessthefull potentialofmanycorearchitectures,butthiscomesatthecostofabstraction! Thisisparticularlytrueinthecaseofperformanceportability,anditisoftendifficultto achievegoodperformanceontwodifferentarchitectureswiththesamesourcecode. ThiscanbeevenmoreevidentbetweenthedifferenttypesofOpenCLdevices(e.g., GPUsandCPUs).Thisshouldnotcomeasasurprise,astheOpenCLspecificationitself statesthatitisalowlevelprogramminglanguage,andgiventhatthesedevicescan haveverydifferenttypesofcomputecapabilities,carefultuningisoftenrequiredtoget closetopeakperformance.

OnlineResources
OpenCL1.1Specification(revision33,June11,2010). http://www.khronos.org/registry/cl/specs/opencl1.1.pdf OpenCL1.1C++WrapperAPISpecification(revision4,June14,2010). http://www.khronos.org/registry/cl/specs/openclcplusplus1.1.pdf OpenCL1.1OnlineManualPages. http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/ OpenCLQuickReferenceCard.http://www.khronos.org/opencl/. Tutorial AnexcellentbeginnershelloworldtutorialintroductionusingOpenCL1.1sC++API. http://developer.amd.com/GPU/ATISTREAMSDK/pages/TutorialOpenCL.aspx Videos ATIStreamOpenCLTechnicalOverviewVideoSeries. http://developer.amd.com/DOCUMENTATION/VIDEOS/OPENCLTECHNICALOVER

VIEWVIDEOSERIES/Pages/default.aspx Thisfivepartvideotutorialseriesprovidesanexcellentintroductiontothebasicsof OpenCL,includingitsexecutionandmemorymodels,andtheOpenCLCdevice programminglanguage.

Communitysites
TheKhronosGroupsmainpage,http://www.khronos.org,keepstrackofmajorevents aroundOpenCLanditsotherlanguagessuchasOpenGLandWebGL,alongwithsome usefuldiscussionforumsonthese.http://www.khronos.org/opencl.

Thewebsiteshttp://www.beyond3d.comandMarkHarrishttp://www.gpgpu.orgare fullofinformationaboutmanycoreprogramming,inparticularthemodernGPUsof AMDandNVIDIA,andprovidevibrantdiscussionsonOpenCLandCUDA(thedetails ofthislanguagewillfollow),amongotherinterestingareasandtopics.

Thereisanevergrowingsetofexamplesthatcanbefoundallovertheweb,andeachof themajorvendorsprovidesexcellentexampleswiththeircorrespondingSDKs.

CUDA.ComputeUnifiedDeviceArchitecture(CUDA)isNVIDIAsparallelcomputing architecturethatenablesaccelerationincomputingperformancebyharnessingthe poweroftheGPU(graphicsprocessingunit).CUDAisinessenceadataparallelmodel, sharingalotincommonwiththeotherpopularGPGPUlanguageOpenCL,where kernels(similartofunctions)areexecutedovera3Diterationspace;eachindexis executedconcurrently,possiblyinparallel.

OnlineResources
NVIDIAmaintainsacollectionoffeaturedtutorials,presentations,andexercisesonthe CUDADeveloperZone.http://developer.nvidia.com/object/cuda_training.html.

OnlineReadings
FordetailsonNVIDIACUDAhardwareandtheunderlyingprogrammingmodels,the followingarticlesarerelevant: ErikLindholm,JohnNickolls,StuartOberman,andJohnMontrym.2008.NVIDIATesla: Aunifiedgraphicsandcomputingarchitecture,IEEEMicro28,2(March),3955. NVIDIAGF100.2010. http://www.nvidia.com/object/IO_89569.htm.(Whitepaper.) Codeexamples TheCUDASDKincludesnumerouscodeexamples,alongwithCUDAversionsof

popularlibraries(cuBLASandcuFFT). http://developer.nvidia.com/object/cuda_3_1_downloads.html Books DavidB.KirkandWenMeiW.Hwu.ProgrammingMassivelyParallelProcessors:AHands onApproach,MorganKaufmann. http://www.amazon.com/dp/0123814723?tag=wwwnvidiacomc 20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0123814723&adid=1DT 2S034DXS37V3K5FFY. ThisisarecentandpopulartextbookforteachingCUDA.Thisbookshowsbothstudent andprofessionalalikethebasicconceptsofparallelprogrammingandGPUarchitecture. Varioustechniquesforconstructingparallelprogramsareexploredindetail.Case studiesdemonstratethedevelopmentprocess,whichbeginswithcomputational thinkingandendswitheffectiveandefficientparallelprograms.AbbreviatedPublishers Abstract. Hybridparallelsoftwarearchitectures.Programswhichuseahybridparallel architecturecombinetwoormorelibraries/models/languages(seeSection3)intoa singleprogram;themotivationforthisextracomplexityistoallowasingleparallel programimagetoharnessadditionalcomputationalresources. ThemostcommonformsofhybridmodelscombineMPIwithOpenMPorMPIwith CUDA.MPI/OpenMPisappropriateforuseonclusterresourceswherethenodesare multicoremachines;MPIisusedtomovedataandresultsamongthedistributed memories,andOpenMPisusedtoleveragethecomputepowerofthecoresonthe individualnodes.MPI/CUDAisappropriateforuseonclusterresourceswherethe nodesareequippedwithNVIDIAsGPGPUcards.Again,MPIisusedtomovedata andresultsamongthedistributedmemoriesandCUDAisusedtoleveragethe resourcesofeachGPGPUcard. Online Resources MPI/OpenMP:TheLouisianaOpticalNetworkInitiative(LONI)hasanicetutorialon buildinghybridMPI/OpenMPapplications.Itcanbefoundat https://docs.loni.org/wiki/Introduction_to_Programming_HybridApplications_UsingOp enMP_and_MPI.ThisincludespointerstoLONIsOpenMPaswellasMPItutorials. MPI/CUDA:TheNationalCenterforSupercomputingApplications(NCSA)hasa tutorialthatincludesinformationaboutthis;seethesectionCombiningMPIand CUDAin http://www.ncsa.illinois.edu/UserInfo/Training/Workshops/CUDA/presentations/tutori alCUDA.html.

ParallelLanguages Wetouchhereonlybrieflyonthetopicofinherentlyparallellanguages.There areavarietyofefforts,rangingfromextensionstoexistinglanguagestoradically newapproaches.Many,suchasCilkorUPC,canberelativelyeasilyunderstood intermsofthelibrariesandtechniquesdescribedabove.Afullerdiscussionof thevarietyoflanguageeffortsandtoolswillbeinafurtherTechPack.Becauseit issufficientlydifferent,however,aquicklookathowparallelismisincorporated intofunctionallanguagesishelpful.


ConcurrentMLandConcurrentHaskell:Functionalprogramminglanguages,suchas StandardMLandHaskell,providestrongfoundationforbuildingconcurrentand parallelprogrammingabstractions,forthesinglereasonthattheyaredeclarative. Declarativeinaparallelworld,i.e.avoidingtheissues(e.g.raceconditions)that updatingaglobalstateforasharedmemorymodelcancause,providesforastrong foundationtobuildconcurrencyabstractions. ConcurrentMLisahighlevelmessagepassinglanguagethatsupportstheconstruction offirstclasssynchronousabstractionscalledevents,embeddedintoStandardML.It providesarichsetofconcurrencymechanismsbuiltonthenotionofspawningnew threadsthatcommunicateviachannels. ConcurrentHaskellisanextensiontothefunctionallanguageHaskellfordescribingthe creationofthreads,thathavethepotentialtoexecuteinparallelwithother computations.UnlikeConcurrentML,ConcurrentHaskellprovidesalimitedformof sharedmemory,introducingMVars(mutablevariables)whichcanbeusedto atomicallycommunicateinformationbetweenthreads.Unlikemorerelaxedshared memorymodels(e.g.seeOpenMLandOpenCLinthefollowingtext),Concurrent Haskellsruntimesystemensuresthattheoperationsforreadingfromandwritingto MVarsoccuratomically.

Tutorials
SimonPeytonJonesandSatnamSingh.ATutorialonParallelandConcurrent ProgramminginHaskellLectureNotesfromAdvancedFunctionalProgramming SummerSchool2008.

http://research.microsoft.com/enus/um/people/simonpj/papers/parallel/AFP08 notes.pdf Books


JohnH.Reppy.1999.ConcurrentML,CambridgeUniversityPress.

SimonPeytonJones.2007.BeautifulConcurrency.InBeautifulCode;editedbyGreg Wilson,OReilly.http://research.microsoft.com/en us/um/people/simonpj/papers/stm/index.htm#beautiful

5.TOOLS

Thereisavarietyoftoolsavailabletoassistprogrammersincreating,debugging,and runningparallelcodes.Thissectionsummarizesthecategoriesoftools;amore exhaustivelistoftoolsthatrunondifferenthardwareandsoftwareplatformswillbe includedinasubsequentadditiontotheTechPack. Compilers Manyofthecompilerfamiliestoday,bothcommercialandopensource,directlysupport someformofexplicitparallelism.(OpenMP,threads,etc.). Autoparallelization Theholygrailofmany,coresupportwouldbeacompilerthatcouldautomatically extractparallelismatcompiletime.Unfortunately,thisisstillaworkinprogress.That said,anumberofcompilerscanaddutilitythroughvectorizationandtheidentification ofobviousparallelisminsimpleloops. OnlineResources http://en.wikipedia.org/wiki/Automatic_parallelization. Threaddebuggers IntelThreadChecker:http://software.intel.com/enus/intelthreadchecker/. IntelParallelInspector:http://software.intel.com/enus/intelparallelinspector/. MicrosoftVisualStudio2010tools:http://www.microsoft.com/visualstudio/enus/. Hellgrind.http://valgrind.org/docs/manual/hgmanual.htmlisaValgrindtoolfor detectingsynchronizationerrorsinC,C++andFORTRANprogramsthatusethePOSIX pthreadsthreadingprimitives.ThemainabstractionsinPOSIXpthreadsareasetof threadssharingacommonaddressspace,threadcreation,threadjoining,threadexit, mutexes(locks),conditionvariables(interthreadeventnotifications),readerwriter locks,spinlocks,semaphoresandbarriers. Tuners/performanceprofilers IntelVTunePerformanceAnalyzer&IntelThreadProfiler3.1forWindows.The ThreadProfilercomponentofVtunehelpstunemultithreadedapplicationsfor

performance.TheIntelThreadProfilertimelineviewshowswhatthethreadsaredoing andhowtheyinteract.http://software.intel.com/enus/intelvtune/. IntelParallelAmplifier.Atooltohelpfindmulticoreperformancebottleneckswithout needingtoknowtheprocessorarchitectureorassemblycode. http://software.intel.com/enus/intelparallelamplifier/. MicrosoftVisualStudio2010tools.http://www.microsoft.com/visualstudio/enus/. gprof:theGNUProfiler.http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html.

Memorytools Hoard:http://www.hoard.org/.TheHoardmemoryallocatorisafast,scalable,and memoryefficientmemoryallocator.Itrunsonavarietyofplatforms,includingLinux, Solaris,andWindows.Hoardisadropinreplacementformalloc()thatcandramatically improveapplicationperformance,especiallyformultithreadedprogramsrunningon multiprocessors.Nochangetoyoursourceisnecessary.Justlinkitinorsetjustone environmentvariable.

You might also like