You are on page 1of 7

8/25/2015

HadoopvsSpark2015Wholooksthebigwinnerinthebigdataworld?

HadoopMapReducevs.ApacheSparkWhoWinstheBattle?
12Nov2014
TherearevariousapproachesintheworldofbigdatawhichmakeApacheHadoopjusttheperfectchoiceforiterativedataprocessing,
interactivequeriesandadhocqueries.EveryHadoopuserisawareofthefactthatHadoopMapReduceframeworkismeantmajorlyfor
batchprocessingandthususingHadoopMapReduceformachinelearningprocesses,adhocdataexplorationandothersimilarprocesses
isnotapt.
MostoftheBigDatavendorsaremakingtheireffortsforfindingandidealsolutiontothischallengingproblemthathaspavedwayforthe
adventofaverydemandingandpopularalternativenamedApacheSpark.Sparkmakesdevelopmentcompletelyapleasurableactivityand
hasabetterperformanceexecutionengineoverMapReducewhilstusingthesamestorageengineHadoopHDFSforexecutinghugedata
sets.

ApacheSparkhasgainedgreathypeinthepastfewmonthsandisnowbeingregardedasthemostactiveprojectofHadoopEcosystem.
BeforewegetintofurtherdiscussiononwhatempowersApacheSparkoverHadoopMapReduceletushaveabriefunderstandingofwhat
actuallyApacheSparkisandthenmoveontounderstandingthedifferencesbetweenthetwo.
IntroductiontotheUserFriendlyFaceofHadoopApacheSpark

Sparkisafastclustercomputingsystemdevelopedbythecontributionsofnearabout250developersfrom50companiesintheUC
BerkeleysAMPLab,formakingdataanalyticsfasterandeasiertowriteandaswelltorun.
ApacheSparkisanopensourceavailableforfreedownloadthusmakingitauserfriendlyfaceofthedistributedprogrammingframeworki.e.
BigData.Sparkfollowsageneralexecutionmodelthathelpsininmemorycomputingandoptimizationofarbitraryoperatorgraphssothat
queryingdatabecomesmuchfasterwhencomparedtothediskbasedengineslikeMapReduce.
ApacheSparkhasawelldesignedapplicationprogramminginterfacethatconsistsofvariousparallelcollectionswithmethodssuchas
groupByKey,MapandReducesothatyougetafeelasthoughyouareprogramminglocally.WithApacheSparkyoucanwritecollection
orientedalgorithmsusingthefunctionalprogramminglanguageScala.

WhyApacheSparkwasdeveloped?
HadoopMapReducethatwasenvisionedatGoogleandsuccessfullyimplementedandApacheHadoopisanextremelyfamousandwidely
usedexecutionengine.Youwillfindseveralapplicationsthatareonfamiliartermswithhowtodecomposetheirworkintoasequenceof
MapReducejobs.Alltheserealtimeapplicationswillhavetocontinuetheiroperationwithoutanychange.
HowevertheusershavebeenconsistentlycomplainingaboutthehighlatencyproblemwithHadoopMapReducestatingthatthebatchmode
responseforalltheserealtimeapplicationsishighlypainfulwhenitcomestoprocessingandanalyzingdata.

NowthispavedwayforHadoopSpark,asuccessorsystemthatismorepowerfulandflexiblethanHadoopMapReduce.Despitethefactthat
itmightnotbepossibleforallthefutureallocationsorexistingapplicationstocompletelyabandonHadoopMapReduce,butthereisascope
formostofthefutureapplicationstomakeuseofageneralpurposeexecutionenginesuchasHadoopSparkthatcomeswithmanymore
innovativefeatures,toaccomplishmuchmorethanthatispossiblewithMapReduceHadoop.
ApacheSparkvsHadoopWhatmakesSparksuperioroverHadoop?
ApacheSparkisanopensourcestandaloneprojectthatwasdevelopedtocollectivelyfunctiontogetherwithHDFS.ApacheSparkbynow
hasahugecommunityofvocalcontributorsandusersforthereasonthatprogrammingwithSparkusingScalaismucheasieranditis

http://www.dezyre.com/article/hadoopmapreducevsapachesparkwhowinsthebattle/83

1/7

8/25/2015

HadoopvsSpark2015Wholooksthebigwinnerinthebigdataworld?
muchfasterthantheHadoopMapReduceframeworkbothondiskandinmemory.

Thus,HadoopSparkisjusttheaptchoiceforthefuturebigdataapplicationsthatpossiblywouldrequirelowerlatencyqueries,iterative
computationandrealtimeprocessingonsimilardata.

HadoopSparkhaslotsofadvantagesoverHadoopMapReduceframeworkintermsofawiderangeofcomputingworkloadsitcandealwith
andthespeedatwhichitexecutesthebatchprocessingjobs.
ClickheretoknowmoreaboutourIBMCertifiedHadoopDevelopercourse
DifferencebetweenMapReduceandSpark

Src:www.tapad.com

i)HadoopvsSparkPerformance
HadoopSparkhasbeensaidtoexecutebatchprocessingjobsnearabout10to100timesfasterthantheHadoopMapReduceframework
justbymerelybycuttingdownonthenumberofreadsandwritestothedisc.
IncaseofMapReducetherearetheseMapandReducetaskssubsequenttowhichthereisasynchronizationbarrierandoneneedsto
preservethedatatothedisc.ThisfeatureofMapReduceframeworkwasdevelopedwiththeintentthatincaseoffailurethejobscanbe
recoveredbutthedrawbacktothisisthat,itdoesnotleveragethememoryoftheHadoopclustertothemaximum.

NeverthelesswithHadoopSparktheconceptofRDDs(ResilientDistributedDatasets)letsyousavedataonmemoryandpreserveittothe
discifandonlyifitisrequiredandaswellitdoesnothaveanykindofsynchronizationbarriersthatpossiblycouldslowdowntheprocess.
ThusthegeneralexecutionengineofSparkismuchfasterthanHadoopMapReducewiththeuseofmemory.
ii)HadoopMapReducevsSparkEasyManagement
ItisnoweasyfortheorganizationstosimplifytheirinfrastructureusedfordataprocessingaswithHadoopSparknowitispossibleto
performStreaming,BatchProcessingandMachineLearningallinthesamecluster.

MostoftherealtimeapplicationsuseHadoopMapReduceforgeneratingreportsthathelpinfindinganswerstohistoricalqueriesandthen
altogetherdelayadifferentsystemthatwilldealwithstreamprocessingsoastogetthekeymetricsinrealtime.Thustheorganizations
oughttomanageandmaintainseparatesystemsandthendevelopapplicationsforboththecomputationalmodels.
HoweverwithHadoopSparkallthesecomplexitiescanbeeliminatedasitispossibletoimplementbothstreamandbatchprocessingonthe
samesystemsothatitsimplifiesthedevelopment,deploymentandmaintenanceoftheapplication.WithSparkitispossibletocontrol
differentkindsofworkloads,soifthereisaninteractionbetweenvariousworkloadsinthesameprocessitiseasiertomanageandsecure

http://www.dezyre.com/article/hadoopmapreducevsapachesparkwhowinsthebattle/83

2/7

8/25/2015

HadoopvsSpark2015Wholooksthebigwinnerinthebigdataworld?
suchworkloadswhichcomeasalimitationwithMapReduce.
iii)SparkvsMapreduceRealTimeMethodtoProcessStreams

IncaseofHadoopMapReduceyoujustgettoprocessabatchofstoreddatabutwithHadoopSparkitisaswellpossibletomodifythedata
inrealtimethroughSparkStreaming.
WithSparkStreamingitispossibletopassdatathroughvarioussoftwarefunctionsforinstanceperformingdataanalyticsasandwhenitis
collected.

DeveloperscannowaswellmakeuseofApacheSparkforGraphprocessingwhichmapstherelationshipsindataamongstvariousentities
suchaspeopleandobjects.OrganizationscanalsomakeuseofApacheSparkwithpredefinedmachinelearningcodelibrariessothat
machinelearningcanbeperformedonthedatathatisstoredinvariousHadoopclusters.
iv)SparkvsMapReduceCaching
SparkensureslowerlatencycomputationsbycachingthepartialresultsacrossitsmemoryofdistributedworkersunlikeMapReducewhich
isdiskorientedcompletely.HadoopSparkisslowlyturningouttobeahugeproductivityboostincomparisontowritingcomplexHadoop
MapReducepipelines.

v)SparkvsMapReduceEaseofUse
WritingSparkisalwayscompactthanwritingHadoopMapReducecode.HereisaSparkMapReduceexampleThebelowimagesshow
thewordcountprogramcodeinSparkandHadoopMapReduce.Ifwelookattheimages,itisclearlyevidentthatHadoopMapReducecodeis
moreverboseandlengthy.
SparkMapReduceExampleWordcountPrograminSpark

http://www.dezyre.com/article/hadoopmapreducevsapachesparkwhowinsthebattle/83

3/7

8/25/2015

HadoopvsSpark2015Wholooksthebigwinnerinthebigdataworld?
SparkMapReduceExampleWordcountPrograminHadoopMapReduce

SparkMapReduceComparisonTheBottomline

HadoopMapReduceismeantfordatathatdoesnotfitinthememorywhereasApacheSparkhasabetterperformanceforthedatathat
fitsinthememory,particularlyondedicatedclusters.
HadoopMapReducecanbeaneconomicaloptionbecauseofHadoopasaserviceoffering(HaaS)andavailabilityofmorepersonnel.
Accordingtothebenchmarks,ApacheSparkismorecosteffectivebutstaffingwouldbeexpensiveincaseofSpark.
ApacheSparkandHadoopMapReducebotharefailuretolerantbutcomparativelyHadoopMapReduceismorefailuretolerantthanSpark.
SparkandHadoopMapReducebothhavesimilarcompatibilityintermsofdatatypesanddatasources.
ProgramminginApacheSparkiseasierasithasaninteractivemodewhereasHadoopMapReducerequirescorejavaprogramming
skills,howeverthereareseveralutilitiesthatmakeprogramminginHadoopMapReduceeasier.
WillApacheSparkEliminateHadoopMapReduce?
HadoopMapReduceisbeingcondemnedbymostoftheusersasalogjaminHadoopClusteringforthereasonthatMapReduceexecutes
allthejobsinBatchModewhichimpliesthatanalyzingdatainrealtimeisnotpossible.WiththeadventofHadoopSparkwhichisprovento
beagreatalternativetoHadoopMapReducethebiggestquestionthathindersthemindsofDataScientistsisHadoopvs.SparkWho

http://www.dezyre.com/article/hadoopmapreducevsapachesparkwhowinsthebattle/83

4/7

8/25/2015

HadoopvsSpark2015Wholooksthebigwinnerinthebigdataworld?
winsthebattle?

ApacheSparkexecutesthejobsinmicrobatchesthatareveryshortsayapproximately5secondsorlessthanthat.ApacheSparkhasover
thetimebeensuccessfulinprovidingmorestabilitywhencomparedtotherealtimestreamorientedHadoopFrameworks.

NeverthelesseverycoinhastwofacesandyeahsodoesHadoopSparkcomeswithsomebacklogssuchasinabilitytohandleincaseifthe
intermediatedataisgreaterthanthememorysizeofthenode,problemsincaseofnodefailureandthemostimportantofallisthecost
factor.
HadoopSparkmakesuseofthejournaling(alsoknownasRecomputation)forprovidingresiliencyincasethereisanodefailurebychance
asaresultwecanconcludethattherecoverybehaviorincaseofnodefailureisjustsimilarasthatincaseofHadoopMapReduceexceptfor
thefactthattherecoveryprocesswouldbemuchfaster.
SparkalsohasthespilltodiskfeatureincaseifforaparticularnodethereisinsufficientRAMforstoringthedatapartitionsthenitprovides
gracefuldegradationfordiskbaseddatahandling.Whenitcomestocost,withstreetRAMpricesbeing5USDperGB,wecanhavenear
about1TBofRAMfor5KUSDthusmakingmemorytobeaveryminorfractionoftheoverallnodecosting.

OnegreatadvantagethatcomescoupledwithHadoopMapReduceoverApacheSparkisthatincaseifthedatasizeisgreaterthan
memorythenundersuchcircumstancesApacheSparkwillnotbeabletoleverageitscacheandthereismuchprobabilitythatitwillbefar
slowerthanthebatchprocessingofMapReduce.
ConfusedHadoopvs.SparkWhichOnetoChoose?
IfthequestionthatisleavingyouconfusedonHadoopMapReduceorApacheSparkorrathersaytochooseDiskBasedComputingorRAM
BasedComputing,thentheanswertothisquestionisstraightforward.Italldependsandthevariablesonwhichthisdecisiondependskeep
onchangingdynamicallywithtime.

Nevertheless,thecurrenttrendsareinfavoroftheinmemorytechniquesliketheApacheSparkastheindustrytrendsseemtoberendering
apositivefeedbackforit.Sotoconcludewithwecanstatethat,thechoiceofHadoopMapReducevs.ApacheSparkdependsontheuser
basedcaseandwecannotmakeanautonomouschoice.
HadoopvsSpark2015Wholooksthebigwinnerinthebigdataworld?Letusknowincommentsbelow!
ClickheretoknowmoreaboutourIBMCertifiedHadoopDevelopercourse

PREVIOUS

NEXT

Follow

http://www.dezyre.com/article/hadoopmapreducevsapachesparkwhowinsthebattle/83

5/7

8/25/2015

HadoopvsSpark2015Wholooksthebigwinnerinthebigdataworld?
4Comments

DeZyre

Share

Recommend 3

Login

SortbyNewest

Jointhediscussion
SAnnuAne 2monthsago

DeZyre

superrrrr

Reply Share

Sudip 8monthsago

importantarticle

Reply Share

Stratos 9monthsago

GoodpointregardingtheshortertimethatMRwillneedtorecovercomparedtoSpark.Butwhatabouttheperformance?Haveyou
seentherecentrecordthatSparksetonsortingwithoutusinginmemoryatall?https://databricks.com/blog/20...Canyouthinkof
anyusecase(maybeotherthansorting)thatMapReducewillperformbetterevenwhentheworkingdatasetsizeiswaybiggerthan
theavailablememory?
2

Reply Share

MrinalSinghi 9monthsago

Goodarticle

Subscribe

Reply Share

AddDisqustoyoursite

Privacy

Hadoop2.0(YARN)FrameworkTheGatewaytoEasier

HowBigDataAnalysishelpedincreaseWalmartsSalesturnover?

ProgrammingforHadoopUsers

DataScienceProgramming:PythonvsR

DeZyreInSync:HowtobecomeaDataScientist

http://www.dezyre.com/article/hadoopmapreducevsapachesparkwhowinsthebattle/83

6/7

8/25/2015

HadoopvsSpark2015Wholooksthebigwinnerinthebigdataworld?

DifferencebetweenPigandHiveTheTwoKeyComponentsofHadoopEcosystem

HadoopMapReducevs.ApacheSparkWhoWinstheBattle?

http://www.dezyre.com/article/hadoopmapreducevsapachesparkwhowinsthebattle/83

7/7