You are on page 1of 29

Introduction to MapReduce

ECE7610

The Age of Big-Data


Big-data age Faceboo co!!ect" #00 terab$te" a da$%&011' (oog!e co!!ect" &0000)B a da$ %&011' Data i" an i*portant a""et to an$

organi+ation

Finance co*pan$, in"urance co*pan$, internet co*pan$

-e need ne. A!gorith*"/data "tructure"/progra**ing *ode!

-hat to do 0 %-ord Count'


Con"ider a !arge data co!!ection and count

the occurrence" of the different .ord"


web weed Main green sun

{web, weed, green, sun, moon, land, part, web, green,}

2 1 2 1 1 1 1

Data co!!ection

moon WordCounter parse( ) count( ) land part

DataCollection

ResultTable

-hat to do 0%-ord Count'


Multi-thread Lock on shared data
Main web weed green sun Thread 2 1 2 1 1 1 1

Data co!!ection

moon 1..* WordCounter parse( ) count( ) land part

DataCollection

ResultTable

-hat to do0%-ord Count'


1ing!e *achine cannot "er2e a!! the

Data co!!ection

data3 $ou need a di"tributed "pecia! %fi!e' "$"te* 4arge nu*ber of co**odit$ hard.are di" "3 "a$5 1000 di" " 1TB each Critica! a"pect"3 fau!t to!erance 6 rep!ication 6 !oad ba!ancing5 *onitoring E7p!oit para!!e!i"* afforded b$ "p!itting par"ing and counting )ro2i"ion and !ocate co*puting at data !ocation"
5

-hat to do0 %-ord Count'


Data co!!ection Data co!!ection Data co!!ection Data co!!ection Data co!!ection
1..* Parser Main

web weed green sun moon

2 1 2 1 1 1 1

Thread

land
1..* C ounter

part

D ataC ollec tion

W ordLis t

R esultTable

Separate counters Separate data


web green .

KEY VALUE

web

weed

green

sun

moon

land

part

It i" not ea"$ to para!!e!89


Different programming models Fundamental issues
Scheduling, data distribution, synchronization, interprocess communication, robustness, fault tolerance,

Message Passing Shared Memory

Architectural issues
Flynns ta onomy !S"MD, M"MD, etc#$, net%or& topology, bisection band%idth, cache coherence,

Different programming constructs


Mute es, conditional )ariables, barriers, masters*sla)es, producers*consumers, %or& +ueues,#

'ommon problems
(i)eloc&, deadloc&, data star)ation, priority in)ersion, dining philosophers, sleeping barbers, cigarette smo&ers,

Actually, Programmers Nightmare.


7

MapReduce3 Auto*ate for $ou


I*portant di"tributed para!!e! progra**ing paradig* for !arge-"ca!e app!ication"9 Beco*e" one of the core techno!ogie" po.ering big IT co*panie"5 !i e (oog!e5 IBM5 :ahoo and Faceboo 9 The fra*e.or run" on a c!u"ter of *achine" and auto*atica!!$ partition" ;ob" into nu*ber of "*a!! ta" " and proce""e" the* in para!!e!9 Feature"3 fairne""5 ta" data !oca!it$5 fau!t-to!erance9

MapReduce
web 1 1 1 1 1 1 1 1 1 1 VAL E

MA)3 Input data < e$5 2a!ue= pair

weed green sun moon land

Data Co!!ection3 "p!it1 Data Co!!ection3 "p!it & Data Co!!ection3 "p!it n

Map
Split the data to Suppl! multiple processors

part web green KEY

Map

MapReduce
MA)3 Input data < e$5 2a!ue= pair RED>CE3 < e$5 2a!ue= pair <re"u!t= Data Co!!ection3 "p!it1 Data Co!!ection3 "p!it & Data Co!!ection3 "p!it n

Map
Split the data to Suppl! multiple processors

Reduce

Map

Reduce

Map

Reduce
10

Large scale data splits

Map "ke!# 1$ %educers &sa!# 'ount(


)ar"eha"h

Count

)ar"eha"h

Count

)ar"eha"h

Count

)ar"eha"h
C. Xu @ Wayne State

11

MapReduce

12

?o. to "tore the data 0

Compute Nodes

Whats the p o!"em he e#


13

Di"tributed Fi!e 1$"te*


Don@t *o2e data to .or er"8 Mo2e .or er"

to the dataA

1tore data on the !oca! di" " for node" in the c!u"ter 1tart up the .or er" on the node that ha" the data !oca! Bot enough RAM to ho!d a!! the data in *e*or$ Bet.or i" the bott!enec 5 di" throughput i" good

-h$0

A di"tributed fi!e "$"te* i" the an".er


(F1 %(oog!e Fi!e 1$"te*' ?DF1 for ?adoop

14

(F1/?DF1 De"ign
Co**odit$ hard.are o2er Ce7oticD hard.are ?igh co*ponent fai!ure rate" Fi!e" "tored a" chun "

Fi7ed "i+e %6EMB' Re!iabi!it$ through rep!ication Each chun rep!icated acro"" F6 chun "er2er" 1ing!e *a"ter to coordinate acce""5 eep *etadata 1i*p!e centra!i+ed *anage*ent Bo data caching 4itt!e benefit due to !arge data "et"5 "trea*ing read" 1i*p!if$ the A)I )u"h "o*e of the i""ue" onto the c!ient

15

(F1/?DF1

16

MapReduce Data 4oca!it$


Ma"ter "chedu!ing po!ic$

A" " ?DF1 for !ocation" of rep!ica" of input fi!e b!oc " Map ta" " t$pica!!$ "p!it into 6EMB %GG (F1 b!oc "i+e' 4oca!it$ !e2e!"3 node !oca!it$/rac !oca!it$/off-rac Map ta" " "chedu!ed a" c!o"e to it" input data a" po""ib!e Thou"and" of *achine" read input at !oca! di" "peed9 -ithout thi"5 rac ".itche" !i*it read rate and net.or band.idth beco*e" the bott!enec 9

Effect

17

MapReduce Fau!t-to!erance
Reacti2e .a$ -or er fai!ure

H ?eartbeat5 -or er" are periodica!!$ pinged b$ *a"ter


$ N% esponse & 'a("ed )o *e

H If the proce""or of a .or er fai!"5 the ta" " of that .or er are rea""igned to another .or er9

Ma"ter fai!ure

H Ma"ter .rite" periodic chec point" H Another *a"ter can be "tarted fro* the !a"t chec pointed "tate H If e2entua!!$ the *a"ter die"5 the ;ob .i!! be aborted

18

MapReduce Fau!t-to!erance
)roacti2e .a$ %Speculative Execution' The prob!e* of +"tragg!er", %"!o. .or er"'

H Ither ;ob" con"u*ing re"ource" on *achine H Bad di" " .ith "oft error" tran"fer data 2er$ "!o.!$ H -eird thing"3 proce""or cache" di"ab!ed %AA'

-hen co*putation a!*o"t done5 re"chedu!e inprogre"" ta" " -hene2er either the pri*ar$ or the bac up e7ecution" fini"he"5 *ar it a" co*p!eted

19

MapReduce 1chedu!ing
Fair Sharing

conduct" fair "chedu!ing u"ing greed$ *ethod to *aintain data !oca!it$ Delay u"e" de!a$ "chedu!ing a!gorith* to achie2e good data !oca!it$ b$ "!ight!$ co*pro*i"ing fairne"" re"triction LATE(4onge"t Appro7i*ate Ti*e to End) i*pro2e" MapReduce app!ication"J perfor*ance in heterogenou" en2iron*ent5 !i e 2irtua!i+ed en2iron*ent5 through accurate "pecu!ati2e e7ecution Capacity introduced b$ :ahoo5 "upport" *u!tip!e Kueue" for "hared u"er" and guarantee" each Kueue a fraction of the capacit$ of the c!u"ter

20

MapReduce C!oud 1er2ice


, ,

)ro2iding MapReduce fra*e.or " a" a "er2ice in c!oud" beco*e" an attracti2e u"age *ode! for enterpri"e"9 A MapReduce c!oud "er2ice a!!o." u"er" to co"t-effecti2e!$ acce"" a !arge a*ount of co*puting re"ource" .ith creating o.n c!u"ter9 >"er" are ab!e to ad;u"t the "ca!e of MapReduce c!u"ter" in re"pon"e to the change of the re"ource de*and of app!ication"9

21

A*a+on E!a"tic MR

3. 2e0e"op .ode "o.a""y

0. 7""o.ate 1adoop ."uste 1. S.p data to ."uste 2. /o0e data (nto 123S 8C2

4. Su!m(t /ap4edu.e 5o! 4a. 6o !a.* to Step 3 -ou -ou 1adoop C"uste 5. /o0e data out o' 123S 6. S.p data ' om ."uste 7. C"ean up9

Be. Cha!!enge"
Interference bet.een co-ho"ted LM" 1!o. do.n the ;ob 19#-7 ti*e" 4oca!it$ pre"er2ing po!ic$ no !ong effecti2e

4o"e *ore than &0M !oca!it$ %depend"'

Beed "pecifica!!$ de"igned "chedu!er for

2irtua! MapReduce c!u"ter


Interference-a.are 4oca!it$-a.are

23

MapReduce )rogra**ing
?adoop i*p!e*entation of MR in Na2a %2er"ion 1909E' -ordCount e7a*p!e3 hadoop-

1909E/"rc/e7a*p!e"/org/apache/hadoop/e7a*p!e"/-ordCount9;a2a

24

MapReduce )rogra**ing

25

Map
I*p!e*ent $our o.n *ap c!a"" e7tending

the Mapper c!a""

26

Reduce
I*p!e*ent $our o.n reducer c!a""

e7tending the reducer c!a""

27

Main%'

28

De*o

29