You are on page 1of 70

GV:

TS. H Bo Quc
1011036

Nhm HV: inh Th Lng

on Cao Ngha 1011043


Hng Xun Vin 1011067
1

Ni dung

Gii thiu
Nhu cu thc t Hadoop l g? Lch s pht trin

Cc thnh phn ca Hadoop


Hadoop Common, HDFS, MapReduce
2

Nhu cu thc t

Nhu cu lu tr & x l d liu hng exabyte (1

exabyte = 1021 bytes)


c & truyn ti d liu rt chm

Cn rt nhiu node lu tr vi chi ph thp


Li phn cng node xy ra hng ngy Kch thc cluster khng c nh

Nhu cn cn c mt h tng chung


Hiu qu, tin cy
3

Kin trc 2 tng

Cc node l cc PC
Chia lm nhiu rack (khong 40 PC/rack)
4

Hadoop l g?

Nn tng ng dng h tr cc ng dng phn

tn vi d liu rt ln.
Quy m: hng terabyte d liu, hng ngn node.

Thnh phn:
Lu tr: HDFS (Hadoop Distributed Filesystem) X l: MapReduce
H tr m hnh lp trnh Map/Reduce

Lch s pht trin

2002-2004: Doug Cutting gii thiu Nutch

12/2004 cng b bi bo v GFS & MapReduce


05/2005 Nutch s dng MapReduce & DFS

02/2006 Tr thnh subproject ca Lucene


04/2007 Yahoo chy 1000-node cluster 01/2008 tr thnh d n cao cp ca Apache 07/2008 Yahoo th nghim 4000 node cluster
7

Hadoop Common

Tp hp nhng tin ch h tr cho cc d n con ca Hadoop

Bao gm: tin ch truy cp h thng file, RPC,

Hadoop Distributed File System


H thng file phn tn? HDFS? Kin trc ca HDFS Cch thc lu tr v sa li

H thng file
ng dng

H thng file (NTFS)

cng vt l

10

H thng file phn tn


ng dng

H thng file phn tn

H thng file
cng vt l

H thng file
cng vt l

H thng file cng vt l 11

Mc tiu ca HDFS

Lu tr file rt ln (hng terabyte)

Truy cp d liu theo dng


M hnh lin kt d liu n gin
Ghi 1 ln, c nhiu ln

Di chuyn qu trnh x l thay v d liu

S dng phn cng ph thng, a dng


T ng pht hin li, phc hi d liu rt nhanh
12

im yu ca HDFS

ng dng cn truy cp vi tr cao


HDFS ti u qu trnh truy cp file rt ln

Khng th lu qu nhiu file trn 1 cluster


NameNode lu trn b nh -> cn nhiu b nh

Khng h tr nhiu b ghi, sa d liu bt k

13

Kin trc HDFS

14

Kin trc HDFS (t.t)

Cc khi nim
Block: n v lu tr d liu nh nht
Hadoop dng mc nh 64MB/block 1 file chia lm nhiu block

Cc block cha bt k node no trong cluster

NameNode
Qun l thng tin ca tt c cc file trong cluster

DataNode
Qun l cc block d liu
15

16

NameNode

Thnh phn trng yu ca HDFS Qun l v thc thi cc thao tc lin quan n tn file
ng, m, i tn

Qun l v tr ca cc block

17

DataNode

Qun l cc block Thc hin thao tc trn d liu


Thm, xa, nhn bit block Thc hin cc yu cu x l d liu

18

Cc thc lu tr & pht hin li

Bn sao d liu:
Mi file c nhiu bn sao nhiu bn sao ca block

NameNode quyt nh vic to bn sao


Nhn d liu Heartbeat & Blockreport t DataNode
Heartbeat: tnh trng chc nng ca DataNode Blockreport: danh sch cc block

Thit lp chnh sch lu tr ca cc bn sao


C ch xc nh block thuc node no

19

20

Chnh sch lu tr ca cc bn sao block

Cc k quan trng

Quyt nh tnh n nh, an ton, v kh nng vn hnh


ca h thng

Cn nhiu thi gian v kinh nghim Quan tm n kin trc vt l: rack, bandwith Chnh sch thng thng (khng ti u)
Chia block lm 3 bn sao
Lu node trong rack ni b, 2 block 2 node khc nhau trong rack khc (remote rack)
21

bn vng ca HDFS

Mc tiu chnh: m bo d liu chnh xc ngay c khi li h thng xy ra

3 loi li chnh:
Li NameNode Li DataNode S cn tr ca mng my tnh
22

bn vng ca HDFS (t.t)

DataNode gi nh k Heartbeat ln NameNode


Xc nh node b li nu NameNode khng nhn
c Heartbeat. a DataNode khi lin kt & c gng to bn sao khc

Ti cn bng cluster
Chuyn cc block sang DataNode khc c khong trng di nh mc qui nh
23

bn vng ca HDFS (t.t)

Li NameNode
C th lm h thng HDFS v dng To cc bn copy ca FsImage v EditLog Khi NameNode restart, h thng s ly bn sao gn nht.

24

C ch hot ng

c d liu:
Chng trnh client yu cu c d liu t NameNode

NameNode tr v v tr cc block ca d liu


Chng trnh trc tip yu cu d liu ti cc

node.

25

C ch hot ng (t.t)

26

27

C ch hot ng (t.t)

Ghi d liu:
Ghi theo dng ng ng (pipeline)
Chng trnh yu cu thao tc ghi NameNode NameNode kim tra quyn ghi v m bo file khng tn ti Cc bn sao ca block to thnh ng ng d liu tun t c ghi vo

28

C ch hot ng (t.t)

29

30

Map Reduce

Ti sao cn Map Reduce ? Map Reduce l g ? M hnh Map Reduce Thc thi Hadoop Map Reduce Demo
31

Ti sao cn Map Reduce ?

X l d liu vi quy m ln
Mun x dng 1000 CPU
Mong mun mt m hnh qun l n gin

Kin trc Map Reduce


Qun l tin trnh song song v phn tn
Qun l, sp xp lch trnh truy xut I/O Theo di trng thi d liu Qun l s lng ln d liu c quan h ph thuc nhau X l li Tru tng i vi cc lp trnh vin .
32

Map Reduce l g ?

M hnh lp trnh
MapReduce c xy dng t m hnh lp trnh hm v lp trnh song song

H thng tnh ton phn tn


Tng tc thc thi x l d liu

Gii quyt c nhiu bi ton n cc chi tit ci t, qun l


Qun l li Gom nhm v sp xp Lp lch

33

Map Reduce l g ?

Cch tip cn : chia tr


Chia nh vn ln thnh cc vn nh
X l song song tng vic nh Tng hp kt qu

c d liu ln
Rt trch thng tin cn thit t tng phn t ( Map ) Trn v sp xp cc kt qu trung gian

Tng hp cc kt qu trung gian ( Reduce )


Pht sinh kt qu cui cng
34

Map Reduce l g ?

35

M hnh Map Reduce

Tri qua hai qu trnh Map v Reduce

Map Reduce nh ngha d liu di dng cp


<key,value>

Map <k1,v1> -> list(<k2,v2>)


Reduce <k2,list(<v2>)> -> < k3, v3 >
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2>
-> reduce -> <k3, v3> (output)

36

M hnh Map Reduce

Hm Map
Mi phn t ca d liu u vo s c truyn cho hm Map di dng cp

<key,value>
Hm Map xut ra mt hoc nhiu cp

<key,value>

37

M hnh Map Reduce

Sau qu trnh Map, cc gi tr trung gian c

tp hp thnh cc danh sch theo tng kha

Hm Reduce
Kt hp, x l, bin i cc value
u ra l mt cp <key,value> c x l

38

V d word count

39

V d word count (tt)

Mapper
u vo : Mt dng ca vn bn u ra : key : t, value : 1

Reducer
u vo : key : t, values : tp hp cc gi tr m c ca mi t u ra : key : t, value : tng
40

Tnh ton song song

Hm Map chy song song to ra cc gi tr trung

gian khc nhau t cc tp d liu khc nhau

Hm Reduce cng chy song song, mi reducer

x l mt tp kha khc nhau


Tt c cc gi tr c x l c lp

Bottleneck: Giai on Reduce ch bt u khi


giai on Map kt thc
41

Thc thi MR

42

Thc thi ( bc 1)

Chng trnh (user program), thng qua th vin MapReduce phn mnh d liu u vo
Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6

Input Data

User Program

43

Thc thi ( bc 2)
Map Reduce sao chp chng trnh ny vo

cc my cluster ( master v cc worker )

Master

User Program Workers Workers Workers Workers Workers

44

Thc thi ( bc 3)

Master phn phi M tc v Map v R tc v Reduce vo cc worker rnh ri

Master phn phi cc tc v da trn v tr ca d liu


Message(Do_map_task)

Master

Idle Worker

45

Thc thi ( bc 4)

Mi map-task worker c d liu t phn vng d liu c gn cho n v xut ra nhng cp <key,value> trung gian
D liu ny c ghi tm trn RAM
Map worker

Shard 0

Key/value pairs

46

Thc thi ( bc 5)

Mi worker phn chia d liu trung gian thnh R vng, lu xung a, xa d liu trn b m v thng bo cho Master
Disk locations

Master

Map worker

Local Storage

47

Thc thi ( bc 6)

Master gn cc d liu trung gian v ch ra v tr ca d liu cho cc reduce-task

Master

Disk locations

Reduce worker

remote Storage

48

Thc thi ( bc 7)

Mi reduce-task worker sp xp cc key, gi hm reduce v xut kt qu u ra

Sorts data

Partition Output file

Reduce worker

49

Thc thi ( bc 8)

Master kch hot (wakes up) chng trnh ca ngi dng thng bo kt qu hon thnh

D liu u ra c lu trong R tp tin


Master
wakeup

User Program

Output files

50

Hadoop - Map Reduce


L mt framework S dng HDFS Kin trc master/slave


Master Namenode Jobtracker Slave Datanote Tasktracker
51

DFS MapReduce

Hadoop - Map Reduce

Client gi MapReduce Job

JobTracker iu phi vic thc thi Job


TaskTracker thc thi cc task c chia ra

52

Hadoop - Map Reduce

53

Job Submission

Yu cu ID cho job mi (1 )

Kim tra cc th mc u vo v u ra
Chia tch d liu u vo Chp cc ti nguyn bao gi chng trnh (Jar), cc tp tin cu hnh, cc mnh d liu u vo filesystem ca jobtracker (3)

Thng bo vi jobtracker job sn sng thc thi (4)

54

Khi to Job

Thm job vo hng i & khi to cc ti nguyn (5)

To danh sch cc tc v ( task ) (6)

55

Phn phi cc tc v

TaskTracker nh k thng bo sn sng nhn cc tc v mi (7)

JobTracker giao tc v c nh cho TaskTracker ( v d 1 TaskTracker chy ng thi 2 map-task v 2 reduce-task)

56

Thc thi tc v

TaskTracker Chp chng trnh thc thi (Jar File) v cc d liu cn thit t h thng chia s file

To tin trnh TaskRunner thc thi tc v

57

Cp nht trng thi

Cp nht trng thi trong qu trng thc thi


Tc v x l c bao nhiu d liu u vo ?
Tc v hon thnh thnh cng ? Tc v li ?

Task process gi bo co 3s mt ln cho TaskTracker


TaskTracker gi bo co 5s mt ln cho JobTracker JobTracker tng hp cc bo co, gi li cho JobClient mi giy mt ln
58

Cp nht trng thi

59

Kt thc Job

Khi JobTracker nhn c tn hiu kt thc ca

tc v cui cng

JobTracker gi tn hiu success cho JobClient

JobClient thng bo cho chng trnh ca


ngi dng

JobTracker thu gom rc, hy cc kt qu trung


gian
60

Kh nng chu li

Master pht hin cc li Li tc v (Task Failure)


Vng li ngoi l, B git bi VJM, Treo JobTracker giao cho TaskTracker khc x l trong mt gii hn nht nh

Hn ch giao tc v mi cho TaskTracker x l


tc v b li
61

Kh nng chu li

Li TaskTracker
Crashing, Chy chm, khng gi bo co ng hn cho JobTracker

JobTracker loi b TaskTracker ra khi bng lch biu


tc v ( schedule tasks ) v thm vo blacklist JobTracker lp lch li chy cc tc v trao cho TaskTracer b li

62

Kh nng chu li

Li Jobtracker
Nghim trng Cha c hng gii quyt

63

Ti u ha

Reduce ch bt u khi ton b Map kt thc


a trn mt node truy xut chm c th nh hng ti ton b qu trnh

Bng thng ca mng

64

Ti u ha

a ra hm combiner
C th chy trn cng my vi cc mapper Chy c lp vi cc mapper khc Mini Reducer, lm gim u ra ca cc giai on Map. Tit kim bng thng

65

ng dng

Sp xp d liu phn tn Phn tch thng k Web Ranking Dch my Indexing ...
66

Tng kt

L m hnh n gin x l lng d liu ln trn m hnh phn tn

Tp trung vo vn chnh cn x l

67

Demo Word Count (1)

Map

public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } 68

Demo Word Count (2)

Reduce

public static class Reduce extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } }

69

Demo Word Count (3)

Main

public static void main(String[] args) throws IOException { //checking goes here JobConf conf = new JobConf(); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf); } 70

You might also like