You are on page 1of 70

GV:

TS. H Bo Quc

Nhm HV: inh Th Lng

1011036

on Cao Ngha 1011043


Hng Xun Vin 1011067
1

Ni dung

Gii thiu
Nhu cu thc t
Hadoop l g?
Lch s pht trin

Cc thnh phn ca Hadoop


Hadoop Common, HDFS, MapReduce
2

Nhu cu thc t

Nhu cu lu tr & x l d liu hng exabyte (1

exabyte = 1021 bytes)


c & truyn ti d liu rt chm

Cn rt nhiu node lu tr vi chi ph thp


Li phn cng node xy ra hng ngy
Kch thc cluster khng c nh

Nhu cn cn c mt h tng chung


Hiu qu, tin cy
3

Kin trc 2 tng

Cc node l cc PC

Chia lm nhiu rack (khong 40 PC/rack)


4

Hadoop l g?

Nn tng ng dng h tr cc ng dng phn

tn vi d liu rt ln.
Quy m: hng terabyte d liu, hng ngn node.

Thnh phn:
Lu tr: HDFS (Hadoop Distributed Filesystem)
X l: MapReduce
H tr m hnh lp trnh Map/Reduce

Lch s pht trin

2002-2004: Doug Cutting gii thiu Nutch

12/2004 cng b bi bo v GFS & MapReduce

05/2005 Nutch s dng MapReduce & DFS

02/2006 Tr thnh subproject ca Lucene

04/2007 Yahoo chy 1000-node cluster

01/2008 tr thnh d n cao cp ca Apache

07/2008 Yahoo th nghim 4000 node cluster


7

Hadoop Common

Tp hp nhng tin ch h tr cho cc d


n con ca Hadoop

Bao gm: tin ch truy cp h thng file,


RPC,

Hadoop Distributed File System

H thng file phn tn?

HDFS?

Kin trc ca HDFS

Cch thc lu tr v sa li

H thng file
ng dng

H thng file (NTFS)

cng vt l

10

H thng file phn tn


ng dng

H thng file phn tn

H thng file

H thng file

H thng file

cng vt l

cng vt l

cng vt l
11

Mc tiu ca HDFS

Lu tr file rt ln (hng terabyte)

Truy cp d liu theo dng

M hnh lin kt d liu n gin


Ghi 1 ln, c nhiu ln

Di chuyn qu trnh x l thay v d liu

S dng phn cng ph thng, a dng

T ng pht hin li, phc hi d liu rt nhanh


12

im yu ca HDFS

ng dng cn truy cp vi tr cao


HDFS ti u qu trnh truy cp file rt ln

Khng th lu qu nhiu file trn 1 cluster


NameNode lu trn b nh -> cn nhiu b nh

Khng h tr nhiu b ghi, sa d liu bt k

13

Kin trc HDFS

14

Kin trc HDFS (t.t)

Cc khi nim
Block: n v lu tr d liu nh nht
Hadoop dng mc nh 64MB/block
1 file chia lm nhiu block

Cc block cha bt k node no trong cluster

NameNode
Qun l thng tin ca tt c cc file trong cluster

DataNode
Qun l cc block d liu
15

16

NameNode

Thnh phn trng yu ca HDFS

Qun l v thc thi cc thao tc lin quan


n tn file
ng, m, i tn

Qun l v tr ca cc block

17

DataNode

Qun l cc block

Thc hin thao tc trn d liu


Thm, xa, nhn bit block
Thc hin cc yu cu x l d liu

18

Cc thc lu tr & pht hin li

Bn sao d liu:
Mi file c nhiu bn sao nhiu bn sao ca block

NameNode quyt nh vic to bn sao


Nhn d liu Heartbeat & Blockreport t DataNode
Heartbeat: tnh trng chc nng ca DataNode
Blockreport: danh sch cc block

Thit lp chnh sch lu tr ca cc bn sao


C ch xc nh block thuc node no

19

20

Chnh sch lu tr ca cc bn sao block

Cc k quan trng

Quyt nh tnh n nh, an ton, v kh nng vn hnh


ca h thng

Cn nhiu thi gian v kinh nghim

Quan tm n kin trc vt l: rack, bandwith

Chnh sch thng thng (khng ti u)


Chia block lm 3 bn sao
Lu node trong rack ni b, 2 block 2 node khc nhau trong
rack khc (remote rack)
21

bn vng ca HDFS

Mc tiu chnh: m bo d liu chnh xc


ngay c khi li h thng xy ra

3 loi li chnh:
Li NameNode
Li DataNode
S cn tr ca mng my tnh
22

bn vng ca HDFS (t.t)

DataNode gi nh k Heartbeat ln NameNode


Xc nh node b li nu NameNode khng nhn
c Heartbeat.
a DataNode khi lin kt & c gng to bn sao
khc

Ti cn bng cluster
Chuyn cc block sang DataNode khc c khong
trng di nh mc qui nh
23

bn vng ca HDFS (t.t)

Li NameNode
C th lm h thng HDFS v dng
To cc bn copy ca FsImage v EditLog
Khi NameNode restart, h thng s ly bn
sao gn nht.

24

C ch hot ng

c d liu:
Chng trnh client yu cu c d liu t
NameNode

NameNode tr v v tr cc block ca d liu


Chng trnh trc tip yu cu d liu ti cc

node.

25

C ch hot ng (t.t)

26

27

C ch hot ng (t.t)

Ghi d liu:
Ghi theo dng ng ng (pipeline)
Chng trnh yu cu thao tc ghi NameNode
NameNode kim tra quyn ghi v m bo file khng
tn ti
Cc bn sao ca block to thnh ng ng d
liu tun t c ghi vo

28

C ch hot ng (t.t)

29

30

Map Reduce

Ti sao cn Map Reduce ?

Map Reduce l g ?

M hnh Map Reduce

Thc thi

Hadoop Map Reduce

Demo
31

Ti sao cn Map Reduce ?

X l d liu vi quy m ln
Mun x dng 1000 CPU
Mong mun mt m hnh qun l n gin

Kin trc Map Reduce


Qun l tin trnh song song v phn tn
Qun l, sp xp lch trnh truy xut I/O
Theo di trng thi d liu
Qun l s lng ln d liu c quan h ph thuc nhau
X l li
Tru tng i vi cc lp trnh vin .

32

Map Reduce l g ?

M hnh lp trnh
MapReduce c xy dng t m hnh lp trnh hm v lp trnh song
song

H thng tnh ton phn tn


Tng tc thc thi x l d liu

Gii quyt c nhiu bi ton

n cc chi tit ci t, qun l


Qun l li
Gom nhm v sp xp
Lp lch

33

Map Reduce l g ?

Cch tip cn : chia tr


Chia nh vn ln thnh cc vn nh
X l song song tng vic nh
Tng hp kt qu

c d liu ln

Rt trch thng tin cn thit t tng phn t ( Map )

Trn v sp xp cc kt qu trung gian

Tng hp cc kt qu trung gian ( Reduce )

Pht sinh kt qu cui cng


34

Map Reduce l g ?

35

M hnh Map Reduce

Tri qua hai qu trnh Map v Reduce

Map Reduce nh ngha d liu di dng cp


<key,value>

Map <k1,v1> -> list(<k2,v2>)

Reduce <k2,list(<v2>)> -> < k3, v3 >

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2>
-> reduce -> <k3, v3> (output)

36

M hnh Map Reduce

Hm Map
Mi phn t ca d liu u vo s c
truyn

cho

hm

Map

di

dng

cp

<key,value>
Hm Map xut ra mt hoc nhiu cp

<key,value>

37

M hnh Map Reduce

Sau qu trnh Map, cc gi tr trung gian c

tp hp thnh cc danh sch theo tng kha

Hm Reduce
Kt hp, x l, bin i cc value
u ra l mt cp <key,value> c x l

38

V d word count

39

V d word count (tt)

Mapper
u vo : Mt dng ca vn bn
u ra : key : t, value : 1

Reducer
u vo : key : t, values : tp hp cc gi tr
m c ca mi t
u ra : key : t, value : tng
40

Tnh ton song song

Hm Map chy song song to ra cc gi tr trung

gian khc nhau t cc tp d liu khc nhau

Hm Reduce cng chy song song, mi reducer

x l mt tp kha khc nhau

Tt c cc gi tr c x l c lp

Bottleneck: Giai on Reduce ch bt u khi


giai on Map kt thc
41

Thc thi MR

42

Thc thi ( bc 1)

Chng trnh (user program), thng qua


th vin MapReduce phn mnh d liu
u vo

Input
Data

User
Program

Shard 0
Shard 1
Shard 2
Shard 3
Shard 4
Shard 5
Shard 6

43

Thc thi ( bc 2)
Map Reduce sao chp chng trnh ny vo

cc my cluster ( master v cc worker )

Master

User
Program
Workers
Workers
Workers
Workers
Workers

44

Thc thi ( bc 3)

Master phn phi M tc v Map v R tc


v Reduce vo cc worker rnh ri

Master phn phi cc tc v da trn v tr


ca d liu
Message(Do_map_task)

Master

Idle
Worker

45

Thc thi ( bc 4)

Mi map-task worker c d liu t phn


vng d liu c gn cho n v xut ra
nhng cp <key,value> trung gian
D liu ny c ghi tm trn RAM

Shard 0

Map
worker

Key/value pairs

46

Thc thi ( bc 5)

Mi worker phn chia d liu trung gian


thnh R vng, lu xung a, xa d liu
trn b m v thng bo cho Master
Disk locations

Master

Map
worker

Local
Storage

47

Thc thi ( bc 6)

Master gn cc d liu trung gian v ch ra


v tr ca d liu cho cc reduce-task

Master

Disk locations

Reduce
worker

remote
Storage

48

Thc thi ( bc 7)

Mi reduce-task worker sp xp cc key,


gi hm reduce v xut kt qu u ra

Sorts data

Partition
Output file

Reduce
worker

49

Thc thi ( bc 8)

Master kch hot (wakes up) chng trnh


ca ngi dng thng bo kt qu hon
thnh

D liu u ra c lu trong R tp tin


Master

wakeup

User
Program

Output
files

50

Hadoop - Map Reduce

L mt framework

S dng HDFS

Kin trc master/slave

DFS
MapReduce

Master
Namenode
Jobtracker

Slave
Datanote
Tasktracker
51

Hadoop - Map Reduce

Client gi MapReduce Job

JobTracker iu phi vic thc thi Job

TaskTracker thc thi cc task c chia ra

52

Hadoop - Map Reduce

53

Job Submission

Yu cu ID cho job mi (1 )

Kim tra cc th mc u vo v u ra

Chia tch d liu u vo

Chp cc ti nguyn bao gi chng trnh (Jar), cc tp


tin cu hnh, cc mnh d liu u vo filesystem ca
jobtracker (3)

Thng bo vi jobtracker job sn sng thc thi (4)

54

Khi to Job

Thm job vo hng i & khi to cc ti


nguyn (5)

To danh sch cc tc v ( task ) (6)

55

Phn phi cc tc v

TaskTracker nh k thng bo sn sng


nhn cc tc v mi (7)

JobTracker giao tc v c nh cho


TaskTracker ( v d 1 TaskTracker chy
ng thi 2 map-task v 2 reduce-task)

56

Thc thi tc v

TaskTracker Chp chng trnh thc thi


(Jar File) v cc d liu cn thit t h
thng chia s file

To tin trnh TaskRunner thc thi tc


v

57

Cp nht trng thi

Cp nht trng thi trong qu trng thc thi


Tc v x l c bao nhiu d liu u vo ?
Tc v hon thnh thnh cng ?
Tc v li ?

Task process gi bo co 3s mt ln cho TaskTracker

TaskTracker gi bo co 5s mt ln cho JobTracker

JobTracker tng hp cc bo co, gi li cho JobClient


mi giy mt ln
58

Cp nht trng thi

59

Kt thc Job

Khi JobTracker nhn c tn hiu kt thc ca

tc v cui cng

JobTracker gi tn hiu success cho JobClient

JobClient thng bo cho chng trnh ca


ngi dng

JobTracker thu gom rc, hy cc kt qu trung


gian
60

Kh nng chu li

Master pht hin cc li

Li tc v (Task Failure)
Vng li ngoi l, B git bi VJM, Treo
JobTracker giao cho TaskTracker khc x l trong
mt gii hn nht nh

Hn ch giao tc v mi cho TaskTracker x l


tc v b li
61

Kh nng chu li

Li TaskTracker
Crashing, Chy chm, khng gi bo co ng hn
cho JobTracker

JobTracker loi b TaskTracker ra khi bng lch biu


tc v ( schedule tasks ) v thm vo blacklist
JobTracker lp lch li chy cc tc v trao cho
TaskTracer b li

62

Kh nng chu li

Li Jobtracker
Nghim trng
Cha c hng gii quyt

63

Ti u ha

Reduce ch bt u khi ton b Map kt


thc
a trn mt node truy xut chm c th nh
hng ti ton b qu trnh

Bng thng ca mng

64

Ti u ha

a ra hm combiner
C th chy trn cng my vi cc mapper
Chy c lp vi cc mapper khc
Mini Reducer, lm gim u ra ca cc giai
on Map. Tit kim bng thng

65

ng dng

Sp xp d liu phn tn

Phn tch thng k

Web Ranking

Dch my

Indexing

...
66

Tng kt

L m hnh n gin x l lng d


liu ln trn m hnh phn tn

Tp trung vo vn chnh cn x l

67

Demo Word Count (1)

Map

public static class MapClass extends MapReduceBase


implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter)
throws IOException {
String line = ((Text)value).toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}

68

Demo Word Count (2)

Reduce

public static class Reduce extends MapReduceBase


implements Reducer {
public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += ((IntWritable) values.next()).get();
}
output.collect(key, new IntWritable(sum));
}
}

69

Demo Word Count (3)

Main

public static void main(String[] args) throws IOException {


//checking goes here
JobConf conf = new JobConf();
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputPath(new Path(args[0]));
conf.setOutputPath(new Path(args[1]));
JobClient.runJob(conf);
}

70

You might also like