Professional Documents
Culture Documents
Hadoop
Hadoop
TS. H Bo Quc
1011036
Ni dung
Gii thiu
Nhu cu thc t Hadoop l g? Lch s pht trin
Nhu cu thc t
Cc node l cc PC
Chia lm nhiu rack (khong 40 PC/rack)
4
Hadoop l g?
tn vi d liu rt ln.
Quy m: hng terabyte d liu, hng ngn node.
Thnh phn:
Lu tr: HDFS (Hadoop Distributed Filesystem) X l: MapReduce
H tr m hnh lp trnh Map/Reduce
Hadoop Common
H thng file phn tn? HDFS? Kin trc ca HDFS Cch thc lu tr v sa li
H thng file
ng dng
cng vt l
10
H thng file
cng vt l
H thng file
cng vt l
Mc tiu ca HDFS
im yu ca HDFS
13
14
Cc khi nim
Block: n v lu tr d liu nh nht
Hadoop dng mc nh 64MB/block 1 file chia lm nhiu block
NameNode
Qun l thng tin ca tt c cc file trong cluster
DataNode
Qun l cc block d liu
15
16
NameNode
Thnh phn trng yu ca HDFS Qun l v thc thi cc thao tc lin quan n tn file
ng, m, i tn
Qun l v tr ca cc block
17
DataNode
18
Bn sao d liu:
Mi file c nhiu bn sao nhiu bn sao ca block
19
20
Cc k quan trng
Cn nhiu thi gian v kinh nghim Quan tm n kin trc vt l: rack, bandwith Chnh sch thng thng (khng ti u)
Chia block lm 3 bn sao
Lu node trong rack ni b, 2 block 2 node khc nhau trong rack khc (remote rack)
21
bn vng ca HDFS
3 loi li chnh:
Li NameNode Li DataNode S cn tr ca mng my tnh
22
Ti cn bng cluster
Chuyn cc block sang DataNode khc c khong trng di nh mc qui nh
23
Li NameNode
C th lm h thng HDFS v dng To cc bn copy ca FsImage v EditLog Khi NameNode restart, h thng s ly bn sao gn nht.
24
C ch hot ng
c d liu:
Chng trnh client yu cu c d liu t NameNode
node.
25
C ch hot ng (t.t)
26
27
C ch hot ng (t.t)
Ghi d liu:
Ghi theo dng ng ng (pipeline)
Chng trnh yu cu thao tc ghi NameNode NameNode kim tra quyn ghi v m bo file khng tn ti Cc bn sao ca block to thnh ng ng d liu tun t c ghi vo
28
C ch hot ng (t.t)
29
30
Map Reduce
Ti sao cn Map Reduce ? Map Reduce l g ? M hnh Map Reduce Thc thi Hadoop Map Reduce Demo
31
X l d liu vi quy m ln
Mun x dng 1000 CPU
Mong mun mt m hnh qun l n gin
Map Reduce l g ?
M hnh lp trnh
MapReduce c xy dng t m hnh lp trnh hm v lp trnh song song
33
Map Reduce l g ?
c d liu ln
Rt trch thng tin cn thit t tng phn t ( Map ) Trn v sp xp cc kt qu trung gian
Map Reduce l g ?
35
36
Hm Map
Mi phn t ca d liu u vo s c truyn cho hm Map di dng cp
<key,value>
Hm Map xut ra mt hoc nhiu cp
<key,value>
37
Hm Reduce
Kt hp, x l, bin i cc value
u ra l mt cp <key,value> c x l
38
V d word count
39
Mapper
u vo : Mt dng ca vn bn u ra : key : t, value : 1
Reducer
u vo : key : t, values : tp hp cc gi tr m c ca mi t u ra : key : t, value : tng
40
Tt c cc gi tr c x l c lp
Thc thi MR
42
Thc thi ( bc 1)
Chng trnh (user program), thng qua th vin MapReduce phn mnh d liu u vo
Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6
Input Data
User Program
43
Thc thi ( bc 2)
Map Reduce sao chp chng trnh ny vo
Master
44
Thc thi ( bc 3)
Master
Idle Worker
45
Thc thi ( bc 4)
Mi map-task worker c d liu t phn vng d liu c gn cho n v xut ra nhng cp <key,value> trung gian
D liu ny c ghi tm trn RAM
Map worker
Shard 0
Key/value pairs
46
Thc thi ( bc 5)
Mi worker phn chia d liu trung gian thnh R vng, lu xung a, xa d liu trn b m v thng bo cho Master
Disk locations
Master
Map worker
Local Storage
47
Thc thi ( bc 6)
Master
Disk locations
Reduce worker
remote Storage
48
Thc thi ( bc 7)
Sorts data
Reduce worker
49
Thc thi ( bc 8)
Master kch hot (wakes up) chng trnh ca ngi dng thng bo kt qu hon thnh
User Program
Output files
50
DFS MapReduce
52
53
Job Submission
Yu cu ID cho job mi (1 )
Kim tra cc th mc u vo v u ra
Chia tch d liu u vo Chp cc ti nguyn bao gi chng trnh (Jar), cc tp tin cu hnh, cc mnh d liu u vo filesystem ca jobtracker (3)
54
Khi to Job
55
Phn phi cc tc v
56
Thc thi tc v
TaskTracker Chp chng trnh thc thi (Jar File) v cc d liu cn thit t h thng chia s file
57
59
Kt thc Job
tc v cui cng
Kh nng chu li
Kh nng chu li
Li TaskTracker
Crashing, Chy chm, khng gi bo co ng hn cho JobTracker
62
Kh nng chu li
Li Jobtracker
Nghim trng Cha c hng gii quyt
63
Ti u ha
64
Ti u ha
a ra hm combiner
C th chy trn cng my vi cc mapper Chy c lp vi cc mapper khc Mini Reducer, lm gim u ra ca cc giai on Map. Tit kim bng thng
65
ng dng
Sp xp d liu phn tn Phn tch thng k Web Ranking Dch my Indexing ...
66
Tng kt
Tp trung vo vn chnh cn x l
67
Map
public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } 68
Reduce
public static class Reduce extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } }
69
Main
public static void main(String[] args) throws IOException { //checking goes here JobConf conf = new JobConf(); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf); } 70