Nhn t bn ngoi, files lu trn HDFS cng ging nh lu trong
Windows hay Linux. Chng ta c th create new, delete, move,
renameNhng trn thc t, d liu c chia ra thnh blocks lu tr trn rt nhiu DataNode, mi block c nhiu bn sao(mc nh l 3) lu trn nhiu DataNode khc nhau, phng khi mt DataNode no c s c th h thng vn hot ng bnh thng. Ngoi ra cn c 1(v ch 1) NameNode lm nhim v qun l d liuv iu tit cc lnh i hi thao tc chng. Cn MapReduce gip cho vic x l song song c thun li, t nht gm 3 b phn: hm Map phn tch data thnh cc cp(key, value); hm Reduce cn c vo cc key gom tp hp cc cp nh vy li vi nhau v a ra kt qu; hm Main iu tit. Mi mt thao tc Map hoc Reduce c gi l TaskTracker. Thng thng TaskTrackers c chy trn DataNodes gim ng truyn. Tasktrackers c JobTracker cn c vo thng tin ca blocks khi to trn DataNode ph hp. JobTracker khng nht thit chy trn cng my vi NameNode. Hy xem 1 v d n gin l thng k tn s xut hin ca tng t trong : hello world, hello hadoop. 1 TaskTracker s Map on hello world cho ra (hello, 1)(world, 1). TaskTracker khc s Map on hello hadoop cho ra (hello, 1)(hadoop, 1). Sau 1 TaskTracker khc s Reduce cho ra kt qu (hello, 2)(world, 1)(hadoop, 1).
Hnh di y minh ha mt v d v d liu thi tit. T kho d liu thi tit v
nhit ca cc thi im trong cc nm, ngi ta mun thng k nhit cao nht cho tng nm:
Trong hnh trn, t nhng d liu th l cc log ghi c t cc thit b s c
phn gii v chuyn thnh d liu trung gian, sau qu trnh map v reduce s thc hin cng vic ca mnh ly c kt qu cn tnh. Trong thc t, vi yu cu x l phc tp, map v reduce c th c gi v s dng nhiu ln cho ti khi c c kt qu cn tnh.