Professional Documents
Culture Documents
Cai Dat Cum Hadoop
Cai Dat Cum Hadoop
1. Chun b...............................................................................................................1
2. Ci t Hadoop....................................................................................................1
3. Cu hnh cc thng s cho hadoop cluster........................................................3
4. Format HDFS.....................................................................................................22
5. Khi ng h thng...........................................................................................22
6. Kim tra ci t thnh cng.............................................................................23
1. Chun b
1.1. To user Hadoop
Thc hin cc lnh sau trn tt c cc server (Master v slave)
useradd hadoop
passwd hadoop
hadoop ALL = NOPASSWD: ALL (Mn quyn root thc thi mi cu lnh)
1.2. Ci t java 1.7 (Phng QTHT ci)
Nu cha c ci, tham kho trn web:
http://timarcher.com/node/59
http://www.roseindia.net/linux/tutorial/installingjdk5onlinux.shtml
1.3. Cu hnh SSH
Kim tra xem my c c ci t SSH hay cha:
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop
later on)
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$JAVA_HOME/bin:$PATH:$HADOOP_PREFIX/bin
File $HADOOP_HOME/conf/hdfs-site.xml
<?xml version="1.0"?>
4
<value>3</value>
</property>
<property>
<name>dfs.http.address</name>
<value>0.0.0.0:9070</value>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>0.0.0.0:9090</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:9010</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:9075</value>
</property>
<property>
<name>dfs.datanode.https.address</name>
<value>0.0.0.0:9475</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>0.0.0.0:9020</value>
</property>
<property>
6
<name>dfs.https.address</name>
<value>0.0.0.0:9470</value>
</property>
</configuration>
File $HADOOP_HOME/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>master:9311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>/u01/hadoop-0.20.203.0/tempdir</value>
</property>
<property>
<name>mapred.map.child.java.opts</name>
7
<value>-Xmx512M -Djava.io.tmpdir>/u01/hadoop-0.20.203.0/tempdir</value>
<description>Larger heap-size for child jvms of maps.
</description>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx512M -Djava.io.tmpdir>/u01/hadoop-0.20.203.0/tempdir </value>
<description>Larger heap-size for child jvms of reduces.
</description>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>0.0.0.0:9030</value>
<description>Larger heap-size for child jvms of reduces.
</description>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>0.0.0.0:9060</value>
<description>Larger heap-size for child jvms of reduces.
</description>
</property>
</configuration>
File $HADOOP_HOME/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
8
10
11
Tn file
Format
M t
12
Lu gi cc bin mi
hadoop-
Bash
env.sh
script
trng
chy
daemons
trn
Hadoop
cluster.
File cu
hnh
core-site.xml
theo
nh
hadoop core
dng
xml
File cu Cu hnh cc thng s cho
hdfs-site.xml
mapredsite.xml
hnh
cc
theo
daemons
nh
namenode,
dng
datanode,
chy
hdfs:
secondary
xml
namenode.
File cu
Cu hnh cc thng s cho
hnh
cc
theo
daemons
chy
nh
MapReduce:
dng
jobtracker, tasktracker
xml
Cha danh sch a ch ip
(hoc
masters
Plain
hostname nu c ci dns)
text
cc
my
chy
namnode
13
secondary
Plain
hostname nu c ci dns)
text
cc
my chy datanode v
tasktraker.
Cu hnh cc metric, tc
hadoopmetric Java
s.
properti
Properties
es
cch
m hadoop s report li
cc
thng tin hot ng ca
cluster.
Cu hnh cc properties
log4j.properti
es
Java
properti
es
cho
vic
namenode,
datanode,
jobtracker, tasktracker.
14
Cha cc bin mi trng phc v cho vic chy cc daemon Hadoop. Cc daemon
chy Hadoop gm c Namenode/Datanode, Jobtracker/TastTracker v Secondary
Namenode. Mt s thng s quan trng:
Tn file
Gi tr mc nh
ngha
Bin mi trng cha
th
JAVA_HOME
mc home ca Java.
Khng c
y
l mt bin rt quan
trng.
Lu gi thng tin v
th
HADOOP_LOG_DIR
$HADOOP_HOME/log
mc lu cc file log
trong
qu trnh chy cc
daemon
Lng b nh ti a s
HADOOP_HEAPSIZE
1000 MB
c cp pht chy
mi daemon
core-site.xml
Cu hnh cc thng s cho cc daemons chy Hadoop:
Tn file
Gi
tr
mc
nh
15
ngha
file://
l:
hdfs://<hostname
hay
ip
ca
namenode>
Mc nh, y s l th mc lu tr
hadoop.tmp.dir
cho
/tmp
tt
Gi tr mc nh
3
ngha
Tham s ny quy nh ch
s
replication level mc nh
cho
mt
file
khi n c to ra trn
HDFS.
Nh
ta
mt
file
chnh l s bn sao ca
tng
16
block
ca
level
ca file F l n, th mi
block
ca
file
s c lu ra n bn sao
nm
trn
cluster.
Danh sch cc th mc
/name
lu
liu
trn
namenode. Ni y s lu
tr
cc
tn
${hadoop.tmp.dir}/dfs
HDFS.
Danh sch cc thc mc
/data
lu
tr
liu
dfs.data.dir
daemon
datanode. y l ni tht
s
lu
tr
$
fs.checkpoint.di {hadoop.tmp.dir}/dfs/namesecon
cc
dary
daemon
secondary
namenode s lu tr
17
mapred-site.xml
Cu hnh cc daemon chy Map/Reduce. Cc tham s quan trng:
Tn file
Gi tr mc nh
localhost:8021
ngha
Hostname (hoc ip) v port
ca
Nh
daemon
ta
Jobtracker.
bit,
trn
mapred.job.tracker
mt
daemon
JobTracker
ny l 8021
Ni lu tr trn h thng
file cc b ca cc tin trnh
mapred.local.dir
chy
MapReduce
JobTracker v TaskTracker
Phn tn cc ci t v cu hnh ln mi node trn cluster
Dng lnh scp chp ton b th mc /u01/hadoop/hadoop_installation ln
cc th mc tng ng trn slave01, slave02
18
nh
4. Format HDFS
Format HDFS
Ch : Lnh sau phi c thc hin t NameNode
Format namenode:
5. Khi ng h thng
Khi ng Hadoop
Ch : cc lnh sau phi c thc hin t namenode
Trc khi khi ng, ta phi m bo tng la c tt trn tt c cc node
Khi ng MapReduce:
20
21
22
23
Descriptio
n
Name
Default
value
Descriptio
n
Name
Default
value
Descriptio
n
hadoop.tmp.dir
/tmp/hadoop-${user.name}
Cc th mc tm trong cc node trong cluster
fs.default.name
file:///
Tn ca h thng file mc nh gm cc trng
nh scheme v
authority ca URI. Authority gm c host v port.
Mc nh l h
thng local. Cn vi HDFS l hdfs://
fs.checkpoint.size
67108864
Kch thc ca editlog (theo byte) m trigger li
cc checkpoint
local.cache.size
10737418240
Kch thc ti a ca b nh cache m bn mun
lu tr (mc
24
nh l 10GB)
Thng s cu hnh h thng file HDFS
File hdfs-site.xml
File cu hnh hdfs-site.xml, dng cho thao tc cu hnh cc thng tin ca h thng
file
HDFS.
Xem
thm
ti
(http://hadoop.apache.org/common/docs/current/
hdfsdefault.html)
hdfssite.xml
Name
Default
value
Descriptio
n
Name
Default
value
Descriptio
n
Name
Default
value
Descriptio
n
Name
Default
value
dfs.namenode.logging.level
info
Cc mc logging cho namenode. Nu gi tr l
dir th s log li
thay i ca namespace, l block th log li cc
thng tin v cc
125
bn sao,thao tc to hoc xa block, cui cng l
all
dfs.secondary.http.address
0.0.0.0:50090
a chi http ca Secondary Namenode server. Nu
port l 0 th
server s chy trn mt port bt k.
dfs.datanode.address
0.0.0.0:50010
a chi datanode server dng lng nghe cc kt
ni. Nu port l
0 th server s chy trn mt port bt k.
dfs.datanode.http.address
0.0.0.0:50075
25
Descriptio
n
Name
Default
value
Descriptio
n
dfs.datanode.handler.count
Name
Default
value
dfs.http.address
Descriptio
n
Name
Default
value
Descriptio
n
Name
Default
value
Descriptio
n
Name
Default
value
Descriptio
3
S lng cc tiu trnh trn server cho datanode
0.0.0.0:50070
a chi v port ca giao din web ca dfs
namenode dng lng
nghe cc kt ni. Nu port l 0 th server s chy
trn mt port bt
k.
dfs.name.dir
${hadoop.tmp.dir}/dfs/name
Th mc trn h thng file local m DFS
Namenode dng lu
tr file fsimage. Nu c nhiu th mc, th file
fsimage s c to
bn sao trong tt c cc th mc trn.
dfs.name.edits.dir
${dfs.name.dir}
Th mc trn h thng file local m DFS
Namenode dng lu
tr file v transaction (file edits). Nu c nhiu
th mc, th file
ny s c to bn sao trong tt c cc th mc
trn.
dfs.permissions
TRUE
Bt thao tc kim tra cc permission trn HDFS.
26
n
Name
Default
value
Descriptio
n
dfs.data.dir
${hadoop.tmp.dir}/dfs/data
Th mc trn h thng file local m mt DFS
Datanode dng lu tr cc file block ca n.
Nu c nhiu th mc, th cc block
s c to bn sao trong tt c cc th mc trn.
Nu th mc
khng tn ti th b ignore
Name
Default
value
Descriptio
n
dfs.replication
Name
Default
value
Descriptio
n
dfs.replication.max
Name
Default
value
Descriptio
n
dfs.replication.min
Name
Default
value
Descriptio
n
dfs.block.size
Name
Default
value
Descriptio
n
dfs.heartbeat.interval
3
S lng bn sao mc nh ca 1 block
512
S lng bn sao ti a ca mt block
1
S lng bn sao ti thiu ca mt block
67108864
Kch thc mc nh ca mt block (64MB)
3
Khong thi gian datanode gi heartbeat n
Namenode (giy)
27
Name
Default
value
Descriptio
n
dfs.namenode.handler.count
Name
Default
value
dfs.replication.interval
Descriptio
n
10
S lng cc tiu trnh server trn Namenode
3
Chu k (giy) m namenode s tnh li s lng
bn sao cho cc
datanode
File master
File ny nh ngha host lm Secondary Namenode. Vi tng dng trong file ny l
a ch ip hoc tn ca host .
File slaves
File ny nh ngha cc host lm DataNode cng nh TaskTracker. Vi tng dng
trong file l a chi ip hoc tn ca host .
Thng s cu hnh cho m hnh Hadoop MapReduce
File mapred-site.xml
File cu hnh mapred-site.xml, dng cho thao tc cu hnh cc thng tin ca m hnh
MapReduce. Tham kho thm ti
(http://hadoop.apache.org/common/docs/current/mapred-default.html)
mapredsite.xml
Name
mapred.job.tracker
Default value local
Host v port m MapReduce job tracker chy trn
. Nu l local, cc job s c chy trong mt
Description
tin trnh nh mt maptask v reduce task.
Name
mapred.job.tracker.http.address
Default value 0.0.0.0:50030
28
Description
Name
mapred.local.dir
Default value ${hadoop.tmp.dir}/mapred/local
Th mc local ni m MapReduce s lu cc file d
liu trung gian. C th l danh sch cc th mc
c cch nhau bi du phy trn cc thit b khc
Description
nhau m rng a. Th mc phi tn ti.
Name
mapred.system.dir
Default value ${hadoop.tmp.dir}/mapred/system
Th mc chia s ni m MapReduce lu tr cc file
Description
iu khin.
Name
mapred.temp.dir
Default value ${hadoop.tmp.dir}/mapred/temp
Description
Th mc chia s cho cc file tm.
Name
mapred.map.tasks
Default value 2
S lng cc maptask dng cho mt job. Khng c
Description
hiu lc khi mapred.job.tracker l local
Name
mapred.reduce.tasks
Default value 1
S lng cc reducetask dng cho mt job. Khng
Description
c hiu lc khi mapred.job.tracker l local
Name
mapred.child.java.opts
Default value -Xmx200m
Cc option ca Java cho cc tin trnh con ca
Description
TaskTracker. Gi tr kch thc heap cho mt task.
Name
mapred.job.reuse.jvm.num.tasks
Default value 1
S lng cc task chy trn mi jvm. Nu gi tr l
Description
-1 th khng gii hn s lng task.
Name
mapred.task.tracker.http.address
29
30