Professional Documents
Culture Documents
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐1
Delving
Deeper
into
the
Hadoop
API
Chapter
4.1
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐2
Delving
Deeper
into
the
Hadoop
API
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐3
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐4
Why
Use
ToolRunner?
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐5
How
to
Implement
ToolRunner:
Complete
Driver
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐6
How
to
Implement
ToolRunner:
Imports
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
}
System.exit(exitCode);
statements
in
future
slides
for
brevity.
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf(
"Usage: %s [generic options] <input dir> <output dir>\n",
getClass().getSimpleName());
return -1;
}
Job job = new Job(getConf());
…
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐7
How
to
Implement
ToolRunner:
Driver
Class
DefiniPon
}
System.exit(exitCode);
The
driver
class
implements
the
Tool
interface
and
public int run(String[] extends
args) throwsthe
Configured
Exception { class.
if (args.length != 2) {
System.out.printf(
"Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName());
return -1;
}
Job job = new Job(getConf());
job.setJarByClass(WordCount.class); job.setJobName("Word Count");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐8
How
to
Implement
ToolRunner:
Main
Method
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
...
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐9
How
to
Implement
ToolRunner:
Run
Method
...
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐10
ToolRunner
Command
Line
OpPons
§ ToolRunner
allows
the
user
to
specify
configuraDon
opDons
on
the
command
line
§ Commonly
used
to
specify
Hadoop
properDes
using
the
-D
flag
– Will
override
any
default
or
site
properPes
in
the
configuraPon
– But
will
not
override
those
set
in
the
driver
code
§ Note
that
-D
opDons
must
appear
before
any
addiDonal
program
arguments
§ Can
specify
an
XML
configuraDon
file
with
-conf
§ Can
specify
the
default
filesystem
with
-fs uri
– Shortcut
for
–D fs.default.name=uri
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐11
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐12
The
setup
Method
§ It
is
common
to
want
your
Mapper
or
Reducer
to
execute
some
code
before
the
map
or
reduce
method
is
called
for
the
first
Dme
– IniPalize
data
structures
– Read
data
from
an
external
file
– Set
parameters
§ The
setup
method
is
run
before
the
map
or
reduce
method
is
called
for
the
first
Dme
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐13
The
cleanup
Method
§ Similarly,
you
may
wish
to
perform
some
acDon(s)
a[er
all
the
records
have
been
processed
by
your
Mapper
or
Reducer
§ The
cleanup
method
is
called
before
the
Mapper
or
Reducer
terminates
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐14
Passing
Parameters
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐15
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐16
The
Combiner
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐17
The
Combiner
Map
phase
Input
Format
Input
Format
Input
Format
§ Output
from
the
Combiners
is
passed
to
the
Reducers
Mapper
Mapper
Mapper
Reducer Reducer
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐18
WordCount
Revisited
Node
1
the 1
cat 1
the cat sat 1 aardvark 1
sat on Mapper
on 1 cat 1
the mat Reducer
aardvark 1
the 1 mat 1 cat 1
mat 1 on 1 mat 1
1
Reducer
on 2
sat 1
1 sat 2
sofa 1 sofa 1
Node
2
the 1
the 1
Reducer
the 4
aardvark 1 1
the 1
sat 1 1
aardvark
sat on Mapper
on 1
the sofa the 1
sofa 1
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐19
WordCount
With
Combiner
Node
1
the 1
cat 1
cat 1
mat 1
the cat sat 1
sat on Mapper
Combiner
on 1
on 1 aardvark 1
the mat sat 1
cat 1 aardvark 1
the 1
the 2 cat 1
mat 1 mat 1
on 1 mat 1
1
on 2
sat 1
1 sat 2
Node
2
the 1
aardvark
sofa 1 sofa 1
1
aardvark 1 the 2 the 4
the on 1 2
sat 1
aardvark sat 1
sat on Mapper
on 1 Combiner
sofa 1
the sofa the 1
the 2
sofa 1
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐20
WriPng
a
Combiner
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐21
Combiners
and
Reducers
Sum(2,3)
5
2
3
foo
11 Sum(5,11)
foo 16
foo Sum(5,6)
=
5
6
Sum(2,3,5,6) foo 16
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐22
Specifying
a
Combiner
§ Specify
the
Combiner
class
to
be
used
in
your
MapReduce
code
in
the
driver
– Use
the
setCombinerClass
method,
e.g.:
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setCombinerClass(SumReducer.class);
§ Input
and
output
data
types
for
the
Combiner
and
the
Reducer
for
a
job
must
be
idenDcal
§ VERY
IMPORTANT:
The
Combiner
may
run
once,
or
more
than
once,
on
the
output
from
any
given
Mapper
– Do
not
put
code
in
the
Combiner
which
could
influence
your
results
if
it
runs
more
than
once
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐23
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐24
Accessing
HDFS
ProgrammaPcally
§ In
addiDon
to
using
the
command-‐line
shell,
you
can
access
HDFS
programmaDcally
– Useful
if
your
code
needs
to
read
or
write
‘side
data’
in
addiPon
to
the
standard
MapReduce
inputs
and
outputs
– Or
for
programs
outside
of
Hadoop
which
need
to
read
the
results
of
MapReduce
jobs
§ Beware:
HDFS
is
not
a
general-‐purpose
filesystem!
– Files
cannot
be
modified
once
they
have
been
wri>en,
for
example
§ Hadoop
provides
the
FileSystem
abstract
base
class
– Provides
an
API
to
generic
file
systems
– Could
be
HDFS
– Could
be
your
local
file
system
– Could
even
be,
for
example,
Amazon
S3
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐25
The
FileSystem
API
(1)
§ In order to use the FileSystem API, retrieve an instance of it
§ The
conf
object
has
read
in
the
Hadoop
configuraDon
files,
and
therefore
knows
the
address
of
the
NameNode
§ A
file
in
HDFS
is
represented
by
a
Path
object
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐26
The
FileSystem
API
(2)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐27
The
FileSystem
API:
Directory
LisPng
// do something interesting
}
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐28
The
FileSystem
API:
WriPng
Data
// write an int
out.writeInt(getInt());
...
out.close();
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐29
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐30
The
Distributed
Cache:
MoPvaPon
§ A
common
requirement
is
for
a
Mapper
or
Reducer
to
need
access
to
some
‘side
data’
– Lookup
tables
– DicPonaries
– Standard
configuraPon
values
§ One
opDon:
read
directly
from
HDFS
in
the
setup
method
– Using
the
API
seen
in
the
previous
secPon
– Works,
but
is
not
scalable
§ The
Distributed
Cache
provides
an
API
to
push
data
to
all
slave
nodes
– Transfer
happens
behind
the
scenes
before
any
task
is
executed
– Data
is
only
transferred
once
to
each
node,
rather
– Note:
Distributed
Cache
is
read-‐only
– Files
in
the
Distributed
Cache
are
automaPcally
deleted
from
slave
nodes
when
the
job
finishes
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐31
Using
the
Distributed
Cache:
The
Difficult
Way
– .jar
files
added
with
addFileToClassPath
will
be
added
to
your
Mapper
or
Reducer’s
classpath
– Files
added
with
addCacheArchive
will
automaPcally
be
dearchived/decompressed
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐32
Using
the
DistributedCache:
The
Easy
Way
§ If
you
are
using
ToolRunner,
you
can
add
files
to
the
Distributed
Cache
directly
from
the
command
line
when
you
run
the
job
– No
need
to
copy
the
files
to
HDFS
first
§ Use
the
-files
opDon
to
add
files
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐33
Accessing
Files
in
the
Distributed
Cache
§ Files
added
to
the
Distributed
Cache
are
made
available
in
your
task’s
local
working
directory
– Access
them
from
your
Mapper
or
Reducer
the
way
you
would
read
any
ordinary
local
file
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐34
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐35
Reusable
Classes
for
the
New
API
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐36
Key
Points
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-‐37
Bibliography
The
following
offer
more
informaDon
on
topics
discussed
in
this
chapter
§ Combiners
are
discussed
in
TDG
3e
on
pages
33-‐36.
§ A
table
describing
available
filesystems
in
Hadoop
is
on
pages
52-‐53
of
TDG
3e.
§ The
HDFS
API
is
described
in
TDG
3e
on
pages
55-‐67.
§ Distributed
cache:
See
pages
289-‐295
of
TDG
3e
for
more
details.
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 4-‐38