Professional Documents
Culture Documents
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐1
Common
MapReduce
Algorithms
Chapter
7.1
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐2
Common
MapReduce
Algorithms
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐3
IntroducSon
§ MapReduce
jobs
tend
to
be
relaJvely
short
in
terms
of
lines
of
code
§ It
is
typical
to
combine
mulJple
small
MapReduce
jobs
together
in
a
single
workflow
– OJen
using
Oozie
(see
later)
§ You
are
likely
to
find
that
many
of
your
MapReduce
jobs
use
very
similar
code
§ In
this
chapter
we
present
some
very
common
MapReduce
algorithms
– These
algorithms
are
frequently
the
basis
for
more
complex
MapReduce
jobs
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐4
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐5
SorSng
(1)
§ MapReduce
is
very
well
suited
to
sorJng
large
data
sets
§ Recall:
keys
are
passed
to
the
Reducer
in
sorted
order
§ Assuming
the
file
to
be
sorted
contains
lines
with
a
single
value:
– Mapper
is
merely
the
idenSty
funcSon
for
the
value
(k, v) à(v, _)
– Reducer
is
the
idenSty
funcSon
(k, _) à(k, '')
Addams Gomez
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐6
SorSng
(2)
§ For
mulJple
Reducers,
need
to
choose
a
parJJoning
funcJon
such
that
if
k1 < k2, partition(k1) <= partition(k2)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐7
SorSng
as
a
Speed
Test
of
Hadoop
§ SorJng
is
frequently
used
as
a
speed
test
for
a
Hadoop
cluster
– Mapper
and
Reducer
are
trivial
– Therefore
sorSng
is
effecSvely
tesSng
the
Hadoop
framework’s
I/O
§ Good
way
to
measure
the
increase
in
performance
if
you
enlarge
your
cluster
– Run
and
Sme
a
sort
job
before
and
aJer
you
add
more
nodes
– terasort
is
one
of
the
sample
jobs
provided
with
Hadoop
– Creates
and
sorts
very
large
files
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐8
Searching
§ Assume
the
input
is
a
set
of
files
containing
lines
of
text
§ Assume
the
Mapper
has
been
passed
the
paZern
for
which
to
search
as
a
special
parameter
– We
saw
how
to
pass
parameters
to
a
Mapper
in
a
previous
chapter
§ Algorithm:
– Mapper
compares
the
line
against
the
pa>ern
– If
the
pa>ern
matches,
Mapper
outputs
(line, _)
– Or
(filename+line, _),
or
…
– If
the
pa>ern
does
not
match,
Mapper
outputs
nothing
– Reducer
is
the
IdenSty
Reducer
– Just
outputs
each
intermediate
key
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐9
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐10
Indexing
§ Assume
the
input
is
a
set
of
files
containing
lines
of
text
§ Key
is
the
byte
offset
of
the
line,
value
is
the
line
itself
§ We
can
retrieve
the
name
of
the
file
using
the
Context
object
– More
details
on
how
to
do
this
in
the
Exercise
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐11
Inverted
Index
Algorithm
§ Mapper:
– For
each
word
in
the
line,
emit
(word, filename)
§ Reducer:
– IdenSty
funcSon
– Collect
together
all
values
for
a
given
key
(i.e.,
all
filenames
for
a
parScular
word)
– Emit
(word, filename_list)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐12
Inverted
Index:
Dataflow
File1
i File1
a File1
never File1
I never saw a cow File1
purple cow; saw File1
hope File2
a File1
i File1,
purple File1 File2
cow File1 never File1,
File2
one File2
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐13
Aside:
Word
Count
§ Recall
the
WordCount
example
we
used
earlier
in
the
course
– For
each
word,
Mapper
emi>ed
(word, 1)
– Very
similar
to
the
inverted
index
§ This
is
a
common
theme:
reuse
of
exisJng
Mappers,
with
minor
modificaJons
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐14
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐15
Term
Frequency
–
Inverse
Document
Frequency
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐16
TF-‐IDF:
MoSvaSon
§ Merely
counJng
the
number
of
occurrences
of
a
word
in
a
document
is
not
a
good
enough
measure
of
its
relevance
– If
the
word
appears
in
many
other
documents,
it
is
probably
less
relevant
– Some
words
appear
too
frequently
in
all
documents
to
be
relevant
– Known
as
‘stopwords’
– e.g.
a,
the,
this,
to,
from,
etc.
§ TF-‐IDF
considers
both
the
frequency
of
a
word
in
a
given
document
and
the
number
of
documents
which
contain
the
word
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐17
TF-‐IDF:
Data
Mining
Example
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐18
TF-‐IDF
Formally
Defined
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐19
CompuSng
TF-‐IDF
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐20
CompuSng
TF-‐IDF
With
MapReduce
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐21
CompuSng
TF-‐IDF:
Job
1
–
Compute
0
§ Mapper
– Input:
(docid,
contents)
– For
each
term
in
the
document,
generate
a
(term,
docid)
pair
– i.e.,
we
have
seen
this
term
in
this
document
once
– Output:
((term,
docid),
1)
§ Reducer
– Sums
counts
for
word
in
document
– Outputs
((term,
docid),
0)
– i.e.,
the
term
frequency
of
term
in
docid
is
0
§ We
can
add
a
Combiner,
which
will
use
the
same
code
as
the
Reducer
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐22
CompuSng
TF-‐IDF:
Job
2
–
Compute
n
§ Mapper
– Input:
((term,
docid),
0)
– Output:
(term,
(docid,
0,
1))
§ Reducer
– Sums
1s
to
compute
n
(number
of
documents
containing
term)
– Note:
need
to
buffer
(docid,
0)
pairs
while
we
are
doing
this
(more
later)
– Outputs
((term,
docid),
(0,
n))
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐23
CompuSng
TF-‐IDF:
Job
3
–
Compute
TF-‐IDF
§ Mapper
– Input:
((term,
docid),
(0,
n))
– Assume
N
is
known
(easy
to
find)
– Output
((term,
docid),
TF
×
IDF)
§ Reducer
– The
idenSty
funcSon
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐24
CompuSng
TF-‐IDF:
Working
At
Scale
§ Job
2:
We
need
to
buffer
(docid,
0)
pairs
counts
while
summing
1’s
(to
compute
n)
– Possible
problem:
pairs
may
not
fit
in
memory!
– In
how
many
documents
does
the
word
“the”
occur?
§ Possible
soluJons
– Ignore
very-‐high-‐frequency
words
– Write
out
intermediate
data
to
a
file
– Use
another
MapReduce
pass
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐25
TF-‐IDF:
Final
Thoughts
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐26
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐27
Word
Co-‐Occurrence:
MoSvaSon
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐28
Word
Co-‐Occurrence:
Algorithm
§ Mapper
map(docid a, doc d) {
foreach w in d do
foreach u near w do
emit(pair(w, u), 1)
}
§ Reducer
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐29
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐30
Secondary
Sort:
MoSvaSon
(1)
§ Recall
that
keys
are
passed
to
the
Reducer
in
sorted
order
§ The
list
of
values
for
a
parJcular
key
is
not
sorted
– Order
may
well
change
between
different
runs
of
the
MapReduce
job
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐31
Secondary
Sort:
MoSvaSon
(2)
§ SomeJmes
a
job
needs
to
receive
the
values
for
a
parJcular
key
in
a
sorted
order
– This
is
known
as
a
secondary
sort
§ Example:
Sort
by
Last
Name,
then
First
Name
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐32
Secondary
Sort:
MoSvaSon
(3)
§ Example:
Find
the
latest
birth
year
for
each
surname
in
a
list
§ Naïve
soluJon
– Reducer
loops
through
all
values,
keeping
track
of
the
latest
year
– Finally,
emit
the
latest
year
§ BeZer
soluJon
– Pass
the
values
sorted
by
year
in
descending
order
to
the
Reducer,
which
can
then
just
emit
the
first
value
Addams Gomez 1964-09-18 Addams Gomez 1964-09-18
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐33
ImplemenSng
Secondary
Sort:
Composite
Keys
let map(k, v) =
emit(new Pair(v.getPrimaryKey(), v.getSecondaryKey)), v)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐34
ImplemenSng
Secondary
Sort:
ParSSoning
Composite
Keys
ParJJon
0
Jones#1947 Jones David 1947-Jan-08
Jones#1947 Jones David 1947-Jan-08
Jones#1901 Jones Asa 1901-Aug-08
Addams#1860 Addams Jane 1860-Sep-06
Jones#1945 Jones David 1945-Dec-30
Jones#1901 Jones Asa 1901-Aug-08
ParSSoner
Addams#1964 Addams Gomez 1964-Sep-18
ParJJon
1
Jones#1945 Jones David 1945-Dec-30
Addams#1860 Addams Jane 1860-Sep-06
Addams#1964 Addams Gomez 1964-Sep-18
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐35
ImplemenSng
Secondary
Sort:
SorSng
Composite
Keys
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐36
ImplemenSng
Secondary
Sort:
Sort
Comparator
§ Sort
Comparator
– Sorts
the
input
to
the
Reducer
– Uses
the
full
composite
key:
compares
natural
key
first;
if
equal,
compares
secondary
key
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐37
ImplemenSng
Secondary
Sort:
Grouping
Comparator
§ Grouping
Comparator
– Uses
‘natural’
key
only
– Determines
which
keys
and
values
are
passed
in
a
single
call
to
the
Reducer
Addams#1860 = Addams#1964
Addams#1860 < Jones#1945
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐38
ImplemenSng
Secondary
Sort:
Setng
Comparators
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐39
Secondary
Sort:
Summary
1. Mapper
emits
composite
keys
2. Custom
ParSSoner
parSSons
by
natural
key
Turing#1912 Turing Alan 1912-Jun-23 ParJJon
0
Jones#1947 Jones David 1947-Jan-08
Jones#1947 Jones David 1947-Jan-08
Turing#1912 Turing Alan 1912-Jun-23
Addams#1960 Addams Jane 1860-Sep-06
Jones#1901 Jones Asa 1901-Aug-08
Jones#1901 Jones Asa 1901-Aug-08
Jones#1945 Jones David 1945-Dec-30
Addams#1964 Addams Gomez 1964-Sep-18
3. Sort
Comparator
sorts
composite
key
4. Grouping
Comparator
groups
by
natural
key
ParJJon
0
for
reduce()
calls
Jones#1947 Jones David 1947-Jan-08 Jones#1947 Jones David 1947-Jan-08
Jones#1945 Jones David 1945-Dec-30 Jones#1945 Jones David 1945-Dec-30
Jones#1901 Jones Asa 1901-Aug-08 Jones#1901 Jones Asa 1901-Aug-08
Turing#1912 Turing Alan 1912-Jun-23
Turing#1912 Turing Alan 1912-Jun-23
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐40
Key
Points
(1)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐41
Key
Points
(2)
§ Word
co-‐occurrence
– Mapper:
emits
pairs
of
“close”
words
as
keys,
their
frequencies
as
values
– Reducer:
sum
frequencies
for
each
pair
§ Secondary
Sort
– Define
a
composite
key
type
with
natural
key
and
secondary
key
– ParSSon
by
natural
key
– Define
comparators
for
sorSng
(by
both
keys)
and
grouping
(by
natural
key)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐42
Bibliography
The
following
offer
more
informaJon
on
topics
discussed
in
this
chapter
§ For
more
informaJon
on
TF-‐IDF,
see
– http://marcellodesales.wordpress.com/2009/12/31/
tf-idf-in-hadoop-part-1-word-frequency-in-doc/
§ The
secondary
sort
is
described
in
TDG
3e
on
pages
277-‐283.
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐43
Joining
Data
Sets
in
MapReduce
Jobs
Chapter
7.2
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐44
Joining
Data
Sets
in
MapReduce
Jobs
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐45
IntroducSon
§ We
frequently
need
to
join
data
together
from
two
sources
as
part
of
a
MapReduce
job,
such
as
– Lookup
tables
– Data
from
database
tables
§ There
are
two
fundamental
approaches:
Map-‐side
joins
and
Reduce-‐side
joins
§ Map-‐side
joins
are
easier
to
write,
but
have
potenJal
scaling
issues
§ We
will
invesJgate
both
types
of
joins
in
this
chapter
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐46
But
First…
§ But
first…
§ Avoid
wriJng
joins
in
Java
MapReduce
if
you
can!
§ Tools
such
as
Impala,
Hive,
and
Pig
are
much
easier
to
use
– Save
hours
of
programming
§ If
you
are
dealing
with
text-‐based
data,
there
really
is
no
reason
not
to
use
Impala,
Hive,
or
Pig
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐47
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐48
Map-‐Side
Joins:
The
Algorithm
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐49
Map-‐Side
Joins:
Problems,
Possible
SoluSons
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐50
Chapter
Topics
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐51
Reduce-‐Side
Joins:
The
Basic
Concept
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐52
Reduce-‐Side
Joins:
Example
Join
Employees
LocaSons
empid
empname
locid
locid
locaSon
001 Elizabeth Windsor 4 1 Chicago
002 Peter Parker 5 2 San Francisco
003 Levi Strauss 2 3 Amsterdam
004 Francis Bacon 4 4 London
… … … 5 New York
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐53
Example
Record
Data
Structure
§ A data structure to hold a record could look like this:
class Record {
enum RecType { emp, loc };
RecType type;
String empId;
String empName;
int locId;
String locName;
}
§ Example
records
type: emp type: loc
empId: 002 empId: <null>
empName: Levi Strauss empName: <null>
locId: 2 locId: 4
locName: <null> locName: London
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐54
Reduce-‐Side
Join:
Mapper
void map(k, v) {
Record r = parse(v);
emit (r.locId, r);
}
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐55
Reduce-‐Side
Join:
Shuffle
and
Sort
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐56
Reduce-‐Side
Join:
Reducer
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐57
Reduce-‐Side
Join:
Reducer
Grouping
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐58
Scalability
Problems
With
Our
Reducer
§ SoluJon:
Ensure
the
locaJon
record
is
the
first
one
to
arrive
at
the
Reducer
– Using
a
Secondary
Sort
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐59
A
Be>er
Intermediate
Key
(1)
class LocKey {
int locId;
boolean isLocation;
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐60
A
Be>er
Intermediate
Key
(2)
class LocKey {
int locId;
boolean isLocation;
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐61
A
Be>er
Intermediate
Key
(3)
class LocKey {
int locId;
boolean isLocation;
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐62
A
Be>er
Intermediate
Key
(4)
class LocKey {
int locId;
The
hashCode
method
only
looks
at
the
locaSon
ID
boolean isLocation;
porSon
of
the
record.
This
ensures
that
all
records
with
the
public int compareTo(LocKey k) {
same
if (locId != kk.locId)
ey
will
go
to
{ the
same
Reducer.
This
is
an
alternaSve
returnto
providing
a
custom
ParSSoner.
Integer.compare(locId, k.locId);
} else {
locId: 4 locId: 4
return Boolean.compare(k.isLocation,
isLocation: true
==
isLocation);
isLocation: false
}
}
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐63
A
Be>er
Mapper
void map(k, v) {
Record r = parse(v);
LocKey newkey = new LocKey;
newkey.locId = r.locId;
if (r.type == RecordType.emp) {
newkey.isLocation = false;
} else {
newkey.isLocation = true;
}
emit (newkey, r);
}
4#false 001 Elizabeth Windsor
2#false 003 Levi Strauss
001 Elizabeth Windsor 4
002 Levi Strauss 2 4#false 004 Francis Bacon
004 Francis Bacon 4 1#true Chicago
Map
1 Chicago 2#true San Francisco
2 San Francisco 3#true Amsterdam
3 Amsterdam 4#true London
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐64
Create
a
Sort
Comparator…
§ Create
a
sort
comparator
to
ensure
that
the
locaJon
record
is
the
first
one
in
the
list
of
records
passed
in
each
Reducer
call
class LocKeySortComparator
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐65
…And
a
Grouping
Comparator…
§ Create
a
Grouping
Comparator
to
ensure
that
all
records
for
a
given
locaJon
are
passed
in
a
single
call
to
the
reduce()
method
class LocKeyGroupingComparator
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐66
…And
Configure
Hadoop
To
Use
It
In
The
Driver
job.setSortComparatorClass(LocKeySortComparator.class);
job.setGroupingComparatorClass(LocKeyGroupingComparator.class);
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐67
A
Be>er
Reducer
Record thisLoc;
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐68
A
Be>er
Reducer:
Output
with
Correct
SorSng
and
Grouping
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐69
Key
Points
§ Joins
are
usually
best
done
using
Impala,
Hive,
or
Pig
§ Map-‐side
joins
are
simple
but
don’t
scale
well
§ Use
reduce-‐side
joins
when
both
datasets
are
large
– Mapper:
– Merges
both
data
sets
into
a
common
record
type
– Use
a
composite
key
(custom
WritableComparable)
with
join
key/record
type
– Shuffle
and
sort:
– Secondary
sort
so
that
‘primary’
records
are
processed
first
– Custom
ParSSoner
to
ensure
records
are
sent
to
the
correct
Reducer
(or
hack
the
hashCode
of
the
composite
key)
– Reducer:
– Group
by
join
key
(custom
grouping
comparator)
– Write
out
‘secondary’
records
joined
with
‘primary’
record
data
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7-‐70