Professional Documents
Culture Documents
1 / 59
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
2 / 59
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
3 / 59
What is MapReduce?
4 / 59
MapReduce
most broadly, MapReduce refers to the the following pattern of
computation
MapReduce i n n a t u a l l a n g u a g e :
input:
a set of records X ;
a u s e r d e f i n e d f u n c t i o n f , c a l l e d a mapper ;
a u s e r d e f i n e d f u n c t i o n r , c a l l e d a reducer ;
map phase:
apply f t o each r e c o r d i n X , which p r o d u c e s a l i s t o f
hkey, valuei p a i r s ;
reduce phase:
for each d i s t i n c t key produced , c o l l e c t a l l t h e v a l u e s
p a i r e d with t h a t key and apply r t o t h e key and
values ;
5 / 59
intermediate
key,value pairs
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
...
...
...
}
map phase
outputs
reduce phase
6 / 59
def map_reduce (X , f , r ) :
# hashtable :
# key -> values
KV = {}
inputs
for x in X :
for k , v in f ( x ) :
x f
add ( KV , k , v )
for k in KV :
x f
r (k , KV [ k ])
x f
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
}
map phase
outputs
...
...
...
def add ( KV , k , v ) :
if k not in KV :
KV [ k ] = []
# add v to list
KV [ k ]. append ( v )
collect values
of the same key
intermediate
key,value pairs
reduce phase
7 / 59
Dataflow in MapReduce
key space (reduce)
collect values
of the same key
intermediate
key,value pairs
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
...
reduce phase
...
...
...
...
...
map phase
outputs
inputs (map)
inputs
...
8 / 59
9 / 59
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
10 / 59
. word count:
find a histogram of word occurrences in text
2. inverted index:
find a mapping word positions it occurs in text
3. integration:
find a numeric integration
1
11 / 59
Word count
Problem description:
given a number of documents, output the histogram of word
occurrence
e.g.
.
3
!
1
am
1
chukky 1
I am chukky! kill you. wait wait.
heaven 2
for each document :
for each word in document :
WC [ word ] = WC [ word ] + 1
return WC
MapReduce:
a record : a single document
f (document) = [ hword, 1i for each word in document ]
r(word, V ) = |V |
13 / 59
Inverted index
Problem description:
given a number of documents, output for each word the list
of documents it occurred in.
e.g.
.
1,2,3
!
0
am
0
doc id body
chukky
0
0
I am chukky!
heaven 3,3
1
kill you.
I
0
2
wait wait.
kill
1
3
heaven the heaven.
the
3
wait
2,2
you
1
14 / 59
Integration
Problem description:
Find an approximate value of integration
Z b
g(x) dx
a
g(xi )x,
i=0
where
x = (b a)/N
and
xi = a + i.
15 / 59
Integration in MapReduce
Serial pseudo code:
x = ( b - a ) / N
s = 0
for i in 0.. n -1:
s += g ( a + i x )x
return s
MapReduce
a record : index i
f (x) = [ h0,X
g(a + ix)xi ]
r(k, V ) =
y
yV
16 / 59
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
17 / 59
parallel execution
processing large (out-of-core) data
fault tolerant execution
18 / 59
Parallel execution
collect values
of the same key
intermediate
key,value pairs
f
f
f
f
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
...
...
...
}
map phase
outputs
def map_reduce (X , f , r ) :
x
KV = {}
for x in X: (1)
x
for k , v in f ( x ) :
x
add ( KV , k , v )
x
for k in KV: (2)
r(k, KV[k]) (3)
inputs
reduce phase
inputs (map)
...
...
...
def map_reduce (X , f , r ) :
forall i in 0..M-1: # M map tasks
for j in 0.. R -1: KV i,j = {}
for x in i - th chunk of X :
for k , v in f ( x ) :
j = partition ( k )
add ( KV i,j , k , v )
forall j in 0..R-1: # R reduce tasks
# all - all communication
KV j = i KV i,j # key - wise concat
for k in KV j :
r (k , KV j [ k ])
...
20 / 59
22 / 59
23 / 59
intermediate
key,value pairs
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
...
...
...
}
map phase
outputs
reduce phase
24 / 59
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
25 / 59
Hadoop
26 / 59
27 / 59
$ ls confs/local
$ export HADOOP CONF DIR=pwd/confs/local
$ cat input/chukky
I am chukky
kill you
wait wait
heaven the heaven
$ hadoop jar
hadoop-0.20.2-cdh3u5/hadoop-examples-0.20.2-cdh3u5.jar
wordcount input/chukky out
12/11/19 15:03:00 INFO jvm . JvmMetrics :
Initializing JVM Metrics with processName =
JobTracker , sessionId =
...
28 / 59
29 / 59
http://wiki.apache.org/hadoop/WordCount
all right-minded wish alternatives streaming
30 / 59
hadoop jar hadoop - streaming -0.20.2 - cdh3u5 . jar
HadoopStreaming - mapper ./ map . py - reducer ./ red
. py - input input / chukky - output out
31 / 59
32 / 59
33 / 59
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
34 / 59
35 / 59
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
36 / 59
37 / 59
create table words ( doc_id , word ) ;
select word , count (*) from words group by word ;
interted index:
select word , group_concat ( doc_id ) from words group
by word ;
38 / 59
create table interval ( i ) ;
select sum ( g (a + i * dx) * dx) from interval ;
computes
g(x) dx,
a
39 / 59
MapReduce-equivalent in SQL
assume the following table
create table X (a ,b ,c ,...) ;
select k(a,b,c,...) as key, r(v(a,b,c,...))
from X group by key
where
k is a user-defined function that returns the key component
of f (x)
v is a user-defined function that returns the value component
of f (x)
r is a user-defined aggregate that returns r(V ) for V
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
41 / 59
Hive
42 / 59
$ HIVE_HOME =... HADOOP_HOME =... \
HADOOP_CONF_DIR =empty-dir hive \
--hiveconf hive.metastore.warehouse.dir=dir \
-f program . sql
44 / 59
CREATE TABLE table (columname type, ...)
LOAD DATA [ LOCAL ] INPATH file [ OVERWRITE ] INTO TABLE
table
LOCAL says the file is a file in your regular file system (not
HDFS)
default column delimiter is \A and row delimiter \n, so
typical files become a table having a single column and as
many rows as lines
45 / 59
Example
docs.txt:
0^ AI have a pen . It is a nice weather . See you .
1^ AI am Ken . It is a bad weather . Good bye .
hive > create table docs ( doc_id int , body string ) ;
hive > load data local inpath " docs . txt " overwrite
into table docs ;
hive > select * from docs ;
OK
0
I have a pen . It is a nice weather . See you .
1
I am Ken . It is a bad weather . Good bye .
Time taken : 0.186 seconds
46 / 59
select sum ( x * y ) from vector ;
select count (*) from lines ;
nested query :
SELECT ... FROM (query) table;
FROM (query) table SELECT ...;
47 / 59
48 / 59
Transform function
SELECT TRANSFORM (column, ...) USING command
as column-alias, . . . ;
49 / 59
where ws.py is
import sys
50 / 59
select explode ( split ( body , " " ) ) from ...
NG:
select line number, explode ( split ( body , " " ) )
from ...
51 / 59
Laterval view
SELECT ...
FROM table LATERAL VIEW table-generating-expression table-alias as
column-alias
SELECT * FROM foo LATERAL VIEW explode ( split (V , " " ) )
bar as X ;
foo:
I
1
3
5
J
2
4
6
V
abc
de
f
bar:
I
1
1
1
3
3
5
J
2
2
2
4
4
6
V
abc
abc
abc
de
de
f
X
a
b
c
d
e
f
52 / 59
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
53 / 59
s = 0
for ( i = 0; i < n ; i ++)
s += x [ i ] * y [ i ]
m =
for ( i = 0; i < n ; i ++)
for ( j = i +1; j < n ; j ++)
m = min (m , fabs ( x [ i ] - x [ j ]) )
54 / 59
General trick
55 / 59
create table X (i , x ) ;
create table Y (i , y ) ;
i
x
y
i0 xi0 yi0
X natural join Y is the following table i1 xi1 yi1
i2 xi2 yi2
... ... ...
the point is to bring items accessed in the same iteration in
a single row
56 / 59
assume
create table X (i , v ) ;
select min ( abs ( A . v - B . v ) )
from X A,X B on A . i < B . i ;
58 / 59