Map Reduce Essesntials

.
Programming in Hadoop and Hive

Kenjiro Taura
University of Tokyo
1 / 59
Contents
1.
MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop
2.
MapReduce-like computation in other languages

Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially
2 / 59
Contents
1.
MapReduce
MapReduce Basics
Hadoop
2.

SQL
Hive
3 / 59
What is MapReduce?
originally proposed by Google

the goal is to process large data easily in parallel, and in a
fault tolerant manner
it encompasses a commonly occurring pattern of parallel
processing
4 / 59
MapReduce
most broadly, MapReduce refers to the the following pattern of
computation

MapReduce i n n a t u a l l a n g u a g e :
input:
a set of records X ;
a u s e r d e f i n e d f u n c t i o n f , c a l l e d a mapper ;
a u s e r d e f i n e d f u n c t i o n r , c a l l e d a reducer ;
map phase:
apply f t o each r e c o r d i n X , which p r o d u c e s a l i s t o f
hkey, valuei p a i r s ;
reduce phase:
for each d i s t i n c t key produced , c o l l e c t a l l t h e v a l u e s
p a i r e d with t h a t key and apply r t o t h e key and
values ;

5 / 59
MapReduce expressed in graphs

collect values
of the same key
inputs
intermediate
key,value pairs
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
...
...
...
}
map phase
outputs
reduce phase
6 / 59
Serial MapReduce expressed in Python

Working description of MapReduce in Python

def map_reduce (X , f , r ) :
# hashtable :
# key -> values
KV = {}
inputs
for x in X :
for k , v in f ( x ) :
x f
add ( KV , k , v )
for k in KV :
x f
r (k , KV [ k ])
x f

an auxiliary function add is:
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
}
map phase
outputs
...
...
...

def add ( KV , k , v ) :
if k not in KV :
KV [ k ] = []
# add v to list
KV [ k ]. append ( v )

collect values
of the same key
intermediate
key,value pairs
reduce phase
7 / 59
Dataflow in MapReduce
key space (reduce)
collect values
of the same key
intermediate
key,value pairs
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
...
reduce phase
...
...
...
...
...
map phase
outputs
inputs (map)
inputs
...
8 / 59
Why MapReduce is important?
. abundant problems are expressible in MapReduce (or their

combination)
though its originally proposed in the context of processing

text/web data, its not limited to text/web data nor even
strings
. it can be efficiently parallelized

3. it can be efficient even when data are large (i.e. do not fit in
memory)
.4 it can be efficiently fault tolerant
2
9 / 59
Contents
1.
MapReduce
MapReduce Basics
Hadoop
2.

SQL
Hive
10 / 59
Some problems expressible in MapReduce
. word count:
find a histogram of word occurrences in text
2. inverted index:
find a mapping word positions it occurs in text
3. integration:
find a numeric integration
1
less trivial examples later
11 / 59
Word count
Problem description:
given a number of documents, output the histogram of word
occurrence
e.g.
.
3
!
1
am
1
chukky 1
I am chukky! kill you. wait wait.
heaven 2
heaven the heaven.

I
1
kill
1
the
1
wait
2
you
1
12 / 59
Word count in MapReduce
Serial pseudo code:

for each document :
for each word in document :
WC [ word ] = WC [ word ] + 1
return WC

MapReduce:
a record : a single document
f (document) = [ hword, 1i for each word in document ]
r(word, V ) = |V |
13 / 59
Inverted index
given a number of documents, output for each word the list
of documents it occurred in.
e.g.
.
1,2,3
!
0
am
0
doc id body
chukky
0
0
I am chukky!
heaven 3,3
1
kill you.
I
0
2
wait wait.
kill
1
3
heaven the heaven.
the
3
wait
2,2
you
1
14 / 59
Integration
Find an approximate value of integration
Z b
g(x) dx
a
by the following summation:

N
1
X
g(xi )x,
i=0
where
x = (b a)/N
and
xi = a + i.
15 / 59
Integration in MapReduce
Serial pseudo code:

x = ( b - a ) / N
s = 0
for i in 0.. n -1:
s += g ( a + i x )x
return s

MapReduce
a record : index i
f (x) = [ h0,X
g(a + ix)xi ]
r(k, V ) =
y
yV
16 / 59
Contents
1.
MapReduce
MapReduce Basics
Hadoop
2.

SQL
Hive
17 / 59
MapReduce for large data
parallel execution
processing large (out-of-core) data
fault tolerant execution
18 / 59
Parallel execution
collect values
of the same key
intermediate
key,value pairs
f
f
f
f
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
...
...
...
}
map phase
outputs

x
KV = {}
for x in X: (1)
x
x
add ( KV , k , v )
x
for k in KV: (2)
r(k, KV[k]) (3)

inputs
reduce phase
parallelism is abundant in MapReduce:

1. applying f to all records can be done in parallel, provided
accesses (add) to the same key are properly arbitrated
.2 applying r to all distinct keys can be done in parallel
3. plus, not apparent from the code above, r for a single key is
almost always a reduction, which can be parallelized too
19 / 59
Parallel/distributed execution made explicit
inputs (map)
...
...
map phase partitions the inputs

reduce phase partitions the key space
induce all-to-all communication
key space (reduce)
...

forall i in 0..M-1: # M map tasks
for j in 0.. R -1: KV i,j = {}
for x in i - th chunk of X :
j = partition ( k )
add ( KV i,j , k , v )
forall j in 0..R-1: # R reduce tasks
# all - all communication
KV j = i KV i,j # key - wise concat
for k in KV j :
r (k , KV j [ k ])

...
20 / 59
Processing large data

for j in 0.. R -1: KV i,j = {}
j = partition ( k )
add ( KV i,j , k , v )
KV j = i KV i,j # key - wise concat
for k in KV j :
r (k , KV j [ k ])

we like to run this algorithm when inputs X or intermediate data

KV do not fit in memory
21 / 59
. X does not fit in memory
simply scan records from disk (not a big deal)

2. KV does not fit in memory
. sort all key-value pairs (out-of-core sort), or

2. use data structure supporting index search (B-tree, hash,
etc.) in disk
1
22 / 59

for j in 0.. R -1: KV i,j = output stream
j = partition ( k )
KV i,j . write (hk , vi)
KV j = sort(i KVi,j ) # list concat + sort
# this is sequential scan
for k in unique keys in KV j :
V = values paired with k
r (k , KV j [ k ])

23 / 59
Fault tolerant execution

KVi,j serve as intermediate check points
when a map/reduce task fails, it suffies to redo that task
collect values
of the same key
inputs
intermediate
key,value pairs
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k1, [v,v,...]>
<k2, [v,v,...]>
<k3, [v,v,...]>
...
...
...
}
map phase
outputs
reduce phase
24 / 59
Contents
1.
MapReduce
MapReduce Basics
Hadoop
2.

SQL
Hive
25 / 59
Hadoop
Hadoop is an open source implementation of MapReduce

http://hadoop.apache.org/ and http://www.cloudera.com/
the Hadoop system takes care of parallelization on distributed
memory machines, input/output, and fault tolerance
its native API is Java; you write mappers and reducers as
Java methods
it also supports Hadoop streaming; you can use arbitrary
programs reading standard in and writing standard out
26 / 59
Hadoop operations and HDFS
HDFS : a file system for Hadoop, which distributes files

across clusters
three modes of operations
local: work locally on regular files, without daemons
pseudo distributed: work on HDFS and run daemons, but
still locally
distributed: use clusters
27 / 59
Running Hadoop programs in local mode

point to empty directory for config directory
run a class in a jar file
e.g.

$ ls confs/local
$ export HADOOP CONF DIR=pwd/confs/local
$ cat input/chukky
I am chukky
kill you
wait wait
heaven the heaven
$ hadoop jar
hadoop-0.20.2-cdh3u5/hadoop-examples-0.20.2-cdh3u5.jar
wordcount input/chukky out
12/11/19 15:03:00 INFO jvm . JvmMetrics :
Initializing JVM Metrics with processName =
JobTracker , sessionId =
...

28 / 59
Running Hadoop programs in local mode

$ ls out
_SUCCESS * part -r -00000*
$ cat out/part-r-00000
I
1
am
1
chukky 1
heaven 2
kill
1
the
1
wait
2
you
1

29 / 59
Inside wordcount class?
http://wiki.apache.org/hadoop/WordCount
all right-minded wish alternatives streaming
30 / 59
Writing Hadoop streaming

you write two commands in any language, one for mapper,
the other for reducer
run the HadoopStreaming class in the jar file
contrib/streaming/hadoop-streaming-xxx.jar

hadoop jar hadoop - streaming -0.20.2 - cdh3u5 . jar
HadoopStreaming - mapper ./ map . py - reducer ./ red
. py - input input / chukky - output out

mapper (./map.py) : read data from stdin; print

whitespace-separated key-value pairs, one in each line
reducer (./red.py) : read key-value pairs from stdin;
31 / 59
(Pseudo) distributed mode

work on files in HDFS, hadoop-taylored distributed file
system,
for which you need to get daemons up and running
namenode : master of HDFS, on one node
datanode : data servers of HDFS on each node listed as
slaves
jobtracker : master of MapReduce execution, on one node
tasktracker : overlook MapReduce execution, on each node
listed as slaves
for which you need to write several config files

I will cover details when more appropriate
32 / 59
MapReduce is a very common pattern in parallel processing

no surprise that other languages more or less support a
similar pattern already
parallel languages have parallel loops and reductions
SQL can express many MapReduce computation naturally
Hive is specifically designed to support SQL-like syntax and
executes it on Hadoop
SQL is particularly noteworthy, as it supports out-of-core

data and more flexible/declarative than MapReduce
33 / 59
Contents
1.
MapReduce
MapReduce Basics
Hadoop
2.

SQL
Hive
34 / 59
OpenMP: parallel, for, reduction (of scalar)

TBB: parallel reduce
Chapel: reduction expression (25.4.1)
35 / 59
Contents
1.
MapReduce
MapReduce Basics
Hadoop
2.

SQL
Hive
36 / 59
SQL and MapReduce
SQL can express many MapReduce computations naturally

basic idea:
map phase select + (user-defined) functions
reduce phase group by + (user-defined) aggregates
37 / 59
MapReduce examples in SQL

assume the following table

create table words ( doc_id , word ) ;

containing a word per record

word count:

select word , count (*) from words group by word ;

interted index:

select word , group_concat ( doc_id ) from words group
by word ;

38 / 59
MapReduce examples in SQL


create table interval ( i ) ;

containing sequence numbers 0, 1, 2, . . . , n 1 then (how?)

integration (only for pedagogical purpose)

select sum ( g (a + i * dx) * dx) from interval ;

computes
g(x) dx,
a
where g is a user-definied function that returns g(x) for x
39 / 59
MapReduce-equivalent in SQL

create table X (a ,b ,c ,...) ;

MapReduce with mapper f and reducer r

select k(a,b,c,...) as key, r(v(a,b,c,...))
from X group by key

where
k is a user-defined function that returns the key component
of f (x)
v is a user-defined function that returns the value component
of f (x)
r is a user-defined aggregate that returns r(V ) for V
parallel database should be able to execute this query in

parallel for large data
40 / 59
Contents
1.
MapReduce
MapReduce Basics
Hadoop
2.

SQL
Hive
41 / 59
Hive
an SQL-like language running on top of Hadoop

https://cwiki.apache.org/Hive/home.html
important departures from SQL
table-generating functions
lateral view
I will explain a workflow of typical file processing tasks

See https://cwiki.apache.org/Hive/languagemanual.html for
language reference
42 / 59
Running Hive programs in local mode

trick : set hive.metastore.warehouse.dir to the absolute
path to an existing local directory
likely to fail with the default setting (/usr/hive/warehouse)
e.g.

$ HIVE_HOME =... HADOOP_HOME =... \
HADOOP_CONF_DIR =empty-dir hive \
--hiveconf hive.metastore.warehouse.dir=dir \
-f program . sql

no arguments : open interactive session

-f file : run hive statements in file
-i file : run hive statements in file and then open interactive
session
43 / 59
Hive a typical workflow
we outline what you need to know to process data in regular files

1. create table
2. load data from file to table
3. process data by query
table-generating functions
user-defined operations via external programs
(TRANSFORM)
lateral view
44 / 59
Create table and load data into table

creating a table is almost usual SQL

CREATE TABLE table (columname type, ...)

loading data from file to table

LOAD DATA [ LOCAL ] INPATH file [ OVERWRITE ] INTO TABLE
table

LOCAL says the file is a file in your regular file system (not
HDFS)
default column delimiter is \A and row delimiter \n, so
typical files become a table having a single column and as
many rows as lines
45 / 59
Example
docs.txt:

0^ AI have a pen . It is a nice weather . See you .
1^ AI am Ken . It is a bad weather . Good bye .

hive > create table docs ( doc_id int , body string ) ;
hive > load data local inpath " docs . txt " overwrite
into table docs ;
hive > select * from docs ;
OK
0
I have a pen . It is a nice weather . See you .
1
I am Ken . It is a bad weather . Good bye .
Time taken : 0.186 seconds

46 / 59
Simple and nested queries
simple query : almost usual SQL; e.g.

select sum ( x * y ) from vector ;
select count (*) from lines ;

nested query :

SELECT ... FROM (query) table;
FROM (query) table SELECT ...;

47 / 59
Table generating functions : functions generating

multiple rows from one
Regular SQLs, even with user-defined functions, can generate
only one row from each input row

select doc_id , whatever ( body ) from ...

Hives table generating functions allow a single row to expand

to multiple rows
exactly what we need to count individual words in a single
row
explode is one such example; it takes an array and generate

a row for each item in the array (split is a function that
splits a string into an array)

select explode ( split ( line , " " ) ) as word
from a_file ;

48 / 59
Transform function
transform is a table generating function that applies an

external program (just like streaming)

SELECT TRANSFORM (column, ...) USING command
as column-alias, . . . ;

specified column, . . . are sent to command

outputs from command become rows
49 / 59
explode(split()) equivalent by transform

select transform ( line ) using ./ ws . py from a_file ;

where ws.py is

import sys
for line in sys . line :

for w in line . split () :
print w

50 / 59
A limitation of table generating functions
when using table generating function (explode, transform,

etc.), it must be the only column in the select statement
OK:

select explode ( split ( body , " " ) ) from ...

NG:

select line number, explode ( split ( body , " " ) )
from ...

lateral view is a mechanism to overcome this
51 / 59
Laterval view

SELECT ...
FROM table LATERAL VIEW table-generating-expression table-alias as
column-alias

. add a new column column-alias to table

2. for each row generated by table-genrating-expression it
generates a row whose column-alias is the generated value

SELECT * FROM foo LATERAL VIEW explode ( split (V , " " ) )
bar as X ;

foo:
I
1
3
5
J
2
4
6
V
abc
de
f
bar:
I
1
1
1
3
3
5
J
2
2
2
4
4
6
V
abc
abc
abc
de
de
f
X
a
b
c
d
e
f
52 / 59
Contents
1.
MapReduce
MapReduce Basics
Hadoop
2.

SQL
Hive
53 / 59
Less trivial problems in MapReduce

generally, working on multiple data (or on a single data but
in different access streams) is not trivial to express, even if it
is in other languages
examples:
inner product of two vectors

s = 0
for ( i = 0; i < n ; i ++)
s += x [ i ] * y [ i ]

observe it were easy if we were to compute x x

minimum distance among points

m =
for ( i = 0; i < n ; i ++)
for ( j = i +1; j < n ; j ++)
m = min (m , fabs ( x [ i ] - x [ j ]) )

54 / 59
General trick
the problem is a single iteration accesses two (or more) pieces

of data
x[i] and y[i]
x[i] and x[j]
yet MapReduce (or any out-of-core algorithm) does not allow

a random access
the trick is to make a stream that brings them together
in SQL, table join
in MapReduce, another MapReduce job which assigns the
same key to those that need to be accessed together
55 / 59
Inner product using SQL join

assume the following tables

create table X (i , x ) ;
create table Y (i , y ) ;

and we like to compute X Y

the following query does it

select sum ( x * y ) from X natural join Y;

i
x
y
i0 xi0 yi0
X natural join Y is the following table i1 xi1 yi1
i2 xi2 yi2
... ... ...
the point is to bring items accessed in the same iteration in
a single row
56 / 59
Inner product using MapReduce

assume X is in file X and Y in file Y, each of the format:
i0 vi0
i1 vi1
i2 vi2
... ...
goal : bring two lines i xi and i yi together
ask MapReduce to do the job! run a MapReduce to get a
single stream sorted (clustered) by i
inputs : X and Y (arbitrarily interleaved)
f (i, v) = hi, vi
r(i, XY ) = i, XY [0], XY [1]
i.e. join in MapReduce

then another MapReduce
f (i, x, y) =Ph0, xyi
r(k, P ) = vP v
57 / 59
Minimum distance using SQL join
assume

create table X (i , v ) ;

taking cross product is much simpler in SQL

select min ( abs ( A . v - B . v ) )
from X A,X B on A . i < B . i ;

58 / 59
Minimum distance using MapReduce

m =
for ( i = 0; i < n ; i ++)
for ( j = i +1; j < n ; j ++)
m = min (m , fabs ( x [ i ] - x [ j ]) )

goal : generate rows containing xi , xj for all i < j

easy to generate each; how to bring x[i] and x[j] together?
let MapReduce do the job
f (i, x) = [ hhj, ii, xi for j = 0, . . . , i 1 ]
+ [ hhi, ji, xi for j = i + 1, . . . , n 1 ]
R(k, V ) = k, V [0], V [1]
then another MapReduce

f (k, xi, xj) = h0, abs(xi xj)i
R(k, V ) = minvV v
59 / 59

Map Reduce Essesntials

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Map Reduce Essesntials

Uploaded by

Copyright:

Available Formats

.

Programming in Hadoop and Hive

MapReduce-like computation in other languages

MapReduce-like computation in other languages

originally proposed by Google

MapReduce expressed in graphs

Serial MapReduce expressed in Python

an auxiliary function add is:

Why MapReduce is important?

. abundant problems are expressible in MapReduce (or their

though its originally proposed in the context of processing

. it can be efficiently parallelized

MapReduce-like computation in other languages

Some problems expressible in MapReduce

less trivial examples later

heaven the heaven.

Word count in MapReduce

Serial pseudo code:

by the following summation:

MapReduce-like computation in other languages

MapReduce for large data

parallelism is abundant in MapReduce:

Parallel/distributed execution made explicit

map phase partitions the inputs

key space (reduce)

Processing large data

we like to run this algorithm when inputs X or intermediate data

Processing large data

. X does not fit in memory

simply scan records from disk (not a big deal)

. sort all key-value pairs (out-of-core sort), or

Processing large data

Fault tolerant execution

MapReduce-like computation in other languages

Hadoop is an open source implementation of MapReduce

Hadoop operations and HDFS

HDFS : a file system for Hadoop, which distributes files

Running Hadoop programs in local mode

Running Hadoop programs in local mode

Inside wordcount class?

Writing Hadoop streaming

mapper (./map.py) : read data from stdin; print

(Pseudo) distributed mode

for which you need to write several config files

MapReduce-like computation in other languages

MapReduce is a very common pattern in parallel processing

SQL is particularly noteworthy, as it supports out-of-core

MapReduce-like computation in other languages

Reduction in parallel languages

OpenMP: parallel, for, reduction (of scalar)

MapReduce-like computation in other languages

SQL and MapReduce

SQL can express many MapReduce computations naturally

MapReduce examples in SQL

containing a word per record

MapReduce examples in SQL

containing sequence numbers 0, 1, 2, . . . , n 1 then (how?)

where g is a user-definied function that returns g(x) for x

MapReduce with mapper f and reducer r

parallel database should be able to execute this query in

MapReduce-like computation in other languages

an SQL-like language running on top of Hadoop

I will explain a workflow of typical file processing tasks

Running Hive programs in local mode