You are on page 1of 59

.

Programming in Hadoop and Hive


Kenjiro Taura
University of Tokyo

1 / 59

Contents
1.

MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop

2.

MapReduce-like computation in other languages


Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially

2 / 59

Contents
1.

MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop

2.

MapReduce-like computation in other languages


Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially

3 / 59

What is MapReduce?

originally proposed by Google


the goal is to process large data easily in parallel, and in a
fault tolerant manner
it encompasses a commonly occurring pattern of parallel
processing

4 / 59

MapReduce
most broadly, MapReduce refers to the the following pattern of
computation


MapReduce i n n a t u a l l a n g u a g e :
input:
a set of records X ;
a u s e r d e f i n e d f u n c t i o n f , c a l l e d a mapper ;
a u s e r d e f i n e d f u n c t i o n r , c a l l e d a reducer ;
map phase:
apply f t o each r e c o r d i n X , which p r o d u c e s a l i s t o f
hkey, valuei p a i r s ;
reduce phase:
for each d i s t i n c t key produced , c o l l e c t a l l t h e v a l u e s
p a i r e d with t h a t key and apply r t o t h e key and
values ;


5 / 59

MapReduce expressed in graphs


collect values
of the same key
inputs

intermediate
key,value pairs

<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>

<k1, [v,v,...]>

<k2, [v,v,...]>

<k3, [v,v,...]>

...

...

...

}
map phase

outputs

reduce phase

6 / 59

Serial MapReduce expressed in Python


Working description of MapReduce in Python



def map_reduce (X , f , r ) :
# hashtable :
# key -> values
KV = {}
inputs
for x in X :
for k , v in f ( x ) :
x f
add ( KV , k , v )
for k in KV :
x f
r (k , KV [ k ])
x f



an auxiliary function add is:

<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>

<k1, [v,v,...]>

<k2, [v,v,...]>

<k3, [v,v,...]>

}
map phase

outputs

...

...

...


def add ( KV , k , v ) :
if k not in KV :
KV [ k ] = []
# add v to list
KV [ k ]. append ( v )


collect values
of the same key
intermediate
key,value pairs

reduce phase

7 / 59

Dataflow in MapReduce
key space (reduce)

collect values
of the same key
intermediate
key,value pairs

<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>

<k1, [v,v,...]>

<k2, [v,v,...]>

<k3, [v,v,...]>

...

reduce phase

...

...

...

...

...
map phase

outputs

inputs (map)

inputs

...

8 / 59

Why MapReduce is important?

. abundant problems are expressible in MapReduce (or their


combination)

though its originally proposed in the context of processing


text/web data, its not limited to text/web data nor even
strings

. it can be efficiently parallelized


3. it can be efficient even when data are large (i.e. do not fit in
memory)
.4 it can be efficiently fault tolerant
2

9 / 59

Contents
1.

MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop

2.

MapReduce-like computation in other languages


Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially

10 / 59

Some problems expressible in MapReduce

. word count:
find a histogram of word occurrences in text
2. inverted index:
find a mapping word positions it occurs in text
3. integration:
find a numeric integration
1

less trivial examples later

11 / 59

Word count
Problem description:
given a number of documents, output the histogram of word
occurrence
e.g.
.
3
!
1
am
1
chukky 1
I am chukky! kill you. wait wait.
heaven 2

heaven the heaven.


I
1
kill
1
the
1
wait
2
you
1
12 / 59

Word count in MapReduce

Serial pseudo code:


for each document :
for each word in document :
WC [ word ] = WC [ word ] + 1
return WC


MapReduce:
a record : a single document
f (document) = [ hword, 1i for each word in document ]
r(word, V ) = |V |

13 / 59

Inverted index
Problem description:
given a number of documents, output for each word the list
of documents it occurred in.
e.g.
.
1,2,3
!
0
am
0
doc id body
chukky
0
0
I am chukky!
heaven 3,3

1
kill you.
I
0
2
wait wait.
kill
1
3
heaven the heaven.
the
3
wait
2,2
you
1
14 / 59

Integration
Problem description:
Find an approximate value of integration
Z b
g(x) dx
a

by the following summation:


N
1
X

g(xi )x,

i=0

where
x = (b a)/N
and
xi = a + i.
15 / 59

Integration in MapReduce
Serial pseudo code:


x = ( b - a ) / N
s = 0
for i in 0.. n -1:
s += g ( a + i x )x
return s


MapReduce
a record : index i
f (x) = [ h0,X
g(a + ix)xi ]
r(k, V ) =
y
yV

16 / 59

Contents
1.

MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop

2.

MapReduce-like computation in other languages


Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially

17 / 59

MapReduce for large data

parallel execution
processing large (out-of-core) data
fault tolerant execution

18 / 59

Parallel execution
collect values
of the same key
intermediate
key,value pairs

f
f
f
f

<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>

<k1, [v,v,...]>

<k2, [v,v,...]>

<k3, [v,v,...]>

...

...

...

}
map phase

outputs



def map_reduce (X , f , r ) :
x
KV = {}
for x in X: (1)
x
for k , v in f ( x ) :
x
add ( KV , k , v )
x
for k in KV: (2)
r(k, KV[k]) (3)



inputs

reduce phase

parallelism is abundant in MapReduce:


1. applying f to all records can be done in parallel, provided
accesses (add) to the same key are properly arbitrated
.2 applying r to all distinct keys can be done in parallel
3. plus, not apparent from the code above, r for a single key is
almost always a reduction, which can be parallelized too
19 / 59

Parallel/distributed execution made explicit

inputs (map)

...

...

map phase partitions the inputs


reduce phase partitions the key space
induce all-to-all communication

key space (reduce)

...



def map_reduce (X , f , r ) :
forall i in 0..M-1: # M map tasks
for j in 0.. R -1: KV i,j = {}
for x in i - th chunk of X :
for k , v in f ( x ) :
j = partition ( k )
add ( KV i,j , k , v )
forall j in 0..R-1: # R reduce tasks
# all - all communication
KV j = i KV i,j # key - wise concat
for k in KV j :
r (k , KV j [ k ])



...

20 / 59

Processing large data



def map_reduce (X , f , r ) :
forall i in 0..M-1: # M map tasks
for j in 0.. R -1: KV i,j = {}
for x in i - th chunk of X :
for k , v in f ( x ) :
j = partition ( k )
add ( KV i,j , k , v )
forall j in 0..R-1: # R reduce tasks
# all - all communication
KV j = i KV i,j # key - wise concat
for k in KV j :
r (k , KV j [ k ])


we like to run this algorithm when inputs X or intermediate data


KV do not fit in memory
21 / 59

Processing large data

. X does not fit in memory

simply scan records from disk (not a big deal)


2. KV does not fit in memory

. sort all key-value pairs (out-of-core sort), or


2. use data structure supporting index search (B-tree, hash,
etc.) in disk
1

22 / 59

Processing large data



def map_reduce (X , f , r ) :
forall i in 0..M-1: # M map tasks
for j in 0.. R -1: KV i,j = output stream
for x in i - th chunk of X :
for k , v in f ( x ) :
j = partition ( k )
KV i,j . write (hk , vi)
forall j in 0..R-1: # R reduce tasks
# all - all communication
KV j = sort(i KVi,j ) # list concat + sort
# this is sequential scan
for k in unique keys in KV j :
V = values paired with k
r (k , KV j [ k ])


23 / 59

Fault tolerant execution


KVi,j serve as intermediate check points
when a map/reduce task fails, it suffies to redo that task
collect values
of the same key
inputs

intermediate
key,value pairs

<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>
<k,v>

<k1, [v,v,...]>

<k2, [v,v,...]>

<k3, [v,v,...]>

...

...

...

}
map phase

outputs

reduce phase

24 / 59

Contents
1.

MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop

2.

MapReduce-like computation in other languages


Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially

25 / 59

Hadoop

Hadoop is an open source implementation of MapReduce


http://hadoop.apache.org/ and http://www.cloudera.com/
the Hadoop system takes care of parallelization on distributed
memory machines, input/output, and fault tolerance
its native API is Java; you write mappers and reducers as
Java methods
it also supports Hadoop streaming; you can use arbitrary
programs reading standard in and writing standard out

26 / 59

Hadoop operations and HDFS

HDFS : a file system for Hadoop, which distributes files


across clusters
three modes of operations
local: work locally on regular files, without daemons
pseudo distributed: work on HDFS and run daemons, but
still locally
distributed: use clusters

27 / 59

Running Hadoop programs in local mode


point to empty directory for config directory
run a class in a jar file
e.g.


$ ls confs/local
$ export HADOOP CONF DIR=pwd/confs/local
$ cat input/chukky
I am chukky
kill you
wait wait
heaven the heaven
$ hadoop jar
hadoop-0.20.2-cdh3u5/hadoop-examples-0.20.2-cdh3u5.jar
wordcount input/chukky out
12/11/19 15:03:00 INFO jvm . JvmMetrics :
Initializing JVM Metrics with processName =
JobTracker , sessionId =
...


28 / 59

Running Hadoop programs in local mode



$ ls out
_SUCCESS * part -r -00000*
$ cat out/part-r-00000
I
1
am
1
chukky 1
heaven 2
kill
1
the
1
wait
2
you
1


29 / 59

Inside wordcount class?

http://wiki.apache.org/hadoop/WordCount
all right-minded wish alternatives streaming

30 / 59

Writing Hadoop streaming


you write two commands in any language, one for mapper,
the other for reducer
run the HadoopStreaming class in the jar file
contrib/streaming/hadoop-streaming-xxx.jar



hadoop jar hadoop - streaming -0.20.2 - cdh3u5 . jar
HadoopStreaming - mapper ./ map . py - reducer ./ red
. py - input input / chukky - output out



mapper (./map.py) : read data from stdin; print


whitespace-separated key-value pairs, one in each line
reducer (./red.py) : read key-value pairs from stdin;

31 / 59

(Pseudo) distributed mode


work on files in HDFS, hadoop-taylored distributed file
system,
for which you need to get daemons up and running
namenode : master of HDFS, on one node
datanode : data servers of HDFS on each node listed as
slaves
jobtracker : master of MapReduce execution, on one node
tasktracker : overlook MapReduce execution, on each node
listed as slaves

for which you need to write several config files


I will cover details when more appropriate

32 / 59

MapReduce-like computation in other languages

MapReduce is a very common pattern in parallel processing


no surprise that other languages more or less support a
similar pattern already
parallel languages have parallel loops and reductions
SQL can express many MapReduce computation naturally
Hive is specifically designed to support SQL-like syntax and
executes it on Hadoop

SQL is particularly noteworthy, as it supports out-of-core


data and more flexible/declarative than MapReduce

33 / 59

Contents
1.

MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop

2.

MapReduce-like computation in other languages


Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially

34 / 59

Reduction in parallel languages

OpenMP: parallel, for, reduction (of scalar)


TBB: parallel reduce
Chapel: reduction expression (25.4.1)

35 / 59

Contents
1.

MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop

2.

MapReduce-like computation in other languages


Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially

36 / 59

SQL and MapReduce

SQL can express many MapReduce computations naturally


basic idea:
map phase select + (user-defined) functions
reduce phase group by + (user-defined) aggregates

37 / 59

MapReduce examples in SQL


assume the following table


create table words ( doc_id , word ) ;


containing a word per record


word count:


select word , count (*) from words group by word ;


interted index:


select word , group_concat ( doc_id ) from words group
by word ;


38 / 59

MapReduce examples in SQL


assume the following table


create table interval ( i ) ;


containing sequence numbers 0, 1, 2, . . . , n 1 then (how?)


integration (only for pedagogical purpose)


select sum ( g (a + i * dx) * dx) from interval ;


computes

g(x) dx,
a

where g is a user-definied function that returns g(x) for x

39 / 59

MapReduce-equivalent in SQL
assume the following table


create table X (a ,b ,c ,...) ;


MapReduce with mapper f and reducer r


select k(a,b,c,...) as key, r(v(a,b,c,...))
from X group by key


where
k is a user-defined function that returns the key component
of f (x)
v is a user-defined function that returns the value component
of f (x)
r is a user-defined aggregate that returns r(V ) for V

parallel database should be able to execute this query in


parallel for large data
40 / 59

Contents
1.

MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop

2.

MapReduce-like computation in other languages


Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially

41 / 59

Hive

an SQL-like language running on top of Hadoop


https://cwiki.apache.org/Hive/home.html
important departures from SQL
table-generating functions
lateral view

I will explain a workflow of typical file processing tasks


See https://cwiki.apache.org/Hive/languagemanual.html for
language reference

42 / 59

Running Hive programs in local mode


trick : set hive.metastore.warehouse.dir to the absolute
path to an existing local directory
likely to fail with the default setting (/usr/hive/warehouse)
e.g.


$ HIVE_HOME =... HADOOP_HOME =... \
HADOOP_CONF_DIR =empty-dir hive \
--hiveconf hive.metastore.warehouse.dir=dir \
-f program . sql


no arguments : open interactive session


-f file : run hive statements in file
-i file : run hive statements in file and then open interactive
session
43 / 59

Hive a typical workflow

we outline what you need to know to process data in regular files


1. create table
2. load data from file to table
3. process data by query
table-generating functions
user-defined operations via external programs
(TRANSFORM)
lateral view

44 / 59

Create table and load data into table


creating a table is almost usual SQL


CREATE TABLE table (columname type, ...)


loading data from file to table



LOAD DATA [ LOCAL ] INPATH file [ OVERWRITE ] INTO TABLE
table



LOCAL says the file is a file in your regular file system (not
HDFS)
default column delimiter is \A and row delimiter \n, so
typical files become a table having a single column and as
many rows as lines
45 / 59

Example
docs.txt:


0^ AI have a pen . It is a nice weather . See you .
1^ AI am Ken . It is a bad weather . Good bye .






hive > create table docs ( doc_id int , body string ) ;
hive > load data local inpath " docs . txt " overwrite
into table docs ;
hive > select * from docs ;
OK
0
I have a pen . It is a nice weather . See you .
1
I am Ken . It is a bad weather . Good bye .
Time taken : 0.186 seconds



46 / 59

Simple and nested queries

simple query : almost usual SQL; e.g.


select sum ( x * y ) from vector ;
select count (*) from lines ;


nested query :


SELECT ... FROM (query) table;
FROM (query) table SELECT ...;





47 / 59

Table generating functions : functions generating


multiple rows from one
Regular SQLs, even with user-defined functions, can generate
only one row from each input row


select doc_id , whatever ( body ) from ...


Hives table generating functions allow a single row to expand


to multiple rows
exactly what we need to count individual words in a single
row

explode is one such example; it takes an array and generate


a row for each item in the array (split is a function that
splits a string into an array)


select explode ( split ( line , " " ) ) as word
from a_file ;



48 / 59

Transform function

transform is a table generating function that applies an


external program (just like streaming)


SELECT TRANSFORM (column, ...) USING command
as column-alias, . . . ;


specified column, . . . are sent to command


outputs from command become rows

49 / 59

explode(split()) equivalent by transform



select transform ( line ) using ./ ws . py from a_file ;


where ws.py is


import sys

for line in sys . line :


for w in line . split () :
print w





50 / 59

A limitation of table generating functions

when using table generating function (explode, transform,


etc.), it must be the only column in the select statement
OK:


select explode ( split ( body , " " ) ) from ...


NG:


select line number, explode ( split ( body , " " ) )
from ...





lateral view is a mechanism to overcome this

51 / 59

Laterval view

SELECT ...
FROM table LATERAL VIEW table-generating-expression table-alias as
column-alias


. add a new column column-alias to table


2. for each row generated by table-genrating-expression it
generates a row whose column-alias is the generated value


SELECT * FROM foo LATERAL VIEW explode ( split (V , " " ) )
bar as X ;

foo:
I
1
3
5

J
2
4
6

V
abc
de
f

bar:
I
1
1
1
3
3
5

J
2
2
2
4
4
6

V
abc
abc
abc
de
de
f

X
a
b
c
d
e
f
52 / 59

Contents
1.

MapReduce
MapReduce Basics
Some simple examples
Executing MapReduce for large data
Hadoop

2.

MapReduce-like computation in other languages


Reduction in parallel languages
SQL
Hive
Other problems expressible in MapReduce, but less trivially

53 / 59

Less trivial problems in MapReduce


generally, working on multiple data (or on a single data but
in different access streams) is not trivial to express, even if it
is in other languages
examples:
inner product of two vectors


s = 0
for ( i = 0; i < n ; i ++)
s += x [ i ] * y [ i ]


observe it were easy if we were to compute x x


minimum distance among points


m =
for ( i = 0; i < n ; i ++)
for ( j = i +1; j < n ; j ++)
m = min (m , fabs ( x [ i ] - x [ j ]) )






54 / 59

General trick

the problem is a single iteration accesses two (or more) pieces


of data
x[i] and y[i]
x[i] and x[j]

yet MapReduce (or any out-of-core algorithm) does not allow


a random access
the trick is to make a stream that brings them together
in SQL, table join
in MapReduce, another MapReduce job which assigns the
same key to those that need to be accessed together

55 / 59

Inner product using SQL join


assume the following tables


create table X (i , x ) ;
create table Y (i , y ) ;


and we like to compute X Y


the following query does it


select sum ( x * y ) from X natural join Y;




i
x
y
i0 xi0 yi0
X natural join Y is the following table i1 xi1 yi1
i2 xi2 yi2
... ... ...
the point is to bring items accessed in the same iteration in
a single row
56 / 59

Inner product using MapReduce


assume X is in file X and Y in file Y, each of the format:
i0 vi0
i1 vi1
i2 vi2
... ...
goal : bring two lines i xi and i yi together
ask MapReduce to do the job! run a MapReduce to get a
single stream sorted (clustered) by i
inputs : X and Y (arbitrarily interleaved)
f (i, v) = hi, vi
r(i, XY ) = i, XY [0], XY [1]

i.e. join in MapReduce


then another MapReduce
f (i, x, y) =Ph0, xyi
r(k, P ) = vP v
57 / 59

Minimum distance using SQL join

assume


create table X (i , v ) ;


taking cross product is much simpler in SQL


select min ( abs ( A . v - B . v ) )
from X A,X B on A . i < B . i ;





58 / 59

Minimum distance using MapReduce



m =
for ( i = 0; i < n ; i ++)
for ( j = i +1; j < n ; j ++)
m = min (m , fabs ( x [ i ] - x [ j ]) )


goal : generate rows containing xi , xj for all i < j


easy to generate each; how to bring x[i] and x[j] together?
let MapReduce do the job
f (i, x) = [ hhj, ii, xi for j = 0, . . . , i 1 ]
+ [ hhi, ji, xi for j = i + 1, . . . , n 1 ]
R(k, V ) = k, V [0], V [1]

then another MapReduce


f (k, xi, xj) = h0, abs(xi xj)i
R(k, V ) = minvV v
59 / 59

You might also like