Professional Documents
Culture Documents
computing
Jiaheng Lu
Department of Computer Science
Renmin University of China
www.jiahenglu.net
Hadoop/Hive
Search Engine
Googles Solution
Spider
Batch Processing System
on top of Hadoop
Internet
Runtime
SE Web Server
Business
Intelligence
Database
Web Server
Hadoop History
Hadoop Introduction
http://hadoop.apache.org/
Book: http://oreilly.com/catalog/9780596521998/index.html
Written in Java
Runs on
Largest Cluster
Cloudera
Hadoop Components
Hadoop Streaming
Pig / JAQL / Hive
HBase
Goals of HDFS
Load balancing
Node failures
Cluster expansion
HDFS Details
Data Coherency
Intelligent Client
Java API
Command Line
hadoop dfs -mkdir /foodir
hadoop dfs -cat /foodir/myfile.txt
hadoop dfs -rm /foodir myfile.txt
hadoop dfsadmin -report
hadoop dfsadmin -decommission datanodename
Web Interface
http://host:port/dfshealth.jsp
Log processing
Web index building
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
Local
Map
<nk1, nv1>
<nk3, nv3>
<nk1, nv6>
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
Local
Sort
Global
Shuffle
<nk1, 2>
<nk3, 1>
Local
Reduce
Machine 2
<k4, v4>
<k5, v5>
<k6, v6>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk2, 3>
Physical Flow
Example Code
Hadoop Streaming
hadoop streaming
-input /user/zshao/articles
-mapper tr \n
-reducer uniq -c
-output /user/zshao/
-numReduceTasks 32
Map-Reduce is scalable
Hive
http://svn.apache.org/repos/asf/hadoop/hive/trunk/
Type system
Primitive types
Recursively build up using Composition/Maps/Lists
Schema Evolution
MetaStore
Thrift API
Hive CLI
DDL:
Browsing:
Loading Data
Queries
MetaStore UI:
HiPal:
Philosophy
SQL
Map-Reduce with custom scripts (hadoop streaming)
Query Operators
Projections
Equi-joins
Group by
Sampling
Order By
Extended SQL:
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script' AS (dt, uid)
CLUSTER BY dt) map
INSERT INTO TABLE pv_users_reduced
REDUCE map.dt, map.uid
USING 'reduce_script' AS (date, count);
Hive Architecture
Map Reduce
Web UI
Mgmt, etc
Hive CLI
Browsing
Queries
DDL
Hive QL
MetaStore
Parser
Planner
Execution
SerDe
Thrift API
HDFS
Hive QL Join
page_view
page
id
1
2
1
user
id
111
111
222
time
9:08:01
9:08:13
9:08:14
pv_users
user
page age
id
1
25
2
25
SQL:
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
32
page
id
1
2
1
user
id
111
111
222
time
9:08:01
key
9:08:13
111
9:08:14
111
222
Map
user
key
user age gender
111
id
222
111 25 female
222 32
male
value
<1,1>
<1,2>
<1,1>
value
<2,25>
<2,32>
key value
111 <1,1>
111 <1,2>
Shuffle
Sort
111
<2,25>
Reduce
key value
222 <1,1>
222 <2,32>
Hive QL Group By
pv_users
page age
id
1
25
2
25
1
2
32
25
SQL:
pageid_age_sum
pageid
1
2
age
25
25
Count
1
2
32
page age
id
1
25
2
25
Map
page age
id
1
32
2
25
key value
<1,2
1
5>
<2,2
1
5>
key value
<1,3
1
2>
<2,2
1
5>
Shuffle
Sort
key value
pa
<1,2
1
5>
<1,3
1
Reduce
2>
key value
<2,2
1
5>
<2,2
1
5>
pa
page
id
1
2
1
2
useri
d
111
111
222
111
time
9:08:01
9:08:13
9:08:14
result
pagei count_distinct_u
d
serid
1
2
2
1
9:08:20
SQL
pagei useri
d
d
1
111
2
111
time
9:08:01
9:08:13
Shuffle
and
Sort
key
<1,111>
<1,222
>
pagei coun
d
t
1
2
Reduce
key
<2,111>
<2,111>
pagei coun
d
t
2
1
pagei useri
key
v
d
d
<1,111> 9:08:0
1
111
1
2
111
<2,111> 9:08:1
3 Reduce
key
v
<1,222 9:08:1
>
4
<2,111> 9:08:2
0
pagei useri
d
d
1
222
2
111
9:
9:
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
Local
Map
<nk1, nv1>
<nk3, nv3>
<nk1, nv6>
Global
Shuffle
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
Local
Sort
<nk1, 2>
<nk3, 1>
Local
Reduce
Machine 2
<k4, v4>
<k5, v5>
<k6, v6>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk2, 3>
key av
1 111
AB
key bv
1 222
SQL:
Map Reduce
ke av bv
y
1 111 222
C
key cv
1 333
ABC
Map Reduce
ke av bv cv
y
1 111 222 333
page age
id
1
25
2
32
Map Reduce
page cou
id
nt
1
1
2
1
Extended SQL
page age
id
1
25
2
32
Map Reduce
age cou
nt
25
1
32
1
FROM pv_users
INSERT INTO TABLE
pv_pageid_sum
SELECT pageid, count(1)
GROUP BY pageid
INSERT INTO TABLE pv_age_sum
SELECT age, count(1)
GROUP BY age;
pv_users
page
id
1
1
ag
e
25
25
1
2
1
25
32
25
pageid_age_partial_sum
pageid_age_sum
Map-Reduce
page
id
1
2
1
ag
ag cou
cou
ee nt
nt
25
25 24
32
32 11
25
<male, 343>
<female, 128>
Local
Map
<male, 343>
<male, 123>
Global
Shuffle
<male, 343>
<male, 123>
Local
Sort
<male, 466>
Local
Reduce
Machine 2
<k4, v4>
<k5, v5>
<k6, v6>
<male, 123>
<female, 244>
<female, 128>
<female, 244>
<female, 128>
<female, 244>
<female, 372>
Query Rewrite
Predicate Push-down
Column Pruning