You are on page 1of 24

Map/Reduce, Hadoop & Pig

Data mining applied on the enterprise
Definitions
 Data mining is the process of extracting
patterns from data. Commonly used in a
wide range of profiling practices, such as
marketing, surveillance, fraud detection
and scientific discovery.
 A Framework is a re-usable design for a
software system (or subsystem). A software
framework may include support programs,
code libraries, a scripting language, or
other software to help develop and glue
together the different components of a
software project. Various parts of the
framework may be exposed through an API.
Map/Reduce
 Framework for processing huge datasets on certain
kinds of distributable problems using a large
number of computers.
 MapReduce provides
 Automatic parallelization & distribution
 Fault tolerance
 I/O scheduling
 Status monitoring
 Use cases
 Document clustering
 Machine learning
 Inverted index construction
 Was used to completely regenerate Google's index of
the World Wide Web
Map/Reduce
"Map" step: The master node takes the
input, chops it up into smaller sub-
problems, and distributes those to worker
nodes. A worker node may do this again in
turn, leading to a multi-level tree structure.

"Reduce" step: The master node then takes
the answers to all the sub-problems and
combines them in a way to get the output -
the answer to the problem it was originally
trying to solve.

Defined with respect to data structured in
(key, value) pairs
Map/Reduce
Map/Reduce – Data flow sections

 Input reader

 Map function

 Partition function

 Compare function

 Reduce function

 Output writer
Map/Reduce -> Hadoop

Google calls it… Hadoop equivalent…
MapReduce Hadoop MapReduce
GFS HDFS
Sawzall Hive, Pig
BigTable Hbase
Chubby ZooKeeper
Hadoop
 Java Map/Reduce implementation

 Framework that schedules tasks, provides
monitoring, and re-executing the failed
ones.

 Single master JobTracker

 Several slave TaskTracker, one per node

 Hadoop DFS (not explicitly required)

 Add-ons: Hive (Facebook dev), Pig (Yahoo!
Hadoop example
 A program that takes web server access log
files and counts the number of hits in each
minute slot over a week

 Differentiate input & output phases: Map &
Reduce
 Map phase: Access log files
 Reduce phase: Key set + iterator over each
key subset
Hadoop Example

Input
Map

Reduce

Output
Hadoop Example – Map Phase
Hadoop Example – Reduce Phase
Hadoop Example – Main code
Problems
 Hadoop Map/Reduce is very powerful, but…

 Requires a Java Programmer

 Userhas to reinvent the wheel everytime a
functionality is needed (join, filter, etc)

 Harder to write, harder to maintain

 User optimized
Pig
 Platform for analyzing large data sets

 High-level language + infrastructure
(compiler)

 Pig Latin
 Data flow language rather than procedural or
declarative

 Ease of programming

 Optimization opportunities

 Extensibility
Pig - Advantages
 Increases productivity. In one test
 10lines of Pig Latin ≈ 200 lines of Java.
 What took 4 hours to write in Java took 15
minutes in Pig Latin.

 Opens the system to non-Java programmers.

 Provides common operations like join, group,
filter, sort.
Pig – How it works
Pig Example

 Start a terminal and run
$ cd /usr/share/cloudera/pig/
 $ bin/pig –x local
 Should see a prompt like:
 grunt>
Pig – Example - Aggregation
 Let’s count the number of times each user
appears in a given data set.
 log = LOAD ‘excite-small.log’ AS (user,
timestamp, query);
 grpd = GROUP log BY user;
 cntd = FOREACH grpd GENERATE group,
COUNT(log);
 STORE cntd INTO ‘output’;
 Results:
 002BB5A52580A8ED18
 005BD9CD3AC6BB38 18
Pig
 Supports several functions
 Aggregation
 Grouping
 Filtering
 Ordering
 Joins& Anti-Joins
 Cogrouping (grouping generalization)
 Several data types:
Pig - Commands
Pig Command What it does
load Read data from file system.

store Write data to file system.

foreach Apply expression to each record and output one or more
records.
filter Apply predicate and remove records that do not return true.

group/cogroup Collect records with the same key from one or more inputs.

join Join two or more inputs based on a key.

order Sort records based on a key.

distinct Remove duplicate records.

union Merge two data sets.
Possible Applications at vLex
 Faster and improved, parallelized document
indexing

 Targeted advertisement

 Recommendation system

 Trending topics

 Better search tools (Search assist)
Questions?
Referencias
 Cloudera: Introduction to Pig

 Hadoop, a Free Software Program, Finds Uses Beyond S

 Digging Deeper Into Data With Hadoop

 Apache Hadoop

 Pig Tutorial