You are on page 1of 24

Map/Reduce, Hadoop & Pig

Data mining applied on the enterprise

Definitions
Data mining is the process of extracting patterns from data. Commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.  A Framework is a re-usable design for a software system (or subsystem). A software framework may include support programs, code libraries, a scripting language, or other software to help develop and glue together the different components of a software project. Various parts of the framework may be exposed through an API.

Map/Reduce
 Framework for processing huge datasets on certain kinds of distributable problems using a large number of computers.  MapReduce provides  Automatic parallelization & distribution  Fault tolerance  I/O scheduling  Status monitoring  Use cases  Document clustering  Machine learning  Inverted index construction  Was used to completely regenerate Google's index of the World Wide Web

Map/Reduce
"Map" step: The master node takes the input, chops it up into smaller subproblems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output the answer to the problem it was originally trying to solve.  Defined with respect to data structured in (key, value) pairs

Map/Reduce

Map/Reduce – Data flow sections
 Input  Map

reader

function function function

 Partition

 Compare  Reduce  Output

function writer

Map/Reduce -> Hadoop

Google calls it… MapReduce GFS Sawzall BigTable Chubby

Hadoop equivalent… Hadoop MapReduce HDFS Hive, Pig Hbase ZooKeeper

Hadoop

Java Map/Reduce implementation Framework that schedules tasks, provides monitoring, and re-executing the failed ones. Single master JobTracker Several slave TaskTracker, one per node Hadoop DFS (not explicitly required) Add-ons: Hive (Facebook dev), Pig (Yahoo!

Hadoop example

A program that takes web server access log files and counts the number of hits in each minute slot over a week Differentiate input & output phases: Map & Reduce
 Map

phase: Access log files  Reduce phase: Key set + iterator over each key subset

Hadoop Example
Input Map

Reduce

Output

Hadoop Example – Map Phase

Hadoop Example – Reduce Phase

Hadoop Example – Main code

Problems

Hadoop Map/Reduce is very powerful, but…
 Requires  User

a Java Programmer

has to reinvent the wheel everytime a functionality is needed (join, filter, etc) to write, harder to maintain

 Harder

 User

optimized

Pig

Platform for analyzing large data sets High-level language + infrastructure (compiler) Pig Latin
 Data

flow language rather than procedural or declarative of programming opportunities

 Ease

 Optimization  Extensibility

Pig - Advantages

Increases productivity. In one test
 10

lines of Pig Latin ≈ 200 lines of Java.  What took 4 hours to write in Java took 15 minutes in Pig Latin.

Opens the system to non-Java programmers. Provides common operations like join, group, filter, sort.

Pig – How it works

Pig Example

Start a terminal and run
$

cd /usr/share/cloudera/pig/  $ bin/pig –x local

Should see a prompt like:
 grunt>

Pig – Example - Aggregation

Let’s count the number of times each user appears in a given data set.
 log

= LOAD ‘excite-small.log’ AS (user, timestamp, query);  grpd = GROUP log BY user;  cntd = FOREACH grpd GENERATE group, COUNT(log);  STORE cntd INTO ‘output’;

Results:
 002BB5A52580A8ED

18  005BD9CD3AC6BB38 18

Pig

Supports several functions
 Aggregation  Grouping  Filtering  Ordering  Joins

& Anti-Joins  Cogrouping (grouping generalization)  Several data types:

Pig - Commands
Pig Command
load store foreach filter group/cogroup join order distinct union

What it does
Read data from file system. Write data to file system. Apply expression to each record and output one or more records. Apply predicate and remove records that do not return true. Collect records with the same key from one or more inputs. Join two or more inputs based on a key. Sort records based on a key. Remove duplicate records. Merge two data sets.

Possible Applications at vLex

Faster and improved, parallelized document indexing Targeted advertisement Recommendation system Trending topics Better search tools (Search assist)

  

Questions?

Referencias

Cloudera: Introduction to Pig

Hadoop, a Free Software Program, Finds Uses Beyond S

Digging Deeper Into Data With Hadoop Apache Hadoop Pig Tutorial