You are on page 1of 244

BIG DATA TRAINING

Concepts, Use Cases, Architecture, Hands-on


Training & Exercises
Course Overview and Expectations
This training provides a comprehensive overview of Big Data
and in-depth discussion of Big Data from the perspective of the Day 2 Duration

current market trends. The training will go over key Big Data Industry use cases end to end 10.00 - 11.30 AM
use cases in financial services, Big Data concepts and the
Hadoop technology architecture used to address Big Data issues. Break 11:30 11:45 AM
Upon completion of this training, participants should be able to: Hadoop Installation & HDFS Hands-on 11.45 - 1.00 PM
Understand what Big Data means and how Big Data issues
are manifesting themselves in financial services. Lunch Break 1.00 to 2.00 PM

Understand the Big Data reference architecture. Map Reduce Deep Dive 2.00 - 3.00 PM
Get an in-depth understanding of the Hadoop architecture
platform for Big Data. Map Reduce Hands on development & Deployment 3.00 - 4.30 PM

Understand how to build solutions for Big Data problems


Break 4:30 4:45 PM
Hands-on training on core Big Data concepts such as
MapReduce, Pig, and Hive, etc. Map Reduce Hands on development & Deployment 4.45 6.00 PM

Day 1 Duration Day 3 Duration

Breakfast 8:30- 9:00 AM Breakfast 8:30- 9.00 AM

Big Data Introduction and Concepts 9.00 - 11.00 AM PIG deep dive Theory 9.00 - 10.00 AM

PIG hands on 10.00 - 10.30 AM


Break 10:45 11:00 AM
Break 10:30 10:45 AM
Big Data Reference Architecture 11.00 -12.00 PM
Pig hands on, Hive deep dive Theory 10.45 - 12.00 PM
Lunch Break 12:00 to 1:00 PM
Lunch Break 12.00 to 1.00 PM
Introduction to HDFS and MapReduce 1.00 2.00 PM
Hive hands on 1.00 - 2.30 PM
Pig, HBase 2:00 4.00 PM
Break 2:30 2:45 PM
Break 3:30 3:45 PM
Hadoop administration & troubleshooting 2.45 - 4.00 PM
Hive, Sqoop 3:45 5.00 PM Training Conclusion 4.00 - 4.30 PM

Page 2
Big Data Introduction & Concepts 1
Day

Page 3
What is Big Data?
Big Data constitutes dataset that become so large as to render themselves unmanageable using traditional data
platforms e.g. RDBMS, flat file systems, OO databases etc.. This unmanageability stems from the complexity in
capture, storage, search, retrieval, sharing and analytics of these datasets due to their sheer size.

o Click Stream
o Active/Passive Sensor
o Log
o Unstructured
o
o
Event
Printed Corpus VOLUME VARIETY o Semi Structured
o Structured
o Speech
o Social Media
o Traditional Big Data

VELOCITY How Fast is Fast?


Why Does it Matter?

o Velocity of Generation
o Velocity of Analysis

Page 4
Who Uses BIG Data?

Page 5
Market Trends in Data... What is the Context?
Neil Armstrong lands on the moon with 32KB of data (1969)
Google processes 24PB of data everyday (2010) ~ 240K 100GB hard drives

So What is a PetaByte?
o 1 PB = 1 000 000 000 000 000 bytes = 1 million gigabytes = 1 thousand terabytes
o Large Hadron Collider produces ~ 15 PB per year
o The movie Avatar took ~ 1 PB of storage for rendering 3D CGI graphics

Twitter ~ 7TB
Facebook ~ 10B
Bank of America Who Knows? Why dont we know? Because they dont have the ability to store and process all the
unstructured / semi-structured data their eco-system generates But can they use it?

Spending 80% of your life with 20% of the data!!!


o 80% of new world data generated is unstructured and semi-structured
o Relational structures | strict schemas | canonical encodings will persist But will progressively constitute less and less of the
market data share
o Analysis requirements therefore are shifting towards unstructured and semi-structured data
o Traditional analysis platforms cannot scale to meet the new data paradigm

How Fast is Fast?


o Velocity of data generation is increasing exponentially Proliferation of automatic sensors in software and hardware
o Necessitates proportional increase in the velocity of data analysis
Streams Computing Analytics in Motion on-the-wire
Analytics at Rest Partial Today

Velocity of Data Velocity of Data Flow Restful


Generation Data

o WHY need Velocity of Analytics? Finding arbitrage opportunities in capital markets before asset prices balance

Page 6
What is Big Data Used For?
o Search
Yahoo, Google, Amazon, Zevents

o Log Processing
Yahoo, Facebook, Google, Twitter, LinkedIn
o Recommendation Systems
Yahoo, Facebook, Google, Twitter, LinkedIn

o Data Warehousing
Facebook, AOL

o Sentiment Analysis
Analyze TB volume of emails and transaction logs against an apriori established sentiment ML model to gauge which
customers are likely to leave Offer an incentive program to reduce customer attrition

o System Outage Prediction


Analyze application logs and join with call center data to predict system outages.

o Security Profiling
Collect event data from the enterprise event cloud and run it against an apriori established ML model to detect security breaches
and unforeseen correlations between enterprise events.

o Anti - Money Laundering | Fraud Analysis

o Correlation | Link Analysis


Google, Facebook, LinkedIn, eHarmony

o Multimedia Analysis
New York Times, Veoh

Page 7
Confluence of Influences
80% of new data generated is either completely unstructured or semi-structured Much of this data is being
generated at very high velocity Presents tremendous challenges for storage, search, retrieval and analysis
Traditional data platforms and analysis capabilities are unable to meet these evolving demands!

Consolidate ALL Enterprise


Analytics-in-Motion Data in a Single Place Email, Chat, Voice, Logs

High Velocity of Data Generation Open Data Format / Fixable Schema

On ALL Data

Deep Predictive Analysis Unstructured / Semi Structured Data

Extremely High Throughput Big Data Massively Parallel & Fault Tolerant

Store All Your Data Never Delete No Specialized Hardware Required

Low Cost Per Compute


Massive Datasets 100s of GBytes!! Integration with Structured Data

IVR, Image, Video Complete Support for


DW, ODS, Transaction Data

Page 8
Digression But Illustrative of Big Data Problems!!!
Building extremely large scale search engines or equivalently processing 1000s of terabytes of data in financial
services requires a large number of machines and a complex (aka costly) and scalable processing engine.

Yahoo! Use Case

Yahoo! Search architecture consists of four primary components


Crawler: Crawls the web and downloads pages from web servers across the entire net.
WebMap: Builds a graph of the known web. A graph is an interconnected mesh of web pages as they reference
each other.
Indexer: Builds a reverse index to the best pages for quick search retrieval across multiple facets.
Runtime: Responds to real time queries.

So what is the complexity?


A daily execution of a Yahoo! net crawl yields a graph of 1 trillion (1012) edges and 100 billion nodes (1011).
Each edge represents a web link.
Each node represents a distinct URL.
Analyzing this graph would require traversing each leg of this graph The graph has 1 trillion legs!!!
The analysis has to be done in finite time; typically less than a day.

So what kind of platform and framework would be needed to enable this kind of computation?
1) A large numbers of machines/nodes running in parallel.
2) Logically dividing the analysis in smaller chunks of work that can be processed in parallel.
3) Recombining the smaller pieces of work into a cohesive whole at the end.
4) A fault tolerant platform that can survive node failures.
5) A latency sensitive platform that can work on local data instead of requiring remote data.

Page 9
Who is Hadoop?
Hadoop is a software framework that supports very large scale data intensive processing using an open source license
(Hadoop, 2013)[1]. It is composed of two primary components 1. Hadoop File System (HDFS) & 2. MapReduce

Hadoop File System (HDFS) MapReduce


o Very large distributed file system o Programming model for large scale data
(10K+ nodes, 100+ million files, 10 100 PB) processing
o Runs on commodity hardware. o Parallel programming model.
(Does not require expensive, highly reliable hardware) o Breaks data processing problems into a Map
o Complete fault tolerance and failover. phase and a Reduce phase.
(Fault tolerance is built into the framework) o Each Map process works on a small subset of data
o Scale to hundreds of nodes in a cluster. ideally local to the process. (Data localization is
o Data localization. the responsibility of HDFS not MapReduce)
o Computation moves to data. o MapReduce processes data as key-value pairs.
o Provides very high aggregate bandwidth. o HDFS takes care of splitting and moving data to
o Provides support for streaming data access. MapReduce functions.
(Write once read many times) o Can be represented in any language of choice e.g.
Java, C++, Python, Shell etc.
HDFS does not work well for the following use cases: o High order abstraction languages such has Pig
o Lots of small files Latin have been created to obfuscate lower level
o Multiple writers details.
o Arbitrary file writes
Note: The last two limitations i.e. Multiple writers and arbitrary
file writes are being addressed in the new release of Hadoop

Page 10
Why Hadoop?
Which Data Problems are Hadoopable?
Analysis of structured, semi-structured and unstructured dataset from a variety of data sources.
When the entire population dataset requires analysis instead of merely sampling a subset and extrapolating results.
Ideal for iterative and exploratory analysis when business measures on data are not predetermined.

Coexisting Data Profiles DW IT Investment is Not Lost


In many cases the value-per-byte of data is unknown after money is spent storing it in DW. Hadoop removes this uncertainty

High Value Data


Hadoop DW
DM
DM
DM

o Analyze all data (Structured, unstructured, semi-structured) o Cleansed, enriched, matched data
o Inherent data discovery and data value analysis o Structured data analysis
o Analytics-at-rest & analytics on-the-wire o Analytics-at-rest
o Multiple disparate data sources o Produces insight with known and stable measurements
o Store all data (retain fidelity of transactions, logs, posts etc.) o Defined based on pre-determined corpus of questions
o Store data in native object format o Inflexibility in structure due to rigid data structure design
o Flexible or no data transport encoding o Rigorous data quality controls
o Low cost-per-compute o Performance envelope constrained due to functional limits
o Minimal performance concerns due to massive parallelism o High cost-per-compute
o High value-per-byte
o Data retained based on perceived business value

Page 11
A Little Bit of History
2002 2003
Apache Nutch
o Open source web search engine.
o Building one which can index 1-billion web pages is ambitious and costly at the very least.
o Doug Cutting wrote a Nutch crawler but the architecture would not scale to a billion pages.
Google
o Meanwhile Google site index is growing exponentially based on Sergey and Larrys Page Rank Algorithm.
o Query semantics and composition is becoming increasing complex.
o They are facing similar scale issues as Nutch. Their Oracle RAC is just not scaling.
o At this time Google is looking for a technological miracle else a possible demise.
2004
o Dec 2004 Google publish their seminal paper on distributed computing in the shape of Google File System (GFS).
o GFS would solve Googles storage needs, free up time spent on node management & enable huge indexing & crawling jobs.
o GFS runs all of Google search including all the utility functions NOW.
2005
o Doug Cutting picks up the GFS paper and implements an open sources version called NDFS.
o Nutch realizes that NDFS is applicable to wide array of computing issues beyond merely search.
2006
o NDFS is moved out as a top level Apache project and renamed to Hadoop (after the toy elephant of Dougs kid)
o Yahoo! Hires Doug Cutting and adopts Hadoop as their main computing platform.
o Yahoo! Implements 10GB/node sort benchmark on 188 nodes in 47.9 hours
2008
o Yahoo! wins the 1 terabyte sort benchmark in 209 seconds on 900 nodes
o Yahoo! loads 10 terabytes of data per day on a cluster of 1000 Hadoop nodes.
2009 - Yahoo! has 17 Hadoop clusters with 24,000 nodes. Wins the min. sort by sorting 500GB in 59 seconds on 1400 nodes.

Page 12
Where Can Our Clients Use Big Data?
In the new data paradigm; Big Data constitutes the fundamental enabler of value-add predictive analytics Big Data
enables analytics-at-rest and analytics-on-the-wire

Domain Metric Problem Question Solution


Customer Increased Customers are moving to What is a probability of a Customer sentiment analysis
Retention customer competitors customer leaving? using unstructured data
attrition (voice, email, logs, chat) on
Big Data platform using
Natural Language Processing
(NLP)
Fraud Analysis High rate of AML Correlating high volume of How do we successfully Link analysis using Big Data
alerts potential AML alerts identify a fraud network? and Machine Learning
Operational High rate of Successfully classifying How do we successfully Link analysis using Big Data
Continuity operational significant operational predict future system outages? and Machine Learning
events (Errors | errors with high precision Source ALL enterprise event
Alerts) and developing an data & run it against ML
operational outage model model
Security High rate of Successfully correlating How do we capture and Massive parallelism of
security breaches security breaches and analyze ALL enterprise events? enterprise event data using
enterprise issuing alerts Develop a threat profile? Big Data and Link Analysis to
security policy Issue alerts? isolate threat profiles
violations
Loss High rate of Automated detection of How do train an automated NLP processing of
Management insurance claims insurance fraud system to detect fraud on unstructured text corpus
insurance claims? against a self training fraud
model
Consolidation of Enterprise Business Intelligence Capabilities
Page 13
Big Data Reference Architecture 1
Day

Page 14
Big Data Reference Architecture
1 Connectors: Different methods to connect external
1 Connectors
source/target systems to the Hadoop platform.
ETL DBMS Middleware BI Analytics Visualization
2 Analytics: Data mining algorithms for performing
clustering, regression testing and statistical modeling
and to implement them using the Map Reduce model. 2 Analytics
3 Security: Kerberos Authentication, Role Based Text Analytics Machine Learning Object Correlation Path & Pattern Analysis
Authorization, Audit, Encryption.
4 Data Access | Pipelining | Serialization:
3
Security (Authentication, Authorization, Audit, Encryption)
HBase - A non-relational database that allows for low-
latency, quick lookups and adds transactional capabilities
to Hadoop. 4 Data Access | Pipelining | Serialization 5 Resource Mgmt. & Orchestration
Hive - allows users to write queries in a SQL-like language HBase Hive Pig Sqoop Avro Yarn ZooKeeper Oozie Flume
called HiveQL, which are then converted to Map Reduce.
PIG - A platform for analyzing large data sets that
consists of a high-level language for expressing data 6
Near Real Time Access
analysis programs.
In - Memory Database | Cache Object Immutability Graphing
Sqoop - A connectivity tool for moving data from non-
Hadoop data stores such as relational databases and
data warehouses into Hadoop. 7 Map Reduce -- Data Processing
Avro - A data serialization system that allows for
encoding the schema of Hadoop files.
8 HDFS | Cassandra File System (CFS) | GPFS -- Storage

5 Resource Mgmt. & Orchestration:


Java Virtual Machine
Yarn - Resource Manager consists of the Scheduler and
the Applications Manager; Resource mgmt. and job Operating System (Linux)
scheduling/monitoring.
Zookeeper - Centralized service to maintain 6 Near Real Time Access: IMDB is a database where data is stored in main memory to facilitate faster
configuration information, provides distributed response times with near real time access. Object immutability, with every state available for real time query
synchronization and group services. and retrieval.
Oozie - Workflow processing system that lets users 7 Map Reduce: Data Processing: A programming model for processing large data sets with a parallel,
define a series of jobs written in multiple languages distributed algorithm on a cluster. A Map Reduce program comprises a Map procedure that performs
such as Map Reduce, Pig and Hive -- then intelligently link filtering and sorting and a Reduce procedure that performs a summary operation. Orchestrates by
them to one another. marshaling the distributed servers, running the various tasks in parallel.
Flume - A framework for efficiently collecting, 8 Storage: Highly fault-tolerant distributed file system designed to run on commodity hardware. HDFS is the
aggregating and moving large amounts of log data from primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and
many different sources to a centralized data store. distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations
Page 15
Big Data Extended Eco System

Page 16
Lunch Break 1
Day

Page 17
Introduction to HDFS & MapReduce 1
Day

Page 18
HDFS Architecture Overview
Hadoop Distributed File System Architecture (Borthakur, 2013)[2]

HDFS Cluster
Block Map
NameNode
Metadata - master -
Operations
Client
Block Operations
Read
RACK RACK
DataNode DataNode DataNode DataNode
- worker - - worker - - worker - - worker -
1 4 2 4 5 6 Replication 6 9
2
3 7 8
3

Write Write
HDFS B lock

Client

Page 19
HDFS Architecture What is an HDFS Block?
A Block is the most basic level of persistent storage More basic than a file. Actually, a file is composed of one or more blocks
Alternatively, you can think of a Block as the minimum amount of data that can be read or written to a storage platform.

All filesystems (e.g. NFS, NTFS, FAT, Apple HFS) are designed using the block paradigm
Most filesystem blocks are usually a few KB
HDFS is also organized using blocks
Files in HDFS are broken down into equal sized data blocks
An HDFS blocks is >= 64MB. That is a very large data block

Why is the HDFS data block so large?


o To minimize the cost of seeks
o Map tasks operate on a single block usually.
o Making the block size too small would result in too many Map task Wasted memory

Block based storage architecture allows HDFS to enable the following (Borthakur, 2007)[3]:
o Blocks from same file can be stored on different servers allowing cluster based fault tolerance
o Blocks can be replicated to enable high availability
o Popular files can be set to a high replication factor (meaning replicate blocks to more servers) to enable load balancing and high
throughput
Client 1 Client 2 Client 3

DataNode DataNode DataNode


- worker - - worker - - worker -
1 2 Replication 1 2 3 4

Blocks 1 & 2 have high replication factor


Page 20
HDFS Architecture NameNode The Master
Manages the filesystem namespace This means all CRUD operations against DataNode(s) are managed by the NameNode

o Namespace is a name that uniquely identifies an HDFS instance e.g. hdfs://ernst&young/stas/ali/filename


o HDFS namespaces are defined using URIs

Maintains the filesystem tree


Maintains metadata for all the files and directories for an HDFS cluster
Knows all the DataNodes in the HDFS cluster
Holds the in-memory location of each data block for all file within an HDFS cluster

o Since block locations change during the life of an HDFS cluster; they are not persistently held by the NameNode
o In-memory caching of block locations imposes a limitation on the number of files that can be stored in an HDFS cluster

Clients access NameNode when they want to read or write a file. NameNode proxy client requests to appropriate DataNode(s)
DataNode B NameNode DataNode A
3 4 File1
- master - 1 2 File1
Name Type Server Block
File1 File A 1 -2 1 2 File2
B 3 -4
File2 File A 1-2
Without NameNode an HDFS cluster cannot function

X
NameNode
- master - X
HDFS
Even though Hadoop provides fault tolerance; how do you make Hadoop itself fault tolerant?
o Make NameNode resilient by writing persistent state of HDFS cluster to a remote filesystem e.g. NFS
o Run a secondary NameNode

Page 21
HDFS Architecture Anatomy of an HDFS Read
Client JVM/Node
2 - Remote Procedure Call
DistributedFileSystem 3-
NameNode
http://DataNodeA 1, 2 - master -
http://DataNodeB 3, 4
4 - Read
Client 7 - Close FSDataInputStream Sorted by proximity to client

5 - Read o Data is streamed from DataNode A to the


client consistently till end of the last block is
reached (Block 2)
DataNode A o Client connection to DataNode A is closed
- worker - 6 - Read o Client connects to DataNode B and
repeats the same process
1 2

DataNode B
Primary Proximity - worker -
3 4
How does this design scale?
o Clients retrieve data directly from DataNode(s) 2
o NameNode merely serves up location information
o HDFS can therefore scale to multiple concurrent clients since data Secondary Proximity
traffic is spread across multiple DataNode(s)


And where is the fault tolerance of Hadoop?
o If an error occurs while reading from a DataNode; HDFS will automatically proxy the request to the closest DataNode with the block.
o HDFS will remember the DataNode(s) that have failed and will automatically remove them from future client requests.

Page 22
HDFS Architecture The Concept of Node Proximity
HDFS represents the network as a Tree and the distance between two nodes is the sum of their distances to their common ancestor.

d=4 d=6
d=0 B Node 1 B Node 3 B Node 4

d=2

B Node 2

Rack 1 Rack 2 Rack 3 Rack 4

Data Center 1 Data Center 2

How does HDFS chooses which DataNode to store block replicas on? Bandwidth degradation between DataNode(s)
o The contention is between balancing a) Reliability b) Write bandwidth and c)
Read bandwidth. Same node

Bandwidth
o First replica goes on the same node as the client.
Different node same rack
o Second replica goes off-rack in the same data center chosen randomly.
o Third replica goes on the same rack as the second but on a different node. Different rack same data center
o Further replicas are placed on random nodes within the cluster.
Different data center
o This strategy provides:
Reliability Blocks stored on two separate racks
Write bandwidth Writes only traverse a single network switch
Read bandwidth Choice of reading from two racks or more
Block distribution Client writes a single block on local rack

Page 23
HDFS Architecture Anatomy of an HDFS Write
Replication Factor = 3
Client JVM/Node
2 - Remote Procedure Call NameNode
DistributedFileSystem
- master -
3 - Write
Client 9 - Close FSDataOutputStream
AC K QU E U E Create new file in file system
namespace e.g.
[Data Queue] Ack. Packet hdfs://eny/stas/ali/myfile.doc
DataStreamer

5 - Write Packet 8 - Acknowledgement Packet

DataNode A DataNode B DataNode C


DataNode - worker - - worker - - worker -
6 - Replicate Packet 6 - Replicate Packet
Pipelining 1 2 1 2 1 2
7 - Acknowledge 7 - Acknowledge

So what happens when all of this goes to hell? When a DataNode dies !!
o HDFS closes the pipeline
o Packets in the Ack. Queue are added to the front of Data Queue (Why?)
o Current blocks on the good DataNodes are assigned new identity and communicated to NameNode
o Partial blocks on failed DataNodes are deleted when the node comes back up
o Failed DataNode is removed from the pipeline
o NameNode notices the under-replication and assigns a new healthy DataNode for replication

Page 24
MapReduce 1
Day

Page 25
What is MapReduce?
MapReduce is a programming model pioneered by Google & Yahoo to process extremely large datasets (> 100s of Giga Bytes)
MapReduce parallelizes the data processing problem into smaller chunks of work (MapReduce, 2013)[7]

MapReduce Parallelism

(key, value)
split Map
Dataset

Result
(key, value) (key, value)
split Map Reduce

(key, value)
split Map

o User specifies the problem in terms of Map and Reduce tasks


o A Map task consumes raw input data and converts it into another dataset represented as tuples of key/value pairs
o A Reduce tasks combines data from multiple Map tasks into fewer set of tuples

Map: Input data <key, value>


Reduce: <key, value> <result>
The underlying HDFS runtime automatically provides following capabilities:
o Slicing of large datasets into smaller equally sized splits
o Merging of Map results into a fewer input in Reduce tasks
o Fault tolerance and automatic failover
o Automatic data movement between Map and Reduce taks
o Automatic data replication between DataNode(s)
o Localized processing by moving Map & Reduce tasks to where the data resides
o Load balancing of the base application

Page 26
MapReduce Architecture How Does the Data Flow?
NameNode Job Tracker
- master -

MapReduce Job Management

Task Tracker
DataNode 1
Partitions
sort
Split 1 Map DataNode 5 Task Tracker

Partitions Merge
Source Dataset

Task Tracker HDFS


DataNode 2 Reduce Result 1
Replication
Partitions
sort
Split 2 Map Task Tracker
DataNode 6
Partitions Merge
Task Tracker HDFS
DataNode 3 Reduce Result 2
Replication
Partitions
sort
Split 3 Map

MapReduce Task Distribution


o 1 Split = 1 Map = 1 Processor
o Fine grained splits improve performance and quality of load balancing
o Map tasks utilize data locality optimization
o Map tasks partition their output into multiple segments each segment targeted for a different Reducer
o Reduce output is stored on HDFS as per the principles of node proximity

Page 27
Making MapReduce Real Example from Capital Markets
New York Stock Exchange executes 1.1 billion trades per year. How do you find valuable information in such a large dataset?

Problem Statement I Find the highest traded stock price for each company registered on NYSE within a given year
Problem Statement II Find the spread between trade prices that are within 1, 2 and 3 for each listing on NYSE

MapReduce tasks are defined in terms of (key, value) pairs (Stock Ticker, Trade Price)

MapReduce Parallelism

AAPL 544.21
AAPL 543.90 Key Value
AAPL 521.36 AAPL (544.21, 543.90, 521.36)

Original Dataset
(1.1 Billion Tuples) split(s)

MSFT 32.20 A
H
Map
AAPL 544.21
XRX 8.25
MSFT 31.87 I
AAPL 543.90 P
Map Reduce
AAPL 521.36
Q
Z Map

Key Value
MSFT 32.20 XRX 8.25 Key Value AAPL 544.21
MSFT 31.87 MSFT (32.20, 31.87) MSFT 32.20

Page 28
Sorting in MapReduce Adding Velocity to Data Processing
The ability to sort data is at the heart of MapReduce. It helps to organize data and improves the data processing speed in MR.

How do we create balanced splits that feed into Mappers? - (Consider the NYSE example)

Even splits are important so that no one Mapper can dominate the overall job time.
The overall job is as fast as the slowest Mapper or Reducer

Velocity of Data

Understand the distribution of data in the source before split segmentation

Sort Algorithm in MapReduce

Count every occurrence Brute force!


of a ticker in the source High Latency! Requires
full source dataset scan
1.1 Billion Tuple Dataset count(MSFT) 100K

X
count(APPL) 230K
MSFT 32.20
Approach I
count(XRX) 400K
Cost Prohibitive
AAPL 544.21 Split Segmentation
XRX 8.25
MSFT 31.87
AAPL 543.90 Sample a subset of


AAPL 521.36 Approach II tuples to estimate Hadoop Even Split
the population key InputSampler Segmentation
distribution

Page 29
Elaborating MapReduce Sort Why is Sorting Important?
Why is Sorting Important?
o Improves data processing speed by producing more even splits to be distributed across multiple Mappers
o Facilitates joining multiple datasets together for improved analysis capabilities
o Produces a globally sorted output that is consumable by downstream Mappers and Reducers

Steps in MapReduce Sort


This approach preserves the total order of the population dataset Unsorted
Population
Dataset split
Sample the Estimate Determine
(P) Population Key Number of split
(n) Distribution Splits
split

Sort
A
Map Z

Sorted
A Population
Map Z Reduce
Dataset

A
Map Z

Page 30
Pig & HBase 1
Day

Page 31
Introducing Pig
What is Pig ? (Pig, 2013)[8]
o High-level data processing language
o Hadoop extension that simplifies Hadoop programming
o 40% of all Yahoo jobs are run using Pig. Twitter is another well known user of Pig

Why call it Pig ? There is a reason that we will get to later

Why use Pig ?


o Programming in MapReduce is Non-Trivial
o Pig simplifies Hadoop programming by providing a data processing language
o Provides Data Structures that are much richer and has powerful transformation capabilities

Pig has two major components:


o Pig Latin - A high-level data processing language
o Execution Environment- That translates Pig Latin scripts to Map Reduce jobs and then runs.

Execution Modes: Pig has two execution types or modes: local mode and Hadoop mode

Local Mode Hadoop Mode

Pig Latin Pig Latin

Single JVM Runtime Hadoop Cluster

o Single JVM o Runs on Hadoop cluster


o Accesses local file system o Suitable for large data sets
o Suitable for small data sets o Queries are translated to MapReduce jobs

Page 32
Where Does a Pig Fit ?
Data processing usually involves three higher level tasks (Data Collection, Data Preparation & Data Presentation)

Data
Data Preparation Data Presentation
Collection

ETL (or) Data Factory Data Warehouse

Data Factory Use Cases:

o Pipeline Bring in data feed and clean and transform it. Example is logs from Yahoo Web Servers. Logs undergo cleaning
step where bots, company internal views and clicks are removed.
o Iterative Processing Typical processing on large data set involves bringing in small new pieces of data that will change the
state of the large data set
o Research Quickly write a script to test a theory or gain deeper insight by combing through Petabyte of data

Page 33
Pig Latin The Language
Pig Latin provides a higher order abstraction for implementing MapReduce jobs. It constitutes a data flow language which is
made up of a series of operations and transformations that are applied to the input data

Key features of Pig Latin


o Ease of programming - Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as
data flow sequences, making them easy to write, understand, and maintain.
o Optimization opportunities - The way in which tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than efficiency.
o Extensibility - Users can create their own functions to do special-purpose processing.

Pig Latin Statement


o Pig Latin statements are the basic constructs you use to process data using Pig.
o A Pig Latin statement is an operator that takes a relation as input and produces another relation as output.
o Pig Latin statements may include expressions and schemas.

How are Pig Latin Statements LOAD Load statements read data from the file system
Organized ?
Transformation statements process data read
Pig Latin statements are organized as a TRANSFORMATION from the file system
sequence of steps such that each step
represents a transformation applied to DUMP statement display results
some data
DUMP / STORE
STORE statement to save the results.

Page 34
Understanding Pig Data Model
Pig Latin is a relatively simple language that executes statements

Bag Pig Latin Statement Bag

A bag is a relation, similar to A statement is an operation The output is another bag


table, that you'll find in a that takes input (such as a with the result set of the
relational database (where bag, which represents a set of processing
tuples represent the rows, tuples) and emits another bag
and individual tuples are as its output.
made up of fields)

Pig Latin has 3 Complex Data types and 1 Simple Data Type (Atom simple atomic value such as string or number)

(5,Big Data,-2) A tuple is an ordered set of fields. Its represented by


TUPLE fields separated by commas, all enclosed by parentheses

A bag is represented by collection of tuples separated by


{(5,Big Data,-2), commas, all enclosed by curly brackets. Tuples in a bag
BAG (6.5,PIG,10)} arent required to have the same schema or even have
the same number of fields.
A map is a set of key/value pairs. Keys must be unique and
MAP [key # value]
be a string. The value can be any type.

Page 35
Pig Data Model Expressions

tup = (puppy, { (wagging,1) (chewing,2)}, [age 4])

Let the fields of the Tuple tub be called t1, t2, t3

Expression Type Example Value for tup


Field by position $0 puppy
Field by name t3 [age 4]
{(wagging)
Projection t2.$0
(chewing)}
Map Lookup T3#age 4
Functional Evaluation SUM(t2.$1) 1 + 2 = 3
T2. $0.$0 ==wagging ?
Conditional Expression Dog
Dog : cat
wagging,1
Flattening FLATTEN(t2)
chewing,2

The table above shows the expression types in Pig Lain and how they operate. The Pig data model is very flexible and
permits arbitrary nesting.

Page 36
Pig Runtime

Executing Pig Programs


(Both Local and Hadoop mode)

Script Grunt Embedded


o File or Command line based o Interactive shell to run commands o Run Pig programs from Java
o Use Pig command to run file o Use exec or run command o Same way as using JDBC to call SQL
o Use Pig e option for command line

The Grunt shell allows you to enter Pig commands manually.


Grunt shell can be used for ad hoc data analysis or during the interactive cycles of program development
Grunt remembers command history and can recall lines in the history buffer using Ctrl-P or Ctrl-N
Pig programs as similar to SQL queries , and Pig provides a PigServer class that allows any Java program to
execute Pig queries. Conceptually this is analogous to using JDBC to execute SQL queries

Pig Editors
o Pig Pen - script text editor

Page 37
Ride the Pig !!!
Example - How would you find the maximum temperature in a given year using Pig Latin

records = LOAD examples/sample.txt AS (year:chararray, temperature:int, quality:int); 1

filtered_records = FILTER records BY temperature != 9999 2


AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); 3

grouped_records = GROUP filtered_records BY year; 4

max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records . temperature); 5

DUMP max_temp;

Decoding the Hieroglyphics !!!


1. Ease of programming - Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data
flow sequences, making them easy to write, understand, and maintain.
2. LOAD function produces a set of (year, temperature, quality) tuples that are present in the input file
3. FILTER removes records that have a missing temperature (indicated by a value of 9999) Data quality check
4. GROUP groups the records relation by year
5. FOREACH processes every row to generate a derived set of rows
6. MAX is a built-in function for calculating the maximum value of fields in a bag
7. Examine the contents using the DUMP operator

Page 38
Pig Latin Programming
Statements
A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation, or a command.
For example DUMP operation is a type of a statement e.g. DUMP max_temp;
Any command to list the files in a Hadoop file system is another example of a statement e.g. ls /

Statement Execution
1. Each statement is parsed in turn Parsing
2. For syntax errors and other semantic problems the
interpreter will halt and display an error message
3. The interpreter builds a logical plan for every relational Yes Display
Errors ? Error Message
operation, which forms the core of a Pig Latin program
4. The logical plan for the statement is added to the logical No
plan for the program so far, then the interpreter moves
on to the next statement. Build Logical Plan
5. The trigger for Pig to start processing is the DUMP for the statement

statement (a STORE statement also triggers processing). Run Logical Plan


6. At that point, the logical plan is compiled into a physical for the Program

plan and executed. DUMP/STORE


(trigger)

Comments
Pig Latin has two forms of comments.
Double hyphens are single-line comments. Everything from the first hyphen to the end of the line is ignored by the Pig Latin
interpreter. Ex: -- DUMP max_temp
C-style comments are more flexible since they delimit the beginning and end of the comment block with /* and */ markers.
They can span lines or be embedded in a single line. Ex: /* . */

Page 39
Pig Functions Beyond the Barn
Functions
1. Eval function
o Takes one or more expressions and returns another expression
o Some Eval functions are aggregate functions such as MAX, which returns the maximum value of the
entries in a bag
2. Filter function
o Removes unwanted rows
o Returns a logical Boolean result
o Example: IsEmpty, which tests whether a bag or a map contains any items
3. Comparison function
o Can impose an ordering on a pair of tuples.
4. Load function
o Specifies how to load data into a relation from external storage
5. Store function
o A function that specifies how to save the contents of a relation to external storage

User Defined Functions (UDF)


1. PigLatin is extensible through user-defined functions (UDFs ), and theres a well-defined set of APIs for writing UDFs.
2. PiggyBank - an online repository for users to share their functions

REGISTER piggybank/java/piggybank.jar;
b = FOREACH a GENERATE UDFs are written in
PIG UDF Jar File Java and packaged
org.apache.pig.piggybank.evaluation.str
in a jar file
ing.UPPER($0);

Page 40
Analyzing Logs using Pig Where Can we Use This?
Churning through voluminous log files and extracting meaningful insight Very common for click stream data, active and
passive sensor data, exception stack-trace information etc.

Much of what Google, Yahoo, Twitter & Facebook does

Processing Log Files with Pig Latin


REGISTER Register PiggyBank
DEFINE Define Log Loader Define Load Extractor
LOAD Load log files into Pig

Log File Ready for Analysis

Find Unique Hits per Day Find Unique Website Hits per Day

GROUP Group records based on days


FOREACH Iterate to get unique number of user identifiers
ORDER Sort the records by unique user identifiers
STORE Store the final result
?
Page 41
HBase 1
Day

Page 42
Yet another Storage Mechanism?
What is Database ?
o Organized collection of data to provide usage (for example, storing and finding a list of conference rooms available)
o Databases are intended to be used by multiple programs and several different users at the same time
o Has long history that is spread over 50 years with several technological advancements

The Evolution of Database

1960s 1970s 1980s 1990s 2000 +

Hierarchical Object Oriented

?
Traditional Files Relational

Network Object relational

Page 43
Why Enough is not Enough?
RDBMS still makes sense in most cases At present
o As a Persistence layer for front end applications
o Store relational data, strong consistency (ACID properties) and Referential Integrity
o Random access for structured data
o Limited number of records

So Why is an RDBMS not enough?


o Large dataset
o Scalability story
o Store both Structured and Semi- Structured data
o There is no cost effective way to store everything.

Scale up !!!
Enterprise Data Needs

Scale up !!

Offload reads to in-


Scale up ! memory systems Out of Options
Cease Stored
Consistent Updates Add more Procedures
Thousands of application servers Materialized Views
Referential Master-Slave Schema de-
Users
Integrity Architecture normalization
Normalize data
Stored Procedures Caching layer Partition data across
Use Foreign keys
ACID properties multiple databases
Use Indexes

Options

Big Data Revolution Store everything & There is no one size that fits all

Page 44
BigTable The Backdrop
A Distributed Storage System for Structured Data

Quick Facts Motivation


o Started as a Internal initiative at Google Labs by 2004 o Lots of (semi) structured data at Google
o Used by many Google production applications o Scale is large
Google Analytics o Need data for offline data processing and online
Google Maps serving
Google Earth

Wide applicability
(many Google products)
Big Table

High Availability

Thousands of Servers (Hundreds of TB to PB)

High Performance Scalability


(Very high reads/writes)

Page 45
Terminology and Complexity

From Googles BigTable Paper

A BigTable is a sparse, distributed, persistent multidimensional sorted map

From Googles BigTable Paper

The map is indexed by a row key, column key, and a timestamp; each value in the map is an un-
interpreted array of bytes.

From Hadoop wiki

HBase uses a data model very similar to that of BigTable. Users store data rows in labeled
tables. A data row has a sortable key and an arbitrary number of columns. The table is stored
sparsely, so that rows in the same table can have crazily-varying columns, if the user likes..

Page 46
Demystifying HBase and BigTable
HBase is a open source implementation of BigTable (HBase, 2013)[4]

Sorted Map Multi-dimensional


HBase maintains maps of Keys to Values (Key Value). Each of these A map of maps. The key itself has structure. Each key consists of the
mappings is called a Cell following parts: row-key, column-key, column and time-stamp
Ex:1 Row
{ Column Family
1 : Dave, {
2 : Joe, 1 : {
} Name : Dave,
Group : STAS
},
2 : {
Name : Joe,
Group : ITAS
The cells are sorted by the key. Allows for searching (ex: retrieve all }
values for which the key is between x and y). Rather than just }
retrieving a value for a known key
Ex:2

Sparse
Hbase does not follow a spreadsheet model. In Hbase a given row can
have any number of columns in each column family or none at all

(rowkey, column family, column, timestamp) value

Distributed
Hbase and BigTable are built over distributed files systems so that the
underlying storage can be spread out among array of independent
machines. Provides a layer of protection against a node within the
Values can be of any length, no predefined names or widths cluster failing

Page 47
Storage Layout Differences by Example
ROW-ORIENTED
Row-ID Name Birthdate Salary Dept
1 Joe 6-Apr 70000 SAL
2 Jane 12-Feb 55000 ACCT
3 Bob 13-Jul 120000 ENG
4 Mike 17-Sep 115000 MKT

SCHEMA ID Name Birthdate Salary Dept


(Integer) (Varchar) (String) (Integer) (String)

COLUMN-ORIENTED
Name Column Row-ID Value
1 Joe
2 Jane
3 Bob
4 Mike

Birthdate Column Row-ID Value


1 6-Apr
2 12-Feb
3 13-Jul
4 17-Sep

Page 48
Storage Layout Differences by Example (contd..)
Converting a Relational Model to HBase

Example Business Application: Shopping Cart


Problem Statement:
Build a Online Shopping Cart Application

Data Model:
A data model representation by conventional RDBMS: General RDBMS methodology:

Customer Orders Products o Normalize data


CustomerID UUID PK OrderID UUID PK ProductID UUID PK o Use foreign keys
Name Text CustomerID FK Name Text o Use indexes
Email Text date Timestamp Price Double
Total Double

Order_Products

OrderID FK
ProductID FK

Model Description: Design description:


Table Purpose 1. Customer table is indexed on the customerID
field for fast lookup
Customer Contains list of customers 2. Orders table is indexed on the OrderID for fast
lookup
Orders Orders placed by customers 3. Products table is indexed on the ProductID
4. Customer and Order tables are related through a
Products List of products available in the store foreign key relation on the CustomerID
5. Orders table and Products table are mapped
Mapping between orders placed and products in the between each other by mapping table
Order_Products
order

Page 49
Storage Layout Differences by Example (contd..)
Representing Shopping Cart application in HBase

Example Business Application: Shopping Cart


Problem Statement:
Build a Online Shopping Cart Application
Data Model:
A data model representation in HBase:
Table: Customers
Model Description:
Row Key Customers
Family
Table Purpose
data Columns: Content, Orders
Stats-3 mins Stores each customer, associated order usage statistics, various time
Customer
Stats-2 mins ranges in separate column-families with distinct TTL settings
Stats-1 mins
Products Stores content associated with products
Table: Products
Orders Stores all orders and products under each order
Row Key Products
Family data Columns: Content
Design description:
Stats-3 mins

Stats-2 mins 1. Similar Number of tables as RDBMS model


Stats-1 mins 2. Mapping table is absorbed as part of the Customers table
3. Statistics are stored with date as a key, so they can be accessed sequentially
Table: Orders
Row Key Orders Important Changes:
Family data Columns: Content, Products
1. Wide tables and Column oriented design eliminates JOINs
Stats-3 mins
2. Compound keys are essential
Stats-2 mins
Stats-1 mins
3. Data partitioning is based on keys, so a proper understanding is essential

Page 50
Storage Layout Differences by Example (contd..)
Where RDBMS makes sense
1. Joining
o In a single query get all products in an order with their product information
2. Secondary Indexing
o Get Customer Id by Email
3. Referential Integrity
o Deleting an order would delete links out of order_products

Where Hbase makes sense


1. Dataset Scale
o We have 1M customers and 100M orders
o Product information includes large text datasheets and PDF files
o Want to track every time a customer looks at a product page
2. Read/Write Scale
o Writes are extremely fast since there are no index updates
o Tables distributed across nodes means reads/writes are fully distributed
3. Replication
o Comes free with Hadoop
4. Batch Analysis
o Massive queries executed serially become efficient MapReduce jobs distributed and
executed in parallel

Conclusion
For small instances of simple straightforward systems, relational databases offer much more convenient way to
model and access data
If you need to scale to larger proportions, the properties and flexibility of Hbase can relieve you from the headaches
associated with scaling an RDBMS
RDMBS provides tremendous functionality out of the box but extremely difficult and costly to scale. Hbase provides
barebones functionality out of the box but scaling is built in and inexpensive

Page 51
HBase Building Blocks

Row Column
o Rows are composed of columns o The most basic unit of Hbase is a Column
o Can have millions of columns o Each column may have multiple versions, with each distinct
o Can be compressed or tagged to stay in memory value contained in a separate cell
o One of more columns for a row, that is addressed uniquely
by a row key
Table
o Column name is called qualifier.
o Collection of rows
o Reference = family: qualifier
o All rows are always sorted lexicographically by their row
key
o Keys are compared on binary level from left to right Cell
o Rows are always unique and can be thought of a primary o Every column value is called Cell. It is time stamped.
index on the row key o Can be used to save multiple versions of a value. Versions
are stored in decreasing timestamp, most recent ones first
Column Families
o Columns are grouped into column families Region
o Semantic boundaries between data o Basic unit of scalability and load balancing
o Defined when table is created and should not be changed o Contiguous ranges of rows stored together.
often o Dynamically split by the system when they become too
o Number of column families should be reasonable (?) large

Page 52
HBase Logical Architecture

o HB ase uses HDFS as its reliable


HBase storage layer, which provides
failover, replication
API o Native Java API, Gateway for
REST, Thrift, Avro (Avro,
2013)[12]
Region Servers
o Two types of HBase nodes
(Master and Region Server)
Master HFile memStore
o Master manages cluster and
Write Ahead Log responsible for monitoring the
region server and load balancing
of regions

o Region Server manages data


HDFS ZooKeeper o Zookeeper is used for
coordination

Page 53
HBase Distribution Architecture

Rows Region Server 1 Region Server 2 Region Server 3


A
. Keys: [T-z]
Table - Logical View

.
. Keys: [I-M]
H
Keys: [F-I]
.
. Keys: [A-C]
.
Q
. Keys: [M-T]
.
. Keys: [C-F]
Z

Distribution
o Unit of Scalability in Hbase is a Region
o Sorted and Contiguous rows
o Spread randomly across Region Server
o Moved around for Load balancing and failover
o Split automatically or manually to scale growing data

Page 54
Hive & Sqoop 1
Day

Page 55
Introducing Hive
Many thanks Facebook for making Hadoop data files look like SQL tables!!!

Hive is a Petabyte scale data warehousing infrastructure for managing and querying structured data built
on top of Hadoop (Hive, 2013)[5]
o Map-Reduce for execution
o Simple query language called Hive QL, which is based on SQL
o Plug in custom mappers and reducers for sophisticated analysis that may not be supported by the built-in capabilities of the
language

Data Summarization Adhoc Querying Data Analysis

Hive
(Built by Facebook)

Large Volumes of Data

How is Hive different from Oracle or other popular databases ?


1. Not designed for online transaction processing
2. Does not offer real time queries and row level updates. More suited for batch processing over large sets of immutable data
3. Provides Data Structures that are much richer and has powerful transformation capabilities

Page 56
Motivation For Using Hive

Quick Facts
Motivation
o Data Data and mode Data
o Users expect faster response time on fresher data
o On average; data is increasing at 8X yearly
o Fast, faster and real-time
o Platform scalability is the major limiting factor in
supporting this data growth

Realization that more insights are derived


from simple algorithms on more data

Existing data warehousing systems do not


meet all the requirements in a scalable, agile
and cost effective way

Can I use
Hadoop ?

Page 57
Rationale For Hive

Pros and Cons analysis

1. Hadoop provides superior


availability/scalability/mana
geability
2. Command-line interface for
end users 1. Map Reduce is hard to
3. Partial program
availability/resilience/scale 2. Need to publish data in well
more important than ACID known schemas

Hive> select employee_number, $cat > /tmp/mapreducer.sh


count(1) from employee_records where $cat > /tmp/map.sh
employee_number > 100 group by .multiple lines of map reduce code
employee_number

HIVE

Page 58
Where Does Hive Help?
HIVE Principles that Addresses Key Challenges

Data Growth Performance


How Hive addresses Data Growth How Hive addresses Performance
challenges? challenges?
1. Hive table can be defined directly on 1. Tools to load data into Hive table in near
existing HDFS files real time
2. Tables can be partitioned and bucketed 2. Various optimization techniques
and data can be loaded to each partition 3. Pull simple short tasks to the client side
3. Scale out instead of Scale up

HIVE
How Hive addresses How Hive addresses Extensibility
Interoperability challenges? challenges?
1. Schemas are stored in RDBMS 1. Plug in Custom Mappers / Reducer
2. Column types could be complex types 2. Data Source can come from web services
3. Tables and Partitions can be altered 3. JDBC/ ODBC drivers
4. Views to be available soon

Interoperability Extensibility

Page 59
Hive Architecture

Data Model
Tables have Typed columns (int , float,
HIVE string, date, boolean)
JDBC ODBC Partitions
Buckets (Hash partitions useful for
Command Line sampling, join optimization)

Metastore
Web Interface Thrift Server
Interface

Driver
(Compiler, Optimizer, Executor)
Metastore
Namespace containing set of tables
Holds partition definitions
Statistics
Runs on Derby, MySQL and many other
relational databases

Job Tracker Name Node


Physical Layout
Warehouse directory in HDFS
Data Node Table row data stored in subdirectories of
+
Task Tracker
warehouse
Partitions form subdirectories of table
Hadoop directories
Map Reduce + HDFS Actual data stored in flat files

Page 60
Hive Query Language
Features

Data Types Built-in Functions


MySQL like MapReduce
SQL Dialect (Simple) (Show Functions)
extensions extensions
(Complex) (Describe Functions)

Capabilities
o Can point to external tables or existing data directories in HDFS
o Sub Queries
o Equi Joins
o Multi-table Insert
o Multi-group by

Select Query Syntax


SELECT
SELECT [ALL
[ALL || DISTINCT]
DISTINCT] select_expr,
select_expr, select_expr,
select_expr, ...
...
FROM table_reference
FROM table_reference
[WHERE
[WHERE where_condition]
where_condition]
[GROUP
[GROUP BY
BY col_list]
col_list]
[[ CLUSTER
CLUSTER BY col_list
BY col_list
|| [DISTRIBUTE
[DISTRIBUTE BYBY col_list]
col_list] [SORT
[SORT BY
BY col_list]
col_list]
]]
[LIMIT
[LIMIT number]
number]

Page 61
Hive Usage @ Facebook and Beyond
Quick Statistics
12 TB of compressed new data added / day
135 TB of compressed data scanned / day
7500 + Hive jobs / day
Analysts (non-engineers) use Hadoop through Hive
95% of jobs at Facebook are Hive jobs

Applications & Jobs


Reporting
o Daily / Weekly aggregations of impression / click counts
o Measures of user engagement
o Micro strategy reports
Ad Hoc Analysis
Machine Learning
Etc Etc.

Other Real World Use Cases


CNET
o Data Mining Log Analysis Adhoc queries eLearning
Hi5
o Analytics Machine Learning Social graph analysis

Page 62
Sqoop 1
Day

Page 63
Sqoop SQL 4* Hadoop
Sqoop is an open source tool that allows users to extract data from relational database into Hadoop

o A great Strength of Hadoop platform is its ability to work with data in different forms and parse Adhoc data formats
extracting relevant information (Sqoop, 2013)[9]
o Considerable amount of valuable data in an organization is stored in relational database systems

Features
o Written in Java Custom MapReduce programs interpret data
o JDBC based interface
o Automatic datatype generation
o Uses MapReduce to read tables from Database SCOOP
database Table HDFS
o Supports most JDBC standard types
o Provides ability to import from SQL
databases straight to HIVE data warehouse

Auto generated datatype


definitions

Page 64
Sqooping Large Objects
Importing Large Objects
o Database queries usually reads all columns of each row from disk to identify rows that match a query criteria
o If large objects were stored inline in this fashion, it would adversely affect the performance of the scans

MapReduce typically materializes every record before passing it along to the mappers. To avoid the
performance degradation the large objects are often stored externally from their rows

Storage B

LOB A
Row 1: Col 1 ID (A) Col 3 Col 4

Row 2: Col 1 ID (B) Col 3 Col 4


LOB B

o Sqoop will store imported large objects in a separate file called a LobFile
o The LobFile format can store individual records of very large size (64 bit address).
o This format allows clients to hold a reference to a record without accessing the record contents

Page 65
Conclusion 1
Day

Questions?

Page 66
Industry Use Cases End to End 2
Day

Page 67
Use Case 1:How do you Gauge Customer Sentiment?
Use Case Reducing customer attrition via apriori semantic analysis of customer sentiments using textual data

Problem Statement Ability to ingest large volume of unstructured data from multiple sources
Apply an apriori established rules models to ingested data
Derive a structured Result-Set that can correlate sentiments to subjects

Challenges in the semantic analysis of human sentiment!! Limitations of traditional semantic text analysis
o Accuracy of the result-set is the major challenge in o Limited accuracy
unstructured textual analysis. Much of it is manifested in o Latency (limited speed)
form of Precision and Recall o Low expressiveness
o Precision Percentage of relevant items in the result-set i.e. o Low granularity
are the results valid? Correlation is a better measure. o Inability to perform bi-directional raw text
o Recall Percentage of relevant results retrieved from analysis
unstructured data i.e. are all sentiments relevant to a
subject retrieved successfully?

CHANNELS MODEL RUNTIME STRUCTURED OUTPUT


Text Corpus
Email
Speech Subject Satisfaction Index
Corpus John Irritating [-1]
OPTIMIZER Chris Unhappy [-3]
IVR Adam Tired [-5]

Text Corpus
Online Extractor
Help

Page 68
Using NLP to assess Customer Sentiment
Natural Language Processing (NLP) constitutes the process of extracting meaningful information via computation
from a natural language e.g. English. This requires collaboration between computation, linguistics and statistics.

Mahout
NLP constructs that we use to answer the Customer Sentiment question using Mahout using (Mahout, 2011)[6]
Coreference Resolution Given a sentence determine which words refers to the same object/entity.
Named Entity Recognition Given a stream of text determine which item in the text maps to a proper name & type.
Natural Language Generation Convert rich media to human readable format. Conversion of IVR data to text.

Part of Speech Tagging Given a sentence, determine part-of-speech for each word. @ Large Bank, customer interactions
captured in a non-inflectional language such as English introduces ambiguity because multiple words can be used both as
nouns and verbs e.g. book, set etc.

Sentence Boundary Disambiguation Given a stream of text determine sentence boundaries.


Word Sense Disambiguation Given a stream of text determine the context in which each word is used for words with
multiple meanings under different contexts.
Sentiment Analysis Given a sentence, classify its polarity given basic states such as positive, negative or neutral. Further
sub-classification into advance emotional states based on a scaling model e.g. -5 to 5 can also be derived.

@ We can use a Hadoop platform that constitutes:


Text Analytics Toolkit
Query Language (QL)
HDFS implementation

Page 69
Extractor Definition
Semantic Extractor Definition
Sample Customer Email Extract
My name is John Doe and I maintain two wealth management accounts with you. On 28th Feb. I
called customer regarding erroneous fees applied to my account, however after speaking with
multiple agents and wasting a lot of time. I came off very disappointed with the level of
service I received. My issue has not been resolved and I am considering moving my accounts
to another institution

Extractor Rules Expression


<Named Entity>
<Person Entity> <Person Entity> <Emotive Expression> <Emotive Expression>

Single Sentence Boundary

Data Stream Boundary

Boundary Person Emotion Satisfaction Index


Data Stream John Doe wasting, disappointed, issue
Sentence I wasting -3
Sentence I disappointed -4
Sentence My issue -2

Page 70
Introducing Hadoop Parallelism for Sentiment Analysis
We can use a Hadoop platform that includes
Includes Advance Text Analytics Engine
Includes Annotator Query Language (AQL) which is a fast, declarative & fine grained expressive language for text analysis
AQL is used for defining text analytics rules to build text extractors
Runtime complies an AQL extractor into an Analytics Object Graph (AOG) which represents the text extraction rules in a tree
structure in memory
A separate instance of AOG is deployed on each Mapper which runs a full instance of Text Analytics Engine
A single Mapper operates on a single data Split running it through the rules defined in the AOG
Output from all Mappers is reduced via a Reducer that coalesces negative emotive expressions exclusively

Hadoop Parallelism Extractor per Mapper


Mappers
Each Mapper runs
an Analytics
AOG instance
Subject Satisfaction Index
John Irritating [-1]
Optimizer Chris Unhappy [-3]
AQL AOG Reducer
& Runtime Adam Tired [-5]

Reduces negative
AOG emotive expressions

Page 71
Use Case 2: Anti Money Laundering Using Hadoop Fraud Use Case
Money Laundering is the practice of concealing the source of illegally obtained money

Process of Money Laundering


Placement Act of transforming illicit liquid assets (currency) into any other asset class by a subject.
Layering Attempt by a subject to obscure the trail of an illicit source of funds via the establishment of complex financial
transactions e.g. shell companies, offshore accounts, tax heavens etc.
Integration The aggregation of illicit funds from various legitimate commercial activities and financial systems. This stage
converts illicit money into clean money.

Placement Layering Integration


Offshore Bank A/C
Money Changer

Subject $ Remit. Agent $ Tax Heaven $

$ $ $
Offshore Bank A/C
$
$ Bank Agent
$ $ Agent
Remit. Agent
$
Region A Region B

Page 72
Notional Model for Event Based Detection of ML
Using Hadoop for AML Detection
SQUOOP Event Record
o Subject
Enterprise AVRO o Time
Event Cloud o Location ML Detection
Hadoop Algorithms
o Activity
Serialization
o . ML Object
o Historical Subject John Doe
Iterations to isolate Event
Associations & 14:35 April 5 2009
sub-associations Records Seattle, WA
Rankings
Company Setup
o OFAC Lists HDFS
o CFTC Lists CFO
MapReduce ML Object Graph
Put Ranking
& Association ML Object Parallelism A2 1
3
Subject Ranking
Get Ranking Mahout
1 1
& Association
Knowledge Base
& Association Cluster Association B
Ranking Assignment

AML Modeling Characteristics Heuristics for AML Modeling


A
o ML Event Is a significant occurrence that presents an ML o Corruption Network
activity at a single point in space-time.
Parallel Cluster MAP o Illicit Activity Period
o Cluster Defines all the associations for a given subject. B o Illicit Transaction
Evaluation
o ML Object Defines a subject with corresponding attributes o Suspicious Entity
including time, illicit activity and location of activity. o Suspicious Region
o Association Defines the relationship between two B o Subjects Career
subjects.
MAP o Subjects Age Profile
o Association Degree Defines the degree of relationship o Transaction Behavior
closeness.
o Subject Rank Defines the degree of suspicion of a subject.

Page 73
Use Case 3: Major System Outages in Recent Past

Friday Aug 05 2011


Lloyds Banking Group's wholesale banking division suffered several hours of downtime in its trading
system as a server cooling system failed and traders were forced to revert to pen and paper and
telephones during a day of turmoil in the stock market. The downtime came during a busy day on the
stock markets as billions of pounds were wiped off the value of shares. Lloyds Bank cooling system
failure to blame for trading downtime
Source: Computer Weekly

Monday Oct 10 2011


BlackBerry users have just started to report issues with BlackBerry Messenger and email, which
suggest that the network is down. This is not the first time that users complain of BlackBerry outage,
but this time seems to be a global issue as we've seen reports from all over the world and BlackBerry
owners in our team seem to be affected as well..
Source: CNN Money

Page 74
Monitoring & Decisioning Man vs. Machine
The variance of accuracy is inversely proportional to the increase in the breadth of the data

Current Situation Decisions changes based on each


Perception individual's past, interpretation,
perception and judgmental capabilities
Interpretation
Past Knowledge Action
Motivation Judgment

Decision
Consistency and Comprehension of the
breadth of data is humanely impossible

Trainable Model
Current Situation (More the data More the Accuracy) Consistent Results
(Perception, Interpretation,
Judgment remains the same)

Page 75
Notional Model for System Outage Prediction
Use Case Designing a automated framework that collects, correlates and classifies enterprise events; thereby
generating alert notifications identifying potential outage scenarios
Problem Statement Collection, classification and correlation of enterprise events
Autonomous self training intelligent event framework

Critical factors in automated predictive analysis Limitations of traditional reactive


o Real Time Faster high frequency data capture and analysis methodology
as soon as the data is captured. o Manned effort
o Data Volumes of historic data both structured and un- o Prone to errors
structured to train the system o Nearly impossible to have consistent results
o White Noise Percentage of relevant results retrieved from o Difficult to comprehend the breadth of data
the collected data o Impossible to meet and maintain the SLAs
o Autonomous Decisioning - More Intelligent automated
decision making.

Data Source MODEL Autonomous Actions

Time Box
Log Data Notifications
Pattern matching

Decide/Re-train
Filter White Noise
Collect

Aggregate
Events Corrective Actions
Enrich
Classify
Apply Rules
System Actions
Historic Data

Page 76
Intelligent Outage Learning
We can use a Big Data Platform to build a system outage framework that uses ML to predict outages

System Logs exception stack trace enterprise events customer call data is fed into a Mahout ML Model (Baseline)
ML Base model is continuously trained by regressing all exogenous & endogenous variables against test data set(s)
Refined base model is established as the yardstick for the prediction of future outages
Real time events are sent through the trained ML model at times t+n and variances from the fitted curve are observed
Event cloud input is multiplexed across multiple Mappers; each running an instance of the trained Mahout ML Model
Outlier events are identified by their violation of the normal variance threshold from the fitted curve of the ML Model

System Outage Predictive Framework

Page 77
Hadoop Installation & HDFS Hands on 2
Day

Page 78
Setting up the VM

For this demonstration, we will be using Clouderas Quick Start VM. This 64-bit VM contains a single-node Apache
Hadoop cluster. It runs on CentOS and includes:

o CDH4.6
o Cloudera Manager
o Cloudera Impala
o Cloudera Search

For the installation, we will be:


Installing VirtualBox (the virtualization application)
Downloading & Importing Cloudera Quickstart VM
Adjusting VM RAM and Copy/Paste
Optional: Shared Folder, Internet Settings
Cleaning Up and Starting the VM

Page 79
Setting up the VM
Installing VirtualBox

Step 1
Install Chocolatey from Command Prompt (as administrator) Chocolatey is a utility for easy command-line installation

C:\> @powershell -NoProfile -ExecutionPolicy unrestricted -Command "iex ((new-object


net.webclient).DownloadString('https://chocolatey.org/install.ps1'))" && SET
PATH=%PATH%;%systemdrive%\chocolatey\bin

Step 2
Install VirtualBox

C:\> cinst virtualbox

Step 3
Install 7Zip (will be used later to extract the VM)

C:\> cinst 7zip

Page 80
Setting up the VM
Downloading & Importing Cloudera Quickstart VM

Step 1
Download Cloudera Quickstart VM (approximately 8 minutes)

http://www.cloudera.com/content/support/en/downloads/download-components/download-
products.html?productID=F6mO278Rvo&version=2.1

Step 2
Unzip VM
Locate the downloaded VM zip file (cloudera-quickstart-vm-4.4.0-1-virtualbox.7z)
Move file to desired folder (optional)
Right-click on the file and from the 7-Zip sub-menu, select "Extract Here" (approximately 3 minutes)

Step 3
Import Appliance
Open VirtualBox (Oracle VM VirtualBox Manager)
Select from the File menu "Import Appliance
Located the extracted VM folder, expand it, and select the the *.ovf file
Hit Next then Import

Page 81
Setting up the VM
Adjusting RAM Settings
It is advisable to allocate at least 2GB of RAM for your host machine OS. CDH4.6 requires 4GB of RAM. Before loading the VM, it is
important to make sure the RAM settings leave at least 2GB of RAM space for your host OS to run. If you keep it at the default it is set to,
Windows and the VM will continuously fight for resources -- you don't really want Windows to become inoperable. This is done by:

Steps:
1. In Oracle VM VirtualBox Manager with
"cloudera-quicksta..."
highlighted click
2. Select the "System tab
3. Reduce the "Base Memory
of the VM to accommodate
enough memory for your
host OS
4. Hit OK

Page 82
Setting up the VM
Adjusting Copy/Paste Settings
It is advisable to allocate at least 2GB of RAM for your host machine OS. CDH4.6 requires 4GB of RAM. Before loading the VM, it is
important to make sure the RAM settings leave at least 2GB of RAM space for your host OS to run. If you keep it at the default it is set to,
Windows and the VM will continuously fight for resources -- you don't really want Windows to become inoperable. This is done by:

Steps:
1. In Oracle VM VirtualBox Manager with
"cloudera-quicksta..."
highlighted click
2. Select the Advanced
sub-tab within General
3. Change the Shared
Clipboard drop-down to
Bidirectional.
4. Hit OK

Page 83
Setting up the VM
Cleaning Up and Starting the VM

Step 1
Clean up

You can remove the *.7z and the extracted folder (5GB of hard drive savings)

NOTE: The VM has already been loaded (i.e., installed) on your machine in:
c:\users\[user]\VirtualBox VMs\cloudera-quickstart-vm-4.4.0-1-virtualbox
This is where the functional machine exists (do not remove it).

Step 2
Start up the VM
With the "cloudera-quicksta..." appliance selected, click the Start button.

o Within Firefox, the bookmarks on top are for Cloudera/Hadoop managers


o Eclipse (shortcut on the desktop) already has a training folder setup with a blank
MapReduce template
o Go to System -> Shutdown

Page 84
Setting up the VM
Creating a Shared Folder
You can setup your Shared Folder to transfer files from your host machine to the VM. This will be useful when we wish to transfer our
tutorial files from our host to the VM My Documents folder.

Steps:
1. Be sure to have your virtual machine
shutdown
2. In Oracle VM VirtualBox Manager with
"cloudera-quicksta..."
highlighted click
3. Select the Shared Folders tab
4. Change the icon to add a shared
folder.
5. Select Other folder path. Select the
folder on your host machine under
your My Documents folder which
you have created to share documents
between your machine and the VM.
6. Provide the Folder Name which is
what the name of the shared folder
will be within the VM.
7. Ensure that Auto-mount is checked.

Page 85
Setting up the VM
Transferring Files into VM

Step 0: Start the VM and Open a Terminal window

Step 1: Escalate privileges

$ sudo su

Step 2: Move the tutorial file from the shared folder to the home folder
o The shared folders are located within the /media/ path.
o The folder name provided in the VM Settings is prefixed with sf_

# mv /media/sf_VMShare/NYSE.tar.gz /home/cloudera/NYSE.tar.gz

Step 3: Change the owner of the file to cloudera

# chown cloudera /home/cloudera/NYSE.tar.gz

Step 4: Exit escalated privileges


# exit

Page 86
Setting up the VM
Internet Settings: Adjusting the Network Connection Settings
You can setup your Shared Folder to transfer files from your host machine to the VM. This will be useful when we wish to transfer our
tutorial files from our host to the VM My Documents folder.

Note: Internet connectivity is not required for this tutorial. You may omit this section.

Steps:
1. Check internet connection
o Having the cloudera-quicksta.. VM selected,
start the VM by clicking
o Open up Firefox within the VM and attempt
to browse to a website (www.google.com).
o If the connection succeeds, your internet
connection is working and you do not need to
do anything else.
o If the connection fails, shutdown the VM
(System -> Shutdown) and following the
next steps.

2. Copy the VMs assigned MAC address.


o In the settings window of the VM, navigate
to the Network tab.
o Copy the MAC address

Page 87
Setting up the VM
Internet Settings: Adjusting the Network Connection Settings (Continued)
You can setup your Shared Folder to transfer files from your host machine to the VM. This will be useful when we wish to transfer our
tutorial files from our host to the VM My Documents folder.

Note: Internet connectivity is not required for this tutorial. You may omit this section.

Steps:
3. Adjust VM Network Connections
o Start VM
o Under System -> Preferences, select Network Connections. If only System eth0 exists, click Add to eth01 as a new
Wired connection.
o If prompted, the Password for root is cloudera.
o The MAC addresses of both connections should be the same as copied from Step 1 and the first two checkboxes of each
connection should be checked.
o Under IPv4 Settings:
The connection method for eth01 should be Automatic (DHCP) addresses only.
The host IPv4 DNS Server address should be provided as the DNS Server
o Now that the internet connection works, you can use the web to download the training materials zip file to the My
Documents folder.

Page 88
Setting up the VM
Screen shots of Linux and Windows popup windows

For both the System eth0 and In Windows, record the IPv4 DNS Server Ensure that the connection Method for
eth01 connections, make sure that address (highlighted above) for the eth01 is Automatic (DHCP) addresses
you have checked the appropriate connected network. Select the current only. Provide the recorded IPv4 DNS
boxes and provided the Device MAC network from Network and Sharing Server from Windows (highlighted above).
address (highlighted above). Center in Control Panel, then click
Details.

HadoopPage
Hands-On
89 Training: Step-by-Step Examples
HDFS Hands on 2
Day

Page 90
HDFS

HDFS (Hadoop Distributed File System) distributed file system that stores the data on the commodity machines, providing
high aggregate bandwidth across the cluster. HDFS allows us to not worry about where the file is actually stored in the
Hadoop clusters and treat it as a singular file system.

For the intro on HDFS, we will be:


Setting up the user folder in Hadoop
Transferring data files into Hadoop
Optional: HDFS shell commands

The File System (FS) is invoked by:

% hadoop fs <args>

Page 91
HDFS
Setting up the user folder in Hadoop

Step 0: Start the VM and Open a Terminal window


o We will do this every time we run commands.

Step 1: Escalate privileges for HDFS

$ sudo su hdfs

Step 2: Create the cloudera user folder. (If this errors, simply proceed)

$ hadoop fs -mkdir /user/cloudera

Step 3: Change the owner of the folder to cloudera

$ hadoop fs -chown cloudera /user/cloudera

Step 4: Exit escalated privileges

$ exit

Page 92
HDFS
Shell Command Basics
All the FS shell commands take path URIs as arguments
The URI format is scheme://autority/path
For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If
not specified, the default scheme specified in the configuration is used
An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as
/parent/child (given that your configuration is set to point to hdfs://namenodehost)
Most of the commands in FS shell behave like corresponding Unix commands. Differences are described with each of
the commands
Error information is sent to stderr and the output is sent to stdout

Page 93
HDFS
Shell Commands List

recall getmerge Concatenates files in src into the destination local


file. Specifying addnl is for adding a new line between files
% hadoop fs <args>
-getmerge <src> <localdst> [addnl]

dus Displays a summary of file lengths ls For a file returns stat on the file and for a directory returns
list of its direct children.
-dus <args>
-ls <args>

expunge Empty the trash. lsr - Recursive version of ls. Similar to Unix ls -R.

-expunge -lsr <args>

get - Copy files to the local file system. CRC options relate mkdir - Takes path uri's as argument and creates directories.
to cyclic redundancy check. The behavior is much like unix mkdir -p creating parent
directories along the path.
-get [-ignorecrc] [-crc] <src> <localdst>
-mkdir <paths>

mv - Moves files from source to destination within HDFS.


For multiple sources, the destination needs to be a
directory.
-mv URI [URI ] <dest>

Page 94
HDFS
Shell Commands List (continued)

put Copy single or multiple sources from local file system tail Returns last kilobyte of the file to stdout.
to HDFS. Specifying - as source is stdin.
-put <localsrc> ... <dst> -tail [-f] URI

rm Deletes non-empty directories and files. test Test to see if the file exists (E), is zero length (Z), or is a
directory (D)
-rm URI [URI ] -test -[ezd] URI

rmr Recursive version of delete. text - Takes a source file and outputs the file in text format. The
allowed formats are zip and TextRecordInputStream.
-rmr URI [URI ] -text <src>

setrep Change the replication factor of a file. Provide R touchz - Create a file of zero length.
for recursive.
-touchz URI [URI ]
-setrep <args> [-R] <path>

stat Returns stat information on the path.

-stat URI [URI ]

Page 95
Lunch 2
Day

Page 96
MapReduce Deep Dive 2
Day

Page 97
Features of MapReduce
Automatic parallelization and distribution
A clean abstraction for programmers
o MapReduce programs are usually written in Java
Can be written in any language using Hadoop Streaming (see later)
All of Hadoop is written in Java
o MapReduce abstracts all the housekeeping away from the developer
Developer can concentrate simply on writing the Map and Reduce functions
Automatic Fault tolerance
Status and monitoring tools

Page 98
MapReduce
JobTracker

MapReduce jobs are controlled by a software daemon known as the JobTracker


The JobTracker resides on a master node
o Clients submit MapReduce jobs to the JobTracker
o The JobTracker assigns Map and Reduce tasks to other nodes on the cluster
o These nodes each run a separate daemon known as the TaskTracker
o The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting
progress back to the JobTracker

Page 99
MapReduce
Terminology

MapReduce jobs job is a full program


A task is the execution of a single Mapper or Reducer over a slice of data
A task is attempt is a particular instance of an attempt to execute a task
o There will be at least as many task attempts as there are tasks
o If a task attempt fails, another will be started by the JobTracker
o Speculative execution can also result in more task attempts than completed tasks

Page 100
MapReduce
Mapper

Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to
avoid network traffic
o Multiple Mappers run in parallel, each processing a portion of the input data
The Mapper reads data in the form of key/value pairs
It outputs zero or more key/value pairs (pseudo code):
map(in_key, in_value) -> (inter_key, inter_value) list

The Mapper may use or completely ignore the input key

For example, a standard pattern is to read a line of a le at a time

o The key is the byte oset into the le at which the line starts

o The value is the contents of the line itself

o Typically the key is considered irrelevant


If the Mapper writes anything out, the output must be in the form of key/value pairs

Page 101
MapReduce
Reducer
After the Map phase is over, all the intermediate values for a given intermediate key are
combined together into a list
This list is given to a Reducer

o There may be a single Reducer, or multiple Reducers

This is specied as part of the job conguration (see later)

o All values associated with a particular intermediate key are guaranteed to go to the same Reducer

o The intermediate keys, and their value lists, are passed to the Reducer in sorted key order

o This step is known as the shue and sort


The Reducer outputs zero or more nal key/value pairs

o These are written to HDFS

o In practice, the Reducer usually displays a single key/value pair for each input key

Page 102
MapReduce
Data Locality
Whenever possible, Hadoop will attempt to ensure that a Map task on a node is working on a block
of data stored locally on that node via HDFS
If this is not possible, the Map task will have to transfer the data across the network as it processes
that data
Once the Map tasks have nished, data is then transferred across the network to the Reducers

o Although the Reducers may run on the same physical machines as the Map tasks, there is no
concept of data locality for the Reducers

All Mappers will, in general, have to communicate with all Reducers

Page 103 Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent. 03#43
MapReduce: Bigger Picture?
Node 1 Node 2

Files loaded from Files loaded from


local HDFS stores local HDFS stores

Input Format Input Format

File File

Split Split Split Split Split Split

File File

RR RR RR RR RR RR
Record readers: Record readers:

Input (k, v) pairs Input (k, v) pairs

map map map map map map

Intermediate (k, v) pairs Intermediate (k, v) pairs

Partitioner Partitioner
Shuffling Process

Intermediate (k, v)
(Sort) pairs exchanged by all
(Sort)
nodes

Reduce Reduce

Final (k, v) pairs Final (k, v) pairs

Write-back to Write-back to
local HDFS store Output Format Output Format local HDFS store

Page 104
MapReduce
Is Shuffle and Sort a Bottleneck?
It appears that the shue and sort phase is a bottleneck

o The reduce method in the Reducers cannot start until all Mappers have nished

In practice, Hadoop will start to transfer data from Mappers to Reducers as the Mappers nish work

o This mitigates against a huge amount of data transfer starting as soon as the last Mapper nishes

o Note that this behavior is congurable

The developer can specify the percentage of Mappers which should nish before Reducers start
retrieving data

o The developers reducemethod still does not start until all intermediate data has been
transferred and sorted

Page 105
MapReduce
Is a Slow Mapper a Bottleneck?

It is possible for one Map task to run more slowly than the others

o Perhaps due to faulty hardware, or just a very slow machine


It would appear that this would create a bottleneck

o The reducemethod in the Reducer cannot start until every Mapper has nished
Hadoop uses speculative execution to mitigate against this

o If a Mapper appears to be running signicantly more slowly than the others, a new instance
of the Mapper will be started on another machine, operating on the same data

o The results of the rst Mapper to nish will be used

o Hadoop will kill o the Mapper which is still running

Page 106
MapReduce
The Five Hadoop Daemons
Hadoop is comprised of ve separate daemons
NameNode

o Holds the metadata for HDFS


Secondary NameNode

o Performs housekeeping functions for the NameNode

o Is not a backup or hot standby for the NameNode!


DataNode

o Stores actual HDFS data blocks


JobTracker

o Manages MapReduce jobs, distributes individual tasks to machines running the


TaskTracker

o Instantiates and monitors individual Map and Reduce tasks

Page 107
MapReduce
The Five Hadoop Daemons (Continued)

Each daemon runs in its own Java Virtual Machine (JVM)


No node on a real cluster will run all ve daemons

o Although this is technically possible


We can consider nodes to be in two dierent categories:

o Master Nodes

Run the NameNode, Secondary NameNode, JobTracker daemons

Only one of each of these daemons runs on the cluster

o Slave Nodes

Run the DataNode and TaskTracker daemons

A slave node will run both of these daemons

Page 108
MapReduce
Submitting a Job
When a client submits a job, its configuration information is packaged into an XML le
This le, along with the .jarle containing the actual program code, is handed to the JobTracker

o The JobTracker then parcels out individual tasks to TaskTracker nodes

o When a TaskTracker receives a request to run a task, it instantiates a separate JVM for that task

o TaskTracker nodes can be congured to run multiple tasks at the same time, if the node has enough
processing power and memory
The intermediate data is held on the TaskTrackers local disk
As Reducers start up, the intermediate data is distributed across the network to the Reducers
Reducers write their nal output to HDFS
Once the job has completed, the TaskTracker can delete the intermediate data from its local disk

o Note that the intermediate data is not deleted until the entire job completes

Page 109
Configuration Properties
Filename Format Description

hadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop.

Hadoop
Configuration settings for Hadoop core, such as I/O settings that are
core-site.xml configuration
common to HDFS and MapReduce.
XML
Hadoop
Configuration settings for HDFS daemons: the namenode, the secondary
hdfs-site.xml configuration
namenode, and the datanodes.
XML
Hadoop
Configuration settings for MapReduce daemons: the jobtracker and the
mapred-site.xml configuration
tasktrackers.
XML

masters Plain text A list of machines (one per line) that each run a secondary namenode.

A list of machines (one per line) that each run a datanode and a
slaves Plain text
tasktracker.

hadoop-metrics.properties Java properties Properties controlling how metrics are published in Hadoop.

Properties for the system logfiles, the namenode audit log, and the task log
log4j.properties Java properties
for the tasktracker child process.

Source: Hadoop Definitive Guide, Tom White

Page 110
Configuration Properties
Property Name Type Default Value Description

The default filesystem. The URI defines the


fs.default.name hostname and port that namenodes RPC
URI file:///
(core-site.xml) server runs on. The default port is 8020.
This property is set in core-site.xml.

The list of directories where the namenode


dfs.name.dir comma-separated- stores its persistent metadata. The
${hadoop.tmp.dir}/dfs/name
(hdfs-site.xml) directory names namenode stores a copy of the metadata in
each directory in the list.

dfs.data.dir comma-separated- The list of directories where the datanode


(hdfs-site.xml) directory names ${hadoop.tmp.dir}/dfs/data stores blocks. Each block is stored in only
one of these directories.

The list of directories where the secondary


fs.checkpoint.dir
comma-separated- ${hadoop.tmp.dir}/dfs/name namenode stores checkpoints. It stores a
(hdfs-site.xml)
directory names secondary copy of the checkpoint in each directory in
the list.

Source: Hadoop Definitive Guide, Tom White

Page 111
Configuration Properties
Property Name Type Default Value Description

The hostname and port that jobtrackers RPC server


runs on. If set to the default value of local then the
mapred.job.tracker
Hostname and jobtracker is run in-process on demand when you run a
(mapred-site.xml) local
Port MapReduce job (you do not need to start the
jobtracker in this case, and in fact will get an error if
you try to start it in this mode).

comma- The list of directories where MapReduce stores


mapred.local.dir ${hadoop.tmp.dir}/
separated- intermediate data for jobs. The data is removed when
(mapred-site.xml) mapred/local
directory names the job completes.
Mapred-site.xml

mapred.system.dir ${hadoop.tmp.dir}/ The directory relative to fs.default.name where shared


URI
(mapred-site.xml) mapred/system files are stored, during a job run.

mapred.tasktracker.map
The number of map tasks that may be run on
.tasks.maximum Int 2
tasktracker at any one time.
(mapred-site.xml)
mapred.tasktracker.red
The number of reduce tasks that may be run on
uce.tasks.maximum Int 2
tasktracker at any one time.
(mapred-site.xml)

The JVM option used to launch tasktracker child


mapred.child.java.opts
process that runs map and reduce tasks. This property
(mapred-site.xml) String -Xmx200m
can be set on a per-job basis, which can be useful for
setting JVM properties for debugging, for example.

Source: Hadoop Definitive Guide, Tom White

Page 112
Configuration Properties
Property Name Type Default Value Description

mapred.map.java.opts The JVM option used for the child process that runs
String -Xmx200m
(mapred-site.xml) map tasks. From 0 .21.

mapred.reduce.java.opts The JVM option used for the child process that runs
String -Xmx200m
(mapred-site.xml) reduce tasks. From 0 .21.

Source: Hadoop Definitive Guide, Tom White

Page 113
MapReduce
Submitting a Job
This consists of three portions

o The driver code

Code that runs on the client to congure and submit the job

o The Mapper

o The Reducer
Before we look at the code, we need to cover some basic Hadoop API concepts

Page 114
MapReduce
Getting Data to the Mapper
The data passed to the Mapper is specied by an InputFormat

o Specied in the driver code

o Denes the location of the input data

A le or directory, for example

o Determines how to split the input data into input splits

Each Mapper deals with a single input split

o InputFormat is a factory for RecordReaderobjects to extract (key, value) records from the input source

Page 115
Map Reduce Hands on Development
& Deployment 2
Day

Page 116
Executing a MapReduce Program

MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. It
includes:
o A Mapper class containing the map() function
o A Reducer class containing the reduce() function
o A driver class which configures the MapReduce job in the main() function

For this demonstration, we will be:


Compiling and exporting a MapReduce job using commands
Running the MapReduce program using Hadoop
Tracking a MapReduce job using the Hadoop JobTracker
Examining the output of a MapReduce execution

Page 117
Executing a MapReduce Program
Compiling and exporting an MR job using commands

Step 1
Change the directory to where we will be compiling and exporting the MapReduce program.
This folder already contains the *.class and *.jar files but we will overwrite them for the sake of this exercise.
$ cd /home/cloudera/NYSE/bin

Step 2
Compile the *.class files using javac (Java compiler)

$ javac -classpath /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*:/home/cloudera/NYSE/src


/home/cloudera/NYSE/src/AvgHigh.java /home/cloudera/NYSE/src/AvgHighMapper.java
/home/cloudera/NYSE/src/AvgHighReducer.java

Included library folders (colon delimited) Driver source file Mapper source file Reducer source file

Step 3
Export the class files into a jar file.
$ jar cvf averageHigh.jar AvgHigh.class AvgHighMapper.class AvgHighReducer.class

Page 118
Executing a MapReduce Program
Running the MapReduce program using Hadoop

Step 1
Change the directory to where we the jar file exists.
$ cd /home/cloudera/NYSE/bin

Step 2
Set HADOOP_CLASSPATH environment variable to the jar file

$ export HADOOP_CLASSPATH=averageHigh.jar

Step 3
Execute the MapReduce program using Hadoop
$ hadoop AvgHigh /user/cloudera/NYSE/data/EOD2013.txt output/AvgHigh

Application name within jar file Input data on HDFS Output folder on HDFS

Beginning of output from


map() function called execution: Jobid highlighted.
reduce() function called map() and reduce() calls
indicated.

Page 119
Executing a MapReduce Program
Tracking a MapReduce Job using the Hadoop JobTracker

Page 120
Executing a MapReduce Program
Tracking a MapReduce Job using the Hadoop JobTracker

Page 121
Executing a MapReduce Program
Running the MapReduce program using Hadoop
Step 1
Examine the files contained in the output folder specified when the MapReduce program was executed.
$ hadoop fs -ls output/AvgHigh

Step 2
Display the output (in HDFS) to the screen (stdout).
$ hadoop fs -cat output/AvgHigh/part-r-00000

Step 3
Make a local output folder.
$ mkdir /home/cloudera/NYSE/output

Step 4
Copy the output to your local (VM) file system.
$ hadoop fs -get output/AvgHigh /home/cloudera/NYSE/output/AvgHigh

Step 5
Display the output on your local system (VM) to the screen (stdout).
$ cat /home/cloudera/NYSE/output/AvgHigh/part-r-00000

Page 122
InputFormat - Hierarchy

CombineFile FileInputFormat
InputFormat <K, V> The base class used for all file
based InputFormats

TextInputFormat TextInputFormat
The default
Treats each \n-terminated line
<<Interface>> of a file as a value.
InputFormat <K, V> FileInputFormat <K, V> KeyValueTextInputFormat StreamInputFormat
org.apache.hadoop.mapred Key is the byte offset within
the file of that line.

NLineInputFormat KeyValueTextInputFormat
Maps \n-terminated lines as
KeySEP value.
SequenceFile SequenceFileAsBinary By default separator is a tab.
InputFormat <K, V> InputFormat

SequenceFileInputFormat
Binary file of (key, value) pairs
SequenceFileAsText
InputFormat
with some additional metadata

SequenceFileAsTextInputFormat
<<Interface>>
Composite SequenceFile Similar, but maps
CompositeInput
InputFormat <K, V> InputFilter <K, V> (key.toString(),
Format <K, V>
value.toString())

DBInputFile<T> DBInputFormat

EmptyInput
Format <K, V>
Source: Hadoop Definitive Guide, Tom White

Page 123
OutputFormat - Hierarchy

TextOutputFormat <K, V>

<<Interface>> SequenceFileOutput SequenceFileAsBinary


OututFormat <K, V> FileOutputFormat <K, V> Format <K, V> OutputFormat
org.apache.hadoop.mapred

MapFileOutputFormat

MultipleOutputputFormat MultipleTextOutput
<K, V> Format <K, V>

MultipleSequenceFile
OutputFormat

NullOutput
Format <K, V>

DBOutputFormat<K, V>

FilterOutput LazyOutputFormat <K, V>


Format <K, V>
Source: Hadoop Definitive Guide, Tom White

Page 124
Writable & WritableComparable?
Hadoop denes its own box classes for strings, integers and so on o Keys and values in Hadoop
IntWritable for ints are Objects
LongWritable for longs
o Values are objects which
FloatWritable for oats
implement Writable
DoubleWritable for doubles
Text for strings o Keys are objects which
Etc. implement
WritableComparable
The Writable interface makes serialization quick and easy for Hadoop

Any values type must implement the Writable interface

Any values type must implement the Writable interface


Two WritableComparables can be compared against each other to determine their order
Keys must be WritableComparables because they are passed to the Reducer in sorted order

Page 125 Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent. 04#16
Driver & Jobs
The Driver Code Specifying the InputFormat:
o The driver code runs on the client machine The default InputFormat (TextInputFormat) will be used
o It congures the job, then submits it to the cluster unless you specify otherwise
To use an InputFormat other than the default, use e.g.:
job.setInputFormatClass(KeyValueTextInputFormat.class)
Creating a New Job Object
o The Job class allows you to set configuration options for your
MapReduce job Specifying Final Output with OutputFormat:
o The classes to be used for your Mapper and Reducer FileOutputFormat.setOutputPath() species the directory
o The input and output directories to which the Reducers will write their nal output
o Many other options
The driver can also specify the format of the output data
o Any options not explicitly set in your driver code will be read
from your Hadoop configuration les o Default is a plain text le
o Usually located in /etc/hadoop/conf o Could be explicitly written as
o Any options not specied in your configuration les will receive o job.setOutputFormatClass(TextOutputFormat.class)
Hadoops default values
o You can also use the Job object to submit the job, control its
execution, and query its state

Running The Job:


Determining While Files to Read There are two ways to run your MapReduce job:
o By default, FileInputFormat.setInputPaths() will read all les o job.waitForCompletion()
from a specied directory and send them to Mappers Blocks (waits for the job to complete before continuing)
o Exceptions: items whose names begin with a period (.) or
underscore (_) o job.submit()
o Globs can be specied to restrict input, e.g.: /2010/*/01/* Does not block (driver code continues as the job is
o Alternatively, FileInputFormat.addInputPath() can be called running)
multiple times, specifying a single le or directory each time The job determines the proper division of input data into
o More advanced ltering can be performed by implementing a
InputSplits, and then sends the job information to the
PathFilter
JobTracker daemon on the cluster

Page 126
MapReduce Architecture

2. get new job id


MapReduce 1. run job 5. initialize
Job Client 4. submit job JobTracker job
Program

Client JVM

Client Node 6. retrieve JobTracker Node


input splits
7. Heartbeat
3. copy job (return tasks)
resources

TaskTracker

8. retrieve job
resources 9. launch

Child JVM
Shared Filesystem
(e.g., HDFS)
Child

10. run

MapTask
Or
ReduceTask

TaskTracker Node

Page 127
Building a MapReduce Program

We will be utilizing Eclipse as the IDE to build our MapReduce program.

o IDE stands for Integrated Development Environment.


o Eclipse has become popular because it is free and can be used to program in various languages.
o Eclipse can be extendable for customizing the working environment for a specific programming language.
o Common elements of an IDE include a source code editor, build automation tools, and a debugger.

For this demonstration, we will be:


Creating a new Java Project
Eclipse workspace environment
Examine the data
Writable classes (data types)
Building the Mapper class
Building the Reducer class
Building the driver class
Exporting the JAR package
Executing the MapReduce job

Page 128
Building a MapReduce Program
Creating a new Java Project

Steps:
1. Open Eclipse
2. Go to File > New, select Java Project
3. Provide NYSE as the project name
and modify the Location (uncheck
Use default location) to:

/home/cloudera/NYSE/src

4. After clicking Next on the Create a


Java Project window, browse to the
Libraries tab. Click on the Add
External JARs
The external JARs will be the
compilations of Hadoop code
that makes the program
understand the references to
Hadoop / MapReduce code.

Page 129
Building a MapReduce Program
Creating a new Java Project (continued)

Steps (continued):
5. Select the *.jar files in the /usr/lib/hadoop folder
Click on File System
Open the usr folder, then the lib folder, then the hadoop folder

Page 130
Building a MapReduce Program
Creating a new Java Project (continued)

Steps (continued):
6. After clicking OK, again click on Add External JARs and select the *.jar files in the /usr/lib/hadoop/client-0.20 folder.
Click OK, then Finish to return to the workspace.

Page 131
Building a MapReduce Program
Eclipse Working Environment

Open Files

Imported libraries
Project
Package

Comments
Files
Class
Classes definition
Functions

Function
Libraries definition
included when
compiling

Page 132
Building a MapReduce Program
Examining the Data

Preview the first 10 lines of the data file.


$ head -10 /home/cloudera/NYSE/data/EOD2013.txt

Youll notice the following output:

The fields of the EOD2013.txt data file are:


o Symbol, Date, Open, High, Low, Close, Volume

Page 133
Building a MapReduce Program
Writable classes (data types)

Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io package:

o There are Writable wrappers for all the Java primitive types except char which can be stored in an IntWritable.
o All wrappers have a get() and set() method for retrieving and storing the value.

Java primitive Writable classes Bytes Other Writable classes:

boolean BooleanWritable NullWritable


1
Text
byte ByteWritable 1 BytesWritable
MD5Hash
short ShortWritable 2 ObjectWritable
int IntWritable
GenericWritable
4
ArrayWritable
VIntWritable 1-5 ArrayPrimitiveWritable
TwoDArrayWritable
float FloatWritable 4 AbstractMapWritable
o MapWritable
long LongWritable 8
o SortedMapWritable
VLongWritable 1-9 EnumSetWritable
CompressedWritable
double DoubleWritable 8 VersionedWritable

Page 134
Building a MapReduce Program
Mapper Class: Import Commands

Import commands are used to include references to functionality used in code.


java.io.IOException used to throw an error when needed
org.apache.hadoop.io.DoubleWritable include the Hadoop defined DoubleWritable data type
org.apache.hadoop.io.LongWritable include the Hadoop defined LongWritable data type
org.apache.hadoop.io.Text include the Hadoop defined Text data type
org.apache.hadoop.mapreduce.Mapper include the Mapper template which we will extend to make our own
Mapper

import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

Page 135
Building a MapReduce Program
Mapper Class: Building the Mapper class

The AvgHighMapper is a subclass of the Hadoop defined Mapper class


Our AvgHighMapper subclass extends the Mapper class
The Mapper takes an input <key, value> and outputs a <key, value>
The first parameter (LongWritable) is the input key data type
The second parameter (Text) is the input value data type
The third parameter (Text) is the output key data type
The fourth parameter (DoubleWritable) is the output value data type
The output <key, value> will the be input <key, value> for the Reducer

public class AvgHighMapper extends Mapper<LongWritable, Text, Text,


DoubleWritable> {

Page 136
Building a MapReduce Program
Mapper Class: Building the map function

The map function is the function for the Mapper that is executed by Hadoop
The map function takes three parameters.
The key parameter is the auto-assigned id of the line that Hadoop is processing. For most purposes, this is an arbitrary
value.
The value parameter is the line that Hadoop is processing.
The context is where the output is processed to.

public class AvgHighMapper extends Mapper<LongWritable, Text, Text,


DoubleWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

}
}

Page 137
Building a MapReduce Program
Mapper Class: Mapper skeleton

The skeleton below represents the key elements that are found in each Mapper.
The purple highlighted elements can be changed to work with your needs.

import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class AvgHighMapper extends Mapper<LongWritable, Text, Text,


DoubleWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// CODE GOES HERE
}
}

Page 138
Building a MapReduce Program
Mapper Class: The code
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class AvgHighMapper extends Mapper<LongWritable, Text, Text,


DoubleWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] line = value.toString().split(",");
if (line.length > 3)
context.write(new Text(line[0]), new
DoubleWritable(Float.parseFloat(line[3])));
}
}

Page 139
Building a MapReduce Program
Reducer Class: Import Commands

Import commands are used to include references to functionality used in code.


java.io.IOException used to throw an error when needed
org.apache.hadoop.io.DoubleWritable include the Hadoop defined DoubleWritable data type
org.apache.hadoop.io.Text include the Hadoop defined Text data type
org.apache.hadoop.mapreduce.Reducer include the Reducer template which we will extend to make our own
Reducer

import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

Page 140
Building a MapReduce Program
Reducer Class: Building the Reducer class

The AvgHighReducer is a subclass of the Hadoop defined Reducer class


Our AvgHighReducer subclass extends the Reducer class
The Mapper takes an input <key, value> and outputs a <key, value>
The first parameter (Text) is the input key data type and matches the output key data type of the Mapper
The second parameter (DoubleWritable) is the input value data type and matches the output data type of the Mapper
The third parameter (Text) is the output key data type
The fourth parameter (DoubleWritable) is the output value data type

public class AvgHighReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable>


{

Page 141
Building a MapReduce Program
Reducer Class: Building the reduce function

The map function is the function for the Mapper that is executed by Hadoop
The map function takes three parameters.
The key parameter is the auto-assigned id of the line that Hadoop is processing. For most purposes, this is an arbitrary
value.
The value parameter is the line that Hadoop is processing.
The context is where the output is processed to.

public class AvgHighReducer extends Reducer<Text, DoubleWritable, Text,


DoubleWritable> {
@Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {

}
}

Page 142
Building a MapReduce Program
Reducer Class: Reducer skeleton

The skeleton below represents the key elements that are found in each Reducer.
The purple highlighted elements can be changed to work with your needs.

import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class AvgHighReducer extends Reducer<LongWritable, Text, Text,


DoubleWritable> {
@Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
// CODE GOES HERE
}
}

Page 143
Building a MapReduce Program
Reducer Class: The code
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class AvgHighReducer extends Reducer<Text, DoubleWritable, Text,
DoubleWritable> {
@Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
double symbolCount = 0.0;
double dailyHighs = 0.0;
for (DoubleWritable value : values) {
dailyHighs += value.get();
symbolCount++;
}
context.write(key, new DoubleWritable(dailyHighs/symbolCount));
}
}

Page 144
Building a MapReduce Program
Driver Class: Import Commands

Import commands are used to include references to functionality used in code.


org.apache.hadoop.fs.Path Hadoop defined Path data type
org.apache.hadoop.io.DoubleWritable Hadoop DoubleWritable data type
org.apache.hadoop.io.Text Hadoop defined Text data type
org.apache.hadoop.mapreduce.lib.input.FileInputFormat Input file
org.apache.hadoop.mapreduce.lib.input.FileOutputFormat Output file
org.apache.hadoop.mapreduce.Job Hadoop MapReduce Job specifications

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;

Page 145
Building a MapReduce Program
Driver Class: The code
public class AvgHigh {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: AvgHigh <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(AvgHigh.class);
job.setJobName("Average Ticker Highs");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(AvgHighMapper.class);
job.setReducerClass(AvgHighReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Page 146
Building a MapReduce Program
Exporting the JAR package

Steps:
1. Right click the (default
package) and select the Export option
2. Select JAR File under the Java folder and
press Next

Page 147
Building a MapReduce Program
Exporting the JAR package (continued)

Steps: (continued)
3. Provide the following as the destination of the
JAR file and hit Finish.

home/cloudera/NYSE/bin/averageHigh.jar

At this point, we are at the same point we


were after the Executing a MapReduce
Program: Compiling and exporting an MR job
using commands slide.
We have exported a *.jar file using Eclipse
instead of doing it from the Terminal as we did
before.

Page 148
Executing a MapReduce Program
Executing the MapReduce Job
Step 1
Change the directory to where we the jar file exists.
$ cd /home/cloudera/NYSE/bin

Step 2
Set HADOOP_CLASSPATH environment variable to the jar file
$ export HADOOP_CLASSPATH=averageHigh.jar

Step 3
Remove the output/AvgHigh folder in HDFS that was created before. The skipTrash option removes it immediately (no trash).
$ hadoop fs -rm -r -skipTrash output/AvgHigh

Step 4
Execute the MapReduce program using Hadoop
$ hadoop AvgHigh /user/cloudera/NYSE/data/EOD2013.txt output/AvgHigh

Step 5
Review the results
$ hadoop fs -cat output/AvgHigh/part-r-00000

Page 149
The Streaming API

Many organizations have developers skilled in languages other than Java, such as:

o Ruby
o Python
o Perl

The Streaming API allows developers to use any language they wish to write Mappers and Reducers
As long as the language can read from standard input and write to standard output

Advantages of the Streaming API Disadvantages of the Streaming API


No need for non/Java coders to learn Java Performance
Fast development time Primarily suited for handling data that can be
Ability to use existing code libraries represented as text
Streaming jobs can use excessive amounts of RAM
or fork excessive numbers of processes
Although Mappers and Reducers can be written
using the Streaming API, Partitioners, InputFormats
etc. must still be written in Java

Page 150
More on the Hadoop API

The ToolRunner class


How to improve the efficiency of intermediate data with Combiners
The setup and cleanupmethods
How to write custom Partitioners for better load balancing
How to access HDFS programmatically
How to use the distributed cache

Page 151
Why Use ToolRunner?

You can use ToolRunnerin MapReduce driver classes


o This is not required, but is a best practice

ToolRunner uses the GenericOptionsParser class internally


o Allows you to specify conguration options on the command line

o Also allows you to specify items for the Distributed Cache on the command line (see later)

Page 152
The Combiner

Often, Mappers produce large amounts of intermediate data


o That data must be passed to the Reducers

o This can result in a lot of network trac

It is often possible to specify a Combiner


o Like a mini/Reducer

o Runs locally on a single Mappers output

o Output from the Combiner is sent to the Reducers

o Input and output data types for the Combiner/Reducer must be identical

Combiner and Reducer code are often identical


A Combiner would decrease the amount of data sent to the Reducer
Combiners decrease the amount of network trac required during the shue and sort
phase

Page 153
Specifying a Combiner
To specify the Combiner class to be used in your MapReduce code, put the following line
in your Driver:
job.setCombinerClass(YourCombinerClass.class);

The Combiner uses the same interface as the Reducer


o Takes in a key and a list of values

o Outputs zero or more (key, value) pairs

o The actual method called is the reducemethod in the class

VERY IMPORTANT: The Combiner may run once, or more than once, on the output
from any given Mapper
o Do not put code in the Combiner which could inuence your results if it runs more than once

Page 154
The setup / Clean up Method
It is common to want your Mapper or Reducer to execute some code before the
mapor reducemethod is called
o Initialize data structures

o Read data from an external le

o Set parameters

The setupmethod is run before the mapor reducemethod is called for the rst time

public void setup(Context context)

Similarly, you may wish to perform some action(s) after all the records have been
processed by your Mapper or Reducer
The cleanupmethod is called before the Mapper or Reducer terminates

public void cleanup(Context context) throws


IOException, InterruptedException

Page 155
What Does The Partitioner Do?
The Partitioner divides up the keyspace
o Controls which Reducer each intermediate key and its associated values goes to

Often, the default behavior is ne


o Default is the HashPartitioner

public class HashPartitioner<K, V> extends Partitioner<K, V> {

public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() &
Integer.MAX_VALUE) % numReduceTasks;
}
}

Page 156
Creating a Custom Partitioner
Partitioner

Step 1 : Create a class for the custom Partitioner


extend Partitioner

public class MyPartitioner<K, V> extends Partitioner<K, V> {


//variables and methods
}

Step 2: Create a method in the class called getPartition


Receives the key, the value, and the number of Reducers
Should return an int between 0 and one less than the number of Reducers
e.g., if it is told there are 10 Reducers, it should return an int between 0 and 9

public int getPartition(key, value, int) { //do something }

Step 3: Specify the custom Partitioner in your driver code

job.setPartitionerClass(MyPartitioner.class);

Page 157
The MapReduce Flow: Shuffle and Sort
Node 1 Node 2

Files loaded from Files loaded from


local HDFS stores local HDFS stores

Input Format Input Format

File File
Split Split Split Split Split Split

File File

RR RR RR RR RR RR
Record readers: Record readers:

Input (k, v) pairs Input (k, v) pairs

map map map map map map

Intermediate (k, v) pairs Intermediate (k, v) pairs

Partitioner Partitioner
Shuffling Process

Intermediate (k, v)
(Sort) pairs exchanged by
(Sort)
all nodes

Reduce Reduce

Final (k, v) pairs Final (k, v) pairs

Write-back to Write-back to
local HDFS Output Format Output Format local HDFS
store store

Page 158
The FileSystem API
Some useful API methods:
o FSDataOutputStream create(...)

Extends java.io.DataOutputStream

Provides methods for wriEng primiEves, raw bytes etc

o FSDataInputStream open(...)

Extends java.io.DataInputStream

Provides methods for reading primiEves, raw bytes etc


o boolean delete(...)

o boolean mkdirs(...)

o void copyFromLocalFile(...)

o void copyToLocalFile(...)

o FileStatus[] listStatus(...)

Copyright 2010/2013 EY. All rights reserved. Not to be


Page 159
reproduced without prior wri> en consent.
The FileSystem API: Directory Listing
Get a directory listing:

Path p = new Path("/my/path");


Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

FileStatus[] fileStats = fs.listStatus(p);


for (int i = 0; i < fileStats.length; i++) {

Path f = fileStats[i].getPath();
//do something
}

Copyright 2010/2013 EY. All rights reserved. Not to be


Page 160
reproduced without prior wri> en consent.
The Distributed Cache

A common requirement is for a Mapper or Reducer to need access to some side data
o Lookup tables

o Dictionaries

o Standard conguration values

Option 1: read directly from HDFS in the setupmethod


o Works, but is not scalable

Option 2: The Distributed Cache provides an API to push data to all slave nodes
o Transfer happens behind the scenes before any task is executed

o Note: Distributed Cache is read/only

o Files in the Distributed Cache are automatically deleted from slave nodes when the job nishes

Copyright 2010/2013 EY. All rights reserved. Not to be


Page 161
reproduced without prior wri> en consent.
Using the Distributed Cache: Via coding
Place the les into HDFS
Congure the Distributed Cache in your driver code

Configuration conf = new Configuration();


DistributedCache.addCacheFile(new URI("/myapp/lookup.dat"),conf);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"),conf);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz",conf));

jarles added with addFileToClassPath will be added to your Mapper or Reducers classpath
Files added with addCacheArchive will automatically be dearchived/decompressed

Page 162
Using the DistributedCache: Command line
If you are using ToolRunner, you can add les to the Distributed Cache directly from the
command line when you run the job
o No need to copy the les to HDFS rst

Use the -files option to add les

hadoop jar myjar.jar MyDriver -files file1, file2, file3, ...

The -archivesag adds archived les, and automatically unarchives them on the destination
machines
The -libjarsag adds jar les to the classpath

Copyright 2010/2013 EY. All rights reserved. Not to be


Page 163
reproduced without prior wri> en consent.
Accessing Files in the Distributed Cache
Files added to the Distributed Cache are made available in your tasks local working
directory
o Access them from your Mapper or Reducer the way you would read any ordinary local le

File f = new File("file_name_here");

Page 164
Reusable Classes for the New API
The org.apache.hadoop.mapreduce.lib.*/* packages contain a library of Mappers,
Reducers, and Partitioners supporting the new API

Example classes:
o InverseMapper Swaps keys and values

o RegexMapper Extracts text based on a regular expression

o IntSumReducer, LongSumReducer Add up all values for a key

o TotalOrderPartitioner Reads a previously/created partition l e and partitions based on


the data from that le:

Sample the data rst to create the partition l e

Allows you to partition your data into n partitions without hard/ coding the
partitioning information

Page 165
Most Common InputFormats
Most common InputFormats:
o TextInputFormat

o KeyValueTextInputFormat

o SequenceFileInputFormat

Others are available


o NLineInputFormat
Every n lines of an input le is treated as a separate InputSplit
Congure in the driver code by setting:
mapreduce.input.lineinput.linespermap
o MultiFileInputFormat
Abstract class that manages the use of multiple les in a single task
You must supply a getRecordReader()implementation

Page 166
How FileInputFormat Works
All le based InputFormats inherit from FileInputFormat
FileInputFormat computes InputSplits based on the size of each le, in bytes
o HDFS block size is used as upper bound for InputSplit size

o Lower bound can be specied in your driver code

o This means that an InputSplit typically correlates to an HDFS block

So the number of Mappers will equal the number of HDFS blocks of input data to be processed

Important: InputSplits do not respect record boundaries!

Page 167
What RecordReaders Do
InputSplits are handed to the RecordReaders
o Specied by the path, starting position oset, length

RecordReaders must:
o Ensure each (key, value) pair is processed
o Ensure no (key, value) pair is processed more than once
o Handle (key, value) pairs which are split across Input Splits Not a good idea

Page 168
OutputFormat
OutputFormats work much like InputFormat classes
Custom OutputFormats must provide a RecordWriter implementation

Page 169
Compressions
Compression Format Tool Algorithm File Extension Split-able

DEFLATE (a) N/A DEFLATE .deflate No

gzip gzip DEFLATE .gz No

bzip2 bzip2 bzip2 .bz2 Yes

LZO lzop LZO .lzo No (b)

Snappy N/A Snappy .snappy No

(a) DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available
command line tool for producing files in DEFLATE format as gzip is normally used. (Note the gzip file format is
DEFLATE with extra headers and a footer.). The .deflate file extension is a Hadoop convention.

(b) LZO files are split-able if they have been indexed in a pre-processing step,

Compression Format Hadoop Compression Codec

DEFLATE org.apache.hadoop.io.compress.DefaultCodec

gzip org.apache.hadoop.io.compress.GzipCodec

bzip2 org.apache.hadoop.io.compress.BZip2Codec

LZO org.apache.hadoop.io.compress.LzopCodec

Snappy org.apache.hadoop.io.compress.SnappyCodec

Source: Hadoop Definitive Guide, Tom White

Page 170
Hadoop and Compressed Files
Hadoop understands a variety of le compression formats

o Including GZip
If a compressed le is included as one of the les to be processed, Hadoop will automatically decompress
it and pass the decompressed contents to the Mapper

o There is no need for the developer to worry about decompressing the le


However, GZip is not a splitable le format

o A GZipped le can only be decompressed by starting at the beginning of the le and continuing on to
the end

o You cannot start decompressing the le part of the way through it

Page 171
Non-Splittable File Formats and Hadoop
If the MapReduce framework receives a non splittable le (such as a GZipped le) it passes the
entire file to a single Mapper
This can result in one Mapper running for far longer than the others

o It is dealing with an entire le, while the others are dealing with smaller portions of les

o Speculative execution could occur

Although this will provide no benet


Typically it is not a good idea to use GZip to compress les which will be processed by MapReduce

Page 172
Snappy Codec
Splittable Compression for SequenceFiles and Avro Files Using the Snappy Codec

Snappy is a relatively new compression codec

o Developed at Google

o Very fast
Snappy does not compress a SequenceFile and produce, e.g., a le with a .snappyextension

o Instead, it is a codec that can be used to compress data within a le

o That data can be decompressed automatically by Hadoop (or other programs) when the le
is read

o Works well with SequenceFiles, Avro les


Snappy is now preferred over LZO

Page 173
Map Reduce Patterns
Summarization Patterns Join Patterns
o Numerical Summarization o Reduce Side Join
o Inverted Index Summarization o Replicated Join
o Counting with Hadoop Counters o Composite Join
Filtering Patterns o Cartesian Product
o Normal Filtering Meta Patterns
o Bloom Filtering o Job Chaining
o Top Ten Filtering o Chain Folding
Data Organization patterns o Job Merging
o Structured to Hierarchical Input Output Patterns
o Partitioning o Customizing I/O & O/P
o Binning o Generating Data
o Total Order Sorting o External Source Output
o Shuffling o External Source Input
o Partition Pruning

Page 174
Map and Reduce Side- Joins Pattern Overview
We frequently need to join data together from two sources as part of a MapReduce job, such as

o Lookup tables

o Data from database tables


There are two fundamental approaches: Mapside joins and Reduce side joins
Map side joins are easier to write, but have potential scaling issues
But rst
Avoid writing joins in Java MapReduce if you can!
Abstractions such as Pig and Hive are much easier to use

o Save hours of programming


If you are dealing with text based data, there really is no reason not to use Pig or Hive

Page 175
Conclusion 2
Day

Questions?

Page 176
Pig deep dive - Theory 3
Day

Page 177
Hive and Pig: Why?
MapReduce code is typically written in Java

o Although it can be writ t en in other languages using Hadoop Streaming


Requires:

o A programmer

o Who is a good Java programmer

o Who understands how to think in terms of MapReduce

o Who understands the problem theyre trying to solve

o Who has enough time to write and test the code

o Who will be available to maintain and update the code in the future as requirements change

Copyright 2010/2013 EY. All rights reserved. Not to be


Page 178
reproduced without prior wri> en consent.
Hive and Pig
Many organizations have only a few developers who can write good MapReduce code
Meanwhile, many other people want to analyze data

o Business analysts

o Data scientists

o Statisticians

o Data analysts
Whats needed is a higher level of abstractions on top of MapReduce

o Providing the ability to query the data without needing to know MapReduce

o Hive and Pig address these needs

Page 179
Hive and Pig
Introduction

Pig was originally created at Yahoo! to answer a similar need to Hive

o Many developers did not have the Java and/or MapReduce knowledge required to write standard
MapReduce programs

o But still needed to query data


Pig is a high level platform for creating MapReduce programs

o Language is called PigLatin

o Relatively simple syntax

o Under the covers, PigLatin scripts are turned into MapReduce jobs and executed on the cluster
Installation of Pig requires no modification to the cluster
The Pig interpreter runs on the client machine

o Turns PigLatin into standard Java MapReduce jobs, which are then submitted to the JobTracker
There is (currently) no shared metadata, so no need for a shared metastore of any kind

Page 180
Pig
Pig Philosophy

Pigs eat anything


o Pig can operate on data whether it has metadata or not, whether it is relational, nested,
unstructured
Pig lives anywhere
o Pig is intended to be a language for parallel data processing. It is not tied to any
framework
Pigs are domestic animals
o It is designed to be easily controlled and modified by its users
o Allows integration of users code wherever possible
o Supports user provided load and store functions
o Supports external executables via its stream command and MapReduce via JARs
Pigs Fly
o Processes data quickly.

Page 181
Pig Overview
Pig Latin provides a higher order abstraction for implementing MapReduce jobs. It constitutes a data flow language which is
made up of a series of operations and transformations that are applied to the input data

Key features of Pig Latin


o Ease of programming - Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as
data flow sequences, making them easy to write, understand, and maintain.
o Optimization opportunities - The way in which tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than efficiency.
o Extensibility - Users can create their own functions to do special-purpose processing.

Pig Latin Statement


o Pig Latin statements are the basic constructs you use to process data using Pig.
o A Pig Latin statement is an operator that takes a relation as input and produces another relation as output.
o Pig Latin statements may include expressions and schemas.

How are Pig Latin Statements LOAD Load statements read data from the file system
Organized ?
Transformation statements process data read
Pig Latin statements are organized as a TRANSFORMATION from the file system
sequence of steps such that each step
represents a transformation applied to DUMP statement display results
some data DUMP / STORE STORE statement to save the results.

Page 182
Pig Vs. SQL
PigLatin is data flow language SQL is query language
o Allows users to describe how data from one or o Allows users to form queries
more inputs should be read, processed and then o Allows users to describe what questions they want
stored to one or more outputs in parallel answered, not how
o Pig latin script describes a directed acyclic graph o Focused around answering one question. Complex
(DAG), where edges are data flows and the requires temp tables, sub queries or multiple
nodes are operators that process the data procedures
o Allows users to describe how to process the o Schemas are static and constraints are enforced
input data
o Schemas are dynamic

Page 183
PigLatin Concepts
Concepts Features

o In Pig, a single element of data is an atom o Pig supports many features which allow developers to
o A collection of atoms such as a row, or a partial perform complex data analysis without having to write
row is a tuple Java MapReduce code
o Tuples are collected together into bags Joining datasets
o Typically, a PigLatin s c r ipt starts by loading one Grouping data
or more datasets into bags, and then creates
Referring to elements by position rather than name
new bags by modifying those it already has
o Useful for datasets with many
elements

Loading non/delimited data using a custom SerDe

Creation of user/dened functions written in Java

And more

Page 184
PigLatin
Sample Pig Script

tkrs = LOAD 'NYSE_Daily' AS(sym,open,high,low,close);


tops = FILTER tkrs BY high > 100;
srtd = ORDER tops BY close DESC;
STORE srtd INTO 'best_tickers';

Here, we load a directory of data into a bag called tkrs

Then we create a new bag called tops which contains just those records where the high
portion is greater than 100

Finally, we write the contents of the srtdbag to a new directory in HDFS


o By default, the data will be written in tab separated format

Alternatively, to write the contents of a bag to the screen, say


DUMP srtd;

Page 185
PigLatin
Data Types/Functions/Operations

Scalar Types Relational Operations Input/Output


INT foreach Load

LONG filter Store

FLOAT Group Dump

DOUBLE Order by

CHARARRAY Distinct
BYTEARRAY Join

Sample

Complex Types Parallel

MAP - Chararray to data UDF


element mapping
TUPLE Fixed length, ordered
collection
BAG unordered collection of
tuples

Page 186
PigLatin
Using the Grunt Shell to Run PigLatin

Starting Grunt

$ pig grunt>

Useful Commands

$ pig -help (or -h)


$ pig -version (-i)
$ pig -execute (-e)
$ pig script.pig

Page 187
PigLatin
Sample Pig Script

tkrs = LOAD 'NYSE_Daily' AS(sym,open,high,low,close);


tops = FILTER tkrs BY high > 100;
srtd = ORDER tops BY close DESC;
STORE srtd INTO 'best_tickers';

Here, we load a directory of data into a bag called tkrs

Then we create a new bag called tops which contains just those records where the high
portion is greater than 100

Finally, we write the contents of the srtdbag to a new directory in HDFS


o By default, the data will be written in tab separated format

Alternatively, to write the contents of a bag to the screen, say


DUMP srtd;

Page 188
PigLatin
More PigLatin

To view the structure of a bag:

DESCRIBE bagname;

Joining two datasets


data1 = LOAD 'data1' AS (col1, col2, col3, col4);
data2 = LOAD 'data2' AS (colA, colB, colC);
jnd= JOIN data1 BY col3, data2 BY colA;
STORE jnd INTO 'outfile';

Expressions in for each


Beginning = foreach prices generate ..open;
Middle = foreach prices generate open..close;
End = foreach prices generate volume..;
gain = foreach prices generate close open;
Gain 2 = foreach prices generate $4-$1;

Page 189
PigLatin
FOR EACH

The FOREACH...GENERATEstatement iterates over members of a bag

justnames = FOREACH emps GENERATE name;

Can combine with COUNT:

summedUp = FOREACH grpd GENERATE group, COUNT(bag1) AS


elementCount;

Page 190
PigLatin
Grouping

Grouping

grpd = GROUP bag1 BY elementX

Create a new bag

o Each tuple in grpd has an element called group, and an element called bag1

o The group element has a unique value for elementX from bag1

o The bag1 element is itself a bag, containing all the tuples from bag1 with that value
for elementX

Page 191
Pig hands on 3
Day

Page 192
Hive
Overview

Pig Latin is a data flow programming language whereas SQL is a declarative programming
language. There are three ways of executing Pig programs, all of which work in both local and
MapReduce modes:
Pig can run a script file that contains Pig commands (pig script.pig).
Grunt is an interactive shell for running Pig commands.
Grunt is started when no file is specified for Pig to run and the -e (or -execute for direct command
execution) is not used.
It is also possible to run Pig scripts from within Grunt using run and exec.
Embedded You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use PigRunner.

For this demonstration, we will be:


Load and filter data using Grunt
Naming fields
Joining data
Grouping and aggregate functions
Splitting records

Page 193
Pig
Load and filter using Grunt

Step 1 : Start pig

$ Pig

Step 2: Load the symbols data into the symbols variable

grunt> symbols = LOAD 'NYSE/data/symbols.txt';

Step 3: Dump the data and the schema just loaded to the screen

grunt> DUMP symbols;


grunt> DESCRIBE symbols;

Step 4 Retrieve the records where the first field ($0) starts with M
grunt> symbols_m = FILTER symbols BY $0 matches 'M.*';
grunt> DUMP symbols_m;
grunt> DESCRIBE symbols_m;

Page 194
Pig
Naming Fields

Step 1 : Reload the data, this time specifying the field names and types;
The >> indicates a line break statements end with a semicolon (;)

grunt> symbols = LOAD 'NYSE/data/symbols.txt'


>> AS (ticker:chararray, name:chararray);

Step 2: Dump the data and the schema just loaded to the screen
grunt> DUMP symbols;
grunt> DESCRIBE symbols;

Step 3: Retrieve the records where the ticker field starts with M
The fields can be referred to by name or ordinality ($[i-1])

grunt> symbols_m = FILTER symbols BY ticker matches 'M.*';


grunt> DUMP symbols_m;
grunt> DESCRIBE symbols_m;

Page 195
Pig
Joining Data

Step 1 : Load the end of the data data.

grunt> eoddata = LOAD 'NYSE/data/EOD2013.txt' USING PigStorage(',') AS (ticker:chararray,


close_date:chararray, price_open:double, price_high:double, price_low:double,
price_close:double, volume:long);

Step 2: Examine the schema just loaded on to the screen


grunt> DESCRIBE eoddata;

Step 3: Join the record sets

grunt> sym_eod = JOIN symbols BY ticker, eoddata BY ticker;

Step 4: Examine the schema of join

grunt> DESCRIBE sym_eod;

Page 196
Pig
Grouping and Aggregate Functions

Step 1 : Group the symbols records by ticker symbol (dump to review)

grunt> symbols_grp = GROUP symbols BY ticker;


grunt> DUMP symbols_grp;

Step 2: Co-Group the symbols and EOD data (describe to review)


grunt> symeod_grp = COGROUP symbols BY ticker, eoddata BY ticker;
grunt> DESCRIBE symeod_grp;

Step 3: For each co-grouped record, find the maximum open price

grunt> eod_maxopen = FOREACH symeod_grp GENERATE group, MAX(eoddata.price_open), symbols.name;


grunt> DESCRIBE eod_maxopen;

Step 4: Sort the results (dump to view the final record set)

grunt> eod_maxopen_sorted = ORDER eod_maxopen BY $1 DESC;


grunt> DUMP eod_maxopen_sorted;

Page 197
Pig
Splitting Records

Step 1 : Split the data into good and bad relations

grunt> SPLIT eod_maxopen INTO eod_maxopen_good IF $1 >= 5.0, eod_maxopen_bad IF $1 < 5.0;

Step 2: Examine the Good schema and results


grunt> DESCRIBE eod_maxopen_good;
grunt> STORE eod_maxopen_good INTO 'output/pig/eod_maxopen_good';

Step 3: Examine the Bad schema and results

grunt> DESCRIBE eod_maxopen_bad;


grunt> STORE eod_maxopen_good INTO 'output/pig/eod_maxopen_good';

Page 198
Pig
Running Pig as Script

Step 1 : myscript.pig (stored in NYSE/pig)

symbols = LOAD 'NYSE/data/symbols.txt' AS (ticker:chararray, name:chararray);


eoddata = LOAD 'NYSE/data/EOD2013.txt' USING PigStorage(',') AS (ticker:chararray,
close_date:chararray, price_open:double, price_high:double, price_low:double,
price_close:double, volume:long);
symbols_ms = FILTER symbols BY ticker matches 'MS';
eoddata_ms = FILTER eoddata BY ticker matches 'MS';
symeod_ms = JOIN symbols_ms BY ticker, eoddata_ms BY ticker;
SPLIT symeod_ms INTO symeod_ms_good IF price_open <= price_close, symeod_ms_bad IF price_open >
price_close;
STORE symeod_ms_good INTO 'output/pig/symeod_ms_good';
STORE symeod_ms_bad INTO 'output/pig/symeod_ms_bad';

Step 2: Execute Script


% pig NYSE/pig/myscript.pig

Page 199
Pig
Running Pig as local Script

Step 1 : myscript.pig (stored in NYSE/pig)

symbols = LOAD '/home/cloudera/NYSE/data/symbols.txt' AS (ticker:chararray, name:chararray);


eoddata = LOAD '/home/cloudera/NYSE/data/EOD2013.txt' USING PigStorage(',') AS
(ticker:chararray, close_date:chararray, price_open:double, price_high:double,
price_low:double, price_close:double, volume:long);
symbols_ms = FILTER symbols BY ticker matches 'MS';
eoddata_ms = FILTER eoddata BY ticker matches 'MS';
symeod_ms = JOIN symbols_ms BY ticker, eoddata_ms BY ticker;
SPLIT symeod_ms INTO symeod_ms_good IF price_open <= price_close, symeod_ms_bad IF price_open >
price_close;
STORE symeod_ms_good INTO '/home/cloudera/NYSE/output/pig/symeod_ms_good';
STORE symeod_ms_bad INTO '/home/cloudera/NYSE/output/pig/symeod_ms_bad';

Step 2: Execute Script


% pig x local NYSE/pig/myscript-local.pig

Page 200
Pig Latin
Operators

Category Operator Category Operator


Loading and storing LOAD
Sorting ORDER
STORE
LIMIT
DUMP
Filtering FILTER Combining and UNION
splitting
DISTINCT SPLIT
FOREACHGENERATE
Diagnostic DESCRIBE
MAPREDUCE operators
STREAM EXPLAIN

SAMPLE ILLUSTRATE
Grouping and JOIN Macro and UDF REGISTER
joining statements
COGROUP
DEFINE
GROUP
CROSS IMPORT

Page 201
Pig Latin
Commands

Category Operator Category Operator


Hadoop Filesystem cat
cd Hadoop kill
MapReduce Utility
copyFromLocal
exec
copyToLocal
cp help
fs
quit
ls
mkdir run
mv
pwd set

rm
sh
rmf

Page 202
Pig Latin
Expressions

Category Expression Category Expression


Constant Literal Conditional x?y:z

Field (ordinality) $n Comparison x == y, x != y

Field (by name) f x > y, x < y

Field (disambiguate) r::f x >= y, x <= y

Projection c.$n, c.f x matches y

Map lookup m#k x is null

Cast (t) f x is not null

Arithmetic x + y, x y Boolean x or y, x and y

x * y, x / y not x

x%y Functional fn(f1,f2,...)

+x, -x Flatten FLATTEN(f)

Page 203
Pig Latin
Data Types

Category Type Description Literal example


Numeric int 32-bit signed integer 1

long 64-bit signed integer 1L

float 32-bit floating-point number 1.0F

double 64-bit floating-point number 1.0

Text chararray Character array in UTF-16 format 'a'

Binary bytearray Byte array Not supported

Complex tuple Sequence of fields of any type (1,'boy')

bag An unordered collection of tuples, possibly with {(1,'boy'),(2)}


duplicates
map A set of key-value pairs; keys must be character ['a'#'boy']
arrays, but values may be any type

Page 204
Pig Latin
Built-in Functions

Category Function Category Function


Eval AVG Eval TOMAP
CONCAT
TOP
COUNT
TOTUPLE
COUNT_STAR

DIFF Filter IsEmpty

MAX Load/Save PigStorage

MIN BinStorage
SIZE
TextLoader
SUM
JsonLoader, JsonStoarge
TOBAG

TOKENIZE HBaseStorage

Page 205
Hive deep dive - Theory 3
Day

Page 206
Hive Introduction
o Hive was originally developed at Facebook
Provides a very SQL Like language
Can be used by people who know SQL
Under the covers, generates MapReduce jobs that run on the Hadoop cluster

Enabling Hive requires almost no extra work by the system administrator

Page 207
Hive Architecture
Data Model
Tables have Typed columns (int , float,
HIVE string, date, boolean)
JDBC ODBC Partitions
Buckets (Hash partitions useful for
Command Line sampling, join optimization)

Metastore
Web Interface Thrift Server
Interface

Driver
(Compiler, Optimizer, Executor)
Metastore
Namespace containing set of tables
Holds partition definitions
Statistics
Runs on Derby, MySQL and many other
relational databases

Job Tracker Name Node


Physical Layout
Warehouse directory in HDFS
Data Node Table row data stored in subdirectories of
+
Task Tracker
warehouse
Partitions form subdirectories of table
Hadoop directories
Map Reduce + HDFS Actual data stored in flat files

Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent.

03#60

Page 208
Hive Metastore

o Hives Metastore is a database containing table definitions and other metadata


By default, stored locally on the client machine in a Derby database
If multiple people will be using Hive, the system administrator should create a shared Metastore

Usually in MySQL or some other relational database server

Page 209
Hive Data Model and Data Types

Hive layers table definition on top of data in Type constructors:


HDFS ARRAY < primitive-type >
MAP < primitive-type, data-type >
o Tables
STRUCT < col-name : data-type, ... >
Typed columns (int, oat, string,
Primitive Type
boolean and so on)
TINYINT
Also array, struct, map (for JSON like
SMALLINT
data)
INT
o Partitions BIGINT
e.g., to range partition tables by date FLOAT
BOOLEAN
o Buckets DOUBLE
Hash partitions within ranges (useful for STRING
sampling, join optimization) BINARY(available starting in CDH4)
TIMESTAMP(available starting in CDH4)

Page 210
Hive Sample Commands

CREATE TABLE managed_table ( dummy STRING);


LOAD DATA INPATH user/dinesh/data.txt INTO table managed_table

DROP TABLE managed_table;

CREATE EXTERNAL TABLE external_table ( dummy STRING);


LOCATION user/dinesh/external_table;
LOAD DATA INPATH /user/dinesh/data.txt INTO TABLE external_table;

CREATE TABLE Logs (ts BIGINT, line STRING) PARTIONED BY (dt STRING, country
STRING)
LOAD DATA LOCAL INPATH input/live/partitions/file1 INTO TABLE logs
PARTITION (dt=2001-10-10, country=US);

SHOW PARTITIONS logs;


SELECT ts, dt, line FROM logs WHERE country=US

CREATE TABLE bucketed_users (id INT, name STRING)


CLUSTERED BY 9id) INTO 4 BUCKETS;

Page 211
Storage Format
Default
Default Delimited text with a row per line
OUTER JOIN / CROSS JOIN / VIEWS / GROUP BY / UDF All are possible with Hive

CREATE TABLE ...


ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

JOINS:

SELECT Sales.*, things.* FROM Sales JOIN


things ON (Sales.ID = things.ID)

EXPLAIN
Select . Entire query: Will provide details
about the execution plan for the query,
including MR job

Page 212
Hive Data
Physical Layout

o Hive tables are stored in Hives warehouse directory in HDFS


By default, /user/hive/warehouse

o Tables are stored in subdirectories of the warehouse directory


Partitions form subdirectories of tables
o Possible to create external tables if the data is already in HDFS and should not be moved from its current
location
o Actual data is stored in at les
Control character/delimited text, or SequenceFiles
Can be in arbitrary format with the use of a custom Serializer/ Deserializer (SerDe)

Limitations

o Not all standard SQL is supported


Subqueries are only supported in the FROM clause

No correlated subqueries

o No support for UPDATE or DELETE

o No support for INSERTing single rows

Page 213
Hive vs. Pig
Choosing between Pig and Hive

o If you want abstraction on top of MapReduce use either Hive or Pig

o Which one is chosen depends on the skillset of the target users


Those with an SQL background will naturally gravitate towards Hive
Those who do not know SQL will often choose Pig

o use both
Pig deals better with less structured data, so Pig is used to manipulate the data into a more

structured form, then Hive is used to query that structured data


Pig is better suited for data pipelines.
Where schema is unknown, incomplete, inconsistent , need to manage nested data, work on data

before cleaning etc (Researchers prefer this

Page 214
Lunch Break 3
Day

Page 215
Hive hands on 3
Day

Page 216
Hive
Overview

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Internally, a compiler translates HiveQL statements into
a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.
Originally developed by Facebook.
Supports analysis of large datasets stored in HDFS and compatible file systems such as Amazon S3 filesystem.
Provides an SQL-like language called HiveQL.
HiveQL does not strictly follow the full SQL-92 standard.
HiveQL offers extensions not in SQL (mulitable inserts, create table as select)
To accelerate queries, Hive provides indexes, including bitmap indexes.

For this demonstration, we will be:


Review Hive file types

Loading data from a file into Hive

Loading data from a Hive table into


another Hive table

Another data loading example

Page 217
Hive
Hive File Types
Four file types supported in Hive: TEXTFILE, SEQUENCEFILE, ORC, AND RCFILE
Import text files compressed with Gzip or Bzip2 directly into a TEXTFILE table
Compression automatically detected and will be decompressed on-the-fly

CREATE TABLE raw (line STRING)


ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

Default File Type


TEXTFILE is default storageHadoop cannot split this type of file into blocks
Recommended practice is to insert data into another SEQUENCEFILE table.

CREATE TABLE raw (line STRING)


ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE;
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;

Page 218
Hive
ORC File Type

Optimized Row Columnar (ORC) file format


provides a highly efficient way to store
Hive data.
o An ORC file contains groups of row data called
stripes, along with auxiliary information in a file
footer. At the end of the file a postscript holds
compression parameters and the size of the
compressed footer.
o The default stripe size is 250 MB. Large stripe
sizes enable large, efficient reads from HDFS.
o The file footer contains a list of stripes in the
file, the number of rows per stripe, and each
column's data type. It also contains column-
level aggregates count, min, max, and sum.
o This diagram illustrates the ORC file structure:

Page 219
Hive
Loading data from File to Hive

Step 1 : Copy the working data files for use with Hive

$ hadoop fs -mkdir NYSE/data/hive


$ hadoop fs cp NYSE/data/*.txt NYSE/data/hive

Step 2: Start Hive


$ hive

Step 3: Create the temporary table to load the Hive data into.

hive> CREATE TABLE symbols_tmp (symbol STRING, description STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' STORED AS TEXTFILE;

Step 4: Load the data from the source file into the temporary table.

hive> LOAD DATA INPATH 'NYSE/data/hive/symbols.txt' INTO TABLE symbols_tmp;

Data can also be loaded from the OS file system using the LOCAL option

hive> LOAD DATA LOCAL INPATH '/home/cloudera/NYSE/data/symbols.txt' INTO TABLE symbols_tmp;

Page 220
Hive
Loading Data from Hive table into another Hive Table

Step 1 : Create the destination table

hive> CREATE TABLE symbols (symbol STRING, description STRING, exchange STRING) STORED AS
SEQUENCEFILE;

Step 2: Set the Hive Compression parameters


hive> SET hive.exec.compress.output=true;
hive> SET io.seqfile.compression.type=BLOCK;

Step 3: Insert the data into destination table.

hive> INSERT OVERWRITE TABLE symbols SELECT symbol, description, 'NYSE' FROM symbols_tmp;

Step 4: Drop the temporary table we previously created.

hive> DROP TABLE symbols_tmp;

Page 221
Hive
Another data loading example

Step s

hive> CREATE TABLE eod_data_tmp (ticker STRING, close_date STRING, price_open DOUBLE, price_high
DOUBLE, price_low DOUBLE, price_close DOUBLE, volume BIGINT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' STORED AS TEXTFILE;
hive> LOAD DATA INPATH 'NYSE/data/hive/EOD2013.txt' INTO TABLE eod_data_tmp;
hive> CREATE TABLE eod_data (ticker STRING, close_date timestamp, price_open DOUBLE, price_high
DOUBLE, price_low DOUBLE, price_close DOUBLE, volume BIGINT) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET io.seqfile.compression.type=BLOCK;
hive> INSERT OVERWRITE TABLE eod_data SELECT ticker,
from_unixtime(unix_timestamp(CONCAT(close_date,'163000000'), 'yyyyMMddHHmmssSSS')), price_open,
price_high, price_low, price_close, volume FROM eod_data_tmp;
hive> DROP TABLE eod_data_tmp;

Page 222
Hive
Running Select Queries

Step 1 : Get end-of-day data from 10/2013 for MS

hive> SELECT * FROM eod_data WHERE ticker = 'MS' AND year(close_date) = 2013 AND
month(close_date) = 10;

Step 2: Same as above using date ranges and unix timestamp


hive> SELECT * FROM eod_data WHERE ticker = 'MS' AND close_date >= unix_timestamp('2013-10-01
00:00:00') AND close_date < unix_timestamp('2013-11-01 00:00:00');

Step 3: Get the average monthly close prices to MS .

hive> SELECT MONTH(close_date), AVG(price_close) FROM eod_data WHERE ticker = 'MS' GROUP BY
month(close_date);

Page 223
Hive
Aggregate Function and Joins

Step 1 : Find the difference in the highest and smallest close prices for all stocks in the NYSE in 2013

hive> SELECT s.symbol, s.description, MAX(d.price_close) - MIN(d.price_close) AS spread FROM


eod_data d JOIN symbols s ON d.ticker = s.symbol WHERE s.exchange = 'NYSE' AND
year(d.close_date) = 2013 GROUP BY s.symbol, s.description ORDER BY spread DESC;

Output: Stage 1

Page 224
Hive
Aggregate Function and Joins

Output: Stage 2 and Stage 3

Page 225
Hive
Aggregate Function and Joins

Output: Stage 2 and Stage 3 ( Continued )

Page 226
Hive
Saving Query Results

Option 1 : Saving results into HDFS

hive> INSERT OVERWRITE DIRECTORY '/user/cloudera/NYSE/hive/udaf' SELECT s.symbol,


s.description, MAX(d.price_close) - MIN(d.price_close) AS spread FROM eod_data d JOIN symbols s
ON d.ticker = s.symbol WHERE s.exchange = 'NYSE' AND year(d.close_date) = 2013 GROUP BY
s.symbol, s.description ORDER BY spread DESC;

Option 2: Saving results to local file system

hive> INSERT OVERWRITE LOCAL DIRECTORY '~/NYSE/hive/udaf' SELECT s.symbol, s.description,


MAX(d.price_close) - MIN(d.price_close) AS spread FROM eod_data d JOIN symbols s ON d.ticker =
s.symbol WHERE s.exchange = 'NYSE' AND year(d.close_date) = 2013 GROUP BY s.symbol,
s.description ORDER BY spread DESC;

Page 227
Hive
Examining Database Results

Database : Show database

hive> show databases;

Tables: Show tables

hive> show tables;

Columns: Show columns

hive> show columns in eod_data;


hive> show columns in symbols;

Step 2: Exit Hive

hive> exit;

Page 228
Administration 3
Day

Page 229
Hadoop Administration & Troubleshooting

Managing Hadoop Processes

Starting and Stopping Processes with Init Scripts

Starting and Stopping Processes Manually

MapReduce Maintenance Tasks

Adding a Tasktracker

Decommissioning a Tasktracker

Killing a MapReduce Job

Dealing with a Blacklisted Tasktracker


HDFS Maintenance Tasks

Adding a Datanode

Decommissioning a Datanode

Checking Filesystem Integrity with fsck

Balancing HDFS Block Data

Dealing with a Failed Disk

Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent.
Copyright 2010/2013 EY. All rights reserved. Not to be
Page 230 14#5
reproduced without prior wri> en consent.
Conclusion 3
Day

Questions?

Page 231
Notes

Page 232
Notes

Page 233
Notes

Page 234
Notes

Page 235
Notes

Page 236
Appendix
o Oozie
Why Oozie
Oozie use cases

o Many problems cannot be solved with a single MapReduce job Sta


rt
o Instead, a workow of jobs must be created Dat
a
o Simple workow:
Run Job A Job A

Use output of Job A as input to Job B


Use output of Job B as input to Job C
Output of Job C is the nal required output
Job B
o Easy if the workow is linear like this
Can be created as standard Driver code
Job C
o If the workow is more complex, Driver code becomes much
more dicult to maintain

o Example: running multiple jobs in parallel, using the output


from all of those jobs as the input to the next job

o Example: including Hive or Pig jobs as part of the workow

Page 238
What is Oozie?
How it works?
Oozie is a workow engine

Runs on a server
Typically outside the cluster

Runs workows of Hadoop jobs


Including Pig, Hive, Sqoop jobs
Submits those jobs to the cluster based on a workow definition

Workow definitions are submitted via HTTP

Jobs can be run at specic times


One/o or recurring jobs

Jobs can be run when data is present in a directory

Page 239
Oozie Workflow Basics?
Workflow Overview
Oozie workows are written in XML
Map Reduce
Workow is a collection of actions PIG
Start End
Hive
MapReduce jobs, Pig jobs, Hive jobs etc. Job
Start Success
A workow consists of control ow nodes and action nodes
Failure
Control ow nodes dene the beginning and end of a
workow
They provide methods to determine the workow
execution path Error

Example: Run multiple jobs simultaneously

Action nodes trigger the execution of a processing task, such


as
A MapReduce job
A Pig job
A Sqoop data import job

Page 240
Oozie Workflow Sample?
Workflow XML Anatomy
1 o A workflow is wrapped into workflow entity
<workflow-app name='wordcount-wf'
1xmlns="uri:oozie:workflow:0.1">
2 <start to='wordcount'/> o The startnode is the control node which tells
<action name='wordcount'> 2 Oozie which workow node should be run rst.
3 <map-reduce> There must be one startnode in an Oozie
<job-tracker>${jobTracker}</job-tracker> workow. In our example, we are telling Oozie
<name-node>${nameNode}</name-node>
<configuration> to start by transitioning to the wordcount
<property> workow node.
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property> o Action node defines the type of job. It is
3
<name>mapred.reducer.class</name> mapReduce in this. Within the action we define
<value>org.myorg.WordCount.Reduce</value> the job properties
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value> 4 o We specify what to do if the action ends
</property> successfully, and what to do if it fails. In this
<property> example, if the job is successful we go to the end
<name>mapred.output.dir</name>
<value>${outputDir}</value>
node. If it fails we go to the killnode.
</property>
</configuration>
</map-reduce> o If the workow reaches a killnode, it will kill all
4 <ok to='end'/> 5 running actions and then terminate with an error.
<error to='kill'/>
</action> A workow can have zero or more killnodes
<kill name='kill'>
<message>Something went wrong:
5 ${wf:errorCode('wordcount')}</message>
6 o Every workow must have an endnode. This
</kill/>
indicates that the workow has completed
6 <end name='end'/>
</workflow-app> successfully.

Page 241
Oozie Other control nodes
Control noes overview
o A decisioncontrol node allows Oozie to determine the workow execution path based on some criteria
Similar to a switch/case statement

o forkand joincontrol nodes split one execution path into multiple execution paths which run concurrently
forksplits the execution path
joinwaits for all concurrent execution paths to complete before proceeding
forkand joinare used in pairs

o Oozie can also be called from within a Java program


Via the Oozie client API

o To submit an Oozie workow using the command line tool:

$ oozie job -oozie http://<oozie_server>/oozie


-config config_file -run

Page 242
Oozie Action Nodes
Action Nodes Overview

Node Name Description

map-reduce Runs either a Java MapReduce or Streaming job

fs Create directories, move or delete les or directories

java Runs the main()method in the specied Java class as a single/ Map,
Map/only job on the cluster

pig Runs a Pig job

hive Runs a Hive job

sqoop Runs a Sqoop job

email Sends an e/mail message

Page 243
Bibliography
Hadoop. (2013, 03 06)[1]. Welcome to Apache Hadoop. Retrieved from Apache Hadoop: http://hadoop.apache.org/

Borthakur, D. (2013, 02 13)[2]. HDFS Architecture Guide. Retrieved from Apache Hadoop:
http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html

Borthakur, D. (2007)[3]. The Hadoop Distributed File System:Architecture and Design. Retrieved from Apache Hadoop:
http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf

HBase. (2013, 03 14)[4]. Aapache HBase. Retrieved from Apache HBase: http://hbase.apache.org/

Hive. (2013, 02 07)[5]. Welcome to Hive. Retrieved from Hadoop: http://hive.apache.org/

Mahout. (2011)[6]. What is Apache Mahout? Retrieved from mahout: http://mahout.apache.org/

MapReduce. (2013, 03 12)[7]. MapReduce. Retrieved from Wikipedia: http://en.wikipedia.org/wiki/MapReduce

Pig. (2013, 02 21)[8]. Welcome to Apache Pig. Retrieved from Apache Hadoop: http://pig.apache.org/

Sqoop. (2013, 03 08)[9]. Sqoop. Retrieved from Apache Sqoop: http://sqoop.apache.org/

Zookeeper. (2010)[10]. Apache ZooKeeper. Retrieved from Apache ZooKeeper: http://zookeeper.apache.org/

oozie. (2013, 01 23)[11]. Apache Oozie Workflow Scheduler for Hadoop. Retrieved from oozie: http://oozie.apache.org/

Avro. (2013, 02 26)[12]. Welcome to Apache Avro. Retrieved from Avro: http://avro.apache.org/

Informatica Corporation.(2012)[13]. Lean Integration. Retrieved from Informatica: http://www.informatica.com/us/vision/best-


practices/lean-integration/

MarkLogic Corporation. (2013)[14]. Featured Clients. Retrieved from MarkLogic: http://www.marklogic.com/

MarkLogic Corporation. (2011)[15]. What's New in MarkLogic 5.


Page 244