BigData - Hadoop - Spark - ES - 3 - Day Training

Big Data and Hadoop
(3 Day Training)
For any support you need please feel free to contact:
Ashish Baghel : abaghel@impetus.com
AVP & Head –Banking and Financial Services (BFSI)
913-638-2948 (Cell)
408-213-3310 –Ext 567 (Office)
Sachneet Singh : sachneets.bains@impetus.co.in
1 Impetus Technologies - Confidential

Class Schedule
• Schedule
– 9:30 a.m. – 5:30 p.m.
• Short tea breaks
• As and when you need one 
• Lunch Break

Introductions
• Your name
• Job responsibilities
• Any Hadoop experience
• Expectations

Course Outline
• Day 1
– Big data concepts, context and challenges
– Hadoop Overview
– The MapReduce Framework and YARN
– The Hadoop Distributed File System (HDFS)
– MapR File system
– PIG Programming : Concepts
• Day 2
– PIG Programming : Loading Data and Querying
– Debugging and Macros
– Hive Introduction and Architecture
– Hive Programming : Load data, Define Schema, Query Data
– HBase : Introduction , Concepts and Architecture

Course Outline
• Day 3
– HBase : Advanced Concepts
– Spark: Introduction, Concepts
– Elastic Search : Introduction and Architecture
– Elastic Search : Concepts of Index and Search

Big Data Facts
1. In what timeframe do we now create the same
amount of information that we created from 2 days
the dawn of civilization until 2003?
2. 90% of the world’s data was created in the last 2 years
(how many years)?
3. What is 1024 petabytes also known as?
An Exabyte
4. Companies monitoring Twitter to
measure “sentiment” analyze 12 terabytes of
tweets every day !!
Source: https://www.linkedin.com/pulse/20140925030713-64875646-big-data-the-eye-opening-
facts-everyone-should-know

What is Big Data?
"Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to
enable enhanced decision making, insight discovery and
process optimization.“ – Gartner, 2012

3 V's of Big Data
VOLUME Velocity
BigData comes in One Size Streaming Data, Time
– Large sensitive Data
Terabytes/ Petabytes/ Batch, Near/ Real Time
Exabytes Streams
Value to Business Value to Business
Structures, Semi-Structured,
unstructured
Variety Text, Audio/ Video, Click
Streams, log files etc

6 Key Hadoop DATA TYPES
1. Sentiment
How your customers feel
2. Clickstream
Website visitors’ data
3. Sensor/Machine
Data from remote sensors and machines
4. Geographic
Location-based data
5. Server Logs Value

6. Text
Millions of web pages, emails, and documents
Source: www.Hortonworks.com

Sentiment Use Case
• Analyze customer sentiment on
the days leading up to and
following the release of each
‘Game of Thrones’ season .
• Look for answers to

• What is the response to the release ?
• What did the public like the most and
dislike the most ?
• Sentiments around each character,
event and season ?

Sentiment Use case
- Select a source of data
- Select a tool to bring the data from source to Hadoop
- Use HCatalog to Define a Schema
- Use Hive to Determine Sentiment
- Third party tools for NLP etc. may be required
- Use BI tools to connect to Hadoop and show reports

Sentiment use case
Flume Agent
• Twitter is one of the source

• Collect all the tweets using a tool called ‘Flume’
• Bring the data to Hadoop and process
• Connect to BI tools for reports
Flume is used to stream data into

Hadoop.
Hadoop cluster

Geolocation Use Case
Geo Location use cases involve vehicles, devices or people
moving across a map or similar surface.
Example:
– A company has a fleet of trucks.
– Each truck has sensor to log location and event data
– The collected data is put onto Hadoop for analysis
• The company’s goal with Hadoop is to:
– Find out wrong driving events and improve driver safety
– Figure out improvements to save fuel and increase efficiency

The Geolocation Data
• Here is what the collected data from the trucks’ sensors
looks like:
– truckid
– driverid
– event
– latitude
– longitude
– city
– state
– velocity
– event_indicator (0 or 1)
– idling_indicator (0 or 1)
Source: https://github.com/hortonworks/hadoop-tutorials/blob/master/Sandbox/T15_Analyzing_Geolocation_Data.md

The Geolocation Data : Getting data
Raw sensor
data
A54 A54 normal 38.44047 -122.714 Santa Rosa California 17 0 0
A20 A20 normal 36.97717 -121.899 Aptos California 27 0 0
overspee
A40 A40 d 37.9577 -121.291 Stockton California 77 1 0
Flume Agent
Source (Data): https://github.com/hortonworks/hadoop-

tutorials/blob/master/Sandbox/T15_Analyzing_Geolocation_Data.md

The Truck Data
• The truck data is stored in a database and looks like:
– driverid
– truckid
– model
– monthly_miles
– monthly_gas
– total_miles
– total_gas
– mileage
• The miles and gas figures are for a few years.

Truck Data from RDBMS into Hadoop
A Sqoop job
RDBMS data (info of
the trucks)

Data Analysis
• Analysis can help us answer
– Is there any fuel wastage due to idling ?
– Unsafe events on road which can result in accidents ?
– Drivers with such events and its frequency ?

Risk Factors Viewed on a Map

BigData – Use Cases

Potential Use Cases for Big Data
Source: http://athenaanalytics.tumblr.com/post/19544048414/cyberlabe-potential-use-cases-for-big-data

Big Data - Use Cases-Telecommunication
• Can I offer something better? • Better/ Best Deals?

• Why am I loosing Customers? • Is network reliable?
• 24x7 Service - predict the Failures? • Which Plans are good for me?
• What Plans should I offer to my
customers?
Subscribers
Telecom
Vendors

Big Data - Use Cases- Financial Services
Wanted to buy some products I am launching some new offers

• Are there any offers? • How do I make the best use of it?
• Are these offers good for me? • How to attract the relevant/ Interested
• which stores are providing these customers?
offers?
Ah! There are so many offers… hard to

find the relevant ones…
Merchant
Customer

BigData – Challenges

BigData Challenges
• Data processing: -
– Processing & Analyzing large data – Terabyte++
– Massively Scalable and Parallel
– Moving computation is easy than moving data
– Support Partial Failure
• Data Storage
– Doesn’t fit on 1 node, requires cluster
– Flexible and Schema less Structure
– Data Replication, Partioning and Sharding

Challenges of Distributed Processing
• Production deployments need to be carefully
Planned
– Unavailability on 1 node should not Impact
• Need High Speed Networks

• Data Replication involves data conflicts
• Troubleshooting and diagnosing
• Geographically Distributed
• Consistency & Reliability

Big Data – Hadoop

History
• Hadoop was created by Doug Cutting and Mike
Cafarella in 2005 who was working at Yahoo! at
the time
• Named it after his son's toy elephant
• Originally developed to support distribution for
the Nutch search engine project

What is Hadoop?
• A Batch processing Framework for distributed processing of
large data sets on a network of commodity hardware.
• Designed to scale out
• Fault - tolerant – At Application level
• Open source + Commodity hardware = Reduction in Cost
• Hadoop is very fast for very large jobs
• Hadoop is not fast for small jobs
• Designed for hardware and software failures
29 Impetus Technologies - Confidential 29

What is Hadoop 2.0?
• The Apache Hadoop 2.0 project consists of the following
modules:
– Hadoop Common: the utilities that provide support for the other
Hadoop modules.
– HDFS: the Hadoop Distributed File System
– YARN: a framework for job scheduling and cluster resource
management.
– MapReduce: for processing large data sets in a scalable and
parallel fashion.

What’s New in Hadoop 2.0?
• YARN is a re-architecture of Hadoop that allows multiple
applications to run on the same platform
Source: http://hortonworks.com/blog/apache-hadoop-2-is-ga/

The Hadoop Ecosystem
Source: http://www.apache.org/

Typical Hadoop Based Solution
Source: http://hortonworks.com/blog/webinar-series-building-a-modern-data-architecture-with-hadoop/

Relational Databases vs. Hadoop
Relational Hadoop
Required on write schema Required on read
Reads are fast speed Writes are fast
Standards and structured governance Not strict and standard yet
Structured data types Multi and unstructured
Interactive OLAP Analytics Data Discovery

Complex ACID Transactions best fit use Processing unstructured data
Operational Data Store Massive Storage/Processing

HDFS

Hadoop Distributed File System
• Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware

– Failure is expected, rather than exceptional
• Streaming Data Access

– Write-Once, Read-Many pattern
• Batch processing
• Node failure - Replication

HDFS Components
• NameNode
– The “master” node of HDFS
– Determines and maintains how the chunks of data are
distributed across the DataNodes
• DataNode
– Stores the chunks of data, and is responsible for
replicating the chunks across other DataNodes

HDFS Namenode Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc

HDFS DataNode
• A Block Server
– Stores data in the local file system.
– Stores meta-data of a block.
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data

– Forwards data to other specified DataNodes

The NameNode
1. Name node persists meta data
information as two files on the disk:
a. Namespace image file (fsimage_N ) The NameNode will be in
b. Transaction log (edits_N) safemode, a read-only mode.
2. On start up, NameNode reads the
fsimage_N and edits_N files in memory
3. The transactions in edits_N are merged
with fsimage_N.
4. A newly-created fsimage_N+1 is written
to disk, and a new, empty edits_N+1 is
created.
5. Client applications can NameNode
6. The NameNode journals that create fsimage edit logs
transaction in the edits_N+1 file.
Namespace Journaling

The DataNodes
NameNode
Here is my Here is my
heartbeat and hearbeat and
heartbeat !! heartbeat !!
block report !! blockreport
Block 25
Data Node 1 Data Node 2 Data Node 3 Data Node 4

The DataNodes
Source: http://www.slideshare.net/hdhappy001/nicholashdfs-what-is-new-in-hadoop-2

DataNode Failure
No hearbeat
NameNode from DN3 , its
dead. Replicate
its data.
heartbeat and hearbeat and
heartbeat !! heartbeat !!
block report !! blockreport
Block 25
Data Node 1 Data Node 2 Data Node 3 Data Node 4

HDFS Writes
Source: Hadoop definitive guide

Data Organization
• HDFS is designed to support very large files
• HDFS supports write-once-read-many semantics on files. A typical block

size used by HDFS is 64 MB (configurable).
• Data is written into HDFS from client application in blocks of data. Each
of this block is of configured block size.
• Replication between the Data Nodes is pipelined to eliminate

replication overhead from the client application
Client
Application DN1 DN2 DN3

HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
Client Block ops
Read Datanodes Datanodes
B
replication
Blocks
Rack1 Write Rack2
Client
Source: https://hadoop.apache.org

HDFS Reads
Source: Hadoop definitive guide

Rack Awareness
NameNode
Core
Metadata
File.txt =
Blk A:
1 6 11 DN: 1, 7, 8
Blk B:
2 7 12 DN: 8, 12, 14
3 8 13
4 9 14
5 10 15
Rack1 Rack2 Rack3

Recovery using Replication
• DataNode sends heartbeat to NameNode periodically
• No heartbeat? NameNode marks DataNode as dead

– Stops forwarding any new I/O requests to the DataNode
– Data registered to a dead DataNode is “lost” to HDFS
• NameNode’s process for data recovery:

– Determine which blocks were on the lost node
– Find other DataNodes with copies of these blocks
– Instruct these DataNodes to copy the blocks to other nodes
(whenever possible)

HDFS Commands
hadoop fs –command [args]
A few commands:
-ls, -ls -R: list files/directories
-cat: display file content (uncompressed)
-chgrp,-chmod,-chown: changes file permissions
-put,-get,-copyFromLocal,-copyToLocal: copies files from the local file
system to the HDFS and vice-versa.
-mv,-moveFromLocal,-moveToLocal: moves files

MapR-FS

Architectural Differences with MapR FS
Source: https://www.mapr.com/blog/comparing-mapr-fs-and-hdfs-nfs-and-snapshots#.VZoux_lViko

MapR's Containers
Files/directories are sharded into blocks, which
are placed into containers on disks
Containers are 16-

32 GB segments of
disk, placed on
nodes
Source: http://www.slideshare.net/jayshao/nyc-hadoop-meetup-mapr-architecture-philosophy-and-applications
Container locations and replication
N1, N2 N1
N3, N2
N1, N2
N1, N3 N2
N3, N2
CLDB
N3
Container location database
(CLDB) keeps track of nodes
hosting each container and
Source: MapR
replication chain order
Source: http://www.slideshare.net/jayshao/nyc-hadoop-meetup-mapr-architecture-philosophy-and-applications

Random Writing in MapR
S1
Ask for
Client
64M block CLDB
writing Create cont.
data S1, S2, S4
attach
S1, S3
Write S1, S4, S5
next chunk S2
Picks master S2, S4, S5
to S2
and 2 replica S3
slaves
S2, S3, S5
S4 S5
S3
Source: http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop

Map-Reduce

Understanding map-reduce
Source: https://developer.yahoo.com/hadoop/tutorial/module4.html

Understanding Map-Reduce
Cluster Node 1 Cluster Node 2 Cluster Node 3
Data Node Data Node Data Node
Input Format Input Format Input Format
Split Split Split Split Split
Map Task 1 Map Task 2 Map Task 3 Map Task 4 Map Task 5
<key1, value> <key1, value>

<key1, value> <key1, value> <key5, value>
map file 1 map file 2 map file 3 map file 4 map file 5

Understanding Map-Reduce
Cluster Node 1 Cluster Node 2 Cluster Node 3
map file 1 map file 2 map file 3 map file 4 map file 5
Shuffle and Sort
<key1, value, value, value> <key2, value>

<key5, value> <key3, value, value, value>
<key7, value, value> <key4, value, value>
<key8, value> <key6, value, value, value>
<key9, value>
Reduce Task 1 Reduce Task 2

Keys for the reduce phase
• A reducing function turns a large list of values into
one (or a few) output values.
• All of the values with the same key are presented
to a single reducer together.
• This is done by a partitoner.

Detailed MapReduce Flow

Hadoop Map Reduce – Word Count
Source: http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html

The Combiner

Mapper Output process
1 2
Input <k1,v1> Mapper
Splits
The map method
4 outputs <k2,v2>
pairs
5
3
Mapper Map Output
Output= Buffer
Reducer
Input
Records are sorted
Spill files are and spilled to disk
merged into a when the buffer
single file reaches a threshold

Reducer Output Process
In-memory 2.
Spill
Mapper output = buffer
files
Reducer input
3.
Merged
input
4.
Mapper output = 5.
Reducer HDFS
Reducer input
In-memory
Spill
buffer
files
Mapper output =
Reducer input Merged
1. The Reducer input
fetches the data
from the
Mappers
Reducer HDFS

Key-Value pairs
<K1, V1>
<K2, V2>
Mapper Shuffle/Sort
<K3, V3> Reducer

<K2, (V2,V2,V2,V2)>

Hadoop MR Job Flow
Client
Job tracker Name Node S. Name Node
Task Task Task

Task Task Task
Tracker Task Tracker Task Tracker Task
Task Task Task
Cluster Node Cluster Node Cluster Node


Image Courtesy: http://www.apache.org/

What is YARN?
• YARN = Yet Another Resource Negotiator
• YARN splits up the functionality of the JobTracker in

Hadoop 1.x into two separate processes:
– ResourceManager: for allocating resources and

scheduling applications
– ApplicationMaster: for executing applications and

providing failover

Why YARN ?
• Hadoop 1.x – ONLY MapReduce
• Not every problem is a MapReduce
• Hadoop 1.x limitations
Limited Scalability
Low cluster resources utilization
Lack of support for alternative frameworks/
paradigms

Hadoop 1.x Architecture
Source: http://www.ibm.com/developerworks/library/bd-hadoopyarn/

JobTracker responsibilities
• Manages the computational resources in terms of map and
reduce slots
• Schedules the submitted jobs
• Monitors the executions of the TaskTrackers
• Restarts failed tasks
• Performs speculative execution of tasks
• Calculates the Job Counters
Clearly the JobTracker alone does a lots of tasks together and is

overloaded with loads of work.

Hadoop 1.x limitations
• Lacks Support for Alternate Paradigms and Services
– Force everything needs to look like Map Reduce
– Iterative applications in MapReduce are 10x slower
• Scalability
– Max Cluster size ~5,000 nodes
– Max concurrent tasks ~40,000
• Availability
– Failure Kills Queued & Running Jobs
• Hard partition of resources into map and reduce

slots
– Non-optimal Resource Utilization

Hadoop 1.x Architecture Redesign
Source: http://www.ibm.com/developerworks/library/bd-hadoopyarn/

Hadoop 1.x Architecture Redesign
• YARN Splits up the two major functions of JobTracker into:
– Global Resource Manager (RM) - Cluster resource management and
Scheduling responsibilities
– Application Master (AM) – Application Life Cycle Management i.e.,

Job execution & monitoring (per app).
AM coordinates the logical plan of a single job.
AM negotiates resource containers from the Scheduler, tracking their
status and monitoring for progress.
AM itself runs as a normal container.
• Tasktracker is replaced by NodeManager (NM) - A new per-node slave is

responsible for:
1. launching the applications’ containers,
2. monitoring their resource usage (cpu, memory, disk, network)
3. reporting to the RM

YARN Benefits
• Scalability
• Optimal Cluster Utilization
• Support for Alternative frameworks
• Multi-tenancy
• Flexible Resource Model

Where else is YARN used?
• Apache Tez on YARN
• Storm on YARN
• HOYA (Apache HBase on YARN)
• Apache Samza on YARN
• Apache Giraph for graph processing
• Apache Accumulo on YARN

How Applications Work in YARN
Source: http://doc.mapr.com/display/MapR/YARN

Multi-Tenancy is Built-in
• Queues
• Economics as queue-capacity
– Hierarchical Queues ResourceManager
• SLAs Capacity Scheduler
– Cooperative Preemption Scheduler
• Resource Isolation root Hierarchical

Queues
– Linux: cgroups
• Administration
Mrkting Adhoc DW
20% 10% 70%
– Queue ACLs Dev Prod Dev Reserved Prod

20% 80% 10% 20% 70%
Default Capacity Scheduler supports P0 P1

70% 30%
all features
Source: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-understanding-the-data-operating-system-of-hadoop

RM High Availability
Source: http://blog.cloudera.com/blog/2014/05/how-apache-hadoop-yarn-ha-works/

Key YARN take- away
• YARN is a platform to build/run Multiple Distributed Applications in
Hadoop
• YARN is completely Backwards Compatible for existing MapReduce apps
• YARN enables Fine Grained Resource Management via Generic Resource
Containers.
• YARN has built-in support for multi-tenancy to share cluster resources
and increase cost efficiency
• YARN provides a cluster operating system like abstraction for a modern
data architecture
Source: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-understanding-the-data-operating-system-of-hadoop

Apache Sqoop

What is Sqoop ?
A tool designed to transfer data between Hadoop and
relational databases
- import data from a relational database management system
(RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop
Distributed File System
- Export the data into an RDBMS

Overview of Sqoop
Source: http://hortonworks.com/hadoop/sqoop/

The Sqoop Import Tool
• The import command has the following requirements:
– Must specify a connect string using the --connect argument
– Credentials can be included in the connect string, so using the --

username and --password arguments
– Must specify either a table to import using --table, or the result of a

SQL query using --query

Importing a Table
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile

Importing Specific Columns
sqoop import
--connect jdbc:mysql://host/bse
--table Stocks
--columns StockSymbol,Volume, High,ClosingPrice
--target-dir /data/stocks/
--as-textfile
--split-by StockSymbol
-m 10

Importing from a Query
sqoop import
--connect jdbc:mysql://host/bse
--query "SELECT * FROM Stocks s WHERE s.Volume
>= 1000000
AND \$CONDITIONS"
--target-dir /data/stocks/
--as-textfile
--direct
--split-by StockSymbol

The Sqoop Export Tool
• The export command transfers data from HDFS to
a database:
– Use --table to specify the database table
– Use --export-dir to specify the data to export
• Rows are appended to the table by default
• If you define --update-key, then existing rows will be
updated with the new data
• Use --call to invoke a stored procedure (instead of
specifying the --table argument)

Exporting to a Table
sqoop export
--connect jdbc:mysql://host/mylogs
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"

Sqoop Labs

Apache Flume

What is Flume ?
Source: http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas

Flume Agent design
Source: https://blogs.apache.org/flume/

Events
Flume’s core data movement atom: the Event
• An Event has a simple schema

– Event header: Map<String, String>
• similar in spirit to HTTP headers
– Event body: array of bytes
Source: http://www.slideshare.net/Hadoop_Summit/percy-june26-455room211

Channels
• Passive Component
• Channel type determines the reliability guarantees
• Stock channel types:
– JDBC
– Memory – lower latency for small writes, but not durable
– File – provides durability; most people use this

Sources
• Event-driven or polling-based
• Most sources can accept batches of events
• Source implementations:
o Avro-RPC – other Java-based Flume agents can send data to this source port
o Thrift-RPC
o HTTP – post via a REST service (extensible)
o Netcat
o Scribe
o Spooling Directory – parse and ingest completed log files
o Exec – execute shell commands and ingest the output

Sinks
• All sinks are polling-based
• Most sinks can process batches of events at a time
• Sink implementations :
o HDFS
o HBase
o SolrCloud
o ElasticSearch
o Avro-RPC, Thrift-RPC
o File Roller
o Null, Logger, Seq

Data flow model
Source: http://flume.apache.org/FlumeUserGuide.html

Anatomy of a Flume agent

Flume component interactions
• Source: Puts events into the local channel

• Channel: Store events until someone takes them
• Sink: Takes events from the local channel
– On failure, sinks backoff and retry forever until success
Source: http://www.slideshare.net/Hadoop_Summit/percy-june26-455room211

A Flume Example
agent.sources = webserver
agent.channels = memoryChannel
agent.sinks = mycluster
agent.sources.webserver.type = exec
agent.sources.webserver.command = tail -F
/var/log/hadoop/hdfs/audit.log
agent.sources.webserver.batchSize = 1
agent.sources.webserver.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000
agent.sinks.mycluster.type = hdfs
agent.sinks.mycluster.channel = memoryChannel
agent.sinks.mycluster.hdfs.path = hdfs://127.0.0.1:8020/hdfsaudit/

Transactional Data Exchange

Routing and Replication

Multiplexing the flow

Multi-agent flow

Consolidation

Custom Source and Sink
• Custom Source
– A custom source is your own implementation of the Source interface
• Custom Sink
– A custom sink is your own implementation of the Sink interface

Load Balancing Sink
• Ability to load-balance flow over multiple sinks
– maintains an indexed list of active sinks on which the
load must be distributed
– Supports distributing load using either via round_robin
or random selection mechanisms
– Custom selection mechanisms are supported via custom
classes

Flume Interceptors
• Flume has the capability to modify/drop events in-
flight
• An interceptor can modify or even drop events
based on any criteria chosen by the developer of
the interceptor
• Flume supports chaining of interceptors
• The order in which the interceptors are specified is
the order in which they are invoked

Apache Pig

Pig
• Provides a high level of abstraction over Map-
Reduce for processing large datasets.
• Pig used to process both structured and
unstructured data (Pig eats any thing)
• Pig is designed to be extensible via UDF
• It is made up of
– Pig Latin: is a language
– Pig programs can be run using
• Pig binary
• Grunt shell
• Java programs

Pig Latin
• High-level data flow scripting language
• Pig program is a series of operations or
transformations on input data to produce output
• Pig executes in a unique fashion:
– During execution each statement is processed by the Pig
interpreter
– If a statement is valid, it gets added to a logical plan
built by the interpreter
– The steps in the logical plan do not actually execute until
a DUMP or STORE command

Pig Latin – Define Relation Names
• A relation is created as a result of an operation or

transformation in pig script
• An alias is a name assigned to a relation so that it
can be used in subsequent operations or
transformations
• For example, data is an alias:
data = LOAD ‘data.txt’ using TextLoader();

Pig Latin – Define Relation with Fields
• Relations can define and use field names, which

are associated with an alias
• For example:
salaries = LOAD ‘salary.data’ USING PigStorage(‘ , ‘)

AS (gender, age, income, zip);
highsalaries = FILTER salaries BY income > 100000;

Pig Latin - Data Types
• int • long
• float • double
• chararray • bytearray
• boolean • datetime
• bigdecimal • biginteger

Pig Latin - Complex Types
• Tuple: ordered set of values

• (OH,Mark,Twain,31225)
• Bag: unordered collection of tuples
{
(OH,Mark,Twain,31225),
(UK,Charles,Dickens,42207),
(ME,Robert,Frost,11496)
}
• Map: collection of key value pairs
[state#OH,name#Mark Twain,zip#31225]

Pig Latin – Define Relation with Schema
customers = LOAD ‘customer.data’ AS (

firstName: chararray,
lastname: chararray,
houseno: int,
street: chararray,
phone: long,
payment: double)
salaries = LOAD ‘salaries.txt’ AS (
gender: chararray,
details: bag { b ( age:int, salary: double, zip: long) });

Pig Latin - GROUP Operator
salaries salariesbyage
gender age salary zip group salaries
F 25 35000 95103 25 {(F, 25, 35000, 95103),
M 30 45000 95102 (M, 25, 39000, 95103)}
F 35 60000 95103 30 {(M, 30, 45000, 95102),
F 30 48000 95105 (F, 30, 48000, 95105),
M 30 47000 95102 (M, 30, 47000, 95102)}
M 25 39000 95103 35 {(F, 35, 60000, 95103)}
salariesbyage = GROUP salaries BY age;
grunt > DESCRIBE salariesbyage;
salariesbyage: { group: int, salaries: {(gender: chararray, age: int, salary: double, zip: int)}}

Pig Latin - GROUP ALL Operator
salaries salariesgroupall
gender age salary zip group salaries
F 25 35000 95103 all {(F, 25, 35000, 95103),
M 30 45000 95102 (M, 25, 39000, 95103),
F 35 60000 95103 (M, 30, 45000, 95102),
F 30 48000 95105 (F, 30, 48000, 95105),
M 30 47000 95102 (M, 30, 47000, 95102),
M 25 39000 95103 (F, 35, 60000, 95103)}
salariesgroupall = GROUP salaries ALL;

grunt > DESCRIBE salariesgroupall;
salariesbyage: { group: chararray,

salaries: {(gender: chararray, age: int, salary: double, zip: int)}}

Pig Latin – Relation without schema
salaries salariesbyzip
$0 $1 $2 $3 group salaries
F 25 35000 95103 95102 {(M, 30, 45000, 95102),
M 30 45000 95102 (M, 30, 47000, 95102)}
F 35 60000 95103 95103 {(F, 25, 35000, 95103),
F 30 48000 95105 (F, 35, 60000, 95103,
M 30 47000 95102 (M, 25, 39000, 95103)}
M 25 39000 95103 95105 {(F, 30, 48000, 95105)}
salariesbyzip = GROUP salaries BY $3;
grunt > DESCRIBE salariesbyzip;
salariesbyzip: { group: bytearray,

salaries: { () } }

Pig Latin – FOREACH GENERATE Operator
salaries A
gender age salary zip age salary
F 25 35000.00 95103 25 35000.00
M 30 45000.00 95102 30 45000.00
F 35 60000.00 95103 35 60000.00
F 30 48000.00 95105 30 48000.00
M 30 47000.00 95102 30 47000.00
M 25 39000.00 95103 25 39000.00
A = FOREACH salaries GENERATE age, salary;
grunt > DESCRIBE A;
A: { age: int, salary: double }

Pig Latin – Specifying Ranges in FOREACH
salaries B/C
gender age salary zip age salary zip
F 25 35000.00 95103 25 35000.00 95103
M 30 45000.00 95102 30 45000.00 95102
F 35 60000.00 95103 35 60000.00 95103
F 30 48000.00 95105 30 48000.00 95105
M 30 47000.00 95102 30 47000.00 95102
M 25 39000.00 95103 25 39000.00 95103
grunt > salaries = LOAD ‘salaries.txt’ USING PigStorage(‘ , ’) AS

(gender: chararray, age: int, salary: double, zip: int);
grunt > B = FOREACH salaries GENERATE age .. zip;
grunt > C = FOREACH salaries GENERATE age .. ;
grunt > D = FOREACH salaries GENERATE .. salary;

Pig Latin – Specifying Ranges in FOREACH
salaries = LOAD 'salaries.txt' USING PigStorage(',') AS

(gender:chararray, age:int,salary:double,zip:int);
C = FOREACH salaries GENERATE age..zip;
D = FOREACH salaries GENERATE age..;
E = FOREACH salaries GENERATE ..salary;
customer = LOAD 'data/customers';

F = FOREACH customer GENERATE $12..$23;

Pig Latin – FILTER Operator
salaries G
gender age salary zip gender age salary zip
F 25 35000.00 95103 M 30 47000.00 95102
M 30 45000.00 95102 F 30 48000.00 95105
F 35 60000.00 95103 F 35 60000.00 95103
F 30 48000.00 95105
M 30 47000.00 95102
M 25 39000.00 95103
G = salaries FILTER salary > 45000;

Pig Latin – CASE Operator
salaries bonus
gender age salary zip salary bonus
F 25 35000.00 95103 35000.00 10500
M 30 45000.00 95102 45000.00 9000
F 35 60000.00 95103 48000.00 6000
F 30 48000.00 95105 47000.00 9600
M 30 47000.00 95102 35000.00 9400
M 25 39000.00 95103 39000.00 11700
bonus = FOREACH salaries GENERATE salary, (

CASE
WHEN salary <= 40000 THEN salary * .3
WHEN salary > 40000 AND salary <= 50000 THEN salary * .2
WHEN salary > 50000 THEN salary * 0.1
END
) AS bonus;

Pig Latin - Using PARALLEL
• PARALLEL determines the number of reducers to
use in a particular operation
• Pig can run max 999 reducers – one reducer per
Gb of data
A = LOAD ‘data1’;
B = LOAD ‘data2’;
C = JOIN A by $1, B by $3 PARALLEL 20;
D = ORDER C BY $0 PARALLEL 5;

Pig Latin – Inner Join
location department
state name fname dept
CA James John Sales
AZ Bill Bond Finance
NY John James Support
MN Bond Bill IT
join = JOIN location BY name, department BY fname
location::state location::name department::fname department::dept

CA James James Support
AZ Bill Bill IT
NY John John Sales
MN Bond Bond Finance

Pig Latin - Replicated Joins
• Loads one dataset in memory to perform a map-
side join
• Specify ‘replicated’, which is applied to the second
data set listed
Grunt > replicatedjoin = JOIN location BY name,

department BY fname
USING ‘replicated’;

Pig Latin – User Defined Functions
• Pig UDF is implemented in java by:
– Implementing a Java class that extends EvalFunc.
– Deploying the class in a JAR file.
– Registering the JAR file in the Pig script using the
REGISTER command.
– Optionally define an alias for the UDF using the DEFINE
command

Pig Latin - UDF in Java Example
package com.impetus.udfs;
public class UpperCaseUdf extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
String inputStr = input.get(0).toString().trim();
return inputStr.toUpperCase();
}
}

Pig Latin – UDF Invocation
UDF Registration
grunt> REGISTER myudf.jar

UDF Invocation
grunt> x = FOREACH location GENERATE UpperCaseUdf(name);

Define UDF alias
grunt> DEFINE TO_UPPER_UDF com.impetus.udf.UpperCaseUdf()
grunt> x = FOREACH location GENERATE TO_UPPER_UDF(name);

DataFu Library
• DataFu is a collection of Pig UDFs for data analysis
on Hadoop
• Started by LinkedIn and open sourced under the
Apache 2.0 license

Tips for Optimizing Pig Scripts
• Filter early and often
• Project early and often
• Drop nulls before a join
• Use replicated joins whenever possible
• Use PARALLEL properly
• Use compression
• Choose the right data types
• Use .pigbootup for global settings of Pig scripts

Macros
Pig Latin supports the definition, expansion, and
import of macros
• A macro definition can appear anywhere in a Pig script as
long as it appears prior to the first use
• Recursive references are not allowed.
• Macros are not allowed inside a FOREACH nested block.
• Macros cannot contain Grunt shell commands.
• Macros cannot include a user-defined schema that has a
name collision with an alias in the macro

Limit Operator
• Limits the number of output tuples for a relation
• It does not guarantee that same 3 tuples would be
returned in each execution
• Defining ORDER BY clause before LIMIT operator
guarantees same result on each execution
employees = LOAD 'pigdemo.txt' AS (state:chararray, name:chararray);
emp_group = GROUP employees BY state;
L = LIMIT emp_group 3;

SAMPLE
• Same as DUMP operation but outputs a subset of
the complete data on console
• Useful when working with large relations
employees = LOAD 'pigdemo.txt' AS (state:chararray, name:chararray);
employee_subset = SAMPLE employees 0.05;

Built-in Functions
PluckTuple
– Allows the user to specify a string prefix, and then filter for the
columns in a relation that begin with that prefix.
IsEmpty
– Checks if a bag or map is empty.
SIZE
– Computes the number of elements based on any Pig data type.
SUBTRACT
– Bags subtraction, SUBTRACT(bag1, bag2) = bags composed of bag1
elements not in bag2
TOKENIZE
– Splits a string and outputs a bag of words.

Type of Loaders and Storage
Text Loader
– Loads unstructured data in UTF-8 format.
PigStorage
– Loads and stores data as structured text files.
PigDump
– Stores data in UTF-8 format.
JsonLoader, JsonStorage
– Load or store JSON data

Type of Loaders and Storage
BinStorage
– Loads and stores data in machine-readable format.
HBaseStorage
– Loads and stores data from an HBase table
AvroStorage
– Loads and stores data from Avro files.

Handling Compression
• Support for compression is determined by the
load/store function
• PigStorage and TextLoader support
– gzip
• Gzipped files cannot be split across multiple maps
– bzip
• bzipped files can be split across multiple maps
– compression for both read (load) and write
(store).
• BinStorage does not support compression.

Debugging Pig Latin
Pig Latin provides operators that can help you debug
your Pig Latin statements:
• Use the DUMP operator to display results to your terminal
screen.
• Use the DESCRIBE operator to review the schema of a
relation.
• Use the EXPLAIN operator to view the logical, physical, or
map reduce execution plans to compute a relation.
• Use the ILLUSTRATE operator to view the step-by-step
execution of a series of statements.

Explain Command
Review the logical, physical, and map reduce
execution plans that are used to compute the
specified relationship
• The logical plan shows a pipeline of operators to be
executed to build the relation.
• Type checking and backend-independent optimizations
(such as applying filters early on) also apply
• The physical plan shows how the logical operators are
translated to backend-specific physical operators. Some
backend optimizations also apply.
• The mapreduce plan shows how the physical operators are
grouped into map reduce jobs.

Illustrate Command
Displays a step-by-step execution of a sequence of
statements.
• Review how data is transformed through a sequence of Pig
Latin statements
• Test your programs on small datasets and get faster
turnaround times
• Works by retrieving a small sample of the input data and
then propagating this data through the pipeline
• Algorithm may automatically generate example data, in near
real-time. Thus, you might see data propagating through the
pipeline that was not found in the original input data
This happens usually for Joins

Pig Statistics
Pig Statistics is a framework for collecting and storing
script-level statistics for Pig Latin
• Complex Pig scripts often generate many MapReduce jobs.
• To help you debug a script, Pig prints a summary of the
execution that shows which relations (aliases) are mapped
to each MapReduce job.
• Pig statistics and the existing Hadoop statistics can also be
accessed via the Hadoop job history file
• Piggybank has a HadoopJobHistoryLoader which acts as an
example of using Pig itself to query these statistics

Timing UDF’s
Method for approximately measuring how much time
is spent in different user-defined functions (UDFs)
and Loaders
Set the pig.udf.profile property to true
Measures:
– the approximate amount of time spent in a UDF
– the approximate number of times the UDF was invoked

Execution control
• Output location strict check
– Avoid writing to the same output location.To enforce strict checking of
output location, set pig.location.check.strict=true for fail fast
• Disabling Pig commands and operators
– Blacklist or/and whitelist certain commands and operations
Blacklisting
 For eg, pig.blacklist=rm,kill,cross would disable users from executing any of
"rm", "kill" commands and "cross" operator.
Whitelisting
 disable all commands and operators that are not a part of the whitelist.
 For eg, pig.whitelist=load,filter,store will disallow every command and
operator other than "load", "filter" and "store".
There should not be any conflicts between blacklist and whitelist. Make sure to
have them entirely distinct or Pig will complain.

Optimization Rules in Pig
• FilterLogicExpressionSimplifier
• PartitionFilterOptimizer
– Push the filter condition to loader
A = LOAD 'input' as (dt, state, event) using HCatLoader();
B = FILTER A BY dt=='201310' AND state=='CA';
Filter condition will be pushed to

loader if loader supports
A = LOAD 'input' as (dt, state, event) using HCatLoader();

--Filter is removed

Optimization Rules in Pig
• SplitFilter
• PushUpFilter
• LimitOptimizer
• PushDownForEachFlatten

Pig Labs

Apache Hive

What is Hive?
• Data warehouse system for Hadoop
• Hive organizes data into tables by maintaining
metadata information about Big Data stored on
HDFS
• Metadata like schema, is store in database called
Metastore (default derby or MySQL)
• Perform SQL-like operations on the data using a
scripting language called HiveQL

Comparing Hive to SQL
SQL Datatypes SQL Semantics
INT SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING
BOOLEAN GROUP BY, ORDER BY, SORT BY
FLOAT CLUSTER BY, DISTRIBUTE BY
DOUBLE Sub-queries in FROM clause
STRING GROUP BY, ORDER BY
BINARY ROLLUP and CUBE
TIMESTAMP UNION
ARRAY, MAP, STRUCT, UNION LEFT, RIGHT and FULL INNER/OUTER
JOIN
DECIMAL CROSS JOIN, LEFT SEMI JOIN
CHAR Windowing functions (OVER, RANK, etc.)
VARCHAR Sub-queries for IN/NOT IN, HAVING
DATE EXISTS / NOT EXISTS
INTERSECT, EXCEPT

Hive Architecture
Hive HQL CLI JDBC/ODBC Web UI
Hive Server 2 Metastore
Hive Driver
Compiler Optimizer Executor
HADOOP YARN
(MR + HDFS)
Resource Name
Manager Node
Data Node +
Node Manager

Hive Services
• List of services available with Hive are
– Hive CLI (command line interface)
• Traditional client used to connect to HiveServer instance
$ hive –h hostname
hive >
– hiveserver
• Runs hive as a server exposing a Thrift service, enabling access
from a range of clients – JDBC, ODBC
– HWI (Hive Web Interface)
• Allows running hive queries using a web interface
(http://localhost:9999/hwi)
– Jar
• Hive equivalent to “hadoop jar”, a convenient way to run Java
applications
– metastore
• By default it runs in the same process as Hive server

Ways to Submit Hive Queries
• Using Hive CLI
– Traditional Hive client that connects to a HiveServer
instance
$ hive -h hostname
hive>
• Using Beeline
– A new command line client that connects to a
HiveServer2 instance using Hive JDBC driver
$ beeline
Hive version 0.11.0-SNAPSHOT by Apache
beeline> !connect
jdbc:hive2://hostname:10000 username password
org.apache.hive.jdbc.HiveDriver

Hive Tables
• Logically made up of
– Data stored on HDFS
– Metadata describing the layout of the data in a
Metastore (which is an RDBMS)
• Two types of tables:
– Managed Tables (default)
• Hive moves data into its warehouse directory on HDFS
– External Tables
• Hive refers to data in an existing location on HDFS outside
warehouse directory

Define a Hive-Managed Table
• Use “create table” command to define a Hive-
Managed table
• This command only creates meta-data information in
metastore
CREATE TABLE customer (

id INT,
fName STRING,
lName STRING,
birthday TIMESTAMP
) ROW FORMAT DELIMTED FIELDS TERMINATED BY ‘,’ ;

Define an Hive-External Table
• Use “create table” command with “LOCATION”
keyword to define a Hive-External table
• This command only creates meta-data information
in metastore
CREATE EXTERNAL TABLE salaries (

gender STRING,
age INT,
salary DOUBLE,
zip INT
) ROW FORMAT DELIMTED FIELDS TERMINATED BY ‘,’
LOCATION ‘/user/data/salaries/’;

Loading Data in Hive
• Use the following commands to load data in Hive
tables
• These commands only move data to hive ware
house directory for Managed tables and LOCATION
directory for External tables
LOAD DATA LOCAL INPATH ‘/tmp/customer.csv’ OVERWRITE INTO

TABLE customer;
LOAD DATA INPATH ‘/tmp/customer.csv’ OVERWRITE INTO TABLE
customer;
INSERT INTO birthdays
SELECT fName, lName, birthday FROM customer WHERE
birthday IS NOT NULL;

Query Data in Hive
SELECT * FROM customer;
FROM customer
SELECT fName, lName, birthday
WHERE birthday IS NOT NULL;
SELECT customer.*, order.*

FROM customer JOIN order ON
(customer.id = order.customerId)

Use Hive to Save Results to file
INSERT OVERWRITE DIRECTORY
‘/user/data/ca_or_sd’
FROM names
SELECT name, state
WHERE state = ‘CA’ or state = ‘SD’;
INSERT OVERWRITE LOCAL DIRECTORY

‘/tmp/data’
SELECT * FROM bucketnames
ORDER BY age;

Hive Partitions
• Hive allows table to be partitioned, dividing tables
into coarse grained parts based on value of
partition column
• Partitions enable faster queries on smaller slices of
data
CREATE TABLE employee (id INT, name STRING, salary DOUBLE)
PARTITION BY (dept STRING);
• Sub-folders created based on partition values

/data/hive/warehouse/employee
/dept=presales/
/dept=admin/
/dept=support/

Hive Buckets
• Buckets impose extra structure on the table
• Buckets enable more efficient sampling of data by
dividing data into smaller parts
• Create buckets on column with high cardinality
Bucket column value

input is hashed
records
The table’s data is
divided up into buckets
bucket bucket bucket

0 1 2

Hive Skew Tables
• Skew table is helpful columns with uneven
distribution of data value – some values exist more
often than other
• Store skewed values is a separate directories to
enable efficient queries
CREATE TABLE customer (

id INT,
name STRING,
zip INT
) SKEWED BY (zip) ON (92121, 92120)
STORED AS DIRECTORIES;

Hive Join Strategies
Type Approach Pros Cons

Join keys are shuffled Works regardless of Most resource-
Shuffle using MapReduce and data size or layout. intensive and
Join joins are performed on the slowest join type.
reduce side.
Small tables are loaded Very fast, single All but one table
Map
into memory in all nodes, scan through largest must be small
(Broadcast)
mapper scans through the table. enough to fit in
Join large table and joins. RAM.
Mappers take advantage Very fast for tables Data must be sorted
Sort-
of co-location of keys to do of any size. and bucketed
Merge-
efficient joins. ahead of time.
Bucket Join
Source: http://www.slideshare.net/ye.mikez/hive-tuning

Hive UDF
• Using UDF user can plugin custom processing code and
invoke from query
• Three types of UDF
– UDF (User Defined Functions)
• Operate on single row to output a single row
– UDAF (User Defined Aggregate Functions)
• Operate on multiple rows to output a single row
– UDTF (User Defined Table Functions)
• Operate on a single row to output multiple rows
• UDF have to be written in Java

Hive UDF
• UDF must extend UDF class and implement evaluate
method
package com.impetus.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {

public Text evaluate(final Text s) {
if (s == null) { return null;}
return new Text(s.toString().toLowerCase());
}
}

Invoking Hive UDF
• Register UDF jar file with Hive
• Create FUNCTION to access UDF. These functions
are defined only for duration of session
ADD JAR /myapp/lib/myhiveudf.jar
CREATE TEMPORARY FUNCTION ToLower AS

‘com.impetus.hive.udf.ToLower’;
FROM customer
SELECT ToLower(fName),
ToLower(lName);

Hive Views
• View in Hive is defined by SELECT statement
• Views help to:
– Reduce complexity of a query
– Restrict access to subset of actual Hive table
CREATE VIEW 2014_VISITORS AS

SELECT fname, lname, logdate, infoComments
FROM visitor_log
WHERE cast(substring(logdate, 6,4) AS INT) = 2014;

Hive File Formats
• Hive support different file formats
– Text file
– SequenceFile
– RCFile (Row Columnar file)
– ORC File
• File format is defined using “STORED AS” keyword
CREATE TABLE names (fname STRING, lname STRING)

STORED AS RCFile;

Hive ORC Files
• The Optimized Row Columnar (ORC) file format
provides a highly efficient way to store Hive data.
• Using ORC format improves Hive performance
when reading, writing and processing data
CREATE TABLE tablename (

…
) STORED AS ORC;
ALTER TABLE tablename SET FILEFORMAT ORC;
SET hive.default.fileformat=Orc;

HCatalog in the Ecosystem
Java MapReduce
HCatalog
HDFS HBase

Hive Labs

Defining Indexes
CREATE INDEX city_index ON TABLE Customers (city) AS
'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX city_index ON Customers REBUILD;
SHOW INDEX ON Customers;
DROP INDEX city_index ON Customers;

Overview of Indexes
1. An index is defined on the city column 2. An index table is created in the Hive
of a table named Customers. metastore for the state column
Table: Customers
Col1: name Hive

city_i
Col2: age Metastore
ndex
Col3: city
City = Phoenix
3. The index table now knows which

blocks in HDFS contain each city
HDFS

Stinger Initiative

Vectorization
Vectorization is a new feature that allows Hive to
process a batch of 1024 rows together instead of
one each time
• Table needs to be in ORC format
• Needs to be enabled
hive.vectorized.execution.enabled=true
• Each batch consists of a column vector and operations
are performed on the column vector
• Hive examines the query and data to decide whether it
can be used or not

SQL Standard Based Hive Authorization
Privileges:
– SELECT privilege – gives read access to an object.
– INSERT privilege – gives ability to add data to an object (table).
– UPDATE privilege – gives ability to run update queries on an object
(table).
– DELETE privilege – gives ability to delete data in an object (table).
– ALL PRIVILEGES – gives all privileges (gets translated into all the
above privileges).
Objects
– The privileges apply to table and views. The above privileges are not
supported on databases.

SQL Standard Based Hive Authorization
Users and Roles

– Privileges can be granted to users as well as roles.
– Users can belong to one or more roles.
– There are two roles with special meaning –
public and admin
– When a user runs a Hive query or command, the
privileges granted to the user and her "current roles" are
checked

Understanding Hive on Tez
SELECT a.state, COUNT(*), AVG(c.price)
Tez avoids unneeded
FROM a
writes to HDFS
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Hive – MapReduce Hive – Tez
M M M M M M
SELECT a.state SELECT b.id SELECT a.state,
c.itemId SELECT b.id
R R R R
M M
HDFS M M
M M R
JOIN (a, c) R
SELECT c.price
R JOIN (a, c) R
HDFS
HDFS
JOIN(a, b) M M JOIN(a, b)
GROUP BY a.state GROUP BY a.state
COUNT(*) COUNT(*)
AVG(c.price) AVG(c.price)
R R
182 Impetus Technologies - Confidential Source : http://www.slideshare.net/ye.mikez/hive-tuning

Transactions

Transaction Use Cases

Transaction Limitations
 BEGIN, COMMIT, and ROLLBACK are not yet
supported
 Only ORC file format is supported
 Tables must be bucketed

HBase

What Is HBase?
• HBase is a NoSQL datastore built on top of HDFS (Hadoop)
• An Apache Top Level Project
• Based on Google’s BigTable paper
• Open-source, distributed, versioned
• Key Features
 Distributed Storage
 Strictly consistent random reads and writes.
 Schema less data model
 Automatic and configurable sharding of tables
 Automatic failover support between RegionServers.

When to use?
• Big Data with random read and writes
• Storing large amounts of data (TB/PB)
• High throughput for a large number of requests
• Storing unstructured or variable column data

HBase Data Model

Data Model Terminology
• Table
– An HBase table consists of multiple rows.
• Row
– A row in HBase consists of a row key and one or more columns with
values associated with them
– Sorted alphabetically by the row key as they are stored
– Column
– A column in HBase consists of a column family and a column
qualifier, which are delimited by a : (colon) character.
– Column Family
– Column families physically colocate a set of columns and their
values

Data Model Terminology
• Column Qualifier
– A column qualifier is added to a column family to provide the index for a
given piece of data
– Cell
– A cell is a combination of row, column family, and column qualifier, and
contains a value and a timestamp, which represents the value’s version.
– Timestamp
– A timestamp is written alongside each value, and is the identifier for a given
version of a value.

HBase: Keys and Column Families
Each record is divided into Column Families
Each row has a Key
Each column family consists of one or more Columns

Source: http://www.slideshare.net/aillonianilreddy/hadoop-32452974

HBase Logical View

Rows and Columns
No storage penalty for unused columns
Row keys identify a row

Each Column Family can have many columns

HBase Architecture

Terminology
• Node
– Physical server
• Cluster
– Group of nodes
• Master Node
– Co-ordinates the nodes in the cluster
• Slave Node
– Worker nodes that perform tasks
• Daemon
– A process or service

Daemons
• HBase Master
• RegionServer
• ZooKeeper
• HDFS
– NameNode/Standby NameNode
– DataNode

Basic Concepts
• HBase data is stored in Tables
– Similar to RDBMS tables
• Table data is stored on HDFS
– Split into blocks and stored across nodes in the cluster
• Architecturally tables are big, sorted , distributed
maps
• Tables are sharded/partioned and replicated

HBase Regions
• HBase tables are split into regions
– Piece of a tables
– Created through sharding or partioning
• RegionServer daemons serves regions
– Runs on each slave node in the cluster
– Serves multiple regions belonging to several different tables
– Usually one Table is split among multiple Region servers
• Communicate with the client and handle data-
related operations.
• Handle read and write requests for all the regions
under it.
• Decide the size of the region by following the region
size thresholds.

HBase Master
• Master that co-ordinates the Region servers

– Co-ordinates which regions are served by which
Region server
– Handles new table creation and region
movement
– Interface for all metadata changesServes
– In a distributed cluster, the Master typically runs
on the NameNode.
– Handles load balancing of the regions across
region servers. It unloads the busy servers and
shifts the regions to less occupied servers.

HBase and Zookeeper
• An HBase cluster may have multiple masters

for HA
– Only one master controls the cluster
– The Zookeeper co-ordinates the masters
• Zookeeper service runs on each master
– Upon start-up all masters connect to Zookeeper
– They all are eligible to get control
– The first to connect gets the control
– If the controlling master fails, the rest of the masters
compete to get control and one of them gets control

HBase Regions
Source: http://hbase.apache.org/

Store
• A Store hosts a MemStore and 0 or more StoreFiles
(HFiles).
• A Store corresponds to a column family for a table for a
given region
– MemStore
• The MemStore holds in-memory modifications to the Store
– StoreFile (HFile)
• Memstore is flushed to StoreFiles
• StoreFiles are where your data lives.
• HFile Format
– The HFile file format is based on the SSTable from BigTable
– Blocks
• StoreFiles are composed of blocks. The blocksize is configured on a
per-ColumnFamily basis

RegionServer Architecture
Source: http://www.slideshare.net/xefyr/h-base-for-architectspptx

Compaction
Compaction is an operation which reduces the
number of StoreFiles in a Store, by merging them
together
– After MemStore reaches a given size it flushes its
contents to a StoreFile.
– StoreFiles in a Store increases over time
– Two categories
• Minor compactions usually select a small number of
small, adjacent StoreFiles and rewrite them as a
single StoreFile
• Major compaction results in a single StoreFile per
Store

HBase Data Access

Four primary data model operations
• Gets
– Gets a row’s data based on the row key
• Puts
– Upserts a row with data based on the row key
• Scans
– Finds all matching rows based on the row key
– Scan logic can be increased by using filters
• Delete
– deletes are handled by creating new markers
called tombstones

Write Path
Client
1. Which NameNode
Standby
NameNode
RegionServer is Master Master

Master
serving the Region?

ZooKeeper ZooKeeper ZooKeeper
DataNode DataNode DataNode
RegionServer RegionServer RegionServer
2. Write to
RegionServer DataNode DataNode DataNode
Source: http://www.slideshare.net/HBaseCon/hbase-just-the-basics

Write Path
Source: http://blog.cloudera.com/blog/2012/06/hbase-write-path/

Read Path
Client
1. Which NameNode
Standby
NameNode
RegionServer is Master Master

Master
serving the Region?

ZooKeeper ZooKeeper ZooKeeper
DataNode DataNode DataNode
2. Write to
RegionServer DataNode DataNode DataNode

Puts
1 Put p = new Put(Bytes.toBytes(ROW_KEY_BYTES);

p.add(COLFAM_BYTES, COLDESC_BYTES,
2 Bytes.toBytes("value"));
3 table.put(p);
4

Gets
1 Get g = new Get(ROW_KEY_BYTES);

2 Result r= table.get(g);
3 byte[] byteArray =
r.getValue(COLFAM_BYTS,COLDESC_BYTS);
4 String columnValue = Bytes.toString(byteArray);

Filters
Generally used via the Java API
• PrefixFilter
– prefix of a row key
• ColumnPrefixFilter
– a column prefix
• InclusiveStopFilter
– row key on which to stop scanning
• FamilyFilter
• QualifierFilter
• ColumnRangeFilter

HBase CoProcessor

Coprocessor Framework
A framework
– that provides a library and runtime environment for
executing user code within the HBase region server and
master processes
– flexible and generic extension of HBase functionality
– distributed computation directly within the HBase server
processes
Characteristics :
– Arbitrary code can run at each RegionServer
– High-level call interface for clients
– Calls are addressed to rows or ranges of rows and the
coprocessor client library resolves them to actual locations;
– Calls across multiple rows are automatically split into multiple
parallelized RPC
– Provides a very flexible model for building distributed
services
Coprocessor Types )
Based on deployment
• System Coprocessors
– loaded globally on all tables and regions hosted by the
region server
• Table coprocessors
– loaded on all regions for a table on a per-table basis

Coprocessor Types (functionality)
Based on functionality
• Observers
– Can be thought of like database triggers
– User code inserted by overriding upcall methods provided
by the coprocessor framework
– Functions are executed from core HBase code when
certain events occur
– framework handles all of the details of invoking callbacks
• Endpoint
– Resembling stored procedures
– Can be invoked at any time from the client
– executed remotely at the target region or regions

Observer Types
Three kind of Observers
• RegionObserver
– hooks for data manipulation events, Get, Put, Delete, Scan
– an instance of a RegionObserver coprocessor for every
table region
• WALObserver
– hooks for write-ahead log (WAL) related operations
– one such context per region server
• MasterObserver
– hooks for DDL-type operation, i.e., create, delete, modify
table, etc.
– runs within the context of the HBase master.

Region Observer
Provides callbacks for
• preOpen, postOpen:
• Called before and after the region is reported as online to the master.
• preFlush, postFlush:
• Called before and after the memstore is flushed into a new store file.
• preGet, postGet:
• Called before and after a client makes a Get request.
• preExists, postExists:
• Called before and after the client tests for existence using a Get.
• prePut and postPut:
• Called before and after the client stores a value.
• preDelete and postDelete:
• Called before and after the client deletes a value.

Master and WAL Observer
MasterObserver provides upcalls for:

• preCreateTable/postCreateTable:
 Called before and after the region is reported as online to the
master.
• preDeleteTable/postDeleteTable
WALObserver provides upcalls for:

• preWALWrite/postWALWrite:
 Called before and after a WALEdit written to WAL.

Endpoint
• Endpoint is an interface for dynamic RPC

extension
• Installed on the server side and can then be
invoked with HBase RPC
• The client library provides convenience
methods for invoking such dynamic
interfaces.

Endpoint Invocation
Source: https://blogs.apache.org/hbase/entry/coprocessor_introduction

Loading Coprocessor
• Load through configuration entries

• hbase.coprocessor.region.classes:
 for RegionObservers and Endpoints
• hbase.coprocessor.master.classes:
 for MasterObservers
• hbase.coprocessor.wal.classes:
 for WALObservers
 jar file must reside on the server side HBase classpath
• Load from shell
 load on a per table basis, via a shell command ``alter’’ + ``table_att'‘
 Coprocessor attribute added to table.Contains :
 File path: The jar file containing the coprocessor implementation
 Class name: The full class name of the coprocessor.
 Priority: An integer.
 Arguments: This field is passed to the coprocessor implementation.
Source: https://blogs.apache.org/hbase/entry/coprocessor_introduction

HBase Schema Design

Schema Design
• Access pattern must be known and ascertained
• Denormalize to improve performance
– Fewer, bigger tables
• Does not do well with anything above two or three
column families
• Rows sorted lexicographically, keep similar rows
together but don’t hotspot
• Use Salting or hashing
• Minimize row and column sizes
• ColumnFamily names as small as possible, preferably one character

Schema Design
• Prefer shorter attribute names
• Use TTL where ever possible
• Factor in the potential of joins into schema design
• Use Monotonically Increasing Row keys/Timeseries
Data carefully

Elastic Search

Apache Lucene
• Fas t, high performance, scalable search/IR library
• Open source
• Originally written in java; ported
to Delphi, Perl, C#, C++, Python, Ruby, and PHP
• Initially developed by Doug Cutting (Also author of
Hadoop)
• Indexing and Searching
• Used by companies like Twitter, Linkedin, Wikipedia etc.

Apache Lucene … Features !
• Full text search; fielded searching (e.g. title, author,

contents)
• Powerful query types: phrase queries, wildcard
queries, proximity queries, range queries
• Ranked searching
• Fast, memory-efficient and typo-tolerant suggests
• Sorting by any field

Type of Queries
• Term Query
– useful for retrieving documents by a key.
• Prefix Query
– matches documents containing terms beginning with a specified string.
• Range Query
– facilitates searches from a starting term through an ending term.
• Boolean Query
– allows for logical AND, OR, and NOT combinations.
• Phrase Query
– An index contains positional information of terms.
• Fuzzy Query
– matches terms similar to a specified term.
• Boost Query
– Boost a particular term
Lucene in a search system
Index document Users
Analyze
document Search UI
Build document
Index Build Render
query results
Acquire content
Run query
Raw
Content
Source: http://web.stanford.edu/class/cs276/handouts/lecture-lucene.pptx

How Lucene models content
The fundamental concepts in Lucene are index,
document, field and term.
– An index contains a sequence of documents.
– A document is a sequence of fields.
– A field is a named sequence of terms.
– A term is a sequence of bytes.

How Lucene models content ….
• A Document is the atomic unit of indexing and
searching
– A Document contains Fields
• Fields have a name and a value

– You have to translate raw content into Fields
Examples: Title, author, date, abstract, body, URL, keywords
– Different documents can have different fields
– Search a field using name:term, e.g., title:lucene

Fields
• Fields may
– Be indexed or not
• Indexed fields may or may not be analyzed (i.e., tokenized with an
Analyzer)
– Non-analyzed fields view the entire value as a single token (useful
for URLs, paths, dates, social security numbers, ...)
– Be stored or not
• Useful for fields that you’d like to display to users
– Optionally store term vectors

• Like a positional index on the Field’s terms
• Useful for highlighting, finding similar documents, categorization

Index Format and structure
• Segments
– Lucene indexes may be composed of multiple sub-indexes,
or segments.
– Each segment is a fully independent index
– New segments created for newly added documents.
– Existing segments are merged
• Document Numbers
– Internally, Lucene refers to documents by an integer document
number
– first document added to an index is numbered zero and so on..

Index Format and structure
• Segments
– Lucene indexes may be composed of multiple sub-indexes,
or segments.
– Each segment is a fully independent index
– New segments created for newly added documents.
– Existing segments are merged
• Document Numbers
– Internally, Lucene refers to documents by an integer document
number
– first document added to an index is numbered zero and so on..

Index Format and structure …
• Each segment index maintains

– Segment info.
• This contains metadata about a segment, such as the number of
documents, what files it uses.
– Field names
– Stored Field values
– Term dictionary.
• A dictionary containing all of the terms used in all of the indexed fields of all of
the documents.
– Term Frequency data

– Term Proximity data
– Normalization factors
– Term Vectors
Index Format and structure …
• Stores terms and documents in arrays

– Binary search
Source: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal

Index Format and structure …Insertion & Merge !
• Insertion = write a new Segment

• Merge Segments when there are too many of them
– Concatenate docs, merge terms dicts and posting lists (merge sort !)

Index Format and structure … Deletion !
• Deletion = turn a bit off

• Ignore deleted docs when searching and merging
• Merge policies favor segments with many deletions

Indexing , Analyzing and Searching

Indexing Pipeline
• Analyzer : create tokens using a Tokenizer and/or applying Filters (Token

Filters)
• Each field can define an Analyzer at index time/query time or both
Source: http://www.slideshare.net/otisg/lucene-introduction

Indexing and Querying Summary
Source: http://www.slideshare.net/saumitra121/apache-solr-workshop

Core Analysis : Four main parts
• Analyzer
– Responsible for supplying a TokenStream which can be consumed by the
indexing and searching processes
• CharFilter
– Used to transform the text before it is tokenized
• Tokenizer
– Responsible for breaking up incoming text into tokens.
• TokenFilter
– Responsible for modifying tokens that have been created by the
Tokenizer

Core Analysis : Post-Tokenization
Many post-tokenization steps :
• Stemming
– Replacing words with their stems
• Stop Words Filtering

– Removing Common words like "the", "and" and "a“
• Text Normalization
– Stripping accents and other character markings
• Synonym Expansion
– Adding in synonyms at the same token position

Analyzer Overview
Searching Data : Basic Concepts
What is Elastic Search?
• Elastic Search is an Open source (Apache 2), Distributed
Search Engine built on top of Apache Lucene
• Elastic Search functionality can be accessed an API and a
Restful service interface
• Elastic Search is build to be distributed from ground up so it
can easily scales from one to 100s of machine
• It provides features like fault tolerance and high availability

Elastic Search - Terms
• Index
– An index is a collection of documents that have somewhat similar
characteristics
– Identified by a name (that must be all lowercase)
– This name is used to refer to the index when performing indexing,
search, update, and delete operations against the documents in it
• Documet
– A document is a basic unit of information that can be indexed
– Expressed in JSON
• Document Type
– Each index can store different types of documents
– Organizing documents into types helps in data manipulation

• Cluster
– Collection of one or more nodes (servers) that together holds your
entire data
– Provides federated indexing and search capabilities across all
nodes
– Identified by a unique name which by default is "elasticsearch“
– This name is important because a node can only be part of a cluster
if the node is set up to join the cluster by its name
• Node
– A node is a single server that is part of your cluster, stores your
data, and participates in the cluster’s indexing and search
capabilities
– Node is identified by a name
– Name is important for administration purposes

• Shard
– Subdivide index into multiple pieces called shards
– While creating an index define the number of shards
– Each shard is in itself a fully-functional and independent "index“
• Replica
– Copy of the primary shard
– Number of shards and replicas can be defined per index at the time
the index is created
– Each shard is in itself a fully-functional and independent "index“

Elastic Search – The Cluster
• Elasticsearch provides a very comprehensive and powerful
REST API
– Check cluster, node, and index health, status, and statistics
– Administer cluster, node, and index data and metadata
– Perform CRUD (Create, Read, Update, and Delete) and search
operations against indexes
– Execute advanced search operations such as paging, sorting,
filtering, scripting, faceting, aggregations, and many others

Elastic Search – Index API
Adds document into the "twitter" index, under a type
called "tweet" with an id of 1:
Source: https://www.elastic.co/products/elasticsearch

Elastic Search – Get API
Get a document from the "twitter" index, under a type
called "tweet" with an id of 1:

Elastic Search – Delete API
Delete document from the "twitter" index, under a
type called "tweet" with an id of 1:
Delete by query

Elastic Search – Query DSL
• JSON-style domain-specific language used to execute
queries
• localhost:9200/bank/_search
– Match all and return the first result
• { "query": { "match_all": {} }, "size": 1 }‘
• Match all and return the results from 11-20
– { "query": { "match_all": {} }, "from": 10, "size": 10 }‘
• Match all and return sorted results
– { "query": { "match_all": {} }, "sort": { "balance": { "order": "desc" } } }‘
• Match all and return the two fields
– { "query": { "match_all": {} }, "_source": ["account_number",
"balance"] }‘
• Match all with a specific value for a field
– { "query": { "match": { "account_number": 20 } } }‘

• Match where address contains “mill” OR “lane”
– { "query": { "match": { "address": "mill lane" } } }‘
• Match where address contains “mill lane”
– { "query": { "match_phrase": { "address": "mill lane" } } }‘
• Boolean
– { "query": { "bool": { "must": [ { "match": { "address": "mill" } }, {
"match": { "address": "lane" } } ] } } }‘
– { "query": { "bool": { "should": [ { "match": { "address": "mill" } }, {
"match": { "address": "lane" } } ] } } }‘
• Filtering
– { "query": { "filtered": { "query": { "match_all": {} }, "filter": {
"range": { "balance": { "gte": 20000, "lte": 30000 } } } } } }‘

– Aggregations
• groups all the accounts by state, and then returns the top 10
(default) states sorted by count descending (also default)
– { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state" }
} } }'
– calculates the average account balance by state (again only for the
top 10 states sorted by count in descending order)
– { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state" },
"aggs": { "average_balance": { "avg": { "field": "balance" } } } } } }'

Elastic Search – Distributed Store
• Contains one index that has two primary shards.
Each primary shard has two replicas.
• Copies of the same shard are never allocated to
the same node

Elastic Search – Creating , indexing , deleting
• Create, index, and delete requests are write
operations, which must be successfully completed
on the primary shard before they can be copied to
any associated replica shards

Elastic Search – Creating , indexing , deleting
• The client sends a create, index, or delete request to Node
1.
• The node uses the document’s _id to determine that the
document belongs to shard 0. It forwards the request to
Node 3, where the primary copy of shard 0 is currently
allocated.
• Node 3 executes the request on the primary shard.
• If it is successful, it forwards the request in parallel to the
replica shards on Node 1 and Node 2.
• Once all of the replica shards report success, Node 3
reports success to the requesting node, which reports
success to the client.

Elastic Search – Replication
• The default value for replication is sync. This
causes the primary shard to wait for successful
responses from the replica shards before
returning.
• If you set replication to async, it will return success

to the client as soon as the request has been
executed on the primary shard.

Elastic Search – Consistency
• By default, the primary shard requires a quorum,
or majority, of shard copies (where a shard copy
can be a primary or a replica shard) to be available
before even attempting a write operation.
• This is to prevent writing data to the “wrong side”
of a network partition. A quorum is defined as
follows:
int( (primary + number_of_replicas) / 2 ) + 1
• The allowed values for consistency are one
o (just the primary shard),
o all (the primary and all replicas),
o or the default quorum, or majority, of shard copies.

Elastic Search – Retrieving
• A document can be retrieved from a primary shard
or from any of its replicas

Elastic Search – Retrieving
• The client sends a get request to Node 1.
• The node uses the document’s _id to determine
that the document belongs to shard 0. Copies of
shard 0 exist on all three nodes. On this occasion,
it forwards the request to Node 2.
• Node 2 returns the document to Node 1, which
returns the document to the client.
• For read requests, the requesting node will choose
a different shard copy on every request in order to
balance the load; it round-robins through all shard
copies.
Spark

Apache Spark
• Originally developed in
2009 in UC Berkeley’s
AMP Lab
• Fully open sourced in 2010

– now a Top Level Project
at the Apache Software
Foundation
spark.apache.org
github.com/apache/spark
user@spark.apache.org

Spark is the Most Active Open Source Project in Big Data
140
Project contributors in past year
120
100
80
60
40 Giraph
Storm
Tez
20

Unified Platform
Spark SQL Spark Streaming MLlib GraphX (Graph

(SQL) (Streaming) (Machine learning) computation)
Spark (General execution engine)

Easy and Fast Big Data
• Easy to Develop • Fast to Run

– Rich APIs in Java, Scala, – General execution graphs
Python – In-memory storage
– Interactive shell
Up to 10× faster on disk,
2-5× less code 100× in memory

Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop
InputFormat
• HBase

Deploying Spark – Cluster Manager Types
• Mesos
• Standalone mode
• YARN

Quick Terminology
• Tasks: Fundamental unit of work
• Stage: Set of tasks that run in parallel
• DAG: Logical graph of RDD operations
• RDD: Parallel dataset with partitions

Key Concept: RDD’s
Write programs in terms of operations on

distributed datasets
Resilient Distributed Datasets Operations
• Collections of objects spread • Transformations
across a cluster, stored in RAM (e.g. map, filter,
or on Disk groupBy)
• Built through parallel • Actions
transformations (e.g. count, collect,
• Automatically rebuilt on failure save)

RDD Components
• Set of partitions (“splits” in

hadoop)
• List of dependencies on
parent RDD Lineage
• Functions to compute
partition given its parent
• (Optional) partitioner (hash,
range) Optimized
• (Optional) preferred location Execution
for each partition

More RDD Operators
• map • reduce sample

• filter • count take
• groupBy • fold
first
• sort • reduceByKey
• union • groupByKey partitionBy
• join • cogroup mapWith

• leftOuterJoin • cross pipe
• rightOuterJoin • zip
save ...

ImpetusSpark Execution
Technologies Inc.

Spark Execution – Typical flow
Driver
Executor Executor Executor

Block 1 Block 2 Block 3

Spark Execution – Typical flow (contd..)
Driver
Submit stages of task to low level

scheduler e.g. YARN, Mesos,
Standalone


Driver
Read HDFS local data


Driver
Process and cache data
Cache Cache Cache


Driver
Report back the results
Cache Cache Cache


Driver
Process the next stage from cache
Cache Cache Cache


Driver
Report back the results
Cache Cache Cache


Quick Terminology
• Tasks: Fundamental unit of work
• Stage: Set of tasks that run in parallel
• DAG: Logical graph of RDD operations
• RDD: Parallel dataset with partitions

Word Count Example
Program in scala
val input = sc.textFile(“hdfs://name.txt”)

val count = input.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey(_ + _)
count.saveAsTextFile(“hdfs://wordcount.txt”)

Word Count Example
Stage 1
Tasks DAG of operations
big data big camp
sc.textFile(“hdfs://name.txt”
) HadoopRDD
big data big camp
textFile.flatMap(line =>
line.split(“ “)) MapRDD
(big, 1) (data, 1) (camp,
(big, 1)
1)
map(word => (word, 1))
MapRDD
reduceByKey(_ + _) Stage 2
(big, [1, 1] (data, [1]) (camp, [1])

count.saveAsTextFile(“hdfs
://wordcount.txt”) reduceByKey (Action)
(big, 2] (data, 1) (camp, 1)
saveAsTextFile
(Action) res =
[(big,2),(data,1),(camp,1)]

Job Execution
Build an operator DAG
Split graph into stages of tasks
Cache
Task
Schedule and execute tasks Executor
Block

Spark Execution flow – example 1
sc.textFile("/some-hdfs-data") RDD[String]
.map(line => line.split("\t")) RDD[List[String]]
.map(parts =>
(parts[0], int(parts[1]))) RDD[(String, Int)]
.reduceByKey(_ + _, 3) RDD[(String, Int)]

Array[(String, Int)]
.collect()
textFile map map reduceByKey collect

Source: https://spark-summit.org/2013/talk/wendell-understanding-the-performance-of-spark-applications/

Directed Acyclic Graph (DAG)
• Directed in a single direction
• Acyclic : No looping
• Support fault-tolerance
Join
GroupBy

• Pipeline as much as possible
• Split into stages of tasks
Stage 1 Stage 2


Split graph into stages of tasks (contd…)
Stage 1 Stage 2
Stage 1 Stage 2
read HDFS split read shuffle data
apply both maps final reduce
partial reduce send result to driver
write shuffle data

Stage execution
Stage 1
Task 1
Task 2
Task 3
Task 4
– Create a task for each partition in the new RDD

– Serialize task
– Schedule and ship task to slaves

Task execution
– Task is the fundamental unit of execution in Spark

- A. Fetch input from InputFormat or a shuffle
- B. Execute the task
- C. Materialize task output as shuffle or driver result
Pipelined
Fetch input Execution
Execute task
Write output

Spark execution flow – example 2
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
Spark execution flow – example 2 (contd…)
Davidson.pdf
Build an operator DAG
Davidson.pdf
Build an operator DAG (contd…)
Davidson.pdf
Davidson.pdf
Split graph into stages of tasks (contd…)
Davidson.pdf
Schedule and execute tasks
Davidson.pdf
Schedule and execute tasks (contd…)
Davidson.pdf
Davidson.pdf
Davidson.pdf
Davidson.pdf
Shuffle
• Redistribute data among partitions
• Hash keys into buckets
• Optimizations
– Avoided when possible, if data is already partitioned
– Partial aggregation reduces data movement
Stage 1
Stage 2

Shuffle (cont.)
• Pull-based not push based
• Write intermediate files to disk
Stage 1
Stage 2

Spark Programming Model

RDD
Transformed RDD

SparkContext
• Main entry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you’d make your own (see
later for details)

Creating RDD
• Is a series on transformations on RDD
• RDD can be created by
– Using parallelize() on in memory dataset or collection
– Reading data from external dataset like HDFS, S3, local
file system
– Transformations on existing RDD
• Parallelize method
– Invoke parallelize method on a collection
– Elements of collection are copied to create distributed
dataset which can be operated in parallel

Creating RDD (cont)
• Parallelize method (cont.)
– Example
• Scala
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
• Java
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
– How many partitions are created?
• Specify number of partitions to create as input to parallelize()
• Spark automatically decided number of partitions based on
cluster configuration
• Spark runs a separate task for each partition

Creating RDD (cont)
• External Dataset
– Spark can create distributed datasets from any storage
source supported by Hadoop, including your local file
system, HDFS, Cassandra, HBase, Amazon S3, etc.
– Spark supports Text files, Sequence Files, and any other
Hadoop Input Format
– Api’s to read data from external sources
• sc.wholeTextFile()
– Read a directory containing multiple small text files, and returns each of
them as (filename, content) pairs
• sc.sequenceFile[k,v]()
– Read sequence files, where K and V are the types of key and values in the
file
• sc.hadoopRDD()
– Read files in Hadoop Input format

Creating RDD (cont)
• Reading External Dataset
– Example
• Scala
val distFile = sc.textFile(“file.txt”)
val distFile = sc.textFile(“directory/*.txt”)
val distFile = sc.textFile(“hdfs://namenode:9000/path/file”)
• Java
JavaRDD<String> distFile = sc.textFile(“file.txt”);
JavaRDD<String> distFile = sc.textFile(“directory/*.txt”);
JavaRDD<String> distFile = sc.textFile (“hdfs://namenode:9000/path/file”);

RDD Operations
• RDDs support two types of operations:
– transformations, which create a new dataset from an
existing one, and
– actions, which return a value to the driver program after
running a computation on the dataset.
• For example,
– map() is a transformation that passes each dataset
element through a function and returns a new RDD
representing the results.
– reduce() is an action that aggregates all the elements of
the RDD using some function and returns the final result
to the driver program

RDD Operations
• All transformations in Spark are lazy,
– Only computed when an action requires a result to be
returned to the driver program.
• RDD can be persisted in memory, disk or both
– RDD are kept in memory for faster access in subsequent
transformations

RDD Operations - Example
• val lines = sc.textFile("data.txt") Define base RDD
• val lineLengths = lines.map(s Transformation: Map

=> s.length)
• val totalLength =
lineLengths.reduce((a, b) => a Action: Reduce
+ b)
• lineLengths.persist() Persist: Store

Basic Transformations
# Create base RDD
> val data = Array(1,2,3)
> val nums = sc.parallelize(data)
# Pass each element through a function

> val squares = nums.map(x => x * x) // {1, 4, 9}
# Keep elements passing a predicate

> val even = squares.filter(x => x % 2 == 0) // {4}

Basic Actions
> nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection
> nums.collect() # => [1, 2, 3]
# Return first K elements
> nums.take(2) # => [1, 2]
# Count number of elements
> nums.count() # => 3
# Merge elements with an associative function
> nums.reduce((x, y) => x + y) # => 6
# Write elements to a text file
> nums.saveAsTextFile(“hdfs://192.168.91.128/exampleact
ion.txt”)

Working with Key-Value Pairs
Spark’s “distributed reduce” transformations operate on

RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)

pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b);

pair._1 // => a
pair._2 // => b
Some Key-Value Operations
> val petsdata=Array(“cat”, “dog”, “cat”)

> val pets = sc.parallelize(petsdata)
> val reduceCounts= pets.reduceByKey((x, y) => x + y)

> reduceCounts.collect()
# Array((dog,1), (cat,3))
> val groupCounts = pets.groupByKey()

> groupCounts.collect()
> # Array((dog,ArrayBuffer(1)), (cat,ArrayBuffer(1, 2)))
> val sortCounts = pets.sortByKey()

> # Array((cat,1), (cat,2), (dog,1))
reduceByKey also automatically implements combiners on the map side

Example: Word Count
> val lines = sc.textFile(“hdfs://192.168.91.128/4300.txt”)

> val counts = lines.flatMap(line => line.split(“ ”))
.map(word => (word, 1))
.reduceByKey((x, y) => x + y)
> counts.collect()
> counts.toArray().foreach(println)
> counts.saveAsTextFile("hdfs://192.168.91.128:8020/tmp/wordc
ount1")
“to” (to, 1)
(be, 2)
“to be or” “be” (be, 1)
(not, 1)
“or” (or, 1)
“not” (not, 1)
(or, 1)
“not to be” “to” (to, 1)
(to, 2)
“be” (be, 1)

Fault Recovery
RDDs track lineage information that can be

used to efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))

.map(lambda s: s.split(“\t”)[2])
HDFS File Filtered RDD Mapped RDD

filter map
(func = startsWith(…)) (func = split(...))

How to Run Spark

Language Support
Python Standalone Programs

lines = sc.textFile(...) • Python, Scala, & Java
lines.filter(lambda s: “ERROR” in s).count()
Scala Interactive Shells

val lines = sc.textFile(...) • Python & Scala
lines.filter(x => x.contains(“ERROR”)).count()
Performance
Java • Java & Scala are faster due
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() { to static typing
Boolean call(String s) {
return s.contains(“error”); • …but Python is often fine
}
}).count();

Interactive Shell
• The Fastest Way to

Learn Spark
• Available in Python and
Scala
• Runs as an application
on an existing Spark
Cluster…
• OR Can run locally

… or a Standalone Application
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0],
None)
lines = sc.textFile(sys.argv[1])
counts = lines.flatMap(lambda s: s.split(“ ”)) \

.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)
counts.saveAsTextFile(sys.argv[2])

Create a SparkContext
import org.apache.spark.SparkContext
Scala
import org.apache.spark.SparkContext._
val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”))
Cluster URL, or local App Spark install

import org.apache.spark.api.java.JavaSparkContext; List of JARs with
Java
/ local[N] name path on cluster app code (to ship)

JavaSparkContext sc = new JavaSparkContext(
“masterUrl”, “name”, “sparkHome”, new String[] {“app.jar”}));
Python
from pyspark import SparkContext
sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))

ImpetusSpark Architecture Inc.
Technologies

Spark Architecture
Client RDD Graph
(Scala/Java
/Python) Driver
Scheduler
Cluster Manager Local Threads
Cache Cache Cache

Task Task Task

Block Block … Block
HDFS/HBase/Storage

Deploying Spark – Cluster Manager Types
• Mesos
• Standalone mode
• YARN

Spark Architecture (Client mode)
Source: Hadoop Definitive Guide

Spark Architecture (Cluster mode)
Source: Hadoop Definitive Guide

Spark Streaming

What is Spark Streaming?
 Extends Spark for doing large scale stream
processing
 Scales to 100s of nodes and achieves second
scale latencies
 Efficient and fault-tolerant stateful stream
processing
 Integrates with Spark’s batch and interactive
processing
 Provides a simple batch-like API for implementing
complex algorithms
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617

Spark Streaming
● Scalable, high-throughput, fault-tolerant stream processing

Discretized Stream Processing
Run a streaming computation as a series of

very small, deterministic batch jobs
live data stream
Spark
 Chop up the live stream into batches of X Streaming
seconds
 Spark treats each batch of data as RDDs batches of X
seconds
and processes them using RDD operations
 Finally, the processed results of the RDD
Spark
operations are returned in batches processed
results
337

Discretized Stream Processing
Run a streaming computation as a series of

very small, deterministic batch jobs
live data stream
Spark
 Batch sizes as low as ½ second, latency ~ 1 Streaming
second
 Potential for combining batch processing batches of X
seconds
and streaming processing in the same
system
Spark
processed
results
338

Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter

password>)
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data
Twitter Streaming API batch @ t batch @ t+1 batch @ t+2
tweets DStream
stored in memory as an RDD

(immutable, distributed dataset)


password>)
val hashTags = tweets.flatMap (status => getTags(status))
transformation: modify data in one DStream to create
new DStream
another DStream
batch @ t batch @ t+1 batch @ t+2
tweets DStream
flatMap flatMap flatMap
hashTags Dstream
…
new RDDs created
[#cat, #dog, … ]
for every batch


password>)
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
batch @ t batch @ t+1 batch @ t+2

tweets DStream
flatMap flatMap flatMap
hashTags DStream
save save save
every batch
saved to HDFS

Example 2 – Count the hashtags over last 1 min
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
sliding window
window length sliding interval
operation
window length
DStream of data
sliding interval
Example 2 – Count the hashtags over last 1 min
val tagCounts = hashTags.window(Minutes(1),

Seconds(1)).countByValue()
t-1 t t+1 t+2 t+3
hashTags
sliding window
countByValue
tagCounts count over all

the data in the
window
Key concepts
• DStream – sequence of RDDs representing a

stream of data
– Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP
sockets
• Transformations – modify data from one DStream to
another
– Standard RDD operations – map, countByValue,
reduceByKey, join, …
– Stateful operations – window,
countByValueAndWindow, …
• Output Operations – send data to external entity
– saveAsHadoopFiles – saves to HDFS
– foreach – do anything with each batch of results
Arbitrary Stateful Computations
• Maintain arbitrary state, track sessions

– Maintain per-user mood as state, and update it
with his/her tweets
moods = tweets.updateStateByKey(tweet => updateMood(tweet))
updateMood(newTweets, lastMood) => newMood
t-1 t t+1 t+2 t+3

tweets
moods
Combine Batch and Stream Processing
• Do arbitrary Spark RDD computation within

DStream
– Join incoming tweets with a spam file to filter
out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
Fault-tolerance
• RDDs remember the tweets

input data
operations that created them RDD
replicated
in memory
• Batches of input data are
replicated in memory for fault-
flatMap
tolerance
• Data lost due to worker hashTags

failure, can be recomputed RDD
lost partitions
from replicated input data recomputed on
other workers
• Therefore, all transformed
data is fault-tolerant
Spark SQL

Spark SQL
 Spark’s interface to work with structured or semi-

structured data
 Structured data
o known set of fields for each record - schema
 Main capabilities
o load data from variety of structured sources
o query the data with SQL
o integration between Spark (Java, Scala and Python API)
and SQL (joining RDDs and SQL tables, using SQL
functionality)
Source: http://www.slideshare.net/PetrZapletal1/spark-concepts-spark-sql-graphx-streaming

DataFrames (SchemaRDD)
 RDD of row objects, each representing a record

 Known schema (i.e. data fields) of its rows
 Behaves like regular RDD, stored in more efficient manner
 Adds new operations, especially running SQL queries
 Can be created from
o external data sources
o results of queries
o regular RDD
 Used in ML Pipeline API

SQLContext
● Entry points:
o HiveContext
 superset functionality, Hive related
o SQLContext

MLib
 MLlib is Spark’s scalable machine learning

library
 Common learning algorithms and utilities:
 Classification
 Regression,
 Clustering
 Collaborative filtering
 Dimensionality reduction

GraphX
 Spark API for graphs and graph-parallel

computation
 Resilient Distributed Property Graph (RDPG,
extends RDD)
 directed multigraph ( -> parallel edges)
 properties attached to each vertex and edge
 Common graph operations (subgraph
computation, joining vertices, ...)
 Growing collection of graph algorithms

Thank You
For any support you need please feel free to contact:
Ashish Baghel : abaghel@impetus.com

AVP & Head – Banking and Financial Services (BFSI)
913-638-2948 (Cell)
408-213-3310 – Ext 567 (Office)
Sachneet Singh: sachneets.bains@impetus.co.in

BigData - Hadoop - Spark - ES - 3 - Day Training

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BigData - Hadoop - Spark - ES - 3 - Day Training

Uploaded by

Copyright:

Available Formats

Big Data and Hadoop

Sachneet Singh : sachneets.bains@impetus.co.in

1 Impetus Technologies - Confidential

2 Impetus Technologies - Confidential

3 Impetus Technologies - Confidential

4 Impetus Technologies - Confidential

5 Impetus Technologies - Confidential

6 Impetus Technologies - Confidential

7 Impetus Technologies - Confidential

Value to Business Value to Business

8 Impetus Technologies - Confidential

5. Server Logs Value

9 Impetus Technologies - Confidential

• Look for answers to

10 Impetus Technologies - Confidential

11 Impetus Technologies - Confidential

• Twitter is one of the source

Flume is used to stream data into

12 Impetus Technologies - Confidential

13 Impetus Technologies - Confidential

14 Impetus Technologies - Confidential

Source (Data): https://github.com/hortonworks/hadoop-

15 Impetus Technologies - Confidential

16 Impetus Technologies - Confidential

17 Impetus Technologies - Confidential

18 Impetus Technologies - Confidential

19 Impetus Technologies - Confidential

20 Impetus Technologies - Confidential

21 Impetus Technologies - Confidential

• Can I offer something better? • Better/ Best Deals?

22 Impetus Technologies - Confidential

Wanted to buy some products I am launching some new offers

Ah! There are so many offers… hard to

23 Impetus Technologies - Confidential

24 Impetus Technologies - Confidential

25 Impetus Technologies - Confidential

• Need High Speed Networks

26 Impetus Technologies - Confidential

27 Impetus Technologies - Confidential

28 Impetus Technologies - Confidential

29 Impetus Technologies - Confidential 29

30 Impetus Technologies - Confidential

31 Impetus Technologies - Confidential

32 Impetus Technologies - Confidential

33 Impetus Technologies - Confidential

Reads are fast speed Writes are fast

Standards and structured governance Not strict and standard yet

Structured data types Multi and unstructured

Interactive OLAP Analytics Data Discovery

34 Impetus Technologies - Confidential

35 Impetus Technologies - Confidential

• Assumes Commodity Hardware

• Streaming Data Access

36 Impetus Technologies - Confidential

37 Impetus Technologies - Confidential

38 Impetus Technologies - Confidential

• Facilitates Pipelining of Data

39 Impetus Technologies - Confidential

40 Impetus Technologies - Confidential

Data Node 1 Data Node 2 Data Node 3 Data Node 4

41 Impetus Technologies - Confidential

42 Impetus Technologies - Confidential

Data Node 1 Data Node 2 Data Node 3 Data Node 4