You are on page 1of 358

Big Data and Hadoop

(3 Day Training)
For any support you need please feel free to contact:
Ashish Baghel : abaghel@impetus.com
AVP & Head –Banking and Financial Services (BFSI)
913-638-2948 (Cell)
408-213-3310 –Ext 567 (Office)

Sachneet Singh : sachneets.bains@impetus.co.in

1 Impetus Technologies - Confidential


Class Schedule
• Schedule
– 9:30 a.m. – 5:30 p.m.
• Short tea breaks
• As and when you need one 
• Lunch Break

2 Impetus Technologies - Confidential


Introductions
• Your name
• Job responsibilities
• Any Hadoop experience
• Expectations

3 Impetus Technologies - Confidential


Course Outline
• Day 1
– Big data concepts, context and challenges
– Hadoop Overview
– The MapReduce Framework and YARN
– The Hadoop Distributed File System (HDFS)
– MapR File system
– PIG Programming : Concepts
• Day 2
– PIG Programming : Loading Data and Querying
– Debugging and Macros
– Hive Introduction and Architecture
– Hive Programming : Load data, Define Schema, Query Data
– HBase : Introduction , Concepts and Architecture

4 Impetus Technologies - Confidential


Course Outline
• Day 3
– HBase : Advanced Concepts
– Spark: Introduction, Concepts
– Elastic Search : Introduction and Architecture
– Elastic Search : Concepts of Index and Search

5 Impetus Technologies - Confidential


Big Data Facts
1. In what timeframe do we now create the same
amount of information that we created from 2 days
the dawn of civilization until 2003?
2. 90% of the world’s data was created in the last 2 years
(how many years)?
3. What is 1024 petabytes also known as?
An Exabyte
4. Companies monitoring Twitter to
measure “sentiment” analyze 12 terabytes of
tweets every day !!

Source: https://www.linkedin.com/pulse/20140925030713-64875646-big-data-the-eye-opening-
facts-everyone-should-know

6 Impetus Technologies - Confidential


What is Big Data?
"Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to
enable enhanced decision making, insight discovery and
process optimization.“ – Gartner, 2012

7 Impetus Technologies - Confidential


3 V's of Big Data
VOLUME Velocity
BigData comes in One Size Streaming Data, Time
– Large sensitive Data
Terabytes/ Petabytes/ Batch, Near/ Real Time
Exabytes Streams

Value to Business Value to Business

Structures, Semi-Structured,
unstructured
Variety Text, Audio/ Video, Click
Streams, log files etc

8 Impetus Technologies - Confidential


6 Key Hadoop DATA TYPES
1. Sentiment
How your customers feel

2. Clickstream
Website visitors’ data

3. Sensor/Machine
Data from remote sensors and machines

4. Geographic
Location-based data

5. Server Logs Value


6. Text
Millions of web pages, emails, and documents

Source: www.Hortonworks.com

9 Impetus Technologies - Confidential


Sentiment Use Case
• Analyze customer sentiment on
the days leading up to and
following the release of each
‘Game of Thrones’ season .

• Look for answers to


• What is the response to the release ?
• What did the public like the most and
dislike the most ?
• Sentiments around each character,
event and season ?

10 Impetus Technologies - Confidential


Sentiment Use case
- Select a source of data
- Select a tool to bring the data from source to Hadoop
- Use HCatalog to Define a Schema
- Use Hive to Determine Sentiment
- Third party tools for NLP etc. may be required
- Use BI tools to connect to Hadoop and show reports

11 Impetus Technologies - Confidential


Sentiment use case

Flume Agent

• Twitter is one of the source


• Collect all the tweets using a tool called ‘Flume’
• Bring the data to Hadoop and process
• Connect to BI tools for reports

Flume is used to stream data into


Hadoop.
Hadoop cluster

12 Impetus Technologies - Confidential


Geolocation Use Case
Geo Location use cases involve vehicles, devices or people
moving across a map or similar surface.
Example:
– A company has a fleet of trucks.
– Each truck has sensor to log location and event data
– The collected data is put onto Hadoop for analysis
• The company’s goal with Hadoop is to:
– Find out wrong driving events and improve driver safety
– Figure out improvements to save fuel and increase efficiency

13 Impetus Technologies - Confidential


The Geolocation Data
• Here is what the collected data from the trucks’ sensors
looks like:
– truckid
– driverid
– event
– latitude
– longitude
– city
– state
– velocity
– event_indicator (0 or 1)
– idling_indicator (0 or 1)

Source: https://github.com/hortonworks/hadoop-tutorials/blob/master/Sandbox/T15_Analyzing_Geolocation_Data.md

14 Impetus Technologies - Confidential


The Geolocation Data : Getting data

Raw sensor
data
A54 A54 normal 38.44047 -122.714 Santa Rosa California 17 0 0
A20 A20 normal 36.97717 -121.899 Aptos California 27 0 0
overspee
A40 A40 d 37.9577 -121.291 Stockton California 77 1 0
Flume Agent

Source (Data): https://github.com/hortonworks/hadoop-


tutorials/blob/master/Sandbox/T15_Analyzing_Geolocation_Data.md

15 Impetus Technologies - Confidential


The Truck Data
• The truck data is stored in a database and looks like:
– driverid
– truckid
– model
– monthly_miles
– monthly_gas
– total_miles
– total_gas
– mileage
• The miles and gas figures are for a few years.

Source: https://github.com/hortonworks/hadoop-tutorials/blob/master/Sandbox/T15_Analyzing_Geolocation_Data.md

16 Impetus Technologies - Confidential


Truck Data from RDBMS into Hadoop

A Sqoop job
RDBMS data (info of
the trucks)

17 Impetus Technologies - Confidential


Data Analysis
• Analysis can help us answer
– Is there any fuel wastage due to idling ?
– Unsafe events on road which can result in accidents ?
– Drivers with such events and its frequency ?

18 Impetus Technologies - Confidential


Risk Factors Viewed on a Map

Source: https://github.com/hortonworks/hadoop-tutorials/blob/master/Sandbox/T15_Analyzing_Geolocation_Data.md

19 Impetus Technologies - Confidential


BigData – Use Cases

20 Impetus Technologies - Confidential


Potential Use Cases for Big Data

Source: http://athenaanalytics.tumblr.com/post/19544048414/cyberlabe-potential-use-cases-for-big-data

21 Impetus Technologies - Confidential


Big Data - Use Cases-Telecommunication

• Can I offer something better? • Better/ Best Deals?


• Why am I loosing Customers? • Is network reliable?
• 24x7 Service - predict the Failures? • Which Plans are good for me?
• What Plans should I offer to my
customers?

Subscribers
Telecom
Vendors

22 Impetus Technologies - Confidential


Big Data - Use Cases- Financial Services

Wanted to buy some products I am launching some new offers


• Are there any offers? • How do I make the best use of it?
• Are these offers good for me? • How to attract the relevant/ Interested
• which stores are providing these customers?
offers?

Ah! There are so many offers… hard to


find the relevant ones…

Merchant
Customer

23 Impetus Technologies - Confidential


BigData – Challenges

24 Impetus Technologies - Confidential


BigData Challenges
• Data processing: -
– Processing & Analyzing large data – Terabyte++
– Massively Scalable and Parallel
– Moving computation is easy than moving data
– Support Partial Failure

• Data Storage
– Doesn’t fit on 1 node, requires cluster
– Flexible and Schema less Structure
– Data Replication, Partioning and Sharding

25 Impetus Technologies - Confidential


Challenges of Distributed Processing
• Production deployments need to be carefully
Planned
– Unavailability on 1 node should not Impact

• Need High Speed Networks


• Data Replication involves data conflicts
• Troubleshooting and diagnosing
• Geographically Distributed
• Consistency & Reliability

26 Impetus Technologies - Confidential


Big Data – Hadoop

27 Impetus Technologies - Confidential


History
• Hadoop was created by Doug Cutting and Mike
Cafarella in 2005 who was working at Yahoo! at
the time
• Named it after his son's toy elephant
• Originally developed to support distribution for
the Nutch search engine project

28 Impetus Technologies - Confidential


What is Hadoop?
• A Batch processing Framework for distributed processing of
large data sets on a network of commodity hardware.
• Designed to scale out
• Fault - tolerant – At Application level
• Open source + Commodity hardware = Reduction in Cost
• Hadoop is very fast for very large jobs
• Hadoop is not fast for small jobs
• Designed for hardware and software failures

29 Impetus Technologies - Confidential 29


What is Hadoop 2.0?
• The Apache Hadoop 2.0 project consists of the following
modules:
– Hadoop Common: the utilities that provide support for the other
Hadoop modules.
– HDFS: the Hadoop Distributed File System
– YARN: a framework for job scheduling and cluster resource
management.
– MapReduce: for processing large data sets in a scalable and
parallel fashion.

30 Impetus Technologies - Confidential


What’s New in Hadoop 2.0?
• YARN is a re-architecture of Hadoop that allows multiple
applications to run on the same platform

Source: http://hortonworks.com/blog/apache-hadoop-2-is-ga/

31 Impetus Technologies - Confidential


The Hadoop Ecosystem

Source: http://www.apache.org/

32 Impetus Technologies - Confidential


Typical Hadoop Based Solution

Source: http://hortonworks.com/blog/webinar-series-building-a-modern-data-architecture-with-hadoop/

33 Impetus Technologies - Confidential


Relational Databases vs. Hadoop

Relational Hadoop
Required on write schema Required on read

Reads are fast speed Writes are fast

Standards and structured governance Not strict and standard yet

Structured data types Multi and unstructured

Interactive OLAP Analytics Data Discovery


Complex ACID Transactions best fit use Processing unstructured data
Operational Data Store Massive Storage/Processing

34 Impetus Technologies - Confidential


HDFS

35 Impetus Technologies - Confidential


Hadoop Distributed File System
• Large Distributed File System
– 10K nodes, 100 million files, 10 PB

• Assumes Commodity Hardware


– Failure is expected, rather than exceptional

• Streaming Data Access


– Write-Once, Read-Many pattern

• Batch processing
• Node failure - Replication

36 Impetus Technologies - Confidential


HDFS Components
• NameNode
– The “master” node of HDFS
– Determines and maintains how the chunks of data are
distributed across the DataNodes
• DataNode
– Stores the chunks of data, and is responsible for
replicating the chunks across other DataNodes

37 Impetus Technologies - Confidential


HDFS Namenode Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data

• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor

• A Transaction Log
– Records file creations, file deletions. etc

38 Impetus Technologies - Confidential


HDFS DataNode
• A Block Server
– Stores data in the local file system.
– Stores meta-data of a block.
– Serves data and meta-data to Clients

• Block Report
– Periodically sends a report of all existing blocks to the NameNode

• Facilitates Pipelining of Data


– Forwards data to other specified DataNodes

39 Impetus Technologies - Confidential


The NameNode
1. Name node persists meta data
information as two files on the disk:
a. Namespace image file (fsimage_N ) The NameNode will be in
b. Transaction log (edits_N) safemode, a read-only mode.
2. On start up, NameNode reads the
fsimage_N and edits_N files in memory
3. The transactions in edits_N are merged
with fsimage_N.
4. A newly-created fsimage_N+1 is written
to disk, and a new, empty edits_N+1 is
created.
5. Client applications can NameNode
6. The NameNode journals that create fsimage edit logs
transaction in the edits_N+1 file.
Namespace Journaling

40 Impetus Technologies - Confidential


The DataNodes

NameNode

Here is my Here is my
Here is my Here is my
heartbeat and hearbeat and
heartbeat !! heartbeat !!
block report !! blockreport

Block 25

Data Node 1 Data Node 2 Data Node 3 Data Node 4

41 Impetus Technologies - Confidential


The DataNodes

Source: http://www.slideshare.net/hdhappy001/nicholashdfs-what-is-new-in-hadoop-2

42 Impetus Technologies - Confidential


DataNode Failure
No hearbeat
NameNode from DN3 , its
dead. Replicate
its data.

Here is my Here is my
Here is my Here is my
heartbeat and hearbeat and
heartbeat !! heartbeat !!
block report !! blockreport

Block 25

Data Node 1 Data Node 2 Data Node 3 Data Node 4

43 Impetus Technologies - Confidential


HDFS Writes

Source: Hadoop definitive guide

44 Impetus Technologies - Confidential


Data Organization
• HDFS is designed to support very large files

• HDFS supports write-once-read-many semantics on files. A typical block


size used by HDFS is 64 MB (configurable).

• Data is written into HDFS from client application in blocks of data. Each
of this block is of configured block size.

• Replication between the Data Nodes is pipelined to eliminate


replication overhead from the client application

Client
Application DN1 DN2 DN3

45 Impetus Technologies - Confidential


HDFS Architecture

Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..

Client Block ops

Read Datanodes Datanodes

B
replication
Blocks

Rack1 Write Rack2

Client
Source: https://hadoop.apache.org

46 Impetus Technologies - Confidential


HDFS Reads

Source: Hadoop definitive guide

47 Impetus Technologies - Confidential


Rack Awareness
NameNode
Core

Metadata
File.txt =
Blk A:
1 6 11 DN: 1, 7, 8
Blk B:
2 7 12 DN: 8, 12, 14

3 8 13

4 9 14

5 10 15

Rack1 Rack2 Rack3

48 Impetus Technologies - Confidential


Recovery using Replication
• DataNode sends heartbeat to NameNode periodically

• No heartbeat? NameNode marks DataNode as dead


– Stops forwarding any new I/O requests to the DataNode
– Data registered to a dead DataNode is “lost” to HDFS

• NameNode’s process for data recovery:


– Determine which blocks were on the lost node
– Find other DataNodes with copies of these blocks
– Instruct these DataNodes to copy the blocks to other nodes
(whenever possible)

49 Impetus Technologies - Confidential


HDFS Commands

hadoop fs –command [args]

A few commands:
-ls, -ls -R: list files/directories
-cat: display file content (uncompressed)
-chgrp,-chmod,-chown: changes file permissions
-put,-get,-copyFromLocal,-copyToLocal: copies files from the local file
system to the HDFS and vice-versa.
-mv,-moveFromLocal,-moveToLocal: moves files

50 Impetus Technologies - Confidential


MapR-FS

51 Impetus Technologies - Confidential


Architectural Differences with MapR FS

Source: https://www.mapr.com/blog/comparing-mapr-fs-and-hdfs-nfs-and-snapshots#.VZoux_lViko

52 Impetus Technologies - Confidential


MapR's Containers
Files/directories are sharded into blocks, which
are placed into containers on disks

Containers are 16-


32 GB segments of
disk, placed on
nodes

Source: http://www.slideshare.net/jayshao/nyc-hadoop-meetup-mapr-architecture-philosophy-and-applications
53 Impetus Technologies - Confidential
Container locations and replication

N1, N2 N1
N3, N2
N1, N2
N1, N3 N2

N3, N2

CLDB
N3
Container location database
(CLDB) keeps track of nodes
hosting each container and
Source: MapR
replication chain order
Source: http://www.slideshare.net/jayshao/nyc-hadoop-meetup-mapr-architecture-philosophy-and-applications

54 Impetus Technologies - Confidential


Random Writing in MapR
S1
Ask for
Client
64M block CLDB
writing Create cont.
data S1, S2, S4
attach
S1, S3
Write S1, S4, S5
next chunk S2
Picks master S2, S4, S5
to S2
and 2 replica S3
slaves
S2, S3, S5

S4 S5
S3

Source: http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop

55 Impetus Technologies - Confidential


Map-Reduce

56 Impetus Technologies - Confidential


Understanding map-reduce

Source: https://developer.yahoo.com/hadoop/tutorial/module4.html

57 Impetus Technologies - Confidential


Understanding Map-Reduce
Cluster Node 1 Cluster Node 2 Cluster Node 3

Data Node Data Node Data Node

Input Format Input Format Input Format

Split Split Split Split Split

Map Task 1 Map Task 2 Map Task 3 Map Task 4 Map Task 5

<key1, value> <key1, value>


<key1, value> <key1, value> <key5, value>
<key6, value> <key1, value>
<key2, value> <key2, value> <key2, value>
<key9, value> <key3, value>
<key3, value> <key6, value> <key4, value>
<key7, value> <key4, value>
<key4, value> <key6, value> <key8, value>
<key9, value> <key4, value>
<key5, value> <key7, value> <key7, value>

map file 1 map file 2 map file 3 map file 4 map file 5

58 Impetus Technologies - Confidential


Understanding Map-Reduce
Cluster Node 1 Cluster Node 2 Cluster Node 3

map file 1 map file 2 map file 3 map file 4 map file 5

Shuffle and Sort

<key1, value, value, value> <key2, value>


<key5, value> <key3, value, value, value>
<key7, value, value> <key4, value, value>
<key8, value> <key6, value, value, value>
<key9, value>

Reduce Task 1 Reduce Task 2

Data Node Data Node Data Node

59 Impetus Technologies - Confidential


Keys for the reduce phase
• A reducing function turns a large list of values into
one (or a few) output values.
• All of the values with the same key are presented
to a single reducer together.
• This is done by a partitoner.

Source: https://developer.yahoo.com/hadoop/tutorial/module4.html

60 Impetus Technologies - Confidential


Detailed MapReduce Flow

Source: https://developer.yahoo.com/hadoop/tutorial/module4.html

61 Impetus Technologies - Confidential


Hadoop Map Reduce – Word Count

Source: http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html

62 Impetus Technologies - Confidential


The Combiner

Source: https://developer.yahoo.com/hadoop/tutorial/module4.html

63 Impetus Technologies - Confidential


Mapper Output process
1 2
Input <k1,v1> Mapper
Splits
The map method
4 outputs <k2,v2>
pairs
5
3
Mapper Map Output
Output= Buffer
Reducer
Input
Records are sorted
Spill files are and spilled to disk
merged into a when the buffer
single file reaches a threshold

64 Impetus Technologies - Confidential


Reducer Output Process
In-memory 2.
Spill
Mapper output = buffer
files
Reducer input
3.
Merged
input

4.
Mapper output = 5.
Reducer HDFS
Reducer input

In-memory
Spill
buffer
files
Mapper output =
Reducer input Merged
1. The Reducer input
fetches the data
from the
Mappers
Reducer HDFS

65 Impetus Technologies - Confidential


Key-Value pairs

<K1, V1>
<K2, V2>
Mapper Shuffle/Sort

<K3, V3> Reducer


<K2, (V2,V2,V2,V2)>

66 Impetus Technologies - Confidential


Hadoop MR Job Flow
Client

Job tracker Name Node S. Name Node

Task Task Task


Task Task Task
Tracker Task Tracker Task Tracker Task

Task Task Task

Data Node Data Node Data Node

Cluster Node Cluster Node Cluster Node

67 Impetus Technologies - Confidential


The Hadoop Ecosystem

68 Impetus Technologies - Confidential


The Hadoop Ecosystem

Image Courtesy: http://www.apache.org/

69 Impetus Technologies - Confidential


What is YARN?
• YARN = Yet Another Resource Negotiator

• YARN splits up the functionality of the JobTracker in


Hadoop 1.x into two separate processes:

– ResourceManager: for allocating resources and


scheduling applications

– ApplicationMaster: for executing applications and


providing failover

70 Impetus Technologies - Confidential


Why YARN ?
• Hadoop 1.x – ONLY MapReduce
• Not every problem is a MapReduce
• Hadoop 1.x limitations
Limited Scalability
Low cluster resources utilization
Lack of support for alternative frameworks/
paradigms

71 Impetus Technologies - Confidential


Hadoop 1.x Architecture

Source: http://www.ibm.com/developerworks/library/bd-hadoopyarn/

72 Impetus Technologies - Confidential


JobTracker responsibilities
• Manages the computational resources in terms of map and
reduce slots
• Schedules the submitted jobs
• Monitors the executions of the TaskTrackers
• Restarts failed tasks
• Performs speculative execution of tasks
• Calculates the Job Counters

Clearly the JobTracker alone does a lots of tasks together and is


overloaded with loads of work.

73 Impetus Technologies - Confidential


Hadoop 1.x limitations
• Lacks Support for Alternate Paradigms and Services
– Force everything needs to look like Map Reduce
– Iterative applications in MapReduce are 10x slower

• Scalability
– Max Cluster size ~5,000 nodes
– Max concurrent tasks ~40,000

• Availability
– Failure Kills Queued & Running Jobs

• Hard partition of resources into map and reduce


slots
– Non-optimal Resource Utilization

74 Impetus Technologies - Confidential


Hadoop 1.x Architecture Redesign

Source: http://www.ibm.com/developerworks/library/bd-hadoopyarn/

75 Impetus Technologies - Confidential


Hadoop 1.x Architecture Redesign
• YARN Splits up the two major functions of JobTracker into:
– Global Resource Manager (RM) - Cluster resource management and
Scheduling responsibilities

– Application Master (AM) – Application Life Cycle Management i.e.,


Job execution & monitoring (per app).
AM coordinates the logical plan of a single job.
AM negotiates resource containers from the Scheduler, tracking their
status and monitoring for progress.
AM itself runs as a normal container.

• Tasktracker is replaced by NodeManager (NM) - A new per-node slave is


responsible for:
1. launching the applications’ containers,
2. monitoring their resource usage (cpu, memory, disk, network)
3. reporting to the RM

76 Impetus Technologies - Confidential


YARN Benefits
• Scalability
• Optimal Cluster Utilization
• Support for Alternative frameworks
• Multi-tenancy
• Flexible Resource Model

77 Impetus Technologies - Confidential


Where else is YARN used?
• Apache Tez on YARN
• Storm on YARN
• HOYA (Apache HBase on YARN)
• Apache Samza on YARN
• Apache Giraph for graph processing
• Apache Accumulo on YARN

78 Impetus Technologies - Confidential


How Applications Work in YARN

Source: http://doc.mapr.com/display/MapR/YARN

79 Impetus Technologies - Confidential


Multi-Tenancy is Built-in
• Queues
• Economics as queue-capacity
– Hierarchical Queues ResourceManager

• SLAs Capacity Scheduler

– Cooperative Preemption Scheduler

• Resource Isolation root Hierarchical


Queues

– Linux: cgroups
• Administration
Mrkting Adhoc DW
20% 10% 70%

– Queue ACLs Dev Prod Dev Reserved Prod


20% 80% 10% 20% 70%

Default Capacity Scheduler supports P0 P1


70% 30%
all features

Source: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-understanding-the-data-operating-system-of-hadoop

80 Impetus Technologies - Confidential


RM High Availability

Source: http://blog.cloudera.com/blog/2014/05/how-apache-hadoop-yarn-ha-works/

81 Impetus Technologies - Confidential


Key YARN take- away
• YARN is a platform to build/run Multiple Distributed Applications in
Hadoop
• YARN is completely Backwards Compatible for existing MapReduce apps
• YARN enables Fine Grained Resource Management via Generic Resource
Containers.
• YARN has built-in support for multi-tenancy to share cluster resources
and increase cost efficiency
• YARN provides a cluster operating system like abstraction for a modern
data architecture

Source: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-understanding-the-data-operating-system-of-hadoop

82 Impetus Technologies - Confidential


Apache Sqoop

83 Impetus Technologies - Confidential


What is Sqoop ?
A tool designed to transfer data between Hadoop and
relational databases
- import data from a relational database management system
(RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop
Distributed File System
- Export the data into an RDBMS

84 Impetus Technologies - Confidential


Overview of Sqoop

Source: http://hortonworks.com/hadoop/sqoop/

85 Impetus Technologies - Confidential


The Sqoop Import Tool
• The import command has the following requirements:
– Must specify a connect string using the --connect argument

– Credentials can be included in the connect string, so using the --


username and --password arguments

– Must specify either a table to import using --table, or the result of a


SQL query using --query

86 Impetus Technologies - Confidential


Importing a Table
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile

87 Impetus Technologies - Confidential


Importing Specific Columns
sqoop import
--connect jdbc:mysql://host/bse
--table Stocks
--columns StockSymbol,Volume, High,ClosingPrice
--target-dir /data/stocks/
--as-textfile
--split-by StockSymbol
-m 10

88 Impetus Technologies - Confidential


Importing from a Query
sqoop import
--connect jdbc:mysql://host/bse
--query "SELECT * FROM Stocks s WHERE s.Volume
>= 1000000
AND \$CONDITIONS"
--target-dir /data/stocks/
--as-textfile
--direct
--split-by StockSymbol

89 Impetus Technologies - Confidential


The Sqoop Export Tool
• The export command transfers data from HDFS to
a database:
– Use --table to specify the database table
– Use --export-dir to specify the data to export
• Rows are appended to the table by default
• If you define --update-key, then existing rows will be
updated with the new data
• Use --call to invoke a stored procedure (instead of
specifying the --table argument)

90 Impetus Technologies - Confidential


Exporting to a Table
sqoop export
--connect jdbc:mysql://host/mylogs
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"

91 Impetus Technologies - Confidential


Sqoop Labs

92 Impetus Technologies - Confidential


Apache Flume

93 Impetus Technologies - Confidential


What is Flume ?

Source: http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas

94 Impetus Technologies - Confidential


Flume Agent design

Source: https://blogs.apache.org/flume/

95 Impetus Technologies - Confidential


Events
Flume’s core data movement atom: the Event

• An Event has a simple schema


– Event header: Map<String, String>
• similar in spirit to HTTP headers
– Event body: array of bytes

Source: http://www.slideshare.net/Hadoop_Summit/percy-june26-455room211

96 Impetus Technologies - Confidential


Channels
• Passive Component
• Channel type determines the reliability guarantees
• Stock channel types:
– JDBC
– Memory – lower latency for small writes, but not durable
– File – provides durability; most people use this

97 Impetus Technologies - Confidential


Sources
• Event-driven or polling-based
• Most sources can accept batches of events
• Source implementations:
o Avro-RPC – other Java-based Flume agents can send data to this source port
o Thrift-RPC
o HTTP – post via a REST service (extensible)
o Netcat
o Scribe
o Spooling Directory – parse and ingest completed log files
o Exec – execute shell commands and ingest the output

98 Impetus Technologies - Confidential


Sinks
• All sinks are polling-based
• Most sinks can process batches of events at a time
• Sink implementations :
o HDFS
o HBase
o SolrCloud
o ElasticSearch
o Avro-RPC, Thrift-RPC
o File Roller
o Null, Logger, Seq

99 Impetus Technologies - Confidential


Data flow model

Source: http://flume.apache.org/FlumeUserGuide.html

100 Impetus Technologies - Confidential


Anatomy of a Flume agent

Source: http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas

101 Impetus Technologies - Confidential


Flume component interactions

• Source: Puts events into the local channel


• Channel: Store events until someone takes them
• Sink: Takes events from the local channel
– On failure, sinks backoff and retry forever until success

Source: http://www.slideshare.net/Hadoop_Summit/percy-june26-455room211

102 Impetus Technologies - Confidential


A Flume Example
agent.sources = webserver
agent.channels = memoryChannel
agent.sinks = mycluster

agent.sources.webserver.type = exec
agent.sources.webserver.command = tail -F
/var/log/hadoop/hdfs/audit.log
agent.sources.webserver.batchSize = 1
agent.sources.webserver.channels = memoryChannel

agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000

agent.sinks.mycluster.type = hdfs
agent.sinks.mycluster.channel = memoryChannel
agent.sinks.mycluster.hdfs.path = hdfs://127.0.0.1:8020/hdfsaudit/

103 Impetus Technologies - Confidential


Transactional Data Exchange

Source: http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas

104 Impetus Technologies - Confidential


Routing and Replication

Source: http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas

105 Impetus Technologies - Confidential


Multiplexing the flow

Source: http://flume.apache.org/FlumeUserGuide.html

106 Impetus Technologies - Confidential


Multi-agent flow

Source: http://flume.apache.org/FlumeUserGuide.html

107 Impetus Technologies - Confidential


Consolidation

Source: http://flume.apache.org/FlumeUserGuide.html

108 Impetus Technologies - Confidential


Custom Source and Sink
• Custom Source
– A custom source is your own implementation of the Source interface
• Custom Sink
– A custom sink is your own implementation of the Sink interface

109 Impetus Technologies - Confidential


Load Balancing Sink
• Ability to load-balance flow over multiple sinks
– maintains an indexed list of active sinks on which the
load must be distributed
– Supports distributing load using either via round_robin
or random selection mechanisms
– Custom selection mechanisms are supported via custom
classes

110 Impetus Technologies - Confidential


Flume Interceptors
• Flume has the capability to modify/drop events in-
flight
• An interceptor can modify or even drop events
based on any criteria chosen by the developer of
the interceptor
• Flume supports chaining of interceptors
• The order in which the interceptors are specified is
the order in which they are invoked

111 Impetus Technologies - Confidential


Apache Pig

112 Impetus Technologies - Confidential


Pig
• Provides a high level of abstraction over Map-
Reduce for processing large datasets.
• Pig used to process both structured and
unstructured data (Pig eats any thing)
• Pig is designed to be extensible via UDF
• It is made up of
– Pig Latin: is a language
– Pig programs can be run using
• Pig binary
• Grunt shell
• Java programs

113 Impetus Technologies - Confidential


Pig Latin
• High-level data flow scripting language
• Pig program is a series of operations or
transformations on input data to produce output
• Pig executes in a unique fashion:
– During execution each statement is processed by the Pig
interpreter
– If a statement is valid, it gets added to a logical plan
built by the interpreter
– The steps in the logical plan do not actually execute until
a DUMP or STORE command

114 Impetus Technologies - Confidential


Pig Latin – Define Relation Names

• A relation is created as a result of an operation or


transformation in pig script
• An alias is a name assigned to a relation so that it
can be used in subsequent operations or
transformations
• For example, data is an alias:

data = LOAD ‘data.txt’ using TextLoader();

115 Impetus Technologies - Confidential


Pig Latin – Define Relation with Fields

• Relations can define and use field names, which


are associated with an alias
• For example:

salaries = LOAD ‘salary.data’ USING PigStorage(‘ , ‘)


AS (gender, age, income, zip);

highsalaries = FILTER salaries BY income > 100000;

116 Impetus Technologies - Confidential


Pig Latin - Data Types
• int • long
• float • double
• chararray • bytearray
• boolean • datetime
• bigdecimal • biginteger

117 Impetus Technologies - Confidential


Pig Latin - Complex Types

• Tuple: ordered set of values


• (OH,Mark,Twain,31225)
• Bag: unordered collection of tuples
{
(OH,Mark,Twain,31225),
(UK,Charles,Dickens,42207),
(ME,Robert,Frost,11496)
}
• Map: collection of key value pairs
[state#OH,name#Mark Twain,zip#31225]

118 Impetus Technologies - Confidential


Pig Latin – Define Relation with Schema

customers = LOAD ‘customer.data’ AS (


firstName: chararray,
lastname: chararray,
houseno: int,
street: chararray,
phone: long,
payment: double)
salaries = LOAD ‘salaries.txt’ AS (
gender: chararray,
details: bag { b ( age:int, salary: double, zip: long) });

119 Impetus Technologies - Confidential


Pig Latin - GROUP Operator
salaries salariesbyage
gender age salary zip group salaries
F 25 35000 95103 25 {(F, 25, 35000, 95103),
M 30 45000 95102 (M, 25, 39000, 95103)}
F 35 60000 95103 30 {(M, 30, 45000, 95102),
F 30 48000 95105 (F, 30, 48000, 95105),
M 30 47000 95102 (M, 30, 47000, 95102)}
M 25 39000 95103 35 {(F, 35, 60000, 95103)}

salariesbyage = GROUP salaries BY age;

grunt > DESCRIBE salariesbyage;

salariesbyage: { group: int, salaries: {(gender: chararray, age: int, salary: double, zip: int)}}

120 Impetus Technologies - Confidential


Pig Latin - GROUP ALL Operator
salaries salariesgroupall
gender age salary zip group salaries
F 25 35000 95103 all {(F, 25, 35000, 95103),
M 30 45000 95102 (M, 25, 39000, 95103),
F 35 60000 95103 (M, 30, 45000, 95102),
F 30 48000 95105 (F, 30, 48000, 95105),
M 30 47000 95102 (M, 30, 47000, 95102),
M 25 39000 95103 (F, 35, 60000, 95103)}

salariesgroupall = GROUP salaries ALL;


grunt > DESCRIBE salariesgroupall;

salariesbyage: { group: chararray,


salaries: {(gender: chararray, age: int, salary: double, zip: int)}}

121 Impetus Technologies - Confidential


Pig Latin – Relation without schema
salaries salariesbyzip
$0 $1 $2 $3 group salaries
F 25 35000 95103 95102 {(M, 30, 45000, 95102),
M 30 45000 95102 (M, 30, 47000, 95102)}
F 35 60000 95103 95103 {(F, 25, 35000, 95103),
F 30 48000 95105 (F, 35, 60000, 95103,
M 30 47000 95102 (M, 25, 39000, 95103)}
M 25 39000 95103 95105 {(F, 30, 48000, 95105)}

salariesbyzip = GROUP salaries BY $3;

grunt > DESCRIBE salariesbyzip;

salariesbyzip: { group: bytearray,


salaries: { () } }

122 Impetus Technologies - Confidential


Pig Latin – FOREACH GENERATE Operator
salaries A
gender age salary zip age salary
F 25 35000.00 95103 25 35000.00
M 30 45000.00 95102 30 45000.00
F 35 60000.00 95103 35 60000.00
F 30 48000.00 95105 30 48000.00
M 30 47000.00 95102 30 47000.00
M 25 39000.00 95103 25 39000.00

A = FOREACH salaries GENERATE age, salary;

grunt > DESCRIBE A;

A: { age: int, salary: double }

123 Impetus Technologies - Confidential


Pig Latin – Specifying Ranges in FOREACH
salaries B/C
gender age salary zip age salary zip
F 25 35000.00 95103 25 35000.00 95103
M 30 45000.00 95102 30 45000.00 95102
F 35 60000.00 95103 35 60000.00 95103
F 30 48000.00 95105 30 48000.00 95105
M 30 47000.00 95102 30 47000.00 95102
M 25 39000.00 95103 25 39000.00 95103

grunt > salaries = LOAD ‘salaries.txt’ USING PigStorage(‘ , ’) AS


(gender: chararray, age: int, salary: double, zip: int);
grunt > B = FOREACH salaries GENERATE age .. zip;
grunt > C = FOREACH salaries GENERATE age .. ;
grunt > D = FOREACH salaries GENERATE .. salary;

124 Impetus Technologies - Confidential


Pig Latin – Specifying Ranges in FOREACH

salaries = LOAD 'salaries.txt' USING PigStorage(',') AS


(gender:chararray, age:int,salary:double,zip:int);
C = FOREACH salaries GENERATE age..zip;
D = FOREACH salaries GENERATE age..;
E = FOREACH salaries GENERATE ..salary;

customer = LOAD 'data/customers';


F = FOREACH customer GENERATE $12..$23;

125 Impetus Technologies - Confidential


Pig Latin – FILTER Operator

salaries G
gender age salary zip gender age salary zip
F 25 35000.00 95103 M 30 47000.00 95102
M 30 45000.00 95102 F 30 48000.00 95105
F 35 60000.00 95103 F 35 60000.00 95103
F 30 48000.00 95105
M 30 47000.00 95102
M 25 39000.00 95103

G = salaries FILTER salary > 45000;

126 Impetus Technologies - Confidential


Pig Latin – CASE Operator
salaries bonus
gender age salary zip salary bonus
F 25 35000.00 95103 35000.00 10500
M 30 45000.00 95102 45000.00 9000
F 35 60000.00 95103 48000.00 6000
F 30 48000.00 95105 47000.00 9600
M 30 47000.00 95102 35000.00 9400
M 25 39000.00 95103 39000.00 11700

bonus = FOREACH salaries GENERATE salary, (


CASE
WHEN salary <= 40000 THEN salary * .3
WHEN salary > 40000 AND salary <= 50000 THEN salary * .2
WHEN salary > 50000 THEN salary * 0.1
END
) AS bonus;

127 Impetus Technologies - Confidential


Pig Latin - Using PARALLEL
• PARALLEL determines the number of reducers to
use in a particular operation
• Pig can run max 999 reducers – one reducer per
Gb of data

A = LOAD ‘data1’;
B = LOAD ‘data2’;
C = JOIN A by $1, B by $3 PARALLEL 20;
D = ORDER C BY $0 PARALLEL 5;

128 Impetus Technologies - Confidential


Pig Latin – Inner Join
location department
state name fname dept
CA James John Sales
AZ Bill Bond Finance
NY John James Support
MN Bond Bill IT

join = JOIN location BY name, department BY fname

location::state location::name department::fname department::dept


CA James James Support
AZ Bill Bill IT
NY John John Sales
MN Bond Bond Finance

129 Impetus Technologies - Confidential


Pig Latin - Replicated Joins
• Loads one dataset in memory to perform a map-
side join
• Specify ‘replicated’, which is applied to the second
data set listed

Grunt > replicatedjoin = JOIN location BY name,


department BY fname
USING ‘replicated’;

130 Impetus Technologies - Confidential


Pig Latin – User Defined Functions
• Pig UDF is implemented in java by:
– Implementing a Java class that extends EvalFunc.
– Deploying the class in a JAR file.
– Registering the JAR file in the Pig script using the
REGISTER command.
– Optionally define an alias for the UDF using the DEFINE
command

131 Impetus Technologies - Confidential


Pig Latin - UDF in Java Example
package com.impetus.udfs;

public class UpperCaseUdf extends EvalFunc<String> {

@Override
public String exec(Tuple input) throws IOException {
String inputStr = input.get(0).toString().trim();
return inputStr.toUpperCase();
}
}

132 Impetus Technologies - Confidential


Pig Latin – UDF Invocation
UDF Registration

grunt> REGISTER myudf.jar


UDF Invocation

grunt> x = FOREACH location GENERATE UpperCaseUdf(name);


Define UDF alias

grunt> DEFINE TO_UPPER_UDF com.impetus.udf.UpperCaseUdf()

grunt> x = FOREACH location GENERATE TO_UPPER_UDF(name);

133 Impetus Technologies - Confidential


DataFu Library
• DataFu is a collection of Pig UDFs for data analysis
on Hadoop
• Started by LinkedIn and open sourced under the
Apache 2.0 license

134 Impetus Technologies - Confidential


Tips for Optimizing Pig Scripts
• Filter early and often
• Project early and often
• Drop nulls before a join
• Use replicated joins whenever possible
• Use PARALLEL properly
• Use compression
• Choose the right data types
• Use .pigbootup for global settings of Pig scripts

135 Impetus Technologies - Confidential


Macros
Pig Latin supports the definition, expansion, and
import of macros
• A macro definition can appear anywhere in a Pig script as
long as it appears prior to the first use
• Recursive references are not allowed.
• Macros are not allowed inside a FOREACH nested block.
• Macros cannot contain Grunt shell commands.
• Macros cannot include a user-defined schema that has a
name collision with an alias in the macro

136 Impetus Technologies - Confidential


Limit Operator
• Limits the number of output tuples for a relation
• It does not guarantee that same 3 tuples would be
returned in each execution
• Defining ORDER BY clause before LIMIT operator
guarantees same result on each execution
employees = LOAD 'pigdemo.txt' AS (state:chararray, name:chararray);

emp_group = GROUP employees BY state;

L = LIMIT emp_group 3;

137 Impetus Technologies - Confidential


SAMPLE
• Same as DUMP operation but outputs a subset of
the complete data on console
• Useful when working with large relations

employees = LOAD 'pigdemo.txt' AS (state:chararray, name:chararray);

employee_subset = SAMPLE employees 0.05;

138 Impetus Technologies - Confidential


Built-in Functions
PluckTuple
– Allows the user to specify a string prefix, and then filter for the
columns in a relation that begin with that prefix.
IsEmpty
– Checks if a bag or map is empty.
SIZE
– Computes the number of elements based on any Pig data type.
SUBTRACT
– Bags subtraction, SUBTRACT(bag1, bag2) = bags composed of bag1
elements not in bag2
TOKENIZE
– Splits a string and outputs a bag of words.

139 Impetus Technologies - Confidential


Type of Loaders and Storage
Text Loader
– Loads unstructured data in UTF-8 format.

PigStorage
– Loads and stores data as structured text files.

PigDump
– Stores data in UTF-8 format.

JsonLoader, JsonStorage
– Load or store JSON data

140 Impetus Technologies - Confidential


Type of Loaders and Storage
BinStorage
– Loads and stores data in machine-readable format.

HBaseStorage
– Loads and stores data from an HBase table

AvroStorage
– Loads and stores data from Avro files.

141 Impetus Technologies - Confidential


Handling Compression
• Support for compression is determined by the
load/store function
• PigStorage and TextLoader support
– gzip
• Gzipped files cannot be split across multiple maps
– bzip
• bzipped files can be split across multiple maps
– compression for both read (load) and write
(store).
• BinStorage does not support compression.

142 Impetus Technologies - Confidential


Debugging Pig Latin
Pig Latin provides operators that can help you debug
your Pig Latin statements:
• Use the DUMP operator to display results to your terminal
screen.
• Use the DESCRIBE operator to review the schema of a
relation.
• Use the EXPLAIN operator to view the logical, physical, or
map reduce execution plans to compute a relation.
• Use the ILLUSTRATE operator to view the step-by-step
execution of a series of statements.

143 Impetus Technologies - Confidential


Explain Command
Review the logical, physical, and map reduce
execution plans that are used to compute the
specified relationship
• The logical plan shows a pipeline of operators to be
executed to build the relation.
• Type checking and backend-independent optimizations
(such as applying filters early on) also apply
• The physical plan shows how the logical operators are
translated to backend-specific physical operators. Some
backend optimizations also apply.
• The mapreduce plan shows how the physical operators are
grouped into map reduce jobs.

144 Impetus Technologies - Confidential


Illustrate Command
Displays a step-by-step execution of a sequence of
statements.
• Review how data is transformed through a sequence of Pig
Latin statements
• Test your programs on small datasets and get faster
turnaround times
• Works by retrieving a small sample of the input data and
then propagating this data through the pipeline
• Algorithm may automatically generate example data, in near
real-time. Thus, you might see data propagating through the
pipeline that was not found in the original input data
This happens usually for Joins

145 Impetus Technologies - Confidential


Pig Statistics
Pig Statistics is a framework for collecting and storing
script-level statistics for Pig Latin
• Complex Pig scripts often generate many MapReduce jobs.
• To help you debug a script, Pig prints a summary of the
execution that shows which relations (aliases) are mapped
to each MapReduce job.
• Pig statistics and the existing Hadoop statistics can also be
accessed via the Hadoop job history file
• Piggybank has a HadoopJobHistoryLoader which acts as an
example of using Pig itself to query these statistics

146 Impetus Technologies - Confidential


Timing UDF’s
Method for approximately measuring how much time
is spent in different user-defined functions (UDFs)
and Loaders
Set the pig.udf.profile property to true
Measures:
– the approximate amount of time spent in a UDF
– the approximate number of times the UDF was invoked

147 Impetus Technologies - Confidential


Execution control
• Output location strict check
– Avoid writing to the same output location.To enforce strict checking of
output location, set pig.location.check.strict=true for fail fast
• Disabling Pig commands and operators
– Blacklist or/and whitelist certain commands and operations
Blacklisting
 For eg, pig.blacklist=rm,kill,cross would disable users from executing any of
"rm", "kill" commands and "cross" operator.
Whitelisting
 disable all commands and operators that are not a part of the whitelist.
 For eg, pig.whitelist=load,filter,store will disallow every command and
operator other than "load", "filter" and "store".

There should not be any conflicts between blacklist and whitelist. Make sure to
have them entirely distinct or Pig will complain.

148 Impetus Technologies - Confidential


Optimization Rules in Pig
• FilterLogicExpressionSimplifier

• PartitionFilterOptimizer
– Push the filter condition to loader
A = LOAD 'input' as (dt, state, event) using HCatLoader();
B = FILTER A BY dt=='201310' AND state=='CA';

Filter condition will be pushed to


loader if loader supports

A = LOAD 'input' as (dt, state, event) using HCatLoader();


--Filter is removed

149 Impetus Technologies - Confidential


Optimization Rules in Pig
• SplitFilter

• PushUpFilter

• LimitOptimizer

• PushDownForEachFlatten

150 Impetus Technologies - Confidential


Pig Labs

151 Impetus Technologies - Confidential


Apache Hive

152 Impetus Technologies - Confidential


What is Hive?
• Data warehouse system for Hadoop
• Hive organizes data into tables by maintaining
metadata information about Big Data stored on
HDFS
• Metadata like schema, is store in database called
Metastore (default derby or MySQL)
• Perform SQL-like operations on the data using a
scripting language called HiveQL

153 Impetus Technologies - Confidential


Comparing Hive to SQL
SQL Datatypes SQL Semantics
INT SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING
BOOLEAN GROUP BY, ORDER BY, SORT BY
FLOAT CLUSTER BY, DISTRIBUTE BY
DOUBLE Sub-queries in FROM clause
STRING GROUP BY, ORDER BY
BINARY ROLLUP and CUBE
TIMESTAMP UNION
ARRAY, MAP, STRUCT, UNION LEFT, RIGHT and FULL INNER/OUTER
JOIN
DECIMAL CROSS JOIN, LEFT SEMI JOIN
CHAR Windowing functions (OVER, RANK, etc.)
VARCHAR Sub-queries for IN/NOT IN, HAVING
DATE EXISTS / NOT EXISTS
INTERSECT, EXCEPT

154 Impetus Technologies - Confidential


Hive Architecture
Hive HQL CLI JDBC/ODBC Web UI

Hive Server 2 Metastore

Hive Driver

Compiler Optimizer Executor

HADOOP YARN
(MR + HDFS)
Resource Name
Manager Node

Data Node +
Node Manager

155 Impetus Technologies - Confidential


Hive Services
• List of services available with Hive are
– Hive CLI (command line interface)
• Traditional client used to connect to HiveServer instance
$ hive –h hostname
hive >
– hiveserver
• Runs hive as a server exposing a Thrift service, enabling access
from a range of clients – JDBC, ODBC
– HWI (Hive Web Interface)
• Allows running hive queries using a web interface
(http://localhost:9999/hwi)
– Jar
• Hive equivalent to “hadoop jar”, a convenient way to run Java
applications
– metastore
• By default it runs in the same process as Hive server

156 Impetus Technologies - Confidential


Ways to Submit Hive Queries
• Using Hive CLI
– Traditional Hive client that connects to a HiveServer
instance
$ hive -h hostname
hive>

• Using Beeline
– A new command line client that connects to a
HiveServer2 instance using Hive JDBC driver
$ beeline
Hive version 0.11.0-SNAPSHOT by Apache
beeline> !connect
jdbc:hive2://hostname:10000 username password
org.apache.hive.jdbc.HiveDriver

157 Impetus Technologies - Confidential


Hive Tables
• Logically made up of
– Data stored on HDFS
– Metadata describing the layout of the data in a
Metastore (which is an RDBMS)
• Two types of tables:
– Managed Tables (default)
• Hive moves data into its warehouse directory on HDFS
– External Tables
• Hive refers to data in an existing location on HDFS outside
warehouse directory

158 Impetus Technologies - Confidential


Define a Hive-Managed Table
• Use “create table” command to define a Hive-
Managed table
• This command only creates meta-data information in
metastore

CREATE TABLE customer (


id INT,
fName STRING,
lName STRING,
birthday TIMESTAMP
) ROW FORMAT DELIMTED FIELDS TERMINATED BY ‘,’ ;

159 Impetus Technologies - Confidential


Define an Hive-External Table
• Use “create table” command with “LOCATION”
keyword to define a Hive-External table
• This command only creates meta-data information
in metastore

CREATE EXTERNAL TABLE salaries (


gender STRING,
age INT,
salary DOUBLE,
zip INT
) ROW FORMAT DELIMTED FIELDS TERMINATED BY ‘,’
LOCATION ‘/user/data/salaries/’;

160 Impetus Technologies - Confidential


Loading Data in Hive
• Use the following commands to load data in Hive
tables
• These commands only move data to hive ware
house directory for Managed tables and LOCATION
directory for External tables

LOAD DATA LOCAL INPATH ‘/tmp/customer.csv’ OVERWRITE INTO


TABLE customer;
LOAD DATA INPATH ‘/tmp/customer.csv’ OVERWRITE INTO TABLE
customer;
INSERT INTO birthdays
SELECT fName, lName, birthday FROM customer WHERE
birthday IS NOT NULL;

161 Impetus Technologies - Confidential


Query Data in Hive

SELECT * FROM customer;

FROM customer
SELECT fName, lName, birthday
WHERE birthday IS NOT NULL;

SELECT customer.*, order.*


FROM customer JOIN order ON
(customer.id = order.customerId)

162 Impetus Technologies - Confidential


Use Hive to Save Results to file
INSERT OVERWRITE DIRECTORY
‘/user/data/ca_or_sd’
FROM names
SELECT name, state
WHERE state = ‘CA’ or state = ‘SD’;

INSERT OVERWRITE LOCAL DIRECTORY


‘/tmp/data’
SELECT * FROM bucketnames
ORDER BY age;

163 Impetus Technologies - Confidential


Hive Partitions
• Hive allows table to be partitioned, dividing tables
into coarse grained parts based on value of
partition column
• Partitions enable faster queries on smaller slices of
data
CREATE TABLE employee (id INT, name STRING, salary DOUBLE)
PARTITION BY (dept STRING);

• Sub-folders created based on partition values


/data/hive/warehouse/employee
/dept=presales/
/dept=admin/
/dept=support/

164 Impetus Technologies - Confidential


Hive Buckets
• Buckets impose extra structure on the table
• Buckets enable more efficient sampling of data by
dividing data into smaller parts
• Create buckets on column with high cardinality

Bucket column value


input is hashed
records
The table’s data is
divided up into buckets

bucket bucket bucket


0 1 2

165 Impetus Technologies - Confidential


Hive Skew Tables
• Skew table is helpful columns with uneven
distribution of data value – some values exist more
often than other
• Store skewed values is a separate directories to
enable efficient queries

CREATE TABLE customer (


id INT,
name STRING,
zip INT
) SKEWED BY (zip) ON (92121, 92120)
STORED AS DIRECTORIES;

166 Impetus Technologies - Confidential


Hive Join Strategies

Type Approach Pros Cons


Join keys are shuffled Works regardless of Most resource-
Shuffle using MapReduce and data size or layout. intensive and
Join joins are performed on the slowest join type.
reduce side.
Small tables are loaded Very fast, single All but one table
Map
into memory in all nodes, scan through largest must be small
(Broadcast)
mapper scans through the table. enough to fit in
Join large table and joins. RAM.
Mappers take advantage Very fast for tables Data must be sorted
Sort-
of co-location of keys to do of any size. and bucketed
Merge-
efficient joins. ahead of time.
Bucket Join

Source: http://www.slideshare.net/ye.mikez/hive-tuning

167 Impetus Technologies - Confidential


Hive UDF
• Using UDF user can plugin custom processing code and
invoke from query
• Three types of UDF
– UDF (User Defined Functions)
• Operate on single row to output a single row
– UDAF (User Defined Aggregate Functions)
• Operate on multiple rows to output a single row
– UDTF (User Defined Table Functions)
• Operate on a single row to output multiple rows
• UDF have to be written in Java

168 Impetus Technologies - Confidential


Hive UDF
• UDF must extend UDF class and implement evaluate
method

package com.impetus.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public final class Lower extends UDF {


public Text evaluate(final Text s) {
if (s == null) { return null;}
return new Text(s.toString().toLowerCase());
}
}

169 Impetus Technologies - Confidential


Invoking Hive UDF
• Register UDF jar file with Hive
• Create FUNCTION to access UDF. These functions
are defined only for duration of session

ADD JAR /myapp/lib/myhiveudf.jar

CREATE TEMPORARY FUNCTION ToLower AS


‘com.impetus.hive.udf.ToLower’;

FROM customer
SELECT ToLower(fName),
ToLower(lName);

170 Impetus Technologies - Confidential


Hive Views
• View in Hive is defined by SELECT statement
• Views help to:
– Reduce complexity of a query
– Restrict access to subset of actual Hive table

CREATE VIEW 2014_VISITORS AS


SELECT fname, lname, logdate, infoComments
FROM visitor_log
WHERE cast(substring(logdate, 6,4) AS INT) = 2014;

171 Impetus Technologies - Confidential


Hive File Formats
• Hive support different file formats
– Text file
– SequenceFile
– RCFile (Row Columnar file)
– ORC File
• File format is defined using “STORED AS” keyword

CREATE TABLE names (fname STRING, lname STRING)


STORED AS RCFile;

172 Impetus Technologies - Confidential


Hive ORC Files
• The Optimized Row Columnar (ORC) file format
provides a highly efficient way to store Hive data.
• Using ORC format improves Hive performance
when reading, writing and processing data

CREATE TABLE tablename (



) STORED AS ORC;

ALTER TABLE tablename SET FILEFORMAT ORC;

SET hive.default.fileformat=Orc;

173 Impetus Technologies - Confidential


HCatalog in the Ecosystem

Java MapReduce

HCatalog

HDFS HBase

174 Impetus Technologies - Confidential


Hive Labs

175 Impetus Technologies - Confidential


Defining Indexes
CREATE INDEX city_index ON TABLE Customers (city) AS
'COMPACT' WITH DEFERRED REBUILD;

ALTER INDEX city_index ON Customers REBUILD;

SHOW INDEX ON Customers;

DROP INDEX city_index ON Customers;

176 Impetus Technologies - Confidential


Overview of Indexes
1. An index is defined on the city column 2. An index table is created in the Hive
of a table named Customers. metastore for the state column

Table: Customers

Col1: name Hive


city_i
Col2: age Metastore
ndex
Col3: city
City = Phoenix

3. The index table now knows which


blocks in HDFS contain each city

HDFS

177 Impetus Technologies - Confidential


Stinger Initiative

178 Impetus Technologies - Confidential


Vectorization
Vectorization is a new feature that allows Hive to
process a batch of 1024 rows together instead of
one each time
• Table needs to be in ORC format
• Needs to be enabled
hive.vectorized.execution.enabled=true
• Each batch consists of a column vector and operations
are performed on the column vector
• Hive examines the query and data to decide whether it
can be used or not

179 Impetus Technologies - Confidential


SQL Standard Based Hive Authorization

Privileges:
– SELECT privilege – gives read access to an object.
– INSERT privilege – gives ability to add data to an object (table).
– UPDATE privilege – gives ability to run update queries on an object
(table).
– DELETE privilege – gives ability to delete data in an object (table).
– ALL PRIVILEGES – gives all privileges (gets translated into all the
above privileges).
Objects
– The privileges apply to table and views. The above privileges are not
supported on databases.

180 Impetus Technologies - Confidential


SQL Standard Based Hive Authorization

Users and Roles


– Privileges can be granted to users as well as roles.
– Users can belong to one or more roles.
– There are two roles with special meaning –
public and admin
– When a user runs a Hive query or command, the
privileges granted to the user and her "current roles" are
checked

181 Impetus Technologies - Confidential


Understanding Hive on Tez
SELECT a.state, COUNT(*), AVG(c.price)
Tez avoids unneeded
FROM a
writes to HDFS
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Hive – MapReduce Hive – Tez

M M M M M M
SELECT a.state SELECT b.id SELECT a.state,
c.itemId SELECT b.id
R R R R
M M
HDFS M M

M M R
JOIN (a, c) R
SELECT c.price
R JOIN (a, c) R
HDFS

HDFS

JOIN(a, b) M M JOIN(a, b)
GROUP BY a.state GROUP BY a.state
COUNT(*) COUNT(*)
AVG(c.price) AVG(c.price)
R R

182 Impetus Technologies - Confidential Source : http://www.slideshare.net/ye.mikez/hive-tuning


Transactions

183 Impetus Technologies - Confidential


Transaction Use Cases

184 Impetus Technologies - Confidential


Transaction Limitations
 BEGIN, COMMIT, and ROLLBACK are not yet
supported
 Only ORC file format is supported
 Tables must be bucketed

185 Impetus Technologies - Confidential


HBase

186 Impetus Technologies - Confidential


What Is HBase?
• HBase is a NoSQL datastore built on top of HDFS (Hadoop)
• An Apache Top Level Project
• Based on Google’s BigTable paper
• Open-source, distributed, versioned
• Key Features
 Distributed Storage
 Strictly consistent random reads and writes.
 Schema less data model
 Automatic and configurable sharding of tables
 Automatic failover support between RegionServers.

187 Impetus Technologies - Confidential


When to use?
• Big Data with random read and writes
• Storing large amounts of data (TB/PB)
• High throughput for a large number of requests
• Storing unstructured or variable column data

188 Impetus Technologies - Confidential


HBase Data Model

189 Impetus Technologies - Confidential


Data Model Terminology
• Table
– An HBase table consists of multiple rows.
• Row
– A row in HBase consists of a row key and one or more columns with
values associated with them
– Sorted alphabetically by the row key as they are stored
– Column
– A column in HBase consists of a column family and a column
qualifier, which are delimited by a : (colon) character.
– Column Family
– Column families physically colocate a set of columns and their
values

190 Impetus Technologies - Confidential


Data Model Terminology
• Column Qualifier
– A column qualifier is added to a column family to provide the index for a
given piece of data

– Cell
– A cell is a combination of row, column family, and column qualifier, and
contains a value and a timestamp, which represents the value’s version.

– Timestamp
– A timestamp is written alongside each value, and is the identifier for a given
version of a value.

191 Impetus Technologies - Confidential


HBase: Keys and Column Families

Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns


Source: http://www.slideshare.net/aillonianilreddy/hadoop-32452974

192 Impetus Technologies - Confidential


HBase Logical View

Source: http://www.slideshare.net/aillonianilreddy/hadoop-32452974

193 Impetus Technologies - Confidential


Rows and Columns

No storage penalty for unused columns

Row keys identify a row


Each Column Family can have many columns

Source: http://www.slideshare.net/aillonianilreddy/hadoop-32452974

194 Impetus Technologies - Confidential


HBase Architecture

195 Impetus Technologies - Confidential


Terminology
• Node
– Physical server
• Cluster
– Group of nodes
• Master Node
– Co-ordinates the nodes in the cluster
• Slave Node
– Worker nodes that perform tasks
• Daemon
– A process or service

196 Impetus Technologies - Confidential


Daemons
• HBase Master
• RegionServer
• ZooKeeper
• HDFS
– NameNode/Standby NameNode
– DataNode

197 Impetus Technologies - Confidential


Basic Concepts
• HBase data is stored in Tables
– Similar to RDBMS tables
• Table data is stored on HDFS
– Split into blocks and stored across nodes in the cluster
• Architecturally tables are big, sorted , distributed
maps
• Tables are sharded/partioned and replicated

198 Impetus Technologies - Confidential


HBase Regions
• HBase tables are split into regions
– Piece of a tables
– Created through sharding or partioning
• RegionServer daemons serves regions
– Runs on each slave node in the cluster
– Serves multiple regions belonging to several different tables
– Usually one Table is split among multiple Region servers
• Communicate with the client and handle data-
related operations.
• Handle read and write requests for all the regions
under it.
• Decide the size of the region by following the region
size thresholds.

199 Impetus Technologies - Confidential


HBase Master

• Master that co-ordinates the Region servers


– Co-ordinates which regions are served by which
Region server
– Handles new table creation and region
movement
– Interface for all metadata changesServes
– In a distributed cluster, the Master typically runs
on the NameNode.
– Handles load balancing of the regions across
region servers. It unloads the busy servers and
shifts the regions to less occupied servers.

200 Impetus Technologies - Confidential


HBase and Zookeeper

• An HBase cluster may have multiple masters


for HA
– Only one master controls the cluster
– The Zookeeper co-ordinates the masters
• Zookeeper service runs on each master
– Upon start-up all masters connect to Zookeeper
– They all are eligible to get control
– The first to connect gets the control
– If the controlling master fails, the rest of the masters
compete to get control and one of them gets control

201 Impetus Technologies - Confidential


HBase Regions

Source: http://hbase.apache.org/

202 Impetus Technologies - Confidential


Store
• A Store hosts a MemStore and 0 or more StoreFiles
(HFiles).
• A Store corresponds to a column family for a table for a
given region
– MemStore
• The MemStore holds in-memory modifications to the Store
– StoreFile (HFile)
• Memstore is flushed to StoreFiles
• StoreFiles are where your data lives.
• HFile Format
– The HFile file format is based on the SSTable from BigTable
– Blocks
• StoreFiles are composed of blocks. The blocksize is configured on a
per-ColumnFamily basis

203 Impetus Technologies - Confidential


RegionServer Architecture

Source: http://www.slideshare.net/xefyr/h-base-for-architectspptx

204 Impetus Technologies - Confidential


Compaction
Compaction is an operation which reduces the
number of StoreFiles in a Store, by merging them
together
– After MemStore reaches a given size it flushes its
contents to a StoreFile.
– StoreFiles in a Store increases over time
– Two categories
• Minor compactions usually select a small number of
small, adjacent StoreFiles and rewrite them as a
single StoreFile
• Major compaction results in a single StoreFile per
Store

205 Impetus Technologies - Confidential


HBase Data Access

206 Impetus Technologies - Confidential


Four primary data model operations
• Gets
– Gets a row’s data based on the row key
• Puts
– Upserts a row with data based on the row key
• Scans
– Finds all matching rows based on the row key
– Scan logic can be increased by using filters
• Delete
– deletes are handled by creating new markers
called tombstones

207 Impetus Technologies - Confidential


Write Path

Client

1. Which NameNode
Standby
NameNode

RegionServer is Master Master


Master

serving the Region?


ZooKeeper ZooKeeper ZooKeeper

DataNode DataNode DataNode

RegionServer RegionServer RegionServer

2. Write to
RegionServer DataNode DataNode DataNode

RegionServer RegionServer RegionServer

Source: http://www.slideshare.net/HBaseCon/hbase-just-the-basics

208 Impetus Technologies - Confidential


Write Path

Source: http://blog.cloudera.com/blog/2012/06/hbase-write-path/

209 Impetus Technologies - Confidential


Read Path

Client

1. Which NameNode
Standby
NameNode

RegionServer is Master Master


Master

serving the Region?


ZooKeeper ZooKeeper ZooKeeper

DataNode DataNode DataNode

RegionServer RegionServer RegionServer

2. Write to
RegionServer DataNode DataNode DataNode

RegionServer RegionServer RegionServer

Source: http://www.slideshare.net/HBaseCon/hbase-just-the-basics

210 Impetus Technologies - Confidential


Puts

1 Put p = new Put(Bytes.toBytes(ROW_KEY_BYTES);


p.add(COLFAM_BYTES, COLDESC_BYTES,
2 Bytes.toBytes("value"));

3 table.put(p);
4

Source: http://www.slideshare.net/HBaseCon/hbase-just-the-basics

211 Impetus Technologies - Confidential


Gets

1 Get g = new Get(ROW_KEY_BYTES);


2 Result r= table.get(g);
3 byte[] byteArray =
r.getValue(COLFAM_BYTS,COLDESC_BYTS);
4 String columnValue = Bytes.toString(byteArray);

Source: http://www.slideshare.net/HBaseCon/hbase-just-the-basics

212 Impetus Technologies - Confidential


Filters
Generally used via the Java API
• PrefixFilter
– prefix of a row key
• ColumnPrefixFilter
– a column prefix
• InclusiveStopFilter
– row key on which to stop scanning
• FamilyFilter
• QualifierFilter
• ColumnRangeFilter

213 Impetus Technologies - Confidential


HBase CoProcessor

214 Impetus Technologies - Confidential


Coprocessor Framework
A framework
– that provides a library and runtime environment for
executing user code within the HBase region server and
master processes
– flexible and generic extension of HBase functionality
– distributed computation directly within the HBase server
processes

Characteristics :
– Arbitrary code can run at each RegionServer
– High-level call interface for clients
– Calls are addressed to rows or ranges of rows and the
coprocessor client library resolves them to actual locations;
– Calls across multiple rows are automatically split into multiple
parallelized RPC
– Provides a very flexible model for building distributed
services
215 Impetus Technologies - Confidential
Coprocessor Types )
Based on deployment
• System Coprocessors
– loaded globally on all tables and regions hosted by the
region server
• Table coprocessors
– loaded on all regions for a table on a per-table basis

216 Impetus Technologies - Confidential


Coprocessor Types (functionality)
Based on functionality
• Observers
– Can be thought of like database triggers
– User code inserted by overriding upcall methods provided
by the coprocessor framework
– Functions are executed from core HBase code when
certain events occur
– framework handles all of the details of invoking callbacks
• Endpoint
– Resembling stored procedures
– Can be invoked at any time from the client
– executed remotely at the target region or regions

217 Impetus Technologies - Confidential


Observer Types
Three kind of Observers
• RegionObserver
– hooks for data manipulation events, Get, Put, Delete, Scan
– an instance of a RegionObserver coprocessor for every
table region
• WALObserver
– hooks for write-ahead log (WAL) related operations
– one such context per region server
• MasterObserver
– hooks for DDL-type operation, i.e., create, delete, modify
table, etc.
– runs within the context of the HBase master.

218 Impetus Technologies - Confidential


Region Observer
Provides callbacks for
• preOpen, postOpen:
• Called before and after the region is reported as online to the master.
• preFlush, postFlush:
• Called before and after the memstore is flushed into a new store file.
• preGet, postGet:
• Called before and after a client makes a Get request.
• preExists, postExists:
• Called before and after the client tests for existence using a Get.
• prePut and postPut:
• Called before and after the client stores a value.
• preDelete and postDelete:
• Called before and after the client deletes a value.

219 Impetus Technologies - Confidential


Master and WAL Observer

MasterObserver provides upcalls for:


• preCreateTable/postCreateTable:
 Called before and after the region is reported as online to the
master.
• preDeleteTable/postDeleteTable

WALObserver provides upcalls for:


• preWALWrite/postWALWrite:
 Called before and after a WALEdit written to WAL.

220 Impetus Technologies - Confidential


Endpoint

• Endpoint is an interface for dynamic RPC


extension
• Installed on the server side and can then be
invoked with HBase RPC
• The client library provides convenience
methods for invoking such dynamic
interfaces.

221 Impetus Technologies - Confidential


Endpoint Invocation

Source: https://blogs.apache.org/hbase/entry/coprocessor_introduction

222 Impetus Technologies - Confidential


Loading Coprocessor

• Load through configuration entries


• hbase.coprocessor.region.classes:
 for RegionObservers and Endpoints
• hbase.coprocessor.master.classes:
 for MasterObservers
• hbase.coprocessor.wal.classes:
 for WALObservers
 jar file must reside on the server side HBase classpath
• Load from shell
 load on a per table basis, via a shell command ``alter’’ + ``table_att'‘
 Coprocessor attribute added to table.Contains :
 File path: The jar file containing the coprocessor implementation
 Class name: The full class name of the coprocessor.
 Priority: An integer.
 Arguments: This field is passed to the coprocessor implementation.

Source: https://blogs.apache.org/hbase/entry/coprocessor_introduction

223 Impetus Technologies - Confidential


HBase Schema Design

224 Impetus Technologies - Confidential


Schema Design
• Access pattern must be known and ascertained
• Denormalize to improve performance
– Fewer, bigger tables
• Does not do well with anything above two or three
column families
• Rows sorted lexicographically, keep similar rows
together but don’t hotspot
• Use Salting or hashing
• Minimize row and column sizes
• ColumnFamily names as small as possible, preferably one character

225 Impetus Technologies - Confidential


Schema Design
• Prefer shorter attribute names
• Use TTL where ever possible
• Factor in the potential of joins into schema design
• Use Monotonically Increasing Row keys/Timeseries
Data carefully

226 Impetus Technologies - Confidential


Elastic Search

227 Impetus Technologies - Confidential


Apache Lucene
• Fas t, high performance, scalable search/IR library
• Open source
• Originally written in java; ported
to Delphi, Perl, C#, C++, Python, Ruby, and PHP
• Initially developed by Doug Cutting (Also author of
Hadoop)
• Indexing and Searching
• Used by companies like Twitter, Linkedin, Wikipedia etc.

228 Impetus Technologies - Confidential


Apache Lucene … Features !

• Full text search; fielded searching (e.g. title, author,


contents)
• Powerful query types: phrase queries, wildcard
queries, proximity queries, range queries
• Ranked searching
• Fast, memory-efficient and typo-tolerant suggests
• Sorting by any field

229 Impetus Technologies - Confidential


Type of Queries
• Term Query
– useful for retrieving documents by a key.

• Prefix Query
– matches documents containing terms beginning with a specified string.

• Range Query
– facilitates searches from a starting term through an ending term.

• Boolean Query
– allows for logical AND, OR, and NOT combinations.

• Phrase Query
– An index contains positional information of terms.

• Fuzzy Query
– matches terms similar to a specified term.
• Boost Query
– Boost a particular term
230 Impetus Technologies - Confidential
Lucene in a search system

Index document Users

Analyze
document Search UI

Build document
Index Build Render
query results
Acquire content

Run query
Raw
Content

Source: http://web.stanford.edu/class/cs276/handouts/lecture-lucene.pptx

231 Impetus Technologies - Confidential


How Lucene models content
The fundamental concepts in Lucene are index,
document, field and term.
– An index contains a sequence of documents.

– A document is a sequence of fields.

– A field is a named sequence of terms.

– A term is a sequence of bytes.

232 Impetus Technologies - Confidential


How Lucene models content ….
• A Document is the atomic unit of indexing and
searching
– A Document contains Fields

• Fields have a name and a value


– You have to translate raw content into Fields
Examples: Title, author, date, abstract, body, URL, keywords
– Different documents can have different fields
– Search a field using name:term, e.g., title:lucene

233 Impetus Technologies - Confidential


Fields
• Fields may
– Be indexed or not
• Indexed fields may or may not be analyzed (i.e., tokenized with an
Analyzer)
– Non-analyzed fields view the entire value as a single token (useful
for URLs, paths, dates, social security numbers, ...)

– Be stored or not
• Useful for fields that you’d like to display to users

– Optionally store term vectors


• Like a positional index on the Field’s terms
• Useful for highlighting, finding similar documents, categorization

234 Impetus Technologies - Confidential


Index Format and structure
• Segments
– Lucene indexes may be composed of multiple sub-indexes,
or segments.
– Each segment is a fully independent index
– New segments created for newly added documents.
– Existing segments are merged

• Document Numbers
– Internally, Lucene refers to documents by an integer document
number
– first document added to an index is numbered zero and so on..

235 Impetus Technologies - Confidential


Index Format and structure

• Segments
– Lucene indexes may be composed of multiple sub-indexes,
or segments.
– Each segment is a fully independent index
– New segments created for newly added documents.
– Existing segments are merged

• Document Numbers
– Internally, Lucene refers to documents by an integer document
number
– first document added to an index is numbered zero and so on..

236 Impetus Technologies - Confidential


Index Format and structure …

• Each segment index maintains


– Segment info.
• This contains metadata about a segment, such as the number of
documents, what files it uses.
– Field names
– Stored Field values
– Term dictionary.
• A dictionary containing all of the terms used in all of the indexed fields of all of
the documents.

– Term Frequency data


– Term Proximity data
– Normalization factors
– Term Vectors
237 Impetus Technologies - Confidential
Index Format and structure …

• Stores terms and documents in arrays


– Binary search

Source: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal

238 Impetus Technologies - Confidential


Index Format and structure …Insertion & Merge !

• Insertion = write a new Segment


• Merge Segments when there are too many of them
– Concatenate docs, merge terms dicts and posting lists (merge sort !)

Source: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal

239 Impetus Technologies - Confidential


Index Format and structure … Deletion !

• Deletion = turn a bit off


• Ignore deleted docs when searching and merging
• Merge policies favor segments with many deletions

Source: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal

240 Impetus Technologies - Confidential


Indexing , Analyzing and Searching

241 Impetus Technologies - Confidential


Indexing Pipeline

• Analyzer : create tokens using a Tokenizer and/or applying Filters (Token


Filters)
• Each field can define an Analyzer at index time/query time or both

Source: http://www.slideshare.net/otisg/lucene-introduction

242 Impetus Technologies - Confidential 242


Indexing and Querying Summary

Source: http://www.slideshare.net/saumitra121/apache-solr-workshop

243 Impetus Technologies - Confidential 243


Core Analysis : Four main parts
• Analyzer
– Responsible for supplying a TokenStream which can be consumed by the
indexing and searching processes

• CharFilter
– Used to transform the text before it is tokenized

• Tokenizer
– Responsible for breaking up incoming text into tokens.

• TokenFilter
– Responsible for modifying tokens that have been created by the
Tokenizer

244 Impetus Technologies - Confidential


Core Analysis : Post-Tokenization
Many post-tokenization steps :
• Stemming
– Replacing words with their stems

• Stop Words Filtering


– Removing Common words like "the", "and" and "a“

• Text Normalization
– Stripping accents and other character markings

• Synonym Expansion
– Adding in synonyms at the same token position

245 Impetus Technologies - Confidential


Analyzer Overview

Source: http://www.slideshare.net/saumitra121/apache-solr-workshop
246 Impetus Technologies - Confidential
Searching Data : Basic Concepts

Source: http://www.slideshare.net/saumitra121/apache-solr-workshop
247 Impetus Technologies - Confidential
What is Elastic Search?
• Elastic Search is an Open source (Apache 2), Distributed
Search Engine built on top of Apache Lucene
• Elastic Search functionality can be accessed an API and a
Restful service interface
• Elastic Search is build to be distributed from ground up so it
can easily scales from one to 100s of machine
• It provides features like fault tolerance and high availability

248 Impetus Technologies - Confidential


Elastic Search - Terms
• Index
– An index is a collection of documents that have somewhat similar
characteristics
– Identified by a name (that must be all lowercase)
– This name is used to refer to the index when performing indexing,
search, update, and delete operations against the documents in it
• Documet
– A document is a basic unit of information that can be indexed
– Expressed in JSON
• Document Type
– Each index can store different types of documents
– Organizing documents into types helps in data manipulation

249 Impetus Technologies - Confidential


Elastic Search - Terms
• Cluster
– Collection of one or more nodes (servers) that together holds your
entire data
– Provides federated indexing and search capabilities across all
nodes
– Identified by a unique name which by default is "elasticsearch“
– This name is important because a node can only be part of a cluster
if the node is set up to join the cluster by its name
• Node
– A node is a single server that is part of your cluster, stores your
data, and participates in the cluster’s indexing and search
capabilities
– Node is identified by a name
– Name is important for administration purposes

250 Impetus Technologies - Confidential


Elastic Search - Terms
• Shard
– Subdivide index into multiple pieces called shards
– While creating an index define the number of shards
– Each shard is in itself a fully-functional and independent "index“
• Replica
– Copy of the primary shard
– Number of shards and replicas can be defined per index at the time
the index is created
– Each shard is in itself a fully-functional and independent "index“

251 Impetus Technologies - Confidential


Elastic Search – The Cluster
• Elasticsearch provides a very comprehensive and powerful
REST API
– Check cluster, node, and index health, status, and statistics
– Administer cluster, node, and index data and metadata
– Perform CRUD (Create, Read, Update, and Delete) and search
operations against indexes
– Execute advanced search operations such as paging, sorting,
filtering, scripting, faceting, aggregations, and many others

252 Impetus Technologies - Confidential


Elastic Search – Index API
Adds document into the "twitter" index, under a type
called "tweet" with an id of 1:

Source: https://www.elastic.co/products/elasticsearch

253 Impetus Technologies - Confidential


Elastic Search – Get API
Get a document from the "twitter" index, under a type
called "tweet" with an id of 1:

254 Impetus Technologies - Confidential


Elastic Search – Delete API
Delete document from the "twitter" index, under a
type called "tweet" with an id of 1:

Delete by query

Source: https://www.elastic.co/products/elasticsearch

255 Impetus Technologies - Confidential


Elastic Search – Query DSL
• JSON-style domain-specific language used to execute
queries
• localhost:9200/bank/_search
– Match all and return the first result
• { "query": { "match_all": {} }, "size": 1 }‘
• Match all and return the results from 11-20
– { "query": { "match_all": {} }, "from": 10, "size": 10 }‘
• Match all and return sorted results
– { "query": { "match_all": {} }, "sort": { "balance": { "order": "desc" } } }‘
• Match all and return the two fields
– { "query": { "match_all": {} }, "_source": ["account_number",
"balance"] }‘
• Match all with a specific value for a field
– { "query": { "match": { "account_number": 20 } } }‘

256 Impetus Technologies - Confidential


Elastic Search – Query DSL
• Match where address contains “mill” OR “lane”
– { "query": { "match": { "address": "mill lane" } } }‘
• Match where address contains “mill lane”
– { "query": { "match_phrase": { "address": "mill lane" } } }‘
• Boolean
– { "query": { "bool": { "must": [ { "match": { "address": "mill" } }, {
"match": { "address": "lane" } } ] } } }‘
– { "query": { "bool": { "should": [ { "match": { "address": "mill" } }, {
"match": { "address": "lane" } } ] } } }‘
• Filtering
– { "query": { "filtered": { "query": { "match_all": {} }, "filter": {
"range": { "balance": { "gte": 20000, "lte": 30000 } } } } } }‘

257 Impetus Technologies - Confidential


Elastic Search – Query DSL
– Aggregations
• groups all the accounts by state, and then returns the top 10
(default) states sorted by count descending (also default)
– { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state" }
} } }'
– calculates the average account balance by state (again only for the
top 10 states sorted by count in descending order)
– { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state" },
"aggs": { "average_balance": { "avg": { "field": "balance" } } } } } }'

258 Impetus Technologies - Confidential


Elastic Search – Distributed Store
• Contains one index that has two primary shards.
Each primary shard has two replicas.
• Copies of the same shard are never allocated to
the same node

Source: https://www.elastic.co/products/elasticsearch

259 Impetus Technologies - Confidential


Elastic Search – Creating , indexing , deleting
• Create, index, and delete requests are write
operations, which must be successfully completed
on the primary shard before they can be copied to
any associated replica shards

Source: https://www.elastic.co/products/elasticsearch

260 Impetus Technologies - Confidential


Elastic Search – Creating , indexing , deleting
• The client sends a create, index, or delete request to Node
1.
• The node uses the document’s _id to determine that the
document belongs to shard 0. It forwards the request to
Node 3, where the primary copy of shard 0 is currently
allocated.
• Node 3 executes the request on the primary shard.
• If it is successful, it forwards the request in parallel to the
replica shards on Node 1 and Node 2.
• Once all of the replica shards report success, Node 3
reports success to the requesting node, which reports
success to the client.

261 Impetus Technologies - Confidential


Elastic Search – Replication
• The default value for replication is sync. This
causes the primary shard to wait for successful
responses from the replica shards before
returning.

• If you set replication to async, it will return success


to the client as soon as the request has been
executed on the primary shard.

262 Impetus Technologies - Confidential


Elastic Search – Consistency
• By default, the primary shard requires a quorum,
or majority, of shard copies (where a shard copy
can be a primary or a replica shard) to be available
before even attempting a write operation.
• This is to prevent writing data to the “wrong side”
of a network partition. A quorum is defined as
follows:
int( (primary + number_of_replicas) / 2 ) + 1
• The allowed values for consistency are one
o (just the primary shard),
o all (the primary and all replicas),
o or the default quorum, or majority, of shard copies.

263 Impetus Technologies - Confidential


Elastic Search – Retrieving
• A document can be retrieved from a primary shard
or from any of its replicas

Source: https://www.elastic.co/products/elasticsearch

264 Impetus Technologies - Confidential


Elastic Search – Retrieving
• The client sends a get request to Node 1.
• The node uses the document’s _id to determine
that the document belongs to shard 0. Copies of
shard 0 exist on all three nodes. On this occasion,
it forwards the request to Node 2.
• Node 2 returns the document to Node 1, which
returns the document to the client.
• For read requests, the requesting node will choose
a different shard copy on every request in order to
balance the load; it round-robins through all shard
copies.
265 Impetus Technologies - Confidential
Spark

266 Impetus Technologies - Confidential


Apache Spark
• Originally developed in
2009 in UC Berkeley’s
AMP Lab

• Fully open sourced in 2010


– now a Top Level Project
at the Apache Software
Foundation

spark.apache.org
github.com/apache/spark
user@spark.apache.org

267 Impetus Technologies - Confidential


Spark is the Most Active Open Source Project in Big Data

140
Project contributors in past year

120

100

80

60

40 Giraph
Storm
Tez
20

268 Impetus Technologies - Confidential


Unified Platform

Spark SQL Spark Streaming MLlib GraphX (Graph


(SQL) (Streaming) (Machine learning) computation)

Spark (General execution engine)

269 Impetus Technologies - Confidential


Easy and Fast Big Data

• Easy to Develop • Fast to Run


– Rich APIs in Java, Scala, – General execution graphs
Python – In-memory storage
– Interactive shell
Up to 10× faster on disk,
2-5× less code 100× in memory

270 Impetus Technologies - Confidential


Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop
InputFormat
• HBase

271 Impetus Technologies - Confidential


Deploying Spark – Cluster Manager Types
• Mesos
• Standalone mode
• YARN

272 Impetus Technologies - Confidential


Quick Terminology
• Tasks: Fundamental unit of work
• Stage: Set of tasks that run in parallel
• DAG: Logical graph of RDD operations
• RDD: Parallel dataset with partitions

273 Impetus Technologies - Confidential


Key Concept: RDD’s

Write programs in terms of operations on


distributed datasets
Resilient Distributed Datasets Operations
• Collections of objects spread • Transformations
across a cluster, stored in RAM (e.g. map, filter,
or on Disk groupBy)
• Built through parallel • Actions
transformations (e.g. count, collect,
• Automatically rebuilt on failure save)

274 Impetus Technologies - Confidential


RDD Components

• Set of partitions (“splits” in


hadoop)
• List of dependencies on
parent RDD Lineage

• Functions to compute
partition given its parent
• (Optional) partitioner (hash,
range) Optimized
• (Optional) preferred location Execution
for each partition

275 Impetus Technologies - Confidential


More RDD Operators

• map • reduce sample


• filter • count take
• groupBy • fold
first
• sort • reduceByKey
• union • groupByKey partitionBy

• join • cogroup mapWith


• leftOuterJoin • cross pipe
• rightOuterJoin • zip
save ...

276 Impetus Technologies - Confidential


ImpetusSpark Execution
Technologies Inc.

277 Impetus Technologies - Confidential


Spark Execution – Typical flow

Driver

Executor Executor Executor


Block 1 Block 2 Block 3

278 Impetus Technologies - Confidential


Spark Execution – Typical flow (contd..)

Driver

Submit stages of task to low level


scheduler e.g. YARN, Mesos,
Standalone

Executor Executor Executor


Block 1 Block 2 Block 3

279 Impetus Technologies - Confidential


Spark Execution – Typical flow (contd..)

Driver

Read HDFS local data

Executor Executor Executor


Block 1 Block 2 Block 3

280 Impetus Technologies - Confidential


Spark Execution – Typical flow (contd..)

Driver

Process and cache data

Cache Cache Cache

Executor Executor Executor


Block 1 Block 2 Block 3

281 Impetus Technologies - Confidential


Spark Execution – Typical flow (contd..)

Driver

Report back the results

Cache Cache Cache

Executor Executor Executor


Block 1 Block 2 Block 3

282 Impetus Technologies - Confidential


Spark Execution – Typical flow (contd..)

Driver

Process the next stage from cache

Cache Cache Cache

Executor Executor Executor


Block 1 Block 2 Block 3

283 Impetus Technologies - Confidential


Spark Execution – Typical flow (contd..)

Driver

Report back the results

Cache Cache Cache

Executor Executor Executor


Block 1 Block 2 Block 3

284 Impetus Technologies - Confidential


Quick Terminology
• Tasks: Fundamental unit of work
• Stage: Set of tasks that run in parallel
• DAG: Logical graph of RDD operations
• RDD: Parallel dataset with partitions

285 Impetus Technologies - Confidential


Word Count Example
Program in scala

val input = sc.textFile(“hdfs://name.txt”)


val count = input.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey(_ + _)
count.saveAsTextFile(“hdfs://wordcount.txt”)

286 Impetus Technologies - Confidential


Word Count Example
Stage 1
Tasks DAG of operations
big data big camp
sc.textFile(“hdfs://name.txt”
) HadoopRDD
big data big camp
textFile.flatMap(line =>
line.split(“ “)) MapRDD
(big, 1) (data, 1) (camp,
(big, 1)
1)
map(word => (word, 1))
MapRDD

reduceByKey(_ + _) Stage 2

(big, [1, 1] (data, [1]) (camp, [1])


count.saveAsTextFile(“hdfs
://wordcount.txt”) reduceByKey (Action)
(big, 2] (data, 1) (camp, 1)

saveAsTextFile
(Action) res =
[(big,2),(data,1),(camp,1)]

287 Impetus Technologies - Confidential


Job Execution

Build an operator DAG

Split graph into stages of tasks

Cache
Task
Schedule and execute tasks Executor
Block

288 Impetus Technologies - Confidential


Spark Execution flow – example 1

sc.textFile("/some-hdfs-data") RDD[String]

.map(line => line.split("\t")) RDD[List[String]]

.map(parts =>
(parts[0], int(parts[1]))) RDD[(String, Int)]

.reduceByKey(_ + _, 3) RDD[(String, Int)]


Array[(String, Int)]
.collect()

textFile map map reduceByKey collect


Source: https://spark-summit.org/2013/talk/wendell-understanding-the-performance-of-spark-applications/

289 Impetus Technologies - Confidential


Directed Acyclic Graph (DAG)
• Directed in a single direction
• Acyclic : No looping
• Support fault-tolerance

Join

GroupBy

290 Impetus Technologies - Confidential


Split graph into stages of tasks
• Pipeline as much as possible
• Split into stages of tasks

textFile map map reduceByKey collect

Stage 1 Stage 2

textFile map map reduceByKey collect


Source: https://spark-summit.org/2013/talk/wendell-understanding-the-performance-of-spark-applications/

291 Impetus Technologies - Confidential


Split graph into stages of tasks (contd…)

Stage 1 Stage 2

textFile map map reduceByKey collect

Stage 1 Stage 2
read HDFS split read shuffle data
apply both maps final reduce
partial reduce send result to driver
write shuffle data

Source: https://spark-summit.org/2013/talk/wendell-understanding-the-performance-of-spark-applications/

292 Impetus Technologies - Confidential


Stage execution

Stage 1
Task 1
Task 2
Task 3
Task 4

– Create a task for each partition in the new RDD


– Serialize task
– Schedule and ship task to slaves

Source: https://spark-summit.org/2013/talk/wendell-understanding-the-performance-of-spark-applications/

293 Impetus Technologies - Confidential


Task execution

– Task is the fundamental unit of execution in Spark


- A. Fetch input from InputFormat or a shuffle
- B. Execute the task
- C. Materialize task output as shuffle or driver result

Pipelined
Fetch input Execution

Execute task

Write output

Source: https://spark-summit.org/2013/talk/wendell-understanding-the-performance-of-spark-applications/

294 Impetus Technologies - Confidential


Spark execution flow – example 2

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
295 Impetus Technologies - Confidential
Spark execution flow – example 2 (contd…)

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
296 Impetus Technologies - Confidential
Build an operator DAG

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
297 Impetus Technologies - Confidential
Build an operator DAG (contd…)

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
298 Impetus Technologies - Confidential
Split graph into stages of tasks
• Pipeline as much as possible
• Split into stages of tasks

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
299 Impetus Technologies - Confidential
Split graph into stages of tasks (contd…)
• Pipeline as much as possible
• Split into stages of tasks

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
300 Impetus Technologies - Confidential
Schedule and execute tasks

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
301 Impetus Technologies - Confidential
Schedule and execute tasks (contd…)

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
302 Impetus Technologies - Confidential
Schedule and execute tasks (contd…)

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
303 Impetus Technologies - Confidential
Schedule and execute tasks (contd…)

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
304 Impetus Technologies - Confidential
Schedule and execute tasks (contd…)

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
305 Impetus Technologies - Confidential
Shuffle
• Redistribute data among partitions
• Hash keys into buckets
• Optimizations
– Avoided when possible, if data is already partitioned
– Partial aggregation reduces data movement
Stage 1

Stage 2

306 Impetus Technologies - Confidential


Shuffle (cont.)
• Pull-based not push based
• Write intermediate files to disk

Stage 1

Stage 2

307 Impetus Technologies - Confidential


Spark Programming Model

308 Impetus Technologies - Confidential


RDD

Transformed RDD

309 Impetus Technologies - Confidential


SparkContext
• Main entry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you’d make your own (see
later for details)

310 Impetus Technologies - Confidential


Creating RDD
• Is a series on transformations on RDD
• RDD can be created by
– Using parallelize() on in memory dataset or collection
– Reading data from external dataset like HDFS, S3, local
file system
– Transformations on existing RDD
• Parallelize method
– Invoke parallelize method on a collection
– Elements of collection are copied to create distributed
dataset which can be operated in parallel

311 Impetus Technologies - Confidential


Creating RDD (cont)
• Parallelize method (cont.)
– Example
• Scala
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
• Java
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
– How many partitions are created?
• Specify number of partitions to create as input to parallelize()
• Spark automatically decided number of partitions based on
cluster configuration
• Spark runs a separate task for each partition

312 Impetus Technologies - Confidential


Creating RDD (cont)
• External Dataset
– Spark can create distributed datasets from any storage
source supported by Hadoop, including your local file
system, HDFS, Cassandra, HBase, Amazon S3, etc.
– Spark supports Text files, Sequence Files, and any other
Hadoop Input Format
– Api’s to read data from external sources
• sc.wholeTextFile()
– Read a directory containing multiple small text files, and returns each of
them as (filename, content) pairs
• sc.sequenceFile[k,v]()
– Read sequence files, where K and V are the types of key and values in the
file
• sc.hadoopRDD()
– Read files in Hadoop Input format

313 Impetus Technologies - Confidential


Creating RDD (cont)
• Reading External Dataset
– Example
• Scala
val distFile = sc.textFile(“file.txt”)
val distFile = sc.textFile(“directory/*.txt”)
val distFile = sc.textFile(“hdfs://namenode:9000/path/file”)
• Java
JavaRDD<String> distFile = sc.textFile(“file.txt”);
JavaRDD<String> distFile = sc.textFile(“directory/*.txt”);
JavaRDD<String> distFile = sc.textFile (“hdfs://namenode:9000/path/file”);

314 Impetus Technologies - Confidential


RDD Operations
• RDDs support two types of operations:
– transformations, which create a new dataset from an
existing one, and
– actions, which return a value to the driver program after
running a computation on the dataset.
• For example,
– map() is a transformation that passes each dataset
element through a function and returns a new RDD
representing the results.
– reduce() is an action that aggregates all the elements of
the RDD using some function and returns the final result
to the driver program

315 Impetus Technologies - Confidential


RDD Operations
• All transformations in Spark are lazy,
– Only computed when an action requires a result to be
returned to the driver program.
• RDD can be persisted in memory, disk or both
– RDD are kept in memory for faster access in subsequent
transformations

316 Impetus Technologies - Confidential


RDD Operations - Example

• val lines = sc.textFile("data.txt") Define base RDD

• val lineLengths = lines.map(s Transformation: Map


=> s.length)

• val totalLength =
lineLengths.reduce((a, b) => a Action: Reduce
+ b)
• lineLengths.persist() Persist: Store

317 Impetus Technologies - Confidential


Basic Transformations
# Create base RDD
> val data = Array(1,2,3)
> val nums = sc.parallelize(data)

# Pass each element through a function


> val squares = nums.map(x => x * x) // {1, 4, 9}

# Keep elements passing a predicate


> val even = squares.filter(x => x % 2 == 0) // {4}

318 Impetus Technologies - Confidential


Basic Actions

> nums = sc.parallelize([1, 2, 3])


# Retrieve RDD contents as a local collection
> nums.collect() # => [1, 2, 3]
# Return first K elements
> nums.take(2) # => [1, 2]
# Count number of elements
> nums.count() # => 3
# Merge elements with an associative function
> nums.reduce((x, y) => x + y) # => 6
# Write elements to a text file
> nums.saveAsTextFile(“hdfs://192.168.91.128/exampleact
ion.txt”)

319 Impetus Technologies - Confidential


Working with Key-Value Pairs

Spark’s “distributed reduce” transformations operate on


RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b

Scala: val pair = (a, b)


pair._1 // => a
pair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b);


pair._1 // => a
pair._2 // => b
320 Impetus Technologies - Confidential
Some Key-Value Operations

> val petsdata=Array(“cat”, “dog”, “cat”)


> val pets = sc.parallelize(petsdata)

> val reduceCounts= pets.reduceByKey((x, y) => x + y)


> reduceCounts.collect()
# Array((dog,1), (cat,3))

> val groupCounts = pets.groupByKey()


> groupCounts.collect()
> # Array((dog,ArrayBuffer(1)), (cat,ArrayBuffer(1, 2)))

> val sortCounts = pets.sortByKey()


> # Array((cat,1), (cat,2), (dog,1))

reduceByKey also automatically implements combiners on the map side

321 Impetus Technologies - Confidential


Example: Word Count

> val lines = sc.textFile(“hdfs://192.168.91.128/4300.txt”)


> val counts = lines.flatMap(line => line.split(“ ”))
.map(word => (word, 1))
.reduceByKey((x, y) => x + y)
> counts.collect()
> counts.toArray().foreach(println)
> counts.saveAsTextFile("hdfs://192.168.91.128:8020/tmp/wordc
ount1")

“to” (to, 1)
(be, 2)
“to be or” “be” (be, 1)
(not, 1)
“or” (or, 1)
“not” (not, 1)
(or, 1)
“not to be” “to” (to, 1)
(to, 2)
“be” (be, 1)

322 Impetus Technologies - Confidential


Fault Recovery

RDDs track lineage information that can be


used to efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))


.map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDD


filter map
(func = startsWith(…)) (func = split(...))

323 Impetus Technologies - Confidential


How to Run Spark

324 Impetus Technologies - Confidential


Language Support

Python Standalone Programs


lines = sc.textFile(...) • Python, Scala, & Java
lines.filter(lambda s: “ERROR” in s).count()

Scala Interactive Shells


val lines = sc.textFile(...) • Python & Scala
lines.filter(x => x.contains(“ERROR”)).count()

Performance
Java • Java & Scala are faster due
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() { to static typing
Boolean call(String s) {
return s.contains(“error”); • …but Python is often fine
}
}).count();

325 Impetus Technologies - Confidential


Interactive Shell

• The Fastest Way to


Learn Spark
• Available in Python and
Scala
• Runs as an application
on an existing Spark
Cluster…
• OR Can run locally

326 Impetus Technologies - Confidential


… or a Standalone Application

import sys
from pyspark import SparkContext

if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0],
None)
lines = sc.textFile(sys.argv[1])

counts = lines.flatMap(lambda s: s.split(“ ”)) \


.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)

counts.saveAsTextFile(sys.argv[2])

327 Impetus Technologies - Confidential


Create a SparkContext

import org.apache.spark.SparkContext
Scala

import org.apache.spark.SparkContext._

val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”))

Cluster URL, or local App Spark install


import org.apache.spark.api.java.JavaSparkContext; List of JARs with
Java

/ local[N] name path on cluster app code (to ship)


JavaSparkContext sc = new JavaSparkContext(
“masterUrl”, “name”, “sparkHome”, new String[] {“app.jar”}));
Python

from pyspark import SparkContext

sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))

328 Impetus Technologies - Confidential


ImpetusSpark Architecture Inc.
Technologies

329 Impetus Technologies - Confidential


Spark Architecture
Client RDD Graph
(Scala/Java
/Python) Driver
Scheduler

Cluster Manager Local Threads

Cache Cache Cache


Task Task Task

Executor Executor Executor


Block Block … Block

HDFS/HBase/Storage

330 Impetus Technologies - Confidential


Deploying Spark – Cluster Manager Types
• Mesos
• Standalone mode
• YARN

331 Impetus Technologies - Confidential


Spark Architecture (Client mode)

Source: Hadoop Definitive Guide

332 Impetus Technologies - Confidential


Spark Architecture (Cluster mode)

Source: Hadoop Definitive Guide

333 Impetus Technologies - Confidential


Spark Streaming

334 Impetus Technologies - Confidential


What is Spark Streaming?
 Extends Spark for doing large scale stream
processing
 Scales to 100s of nodes and achieves second
scale latencies
 Efficient and fault-tolerant stateful stream
processing
 Integrates with Spark’s batch and interactive
processing
 Provides a simple batch-like API for implementing
complex algorithms
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617

335 Impetus Technologies - Confidential


Spark Streaming

● Scalable, high-throughput, fault-tolerant stream processing

336 Impetus Technologies - Confidential


Discretized Stream Processing

Run a streaming computation as a series of


very small, deterministic batch jobs
live data stream
Spark
 Chop up the live stream into batches of X Streaming
seconds
 Spark treats each batch of data as RDDs batches of X
seconds
and processes them using RDD operations
 Finally, the processed results of the RDD
Spark
operations are returned in batches processed
results

Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617

337

337 Impetus Technologies - Confidential


Discretized Stream Processing

Run a streaming computation as a series of


very small, deterministic batch jobs
live data stream
Spark
 Batch sizes as low as ½ second, latency ~ 1 Streaming
second
 Potential for combining batch processing batches of X
seconds
and streaming processing in the same
system
Spark
processed
results

Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617

338

338 Impetus Technologies - Confidential


Example 1 – Get hashtags from Twitter

val tweets = ssc.twitterStream(<Twitter username>, <Twitter


password>)
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data

Twitter Streaming API batch @ t batch @ t+1 batch @ t+2

tweets DStream

stored in memory as an RDD


(immutable, distributed dataset)

Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617

339 Impetus Technologies - Confidential


Example 1 – Get hashtags from Twitter

val tweets = ssc.twitterStream(<Twitter username>, <Twitter


password>)
val hashTags = tweets.flatMap (status => getTags(status))
transformation: modify data in one DStream to create
new DStream
another DStream

batch @ t batch @ t+1 batch @ t+2

tweets DStream

flatMap flatMap flatMap

hashTags Dstream

new RDDs created
[#cat, #dog, … ]
for every batch

Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617

340 Impetus Technologies - Confidential


Example 1 – Get hashtags from Twitter

val tweets = ssc.twitterStream(<Twitter username>, <Twitter


password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage

batch @ t batch @ t+1 batch @ t+2


tweets DStream
flatMap flatMap flatMap

hashTags DStream
save save save

every batch
saved to HDFS
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617

341 Impetus Technologies - Confidential


Example 2 – Count the hashtags over last 1 min

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

sliding window
window length sliding interval
operation

window length

DStream of data
sliding interval
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
342 Impetus Technologies - Confidential
Example 2 – Count the hashtags over last 1 min

val tagCounts = hashTags.window(Minutes(1),


Seconds(1)).countByValue()

t-1 t t+1 t+2 t+3

hashTags

sliding window

countByValue

tagCounts count over all


the data in the
window

Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
343 Impetus Technologies - Confidential
Key concepts

• DStream – sequence of RDDs representing a


stream of data
– Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP
sockets
• Transformations – modify data from one DStream to
another
– Standard RDD operations – map, countByValue,
reduceByKey, join, …
– Stateful operations – window,
countByValueAndWindow, …
• Output Operations – send data to external entity
– saveAsHadoopFiles – saves to HDFS
– foreach – do anything with each batch of results
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
344 Impetus Technologies - Confidential
Arbitrary Stateful Computations

• Maintain arbitrary state, track sessions


– Maintain per-user mood as state, and update it
with his/her tweets
moods = tweets.updateStateByKey(tweet => updateMood(tweet))
updateMood(newTweets, lastMood) => newMood

t-1 t t+1 t+2 t+3


tweets

moods

Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
345 Impetus Technologies - Confidential
Combine Batch and Stream Processing

• Do arbitrary Spark RDD computation within


DStream
– Join incoming tweets with a spam file to filter
out bad tweets

tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})

Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
346 Impetus Technologies - Confidential
Fault-tolerance

• RDDs remember the tweets


input data
operations that created them RDD
replicated
in memory
• Batches of input data are
replicated in memory for fault-
flatMap
tolerance

• Data lost due to worker hashTags


failure, can be recomputed RDD
lost partitions
from replicated input data recomputed on
other workers
• Therefore, all transformed
data is fault-tolerant

Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
347 Impetus Technologies - Confidential
Spark SQL

348 Impetus Technologies - Confidential


Spark SQL

 Spark’s interface to work with structured or semi-


structured data
 Structured data
o known set of fields for each record - schema
 Main capabilities
o load data from variety of structured sources
o query the data with SQL
o integration between Spark (Java, Scala and Python API)
and SQL (joining RDDs and SQL tables, using SQL
functionality)

Source: http://www.slideshare.net/PetrZapletal1/spark-concepts-spark-sql-graphx-streaming

349 Impetus Technologies - Confidential


DataFrames (SchemaRDD)

 RDD of row objects, each representing a record


 Known schema (i.e. data fields) of its rows
 Behaves like regular RDD, stored in more efficient manner
 Adds new operations, especially running SQL queries
 Can be created from
o external data sources
o results of queries
o regular RDD
 Used in ML Pipeline API

Source: http://www.slideshare.net/PetrZapletal1/spark-concepts-spark-sql-graphx-streaming

350 Impetus Technologies - Confidential


351 Impetus Technologies - Confidential
SQLContext
● Entry points:
o HiveContext
 superset functionality, Hive related
o SQLContext

352 Impetus Technologies - Confidential


353 Impetus Technologies - Confidential
354 Impetus Technologies - Confidential
355 Impetus Technologies - Confidential
MLib

 MLlib is Spark’s scalable machine learning


library
 Common learning algorithms and utilities:
 Classification
 Regression,
 Clustering
 Collaborative filtering
 Dimensionality reduction

356 Impetus Technologies - Confidential


GraphX

 Spark API for graphs and graph-parallel


computation
 Resilient Distributed Property Graph (RDPG,
extends RDD)
 directed multigraph ( -> parallel edges)
 properties attached to each vertex and edge
 Common graph operations (subgraph
computation, joining vertices, ...)
 Growing collection of graph algorithms
Source: http://www.slideshare.net/PetrZapletal1/spark-concepts-spark-sql-graphx-streaming

357 Impetus Technologies - Confidential


Thank You
For any support you need please feel free to contact:

Ashish Baghel : abaghel@impetus.com


AVP & Head – Banking and Financial Services (BFSI)
913-638-2948 (Cell)
408-213-3310 – Ext 567 (Office)

Sachneet Singh: sachneets.bains@impetus.co.in

358 Impetus Technologies - Confidential

You might also like