Professional Documents
Culture Documents
(3 Day Training)
For any support you need please feel free to contact:
Ashish Baghel : abaghel@impetus.com
AVP & Head –Banking and Financial Services (BFSI)
913-638-2948 (Cell)
408-213-3310 –Ext 567 (Office)
Source: https://www.linkedin.com/pulse/20140925030713-64875646-big-data-the-eye-opening-
facts-everyone-should-know
Structures, Semi-Structured,
unstructured
Variety Text, Audio/ Video, Click
Streams, log files etc
2. Clickstream
Website visitors’ data
3. Sensor/Machine
Data from remote sensors and machines
4. Geographic
Location-based data
Source: www.Hortonworks.com
Flume Agent
Source: https://github.com/hortonworks/hadoop-tutorials/blob/master/Sandbox/T15_Analyzing_Geolocation_Data.md
Raw sensor
data
A54 A54 normal 38.44047 -122.714 Santa Rosa California 17 0 0
A20 A20 normal 36.97717 -121.899 Aptos California 27 0 0
overspee
A40 A40 d 37.9577 -121.291 Stockton California 77 1 0
Flume Agent
Source: https://github.com/hortonworks/hadoop-tutorials/blob/master/Sandbox/T15_Analyzing_Geolocation_Data.md
A Sqoop job
RDBMS data (info of
the trucks)
Source: https://github.com/hortonworks/hadoop-tutorials/blob/master/Sandbox/T15_Analyzing_Geolocation_Data.md
Source: http://athenaanalytics.tumblr.com/post/19544048414/cyberlabe-potential-use-cases-for-big-data
Subscribers
Telecom
Vendors
Merchant
Customer
• Data Storage
– Doesn’t fit on 1 node, requires cluster
– Flexible and Schema less Structure
– Data Replication, Partioning and Sharding
Source: http://hortonworks.com/blog/apache-hadoop-2-is-ga/
Source: http://www.apache.org/
Source: http://hortonworks.com/blog/webinar-series-building-a-modern-data-architecture-with-hadoop/
Relational Hadoop
Required on write schema Required on read
• Batch processing
• Node failure - Replication
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
NameNode
Here is my Here is my
Here is my Here is my
heartbeat and hearbeat and
heartbeat !! heartbeat !!
block report !! blockreport
Block 25
Source: http://www.slideshare.net/hdhappy001/nicholashdfs-what-is-new-in-hadoop-2
Here is my Here is my
Here is my Here is my
heartbeat and hearbeat and
heartbeat !! heartbeat !!
block report !! blockreport
Block 25
• Data is written into HDFS from client application in blocks of data. Each
of this block is of configured block size.
Client
Application DN1 DN2 DN3
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
B
replication
Blocks
Client
Source: https://hadoop.apache.org
Metadata
File.txt =
Blk A:
1 6 11 DN: 1, 7, 8
Blk B:
2 7 12 DN: 8, 12, 14
3 8 13
4 9 14
5 10 15
A few commands:
-ls, -ls -R: list files/directories
-cat: display file content (uncompressed)
-chgrp,-chmod,-chown: changes file permissions
-put,-get,-copyFromLocal,-copyToLocal: copies files from the local file
system to the HDFS and vice-versa.
-mv,-moveFromLocal,-moveToLocal: moves files
Source: https://www.mapr.com/blog/comparing-mapr-fs-and-hdfs-nfs-and-snapshots#.VZoux_lViko
Source: http://www.slideshare.net/jayshao/nyc-hadoop-meetup-mapr-architecture-philosophy-and-applications
53 Impetus Technologies - Confidential
Container locations and replication
N1, N2 N1
N3, N2
N1, N2
N1, N3 N2
N3, N2
CLDB
N3
Container location database
(CLDB) keeps track of nodes
hosting each container and
Source: MapR
replication chain order
Source: http://www.slideshare.net/jayshao/nyc-hadoop-meetup-mapr-architecture-philosophy-and-applications
S4 S5
S3
Source: http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop
Source: https://developer.yahoo.com/hadoop/tutorial/module4.html
Map Task 1 Map Task 2 Map Task 3 Map Task 4 Map Task 5
map file 1 map file 2 map file 3 map file 4 map file 5
map file 1 map file 2 map file 3 map file 4 map file 5
Source: https://developer.yahoo.com/hadoop/tutorial/module4.html
Source: https://developer.yahoo.com/hadoop/tutorial/module4.html
Source: http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html
Source: https://developer.yahoo.com/hadoop/tutorial/module4.html
4.
Mapper output = 5.
Reducer HDFS
Reducer input
In-memory
Spill
buffer
files
Mapper output =
Reducer input Merged
1. The Reducer input
fetches the data
from the
Mappers
Reducer HDFS
<K1, V1>
<K2, V2>
Mapper Shuffle/Sort
Source: http://www.ibm.com/developerworks/library/bd-hadoopyarn/
• Scalability
– Max Cluster size ~5,000 nodes
– Max concurrent tasks ~40,000
• Availability
– Failure Kills Queued & Running Jobs
Source: http://www.ibm.com/developerworks/library/bd-hadoopyarn/
Source: http://doc.mapr.com/display/MapR/YARN
– Linux: cgroups
• Administration
Mrkting Adhoc DW
20% 10% 70%
Source: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-understanding-the-data-operating-system-of-hadoop
Source: http://blog.cloudera.com/blog/2014/05/how-apache-hadoop-yarn-ha-works/
Source: http://www.slideshare.net/hortonworks/apache-hadoop-yarn-understanding-the-data-operating-system-of-hadoop
Source: http://hortonworks.com/hadoop/sqoop/
Source: http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas
Source: https://blogs.apache.org/flume/
Source: http://www.slideshare.net/Hadoop_Summit/percy-june26-455room211
Source: http://flume.apache.org/FlumeUserGuide.html
Source: http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas
Source: http://www.slideshare.net/Hadoop_Summit/percy-june26-455room211
agent.sources.webserver.type = exec
agent.sources.webserver.command = tail -F
/var/log/hadoop/hdfs/audit.log
agent.sources.webserver.batchSize = 1
agent.sources.webserver.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000
agent.sinks.mycluster.type = hdfs
agent.sinks.mycluster.channel = memoryChannel
agent.sinks.mycluster.hdfs.path = hdfs://127.0.0.1:8020/hdfsaudit/
Source: http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas
Source: http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas
Source: http://flume.apache.org/FlumeUserGuide.html
Source: http://flume.apache.org/FlumeUserGuide.html
Source: http://flume.apache.org/FlumeUserGuide.html
salariesbyage: { group: int, salaries: {(gender: chararray, age: int, salary: double, zip: int)}}
salaries G
gender age salary zip gender age salary zip
F 25 35000.00 95103 M 30 47000.00 95102
M 30 45000.00 95102 F 30 48000.00 95105
F 35 60000.00 95103 F 35 60000.00 95103
F 30 48000.00 95105
M 30 47000.00 95102
M 25 39000.00 95103
A = LOAD ‘data1’;
B = LOAD ‘data2’;
C = JOIN A by $1, B by $3 PARALLEL 20;
D = ORDER C BY $0 PARALLEL 5;
@Override
public String exec(Tuple input) throws IOException {
String inputStr = input.get(0).toString().trim();
return inputStr.toUpperCase();
}
}
L = LIMIT emp_group 3;
PigStorage
– Loads and stores data as structured text files.
PigDump
– Stores data in UTF-8 format.
JsonLoader, JsonStorage
– Load or store JSON data
HBaseStorage
– Loads and stores data from an HBase table
AvroStorage
– Loads and stores data from Avro files.
There should not be any conflicts between blacklist and whitelist. Make sure to
have them entirely distinct or Pig will complain.
• PartitionFilterOptimizer
– Push the filter condition to loader
A = LOAD 'input' as (dt, state, event) using HCatLoader();
B = FILTER A BY dt=='201310' AND state=='CA';
• PushUpFilter
• LimitOptimizer
• PushDownForEachFlatten
Hive Driver
HADOOP YARN
(MR + HDFS)
Resource Name
Manager Node
Data Node +
Node Manager
• Using Beeline
– A new command line client that connects to a
HiveServer2 instance using Hive JDBC driver
$ beeline
Hive version 0.11.0-SNAPSHOT by Apache
beeline> !connect
jdbc:hive2://hostname:10000 username password
org.apache.hive.jdbc.HiveDriver
FROM customer
SELECT fName, lName, birthday
WHERE birthday IS NOT NULL;
Source: http://www.slideshare.net/ye.mikez/hive-tuning
package com.impetus.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
FROM customer
SELECT ToLower(fName),
ToLower(lName);
SET hive.default.fileformat=Orc;
Java MapReduce
HCatalog
HDFS HBase
Table: Customers
HDFS
Privileges:
– SELECT privilege – gives read access to an object.
– INSERT privilege – gives ability to add data to an object (table).
– UPDATE privilege – gives ability to run update queries on an object
(table).
– DELETE privilege – gives ability to delete data in an object (table).
– ALL PRIVILEGES – gives all privileges (gets translated into all the
above privileges).
Objects
– The privileges apply to table and views. The above privileges are not
supported on databases.
M M M M M M
SELECT a.state SELECT b.id SELECT a.state,
c.itemId SELECT b.id
R R R R
M M
HDFS M M
M M R
JOIN (a, c) R
SELECT c.price
R JOIN (a, c) R
HDFS
HDFS
JOIN(a, b) M M JOIN(a, b)
GROUP BY a.state GROUP BY a.state
COUNT(*) COUNT(*)
AVG(c.price) AVG(c.price)
R R
– Cell
– A cell is a combination of row, column family, and column qualifier, and
contains a value and a timestamp, which represents the value’s version.
– Timestamp
– A timestamp is written alongside each value, and is the identifier for a given
version of a value.
Source: http://www.slideshare.net/aillonianilreddy/hadoop-32452974
Source: http://www.slideshare.net/aillonianilreddy/hadoop-32452974
Source: http://hbase.apache.org/
Source: http://www.slideshare.net/xefyr/h-base-for-architectspptx
Client
1. Which NameNode
Standby
NameNode
2. Write to
RegionServer DataNode DataNode DataNode
Source: http://www.slideshare.net/HBaseCon/hbase-just-the-basics
Source: http://blog.cloudera.com/blog/2012/06/hbase-write-path/
Client
1. Which NameNode
Standby
NameNode
2. Write to
RegionServer DataNode DataNode DataNode
Source: http://www.slideshare.net/HBaseCon/hbase-just-the-basics
3 table.put(p);
4
Source: http://www.slideshare.net/HBaseCon/hbase-just-the-basics
Source: http://www.slideshare.net/HBaseCon/hbase-just-the-basics
Characteristics :
– Arbitrary code can run at each RegionServer
– High-level call interface for clients
– Calls are addressed to rows or ranges of rows and the
coprocessor client library resolves them to actual locations;
– Calls across multiple rows are automatically split into multiple
parallelized RPC
– Provides a very flexible model for building distributed
services
215 Impetus Technologies - Confidential
Coprocessor Types )
Based on deployment
• System Coprocessors
– loaded globally on all tables and regions hosted by the
region server
• Table coprocessors
– loaded on all regions for a table on a per-table basis
Source: https://blogs.apache.org/hbase/entry/coprocessor_introduction
Source: https://blogs.apache.org/hbase/entry/coprocessor_introduction
• Prefix Query
– matches documents containing terms beginning with a specified string.
• Range Query
– facilitates searches from a starting term through an ending term.
• Boolean Query
– allows for logical AND, OR, and NOT combinations.
• Phrase Query
– An index contains positional information of terms.
• Fuzzy Query
– matches terms similar to a specified term.
• Boost Query
– Boost a particular term
230 Impetus Technologies - Confidential
Lucene in a search system
Analyze
document Search UI
Build document
Index Build Render
query results
Acquire content
Run query
Raw
Content
Source: http://web.stanford.edu/class/cs276/handouts/lecture-lucene.pptx
– Be stored or not
• Useful for fields that you’d like to display to users
• Document Numbers
– Internally, Lucene refers to documents by an integer document
number
– first document added to an index is numbered zero and so on..
• Segments
– Lucene indexes may be composed of multiple sub-indexes,
or segments.
– Each segment is a fully independent index
– New segments created for newly added documents.
– Existing segments are merged
• Document Numbers
– Internally, Lucene refers to documents by an integer document
number
– first document added to an index is numbered zero and so on..
Source: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal
Source: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal
Source: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal
Source: http://www.slideshare.net/otisg/lucene-introduction
Source: http://www.slideshare.net/saumitra121/apache-solr-workshop
• CharFilter
– Used to transform the text before it is tokenized
• Tokenizer
– Responsible for breaking up incoming text into tokens.
• TokenFilter
– Responsible for modifying tokens that have been created by the
Tokenizer
• Text Normalization
– Stripping accents and other character markings
• Synonym Expansion
– Adding in synonyms at the same token position
Source: http://www.slideshare.net/saumitra121/apache-solr-workshop
246 Impetus Technologies - Confidential
Searching Data : Basic Concepts
Source: http://www.slideshare.net/saumitra121/apache-solr-workshop
247 Impetus Technologies - Confidential
What is Elastic Search?
• Elastic Search is an Open source (Apache 2), Distributed
Search Engine built on top of Apache Lucene
• Elastic Search functionality can be accessed an API and a
Restful service interface
• Elastic Search is build to be distributed from ground up so it
can easily scales from one to 100s of machine
• It provides features like fault tolerance and high availability
Source: https://www.elastic.co/products/elasticsearch
Delete by query
Source: https://www.elastic.co/products/elasticsearch
Source: https://www.elastic.co/products/elasticsearch
Source: https://www.elastic.co/products/elasticsearch
Source: https://www.elastic.co/products/elasticsearch
spark.apache.org
github.com/apache/spark
user@spark.apache.org
140
Project contributors in past year
120
100
80
60
40 Giraph
Storm
Tez
20
• Functions to compute
partition given its parent
• (Optional) partitioner (hash,
range) Optimized
• (Optional) preferred location Execution
for each partition
Driver
Driver
Driver
Driver
Driver
Driver
Driver
reduceByKey(_ + _) Stage 2
saveAsTextFile
(Action) res =
[(big,2),(data,1),(camp,1)]
Cache
Task
Schedule and execute tasks Executor
Block
sc.textFile("/some-hdfs-data") RDD[String]
.map(parts =>
(parts[0], int(parts[1]))) RDD[(String, Int)]
Join
GroupBy
Stage 1 Stage 2
Stage 1 Stage 2
Stage 1 Stage 2
read HDFS split read shuffle data
apply both maps final reduce
partial reduce send result to driver
write shuffle data
Source: https://spark-summit.org/2013/talk/wendell-understanding-the-performance-of-spark-applications/
Stage 1
Task 1
Task 2
Task 3
Task 4
Source: https://spark-summit.org/2013/talk/wendell-understanding-the-performance-of-spark-applications/
Pipelined
Fetch input Execution
Execute task
Write output
Source: https://spark-summit.org/2013/talk/wendell-understanding-the-performance-of-spark-applications/
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
295 Impetus Technologies - Confidential
Spark execution flow – example 2 (contd…)
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
296 Impetus Technologies - Confidential
Build an operator DAG
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
297 Impetus Technologies - Confidential
Build an operator DAG (contd…)
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
298 Impetus Technologies - Confidential
Split graph into stages of tasks
• Pipeline as much as possible
• Split into stages of tasks
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
299 Impetus Technologies - Confidential
Split graph into stages of tasks (contd…)
• Pipeline as much as possible
• Split into stages of tasks
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
300 Impetus Technologies - Confidential
Schedule and execute tasks
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
301 Impetus Technologies - Confidential
Schedule and execute tasks (contd…)
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
302 Impetus Technologies - Confidential
Schedule and execute tasks (contd…)
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
303 Impetus Technologies - Confidential
Schedule and execute tasks (contd…)
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
304 Impetus Technologies - Confidential
Schedule and execute tasks (contd…)
Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-
Davidson.pdf
305 Impetus Technologies - Confidential
Shuffle
• Redistribute data among partitions
• Hash keys into buckets
• Optimizations
– Avoided when possible, if data is already partitioned
– Partial aggregation reduces data movement
Stage 1
Stage 2
Stage 1
Stage 2
Transformed RDD
• val totalLength =
lineLengths.reduce((a, b) => a Action: Reduce
+ b)
• lineLengths.persist() Persist: Store
“to” (to, 1)
(be, 2)
“to be or” “be” (be, 1)
(not, 1)
“or” (or, 1)
“not” (not, 1)
(or, 1)
“not to be” “to” (to, 1)
(to, 2)
“be” (be, 1)
Performance
Java • Java & Scala are faster due
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() { to static typing
Boolean call(String s) {
return s.contains(“error”); • …but Python is often fine
}
}).count();
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0],
None)
lines = sc.textFile(sys.argv[1])
counts.saveAsTextFile(sys.argv[2])
import org.apache.spark.SparkContext
Scala
import org.apache.spark.SparkContext._
HDFS/HBase/Storage
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
337
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
338
tweets DStream
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
tweets DStream
hashTags Dstream
…
new RDDs created
[#cat, #dog, … ]
for every batch
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
hashTags DStream
save save save
every batch
saved to HDFS
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
sliding window
window length sliding interval
operation
window length
DStream of data
sliding interval
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
342 Impetus Technologies - Confidential
Example 2 – Count the hashtags over last 1 min
hashTags
sliding window
countByValue
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
343 Impetus Technologies - Confidential
Key concepts
moods
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
345 Impetus Technologies - Confidential
Combine Batch and Stream Processing
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
346 Impetus Technologies - Confidential
Fault-tolerance
Source: http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
347 Impetus Technologies - Confidential
Spark SQL
Source: http://www.slideshare.net/PetrZapletal1/spark-concepts-spark-sql-graphx-streaming
Source: http://www.slideshare.net/PetrZapletal1/spark-concepts-spark-sql-graphx-streaming