Introduction To Big Data and Hadoop: S.Mohan Kumar

Trendwise Analytics
Introduction to
Big Data and
Hadoop
S.Mohan Kumar
Copyright Trendwise Analytics

Trendwise Analytics
Internet Minute ....

Trendwise Analytics
Other sources
 A commercial aircraft An ERP system for an A Video Suveillance

generates 3GB of mid size company Camera generates 1-
flight sensor data in 1 grows by 1-2TB 3TB data in 3 months
hour annually
Every day 2.5 quintillion

Airtel or Vodafone
generates 3TB of (2.5×10^18) bytes of data
Call Details is created
Records (CDR)
every day i.e., 2,500,000TB

Trendwise Analytics
IoT

Trendwise Analytics
IoT

Trendwise Analytics
Trendwise Analytics
How are Organizations

using Big Data
Technology?
Trendwise Analytics
Across All Industries
IT infrastructure Legal Social network Traffic flow Web app

optimization discovery analysis optimization optimization
Churn Natural resource Weather Healthcare

analysis exploration forecasting outcomes
Smart meter
Fraud Life sciences Advertising Equipment
monitoring
detection research analysis monitoring
Trendwise Analytics
What are the types of business problems?
Source: Cloudera “Ten Common Hadoopable Problems”

Trendwise Analytics
Watson wins Jeopardy!
Feb 14th 2011 – Watson wins Jeopardy!

beating its human opponents.
Watson is IBM’s super computer built
using Big Data Technology.

Trendwise Analytics
Who is using?
Early adopters:

Trendwise Analytics
Who is using?
Facebook: LinkedIn
2 major clusters: ~800 Westmere-based HP SL

170x, with 2x4 cores, 24GB RAM,
A 1100-machine cluster 6x2TB SATA
with 8800 cores and
about 12 PB raw ~1900 Westmere-based
storage. SuperMicro X8DTT-H, with 2x6
cores, 24GB RAM, 6x2TB SATA
A 300-machine cluster
with 2400 cores and ~1400 Sandy Bridge-based
about 3 PB raw SuperMicro with 2x6 cores, 32GB
storage. RAM, 6x2TB SATA
Trendwise Analytics
Who is Using?
Yahoo
Inmobi
More than 100,000 CPUs in >40,000
computers running Hadoop
Running Apache Hadoop on
Biggest cluster: 4500 nodes around 700 nodes (16800
(2*4cpu boxes w 4*1TB disk & 16GB cores, 5+ PB) in 6 Data
RAM)
Centers for ETL, Analytics,
Data Science and Machine
Ebay: Learning
532 nodes cluster (8 * 532
cores, 5.3PB).
Trendwise Analytics
Mainstream...
Other Examples:
UPS implemented Big Data to track data on 16.3 million
packages per day for 8.8 million customers, with an
average of 39.5 million tracking requests from customers
per day. The company stores over 16 petabytes of data.
Source: http://www.csc.com/big_data/success_stories

Trendwise Analytics
Trendwise Analytics
Trendwise Analytics
Some more...
More detailed list

Trendwise Analytics
Big Data usage

Ford collects and aggregates data from the 4
New $1B corporate center for software and analytics
million vehicles that use in-car sensing and
Hiring 400 data scientists remote app management software
Includes financial and marketing applications, The data allows to glean information on a range of
but with special focus on industrial uses of big data issues, from how drivers are using their vehicles, to
When will this gas turbine need maintenance? the driving environment that could help them
improve the quality of the vehicle
Partnered with Microsoft to develop SYNC
AT&T has 300 million customers

Amazon has been collecting customer
information for years--not just addresses and A team of researchers is working to turn data
payment information but the identity of collected through the company’s cellular network into
everything that a customer had ever bought or a trove of information for policymakers, urban
even looked at. planners and traffic engineers.
They’re using that data to build customer The researchers want to see how the city changes
relationship hourly by looking at calls and text messages relayed
through cell towers around the region, noting that
certain towers see more activity at different times
1
Trendwise Analytics
Closer home..
Aadhar – Govt. of India's UIDAI project

Trendwise Analytics
TECHNOLOGY

Trendwise Analytics
Technology options
Hadoop – most popular
NoSQL Databases – Cassandra, MongoDB
SAP Hana - Proprietary
Spark
…..
Trendwise Analytics
What is Hadoop?
Open source project started by Doug Cutting
A platform to manage Big Data
Helps in Distributed computing
Runs on Commodity Hardware
Trendwise Analytics
Why Hadoop - What is the motivation?
• Digital information produced in 2011 - 1,800

Exabytes
• It will be ten times 2017
• This data will be “unstructured” – complex data

poorly‐suited to management by structured
storage systems like relational databases.

Trendwise Analytics

Unstructured data comes from many sources and takes
many forms
• web logs
• text files
• sensor readings
• product reviews
• text messages
• audio
• video
• photos
• etc
Trendwise Analytics
Large volumes of complex data can hide important insights:

• Are there buying patterns in point‐of‐sale data that can forecast demand
for products at particular stores?
• Do user logs from a web site, or calling records in a mobile network,
contain information about relationships among individual customers?
• Can a collection of nucleotide sequences be assembled into a single gene?

Dealing with big data requires two things:
• Inexpensive, reliable storage; and
• New tools for analyzing unstructured and structured data.


Trendwise Analytics

Apache Hadoop is a powerful open source software platform that addresses
both of these problems.
Hadoop enables a computing solution that is:
• Scalable– New nodes can be added as needed, and added without needing
to change data formats, how data is loaded, how jobs are written, or the
applications on top.
• Cost effective– Hadoop brings massively parallel computing to commodity
servers. The result is a sizeable decrease in the cost per terabyte of
storage, which in turn makes it affordable to model all your data.
• Flexible– Hadoop is schema-less, and can absorb any type of data,
structured or not, from any number of sources. Data from multiple sources
can be joined and aggregated in arbitrary ways enabling deeper analyses
than any one system can provide.
• Fault tolerant– When you lose a node, the system redirects work to
another location of the data and continues processing without missing a
beat.
•
Trendwise Analytics
What is Hadoop?
Hadoop uses simple, robust techniques on inexpensive computer systems to
deliver very high data availability and to analyze enormous amounts of
information quickly. Hadoop offers enterprises a powerful new tool for
managing big data.
Hadoop Is:
• Open Source, Java
• Apache project
• Target cluster of commodity PCs
• Cost-effective bulk computing
• Distributed File System
• Modeled on GFS
• Distributed Processing Framework
• Using Map/Reduce metaphor

Trendwise Analytics
How Hadoop came into Existence –

Origins of Hadoop
Hadoop got its start in Nutch
• Problem in Nutch Project
• A few enthusiastic developers were attempting to build an open source
web search engine
• And having trouble managing computations running on even a handful
of computers
• Light
• Once Google published their GoogleFS and MapReduce whitepapers,
the way forward became clear
• Google had devised systems to solve precisely the problems the Nutch
project was facing
• Birth
• Thus, Hadoop was born

History of Hadoop
Hadoop Timeline
 Seeds of Hadoop were planted in 2002

 Started by Doug Cutting and Mike Cafarella
Doug Cutting Mike Cafarella

Trendwise Analytics
What Hadoop is used for?

• Searching
• Log processing
• Recommendation systems
• Business Intelligence / Data Warehousing
• Video and Image analysis
• Archiving

Trendwise Analytics
Who uses Hadoop?

• Face book
• Google
• Yahoo
• IBM
• Twitter
• Amazon
• AOL
• FOX
• The New York Times

Trendwise Analytics
Core components of Hadoop
 Data storage (HDFS)

Runs on commodity hardware
Horizontally scalable
 Processing (MapReduce)
Parallelized (scalable) processing
Fault Tolerant
 Other Tools / Frameworks
HBase, Hive, Pig, Mahout
Trendwise Analytics
Hadoop Component - HDFS
• General
• A scalable, Fault tolerant, High performance distributed file system
(Storage)
• No RAID required
• Access from C, Java, FUSE, WebDAV, Thrift
• Single namespace for entire cluster
• Managed by a single name node.
• Hierarchal directories
• Optimized for streaming reads of large files.

Trendwise Analytics
Hadoop Component - HDFS

Trendwise Analytics
Hadoop Component – Map / Reduce

Software framework for distributed computation
Input | Map() | Copy/Sort | Reduce() | Output
JobTracker schedules and manages jobs on the
NameNode
TaskTracker executes individual map() and reduce()
tasks on each DataNode

Trendwise Analytics
Hadoop Component – Map / Reduce

Trendwise Analytics
Introduction to Hadoop Eco –System

Trendwise Analytics
Hive
• Data Warehouse infrastructure that provides

data summarization and ad hoc querying on top
of Hadoop
• MetaStore - Meta Data, Tables etc
• Hive Query Language - Sql, Select, Group By,
Join

Trendwise Analytics
Pig
• A high-level data-flow language and execution
framework for parallel computation
• Simple to write Map Reduce program
• Abstracts you from specific detail
• Focus on data processing
• Data flow
• For data manipulation

Trendwise Analytics
Sqoop
• Sqoop is a tool designed to help users of large
data import existing relational databases into
their Hadoop clusters
• Automatic data import
• SQL to Hadoop
• Generates code for use in MapReduce
applications

Trendwise Analytics
Zookeeper
• A high-performance coordination service for
distributed applications
• Zookeeper is a centralized service for
maintaining configuration information, naming,
providing distributed synchronization, and
providing group services

Trendwise Analytics
Avro
• A data serialization system that provides
dynamic integration with scripting languages
• Avro Data is smaller and faster
• Avro RPC

Trendwise Analytics
Chukwa
• A data collection system for managing large
distributed systems
• Build on HDFS and MapReduce
• Tools kit for displaying, monitoring and
analyzing the log files

Trendwise Analytics
HBase
• HBase is the Hadoop database. Think of it as a
distributed scalable Big Data store.
• HBase can be used when we require random,
real-time read/write access to your Big Data
• HBase is an open-source, distributed, versioned,
column-oriented store modeled after Google's
Bigtable: A Distributed Storage System

Trendwise Analytics
Benefits of Hadoop
• Hadoop is designed to run on cheap commodity
hardware
• It automatically handles data replication and node
failure
• Handles large volumes of unstructured data easily
• Last but not least – its free! ( Open source)
Trendwise Analytics
Divide and conquer!
Source:http://pivotalhd.docs.pivotal.io/docs/getting-started.html
Trendwise Analytics
Modes of Hadoop
• Distributed mode – cluster ( for production)
• Psuedo distributed mode- single node
• Local ( Standalone ) mode
• Single java process – not very common
Trendwise Analytics
Hadoop Components (2.X)

Trendwise Analytics
What's new in Hadoop 2.0
Introduction of Yarn
High Availability - Standby Name Node
HDFS Federation
Windows OS support
Trendwise Analytics
Commercial Hadoop Distributions
• Cloudera
• Hortonworks
• MapR
• Greenplum, A Division of EMC
• IBM InfoSphere BigInsights

Trendwise Analytics
Hadoop UI – Browser based
Hadoop Daemon Port
Namenode 50070
Jobtracker 50030
Tasktracker 50060
Trendwise Analytics
Web Interface – NameNode Tracker

Trendwise Analytics
Job Tracker
Trendwise Analytics
Task Tracker
Trendwise Analytics
Trendwise Analytics
HDFS
Trendwise Analytics
HDFS Design : Assumptions and Goals
Hardware Failure
Detection of faults and quick, automatic recovery from them is a core architectural
goal of HDFS.
Streaming Data Access
HDFS is designed more for batch processing rather than interactive use by users.
The emphasis is on high throughput of data access rather than low latency of data
access.
Large Data Sets
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes
to terabytes in size. Thus, HDFS is tuned to support large files.
Trendwise Analytics
HDFS Design : Assumptions and Goals
Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once
created, written, and closed need not be changed. This assumption simplifies data
coherency issues and enables high throughput data access.
“Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more efficient if it is executed near

the data it operates on. This is especially true when the size of the data set is huge.
This minimizes network congestion and increases the overall throughput of the
system.
Portability Across Heterogeneous Hardware and Software Platforms
HDFS has been designed to be easily portable from one platform to another.
Trendwise Analytics
HDFS: Important Terms and concepts
Replication Factor
Heartbeat
Blockreport
Rack awareness
High Availability
Trendwise Analytics
HDFS Components
Name Node
Secondary Name Node
Standby Name Node
Data Node
Trendwise Analytics
HDFS Architecture
Trendwise Analytics
HDFS ARCHITECTURE AND

COMPONENTS

Trendwise Analytics
Namenode
This is the Master component
– Hardware server
– Software Daemon
Stores all the information about the entire cluster
Stores only Meta Data – not the actual Data
Stores the information in RAM for quick retrieval
– Hence the system needs to have more memory
– Storage can be less ( disk space)
*IMP: Namenode is the single point of failure of Hadoop....this problem is solved

to some extent in Hadoop 2.0
Trendwise Analytics
Namenode
In addition to the RAM...Namenode saves the information onto disk from time to
time in two files:
– fsimage
– edits
*edits is the log file – stores log information about files

create/deleted/updated
In case of Namenode restart the following sequence is followed:
1. Read the fsimage and load into RAM
2. Apply edits to fsimage
3. Write the updated fsimage to the disk

Trendwise Analytics
Name Node
This is the controller or Master of Cluster, which manages all the operations
requested by Client. Name Node performs following major operations –
• Allocation of Blocks to File
• Monitoring Data Node for Data Node Failure and new Data Node addition
• Replication Management
• User requests management – like writing file, reading file etc.
• Transaction Tracking and logging of transaction

Trendwise Analytics
Meta Data
This contains all information about File System, like:
• where blocks related to file are

• where blocks are replicated
• number of replication copy
• where data nodes running
• what space available on each
• directory structure etc.
It also keeps track of transaction happening in system in Edit Log.

Trendwise Analytics
Secondary Namenode
Secondary namenode is a software daemon
This runs on a machine different from Namenode
– It could be one of the data nodes
It helps in restarting the Namenode faster when it fails
• But it is not a hot standby
Its job is to create checkpoints of the filesystem at regular intervals
This is how it works:
1. Get the fsimage from Namenode
2. Get the edit logs file from Namenode
3. Apply the edit logs to fsimage
4. Copy the updated fsimage back to Namenode

Trendwise Analytics
Secondary Namenode
Trendwise Analytics
Datanode
Datanode is the slave component
It consists of :
– Software Daemon
– Running on a datanode system (hardware)
A cluster consists of several datanodes
All the data is distributed and stored on the datanodes
Datanode hardware needs more disk storage and less RAM

Trendwise Analytics
Datanode
Data is broken into blocks and stored in Datanodes
Default block size is 64MB ( configurable)
Multiple copies of each block are stored on different nodes
– Replication – default 3
– Configurable
Data node regularly communicates with Namenode
– This is know as Heartbeat
– Default every 3 seconds
– Sends the block report
If a data node misses 10 heartbeats – Namenode marks it as “Dead”

Trendwise Analytics
Data Node
This is where data is stored in HDFS, Data node is not aware of what file data belongs, it
writes data in local file system in form of blocks. Data node has following major
functionalities:
• Write/Read Block to/from Local File
• Perform operation as directed by Name Node
• Register/Heartbeat itself with name node and provide Block report to name node

Trendwise Analytics
Replication – Factor 3
Trendwise Analytics
Client
Client is the one which is communicating to HDFS using predefined API, it
could be FS utility provided in Hadoop, Java API provided by Hadoop, or
some other API to communicate to Hadoop in language independent
manner.
In the following section we will be covering flow of different operation, where

each component is coming into picture during operation that will clear out
how it works internally.

Trendwise Analytics
HDFS Write Operation

Trendwise Analytics
HDFS Write Operation
• The Client breaks File.txt into (3) Blocks. For each block, the Client
consults the Name Node and receives a list of (3) Data Nodes that should
have a copy of this block.
• The Client then writes the block directly to the Data Node

• Data Node after completion of writing the BLOCK - Data Node replicates
the same block to other Data Nodes

• Cycle repeats for the remaining blocks.

Trendwise Analytics
Introducing the Hadoop HDFS power ->

Pipelined Writes

Trendwise Analytics
Preparing - Pipelined Writes

• Client tells the Name Node the directory and file (File.txt) it wishes to write
to HDFS.
• Name Node checks permissions and responds to Client with grant
(Writeable).
• Client sends Add Block request to Name Node for Block (1) of File.txt.
• Name Node responds with a list of (3) Data Nodes, including IP address,
port number, hostnames, rack numbers etc
• Client initiates connection to Data Node (1) in the list
• Client sends Data Node (1) the list of the other (2) Data Nodes
( including IP address, port, etc.)

Trendwise Analytics
Preparing - Pipelined Writes

• Data Node (1) initiates connection to Data Node (2)
• Data Node (1) provides Data Node (2) information about Data Node (3)
(including IP address, port, etc.)
• Data Node (2) initiates connection to Data Node (3)
• Data Node (2) provides information to Data Node (3)
• Data Node (3) ACKs by sending a message to Data Node (2)
• Data Node (2) ACKs by sending a message to Data Node (1)
• Data Node (1) ACKs by sending Client same 3 Byte string of 0’s (000000) –
“pipeline ready” message

Trendwise Analytics

Trendwise Analytics
Writing the Data to Hadoop HDFS

• Client begins sending block data to Data Node (1).
• Client sends Data Node (1) block data.
• Data Node (1) begins sending block data to Data Node (2).
• Data Node (2) begins block data to Data Node (3).
• When completed, each Data Node reports to Name Node “block received”
with block info
• Data Node (1) sends a string completion ACK to Client
• Client reports “success” to Name Node.
• Name Node knows (3) replicas of the block exist.
~Entire process repeats for remaining blocks of File.txt~

Next block is not transferred to HDFS until previous block
successfully completes.

Trendwise Analytics
HDFS Read Operation

Trendwise Analytics
HDFS Read Operation

• Client asks Name Node for the block locations of File.txt
• Name Node provides Client a unique list of (3) Data Nodes for each block
• Client chooses the first Data Node in each list for reading the block
• Blocks are read sequentially
• Reading subsequent blocks does not begin until the previous block finishes

Trendwise Analytics
Hadoop – “Rack Awareness"

Trendwise Analytics
Hadoop shell commands

• HDFS organizes its data in files and
directories
• It provides a command line interface
called the FS shell that lets the user
interact with data in the HDFS.
• The syntax of the commands is similar to
bash and csh.

Trendwise Analytics

Cat command

Usage: hadoop fs -cat URI [URI …]

Copies source paths to stdout.

Example:

hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -cat file:///file3 /user/hadoop/file4


Trendwise Analytics

mkdir

Usage: hadoop fs -mkdir <paths>

Takes path uri's as argument and creates directories. The behavior is much
like unix mkdir -p creating parent directories along the path.

Example:

hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir
hdfs://nn2.example.com/user/hadoop/dir

Trendwise Analytics

ls

Usage: hadoop fs -ls <args>

For a file returns stat on the file with the following format:

permissions number_of_replicas userid groupid filesize modification_date
modification_time filename


Trendwise Analytics

For a directory it returns list of its direct children as in unix.A directory is
listed as:

permissions userid groupid modification_date modification_time dirname

Example:

hadoop fs -ls /user/hadoop/file1

Trendwise Analytics

put

Usage: hadoop fs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination
filesystem. Also reads input from stdin and writes to destination filesystem.

hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile
Reads the input from stdin.

Trendwise Analytics

copyFromLocal

Usage: hadoop fs -copyFromLocal <localsrc> URI

Similar to put command, except that the source is
restricted to a local file reference.


Trendwise Analytics

get

Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>

Copy files to the local file system. Files that fail the CRC check may be copied
with the -ignorecrc option. Files and CRCs may be copied using the -crc
option.

Example:

hadoop fs -get /user/hadoop/file localfile
hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile

Trendwise Analytics

copyToLocal

Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI
<localdst>

Similar to get command, except that the destination is
restricted to a local file reference.


Trendwise Analytics

moveFromLocal

Usage: dfs -moveFromLocal <localsrc> <dst>

Similar to put command, except that the source localsrc
is deleted after it's copied.

Trendwise Analytics

rm

Usage: hadoop fs -rm URI [URI …]

Delete files specified as args. Only deletes non empty directory and files.
Refer to rmr for recursive deletes.
Example:

hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir


Trendwise Analytics

test

Usage: hadoop fs -test -[ezd] URI

Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true.
-d check to see if the path is directory. Return 0 if true.

Example:

hadoop fs -test -e filename
Trendwise Analytics
Hadoop shell commands – Change

Permissions
chgrp

Usage: hadoop fs -chgrp [-R] GROUP URI [URI …]

Change group association of files. With -R, make the change recursively
through the directory structure. The user must be the owner of files, or
else a super-user.

Trendwise Analytics
Hadoop shell commands – Change

Permissions
chmod

Usage: hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI …]

Change the permissions of files. With -R, make the change recursively through the
directory structure. The user must be the owner of the file, or else a super-user.

Trendwise Analytics
Hadoop File System Web Interface

• HDFS exposes a web server which is capable of performing basic status
monitoring and file browsing operations. By default this is exposed on port
50070 on the NameNode. Accessing http://namenode:50070/ with a web
browser will return a page containing overview information about the
health, capacity, and usage of the cluster (similar to the information
returned by bin/hadoop dfsadmin -report).
• The address and port where the web interface listens can be changed by
setting dfs.http.address in conf/hadoop-site.xml. It must be of the form
address:port. To accept requests on all addresses, use 0.0.0.0.
• From this interface, you can browse HDFS itself with a basic file-browser
interface. Each DataNode exposes its file browser interface on port 50075.
You can override this by setting the dfs.datanode.http.address
configuration key to a setting other than 0.0.0.0:50075. Log files
generated by the Hadoop daemons can be accessed through this interface,
which is useful for distributed debugging and troubleshooting.

Trendwise Analytics
HDFS commands
1. File and directory display
hadoop fs –ls (List of the File Display inside the HDFS)

hadoop fs –lsr <directory_name> (Display the whole directory
2. Transferring Files
hadoop fs –get (File transferring from HDFS to Local)
hadoop fs –put (File transferring from local to HDFS)
hadoop fs –cp (File transferring within the HDFS)
hadoop fs -copyFromLocal (is same as put)
hadoop fs -copyToLocal ( is same as get)
3. Other commands
hadoop fs –cat (Display the content in the File within HDFS)
hadoop fs –du <File_name> (Display the File size)
hadoop fs –tail <File_name> (Displays last kilobyte of the File)
hadoop fs –rm (Remove particular File in HDFS)
hadoop fs –mkdir <directory_name> (Create new directory in HDFS)
hadoop fs –dus <directory_name> (Display the directory size)
hadoop fs –rmr (Remove File directory in HDFS)
Trendwise Analytics
Hands on
Trendwise Analytics
MapReduce
Trendwise Analytics
MapReduce
MapReduce is Hadoop's processing engine.
It is a framework – not a programming language.
We can use programming languages like java and

python.
Consists of the following daemons:
Jobtracker (master)
Tasktracker (slave)

Trendwise Analytics
Introduction to Map / Reduce Concept

• User submits Map Reduce job
• Data is sequence of keys and values
• Framework Partitions job into lots of tasks
• Mapper transforms
• Input: key1,value1 pair
• Output: key2, value2 pairs
• Reducer combines
• Input: key2, stream of value2
• Output: key3, value3 pairs
Trendwise Analytics
Introduction to Map / Reduce Concept

Trendwise Analytics
Jobtracker
This Daemon usually runs on the same system as

the Namenode.
It is responsible for accepting job requests from

clients
Jobs are broken into tasks
Assigns tasks to the tasktrackers on the slave

nodes where the corresponding data is available.
Re-assings failed tasks
Ensures are jobs are successfully completed

Trendwise Analytics
Tasktracker
This Daemon runs on the same systems as the

datanodes
It accepts tasks from jobtracker
Communicates with Jobtracker regularly ( like

datanode)
– Heartbeat
– Availability of free processing slots
– Status of the tasks – in

progress,completed,failed etc.

Trendwise Analytics
MapReduce Job – Logical View
Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png

Trendwise Analytics
MapReduce Example - WordCount

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
Trendwise Analytics
MapReduce Execution Pipeline

Trendwise Analytics
MapReduce Coordination Mechanism

Trendwise Analytics
MapReduce Execution Pipeline

Trendwise Analytics
A View of Hadoop (from Hortonworks)
Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI

Trendwise Analytics
Developing MapReduce applications
1. Java is predominantly used for developing

MapReduce applications.
2. Any jde can be used – Eclipse is popular
3. Eclipse provides a Hadoop plugin
4. MapReduce framework consists of several api's

that are extended to develop applications

Trendwise Analytics
MapReduce Program
Three major components of a MapReduce

program:
– Driver
– Mapper
– Reducer

Trendwise Analytics
Driver
This is entry point of the MapReduce progam.
Configuration details like Input,Output , InputFileformat etc.

are provided here.
Main() resides in this class.
InputFormat:
– TextInputFormat ( Default )
– KeyValueTextInputFormat
– SequenceFileInputFormat
– SequenceFilesAsTextInputFormat

Trendwise Analytics
Mapper
Mapper maps input key/value pairs to a set of intermediate

key/value pairs. Maps are the individual tasks that
transform input records into intermediate records.
Example of map job:
Input: 000, Hello World Hello Trendwise
Hello , 1
World , 1
Hello , 1
Trendwise, 1

Trendwise Analytics
Reducer
Reducer reduces a set of intermediate values

which share a key to a smaller set of values.
Reducer is not always mandatory!
Example of reducer output :
Hello , 2
World , 1
Trendwise, 1

Trendwise Analytics
Result in the output folder
_success - file with 0 bytes ..in case of success
_log – directory with log files
part-r-00000

Trendwise Analytics
Demo

Trendwise Analytics
Hadoop 2.0 -Yarn and High Availability

Trendwise Analytics
Supports other than MR as well

Trendwise Analytics
Yarn Architecture

Trendwise Analytics
Main components of Yarn

Trendwise Analytics
Main components of Yarn

Introduction To Big Data and Hadoop: S.Mohan Kumar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Big Data and Hadoop: S.Mohan Kumar

Uploaded by

Copyright:

Available Formats

Trendwise Analytics

Copyright Trendwise Analytics

Internet Minute ....

Copyright Trendwise Analytics

 A commercial aircraft An ERP system for an A Video Suveillance

Every day 2.5 quintillion

Copyright Trendwise Analytics

Copyright Trendwise Analytics

Copyright Trendwise Analytics

How are Organizations

Across All Industries

IT infrastructure Legal Social network Traffic flow Web app

Churn Natural resource Weather Healthcare

What are the types of business problems?

Source: Cloudera “Ten Common Hadoopable Problems”

Watson wins Jeopardy!

Feb 14th 2011 – Watson wins Jeopardy!

Copyright Trendwise Analytics

Copyright Trendwise Analytics

2 major clusters: ~800 Westmere-based HP SL

Copyright Trendwise Analytics

More detailed list

Copyright Trendwise Analytics

Big Data usage

AT&T has 300 million customers

Aadhar – Govt. of India's UIDAI project

Copyright Trendwise Analytics

Copyright Trendwise Analytics

Why Hadoop - What is the motivation?

• Digital information produced in 2011 - 1,800

• It will be ten times 2017

• This data will be “unstructured” – complex data

Copyright Trendwise Analytics

Why Hadoop - What is the motivation?

Why Hadoop - What is the motivation?

Large volumes of complex data can hide important insights:

Copyright Trendwise Analytics

Why Hadoop - What is the motivation?

Copyright Trendwise Analytics

How Hadoop came into Existence –

Copyright Trendwise Analytics

 Seeds of Hadoop were planted in 2002

Doug Cutting Mike Cafarella

What Hadoop is used for?

Copyright Trendwise Analytics

Who uses Hadoop?

Copyright Trendwise Analytics

Core components of Hadoop

 Data storage (HDFS)

Hadoop Component - HDFS

Copyright Trendwise Analytics

Hadoop Component - HDFS

Copyright Trendwise Analytics

Hadoop Component – Map / Reduce

Copyright Trendwise Analytics

Hadoop Component – Map / Reduce

Copyright Trendwise Analytics

Introduction to Hadoop Eco –System

Copyright Trendwise Analytics

• Data Warehouse infrastructure that provides

Copyright Trendwise Analytics

Copyright Trendwise Analytics

Copyright Trendwise Analytics