You are on page 1of 124

Trendwise Analytics

Introduction to
Big Data and
Hadoop
S.Mohan Kumar

Copyright Trendwise Analytics


Trendwise Analytics

Internet Minute ....

Copyright Trendwise Analytics


Trendwise Analytics

Other sources

 A commercial aircraft An ERP system for an A Video Suveillance


generates 3GB of mid size company Camera generates 1-
flight sensor data in 1 grows by 1-2TB 3TB data in 3 months
hour annually

Every day 2.5 quintillion


Airtel or Vodafone
generates 3TB of (2.5×10^18) bytes of data
Call Details is created
Records (CDR)
every day i.e., 2,500,000TB

Copyright Trendwise Analytics


Trendwise Analytics

IoT

Copyright Trendwise Analytics


Trendwise Analytics

IoT

Copyright Trendwise Analytics


Trendwise Analytics
Trendwise Analytics

How are Organizations


using Big Data
Technology?
Trendwise Analytics

Across All Industries

IT infrastructure Legal Social network Traffic flow Web app


optimization discovery analysis optimization optimization

Churn Natural resource Weather Healthcare


analysis exploration forecasting outcomes

Smart meter
Fraud Life sciences Advertising Equipment
monitoring
detection research analysis monitoring
Trendwise Analytics

What are the types of business problems?

Source: Cloudera “Ten Common Hadoopable Problems”


Copyright Trendwise Analytics
Trendwise Analytics

Watson wins Jeopardy!

Feb 14th 2011 – Watson wins Jeopardy!


beating its human opponents.
Watson is IBM’s super computer built
using Big Data Technology.

Copyright Trendwise Analytics


Trendwise Analytics

Who is using?
Early adopters:

Copyright Trendwise Analytics


Trendwise Analytics

Who is using?

Facebook: LinkedIn

2 major clusters: ~800 Westmere-based HP SL


170x, with 2x4 cores, 24GB RAM,
A 1100-machine cluster 6x2TB SATA
with 8800 cores and
about 12 PB raw ~1900 Westmere-based
storage. SuperMicro X8DTT-H, with 2x6
cores, 24GB RAM, 6x2TB SATA
A 300-machine cluster
with 2400 cores and ~1400 Sandy Bridge-based
about 3 PB raw SuperMicro with 2x6 cores, 32GB
storage. RAM, 6x2TB SATA
Trendwise Analytics

Who is Using?

Yahoo
Inmobi
More than 100,000 CPUs in >40,000
computers running Hadoop
Running Apache Hadoop on
Biggest cluster: 4500 nodes around 700 nodes (16800
(2*4cpu boxes w 4*1TB disk & 16GB cores, 5+ PB) in 6 Data
RAM)
Centers for ETL, Analytics,
Data Science and Machine
Ebay: Learning
532 nodes cluster (8 * 532
cores, 5.3PB).
Trendwise Analytics

Mainstream...
Other Examples:
UPS implemented Big Data to track data on 16.3 million
packages per day for 8.8 million customers, with an
average of 39.5 million tracking requests from customers
per day. The company stores over 16 petabytes of data.

Source: http://www.csc.com/big_data/success_stories

Copyright Trendwise Analytics


Trendwise Analytics
Trendwise Analytics
Trendwise Analytics

Some more...

More detailed list

Copyright Trendwise Analytics


Trendwise Analytics

Big Data usage


Ford collects and aggregates data from the 4
New $1B corporate center for software and analytics
million vehicles that use in-car sensing and
Hiring 400 data scientists remote app management software
Includes financial and marketing applications, The data allows to glean information on a range of
but with special focus on industrial uses of big data issues, from how drivers are using their vehicles, to
When will this gas turbine need maintenance? the driving environment that could help them
improve the quality of the vehicle
Partnered with Microsoft to develop SYNC

AT&T has 300 million customers


Amazon has been collecting customer
information for years--not just addresses and A team of researchers is working to turn data
payment information but the identity of collected through the company’s cellular network into
everything that a customer had ever bought or a trove of information for policymakers, urban
even looked at. planners and traffic engineers.
They’re using that data to build customer The researchers want to see how the city changes
relationship hourly by looking at calls and text messages relayed
through cell towers around the region, noting that
certain towers see more activity at different times

1
Trendwise Analytics

Closer home..

Aadhar – Govt. of India's UIDAI project

Copyright Trendwise Analytics


Trendwise Analytics

TECHNOLOGY

Copyright Trendwise Analytics


Trendwise Analytics

Technology options
Hadoop – most popular
NoSQL Databases – Cassandra, MongoDB
SAP Hana - Proprietary
Spark
…..
Trendwise Analytics

What is Hadoop?
Open source project started by Doug Cutting
A platform to manage Big Data
Helps in Distributed computing
Runs on Commodity Hardware
Trendwise Analytics

Why Hadoop - What is the motivation?

• Digital information produced in 2011 - 1,800


Exabytes

• It will be ten times 2017

• This data will be “unstructured” – complex data


poorly‐suited to management by structured
storage systems like relational databases.

Copyright Trendwise Analytics


Trendwise Analytics

Why Hadoop - What is the motivation?


Unstructured data comes from many sources and takes
many forms  
• web logs
• text files
• sensor readings
• product reviews
• text messages
• audio
• video
• photos
• etc
Copyright Trendwise Analytics
Trendwise Analytics

Why Hadoop - What is the motivation?

Large volumes of complex data can hide important insights: 


• Are there buying patterns in point‐of‐sale data that can forecast demand
for products at particular stores?
• Do user logs from a web site, or calling records in a mobile network,
contain information about relationships among individual customers?
• Can a collection of nucleotide sequences be assembled into a single gene?
 
Dealing with big data requires two things: 
• Inexpensive, reliable storage; and
• New tools for analyzing unstructured and structured data.
 

Copyright Trendwise Analytics


Trendwise Analytics

Why Hadoop - What is the motivation?


Apache Hadoop is a powerful open source software platform that addresses
both of these problems. 
Hadoop enables a computing solution that is:
• Scalable– New nodes can be added as needed, and added without needing
to change data formats, how data is loaded, how jobs are written, or the
applications on top.
• Cost effective– Hadoop brings massively parallel computing to commodity
servers. The result is a sizeable decrease in the cost per terabyte of
storage, which in turn makes it affordable to model all your data.
• Flexible– Hadoop is schema-less, and can absorb any type of data,
structured or not, from any number of sources. Data from multiple sources
can be joined and aggregated in arbitrary ways enabling deeper analyses
than any one system can provide.
• Fault tolerant– When you lose a node, the system redirects work to
another location of the data and continues processing without missing a
beat.
•  
Copyright Trendwise Analytics
Trendwise Analytics

What is Hadoop?
Hadoop uses simple, robust techniques on inexpensive computer systems to
deliver very high data availability and to analyze enormous amounts of
information quickly. Hadoop offers enterprises a powerful new tool for
managing big data.
 Hadoop Is: 
• Open Source, Java
• Apache project
• Target cluster of commodity PCs
• Cost-effective bulk computing
• Distributed File System
• Modeled on GFS
• Distributed Processing Framework
• Using Map/Reduce metaphor

Copyright Trendwise Analytics


Trendwise Analytics

How Hadoop came into Existence –


Origins of Hadoop
Hadoop got its start in Nutch
• Problem in Nutch Project 
• A few enthusiastic developers were attempting to build an open source
web search engine
• And having trouble managing computations running on even a handful
of computers 
• Light
•  Once Google published their GoogleFS and MapReduce whitepapers,
the way forward became clear
• Google had devised systems to solve precisely the problems the Nutch
project was facing
•  Birth
• Thus, Hadoop was born

Copyright Trendwise Analytics


History of Hadoop
Hadoop Timeline

 Seeds of Hadoop were planted in 2002


 Started by Doug Cutting and Mike Cafarella

Doug Cutting Mike Cafarella


Trendwise Analytics

What Hadoop is used for?


• Searching
• Log processing
• Recommendation systems
• Business Intelligence / Data Warehousing
• Video and Image analysis
• Archiving

Copyright Trendwise Analytics


Trendwise Analytics

Who uses Hadoop?


• Face book
• Google
• Yahoo
• IBM
• Twitter
• Amazon
• AOL
• FOX
• The New York Times

Copyright Trendwise Analytics


Trendwise Analytics

Core components of Hadoop

 Data storage (HDFS)


Runs on commodity hardware
Horizontally scalable
 Processing (MapReduce)
Parallelized (scalable) processing
Fault Tolerant
 Other Tools / Frameworks
HBase, Hive, Pig, Mahout
Trendwise Analytics

Hadoop Component - HDFS

• General
• A scalable, Fault tolerant, High performance distributed file system
(Storage)
• No RAID required
• Access from C, Java, FUSE, WebDAV, Thrift
• Single namespace for entire cluster
• Managed by a single name node.
• Hierarchal directories
• Optimized for streaming reads of large files.

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop Component - HDFS

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop Component – Map / Reduce


Software framework for distributed computation
Input | Map() | Copy/Sort | Reduce() | Output
JobTracker schedules and manages jobs on the
NameNode
TaskTracker executes individual map() and reduce()
tasks on each DataNode

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop Component – Map / Reduce

Copyright Trendwise Analytics


Trendwise Analytics

Introduction to Hadoop Eco –System

Copyright Trendwise Analytics


Trendwise Analytics

Hive

• Data Warehouse infrastructure that provides


data summarization and ad hoc querying on top
of Hadoop
• MetaStore - Meta Data, Tables etc
• Hive Query Language - Sql, Select, Group By,
Join

Copyright Trendwise Analytics


Trendwise Analytics

Pig
• A high-level data-flow language and execution
framework for parallel computation
• Simple to write Map Reduce program
• Abstracts you from specific detail
• Focus on data processing
• Data flow
• For data manipulation

Copyright Trendwise Analytics


Trendwise Analytics

Sqoop
• Sqoop is a tool designed to help users of large
data import existing relational databases into
their Hadoop clusters
• Automatic data import
• SQL to Hadoop
• Generates code for use in MapReduce
applications

Copyright Trendwise Analytics


Trendwise Analytics

Zookeeper
• A high-performance coordination service for
distributed applications
• Zookeeper is a centralized service for
maintaining configuration information, naming,
providing distributed synchronization, and
providing group services

Copyright Trendwise Analytics


Trendwise Analytics

Avro
• A data serialization system that provides
dynamic integration with scripting languages
• Avro Data is smaller and faster
• Avro RPC

Copyright Trendwise Analytics


Trendwise Analytics

Chukwa
• A data collection system for managing large
distributed systems
• Build on HDFS and MapReduce
• Tools kit for displaying, monitoring and
analyzing the log files

Copyright Trendwise Analytics


Trendwise Analytics

HBase
• HBase is the Hadoop database. Think of it as a
distributed scalable Big Data store.
• HBase can be used when we require random,
real-time read/write access to your Big Data
• HBase is an open-source, distributed, versioned,
column-oriented store modeled after Google's
Bigtable: A Distributed Storage System

Copyright Trendwise Analytics


Trendwise Analytics

Benefits of Hadoop
• Hadoop is designed to run on cheap commodity
hardware
• It automatically handles data replication and node
failure
• Handles large volumes of unstructured data easily
• Last but not least – its free! ( Open source)
Trendwise Analytics

Divide and conquer!

Source:http://pivotalhd.docs.pivotal.io/docs/getting-started.html
Trendwise Analytics

Modes of Hadoop
• Distributed mode – cluster ( for production)
• Psuedo distributed mode- single node
• Local ( Standalone ) mode
• Single java process – not very common
Trendwise Analytics

Hadoop Components (2.X)

Copyright Trendwise Analytics


Trendwise Analytics

What's new in Hadoop 2.0

Introduction of Yarn

High Availability - Standby Name Node

HDFS Federation

Windows OS support
Trendwise Analytics

Commercial Hadoop Distributions

• Cloudera
• Hortonworks
• MapR
• Greenplum, A Division of EMC
• IBM InfoSphere BigInsights

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop UI – Browser based

Hadoop Daemon Port

Namenode 50070

Jobtracker 50030

Tasktracker 50060
Trendwise Analytics

Web Interface – NameNode Tracker


Trendwise Analytics

Job Tracker
Trendwise Analytics

Task Tracker
Trendwise Analytics
Trendwise Analytics

HDFS
Trendwise Analytics

HDFS Design : Assumptions and Goals

Hardware Failure

  Detection of faults and quick, automatic recovery from them is a core architectural
goal of HDFS. 

Streaming Data Access

  HDFS is designed more for batch processing rather than interactive use by users.
The emphasis is on high throughput of data access rather than low latency of data
access.

Large Data Sets

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes
to terabytes in size. Thus, HDFS is tuned to support large files.
Trendwise Analytics

HDFS Design : Assumptions and Goals

Simple Coherency Model

HDFS applications need a write-once-read-many access model for files. A file once
created, written, and closed need not be changed. This assumption simplifies data
coherency issues and enables high throughput data access.

“Moving Computation is Cheaper than Moving Data”

A computation requested by an application is much more efficient if it is executed near


the data it operates on. This is especially true when the size of the data set is huge.
This minimizes network congestion and increases the overall throughput of the
system.

Portability Across Heterogeneous Hardware and Software Platforms

HDFS has been designed to be easily portable from one platform to another. 
Trendwise Analytics

HDFS: Important Terms and concepts

Replication Factor

Heartbeat

Blockreport

Rack awareness

High Availability
Trendwise Analytics

HDFS Components

Name Node

Secondary Name Node

Standby Name Node

Data Node
Trendwise Analytics

HDFS Architecture
Trendwise Analytics

HDFS ARCHITECTURE AND


COMPONENTS

Copyright Trendwise Analytics


Trendwise Analytics

Namenode

This is the Master component

– Hardware server

– Software Daemon

Stores all the information about the entire cluster

Stores only Meta Data – not the actual Data

Stores the information in RAM for quick retrieval

– Hence the system needs to have more memory

– Storage can be less ( disk space)

*IMP: Namenode is the single point of failure of Hadoop....this problem is solved


to some extent in Hadoop 2.0
Trendwise Analytics

Namenode

In addition to the RAM...Namenode saves the information onto disk from time to
time in two files:

– fsimage

– edits

*edits is the log file – stores log information about files


create/deleted/updated

In case of Namenode restart the following sequence is followed:

1. Read the fsimage and load into RAM

2. Apply edits to fsimage

3. Write the updated fsimage to the disk


Trendwise Analytics

Name Node
This is the controller or Master of Cluster, which manages all the operations
requested by Client. Name Node performs following major operations –
• Allocation of Blocks to File
• Monitoring Data Node for Data Node Failure and new Data Node addition
• Replication Management
• User requests management – like writing file, reading file etc.
• Transaction Tracking and logging of transaction

Copyright Trendwise Analytics


Trendwise Analytics

Meta Data
This contains all information about File System, like:

• where blocks related to file are


• where blocks are replicated
• number of replication copy
• where data nodes running
• what space available on each
• directory structure etc.

It also keeps track of transaction happening in system in Edit Log.

Copyright Trendwise Analytics


Trendwise Analytics

Secondary Namenode

Secondary namenode is a software daemon

This runs on a machine different from Namenode

– It could be one of the data nodes

It helps in restarting the Namenode faster when it fails

• But it is not a hot standby

Its job is to create checkpoints of the filesystem at regular intervals

This is how it works:

1. Get the fsimage from Namenode

2. Get the edit logs file from Namenode

3. Apply the edit logs to fsimage

4. Copy the updated fsimage back to Namenode


Trendwise Analytics

Secondary Namenode
Trendwise Analytics

Datanode

Datanode is the slave component

It consists of :

– Software Daemon

– Running on a datanode system (hardware)

A cluster consists of several datanodes

All the data is distributed and stored on the datanodes

Datanode hardware needs more disk storage and less RAM


Trendwise Analytics

Datanode

Data is broken into blocks and stored in Datanodes

Default block size is 64MB ( configurable)

Multiple copies of each block are stored on different nodes

– Replication – default 3

– Configurable

Data node regularly communicates with Namenode

– This is know as Heartbeat

– Default every 3 seconds

– Sends the block report

If a data node misses 10 heartbeats – Namenode marks it as “Dead”


Trendwise Analytics

Data Node
This is where data is stored in HDFS, Data node is not aware of what file data belongs, it
writes data in local file system in form of blocks. Data node has following major
functionalities:
• Write/Read Block to/from Local File
• Perform operation as directed by Name Node
• Register/Heartbeat itself with name node and provide Block report to name node

Copyright Trendwise Analytics


Trendwise Analytics

Replication – Factor 3
Trendwise Analytics

Client
Client is the one which is communicating to HDFS using predefined API, it
could be FS utility provided in Hadoop, Java API provided by Hadoop, or
some other API to communicate to Hadoop in language independent
manner.

In the following section we will be covering flow of different operation, where


each component is coming into picture during operation that will clear out
how it works internally.

Copyright Trendwise Analytics


Trendwise Analytics

HDFS Write Operation

Copyright Trendwise Analytics


Trendwise Analytics

HDFS Write Operation

• The Client breaks File.txt into (3) Blocks. For each block, the Client
consults the Name Node and receives a list of (3) Data Nodes that should
have a copy of this block.

• The Client then writes the block directly to the Data Node
 
• Data Node after completion of writing the BLOCK - Data Node replicates
the same block to other Data Nodes
 
• Cycle repeats for the remaining blocks.

Copyright Trendwise Analytics


Trendwise Analytics

Introducing the Hadoop HDFS power ->


Pipelined Writes

Copyright Trendwise Analytics


Trendwise Analytics

Preparing - Pipelined Writes


• Client tells the Name Node the directory and file (File.txt) it wishes to write
to HDFS.
• Name Node checks permissions and responds to Client with grant
(Writeable).
• Client sends Add Block request to Name Node for Block (1) of File.txt.
• Name Node responds with a list of (3) Data Nodes, including IP address,
port number, hostnames, rack numbers etc
• Client initiates connection to Data Node (1) in the list
• Client sends Data Node (1) the list of the other (2) Data Nodes
( including IP address, port, etc.)

Copyright Trendwise Analytics


Trendwise Analytics

Preparing - Pipelined Writes


• Data Node (1) initiates connection to Data Node (2)
• Data Node (1) provides Data Node (2) information about Data Node (3)
(including IP address, port, etc.)
• Data Node (2) initiates connection to Data Node (3)
• Data Node (2) provides information to Data Node (3)
• Data Node (3) ACKs by sending a message to Data Node (2)
• Data Node (2) ACKs by sending a message to Data Node (1)
• Data Node (1) ACKs by sending Client same 3 Byte string of 0’s (000000) –
“pipeline ready” message

Copyright Trendwise Analytics


Trendwise Analytics

Copyright Trendwise Analytics


Trendwise Analytics

Writing the Data to Hadoop HDFS


• Client begins sending block data to Data Node (1).
• Client sends Data Node (1) block data.
• Data Node (1) begins sending block data to Data Node (2).
• Data Node (2) begins block data to Data Node (3).
• When completed, each Data Node reports to Name Node “block received”
with block info
• Data Node (1) sends a string completion ACK to Client
• Client reports “success” to Name Node.
• Name Node knows (3) replicas of the block exist.

~Entire process repeats for remaining blocks of File.txt~


Next block is not transferred to HDFS until previous block
successfully completes.
 
Copyright Trendwise Analytics
Trendwise Analytics

HDFS Read Operation

Copyright Trendwise Analytics


Trendwise Analytics

HDFS Read Operation


• Client asks Name Node for the block locations of File.txt
• Name Node provides Client a unique list of (3) Data Nodes for each block
• Client chooses the first Data Node in each list for reading the block
• Blocks are read sequentially
• Reading subsequent blocks does not begin until the previous block finishes

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop – “Rack Awareness"

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


• HDFS organizes its data in files and
directories
• It provides a command line interface
called the FS shell that lets the user
interact with data in the HDFS.
• The syntax of the commands is similar to
bash and csh.

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


Cat command
 
Usage: hadoop fs -cat URI [URI …]
 
Copies source paths to stdout.
 
Example:
 
hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -cat file:///file3 /user/hadoop/file4
 

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


mkdir
 
Usage: hadoop fs -mkdir <paths>
 
Takes path uri's as argument and creates directories. The behavior is much
like unix mkdir -p creating parent directories along the path.
 
Example:
 
hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir
hdfs://nn2.example.com/user/hadoop/dir

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


ls
 
Usage: hadoop fs -ls <args>
 
For a file returns stat on the file with the following format:
 
permissions number_of_replicas userid groupid filesize modification_date
modification_time filename
 

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


 
For a directory it returns list of its direct children as in unix.A directory is
listed as:
 
permissions userid groupid modification_date modification_time dirname
 
Example:
 
hadoop fs -ls /user/hadoop/file1

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


put
 
Usage: hadoop fs -put <localsrc> ... <dst>

 Copy single src, or multiple srcs from local file system to the destination
filesystem. Also reads input from stdin and writes to destination filesystem.
 
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile
Reads the input from stdin.

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


copyFromLocal
 
Usage: hadoop fs -copyFromLocal <localsrc> URI
 
Similar to put command, except that the source is
restricted to a local file reference.
 

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


get
 
Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
 
Copy files to the local file system. Files that fail the CRC check may be copied
with the -ignorecrc option. Files and CRCs may be copied using the -crc
option.
 
Example:
 
hadoop fs -get /user/hadoop/file localfile
hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


copyToLocal
 
Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI
<localdst>
 
Similar to get command, except that the destination is
restricted to a local file reference.
 

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


moveFromLocal
 
Usage: dfs -moveFromLocal <localsrc> <dst>
 
Similar to put command, except that the source localsrc
is deleted after it's copied.

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


rm
 
Usage: hadoop fs -rm URI [URI …]
 
Delete files specified as args. Only deletes non empty directory and files.
Refer to rmr for recursive deletes.
Example:
 
hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir
 
 

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands


test
 
Usage: hadoop fs -test -[ezd] URI
 
Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true.
-d check to see if the path is directory. Return 0 if true.
 
Example:
 
hadoop fs -test -e filename
  Copyright Trendwise Analytics
Trendwise Analytics

Hadoop shell commands – Change


Permissions
chgrp
 
Usage: hadoop fs -chgrp [-R] GROUP URI [URI …]
 
Change group association of files. With -R, make the change recursively
through the directory structure. The user must be the owner of files, or
else a super-user.

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop shell commands – Change


Permissions

chmod
 
Usage: hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI …]
 
Change the permissions of files. With -R, make the change recursively through the
directory structure. The user must be the owner of the file, or else a super-user.

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop File System Web Interface


• HDFS exposes a web server which is capable of performing basic status
monitoring and file browsing operations. By default this is exposed on port
50070 on the NameNode. Accessing http://namenode:50070/ with a web
browser will return a page containing overview information about the
health, capacity, and usage of the cluster (similar to the information
returned by bin/hadoop dfsadmin -report).

• The address and port where the web interface listens can be changed by
setting dfs.http.address in conf/hadoop-site.xml. It must be of the form
address:port. To accept requests on all addresses, use 0.0.0.0.
• From this interface, you can browse HDFS itself with a basic file-browser
interface. Each DataNode exposes its file browser interface on port 50075.
You can override this by setting the dfs.datanode.http.address
configuration key to a setting other than 0.0.0.0:50075. Log files
generated by the Hadoop daemons can be accessed through this interface,
which is useful for distributed debugging and troubleshooting.

Copyright Trendwise Analytics


Trendwise Analytics

HDFS commands

1. File and directory display

hadoop fs –ls (List of the File Display inside the HDFS)


hadoop fs –lsr <directory_name> (Display the whole directory

2. Transferring Files
hadoop fs –get (File transferring from HDFS to Local)
hadoop fs –put (File transferring from local to HDFS)
hadoop fs –cp (File transferring within the HDFS)
hadoop fs -copyFromLocal (is same as put)
hadoop fs -copyToLocal ( is same as get)

3. Other commands
hadoop fs –cat (Display the content in the File within HDFS)
hadoop fs –du <File_name> (Display the File size)
hadoop fs –tail <File_name> (Displays last kilobyte of the File)
hadoop fs –rm (Remove particular File in HDFS)
hadoop fs –mkdir <directory_name> (Create new directory in HDFS)
hadoop fs –dus <directory_name> (Display the directory size)
hadoop fs –rmr (Remove File directory in HDFS)
Trendwise Analytics

Hands on
Trendwise Analytics

MapReduce
Trendwise Analytics

MapReduce

MapReduce is Hadoop's processing engine.

It is a framework – not a programming language.

We can use programming languages like java and


python.

Consists of the following daemons:

Jobtracker (master)

Tasktracker (slave)

Copyright Trendwise Analytics


Trendwise Analytics

Introduction to Map / Reduce Concept


• User submits Map Reduce job
• Data is sequence of keys and values
• Framework Partitions job into lots of tasks
• Mapper transforms
• Input: key1,value1 pair
• Output: key2, value2 pairs
• Reducer combines
• Input: key2, stream of value2
• Output: key3, value3 pairs
Copyright Trendwise Analytics
Trendwise Analytics

Introduction to Map / Reduce Concept

Copyright Trendwise Analytics


Trendwise Analytics

Jobtracker

This Daemon usually runs on the same system as


the Namenode.

It is responsible for accepting job requests from


clients

Jobs are broken into tasks

Assigns tasks to the tasktrackers on the slave


nodes where the corresponding data is available.

Re-assings failed tasks

Ensures are jobs are successfully completed

Copyright Trendwise Analytics


Trendwise Analytics

Tasktracker

This Daemon runs on the same systems as the


datanodes

It accepts tasks from jobtracker

Communicates with Jobtracker regularly ( like


datanode)
– Heartbeat

– Availability of free processing slots

– Status of the tasks – in


progress,completed,failed etc.

Copyright Trendwise Analytics


Trendwise Analytics

MapReduce Job – Logical View

Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png


Trendwise Analytics
MapReduce Example - WordCount

Copyright Trendwise Analytics


Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
Trendwise Analytics

MapReduce Execution Pipeline


Trendwise Analytics

MapReduce Coordination Mechanism


Trendwise Analytics

MapReduce Execution Pipeline


Trendwise Analytics

A View of Hadoop (from Hortonworks)

Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI


Copyright Trendwise Analytics
Trendwise Analytics

Developing MapReduce applications

1. Java is predominantly used for developing


MapReduce applications.

2. Any jde can be used – Eclipse is popular

3. Eclipse provides a Hadoop plugin

4. MapReduce framework consists of several api's


that are extended to develop applications

Copyright Trendwise Analytics


Trendwise Analytics

MapReduce Program

Three major components of a MapReduce


program:
– Driver

– Mapper

– Reducer

Copyright Trendwise Analytics


Trendwise Analytics

Driver

This is entry point of the MapReduce progam.

Configuration details like Input,Output , InputFileformat etc.


are provided here.

Main() resides in this class.

InputFormat:

– TextInputFormat ( Default )

– KeyValueTextInputFormat

– SequenceFileInputFormat

– SequenceFilesAsTextInputFormat

Copyright Trendwise Analytics


Trendwise Analytics

Mapper

Mapper maps input key/value pairs to a set of intermediate


key/value pairs. Maps are the individual tasks that
transform input records into intermediate records.

Example of map job:

Input: 000, Hello World Hello Trendwise

Hello , 1

World , 1

Hello , 1

Trendwise, 1

Copyright Trendwise Analytics


Trendwise Analytics

Reducer

Reducer reduces a set of intermediate values


which share a key to a smaller set of values.

Reducer is not always mandatory!

Example of reducer output :

Hello , 2

World , 1

Trendwise, 1

Copyright Trendwise Analytics


Trendwise Analytics

Result in the output folder

_success - file with 0 bytes ..in case of success

_log – directory with log files

part-r-00000

Copyright Trendwise Analytics


Trendwise Analytics

Demo

Copyright Trendwise Analytics


Trendwise Analytics

Hadoop 2.0 -Yarn and High Availability

Copyright Trendwise Analytics


Trendwise Analytics

Supports other than MR as well

Copyright Trendwise Analytics


Trendwise Analytics

Yarn Architecture

Copyright Trendwise Analytics


Trendwise Analytics

Main components of Yarn

Copyright Trendwise Analytics


Trendwise Analytics

Main components of Yarn

Copyright Trendwise Analytics

You might also like