You are on page 1of 131

BIG DATA AND ANALYTICS

Subject Code : 18CS72 CIE Marks : 40

Lecture Hours : 50 SEE Marks : 60

Credits : 04

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
10 Hours

Introduction to Hadoop, HDFS and Essential Tools

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Introduction to Hadoop

• Apache initiated project for developing storage and processing


framework for Big Data
• Creaters: Doug Cutting and Machael J. Cafarelle
Two components
1. Data store in blocks in the clusters
2. Computations at each individual cluster in parallel with other

Components are written in java with part of code in C.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Introduction to Hadoop

• Hadoop is computing environment, in which input data stores, processes and


stores the result.

• Consists clusters, which distribute at clusters

• Each cluster consists of string of files consisting data blocks

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Introduction to Hadoop

Infrastructure consists of cloud for clusters.


Cluster consists set of computers
Hadoop provides low cost Big Data platform
Scalable, self-healing, self-manageable and distributed file system
Ex: Yahoo has more than 100000 CPU over 40000 servers running Haddop
Facebook has 2 major cluster
1. 1100 machines with 8800 cores and about 12 PB raw storage
2. 300 machines with 2400 cores and about 3 PB raw storage

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop core components

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Spark

- Open source cluster-computing framework


- Provides in-memory analytics
- Enables OLAP and real time processing
- Process Big data faster
- Adopted by Amazon, e-Bay and Yahoo

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Features of Hadoop

- Fault-efficient, scalable, flexible and modular design


- Robust design of HDFS
- Store and processing Big Data
- Distributed clusters computing model with data locality
- Hardware fault tolerant
- Open source framework
- Java and Linux based

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem components

- Refers combination of technologies


- Supports
Storage
Process
Access
Analysis
Governance
Security and operations for BD

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem components

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Streaming

- Hadoop Streaming is defined as a utility which comes Hadoop distribution


that is used to execute program analysis of big data using programming
languagesSupports
- Spark and Flink enable in-stream processing

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Pipes

- Hadoop Pipes are C++ pipes interface with MapReduce


- Data streaming into Mapper input and aggregated results flowing out outputs
- Pipes do not use standard I/O when communicating with Mapper and
Reducer codes

Example: IBM PowerLinux enable working with Hadoop pipes and libraries

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS: Data Storage

- HDFS is a core component of Hadoop


- Designed to run on cluster of computers and servers
- HDFS stores data range from GBs to PBs
- Stores data in distributed manner
- Stores data in any format
- Provides high throughput access

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS: Data Storage

- Racks
- Each racks has many DataNodes
- Each DN has many DataBlocks
- Racks distribute in clusters
- File divides data into blocks
- Data block size 64MB
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS: Data Storage

Features
- Create, append, delete, rename and attribute modification functions
- Content of file cannot modified, but can append at the end
- Write once, use many times during usages and processing
- Average file size can be more than 500 MB

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS: Physical Organization

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS: Physical Organization

Name Node stores all information related to File system


- File section is stored in which part of the cluster
- Last access time for files
- User permissions like which user has access to file

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS: Physical Organization

Secondary Node stores information


- Copy of NameNode meta data, so meta data can be rebuilt easily
in case of NameNode failure
JobTracker Coordinates the parallel processing of data

Master Slaves and Hadoop Client Node load the data into cluster

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop 2
- Single NameNode failure in Hadoop1 is an operational failure
- Scaling is also restricted beyond few thousand of nodes and clusters
- Hadoop 2 provides multiple NameNodes enables higher
resource availability
Each MainNode has following components
- An associated NameNode
- Zookeeper coordination client functions as centralized repository for
distributed applications
Zookeeper uses Synchronization, serialization and coordination activities

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop 2

- Associated JournalNode, keeps the records of the state, resources assigned,


intermediate results of application tasks.

- Distributed applications can write and read data from JournalNode


HDFS Commands

- HDFS shell is not compliant with the POSIX


- So, shell cannot interact similar to UNIX / Linux

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Commands
- Commands for interacting with files in HDFS require
/bin/hdfs dfs <args>
copyToLocal copying file at HDFS to local
-cat copying to standard output (stdout)

All Hadoop commands are invoked by bin/Hadoop script


%Hadoop fsck / -files -blocks

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
MapReduce Framework and Programming Model

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
MapReduce Framework and Programming Model

Mapper: SW for doing assigned task after organizing data blocks


imported using keys
Key : Specifies in a command line of Mapper
Command : Maps the key to data

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
MapReduce Framework and Programming Model

Reducer: SW for reducing the mapped data by using aggregation, query or


user-specified function
Reducer provides concise cohesive response for application
Aggregation: groups values of multiple rows together to result single value
of more significant meaning or measurement
Ex:count, sum, max, min, deviation and sd
Query function: Finds desired values
Ex: Find best student who performed best in exam

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
MapReduce Framework and Programming Model

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
MapReduce Framework and Programming Model

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
MapReduce Framework and Programming Model

Features:
1.Provides automatic parallelization and distribution of computation
2. Processes data stored on distributed clusters of DataNodes and racks
3.Allows processing large amount of data in parallel
4. Provides scalability for usages of large number of servers
5. Provides MapReduce batch-oriented programming model in Hadoop v1
6. Provides additional processing modes in Hadoop 2 YARN based system
and enables required parallel processing

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop MapReduce Framework

MapReduce provides two important functions


1.The distribution of a job based on client application task or
users query to various nodes within cluster

2.Organizing and reducing the results from each node into


cohesive response to the application or answer to query

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop MapReduce Framework

MapReduce enables job scheduling and task execution as follows


Client node submits a request of an application to the JobTracker, then
1.Estimate the need of resources for processing request
2.Analyze the states of slave nodes
3.Place the mapping tasks in queue
4.Monitor the progress of task
On failure, restart the task

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop MapReduce Framework

The job execution controlled by two types of processes in MapReduce


1. The Mapper deploys map tasks on the slots, tasks assign to nodes
2.The Hadoop sends the Map and Reduce jobs to appropriate servers in cluster

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Centralized Data (Shared Data)
Distributed Computing

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

Distributed Data
&
Distributed
Computing

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

Distributed
Computing with
No shared data

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

Data Block

stuData stuData
File size < 64MB

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

Data Block (size 64 MB)

DN 3 DN 239

DN 1
stuData

DN 2 DN 4 DN 240
Each Data Node size 64 GB Rack 1 Rack 2 Rack 120
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

64GB / 64MB = 1024 DBs = 1024 student files


Data Block (size 64 MB)
Each rack can store
2 x 64GB / 64MB = 2048 DBs = 2048 student files
Each DB replicates 3 times in DN
DN 239
DN 1 120DN 3
x 2048 /3 = 81920
stuData
Max no. of 81920 stuData_IDN files can
distribute per cluster N= 1 to 81920

DN 2 DN 4 DN 240
Each Data Node size 64 GB Rack 1
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

• Each DNs capacity is 64GB, each rack has 2 DNs are there,
so capacity of one rack is = 2 x 64 GB = 128 GB

• Total 120 racks are there in a cluster

Total capacity of cluster is

120 x 128 GB = 15360 GB = 15TB

DN 3 DN 239

DN 1

DN 2 DN 4 DN 240
Each Data Node size 64 GB Rack 1 Rack 2 Rack 120
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

DN 1 DN 3 DN 239

DN 2 DN 4 DN 240

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

DN 1 DN 3 DN 239

DN 2 DN 4 DN 240

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

2 Data Block required to store 1 stuData


stuData File size < 128MB

stuData1 stuData1

DB1 DB2

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

64GB / 64MB = 1024 DBs

Data Block (size 64 MB)

DN 3 DN 239

DN 1

DN 2 DN 4 DN 240
Each Data Node size 64 GB Rack 1 Rack 2 Rack 120
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

Each stuData file require 2 DataBlocks,


So, 1024 / 2 = 512 student files can store in a DataNode

Each rack can store


Data Block (size 64 MB) 2 x 64GB / 64MB = 2048 DBs /2
= 1024 student files

DN 1 DN 3 DN 239
stuData Each DB replicates 3 times in DN
120 x 1024 /3 = 40960

Max no. of 40960 stuData_IDN files can


DN 2 DN 4
distribute per cluster N=DN 240
1 to 40960
Each Data Node size 64 GB Rack 1
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop YARN

Yet Another Resource Negotiator


YARN is a resource management platform, manages computer resources
- Responsible for providing computational resources, such as
CPU, Memory, NW I/O
- YARN manages the schedules for running sub-tasks
- Each sub-task uses resources in allotted time slots
- YARN enables running of multi-threaded applications

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop YARN based Execution Model

• Client Node
• Resource Manager
• Node Manager
• App Master
• Containers

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop YARN based Execution Model
Job History Server
Master Node
Resource Manager

• Client Node submits request of an application to RM


• One RM exists per cluster
• RM keeps information of all the slave Node Manager: about location
(rack number) and no. of resources (data blocks and servers) they have
• Multiple Name Nodes are there at a cluster
• Node Manager creates Application Master Instance and starts up
• AMI initializes itself and registers with RM
• Multiple AMIs can be created in an AM.
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop YARN based Execution Model

• AMI performs role of Application Manager (AM)


• Estimates resource requirement for running an application program or sub task
• AM sends their requests for necessary resources to the RM
• Each Node Manager includes several containers for uses by subtasks of application
• Node Manager slave, it signals whenever it initializes
• All active NMs send controlling signal periodically to RM signalling their presence
• Each NM assigns container for each AMI
• RM allots the resources to AM
• AM using assigned containers on same or other Node Manager

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem Tools

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem Tools

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem Tools

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem Tools

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem: Zookeeper, Oozie, Sqoop and Flume

• Apache Zookeeper is coordination service


• Enables synchronization in cluster distributed applications
• Manages jobs in cluster

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem: Zookeeper, Oozie, Sqoop and Flume

Zookeepers main coordination services are:


1. Name Service: Maps name to information associated with that name
Ex: DNS maps domain name into IP
2. Concurrency Control: Accesses shared resource in distributed system
and controls concurrency
3. Configuration Management: New joining node can pick up up-to-date
centralized configuration from Zookeeper when node joins system
4. Failure: Automatic recovering strategy by selecting some alternate node

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem: Zookeeper, Oozie, Sqoop and Flume

- Open source project that schedules Hadoop jobs


- Provides way to package and bundle multiple coordinator and workflow jobs,
and manage life cycle of those jobs
Two basic functions are:
1. Oozie workflow jobs are represented as Directed Acrylic Graphs (DAGs)
specifying a sequence of actions execute
2. Oozie coordinator jobs are recurrent Oozie workflow jobs that are
triggered by time and data availability

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem: Zookeeper, Oozie, Sqoop and Flume

Oozie provision for following


1. Integrates multiple jobs in sequential manner
2. Stores and supports Hadoop jobs for MapReduce, Hive, Pig and Sqoop
3. Runs workflow jobs based on time and data triggers
4. Manages batch coordinator for the applicatons
5. Manages timely execution of jobs

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem: Zookeeper, Oozie, Sqoop and Flume

• Apache Sqoop load voluminous data efficiently between Hadoop and external
repositories that resides on Enterprise servers or relational database.
• Sqoop works with relational databases like Oracle, MySQL, PostgreSQL and DB2
• Sqoop provides mechanism for importing data from external data store to HDFS

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem: Zookeeper, Oozie, Sqoop and Flume

• Apache Flume provides distributed, reliable and available service


• Flume collects, aggregates and transfers large streaming data into HDFS
• Flume enables upload large file into Hadoop clusters
• Provide robust and fault tolerant service
• Useful in logs of NW traffic, sensor data, geo-location data, e-mails and
social media messages

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hadoop Ecosystem: Zookeeper, Oozie, Sqoop and Flume

Apache Flume Components


1. Sources: Accepts data from server or an application
2. Sinks : Receive and store it in HDFS repository or transmits to another source
3. Channels : Connects between source and sink by queuing event data for transactions
4. Agents : Run the sinks and sources in Flume.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Ambari

Apache Ambari is management platform for Hadoop


• Enables to plan, securely install, manage and maintain the clusters in Hadoop
• Provides advanced cluster security through Kerberos Ambari

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Ambari

Features are
1. Simplification of installation, configuration and management
2. Enable easy, efficient, repeatable and automated creation of clusters
3. Manages and monitors scalable clustering
4. Provides an intuitive web interface and REST API.
5. Visualize the health of clusters and critical metrics for their operations
6. Enable detection of faulty node links
7. Provides extensibility and customizability

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HBase

• HBase is Hadoop system Database


• Created for large tables
• HBase is open-source, distributed, versioned and non-relational (NoSQL) database
Features:
1. Uses a partial columnar data schema on top of Hadoop and HDFS
2. Supports large table of billions of rows and millions of columns
3. Provides small amount of information (Sparse data) taken from large data sets
which are storing empty or not required data
4. Supports data compression algorithms
5. Provides in-memory column based transactions

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HBase

Features:
6. Access rows serially
7. Provides random, real-time read/write access to BigData
8. Fault tolerant storage
9. Similarity with Google BigTable

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HBase

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hive

• Hive is open-source data warehouse software


• Facilitates reading, writing and managing large datasets which are
at distributed Hadoop files
• Hive provides batch process large data
• Used for managing web logs
• Does not process real-time queries and does not update row-based data tables
• Enables serialization/deserialization
• Supports different storage types: text files, sequence files (binary key/value pairs),
RCFiles (Record Columnar Files),
ORC (Optimized row columnar) and Hbase.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Hive

Three Major functions


1. Data summarization
2. Query
3. Analysis

- Hive interact with structured data stored in HDFS with Hive Query Language
- HQL translates SQL-like queries into MapReduce jobs

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Pig Introduction to Hadoop, HDFS and Essential Tools

• Open source, High Level Language platform


• Developed for analysing large data
• Executes queries on large data set using Hadoop
• Language used in Pig is known as Pig Latin
• Pig Latin is similar to SQL, but applies on larger dataset
Features
1. Loads data after applying filters and dump data in desired format
2. Requires JRE for executing Pig Latin programs
3. Converts all operations into Map and Reduce task
4. Allows complete operation irrespective of Mapper and Reducer functions to
produce o/p results
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Mahout

• Project of Apache with library of scalable ML algorithms


• Apache implemented Mahout on top of Hadoop
• Provides learning tools to automate the finding of meaningful patterns
in the Big Data sets stored in HDFS
Supports four main Area
1. Collaborative data filtering that mines user behaviour and makes product
recommendations
2. Clustering of data: items into class, organizes them into group
3. Classification: assign items into best category
4. Frequent item-set mining identifies which items usually occur together

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Basics

• HDFS is the backbone of Hadoop MapReduce processing


• HDFS designed for BigData processing, simultaneously
• Large file write-once / read many and append-only
• No random writing to HDFS files, bytes are append to end of stream
• HDFS block size is 64MB or 128MB
• Interesting feature is data locality and
• Moving computation to data than moving data to computation
• HDFS designed to work on same hardware
• HDFS has redundant design to handle failure

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Components

Design is based on two types of nodes: Name Node and Data Node
• Single Name Node manages all meta data need to store and
retrieve actual data from DNs

• No data is actually stored on the Name Node

• Master (Name Node) manages file system namespace and


regulates access files by client

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Components
Based on two types of nodes: Name Node and Data Node
• File system namespace operations : opening, closing and renaming files
and directories are all managed by Name Node

• Name Node determines mappings of block to Data Nodes


and handles Data Node failures

• Slave(DataNode) are responsible for serving read & write requests from
file system to clients
Name Node manages block creation, deletion and replecation

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Components

fsimage_*
edit_*

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Block Replication

• HDFS writes a file, is replicated across cluster


• Amount of replication based on value of dfs.replication in hdfs-site.xml
• Default value can overruled with hdfs dfs-setrep command
• Suppose cluster contain DataNodes, replication value is 3
• Cluster <=8 DataNodes , replication is 2
• HDFS default block size is 64MB, in typical OS 4KB or 8KB
• HDFS block size is not minimum block size

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Block Replication

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Safe Mode

• When Name Node starts, It enters read-only safe mode, Where blocks
cannot be replicated or deleted
Safe mode enables Name Node to perform two important processes
1. Previous file system state is reconstructed by loading fsimage file into memory
and replying the edit log
2. Mapping between blocks and data nodes is created by waiting for enough of
Data Nodes to register

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Rack Awareness

Deals with data locality, 3 levels


1. Data resides on local machine (best)
2. Data resides in same rack (better)
3. Data resides in different rack (good)

When YARN scheduler assigns MapReduce containers to work as mappers,


it will try to place container
- First on local machine
- Then on same rack
• Finally on another rack

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Name Node high availability

Early Hadoop suffered from


- Name Node was single point of failure, that could bring down entire cluster

To prevent such failure


- Redundant power supplies employed for Name Node hardware
- Redundant storage was provided

But, still it was susceptible to failure, so


-Solution was to implement Name Node High Availability (HA)

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Name Node high availability

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Name Node Federation

- Older version of HDFS provided single namespace for entire cluster


managed by single Name Node
- Resources of single Name Node determined size of namespace
- Federation addresses this limitation by adding support for multiple
Name Nodes / namespaces to HDFS file system
Benefits are:
1. Namespace scalability: Cluster storage scales horizontally
2. Better Performance: improved R/W operations throughput
3. System Isolation: different application / user deals with separate Name Nodes

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Name Node Federation

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Checkpoints and Backups

- CheckpointNode (SecondaryNameNode) periodically fetches edits from the


NameNode, merges them, and returns an updated fsimage to the NameNode

- BackupNode is similar, but also maintains up-to-date copy of


file system namespace both in memory and on disk.
BackupNode does not need to download fsimage and edit files from
active Name Node.
A NameNode supports one BackupNode at a time.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS Snapshots

- Similar to backup, but created by administrators using


hdfs dfs -snapshot command
- HDFS snapshots are read-only point-in-time copies of file system, features are
• Snapshots can be taken of a sub-tree of file system or entire file system
• Snapshots can be used for data backup, protection against user errors,
and disaster recovery
• Snapshot creation is instantaneous
• DB’s on DN’s are not copied, only records block list and file size
• Snapshots do not adveersely affect regular HDFS operations.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS NFS Gateway

- HDFS NFS Gateway supports NFSv3 and enables HDFS to be mounted as


part of clients local file system

- Users can easily download/upload files from/to the HDFS file system
to/from their local file system

- Users can stream data directly to HDFS through mount point

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands

hdfs : for Hadoop version 2


dfs : was used version 1

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands

Usage:

hdfs [--config confdir] COMMAND

where COMMAND is one of:

dfs run a file system command on the file systems


supported in Hadoop.

namenode -format format the DFS file system


secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands

journalnode run the DFS journalnode


zkfc run the ZK Failover Controller daemon
datanode run a DFS datanode
dfsadmin run a DFS admin client
haadmin run a DFS HA admin client
fsck run a DFS file system checking utility
balancer run a cluster balancing utility
jmxget get JMX exported values from NameNode or
DataNode.
mover run a utility to move block replicas across storage
types

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands
oiv apply the offline fsimage viewer to an fsimage
oiv_legacy apply the offline fsimage viewer to an legacy
fsimage
oev apply the offline edits viewer to an edits file
fetchdt fetch a delegation token from the NameNode
getconf get config values from configuration
groups get the groups which users belong to
snapshotDiff diff two snapshots of a directory or diff the
current directory

lsSnapshottableDir list all snapshottable dirs owned by the current


user Use -help to see options

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands
portmap run a portmap service
nfs3 run an NFS version 3 gateway
cacheadmin configure the HDFS cache
crypto configure HDFS encryption zones
storagepolicies get all the existing block storage policies
version print the version

Most commands print help when invoked w/o parameters.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands List Files in HDFS

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands

Make a Directory in HDFS

hdfs dfs -mkdir ise

Copy file to HDFS

hdfs dfs -put test ise

Copy files from HDFS

hdfs dfs -get ise/test test-local

Copy files within HDFS

hdfs dfs -cp ise/test test.hdfs

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands

Delete a file within HDFS

hdfs dfs -rm test.hdfs

Moved: 'hdfs://limulus:8020/user/hdfs/ise/test' to trash at: hdfs://limulus:8020/user/hdfs/.Trash/Current

hdfs dfs -rm -skipTrash ise/test

Delete a directory in HDFS

hdfs dfs -rm -r -skipTrash ise

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
HDFS User Commands

Get an HDFS status report

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "hdfs.h"
int main(int argc, char **argv)
{
hdfsFS fs = hdfsConnect("default", 0);
const char* writePath = "/tmp/testfile.txt";
hdfsFile writeFile = hdfsOpenFile(fs, writePath, WRONGLY|O_CREAT, 0,0, 0);
if(!writeFile)
{
fprintf(stderr, "Failed to open %s for writing!\n", writePath);
exit(-1);
} Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools

char* buffer ="Hello, World!\n";


tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1);
if (hdfsFlush(fs, writeFile))
{
fprintf(stderr, "Failed to 'flush' %s\n", writePath);
exit(-1);
} hdfsCloseFile(fs, writeFile);
}

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Essential Hadoop Tools

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Essential Hadoop Tools

• Hadoop ecosystem offers many tools to help Data input, High level
processing,
workflow management and creation of huge database.
• Each tool is managed is managed as a separate Apache Software
foundation project
• But designed to operate with core Hadoop services including HDFS, YARN
and MapReduce
• Background on each tool with start and finish example given here

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Pig

• Apache Pig (Pig Latin) is a High level language


• Enables to write complex MapReduce transformations using simple
scripting language
• Defines aggregate, join and sort transformations on data sets
• Used to extract, transform and load (ETL) data pipelines,
quick research on raw data and iterative data processing.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Pig

• Local mode: All processing done on local machine


• Non local(cluster) mode: Execute job on the cluster using either MapReduce engine
optimized Tez engine
• Interactive and Batch Mode: Enable Pig applications to be developed locally in
interactive mode using small amounts of data
and run at scale on the cluster in a production mode.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Hive

• Apache Hive is a data warehouse infrastructure


• Built on top of Hadoop for providing
1. Data summarization
2. Ad hoc queries
3. Analysis of large data sets

using a SQL-like language called HiveQL

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Hive

Apache Hive offers following features


• Tools to enable easy data extraction, transformation and loading (ETL)
• Mechanism to impose structure on variety of data formats
• Access to files stored either directly in HDFS or in other data storage
systems such as HBase
• Query execution via MapReduce and Tez

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Sqoop to Acquire Relational Data

Apache Sqoop is a tool designed to transfer data between Hadoop and relational
databases.
• Sqoop can used to import data from a RDBMS into the HDFS
• transform the data in Hadoop
• Export the data back into an RDBMS
• Can be used with any JDBC–compliant database
• Has been tested on Microsoft SQL Server, PostgresSQL, MySQL, and Oracle

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Sqoop to Acquire Relational Data

Version-1 Version-2
-------------------------------------------------------------------------------------------------------
1. Data were accessed using connectors Does not support connectors
written for specific databases

2. Data transfer from a More generalized ways


- RDBMS directly to Hive or Hbase to accomplish these tasks
-Hive or HBase to your RDBMS

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Sqoop to Acquire Relational Data
Import Methods
Step1: Examines the database to
gather
the necessary metadata
for the data to be imported

Step 2: Map-only (no reduce step)


Hadoop job that Sqoop submits
to thethe
• Job does cluster
actual data transfer
using the metadata captured in the
step1
• The imported data are saved in an HDFS directory
• Once placed in HDFS, the data
are ready for processing
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Sqoop to Acquire Relational Data
Export Methods
Step1: Examine the database
metadata

Step 2: Uses a map-only Hadoop job to


write the data to the database
• Sqoop divides the input data set into splits

• Then uses individual map tasks to push the


splits to the database.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Sqoop to Acquire Relational Data

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Flume to Acquire Data Streams

• Apache Flume is an independent agent designed to collect,


transport, and store data into HDFS

• Data transport involves a number of Flume agents that may


traverse a series of machines and locations

• Used for log files, social media-generated data, email messages, and
just about any continuous data source

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Flume to Acquire Data Streams

Source: The source component receives


data and sends it to a channel.
It can send the data to more than
one channel. The input data can
be from a real-time source
(e.g., weblog) or another Flume agent

Channel: A channel is a data queue that forwards the source data to the
sink destination.
as a buffer that manages input (source) and output (sink) flow rates
Sink: The sink delivers data to destination such as HDFS, a local file, or
another Flume agent
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Flume to Acquire Data Streams
• Sqoop agents may be placed in a pipeline, possibly to traverse several machines or
domains
• This configuration is normally used when data are collected on one machine (e.g.,
a web server) and sent to another machine that has access to HDFS.
• The data transfer format used by Flume is called Apache Avro, provides several
useful features
1. Avro is a data serialization/deserialization system that uses a compact
binary format
2. The schema is sent as part of the data exchange and is defined using
JSON
3. Avro also uses RPCs to send data.
That is, an Avro sink will contact an Avro source to send data.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache Flume to Acquire Data Streams

Flume is used to consolidate


several data sources before
committing them to HDFS

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Manage Hadoop Workflows with Apache Oozie

• Oozie is a workflow director system


• Designed to run and manage multiple related Apache Hadoop jobs
• Oozie workflow jobs are represented as DAGs of actions

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Manage Hadoop Workflows with Apache Oozie

Three types of Oozie jobs are permitted:


1. Workflow—a specified sequence of Hadoop jobs with outcome-based
decision points and control dependency. Progress from one action to
another cannot happen until the first action is completed.
2. Coordinator—a scheduled workflow job that can run at various time intervals
or when data become available.
3. Bundle—a higher-level Oozie abstraction that will batch a set of coordinator
jobs.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Manage Hadoop Workflows with Apache Oozie

• Oozie is integrated with the rest of the Hadoop stack, supporting several
types of Hadoop jobs out of the box (e.g., Java MapReduce, Streaming
MapReduce, Pig, Hive, and Sqoop)
• As well as system-specific jobs (e.g., Java programs and shell scripts).
• Oozie also provides a CLI and a web UI for monitoring jobs.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Manage Hadoop Workflows with Apache Oozie

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Manage Hadoop Workflows with Apache Oozie

• Control flow nodes define the beginning and the end of a workflow. They
include start, end, and optional fail nodes.

• Action nodes are where the actual processing tasks are defined. When an
action node finishes, the remote systems notify Oozie and the next node in
the workflow is executed.
.

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Manage Hadoop Workflows with Apache Oozie

• Fork/join nodes enable parallel execution of tasks in the workflow. The fork
node enables two or more tasks to run at the same time. A join node
represents a rendezvous point that must wait until all forked tasks complete.

FORK JOIN

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Manage Hadoop Workflows with Apache Oozie

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase

• Apache HBase is an open source, distributed, versioned, nonrelational


database
• HBase leverages the distributed data storage provided by the underlying
distributed file systems spread across commodity servers
• Apache HBase provides Bigtable-like capabilities on top of Hadoop and
HDFS

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase

Important features include the following capabilities:


• Linear and modular scalability
• Strictly consistent reads and writes .
• Automatic and configurable sharding of tables
• Automatic failover support between RegionServers
• Convenient base classes for backing Hadoop MapReduce jobs with
Apache HBase tables
• Easy-to-use Java API for client access

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase

HBase Data Model Overview


• A table in HBase is similar to other databases, having rows and columns
• Columns in HBase are grouped into column families, all with the same
prefix Ex: price:open, price:close, price:low, and price:high
• A column does not need to be a family. Ex: volume
• All column family members are stored together in the physical file system
• Specific HBase cell values are identified by a row key, column (column
family and column), and version (timestamp).

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase
HBase Data Model Overview
• It is possible to have many versions of data within an HBase cell.
• Almost anything can serve as a row key, from strings to binary representations
of longs to serialized data structures.
• Rows are lexicographically sorted with the lowest order appearing first in a table.
• The empty byte array denotes both the start and the end of a table’s
namespace.
• All table accesses are via the table row key, which is
• considered its primary key.
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase
Create the Database
hbase(main):006:0> create 'apple', 'price' , 'volume'
0 row(s) in 0.8150 seconds

• Table name is apple, and two columns are defined.


• The date will be used as the row key.
• The price column is a family of four values (open, close, low, high).

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase

The put command is used to add data to the database from within the shell.

put 'apple','6-May-15','price:open','126.56'
put 'apple','6-May-15','price:high','126.75'
put 'apple','6-May-15','price:low','123.36'
put 'apple','6-May-15','price:close','125.01'
put 'apple','6-May-15','volume','71820387'

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase
Inspect the Database :The entire database can be listed using the scan
command.
hbase(main):006:0> scan 'apple'
ROW COLUMN+CELL
6-May-15 column=price:close, timestamp=1430955128359, value=125.01
6-May-15 column=price:high, timestamp=1430955126024, value=126.75
6-May-15 column=price:low, timestamp=1430955126053, value=123.36
6-May-15 column=price:open, timestamp=1430955125977, value=126.56
6-May-15 column=volume:, timestamp=1430955141440, value=71820387
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase
Get a Row You can use the row key to access an individual row.
hbase(main):008:0> get 'apple', '6-May-15'
COLUMN CELL
price:close timestamp=1430955128359, value=125.01
price:high timestamp=1430955126024, value=126.75
price:low timestamp=1430955126053, value=123.36 price:open
timestamp=1430955125977, value=126.56
volume: timestamp=1430955141440, value=71820387
5 row(s) in 0.0130 seconds
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase

Get Table Cells A single cell can be accessed using the get command and the
COLUMN option:

hbase(main):013:0> get 'apple', '5-May-15', {COLUMN => 'price:low'}


COLUMN CELL
price:low timestamp=1431020767444, value=125.78
1 row(s) in 0.0080 seconds

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase

Get Table Cells :multiple columns can be accessed as follows:

hbase(main):012:0> get 'apple', '5-May-15', {COLUMN => ['price:low', 'price:high']}


COLUMN CELL
price:high timestamp=1431020767444, value=128.45
price:low timestamp=1431020767444, value=125.78
2 row(s) in 0.0070 seconds

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase

Delete a Cell A specific cell can be deleted using the following command:
hbase(main):009:0> delete 'apple', '6-May-15' , 'price:low'
Delete a Row You can delete an entire row by giving the deleteall command
hbase(main):009:0> deleteall 'apple', '6-May-15'
Remove a Table To remove (drop) a table, you must first disable it. The following
two commands remove the apple table from Hbase:
hbase(main):009:0> disable 'apple'
hbase(main):010:0> drop 'apple'

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase

Delete a Cell A specific cell can be deleted using the following command:
hbase(main):009:0> delete 'apple', '6-May-15' , 'price:low'
Delete a Row You can delete an entire row by giving the deleteall command
hbase(main):009:0> deleteall 'apple', '6-May-15'
Remove a Table To remove (drop) a table, you must first disable it. The following
two commands remove the apple table from Hbase:
hbase(main):009:0> disable 'apple'
hbase(main):010:0> drop 'apple'

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
Introduction to Hadoop, HDFS and Essential Tools
Using Apache HBase

Adding Data in Bulk


• There are several ways to efficiently load bulk data into HBase
• ImportTsv utility, which loads data in tab-separated values (tsv) format into
HBase
It has two distinct usage modes:
1. Loading data from a tsv-format file in HDFS into HBase via the put command
2. Preparing StoreFiles to be loaded via the completebulkload utility

Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga

You might also like