Module 2

BIG DATA AND ANALYTICS
Subject Code : 18CS72 CIE Marks : 40
Lecture Hours : 50 SEE Marks : 60
Credits : 04
Arun Kumar P
Dept. of ISE, JNNCE, Shivamogga
10 Hours
Introduction to Hadoop, HDFS and Essential Tools
Arun Kumar P
Introduction to Hadoop
• Apache initiated project for developing storage and processing

framework for Big Data
• Creaters: Doug Cutting and Machael J. Cafarelle
Two components
1. Data store in blocks in the clusters
2. Computations at each individual cluster in parallel with other
Components are written in java with part of code in C.
Arun Kumar P
• Hadoop is computing environment, in which input data stores, processes and

stores the result.
• Consists clusters, which distribute at clusters
• Each cluster consists of string of files consisting data blocks
Arun Kumar P
Infrastructure consists of cloud for clusters.

Cluster consists set of computers
Hadoop provides low cost Big Data platform
Scalable, self-healing, self-manageable and distributed file system
Ex: Yahoo has more than 100000 CPU over 40000 servers running Haddop
Facebook has 2 major cluster
1. 1100 machines with 8800 cores and about 12 PB raw storage
2. 300 machines with 2400 cores and about 3 PB raw storage
Arun Kumar P
Hadoop core components
Arun Kumar P
Spark
- Open source cluster-computing framework

- Provides in-memory analytics
- Enables OLAP and real time processing
- Process Big data faster
- Adopted by Amazon, e-Bay and Yahoo
Arun Kumar P
Features of Hadoop
- Fault-efficient, scalable, flexible and modular design

- Robust design of HDFS
- Store and processing Big Data
- Distributed clusters computing model with data locality
- Hardware fault tolerant
- Open source framework
- Java and Linux based
Arun Kumar P
Hadoop Ecosystem components
- Refers combination of technologies

- Supports
Storage
Process
Access
Analysis
Governance
Security and operations for BD
Arun Kumar P
Hadoop Ecosystem components
Arun Kumar P
Hadoop Streaming
- Hadoop Streaming is defined as a utility which comes Hadoop distribution

that is used to execute program analysis of big data using programming
languagesSupports
- Spark and Flink enable in-stream processing
Arun Kumar P
Hadoop Pipes
- Hadoop Pipes are C++ pipes interface with MapReduce

- Data streaming into Mapper input and aggregated results flowing out outputs
- Pipes do not use standard I/O when communicating with Mapper and
Reducer codes
Example: IBM PowerLinux enable working with Hadoop pipes and libraries
Arun Kumar P
HDFS: Data Storage
- HDFS is a core component of Hadoop

- Designed to run on cluster of computers and servers
- HDFS stores data range from GBs to PBs
- Stores data in distributed manner
- Stores data in any format
- Provides high throughput access
Arun Kumar P
HDFS: Data Storage
- Racks
- Each racks has many DataNodes
- Each DN has many DataBlocks
- Racks distribute in clusters
- File divides data into blocks
- Data block size 64MB
Arun Kumar P
HDFS: Data Storage
Features
- Create, append, delete, rename and attribute modification functions
- Content of file cannot modified, but can append at the end
- Write once, use many times during usages and processing
- Average file size can be more than 500 MB
Arun Kumar P
HDFS: Physical Organization
Arun Kumar P
Name Node stores all information related to File system

- File section is stored in which part of the cluster
- Last access time for files
- User permissions like which user has access to file
Arun Kumar P
Secondary Node stores information

- Copy of NameNode meta data, so meta data can be rebuilt easily
in case of NameNode failure
JobTracker Coordinates the parallel processing of data
Master Slaves and Hadoop Client Node load the data into cluster
Arun Kumar P
Hadoop 2
- Single NameNode failure in Hadoop1 is an operational failure
- Scaling is also restricted beyond few thousand of nodes and clusters
- Hadoop 2 provides multiple NameNodes enables higher
resource availability
Each MainNode has following components
- An associated NameNode
- Zookeeper coordination client functions as centralized repository for
distributed applications
Zookeeper uses Synchronization, serialization and coordination activities
Arun Kumar P
Hadoop 2
- Associated JournalNode, keeps the records of the state, resources assigned,

intermediate results of application tasks.
- Distributed applications can write and read data from JournalNode

HDFS Commands
- HDFS shell is not compliant with the POSIX

- So, shell cannot interact similar to UNIX / Linux
Arun Kumar P
HDFS Commands
- Commands for interacting with files in HDFS require
/bin/hdfs dfs <args>
copyToLocal copying file at HDFS to local
-cat copying to standard output (stdout)
All Hadoop commands are invoked by bin/Hadoop script

%Hadoop fsck / -files -blocks
Arun Kumar P
Arun Kumar P
MapReduce Framework and Programming Model
Arun Kumar P
Mapper: SW for doing assigned task after organizing data blocks

imported using keys
Key : Specifies in a command line of Mapper
Command : Maps the key to data
Arun Kumar P
Reducer: SW for reducing the mapped data by using aggregation, query or

user-specified function
Reducer provides concise cohesive response for application
Aggregation: groups values of multiple rows together to result single value
of more significant meaning or measurement
Ex:count, sum, max, min, deviation and sd
Query function: Finds desired values
Ex: Find best student who performed best in exam
Arun Kumar P
Arun Kumar P
Arun Kumar P
Features:
1.Provides automatic parallelization and distribution of computation
2. Processes data stored on distributed clusters of DataNodes and racks
3.Allows processing large amount of data in parallel
4. Provides scalability for usages of large number of servers
5. Provides MapReduce batch-oriented programming model in Hadoop v1
6. Provides additional processing modes in Hadoop 2 YARN based system
and enables required parallel processing
Arun Kumar P
Hadoop MapReduce Framework
MapReduce provides two important functions

1.The distribution of a job based on client application task or
users query to various nodes within cluster
2.Organizing and reducing the results from each node into

cohesive response to the application or answer to query
Arun Kumar P
MapReduce enables job scheduling and task execution as follows

Client node submits a request of an application to the JobTracker, then
1.Estimate the need of resources for processing request
2.Analyze the states of slave nodes
3.Place the mapping tasks in queue
4.Monitor the progress of task
On failure, restart the task
Arun Kumar P
The job execution controlled by two types of processes in MapReduce

1. The Mapper deploys map tasks on the slots, tasks assign to nodes
2.The Hadoop sends the Map and Reduce jobs to appropriate servers in cluster
Arun Kumar P
Centralized Data (Shared Data)
Distributed Computing
Arun Kumar P
Distributed Data
&
Distributed
Computing
Arun Kumar P
Distributed
Computing with
No shared data
Arun Kumar P
Data Block
stuData stuData
File size < 64MB
Arun Kumar P
Data Block (size 64 MB)
DN 3 DN 239
…
DN 1
stuData
DN 2 DN 4 DN 240
Each Data Node size 64 GB Rack 1 Rack 2 Rack 120
Arun Kumar P
64GB / 64MB = 1024 DBs = 1024 student files

Each rack can store
2 x 64GB / 64MB = 2048 DBs = 2048 student files
Each DB replicates 3 times in DN
DN 239
DN 1 120DN 3
x 2048 /3 = 81920
stuData
Max no. of 81920 stuData_IDN files can
distribute per cluster N= 1 to 81920
DN 2 DN 4 DN 240
Each Data Node size 64 GB Rack 1
Arun Kumar P
• Each DNs capacity is 64GB, each rack has 2 DNs are there,
so capacity of one rack is = 2 x 64 GB = 128 GB
• Total 120 racks are there in a cluster
Total capacity of cluster is
120 x 128 GB = 15360 GB = 15TB
DN 3 DN 239
…
DN 1
DN 2 DN 4 DN 240
Arun Kumar P
DN 1 DN 3 DN 239
DN 2 DN 4 DN 240
Arun Kumar P
DN 1 DN 3 DN 239
DN 2 DN 4 DN 240
Arun Kumar P
2 Data Block required to store 1 stuData

stuData File size < 128MB
stuData1 stuData1
DB1 DB2
Arun Kumar P
64GB / 64MB = 1024 DBs
DN 3 DN 239
…
DN 1
DN 2 DN 4 DN 240
Arun Kumar P
Each stuData file require 2 DataBlocks,

So, 1024 / 2 = 512 student files can store in a DataNode
Each rack can store

Data Block (size 64 MB) 2 x 64GB / 64MB = 2048 DBs /2
= 1024 student files
DN 1 DN 3 DN 239
stuData Each DB replicates 3 times in DN
120 x 1024 /3 = 40960
Max no. of 40960 stuData_IDN files can

DN 2 DN 4
distribute per cluster N=DN 240
1 to 40960
Each Data Node size 64 GB Rack 1
Arun Kumar P
Hadoop YARN
Yet Another Resource Negotiator

YARN is a resource management platform, manages computer resources
- Responsible for providing computational resources, such as
CPU, Memory, NW I/O
- YARN manages the schedules for running sub-tasks
- Each sub-task uses resources in allotted time slots
- YARN enables running of multi-threaded applications
Arun Kumar P
Hadoop YARN based Execution Model
• Client Node
• Resource Manager
• Node Manager
• App Master
• Containers
Arun Kumar P
Job History Server
Master Node
Resource Manager
• Client Node submits request of an application to RM

• One RM exists per cluster
• RM keeps information of all the slave Node Manager: about location
(rack number) and no. of resources (data blocks and servers) they have
• Multiple Name Nodes are there at a cluster
• Node Manager creates Application Master Instance and starts up
• AMI initializes itself and registers with RM
• Multiple AMIs can be created in an AM.
Arun Kumar P
• AMI performs role of Application Manager (AM)

• Estimates resource requirement for running an application program or sub task
• AM sends their requests for necessary resources to the RM
• Each Node Manager includes several containers for uses by subtasks of application
• Node Manager slave, it signals whenever it initializes
• All active NMs send controlling signal periodically to RM signalling their presence
• Each NM assigns container for each AMI
• RM allots the resources to AM
• AM using assigned containers on same or other Node Manager
Arun Kumar P
Hadoop Ecosystem Tools
Arun Kumar P
Arun Kumar P
Arun Kumar P
Arun Kumar P
Hadoop Ecosystem: Zookeeper, Oozie, Sqoop and Flume
• Apache Zookeeper is coordination service

• Enables synchronization in cluster distributed applications
• Manages jobs in cluster
Arun Kumar P
Zookeepers main coordination services are:

1. Name Service: Maps name to information associated with that name
Ex: DNS maps domain name into IP
2. Concurrency Control: Accesses shared resource in distributed system
and controls concurrency
3. Configuration Management: New joining node can pick up up-to-date
centralized configuration from Zookeeper when node joins system
4. Failure: Automatic recovering strategy by selecting some alternate node
Arun Kumar P
- Open source project that schedules Hadoop jobs

- Provides way to package and bundle multiple coordinator and workflow jobs,
and manage life cycle of those jobs
Two basic functions are:
1. Oozie workflow jobs are represented as Directed Acrylic Graphs (DAGs)
specifying a sequence of actions execute
2. Oozie coordinator jobs are recurrent Oozie workflow jobs that are
triggered by time and data availability
Arun Kumar P
Oozie provision for following

1. Integrates multiple jobs in sequential manner
2. Stores and supports Hadoop jobs for MapReduce, Hive, Pig and Sqoop
3. Runs workflow jobs based on time and data triggers
4. Manages batch coordinator for the applicatons
5. Manages timely execution of jobs
Arun Kumar P
• Apache Sqoop load voluminous data efficiently between Hadoop and external
repositories that resides on Enterprise servers or relational database.
• Sqoop works with relational databases like Oracle, MySQL, PostgreSQL and DB2
• Sqoop provides mechanism for importing data from external data store to HDFS
Arun Kumar P
• Apache Flume provides distributed, reliable and available service

• Flume collects, aggregates and transfers large streaming data into HDFS
• Flume enables upload large file into Hadoop clusters
• Provide robust and fault tolerant service
• Useful in logs of NW traffic, sensor data, geo-location data, e-mails and
social media messages
Arun Kumar P
Apache Flume Components

1. Sources: Accepts data from server or an application
2. Sinks : Receive and store it in HDFS repository or transmits to another source
3. Channels : Connects between source and sink by queuing event data for transactions
4. Agents : Run the sinks and sources in Flume.
Arun Kumar P
Ambari
Apache Ambari is management platform for Hadoop

• Enables to plan, securely install, manage and maintain the clusters in Hadoop
• Provides advanced cluster security through Kerberos Ambari
Arun Kumar P
Ambari
Features are
1. Simplification of installation, configuration and management
2. Enable easy, efficient, repeatable and automated creation of clusters
3. Manages and monitors scalable clustering
4. Provides an intuitive web interface and REST API.
5. Visualize the health of clusters and critical metrics for their operations
6. Enable detection of faulty node links
7. Provides extensibility and customizability
Arun Kumar P
HBase
• HBase is Hadoop system Database

• Created for large tables
• HBase is open-source, distributed, versioned and non-relational (NoSQL) database
Features:
1. Uses a partial columnar data schema on top of Hadoop and HDFS
2. Supports large table of billions of rows and millions of columns
3. Provides small amount of information (Sparse data) taken from large data sets
which are storing empty or not required data
4. Supports data compression algorithms
5. Provides in-memory column based transactions
Arun Kumar P
HBase
Features:
6. Access rows serially
7. Provides random, real-time read/write access to BigData
8. Fault tolerant storage
9. Similarity with Google BigTable
Arun Kumar P
HBase
Arun Kumar P
Hive
• Hive is open-source data warehouse software

• Facilitates reading, writing and managing large datasets which are
at distributed Hadoop files
• Hive provides batch process large data
• Used for managing web logs
• Does not process real-time queries and does not update row-based data tables
• Enables serialization/deserialization
• Supports different storage types: text files, sequence files (binary key/value pairs),
RCFiles (Record Columnar Files),
ORC (Optimized row columnar) and Hbase.
Arun Kumar P
Hive
Three Major functions

1. Data summarization
2. Query
3. Analysis
- Hive interact with structured data stored in HDFS with Hive Query Language
- HQL translates SQL-like queries into MapReduce jobs
Arun Kumar P
Pig Introduction to Hadoop, HDFS and Essential Tools
• Open source, High Level Language platform

• Developed for analysing large data
• Executes queries on large data set using Hadoop
• Language used in Pig is known as Pig Latin
• Pig Latin is similar to SQL, but applies on larger dataset
Features
1. Loads data after applying filters and dump data in desired format
2. Requires JRE for executing Pig Latin programs
3. Converts all operations into Map and Reduce task
4. Allows complete operation irrespective of Mapper and Reducer functions to
produce o/p results
Arun Kumar P
Mahout
• Project of Apache with library of scalable ML algorithms

• Apache implemented Mahout on top of Hadoop
• Provides learning tools to automate the finding of meaningful patterns
in the Big Data sets stored in HDFS
Supports four main Area
1. Collaborative data filtering that mines user behaviour and makes product
recommendations
2. Clustering of data: items into class, organizes them into group
3. Classification: assign items into best category
4. Frequent item-set mining identifies which items usually occur together
Arun Kumar P
HDFS Basics
• HDFS is the backbone of Hadoop MapReduce processing

• HDFS designed for BigData processing, simultaneously
• Large file write-once / read many and append-only
• No random writing to HDFS files, bytes are append to end of stream
• HDFS block size is 64MB or 128MB
• Interesting feature is data locality and
• Moving computation to data than moving data to computation
• HDFS designed to work on same hardware
• HDFS has redundant design to handle failure
Arun Kumar P
HDFS Components
Design is based on two types of nodes: Name Node and Data Node
• Single Name Node manages all meta data need to store and
retrieve actual data from DNs
• No data is actually stored on the Name Node
• Master (Name Node) manages file system namespace and

regulates access files by client
Arun Kumar P
HDFS Components
Based on two types of nodes: Name Node and Data Node
• File system namespace operations : opening, closing and renaming files
and directories are all managed by Name Node
• Name Node determines mappings of block to Data Nodes

and handles Data Node failures
• Slave(DataNode) are responsible for serving read & write requests from
file system to clients
Name Node manages block creation, deletion and replecation
Arun Kumar P
HDFS Components
fsimage_*
edit_*
Arun Kumar P
HDFS Block Replication
• HDFS writes a file, is replicated across cluster

• Amount of replication based on value of dfs.replication in hdfs-site.xml
• Default value can overruled with hdfs dfs-setrep command
• Suppose cluster contain DataNodes, replication value is 3
• Cluster <=8 DataNodes , replication is 2
• HDFS default block size is 64MB, in typical OS 4KB or 8KB
• HDFS block size is not minimum block size
Arun Kumar P
HDFS Block Replication
Arun Kumar P
HDFS Safe Mode
• When Name Node starts, It enters read-only safe mode, Where blocks
cannot be replicated or deleted
Safe mode enables Name Node to perform two important processes
1. Previous file system state is reconstructed by loading fsimage file into memory
and replying the edit log
2. Mapping between blocks and data nodes is created by waiting for enough of
Data Nodes to register
Arun Kumar P
Rack Awareness
Deals with data locality, 3 levels

1. Data resides on local machine (best)
2. Data resides in same rack (better)
3. Data resides in different rack (good)
When YARN scheduler assigns MapReduce containers to work as mappers,

it will try to place container
- First on local machine
- Then on same rack
• Finally on another rack
Arun Kumar P
Name Node high availability
Early Hadoop suffered from

- Name Node was single point of failure, that could bring down entire cluster
To prevent such failure

- Redundant power supplies employed for Name Node hardware
- Redundant storage was provided
But, still it was susceptible to failure, so

-Solution was to implement Name Node High Availability (HA)
Arun Kumar P
Name Node high availability
Arun Kumar P
HDFS Name Node Federation
- Older version of HDFS provided single namespace for entire cluster

managed by single Name Node
- Resources of single Name Node determined size of namespace
- Federation addresses this limitation by adding support for multiple
Name Nodes / namespaces to HDFS file system
Benefits are:
1. Namespace scalability: Cluster storage scales horizontally
2. Better Performance: improved R/W operations throughput
3. System Isolation: different application / user deals with separate Name Nodes
Arun Kumar P
HDFS Name Node Federation
Arun Kumar P
HDFS Checkpoints and Backups
- CheckpointNode (SecondaryNameNode) periodically fetches edits from the

NameNode, merges them, and returns an updated fsimage to the NameNode
- BackupNode is similar, but also maintains up-to-date copy of

file system namespace both in memory and on disk.
BackupNode does not need to download fsimage and edit files from
active Name Node.
A NameNode supports one BackupNode at a time.
Arun Kumar P
HDFS Snapshots
- Similar to backup, but created by administrators using

hdfs dfs -snapshot command
- HDFS snapshots are read-only point-in-time copies of file system, features are
• Snapshots can be taken of a sub-tree of file system or entire file system
• Snapshots can be used for data backup, protection against user errors,
and disaster recovery
• Snapshot creation is instantaneous
• DB’s on DN’s are not copied, only records block list and file size
• Snapshots do not adveersely affect regular HDFS operations.
Arun Kumar P
HDFS NFS Gateway
- HDFS NFS Gateway supports NFSv3 and enables HDFS to be mounted as

part of clients local file system
- Users can easily download/upload files from/to the HDFS file system
to/from their local file system
- Users can stream data directly to HDFS through mount point
Arun Kumar P
HDFS User Commands
hdfs : for Hadoop version 2

dfs : was used version 1
Arun Kumar P
HDFS User Commands
Usage:
hdfs [--config confdir] COMMAND
where COMMAND is one of:
dfs run a file system command on the file systems

supported in Hadoop.
namenode -format format the DFS file system

secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
Arun Kumar P
HDFS User Commands
journalnode run the DFS journalnode

zkfc run the ZK Failover Controller daemon
datanode run a DFS datanode
dfsadmin run a DFS admin client
haadmin run a DFS HA admin client
fsck run a DFS file system checking utility
balancer run a cluster balancing utility
jmxget get JMX exported values from NameNode or
DataNode.
mover run a utility to move block replicas across storage
types
Arun Kumar P
HDFS User Commands
oiv apply the offline fsimage viewer to an fsimage
oiv_legacy apply the offline fsimage viewer to an legacy
fsimage
oev apply the offline edits viewer to an edits file
fetchdt fetch a delegation token from the NameNode
getconf get config values from configuration
groups get the groups which users belong to
snapshotDiff diff two snapshots of a directory or diff the
current directory
lsSnapshottableDir list all snapshottable dirs owned by the current

user Use -help to see options
Arun Kumar P
HDFS User Commands
portmap run a portmap service
nfs3 run an NFS version 3 gateway
cacheadmin configure the HDFS cache
crypto configure HDFS encryption zones
storagepolicies get all the existing block storage policies
version print the version
Most commands print help when invoked w/o parameters.
Arun Kumar P
HDFS User Commands
Arun Kumar P
HDFS User Commands
Arun Kumar P
HDFS User Commands
Arun Kumar P
HDFS User Commands List Files in HDFS
Arun Kumar P
HDFS User Commands
Make a Directory in HDFS
hdfs dfs -mkdir ise
Copy file to HDFS
hdfs dfs -put test ise
Copy files from HDFS
hdfs dfs -get ise/test test-local
Copy files within HDFS
hdfs dfs -cp ise/test test.hdfs
Arun Kumar P
HDFS User Commands
Delete a file within HDFS
hdfs dfs -rm test.hdfs
Moved: 'hdfs://limulus:8020/user/hdfs/ise/test' to trash at: hdfs://limulus:8020/user/hdfs/.Trash/Current
hdfs dfs -rm -skipTrash ise/test
Delete a directory in HDFS
hdfs dfs -rm -r -skipTrash ise
Arun Kumar P
HDFS User Commands
Get an HDFS status report
Arun Kumar P
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "hdfs.h"
int main(int argc, char **argv)
{
hdfsFS fs = hdfsConnect("default", 0);
const char* writePath = "/tmp/testfile.txt";
hdfsFile writeFile = hdfsOpenFile(fs, writePath, WRONGLY|O_CREAT, 0,0, 0);
if(!writeFile)
{
fprintf(stderr, "Failed to open %s for writing!\n", writePath);
exit(-1);
} Arun Kumar P
char* buffer ="Hello, World!\n";

tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1);
if (hdfsFlush(fs, writeFile))
{
fprintf(stderr, "Failed to 'flush' %s\n", writePath);
exit(-1);
} hdfsCloseFile(fs, writeFile);
}
Arun Kumar P
Essential Hadoop Tools
Arun Kumar P
Essential Hadoop Tools
• Hadoop ecosystem offers many tools to help Data input, High level
processing,
workflow management and creation of huge database.
• Each tool is managed is managed as a separate Apache Software
foundation project
• But designed to operate with core Hadoop services including HDFS, YARN
and MapReduce
• Background on each tool with start and finish example given here
Arun Kumar P
Using Apache Pig
• Apache Pig (Pig Latin) is a High level language

• Enables to write complex MapReduce transformations using simple
scripting language
• Defines aggregate, join and sort transformations on data sets
• Used to extract, transform and load (ETL) data pipelines,
quick research on raw data and iterative data processing.
Arun Kumar P
Using Apache Pig
• Local mode: All processing done on local machine

• Non local(cluster) mode: Execute job on the cluster using either MapReduce engine
optimized Tez engine
• Interactive and Batch Mode: Enable Pig applications to be developed locally in
interactive mode using small amounts of data
and run at scale on the cluster in a production mode.
Arun Kumar P
Using Apache Hive
• Apache Hive is a data warehouse infrastructure

• Built on top of Hadoop for providing
1. Data summarization
2. Ad hoc queries
3. Analysis of large data sets
using a SQL-like language called HiveQL
Arun Kumar P
Using Apache Hive
Apache Hive offers following features

• Tools to enable easy data extraction, transformation and loading (ETL)
• Mechanism to impose structure on variety of data formats
• Access to files stored either directly in HDFS or in other data storage
systems such as HBase
• Query execution via MapReduce and Tez
Arun Kumar P
Using Apache Sqoop to Acquire Relational Data
Apache Sqoop is a tool designed to transfer data between Hadoop and relational
databases.
• Sqoop can used to import data from a RDBMS into the HDFS
• transform the data in Hadoop
• Export the data back into an RDBMS
• Can be used with any JDBC–compliant database
• Has been tested on Microsoft SQL Server, PostgresSQL, MySQL, and Oracle
Arun Kumar P
Version-1 Version-2
-------------------------------------------------------------------------------------------------------
1. Data were accessed using connectors Does not support connectors
written for specific databases
2. Data transfer from a More generalized ways

- RDBMS directly to Hive or Hbase to accomplish these tasks
-Hive or HBase to your RDBMS
Arun Kumar P
Import Methods
Step1: Examines the database to
gather
the necessary metadata
for the data to be imported
Step 2: Map-only (no reduce step)

Hadoop job that Sqoop submits
to thethe
• Job does cluster
actual data transfer
using the metadata captured in the
step1
• The imported data are saved in an HDFS directory
• Once placed in HDFS, the data
are ready for processing
Arun Kumar P
Export Methods
Step1: Examine the database
metadata
Step 2: Uses a map-only Hadoop job to

write the data to the database
• Sqoop divides the input data set into splits
• Then uses individual map tasks to push the

splits to the database.
Arun Kumar P
Arun Kumar P
Using Apache Flume to Acquire Data Streams
• Apache Flume is an independent agent designed to collect,

transport, and store data into HDFS
• Data transport involves a number of Flume agents that may

traverse a series of machines and locations
• Used for log files, social media-generated data, email messages, and
just about any continuous data source
Arun Kumar P
Source: The source component receives

data and sends it to a channel.
It can send the data to more than
one channel. The input data can
be from a real-time source
(e.g., weblog) or another Flume agent
Channel: A channel is a data queue that forwards the source data to the
sink destination.
as a buffer that manages input (source) and output (sink) flow rates
Sink: The sink delivers data to destination such as HDFS, a local file, or
another Flume agent
Arun Kumar P
• Sqoop agents may be placed in a pipeline, possibly to traverse several machines or
domains
• This configuration is normally used when data are collected on one machine (e.g.,
a web server) and sent to another machine that has access to HDFS.
• The data transfer format used by Flume is called Apache Avro, provides several
useful features
1. Avro is a data serialization/deserialization system that uses a compact
binary format
2. The schema is sent as part of the data exchange and is defined using
JSON
3. Avro also uses RPCs to send data.
That is, an Avro sink will contact an Avro source to send data.
Arun Kumar P
Flume is used to consolidate

several data sources before
committing them to HDFS
Arun Kumar P
Manage Hadoop Workflows with Apache Oozie
• Oozie is a workflow director system

• Designed to run and manage multiple related Apache Hadoop jobs
• Oozie workflow jobs are represented as DAGs of actions
Arun Kumar P
Three types of Oozie jobs are permitted:

1. Workflow—a specified sequence of Hadoop jobs with outcome-based
decision points and control dependency. Progress from one action to
another cannot happen until the first action is completed.
2. Coordinator—a scheduled workflow job that can run at various time intervals
or when data become available.
3. Bundle—a higher-level Oozie abstraction that will batch a set of coordinator
jobs.
Arun Kumar P
• Oozie is integrated with the rest of the Hadoop stack, supporting several
types of Hadoop jobs out of the box (e.g., Java MapReduce, Streaming
MapReduce, Pig, Hive, and Sqoop)
• As well as system-specific jobs (e.g., Java programs and shell scripts).
• Oozie also provides a CLI and a web UI for monitoring jobs.
Arun Kumar P
Arun Kumar P
• Control flow nodes define the beginning and the end of a workflow. They
include start, end, and optional fail nodes.
• Action nodes are where the actual processing tasks are defined. When an
action node finishes, the remote systems notify Oozie and the next node in
the workflow is executed.
.
Arun Kumar P
• Fork/join nodes enable parallel execution of tasks in the workflow. The fork
node enables two or more tasks to run at the same time. A join node
represents a rendezvous point that must wait until all forked tasks complete.
FORK JOIN
Arun Kumar P
Arun Kumar P
Using Apache HBase
• Apache HBase is an open source, distributed, versioned, nonrelational

database
• HBase leverages the distributed data storage provided by the underlying
distributed file systems spread across commodity servers
• Apache HBase provides Bigtable-like capabilities on top of Hadoop and
HDFS
Arun Kumar P
Using Apache HBase
Important features include the following capabilities:

• Linear and modular scalability
• Strictly consistent reads and writes .
• Automatic and configurable sharding of tables
• Automatic failover support between RegionServers
• Convenient base classes for backing Hadoop MapReduce jobs with
Apache HBase tables
• Easy-to-use Java API for client access
Arun Kumar P
Using Apache HBase
HBase Data Model Overview

• A table in HBase is similar to other databases, having rows and columns
• Columns in HBase are grouped into column families, all with the same
prefix Ex: price:open, price:close, price:low, and price:high
• A column does not need to be a family. Ex: volume
• All column family members are stored together in the physical file system
• Specific HBase cell values are identified by a row key, column (column
family and column), and version (timestamp).
Arun Kumar P
Using Apache HBase
HBase Data Model Overview
• It is possible to have many versions of data within an HBase cell.
• Almost anything can serve as a row key, from strings to binary representations
of longs to serialized data structures.
• Rows are lexicographically sorted with the lowest order appearing first in a table.
• The empty byte array denotes both the start and the end of a table’s
namespace.
• All table accesses are via the table row key, which is
• considered its primary key.
Arun Kumar P
Using Apache HBase
Create the Database
hbase(main):006:0> create 'apple', 'price' , 'volume'
0 row(s) in 0.8150 seconds
• Table name is apple, and two columns are defined.

• The date will be used as the row key.
• The price column is a family of four values (open, close, low, high).
Arun Kumar P
Using Apache HBase
The put command is used to add data to the database from within the shell.
put 'apple','6-May-15','price:open','126.56'
put 'apple','6-May-15','price:high','126.75'
put 'apple','6-May-15','price:low','123.36'
put 'apple','6-May-15','price:close','125.01'
put 'apple','6-May-15','volume','71820387'
Arun Kumar P
Using Apache HBase
Inspect the Database :The entire database can be listed using the scan
command.
hbase(main):006:0> scan 'apple'
ROW COLUMN+CELL
6-May-15 column=price:close, timestamp=1430955128359, value=125.01
6-May-15 column=price:high, timestamp=1430955126024, value=126.75
6-May-15 column=price:low, timestamp=1430955126053, value=123.36
6-May-15 column=price:open, timestamp=1430955125977, value=126.56
6-May-15 column=volume:, timestamp=1430955141440, value=71820387
Arun Kumar P
Using Apache HBase
Get a Row You can use the row key to access an individual row.
hbase(main):008:0> get 'apple', '6-May-15'
COLUMN CELL
price:close timestamp=1430955128359, value=125.01
price:high timestamp=1430955126024, value=126.75
price:low timestamp=1430955126053, value=123.36 price:open
timestamp=1430955125977, value=126.56
volume: timestamp=1430955141440, value=71820387
Arun Kumar P
Using Apache HBase
Get Table Cells A single cell can be accessed using the get command and the
COLUMN option:
hbase(main):013:0> get 'apple', '5-May-15', {COLUMN => 'price:low'}

COLUMN CELL
price:low timestamp=1431020767444, value=125.78
Arun Kumar P
Using Apache HBase
Get Table Cells :multiple columns can be accessed as follows:
hbase(main):012:0> get 'apple', '5-May-15', {COLUMN => ['price:low', 'price:high']}

COLUMN CELL
price:high timestamp=1431020767444, value=128.45
price:low timestamp=1431020767444, value=125.78
Arun Kumar P
Using Apache HBase
Delete a Cell A specific cell can be deleted using the following command:
hbase(main):009:0> delete 'apple', '6-May-15' , 'price:low'
Delete a Row You can delete an entire row by giving the deleteall command
hbase(main):009:0> deleteall 'apple', '6-May-15'
Remove a Table To remove (drop) a table, you must first disable it. The following
two commands remove the apple table from Hbase:
hbase(main):009:0> disable 'apple'
hbase(main):010:0> drop 'apple'
Arun Kumar P
Using Apache HBase
Delete a Cell A specific cell can be deleted using the following command:
hbase(main):009:0> delete 'apple', '6-May-15' , 'price:low'
Delete a Row You can delete an entire row by giving the deleteall command
hbase(main):009:0> deleteall 'apple', '6-May-15'
Remove a Table To remove (drop) a table, you must first disable it. The following
two commands remove the apple table from Hbase:
hbase(main):009:0> disable 'apple'
hbase(main):010:0> drop 'apple'
Arun Kumar P
Using Apache HBase
Adding Data in Bulk

• There are several ways to efficiently load bulk data into HBase
• ImportTsv utility, which loads data in tab-separated values (tsv) format into
HBase
It has two distinct usage modes:
1. Loading data from a tsv-format file in HDFS into HBase via the put command
2. Preparing StoreFiles to be loaded via the completebulkload utility
Arun Kumar P

Module 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2

Uploaded by

Copyright:

Available Formats

BIG DATA AND ANALYTICS

Subject Code : 18CS72 CIE Marks : 40

Lecture Hours : 50 SEE Marks : 60

Introduction to Hadoop, HDFS and Essential Tools

• Apache initiated project for developing storage and processing

Components are written in java with part of code in C.

• Hadoop is computing environment, in which input data stores, processes and

• Consists clusters, which distribute at clusters

• Each cluster consists of string of files consisting data blocks

Infrastructure consists of cloud for clusters.

- Open source cluster-computing framework

- Fault-efficient, scalable, flexible and modular design

- Refers combination of technologies

- Hadoop Streaming is defined as a utility which comes Hadoop distribution

- Hadoop Pipes are C++ pipes interface with MapReduce

- HDFS is a core component of Hadoop

Name Node stores all information related to File system

Secondary Node stores information

- Associated JournalNode, keeps the records of the state, resources assigned,

- Distributed applications can write and read data from JournalNode

- HDFS shell is not compliant with the POSIX

All Hadoop commands are invoked by bin/Hadoop script

Mapper: SW for doing assigned task after organizing data blocks

Reducer: SW for reducing the mapped data by using aggregation, query or

MapReduce provides two important functions

2.Organizing and reducing the results from each node into

MapReduce enables job scheduling and task execution as follows

The job execution controlled by two types of processes in MapReduce

Data Block (size 64 MB)

64GB / 64MB = 1024 DBs = 1024 student files

• Total 120 racks are there in a cluster

Total capacity of cluster is

120 x 128 GB = 15360 GB = 15TB

2 Data Block required to store 1 stuData

64GB / 64MB = 1024 DBs

Data Block (size 64 MB)

Each stuData file require 2 DataBlocks,

Each rack can store

Max no. of 40960 stuData_IDN files can

Yet Another Resource Negotiator

• Client Node submits request of an application to RM

• AMI performs role of Application Manager (AM)

• Apache Zookeeper is coordination service

Zookeepers main coordination services are:

- Open source project that schedules Hadoop jobs

Oozie provision for following

• Apache Flume provides distributed, reliable and available service

Apache Flume Components

Apache Ambari is management platform for Hadoop

• HBase is Hadoop system Database

• Hive is open-source data warehouse software

Three Major functions

• Open source, High Level Language platform

• Project of Apache with library of scalable ML algorithms

• HDFS is the backbone of Hadoop MapReduce processing

• No data is actually stored on the Name Node

• Master (Name Node) manages file system namespace and

• Name Node determines mappings of block to Data Nodes

• HDFS writes a file, is replicated across cluster

Deals with data locality, 3 levels

When YARN scheduler assigns MapReduce containers to work as mappers,

Early Hadoop suffered from