Professional Documents
Culture Documents
Unit II EBDP 2022 PDF
Unit II EBDP 2022 PDF
7/28/2022 1
m1 m2 m3
Windows 95,… FAT32,
NTFS 16GB, 16EB Volume
Limit
Ext3 Mac 2TB, 32 TB … 2001
Ext4 16TB 1 EB
XFS 8EB
7/28/2022 3
Basic Features: HDFS
• Applications with large data sets
– Very Large Distributed File System -- 10K nodes, 100 million files
– Provides Hadoop Clusters that can run Petabytes of Data
• Streaming access to file system data
– Built on Write-Once, Read-Many-times pattern. The time to read the
whole dataset is more important than the latency in reading the first
record.
• Highly fault-tolerant
– Files are replicated to handle hardware failure
– Detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS
• High throughput
– Optimized for batch processing
– Data locations exposed so that computations can move to where data
resides
• Can be built out of commodity hardware
– Hadoop doesn’t require expensive, highly reliable hardware.
– It’s designed to run on clusters of commodity hardware
7/28/2022 4
HDFS is not Suitable for applications
• Low Latency
– HDFS is for delivering a high throughput of data, and this may be at
the expense of latency
• Lots of Small files
– Built on Master/Slave architecture.
– The namenode(Master) holds filesystem metadata in memory, the
limit to the number of files in a filesystem is governed by the amount
of memory on the namenode.
• Multiple writers, arbitrary file modifications
– Files in HDFS may be written to by a single writer.
– There is no support for multiple writers or for modifications at
arbitrary offsets in the file.
7/28/2022 5
HDFS Concepts
• Block
• Namenode, Datanode and Secondary
namenode
• HDFS Federation
• HDFS High Availability
• Command Line
7/28/2022 6
HDFS concepts : Block
• Blocks: “ A disk has block size, which is the minimum amount of
data that it can read or write.”
• Filesystem blocks size which are integral multiples of disk blocks.
A disk blocks size is normally 512 bytes in linux.
• HDFS , blocks are much larger (128 MB by default). Unlike a file
system, files in HDFS are broken into block-sized chunks, which are
stored as independent units.
• Why blocks in HDFS so large?
– NTFS 4kb , ext4 4kb on Ubuntu, hdfs-128MB
– To minimize the cost of seeks.
• The time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block.
• Time to transfer blocks operates at the disk transfer rate.
– Small metadata volumes, low seek time.
– Blocks simplifies management of storage subsystem
7
HDFS Block
7/28/2022 8
Benefits of HDFS Blocks
• Making the unit of abstraction a block rather than a file.
– Allows a block storage on any of the disks in the cluster.
– Simplifies the storage subsystem.
• Since blocks are a fixed size, it is easy to calculate how many
can be stored on a given disk and
• Eliminates metadata concerns ( such as file permissions etc.)
– Fits well with replication for providing fault tolerance and
availability.
9
HDFS Cluster Nodes
• HDFS Cluster – Two types of
Nodes Master-Slave pattern
• Namenode :
• Manages the Filesystem namespace
• Maintains the filesystem hierarchy
• Maintains the Metadata for all the files
and directories
• Cluster Configuration Management
• Datanode
– Stores data in the local file system
– Stores metadata of a block
– Serves data and metadata to Clients
– Periodically reports its status to Name Node
• Secondary Namenode
– Can be used to restore a failed Namenode
7/28/2022 10
Namenode Functionality
• Namenode, Manages the file system namespace and regulates
access to files by clients.
• Stores all metadata in memory
– Includes List of files, List of Blocks, List of DataNodes for each block,
File attributes creation time, replication factor, permissions, size of
file, etc.
• Metadata is maintained in RAM to serve read requests
– Size of metadata is limited to available RAM storage
– Running out of RAM can cause the NameNode to crash
• NN Maintains two things on its Local File system
– Metadata information of the filesystem tree is persistently stored in
a file called FS Image
– Records file creations, deletions, Modifications (ie log of activities on
hdfs)., are stored in a file called Edit log
7/28/2022 11
Name Node
• Keeps image of entire file system namespace and file
Block map in memory.
• 4GB of local RAM is sufficient to support the above
data structures that represent the huge number of
files and directories.
• When the Namenode starts up it gets the FsImage and
Editlog from its local file system, update FsImage with
EditLog information and then stores a copy of the
FsImage on the filesystem as a checkpoint.
• Periodic checkpointing is done. So that the system can
recover back to the last checkpointed state in case of a
crash.
7/28/2022 12
Data Nodes
• Usually more than one Data Nodes per Cluster.
• Serves Read, Write requests, performs Block Creation, Deletion
and Replication upon instruction from Namenode.
• The DataNodes manage storage attached to the nodes that they
run on.
• A Data Node has no knowledge of HDFS. It stores each block of HDFS
data in a separate file in its local file system.
• Does not create all files in the same directory.
• Uses heuristics to determine optimal number of files per directory
and creates directories appropriately.
• At the File System start-up it generates a list of all HDFS blocks stored
at its local file system and sends a report to Namenode (Block
Report).
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
7/28/2022 13
Namenode Failure
• In the Event of NN failure the New Name Node is built if it is
configured to use one of the following
– Replicating the FSImage and Editlog on multiple machines
– Secondary Name Node.
7/28/2022 14
Secondary NameNode
• It can be used to restore a failed NameNode
– Just copies current directory to new NameNode
• Secondary NameNode merge fsImage and editlog
– Every couple minutes, secondary namenode copies new edit log from
primary NN
– Merges editLog into fsimage
– Copies the new merged fsImage back to primary namenode
• Not used as standby or mirror node
• Runs on a separate machine
• Memory requirements are the same as NameNode
• Directory structure is same as NameNode
– It also keeps previous checkpoint version in addition to current
• Faster startup time
7/28/2022 15
Secondary Name Node
7/28/2022 16
HDFS Federated
• The prior HDFS architecture allows only a single namespace for
the entire cluster. In that configuration, a single Namenode
manages the namespace.
• HDFS Federation allows cluster to scale by adding multiple
Namenodes/namespaces to HDFS. Each of which manages a
portion of the filesystem namespace. For example,
– one namenode might manage all the files rooted under /user, say, and a
second namenode might handle files under /share.
• In HDFS federation, the Namenodes are independent and do not
require coordination with each other. The Datanodes are used as
common storage for blocks by all the Namenodes. Each
Datanode registers with all the Namenodes in the cluster.
Datanodes send periodic heartbeats and block reports. They also
handle commands from the Namenodes.
7/28/2022 18
Federated Namenode (HDFS2)
• New in Hadoop2 Namenodes can be federated
– Historically Namenodes would become a bottleneck on huge clusters
– One million blocks or ~100TB of data require roughly one GB of RAM in
Namenode
• Blockpools
– Administrator can create separate blockpools/namespaces with different
namenodes
– Datanodes register on all Namenodes
– Datanodes store data of all blockpools ( otherwise you could setup separate
clusters)
– New ClusterID identifies all namenodes in a cluster.
– A Namespace and its block pool together are called Namespace Volume
– You define which blockpool to use by connecting to a specific Namenode
– Each Namenode still has its own separate backup/secondary /checkpoint
node
• Benefits
– One Namenode failure will not impact other Blockpools
– Better scalability for large numbers of file operations
High Availability
High Availability
• HDFS-2 adds Namenode High Availability
• Standby Namenode needs filesystem transactions and
block locations for fast failover
• Every filesystem modification is logged to at least 3
quorum journal nodes by active Namenode
– Standby Node applies changes from journal nodes as they occur
– Majority of journal nodes define reality
– Split Brain is avoided by Journalnodes ( They will only allow one
Namenode to write to them )
• Datanodes send block locations and heartbeats to both
Namenodes
• Memory state of Standby Namenode is very close to Active
Namenode
• Much faster failover than cold start
Failover and Fencing
• The transition from the active namenode to the standby is
managed by an entity called the failover controller.
– Default implementation uses ZooKeeper to ensure that only one
namenode is active.
• The HA implementation ensure that the previously active
namenode is prevented from doing any damage and causing
corruption—a method known as fencing.
• Fencing includes
– Revoking the namenode’s access to the shared storage directory
and
– disabling its network port via a remote management command.
– forcibly power down the host machine known as STONITH, or
“shoot the other node in the head”
7/28/2022 23
HDFS - Replication
• Blocks are replicated for fault tolerance.
• Blocks of data are replicated to multiple
nodes
– Behavior is controlled by replication
factor, configurable per file
– Set by dfs.replication parameter in
hdfs-site.xml
– Default is 3 replicas
7/28/2022 24
Rack Awareness (1 of 2)
• Default installation assumes that all nodes belong to one Rack
• Computers belong to the same rack are on the same network switch
• Cluster administrator determines which computer belongs to which rack
and sets the topology using topology.script.file.name referenced in
core-site.xml file
• Example of property:
<property>
<name>topology.script.file.name</name>
<value>/opt/ibm/biginsights/hadoop-conf/rack-aware.sh</value>
</property>
• Each node’s position in the cluster is represented by a string with syntax
similar to a file name.
• The network topology script (topology.script.file.name in the above ex)
receives as arguments one or more IP addresses of nodes in the
cluster. It returns on stdout a list of rack names, one for each input.
The input and output order must be consistent
7/28/2022 25
Rack Awareness(2 of 2)
7/28/2022 26
Distance
• For example, imagine a node n1 on rack r1 in data center d1. This
can be represented as /d1/r1/n1. Using this notation, here are the
distances for the four scenarios:
7/28/2022 29
HDFS Read Anatomy
• A client initiates read request by calling 'open()' method of
DistributedFileSystem object.
• This object connects to Namenode using RPC and gets locations of first
few blocks of the file and addresses of the DataNodes having a copy of
that block.
• Once addresses of DataNodes are received, an object of type FSData-
InputStream is returned to the client. FSDataInputStream contains
DFSInputStream which takes care of interactions with DataNode and
NameNode.
• In step 4, a client invokes ‘read()’ method which causes DFSInputStream
to establish a connection with the first DataNode with the first block of a
file. Data is read in the form of streams wherein client invokes ‘read()’
method repeatedly till it reaches the end of block.
• Once the end of a block is reached, DFSInputStream closes the connection
and moves on to locate the next DataNode for the next block
• Once a client has done with the reading, it calls a close() method.
7/28/2022 30
Replica Selection - Read
• Replica selection for READ operation: HDFS tries to minimize the
bandwidth consumption and latency.
• If there is a replica on the Reader node then that is preferred.
• HDFS cluster may span multiple data centers: replica in the local
data center is preferred over the remote one.
7/28/2022 31
HDFS Data Flow : Anatomy of File Write
7/28/2022 32
HDFS Write Anatomy
• A client initiates write operation by calling 'create()' method of
DistributedFileSystem object which creates a new file.
• DistributedFileSystem object connects to the NameNode using RPC call
and initiates new file creation.
– This file create operation does not associate any blocks with the file.
– It is the responsibility of NameNode to verify that the file does not exist
already and a client has correct permissions to create a new file.
– If a file already exists or client does not have sufficient permission to create
a new file, then IOException is thrown to the client.
– Otherwise, the operation succeeds and a new record for the file is created
by the NameNode.
• Once a new record in NameNode is created, an object of type
FSDataOutputStream (FSDOS) is returned to the client. A client uses it to
write data into the HDFS.
• FSDOS contains DFSOutputStream(DFSOS) object takes care of
interaction between DataNodes and NameNode.
– While the client continues writing data, DFSOS continues creating packets
with this data. These packets are enqueued into a queue which is called
7/28/2022 33
as DataQueue.
HDFS Write Anatomy
• There is a component called DataStreamer(DS) which consumes this
DataQueue. DataStreamer also asks NameNode for allocation of new
blocks thereby picking desirable DataNodes to be used for replication.
• The process of replication starts by creating a pipeline using DataNodes.
Here replication level is 3 and hence there are 3 DataNodes in the
pipeline.
• The DataStreamer pours packets into the first DataNode in the pipeline.
• Every DataNode in a pipeline stores packet received by it and forwards
the same to the second DataNode in a pipeline.
• Another queue, 'Ack Queue' is maintained by DFSOutputStream to store
packets which are waiting for acknowledgment from DataNodes.
• Once acknowledgment for a packet in the queue is received from all
DataNodes in the pipeline, it is removed from the 'Ack Queue'.
– In the event of any DataNode failure, packets from this queue are
used to reinitiate the operation.
7/28/2022 34
HDFS Write Anatomy
• After a client is done with the writing data, it calls a
close(), that results into flushing remaining data packets
to the pipeline followed by waiting for acknowledgment.
7/28/2022 35
Robustness
• Data Node Failures: Server can fail, disk can
crash, data corruption
• Network Failures : Sometimes there's data
corruption because of network issues or disk
issue. You could have network failures, that
can affect a lot of data nodes at a time
• NameNode Failures : Could have name node
failures, disk failure on the name node itself or
the name node itself could corrupt, this
process
7/28/2022 36
Robustness
• Periodic Heartbeat : From DataNode to
NameNode.
• DataNodes without recent heartbeat:
– Arrival of any new I/O is not directed to this node.
Instead directed to the nodes where the blocks of
this datanode are replicated.
7/28/2022 37
Terminology
• Staging : Is the Process of holding data in temporary location, till it
accumulates the data to a block size. When it can be copied to a
datanode.
• Replication Pipelining : While writing data to datanodes, the client
process flushes its block in small pieces (4K) to the first replica,
that in turn copies it to the next replica and so on, thus data is
pipelined from one Datanode to the next.
• HeartBeats : DataNodes send block report once every 3 seconds to
the NameNode. NameNode uses heartbeats to detect DataNode
Health ie working/failed.
• Safe Mode: On startup Namenode enters Safemode. Each
DataNode checks in with Heartbeat and BlockReport. Namenode
verifies that each block has acceptable number of replicas. After a
configurable percentage of safely replicated blocks check in with
the Namenode, Namenode exits Safemode.
7/28/2022 38
Notes - Replication
• Replication of data blocks do not occur in Safemode.
• Once Name Node exits Safemode, then it makes the list of blocks
that need to be replicated. Namenode then proceeds to replicate
these blocks to other Datanodes.
7/28/2022 40
HDFS at a High Level
7/28/2022 41
Application Programming Interface
• HDFS provides Java API for application to use.
• Python access is also used in many
applications.
• A C language wrapper for Java API is also
available.
• A HTTP browser can be used to browse the
files of a HDFS instance.
$ grep
7/28/2022 42
FS Shell, Admin and Browser Interface
• HDFS organizes its data in files and directories.
• It provides a command line interface called the FS
shell that lets the user interact with data in the
HDFS.
• The syntax of the commands is similar to bash and
csh.
• Example: to create a directory
LFS : mkdir /foodir
hdfs: hadoop dfs –mkdir /foodir
• There is also DFSAdmin interface available
• Browser interface is also available to view the
namespace.
7/28/2022 43
7/28/2022 44
7/28/2022 45
HDFS Commands
JPS : JVM Process Status.
By typing JPS we can see Daemons running on the host
machine.
5 daemon services will be running on Hadoop machine
namely Name Node, Data Node, Secondary Name Node,
Resource Manager and Node Manager
Syntax : JPS
To start all Daemons:
Syntax :start-all.sh
To stop all Daemons:
Syntax : stop-all.sh
HDFS Commands
Hadoop make a Directory:
Syntax:
hadoop fs -mkdir /directoryname
Example :
hadoop fs -mkdir /griet
Syntax:
hadoop fs -copyToLocal <source path HDFS> <destination path LFS>
Example:
hadoop fs -copyToLocal /griet/input.log /home/griet/Documents
HDFS Commands
• Cat – Usage: hadoop fs -cat URI [URI …]
– Copies source paths to stdout.
– Example:
• hadoop fs -cat hdfs:/mydir/test_file1 hdfs:/mydir/test_file2
• hadoop fs -cat file:///file3 /user/hadoop/file4
• Chgrp - Usage: hadoop fs -chgrp [-R] GROUP URI [URI ..]
– Change group association of files.
– With -R, make the change recursively through the
directory structure.
• Chmod - Usage: hadoop fs -chmod [-R]
<MODE[,MODE]... | OCTALMODE> URI [URI …]
– Change the permissions of files.
– With -R, make the change recursively through the
directory structure.
HDFS Commands
• Count – Usage: hadoop fs -count [-q] <paths>
– Count the number of directories, files and bytes under the
paths that match the specified file pattern.
– The output columns are: DIR_COUNT, FILE_COUNT,
CONTENT_SIZE FILE_NAME.
7/28/2022 56
Hadoop FileSystem
• Hadoop has an abstract notion of filesystems, of which HDFS is
just one implementation.
• The Java abstract class org.apache.hadoop.fs.FileSystem
represents the client interface to a filesystem in Hadoop.
– In order to run HDFS on machine we Need to follow some instructions for
setting up Hadoop File System
– fs.default.name, is set to hdfs://localhost/, which is used to set a default
filesystem for Hadoop.
– Filesystems are specified by a URI, and here we have used an hdfs URI to
configure Hadoop to use HDFS by default.
• The HDFS daemons will use this property to determine the host and port
for the HDFS namenode.
– We’ll be running it on localhost, on the default HDFS port, 8020/9000. And
HDFS clients will use this property to find out where the namenode is
running so they can connect to it.
Hadoop File System Interfaces
HTTP
C
NFS
FUSE
Interfaces
• HTTP: By exposing its filesystem interface as a Java API,
Hadoop makes it awkward for non-Java applications to
access HDFS.
• The HTTP REST API exposed by the WebHDFS protocol
makes it easier for other languages to interact with HDFS.
• Hadoop provides a C library called libhdfs that mirrors the
Java FileSystem interface. You can find the header file,
hdfs.h, in the include directory of the Apache Hadoop
binary tarball distribution.
– The Apache Hadoop binary tarball comes with prebuilt libhdfs
binaries for 64-bit Linux, but for other platforms we need to
build them ourself by following the BUILDING.txt instructions
at the top level of the source tree.
Interfaces
• NFS : It is possible to mount HDFS on a local client’s
filesystem using Hadoop’s NFSv3 gateway. You can then
use Unix utilities (such as ls and cat) to interact with the
filesystem, upload files, and in general use POSIX libraries
to access the filesystem from any programming language.
7/28/2022 65
Key Observations
• Uses the database to describe the schema of the data
• Uses MapReduce to import and export the data
– Import process creates a Java class
• From the Meta data of table it creates a class and maps to nearest java
datatype thus encapsulate each row of the imported table
– The source code of the class is provided
• Can help to quickly develop MapReduce applications that use HDFS-
stored records
• Transfers data between Hadoop and relational database
– Uses JDBC/ODBC drivers
• To establish a connection for import and export between structures
sources and Hadoop.
– Must copy the JDBC driver JAR files for any relational
databases to $SQOOP_HOME/lib
• Eg: MYSQL/db2 the target RDBMS specific connector jar file should be
part of sqoop installed lib directory.
Import
• Imports data from relational tables into HDFS
– Each row in the table becomes a separate record in HDFS
• Data can be stored
– --target –dir <dir> hdfs destination directory
• If omitted, the name of the directory is the name of the table
– Text files, Binary files
• --as –avrodatafile, --as –sequenceFile, --as –texfile
• --fields –terminated -by <char> , --lines –terminated –by <char>
– Into Hbase
• -- hbase –create table
– Into Hive
• -- hive –import
• Imported data
– Can be all rows of a table
– Can limit the rows and columns
– Can specify your own query to access relational data
Sqoop Connection
• Database connection requirements are same for
import and export
• sqoop import/export --connect
jdbc:db2://your.db2.com:50000/yourDB --
username db2user --password yourpassword …
• sqoop import -- connect
jdbc:mysql://localhost:3306/db --username foo
-- password admin --table Test
• sqoop import --connect
jdbc:db2://your.db2.com:50000/yourDB \ --
username db2user --password db2password --
table db2table \ --target-dir sqoopdata
Sqoop Commands
• List Databases:
$ Sqoop list-databases –connect jdbc:mysql://localhost:3306/ --
username root –password root;
• List Tables:
$ Sqoop list-tables –connect jdbc:mysql://localhost:3306/mydb -
-username root –password root;
• Eval: It allows users to preview their Sqoop import queries to
ensure they import the data they expect.
$ sqoop eval --connect jdbc:mysql://localhost:3306/mydb --
username root --password root --query "select * from emp limit 5";
Data Ingestion Commands
• Sqoop import –connect jdbc:mysql://localhost:3306/mydb --
username root --password root – table emp;
• Sqoop import –connect jdbc:mysql://localhost:3306/mydb --
username root --password root – table emp --target-dir=
/home/sqoop_data;
Options import
1. -- target-dir import data to selected dir(directory name)
2. --split-by tbl_primarykey
Sqoop import –connect jdbc:mysql://localhost:3306/mydb --username root --
password root – table emp --target-dir= /home/sqoop_data –fields-
terminated-by ‘|’ --where ‘sal >15000’ --split-by eid;
7/28/2022 71
Import to HIVE
Sqoop import –connect
jdbc:mysql://localhost:3306/mydb --username
root --password root – table emp -- hive-table
mysqltohive --create-hive-table –hive –import –
m1
7/28/2022 72
Sqoop Export
• Exports a set of files from HDFS to a relation database
system
– Table must already exist
– Records are parsed based upon user's specifications
• Default mode is insert
– Inserts rows into the table
• Update mode
– Generates update statements
– Replaces existing rows in the table
– Does not generate an upsert
• Missing rows are not inserted
– Not detected as an error
• Call mode
– Makes a stored procedure call for each record
– --export-dir
• Specifies the directory in HDFS from which to read the data
export
• Basic export from files in a directory to a table
– sqoop export --connect jdbc:db2://your.db2.com:50000/yourDB \
--username db2user --password db2password --table employee \
--export-dir --target-dir= /home/sqoop_data/part-m-*;
What is Flume
Flume Overview
Flume Architecture
Building Blocks of Flume
What is Flume
Mechanism to collect aggregate and move large
amount of streaming data into HDFS.
Primarily designed for log aggregation from across
servers into a central point on HDFS.
Flume Overview
Streaming data Ingestion into HDFS
Apache Flume is an open source powerful, reliable and flexible
system for efficiently collecting, aggregating and moving large
amounts of unstructured data from multiple data sources into
HDFS/Hbase
1. Designed to capture data as it is generated.
2. Channel it to HDFS for Storage for Subsequent Processing.
Flume Architecture
Building Blocks of Flume
Event : A basic unit of data that needs to be
transferred from source to destination, Ex : a log
record, an avro object, etc. Normally around ~4KB.
The external Sources sends events to Flume source in a
format that is recognized by the target source.
Agent : A JVM process that receives events from
clients or from other flume agent and passes it to
destination or other agents.
Flume Agent Contains 3 main components
Source
Channel
Sink
Flume Agent - Source
Source is Responsible to send the event to the
channel it is connected to.
Flume Source receives events and transfers it to one
or more channels. The Channel acts as a store which
keeps the event until it is consumed by the flume
sink.
May have logic to reading data , translate to event ,