You are on page 1of 95

Hadoop Distributed File System

7/28/2022 1
m1 m2 m3
Windows 95,… FAT32,
NTFS 16GB, 16EB Volume
Limit
Ext3 Mac 2TB, 32 TB … 2001
Ext4 16TB 1 EB
XFS 8EB

HFS Mac Heirarchial FS 2GB 2TB


HFS + mach 8.1 8EB
HDFS
GFS
7/28/2022 2
HDFS
• A distributed system is a collection of independent
computers that appears to its users as a single coherent
system.
• A filesystem is the methods and data structures that an
operating system uses to keep track of files on a disk or
partition; it implies, the way the files are organized on the
disk.
• Filesystems that manage the storage across a network of
machines are called distributed filesystems.
• HDFS is a Hadoop Distributed Filesystem designed for
storing very large files with streaming data access patterns,
running on clusters of commodity hardware.

7/28/2022 3
Basic Features: HDFS
• Applications with large data sets
– Very Large Distributed File System -- 10K nodes, 100 million files
– Provides Hadoop Clusters that can run Petabytes of Data
• Streaming access to file system data
– Built on Write-Once, Read-Many-times pattern. The time to read the
whole dataset is more important than the latency in reading the first
record.
• Highly fault-tolerant
– Files are replicated to handle hardware failure
– Detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS
• High throughput
– Optimized for batch processing
– Data locations exposed so that computations can move to where data
resides
• Can be built out of commodity hardware
– Hadoop doesn’t require expensive, highly reliable hardware.
– It’s designed to run on clusters of commodity hardware
7/28/2022 4
HDFS is not Suitable for applications
• Low Latency
– HDFS is for delivering a high throughput of data, and this may be at
the expense of latency
• Lots of Small files
– Built on Master/Slave architecture.
– The namenode(Master) holds filesystem metadata in memory, the
limit to the number of files in a filesystem is governed by the amount
of memory on the namenode.
• Multiple writers, arbitrary file modifications
– Files in HDFS may be written to by a single writer.
– There is no support for multiple writers or for modifications at
arbitrary offsets in the file.

7/28/2022 5
HDFS Concepts
• Block
• Namenode, Datanode and Secondary
namenode
• HDFS Federation
• HDFS High Availability
• Command Line

7/28/2022 6
HDFS concepts : Block
• Blocks: “ A disk has block size, which is the minimum amount of
data that it can read or write.”
• Filesystem blocks size which are integral multiples of disk blocks.
A disk blocks size is normally 512 bytes in linux.
• HDFS , blocks are much larger (128 MB by default). Unlike a file
system, files in HDFS are broken into block-sized chunks, which are
stored as independent units.
• Why blocks in HDFS so large?
– NTFS 4kb , ext4 4kb on Ubuntu, hdfs-128MB
– To minimize the cost of seeks.
• The time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block.
• Time to transfer blocks operates at the disk transfer rate.
– Small metadata volumes, low seek time.
– Blocks simplifies management of storage subsystem
7
HDFS Block

7/28/2022 8
Benefits of HDFS Blocks
• Making the unit of abstraction a block rather than a file.
– Allows a block storage on any of the disks in the cluster.
– Simplifies the storage subsystem.
• Since blocks are a fixed size, it is easy to calculate how many
can be stored on a given disk and
• Eliminates metadata concerns ( such as file permissions etc.)
– Fits well with replication for providing fault tolerance and
availability.

– Files are distributed on HDFS based on dfs.blocksize property


– HDFS’s fsck command understands blocks. For example,
running:
% hadoop fsck / -files –blocks
will list the blocks that make up each file in the file system.

9
HDFS Cluster Nodes
• HDFS Cluster – Two types of
Nodes Master-Slave pattern
• Namenode :
• Manages the Filesystem namespace
• Maintains the filesystem hierarchy
• Maintains the Metadata for all the files
and directories
• Cluster Configuration Management
• Datanode
– Stores data in the local file system
– Stores metadata of a block
– Serves data and metadata to Clients
– Periodically reports its status to Name Node
• Secondary Namenode
– Can be used to restore a failed Namenode
7/28/2022 10
Namenode Functionality
• Namenode, Manages the file system namespace and regulates
access to files by clients.
• Stores all metadata in memory
– Includes List of files, List of Blocks, List of DataNodes for each block,
File attributes creation time, replication factor, permissions, size of
file, etc.
• Metadata is maintained in RAM to serve read requests
– Size of metadata is limited to available RAM storage
– Running out of RAM can cause the NameNode to crash
• NN Maintains two things on its Local File system
– Metadata information of the filesystem tree is persistently stored in
a file called FS Image
– Records file creations, deletions, Modifications (ie log of activities on
hdfs)., are stored in a file called Edit log

7/28/2022 11
Name Node
• Keeps image of entire file system namespace and file
Block map in memory.
• 4GB of local RAM is sufficient to support the above
data structures that represent the huge number of
files and directories.
• When the Namenode starts up it gets the FsImage and
Editlog from its local file system, update FsImage with
EditLog information and then stores a copy of the
FsImage on the filesystem as a checkpoint.
• Periodic checkpointing is done. So that the system can
recover back to the last checkpointed state in case of a
crash.
7/28/2022 12
Data Nodes
• Usually more than one Data Nodes per Cluster.
• Serves Read, Write requests, performs Block Creation, Deletion
and Replication upon instruction from Namenode.
• The DataNodes manage storage attached to the nodes that they
run on.
• A Data Node has no knowledge of HDFS. It stores each block of HDFS
data in a separate file in its local file system.
• Does not create all files in the same directory.
• Uses heuristics to determine optimal number of files per directory
and creates directories appropriately.
• At the File System start-up it generates a list of all HDFS blocks stored
at its local file system and sends a report to Namenode (Block
Report).
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes

7/28/2022 13
Namenode Failure
• In the Event of NN failure the New Name Node is built if it is
configured to use one of the following
– Replicating the FSImage and Editlog on multiple machines
– Secondary Name Node.

• FsImage and EditLog are central data structures of HDFS.


• A corruption of these files can cause a HDFS instance to be
non-functional.
• For this reason, a Namenode can be configured to maintain
multiple copies of the FsImage and EditLog.
• Multiple copies of the FsImage and EditLog files are updated
synchronously.

7/28/2022 14
Secondary NameNode
• It can be used to restore a failed NameNode
– Just copies current directory to new NameNode
• Secondary NameNode merge fsImage and editlog
– Every couple minutes, secondary namenode copies new edit log from
primary NN
– Merges editLog into fsimage
– Copies the new merged fsImage back to primary namenode
• Not used as standby or mirror node
• Runs on a separate machine
• Memory requirements are the same as NameNode
• Directory structure is same as NameNode
– It also keeps previous checkpoint version in addition to current
• Faster startup time

7/28/2022 15
Secondary Name Node

7/28/2022 16
HDFS Federated
• The prior HDFS architecture allows only a single namespace for
the entire cluster. In that configuration, a single Namenode
manages the namespace.
• HDFS Federation allows cluster to scale by adding multiple
Namenodes/namespaces to HDFS. Each of which manages a
portion of the filesystem namespace. For example,
– one namenode might manage all the files rooted under /user, say, and a
second namenode might handle files under /share.
• In HDFS federation, the Namenodes are independent and do not
require coordination with each other. The Datanodes are used as
common storage for blocks by all the Namenodes. Each
Datanode registers with all the Namenodes in the cluster.
Datanodes send periodic heartbeats and block reports. They also
handle commands from the Namenodes.
7/28/2022 18
Federated Namenode (HDFS2)
• New in Hadoop2 Namenodes can be federated
– Historically Namenodes would become a bottleneck on huge clusters
– One million blocks or ~100TB of data require roughly one GB of RAM in
Namenode
• Blockpools
– Administrator can create separate blockpools/namespaces with different
namenodes
– Datanodes register on all Namenodes
– Datanodes store data of all blockpools ( otherwise you could setup separate
clusters)
– New ClusterID identifies all namenodes in a cluster.
– A Namespace and its block pool together are called Namespace Volume
– You define which blockpool to use by connecting to a specific Namenode
– Each Namenode still has its own separate backup/secondary /checkpoint
node
• Benefits
– One Namenode failure will not impact other Blockpools
– Better scalability for large numbers of file operations
High Availability
High Availability
• HDFS-2 adds Namenode High Availability
• Standby Namenode needs filesystem transactions and
block locations for fast failover
• Every filesystem modification is logged to at least 3
quorum journal nodes by active Namenode
– Standby Node applies changes from journal nodes as they occur
– Majority of journal nodes define reality
– Split Brain is avoided by Journalnodes ( They will only allow one
Namenode to write to them )
• Datanodes send block locations and heartbeats to both
Namenodes
• Memory state of Standby Namenode is very close to Active
Namenode
• Much faster failover than cold start
Failover and Fencing
• The transition from the active namenode to the standby is
managed by an entity called the failover controller.
– Default implementation uses ZooKeeper to ensure that only one
namenode is active.
• The HA implementation ensure that the previously active
namenode is prevented from doing any damage and causing
corruption—a method known as fencing.
• Fencing includes
– Revoking the namenode’s access to the shared storage directory
and
– disabling its network port via a remote management command.
– forcibly power down the host machine known as STONITH, or
“shoot the other node in the head”
7/28/2022 23
HDFS - Replication
• Blocks are replicated for fault tolerance.
• Blocks of data are replicated to multiple
nodes
– Behavior is controlled by replication
factor, configurable per file
– Set by dfs.replication parameter in
hdfs-site.xml
– Default is 3 replicas

7/28/2022 24
Rack Awareness (1 of 2)
• Default installation assumes that all nodes belong to one Rack
• Computers belong to the same rack are on the same network switch
• Cluster administrator determines which computer belongs to which rack
and sets the topology using topology.script.file.name referenced in
core-site.xml file
• Example of property:
<property>
<name>topology.script.file.name</name>
<value>/opt/ibm/biginsights/hadoop-conf/rack-aware.sh</value>
</property>
• Each node’s position in the cluster is represented by a string with syntax
similar to a file name.
• The network topology script (topology.script.file.name in the above ex)
receives as arguments one or more IP addresses of nodes in the
cluster. It returns on stdout a list of rack names, one for each input.
The input and output order must be consistent
7/28/2022 25
Rack Awareness(2 of 2)

7/28/2022 26
Distance
• For example, imagine a node n1 on rack r1 in data center d1. This
can be represented as /d1/r1/n1. Using this notation, here are the
distances for the four scenarios:

– distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)


– distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
– distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same
data center)
– distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)

• The DataNode sends its location to the NameNode as part of the


registration information
• A distance between two computers can be calculated by summing
up their distances to their closest common ancestor.
• distance(f/d1/r1/n3 , f/d2/r4/n1) in previous slide is….6
• 7/28/2022
11 27
Replica Placement
• The placement of the replicas is critical to HDFS reliability and
performance.
• Rack-aware replica placement:
– Goal: improve reliability, availability and network bandwidth utilization
– Research topic
• Communication between racks are through switches.
• Network bandwidth between machines on the same rack is greater than
those in different racks.
– Namenode determines the rack id for each DataNode.
– Replicas are typically placed on unique racks
– Common case : replica placement
• One replica on one node in the local rack (say rack 1)
• 2nd replica on a different node in the different rack (say rack 2)
• The 3rd replica on a different node in a different rack (say rack 2)
• This cuts inter-rack network bandwidth, which improves write
performance
7/28/2022 28
HDFS : Data Flow Anatomy of a File Read

7/28/2022 29
HDFS Read Anatomy
• A client initiates read request by calling 'open()' method of
DistributedFileSystem object.
• This object connects to Namenode using RPC and gets locations of first
few blocks of the file and addresses of the DataNodes having a copy of
that block.
• Once addresses of DataNodes are received, an object of type FSData-
InputStream is returned to the client. FSDataInputStream contains
DFSInputStream which takes care of interactions with DataNode and
NameNode.
• In step 4, a client invokes ‘read()’ method which causes DFSInputStream
to establish a connection with the first DataNode with the first block of a
file. Data is read in the form of streams wherein client invokes ‘read()’
method repeatedly till it reaches the end of block.
• Once the end of a block is reached, DFSInputStream closes the connection
and moves on to locate the next DataNode for the next block
• Once a client has done with the reading, it calls a close() method.
7/28/2022 30
Replica Selection - Read
• Replica selection for READ operation: HDFS tries to minimize the
bandwidth consumption and latency.
• If there is a replica on the Reader node then that is preferred.
• HDFS cluster may span multiple data centers: replica in the local
data center is preferred over the remote one.

Replica – Node Failure


NameNode detects DataNode failures
Chooses new DataNodes for new replicas
Balances disk usage
Balances communication traffic to DataNodes

7/28/2022 31
HDFS Data Flow : Anatomy of File Write

7/28/2022 32
HDFS Write Anatomy
• A client initiates write operation by calling 'create()' method of
DistributedFileSystem object which creates a new file.
• DistributedFileSystem object connects to the NameNode using RPC call
and initiates new file creation.
– This file create operation does not associate any blocks with the file.
– It is the responsibility of NameNode to verify that the file does not exist
already and a client has correct permissions to create a new file.
– If a file already exists or client does not have sufficient permission to create
a new file, then IOException is thrown to the client.
– Otherwise, the operation succeeds and a new record for the file is created
by the NameNode.
• Once a new record in NameNode is created, an object of type
FSDataOutputStream (FSDOS) is returned to the client. A client uses it to
write data into the HDFS.
• FSDOS contains DFSOutputStream(DFSOS) object takes care of
interaction between DataNodes and NameNode.
– While the client continues writing data, DFSOS continues creating packets
with this data. These packets are enqueued into a queue which is called
7/28/2022 33
as DataQueue.
HDFS Write Anatomy
• There is a component called DataStreamer(DS) which consumes this
DataQueue. DataStreamer also asks NameNode for allocation of new
blocks thereby picking desirable DataNodes to be used for replication.
• The process of replication starts by creating a pipeline using DataNodes.
Here replication level is 3 and hence there are 3 DataNodes in the
pipeline.
• The DataStreamer pours packets into the first DataNode in the pipeline.
• Every DataNode in a pipeline stores packet received by it and forwards
the same to the second DataNode in a pipeline.
• Another queue, 'Ack Queue' is maintained by DFSOutputStream to store
packets which are waiting for acknowledgment from DataNodes.
• Once acknowledgment for a packet in the queue is received from all
DataNodes in the pipeline, it is removed from the 'Ack Queue'.
– In the event of any DataNode failure, packets from this queue are
used to reinitiate the operation.
7/28/2022 34
HDFS Write Anatomy
• After a client is done with the writing data, it calls a
close(), that results into flushing remaining data packets
to the pipeline followed by waiting for acknowledgment.

• Once a final acknowledgment is received, NameNode is


contacted to tell it that the file write operation is
complete.

7/28/2022 35
Robustness
• Data Node Failures: Server can fail, disk can
crash, data corruption
• Network Failures : Sometimes there's data
corruption because of network issues or disk
issue. You could have network failures, that
can affect a lot of data nodes at a time
• NameNode Failures : Could have name node
failures, disk failure on the name node itself or
the name node itself could corrupt, this
process

7/28/2022 36
Robustness
• Periodic Heartbeat : From DataNode to
NameNode.
• DataNodes without recent heartbeat:
– Arrival of any new I/O is not directed to this node.
Instead directed to the nodes where the blocks of
this datanode are replicated.

7/28/2022 37
Terminology
• Staging : Is the Process of holding data in temporary location, till it
accumulates the data to a block size. When it can be copied to a
datanode.
• Replication Pipelining : While writing data to datanodes, the client
process flushes its block in small pieces (4K) to the first replica,
that in turn copies it to the next replica and so on, thus data is
pipelined from one Datanode to the next.
• HeartBeats : DataNodes send block report once every 3 seconds to
the NameNode. NameNode uses heartbeats to detect DataNode
Health ie working/failed.
• Safe Mode: On startup Namenode enters Safemode. Each
DataNode checks in with Heartbeat and BlockReport. Namenode
verifies that each block has acceptable number of replicas. After a
configurable percentage of safely replicated blocks check in with
the Namenode, Namenode exits Safemode.
7/28/2022 38
Notes - Replication
• Replication of data blocks do not occur in Safemode.
• Once Name Node exits Safemode, then it makes the list of blocks
that need to be replicated. Namenode then proceeds to replicate
these blocks to other Datanodes.

• The necessity for re-replication may arise due to:


• A Datanode may become unavailable,
• A replica may become corrupted,
• A hard disk on a Datanode may fail, or
• The replication factor on the block may be increased.

• Space Reclamation in case of Replication


• When the replication factor is reduced, the Namenode selects
excess replicas that can be deleted.
• Next heartbeat(?) transfers this information to the Datanode that
clears the blocks for use
7/28/2022 39
Notes : Datanode Failure and Space Reclamation
• Datanode Failure : A network partition can cause a subset of
Datanodes to lose connectivity with the Namenode. Namenode
detects this condition by the absence of a Heartbeat message.
Namenode marks Datanodes without Hearbeat and does not
send any IO requests to them. Any data registered to the failed
Datanode is not available to the HDFS. The death of a Datanode
cause replication factor of some of the blocks to fall below their
specified value.
• Space Reclamation : When a file is deleted by a client, HDFS
renames file to a file to be in the /trash directory for a configurable
amount of time. A client can request for an undelete in this
allowed time. After the specified time the file is deleted and the
space is reclaimed.

7/28/2022 40
HDFS at a High Level

7/28/2022 41
Application Programming Interface
• HDFS provides Java API for application to use.
• Python access is also used in many
applications.
• A C language wrapper for Java API is also
available.
• A HTTP browser can be used to browse the
files of a HDFS instance.

$ grep
7/28/2022 42
FS Shell, Admin and Browser Interface
• HDFS organizes its data in files and directories.
• It provides a command line interface called the FS
shell that lets the user interact with data in the
HDFS.
• The syntax of the commands is similar to bash and
csh.
• Example: to create a directory
LFS : mkdir /foodir
hdfs: hadoop dfs –mkdir /foodir
• There is also DFSAdmin interface available
• Browser interface is also available to view the
namespace.

7/28/2022 43
7/28/2022 44
7/28/2022 45
HDFS Commands
JPS : JVM Process Status.
By typing JPS we can see Daemons running on the host
machine.
5 daemon services will be running on Hadoop machine
namely Name Node, Data Node, Secondary Name Node,
Resource Manager and Node Manager
Syntax : JPS
To start all Daemons:
Syntax :start-all.sh
To stop all Daemons:
Syntax : stop-all.sh
HDFS Commands
Hadoop make a Directory:
Syntax:
hadoop fs -mkdir /directoryname
Example :
hadoop fs -mkdir /griet

List the directory in HDFS :


Syntax:
hadoop fs -ls /directoryname
Example :
hadoop fs -ls /griet
HDFS Commands
Load Data from LFS to HDFS:
Syntax: hadoop fs -put <lfsfile> </hdfs_DIR>
Example: hadoop fs -put input.log /griet

To See list of files in particular directory in HDFS:


Syntax: hadoop fs -ls </hdfs_DIR>
Example: hadoop fs -ls /griet

To see the content of the file which is located under a


particular location:
Syntax: hadoop fs -cat </hdfs_filename>
Example: hadoop fs -cat /griet/input.log
HDFS Commands
To move file from LFS TO HDFS:
Syntax:
hadoop fs -moveFromLocal <<file name>> <</hdfsdir>>
Example:
hadoop fs -moveFromLocal input.log /griet

To copy data from LFS:


Syntax:
hadoop fs -copyFromLocal <<file name>> << /hdfsdir>>
Example:
hadoop fs -copyFromLocal input.log /griet
HDFS Commands
To Move HDFS DIR1 TO HDFS DIR2
Syntax:
hadoop fs -mv <HDFS_Directory/filename> </hdfs_dir>
Example:
hadoop fs -mv /griet/input.log /grietnew

To Create Empty File in HDFS


Syntax:
hadoop fs -touchz <HDFS_Directory/filename>
Example:
hadoop fs -touchz /grietnew/abc.txt
HDFS Commands
To Get Data HDFS to LFS :
Syntax:
hadoop fs -get <source path HDFS> <destination path LFS>
Example:
hadoop fs -get /griet/input.log /home/griet/Documents

Syntax:
hadoop fs -copyToLocal <source path HDFS> <destination path LFS>
Example:
hadoop fs -copyToLocal /griet/input.log /home/griet/Documents
HDFS Commands
• Cat – Usage: hadoop fs -cat URI [URI …]
– Copies source paths to stdout.
– Example:
• hadoop fs -cat hdfs:/mydir/test_file1 hdfs:/mydir/test_file2
• hadoop fs -cat file:///file3 /user/hadoop/file4
• Chgrp - Usage: hadoop fs -chgrp [-R] GROUP URI [URI ..]
– Change group association of files.
– With -R, make the change recursively through the
directory structure.
• Chmod - Usage: hadoop fs -chmod [-R]
<MODE[,MODE]... | OCTALMODE> URI [URI …]
– Change the permissions of files.
– With -R, make the change recursively through the
directory structure.
HDFS Commands
• Count – Usage: hadoop fs -count [-q] <paths>
– Count the number of directories, files and bytes under the
paths that match the specified file pattern.
– The output columns are: DIR_COUNT, FILE_COUNT,
CONTENT_SIZE FILE_NAME.

– The output columns with -q are: QUOTA,


REMAINING_QUOTA, SPACE_QUOTA,
REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT,
CONTENT_SIZE, FILE_NAME.
– Example:
• hadoop fs -count hdfs:/mydir/test_file1 hdfs:/mydir/test_file2
• hadoop fs -count -q hdfs:/mydir/test_file1
Hdfs commands
• du – Usage: hadoop fs -du URI [URI …]
– Displays aggregate length of files contained in the directory or
the length of a file in case its just a file.
• Example:
– hadoop fs -du file:///home/hdpadmin/test_file hdfs:/mydir
• dus – Usage: hadoop fs -dus <args>
– Displays a summary of file lengths.
• expunge – Usage: hadoop fs –expunge
– Empty the Trash
• getmerge – Usage: hadoop fs -getmerge <src> <localdst>
[addnl]
– Takes a source directory and a destination file as input and
concatenates files in the source into the destination local file.
– An additional option can be set to enable adding a newline
character at the end of each file.
Hdfs commands
• test– Usage: hadoop fs -test -[ezd] URI
– Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true.
-d check to see if the path is directory. Return 0 if true.
Example –
hadoop fs -test –e hdfs:/mydir/test_file
• Setrep – Usage: hadoop fs –setrep [-w] [-R] <path>
– Changes the replication factor of a file
– Example:
hadoop fs -setrep -w 5 -R hdfs:/user/hadoop/dir1
HDFS command : stat
Prints statistics about the file/directory at <path> in the specified
format. Format accepts filesize in blocks(%b), group name,….
• hadoop fs -stat "%n" /tmp/messages
• hadoop fs -stat “%b %F %g %n %y %Y" /tmp/messages

• %b Size of file in bytes


• %F Will return "file", "directory", or "symlink" depending on the type of inode
• %g Group name
• %n Filename
• %o HDFS Block size in bytes ( 128MB by default )
• %r Replication factor
• %u Username of owner
• %y Formatted mtime of inode
• %Y UNIX Epoch mtime of inode

7/28/2022 56
Hadoop FileSystem
• Hadoop has an abstract notion of filesystems, of which HDFS is
just one implementation.
• The Java abstract class org.apache.hadoop.fs.FileSystem
represents the client interface to a filesystem in Hadoop.
– In order to run HDFS on machine we Need to follow some instructions for
setting up Hadoop File System
– fs.default.name, is set to hdfs://localhost/, which is used to set a default
filesystem for Hadoop.
– Filesystems are specified by a URI, and here we have used an hdfs URI to
configure Hadoop to use HDFS by default.
• The HDFS daemons will use this property to determine the host and port
for the HDFS namenode.
– We’ll be running it on localhost, on the default HDFS port, 8020/9000. And
HDFS clients will use this property to find out where the namenode is
running so they can connect to it.
Hadoop File System Interfaces

HTTP
C
NFS
FUSE
Interfaces
• HTTP: By exposing its filesystem interface as a Java API,
Hadoop makes it awkward for non-Java applications to
access HDFS.
• The HTTP REST API exposed by the WebHDFS protocol
makes it easier for other languages to interact with HDFS.
• Hadoop provides a C library called libhdfs that mirrors the
Java FileSystem interface. You can find the header file,
hdfs.h, in the include directory of the Apache Hadoop
binary tarball distribution.
– The Apache Hadoop binary tarball comes with prebuilt libhdfs
binaries for 64-bit Linux, but for other platforms we need to
build them ourself by following the BUILDING.txt instructions
at the top level of the source tree.
Interfaces
• NFS : It is possible to mount HDFS on a local client’s
filesystem using Hadoop’s NFSv3 gateway. You can then
use Unix utilities (such as ls and cat) to interact with the
filesystem, upload files, and in general use POSIX libraries
to access the filesystem from any programming language.

• FUSE : Filesystem in Userspace (FUSE) allows filesystems


that are implemented in user space to be integrated as
Unix filesystems. Hadoop’s Fuse-DFS contrib module
allows HDFS (or any Hadoop filesystem) to be mounted as
a standard local filesystem. Fuse-DFS is implemented in C
using libhdfs as the interface to HDFS.
Data Ingestion
 Getting data into hadoop cluster is a critical
issue in big data deployment.
 Data ingestion is important in any big data
project because the volume of data is
generally very huge.
 The two major sources of Data ingestion
tools in hadoop echo components are:
1. SQOOP
2. FLUME
Sqoop
• Apche Sqoop is a tool designed for efficiently transferring
bulk data between HADOOP(HDFS) and structured data
stores, such as RDBMS.
– Import data from relational database tables into HDFS
• Target destination can be HDFS, Hive or Hbase
• Provides option to control format of text as AVRO, Sequence, delimited
text files
– Export data from HDFS into relational database tables
• Target destination is MySQL, Oracle, DB2
Key Features
• Parallel transfer : Uses Yarn Framework for efficient
data transfer and fault tolerance
• Connectors for all major RDBMS
• Incremental Load : Allows Parts of table data transfer
to reflect updates
• Compression: Data can be compressed using (gzip)/
by specifying compression-codec argument, and
then can be transferred. The compressed data can
be directly loaded to HIVE.
• Hive/Hbase : Data con be loaded directly to Hive for
data analysis and can also be dumped to HBase

7/28/2022 65
Key Observations
• Uses the database to describe the schema of the data
• Uses MapReduce to import and export the data
– Import process creates a Java class
• From the Meta data of table it creates a class and maps to nearest java
datatype thus encapsulate each row of the imported table
– The source code of the class is provided
• Can help to quickly develop MapReduce applications that use HDFS-
stored records
• Transfers data between Hadoop and relational database
– Uses JDBC/ODBC drivers
• To establish a connection for import and export between structures
sources and Hadoop.
– Must copy the JDBC driver JAR files for any relational
databases to $SQOOP_HOME/lib
• Eg: MYSQL/db2 the target RDBMS specific connector jar file should be
part of sqoop installed lib directory.
Import
• Imports data from relational tables into HDFS
– Each row in the table becomes a separate record in HDFS
• Data can be stored
– --target –dir <dir> hdfs destination directory
• If omitted, the name of the directory is the name of the table
– Text files, Binary files
• --as –avrodatafile, --as –sequenceFile, --as –texfile
• --fields –terminated -by <char> , --lines –terminated –by <char>
– Into Hbase
• -- hbase –create table
– Into Hive
• -- hive –import
• Imported data
– Can be all rows of a table
– Can limit the rows and columns
– Can specify your own query to access relational data
Sqoop Connection
• Database connection requirements are same for
import and export
• sqoop import/export --connect
jdbc:db2://your.db2.com:50000/yourDB --
username db2user --password yourpassword …
• sqoop import -- connect
jdbc:mysql://localhost:3306/db --username foo
-- password admin --table Test
• sqoop import --connect
jdbc:db2://your.db2.com:50000/yourDB \ --
username db2user --password db2password --
table db2table \ --target-dir sqoopdata
Sqoop Commands
• List Databases:
$ Sqoop list-databases –connect jdbc:mysql://localhost:3306/ --
username root –password root;
• List Tables:
$ Sqoop list-tables –connect jdbc:mysql://localhost:3306/mydb -
-username root –password root;
• Eval: It allows users to preview their Sqoop import queries to
ensure they import the data they expect.
$ sqoop eval --connect jdbc:mysql://localhost:3306/mydb --
username root --password root --query "select * from emp limit 5";
Data Ingestion Commands
• Sqoop import –connect jdbc:mysql://localhost:3306/mydb --
username root --password root – table emp;
• Sqoop import –connect jdbc:mysql://localhost:3306/mydb --
username root --password root – table emp --target-dir=
/home/sqoop_data;
Options import
1. -- target-dir import data to selected dir(directory name)
2. --split-by tbl_primarykey
Sqoop import –connect jdbc:mysql://localhost:3306/mydb --username root --
password root – table emp --target-dir= /home/sqoop_data –fields-
terminated-by ‘|’ --where ‘sal >15000’ --split-by eid;

3. --columns import selected columns “ column names”


-- columns “empno,empname,salary"
4. --where column-name condition
--where "salary > 40000“
5. --Fileds-terminated-by sep (,!..)
6. --query 'SELECT e.empno, e.empname, d.deptname FROM
employee e JOIN department d on (e.deptnum = d.deptnum)'
7. -m indicate no of mappers n
Sqoop import –connect jdbc:mysql://localhost:3306/mydb --username root --
password root – table emp --target-dir= /home/sqoop_data –fields-
terminated-by ‘|’-- where ‘sal >15000’ –m 3;
Options import
8. Importing data with compression techniques
Sqoop import –connect jdbc:mysql://localhost:3306/mydb --
username root --password root – table emp –compression-
codec GZipCodec -m 2 --target-dir=/home/Compre_sqoop_data;

9. sqoop import-all-tables; …. Imports all tables into hdfs

10. Import data into different formats


--as-textfile
--as-avrodatafile
--as-sequencefile
Ex. Sqoop import –connect jdbc:mysql://localhost:3306/mydb --
username root --password root – table emp --as-textfile --target-
dir= /home/sqoop_data;

7/28/2022 71
Import to HIVE
Sqoop import –connect
jdbc:mysql://localhost:3306/mydb --username
root --password root – table emp -- hive-table
mysqltohive --create-hive-table –hive –import –
m1

7/28/2022 72
Sqoop Export
• Exports a set of files from HDFS to a relation database
system
– Table must already exist
– Records are parsed based upon user's specifications
• Default mode is insert
– Inserts rows into the table
• Update mode
– Generates update statements
– Replaces existing rows in the table
– Does not generate an upsert
• Missing rows are not inserted
– Not detected as an error
• Call mode
– Makes a stored procedure call for each record
– --export-dir
• Specifies the directory in HDFS from which to read the data
export
• Basic export from files in a directory to a table
– sqoop export --connect jdbc:db2://your.db2.com:50000/yourDB \
--username db2user --password db2password --table employee \
--export-dir --target-dir= /home/sqoop_data/part-m-*;

• Example calling a stored procedure


– – sqoop export --connect jdbc:db2://your.db2.com:50000/yourDB \ --
username db2user --password db2password --call empproc \ --
export-dir --target-dir= /home/sqoop_data/part-m-*;

• Example updating a table


– sqoop export –connect jdbc:db2://your.db2.com:50000/yourDB \
--username db2user --password db2password --table employee \
--update_key empno --export-dir --target-dir= /home/sqoop_data/part-m-*;
Flume

 What is Flume
 Flume Overview
 Flume Architecture
 Building Blocks of Flume
What is Flume
 Mechanism to collect aggregate and move large
amount of streaming data into HDFS.
 Primarily designed for log aggregation from across
servers into a central point on HDFS.
Flume Overview
 Streaming data Ingestion into HDFS
Apache Flume is an open source powerful, reliable and flexible
system for efficiently collecting, aggregating and moving large
amounts of unstructured data from multiple data sources into
HDFS/Hbase
1. Designed to capture data as it is generated.
2. Channel it to HDFS for Storage for Subsequent Processing.
Flume Architecture
Building Blocks of Flume
 Event : A basic unit of data that needs to be
transferred from source to destination, Ex : a log
record, an avro object, etc. Normally around ~4KB.
 The external Sources sends events to Flume source in a
format that is recognized by the target source.
 Agent : A JVM process that receives events from
clients or from other flume agent and passes it to
destination or other agents.
 Flume Agent Contains 3 main components
 Source
 Channel
 Sink
Flume Agent - Source
 Source is Responsible to send the event to the
channel it is connected to.
 Flume Source receives events and transfers it to one
or more channels. The Channel acts as a store which
keeps the event until it is consumed by the flume
sink.
 May have logic to reading data , translate to event ,

handle failures. Has no control over how the event is


stored in channel.
 Flume supports Netcat , twitter, exec, Avro, sequence

file generator , TCP , UDP , protocol buffers as sources


of data.
7/28/2022 80
Flume Agent - Channel
 Channel : Receives the data or events from the flume
source and buffers them till the sinks consume them.
 Connect the sources and the sinks.
 The writing rate of the sink should be faster than the
ingest rate from the sources.
 ChannelException might lead to data loss
 Durable channel is must for recoverability.
 Channel acts as a buffer with configurable capacity.
 Flume supports different types of channels.
 Ex Channels : in-Memory, or Persistent (File or File System or
database (Derby) Channel, etc
 Channel uses local file system to store these events
7/28/2022 81
Flume Agent - Sink
Sink:
 Receives events from a channel and stores them into the

destination. The destination can be a centralized store like


(HDFS) or other flume agents
 Waits for events from the configured channel.
 Responsible to send the event to the desired destination.
 Manage issues like timeouts , retries
 Can setup sink groups (group of prioritized sinks to manage
sink failures).
 As long as one sink in the group is available the agent will
function.
 There could be multiple flume agents in which case flume
sink forwards the event to the flume source or next flume
agent in the flow.
Flume Architecture
 Events generated by external sources are consumed
by Flume Data Sources.
 The external Sources sends events to Flume source
in a format that is recognized by the target source.
 Flume Source receives an event and stores it into
one or more channels. The Channel acts as a store
which keeps the event until it is consumed by the
flume sink.
 The channel may use local file system in order to
store these events.
Flume Agent and Data Flow

A source in Flume captures events and delivers them to


the channel, which stores the events until they are
forwarded to the sink.
Consolidation
Replicating and Multiplexing
Configuration
• Flume components are defined in a configuration file
– Multiple agents running on the same node can be defined in
the same configuration file
• For each agent, you define the components
– The source(s)
– The sink(s)
– The channel(s)
• Then define the properties for each component
• Then define the relationships between the components
• The configuration file resembles a Java properties format
Flume
Sources
Flume
Sink
Flume
Channel
Spool-to-logger.properties
// Naming components of agent source, sink and channel
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
//Describing /configuring the source. Src-type and src-directory
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /tmp/spooldir
// Describing /configuring sink type and sink directory
agent1.sinks.sink1.type = logger
// Describing /configuring Channel
agent1.channels.channel1.type = file
// Binding the Source and sink to channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
Spool-to-hdfs.properties
// Naming components of agent source, sink and channel
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
//Describing /configuring the source. Src-type and src-directory
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /tmp/spooldir
// Describing /configuring sink type and sink directory
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /tmp/flume
agent1.sinks.sink1.hdfs.filePrefix = events
agent1.sinks.sink1.hdfs.fileSuffix = .log
agent1.sinks.sink1.hdfs.inUsePrefix = _
agent1.sinks.sink1.hdfs.fileType = DataStream
// Describing /configuring Channel
agent1.channels.channel1.type = file
// Binding the Source and sink to channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
Flume execution steps
1. mkdir /tmp/spooldir
2. Hadoop fs -mkdir /tmp/flume
3. create spool-to-logger.properties file
4. save this file to flume/conf directory
5. flume path is obtained by typing cd $BIGINSIGHTS_HOME/flume (OR)
by typing the following command the path can be reached
cd /opt/ibm/biginsights/flume/conf
note spool-to-logger.properties file should be placed in this path.
6. bin/flume-ng agent -n agent1 -f conf/spool-to-logger.properties \
-Dflume.root.logger=INFO, console
7. open a new terminal
8. cat>test.txt
....
....
ctrd + d
9. cp test.txt /tmp/spooldir
10. As soon as the file is added, it gets processed.
ls /tmp/spooldir
we will see test.txt.COMPLETED
Flume v/s Sqoop
Flume Sqoop

Streaming Data Source Source is any kind of


RDBMS
Used for collecting & Sqoop Transfers the data
Aggregating data i.e log parallel while making
data connection to web

You might also like