HDFS Slides

HDFS
Copyright 2021@Learn-Spark.info - All Rights Reserved

Subscribe at Learn-Spark.info
Big Data Challenges
• Storage (Storing big data is a problem due to its massive volume and variety)
• Processing (Takes more time due to Huge Data)

Solution : Hadoop HDFS
Storage Problem
Processing
Problem

Vertical Scaling
(Increase of size of Instance on
the same Machine – RAM, CPU
etc)
10TB
Local File System

Horizontal Scaling
(Add more instances)
10TB 10TB
10TB
10TB
Distributed File
System(HDFS)

What is Hadoop HDFS
HDFS is specially designed file system for storing massive amount of data in commodity hardware.
Advantages of HDFS:
1. Distributed storage for Big Data
2. Cost effective as uses commodity hardware
3. Fault-tolerant as data copies available – Replication.
4. Data is secure as provides data security

HDFS Components
Master –Slave Architecture
• Master Daemon and Name Node
manages the DataNodes
• Only one active NameNode
• Enterprise Hardware
• Stores all Metadata
• Regulates client’s access to
files
• Execute file system
operations – renaming,
opening, closing files etc
• Slave Daemon
Data Node Data Node Data Node • Follow NameNode instruction
to create block creation,
deletion, replication etc
• Stores actual data in blocks
• Can have multiple Data
Nodes
• Commodity Hardware
• Performs read-write
operations as per client
request.

MetaData
NameNode Metadata Secondary NameNode

(File Location, bock size, Replication etc) • Acts as NameNode when NameNode Fails
• Maintains copies of editlog and fsimage.
• Combines both to get an updated version of the fsimage.
• Creates periodic checkpoints of the files.
• Updates the new fsimage into the NameName
• Happens every 1 hour by default. Can be changed
editlog fsimage
editlog.copy fsimage.copy
Keeps Track of the ONLY recent Keeps track of every change made
changes made on HDFS on HDFS since beginning
fsimage.ckpt

HDFS Data Blocks
• HDFS Splits massive files into small split files – called Data Blocks.
• Default size 128MB in Hadoop 2.x and 64MB for Hadoop 1.x.
• Can be configured - /opt/hadoop/etc/hadoop/hdfs-site.xml
• Block size small – There will be too many data blocks and so lots of metadata which causes lot of
overhead.
• Block Size Large – Processing size of each block increases.
File.txt 400MB
Block A Block B Block C Block D
128 MB 128 MB 128 MB 16 MB

Replication
DataNode 1 DataNode 2 DataNode 3 DataNode 4

Replication
Default Replication = 3. Can be set in hdfs-site.xml
DataNode 1 DataNode 2 DataNode 3 DataNode 4
Block D Block A Block D Block A
Block B Block C Block B Block C

Rack Awareness
• It is a concept that helps to decide where a copy of the data block should be stored.
• Rack is a collection of 30-40 DataNodes
Replica of Block A Should not be created on the same Rack.
DN1 Block A DN5 Block B DN9 Block B
DN2 Block C DN6 Block A DN10 Block C
DN3 Block C DN7 DN11 Block B
DN4 DN8 Block A DN12
Rack 1 Rack 2 Rack 3

Exercise - HDFS Read Mechanism
Client Node NameNode
1. open() 2. RPC(Block A,B)
Client Distributed
File System 3. Client Authentication
4. Provide Block Location
(IP Address of Block A,B
and a Token)
FSDataInputStream
6. Read()
6. Close()
Block B Block B Block C
Block A Block A Block D
Block C Block C Block B
DataNode 1 DataNode 2 DataNode 3

Exercise – (HDFS CLI Help Commands)
Hadoop provides a command Line Interface to interact with HDFS.
✓ List all HDFS Commands.

hadoop fs or hdfs dfs
✓ Basic usage for any command.

hadoop fs -usage ls
✓ Full detailed information for any commands.

hadoop fs -help ls

Exercise (Get Data from GitHub to HDFS)
### Get data files from GitHub to our Unix System
git clone https://github.com/sibaramKumar/dataFiles
#Rename the Folder

cd dataFiles
### Unzip the Files

sudo apt install unzip
unzip SalesData.zip
ls –lrt
rm SalesData.zip
### Create a Folder at HDFS

hadoop fs -mkdir -p practice/retail_db/
### Copy the Files from Local to HDFS

hadoop fs -put dataFiles/* practice/retail_db/

Exercise (List/Sort Files in HDFS)
✓ Command – ls
✓ Use a pattern to select.
✓ -R : Recursively list the contents of directories.
✓ -C : Display the paths of files and directories only.
By default it sorts in ascending order by name.

✓ ls –r : Reverse the order of the sort.
✓ ls –S : Sort files by size.
✓ ls –t : Sort files by modification time (most recent first).

Exercise (Creating/Deleting Directories in HDFS)
✓ Command : rmdir
• Remove Directory if it is empty.
• --ignore-fail-on-non-empty : Suppress Error messages if the Folder you are trying to remove is non empty.
✓ Command : rm
• rm remove files.
• With option –r , recursively deletes directories
• With option –skipTrash bypasses trash, if enabled.
• With option –f, no error message even if file does not exist. Check with Unix command $?. It returns the status of
last ran job. 0 means successful and 1 means not successful.
✓ Command: mkdir
• Create Directory
• –p : Do not fail if the directory already exists. Also we can create multiple folders recursively.

Exercise - Copy from HDFS to Local
✓ Command - copyToLocal or get
hadoop fs -get practice/retail_db/orders .
✓ Error if the destination path already exists. To overwrite use –f flag.

hadoop fs -get practice/retail_db/orders .
hadoop fs -get –f practice/retail_db/orders .
✓ -p flag to preserves access and modification times, ownership and the mode.
hadoop fs -get -p practice/retail_db/orders .
✓ To Only copy the files with out folder use a pattern.

hadoop fs -get practice/retail_db/orders/* .
✓ When copying multiple files, the destination must be a directory.

mkdir copyHere
hadoop fs -get practice/retail_db/orders/* practice/sample.txt copyHere

Exercise - Copy data from Local to HDFS
✓ Command - copyFromLocal or put
hadoop fs -mkdir –p practice/retail_db
hadoop fs -put dataFiles/* practice/retail_db/
hadoop fs -mkdir –p practice/retail_db1

hadoop fs -put dataFiles practice/retail_db1/ #Creates a subfolder dataFiles under retail_db
✓ Error if the destination path already exists. To overwrite use –f flag.

hadoop fs -put -f dataFiles/* practice/retail_db/
✓ -p flag to Preserves timestamps, ownership and the mode.

hadoop fs -put -p dataFiles/* practice/retail_db/
✓ We can also copy multiple files.

hadoop fs -put -f dataFiles/* sample.txt practice/retail_db/

Showing Data in HDFS
✓ Command – head
• Show the first 1KB of the file
✓ Command – tail
• Show the last 1KB of the file
• Option –f shows appended data as the file grows.
✓ Command – cat
• Fetch the whole File
• Not Recommended for Big Files.
✓ First 10
hadoop fs -cat practice/retail_db/orders/part-00000 | head -10
✓ Last 10
hadoop fs -cat practice/retail_db/orders/part-00000 | tail -10

Knowing Statistics in HDFS
✓ Command – stat
• Print statistics related to any file/directory
✓ default or %y - Modification Time
✓ %b - File Size in Bytes
✓ %F - Type of object.
✓ %o - Block Size
✓ %r - Replication
✓ %u - User Name
✓ %a - File Permission in Octal
✓ %A - File Permission in Symbolic

Knowing Storage in HDFS
✓ Command – df
• Shows the capacity, free and used space of the HDFS file system.
• hadoop fs -df
• -h →Readable Format
✓ Command – du
• Show the amount of space, in bytes, used by the files that match the specified file pattern.
• hadoop fs -du practice/retail_db
• -h :Readable Format
• -v : Displays with Header
• -s : Summary of total size

File Metadata
✓ Command – fsck
✓ Even if a Size of a file has less than 128MB , it will still occupy 1 Block.
✓ Help: hadoop fsck –help
### Print the fsck High Level Report
hadoop fsck practice/retail_db
### -files → Print a detailed file level report.

hadoop fsck practice/retail_db –files
### -files -blocks →Print a detailed file and block report.

hadoop fsck practice/retail_db –files -blocks
### -files -blocks -locations → Print out locations for every block
hadoop fsck practice/retail_db –files –blocks –locations
### -files -blocks -racks → Print out network topology for data-node locations

HDFS File Permission
• HDFS File Permission is similar to Linux File Permission.
• For HDFS File,

r → Read Permission
w → Write or Append
x → No Meaning in HDFS
User Group Others
• For HDFS Directory,

r → Able to List content
w → Able to Create or Delete Directory
x → Able to access a child
HDFS File Permission
• Change Permission using –chmod
• Octal Mode: –chmod 755
hadoop fs -chmod 755 practice/retail_db/orders/part-00000
• Symbolic Mode: -chmod g+x

hadoop fs -chmod g+w practice/retail_db/orders/part-00000

HDFS Override Properties
✓ 1. Change Properties in hdfs-site.xml or core-site.xml
✓ 2. Override the default properties while Copying the Files into HDFS (-D or –conf)
✓ 3. Change after copying the Files in HDFS (-setRep for changing replication.)


HDFS Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HDFS Slides

Uploaded by

Copyright:

Available Formats

HDFS

Copyright 2021@Learn-Spark.info - All Rights Reserved

Copyright 2021@Learn-Spark.info - All Rights Reserved

Copyright 2021@Learn-Spark.info - All Rights Reserved

Local File System

Copyright 2021@Learn-Spark.info - All Rights Reserved

Copyright 2021@Learn-Spark.info - All Rights Reserved

Copyright 2021@Learn-Spark.info - All Rights Reserved

Copyright 2021@Learn-Spark.info - All Rights Reserved

NameNode Metadata Secondary NameNode

Copyright 2021@Learn-Spark.info - All Rights Reserved

Block A Block B Block C Block D

128 MB 128 MB 128 MB 16 MB

Copyright 2021@Learn-Spark.info - All Rights Reserved

DataNode 1 DataNode 2 DataNode 3 DataNode 4

Block A Block B Block C Block D

Copyright 2021@Learn-Spark.info - All Rights Reserved

DataNode 1 DataNode 2 DataNode 3 DataNode 4

Block A Block B Block C Block D

Block D Block A Block D Block A

Block B Block C Block B Block C

Copyright 2021@Learn-Spark.info - All Rights Reserved

Replica of Block A Should not be created on the same Rack.

DN1 Block A DN5 Block B DN9 Block B

DN2 Block C DN6 Block A DN10 Block C

DN3 Block C DN7 DN11 Block B

DN4 DN8 Block A DN12

Rack 1 Rack 2 Rack 3

Copyright 2021@Learn-Spark.info - All Rights Reserved

Block B Block B Block C

Block A Block A Block D

Block C Block C Block B

DataNode 1 DataNode 2 DataNode 3

✓ List all HDFS Commands.

✓ Basic usage for any command.

✓ Full detailed information for any commands.

Copyright 2021@Learn-Spark.info - All Rights Reserved

#Rename the Folder

### Unzip the Files

### Create a Folder at HDFS

### Copy the Files from Local to HDFS

Copyright 2021@Learn-Spark.info - All Rights Reserved

By default it sorts in ascending order by name.

Copyright 2021@Learn-Spark.info - All Rights Reserved

Copyright 2021@Learn-Spark.info - All Rights Reserved

✓ Error if the destination path already exists. To overwrite use –f flag.

✓ To Only copy the files with out folder use a pattern.

✓ When copying multiple files, the destination must be a directory.

Copyright 2021@Learn-Spark.info - All Rights Reserved

hadoop fs -mkdir –p practice/retail_db1

✓ Error if the destination path already exists. To overwrite use –f flag.

✓ -p flag to Preserves timestamps, ownership and the mode.

✓ We can also copy multiple files.

Copyright 2021@Learn-Spark.info - All Rights Reserved

Copyright 2021@Learn-Spark.info - All Rights Reserved

Copyright 2021@Learn-Spark.info - All Rights Reserved

Copyright 2021@Learn-Spark.info - All Rights Reserved

### -files → Print a detailed file level report.

### -files -blocks →Print a detailed file and block report.

Copyright 2021@Learn-Spark.info - All Rights Reserved

• For HDFS File,

User Group Others

• For HDFS Directory,