Professional Documents
Culture Documents
Storage Problem
Processing
Problem
10TB
10TB 10TB
10TB
10TB
Distributed File
System(HDFS)
Advantages of HDFS:
1. Distributed storage for Big Data
2. Cost effective as uses commodity hardware
3. Fault-tolerant as data copies available – Replication.
4. Data is secure as provides data security
• Slave Daemon
Data Node Data Node Data Node • Follow NameNode instruction
to create block creation,
deletion, replication etc
• Stores actual data in blocks
• Can have multiple Data
Nodes
• Commodity Hardware
• Performs read-write
operations as per client
request.
editlog.copy fsimage.copy
Keeps Track of the ONLY recent Keeps track of every change made
changes made on HDFS on HDFS since beginning
fsimage.ckpt
File.txt 400MB
FSDataInputStream
6. Read()
6. Close()
✓ -p flag to preserves access and modification times, ownership and the mode.
hadoop fs -get -p practice/retail_db/orders .
✓ Last 10
hadoop fs -cat practice/retail_db/orders/part-00000 | tail -10
✓ Command – du
• Show the amount of space, in bytes, used by the files that match the specified file pattern.
• hadoop fs -du practice/retail_db
• -h :Readable Format
• -v : Displays with Header
• -s : Summary of total size
### -files -blocks -locations → Print out locations for every block
hadoop fsck practice/retail_db –files –blocks –locations
### -files -blocks -racks → Print out network topology for data-node locations
✓ 2. Override the default properties while Copying the Files into HDFS (-D or –conf)
✓ 3. Change after copying the Files in HDFS (-setRep for changing replication.)