Professional Documents
Culture Documents
COMMANDS.
Theory:
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
Features of HDFS
HDFS Architecture:
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements: The main
components of HDFS are as described below:
1. NameNode is the master of the system. It maintains the name system i.e. the directories and
files and manages the blocks which are present on the DataNodes.
2. DataNodes are the slaves which are deployed on each machine and provide the actual
storage. They are responsible for serving read and write requests for the clients.
● Block, the default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Goals of HDFS
1. Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
2. Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
3. Hardware at data: A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
1. MapReduce: This tool divides the input into small pieces, distributes them across
many machines in the cluster, and combines the output from all machines into one file.
2. Pig: This tool allows you to write scripts in a language called Pig Latin that can be
used to query large datasets stored in Hadoop Distributed File System (HDFS).
3. Hive: This tool allows users to store data in tables similar to those already present in
SQL databases. Still, it is stored as files on HDFS instead of being stored in relational
database management systems (RDBMS).
The Hadoop ecosystem architecture is made up of four main components: data storage,
data processing, data access, and data management.
1. Data Storage : The first step to explaining the Hadoop ecosystem is where all your
raw data is stored. It could be on a local hard drive or in the cloud.
2. Data Processing: The second phase of the Hadoop ecosystem in Big Data involves
analyzing your data and transforming it into something meaningful that can be used for
further analysis.
3. Data Access : In this third phase of the Hadoop ecosystem, you can use tools like
Hive or Pig to query your data sets and perform actions like filtering out specific rows,
sorting them by certain columns or values within them (such as location or birthdate), etc.
4. Data Management : Finally, the last phase of the Hadoop ecosystem architecture
involves taking all the work we've done on data sets in previous phases and storing it
safely somewhere so we can return to it later if needed.
Hadoop and its ecosystem include many tools for data processing and analysis. Some of
these tools are used to collect data from various sources, while others are used to store
and analyze the data.
1. Oozie - Workflow Monitoring Oozie is a workflow management system that allows users
to monitor and control workflows. It can be used to automate tasks for a variety of
purposes, including data processing, system administration, and debugging.
5. Hive – SQL Hive is a data warehouse system for Hadoop that allows users to query
data using Structured Query Language (SQL). It can also be used to create and modify
tables and views, grant privileges to users, and so on.
6. Pig – Dataflow Pig is a high-level language for writing data transformation programs. It
provides a way to express data analysis programs, like how people speak about their
work. Pig programs are compiled into MapReduce jobs that run on the Hadoop
infrastructure.
13.to copy file/folder from local file system to the hdfs store
CONCLUSION: In conclusion, this lab manual offers a practical and comprehensive exploration
of the Hadoop ecosystem, covering HDFS fundamentals, tool overview, installation, file
operations, and programming exercises. Through these hands-on experiences, I have acquired
essential skills for proficiently utilizing Hadoop in various data processing tasks.