Bda Lab 1

AIM: TO STUDY HADOOP ECOSYSTEM AND TO DEMONSTRATE BASIC HADOOP
COMMANDS.
Theory:
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
Features of HDFS
1. It is suitable for distributed storage and processing.

2. Hadoop provides a command interface to interact with HDFS
3. The built-in servers of namenode and datanode help users to easily check the
status of the cluster
4. Streaming access to file system data.
5. HDFS provides file permissions and authentication.
HDFS Architecture:
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements: The main
components of HDFS are as described below:
1. NameNode is the master of the system. It maintains the name system i.e. the directories and
files and manages the blocks which are present on the DataNodes.
2. DataNodes are the slaves which are deployed on each machine and provide the actual
storage. They are responsible for serving read and write requests for the clients.
● Secondary NameNode is responsible for performing periodic checkpoints. In the

event of NameNode failure, the NameNode can be restarted using the checkpoint
● Block, the default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Goals of HDFS
1. Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
2. Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
3. Hardware at data: A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
Explain Hadoop Ecosystem:

Hadoop consists of several tools that work together to process and analyze vast
amounts of data. It is an open-source framework, and its architecture is based on widely
distributed systems. These tools include:
1. MapReduce: This tool divides the input into small pieces, distributes them across
many machines in the cluster, and combines the output from all machines into one file.
2. Pig: This tool allows you to write scripts in a language called Pig Latin that can be
used to query large datasets stored in Hadoop Distributed File System (HDFS).
3. Hive: This tool allows users to store data in tables similar to those already present in
SQL databases. Still, it is stored as files on HDFS instead of being stored in relational
database management systems (RDBMS).
The Hadoop ecosystem architecture is made up of four main components: data storage,
data processing, data access, and data management.
1. Data Storage : The first step to explaining the Hadoop ecosystem is where all your
raw data is stored. It could be on a local hard drive or in the cloud.
2. Data Processing: The second phase of the Hadoop ecosystem in Big Data involves
analyzing your data and transforming it into something meaningful that can be used for
further analysis.
3. Data Access : In this third phase of the Hadoop ecosystem, you can use tools like
Hive or Pig to query your data sets and perform actions like filtering out specific rows,
sorting them by certain columns or values within them (such as location or birthdate), etc.
4. Data Management : Finally, the last phase of the Hadoop ecosystem architecture
involves taking all the work we've done on data sets in previous phases and storing it
safely somewhere so we can return to it later if needed.
Hadoop and its ecosystem include many tools for data processing and analysis. Some of
these tools are used to collect data from various sources, while others are used to store
and analyze the data.
1. Oozie - Workflow Monitoring Oozie is a workflow management system that allows users
to monitor and control workflows. It can be used to automate tasks for a variety of
purposes, including data processing, system administration, and debugging.
2. Chukwa – Monitoring Chukwa is an open-source distributed monitoring system for

high-performance computing clusters. The tool collects data from Hadoop Distributed File
System (HDFS), MapReduce, and YARN applications. It provides a web interface to view
the data collected by Chukwa agents running on each node in the cluster.
3. Flume – Monitoring Flume is an open-source distributed log collection system storing

log events from sources such as web servers or application servers into HDFS or other
systems.
4. Zookeeper – Management It is a management tool that helps with the configuration

management, data synchronization, and service discovery functions of Hadoop clusters.
5. Hive – SQL Hive is a data warehouse system for Hadoop that allows users to query
data using Structured Query Language (SQL). It can also be used to create and modify
tables and views, grant privileges to users, and so on.
6. Pig – Dataflow Pig is a high-level language for writing data transformation programs. It
provides a way to express data analysis programs, like how people speak about their
work. Pig programs are compiled into MapReduce jobs that run on the Hadoop
infrastructure.
7. Mahout - Machine Learning Mahout is a suite of machine-learning libraries that run on

top of Hadoop. It includes implementing many standard algorithms such as k-means
clustering, naïve Bayes classification, logistic regression, support vector machines (SVM),
random forests, etc.
8. MapReduce - Cluster Management It is a programming model frequently used for

processing and managing large datasets. It has two phases:
● Map Phase: we divide the input data into chunks and process it in parallel.
● Reduce phase: each group of intermediate key-value pairs is passed to a reducer
which computes the final output based on the values in that group.
9. HBASE - Column DB Storage HBase (Hadoop Base) is an open-source database that

uses HDFS as its underlying storage system. It provides a NoSQL storage solution for
storing large amounts of unstructured data in a scalable manner
1.It is a Checklist that all components are working fine
2.Present Working Directory
3.Create Directory in a given path
4.Create a directory in the present working directory
5.Create a directory in the present working directory

6.Display List in current directory
7.Display the list in give path
8.Display the files under root directory

9.Display the list of files on local machine
10.Create directory inside another directory
11.create multiple directories at a time
New file in the specified directory
12.Create file on local system
13.to copy file/folder from local file system to the hdfs store
14.To print content of file stored in hdfs
15.To copy a file from hdfs to file path on the local os
16.To move file from local to hdfs
17.To copy files within hdfs

18.To move files within hdfs
19.To get the size of each file in the directory
20.To get the total size of directory/file
21.To change the replication factor of a file/directory in hdfs
22.To put file on hdfs
23.To get file from hdfs
CONCLUSION: In conclusion, this lab manual offers a practical and comprehensive exploration
of the Hadoop ecosystem, covering HDFS fundamentals, tool overview, installation, file
operations, and programming exercises. Through these hands-on experiences, I have acquired
essential skills for proficiently utilizing Hadoop in various data processing tasks.

Bda Lab 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda Lab 1

Uploaded by

Copyright:

Available Formats

AIM: TO STUDY HADOOP ECOSYSTEM AND TO DEMONSTRATE BASIC HADOOP

1. It is suitable for distributed storage and processing.

● Secondary NameNode is responsible for performing periodic checkpoints. In the

Explain Hadoop Ecosystem:

2. Chukwa – Monitoring Chukwa is an open-source distributed monitoring system for

3. Flume – Monitoring Flume is an open-source distributed log collection system storing

4. Zookeeper – Management It is a management tool that helps with the configuration

7. Mahout - Machine Learning Mahout is a suite of machine-learning libraries that run on

8. MapReduce - Cluster Management It is a programming model frequently used for

9. HBASE - Column DB Storage HBase (Hadoop Base) is an open-source database that

2.Present Working Directory

3.Create Directory in a given path

4.Create a directory in the present working directory

5.Create a directory in the present working directory

8.Display the files under root directory

10.Create directory inside another directory

11.create multiple directories at a time

New file in the specified directory

12.Create file on local system

14.To print content of file stored in hdfs

15.To copy a file from hdfs to file path on the local os

16.To move file from local to hdfs

17.To copy files within hdfs

19.To get the size of each file in the directory

20.To get the total size of directory/file

21.To change the replication factor of a file/directory in hdfs

22.To put file on hdfs

23.To get file from hdfs

You might also like