You are on page 1of 24

Introduction to Hadoop

•Hadoop is an open-source framework designed for distributed storage and processing


of large datasets across clusters of commodity hardware.

•It was initially created by Doug Cutting and Mike Cafarella in 2005 and is now
maintained by the Apache Software Foundation.
1.The primary goal of Hadoop is to enable the processing of vast amounts of data in a
cost-effective and efficient manner. It provides a way to store, manage, and analyze data
that is too large or complex to be handled by traditional relational databases or single
machines.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Introduction to Hadoop
Hadoop's key components include the Hadoop Distributed File System (HDFS) and the
MapReduce programming model.
•Hadoop Distributed File System (HDFS): HDFS is a distributed file system designed
to store large datasets across multiple machines. It divides files into blocks and
replicates them across the cluster to ensure fault tolerance and data availability. HDFS
supports high throughput data access, making it suitable for applications that require
large sequential reads.
•MapReduce Programming Model: MapReduce is a programming paradigm that allows
developers to process the large datasets in parallel across a distributed cluster. It
comprises two main phases: the "Map" phase, where data is processed and filtered, and
the "Reduce" phase, where the processed data is aggregated and summarized This model
abstracts the complexities of parallel programming and enables scalable and fault-
tolerant data processing.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


YARN (Yet Another Resource Negotiator): YARN is a resource management
platform that manages and allocates resources across the cluster, enabling different
processing frameworks to run concurrently on the same cluster. It decouples resource
management from the MapReduce programming model, allowing for more diverse
workloads.

Comparison of Hadoop and RDBMS


Hadoop and Relational Database Management Systems (RDBMS) are two distinct
technologies used for managing and analyzing large volumes of data, but they have
different architectures and use cases. Here's a comparison of the two:
1.Data Storage Model:
Hadoop: Hadoop is designed for distributed storage and processing of large datasets. It
uses a distributed file system called HDFS (Hadoop Distributed File System) to store
data in a fault-tolerant and scalable manner. Hadoop is well-suited for handling
unstructured and semi-structured data.
RDBMS: RDBMS stores data in structured tables with predefined schemas. It uses SQL
(Structured Query Language) for defining, querying, and manipulating data. RDBMS is
suitable for structured data with well-defined relationships between tables.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


2. Scalability:
Hadoop: Hadoop provides horizontal scalability, allowing you to add more nodes to
the cluster to handle larger datasets and workloads. It achieves scalability through
its distributed architecture.
RDBMS: Traditional
 RDBMS systems can also scale, but their scalability is often
vertical (adding more resources to a single server) and might have limitations when
it comes to handling
 extremely large datasets and high throughput.

3.Data Processing Paradigm:


Hadoop: Hadoop uses a batch processing paradigm through its MapReduce
framework. It processes data in parallel across the cluster by dividing tasks into map
and reduce phases. Hadoop also supports real-time processing through technologies
like Apache Spark and Apache Flink.
RDBMS: RDBMS primarily uses a query-based approach for data retrieval and
manipulation. It's optimized for transactional operations and structured queries.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


4. Data Processing Speed:
Hadoop: Hadoop's batch processing approach can be slower for certain types of
queries compared to RDBMS, especially for interactive and real-time queries.
RDBMS: RDBMS can provide faster response times for structured queries due to
its indexing and optimization techniques.

5.Data Types and Flexibility:


Hadoop: Hadoop can handle a wide variety of data types, including unstructured and
semi-structured data like text, images, and log files. It offers more flexibility in data
storage and processing.
RDBMS: RDBMS is better suited for structured data with well- defined schemas.
While some RDBMS systems have added support for semi-structured data, they are
generally not as flexible as Hadoop for handling diverse data types.

6.Use Cases:
Hadoop: Hadoop is commonly used for big data analytics, processing large-scale
log data, machine learning, and data warehousing for unstructured and semi-
structured data.
RDBMS: RDBMS is well-suited for transactional systems, managing structured
business data, and supporting applications that require ACID (Atomicity,
Consistency, Isolation, Durability) compliance.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Brief history of Hadoop

Hadoop is an open-source framework for distributed storage and processing of large


datasets. It was created to address the challenges of dealing with massive amounts
of data that exceed the capabilities of traditional data processing systems. Here's a
brief history of Hadoop:

•Early Development (2004-2005): The origins of Hadoop trace back to 2004 when
Doug Cutting and Mike Cafarella began working on an open-source web search
engine called Nutch. They needed a way to process and analyze large volumes of
data efficiently. Inspired by Google's MapReduce paper and Google File System,
they developed a framework called Nutch Distributed File System (NDFS) and
implemented a basic version of MapReduce.

•Hadoop Project Initiation (2006): In 2006, Doug Cutting joined Yahoo, and the
Hadoop project was formally initiated as an Apache open-source project. Hadoop
was named after Doug Cutting's son's toy elephant. The project aimed to create a
reliable, scalable, and distributed computing framework for processing and storing
large3.datasets.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Brief history of Hadoop

3.Hadoop Core Components (2006-2007): The initial development of Hadoop


focused on two key components: Hadoop Distributed File System (HDFS) for
distributed storage and Hadoop MapReduce for distributed processing. These
components formed the foundation of Hadoop's architecture.

•Apache Hadoop Becomes Official (2008): In 2008, Hadoop graduated from the
Apache Incubator and became a top-level Apache project, solidifying its status as
an open-source and community- driven initiative.

•Ecosystem Expansion (2009-2012): As Hadoop gained popularity, an ecosystem


of related projects emerged to enhance its capabilities. Apache Pig and Apache
Hive provided higher-level query languages and data processing abstractions.
HBase introduced a distributed, scalable NoSQL database. ZooKeeper offered
distributed coordination and synchronization. These projects, along with others,
extended Hadoop's functionality for different use cases.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Brief history of Hadoop
6.Commercial Adoption (2010s): Hadoop started gaining significant attention
from enterprises looking to manage and analyze large datasets. Many companies,
including Cloudera, Hortonworks, and MapR, began offering commercial
distributions of Hadoop, along with tools and services to simplify deployment and
management.

7.YARN and Hadoop 2 (2012): Hadoop 2, released in 2012, introduced YARN


(Yet Another Resource Negotiator), a resource management framework that
decoupled the processing layer (MapReduce) from resource management. YARN
allowed Hadoop to support multiple processing frameworks, enabling more
diverse workloads.

8.Apache Spark and Beyond (2014-2015): Apache Spark, a fast and versatile data
processing engine, gained prominence in the Hadoop ecosystem as an alternative
to MapReduce. Spark provided in-memory processing, making certain workloads
significantly faster. Hadoop's ecosystem continued to evolve with projects like
Apache Tez for optimizing data processing pipelines.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Brief history of Hadoop

9. Hadoop 3 and Current Developments (2017-Present): Hadoop 3, released in


2017, introduced several improvements, including support for erasure coding in
HDFS, enhanced resource management in YARN, and better performance
optimizations. Hadoop's ecosystem continues to expand and adapt to the changing
landscape of data processing and storage technologies.

Throughout its history, Hadoop has played a pivotal role in enabling organizations
to handle and analyze massive datasets. While newer technologies and platforms
have emerged, Hadoop remains an important tool in the realm of big data
processing and analytics.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Apache Hadoop EcoSystem
The Apache Hadoop ecosystem is a collection of open-source projects and tools that
work together to extend and enhance the capabilities of the Hadoop platform. These
projects provide various functionalities for data storage, processing, analysis, and
management. Here are some key components of the Apache Hadoop ecosystem:

•Hadoop Common: The foundational module that provides the common utilities and
libraries required by other Hadoop modules.

•Hadoop Distributed File System (HDFS): A distributed and scalable file system
designed to store large volumes of data across commodity hardware.

•YARN (Yet Another Resource Negotiator): A resource management and job


scheduling framework that allows multiple processing engines (such as MapReduce,
Apache Spark, and Apache Tez) to share cluster resources efficiently.

•MapReduce: A programming model and processing framework for distributed


computation, widely used for batch processing of large datasets.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Apache Hadoop EcoSystem
5. Apache Spark: A fast and general-purpose cluster computing system that supports
in-memory processing and provides APIs for various data processing tasks,
including batch processing, real- time streaming, machine learning, and graph
processing.

•Apache Hive: A data warehousing and SQL-like query language that provides a
high-level abstraction over Hadoop and allows users to query and analyze data using
SQL queries.

•Apache Pig: A platform for analyzing large datasets using a high-level scripting
language called Pig Latin, which simplifies data processing tasks.

•Apache HBase: A distributed, scalable NoSQL database that provides random


access to large amounts of structured data.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Apache Hadoop EcoSystem
9.Apache ZooKeeper: A distributed coordination service that helps manage and
synchronize distributed applications and services.

10. Apache Oozie: A workflow scheduling system for managing and executing
complex Hadoop job workflows.

11. Apache Sqoop: A tool for transferring data between Hadoop and relational
databases, facilitating the import and export of data.

12. Apache Flume: A distributed, reliable, and available system for efficiently
collecting, aggregating, and moving large amounts of log and event data.

•Apache Kafka: A distributed streaming platform that allows applications to publish,


subscribe to, and process streams of records in real-time.

•Apache Mahout: A machine learning library for building scalable and effective
machine learning algorithms on top of Hadoop.

•Apache Drill: A schema-free SQL query engine for big data exploration across
various data sources.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Apache Hadoop EcoSystem
13.Apache Ambari: A management and monitoring tool that provides a web-based
interface for provisioning, managing, and monitoring Hadoop clusters.

14.Apache Knox: A gateway for securing and accessing REST APIs and UIs of
Hadoop clusters.

15.Apache Ranger: A framework for managing security and access control policies
across the Hadoop ecosystem.

16.Apache Sentry: A system for fine-grained authorization to data and metadata stored
on a Hadoop cluster.

17.Apache Nifi: A data integration and dataflow automation tool that enables the
creation of data pipelines to move and process data between systems.

These are just some of the many projects in the Apache Hadoop ecosystem. The
ecosystem continues to evolve, with new projects and updates being developed to
address various big data challenges and use cases.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


03/01/24 UNIT-1-- Dr. K. Kranthi Kumar
03/01/24 UNIT-1-- Dr. K. Kranthi Kumar
Now, let's delve into each component and their interactions:

•Client Application:
 The "Client Application" represents any program or user interacting with HDFS
to read data from or write data to files.
 Read and write requests are initiated by the client application to perform
operations on files stored in HDFS.
 The client communicates with the NameNode to obtain necessary information
about the file, such as its structure and the locations of its data blocks.
•Name Node:

 The "Name Node" is a crucial component in HDFS. It acts as the central


metadata server and manager of the file system's namespace.
 "Metadata" refers to information about files and directories, including their
names, permissions, hierarchy, and the mapping of data blocks to Data Nodes.

 The Name Node maintains the fsimage (checkpoint) and edit logs, which record
changes to the file system's namespace over time.
When a client request comes in, the Name Node responds with block locations for
the requested file's data blocks, enabling the client to directly communicate with
the appropriate Data Nodes.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


3. DataNodes (Slave Nodes):
 "DataNodes" are worker nodes responsible for storing and managing actual
data blocks.
 DataNodes periodically send heartbeat signals to the NameNode to indicate
their availability and health.
 They also send block reports to the NameNode, providing information about
the data blocks they store.
 Each data block can have multiple replicas (usually three) stored across
different DataNodes for data reliability and fault tolerance.
 When a client reads data, it directly accesses the DataNodes holding the
replicas of the requested data blocks.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


ARCHITECTURE OF HDFS

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


ARCHITECTURE OF HDFS
The HDFS architecture consists of the following components:

1. NameNode: The NameNode is the master server in HDFS. It manages the file
system namespace and stores metadata about the files and directories in the
system. The NameNode also tracks the location of the data blocks that make up
each file.
2. DataNode: The DataNodes are the slave servers in HDFS. They store the actual
data blocks that make up the files in the system. The DataNodes are responsible
for serving read and write requests for files from clients.
3. Secondary NameNode: The secondary NameNode is a hot standby for the
NameNode. It periodically synchronizes its metadata with the NameNode and can
be used to take over as the NameNode if the primary NameNode fails.
4. Checkpoint Node: The checkpoint node is used to store a copy of the NameNode's
metadata for disaster recovery. The checkpoint node is not a mandatory
component of HDFS, but it is recommended for production deployments.

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


ARCHITECTURE OF HDFS
The NameNode and secondary NameNode are typically deployed on separate servers. The
DataNodes can be deployed on any number of servers. The number of DataNodes depends
on the amount of data that needs to be stored in HDFS.

The NameNode is responsible for the following tasks:

•Managing the file system namespace


•Storing metadata about files and directories
•Tracking the location of data blocks
•Handling client requests for files
•Rebalancing data blocks if a DataNode fails

The checkpoint node is responsible for the following tasks:

Storing a copy of the NameNode's metadata


Restoring the NameNode's metadata if the NameNode fails
Reporting to the NameNode about the status of their data blocks
Rebalancing data blocks if a DataNode fails
03/01/24 UNIT-1-- Dr. K. Kranthi Kumar
ARCHITECTURE OF HDFS
 The secondary NameNode is responsible for the following tasks:

• Periodically synchronizing its metadata with the NameNode


• Serving read and write requests for files if the NameNode fails
• Taking over as the NameNode if the NameNode fails

 The checkpoint node is responsible for the following tasks:

• Storing a copy of the NameNode's metadata


• Restoring the NameNode's metadata if the NameNode fails

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


Working with HDFS (Commands)

Working with Hadoop Distributed File System (HDFS) involves using a set of
commands to interact with the file system. These commands are executed through the
Hadoop command-line interface (CLI) or other interfaces provided by Hadoop
distributions. Below are some commonly used HDFS commands along with
explanations of their functionality:
1. hadoop fs -ls <path>:

Lists the contents of a directory in HDFS. Example: hadoop fs -ls /user/hadoop

2. hadoop fs -mkdir <path>:


Creates a directory in HDFS.
Example: hadoop fs -mkdir /user/newdir
3. hadoop fs -copyFromLocal <local_path> <hdfs_path>:
Copies a file from the local file system to HDFS.
Example: hadoop fs -copyFromLocal localfile.txt
/user/hadoop/hdfsfile.txt
4. hadoop fs -copyToLocal <hdfs_path> <local_path>:
Copies a file from HDFS to the local file system.
Example: hadoop fs -copyToLocal /user/hadoop/hdfsfile.txt localfile.txt
03/01/24 UNIT-1-- Dr. K. Kranthi Kumar
5. hadoop fs -mv <src> <dest>:
Moves a file or directory within HDFS. Example: hadoop fs -mv
/user/hadoop/file1.txt

/user/hadoop/archive/file1.txt

6. hadoop fs -rm <path>:

Removes a file or directory from HDFS.


Example: hadoop fs -rm /user/hadoop/unwantedfile.txt
7. hadoop fs -du -h <path>:
Displays the disk usage of files and directories in HDFS in a human-readable
format.
Example: hadoop fs -du -h /user/hadoop
8. hadoop fs -cat <path>:
Displays the contents of a file in HDFS.
Example: hadoop fs -cat /user/hadoop/hdfsfile.txt
9. hadoop fs -get <hdfs_path> <local_path>:
Downloads a file from HDFS to the local file system.
Example: hadoop fs -get /user/hadoop/hdfsfile.txt localfile.txt

03/01/24 UNIT-1-- Dr. K. Kranthi Kumar


hadoop fs -put <local_path> <hdfs_path>:
10.

Uploads a file from the local file system to HDFS.


Example: hadoop fs -put localfile.txt /user/hadoop/hdfsfile.txt
11.hadoop fs -getmerge <src> <local_path>:

Merges files in a directory in HDFS and copies the result to the local file system.
Example: hadoop fs -getmerge /user/hadoop/output local_merged.txt

12. hadoop fs -chmod <mode> <path>:


Changes the permissions of a file or directory in HDFS. Example: hadoop fs -
chmod 755 /user/hadoop/hdfsfile.txt
hadoop fs -chown <owner>:<group> <path>:
13.

Changes the owner and group of a file or directory in HDFS. Example: hadoop fs
-chown newowner:newgroup
/user/hadoop/hdfsfile.txt

Note: These are just a few examples of the many HDFS commands available. The
commands are executed using the hadoop fs command-line interface followed by the desired
operation and arguments. Note that the specific commands and syntax might vary depending
on the Hadoop distribution and version you are using. You can also use the -help flag with
any command to get more information about its usage and options.
03/01/24 UNIT-1-- Dr. K. Kranthi Kumar

You might also like