Professional Documents
Culture Documents
•It was initially created by Doug Cutting and Mike Cafarella in 2005 and is now
maintained by the Apache Software Foundation.
1.The primary goal of Hadoop is to enable the processing of vast amounts of data in a
cost-effective and efficient manner. It provides a way to store, manage, and analyze data
that is too large or complex to be handled by traditional relational databases or single
machines.
6.Use Cases:
Hadoop: Hadoop is commonly used for big data analytics, processing large-scale
log data, machine learning, and data warehousing for unstructured and semi-
structured data.
RDBMS: RDBMS is well-suited for transactional systems, managing structured
business data, and supporting applications that require ACID (Atomicity,
Consistency, Isolation, Durability) compliance.
•Early Development (2004-2005): The origins of Hadoop trace back to 2004 when
Doug Cutting and Mike Cafarella began working on an open-source web search
engine called Nutch. They needed a way to process and analyze large volumes of
data efficiently. Inspired by Google's MapReduce paper and Google File System,
they developed a framework called Nutch Distributed File System (NDFS) and
implemented a basic version of MapReduce.
•Hadoop Project Initiation (2006): In 2006, Doug Cutting joined Yahoo, and the
Hadoop project was formally initiated as an Apache open-source project. Hadoop
was named after Doug Cutting's son's toy elephant. The project aimed to create a
reliable, scalable, and distributed computing framework for processing and storing
large3.datasets.
•Apache Hadoop Becomes Official (2008): In 2008, Hadoop graduated from the
Apache Incubator and became a top-level Apache project, solidifying its status as
an open-source and community- driven initiative.
8.Apache Spark and Beyond (2014-2015): Apache Spark, a fast and versatile data
processing engine, gained prominence in the Hadoop ecosystem as an alternative
to MapReduce. Spark provided in-memory processing, making certain workloads
significantly faster. Hadoop's ecosystem continued to evolve with projects like
Apache Tez for optimizing data processing pipelines.
Throughout its history, Hadoop has played a pivotal role in enabling organizations
to handle and analyze massive datasets. While newer technologies and platforms
have emerged, Hadoop remains an important tool in the realm of big data
processing and analytics.
•Hadoop Common: The foundational module that provides the common utilities and
libraries required by other Hadoop modules.
•Hadoop Distributed File System (HDFS): A distributed and scalable file system
designed to store large volumes of data across commodity hardware.
•Apache Hive: A data warehousing and SQL-like query language that provides a
high-level abstraction over Hadoop and allows users to query and analyze data using
SQL queries.
•Apache Pig: A platform for analyzing large datasets using a high-level scripting
language called Pig Latin, which simplifies data processing tasks.
10. Apache Oozie: A workflow scheduling system for managing and executing
complex Hadoop job workflows.
11. Apache Sqoop: A tool for transferring data between Hadoop and relational
databases, facilitating the import and export of data.
12. Apache Flume: A distributed, reliable, and available system for efficiently
collecting, aggregating, and moving large amounts of log and event data.
•Apache Mahout: A machine learning library for building scalable and effective
machine learning algorithms on top of Hadoop.
•Apache Drill: A schema-free SQL query engine for big data exploration across
various data sources.
14.Apache Knox: A gateway for securing and accessing REST APIs and UIs of
Hadoop clusters.
15.Apache Ranger: A framework for managing security and access control policies
across the Hadoop ecosystem.
16.Apache Sentry: A system for fine-grained authorization to data and metadata stored
on a Hadoop cluster.
17.Apache Nifi: A data integration and dataflow automation tool that enables the
creation of data pipelines to move and process data between systems.
These are just some of the many projects in the Apache Hadoop ecosystem. The
ecosystem continues to evolve, with new projects and updates being developed to
address various big data challenges and use cases.
•Client Application:
The "Client Application" represents any program or user interacting with HDFS
to read data from or write data to files.
Read and write requests are initiated by the client application to perform
operations on files stored in HDFS.
The client communicates with the NameNode to obtain necessary information
about the file, such as its structure and the locations of its data blocks.
•Name Node:
1. NameNode: The NameNode is the master server in HDFS. It manages the file
system namespace and stores metadata about the files and directories in the
system. The NameNode also tracks the location of the data blocks that make up
each file.
2. DataNode: The DataNodes are the slave servers in HDFS. They store the actual
data blocks that make up the files in the system. The DataNodes are responsible
for serving read and write requests for files from clients.
3. Secondary NameNode: The secondary NameNode is a hot standby for the
NameNode. It periodically synchronizes its metadata with the NameNode and can
be used to take over as the NameNode if the primary NameNode fails.
4. Checkpoint Node: The checkpoint node is used to store a copy of the NameNode's
metadata for disaster recovery. The checkpoint node is not a mandatory
component of HDFS, but it is recommended for production deployments.
Working with Hadoop Distributed File System (HDFS) involves using a set of
commands to interact with the file system. These commands are executed through the
Hadoop command-line interface (CLI) or other interfaces provided by Hadoop
distributions. Below are some commonly used HDFS commands along with
explanations of their functionality:
1. hadoop fs -ls <path>:
/user/hadoop/archive/file1.txt
Merges files in a directory in HDFS and copies the result to the local file system.
Example: hadoop fs -getmerge /user/hadoop/output local_merged.txt
Changes the owner and group of a file or directory in HDFS. Example: hadoop fs
-chown newowner:newgroup
/user/hadoop/hdfsfile.txt
Note: These are just a few examples of the many HDFS commands available. The
commands are executed using the hadoop fs command-line interface followed by the desired
operation and arguments. Note that the specific commands and syntax might vary depending
on the Hadoop distribution and version you are using. You can also use the -help flag with
any command to get more information about its usage and options.
03/01/24 UNIT-1-- Dr. K. Kranthi Kumar