0% found this document useful (0 votes)
29 views14 pages

Module 2 BAD601

The document provides an introduction to Hadoop, a technology designed for managing and processing large volumes of Big Data from various sources. It discusses the challenges of Big Data, the advantages of using Hadoop over traditional RDBMS, and outlines its components and architecture. Additionally, it covers the functionality of HDFS, including data storage, processing, and commands for interacting with the file system.

Uploaded by

Anbulakshmi S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views14 pages

Module 2 BAD601

The document provides an introduction to Hadoop, a technology designed for managing and processing large volumes of Big Data from various sources. It discusses the challenges of Big Data, the advantages of using Hadoop over traditional RDBMS, and outlines its components and architecture. Additionally, it covers the functionality of HDFS, including data storage, processing, and commands for interacting with the file system.

Uploaded by

Anbulakshmi S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BAD601-Big Data Analytics SVCE

Module-2 Introduction to Hadoop


Today, Big Data seems to be the buzz word! Enterprises, the world over, are beginning to realize that there is
a huge volume of untapped information before them in the form of structured, semi-structured, and
unstructured data. This varied variety of data is spread across the networks.

This text introduces Hadoop, a technology used to handle Big Data - very large amounts of information that
come from different sources every day.

It explains how massive amounts of data are generated every second by different platforms and services:

1. Every day:

1. Stock market data is generated in billions of shares.


2. Facebook stores billions of comments and likes.

3. Google processes huge amounts of data (like millions of GB!).

2. Every minute:

1. Facebook users share 2.5 million posts.

2. People tweet about 300,000 times.

3. Instagram users upload 220,000 photos.

4. YouTube users upload 72 hours of videos.

5. Apple users download 50,000 apps.


6. Over 200 million emails are sent.

7. Amazon makes sales of over $80,000.

8. Google receives 4 million search queries.

3. Every second:

1. Banking applications process over 10,000 credit card transactions.

2. This shows how much data is generated worldwide, and Hadoop helps to store, process, and
manage all this data efficiently.

Data: The Treasure Trove


Data helps businesses – It enables companies to recommend products, create new ideas, and analyze the
market, improving their success.

Data gives early insights – Businesses can use data to predict trends and make better decisions before
competitors.

More data means better accuracy – The more data businesses have, the more precise their analysis
becomes, leading to better strategies and outcomes.

pg. 1
BAD601-Big Data Analytics SVCE

Challenges of Big Data


This diagram illustrates the challenges of Big Data based on the three V’s: Volume, Variety, and Velocity.

Volume (Size of Data)


▪ One person says: “I am inundated with data. How to store terabytes of mounting data?”
▪ This highlights the challenge of storing and managing massive amounts of data.
Variety (Different Data Types)

▪ Another person says: “I have data in varied sources... structured, semi-structured, and unstructured.
How to work with data that is so very different?”
▪ This shows the challenge of handling different types of data from various sources, such as text, images,
videos, or databases.

Velocity (Speed of Processing)

▪ The third person says: “I need this data to be processed quickly. My decision is pending. How to access
the information quickly?
▪ This reflects the challenge of processing and retrieving data quickly to make timely decisions.

Why Hadoop?
▪ Low cost: Hadoop is an open-source framework and uses commodity hardware (commodity hardware
is relatively inexpensive and easy to obtain hardware) to store enormous quantities of data.
▪ Computing Power: Hadoop uses many computers together to process large amounts of data quickly.
The more computers (nodes) in the system, the faster the processing.
▪ Scalability: You can easily add more computers (nodes) to the system as your data grows, and it doesn't
require much effort to manage.
▪ Storage Flexibility: Unlike traditional databases, Hadoop doesn't need data to be structured or
processed before storing it. You can store different types of data, such as images, videos, and text, and
decide later how to use it.
▪ Inherent Data Protection: Hadoop automatically protects your data in case of hardware failure. If
one computer (node) stops working, Hadoop shifts the task to another working computer. It also stores
multiple copies of data across different computers to ensure no data is lost.

pg. 2
BAD601-Big Data Analytics SVCE

Why not RDBMS?


▪ Not good for large files: RDBMS struggles to handle large files like images and videos.
▪ Not ideal for advanced analytics: It is not the best choice for machine learning and big data analysis.
▪ Expensive as data grows: Managing large amounts of data in RDBMS requires high investment in
storage and resources.

RDBMS Vs Hadoop
Feature Hadoop RDBMS
Data Variety Supports structured, semi- Supports only structured data.
structured, and unstructured
data (e.g., XML, JSON, text
files).
Data Storage Handles very large datasets Handles smaller datasets
(terabytes to petabytes). (usually gigabytes).
Querying Uses HiveQL. Uses SQL.
Query Response Slower due to batch Faster with immediate
processing. responses.
Schema Schema is enforced at read Schema is enforced at write
Enforcement time (Schema on Read). time (Schema on Write).
Cost Open-source, scalable, and Available as both proprietary
cost-effective for big data (Oracle, SQL Server, IBM
processing. DB2) and open-source
(MySQL, PostgreSQL).
Use Cases Best for big data analytics and Best for OLTP (Online
data discovery. Transaction Processing),
managing daily business
transactions.

pg. 3
BAD601-Big Data Analytics SVCE

History of Hadoop
▪ Hadoop is an open-source software developed by Apache.

▪ It is written in Java and was created by Doug Cutting in 2005.

▪ He named it after his son's toy elephant. At that time, he was working at Yahoo.

▪ Hadoop was originally built to support "Nutch," a text search engine.

▪ It is based on technologies like Google's MapReduce (which helps process large amounts of data) and
Google File System (which helps store data across many computers).

▪ Today, Hadoop is widely used by big companies like Yahoo, Facebook, LinkedIn, and Twitter as part
of their data storage and computing systems.

Hadoop overview
Hadoop is an open-source software framework used to store and process large amounts of data in a
distributed manner across clusters of commodity hardware.

Basically, Hadoop accomplishes two tasks:

1. Massive data storage.

2. Faster data processing.

Key aspects of Hadoop

pg. 4
BAD601-Big Data Analytics SVCE

Hadoop Components

HDFS (Hadoop Distributed File System) – Stores large data across multiple machines.
YARN (Yet Another Resource Negotiator) – Manages resources and job scheduling.

MapReduce – Processes data in parallel across multiple machines.


Apache Hive – A data warehouse that allows SQL-like querying on Hadoop.

Apache HBase – A NoSQL database for real-time data access.

Apache Spark – A fast, in-memory data processing engine.

Apache Pig – A high-level scripting language for processing large datasets.

Apache Sqoop – Transfers data between Hadoop and relational databases (RDBMS).

Apache Flume – Collects and transfers large amounts of log data.

Apache Zookeeper – Manages and coordinates distributed applications.


Apache Mahout is a machine learning library for building scalable algorithms on Hadoop.

Apache Oozie is a workflow scheduler for managing and coordinating Hadoop jobs.

Hadoop Conceptual Layer

Hadoop Conceptual Layer Hadoop is conceptually divided into:


1. Data Storage Layer – Stores huge volumes of data.
2. Data Processing Layer – Processes data in parallel to extract richer and meaningful insights.

pg. 5
BAD601-Big Data Analytics SVCE

High-Level Architecture of Hadoop

Hadoop follows a Master-Slave Architecture:


1. The Master Node is called NameNode.
2. The Slave Nodes are called DataNodes.
In Hadoop:
Master Node (NameNode) → Manages and controls the overall system, keeps track of where data is stored.
Slave Nodes (DataNodes) → Store actual data and perform processing tasks as directed by the Master.

USE CASE OF HADOOP


ClickStream Data

ClickStream data (mouse clicks) helps you to understand the purchasing behavior of customers. ClickStream
analysis helps online marketers to optimize their product web pages, promotional content, etc. to improve
their business.

Three Key Benefits of ClickStream Analysis Using Hadoop


Combining Data Sources: Hadoop can merge ClickStream data with other sources like CRM (Customer
Relationship Management) data, customer demographics, sales data, and ad campaign information.
This helps businesses understand customer behavior better.

Scalability & Cost Efficiency: Hadoop can store years of data with minimal extra cost.

It allows businesses to analyze long-term trends in ClickStream data, helping them stay ahead of
competitors.

Use of Apache Pig & Apache Hive: Business analysts use Apache Pig and Apache Hive for website data
analysis.

These tools help organize, refine, and prepare ClickStream data for visualization and analytics.

pg. 6
BAD601-Big Data Analytics SVCE

HDFS (HADOOP DISTRIBUTED FILE SYSTEM)


Key points of Hadoop Distributed File System

1. Storage component of Hadoop: It is the storage system used in Hadoop.

2. Distributed File System: Data is stored across multiple computers instead of a single machine.

3. Modeled after Google File System: It follows the same design principles used by Google for handling
big data.

4. Optimized for high throughput: It handles large data efficiently by storing it in big chunks and
processing it close to where it is stored.

5. Data Replication: It keeps multiple copies of the data to prevent loss in case of failure.

6. Automatic Data Replication: If a node (computer) fails, HDFS automatically creates new copies of
lost data on other nodes.

7. Optimized for Large Files: HDFS works best when handling big files (gigabytes or more) rather
than small ones.

8. Uses Existing File Systems: HDFS runs on top of traditional file systems like ext3 and ext4 (used in
Linux).

HDFS Daemons(For storage)


1. NameNode (Master Node) – Manages the file system and keeps track of where data is stored.
2. DataNodes (Worker Nodes) – Store actual data and process tasks.
3. Secondary NameNode – Assists NameNode by taking checkpoints and preventing data loss.
NameNode

▪ HDFS breaks a large file into smaller pieces called blocks.


▪ NameNode uses a rack ID to identify DataNodes in the rack. A rack is a collection of DataNodes
within the cluster.
▪ NameNode keeps track of blocks of a file as it is placed on various DataNodes.
▪ NameNode manages file-related operations such as read, write, create, and delete. Its main job is
managing the File System Namespace.
▪ The file system namespace includes mapping of blocks to files, file properties, and is stored in a file
called FsImage.
▪ NameNode uses an EditLog (transaction log) to record every transaction that happens to the file
system metadata.

pg. 7
BAD601-Big Data Analytics SVCE

DataNode

A DataNode is a storage node in HDFS where actual data blocks are stored.

There are multiple DataNodes in a Hadoop cluster.

Heartbeat Mechanism:

▪ Each DataNode sends a "heartbeat" signal to the NameNode at regular intervals.

▪ This ensures that the DataNode is active and working.

What Happens If a DataNode Fails?


▪ If a DataNode stops sending heartbeats, the NameNode marks it as dead.

▪ NameNode then creates a new copy of lost data on other available DataNodes to maintain reliability.

Secondary NameNode
▪ It takes periodic snapshots (backups) of HDFS metadata to prevent data loss.

▪ It helps in recovering the system if the NameNode fails.

▪ Since the memory requirements of Secondary NameNode are the same as NameNode, it is better to
run NameNode and Secondary NameNode on different machines.

Why is it needed?

▪ The NameNode stores important metadata about files, but if it crashes, data might be lost.

▪ The Secondary NameNode keeps backups to help in recovery.

Limitations of the Secondary NameNode:

▪ It does not replace the NameNode in real-time.


▪ It must be manually configured to take over if the NameNode fails.

pg. 8
BAD601-Big Data Analytics SVCE

Anatomy of File Read

1. The client opens the file that it wishes to read from by calling open() on the DistributedFileSystem.
2. DistributedFileSystem communicates with the NameNode to get the location of data blocks.
NameNode returns with the addresses of the Dat anodes that the data blocks are stored on. Subsequent
to this, the DistributedFileSystem returns an FSDataInputStream to client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has addresses of the DataNodes for the
first few blocks of the file, connects to the closest DataNode for the first block in the file.
4. Client calls read() repeatedly to stream the data from the DataNode.
5. When end of the block is reached, DFSInputStream closes the connection with the DataNode. It repeats
the steps to find the best DataNode for the next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close() on the FSDataInputStream to close
the connection.

pg. 9
BAD601-Big Data Analytics SVCE

Anatomy of File Write

1. The client starts by calling create () on the DistributedFileSystem to make a new file.
2. The NameNode is contacted via an RPC call to check if the file already exists and to create it if not.
Initially, the file is created without any data blocks. Then, the DistributedFileSystem returns an
FSDataOutputStream to allow the client to write data.
3. As the client writes data, the DFSOutputStream breaks it into smaller pieces (packets) and puts them
in a data queue. A component called DataStreamer takes the data from the queue and manages the
writing process.
4. The NameNode selects a set of DataNodes to store copies (replicas) of the data. These DataNodes
form a pipeline. By default, there are three DataNodes in the pipeline.
5. The DataStreamer sends data packets to the first DataNode, which stores the packet and forwards it to
the second DataNode, which then forwards it to the third DataNode.
6. The DFSOutputStream also keeps track of packets waiting for confirmation (acknowledgment) from
the DataNodes. A packet is removed from the queue only after all DataNodes have confirmed it.
7. When the client finishes writing, it calls close() on the stream. This ensures that all remaining packets
are sent, acknowledgments are received, and the NameNode is updated to confirm that the file creation
is complete.

Replica Placement Strategy


Hadoop Default Replica Placement Strategy

1. The first replica is placed on the same node where the client is running.

2. The second replica is placed on a node that is in a different rack.


3. The third replica is placed on the same rack as the second replica but on a different node.

4. After placing the replicas, a pipeline is created for data flow.


5. This strategy ensures high reliability in case of failures.

pg. 10
BAD601-Big Data Analytics SVCE

Working with HDFS Commands


1. hadoop fs -ls / : To get the list of directories and files at the root of HDFS.
2. hadoop fs -ls -R / : To get the list of complete directories and files of HDFS.
3. hadoop fs -mkdir /sample : To create a directory (say, sample) in HDFS.
4. hadoop fs -put /root/sample/test.txt /sample/test.txt : To copy a file from local file system to HDFS.
5. hadoop fs -get /sample/test.txt /root/sample/testsample.txt : To copy a file from HDFS to local file
system.
6. hadoop fs -copyFromLocal /root/sample/test.txt /sample/testsample.txt : To copy a file from local
file system to HDFS via copyFromLocal command.
7. hadoop fs -copyToLocal /sample/test.txt /root/sample/testsample1.txt : To copy a file from Hadoop
file system to local file system via copyToLocal command.
8. hadoop fs -cat /sample/test.txt: To display the contents of an HDFS file on console.
9. hadoop fs -cp /sample/test.txt /sample1: To copy a file from one directory to another on HDFS.
10. hadoop fs -rm -r /sample1: To remove a directory from HDFS.

Processing Data with Hadoop


1. How It Works:
▪ The input data is divided into smaller parts called chunks.
▪ Map tasks process these chunks in parallel.
▪ The output from the map tasks is stored temporarily.
▪ The system sorts and organizes this data using keys.
▪ This sorted data is then sent to reduce tasks.
▪ Reduce tasks combine and process the data to generate the final output.

pg. 11
BAD601-Big Data Analytics SVCE

Why Is It Efficient?
▪ Data Locality: Tasks are scheduled on nodes where the data is already stored. This
avoids unnecessary data transfer and improves speed.
▪ It also handles failures, re-executes failed tasks, and manages job scheduling
automatically.
Important Components:
▪ JobTracker (Master): Manages and schedules tasks.
▪ TaskTracker (Slave): Executes the assigned tasks.
Job Execution:
▪ A job client submits a job to the JobTracker.
▪ The JobTracker schedules tasks to TaskTrackers and monitors their progress.
MapReduce Framework
In Hadoop, there are two main components (daemons) that help in processing data using the
MapReduce framework:
1. JobTracker (Master Node)
▪ Think of it as the manager or supervisor of the system.

▪ When you submit a job (code) to Hadoop, the JobTracker decides how to divide the work and assign
it to different computers (nodes).

▪ It keeps track of all running tasks and reschedules them if any task fails.

▪ Each Hadoop cluster has only one JobTracker, which manages the entire MapReduce job.
2.TaskTracker (Worker Nodes)

▪ TaskTrackers are like workers who execute the actual tasks.

▪ Each node in the cluster has a TaskTracker, which runs the tasks assigned by the JobTracker.

▪ It runs multiple Map or Reduce tasks in parallel using Java Virtual Machines (JVMs).

▪ TaskTracker continuously sends a heartbeat signal to JobTracker to inform that it is still working.

▪ If a TaskTracker fails, the JobTracker assumes it is dead and assigns the task to another available node.

pg. 12
BAD601-Big Data Analytics SVCE

How Does MapReduce Work?


MapReduce divides a data analysis task into two parts – map and reduce.

Figure 5.23 depicts how the MapReduce Programming works. In this example, there are two mappers and one
reducer. Each mapper works on the partial dataset that is stored on that node and the reducer combines the
output from the mappers to produce the reduced result set.

pg. 13
BAD601-Big Data Analytics SVCE

Figure 5.24 describes the working model of MapReduce Programming. The following steps describe how
MapReduce performs its task.
1. First, the input dataset is split into multiple pieces of data (several small subsets).

2. Next, the framework creates a master and several workers processes and executes the worker processes
remotely.

3. Several map tasks work simultaneously and read pieces of data that were assigned to each map task.
The map worker uses the map function to extract only those data that are present on their server and
generates key/value pair for the extracted data.

4. Map worker uses partitioner function to divide the data into regions. Partitioner decides which reducer
should get the output of the specified mapper.

5. When the map workers complete their work, the master instructs the reduce workers to begin their
work. The reduce workers in turn contact the map workers to get the key/value data for their partition.
The data thus received is shuffled and sorted as per keys.

6. Then it calls reduce function for every unique key. This function writes the output to the file.

7. When all the reduce workers complete their work, the master transfers the control to the user program.

pg. 14

You might also like