0% found this document useful (0 votes)

29 views14 pages

Module 2 BAD601

The document provides an introduction to Hadoop, a technology designed for managing and processing large volumes of Big Data from various sources. It discusses the challenges of Big Data, the advantages of using Hadoop over traditional RDBMS, and outlines its components and architecture. Additionally, it covers the functionality of HDFS, including data storage, processing, and commands for interacting with the file system.

Uploaded by

Anbulakshmi S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views14 pages

Module 2 BAD601

Uploaded by

Anbulakshmi S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

BAD601-Big Data Analytics SVCE

Module-2 Introduction to Hadoop

Today, Big Data seems to be the buzz word! Enterprises, the world over, are beginning to realize that there is
a huge volume of untapped information before them in the form of structured, semi-structured, and
unstructured data. This varied variety of data is spread across the networks.

This text introduces Hadoop, a technology used to handle Big Data - very large amounts of information that
come from different sources every day.

It explains how massive amounts of data are generated every second by different platforms and services:

1. Every day:

1. Stock market data is generated in billions of shares.

2. Facebook stores billions of comments and likes.

3. Google processes huge amounts of data (like millions of GB!).

2. Every minute:

1. Facebook users share 2.5 million posts.

2. People tweet about 300,000 times.

3. Instagram users upload 220,000 photos.

4. YouTube users upload 72 hours of videos.

5. Apple users download 50,000 apps.

6. Over 200 million emails are sent.

7. Amazon makes sales of over $80,000.

8. Google receives 4 million search queries.

3. Every second:

1. Banking applications process over 10,000 credit card transactions.

2. This shows how much data is generated worldwide, and Hadoop helps to store, process, and
manage all this data efficiently.

Data: The Treasure Trove

Data helps businesses – It enables companies to recommend products, create new ideas, and analyze the
market, improving their success.

Data gives early insights – Businesses can use data to predict trends and make better decisions before
competitors.

More data means better accuracy – The more data businesses have, the more precise their analysis
becomes, leading to better strategies and outcomes.

pg. 1
BAD601-Big Data Analytics SVCE

Challenges of Big Data

This diagram illustrates the challenges of Big Data based on the three V’s: Volume, Variety, and Velocity.

Volume (Size of Data)

▪ One person says: “I am inundated with data. How to store terabytes of mounting data?”
▪ This highlights the challenge of storing and managing massive amounts of data.
Variety (Different Data Types)

▪ Another person says: “I have data in varied sources... structured, semi-structured, and unstructured.
How to work with data that is so very different?”
▪ This shows the challenge of handling different types of data from various sources, such as text, images,
videos, or databases.

Velocity (Speed of Processing)

▪ The third person says: “I need this data to be processed quickly. My decision is pending. How to access
the information quickly?
▪ This reflects the challenge of processing and retrieving data quickly to make timely decisions.

Why Hadoop?
▪ Low cost: Hadoop is an open-source framework and uses commodity hardware (commodity hardware
is relatively inexpensive and easy to obtain hardware) to store enormous quantities of data.
▪ Computing Power: Hadoop uses many computers together to process large amounts of data quickly.
The more computers (nodes) in the system, the faster the processing.
▪ Scalability: You can easily add more computers (nodes) to the system as your data grows, and it doesn't
require much effort to manage.
▪ Storage Flexibility: Unlike traditional databases, Hadoop doesn't need data to be structured or
processed before storing it. You can store different types of data, such as images, videos, and text, and
decide later how to use it.
▪ Inherent Data Protection: Hadoop automatically protects your data in case of hardware failure. If
one computer (node) stops working, Hadoop shifts the task to another working computer. It also stores
multiple copies of data across different computers to ensure no data is lost.

pg. 2
BAD601-Big Data Analytics SVCE

Why not RDBMS?

▪ Not good for large files: RDBMS struggles to handle large files like images and videos.
▪ Not ideal for advanced analytics: It is not the best choice for machine learning and big data analysis.
▪ Expensive as data grows: Managing large amounts of data in RDBMS requires high investment in
storage and resources.

RDBMS Vs Hadoop
Feature Hadoop RDBMS
Data Variety Supports structured, semi- Supports only structured data.
structured, and unstructured
data (e.g., XML, JSON, text
files).
Data Storage Handles very large datasets Handles smaller datasets
(terabytes to petabytes). (usually gigabytes).
Querying Uses HiveQL. Uses SQL.
Query Response Slower due to batch Faster with immediate
processing. responses.
Schema Schema is enforced at read Schema is enforced at write
Enforcement time (Schema on Read). time (Schema on Write).
Cost Open-source, scalable, and Available as both proprietary
cost-effective for big data (Oracle, SQL Server, IBM
processing. DB2) and open-source
(MySQL, PostgreSQL).
Use Cases Best for big data analytics and Best for OLTP (Online
data discovery. Transaction Processing),
managing daily business
transactions.

pg. 3
BAD601-Big Data Analytics SVCE

History of Hadoop
▪ Hadoop is an open-source software developed by Apache.

▪ It is written in Java and was created by Doug Cutting in 2005.

▪ He named it after his son's toy elephant. At that time, he was working at Yahoo.

▪ Hadoop was originally built to support "Nutch," a text search engine.

▪ It is based on technologies like Google's MapReduce (which helps process large amounts of data) and
Google File System (which helps store data across many computers).

▪ Today, Hadoop is widely used by big companies like Yahoo, Facebook, LinkedIn, and Twitter as part
of their data storage and computing systems.

Hadoop overview
Hadoop is an open-source software framework used to store and process large amounts of data in a
distributed manner across clusters of commodity hardware.

Basically, Hadoop accomplishes two tasks:

1. Massive data storage.

2. Faster data processing.

Key aspects of Hadoop

pg. 4
BAD601-Big Data Analytics SVCE

Hadoop Components

HDFS (Hadoop Distributed File System) – Stores large data across multiple machines.
YARN (Yet Another Resource Negotiator) – Manages resources and job scheduling.

MapReduce – Processes data in parallel across multiple machines.

Apache Hive – A data warehouse that allows SQL-like querying on Hadoop.

Apache HBase – A NoSQL database for real-time data access.

Apache Spark – A fast, in-memory data processing engine.

Apache Pig – A high-level scripting language for processing large datasets.

Apache Sqoop – Transfers data between Hadoop and relational databases (RDBMS).

Apache Flume – Collects and transfers large amounts of log data.

Apache Zookeeper – Manages and coordinates distributed applications.

Apache Mahout is a machine learning library for building scalable algorithms on Hadoop.

Apache Oozie is a workflow scheduler for managing and coordinating Hadoop jobs.

Hadoop Conceptual Layer

Hadoop Conceptual Layer Hadoop is conceptually divided into:

1. Data Storage Layer – Stores huge volumes of data.
2. Data Processing Layer – Processes data in parallel to extract richer and meaningful insights.

pg. 5
BAD601-Big Data Analytics SVCE

High-Level Architecture of Hadoop

Hadoop follows a Master-Slave Architecture:

1. The Master Node is called NameNode.
2. The Slave Nodes are called DataNodes.
In Hadoop:
Master Node (NameNode) → Manages and controls the overall system, keeps track of where data is stored.
Slave Nodes (DataNodes) → Store actual data and perform processing tasks as directed by the Master.

USE CASE OF HADOOP

ClickStream Data

ClickStream data (mouse clicks) helps you to understand the purchasing behavior of customers. ClickStream
analysis helps online marketers to optimize their product web pages, promotional content, etc. to improve
their business.

Three Key Benefits of ClickStream Analysis Using Hadoop

Combining Data Sources: Hadoop can merge ClickStream data with other sources like CRM (Customer
Relationship Management) data, customer demographics, sales data, and ad campaign information.
This helps businesses understand customer behavior better.

Scalability & Cost Efficiency: Hadoop can store years of data with minimal extra cost.

It allows businesses to analyze long-term trends in ClickStream data, helping them stay ahead of
competitors.

Use of Apache Pig & Apache Hive: Business analysts use Apache Pig and Apache Hive for website data
analysis.

These tools help organize, refine, and prepare ClickStream data for visualization and analytics.

pg. 6
BAD601-Big Data Analytics SVCE

HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

Key points of Hadoop Distributed File System

1. Storage component of Hadoop: It is the storage system used in Hadoop.

2. Distributed File System: Data is stored across multiple computers instead of a single machine.

3. Modeled after Google File System: It follows the same design principles used by Google for handling
big data.

4. Optimized for high throughput: It handles large data efficiently by storing it in big chunks and
processing it close to where it is stored.

5. Data Replication: It keeps multiple copies of the data to prevent loss in case of failure.

6. Automatic Data Replication: If a node (computer) fails, HDFS automatically creates new copies of
lost data on other nodes.

7. Optimized for Large Files: HDFS works best when handling big files (gigabytes or more) rather
than small ones.

8. Uses Existing File Systems: HDFS runs on top of traditional file systems like ext3 and ext4 (used in
Linux).

HDFS Daemons(For storage)

1. NameNode (Master Node) – Manages the file system and keeps track of where data is stored.
2. DataNodes (Worker Nodes) – Store actual data and process tasks.
3. Secondary NameNode – Assists NameNode by taking checkpoints and preventing data loss.
NameNode

▪ HDFS breaks a large file into smaller pieces called blocks.

▪ NameNode uses a rack ID to identify DataNodes in the rack. A rack is a collection of DataNodes
within the cluster.
▪ NameNode keeps track of blocks of a file as it is placed on various DataNodes.
▪ NameNode manages file-related operations such as read, write, create, and delete. Its main job is
managing the File System Namespace.
▪ The file system namespace includes mapping of blocks to files, file properties, and is stored in a file
called FsImage.
▪ NameNode uses an EditLog (transaction log) to record every transaction that happens to the file
system metadata.

pg. 7
BAD601-Big Data Analytics SVCE

DataNode

A DataNode is a storage node in HDFS where actual data blocks are stored.

There are multiple DataNodes in a Hadoop cluster.

Heartbeat Mechanism:

▪ Each DataNode sends a "heartbeat" signal to the NameNode at regular intervals.

▪ This ensures that the DataNode is active and working.

What Happens If a DataNode Fails?

▪ If a DataNode stops sending heartbeats, the NameNode marks it as dead.

▪ NameNode then creates a new copy of lost data on other available DataNodes to maintain reliability.

Secondary NameNode
▪ It takes periodic snapshots (backups) of HDFS metadata to prevent data loss.

▪ It helps in recovering the system if the NameNode fails.

▪ Since the memory requirements of Secondary NameNode are the same as NameNode, it is better to
run NameNode and Secondary NameNode on different machines.

Why is it needed?

▪ The NameNode stores important metadata about files, but if it crashes, data might be lost.

▪ The Secondary NameNode keeps backups to help in recovery.

Limitations of the Secondary NameNode:

▪ It does not replace the NameNode in real-time.

▪ It must be manually configured to take over if the NameNode fails.

pg. 8
BAD601-Big Data Analytics SVCE

Anatomy of File Read

1. The client opens the file that it wishes to read from by calling open() on the DistributedFileSystem.
2. DistributedFileSystem communicates with the NameNode to get the location of data blocks.
NameNode returns with the addresses of the Dat anodes that the data blocks are stored on. Subsequent
to this, the DistributedFileSystem returns an FSDataInputStream to client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has addresses of the DataNodes for the
first few blocks of the file, connects to the closest DataNode for the first block in the file.
4. Client calls read() repeatedly to stream the data from the DataNode.
5. When end of the block is reached, DFSInputStream closes the connection with the DataNode. It repeats
the steps to find the best DataNode for the next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close() on the FSDataInputStream to close
the connection.

pg. 9
BAD601-Big Data Analytics SVCE

Anatomy of File Write

1. The client starts by calling create () on the DistributedFileSystem to make a new file.
2. The NameNode is contacted via an RPC call to check if the file already exists and to create it if not.
Initially, the file is created without any data blocks. Then, the DistributedFileSystem returns an
FSDataOutputStream to allow the client to write data.
3. As the client writes data, the DFSOutputStream breaks it into smaller pieces (packets) and puts them
in a data queue. A component called DataStreamer takes the data from the queue and manages the
writing process.
4. The NameNode selects a set of DataNodes to store copies (replicas) of the data. These DataNodes
form a pipeline. By default, there are three DataNodes in the pipeline.
5. The DataStreamer sends data packets to the first DataNode, which stores the packet and forwards it to
the second DataNode, which then forwards it to the third DataNode.
6. The DFSOutputStream also keeps track of packets waiting for confirmation (acknowledgment) from
the DataNodes. A packet is removed from the queue only after all DataNodes have confirmed it.
7. When the client finishes writing, it calls close() on the stream. This ensures that all remaining packets
are sent, acknowledgments are received, and the NameNode is updated to confirm that the file creation
is complete.

Replica Placement Strategy

Hadoop Default Replica Placement Strategy

1. The first replica is placed on the same node where the client is running.

2. The second replica is placed on a node that is in a different rack.

3. The third replica is placed on the same rack as the second replica but on a different node.

4. After placing the replicas, a pipeline is created for data flow.

5. This strategy ensures high reliability in case of failures.

pg. 10
BAD601-Big Data Analytics SVCE

Working with HDFS Commands

1. hadoop fs -ls / : To get the list of directories and files at the root of HDFS.
2. hadoop fs -ls -R / : To get the list of complete directories and files of HDFS.
3. hadoop fs -mkdir /sample : To create a directory (say, sample) in HDFS.
4. hadoop fs -put /root/sample/test.txt /sample/test.txt : To copy a file from local file system to HDFS.
5. hadoop fs -get /sample/test.txt /root/sample/testsample.txt : To copy a file from HDFS to local file
system.
6. hadoop fs -copyFromLocal /root/sample/test.txt /sample/testsample.txt : To copy a file from local
file system to HDFS via copyFromLocal command.
7. hadoop fs -copyToLocal /sample/test.txt /root/sample/testsample1.txt : To copy a file from Hadoop
file system to local file system via copyToLocal command.
8. hadoop fs -cat /sample/test.txt: To display the contents of an HDFS file on console.
9. hadoop fs -cp /sample/test.txt /sample1: To copy a file from one directory to another on HDFS.
10. hadoop fs -rm -r /sample1: To remove a directory from HDFS.

Processing Data with Hadoop

1. How It Works:
▪ The input data is divided into smaller parts called chunks.
▪ Map tasks process these chunks in parallel.
▪ The output from the map tasks is stored temporarily.
▪ The system sorts and organizes this data using keys.
▪ This sorted data is then sent to reduce tasks.
▪ Reduce tasks combine and process the data to generate the final output.

pg. 11
BAD601-Big Data Analytics SVCE

Why Is It Efficient?
▪ Data Locality: Tasks are scheduled on nodes where the data is already stored. This
avoids unnecessary data transfer and improves speed.
▪ It also handles failures, re-executes failed tasks, and manages job scheduling
automatically.
Important Components:
▪ JobTracker (Master): Manages and schedules tasks.
▪ TaskTracker (Slave): Executes the assigned tasks.
Job Execution:
▪ A job client submits a job to the JobTracker.
▪ The JobTracker schedules tasks to TaskTrackers and monitors their progress.
MapReduce Framework
In Hadoop, there are two main components (daemons) that help in processing data using the
MapReduce framework:
1. JobTracker (Master Node)
▪ Think of it as the manager or supervisor of the system.

▪ When you submit a job (code) to Hadoop, the JobTracker decides how to divide the work and assign
it to different computers (nodes).

▪ It keeps track of all running tasks and reschedules them if any task fails.

▪ Each Hadoop cluster has only one JobTracker, which manages the entire MapReduce job.
2.TaskTracker (Worker Nodes)

▪ TaskTrackers are like workers who execute the actual tasks.

▪ Each node in the cluster has a TaskTracker, which runs the tasks assigned by the JobTracker.

▪ It runs multiple Map or Reduce tasks in parallel using Java Virtual Machines (JVMs).

▪ TaskTracker continuously sends a heartbeat signal to JobTracker to inform that it is still working.

▪ If a TaskTracker fails, the JobTracker assumes it is dead and assigns the task to another available node.

pg. 12
BAD601-Big Data Analytics SVCE

How Does MapReduce Work?

MapReduce divides a data analysis task into two parts – map and reduce.

Figure 5.23 depicts how the MapReduce Programming works. In this example, there are two mappers and one
reducer. Each mapper works on the partial dataset that is stored on that node and the reducer combines the
output from the mappers to produce the reduced result set.

pg. 13
BAD601-Big Data Analytics SVCE

Figure 5.24 describes the working model of MapReduce Programming. The following steps describe how
MapReduce performs its task.
1. First, the input dataset is split into multiple pieces of data (several small subsets).

2. Next, the framework creates a master and several workers processes and executes the worker processes
remotely.

3. Several map tasks work simultaneously and read pieces of data that were assigned to each map task.
The map worker uses the map function to extract only those data that are present on their server and
generates key/value pair for the extracted data.

4. Map worker uses partitioner function to divide the data into regions. Partitioner decides which reducer
should get the output of the specified mapper.

5. When the map workers complete their work, the master instructs the reduce workers to begin their
work. The reduce workers in turn contact the map workers to get the key/value data for their partition.
The data thus received is shuffled and sorted as per keys.

6. Then it calls reduce function for every unique key. This function writes the output to the file.

7. When all the reduce workers complete their work, the master transfers the control to the user program.

pg. 14

BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
58 pages
BDA Module-2 Notes
No ratings yet
BDA Module-2 Notes
42 pages
BDA M-2 Notes
No ratings yet
BDA M-2 Notes
41 pages
Module 2 Notes
No ratings yet
Module 2 Notes
61 pages
Bda M2
No ratings yet
Bda M2
60 pages
BDC Cae 1 PDF
No ratings yet
BDC Cae 1 PDF
7 pages
Introduction To Hadoop: Module - II
No ratings yet
Introduction To Hadoop: Module - II
31 pages
BDA Unit 1 and 3
No ratings yet
BDA Unit 1 and 3
87 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Introduction to Big Data and Hadoop
No ratings yet
Introduction to Big Data and Hadoop
26 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Understanding Big Data and Hadoop Solutions
No ratings yet
Understanding Big Data and Hadoop Solutions
24 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Unit-2 (Big Data)
No ratings yet
Unit-2 (Big Data)
27 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
12 pages
Big Data Technology
No ratings yet
Big Data Technology
9 pages
Data Science Overview and Big Data Concepts
No ratings yet
Data Science Overview and Big Data Concepts
23 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Big Data Presentations (Autosaved)
No ratings yet
Big Data Presentations (Autosaved)
126 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
Big Data Insights and Hadoop Overview
No ratings yet
Big Data Insights and Hadoop Overview
29 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
Biggdata
No ratings yet
Biggdata
24 pages
Understanding Big Data Concepts
No ratings yet
Understanding Big Data Concepts
87 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
5 pages
Big Data and Hadoop in Cloud Computing
No ratings yet
Big Data and Hadoop in Cloud Computing
5 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
24 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Hadoop and Big Data Explained
No ratings yet
Hadoop and Big Data Explained
4 pages
Big Data and Mapreduce Challenges, Opportunities and Trends
No ratings yet
Big Data and Mapreduce Challenges, Opportunities and Trends
9 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
47 pages
Big Data Analytics with Apache Hadoop
No ratings yet
Big Data Analytics with Apache Hadoop
33 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Understanding Big Data Concepts and Tools
No ratings yet
Understanding Big Data Concepts and Tools
23 pages
Bda Unit 1 - Mam
No ratings yet
Bda Unit 1 - Mam
198 pages
Bda U2
No ratings yet
Bda U2
68 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
8 pages
Mod-1 Q1. Characteristics of Big Data (5'v) Volumes
No ratings yet
Mod-1 Q1. Characteristics of Big Data (5'v) Volumes
15 pages
HDFS and Hadoop Infrastructure Overview
No ratings yet
HDFS and Hadoop Infrastructure Overview
22 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
40 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
Hadoop Overview and Features for Big Data
No ratings yet
Hadoop Overview and Features for Big Data
44 pages
Big Data
No ratings yet
Big Data
79 pages
Big Data Analytics Challenges Overview
No ratings yet
Big Data Analytics Challenges Overview
24 pages
Big Data Insights for Tech Professionals
No ratings yet
Big Data Insights for Tech Professionals
16 pages
Hadoop Introduction & Use Cases
No ratings yet
Hadoop Introduction & Use Cases
22 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
Big Data Complete Notes
100% (3)
Big Data Complete Notes
33 pages
Big Data Fundamentals and Hadoop Overview
No ratings yet
Big Data Fundamentals and Hadoop Overview
6 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Exploring Binary Categorical Correlation
No ratings yet
Exploring Binary Categorical Correlation
7 pages
BCS502 CN Lab Manual - Nothing BCS502 CN Lab Manual - Nothing
No ratings yet
BCS502 CN Lab Manual - Nothing BCS502 CN Lab Manual - Nothing
69 pages
Digital Museum For Traditional Culture Showcase and Interactive Experience Based On Virtual Reality
No ratings yet
Digital Museum For Traditional Culture Showcase and Interactive Experience Based On Virtual Reality
6 pages
Human-Computer Interactive Showcases Design For The Application of Commercial Space
No ratings yet
Human-Computer Interactive Showcases Design For The Application of Commercial Space
4 pages
Design of Digital Display Platform For Intangible Cultural Heritage Based On VR Intelligent Technology
No ratings yet
Design of Digital Display Platform For Intangible Cultural Heritage Based On VR Intelligent Technology
6 pages
Shweta - DBMS - Chapter 5 - BSC (H) CS - Sem4
No ratings yet
Shweta - DBMS - Chapter 5 - BSC (H) CS - Sem4
44 pages
BDS602 Module 2 PDF
No ratings yet
BDS602 Module 2 PDF
16 pages
Ai &ML Syllabus
No ratings yet
Ai &ML Syllabus
4 pages
How To Replace A Failed SVM Disk - Solaris Commands
No ratings yet
How To Replace A Failed SVM Disk - Solaris Commands
6 pages
CGI Programming for Perl Beginners
No ratings yet
CGI Programming for Perl Beginners
3 pages
PAN OS 6.0 Admin Guide
No ratings yet
PAN OS 6.0 Admin Guide
478 pages
E4782 P5P43TD
No ratings yet
E4782 P5P43TD
62 pages
FSD Windows Binaries and Resources
No ratings yet
FSD Windows Binaries and Resources
4 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
126 pages
8051 Microcontroller Objectives: Understand The 8051 Architecture Use SFR in C Use I/O Ports in C
No ratings yet
8051 Microcontroller Objectives: Understand The 8051 Architecture Use SFR in C Use I/O Ports in C
59 pages
Guide to Contributing to Wikipedia
No ratings yet
Guide to Contributing to Wikipedia
61 pages
HAAP Router and Server Access Guide
No ratings yet
HAAP Router and Server Access Guide
202 pages
SGOS 6.7.x Release Notes Overview
No ratings yet
SGOS 6.7.x Release Notes Overview
222 pages
Hanvon SDKManual
No ratings yet
Hanvon SDKManual
49 pages
M HDLKNX Gateway - V1 0
No ratings yet
M HDLKNX Gateway - V1 0
31 pages
CND 2
No ratings yet
CND 2
7 pages
Abit IP35-Pro Manual
No ratings yet
Abit IP35-Pro Manual
88 pages
11th Computer Short +long Superior
No ratings yet
11th Computer Short +long Superior
8 pages
SEL-4391 Data Courier Instruction Manual: Easily Load New Firmware, Send and Retrieve Settings, and Gather Reports
No ratings yet
SEL-4391 Data Courier Instruction Manual: Easily Load New Firmware, Send and Retrieve Settings, and Gather Reports
16 pages
Delphi I/O Error Codes Guide
No ratings yet
Delphi I/O Error Codes Guide
91 pages
Exam Question
No ratings yet
Exam Question
28 pages
Adm-1.e - 003
No ratings yet
Adm-1.e - 003
2 pages
Comp Net Prctls
No ratings yet
Comp Net Prctls
17 pages
Client-Server Programming Using TCP: Prepared by
No ratings yet
Client-Server Programming Using TCP: Prepared by
13 pages
CS 2204-01 Communications and Networking Written Assignment 6
No ratings yet
CS 2204-01 Communications and Networking Written Assignment 6
6 pages
TCP Session Hijacking Guide
No ratings yet
TCP Session Hijacking Guide
5 pages
GUI Testing with Ranorex and NUnit
No ratings yet
GUI Testing with Ranorex and NUnit
7 pages
Linux Interview Questions Guide
No ratings yet
Linux Interview Questions Guide
45 pages
Computer Org & Arch MCQs & Problems
No ratings yet
Computer Org & Arch MCQs & Problems
10 pages
5876 - 4 MCS 1 Cloud Computing - 3725 - (15-06-24 08 - 04 - 40 - 642 Am)
No ratings yet
5876 - 4 MCS 1 Cloud Computing - 3725 - (15-06-24 08 - 04 - 40 - 642 Am)
3 pages
Zmodem
No ratings yet
Zmodem
32 pages
Year 7 Computer System
No ratings yet
Year 7 Computer System
13 pages
Connect To Your Linux Instance From Windows Using PuTTY
No ratings yet
Connect To Your Linux Instance From Windows Using PuTTY
4 pages

Module 2 BAD601

Uploaded by

Module 2 BAD601

Uploaded by

BAD601-Big Data Analytics SVCE

Module-2 Introduction to Hadoop

1. Stock market data is generated in billions of shares.

3. Google processes huge amounts of data (like millions of GB!).

1. Facebook users share 2.5 million posts.

2. People tweet about 300,000 times.

3. Instagram users upload 220,000 photos.

4. YouTube users upload 72 hours of videos.

5. Apple users download 50,000 apps.

7. Amazon makes sales of over $80,000.

8. Google receives 4 million search queries.

1. Banking applications process over 10,000 credit card transactions.

Data: The Treasure Trove

Challenges of Big Data

Volume (Size of Data)

Velocity (Speed of Processing)

Why not RDBMS?

▪ It is written in Java and was created by Doug Cutting in 2005.

▪ Hadoop was originally built to support "Nutch," a text search engine.

Basically, Hadoop accomplishes two tasks:

1. Massive data storage.

2. Faster data processing.

Key aspects of Hadoop

MapReduce – Processes data in parallel across multiple machines.

Apache HBase – A NoSQL database for real-time data access.

Apache Spark – A fast, in-memory data processing engine.

Apache Pig – A high-level scripting language for processing large datasets.

Apache Flume – Collects and transfers large amounts of log data.

Apache Zookeeper – Manages and coordinates distributed applications.

Hadoop Conceptual Layer

Hadoop Conceptual Layer Hadoop is conceptually divided into:

High-Level Architecture of Hadoop

Hadoop follows a Master-Slave Architecture:

USE CASE OF HADOOP

Three Key Benefits of ClickStream Analysis Using Hadoop

HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

1. Storage component of Hadoop: It is the storage system used in Hadoop.

HDFS Daemons(For storage)

▪ HDFS breaks a large file into smaller pieces called blocks.

There are multiple DataNodes in a Hadoop cluster.

▪ Each DataNode sends a "heartbeat" signal to the NameNode at regular intervals.

▪ This ensures that the DataNode is active and working.

What Happens If a DataNode Fails?

▪ It helps in recovering the system if the NameNode fails.

▪ The Secondary NameNode keeps backups to help in recovery.

Limitations of the Secondary NameNode:

▪ It does not replace the NameNode in real-time.

Anatomy of File Read

Anatomy of File Write

Replica Placement Strategy

2. The second replica is placed on a node that is in a different rack.

4. After placing the replicas, a pipeline is created for data flow.

Working with HDFS Commands

Processing Data with Hadoop

▪ TaskTrackers are like workers who execute the actual tasks.

How Does MapReduce Work?

You might also like