Chapter 2 Introduction To Hadoop

Chapter 2
Introduction to
Introduction
1. What is Hadoop?
2. Core Hadoop Components
3. Hadoop Ecosystem
4.Physical Architecture
5.Hadoop limitations
Hadoop
• Open-source s/w framework.
• Used for storing and processing of Big Data.
• Commodity hardware instead of high end

hardware.
Why Hadoop?
• Scalability
• No Pre-processing
• Handles un-structure data
• Dividing data into blocks and chunks, storing across multiple servers
• Processing is done in parallel across multiple connected machines
• Protection against hardware failure

• Throughput HIGH, Latency LOW
Hadoop Goals
• Scalable
– Single server to thousand servers
• Fault Tolerance
• Economical
• Handle hardware failure
– Ability to detect and handle failures at the
application layer
Hadoop Assumptins
1. H/W will fail, since it considers a large cluster of computer.
2. Processor will run in batches; so aim at high throughput as

opposed to low latency.
3. Application that run on HDFS have large datasets typically

fro gigabyte to terabytes in size.
4. Should support tens of millions of files in a single instance.
5. Application need a write-once-read-many access model.
6. Portability is important.
Core Components of Hadoop
1. Hadoop Common Package

• Provides file system and OS level abstraction
• Contains libraries and Utilities required by Hadoop modules
2. Hadoop Distributed File System (HDFS)

• Provides limited interface for managing the file system
3. Hadoop MapReduce
• Key Algorithm used to distribute work around a cluster
4. Hadoop YARN (Resource Management Platform)

• Responsible for managing computing resources
Hadoop Common Package
• It consist of necessary JAR files and scripts needed to start

Hadoop.
• It also contains Libraries and utilities required by other

Hadoop modules.
• Provides file system and Operating System level

Abstraction.
Hadoop Distributed File System
• Manage storage and retrieval of data & metadata required

for computation.
• Creates multiple replicas of each data block and distributes

them on computers to enable reliable and rapid access.
• When file is loaded in HDFS, it is replicated & fragmented

into “BLOCKS” of data, which are stored across the cluster
nodes(Data Node).
• NameNode is responsible for storage and management of

metadata.
Main components of HDFS
1. Name Node
2. Data Node
Name Node
• Master Node contain metadata.

• Maintain directories and files and manages the
blocks which are present on the Data Node.
• Functions of NameNode:
1. Manages the namespace of the file system in memory
2. Maintains inode information
3. Maps inode to the list of blocks and locations
4. Ensures Authorization and Authentication
5. Creates checkpoints and logs the namespace changes
Data Node
• Slave Node provides actual storage.

• Responsible for processing read and write
requests for clients.
• Functions of DataNode:
1. Handles the block storage on multiple volumes.
2. Maintains block integrity
3. Periodically sends signal and send block reports to
NameNode.
Hadoop Map-Reduce
• Algorithm.
• Helps in parallel processing.
• Two phases:
1. Map Phase:
– Set of key-value pair forms
– Over each key-value pair, desire function is executed so as to
generate a set of intermediate key-value pair.
2. Reduce Phase:
-- The intermediate key-value pairs are grouped by key and values
are combined together according to the reduce algorithm provided by
the user.
• HDFS is the storage system for both i/p and o/p of the
MapReduce jobs.
Components of MapReduce
1. Job Tracker:
– Master which manages the jobs and recourses in the cluster.
– It schedule each map on Task Tracker.
– One Job Tracker in one cluster.
2. Task Tracker:
– Slaves which runs on every machine in a cluster.
– Responsible for running Map and Reduce Task as instructed by Job Tracker
3. JobHistoryServer:
• Demon that saves historical information about tasks.
Yet Another Resourse Negotiator
• YARN is the processing framework in Hadoop,

• Resource management.
• Job scheduling
• Monitoring of Job Tracker.
Hadoop Ecosystem
Hadoop Ecosystem
• HDFS
1. It is foundation for many more BD framework.
2. It provides scalable and reliable storage.
3. Size of data increases, we can add commodity hardware to increase storage
capacity.
• YARN -
1. Provides flexible scheduling and resource management over the HDFS storage.
2. Used at Yahoo to schedule jobs across 40000 servers.
• MapReduce -
1. programming Model
2. Simplifies parallel computing.
3. Instead of dealing with the complexities of synchronization and scheduling,
MapReduce deals with only 2 function:
 Map()
 Reduce()
4. Used by Google for Indexing websites.
Hadoop Ecosystem
• HIVE-
1. Programming model
2. Created at Facebook to issue SQL like queries using MapReduce on their data in
HDFS.
3. It is a basically Data Ware that provides Ad-hoc queries, data summarization
and analysis of huge data sets.
• PIG-
1. High level Programming model
2. Process and analyses BD using User Defined Functions and programming efforts.
3. Provides a Bridge to query data on Hadoop but unlike HIVE
4. Use Script implementation to make Hadoop data accessible by developers.
5. Created at Yahoo to model data flow based programs using MapReduce.
• Giraph –
1. Specialized model for graph processing
2. Used by facebook to analyze social graph.
Hadoop Ecosystem
• Spark
1. Real time in-memory Data Processing
2. In-memory ->100X faster for some tasks.
3. Spark provides an easier to use alternative to MapReduce and offers
performance up to 10 times faster for certain applications.
4. To make programming faster, Spark provides clean, concise APIs in Scala,
Java and Python.
• Storm -
1. Storm is a complex event processor(CEP) .
2. It also work as a distributed computation framework for processing fast, large
stream of data.
3. Real time in-memory Data Processing
• Flink-
• Flint is a data processing system and alternative to MapReduce .
• It comes with its own runtime, rather than building on top of Mapreduce.
• Real time in-memory Data Processing.
Hadoop Ecosystem
• HBase -
1. It is the Hadoop Database.
2. NoSQL / No-relational distributed Database.
3. It is a backing system for MR jobs outputs .
4. Hbase is based on Column than rows for fast processing.
5. Facebook also use Hbase for messaging.
• Cassendra-
1. It is a free and open-source, distributed, wide column database management
system designed to handle large amounts of data across many commodity
servers.
2. It providing high availability with no single point of failure.
3. NoSQL / No-relational distributed Database.
4. MR can retrieved data from Cassendra.
• MongoDB-
1. NoSQL Database
2. Document-oriented database system
3. It stores structure data as JSON-like documents.
Hadoop Ecosystem
• Zookeeper -
1. It is a coordination service that gives you the tools you need to write correct
distribution applications.
2. Managing Cluster.
3. Running all this tools requires a centralized management system for
synchronization, configuration and to ensure high availability.
• Mahout, Spark MLlib -> Machine Learning

• Apache Drill -> SQL on Hadoop
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster
Physical Architecture
• Combination of cloud environment with big data processing tools such as Hadoop
, provides the high performance computing power needed to analyze vast
amount of data efficiently and cost effectively.
• Machine configuration for storage and computing servers :

1. 32 GB memory
2. 4 core processors
3. 200-320 GB hard disk
• Running hadoop in virtualized environments continues to develop and mature

with initiatives from open-source software projects.
Cloud Computing infrastructure to support Big Data Analytics
Cloud
integration
environment
Storage Storage
Node Node
Switch HBase VM
HBase VM Database
Zookeeper VM
Zookeeper VM
Web Console
LDAP VM
Server
VM
Cloud Management VLAN
Compute Compute M/c Config:

Web 32 GB memory
Node Node 4 cores
Server 200-320 GB Disk
Hadoop Cluster Small/Large
• Small Hadoop Cluster:
Hadoop
Cluster
Master Worker Worker Worker Worker

Node Node Node Node Node
JobTracker
TaskTracker DataNode DataNode DataNode DataNode
NameNode TaskTracker TaskTracker TaskTracker TaskTracker
DataNode
• Every hadoop compatible file system should provide location awareness for
effective scheduling of work. As well as DATA AWARENESS.
• Hadoop application uses this information to find the data node and run the task.
• HDFS replicates data to keep different copies of data on different racks to reduce
the impact of rack power or switch failure.
Hadoop compatible file system provides location awareness
FilBlock A
Block B Client File.txt
Block A DN3, DN4
Block C Block B DN5
Block c DN6
File.txt NameNode
Switch
Rack 5
DataNode 3,5
Rack 7
DataNode 4
Switch Switch Switch Rack 9
DataNode 6
DataNode 3 DataNode 4 DataNode 6
A A C
DataNode 5
B
Rack 5 Rack 7 Rack 9
Hadoop limitation
1. Security Concerns
• Hadoop security model is disabled by default due to sheer complexity.
• Doesn’t provides encryption at storage and n/w level.
2. Vulnerable by nature
• Written entirely in java.
• Most widely used language by cyber criminal.
3. Not fit for small data

• Not all big data platforms are suitable for handling small files.
• Due to high capacity design, HDFS inefficiently support the small files
• Not recommended for small scale industries.
Hadoop limitation contn..
4. Potential Stability Issues

• Improvements are constantly being made, like all open source s/w,
Hadoop has stability issues.
• To avoid these issues, organizations are strongly recommended to make
sure they are running the latest stable version or run it under a third
party vender equipped to handle such problems.
4. General Limitations
• Google mentions in its article that Hadoop may not be the only answer
for big data.
• Google has its own Cloud Dataflow as a possible solution.
• Companies could be missing out on many other benefits by using Hadoop
alone.
Hadoop
Thank You

Chapter 2 Introduction To Hadoop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 Introduction To Hadoop

Uploaded by

Copyright:

Available Formats

Chapter 2

2. Core Hadoop Components

• Open-source s/w framework.

• Used for storing and processing of Big Data.

• Commodity hardware instead of high end

• Protection against hardware failure

1. H/W will fail, since it considers a large cluster of computer.

2. Processor will run in batches; so aim at high throughput as

3. Application that run on HDFS have large datasets typically

4. Should support tens of millions of files in a single instance.

5. Application need a write-once-read-many access model.

1. Hadoop Common Package

2. Hadoop Distributed File System (HDFS)

4. Hadoop YARN (Resource Management Platform)

• It consist of necessary JAR files and scripts needed to start

• It also contains Libraries and utilities required by other

• Provides file system and Operating System level

• Manage storage and retrieval of data & metadata required

• Creates multiple replicas of each data block and distributes

• When file is loaded in HDFS, it is replicated & fragmented

• NameNode is responsible for storage and management of

• Master Node contain metadata.

• Slave Node provides actual storage.

• YARN is the processing framework in Hadoop,

• Mahout, Spark MLlib -> Machine Learning

• Machine configuration for storage and computing servers :

• Running hadoop in virtualized environments continues to develop and mature

Cloud Management VLAN

Compute Compute M/c Config:

• Small Hadoop Cluster:

Master Worker Worker Worker Worker

3. Not fit for small data

4. Potential Stability Issues

You might also like