You are on page 1of 31

Chapter 2

Introduction to
Introduction

1. What is Hadoop?

2. Core Hadoop Components

3. Hadoop Ecosystem

4.Physical Architecture

5.Hadoop limitations
Hadoop

• Open-source s/w framework.

• Used for storing and processing of Big Data.

• Commodity hardware instead of high end


hardware.
Why Hadoop?

• Scalability
• No Pre-processing
• Handles un-structure data
• Dividing data into blocks and chunks, storing across multiple servers
• Processing is done in parallel across multiple connected machines

• Protection against hardware failure


• Throughput HIGH, Latency LOW
Hadoop Goals

• Scalable
– Single server to thousand servers
• Fault Tolerance
• Economical
• Handle hardware failure
– Ability to detect and handle failures at the
application layer
Hadoop Assumptins

1. H/W will fail, since it considers a large cluster of computer.

2. Processor will run in batches; so aim at high throughput as


opposed to low latency.

3. Application that run on HDFS have large datasets typically


fro gigabyte to terabytes in size.

4. Should support tens of millions of files in a single instance.

5. Application need a write-once-read-many access model.

6. Portability is important.
Core Components of Hadoop

1. Hadoop Common Package


• Provides file system and OS level abstraction
• Contains libraries and Utilities required by Hadoop modules

2. Hadoop Distributed File System (HDFS)


• Provides limited interface for managing the file system

3. Hadoop MapReduce
• Key Algorithm used to distribute work around a cluster

4. Hadoop YARN (Resource Management Platform)


• Responsible for managing computing resources
Hadoop Common Package

• It consist of necessary JAR files and scripts needed to start


Hadoop.

• It also contains Libraries and utilities required by other


Hadoop modules.

• Provides file system and Operating System level


Abstraction.
Hadoop Distributed File System

• Manage storage and retrieval of data & metadata required


for computation.

• Creates multiple replicas of each data block and distributes


them on computers to enable reliable and rapid access.

• When file is loaded in HDFS, it is replicated & fragmented


into “BLOCKS” of data, which are stored across the cluster
nodes(Data Node).

• NameNode is responsible for storage and management of


metadata.
Main components of HDFS

1. Name Node

2. Data Node
Name Node

• Master Node contain metadata.


• Maintain directories and files and manages the
blocks which are present on the Data Node.

• Functions of NameNode:
1. Manages the namespace of the file system in memory
2. Maintains inode information
3. Maps inode to the list of blocks and locations
4. Ensures Authorization and Authentication
5. Creates checkpoints and logs the namespace changes
Data Node

• Slave Node provides actual storage.


• Responsible for processing read and write
requests for clients.

• Functions of DataNode:
1. Handles the block storage on multiple volumes.
2. Maintains block integrity
3. Periodically sends signal and send block reports to
NameNode.
Hadoop Map-Reduce

• Algorithm.
• Helps in parallel processing.
• Two phases:
1. Map Phase:
– Set of key-value pair forms
– Over each key-value pair, desire function is executed so as to
generate a set of intermediate key-value pair.
2. Reduce Phase:
-- The intermediate key-value pairs are grouped by key and values
are combined together according to the reduce algorithm provided by
the user.
• HDFS is the storage system for both i/p and o/p of the
MapReduce jobs.
Components of MapReduce

1. Job Tracker:
– Master which manages the jobs and recourses in the cluster.
– It schedule each map on Task Tracker.
– One Job Tracker in one cluster.

2. Task Tracker:
– Slaves which runs on every machine in a cluster.
– Responsible for running Map and Reduce Task as instructed by Job Tracker

3. JobHistoryServer:
• Demon that saves historical information about tasks.
Yet Another Resourse Negotiator

• YARN is the processing framework in Hadoop,


• Resource management.
• Job scheduling
• Monitoring of Job Tracker.
Hadoop Ecosystem
Hadoop Ecosystem

• HDFS
1. It is foundation for many more BD framework.
2. It provides scalable and reliable storage.
3. Size of data increases, we can add commodity hardware to increase storage
capacity.

• YARN -
1. Provides flexible scheduling and resource management over the HDFS storage.
2. Used at Yahoo to schedule jobs across 40000 servers.

• MapReduce -
1. programming Model
2. Simplifies parallel computing.
3. Instead of dealing with the complexities of synchronization and scheduling,
MapReduce deals with only 2 function:
 Map()
 Reduce()
4. Used by Google for Indexing websites.
Hadoop Ecosystem

• HIVE-
1. Programming model
2. Created at Facebook to issue SQL like queries using MapReduce on their data in
HDFS.
3. It is a basically Data Ware that provides Ad-hoc queries, data summarization
and analysis of huge data sets.

• PIG-
1. High level Programming model
2. Process and analyses BD using User Defined Functions and programming efforts.
3. Provides a Bridge to query data on Hadoop but unlike HIVE
4. Use Script implementation to make Hadoop data accessible by developers.
5. Created at Yahoo to model data flow based programs using MapReduce.

• Giraph –
1. Specialized model for graph processing
2. Used by facebook to analyze social graph.
Hadoop Ecosystem

• Spark
1. Real time in-memory Data Processing
2. In-memory ->100X faster for some tasks.
3. Spark provides an easier to use alternative to MapReduce and offers
performance up to 10 times faster for certain applications.
4. To make programming faster, Spark provides clean, concise APIs in Scala,
Java and Python.

• Storm -
1. Storm is a complex event processor(CEP) .
2. It also work as a distributed computation framework for processing fast, large
stream of data.
3. Real time in-memory Data Processing

• Flink-
• Flint is a data processing system and alternative to MapReduce .
• It comes with its own runtime, rather than building on top of Mapreduce.
• Real time in-memory Data Processing.
Hadoop Ecosystem
• HBase -
1. It is the Hadoop Database.
2. NoSQL / No-relational distributed Database.
3. It is a backing system for MR jobs outputs .
4. Hbase is based on Column than rows for fast processing.
5. Facebook also use Hbase for messaging.

• Cassendra-
1. It is a free and open-source, distributed, wide column database management
system designed to handle large amounts of data across many commodity
servers.
2. It providing high availability with no single point of failure.
3. NoSQL / No-relational distributed Database.
4. MR can retrieved data from Cassendra.

• MongoDB-
1. NoSQL Database
2. Document-oriented database system
3. It stores structure data as JSON-like documents.
Hadoop Ecosystem

• Zookeeper -
1. It is a coordination service that gives you the tools you need to write correct
distribution applications.
2. Managing Cluster.
3. Running all this tools requires a centralized management system for
synchronization, configuration and to ensure high availability.

• Mahout, Spark MLlib -> Machine Learning


• Apache Drill -> SQL on Hadoop
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster
Physical Architecture

• Combination of cloud environment with big data processing tools such as Hadoop
, provides the high performance computing power needed to analyze vast
amount of data efficiently and cost effectively.

• Machine configuration for storage and computing servers :


1. 32 GB memory
2. 4 core processors
3. 200-320 GB hard disk

• Running hadoop in virtualized environments continues to develop and mature


with initiatives from open-source software projects.
Cloud Computing infrastructure to support Big Data Analytics

Cloud
integration
environment

Storage Storage
Node Node
Switch HBase VM
HBase VM Database
Zookeeper VM
Zookeeper VM
Web Console
LDAP VM
Server
VM

Cloud Management VLAN

Compute Compute M/c Config:


Web 32 GB memory
Node Node 4 cores
Server 200-320 GB Disk
Hadoop Cluster Small/Large

• Small Hadoop Cluster:

Hadoop
Cluster

Master Worker Worker Worker Worker


Node Node Node Node Node
JobTracker
TaskTracker DataNode DataNode DataNode DataNode
NameNode TaskTracker TaskTracker TaskTracker TaskTracker
DataNode

• Every hadoop compatible file system should provide location awareness for
effective scheduling of work. As well as DATA AWARENESS.
• Hadoop application uses this information to find the data node and run the task.
• HDFS replicates data to keep different copies of data on different racks to reduce
the impact of rack power or switch failure.
Hadoop compatible file system provides location awareness

FilBlock A
Block B Client File.txt
Block A DN3, DN4
Block C Block B DN5
Block c DN6
File.txt NameNode
Switch

Rack 5
DataNode 3,5
Rack 7
DataNode 4
Switch Switch Switch Rack 9
DataNode 6
DataNode 3 DataNode 4 DataNode 6

A A C

DataNode 5

B
Rack 5 Rack 7 Rack 9
Hadoop limitation

1. Security Concerns
• Hadoop security model is disabled by default due to sheer complexity.
• Doesn’t provides encryption at storage and n/w level.

2. Vulnerable by nature
• Written entirely in java.
• Most widely used language by cyber criminal.

3. Not fit for small data


• Not all big data platforms are suitable for handling small files.
• Due to high capacity design, HDFS inefficiently support the small files
• Not recommended for small scale industries.
Hadoop limitation contn..

4. Potential Stability Issues


• Improvements are constantly being made, like all open source s/w,
Hadoop has stability issues.
• To avoid these issues, organizations are strongly recommended to make
sure they are running the latest stable version or run it under a third
party vender equipped to handle such problems.

4. General Limitations
• Google mentions in its article that Hadoop may not be the only answer
for big data.
• Google has its own Cloud Dataflow as a possible solution.
• Companies could be missing out on many other benefits by using Hadoop
alone.
Hadoop

Thank You

You might also like