You are on page 1of 56

UNIT-II

Syllabus
Main challenges of Big Data?
• Main challenges of Big Data

• Storage of data

• Processing of data
What is Hadoop?
• Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment.
• It is designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets.
• It is written in the Java programming language
• Hadoop is not a database
• Storage solution in hadoop was named as HDFS (Hadoop Distributed File System)
• Processing Solution in hadoop was named as MapReduce
• It was developed by Doug Cutting and Michael J. Cafarella.
History of Hadoop

4
History of Hadoop?
• Hadoop was developed, based on the paper written by Google on the Google File System in October
2003.
• This paper explained how to store massive amounts of data,
• In 2004 google released one more paper , based on MapReduce
• This paper explained how to Process massive amounts of data,
• Few individuals (Doug Cutting and Mike Cafarella) at yahoo implemented these papers and developed a
framework which was named Hadoop
• In April 2006 Hadoop 0.1.0 was released
• Later on hadoop was handed to Apache Software foundation
Big Data Storage Systems

• Distributed file systems


• Sharding across multiple databases
• Key-value storage systems
• Parallel and distributed databases
Distributed File
System (DFS)
Distributed File Systems
• A distributed file system stores data across a large collection of machines, but provides
single file-system view.
• A Distributed File System (DFS) is a file system that is distributed on multiple file servers
or multiple locations.
• Highly scalable distributed file system for large data-intensive applications.
• E.g., 10K nodes, 100 million files, 10 PB
• Provides redundant storage of massive amounts of data on cheap and unreliable computers
• Files are replicated to handle hardware failure
• Detect failures and recovers from them
• Examples:
• Google File System (GFS)
• Hadoop File System (HDFS)
Google File
System(GFS)
Google File System
• Google File System is a distributed file system developed by Google to provide efficient, reliable access to
data.
• GFS provides fault tolerance, reliability, scalability ,availability and performance to large network and
connected nodes
• GFS is made up of several storage systems built from low-cost.
• GFS is not an open source framework, as it is developed for Google's requirements.
• To meet the needs of growing Big Data, others also require a framework, which is Hadoop.

10
Hadoop
Distributed File
System(HDFS)
HDFS
• The Hadoop Distributed File System (HDFS) is a distributed file system designed to run
on commodity hardware.
• It has many similarities with existing distributed file systems.
• However, the differences from other distributed file systems are significant.
• HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
• HDFS provides high throughput access to application data and is suitable for applications
that have large data sets.
• HDFS was originally built as infrastructure for the Apache Nutch web search engine
project.
• HDFS is now an Apache Hadoop subproject.

12
Hadoop File System Architecture
 Single Namespace for entire
cluster
 Files are broken up into blocks
• Typically 64 MB block size
• Each block replicated on
multiple DataNodes
 Client
• Finds location of blocks
from NameNode
• Accesses data directly from
DataNode
Hadoop Distributed File System (HDFS)

• NameNode
• Maps a filename to list of Block IDs
• Maps each Block ID to DataNodes containing a replica of the
block
• DataNode: Maps a Block ID to a physical location on disk
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Distributed file systems good for millions of large files
• But have very high overheads and poor performance with
billions of smaller tuples
Components of
Hadoop
Components of Hadoop
• There are three components of Hadoop:
Components of Hadoop
• There are three components of Hadoop:

1. HDFS - Hadoop Distributed File System (HDFS) is the storage unit. It states
that the files will be broken into blocks and stored in nodes over the distributed
architecture.

2. MapReduce - Hadoop MapReduce is the processing unit. This is a framework


which helps Java programs to do the parallel computation on data using key value pair.

3. YARN - Yet Another Resource Negotiator (YARN) is a resource management


unit. It is used for job scheduling and manage the cluster.
Features/Advantages of Hadoop
• Open Source
• Ability to store a large amount of data
• Fault tolerance
• Cost effective
• Compatible with all platform
• Distributed computing
• Parallel processing
Data Format
What is a data format?
• data format is the arrangement of data fields for a specific shape.
• A file format is the definition of how information is stored in HDFS.
• Hadoop does not have a default file format and the choice of a format
depends on its use.
• Basic file formats are: Text format, Key-Value format, Sequence
format
• Other formats which are used and are well known are: Avro,
Parquet, RC or Row-Columnar format, ORC or Optimized Row
Columnar format

20
Needs of data Format
• File should
• Get read fast
• Get written fast
• Be splittable i.e. multiple tasks can run parallel on part of file
• Support advance compression through various available
compression codes
Text File Format ( CSV ,TSV)
Sequence File Format
Avro file Format
Example
Row Columnar File Format
ORC file Format
Example
Scaling Up Vs.
Scaling Out
Scaling Up Vs. Scaling Out
• Once a decision has been made for data scaling, the specific scaling
approach must be chosen.
• There are two commonly used types of data scaling :
1. Up
2. Out
• Scaling up, or vertical scaling :
• It involves obtaining a faster server with more powerful processors and
more memory.
• This solution uses less network hardware, and consumes less power; but
ultimately.
• For many platforms, it may only provide a short-term fix, especially if
continued growth is expected.
Scaling Up Vs. Scaling Out
• Scaling out or horizontal scaling :
• It involves adding servers for parallel computing.
• The scale-out technique is a long-term solution, as more and more servers
may be added when needed.
• But going from one monolithic system to this type of cluster may be
difficult, although extremely effective solution.
How MapReduce Works?
How MapReduce Works?
How MapReduce Works?
• We have an Input Reader which is responsible for reading the input
data and produces the list of key-value pairs.
• We can read data in .csv format, in delimiter format, from a database
table, image data(.jpg, .png), audio data etc.
• This list of key-value pairs is fed to the Map phase and Mapper will
work on each of these key-value pair of each pixel and generate some
intermediate key-value pairs.
• After shuffling and sorting, the intermediate key-value pairs are fed
to the Reducer: then the final output produced by the reducer will be
written to the HDFS. These are how a simple Map-Reduce job
works.
Unit Tests with MR Unit
• MRUnit is a JUnit-based Java library that allows us to unit test
Hadoop MapReduce programs.

• MRUnit supports testing Mappers and Reducers separately as


well as testing MapReduce computations as a whole.
• MRUnit provides a powerful and light-weight approach to do
test-driven development.
Map Reduce – Job Tracker and Task Tracker
Hadoop Streaming
• It is a utility or feature that comes with a Hadoop distribution
• It is a Hadoop Library which makes it possible to use any binary as mapper or
reducer
• It allows developers or programmers to write the Map-Reduce program using
different programming languages like Ruby, Perl, Python, C++, etc
• We can use any language that can read from the standard input(STDIN) and
write using standard output(STDOUT).
• We all know the Hadoop Framework is completely written in java but programs
for Hadoop are not necessarily need to code in Java programming language.
• It uses Unix streams as the interface between the Hadoop and our MapReduce
program.
Hadoop Streaming
Hadoop Streaming
Hadoop Pipes
• Hadoop Pipes is the name of the C++ interface to Hadoop
MapReduce.
• Unlike Streaming, this uses standard input and output to
communicate with the map and reduce code.
• Pipes uses sockets as the channel over
which the task tracker communicates
with the process running the C++ map
or reduce function
Hadoop
Eco-System
Hadoop Ecosystem
• Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems.
• It is neither a programming language nor a service, it is a platform or
framework which solves big data problems.
• As we know , Hadoop is a framework that manages big data storage. Hadoop
ecosystem covers Hadoop itself and other related big data tools.
• Apache Hadoop ecosystem refers to the various components of the Hadoop
software library;
• it includes open source projects and a complete range of tools.
Hadoop Ecosystem
• The components that collectively form a Hadoop ecosystem:

HDFS: Hadoop Distributed File System


YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Hadoop Ecosystem
Hadoop Ecosystem
HDFS:
• HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
 Name node
 Data Node
• Name Node is the prime node which contains metadata (data about data)
• Data Node contains actual data
YARN:
• Yet Another Resource Negotiator
• It performs scheduling and resource allocation for the Hadoop System.
• Consists of three major components i.e.
 Resource Manager
 Nodes Manager
 Application Manager
PIG:
• Pig helps to achieve ease of programming and optimization
• Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing huge data sets.
• Pig does the work of executing commands
• After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Mahout:
• Mahout, allows Machine Learnability to a system or application.
• Machine Learning, as the name suggests helps the system to develop itself
based on some patterns, user/environmental interaction or on the basis of
algorithms.
• It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning.
• It allows invoking algorithms as per our need with the help of its own libraries.
Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database.
• It provides capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
• At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short quick
span of time.
• At such times, HBase comes handy as it gives us a tolerant way of storing
limited data
Zookeeper
• There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in
inconsistency, often.
• Zookeeper overcame all the problems by performing synchronization, inter-
component based communication, grouping, and maintenance.
Oozie
• Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit.
• There is two kinds of jobs .i.e
• Oozie workflow and
• Oozie coordinator jobs.
• Oozie workflow is the jobs that need to be executed in a sequentially ordered
manner
• Oozie Coordinator jobs are those that are triggered when some data or
external stimulus is given to it.
HIVE:
• With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets.
• However, its query language is called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch processing both.
• all the SQL datatypes are supported by Hive thus, making the query processing
easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.

You might also like