You are on page 1of 67

• Big Data Analytics(BDA)

Unit-2
Hadoop
 Outline
Looping
• History of Hadoop
• Hadoop Distributed File System
• Developing a Map Reduce Application
• Hadoop Environment
• Hadoop Configuration
• Security in Hadoop
• Administering Hadoop
• Monitoring & Maintenance
• Hadoop Benchmarks
• Hadoop in the cloud
History of Hadoop
History of Hadoop
 Hadoop is an open-source software framework for storing and
processing large datasets ranging in size
from gigabytes to petabytes.
 Hadoop was developed at the Apache Software Foundation.
 In 2008, Hadoop defeated the supercomputers and became
the fastest system on the planet for sorting terabytes of data.
 There are basically two components in Hadoop:
1. Hadoop distributed File System (HDFS):
 It allows you to store data of various formats across a cluster.
2. Yarn:
 For resource management in Hadoop. It allows parallel processing over the data,
i.e. stored across HDFS.

4
Basics of Hadoop
 Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware.
 It provides massive storage for any kind of data, enormous processing power and the ability to
handle virtually limitless concurrent tasks or jobs.
 A data residing in a local file system of a personal computer system, in Hadoop, data resides in a
distributed file system which is called as a Hadoop Distributed File system - HDFS.
 The processing model is based on 'Data Locality' concept wherein computational logic is sent to
cluster nodes(server) containing data.
 This computational logic is nothing, but a compiled version of a program written in a high-level
language such as Java.
 Such a program, processes data stored in Hadoop HDFS.

5
Advantage & Dis-Advantage of Hadoop
Advantages Disadvantages

 Varied Data Sources  Issue With Small Files


 Cost-effective  Vulnerable By Nature
 Performance  Processing Overhead
 Fault-Tolerant  Supports Only Batch Processing
 Highly Available  Iterative Processing
 Low Network Traffic  Security
 High Throughput
 Open Source
 Scalable
 Ease of use
 Compatibility
 Multiple Languages Supported

6
Why Hadoop Required?
1 2

3 5

7
Why Hadoop Required? - Traditional Restaurant Scenario

8
Why Hadoop Required? - Traditional Scenario

9
Why Hadoop Required? - Distributed Processing Scenario

10
Why Hadoop Required? - Distributed Processing Scenario Failure

11
Why Hadoop Required? - Solution to Restaurant Problem

12
Why Hadoop Required? - Hadoop in Restaurant Analogy
 Hadoop functions in a similar fashion as Bob’s
restaurant.
 As the food shelf is distributed in Bob’s restaurant,
similarly, in Hadoop, the data is stored in a
distributed fashion with replications, to provide
fault tolerance.
 For parallel processing, first the data is processed
by the slaves where it is stored for some
intermediate results and then those intermediate
results are merged by master node to send the final
result.

13
Hadoop Distributed File System
Hadoop Distributed File System
 Hadoop File System was developed using distributed file system design.
 It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant
and designed using low-cost hardware.
 HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines.
 These files are stored in redundant fashion to rescue the system from possible data losses in
case of failure.
 HDFS also makes applications available to parallel processing.
 Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

15
HDFS Master-Slave Architecture
 HDFS follows the master-slave architecture and it has the following elements.

16
Hadoop Core Components - HDFS

Name Node  NameNode represented every files and directory which is


used in the namespace.

Data Node  DataNode helps you to manage the state of an HDFS node
and allows you to interacts with the blocks.

 It is constantly reads all the file systems and metadata from


Secondary Node the RAM of the NameNode and writes it into the hard disk.

17
Name Node
 The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software.
 It is a software that can be run on commodity hardware.
 The system having the namenode acts as the master server and it does the following tasks −
 It also executes file system operations such as renaming, closing, and opening files and directories.
 It records each and every change that takes place to the file system metadata.
 It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the
DataNodes are alive
 It keeps a record of all the blocks in the HDFS and DataNode in which they are stored.

19
Data Node
 It is the slave daemon/process which runs on each slave machine.
 The actual data is stored on DataNodes.
 It is responsible for serving read and write requests from the clients.
 It is also responsible for creating blocks, deleting blocks and replicating the same based on the
decisions taken by the NameNode.
 It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by
default, this frequency is set to 3 seconds.

20
Secondary Node
 The Secondary NameNode works concurrently with the
primary NameNode as a helper daemon/process. 
 It is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it
into the hard disk or the file system.
 It is responsible for combining
the EditLogs with FsImage from the NameNode.
 It downloads the EditLogs from the NameNode at regular
intervals and applies to FsImage.
 The new FsImage is copied back to the NameNode, which
is used whenever the NameNode is started the next time.
 Hence, Secondary NameNode performs regular
checkpoints in HDFS. Therefore, it is also called
CheckpointNode.

21
Block
 A Block is the minimum amount of data that it can read or write.
 HDFS blocks are 128 MB by default and this is configurable.
 Files n HDFS are broken into block-sized chunks,which are stored as independent units.

22
HDFS Architecture

23
Hadoop Cluster
 To Improve the Network Performance 
 The communication between nodes residing on
different racks is directed via switch.
 In general, you will find greater network
bandwidth between machines in the same rack
than the machines residing in different rack.
 It helps you to have reduce write traffic in
between different racks and thus providing a
better write performance.
 Also, you will be gaining increased read
performance because you are using the
bandwidth of multiple racks.
 To Prevent Loss of Data
 We don’t have to worry about the data even if an
entire rack fails because of the switch failure or
power failure. Like never put all your eggs in the
same basket.

24
HDFS – Write Pipeline

25
Data Streaming and Replication

26
Shutdown of Pipeline or Acknowledgement stage

27
HDFS Write Architecture
 HDFS client, wants to write a file named “example.txt” of size 248 MB.
 Assume that the system block size is configured for 128 MB (default).
 So, the client will be dividing the file “example.txt” into 2 blocks – one of 128 MB (Block A) and
the other of 120 MB (block B).

28
HDFS Write Architecture – Cont.
 Writing Process Steps:
 At first, the HDFS client will reach out to the NameNode for a Write Request against the two blocks, say, Block
A & Block B.
 The NameNode will then grant the client the write permission and will provide the IP addresses of the
DataNodes where the file blocks will be copied eventually.
 The selection of IP addresses of DataNodes is purely randomized based on availability, replication factor and
rack awareness.
 The replication factor is set to default i.e. 3. Therefore, for each block the NameNode will be providing the
client a list of (3) IP addresses of DataNodes. The list will be unique for each block.
 For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
 For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
 Each block will be copied in three different DataNodes to maintain the replication factor consistent throughout
the cluster.
 Now the whole data copy process will happen in three stages:
 Set up of Pipeline
 Data streaming and replication
 Shutdown of Pipeline (Acknowledgement stage) 
29
Hadoop Ecosystem
 Hadoop is a framework that can process large data sets in the form of clusters.
 As a framework, Hadoop is composed of multiple modules that are compatible with a large
technology ecosystem.
 The Hadoop ecosystem is a platform or suite that provides various services to solve big data
problems.
 It includes the Apache project and various commercial tools and solutions.
 Hadoop has four main elements, namely HDFS, MapReduce, YARN and Hadoop Common.
 Most tools or solutions are used to supplement or support these core elements.
 All of these tools work together to provide services such as data absorption, analysis, storage,
and maintenance.

30
Hadoop Ecosystem
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

31
Hadoop Ecosystem Distribution

32
HDFS
 Hadoop Distributed File System is the core component, or you can say, the backbone of Hadoop
Ecosystem. 
 HDFS is one, and it is possible to store large data sets (i.e., structured, unstructured and semi
structured data).
 HDFS creates levels of abstraction of resources from where you can see all the HDF as a single
unit.
 It helps us in storing our data across various nodes and maintaining the log file about the stored
data (metadata).
 HDFS has two main components: NAMENODE and DATANODE.

33
Yarn
 YARN as the brain of your Hadoop Ecosystem.
 It performs all your processing activities by allocating resources and scheduling tasks.
 It is a type of resource negotiator, as the name suggests, YARN is a negotiator that helps
manage all resources in the cluster.
 In short, you perform scheduling and resource allocation for the Hadoop system.
 It consists of three main components, namely,
1. Resource Manager has the right to allocate resources for the applications on the system.
2. Node Manager is responsible for allocating resources such as CPU, memory, bandwidth and so
on for each machine, and then identify the resource manager.
3. Application Manager acts as an interface between the resource manager and the node
manager, and negotiates according to the requirements of the two.

34
Map Reduce
 MapReduce is the core component of processing in a Hadoop Ecosystem as it provides the logic
of processing. 
 MapReduce is a software framework which helps in writing applications that processes large
data sets using distributed and parallel algorithms inside Hadoop environment.
 So, By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data sets
into a manageable one.
 MapReduce uses two functions, namely Map() and Reduce(), and its task is:
 The Map() function performs actions like filtering, grouping and sorting.
 While Reduce() function aggregates and summarizes the result produced by map function.

35
Pig
 Pig is basically developed by Yahoo, it uses the Pig Latin language, which is a query-based
language similar to SQL.
 It is a platform for structuring the data flows, processing and analyzing massive data sets.
 Pig is responsible for executing commands and processing all MapReduce activities in the
background. After processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework running in Pig Runtime. Like the way
Java runs on the JVM.
 Pig helps simplify programming and optimization and is therefore an important part of the
Hadoop ecosystem.
 Pig working like first the load command, loads the data. Then we perform various functions on it
like grouping, filtering, joining, sorting, etc. 
 At last, either you can dump the data on the screen, or you can store the result back in HDFS.

36
Hive
 With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL
(Hive Query Language).

HIVE + SQL = HQL


 It is highly scalable because it supports real-time processing and batch processing.
 In addition, Hive supports all SQL data types, making query processing easier.
 Similar to the query processing framework, HIVE also has two components: JDBC driver and
HIVE Command-Line.
 JDBC is used with the ODBC driver to set up connection and data storage permissions whereas
HIVE command line facilitates query processing.

37
Mahout
 Mahout, allows automatic learning of systems or applications.
 Machine learning, as its name suggests, can help systems develop themselves based on certain
patterns, user / environment interactions, or algorithm-based fundamentals.
 It provides various libraries or functions, such as collaborative filtering, clustering, and
classification, which are all machine learning concepts.
 It allows to call the algorithm according to our needs with the help of its own library.

38
Apache Spark
 Apache Spark is a framework for real time data analytics in a distributed computing
environment.
 The Spark is written in Scala and was originally developed at the University of California,
Berkeley.
 It executes in-memory computations to increase speed of data processing over Map-Reduce. 
 It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power than Map-
Reduce.
 It is better for real-time processing, while Hadoop is designed to store unstructured data and
perform batch processing on it.
 When we combine the capabilities of Apache Spark with the low-cost operations of Hadoop on
basic hardware, we get the best results.
 So, many companies use Spark and Hadoop together to process and analyze big data stored in
HDFS.
39
Apache HBase
 HBase is an open source non-relational distributed database. In other words, it is a NoSQL
database.
 It supports all types of data, which is why it can handle anything in the Hadoop ecosystem.
 It is based on Google's Big-Table, which is a distributed storage system designed to handle large
data sets.
 HBase is designed to run on HDFS and provide features similar to Big-Table.
 It provides us with a fault-tolerant way of storing sparse data, which is common in most big data
use cases.
 HBase is written in Java, and HBase applications can be written in REST, Avro, and Thrift API.
 let us take an example, You have billions of customer emails and you need to find out the
number of customers who has used the word complaint in their emails. 
 The request needs to be processed quickly.

40
Zookeeper
 There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which
resulted in inconsistency.
 Before Zookeeper, it was very difficult and time consuming to coordinate
between different services in Hadoop Ecosystem. 
 The services earlier had many problems with interactions like common
configuration while synchronizing data. 
 Even if the services are configured, changes in the configurations of the
services make it complex and difficult to handle. 
 The grouping and naming was also a time-consuming factor.
 Zookeeper overcame all the problems by performing synchronization, inter-
component based communication, grouping, and maintenance.

41
Oozie
 Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together
as a single unit.
 There is two kinds of jobs.
1. Oozie work flow
2. Oozie coordinator jobs
 Oozie workflow is the jobs that need to be executed in a sequentially ordered manner.
 Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given
to it.

42
Developing a Map Reduce Application
Developing a Map Reduce Application
 MapReduce is a programming model for building applications which can process big data in
parallel on multiple nodes.
 It provides analytical abilities for analysis of large amount of complex data.
 Traditional model is not suitable to process large amount of data and cannot be incorporated by
standard database servers.
 Google solves this problem using MapReduce Algorithm.
 MapReduce is a distributed data processing algorithm, introduced by Google.
 It is influenced by functional programming model. In cluster environment, MapReduce algorithm
is used to process large volume of data efficiently, reliably and parallel.
 It uses divide and conquer approach to process large volume of data.
 It divides input task into manageable sub-task to execute parallel.

44
MapReduce Architecture
Example:
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
Output of
MapReduce task

bad  1
Class  1
good  1
Hadoop  3
is  2
to  1
Welcome  1

45
How MapReduce Works?
 MapReduce divides a task into small parts and assigns them to many computers.
 The results are collected at one place and integrated to form the result dataset.
 The MapReduce algorithm contains two important tasks:
1. Map - Splits & Mapping
2. Reduce - Shuffling, Reducing
 The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data tuples
(key-value pairs) into a smaller set of tuples.
 The reduce task is always performed after the map job.

46
Map(Splits & Mapping) & Reduce(Shuffling, Reducing)

47
How MapReduce Works? – Cont.
 The complete execution process
(execution of Map and Reduce tasks,
both) is controlled by two types of
entities called:
1. Job Tracker: Acts like
a master (responsible for complete
execution of submitted job)
2. Multiple Task Trackers: Acts
like slaves, each of them performing the
job.
 For every job submitted for execution in
the system, there is one Jobtracker that
resides on Namenode and there
are multiple tasktrackers which reside
on Datanode.

48
How MapReduce Works? – Cont.
 A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
 It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on
different data nodes.
 Execution of individual task is then to look after by task tracker, which resides on every data
node executing part of the job.
 Task tracker's responsibility is to send the progress report to the job tracker.
 In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker so as to notify him
of the current state of the system. 
 Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the
job tracker can reschedule it on a different task tracker.

49
MapReduce Algorithm

#3170722 (BDA)  Unit:2 – Hadoop 50


MapReduce Algorithm – Cont.
 Input Phase
 We have a Record Reader that translates each record in an input file and sends the parsed data to the mapper
in the form of key-value pairs.
 Map
 It is a user-defined function, which takes a series of key-value pairs and processes each one of them to
generate zero or more key-value pairs.
 Intermediate Keys
 They key-value pairs generated by the mapper are known as intermediate keys.
 Combiner
 A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets.
 It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the
values in a small scope of one mapper.
 It is not a part of the main MapReduce algorithm; it is optional.

51
MapReduce Algorithm – Cont.
 Shuffle and Sort
 The Reducer task starts with the Shuffle and Sort step.
 It downloads the grouped key-value pairs onto the local machine, where the Reducer is running.
 The individual key-value pairs are sorted by key into a larger data list.
 The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.
 Reducer
 The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of
them.
 Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of
processing.
 Once the execution is over, it gives zero or more key-value pairs to the final step.
 Output Phase
 In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer
function and writes them onto a file using a record writer.

52
MapReduce Feature
 Scalability
 Flexibility
 Security & Authentication
 Cost Effective Solution
 Fast

53
Security in Hadoop
Security in Hadoop
 When Hadoop was first released in 2007 it was intended to manage large amounts of web data
in a trusted environment, so security was not a significant concern or focus.
 As Hadoop evolved into an enterprise technology, Need to add security in Hadoop, it supports
encryption at the disk, file system, database, and application levels.
 Apache Hadoop support for HDFS Encryption .
 The first step in securing an Apache Hadoop cluster is to enable encryption in transit and at rest.
 Authentication and Kerberos rely on secure communications, so before you even go down the
road of enabling authentication and Kerberos you must enable encryption of data-in-transit.
 To achieve secure communications, need to enable the secure version of protocols used.
 RPC/SASL
 Access to files in HDFS - Access Control Lists for file permission

58
Administering Hadoop
Administering Hadoop
 Hadoop administration which includes both HDFS and MapReduce administration.
 HDFS administration includes monitoring the HDFS file structure, locations, and the updated
files.
 MapReduce administration includes monitoring the list of applications, configuration of nodes,
application status, etc.

60
Monitoring
Monitoring
 HDFS Monitoring
 HDFS (Hadoop Distributed File System) contains the user directories, input files, and output files.
 Use the MapReduce commands, put and get, for storing and retrieving.
 MapReduce Job Monitoring
 A MapReduce application is a collection of jobs (Map job, Combiner, Partitioner, and Reduce
job).
 It is mandatory to monitor and maintain the following − Configuration of datanode where the
application is suitable.
 The number of datanodes and resources used per application.

62
Maintenance
Hadoop Administration & Maintenance
 Hadoop Admin Roles and Responsibilities Hadoop Admin do on day to day ?
include setting up Hadoop clusters.
 Installation and Configuration
 Other duties involve backup, recovery and
maintenance.  Cluster Maintenance
 Hadoop administration requires good  Resource Management
knowledge of hardware systems and excellent  Security Management
understanding of Hadoop architecture.
 Troubleshooting
 With increased adoption of Hadoop in
traditional enterprise IT solutions and  Cluster Monitoring
increased number of Hadoop implementations  Backup And Recovery Task
in production environment, the need for  Aligning with the systems engineering
Hadoop Operations and Administration experts team to propose and deploy new
to take care of the large Hadoop Clusters. hardware and software environments
required for Hadoop and to expand
existing environments.
64
Hadoop in the Cloud
Hadoop in Cloud
 The cloud is ideally suited to provide the big data computation power required for the processing
of these large parallel data sets.
 Cloud has the ability to provide the flexible and agile computing platform required for big data,
as well as the ability to call on massive amounts of computing power.
 It can be able to scale as needed and would be an ideal platform for the on-demand analysis of
structured and unstructured workloads.
 It is easy to understand what the jargony phrase “Hadoop in the cloud” means: it is running
Hadoop clusters on resources offered by a cloud provider. 

67
Hadoop in Cloud – Cont.
 This practice is normally compared with running Hadoop clusters on your own hardware,
called on-premises clusters.
 If you are already familiar with running Hadoop clusters on-prem, you will find that a lot of your
knowledge and practices carry over to the cloud.
 After all, a cloud instance is supposed to act almost exactly like an ordinary server you connect
to remotely, with root access, and some number of CPU cores, and some amount of disk space,
and so on.
 Once instances are networked together properly and made accessible, you can imagine that they
are running in a regular data center, as opposed to a cloud provider’s own data center.

68
Reasons to Run Hadoop in the Cloud
 Lack of space
 Your organization may need Hadoop clusters, but you don’t have anywhere to keep racks of physical servers,
along with the necessary power and cooling.
 Flexibility
 Without physical servers to rack up or cables to run, it is much easier to reorganize instances, or expand or
contract your footprint, for changing business needs. Everything is controlled through cloud provider APIs and
web consoles.
 New usage patterns
 The flexibility of making changes in the cloud leads to new usage patterns that are otherwise impractical. For
example, individuals can have their own instances, clusters, and even networks, without much managerial
overhead.
 Speed of change
 It is much faster to launch new cloud instances or allocate new database servers than to purchase, unpack,
rack, and configure physical computers.

69
Reasons to Run Hadoop in the Cloud – Cont.
 Lower risk
 In the cloud, you can quickly and easily change how many resources you use, so there is little risk of under
commitment or overcommitment.
 If some resource malfunctions, you don’t need to fix it; you can discard it and allocate a new one.
 Focus
 An organization using a cloud provider to rent resources, instead of spending time and effort on the logistics
of purchasing and maintaining its own physical hardware and networks, is free to focus on its core
competencies, like using Hadoop clusters to carry out its business.
 Worldwide availability
 The largest cloud providers have data centers around the world, ready for you from the start. You can use
resources close to where you work, or close to where your customers are, for the best performance.
 You can set up redundant clusters, or even entire computing environments, in multiple data centers, so that if
local problems occur in one data center, you can shift to working elsewhere.

70
Reasons to Run Hadoop in the Cloud – Cont.
 Data storage requirements
 If you have data that is required by law to be stored within specific geographic areas, you can keep it in
clusters that are hosted in data centers in those areas.
 Cloud provider features
 Each major cloud provider offers an ecosystem of features to support the core functions of computing,
networking, and storage.
 To use those features most effectively, your clusters should run in the cloud provider as well.
 Capacity
 Few customers tax the infrastructure of a major cloud provider. You can establish large systems in the cloud
that are not nearly as easy to put together, not to mention maintain, on-prem.

71
• Big Data Analytics(BDA)

Thank
You

You might also like