You are on page 1of 52

HADOOP

HADOOP
• Apache Hadoop is an open source, Scalable, and Fault
tolerant framework written in Java.
• It efficiently processes large volumes of data on a cluster of
commodity hardware.
• Hadoop is not only a storage system but is a platform for large
data storage as well as processing.
• Most of Hadoop code is written by Yahoo, IBM, Facebook,
Cloudera.
• It provides an efficient framework for running jobs on multiple
nodes of clusters.
• Cluster means a group of systems connected via LAN. Apache
Hadoop provides parallel processing of data as it works on
multiple machines simultaneously.
What is hadoop used for ?
i) For processing really BIG DATA - If the business use-case you
are tackling has atleast terabytes or petabytes of data then
Hadoop is your go-to framework of choice. There are tons of
other tools available for not-so large datasets.
ii) For stroing diverse data - Hadoop used for storing and
processing any kind of data be it plain text files, binary format
files, images.
iii) Hadoop is used for parallel data processing use-cases.
HADOOP Key Components
• Hadoop consists of three key parts –
• Hadoop Distributed File System (HDFS) – It is the storage
layer of Hadoop.
• Map-Reduce – It is the data processing layer of Hadoop.
• YARN – It is the resource management layer of Hadoop.
• Why Hadoop?
• Let us now understand in this Hadoop tutorial that why Big
Data Hadoop is very popular, why Hadoop capture more than
90% of big data market.
• Apache Hadoop is not only a storage system but is a platform
for data storage as well as processing.
• It is scalable (as we can add more nodes on the fly),
Fault tolerant(Even if nodes go down, data processed by
another node).
• Following characteristics of Hadoop make it a unique
platform:
• Flexibility to store and mine any type of data whether it is
structured, semi-structured or unstructured. It is not bounded
by a single schema.
• Excels at processing data of complex nature. Its scale-out
architecture divides workloads across many nodes. Another
added advantage is that its flexible file-system eliminates ETL
bottlenecks.
• Scales economically, as discussed it can deploy on commodity
hardware. Apart from this its open-source nature guards
against vendor lock.
HADOOP CORE COMPONENTS
HDFS
DESIGN OF HDFS
AREAS WHERE HDFS IS NOT GOOD
HDFS COMPONENTS
MAIN COMPONENTS OF HDFS
• What are NameNodes and DataNodes?
• Name node is the master node which has all the metadata
information. It contains the information about no. of blocks,
size of blocks, no. of vacant blocks, no. of replicated blocks
etc.
• DataNode is a slave node, which sends information to
the Name node about the files and blocks stored in and
responds to the Name node for all file.
DataNode Configuration
• Sample DataNode Configuration in Hadoop Architecture
• Processors: 2 Quad Core CPUs running @ 2 GHz
• Network: 10 Gigabit Ethernet
• RAM: 64 GB
• Hard Disk 12-24 x 1TB SATA
• Functions of Namenode are:
1. To store all the metadata(data about data) of all the slave
nodes in a Hadoop cluster.
E.g, Filename, Filepath, no. of Blocks, blockid, block location,
number of blocks, slave related configurations.
That is, it knows actually where, what data is stored.
This metadata is stored in memory for faster retrieval to reduce latency
that will be caused due to disk seeks.
Hence, it’s recommended that MasterNode on which Namenode daemon
runs should be a very reliable hardware with high configurations and high
RAM.
2. Keep track of all the slave nodes (whether they are alive or
dead). This is done using the heartbeat methodology.
3. Replication (provides High availability, reliability and Fault
tolerance)
4. Balancing: Namenode balances data replication, i.e., blocks of
data should not be under or over replicated.
• Role of DataNode:
1. DataNode is a daemon (process that runs in background) that
runs on the ‘SlaveNode’ in Hadoop Cluster.
2. In Hdfs file is broken into small chunks called blocks(default
block of 64 MB)
3. These blocks of data are stored on the slave node.
4. It stores the actual data. So, large number of disks are
required to store data.
5. These data read/write operation to disks is performed by the
DataNode. For hosting datanodes, commodity hardware can
be used.
JOB TRACKER & TASK TRACKER
Features of Hadoop
• Open Source
• Apache Hadoop is an open source project. It means its code
can be modified according to business requirements.
• Distributed Processing
• As data is stored in a distributed manner in HDFS across the
cluster, data is processed in parallel on a cluster of nodes.
Features of Hadoop
• Fault Tolerance
• This is one of the very important features of Hadoop.
• By default 3 replicas of each block is stored across the cluster
in Hadoop and it can be changed also as per the requirement.
• So if any node goes down, data on that node can be recovered
from other nodes easily with the help of this characteristic.
• Failures of nodes or tasks are recovered automatically by the
framework. This is how Hadoop is fault tolerant.
Features of Hadoop
• Reliability
• Due to replication of data in the cluster, data is reliably stored
on the cluster of machine despite machine failures. If your
machine goes down, then also your data will be stored
reliably due to this charecteristic of Hadoop.
• High Availability
• Data is highly available and accessible despite hardware
failure due to multiple copies of data. If a machine or few
hardware crashes, then data will be accessed from another
path.
Features of Hadoop
• Scalability
• Hadoop is highly scalable in the way new hardware can be
easily added to the nodes. This feature of Hadoop also
provides horizontal scalability which means new nodes can be
added on the fly without any downtime.
• Economic
• Apache Hadoop is not very expensive as it runs on a cluster of
commodity hardware. We do not need any specialized
machine for it. Hadoop also provides huge cost saving also as
it is very easy to add more nodes on the fly here. So if
requirement increases, then you can increase nodes as well
without any downtime and without requiring much of pre-
planning.
Features of Hadoop
• Easy to use
• No need of client to deal with distributed computing, the
framework takes care of all the things. So this feature of
Hadoop is easy to use.
Limitations of Hadoop
• Although Hadoop is the most powerful tool of big data, there
are various limitations of Hadoop like
• Hadoop is not suited for small files,
• it cannot handle firmly the live data,
• slow processing speed,
• not efficient for iterative processing,
• not efficient for caching etc.
• Issue with Small Files
• Hadoop is not suited for small data. (HDFS) Hadoop
distributed file system lacks the ability to efficiently support
the random reading of small files because of its high capacity
design.
• Small files are the major problem in HDFS. A small file is
significantly smaller than the HDFS block size (default
128MB). If we are storing these huge numbers of small files,
HDFS can’t handle these lots of files, as HDFS was designed to
work properly with a small number of large files for storing
large data sets rather than a large number of small files. If
there are too many small files, then the NameNode will be
overloaded since it stores the namespace of HDFS.
Limitations of Hadoop
• Solution-
• Solution to this Drawback of Hadoop to deal with small file
issue is simple. Just merge the small files to create bigger files
and then copy bigger files to HDFS.
Limitations of Hadoop
• Slow Processing Speed
• In Hadoop, with a parallel and distributed algorithm,
MapReduce process large data sets. There are tasks that need
to be performed: Map and Reduce and, MapReduce requires
a lot of time to perform these tasks thereby increasing
latency. Data is distributed and processed over the cluster in
MapReduce which increases the time and reduces processing
speed.
• As a Solution to this Limitation of Hadoop spark has overcome
this issue, by in-memory processing of data. In-memory
processing is faster as no time is spent in moving the
data/processes in and out of the disk. Spark is 100 times
faster than MapReduce as it processes everything in memory.
Flink is also used, as it processes faster than spark because of
its streaming architecture and Flink may be instructed to
process only the parts of the data that have actually changed,
thus significantly increases the performance of the job.
Limitations of Hadoop
• Support for Batch Processing only
• Hadoop supports batch processing only, it does not process
streamed data, and hence overall performance is slower.
MapReduce framework of Hadoop does not leverage the
memory of the Hadoop cluster to the maximum.
• Solution-
• To solve these limitations of Hadoop spark is used that
improves the performance, but Spark stream processing is
not as much efficient as Flink as it uses micro-batch
processing. Flink improves the overall performance as it
provides single run-time for the streaming as well as batch
processing. Flink uses native closed loop iteration operators
which make machine learning and graph processing faster.
Limitations of Hadoop
• No Real-time Data Processing
• Apache Hadoop is designed for batch processing, that means
it take a huge amount of data in input, process it and produce
the result. Although batch processing is very efficient for
processing a high volume of data, but depending on the size
of the data being processed and computational power of the
system, an output can be delayed significantly. Hadoop is not
suitable for Real-time data processing.
• Solution-
• Apache Spark supports stream processing. Stream processing
involves continuous input and output of data. It emphasizes
on the velocity of the data, and data is processed within a
small period of time.
• Apache Flink provides single run-time for the streaming as
well as batch processing, so one common run-time is utilized
for data streaming application and batch processing
application. Flink is a stream processing system that is able to
process row after row in real time.
Limitations of Hadoop
• No Delta Iteration
• Hadoop is not so efficient for iterative processing, as Hadoop
does not support cyclic data flow(i.e. a chain of stages in
which each output of the previous stage is the input to the
next stage).
• Solution-
• Apache Spark can be used to overcome this type of
Limitations of Hadoop, as it accesses data from RAM instead
of disk, which dramatically improves the performance of
iterative algorithms that access the same dataset repeatedly.
Spark iterates its data in batches. For iterative processing in
Spark, each iteration has to be scheduled and executed
separately.
Limitations of Hadoop
• No Caching
• Hadoop is not efficient for caching. In Hadoop, MapReduce
cannot cache the intermediate data in memory for a further
requirement which diminishes the performance of Hadoop.
• Solution-
• Spark and Flink can overcome this limitation of hadoop, as
Spark and Flink cache data in memory for further iterations
which enhance the overall performance.
Limitations of Hadoop
• Latency
• In Hadoop, MapReduce framework is comparatively slower,
since it is designed to support different format, structure and
huge volume of data. In MapReduce, Map takes a set of data
and converts it into another set of data, where individual
element are broken down into key value pair and Reduce
takes the output from the map as input and process further
and MapReduce requires a lot of time to perform these tasks
thereby increasing latency.
• Solution-
• Spark is used to reduce this limitation of Hadoop,
Apache spark is yet another batch system but it is relatively
faster since it caches much of the input data on memory
by RDD(Resilient Distributed Dataset) and keeps
intermediate data in memory itself.
• Flink’s data streaming achieves low latency and high
throughput.
Limitations of Hadoop
• Security
• Hadoop can be challenging in managing the complex
application. If the user doesn’t know how to enable platform
who is managing the platform, your data could be at huge
risk. At storage and network levels, Hadoop is missing
encryption, which is a major point of concern. Hadoop
supports Kerberos authentication, which is hard to manage.
• HDFS supports access control lists (ACLs) and a traditional file
permissions model. However, third party vendors have
enabled an organization to leverage Active Directory
Kerberos and LDAP for authentication.
• Solution-
• Spark provides security bonus to overcome these limitations
of Hadoop. If we run spark in HDFS, it can use HDFS ACLs and
file-level permissions. Additionally, Spark can run
on YARN giving it the capability of using Kerberos
authentication.
Limitations of Hadoop
• Lengthy Line of Code
• Hadoop has 1,20,000 line of code, the number of lines
produces the number of bugs and it will take more time to
execute the program.
• Solution-
• Although Spark and Flink are written in scala and java but
they are implemented in Scala, so the number of line of code
is lesser than Hadoop. So it will also take less time to execute
the program and solve the lenthy line of code limitations of
Hadoop.
Limitations of Hadoop
• No Abstraction
• Hadoop does not have any type of abstraction so MapReduce
developers need to hand code for each and every operation
which makes it very difficult to work.
• Solution-
• To overcome these Drawback of Hadoop, Spark is used in
which we have RDD abstraction for batch. Flink has Dataset
abstraction.
• What is the difference between JobTracker and TaskTracker
• The JobTracker is responsible for taking in requests from a
client and assigning Task tracker which task to be performed,
whereas the TaskTracker accepts task from the JobTracker. The
task tracker keeps sending a heartbeat message to the job
tracker to notify that it is alive.
• In HDFS, why is it suggested to have very few large files
rather than having multiple small files?
• The Namenode contains metadata about each and every file
in HDFS. If the number of files is more, more will be the
metadata. Namenode loads all the metadata information in-
memory for speed, thus having many small files will make the
metadata information big enough that will exceed the size of
the memory on the Namenode.
• Does Hadoop require SSH passwordless access ?
• Apache Hadoop in itself does not require SSH passwordless
access but hadoop provided shell scripts such as start-
mapred.sh and start-dfs.sh make use of SSH to start and stop
daemons. This holds good in particular when there is a large
hadoop cluster to be managed. However, the daemons can
also be started manually on individual nodes without the
need of SSH script.
Practice MCQ
• Which of the following are challenges of big data ? A. Storage
Capacity B. Analysis C. Searching D. All of the above
• What license is Hadoop distributed under ? a) Apache License
b) Mozilla Public License c) Shareware d) Commercial
• Identify the odd one out with respect to Big Data. A. Map
Reduce B. Hadoop C. 3 V's of Big Data D. RDBMS
• Which of the following characteristic of Big Data states that
the total amount of information is growing exponentially
every year ? A. Volume B. Velocity C. Veracity D. Value
• Which of the following characteristic of Big Data states that
data comes in various forms like Twitter feeds, audio files, MRI
images, web pages, web logs ? A. Volume B. Variety C.
Veracity D. Value
Practice MCQ
• Which of the following category of data in which one will
place XML file ? A. Structured Data B. Unstructured Data C.
Semi Structured Data D. Unclassified Data
• _______ refers to the trustworthiness of the data with respect
to Big Data. A. Value B. Veracity C. Velocity D. Volume
• ________ is the frequency of incoming data that needs to be
processed with respect to Big Data. A. Velocity B. Volume C.
Variety D. Value
• In which following type of data in which one will place
photographs in ? A. Unstructured Data B. Semi-structured
Data C. Structured Data D. None of the above
Practice MCQ
• Which of the following command is used to update the
packages while installing hadoop on linux ? A. sudo apt-get
update B. sudo update package C. sudo package update D.
sudo install package
• Which of the following file is used to store environment
variables during hadoop installation ? A. BASHRC B. HOSTS C.
MAPRED D. HDFS-SITE
• Which of the following command need to be executed so that
your system recognizes the newly created environment
variables ? A. source ~/.bashrc B. gedit ~/.bashrc C. save
~/.bashrc D. execute ~/.bashrc
Hadoop 1.0 vs Hadoop 2.0
• Hadoop 1 is a Master-Slave architecture. It consists of a single
master and multiple slaves. Suppose if master node got
crashed then irrespective of your best slave nodes, your
cluster will be destroyed. Again for creating that cluster
means copying system files, image files, etc. on another
system is too much time consuming which will not be
tolerated by organizations in today’s time.
• Hadoop 2 is also a Master-Slave architecture. But this consists
of multiple masters (i.e active namenodes and standby
namenodes) and multiple slaves. If here master node got
crashed then standby master node will take over it. You can
make multiple combinations of active-standby nodes. Thus
Hadoop 2 will eliminate the problem of a single point of
failure
• Hadoop 1.x has the following Limitations/Drawbacks:
• It is only suitable for Batch Processing of Huge amount of
Data, which is already in Hadoop System.
• It is not suitable for Real-time Data Processing.
• It supports upto 4000 Nodes per Cluster.
• It has a single component : JobTracker to perform many
activities like Resource Management, Job Scheduling, Job
Monitoring, Re-scheduling Jobs etc.
• JobTracker is the single point of failure.
• It does not support Multi-tenancy Support.
• Hadoop 1.x supports maximum 4,000 nodes per cluster where
Hadoop 2.x supports more than 10,000 nodes per cluster.
• Hadoop 1.x supports one and only one programming model:
MapReduce. Hadoop 2.x supports multiple programming
models with YARN Component like MapReduce, Interative,
Streaming, Graph, Spark, Storm etc.

You might also like