Professional Documents
Culture Documents
Syllabus
Data, Data Storage and Analysis, Comparison with Other Systems, RDBMS Grid Computing,
Volunteer Computing, A Brief History of Hadoop, Apache Hadoop and the Hadoop Ecosystem
Hadoop Releases Response
We live in the data age. It’s not easy to measure the total volume of data stored electronically,
but an International Data Corporation (IDC) estimate put the size of the “digital universe”
at 1.48 zettabytes in 2016 and is forecasting a tenfold growth by 2020 to 4.8 zettabytes. A
zettabyte is equivalently one thousand exabytes, one million petabytes, or one billion terabytes.
That is roughly the same order of magnitude as one disk drive for every person in the world.
Today’s rapidly growing big data is called data flooding. It represents immense opportunity
for forward-thinking marketers. But to fully leverage the potential that exists within massive
streams of structured and unstructured data, organizations must quickly optimize ad delivery,
evaluate campaign results, improve site selection and retarget ads.
– New York Stock Exchange generates 1 TB of new trade data per day
– Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month
Pa ge |1
Big Data Analytics Module – 3: Meet Hadoop
Most of the data is locked up in the largest web properties (like search engines), or scientific or financial
institutions, isn’t it? Does the advent of “Big Data,” as it is being called, affect smaller organizations or
individuals?
• The amount of data generated by machines will be even greater than that generated by people
– Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions
• The volume of data being made publicly available increases every year, too.
Problem: Difficulties of implementing storage and support for big data analytics process.
• The problem is simple: while the storage capacities of hard drives have increased massively
over the years, access speeds—the rate at which data can be read from drives— have not
kept up.
Pa ge |2
Big Data Analytics Module – 3: Meet Hadoop
• The transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all
the data off the disk.
• This is a long time to read all data on a single drive—and writing is even slower.
Solution: The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had
100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under
two minutes.
There are some problems to implement read and write in parallel. They are:
The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the
chance that one will fail is fairly high.
A common way of avoiding data loss is through replication: redundant copies of the data are
kept by the system so that in the event of failure, there is another copy available. This is how
RAID works
Another way the Hadoop’s provides is the Hadoop Distributed File System (HDFS)
The second problem is to combine the data for data analysis: Data read from one disk may need to be
combined with the data from any of the other 99 disks. Various distributed systems allow data to be
combined from multiple sources but doing this correctly is challenging.
MapReduce provides a programming model that abstracts the problem from disk reads and
writes, transforming it into a computation over sets of keys and values.
Like HDFS, MapReduce has built-in reliability.
This, in a nutshell, Hadoop provides a reliable storage by HDFS and analysis by MapReduce.
The approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire
dataset—or at least a good portion of it—is processed for each query. But this is its power.
MapReduce is a batch query processor, and the ability to run an ad hoc query against your
whole dataset and get the results in a reasonable time is transformative.
It changes the way you think about data and unlocks data that was previously archived on tape
or disk.
Pa ge |3
Big Data Analytics Module – 3: Meet Hadoop
Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce
needed?
The answer to these questions comes from another trend in disk drives:
Seek time is improving more slowly than transfer rate. Seeking is the process of moving the
disk’s head to a particular place on the disk to read or write data.
It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s
bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read or write large portions
of the dataset than streaming through it, which operates at the transfer rate.
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree
(the data structure used in relational databases, which is limited by the rate it can perform seeks)
works well.
For updating the Majority of a database, a B-Tree is less efficient than MapReduce, which uses
Sort/Merge to rebuild the database.
In many ways, MapReduce can be seen as a complement to an RDBMS. The differences
between the two systems are shown in Table 3.2.)
Table 3.2: Comparison between MapReduce and RDBMS
Some more differences between Mapreduce and RDBMS are listed out in the table 3.3
Table 3.3: Difference between MapReduce and RDBMS
# MapReduce RDBMS
1 Good fit for problems that analyse the whole data RDBMS is good for point queries or updates,
set in a batch fashion where the data set has been ordered to deliver
low latency retrieval
Pa ge |4
Big Data Analytics Module – 3: Meet Hadoop
2 It suits well for applications where the data is Relational database is good for data that are
written once and read many times continuously updated
3 Works on semi-structured or unstructured data Operates on structured data
4 Ex: spread sheets, images, text etc.. Ex: database tables, XML docs etc..
5 It is designed to interpret the data at processing It is designed to interpret the data run time
time (Schema on Read) (Schema on Write)
6 Normalization creates a problem in Hadoop RDBMS data is often normalized to avoid
because, it reading a record is non-local operation, redundancy and to retain integrity.
instead Hadoop makes it possible to perform
streaming reads and writes
7 MapReduce can process the data in parallel Parallel processing is not true for SQL
RDBMS queries
Hadoop systems such as Hive are becoming more interactive and adding features like indexing and
transitions that make this look more and more like traditional RDBMS.
Definition: Grid computing is the collection of computer resources from multiple locations to reach
a common goal. The grid can be thought of as a distributed system with non-interactive workloads that
involve a large number of files.
In grid computing, the computers on the network can work on a task together, thus functioning as a
supercomputer.
The High Performance Computing (HPC) and Grid Computing communities have been doing
large-scale data processing for years, using such APIs as Message Passing Interface (MPI).
HPC is to distribute the work across a cluster of machines, which access a shared filesystem,
hosted by a SAN.
HPC works well for predominantly compute-intensive jobs, but becomes a problem when nodes
need to access larger data volumes (hundreds of gigabytes) since the network bandwidth is the
bottleneck and compute nodes become idle.
MapReduce tries to collocate the data with the compute node, so data access is fast since it is
local. Data locality, is at the heart of MapReduce and is the reason for its good performance.
Recognizing that network bandwidth is the most precious resource in a data center environment
(it is easy to saturate network links by copying data around), MapReduce implementations go
to great lengths to conserve it by explicitly modelling network topology.
MPI gives great control to the programmer, but requires that he or she explicitly handle the
mechanics of the data flow.
MapReduce operates only at the higher level: the programmer thinks in terms of functions of
key and value pairs, and the data flow is implicit.
Pa ge |5
Big Data Analytics Module – 3: Meet Hadoop
Coordinating the process in grid computing faces two major challenges. They are:
Large-scale distributed computation.
Handling partial failure—when you don’t know if a remote process has failed or not—and
still making progress with the overall computation.
MapReduce spares the programmer from having to think about failure, since the implementation detects
failed map or reduce tasks and reschedules replacements on machines that are healthy. MapReduce is
able to do this since it is a shared-nothing architecture, meaning that tasks have no dependence on one
other.
So from the programmer’s point of view, the order in which the tasks run doesn’t matter. By contrast,
MPI programs have to explicitly manage their own check pointing and recovery, which gives more
control to the programmer, but makes them
more difficult to write.
MapReduce might sound like quite a
restrictive programming model, and in a
sense it is: you are limited to key and value
types that are related in specified ways, and
mappers and reducers run with very limited
coordination between one another.
Pa ge |6
Big Data Analytics Module – 3: Meet Hadoop
Volunteer Computing projects work by breaking the problem they are trying to solve into small
chunks called work units, which are sent to computers around the world to analyse.
Example: SETI@home: It works as follows:
It sends a work unit of 0.35 MB of radio telescope data and takes hours or days to analyse on
a typical home computer. When the analysis is completed, results are sent back to the server.
SETI, the Search for Extra-Terrestrial Intelligence, runs a project called SETI@home in
which volunteers donate CPU time from their otherwise idle computers to analyze radio
telescope data for signs of intelligent life outside earth.
others include the Great Internet Mersenne Prime Search (to search for large prime numbers)
and Folding@home (to understand protein folding and how it relates to disease).
Volunteer computing projects work by breaking the problem they are trying to solve into
chunks called work units, which are sent to computers around the world to be analyzed.
For example, a SETI@home work unit is about 0.35 MB of radio telescope data, and takes
hours or days to analyze on a typical home computer.
When the analysis is completed, the results are sent back to the server, and the client gets
another work unit.
As a precaution to combat cheating, each work unit is sent to three different machines and
needs at least two results to agree to be accepted.
MapReduce also works in the similar way of breaking a problem into independent pieces that
work in parallel.
Volunteer computing problem is very CPU intensive, which makes it suitable for running on
hundreds of thousands of computers across the world because the time to transfer the work
unit is dwarfed by time to run the computation time.
Volunteers are donating CPU cycle not the bandwidth.
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware
running in a single data center with very high aggregate bandwidth interconnects.
It is clear that volunteer computing runs a perpetual computation on untrusted machines on
the Internet with highly variable connection speeds and no data locality.
– Created by Doug Cutting, the creator of Apache Lucene, text search library
Pa ge |7
Big Data Analytics Module – 3: Meet Hadoop
– Has its origin in Apache Nutch, an open source web search engine, a part of the Lucene
project.
– ‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy
3.4 History
Pa ge |8
Big Data Analytics Module – 3: Meet Hadoop
Hadoop Distributed File System is the core component or the backbone of Hadoop Ecosystem.
HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
Pa ge |9
Big Data Analytics Module – 3: Meet Hadoop
HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS
as a single unit.
It helps us in storing our data across various nodes and maintaining the log file about the stored
data (metadata).
HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a table of content. Therefore, it requires
less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it requires more
storage resources. These DataNodes are commodity hardware (like your laptops and
desktops) in the distributed environment. That’s the reason, why Hadoop solutions are
very cost effective.
3. You always communicate to the NameNode while writing the data. Then, it internally
sends a request to the client to store and replicate data on various DataNodes.
MapReduce
It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In
other words, MapReduce is a software framework which helps in writing applications that processes
large data sets using distributed and parallel algorithms inside Hadoop environment.
PIG has two parts: Pig Latin, the language and the pig runtime, for the execution
environment. You can better understand it as Java and JVM.
P a g e | 10
10
Big Data Analytics Module – 3: Meet Hadoop
It supports pig latin language, which has SQL like command structure.
The compiler internally converts pig latin to MapReduce. It produces a sequential set of
MapReduce jobs, and that’s an abstraction (which works like black box).
PIG was initially developed by Yahoo.
It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analys ing huge data sets.
How Pig works?
In PIG, first the load command, loads the data. Then we perform various functions on it like grouping,
filtering, joining, sorting, etc. At last, either you can dump the data on the screen or you can store the
result back in HDFS.
Apache Hive
Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel at
home while working in a Hadoop Ecosystem.
Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
The query language of Hive is called Hive Query Language(HQL), which is very similar like
SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
The Hive Command line interface is used to execute HQL commands.
While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is used
to establish connection from data storage.
Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
It supports all primitive data types of SQL.
You can use predefined functions, or write tailored user defined functions (UDF) also to
accomplish your specific needs.
Apache Mahout
Now, let us talk about Mahout which is renowned for machine learning. Mahout provides an
environment for creating machine learning applications which are scalable.
Apache Spark
P a g e | 11
11
Big Data Analytics Module – 3: Meet Hadoop
Apache Spark is a framework for real time data analytics in a distributed computing
environment.
The Spark is written in Scala and was originally developed at the University of California,
Berkeley.
It executes in-memory computations to increase speed of data processing over Map-Reduce.
It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power than Map-
Reduce.
As you can see, Spark comes packed with high-level libraries, including support for R, SQL, Python,
Scala, Java etc. These standard libraries increase the seamless integrations in complex workflow. Over
this, it also allows various sets of services to integrate with it like MLlib, GraphX, SQL + Data Frames,
Streaming services etc. to increase its capabilities.
Apache HBase
Apache Zookeeper
Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of
various services in a Hadoop Ecosystem.
Apache Zookeeper coordinates with various services in a distributed environment.
Before Zookeeper, it was very difficult and time consuming to coordinate between different services in
Hadoop Ecosystem. The services earlier had many problems with interactions like common
P a g e | 12
12
Big Data Analytics Module – 3: Meet Hadoop
configuration while synchronizing data. Even if the services are configured, changes in the
configurations of the services make it complex and difficult to handle. The grouping and naming was
also a time-consuming factor.
Due to the above problems, Zookeeper was introduced. It saves a lot of time by
performing synchronization, configuration maintenance, grouping and naming.
Although it’s a simple service, it can be used to build powerful solutions.
Big names like Rackspace, Yahoo, eBay use this service in many of their use cases and therefore, you
can have an idea about the importance of Zookeeper.
Apache Oozie
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For Apache jobs, Oozie
has been just like a scheduler. It schedules Hadoop jobs and binds them together as one logical work.
There are two kinds of Oozie jobs:
1. Oozie workflow: These are sequential set of actions to be executed. You can assume it as a
relay race. Where each athlete waits for the last one to complete his part.
2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made
available to it. Think of this as the response-stimuli system in our body. In the same manner as
we respond to an external stimulus, an Oozie coordinator responds to the availability of data
and it rests otherwise.
Apache Flume
Ingesting data is an important part of our Hadoop Ecosystem.
The Flume is a service which helps in ingesting unstructured and semi-structured data into
HDFS.
It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
It helps us to ingest online streaming data from various sources like network traffic, social
media, email messages, log files etc. in HDFS.
Apache Sqoop
Now, let us talk about another data ingesting service i.e. Sqoop. The major difference between Flume
and Sqoop is that:
P a g e | 13
13
Big Data Analytics Module – 3: Meet Hadoop
When we submit Sqoop command, our main task gets divided into sub tasks which is handled by
individual Map Task internally. Map Task is the sub task, which imports part of data to the Hadoop
Ecosystem. Collectively, all Map tasks imports the whole data.
Apache Ambari
Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more
manageable.
It includes software for provisioning, managing and monitoringApache Hadoop clusters.
The Ambari provides:
There are a few active release series. The 1.x release series is a continuation of the 0.20 release series,
and contains the most stable versions of Hadoop currently available. This series includes secure
Kerberos authentication, which prevents unauthorized access to Hadoop data. Almost all production
clusters use these releases, or derived versions (such as commercial distributions).
The 0.22 and 0.23 release series are currently marked as alpha releases (as of early 2012), but this is
likely to change by the time you read this as they get more real-world testing and become more stable
(consult the Apache Hadoop releases page for the latest status). 0.23 includes several major new
features:
A new MapReduce runtime, called MapReduce 2, implemented on a new system called YARN
(Yet Another Resource Negotiator), which is a general resource management system for
running distributed applications. MapReduce 2 replaces the classic runtime in previous
releases. It is described in more depth in “YARN”.
HDFS federation, which partitions the HDFS namespace across multiple namenodes to support
clusters with very large numbers of files.
P a g e | 14
14
Big Data Analytics Module – 3: Meet Hadoop
HDFS high-availability, which removes the namenode as a single point of failure by supporting
standby namenodes for failover.
The following figure shows the configuration of Hadoop 1.0 and Hadoop 2.0.
Table 3.5 covers features in HDFS and MapReduce. Other projects in the Hadoop ecosystem are
continually evolving too, and picking a combination of components that work well together can be a
challenge.
Table 3.5: Features Supported by Hadoop Release Series
P a g e | 15
15