You are on page 1of 15

Big Data Analytics Module – 3: Meet Hadoop

Module 3: Meet Hadoop

Syllabus
Data, Data Storage and Analysis, Comparison with Other Systems, RDBMS Grid Computing,
Volunteer Computing, A Brief History of Hadoop, Apache Hadoop and the Hadoop Ecosystem
Hadoop Releases Response

3.1 Data Flooding

We live in the data age. It’s not easy to measure the total volume of data stored electronically,
but an International Data Corporation (IDC) estimate put the size of the “digital universe”
at 1.48 zettabytes in 2016 and is forecasting a tenfold growth by 2020 to 4.8 zettabytes. A
zettabyte is equivalently one thousand exabytes, one million petabytes, or one billion terabytes.
That is roughly the same order of magnitude as one disk drive for every person in the world.

Today’s rapidly growing big data is called data flooding. It represents immense opportunity
for forward-thinking marketers. But to fully leverage the potential that exists within massive
streams of structured and unstructured data, organizations must quickly optimize ad delivery,
evaluate campaign results, improve site selection and retarget ads.

Some of the sources of data flood are:

– New York Stock Exchange generates 1 TB of new trade data per day

– Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of storage

– Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month

– Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.


– The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per
year.

Pa ge |1
Big Data Analytics Module – 3: Meet Hadoop

Most of the data is locked up in the largest web properties (like search engines), or scientific or financial
institutions, isn’t it? Does the advent of “Big Data,” as it is being called, affect smaller organizations or
individuals?

3.1.1 Effect of ‘Big Data’ on smaller organizations and individuals

Digital photos, individual’s interactions


– phone calls, emails, documents – are captured and stored for later access

• The amount of data generated by machines will be even greater than that generated by people

– Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions

• The volume of data being made publicly available increases every year, too.

• Data can be shared for anyone to download and analyze


– Public Data Sets on Amazon Web Services, Infochimps.org, theinfo.org
– Astrometry.net project
• Watches the astrometry group on Flickr for new photos of the night sky
– Analyzes each image and identifies the sky
• The project shows that are possible when data is made available and used for something that
was not anticipated by the creator

3.1.2 Effect of big data in the recent years

• 6 million developers are currently working on big data worldwide


• Spending on big data technology will reach $57 billion this year
• The worldwide business intelligence and big data analytics was worth $18.5 billion in 2017
• 50% of the BI software queries will be over search queries, natural language processing and
voice recognition by 2020
• By 2020 1.7MB of data will be created for every person on earth
The good news is that Big Data is here. The bad news is that we are struggling to store and analyse it.

3.2 Data Storage and Analytics

Problem: Difficulties of implementing storage and support for big data analytics process.
• The problem is simple: while the storage capacities of hard drives have increased massively
over the years, access speeds—the rate at which data can be read from drives— have not
kept up.

Pa ge |2
Big Data Analytics Module – 3: Meet Hadoop

• The transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all
the data off the disk.
• This is a long time to read all data on a single drive—and writing is even slower.
Solution: The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had
100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under
two minutes.
There are some problems to implement read and write in parallel. They are:

The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the
chance that one will fail is fairly high.
A common way of avoiding data loss is through replication: redundant copies of the data are
kept by the system so that in the event of failure, there is another copy available. This is how
RAID works
Another way the Hadoop’s provides is the Hadoop Distributed File System (HDFS)

The second problem is to combine the data for data analysis: Data read from one disk may need to be
combined with the data from any of the other 99 disks. Various distributed systems allow data to be
combined from multiple sources but doing this correctly is challenging.
MapReduce provides a programming model that abstracts the problem from disk reads and
writes, transforming it into a computation over sets of keys and values.
Like HDFS, MapReduce has built-in reliability.
This, in a nutshell, Hadoop provides a reliable storage by HDFS and analysis by MapReduce.

3.3 Comparison Map Reduce with Other Systems

The approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire
dataset—or at least a good portion of it—is processed for each query. But this is its power.
MapReduce is a batch query processor, and the ability to run an ad hoc query against your
whole dataset and get the results in a reasonable time is transformative.
It changes the way you think about data and unlocks data that was previously archived on tape
or disk.

Pa ge |3
Big Data Analytics Module – 3: Meet Hadoop

It gives people the opportunity to innovate with data.


For example, Mailtrust, Rackspace’s mail division, used Hadoop for processing email logs.
The MapReduce is compared with other systems like RDBMS, Grid Computing and Volunteer
Computing as follows:

3.3.1 MapReduce Compared with RDBMS

Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce
needed?
The answer to these questions comes from another trend in disk drives:
Seek time is improving more slowly than transfer rate. Seeking is the process of moving the
disk’s head to a particular place on the disk to read or write data.
It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s
bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read or write large portions
of the dataset than streaming through it, which operates at the transfer rate.
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree
(the data structure used in relational databases, which is limited by the rate it can perform seeks)
works well.
For updating the Majority of a database, a B-Tree is less efficient than MapReduce, which uses
Sort/Merge to rebuild the database.
In many ways, MapReduce can be seen as a complement to an RDBMS. The differences
between the two systems are shown in Table 3.2.)
Table 3.2: Comparison between MapReduce and RDBMS

Some more differences between Mapreduce and RDBMS are listed out in the table 3.3
Table 3.3: Difference between MapReduce and RDBMS
# MapReduce RDBMS
1 Good fit for problems that analyse the whole data RDBMS is good for point queries or updates,
set in a batch fashion where the data set has been ordered to deliver
low latency retrieval

Pa ge |4
Big Data Analytics Module – 3: Meet Hadoop

2 It suits well for applications where the data is Relational database is good for data that are
written once and read many times continuously updated
3 Works on semi-structured or unstructured data Operates on structured data
4 Ex: spread sheets, images, text etc.. Ex: database tables, XML docs etc..
5 It is designed to interpret the data at processing It is designed to interpret the data run time
time (Schema on Read) (Schema on Write)
6 Normalization creates a problem in Hadoop RDBMS data is often normalized to avoid
because, it reading a record is non-local operation, redundancy and to retain integrity.
instead Hadoop makes it possible to perform
streaming reads and writes
7 MapReduce can process the data in parallel Parallel processing is not true for SQL
RDBMS queries

Hadoop systems such as Hive are becoming more interactive and adding features like indexing and
transitions that make this look more and more like traditional RDBMS.

3.3.2 Comparison between Grid Computing and MapReduce

Definition: Grid computing is the collection of computer resources from multiple locations to reach
a common goal. The grid can be thought of as a distributed system with non-interactive workloads that
involve a large number of files.
In grid computing, the computers on the network can work on a task together, thus functioning as a
supercomputer.
The High Performance Computing (HPC) and Grid Computing communities have been doing
large-scale data processing for years, using such APIs as Message Passing Interface (MPI).
HPC is to distribute the work across a cluster of machines, which access a shared filesystem,
hosted by a SAN.
HPC works well for predominantly compute-intensive jobs, but becomes a problem when nodes
need to access larger data volumes (hundreds of gigabytes) since the network bandwidth is the
bottleneck and compute nodes become idle.
MapReduce tries to collocate the data with the compute node, so data access is fast since it is
local. Data locality, is at the heart of MapReduce and is the reason for its good performance.
Recognizing that network bandwidth is the most precious resource in a data center environment
(it is easy to saturate network links by copying data around), MapReduce implementations go
to great lengths to conserve it by explicitly modelling network topology.
MPI gives great control to the programmer, but requires that he or she explicitly handle the
mechanics of the data flow.
MapReduce operates only at the higher level: the programmer thinks in terms of functions of
key and value pairs, and the data flow is implicit.

Pa ge |5
Big Data Analytics Module – 3: Meet Hadoop

Coordinating the process in grid computing faces two major challenges. They are:
Large-scale distributed computation.
Handling partial failure—when you don’t know if a remote process has failed or not—and
still making progress with the overall computation.
MapReduce spares the programmer from having to think about failure, since the implementation detects
failed map or reduce tasks and reschedules replacements on machines that are healthy. MapReduce is
able to do this since it is a shared-nothing architecture, meaning that tasks have no dependence on one
other.
So from the programmer’s point of view, the order in which the tasks run doesn’t matter. By contrast,
MPI programs have to explicitly manage their own check pointing and recovery, which gives more
control to the programmer, but makes them
more difficult to write.
MapReduce might sound like quite a
restrictive programming model, and in a
sense it is: you are limited to key and value
types that are related in specified ways, and
mappers and reducers run with very limited
coordination between one another.

Figure 3.1: Working of Grid Computing


The differences between grid computing MPI and MapReduce are listed in the table 3.4.
Sl. No Grid Computing MPI MapReduce
1. Works well for predominantly compute Hadoop tries to collocate with the data and
intensive jobs, but it becomes a problem compute nodes so that the data access is fast
where nodes access larger data volumes because of data locality.
since the network bandwidth computing
node becomes idle
2. MPI programs have to explicitly manage In MapReduce, since the implementation
their own check point and recovery detects the failed task and re-schedules
which gives more control to the replacement on machine.
programmer but makes them more
difficult
3. MPI is a shared architecture MapReduce is a share nothing architecture
4. Gives great control to the programmer, it Processing in Hadoop happens only at
also requires that they explicitly handle higher level the program thinks in terms of
the mechanics of data flow data model since the data flow is implicit

Pa ge |6
Big Data Analytics Module – 3: Meet Hadoop

3.3.3 Volunteer computing

Volunteer Computing projects work by breaking the problem they are trying to solve into small
chunks called work units, which are sent to computers around the world to analyse.
Example: SETI@home: It works as follows:
It sends a work unit of 0.35 MB of radio telescope data and takes hours or days to analyse on
a typical home computer. When the analysis is completed, results are sent back to the server.
SETI, the Search for Extra-Terrestrial Intelligence, runs a project called SETI@home in
which volunteers donate CPU time from their otherwise idle computers to analyze radio
telescope data for signs of intelligent life outside earth.
others include the Great Internet Mersenne Prime Search (to search for large prime numbers)
and Folding@home (to understand protein folding and how it relates to disease).
Volunteer computing projects work by breaking the problem they are trying to solve into
chunks called work units, which are sent to computers around the world to be analyzed.
For example, a SETI@home work unit is about 0.35 MB of radio telescope data, and takes
hours or days to analyze on a typical home computer.
When the analysis is completed, the results are sent back to the server, and the client gets
another work unit.
As a precaution to combat cheating, each work unit is sent to three different machines and
needs at least two results to agree to be accepted.

Now the question is how MapReduce is different from Volunteer Computing.

MapReduce also works in the similar way of breaking a problem into independent pieces that
work in parallel.
Volunteer computing problem is very CPU intensive, which makes it suitable for running on
hundreds of thousands of computers across the world because the time to transfer the work
unit is dwarfed by time to run the computation time.
Volunteers are donating CPU cycle not the bandwidth.
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware
running in a single data center with very high aggregate bandwidth interconnects.
It is clear that volunteer computing runs a perpetual computation on untrusted machines on
the Internet with highly variable connection speeds and no data locality.

3.4 A Brief History of Hadoop

– Created by Doug Cutting, the creator of Apache Lucene, text search library

Pa ge |7
Big Data Analytics Module – 3: Meet Hadoop

– Has its origin in Apache Nutch, an open source web search engine, a part of the Lucene
project.
– ‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy

3.4 History

– In 2002, Nutch was started


– A working crawler and search system emerged
– Its architecture wouldn’t scale to the billions of pages on the Web
– In 2003, Google published a paper describing the architecture of Google’s distributed
filesystem, GFS
– In 2004, Nutch project implemented the GFS idea into the Nutch Distributed
Filesystem, NDFS
– In 2004, Google published the paper introducing MapReduce
– In 2005, Nutch had a working MapReduce implementation in Nutch
– By the middle of that year, all the major Nutch algorithms had been ported to
run using MapReduce and NDFS
– In Feb. 2006, Doug Cutting started an independent subproject of Lucene, called
Hadoop
– In Jan. 2006, Doug Cutting joined Yahoo!
– Yahoo! Provided a dedicated team and the resources to turn Hadoop into a
system at web scale
– In Feb. 2008, Yahoo! announced its search index was being generated by a 10,000 core
Hadoop cluster
– In Apr. 2008, Hadoop broke a world record to sort a terabytes of data
– In Nov. 2008, Google reported that its MapReduce implementation sorted one terabytes
in 68 seconds.
– In May 2009, Yahoo! used Hadoop to sort one terabytes in 62 seconds

3.5 Apache Hadoop and the Hadoop Ecosystem

Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework


which solves big data problems. You can consider it as a suite which encompasses a number of services
(ingesting, storing, analyzing and maintaining) inside it.
The figure 3.2 describes the typical view of Hadoop Ecosystem:

Pa ge |8
Big Data Analytics Module – 3: Meet Hadoop

Figure 3.2: The Hadoop Ecosystem


The components of Hadoop 1.0 and Hadoop 2.0 are listed as follows:
– Common – a set of components and interfaces for filesystems and I/O.
– Avro – a serialization system for RPC and persistent data storage.
– MapReduce – a distributed data processing model.
– YARN - Yet Another Resource Negotiator
– HDFS – a distributed filesystem running on large clusters of machines.
– Pig – a data flow language and execution environment for large datasets.
– Hive – a distributed data warehouse providing SQL-like query language.
– HBase – a distributed, column-oriented database.
– Mahout, Spark MLlib - Machine Learning
– Apache Drill -SQL on Hadoop
– ZooKeeper – a distributed, highly available coordination service.
– Sqoop – a tool for efficiently moving data between relational DB and HDFS.
– Oozie - Job Scheduling
– Flume- Data Ingesting Services
– Solr & Lucene - Searching & Indexing
– Ambari - Provision, Monitor and Maintain cluster

The description of each component is as follows:


HDFS

Hadoop Distributed File System is the core component or the backbone of Hadoop Ecosystem.
HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).

Pa ge |9
Big Data Analytics Module – 3: Meet Hadoop

HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS
as a single unit.
It helps us in storing our data across various nodes and maintaining the log file about the stored
data (metadata).
HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a table of content. Therefore, it requires
less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it requires more
storage resources. These DataNodes are commodity hardware (like your laptops and
desktops) in the distributed environment. That’s the reason, why Hadoop solutions are
very cost effective.
3. You always communicate to the NameNode while writing the data. Then, it internally
sends a request to the client to store and replicate data on various DataNodes.

MapReduce
It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In
other words, MapReduce is a software framework which helps in writing applications that processes
large data sets using distributed and parallel algorithms inside Hadoop environment.

In a MapReduce program, Map() and Reduce() are two functions.


1. The Map function performs actions like filtering, grouping and sorting.
2. While Reduce function aggregates and summarizes the result produced by map
function.
3. The result generated by the Map function is a key value pair (K, V) which acts as the
input for Reduce function.
Apache PIG

PIG has two parts: Pig Latin, the language and the pig runtime, for the execution
environment. You can better understand it as Java and JVM.

P a g e | 10
10
Big Data Analytics Module – 3: Meet Hadoop

It supports pig latin language, which has SQL like command structure.

The compiler internally converts pig latin to MapReduce. It produces a sequential set of
MapReduce jobs, and that’s an abstraction (which works like black box).
PIG was initially developed by Yahoo.

It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analys ing huge data sets.
How Pig works?
In PIG, first the load command, loads the data. Then we perform various functions on it like grouping,
filtering, joining, sorting, etc. At last, either you can dump the data on the screen or you can store the
result back in HDFS.

Apache Hive

Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel at
home while working in a Hadoop Ecosystem.
Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL

The query language of Hive is called Hive Query Language(HQL), which is very similar like
SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
The Hive Command line interface is used to execute HQL commands.
While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is used
to establish connection from data storage.
Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
It supports all primitive data types of SQL.
You can use predefined functions, or write tailored user defined functions (UDF) also to
accomplish your specific needs.

Apache Mahout
Now, let us talk about Mahout which is renowned for machine learning. Mahout provides an
environment for creating machine learning applications which are scalable.
Apache Spark

P a g e | 11
11
Big Data Analytics Module – 3: Meet Hadoop

Apache Spark is a framework for real time data analytics in a distributed computing
environment.
The Spark is written in Scala and was originally developed at the University of California,
Berkeley.
It executes in-memory computations to increase speed of data processing over Map-Reduce.
It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power than Map-
Reduce.
As you can see, Spark comes packed with high-level libraries, including support for R, SQL, Python,
Scala, Java etc. These standard libraries increase the seamless integrations in complex workflow. Over
this, it also allows various sets of services to integrate with it like MLlib, GraphX, SQL + Data Frames,
Streaming services etc. to increase its capabilities.

Apache HBase

HBase is an open source, non-relational distributed database. In other words, it is a NoSQL


database.
It supports all types of data and that is why, it’s capable of handling anything and everything
inside a Hadoop ecosystem.
It is modelled after Google’s BigTable, which is a distributed storage system designed to cope
up with large data sets.
The HBase was designed to run on top of HDFS and provides BigTable like capabilities.
It gives us a fault tolerant way of storing sparse data, which is common in most Big Data use
cases.
The HBase is written in Java, whereas HBase applications can be written in REST, Avro and
Thrift APIs.
For better understanding, let us take an example. You have billions of customer emails and you need to
find out the number of customers who has used the word complaint in their emails. The request needs
to be processed quickly (i.e. at real time). So, here we are handling a large data set while retrieving a
small amount of data. For solving these kind of problems, HBase was designed.

Apache Zookeeper

Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of
various services in a Hadoop Ecosystem.
Apache Zookeeper coordinates with various services in a distributed environment.
Before Zookeeper, it was very difficult and time consuming to coordinate between different services in
Hadoop Ecosystem. The services earlier had many problems with interactions like common

P a g e | 12
12
Big Data Analytics Module – 3: Meet Hadoop

configuration while synchronizing data. Even if the services are configured, changes in the
configurations of the services make it complex and difficult to handle. The grouping and naming was
also a time-consuming factor.
Due to the above problems, Zookeeper was introduced. It saves a lot of time by
performing synchronization, configuration maintenance, grouping and naming.
Although it’s a simple service, it can be used to build powerful solutions.
Big names like Rackspace, Yahoo, eBay use this service in many of their use cases and therefore, you
can have an idea about the importance of Zookeeper.

Apache Oozie
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For Apache jobs, Oozie
has been just like a scheduler. It schedules Hadoop jobs and binds them together as one logical work.
There are two kinds of Oozie jobs:

1. Oozie workflow: These are sequential set of actions to be executed. You can assume it as a
relay race. Where each athlete waits for the last one to complete his part.
2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made
available to it. Think of this as the response-stimuli system in our body. In the same manner as
we respond to an external stimulus, an Oozie coordinator responds to the availability of data
and it rests otherwise.

Apache Flume
Ingesting data is an important part of our Hadoop Ecosystem.

The Flume is a service which helps in ingesting unstructured and semi-structured data into
HDFS.
It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
It helps us to ingest online streaming data from various sources like network traffic, social
media, email messages, log files etc. in HDFS.

Apache Sqoop
Now, let us talk about another data ingesting service i.e. Sqoop. The major difference between Flume
and Sqoop is that:

Flume only ingests unstructured data or semi-structured data into HDFS.


While Sqoop can import as well as export structured data from RDBMS or Enterprise data
warehouses to HDFS or vice versa.

P a g e | 13
13
Big Data Analytics Module – 3: Meet Hadoop

When we submit Sqoop command, our main task gets divided into sub tasks which is handled by
individual Map Task internally. Map Task is the sub task, which imports part of data to the Hadoop
Ecosystem. Collectively, all Map tasks imports the whole data.
Apache Ambari
Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more
manageable.
It includes software for provisioning, managing and monitoringApache Hadoop clusters.
The Ambari provides:

1. Hadoop cluster provisioning:


It gives us step by step process for installing Hadoop services across a number of hosts.
It also handles configuration of Hadoop services over a cluster.
2. Hadoop cluster management:
It provides a central management service for starting, stopping and re-configuring
Hadoop services across the cluster.
3. Hadoop cluster monitoring:
For monitoring health and status, Ambari provides us a dashboard.
The Amber Alert framework is an alerting service which notifies the user, whenever
the attention is needed. For example, if a node goes down or low disk space on a node,
etc.

3.6 Hadoop Releases and Responses

There are a few active release series. The 1.x release series is a continuation of the 0.20 release series,
and contains the most stable versions of Hadoop currently available. This series includes secure
Kerberos authentication, which prevents unauthorized access to Hadoop data. Almost all production
clusters use these releases, or derived versions (such as commercial distributions).

The 0.22 and 0.23 release series are currently marked as alpha releases (as of early 2012), but this is
likely to change by the time you read this as they get more real-world testing and become more stable
(consult the Apache Hadoop releases page for the latest status). 0.23 includes several major new
features:
A new MapReduce runtime, called MapReduce 2, implemented on a new system called YARN
(Yet Another Resource Negotiator), which is a general resource management system for
running distributed applications. MapReduce 2 replaces the classic runtime in previous
releases. It is described in more depth in “YARN”.
HDFS federation, which partitions the HDFS namespace across multiple namenodes to support
clusters with very large numbers of files.

P a g e | 14
14
Big Data Analytics Module – 3: Meet Hadoop

HDFS high-availability, which removes the namenode as a single point of failure by supporting
standby namenodes for failover.
The following figure shows the configuration of Hadoop 1.0 and Hadoop 2.0.

Table 3.5 covers features in HDFS and MapReduce. Other projects in the Hadoop ecosystem are
continually evolving too, and picking a combination of components that work well together can be a
challenge.
Table 3.5: Features Supported by Hadoop Release Series

********** End of Module 3**********

P a g e | 15
15

You might also like