You are on page 1of 31

BIG DATA ANALYTICS

Unit-1

Introduction to Big Data:What is Big Data? Why Big Data is Important? Meet Hadoop, Data,
Data Storage and Analysis, Comparison with other systems, History of Apache Hadoop, Hadoop
Ecosystem, VMWare Installation of Hadoop. Analyzing the Data with Hadoop, Scaling Out.

What is Big Data? Why Big Data is Important?

In today’s world, social applications are extensively used. It results in rapid data growth.

On social media platforms, billions of users connect daily, users share information, upload
images, videos, and many more. This rising Big Data is not an overhead anymore. Companies
are using this it to achieve growth and defeat their competitors.

What is Big Data?

Big Data refers to massive amounts of data produced by different sources like social media
platforms, web logs, sensors, IoT devices, and many more. It can be either structured (like tables
in DBMS), semi-structured (like XML files), or unstructured (like audios, videos, images).

Traditional database management systems are not able to handle this vast amount of data.

Big data deployments can involve terabytes, petabytes, and even exabytes of data captured over
time.

In simple language, big data is a collection of data that is larger, more complex than traditional
data, and yet growing exponentially with time. It is so huge that no traditional data management
software or tool can manage, store, or can process it efficiently. So, it needs to be processed step
by step via different methodologies.

Importance of Big data

Every company uses its collected data in its own way. More effectively the company uses its
data, more rapidly it grows.

The companies in the present market need to collect it and analyze it because:
1. Cost Savings

Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses when
they have to store large amounts of data. These tools help organizations in identifying more
effective ways of doing business.

2. Time-Saving

Real-time in-memory analytics helps companies to collect data from various sources. Tools like
Hadoop help them to analyze data immediately thus helping in making quick decisions based on
the learnings.

3. Understand the market conditions

Big Data analysis helps businesses to get a better understanding of market situations.

For example, analysis of customer purchasing behavior helps companies to identify the products
sold most and thus produces those products accordingly. This helps companies to get ahead of
their competitors.

4. Social Media Listening

Companies can perform sentiment analysis using Big Data tools. These enable them to get
feedback about their company, that is, who is saying what about the company.

Companies can use Big data tools to improve their online presence.

5. Boost Customer Acquisition and Retention

Customers are a vital asset on which any business depends on. No single business can achieve its
success without building a robust customer base. But even with a solid customer base, the
companies can’t ignore the competition in the market.

If we don’t know what our customers want then it will degrade companies’ success. It will result
in the loss of clientele which creates an adverse effect on business growth.

Big data analytics helps businesses to identify customer related trends and patterns. Customer
behavior analysis leads to a profitable business.
6. Solve Advertisers Problem and Offer Marketing Insights

Big data analytics shapes all business operations. It enables companies to fulfill customer
expectations. Big data analytics helps in changing the company’s product line. It ensures
powerful marketing campaigns.

7. The driver of Innovations and Product Development

Big data makes companies capable to innovate and redevelop their products.

Real-Time Benefits of Big Data

Big Data analytics has expanded its roots in all the fields. This results in the use of Big Data in a
wide range of industries including Finance and Banking, Healthcare, Education, Government,
Retail, Manufacturing, and many more.

There are many companies like Amazon, Netflix, Spotify, LinkedIn, Swiggy,etc which use big
data analytics. Banking sectors make the maximum use of Big Data Analytics. Education sector
is also using data analytics to enhance students’ performance as well as making teaching easier
for instructors.

The Applications of Big Data are

 Banking and Securities

 Communications, Media and Entertainment

 Healthcare Providers

 Education

 Manufacturing and Natural Resources

 Government

 Insurance

 Retail and Wholesale trade

 Transportation

 Energy and Utilities


The Uses of Big Data are

 Location Tracking

 Precision Medicine

 Fraud Detection & Handling

 Advertising

 Entertainment & Media

Real World Big Data Examples

 Discovering consumer shopping habits.

 Personalized marketing.

 Fuel optimization tools for the transportation industry.

 Monitoring health conditions through data from wearable’s.

 Live road mapping for autonomous vehicles.

 Streamlined media streaming.

 Predictive inventory ordering

Meet Hadoop

Data

We live in the data age. It’s not easy to measure the total volume of data stored electronically,
but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is
forecasting a tenfold growth by 2020 to 44 zettabytes.1 A zettabyte is 1021 bytes, or equivalently
one thousand exabytes, one million petabytes, or one billion terabytes. That’s more than one disk
drive for every person in the world.

This flood of data is coming from many sources. Consider the following:
 The New York Stock Exchange generates about 4−5 terabytes of data per day.
 Facebook hosts more than 240 billion photos, growing at 7 petabytes per month.
 Ancestry.com, the genealogy site, stores around 10 petabytes of data.
 The Internet Archive stores around 18.5 petabytes of data.
 The Large Hadron Collider near Geneva, Switzerland, produces about 30 petabytes of
data per year.

So there’s a lot of data out there. But you are probably wondering how it affects you. Most of the
data is locked up in the largest web properties (like search engines) or in scientific or financial
institutions, isn’t it? Does the advent of big data affect smaller organizations or individuals?

I argue that it does. Take photos, for example. My wife’s grandfather was an avid photographer
and took photographs throughout his adult life. His entire corpus of medium format, slide, and
35mm film, when scanned in at high resolution, occupies around 10 gigabytes. Compare this to
the digital photos my family took in 2008, which take up about 5 gigabytes of space. My family
is producing photographic data at 35 times the rate my wife’s grandfather’s did, and the rate is
increasing every year as it becomes easier to take more and more photos.

The good news is that big data is here. The bad news is that we are struggling to store and
analyze it.
Data Storage and Analysis
The problem is simple: although the storage capacities of hard drives have increased massively
over the years, access speeds—the rate at which data can be read from drives— have not kept up.
One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4
MB/s,4 so you could read all the data from a full drive in around five minutes. Over 20 years
later, 1-terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more
than two and a half hours to read all the data off the disk.

This is a long time to read all data on a single drive—and writing is even slower. The obvious
way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each
holding one hundredth of the data. Working in parallel, we could read the data in under two
minutes.

There’s more to being able to read and write data in parallel to or from multiple disks, though.
The first problem to solve is hardware failure: as soon as you start using many pieces of
hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is
through replication: redundant copies of the data are kept by the system so that in the event of
failure, there is another copy available. This is how RAID works, for instance, although
Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS), takes a slightly different
approach.

The second problem is that most analysis tasks need to be able to combine the data in some way,
and data read from one disk may need to be combined with data from any of the other 99 disks.
Various distributed systems allow data to be combined from multiple sources, but doing this
correctly is notoriously challenging. MapReduce provides a programming model that abstracts
the problem from disk reads and writes, trans‐ forming it into a computation over sets of keys
and values. The important point for the present discussion is that there are two parts to the
computation—the map and the reduce—and it’s the interface between the two where the
“mixing” occurs. Like HDFS, MapReduce has built-in reliability.

In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis.
What’s more, because it runs on commodity hardware and is open source, Hadoop is affordable.

Comparison with Other Systems


Hadoop isn’t the first distributed system for data storage and analysis, but it has some unique
properties that set it apart from other systems that may seem similar. Here we look at some of
them.
Relational Database Management Systems
Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop
needed?

The answer to these questions comes from another trend in disk drives: seek time is improving
more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular
place on the disk to read or write data. It characterizes the latency of a disk operation, whereas
the transfer rate corresponds to a disk’s bandwidth.
In many ways, MapReduce can be seen as a complement to a Relational Database Management
System (RDBMS). (The differences between the two systems are shown in Table 1-1.).

Table 1-1. RDBMS compared to MapReduce

1. MapReduce suits in an application where the data is written once and read many
times like in your Facebook profile you post your photo once and that picture of
your seen by your friends many times, whereas RDBMS good for data sets that are
continuously updated.
2. The RDBMS is suits for an application where data size is limited like it's in
GBs,whereas MapReduce suits for an application where data size is in Petabytes.
3. The RDBMS accessed data in interactive and batch mode, whereas MapReduce
access the data in batch mode.
4. The RDBMS schema structure is static, whereas MapReduce schema is dynamic.
5. The RDBMS suits with structure data sets, whereas MapReduce suits with un-
structure data sets.
6. The RDBMS scaling is nonlinear, whereas MapReduce is linear.

Grid Computing

1. Grid Computing - Works well for predominantly compute intensive jobs, but it
becomes a problem when nodes need to access larger data volumes (hundreds of
gigabytes), since the network bandwidth is the bottleneck and compute nodes
become idle. Whereas, Hadoop - Tries to co-locate the data with the compute nodes,
so data access is fast because it is local.This feature, known as data locality(heart of
Hadoop).
2. Grid Computing - Data flow is exposed by low level programming. Whereas, In
Hadoop all about high level programming.
3. Grid Computing - Explicitly manage their own check pointing and recovery of tasks.
Whereas In Hadoop managed by Map Reduce processing engine.
4. Grid Computing resources are highly expensive as compared to Hadoop

Volunteer Computing
Volunteer Computing projects work by breaking the problem they are trying to solve into small
chunks called work units, which are sent to computers around the world to analyse. Let us take
an example of how SETI@home works. It sends a work unit of 0.35 MB of radio telescope data
and takes hours or days to analyse on a typical home computer. When the analysis is completed,
results are sent back to the server.

Now the question is how MapReduce is different from Volunteer Computing. MapReduce also
works in the similar way of breaking a problem into independent pieces that work in parallel.
Volunteer computing problem is very CPU intensive,which makes it suitable for running on
hundreds of thousands of computers across the world because the time to transfer the work unit
is dwarfed by time to run the computation time. Volunteers are donating CPU cycle not the
bandwidth.

.
History of Apache Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google
File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -


o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It
is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data they
have to spend a lot of costs which becomes the consequence of that project. This problem
becomes one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS
(Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Hadoop Ecosystem
Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.
Most of the tools or solutions are used to supplement or support these major elements. All
these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem:


 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus working at
the heart of the system.

YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.
MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.

PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.

 It is a platform for structuring the data flow, processing and analyzing huge data sets.
 Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major segment
of the Hadoop Ecosystem.

HIVE:

 With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.

Mahout:

 Mahout, allows Machine Learnability to a system or application. Machine Learning, as the


name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.

Apache Spark:

 It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.

Apache HBase:

 It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data

Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:

 Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries, especially Lucene is based on Java which allows spell
check mechanism, as well. However, Lucene is driven by Solr.
 Zookeeper: There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.

VMWare Installation of Hadoop

Let us start Setup and Installation of VMWare:

1. Installing VMware Workstation from given below link. There are two options for downloading
one is Windows and other for Linux. My Base Operating System is Windows8, So I choose for
VMware for Windows. If Your Base OS is Linux go and choose VMware for Linux Link.

2.Check your VMware Properties.


3. Go to Download Folder.

4. Click the VMware downloaded File and Install it.


5.Click on VMware Software and click and choose “Pin to Taskbar”.

6.Click on VMware Software and Click on Next to the Installation wizard.

7. Read and Accept the VMware End User license agreement.

Click Next to Continue.

8. Specify the Installation directory. You can also enable Enhance keyboard driver here.

Click Next to continue.

9. You can enable product startup and join the VMware Customer experience Improvement
program here.

Click Next to Continue.

10. Select the shortcuts you want to create for easy access to VMware Workstation.

Click Next to Continue.


11. Click Install button to start the installation.

12. Installation will take just few seconds to complete.

If you have license-key then click on License to enter the license or you can also click Finish to
exit the Installer.

Installation of Hadoop:
Hadoop Prerequisite Installation

Install JDK

ubuntu@ubuntu-VM$sudo apt update

$sudo apt install openjdk-8-jdk -y

$java -version

$ javac -version

Install OpenSSH server/client

$sudo apt install openssh-server –y

$sudo apt install openssh-client -y

Setup a non-user for Hadoop

$sudo adduser hadoop

$sudo usermod –aG sudo hadoop

$su - hadoop

Paaswordless SSh settings for hadoop

hadoop@ubuntu-VM$ssh-keygen -t rsa –P ‘’ –f ~/.ssh/id_rsa

$ cat ~/.ssh/id_rsa .pub >> ~/.ssh/authorized_keys

$chmod 0600 ~/.ssh/authorized_keys

$ssh local host

Hadoop Installation

Download hadoop from hadoop.apache.org

 Download binary from https:// hadoop.apache.org/releases.html


 Use browser to download or use following wget command:
$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
$ls
hadoop-3.2.2.tar.gz

 And untar the file using “tar” command


$tar xfz hadoop-3.2.2.tar.gz
$ls
hadoop-3.2.2 hadoop-3.2.2.tar.gz

Hadoop installation/configuration

We need to configure/modify/change/ following files for successfully hadoop configuration.

 bashrc
 hadoop-env.sh
 core-site.xml
 hdfs-site.xml
 mapred-site.xml
 yarn-site.xml

bashrc

$cd /home/hadoop/hadoop-3.2.2

$sudo nano ~/.bashrc

Go to end of file and then add the following lines


#Hadoop Related Options
export HADOOP_HOME=/home/hadoop/hadoop-3.2.2
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"

and then press ctrl+x  y  enter

$source ~/.bashrc

hadoop-env.sh

Get path to java home directory

$which javac

Note the output of the above command which will be like /usr/bin/javac

$readlink -f /usr/bin/javac
Note the output of the above command to be used as java home path in the next command.

$sudo nano etc/hadoop/hadoop-env.sh

Go to line in file having text “#export JAVA_HOME=”

Add below line “#export JAVA_HOME=” in this file in the end


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Add ABOVE LINE line after the text “#export JAVA_HOME=” in this file.

core-site.xml

 Make necessary directories


$mkdir tmpdata

$sudo nano etc/hadoop/core-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")


<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop-3.2.2/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
<description>The name of the default file system></description>
</property>

hdfs-site.xml

 Make necessary directories


$mkdir -p dfsdata/namenode
$mkdir -p dfsdata/datanode

$sudo nano etc/hadoop/hdfs-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")

<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hadoop-3.2.2/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hadoop-3.2.2/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

mapred-site.xml

$sudo nano etc/hadoop/maored-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")


<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

yarn-site.xml

$sudo nano etc/hadoop/yarn-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")


<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASS
PATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>

Format HDFS name node

$hdfs namenode -format


After successful format, namenode will be shutdown

Start hadoop cluster

$sbin/start-dfs.sh

Start yarn resource and name manager

$sbin/start-yarn.sh

Check if all daemons are active and running

$jps

Access HADOOP UI from Browser

 Name node interface


http://localhost:9870
 Individual data nodes
http://localhost:9864
 Yarn resource manager
http://localhost:8080

Analyzing the Data with Hadoop

Map and Reduce

MapReduce works by breaking the processing into two phases: the map phase and the reduce
phase. Each phase has key-value pairs as input and output, the types of which may be chosen by
the programmer. The programmer also specifies two functions: the map function and the reduce
function.

The input to our map phase is the raw NCDC data. We choose a text input format that gives us
each line in the dataset as a text value. The key is the offset of the beginning of the line from the
beginning of the file, but as we have no need for this, we ignore it. Our map function is simple.
We pull out the year and the air temperature, because these are the only fields we are interested
in. In this case, the map function is just a data preparation phase, setting up the data in such a
way that the reduce function can do its work on it: finding the maximum temperature for each
year. The map function is also a good place to drop bad records: here we filter out temperatures
that are missing, suspect, or erroneous. To visualize the way the map works, consider the
following sample lines of input data (some unused columns have been dropped to fit the page,
indicated by ellipses):

0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999... These lines are
presented to the map function as the key-value pairs:

(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)

(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)

(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)

(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)

(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)

The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and emits them
as its output (the temperature values have been interpreted as integers):

(1950, 0)

(1950, 22)

(1950, −11)

(1949, 111)

(1949, 78)

The output from the map function is processed by the MapReduce framework before being sent
to the reduce function. This processing sorts and groups the key-value pairs by key. So,
continuing the example, our reduce function sees the following input:
(1949, [111, 78]) (1950, [0, 22, −11]) Each year appears with a list of all its air temperature
readings. All the reduce function has to do now is iterate through the list and pick up the
maximum reading:

(1949, 111)

(1950, 22)

This is the final output: the maximum global temperature recorded in each year. The whole data
flow is illustrated in Figure 2-1. At the bottom of the diagram is a Unix pipeline, which mimics
the whole MapReduce flow and which we will see again later in this chapter when we look at
Hadoop Streaming.

Java MapReduce

Having run through how the MapReduce program works, the next step is to express it in code.
We need three things: a map function, a reduce function, and some code to run the job. The map
function is represented by the Mapper class, which declares an abstract map() method. Example
2-3 shows the implementation of our map function. Example 2-3. Mapper for the maximum
temperature example
The Mapper class is a generic type, with four formal type parameters that specify the input key,
input value, output key, and output value types of the map function. For the present example, the
input key is a long integer offset, the input value is a line of text, the output key is a year, and the
output value is an air temperature (an integer).

Rather than using built-in Java types, Hadoop provides its own set of basic types that are
optimized for network serialization. These are found in the org.apache.hadoop.io package. Here
we use LongWritable, which corresponds to a Java Long, Text (like Java String), and
IntWritable (like Java Integer).

The map() method is passed a key and a value. We convert the Text value containing the line of
input into a Java String, then use its substring() method to extract the columns we are interested
in. The map() method also provides an instance of Context to write the output to. In this case, we
write the year as a Text object (since we are just using it as a key), and the temperature is
wrapped in an IntWritable. We write an output record only if the temperature is present and the
quality code indicates the temperature reading is OK. The reduce function is similarly defined
using a Reducer, as illustrated in Example 2-4.

Again, four formal type parameters are used to specify the input and output types, this time for
the reduce function. The input types of the reduce function must match the output types of the
map function: Text and IntWritable. And in this case, the output types of the reduce function are
Text and IntWritable, for a year and its maximum temperature, which we find by iterating
through the temperatures and comparing each with a record of the highest found so far. The third
piece of code runs the MapReduce job (see Example 2-5).

A Job object forms the specification of the job and gives you control over how the job is run.
When we run this job on a Hadoop cluster, we will package the code into a JAR file (which
Hadoop will distribute around the cluster). Rather than explicitly specifying the name of the JAR
file, we can pass a class in the Job’s setJarByClass() method, which Hadoop will use to locate
the relevant JAR file by looking for the JAR file containing this class.
Having constructed a Job object, we specify the input and output paths. An input path is
specified by calling the static addInputPath() method on FileInputFormat, and it can be a single
file, a directory (in which case, the input forms all the files in that directory), or a file pattern. As
the name suggests, addInputPath() can be called more than once to use input from multiple paths.

The output path (of which there is only one) is specified by the static setOutput Path() method on
FileOutputFormat. It specifies a directory where the output files from the reduce function are
written. The directory shouldn’t exist before running the job because Hadoop will complain and
not run the job. This precaution is to prevent data loss (it can be very annoying to accidentally
overwrite the output of a long job with that of another).

Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.
The setOutputKeyClass() and setOutputValueClass() methods control the output types for the
reduce function, and must match what the Reduce class produces. The map output types default
to the same types, so they do not need to be set if the mapper produces the same types as the
reducer (as it does in our case). However, if they are different, the map output types must be set
using the setMapOutputKeyClass() and setMapOutputValueClass() methods.

The input types are controlled via the input format, which we have not explicitly set because we
are using the default TextInputFormat.

After setting the classes that define the map and reduce functions, we are ready to run the job.
The waitForCompletion() method on Job submits the job and waits for it to finish. The single
argument to the method is a flag indicating whether verbose output is generated. When true, the
job writes information about its progress to the console.

The return value of the waitForCompletion() method is a Boolean indicating success (true) or
failure (false), which we translate into the program’s exit code of 0 or 1.

A test run

% export HADOOP_CLASSPATH=hadoop-examples.jar

% hadoop MaxTemperature input/ncdc/sample.txt output

The output was written to the output directory, which contains one output file per reducer. The
job had a single reducer, so we find a single file, named part-r-00000:

% cat output/part-r-00000

1949 111

1950 22

Scaling Out

You’ve seen how MapReduce works for small inputs; now it’s time to take a bird’s-eye view of
the system and look at the data flow for large inputs.

Data Flow
Hadoop does its best to run the map task on a node where the input data resides in HDFS,
because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization.
Sometimes, however, all the nodes hosting the HDFS block replicas for a map task’s input split
are running other map tasks, so the job scheduler will look for a free map slot on a node in the
same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node
is used, which results in an inter-rack network transfer. The three possibilities are illustrated in
Figure 2-2.

Dataflow in mapreduce framework:

 The whole data flow with a single reduce task is illustrated in Figure 2-3. The dotted
boxes indicate nodes, the dotted arrows show data transfers on a node, and the solid
arrows show data transfers between nodes.
 The data which will be processed through mapreduce are stored in HDFS. The data are
stored in different blocks and stored in a distributed format.
 In each mapper one split can be processed at a time . Developers can put their own
business logic in these mappers. Mappers run a parallel fashion in all of the machines.
 The output of the mapper is stored in the local disk but not in the HDFS. The output is
shuffled to reducer the task.
 When all of the mappers have completed their task the output is the stored and merged.
 The reader takes those data and performs the reducing task then the output is stored in
HDFS.

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4

Finally, it’s also possible to have zero reduce tasks. This can be appropriate when you don’t need
the shuffle because the processing can be carried out entirely in parallel. In this case, the only
off-node data transfer is when the map tasks write to HDFS (see Figure 2-5).
Combiner Functions

From the hadoop mapper, it creates a huge amount of intermediate data. In the process of
sending these data to the reducers creates a massive network congestion. The combiners are used
to overcome this problem.

The combiner also known as mini-reducer. It process the intermediate data from the mapper. Use
of the combiner is optional.

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to
minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a
combiner function to be run on the map output, and the combiner function’s output forms the
input to the reduce function. Because the combiner function is an optimization, Hadoop does not
provide a guarantee of how many times it will call it for a particular map output record, if at all.
In other words, calling the combiner function zero, one, or many times should produce the same
output from the reducer.

The contract for the combiner function constrains the type of function that may be used. This is
best illustrated with an example. Suppose that for the maximum temperature example, readings
for the year 1950 were processed by two maps (because they were in different splits).

Imagine the first map produced the output:


(1950, 0)

(1950, 20)

(1950, 10)

and the second produced:

(1950, 25) (1950, 15)

The reduce function would be called with a list of all the values:

(1950, [0, 20, 10, 25, 15])

with output:

(1950, 25)

since 25 is the maximum value in the list. We could use a combiner function that, just like the
reduce function, finds the maximum temperature for each map output. The reduce function
would then be called with: (1950, [20, 25]) and would produce the same output as before. More
succinctly, we may express the function calls on the temperature values in this case as follows:
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25

Specifying a combiner function Going back to the Java MapReduce program, the combiner
function is defined using the Reducer class, and for this application, it is the same
implementation as the reduce function in MaxTemperatureReducer. The only change we need to
make is to set the combiner class on the Job (see Example 2-6).

Example 2-6. Application to find the maximum temperature, using a combiner function for
efficiency
Running a Distributed MapReduce Job

The same program will run, without alteration, on a full dataset. This is the point of MapReduce:
it scales to the size of your data and the size of your hardware. Here’s one data point: on a 10-
node EC2 cluster running High-CPU Extra Large instances, the program took six minutes to run.
We’ll go through the mechanics of running programs on a cluster in 2nd unit 2nd part.

You might also like