You are on page 1of 94

1

RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA, BHOPAL


New Scheme Based On AICTE Flexible Curricula
Computer Science and Engineering, V-Semester
Departmental Elective CS- 503 (A) Data Analytics

UNIT-I:
DESCRIPTIVE STATISTICS :Probability Distributions, Inferential Statistics ,Inferential
Statistics through hypothesis tests Regression & ANOVA ,Regression ANOVA(Analysis of
Variance)
UNIT-II:
INTRODUCTION TO BIG DATA: Big Data and its Importance, Four V’s of Big Data,
Drivers for Big Data, Introduction to Big Data Analytics, Big Data Analytics applications.
BIG DATA TECHNOLOGIES: Hadoop’s Parallel World, Data discovery, Open source
technology for Big Data Analytics, cloud and Big Data, Predictive Analytics, Mobile
Business Intelligence and Big Data, Crowd Sourcing Analytics, Inter- and Trans-Firewall
Analytics, Information Management.
UNIT-III:
PROCESSING BIG DATA: Integrating disparate data stores, Mapping data to the
programming framework, Connecting and extracting data from storage, Transforming data
for processing, subdividing data in preparation for Hadoop Map Reduce.
UNIT-IV:
HADOOP MAPREDUCE: Employing Hadoop Map Reduce, Creating the components of
Hadoop Map Reduce jobs, Distributing data processing across server farms, Executing
Hadoop Map Reduce jobs, monitoring the progress of job flows, The Building Blocks of
Hadoop Map Reduce Distinguishing Hadoop daemons, Investigating the Hadoop Distributed
File System Selecting appropriate execution modes: local, pseudo-distributed, fully
distributed.
UNIT-V:
BIG DATA TOOLS AND TECHNIQUES: Installing and Running Pig, Comparison with
Databases, Pig Latin, User- Define Functions, Data Processing Operators, Installing and
Running Hive, Hive QL, Querying Data, User-Defined Functions, Oracle Big Data.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
2

UNIT -1

DESCRIPTIVE STATISTICS

Introduction
Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can
be either a representation of the entire or a sample of a population. Descriptive statistics are
broken down into measures of central tendency and measures of variability (spread). Measures of
central tendency include the mean, median, and mode, while measures of variability include the
standard deviation, variance, the minimum and maximum variables, and the kurtosis and
skewness.

PROBABILITY DISTRIBUTIONS
6 Common Probability Distributions every data science professional
should know.
Example.

Suppose you are a teacher at a university. After checking assignments for a week, you graded all
the students. You gave these graded papers to a data entry guy in the university and tell him to
create a spreadsheet containing the grades of all the students. But the guy only stores the grades
and not the corresponding students.

He made another blunder, he missed a couple of entries in a hurry and we have no idea whose
grades are missing. Let’s find a way to solve this.

One way is that you visualize the grades and see if you can find a trend in the data.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
3

The graph that you have plot is called the frequency distribution of the data. You see that there is
a smooth curve like structure that defines our data, but do you notice an anomaly? We have an
abnormally low frequency at a particular score range. So the best guess would be to have missing
values that remove the dent in the distribution.

This is how you would try to solve a real-life problem using data analysis. For any Data
Scientist, a student or a practitioner, distribution is a must know concept. It provides the basis for
analytics and inferential statistics.

While the concept of probability gives us the mathematical calculations, distributions help us
actually visualize what’s happening underneath.

In this article, I have covered some important probability distributions which are explained in a
lucid as well as comprehensive manner.

Note: This article assumes you have a basic knowledge of probability. If not, you can refer
this probability distributions.

Types of Distributions
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution
6. Exponential Distribution

Common Data Types

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
4

Before we jump on to the explanation of distributions, let’s see what kind of data can we
encounter. The data can be discrete or continuous.

Discrete Data, as the name suggests, can take only specified values. For example, when you roll
a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.

Continuous Data can take any value within a given range. The range may be finite or infinite.
For example, A girl’s weight or height, the length of the road. The weight of a girl can be any
value from 54 kgs, or 54.5 kgs, or 54.5436kgs.

Now let us start with the types of distributions.

Types of Distributions

1. Bernoulli Distribution
Let’s start with the easiest distribution that is Bernoulli Distribution. It is actually easier to
understand than it sounds!

All you cricket junkies out there! At the beginning of any cricket match, how do you decide who
is going to bat or ball? A toss! It all depends on whether you win or lose the toss, right? Let’s say
if the toss results in a head, you win. Else, you lose. There’s no midway.

A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and
a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with
the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.

Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two possible
outcomes.

The probability mass function is given by: px(1-p)1-x where x € (0, 1).
It can also be written as

The probabilities of success and failure need not be equally likely, like the result of a fight
between me and Undertaker. He is pretty much certain to win. So in this case probability of my
success is 0.15 while my failure is 0.85

Here, the probability of success(p) is not same as the probability of failure. So, the chart below
shows the Bernoulli Distribution of our fight.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
5

Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value is
exactly what it sounds. If I punch you, I may expect you to punch me back. Basically expected
value of any distribution is the mean of the distribution. The expected value of a random variable
X from a Bernoulli distribution is found as follows:

E(X) = 1*p + 0*(1-p) = p

The variance of a random variable from a bernoulli distribution is:

V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)

There are many examples of Bernoulli distribution such as whether it’s going to rain tomorrow
or not where rain denotes success and no rain denotes failure and Winning (success) or losing
(failure) the game.

2. Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are
equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the
n number of possible outcomes of a uniform distribution are equally likely.

A variable X is said to be uniformly distributed if the density function is:

The graph of a uniform distribution curve looks like

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
6

You can see that the shape of the Uniform distribution curve is rectangular, the reason why
Uniform distribution is called rectangular distribution.

For a Uniform Distribution, a and b are the parameters.

The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of
40 and a minimum of 10.

Let’s try calculating the probability that the daily sales will fall between 15 and 30.

The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5

Similarly, the probability that daily sales are greater than 20 is = 0.667

The mean and variance of X following a uniform distribution is:

Mean -> E(X) = (a+b)/2

Variance -> V(X) = (b-a)²/12

The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard uniform
density is given by:

3. Binomial Distribution
Let’s get back to cricket. Suppose that you won the toss today and this indicates a successful
event. You toss again but you lost this time. If you win a toss today, this does not necessitate that
you will win the toss tomorrow. Let’s assign a random variable, say X, to the number of times
you won the toss. What can be the possible value of X? It can be any number depending on the
number of times you tossed a coin.

There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
7

A distribution where only two outcomes are possible, such as success or failure, gain or loss, win
or lose and where the probability of success and failure is same for all the trials is called a
Binomial Distribution.

The outcomes need not be equally likely. Remember the example of a fight between me and
Undertaker? So, if the probability of success in an experiment is 0.2 then the probability of
failure can be easily computed as q = 1 – 0.2 = 0.8.

Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss. An experiment with only two possible outcomes repeated n number
of times is called binomial. The parameters of a binomial distribution are n and p where n is the
total number of trials and p is the probability of success in each trial.

On the basis of the above explanation, the properties of a Binomial Distribution are

Each trial is independent.

There are only two possible outcomes in a trial- either a success or a failure.

A total number of n identical trials are conducted.

The probability of success and failure is same for all trials. (Trials are identical.)

The mathematical representation of binomial distribution is given by:

A binomial distribution graph where the probability of success does not equal the probability of
failure looks like

Now, when probability of success = probability of failure, in such a situation the graph of
binomial distribution looks like

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
8

The mean and variance of a binomial distribution are given by:

Mean -> µ = n*p

Variance -> Var(X) = n*p*q

4. Normal Distribution
Normal distribution represents the behavior of most of the situations in the universe (That is why
it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often
turns out to be normally distributed, contributing to its widespread application. Any distribution
is known as Normal distribution if it has the following characteristics:

The mean, median and mode of the distribution coincide.

The curve of the distribution is bell-shaped and symmetrical about the line x=μ.

The total area under the curve is 1.

Exactly half of the values are to the left of the center and the other half to the right.

A normal distribution is highly different from Binomial Distribution. However, if the number of
trials approaches infinity then the shapes will be quite similar.

The PDF of a random variable X following a normal distribution is given by:

The mean and variance of a random variable X which is said to be normally distributed is given
by:

Mean -> E(X) = µ

Variance -> Var(X) = σ^2

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
9

Here, µ (mean) and σ (standard deviation) are the parameters.


The graph of a random variable X ~ N (µ, σ) is shown below.

A standard normal distribution is defined as the distribution with mean 0 and standard deviation
1. For such a case, the PDF becomes:

5. Poisson Distribution
Suppose you work at a call center, approximately how many calls do you get in a day? It can be
any number. Now, the entire number of calls at a call center in a day is modeled by Poisson
distribution. Some more examples are

The number of emergency calls recorded at a hospital in a day.

The number of thefts reported in an area on a day.

The number of customers arriving at a salon in an hour.

The number of suicides reported in a particular city.

The number of printing errors at each page of the book.

You can now think of many examples following the same course. Poisson Distribution is
applicable in situations where events occur at random points of time and space wherein our
interest lies only in the number of occurrences of the event.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
10

A distribution is called Poisson distribution when the following assumptions are valid:

1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a
longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.

Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some
notations used in Poisson distribution are:

λ is the rate at which an event occurs,

t is the length of a time interval,

And X is the number of events in that time interval.

Here, X is called a Poisson Random Variable and the probability distribution of X is called
Poisson distribution.

Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.

The PMF of X following a Poisson distribution is given by:

The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that
interval. The graph of a Poisson distribution is shown below:

The graph shown below illustrates the shift in the curve due to increase in mean.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
11

It is perceptible that as the mean increases, the curve shifts to the right.

The mean and variance of X following a Poisson distribution:

Mean -> E(X) = µ


Variance -> Var(X) = µ

6. Exponential Distribution
Let’s consider the call center example one more time. What about the interval of time between
the calls ? Here, exponential distribution comes to our rescue. Exponential distribution models
the interval of time between the calls.

Other examples are:

1. Length of time beteeen metro arrivals,


2. Length of time between arrivals at a gas station
3. The life of an Air Conditioner

Exponential distribution is widely used for survival analysis. From the expected life of a machine
to the expected life of a human, exponential distribution successfully delivers the result.

A random variable X is said to have an exponential distribution with PDF:

f(x) = { λe-λx, x ≥ 0

and parameter λ>0 which is also called the rate.

For survival analysis, λ is called the failure rate of a device at any time t, given that it has
survived up to t.

Mean and Variance of a random variable X following an exponential distribution:

Mean -> E(X) = 1/λ

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
12

Variance -> Var(X) = (1/λ)²

Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the curve. This
is explained better with the graph shown below.

To ease the computation, there are some formulas given below.


-λx
P{X≤x} = 1 – e , corresponds to the area under the density curve to the left of x.

P{X>x} = e-λx, corresponds to the area under the density curve to the right of x.

P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds to the area under the density curve between x1 and x2.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
13

Mean, Median, Mode, and Range


Mean, median, and mode are three kinds of "averages". There are many "averages" in statistics,
but these are, I think, the three most common, and are certainly the three you are most likely to
encounter in your pre-statistics courses, if the topic comes up at all.

The "mean" is the "average" you're used to, where you add up all the numbers and then divide by
the number of numbers. The "median" is the "middle" value in the list of numbers. To find the
median, your numbers have to be listed in numerical order from smallest to largest, so you may
have to rewrite your list before you can find the median. The "mode" is the value that occurs
most often. If no number in the list is repeated, then there is no mode for the list.

Find the mean, median, mode, and range for the following list of values:
13, 18, 13, 14, 13, 16, 14, 21, 13

The mean is the usual average, so I'll add and then divide:

(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15

Note that the mean, in this case, isn't a value from the original list. This is a common result. You
should not assume that your mean will be one of your original numbers.

The median is the middle value, so first I'll have to rewrite the list in numerical order:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th
number:

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

The mode is the number that is repeated more often than any other, so 13 is the mode.

The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.

mean: 15
median: 14
mode: 13
range: 8

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
14

REGRESSION
A regression analysis is a statistical procedure that allows you to make a prediction about an
outcome (or criterion) variable based on knowledge of some predictor variable. To create a
regression model, you first need to collect (a lot of) data on both variables, similar to what you
would do if you were conducting a correlation. Then you would determine the contribution of the
predictor variable to the outcome variable. Once you have the regression model, you would be
able to input an individual’s score on the predictor variable to get a prediction of their score on
the outcome variable.

- Example: You want to try to predict whether a student will come back for a second year
based on how many on-campus activities s/he attended. You would have to collect data on how
many activities students attended and then whether or not those students returned for a second
year. If activity attendance and retention are significantly related to each other, then you can
generate a regression model where you could identify at-risk students (in terms of retention)
based on how many activities they have attended.

- Example: You want to try to identify students who are at risk of failing College Algebra
based on their scores on a math assessment so you can direct them to special services on campus.
You would administer the math assessment at the start of the semester and then match each
student’s score on the math assessment to their final grade in the course. Eventually, your data
may show that the math assessment is significantly correlated to their final grade, and you can
create a regression model to identify those at-risk students so you can direct them to tutors and
other resources on campus.

- Thus, use regression when:

You want to be able to make a prediction about an outcome given what you already know about
some related factor.

Another option with regression is to do a multiple regression, which allows you to make a
prediction about an outcome based on more than just one predictor variable. Many retention
models are essentially multiple regressions that consider factors such as GPA, level of
involvement, and attitude towards academics and learning.

The Linear Regression Equation


Linear regression is a way to model the relationship between two variables. You might also
recognize the equation as the slope formula. The equation has the form Y=a+bX, where Y is the
dependent variable (that’s the variable that goes on the Y axis), X is the independent variable
(i.e. it is plotted on the X axis), b is the slope of the line and a is the y-intercept.

Y’=a+bX

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
15

HOW TO FIND A LINEAR REGRESSION EQUATION: STEPS

Step 1: Make a chart of your data, filling in the columns in the same way as you would fill in the chart if you were
finding the Pearson’s Correlation Coefficient.

SUBJECT AGE X GLU CO SE LEV EL Y XY X2 Y2


1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample size (6, in our
case).

Step 2: Use the following equations to find a and b.

Y’=a+bX

a = 65.1416
b = .385225
Click here if you want easy, step-by-step instructions for solving this formula.
Find a:
 ((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 247 2)
 484979 / 7445
 =65.14
Find b:
 (6(20,485) – (247 × 486)) / (6 (11409) – 247 2)
 (122,910 – 120,042) / 68,454 – 2472
 2,868 / 7,445
 = .385225
Step 3: Insert the values into the equation.
y’ = a + bx
y’ = 65.14 + .385225x

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
16

Pearson Correlation Coefficient


The Pearson correlation coefficient is a very helpful statistical formula that measures the strength
between variables and relationships. In the field of statistics, this formula is often referred to as
the Pearson R test. When conducting a statistical test between two variables, it is a good idea to
conduct a Pearson correlation coefficient value to determine just how strong that relationship is
between those two variables.

Formula-
In order to determine how strong the relationship is between two variables, a formula must be
followed to produce what is referred to as the coefficient value. The coefficient value can range
between -1.00 and 1.00. If the coefficient value is in the negative range, then that means the
relationship between the variables is negatively correlated, or as one value increases, the other
decreases. If the value is in the positive range, then that means the relationship between the
variables is positively correlated, or both values increase or decrease together. Let's look at the
formula for conducting the Pearson correlation coefficient value.

Step one: Make a chart with your data for two variables, labeling the variables (x) and (y), and
add three more columns labeled (xy), (x^2), and (y^2). A simple data chart might look like this:
Step-1: Complete the chart using basic multiplication of the variable values.

Person Age (x) Score (y) (xy) (x^2) (y^2)


1 20 30 600 400 900
2 24 20 480 576 400
3 17 27 459 289 729

Step-2: After you have multiplied all the values to complete the chart, add up all of the columns
from top to bottom.
Person Age (x) Score (y) (xy) (x^2) (y^2)
1 20 30 600 400 900
2 24 20 480 576 400
3 17 27 459 289 729
Total 61 77 1539 1265 2029

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
17

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
18

Step-3: Use this formula to find the Pearson correlation coefficient value.
Sample question: Find the value of the correlation coefficient from the following
table:
SUBJECT AG E X G L UC OS E L E VE L Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 1:Make a chart. Use the given data, and add three more columns: xy, x 2, and
y2 .
SUBJECT AG E X G L UC OS E L E VE L Y XY X2 Y2
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would
be 43 × 99 = 4,257.
SUBJECT AG E X GLUCOSE
LEVEL Y
XY X2 Y2
1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
Step 3: Take the square of the numbers in the x column, and put the result in the
x2 column.

SUBJECT AG E X GL U C O S E L E V E L Y XY X2 Y2
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Step 4: Take the square of the numbers in the y column, and put the result in the
y2 column.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
19

SUBJECT AGE X GL U C OS E L E VE L Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Step 5: Add up all of the numbers in the columns and put the result at the bottom
of the column. The Greek letter sigma (Σ) is a short way of saying “sum of.”

SUBJECT AGE GL U C O S E XY X2 Y2

X L E VE L Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Step 6: Use the following correlation coefficient formula.

The answer is: 2868 / 5413.27 = 0.529809


Click here if you want easy, step-by-step instructions for solving this formula.
From our table:

Σx = 247
 Σy = 486
 Σxy = 20,485
 Σx2 = 11,409
 Σy2 = 40,022
 n is the sample size, in our case = 6
The correlation coefficient =

 6(20,485) – (247 × 486) / [√[[6(11,409) – (247 2)] × [6(40,022) – 4862]]]

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
20

= 0.5298

The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or


52.98%, which means the variables have a moderate positive correlation.

INFERENTIAL STATISTICS
What is the main purpose of inferential statistics?
The main purpose of inferential statistics is to:
a. Summarize data in a useful andinformative manner.
b. Estimate a population characteristic based on a sample.
c. Determine if the data adequately represents the population.
Inferential statistics allows you to make inferences about the population from the sample data.

Population & Sample


A sample is a representative subset of a population. Conducting a census on population is an
ideal but impractical approach in most of the cases. Sampling is much more practical, however it
is prone to sampling error. A sample non-representative of population is called bias, method
chosen for such sampling is called sampling bias. Convenience bias, judgement bias, size
bias, response bias are main types of sampling bias. The best technique for reducing bias in
sampling is randomization. Simple random sampling is the simplest of randomization
techniques, cluster sampling & stratified samplingare other systematic sampling techniques.
Here are two main areas of inferential statistics:
1. Estimating parameters. This means taking a statistic from your sample data (for example
the sample mean) and using it to say something about a population parameter (i.e. the
population mean).
2. Hypothesis tests. This is where you can use sample data to answer research questions. For
example, you might be interested in knowing if a new cancer drug is effective. Or if
breakfast helps children perform better in schools.

When you have quantitative data, you can analyze it using either descriptive or inferential
statistics. Descriptive statistics do exactly what it sounds like – they describe the data.
Descriptive statistics include measures of central tendency (mean, median, mode), measures of
variation (standard deviation, variance), and relative position (quartiles, percentiles). There are
times, however, when you want to draw conclusions about the data. This may include making
comparisons across time, comparing different groups, or trying to make predictions based on

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
21

data that has been collected. Inferential statistics are used when you want to move beyond simple
description or characterization of your data and draw conclusions based on your data. There are
several kinds of inferential statistics that you can calculate; here are a few of the more common
types:

INFERENTIAL STATISTICS THROUGH HYPOTHESIS

Q. What Is Hypothesis Testing?


Ans.

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding


a population parameter. The methodology employed by the analyst depends on the nature
of the data used and the reason for the analysis. Hypothesis testing is used to infer the
result of a hypothesis performed on sample data from a larger population.

Q. What is inferential hypothesis?


Ans.
Inferential Statistics arises from the sampling theory that makes inferences about
populations based upon samples taken from such populations. ... Null hypothesis is a
statement that indicates there is no difference between the sample and the population (or
the population means for the treatment and experimental groups).

Real World Example of Hypothesis Testing


If, for example, a person wants to test that a penny has exactly a 50% chance of landing
on heads, the null hypothesis would be yes, and the alternative hypothesis would be no (it
does not land on heads). Mathematically, the null hypothesis would be represented as Ho:
P = 0.5. The alternative hypothesis would be denoted as "Ha" and be identical to the null
hypothesis, except with the equal sign struck-through, meaning that it does not equal
50%.

A random sample of 100 coin flips is taken from a random population of coin flippers,
and the null hypothesis is then tested. If it is found that the 100 coin flips were distributed
as 40 heads and 60 tails, the analyst would assume that a penny does not have a 50%
chance of landing on heads and would reject the null hypothesis and accept the
alternative hypothesis. Afterward, a new hypothesis would be tested, this time that a
penny has a 40% chance of landing on heads.

Four Steps of Hypothesis Testing


All hypotheses are tested using a four-step process:

1. The first step is for the analyst to state the two hypotheses so that only one can be
right.
2. The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.
3. The third step is to carry out the plan and physically analyze the sample data.
4. The fourth and final step is to analyze the results and either accept or reject the
null hypothesis.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
22

ANOVA (ANALYSIS OF VARIANCE)

Analysis of Variance (ANOVA) is a parametric statistical technique used to compare datasets.


This technique was invented by R.A. Fisher, and is thus often referred to as Fisher’s ANOVA, as
well. It is similar in application to techniques such as t-test and z-test, in that it is used to
compare means and the relative variance between them. However, analysis of variance
(ANOVA) is best applied where more than 2 populations or samples are meant to be compared.

Statistics Solutions is the country’s leader in Analysis of Variance (ANOVA) and dissertation
statistics. Contact Statistics Solutions today for a free 30-minute consultation.

The use of this parametric statistical technique involves certain key assumptions, including the
following:

1. Independence of case: Independence of case assumption means that the case of the
dependent variable should be independent or the sample should be selected randomly. There
should not be any pattern in the selection of the sample.

2. Normality: Distribution of each group should be normal. The Kolmogorov-Smirnov or the


Shapiro-Wilk test may be used to confirm normality of the group.

3. Homogeneity: Homogeneity means variance between the groups should be the same.
Levene’s test is used to test the homogeneity between groups.

If particular data follows the above assumptions, then the analysis of variance (ANOVA) is the
best technique to compare the means of two, or more, populations.

Analysis of variance (ANOVA) has three types:

One way analysis: When we are comparing more than three groups based on one factor
variable, then it said to be one way analysis of variance (ANOVA). For example, if we want to
compare whether or not the mean output of three workers is the same based on the working
hours of the three workers.

Two way analysis: When factor variables are more than two, then it is said to be two way
analysis of variance (ANOVA). For example, based on working condition and working hours,
we can compare whether or not the mean output of three workers is the same.

K-way analysis: When factor variables are k, then it is said to be the k-way analysis of variance
(ANOVA).

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
23

UNIT-2

What is Big Data? And Why Is It Important to Me?

According to Gartner:
Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making
From Wikipedia:
Big Data is a broad term for data sets so large or complex that they are difficult to process
using traditional data processing applications. Challenges include analysis, capture,
curation, search, sharing, storage, transfer, visualization, and information privacy.

Four V’s of Big Data


The IT industry, in an attempt to quantify what is and isn’t Big Data, has come up with
what are known as the “V’s” of Big Data. The foundational three are:

 Volume: The amount of data is immense. Each day 2.3 trillion gigabytes of new
data is being created.
 Velocity: The speed of data (always in flux) and processing (analysis of streaming
data to produce near or real time results)
 Variety: The different types of data, structured, as well as, unstructured.
 Visibility Dimension: This dimension refers to a customers’ ability to see, track
their experience or order through the operations process. A high visibility
dimension includes courier companies where you can track your package online or
a retail store where you pick up the goods and purchase them over the counter.
 Value: Value is the end game. After addressing volume, velocity, variety,
variability, veracity, and visualization – which takes a lot of time, effort and
resources – you want to be sure your organization is getting value from the data.
 Variability: Variability is different from variety. A coffee shop may offer 6
different blends of coffee, but if you get the same blend every day and it tastes
different every day, that is variability. The same is true of data; if the meaning is
constantly changing it can have a huge impact on your data homogenization.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
24

It is the combination of these factors, high-volume, high-velocity and high-variety that


serves as the basis for data to be termed Big Data. Big Data platforms and solutions
provide the tools, methods and technologies used to capture, curate, store and search
& analyze the data to find new correlations, relationships and trends that were previously
unavailable.

Introduction to Big Data Analytics

Big data analytics examines large amounts of data to uncover hidden patterns,
correlations and other insights. With today’s technology, it’s possible to analyze your
data and get answers from it immediately. Big Data Analytics helps you to understand
your organization better. With the use of Big data analytics, one can make informed
decisions without blindly relying on guesses.

And it can help answer the following types of questions:

 What actually happened?


 How or why did it happen?
 What’s happening now?
 What is likely to happen next?

Big Data Analytics Applications:

The primary goal of Big Data applications is to help companies make more informative
business decisions by analyzing large volumes of data. It could include web server logs,
Internet click stream data, social media content and activity reports, text from customer
emails, mobile phone call details and machine data captured by multiple sensors.

Organisations from different domain are investing in Big Data applications, for
examining large data sets to uncover all hidden patterns, unknown correlations, market
trends, customer preferences and other useful business information. In this blog we will
we be covering:

 Big Data Applications in Healthcare


 Big Data Applications in Manufacturing
 Big Data Applications in Media & Entertainment
 Big Data Applications in IoT
 Big Data Applications in Government

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
25

Big Data Applications: Healthcare

 Big Data Applications: Manufacturing


o Product quality and defects tracking
o Supply planning
o Manufacturing process defect tracking
o Output forecasting
o Increasing energy efficiency
o Testing and simulation of new manufacturing processes
o Support for mass-customization of manufacturing

 Big Data Applications: Media & Entertainment

o Predicting what the audience wants


o Scheduling optimization
o Increasing acquisition and retention

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
26

o Ad targeting
o Content monetization and new product development

Big Data Applications: Internet of Things (IoT)

Big Data Applications: Government

Cyber security & Intelligence

Crime Prediction and Prevention

Pharmaceutical Drug Evaluation

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
27

Scientific Research

Weather Forecasting

BIG DATA TECHNOLOGIES

The list of technology vendors offering big data solutions is seemingly infinite. Many of
the big data solutions that are particularly popular right now fit into one of the following
15 categories:

1. The Hadoop Ecosystem

While Apache Hadoop may not be as dominant as it once was, it's nearly impossible to
talk about big data without mentioning this open source framework for distributed
processing of large data sets. Last year, Forrester predicted, "100% of all large enterprises
will adopt it (Hadoop and related technologies such as Spark) for big data
analytics within the next two years."

Over the years, Hadoop has grown to encompass an entire ecosystem of related software,
and many commercial big data solutions are based on Hadoop. In fact, Zion Market
Research forecasts that the market for Hadoop-based products and services will continue
to grow at a 50 percent CAGR through 2022, when it will be worth $87.14 billion, up
from $7.69 billion in 2016.

Key Hadoop vendors include Cloudera, Hortonworks and MapR, and the leading public
clouds all offer services that support the technology.

2. Spark

Apache Spark is part of the Hadoop ecosystem, but its use has become so widespread that
it deserves a category of its own. It is an engine for processing big data within Hadoop,
and it's up to one hundred times faster than the standard Hadoop engine, MapReduce.

In the AtScale 2016 Big Data Maturity Survey, 25 percent of respondents said that they
had already deployed Spark in production, and 33 percent more had Spark projects in
development. Clearly, interest in the technology is sizable and growing, and many
vendors with Hadoop offerings also offer Spark-based products.

open source project, is a programming language and software environment designed for
working with statistics. The darling of data scientists, it is managed by the R Foundation
and available under the GPL 2 license. Many popular integrated development
environments (IDEs), including Eclipse and Visual Studio, support the language.
Several organizations that rank the popularity of various programming languages say that
R has become one of the most popular languages in the world. For example,
the IEEE says that R is the fifth most popular programming language, and

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
28

both Tiobe and RedMonk rank it 14th. This is significant because the programming
languages near the top of these charts are usually general-purpose languages that can be
used for many different kinds of work. For a language that is used almost exclusively for
big data projects to be so near the top demonstrates the significance of big data and the
importance of this language in its field.

4. Data Lakes

To make it easier to access their vast stores of data, many enterprises are setting up data
lakes. These are huge data repositories that collect data from many different sources and
store it in its natural state. This is different than a data warehouse, which also collects
data from disparate sources, but processes it and structures it for storage. In this case, the
lake and warehouse metaphors are fairly accurate. If data is like water, a data lake is
natural and unfiltered like a body of water, while a data warehouse is more like a
collection of water bottles stored on shelves.

Data lakes are particularly attractive when enterprises want to store data but aren't yet
sure how they might use it. A lot of Internet of Things (IoT) data might fit into that
category, and the IoT trend is playing into the growth of data lakes.

MarketsandMarkets predicts that data lake revenue will grow from $2.53 billion in 2016
to $8.81 billion by 2021.

5. NoSQL Databases

Traditional relational database management systems (RDBMSes) store information


in structured, defined columns and rows. Developers and database administrators query,
manipulate and manage the data in those RDBMSes using a special language known as
SQL.

NoSQL databases specialize in storing unstructured data and providing fast performance,
although they don't provide the same level of consistency as RDBMSes. Popular NoSQL
databases include MongoDB, Redis, Cassandra, Couchbase and many others; even the
leading RDBMS vendors like Oracle and IBM now also offer NoSQL databases.

NoSQL databases have become increasingly popular as the big data trend has grown.
According to Allied Market Research the NoSQL market could be worth $4.2 billion by
2020. However, the market for RDBMSes is still much, much larger than the market for
NoSQL.

MonboDB is one of several well-known NoSQL databases.

6. Predictive Analytics
Predictive analytics is a sub-set of big data analytics that attempts to forecast future
events or behavior based on historical data. It draws on data mining, modeling and
machine learning techniques to predict what will happen next. It is often used for fraud
detection, credit scoring, marketing, finance and business analysis purposes.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
29

In recent years, advances in artificial intelligence have enabled vast improvements in the
capabilities of predictive analytics solutions. As a result, enterprises have begun to invest
more in big data solutions with predictive capabilities. Many vendors, including
Microsoft, IBM, SAP, SAS, Statistica, RapidMiner, KNIME and others, offer predictive
analytics solutions. Zion Market Research says the Predictive Analytics market generated
$3.49 billion in revenue in 2016, a number that could reach $10.95 billion by 2022.

7. In-Memory Databases

In any computer system, the memory, also known as the RAM, is orders of magnitude
faster than the long-term storage. If a big data analytics solution can process data that is
stored in memory, rather than data stored on a hard drive, it can perform dramatically
faster. And that's exactly what in-memory database technology does.

Many of the leading enterprise software vendors, including SAP, Oracle, Microsoft and
IBM, now offer in-memory database technology. In addition, several smaller companies
like Teradata, Tableau, Volt DB and DataStax offer in-memory database solutions.
Research from MarketsandMarkets estimates that total sales of in-memory technology
were $2.72 billion in 2016 and may grow to $6.58 billion by 2021.

8. Big Data Security Solutions

Because big data repositories present an attractive target to hackers and advanced
persistent threats, big data security is a large and growing concern for enterprises. In the
AtScale survey, security was the second fastest-growing area of concern related to big
data.

According to the IDG report, the most popular types of big data security solutions include
identity and access controls (used by 59 percent of respondents), data encryption (52
percent) and data segregation (42 percent). Dozens of vendors offer big data security
solutions, and Apache Ranger, an open source project from the Hadoop ecosystem, is
also attracting growing attention.

9. Big Data Governance Solutions

Closely related to the idea of security is the concept of governance. Data governance is a
broad topic that encompasses all the processes related to the availability, usability and
integrity of data. It provides the basis for making sure that the data used for big data
analytics is accurate and appropriate, as well as providing an audit trail so that business
analysts or executives can see where data originated.

In the NewVantage Partners survey, 91.8 percent of the Fortune 1000 executives
surveyed said that governance was either critically important (52.5 percent) or important
(39.3 percent) to their big data initiatives. Vendors offering big data governance tools
include Collibra, IBM, SAS, Informatica, Adaptive and SAP.

10. Self-Service Capabilities

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
30

With data scientists and other big data experts in short supply — and commanding large
salaries — many organizations are looking for big data analytics tools that allow business
users to self-service their own needs. In fact, a report from Research and
Markets estimates that the self-service business intelligence market generated $3.61
billion in revenue in 2016 and could grow to $7.31 billion by 2021. And Gartner has
noted, "The modern BI and analytics platform emerged in the last few years to meet new
organizational requirements for accessibility, agility and deeper analytical insight,
shifting the market from IT-led, system-of-record reporting to business-led, agile
analytics including self-service."

Hoping to take advantage of this trend, multiple business intelligence and big data
analytics vendors, such as Tableau, Microsoft, IBM, SAP, Splunk, Syncsort, SAS,
TIBCO, Oracle and other have added self-service capabilities to their solutions. Time will
tell whether any or all of the products turn out to be truly usable by non-experts and
whether they will provide the business value organizations are hoping to achieve with
their big data initiatives.

11. Artificial Intelligence

While the concept of artificial intelligence (AI) has been around nearly as long as there
have been computers, the technology has only become truly usable within the past couple
of years. In many ways, the big data trend has driven advances in AI, particularly in two
subsets of the discipline: machine learning and deep learning.

The standard definition of machine learning is that it is technology that gives "computers
the ability to learn without being explicitly programmed." In big data analytics, machine
learning technology allows systems to look at historical data, recognize patterns, build
models and predict future outcomes. It is also closely associated with predictive analytics.

Deep learning is a type of machine learning technology that relies on artificial neural
networks and uses multiple layers of algorithms to analyze data. As a field, it holds a lot
of promise for allowing analytics tools to recognize the content in images and videos and
then process it accordingly.

Experts say this area of big data tools seems poised for a dramatic takeoff. IDC has
predicted, "By 2018, 75 percent of enterprise and ISV development will include
cognitive/AI or machine learning functionality in at least one application, including all
business analytics tools."

Leading AI vendors with tools related to big data include Google, IBM, Microsoft and
Amazon Web Services, and dozens of small startups are developing AI technology (and
getting acquired by the larger technology vendors).

12. Streaming analytics

As organizations have become more familiar with the capabilities of big data analytics
solutions, they have begun demanding faster and faster access to insights. For these

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
31

enterprises, streaming analytics with the ability to analyze data as it is being created, is
something of a holy grail. They are looking for solutions that can accept input from
multiple disparate sources, process it and return insights immediately — or as close to it
as possible. This is particular desirable when it comes to new IoT deployments, which are
helping to drive the interest in streaming big data analytics.

Several vendors offer products that promise streaming analytics capabilities. They
include IBM, Software AG, SAP, TIBCO, Oracle, DataTorrent, SQLstream, Cisco,
Informatica and others. MarketsandMarkets believes the streaming analytics solutions
brought in $3.08 billion in revenue in 2016, which could increase to $13.70 billion by
2021.

13. Edge Computing

In addition to spurring interest in streaming analytics, the IoT trend is also generating
interest in edge computing. In some ways, edge computing is the opposite of cloud
computing. Instead of transmitting data to a centralized server for analysis, edge
computing systems analyze data very close to where it was created — at the edge of the
network.

The advantage of an edge computing system is that it reduces the amount of information
that must be transmitted over the network, thus reducing network traffic and related costs.
It also decreases demands on data centers or cloud computing facilities, freeing up
capacity for other workloads and eliminating a potential single point of failure.

While the market for edge computing, and more specifically for edge computing
analytics, is still developing, some analysts and venture capitalists have begun calling the
technology the "next big thing."

14. Blockchain

Also a favorite with forward-looking analysts and venture capitalists, blockchain is the
distributed database technology that underlies Bitcoin digital currency. The unique
feature of a blockchain database is that once data has been written, it cannot be deleted or
changed after the fact. In addition, it is highly secure, which makes it an excellent choice
for big data applications in sensitive industries like banking, insurance, health care, retail
and others.

Blockchain technology is still in its infancy and use cases are still developing. However,
several vendors, including IBM, AWS, Microsoft and multiple startups, have rolled out
experimental or introductory solutions built on blockchain technology.

15. Prescriptive Analytics

Many analysts divide big data analytics tools into four big categories. The first,
descriptive analytics, simply tells what happened. The next type, diagnostic analytics,
goes a step further and provides a reason for why events occurred. The third type,

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
32

predictive analytics, discussed in depth above, attempts to determine what will happen
next. This is as sophisticated as most analytics tools currently on the market can get.

However, there is a fourth type of analytics that is even more sophisticated, although very
few products with these capabilities are available at this time. Prescriptive analytics
offers advice to companies about what they should do in order to make a desired result
happen. For example, while predictive analytics might give a company a warning that the
market for a particular product line is about to decrease, prescriptive analytics will
analyze various courses of action in response to those market changes and forecast the
most likely results.

Currently, very few enterprises have invested in prescriptive analytics, but many analysts
believe this will be the next big area of investment after organizations begin experiencing
the benefits of predictive analytics.

HADOOP’S PARALLEL WORLD


HADOOP

Hadoop is an open source distributed processing framework that manages data processing
and storage for big data applications running in clustered systems. It is at the center of a
growing ecosystem of big data technologies that are primarily used to support advanced
analytics initiatives, including predictive analytics, data mining and machine learning
applications. Hadoop can handle various forms of structured and unstructured data,
giving users more flexibility for collecting, processing and analyzing data than relational
databases and data warehouses provide.

HADOOP AND BIG DATA

Hadoop runs on clusters of commodity servers and can scale up to support thousands of
hardware nodes and massive amounts of data. It uses a namesake distributed file system
that's designed to provide rapid data access across the nodes in a cluster, plus fault-
tolerant capabilities so applications can continue to run if individual nodes fail.
Consequently, Hadoop became a foundational data management platform for big data
analytics uses after it emerged in the mid-2000s.

HISTORY OF HADOOP

Hadoop was created by computer scientists Doug Cutting and Mike Cafarella, initially to
support processing in the Nutch open source search engine and web crawler. After
Google published technical papers detailing its Google File System (GFS) and

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
33

MapReduce programming framework in 2003 and 2004, Cutting and Cafarella modified
earlier technology plans and developed a Java-based MapReduce implementation and a
file system modeled on Google's.

In early 2006, those elements were split off from Nutch and became a separate Apache
subproject, which Cutting named Hadoop after his son's stuffed elephant. At the same
time, Cutting was hired by internet services company Yahoo, which became the first
production user of Hadoop later in 2006.

Use of the framework grew over the next few years, and three independent Hadoop
vendors were founded: Cloudera in 2008, MapR a year later and Hortonworks as a Yahoo
spinoff in 2011. In addition, AWS launched a Hadoop cloud service called Elastic
MapReduce in 2009. That was all before Apache released Hadoop 1.0.0, which became
available in December 2011 after a succession of 0.x releases.

HOW HADOOP WORKS AND ITS IMPORTANCE

Put simply: Hadoop has two main components. The first component, the Hadoop
Distributed File System, helps split the data, put it on different nodes, replicate it and
manage it. The second component, MapReduce, processes the data on each node in
parallel and calculates the results of the job. There is also a method to help manage the
data processing jobs.

Hadoop is important because:

it can store and process vast amounts of structured and unstructured data, quickly.

application and data processing are protected against hardware failure. So if one node
goes down, jobs are redirected automatically to other nodes to ensure that the distributed
computing doesn’t fail.

the data doesn’t have to be preprocessed before it’s stored. Organizations can store as
much data as they want, including unstructured data, such as text, videos and images, and
decide how to use it later.

it’s scalable so companies can add nodes to enable their systems to handle more data.

it can analyze data in real time to enable better decision making.

HADOOP APPLICATIONS

YARN greatly expanded the applications that Hadoop clusters can handle to include
stream processing and real-time analytics applications run in tandem with processing
engines, like Apache Spark and Apache Flink. For example, some manufacturers are

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
34

using real-time data that's streaming into Hadoop in predictive maintenance applications
to try to detect equipment failures before they occur. Fraud detection, website
personalization and customer experience scoring are other real-time use cases.

Because Hadoop can process and store such a wide assortment of data, it enables
organizations to set up data lakes as expansive reservoirs for incoming streams of
information. In a Hadoop data lake, raw data is often stored as is so data scientists and
other analysts can access the full data sets if need be; the data is then filtered and
prepared by analytics or IT teams as needed to support different applications.

Data lakes generally serve different purposes than traditional data warehouses that hold
cleansed sets of transaction data. But, in some cases, companies view their Hadoop data
lakes as modern-day data warehouses. Either way, the growing role of big data analytics
in business decision-making has made effective data governance and data security
processes a priority in data lake deployments.

Customer analytics -- examples include efforts to predict customer churn, analyze


clickstream data to better target online ads to web users, and track customer sentiment
based on comments about a company on social networks. Insurers use Hadoop for
applications such as analyzing policy pricing and managing safe driver discount
programs. Healthcare organizations look for ways to improve treatments and patient
outcomes with Hadoop's aid.

Risk management -- financial institutions use Hadoop clusters to develop more accurate
risk analysis models for their customers. Financial services companies can use Hadoop to
build and run applications to assess risk, build investment models and develop trading
algorithms.

Predictive maintenance -- with input from IoT devices feeding data into big data
programs, companies in the energy industry can use Hadoop-powered analytics to help
predict when equipment might fail to determine when maintenance should be performed.

Operational intelligence -- Hadoop can help telecommunications firms get a better


understanding of switching, frequency utilization and capacity use for capacity planning
and management. By analyzing how services are consumed as well as the bandwidth in
specific regions, they can determine the best places to locate new cell towers, for
example. In addition, by capturing and analyzing the data that’s produced by the
infrastructure and by sensors, telcos can more quickly respond to problems in the
network.

Supply chain risk management -- manufacturing companies, for example, can track the
movement of goods and vehicles so they can determine the costs of various transportation
options. Using Hadoop, manufacturers can analyze large amounts of historical, time-
stamped location data as well as map out potential delays so they can optimize their
delivery routes.

BEST BIG DATA ANALYTICS TOOLS

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
35

Big Data Analytics software is widely used in providing meaningful analysis of a large
set of data. This software helps in finding current market trends, customer preferences,
and other information.

Here are the 8 Top Big Data Analytics Tools with key feature and download links.

1. Apache Hadoop
The long-standing champion in the field of Big Data processing, well-known for its
capabilities for huge-scale data processing. This open source Big Data framework can run
on-prem or in the cloud and has quite low hardware requirements. The main Hadoop
benefits and features are as follows:

 HDFS — Hadoop Distributed File System, oriented at working with huge-scale


bandwidth

 MapReduce — a highly configurable model for Big Data processing

 YARN — a resource scheduler for Hadoop resource management

 Hadoop Libraries — the needed glue for enabling third party modules to work
with Hadoop

2. Apache Spark
Apache Spark is the alternative — and in many aspects the successor — of Apache
Hadoop. Spark was built to address the shortcomings of Hadoop and it does this
incredibly well. For example, it can process both batch data and real-time data, and
operates 100 times faster than MapReduce. Spark provides the in-memory data
processing capabilities, which is way faster than disk processing leveraged by
MapReduce. In addition, Spark works with HDFS, OpenStack and Apache Cassandra,
both in the cloud and on-prem, adding another layer of versatility to big data operations
for your business.

3. Apache Storm
Storm is another Apache product, a real-time framework for data stream processing,
which supports any programming language. Storm scheduler balances the workload
between multiple nodes based on topology configuration and works well with Hadoop
HDFS. Apache Storm has the following benefits:

 Great horizontal scalability

 Built-in fault-tolerance

 Auto-restart on crashes

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
36

 Clojure-written

 Works with Direct Acyclic Graph(DAG) topology

 Output files are in JSON format

4. Apache Cassandra
Apache Cassandra is one of the pillars behind Facebook’s massive success, as it allows to
process structured data sets distributed across huge number of nodes across the globe. It
works well under heavy workloads due to its architecture without single points of failure
and boasts unique capabilities no other NoSQL or relational DB has, such as:

 Great liner scalability

 Simplicity of operations due to a simple query language used

 Constant replication across nodes

 Simple adding and removal of nodes from a running cluster

 High fault tolerance

 Built-in high-availability

5. MongoDB (https://www.guru99.com/mongodb-tutorials.html)
MongoDB is another great example of an open source NoSQL database with rich
features, which is cross-platform compatible with many programming languages. IT Svit
uses MongoDB in a variety of cloud computing and monitoring solutions, and we
specifically developed a module for automated MongoDB backups using Terraform. The
most prominent MongoDB features are:

 Stores any type of data, from text and integer to strings, arrays, dates and boolean

 Cloud-native deployment and great flexibility of configuration

 Data partitioning across multiple nodes and data centers

 Significant cost savings, as dynamic schemas enable data processing on the go

6. R Programming Environment
R is mostly used along with JuPyteR stack (Julia, Python, R) for enabling wide-scale
statistical analysis and data visualization. JupyteR Notebook is one of 4 most popular Big
Data visualization tools, as it allows composing literally any analytical model from more

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
37

than 9,000 CRAN (Comprehensive R Archive Network) algorithms and modules,


running it in a convenient environment, adjusting it on the go and inspecting the analysis
results at once. The main benefits of using R are as follows:

 R can run inside the SQL server

 R runs on both Windows and Linux servers

 R supports Apache Hadoop and Spark

 R is highly portable

 R easily scales from a single test machine to vast Hadoop data lakes

7. Neo4j
Neo4j is an open source graph database with interconnected node-relationship of data,
which follows the key-value pattern in storing data. IT Svit has recently built a resilient
AWS infrastructure with Neo4j for one of our customers and the database performs well
under heavy workload of network data and graph-related requests. Main Neo4j features
are as follows:

 Built-in support for ACID transactions

 Cypher graph query language

 High-availability and scalability

 Flexibility due to the absence of schemas

 Integration with other databases

8. Apache SAMOA
This is another of the Apache family of tools used for Big Data processing. Samoa
specializes at building distributed streaming algorithms for successful Big Data mining.
This tool is built with pluggable architecture and must be used atop other Apache
products like Apache Storm we mentioned earlier. Its other features used for Machine
Learning include the following:

 Clustering

 Classification

 Normalization

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
38

 Regression

 Programming primitives for building custom algorithms

Using Apache Samoa enables the distributed stream processing engines to provide such
tangible benefits:

 Program once, use anywhere

 Reuse the existing infrastructure for new projects

 No reboot or deployment downtime

 No need for backups or time-consuming updates

Final thoughts on the list of hot Big Data tools for 2018
Big Data industry and data science evolve rapidly and progressed a big deal lately, with
multiple Big Data projects and tools launched in 2017. This is one of the hottest IT trends
of 2018, along with IoT, blockchain, AI & ML.

PREDICTIVE ANALYTICS

 Predictive analytics is a form of advanced analytics that uses both new and
historical data to forecast activity, behavior and trends. It involves applying
statistical analysis techniques, analytical queries and automated machine learning
algorithms to data sets to create predictive models that place a numerical value --
or score -- on the likelihood of a particular event happening.
 Predictive analytics software applications use variables that can be measured and
analyzed to predict the likely behavior of individuals, machinery or other entities.
 For example, an insurance company is likely to take into account potential driving
safety variables, such as age, gender, location, type of vehicle and driving record,
when pricing and issuing auto insurance policies.
 Multiple variables are combined into a predictive model capable of assessing
future probabilities with an acceptable level of reliability. The software relies
heavily on advanced algorithms and methodologies, such as logistic regression
models, time series analysis and decision trees.
 Predictive analytics has grown in prominence alongside the emergence of big data
systems. As enterprises have amassed larger and broader pools of data in Hadoop
clusters and other big data platforms, they have created increased data mining
opportunities to gain predictive insights. Heightened development and
commercialization of machine learning tools by IT vendors has also helped
expand predictive analytics capabilities.

MOBILE BUSINESS INTELLIGENCE

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
39

Que. What does Mobile Business Intelligence (Mobile BI) mean?

Ans. Mobile business intelligence (mobile BI) refers to the ability to provide business
and data analytics services to mobile/handheld devices and/or remote users. MBI enables
users with limited computing capacity to use and receive the same or similar features,
capabilities and processes as those found in a desktop-based business intelligence
software solution.

 One of the major problems customers face when using mobile devices for
information retrieval is the fact that mobile BI is no longer as simple as the pure
display of BI content on a mobile device. Moreover, a mobile strategy has to be
defined to cope with different suppliers and systems as well as private phones.
Besides attempts to standardize with the same supplier, companies are also
concerned that solutions should have robust security features. These points have
led many to the conclusion that a proper concept and strategy must be in place
before supplying corporate information to mobile devices.
 The first major benefit is the ability for end users to access information in their
mobile BI system at any time and from any location. This enables them to get
data and analytics in ‘real time’, which improves their daily operations and
means they can react more quickly to a wider range of events.

 The integration of mobile BI functions into operational business processes


increases the penetration of BI within organizations and often brings benefits in
the form of additional information.

 This speeds up the decision-making process by extending information and


reducing the time spent searching for relevant information. With this real-time
access to data, operational efficiency is improved and organizational
collaboration is enforced.

 Overall, mobile BI brings about greater availability of information, faster reaction


speed and more efficient working, as well as improving internal communication
and shortening workflows.
 Finally, with the provision of proper mobile applications to all mobile device
users, information can be used by people who previously did not use BI systems.
This in turn leads to a higher BI penetration rate within companies.

MOBILE BUSINESS INTELLIGENCE with BIG DATA

MBI works much like a standard BI software/solution but it is designed specifically for
handheld users. Typically, MBI requires a client end utility to be installed on mobile
devices, which remotely/wirelessly connect over the Internet or a mobile network to the
primary business intelligence application server. Upon connection, MBI users can

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
40

perform queries, and request and receive data. Similarly, clientless MBI solutions can be
accessed through a cloud server that provides Software as a Service business intelligence
(SaaS BI), Real-Time Business Intelligence (RTBI or Real-Time BI).

Mobile BI and Analytics


 Take Your Data Everywhere You Go
Modern business is not confined to the physical boundaries of the office.
The prevalence of mobile and handheld devices means you can take your work
anywhere, from the local cafe to an international flight.
 Never Lose Track of Your Business
Mobile BI lets you access your business intelligence dashboards anywhere
and at any time, using any mobile device. Going on a long business trip? Keep a
close eye on how the office is holding up by regularly checking on the status of
important KPIs using your smartphone.
 Fully Responsive, Mobile-Friendly Environment
Sisense Mobile BI App was designed according to the latest standards of
“mobile first”. This means that everything was designed to work on any screen
and on any device — unlike many applications that provide a buggy mobile
interface as a last-minute afterthought
 Start Taking Your Data To-Go
Mobile data analytics software lets you take your data and your business
with you wherever you may be, freeing you from the constraints of your
workstation and making the world a truly smaller place

WHAT IS CROWDSOURCING?

Crowdsourcing is a term used to describe the process of getting work or funding


from a large group of people in an online setting. The basic concept behind this term is to
use a large group of people for their skills, ideas and participation to generate content or
help facilitate the creation of content or products.

In a sense, crowdsourcing is the distribution of problem solving. If a company needs


funding for a project, marketing content for an upcoming campaign or even research for a
new product, the crowd is a powerful resource capable of generating vast amounts of
money, content and information.

Crowdsourcing data collection consists in building data sets with the help of a large
group of people. There are a source and data suppliers who are willing to enrich the data
with relevant, missing, or new information.
This method originates from the scientific world. One of the first ever case of
crowdsourcing is the Oxford English Dictionary. The project aimed to list all the words
that enjoy any recognized lifespan in the standard English language with their definition
and explanation of usage. That was a gigantic task. So the dictionary creators invited the
crowd to help them on a voluntary basis.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
41

Sounds familiar? Think no further than Wikipedia.

Wikipedia is a free, web-based, multilingual and collaborative encyclopedia built on a


non-for-profit business model. The platform has more than 100,000 active volunteer
contributors who add new knowledge to the system daily.

Another great example of crowdsourcing in practice is OpenStreetMap — an alternative


to GoogleMaps.

More than 1 million mappers work together to collect and supply data to OpenStreetMap
making it full of valuable information about the specified location.

THE IMPORTANCE OF CROWDSOURCING

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
42

The Internet is now a melting pot of user-generated content from blogs to Wikipedia
entries to YouTube videos. The distinction between producer and consumer is no longer
such a prevalent distinction as everyone is equipped with the tools needed to create as
well as consume.
As a business strategy, soliciting customer input isn’t new, and open source software has
proven the productivity possible through a large group of individuals.
The history of crowdsourcing

While the idea behind crowdsourcing isn’t new, its active use online as a business
building strategy has only been around since 2006. The phrase was initially coined by
Jeff Howe, where he described a world in which people outside of a company contribute
work toward that project’s success. Video games have been utilizing crowdsourcing for
many years through their beta invitation. Granting players early access to the game,
studios request only that these passionate gamers report bugs and issues with gameplay as
they encounter before the finished product is released for sale and distribution.
Companies utilize crowdsourcing not only in a research and development capacity, but
also to simply get help from anyone for anything, whether it's word-of-mouth marketing,
creating content or giving feedback.

THE BENEFITS OF CROWDSOURCING


Crowdsourcing is a powerful business marketing tool as it allows an organization to
leverage the creativity and resources of its own audience in promoting and growing the
company for free. From designing marketing campaigns to researching new products to
solving difficult business roadblocks, an organization’s consumers can likely provide
important guidance and answers. And, best of all, all the consumer wants in return for
their opinion and effort is some recognition or even a simple reward.

Crowdsourcing increases the productivity of a company while minimizing labor


expenses. The Internet is a time-proven strategy for soliciting feedback from an active
and passionate consumer base. Customers today want to be involved in the companies
they buy from, which makes crowdsourcing an incredibly effective tool.

THE DOWNSIDES OF CROWDSOURCING


At the same time, consumers aren’t employees, which means organizations can’t contain
or control them. Leveraging the interaction and resources of your audience can put an
organization at risk from a public relations standpoint as things can get ugly quickly
when not properly handled. Crowds may not ask for cash or free product, but they
will demand satisfaction in one form or another, whether it’s recognition, freedom or
honesty.

INFORMATION MANAGEMENT

In term of US higher authorities, Information management was daily practice to manage


information i.e. to determine the nature of Information and what type of people can

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
43

access and read information, with whom it should share or not.


In 1970 when 4th generation computing was in its beginning phase, data scientist
developed various concept to determine the security of data (prevent it from unauthorized
access) and they create a concept of Object which was focused on data security rather
than logics. Before object there was entities called structures and unions that used to
manage data structure algorithms which were quite similar to object but it can’t
encapsulate the behaviour like object is capable to encapsulate it’s attribute as well as
behaviour. So, the concept of object orientation developed with first Object Oriented
Language i.e. SIMULA 67. But this concept grabs more attention when Bjarne Stroustrup
introduces the same concept with release of C++.
Need of Information Management in Web Applications
Web application of 20th century was not that secure like web applications of today. You
must hear that no one is perfect and it’s also not possible to attain the perfection. Object
Oriented System is also not perfect system which has certain limitations. But they have
almost resolved the problem of data security in term of logics. Now it’s too much hard to
access data if you’re an unauthorized user.
In 1990s when Sabeer Bhatia created a web based email service called hotmail, people
think many times before using it, because the chance of leakage of information was in
extreme. That’s why Object Oriented Approach was used in web languages like asp and
php to ensure the data security to create such application that can work with the self
established environment i.e. frameworks, a library of classes and functions to make
programming more easy. We needed to be secure because someone can use our
information to track us and use our information for illegal work. So, this concept was
almost used by every popular programming language.

Today’s Information Management

Information management is essential part of today’s web development that ensures the
security of data that should be shared within the authorized users. This is not an easy task
to manage the whole thing, pass the authority to users and manage the privacy.
Facebook is the good example of Information system. Facebook is providing privacy to
its user which is the one of the capability that a perfect information system can do. It
passes authority to the graph nodes (i.e., your friends) connected with you to access the
information that is set for such circumstances.
Nobody can see your private information expect whom you pass authority to see your
private information in Facebook.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
44

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
45

UNIT-3

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
46

Introduction to BIG DATA: What is, Types, Characteristics & Example

What is Data?

The quantities, characters, or symbols on which operations are performed by a computer, which may
be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.

What is Big Data?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data
that is huge in size and yet growing exponentially with time. In short such data is so large and
complex that none of the traditional data management tools are able to store it or process it
efficiently.

In this tutorial, you will learn,

 Examples Of Big Data


 Types Of Big Data
 Characteristics Of Big Data
 Advantages Of Big Data Processing

Examples Of Big Data


Following are some the examples of Big Data-

The New York Stock Exchange generates about one terabyte of new trade data per day.

Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
47

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.

Types Of Big Data


BigData' could be found in three forms:

1. Structured
2. Unstructured
3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size
of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

Looking at these figures one can easily understand why the name Big Data is given and imagine the
challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one example of
a 'structured' data.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
48

Examples Of Structured Data

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don't know how to derive value out of it since this data is
in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by 'Google Search'

Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
49

<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Data Growth over the years

Please note that web application data, which is unstructured, consists of log files, transaction history
files etc. OLTP systems are built to work with structured data wherein data is stored in relations
(tables).

Characteristics Of Big Data


(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with Big Data.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.

Benefits of Big Data Processing

Ability to process Big Data brings in multiple benefits, such as-

o Businesses can utilize outside intelligence while taking decisions

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
50

Access to social data from search engines and sites like facebook, twitter are enabling organizations
to fine tune their business strategies.

o Improved customer service

Traditional customer feedback systems are getting replaced by new systems designed with Big Data
technologies. In these new systems, Big Data and natural language processing technologies are being
used to read and evaluate consumer responses.

o Early identification of risk to the product/services, if any


o Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big
Data technologies and data warehouse helps an organization to offload infrequently accessed data.

Summary

 Big Data is defined as data that is huge in size. Bigdata is a term used to describe a collection
of data that is huge in size and yet growing exponentially with time.
 Examples of Big Data generation includes stock exchanges, social media sites, jet engines,
etc.
 Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
 Volume, Variety, Velocity, and Variability are few Characteristics of Bigdata
 Improved customer service, better operational efficiency, Better Decision Making are few
advantages of Bigdata

What do you mean by data processing?

Data processing is the conversion of data into usable and desired form. This conversion or
“processing” is carried out using a predefined sequence of operations either manually or
automatically. Most of the data processing is done by using computers and thus done automatically.
The output or “processed” data can be obtained in different forms like image, graph, table, vector
file, audio, charts or any other desired format depending on the software or method of data
processing used. When done itself it is referred to as automatic data processing. Continue reading
below to understand more about what is data processing.

Fundamentals of data processing & how data is processed

Data processing is undertaken by any activity which requires a collection of data. This data collected
needs to be stored, sorted, processed, analyzed and presented. This complete process can be divided
into 6 simple primary stages which are:

 Data collection
 Storage of data

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
51

 Sorting of data
 Processing of data
 Data analysis
 Data presentation and conclusions

Once the data is collected the need for data entry emerges for storage of data. Storage can be done in
physical form by use of papers, in notebooks or in any other physical form. With the emergence and
growing emphasis on Computer System, Big Data & Data Mining the data collection is large and a
number of operations need to be performed for meaningful analysis and presentation, the data is
stored in digital form. Having the raw data and processed data into digital form enables the user to
perform a large number of operations in small time and allows conversion into different types. The
user can thus select the output which best suits the requirement.

This continuous use and processing of data follow cycle called as data processing
cycle and information processing cycle which might provide instant results or take time depending
upon the need of processing data. The complexity in the field of data processing is increasing which
is creating a need for advanced techniques.

Storage of data is followed by sorting and filtering. This stage is profoundly affected by the format in
which data is stored and further depends on the software used. General daily day and noncomplex
data can be stored as text files, tables or a combination of both in Microsoft Excel or similar
software. As the task becomes complex which requires performing specific and specialized
operations they require different data processing tools and software which is meant to cater to the
peculiar needs.

Storing, sorting, filtering and processing of data can be done by single software or a combination of
software whichever feasible and required. Data processing thus carried out by software is done as per
the predefined set of operations. Most of the modern-day software allows users to perform different
actions based on the analysis or study to be carried out. Data processing provides the output file in
various formats.

Different types of output files obtained as “processed” data

 Plain text file – These constitute the simplest form or processed data. Most of these files are
user readable and easy to comprehend. Very negligible or no further processing is these type of
files. These are exported as notepad or WordPad files.
 Table/ spreadsheet – This file format is most suitable for numeric data. Having digits in rows
and columns allows the user to perform various operations like filtering & sorting in
ascending/descending order to make it easy to understand and use. Various mathematical
operations can be applied when using this file output.
 Charts & Graphs – Option to get the output in the form of charts and graphs is handy and now
forms standard features in most of the software. This option is beneficial when dealing with
numerical values reflecting trends and growth/decline. Though there are ample charts and
graphs are available to match diverse requirements there exists situation when there is a need to
have a user-defined option. In case no inbuilt chart or graph is available then the option to
create own charts, i.e., custom charts/graphs come handy.
 Maps/Vector or image file – When dealing with spatial data the option to export the processed
data into maps, vector and image files is of great use. Having the information on maps is of

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
52

particular use for urban planners who work on different types of maps. Image files are obtained
when dealing with graphics and do not constitute any human readable input.
 Other formats/ raw files – These are the software specific file formats which can be used and
processed by specialized software. These output files may not be a complete product and
require further processing. Thus there will need to perform multiple data processing.

Methods of data processing

 Manual data processing: In this method data is processed manually without the use of a
machine, tool or electronic device. Data is processed manually, and all the calculations and
logical operations are performed manually on the data.
 Mechanical data processing – Data processing is done by use of a mechanical device or very
simple electronic devices like calculator and typewriters. When the need for processing is
simple, this method can be adopted.
 Electronic data processing – This is the modern technique to process data. Electronic Data
processing is the fastest and best available method with the highest reliability and accuracy.
The technology used is latest as this method used computers and employed in most of the
agencies. The use of software forms the part of this type of data processing. The data is
processed through a computer; Data and set of instructions are given to the computer as input,
and the computer automatically processes the data according to the given set of instructions.
The computer is also known as electronic data processing machine.

Types of data processing on the basis of process/steps performed

There are various types of data processing, some of the most popular types are as follows:

 Batch Processing
 Real-time processing
 Online Processing
 Multiprocessing
 Time-sharing

What makes processing of data important

Nowadays more and more data is collected for academic, scientific research, private & personal use,
institutional use, commercial use. This collected data needs to be stored, sorted, filtered, analyzed
and presented and even require data transfer for it to be of any use. This process can be simple or
complex depending on the scale at which data collection is done and the complexity of the results
which are required to be obtained. The time consumed in obtaining the desired result depends on the
operations which need to be performed on the collected data and on the nature of the output file
required to be obtained. This problem becomes starker when dealing with the very large volume of
data such as those collected by multinational companies about their users, sales, manufacturing, etc.
Data processing services and companies dealing with personal information and other sensitive
information must be careful about data protection.

The need for data processing becomes more and more critical in such cases. In such cases, data
mining and data management come into play without which optimal results cannot be obtained. Each
stage starting from data collection to presentation has a direct effect on the output and usefulness of

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
53

the processed data. Sharing the dataset with third party must be done carefully and as per written data
processing agreement & service agreement. This prevents data theft, misuse and loss of data.

What type of data needs to be processed

Data in any form and of any type requires processing most of the time. These data can be categorised
as personal information, financial transactions, tax credits, banking details, computational data,
images and simply almost anything you can think of. The quantum of processing required will
depend on the specilisatized processing which the data requires. Subsequently it will depend on the
output that you require. With the increase in demand and the requirement for automatic data
processing & electronic data processing, a competitive market for data services has emerged.

Six stages of data processing

1. Data collection
Collecting data is the first step in data processing. Data is pulled from available sources,
including data lakes and data warehouses. It is important that the data sources available are
trustworthy and well-built so the data collected (and later used as information) is of the highest
possible quality.

2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to
as “pre-processing” is the stage at which raw data is cleaned up and organized for the following stage
of data processing. During preparation, raw data is diligently checked for any errors. The purpose of
this step is to eliminate bad data (redundant, incomplete, or incorrect data) and begin to create high-
quality data for the best business intelligence.

3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data
warehouse like Redshift), and translated into a language that it can understand. Data input is the
first stage in which raw data begins to take the form of usable information.

4. Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for
interpretation. Processing is done using machine learning algorithms, though the process itself
may vary slightly depending on the source of data being processed (data lakes, social networks,
connected devices etc.) and its intended use (examining advertising patterns, medical diagnosis
from connected devices, determining customer needs, etc.).

5. Data output/interpretation

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
54

The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs, videos, images, plain text, etc.). Members of the
company or institution can now begin to self-serve the data for their own data analytics projects.

6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for
future use. While some information may be put to use immediately, much of it will serve a purpose
later on. Plus, properly stored data is a necessity for compliance with data protection legislation
like GDPR. When data is properly stored, it can be quickly and easily accessed by members of the
organization when needed.

The future of data processing


The future of data processing lies in the cloud. Cloud technology builds on the convenience of
current electronic data processing methods and accelerates its speed and effectiveness. Faster,
higher-quality data means more data for each organization to utilize and more valuable insights
to extract.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
55

UNIT-4
Creating the components of Hadoop Map Reduce jobs
Hadoop components and Daemons
Every framework needs two important components:

 Storage: The place where code, data, executables etc are stored.

 Compute: The logic by which code is executed and data is acted upon.

Two main components of Hadoop framework are also on the same lines:

1. HDFS: This is the storage in which all the data is stored. It is a file system that is required by
Hadoop to run various map reduce jobs.
2. MapReduce: This is the compute logic based on which Hadoop runs. MapReduce is the
fundamental algorithm behind the success of Hadoop. This provides very fast processing.

There are 2 layers in Hadoop – HDFS layer and Map-Reduce layer and 5 daemons which run on Hadoop
in these 2 layers. Daemons are the processes that run in the background.

1) Namenode – It runs on master node for HDFS.

2) Datanode – It runs on slave nodes for HDFS.

3) Resource Manager– It runs on YARN master node for MapReduce.

4) Node Manager – It runs on YARN slave node for MapReduce.

5) Secondary-namenode – It is back-up for namenode and runs on a different system (other than master
and slave nodes but can be configured on slave node also)

These 5 daemons run for Hadoop to be functional.


HDFS provides the storage layer and Map Reduce provides the computation layer in hadoop. There is 1
namenode and several datanodes on storage layer ie HDFS. Similarly there is a resource manager and
several node managers on computation layer ie Map Reduce.
Namenode (HDFS) and resource manager (Map-Reduce) run on master while datanodes (HDFS) and
node manager (Map-Reduce) run on slaves.

Job Tracker
 Is a service with Hadoop system
 It is like a scheduler
 Client application is sent to the JobTracker
 It talks to the Namenode, locates the TaskTracker near the data (remember the data has been
populated already).

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
56

 JobTracker moves the work to the chosen TaskTracker node.


 TaskTracker monitors the execution of the task and updates the JobTracker through heartbeat.
Any failure of a task is detected through missing heartbeat.
 Intermediate merging on the nodes are also taken care of by the JobTracker

TaskTracker
 It accepts tasks (Map, Reduce, Shuffle, etc.) from JobTracker
 Each TaskTracker has a number of slots for the tasks; these are execution slots available on the
machine or machines on the same rack;
 It spawns a sepearte JVM for execution of the tasks;
 It indicates the number of available slots through the hearbeat message to the JobTracker

The Execution Framework


 A MapReduce program, referred to as a job, consists of code for mappers, reducers and others
packaged together with configuration parameters (such as IO locations).
 The developer submits the job to the submission node of a cluster (in Hadoop, this is called the
jobtracker).
 Execution framework (sometimes called the \runtime") takes care of everything else: it
transparently handles all other aspects of distributed code execution, on clusters ranging from a
single node to a few thousand nodes.

Responsibilities of the Execution Framework


 Scheduling
◦ Each MapReduce job is divided into smaller units called tasks
◦ Essentially the key space is shared among the # of Mappers
◦ Maintain a queue in case # tasks> #mappers , reducers etc.
◦ Coordination among multiple jobs and users.
 Data/code co-location:
 Synchronization
 Error and fault handling
 Partitioners, Combiners

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
57

JobTracker

The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the
cluster, ideally the nodes that have the data, or at least are in the same rack.

1. Client applications submit jobs to the Job tracker.


2. The JobTracker talks to the NameNode to determine the location of the data
3. The JobTracker locates TaskTracker nodes with available slots at or near the data
4. The JobTracker submits the work to the chosen TaskTracker nodes.
5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are
deemed to have failed and the work is scheduled on a different TaskTracker.
6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it
may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may
even blacklist the TaskTracker as unreliable.
7. When the work is completed, the JobTracker updates its status.
8. Client applications can poll the JobTracker for information.

The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs
are halted.

JobTracker and TaskTracker


JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1 (or
Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and replaced by
Resource Manager, Application Master and Node Manager Daemons.

Job Tracker –
1. JobTracker process runs on a separate node and not usually on a DataNode.
2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by
ResourceManager/ApplicationMaster in MRv2.
3. JobTracker receives the requests for MapReduce execution from the client.
4. JobTracker talks to the NameNode to determine the location of the data.
5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.
6. JobTracker monitors the individual TaskTrackers and the submits back the overall status of the
job back to the client.
7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
8. When the JobTracker is down, HDFS will still be functional but the MapReduce execution can
not be started and the existing MapReduce jobs will be halted.
TaskTracker –
1. TaskTracker runs on DataNode. Mostly on all DataNodes.
2. TaskTracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
5. TaskTracker will be in constant communication with the JobTracker signalling the progress of the
task in execution.
6. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive,
JobTracker will assign the task executed by the TaskTracker to another node.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
58

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
59

Google File System


Many datasets are too large to fit on a single machine. Unstructured data may not be easy
to insert into a database. Distributed file systems store data across a large number of
servers. The Google File System (GFS) is a distributed file system used by Google in the
early 2000s. It is designed to run on a large number of cheap servers.

The purpose behind GFS was the ability to store and access large files, and by large I
mean files that can’t be stored on a single hard drive. The idea is to divide these files into
manageable chunks of 64 MB and store these chunks on multiple nodes, having a
mapping between these chunks also stored inside the file system.

GFS assumes that it runs on many inexpensive commodity components that can often fail,
therefore it should consistently perform failure monitoring and recovery. It can store many
large files simultaneously and allows for two kinds of reads to them: small random reads
and large streaming reads. Instead of rewriting files, GFS is optimized towards appending
data to existing files in the system.

The GFS master node stores the index of files, while GFS chunk servers store the actual
chunks in the filesystems on multiple Linux nodes. The chunks that are stored in the GFS
are replicated, so the system can tolerate chunk server failures. Data corruption is also
detected using checksums, and GFS tries to compensate for these events as soon as
possible.

Here’s a brief history of the Google File System:

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
60

 2003: Google File System paper was released.

 2004: MapReduce framework was released. It is a programming model and an


associated implementation for processing and generating big data sets with a
parallel, distributed algorithm on a cluster.

 2006: Hadoop, which provides a software framework for distributed storage and
processing of big data using the MapReduce programming model, was created.
All the modules in Hadoop are designed with a fundamental assumption that
hardware failures are common occurrences and should be automatically handled
by the framework.

 2007: HBase, an open-source, non-relational, distributed database modeled after


Google’s Bigtable and written in Java, was born. It is developed as part of the
Apache Hadoop project and runs on top of HDFS.

 2008: Hadoop wins the TeraSort contest. TeraSort is a popular benchmark that
measures the amount of time to sort one terabyte of randomly distributed data on
a given computer system

 2009: Spark, an open-source distributed general purpose cluster-computing


framework, was built. It provides an interface for programming entire clusters
with implicit data parallelism and fault tolerance.

 2010: Hive, a data warehouse software project built on top of Apache Hadoop for
providing data query and analysis, was created. It gives a SQL-like interface to
query data stored in various databases and file systems that integrate with
Hadoop.

Hadoop Distributed File System


The Hadoop Distributed File System (HDFS) is a distributed file system designed to run
on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. HDFS is
highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides
high throughput access to application data and is suitable for applications that have large
data sets. In fact, deployments of more than 1000s of nodes of HDFS exist.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
61

In HDFS, files are divided into blocks, and file access follows multi-reader, single-writer
semantics. To meet the fault-tolerance requirement, multiple replicas of a block are stored
on different DataNodes. The number of replicas is called the replication factor. When a
new file block is created, or an existing file is opened for append, the HDFS write
operation creates a pipeline of DataNodes to receive and store the replicas.
(The replication factor generally determines the number of DataNodes in the pipeline.)
Subsequent writes to that block go through the pipeline. For reading operations the client
chooses one of the DataNodes holding copies of the block and requests a data transfer
from it.

MapReduce
MapReduce is a programming model which consists of
writing map and reduce functions. Map accepts key/value pairs and produces a sequence
of key/value pairs. Then, the data is shuffled to group keys together. After that, we reduce
the accepted values with the same key and produce a new key/value pair.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
62

During the execution, the Map tasks are assigned to machines based on input data. Then
those Map tasks produce their output. Next, the mapper output is shuffled and sorted.
Then, the Reduce tasks are scheduled and run. The Reduce output is finally stored to disk.

MapReduce in Python
Let’s walk through some code. The following program is from Michael Noll’s tutorial on
writing a Hadoop MapReduce program in Python.

The code below is the Map function. It will read data from STDIN, split it into words and
output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map
script will not compute an (intermediate) sum of a word’s occurrences though. Instead, it
will output <word> 1 tuple immediately — even though a specific word might occur
multiple times in the input. In our case, we let the subsequent Reduce step do the final sum
count.
import sys
# input comes from STDIN (standard input)for line in sys.stdin:
# remove leading and trailing whitespace line = line.strip()
# split the line into words words = line.
split()
# increase counters for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
# tab-delimited; the trivial word count is 1 print '%s\t%s' % (word, 1)

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
63

The code below is the Reduce function. It will read the results from the map step from
STDIN and sum the occurrences of each word to a final count, and then output its results
to STDOUT.
import syscurrent_word = Nonecurrent_count = 0word = None# input comes from STDINfor
line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input
we got from mapper.py word, count = line.split('\t', 1) # convert count (currently a string) to
int try: count = int(count) except ValueError: # count was not a number, so silently #
ignore/discard this line continue

MapReduce Example
Here is a real-world use case of MapReduce:

Facebook has a list of friends (note that friends are a bi-directional thing on Facebook. If
I’m your friend, you’re mine). They also have lots of disk space and they serve hundreds
of millions of requests every day. They’ve decided to pre-compute calculations when they
can to reduce the processing time of requests. One common processing request is the
“You and Joe have 230 friends in common” feature. When you visit someone’s profile,
you see a list of friends that you have in common. This list doesn’t change frequently so
it’d be wasteful to recalculate it every time you visited the profile (sure you could use a
decent caching strategy, but then I wouldn’t be able to continue writing about MapReduce
for this problem). We’re going to use MapReduce so that we can calculate every one’s
common friends once a day and store those results. Later on, it’s just a quick lookup.
We’ve got lots of disk, it’s cheap.

Assume the friends are stored as Person->[List of Friends], our friends list is then:
A -> B C DB -> A C D EC -> A B D ED -> A B C EE -> B C D

Each line will be an argument to a mapper. For every friend in the list of friends, the
mapper will output a key-value pair. The key will be a friend along with the person. The
value will be the list of friends. The key will be sorted so that the friends are in order,
causing all pairs of friends to go to the same reducer. This is hard to explain with text, so
let’s just do it and see if you can see the pattern. After all the mappers are done running,
you’ll have a list like this:
For map(A -> B C D) :(A B) -> B C D(A C) -> B C D(A D) -> B C DFor map(B -> A C D
E) : (Note that A comes before B in the key)(A B) -> A C D E(B C) -> A C D E(B D) -> A C
D E(B E) -> A C D EFor map(C -> A B D E) :(A C) -> A B D E(B C) -> A B D E(C D) -> A
B D E(C E) -> A B D EFor map(D -> A B C E) :(A D) -> A B C E(B D) -> A B C E(C D) ->
A B C E(D E) -> A B C EAnd finally for map(E -> B C D):(B E) -> B C D(C E) -> B C D(D
E) -> B C DBefore we send these key-value pairs to the reducers, we group them by their
keys and get:(A B) -> (A C D E) (B C D)(A C) -> (A B D E) (B C D)(A D) -> (A B C E) (B
C D)(B C) -> (A B D E) (A C D E)(B D) -> (A B C E) (A C D E)(B E) -> (A C D E) (B C
D)(C D) -> (A B C E) (A B D E)(C E) -> (A B D E) (B C D)(D E) -> (A B C E) (B C D)

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
64

Each line will be passed as an argument to a reducer. The reduce function will simply
intersect the lists of values and output the same key with the result of the intersection. For
example, reduce((A B) -> (A C D E) (B C D)) will output (A B) : (C D) and means that
friends A and B have C and D as common friends.

The result after reduction is:


(A B) -> (C D)(A C) -> (B D)(A D) -> (B C)(B C) -> (A D E)(B D) -> (A C E)(B E) -> (C
D)(C D) -> (A B E)(C E) -> (B D)(D E) -> (B C)

Now when D visits B’s profile, we can quickly look up (B D) and see that they have three
friends in common, (A C E).

MapReduce in MongoDB
We can also use Map-Reduce in MongoDB via the mapReduce database command.
Consider the following map-reduce operation:

In this map-reduce operation, MongoDB applies the map phase to each input document
(i.e. the documents in the collection that match the query condition). The map function
emits key-value pairs. For those keys that have multiple values, MongoDB applies
the reduce phase, which collects and condenses the aggregated data. MongoDB then
stores the results in a collection. Optionally, the output of the reduce function may pass
through a finalize function to further condense or process the results of the aggregation.

All map-reduce functions in MongoDB are JavaScript and run within the mongod process.
Map-reduce operations take the documents of a single collection as the input and can
perform any arbitrary sorting and limiting before beginning the map
stage. mapReduce can return the results of a map-reduce operation as a document or may
write the results to collections. The input and the output collections may be sharded.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
65

Apache Spark
So as we discussed above, MapReduce is an iterative process. Sometimes an algorithm
cannot be executed in a single MapReduce job. To resolve that, MapReduce jobs can be
chained together to provide a solution. This often happens with algorithms which iterate
until convergence (such as k-means, PageRank, etc.). But the big disadvantage is that the
Reduce output must be read from disk again with each new job; and in some cases, the
input data is read from disk many times. Thus, there is no easy way to share the work done
between iterations.

In Apache Spark, the computation model is much richer than just MapReduce.
Transformations on input data can be written lazily and batched together. Intermediate
results can be cached and reused in future calculations. There is a series of
lazy transformations which are followed by actions that force evaluation of all
transformations. Notably, each step in the Spark model produces a resilient distributed
dataset (RDD). Intermediate results can be cached on memory or dis, optionally serialized.

For each RDD, we keep a lineage, which is the operations which created it. Spark can
then recompute any data which is lost without storing to disk. We can still decide to keep
a copy in memory or on disk for performance.

Let’s delve into the Spark model through an example. Say we have the code below that
reads in data .csv file and do data wrangling.

The code highlighted below perform transformations of the data. Spark transforms
include these function calls: distinct(), filter(fn), intersection(other), join(other), map(fn),
union(other).

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
66

The code highlighted below perform actions of the data. Spark actions include these
function calls: collect(), count(), first(), take(n), reduce(fn), foreach(fn), saveAsTextFile().

We can represent this Spark model via a Task Directed Acyclic Graph — as illustrated
below:

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
67

The transformations happen on the left branch:

The actions happen on the right branch:

Let’s say we want to implement k-means clustering in Spark. The process follows like
this:

1. Load the data and “persist.”

2. Randomly sample for initial clusters.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
68

3. Broadcast the initial clusters.

4. Loop over the data and find the new centroids.

5. Repeat from step 3 until the algorithm converges.

Unlike MapReduce, we can keep the input in memory and load them once. The
broadcasting in step 3 means that we can quickly send the centers to all machines. All
cluster assignments do not need to be written to disk every time.

Spark applications run as independent sets of processes on a cluster, coordinated by the


SparkContext object in your main program (called the driver program).

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster
managers (either Spark’s own standalone cluster manager, Mesos or YARN), which
allocate resources across applications. Once connected, Spark acquires executors on nodes
in the cluster, which are processes that run computations and store data for your
application. Next, it sends your application code (defined by JAR or Python files passed to
SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

Seen above is the Spark architecture. There are several useful things to note about this
architecture:

1. Each application gets its own executor processes, which stay up for the duration
of the whole application and run tasks in multiple threads. This has the benefit of
isolating applications from each other, on both the scheduling side (each driver
schedules its own tasks) and executor side (tasks from different applications run
in different JVMs). However, it also means that data cannot be shared across
different Spark applications (instances of SparkContext) without writing it to an
external storage system.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
69

2. Spark is agnostic to the underlying cluster manager. As long as it can acquire


executor processes, and these communicate with each other, it is relatively easy
to run it even on a cluster manager that also supports other applications (e.g.
Mesos/YARN).

3. The driver program must listen for and accept incoming connections from its
executors throughout its lifetime. As such, the driver program must be network
addressable from the worker nodes.

4. Because the driver schedules tasks on the cluster, it should be run close to the
worker nodes, preferably on the same local area network. If you’d like to send
requests to the cluster remotely, it’s better to open an RPC to the driver and have
it submit operations from nearby than to run a driver far away from the worker
nodes.

The system currently supports several cluster managers:

 Standalone — a simple cluster manager included with Spark that makes it easy to
set up a cluster.

 Apache Mesos — a general cluster manager that can also run Hadoop MapReduce
and service applications.

 Hadoop YARN — the resource manager in Hadoop 2.

 Kubernetes — an open-source system for automating deployment, scaling, and


management of containerized applications.

Overall, Apache Spark is much more flexible since we can also run distributed SQL
queries. It also contains many libraries for machine learning, stream processing, etc.
Furthermore, Spark can connect to a number of different data sources.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
70

Other Apache Platforms


Apache Hive is a data warehouse software built on top of Hadoop. It converts SQL
queries to MapReduce, Spark, etc. The data here is stored in files on HDFS. An example
of Hive Input is shown below:

Apache Flink is another system designed for distributed analytics like Apache Spark. It
executes everything as a stream. Iterative computations can be written natively with cycles
in the data flow. It has a very similar architecture to Spark, including (1) a client that
optimizes and constructs data flow graph, (2) a job manager that receives jobs, and (3) a
task manager that executes jobs.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
71

Apache Pig is thus another platform for analyzing large datasets. It provides the Pig Latin
language which is easy to write but runs as MapReduce. The PigLatin code is shorter and
faster to develop than the equivalent Java code.

Here is a Pig Latin code example that counts word:

Here is a Pig Latin code example that visits Page:

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
72

Apache Pig is installed locally and can send jobs to any Hadoop cluster. It is slower than
Spark, but doesn’t require any software on the Hadoop cluster. We can write user-defined
functions for more complex operations (intersection, union, etc.)

Apache HBase is quite similar to Apache Cassandra — the wide column store that we
discussed above. It is essentially a large sorted map that we can update. It uses Apache
Zookeeper to ensure consistent updates to the data.

Next, I want to mention Apache Calcite, a framework that can parse/optimize SQL
queries and process data. It powers query optimization in Flink, Hive, Druid, and others. It
also provides many pieces needed to implement a database engine. The following
companies and projects are powered by Calcite:

More importantly, Calcite can connect to a variety of database systems including Spark,
Druid, Elastic Search, Cassandra, MongoDB, and Java JDBC.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
73

The conventional architecture looks like the one below:

On the other hand, the Calcite architecture takes away the client, server, and parser, and
lets the optimizer do the heavy work of processing metadata. The Calcite optimizer uses
more than 100 rewrite rules to optimize queries. Queries use relational algebra but can
operate on non-relational algebra. Calcite will aim to find the lowest cost way to execute a
query.

And that’s the end of this short post on distributed data processing! If you’r

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
74

UNIT-5
Installing and running PIG
Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Apache
Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps
given in the following link −
http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm

Download Apache Pig


First of all, download the latest version of Apache Pig from the following website
− https://pig.apache.org/

Step 1
Open the homepage of Apache Pig website. Under the section News, click on the link release
page as shown in the following snapshot.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
75

Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this
page, under the Download section, you will have two links, namely, Pig 0.8 and later and Pig
0.7 and before. Click on the link Pig 0.8 and later, then you will be redirected to the page
having a set of mirrors.

Step 3
Choose and click any one of these mirrors as shown below.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
76

Step 4
These mirrors will take you to the Pig Releases page. This page contains various versions of
Apache Pig. Click the latest version among them.

Step 5

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
77

Within these folders, you will have the source and binary files of Apache Pig in various
distributions. Download the tar files of the source and binary files of Apache Pig
0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.

Install Apache Pig


After downloading the Apache Pig software, install it in your Linux environment by following
the steps given below.

Step 1
Create a directory with the name Pig in the same directory where the installation directories
of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig
directory in the user named Hadoop).

$ mkdir Pig

Step 2
Extract the downloaded tar files as shown below.

$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
78

Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown
below.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

Configure Apache Pig


After installing Apache Pig, we have to configure it. To configure, we need to edit two files
− bashrc and pig.properties.

.bashrc file
In the .bashrc file, set the following variables −

 PIG_HOME folder to the Apache Pig’s


installation folder,
 PATH environment variable to the bin folder,
and
 PIG_CLASSPATH environment variable to the
etc (configuration) folder of your Hadoop
installations (the directory that contains the
core-site.xml, hdfs-site.xml and mapred-site.xml
files).
export PIG_HOME = /home/Hadoop/Pig
export PATH = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf

pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can
set various parameters as given below.
pig -h properties
The following properties are supported −
Logging: verbose = true|false; default is false. This property is the same as -v
switch brief=true|false; default is false. This property is the same
as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.
This property is the same as -d switch aggregate.warning = true|false; default is true.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
79

If true, prints count of warnings of each type rather than logging each warning.

Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all


memory).
Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner = true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery = true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR
jobs.
pig.tmpfilecompression = true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination = true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.

pig.exec.mapPartAgg = true|false. Default is false.


Determines if partial aggregation is done within map phase, before records are sent to
combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records by this factor, it
gets disabled.

Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same


as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure = true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of
the host.
Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.

Verifying the Installation


Verify the installation of Apache Pig by typing the version command. If the installation is
successful, you will get the version of Apache Pig as shown below.
$ pig –version

Apache Pig version 0.15.0 (r1682971)


compiled Jun 01 2015, 11:44:35

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
80

Apache Pig Execution Modes

You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing
purpose.

MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig
Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode
using the Grunt shell. In this shell, you can enter the Pig Latin statements and get
the output (using Dump operator).

Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig
Latin script in a single file with .pig extension.

Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and
using them in our script.
Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.

Local mode MapReduce mode

Command − Command −
$ ./pig –x local $ ./pig -x mapreduce

Output − Output −

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
81

Either of these commands gives you the Grunt shell prompt as shown below.
grunt>

You can exit the Grunt shell using ‘ctrl + d’.

After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin statements in it.

grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as
shown below.

Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING

PigStorage(',') as (id:int,name:chararray,city:chararray);

Dump student;

Now, you can execute the script in the above file as shown below.

Local mode MapReduce mode

$ pig -x $ pig -x
local Sample_script.pig mapreduce Sample_script.pig

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
82

Installing and Running HIVE


All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system.
Therefore, you need to install any Linux flavored OS. The following simple steps are executed
for Hive installation:

Step 1: Verifying JAVA Installation


Java must be installed on your system before installing Hive. Let us verify java installation
using the following command:
$ java –version
If Java is already installed on your system, you get to see the following response:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the steps given below for installing java.

Installing Java
Step I:

Download java (JDK <latest version> - X64.tar.gz) by visiting the following


link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html.
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.

Step II:

Generally you will find the downloaded java file in the Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step III:

To make java available to all the users, you have to move it to the location “/usr/local/”. Open
root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
83

Step IV:

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc
file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc

Step V:

Use the following commands to configure java alternatives:


# alternatives --install /usr/bin/java/java/usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac/javac/usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar/jar/usr/local/java/bin/jar 2

# alternatives --set java/usr/local/java/bin/java

# alternatives --set javac/usr/local/java/bin/javac

# alternatives --set jar/usr/local/java/bin/jar


Now verify the installation using the command java -version from the terminal as explained
above.

Step 2: Verifying Hadoop Installation


Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop
installation using the following command:
$ hadoop version
If Hadoop is already installed on your system, then you will get the following response:
Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the following steps:

Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following
commands.
$ su

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
84

password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Installing Hadoop in Pseudo Distributed Mode


The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.

Step I: Setting up Hadoop

You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc

Step II: Hadoop Configuration

You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration
files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs using java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in
your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and the size of
Read/Write buffers.
Open the core-site.xml and add the following properties in between the <configuration> and
</configuration> tags.

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
85

<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, the namenode
path, and the datanode path of your local file systems. It means the place where you want to
store the Hadoop infra.
Let us assume the following data.
dfs.replication (data replication value) = 1

(In the following path /hadoop/ is the user name.


hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)


datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
</property>

</configuration>
Note: In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.
<configuration>

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
86

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<configuration>

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

</configuration>

Verifying Hadoop Installation


The following steps are used to verify the Hadoop installation.

Step I: Name Node Setup

Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
87

10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:


/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step II: Verifying Hadoop dfs

The following command is used to start dfs. Executing this command will start your Hadoop
file system.
$ start-dfs.sh
The expected output is as follows:
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-
namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-
localhost.out
Starting secondary namenodes [0.0.0.0]

Step III: Verifying Yarn Script

The following command is used to start the yarn script. Executing this command will start your
yarn daemons.
$ start-yarn.sh
The expected output is as follows:
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
nodemanager-localhost.out

Step IV: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on your browser.
http://localhost:50070/

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
88

Step V: Verify all applications for cluster

The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.
http://localhost:8088/

Step 3: Downloading Hive

We use hive-0.14.0 in this tutorial. You can download it by visiting the following
link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz”
for this tutorial. The following command is used to verify the download:
$ cd Downloads
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin.tar.gz

Step 4: Installing Hive


The following steps are required for installing Hive on your system. Let us assume the Hive
archive is downloaded onto the /Downloads directory.

Extracting and verifying Hive Archive

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
89

The following command is used to verify the download and extract the hive archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive directory

We need to copy the files from the super user “su -”. The following commands are used to copy
the files from the extracted directory to the /usr/local/hive” directory.
$ su -
passwd:

# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit

Setting up environment for Hive

You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
The following command is used to execute ~/.bashrc file.
$ source ~/.bashrc

Step 5: Configuring Hive


To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in
the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder
and copy the template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
Edit the hive-env.sh file by appending the following line:
export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an external database server to
configure Metastore. We use Apache Derby database.

Step 6: Downloading and Installing Apache Derby


Follow the steps given below to download and install Apache Derby:

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
90

Downloading Apache Derby

The following command is used to download Apache Derby. It takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
The following command is used to verify the download:
$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin.tar.gz

Extracting and verifying Derby archive

The following commands are used for extracting and verifying the Derby archive:
$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz

Copying files to /usr/local/derby directory

We need to copy from the super user “su -”. The following commands are used to copy the files
from the extracted directory to the /usr/local/derby directory:
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit

Setting up environment for Derby

You can set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
The following command is used to execute ~/.bashrc file:
$ source ~/.bashrc

Create a directory to store Metastore

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
91

Create a directory named data in $DERBY_HOME directory to store Metastore data.


$ mkdir $DERBY_HOME/data
Derby installation and environmental setup is now complete.

Step 7: Configuring Metastore of Hive


Configuring Metastore means specifying to Hive where the database is stored. You can do this
by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy
the template file using the following command:
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration> and
</configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass =

org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Step 8: Verifying Hive Installation


Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS.
Here, we use the /user/hive/warehouse folder. You need to set write permission for these
newly created folders as shown below:
chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
92

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp


$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
The following commands are used to verify Hive installation:
$ cd $HIVE_HOME
$ bin/hive
On successful installation of Hive, you get to see the following response:
Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-
0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….
hive>
The following sample command is executed to display all the tables:
hive> show tables;
OK
Time taken: 2.798 seconds
hive>

 What is HiveQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks
which HQL can do easily.

 Create and manage tables and partitions


 Support various Relational, Arithmetic and Logical Operators
 Evaluate functions
 Download the contents of a table to a local directory or result of queries to HDFS
directory

Here is the example of the HQL Query:

SELECT upper(name), salesprice

FROM sales;

SELECT category, count(1)

FROM products

GROUP BY category;

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
93

When you look at the above query, you can see they are very similar to SQL like queries.

HIVE User Defined Functions

 Hive UDF versus UDAF

In Hive, you can define two main kinds of custom functions:

UDF
A UDF processes one or several columns of one row and outputs one value. For example
:SELECT lower(str) from table
For each row in "table," the "lower" UDF takes one argument, the value of "str", and
outputs one value, the lowercase representation of "str".

o SELECT datediff(date_begin, date_end) from table

For each row in "table," the "datediff" UDF takes two arguments, the value of
"date_begin" and "date_end", and outputs one value, the difference in time between these
two dates.

Each argument of a UDF can be:


 A column of the table
 A constant value
 The result of another UDF
 The result of an arithmetic computation
 TODO : Example

UDAF
An UDAF processes one or several columns of several input rows and outputs one value.
It is commonly used together with the GROUP operator. For example:

o SELECT sum(price) from table GROUP by customer;

The Hive Query executor will group rows by customer, and for each group, call the
UDAF with all price values. The UDAF then outputs one value for the output record (one
output record per customer);

o SELECT total_customer_value(quantity, unit_price, day) from


table group by customer;

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
94

For each record of each group, the UDAF will receive the three values of the three
selected column, and output one value of the output record.

 Oracle Big Data SQL

Extends Oracle SQL to Hadoop and NoSQL and the security of Oracle Database to all
your data. It also includes a unique Smart Scan service that minimizes data movement
and maximizes performance, by parsing and intelligently filtering data where it resides .

What is main differences between hive vs pig vs sql?

HIVE:

1. Hive is a Dataware house system for Hadoop that facilitates easydata


summarisation ,adhoc queries,and analysis of large datasets stored in Hadoop
compatible Filesystems.
2. Hive provides a mechanism to query the data using Sql like lamguage called as
HIVE QL or HQL.
3. Hive enables developers not familiar with MapReduce to write data queries that
are translated into MapReduce jobs in Hadoop.

Hive limitations (Compared to SQL Languages)

1. No support for Update or Delete.


2. No support for inserting single rows.
3. Limited number of Built in functions
4. Not all Standard SQL is supported.

PIG

1. An abstraction over the complexity of MapReduce programming, the Pig


platform includes an execution environment and a scripting language (Pig Latin)
used to analyzeHadoop data sets.
2. Its compiler translates Pig Latin into sequences of MapReduce programs

CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP

You might also like