Professional Documents
Culture Documents
UNIT-I:
DESCRIPTIVE STATISTICS :Probability Distributions, Inferential Statistics ,Inferential
Statistics through hypothesis tests Regression & ANOVA ,Regression ANOVA(Analysis of
Variance)
UNIT-II:
INTRODUCTION TO BIG DATA: Big Data and its Importance, Four V’s of Big Data,
Drivers for Big Data, Introduction to Big Data Analytics, Big Data Analytics applications.
BIG DATA TECHNOLOGIES: Hadoop’s Parallel World, Data discovery, Open source
technology for Big Data Analytics, cloud and Big Data, Predictive Analytics, Mobile
Business Intelligence and Big Data, Crowd Sourcing Analytics, Inter- and Trans-Firewall
Analytics, Information Management.
UNIT-III:
PROCESSING BIG DATA: Integrating disparate data stores, Mapping data to the
programming framework, Connecting and extracting data from storage, Transforming data
for processing, subdividing data in preparation for Hadoop Map Reduce.
UNIT-IV:
HADOOP MAPREDUCE: Employing Hadoop Map Reduce, Creating the components of
Hadoop Map Reduce jobs, Distributing data processing across server farms, Executing
Hadoop Map Reduce jobs, monitoring the progress of job flows, The Building Blocks of
Hadoop Map Reduce Distinguishing Hadoop daemons, Investigating the Hadoop Distributed
File System Selecting appropriate execution modes: local, pseudo-distributed, fully
distributed.
UNIT-V:
BIG DATA TOOLS AND TECHNIQUES: Installing and Running Pig, Comparison with
Databases, Pig Latin, User- Define Functions, Data Processing Operators, Installing and
Running Hive, Hive QL, Querying Data, User-Defined Functions, Oracle Big Data.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
2
UNIT -1
DESCRIPTIVE STATISTICS
Introduction
Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can
be either a representation of the entire or a sample of a population. Descriptive statistics are
broken down into measures of central tendency and measures of variability (spread). Measures of
central tendency include the mean, median, and mode, while measures of variability include the
standard deviation, variance, the minimum and maximum variables, and the kurtosis and
skewness.
PROBABILITY DISTRIBUTIONS
6 Common Probability Distributions every data science professional
should know.
Example.
Suppose you are a teacher at a university. After checking assignments for a week, you graded all
the students. You gave these graded papers to a data entry guy in the university and tell him to
create a spreadsheet containing the grades of all the students. But the guy only stores the grades
and not the corresponding students.
He made another blunder, he missed a couple of entries in a hurry and we have no idea whose
grades are missing. Let’s find a way to solve this.
One way is that you visualize the grades and see if you can find a trend in the data.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
3
The graph that you have plot is called the frequency distribution of the data. You see that there is
a smooth curve like structure that defines our data, but do you notice an anomaly? We have an
abnormally low frequency at a particular score range. So the best guess would be to have missing
values that remove the dent in the distribution.
This is how you would try to solve a real-life problem using data analysis. For any Data
Scientist, a student or a practitioner, distribution is a must know concept. It provides the basis for
analytics and inferential statistics.
While the concept of probability gives us the mathematical calculations, distributions help us
actually visualize what’s happening underneath.
In this article, I have covered some important probability distributions which are explained in a
lucid as well as comprehensive manner.
Note: This article assumes you have a basic knowledge of probability. If not, you can refer
this probability distributions.
Types of Distributions
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution
6. Exponential Distribution
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
4
Before we jump on to the explanation of distributions, let’s see what kind of data can we
encounter. The data can be discrete or continuous.
Discrete Data, as the name suggests, can take only specified values. For example, when you roll
a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.
Continuous Data can take any value within a given range. The range may be finite or infinite.
For example, A girl’s weight or height, the length of the road. The weight of a girl can be any
value from 54 kgs, or 54.5 kgs, or 54.5436kgs.
Types of Distributions
1. Bernoulli Distribution
Let’s start with the easiest distribution that is Bernoulli Distribution. It is actually easier to
understand than it sounds!
All you cricket junkies out there! At the beginning of any cricket match, how do you decide who
is going to bat or ball? A toss! It all depends on whether you win or lose the toss, right? Let’s say
if the toss results in a head, you win. Else, you lose. There’s no midway.
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and
a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with
the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.
Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two possible
outcomes.
The probability mass function is given by: px(1-p)1-x where x € (0, 1).
It can also be written as
The probabilities of success and failure need not be equally likely, like the result of a fight
between me and Undertaker. He is pretty much certain to win. So in this case probability of my
success is 0.15 while my failure is 0.85
Here, the probability of success(p) is not same as the probability of failure. So, the chart below
shows the Bernoulli Distribution of our fight.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
5
Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value is
exactly what it sounds. If I punch you, I may expect you to punch me back. Basically expected
value of any distribution is the mean of the distribution. The expected value of a random variable
X from a Bernoulli distribution is found as follows:
There are many examples of Bernoulli distribution such as whether it’s going to rain tomorrow
or not where rain denotes success and no rain denotes failure and Winning (success) or losing
(failure) the game.
2. Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are
equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the
n number of possible outcomes of a uniform distribution are equally likely.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
6
You can see that the shape of the Uniform distribution curve is rectangular, the reason why
Uniform distribution is called rectangular distribution.
The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of
40 and a minimum of 10.
Let’s try calculating the probability that the daily sales will fall between 15 and 30.
The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5
Similarly, the probability that daily sales are greater than 20 is = 0.667
The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard uniform
density is given by:
3. Binomial Distribution
Let’s get back to cricket. Suppose that you won the toss today and this indicates a successful
event. You toss again but you lost this time. If you win a toss today, this does not necessitate that
you will win the toss tomorrow. Let’s assign a random variable, say X, to the number of times
you won the toss. What can be the possible value of X? It can be any number depending on the
number of times you tossed a coin.
There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
7
A distribution where only two outcomes are possible, such as success or failure, gain or loss, win
or lose and where the probability of success and failure is same for all the trials is called a
Binomial Distribution.
The outcomes need not be equally likely. Remember the example of a fight between me and
Undertaker? So, if the probability of success in an experiment is 0.2 then the probability of
failure can be easily computed as q = 1 – 0.2 = 0.8.
Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss. An experiment with only two possible outcomes repeated n number
of times is called binomial. The parameters of a binomial distribution are n and p where n is the
total number of trials and p is the probability of success in each trial.
On the basis of the above explanation, the properties of a Binomial Distribution are
There are only two possible outcomes in a trial- either a success or a failure.
The probability of success and failure is same for all trials. (Trials are identical.)
A binomial distribution graph where the probability of success does not equal the probability of
failure looks like
Now, when probability of success = probability of failure, in such a situation the graph of
binomial distribution looks like
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
8
4. Normal Distribution
Normal distribution represents the behavior of most of the situations in the universe (That is why
it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often
turns out to be normally distributed, contributing to its widespread application. Any distribution
is known as Normal distribution if it has the following characteristics:
The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
Exactly half of the values are to the left of the center and the other half to the right.
A normal distribution is highly different from Binomial Distribution. However, if the number of
trials approaches infinity then the shapes will be quite similar.
The mean and variance of a random variable X which is said to be normally distributed is given
by:
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
9
A standard normal distribution is defined as the distribution with mean 0 and standard deviation
1. For such a case, the PDF becomes:
5. Poisson Distribution
Suppose you work at a call center, approximately how many calls do you get in a day? It can be
any number. Now, the entire number of calls at a call center in a day is modeled by Poisson
distribution. Some more examples are
You can now think of many examples following the same course. Poisson Distribution is
applicable in situations where events occur at random points of time and space wherein our
interest lies only in the number of occurrences of the event.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
10
A distribution is called Poisson distribution when the following assumptions are valid:
1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a
longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.
Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some
notations used in Poisson distribution are:
Here, X is called a Poisson Random Variable and the probability distribution of X is called
Poisson distribution.
Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.
The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that
interval. The graph of a Poisson distribution is shown below:
The graph shown below illustrates the shift in the curve due to increase in mean.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
11
It is perceptible that as the mean increases, the curve shifts to the right.
6. Exponential Distribution
Let’s consider the call center example one more time. What about the interval of time between
the calls ? Here, exponential distribution comes to our rescue. Exponential distribution models
the interval of time between the calls.
Exponential distribution is widely used for survival analysis. From the expected life of a machine
to the expected life of a human, exponential distribution successfully delivers the result.
f(x) = { λe-λx, x ≥ 0
For survival analysis, λ is called the failure rate of a device at any time t, given that it has
survived up to t.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
12
Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the curve. This
is explained better with the graph shown below.
P{X>x} = e-λx, corresponds to the area under the density curve to the right of x.
P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds to the area under the density curve between x1 and x2.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
13
The "mean" is the "average" you're used to, where you add up all the numbers and then divide by
the number of numbers. The "median" is the "middle" value in the list of numbers. To find the
median, your numbers have to be listed in numerical order from smallest to largest, so you may
have to rewrite your list before you can find the median. The "mode" is the value that occurs
most often. If no number in the list is repeated, then there is no mode for the list.
Find the mean, median, mode, and range for the following list of values:
13, 18, 13, 14, 13, 16, 14, 21, 13
The mean is the usual average, so I'll add and then divide:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15
Note that the mean, in this case, isn't a value from the original list. This is a common result. You
should not assume that your mean will be one of your original numbers.
The median is the middle value, so first I'll have to rewrite the list in numerical order:
There are nine numbers in the list, so the middle one will be the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th
number:
The mode is the number that is repeated more often than any other, so 13 is the mode.
The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.
mean: 15
median: 14
mode: 13
range: 8
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
14
REGRESSION
A regression analysis is a statistical procedure that allows you to make a prediction about an
outcome (or criterion) variable based on knowledge of some predictor variable. To create a
regression model, you first need to collect (a lot of) data on both variables, similar to what you
would do if you were conducting a correlation. Then you would determine the contribution of the
predictor variable to the outcome variable. Once you have the regression model, you would be
able to input an individual’s score on the predictor variable to get a prediction of their score on
the outcome variable.
- Example: You want to try to predict whether a student will come back for a second year
based on how many on-campus activities s/he attended. You would have to collect data on how
many activities students attended and then whether or not those students returned for a second
year. If activity attendance and retention are significantly related to each other, then you can
generate a regression model where you could identify at-risk students (in terms of retention)
based on how many activities they have attended.
- Example: You want to try to identify students who are at risk of failing College Algebra
based on their scores on a math assessment so you can direct them to special services on campus.
You would administer the math assessment at the start of the semester and then match each
student’s score on the math assessment to their final grade in the course. Eventually, your data
may show that the math assessment is significantly correlated to their final grade, and you can
create a regression model to identify those at-risk students so you can direct them to tutors and
other resources on campus.
You want to be able to make a prediction about an outcome given what you already know about
some related factor.
Another option with regression is to do a multiple regression, which allows you to make a
prediction about an outcome based on more than just one predictor variable. Many retention
models are essentially multiple regressions that consider factors such as GPA, level of
involvement, and attitude towards academics and learning.
Y’=a+bX
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
15
Step 1: Make a chart of your data, filling in the columns in the same way as you would fill in the chart if you were
finding the Pearson’s Correlation Coefficient.
Y’=a+bX
a = 65.1416
b = .385225
Click here if you want easy, step-by-step instructions for solving this formula.
Find a:
((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 247 2)
484979 / 7445
=65.14
Find b:
(6(20,485) – (247 × 486)) / (6 (11409) – 247 2)
(122,910 – 120,042) / 68,454 – 2472
2,868 / 7,445
= .385225
Step 3: Insert the values into the equation.
y’ = a + bx
y’ = 65.14 + .385225x
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
16
Formula-
In order to determine how strong the relationship is between two variables, a formula must be
followed to produce what is referred to as the coefficient value. The coefficient value can range
between -1.00 and 1.00. If the coefficient value is in the negative range, then that means the
relationship between the variables is negatively correlated, or as one value increases, the other
decreases. If the value is in the positive range, then that means the relationship between the
variables is positively correlated, or both values increase or decrease together. Let's look at the
formula for conducting the Pearson correlation coefficient value.
Step one: Make a chart with your data for two variables, labeling the variables (x) and (y), and
add three more columns labeled (xy), (x^2), and (y^2). A simple data chart might look like this:
Step-1: Complete the chart using basic multiplication of the variable values.
Step-2: After you have multiplied all the values to complete the chart, add up all of the columns
from top to bottom.
Person Age (x) Score (y) (xy) (x^2) (y^2)
1 20 30 600 400 900
2 24 20 480 576 400
3 17 27 459 289 729
Total 61 77 1539 1265 2029
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
17
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
18
Step-3: Use this formula to find the Pearson correlation coefficient value.
Sample question: Find the value of the correlation coefficient from the following
table:
SUBJECT AG E X G L UC OS E L E VE L Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 1:Make a chart. Use the given data, and add three more columns: xy, x 2, and
y2 .
SUBJECT AG E X G L UC OS E L E VE L Y XY X2 Y2
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would
be 43 × 99 = 4,257.
SUBJECT AG E X GLUCOSE
LEVEL Y
XY X2 Y2
1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
Step 3: Take the square of the numbers in the x column, and put the result in the
x2 column.
SUBJECT AG E X GL U C O S E L E V E L Y XY X2 Y2
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Step 4: Take the square of the numbers in the y column, and put the result in the
y2 column.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
19
SUBJECT AGE X GL U C OS E L E VE L Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Step 5: Add up all of the numbers in the columns and put the result at the bottom
of the column. The Greek letter sigma (Σ) is a short way of saying “sum of.”
SUBJECT AGE GL U C O S E XY X2 Y2
X L E VE L Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Step 6: Use the following correlation coefficient formula.
Σx = 247
Σy = 486
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
The correlation coefficient =
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
20
= 0.5298
INFERENTIAL STATISTICS
What is the main purpose of inferential statistics?
The main purpose of inferential statistics is to:
a. Summarize data in a useful andinformative manner.
b. Estimate a population characteristic based on a sample.
c. Determine if the data adequately represents the population.
Inferential statistics allows you to make inferences about the population from the sample data.
When you have quantitative data, you can analyze it using either descriptive or inferential
statistics. Descriptive statistics do exactly what it sounds like – they describe the data.
Descriptive statistics include measures of central tendency (mean, median, mode), measures of
variation (standard deviation, variance), and relative position (quartiles, percentiles). There are
times, however, when you want to draw conclusions about the data. This may include making
comparisons across time, comparing different groups, or trying to make predictions based on
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
21
data that has been collected. Inferential statistics are used when you want to move beyond simple
description or characterization of your data and draw conclusions based on your data. There are
several kinds of inferential statistics that you can calculate; here are a few of the more common
types:
A random sample of 100 coin flips is taken from a random population of coin flippers,
and the null hypothesis is then tested. If it is found that the 100 coin flips were distributed
as 40 heads and 60 tails, the analyst would assume that a penny does not have a 50%
chance of landing on heads and would reject the null hypothesis and accept the
alternative hypothesis. Afterward, a new hypothesis would be tested, this time that a
penny has a 40% chance of landing on heads.
1. The first step is for the analyst to state the two hypotheses so that only one can be
right.
2. The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.
3. The third step is to carry out the plan and physically analyze the sample data.
4. The fourth and final step is to analyze the results and either accept or reject the
null hypothesis.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
22
Statistics Solutions is the country’s leader in Analysis of Variance (ANOVA) and dissertation
statistics. Contact Statistics Solutions today for a free 30-minute consultation.
The use of this parametric statistical technique involves certain key assumptions, including the
following:
1. Independence of case: Independence of case assumption means that the case of the
dependent variable should be independent or the sample should be selected randomly. There
should not be any pattern in the selection of the sample.
3. Homogeneity: Homogeneity means variance between the groups should be the same.
Levene’s test is used to test the homogeneity between groups.
If particular data follows the above assumptions, then the analysis of variance (ANOVA) is the
best technique to compare the means of two, or more, populations.
One way analysis: When we are comparing more than three groups based on one factor
variable, then it said to be one way analysis of variance (ANOVA). For example, if we want to
compare whether or not the mean output of three workers is the same based on the working
hours of the three workers.
Two way analysis: When factor variables are more than two, then it is said to be two way
analysis of variance (ANOVA). For example, based on working condition and working hours,
we can compare whether or not the mean output of three workers is the same.
K-way analysis: When factor variables are k, then it is said to be the k-way analysis of variance
(ANOVA).
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
23
UNIT-2
According to Gartner:
Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making
From Wikipedia:
Big Data is a broad term for data sets so large or complex that they are difficult to process
using traditional data processing applications. Challenges include analysis, capture,
curation, search, sharing, storage, transfer, visualization, and information privacy.
Volume: The amount of data is immense. Each day 2.3 trillion gigabytes of new
data is being created.
Velocity: The speed of data (always in flux) and processing (analysis of streaming
data to produce near or real time results)
Variety: The different types of data, structured, as well as, unstructured.
Visibility Dimension: This dimension refers to a customers’ ability to see, track
their experience or order through the operations process. A high visibility
dimension includes courier companies where you can track your package online or
a retail store where you pick up the goods and purchase them over the counter.
Value: Value is the end game. After addressing volume, velocity, variety,
variability, veracity, and visualization – which takes a lot of time, effort and
resources – you want to be sure your organization is getting value from the data.
Variability: Variability is different from variety. A coffee shop may offer 6
different blends of coffee, but if you get the same blend every day and it tastes
different every day, that is variability. The same is true of data; if the meaning is
constantly changing it can have a huge impact on your data homogenization.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
24
Big data analytics examines large amounts of data to uncover hidden patterns,
correlations and other insights. With today’s technology, it’s possible to analyze your
data and get answers from it immediately. Big Data Analytics helps you to understand
your organization better. With the use of Big data analytics, one can make informed
decisions without blindly relying on guesses.
The primary goal of Big Data applications is to help companies make more informative
business decisions by analyzing large volumes of data. It could include web server logs,
Internet click stream data, social media content and activity reports, text from customer
emails, mobile phone call details and machine data captured by multiple sensors.
Organisations from different domain are investing in Big Data applications, for
examining large data sets to uncover all hidden patterns, unknown correlations, market
trends, customer preferences and other useful business information. In this blog we will
we be covering:
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
25
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
26
o Ad targeting
o Content monetization and new product development
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
27
Scientific Research
Weather Forecasting
The list of technology vendors offering big data solutions is seemingly infinite. Many of
the big data solutions that are particularly popular right now fit into one of the following
15 categories:
While Apache Hadoop may not be as dominant as it once was, it's nearly impossible to
talk about big data without mentioning this open source framework for distributed
processing of large data sets. Last year, Forrester predicted, "100% of all large enterprises
will adopt it (Hadoop and related technologies such as Spark) for big data
analytics within the next two years."
Over the years, Hadoop has grown to encompass an entire ecosystem of related software,
and many commercial big data solutions are based on Hadoop. In fact, Zion Market
Research forecasts that the market for Hadoop-based products and services will continue
to grow at a 50 percent CAGR through 2022, when it will be worth $87.14 billion, up
from $7.69 billion in 2016.
Key Hadoop vendors include Cloudera, Hortonworks and MapR, and the leading public
clouds all offer services that support the technology.
2. Spark
Apache Spark is part of the Hadoop ecosystem, but its use has become so widespread that
it deserves a category of its own. It is an engine for processing big data within Hadoop,
and it's up to one hundred times faster than the standard Hadoop engine, MapReduce.
In the AtScale 2016 Big Data Maturity Survey, 25 percent of respondents said that they
had already deployed Spark in production, and 33 percent more had Spark projects in
development. Clearly, interest in the technology is sizable and growing, and many
vendors with Hadoop offerings also offer Spark-based products.
open source project, is a programming language and software environment designed for
working with statistics. The darling of data scientists, it is managed by the R Foundation
and available under the GPL 2 license. Many popular integrated development
environments (IDEs), including Eclipse and Visual Studio, support the language.
Several organizations that rank the popularity of various programming languages say that
R has become one of the most popular languages in the world. For example,
the IEEE says that R is the fifth most popular programming language, and
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
28
both Tiobe and RedMonk rank it 14th. This is significant because the programming
languages near the top of these charts are usually general-purpose languages that can be
used for many different kinds of work. For a language that is used almost exclusively for
big data projects to be so near the top demonstrates the significance of big data and the
importance of this language in its field.
4. Data Lakes
To make it easier to access their vast stores of data, many enterprises are setting up data
lakes. These are huge data repositories that collect data from many different sources and
store it in its natural state. This is different than a data warehouse, which also collects
data from disparate sources, but processes it and structures it for storage. In this case, the
lake and warehouse metaphors are fairly accurate. If data is like water, a data lake is
natural and unfiltered like a body of water, while a data warehouse is more like a
collection of water bottles stored on shelves.
Data lakes are particularly attractive when enterprises want to store data but aren't yet
sure how they might use it. A lot of Internet of Things (IoT) data might fit into that
category, and the IoT trend is playing into the growth of data lakes.
MarketsandMarkets predicts that data lake revenue will grow from $2.53 billion in 2016
to $8.81 billion by 2021.
5. NoSQL Databases
NoSQL databases specialize in storing unstructured data and providing fast performance,
although they don't provide the same level of consistency as RDBMSes. Popular NoSQL
databases include MongoDB, Redis, Cassandra, Couchbase and many others; even the
leading RDBMS vendors like Oracle and IBM now also offer NoSQL databases.
NoSQL databases have become increasingly popular as the big data trend has grown.
According to Allied Market Research the NoSQL market could be worth $4.2 billion by
2020. However, the market for RDBMSes is still much, much larger than the market for
NoSQL.
6. Predictive Analytics
Predictive analytics is a sub-set of big data analytics that attempts to forecast future
events or behavior based on historical data. It draws on data mining, modeling and
machine learning techniques to predict what will happen next. It is often used for fraud
detection, credit scoring, marketing, finance and business analysis purposes.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
29
In recent years, advances in artificial intelligence have enabled vast improvements in the
capabilities of predictive analytics solutions. As a result, enterprises have begun to invest
more in big data solutions with predictive capabilities. Many vendors, including
Microsoft, IBM, SAP, SAS, Statistica, RapidMiner, KNIME and others, offer predictive
analytics solutions. Zion Market Research says the Predictive Analytics market generated
$3.49 billion in revenue in 2016, a number that could reach $10.95 billion by 2022.
7. In-Memory Databases
In any computer system, the memory, also known as the RAM, is orders of magnitude
faster than the long-term storage. If a big data analytics solution can process data that is
stored in memory, rather than data stored on a hard drive, it can perform dramatically
faster. And that's exactly what in-memory database technology does.
Many of the leading enterprise software vendors, including SAP, Oracle, Microsoft and
IBM, now offer in-memory database technology. In addition, several smaller companies
like Teradata, Tableau, Volt DB and DataStax offer in-memory database solutions.
Research from MarketsandMarkets estimates that total sales of in-memory technology
were $2.72 billion in 2016 and may grow to $6.58 billion by 2021.
Because big data repositories present an attractive target to hackers and advanced
persistent threats, big data security is a large and growing concern for enterprises. In the
AtScale survey, security was the second fastest-growing area of concern related to big
data.
According to the IDG report, the most popular types of big data security solutions include
identity and access controls (used by 59 percent of respondents), data encryption (52
percent) and data segregation (42 percent). Dozens of vendors offer big data security
solutions, and Apache Ranger, an open source project from the Hadoop ecosystem, is
also attracting growing attention.
Closely related to the idea of security is the concept of governance. Data governance is a
broad topic that encompasses all the processes related to the availability, usability and
integrity of data. It provides the basis for making sure that the data used for big data
analytics is accurate and appropriate, as well as providing an audit trail so that business
analysts or executives can see where data originated.
In the NewVantage Partners survey, 91.8 percent of the Fortune 1000 executives
surveyed said that governance was either critically important (52.5 percent) or important
(39.3 percent) to their big data initiatives. Vendors offering big data governance tools
include Collibra, IBM, SAS, Informatica, Adaptive and SAP.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
30
With data scientists and other big data experts in short supply — and commanding large
salaries — many organizations are looking for big data analytics tools that allow business
users to self-service their own needs. In fact, a report from Research and
Markets estimates that the self-service business intelligence market generated $3.61
billion in revenue in 2016 and could grow to $7.31 billion by 2021. And Gartner has
noted, "The modern BI and analytics platform emerged in the last few years to meet new
organizational requirements for accessibility, agility and deeper analytical insight,
shifting the market from IT-led, system-of-record reporting to business-led, agile
analytics including self-service."
Hoping to take advantage of this trend, multiple business intelligence and big data
analytics vendors, such as Tableau, Microsoft, IBM, SAP, Splunk, Syncsort, SAS,
TIBCO, Oracle and other have added self-service capabilities to their solutions. Time will
tell whether any or all of the products turn out to be truly usable by non-experts and
whether they will provide the business value organizations are hoping to achieve with
their big data initiatives.
While the concept of artificial intelligence (AI) has been around nearly as long as there
have been computers, the technology has only become truly usable within the past couple
of years. In many ways, the big data trend has driven advances in AI, particularly in two
subsets of the discipline: machine learning and deep learning.
The standard definition of machine learning is that it is technology that gives "computers
the ability to learn without being explicitly programmed." In big data analytics, machine
learning technology allows systems to look at historical data, recognize patterns, build
models and predict future outcomes. It is also closely associated with predictive analytics.
Deep learning is a type of machine learning technology that relies on artificial neural
networks and uses multiple layers of algorithms to analyze data. As a field, it holds a lot
of promise for allowing analytics tools to recognize the content in images and videos and
then process it accordingly.
Experts say this area of big data tools seems poised for a dramatic takeoff. IDC has
predicted, "By 2018, 75 percent of enterprise and ISV development will include
cognitive/AI or machine learning functionality in at least one application, including all
business analytics tools."
Leading AI vendors with tools related to big data include Google, IBM, Microsoft and
Amazon Web Services, and dozens of small startups are developing AI technology (and
getting acquired by the larger technology vendors).
As organizations have become more familiar with the capabilities of big data analytics
solutions, they have begun demanding faster and faster access to insights. For these
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
31
enterprises, streaming analytics with the ability to analyze data as it is being created, is
something of a holy grail. They are looking for solutions that can accept input from
multiple disparate sources, process it and return insights immediately — or as close to it
as possible. This is particular desirable when it comes to new IoT deployments, which are
helping to drive the interest in streaming big data analytics.
Several vendors offer products that promise streaming analytics capabilities. They
include IBM, Software AG, SAP, TIBCO, Oracle, DataTorrent, SQLstream, Cisco,
Informatica and others. MarketsandMarkets believes the streaming analytics solutions
brought in $3.08 billion in revenue in 2016, which could increase to $13.70 billion by
2021.
In addition to spurring interest in streaming analytics, the IoT trend is also generating
interest in edge computing. In some ways, edge computing is the opposite of cloud
computing. Instead of transmitting data to a centralized server for analysis, edge
computing systems analyze data very close to where it was created — at the edge of the
network.
The advantage of an edge computing system is that it reduces the amount of information
that must be transmitted over the network, thus reducing network traffic and related costs.
It also decreases demands on data centers or cloud computing facilities, freeing up
capacity for other workloads and eliminating a potential single point of failure.
While the market for edge computing, and more specifically for edge computing
analytics, is still developing, some analysts and venture capitalists have begun calling the
technology the "next big thing."
14. Blockchain
Also a favorite with forward-looking analysts and venture capitalists, blockchain is the
distributed database technology that underlies Bitcoin digital currency. The unique
feature of a blockchain database is that once data has been written, it cannot be deleted or
changed after the fact. In addition, it is highly secure, which makes it an excellent choice
for big data applications in sensitive industries like banking, insurance, health care, retail
and others.
Blockchain technology is still in its infancy and use cases are still developing. However,
several vendors, including IBM, AWS, Microsoft and multiple startups, have rolled out
experimental or introductory solutions built on blockchain technology.
Many analysts divide big data analytics tools into four big categories. The first,
descriptive analytics, simply tells what happened. The next type, diagnostic analytics,
goes a step further and provides a reason for why events occurred. The third type,
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
32
predictive analytics, discussed in depth above, attempts to determine what will happen
next. This is as sophisticated as most analytics tools currently on the market can get.
However, there is a fourth type of analytics that is even more sophisticated, although very
few products with these capabilities are available at this time. Prescriptive analytics
offers advice to companies about what they should do in order to make a desired result
happen. For example, while predictive analytics might give a company a warning that the
market for a particular product line is about to decrease, prescriptive analytics will
analyze various courses of action in response to those market changes and forecast the
most likely results.
Currently, very few enterprises have invested in prescriptive analytics, but many analysts
believe this will be the next big area of investment after organizations begin experiencing
the benefits of predictive analytics.
Hadoop is an open source distributed processing framework that manages data processing
and storage for big data applications running in clustered systems. It is at the center of a
growing ecosystem of big data technologies that are primarily used to support advanced
analytics initiatives, including predictive analytics, data mining and machine learning
applications. Hadoop can handle various forms of structured and unstructured data,
giving users more flexibility for collecting, processing and analyzing data than relational
databases and data warehouses provide.
Hadoop runs on clusters of commodity servers and can scale up to support thousands of
hardware nodes and massive amounts of data. It uses a namesake distributed file system
that's designed to provide rapid data access across the nodes in a cluster, plus fault-
tolerant capabilities so applications can continue to run if individual nodes fail.
Consequently, Hadoop became a foundational data management platform for big data
analytics uses after it emerged in the mid-2000s.
HISTORY OF HADOOP
Hadoop was created by computer scientists Doug Cutting and Mike Cafarella, initially to
support processing in the Nutch open source search engine and web crawler. After
Google published technical papers detailing its Google File System (GFS) and
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
33
MapReduce programming framework in 2003 and 2004, Cutting and Cafarella modified
earlier technology plans and developed a Java-based MapReduce implementation and a
file system modeled on Google's.
In early 2006, those elements were split off from Nutch and became a separate Apache
subproject, which Cutting named Hadoop after his son's stuffed elephant. At the same
time, Cutting was hired by internet services company Yahoo, which became the first
production user of Hadoop later in 2006.
Use of the framework grew over the next few years, and three independent Hadoop
vendors were founded: Cloudera in 2008, MapR a year later and Hortonworks as a Yahoo
spinoff in 2011. In addition, AWS launched a Hadoop cloud service called Elastic
MapReduce in 2009. That was all before Apache released Hadoop 1.0.0, which became
available in December 2011 after a succession of 0.x releases.
Put simply: Hadoop has two main components. The first component, the Hadoop
Distributed File System, helps split the data, put it on different nodes, replicate it and
manage it. The second component, MapReduce, processes the data on each node in
parallel and calculates the results of the job. There is also a method to help manage the
data processing jobs.
it can store and process vast amounts of structured and unstructured data, quickly.
application and data processing are protected against hardware failure. So if one node
goes down, jobs are redirected automatically to other nodes to ensure that the distributed
computing doesn’t fail.
the data doesn’t have to be preprocessed before it’s stored. Organizations can store as
much data as they want, including unstructured data, such as text, videos and images, and
decide how to use it later.
it’s scalable so companies can add nodes to enable their systems to handle more data.
HADOOP APPLICATIONS
YARN greatly expanded the applications that Hadoop clusters can handle to include
stream processing and real-time analytics applications run in tandem with processing
engines, like Apache Spark and Apache Flink. For example, some manufacturers are
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
34
using real-time data that's streaming into Hadoop in predictive maintenance applications
to try to detect equipment failures before they occur. Fraud detection, website
personalization and customer experience scoring are other real-time use cases.
Because Hadoop can process and store such a wide assortment of data, it enables
organizations to set up data lakes as expansive reservoirs for incoming streams of
information. In a Hadoop data lake, raw data is often stored as is so data scientists and
other analysts can access the full data sets if need be; the data is then filtered and
prepared by analytics or IT teams as needed to support different applications.
Data lakes generally serve different purposes than traditional data warehouses that hold
cleansed sets of transaction data. But, in some cases, companies view their Hadoop data
lakes as modern-day data warehouses. Either way, the growing role of big data analytics
in business decision-making has made effective data governance and data security
processes a priority in data lake deployments.
Risk management -- financial institutions use Hadoop clusters to develop more accurate
risk analysis models for their customers. Financial services companies can use Hadoop to
build and run applications to assess risk, build investment models and develop trading
algorithms.
Predictive maintenance -- with input from IoT devices feeding data into big data
programs, companies in the energy industry can use Hadoop-powered analytics to help
predict when equipment might fail to determine when maintenance should be performed.
Supply chain risk management -- manufacturing companies, for example, can track the
movement of goods and vehicles so they can determine the costs of various transportation
options. Using Hadoop, manufacturers can analyze large amounts of historical, time-
stamped location data as well as map out potential delays so they can optimize their
delivery routes.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
35
Big Data Analytics software is widely used in providing meaningful analysis of a large
set of data. This software helps in finding current market trends, customer preferences,
and other information.
Here are the 8 Top Big Data Analytics Tools with key feature and download links.
1. Apache Hadoop
The long-standing champion in the field of Big Data processing, well-known for its
capabilities for huge-scale data processing. This open source Big Data framework can run
on-prem or in the cloud and has quite low hardware requirements. The main Hadoop
benefits and features are as follows:
Hadoop Libraries — the needed glue for enabling third party modules to work
with Hadoop
2. Apache Spark
Apache Spark is the alternative — and in many aspects the successor — of Apache
Hadoop. Spark was built to address the shortcomings of Hadoop and it does this
incredibly well. For example, it can process both batch data and real-time data, and
operates 100 times faster than MapReduce. Spark provides the in-memory data
processing capabilities, which is way faster than disk processing leveraged by
MapReduce. In addition, Spark works with HDFS, OpenStack and Apache Cassandra,
both in the cloud and on-prem, adding another layer of versatility to big data operations
for your business.
3. Apache Storm
Storm is another Apache product, a real-time framework for data stream processing,
which supports any programming language. Storm scheduler balances the workload
between multiple nodes based on topology configuration and works well with Hadoop
HDFS. Apache Storm has the following benefits:
Built-in fault-tolerance
Auto-restart on crashes
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
36
Clojure-written
4. Apache Cassandra
Apache Cassandra is one of the pillars behind Facebook’s massive success, as it allows to
process structured data sets distributed across huge number of nodes across the globe. It
works well under heavy workloads due to its architecture without single points of failure
and boasts unique capabilities no other NoSQL or relational DB has, such as:
Built-in high-availability
5. MongoDB (https://www.guru99.com/mongodb-tutorials.html)
MongoDB is another great example of an open source NoSQL database with rich
features, which is cross-platform compatible with many programming languages. IT Svit
uses MongoDB in a variety of cloud computing and monitoring solutions, and we
specifically developed a module for automated MongoDB backups using Terraform. The
most prominent MongoDB features are:
Stores any type of data, from text and integer to strings, arrays, dates and boolean
6. R Programming Environment
R is mostly used along with JuPyteR stack (Julia, Python, R) for enabling wide-scale
statistical analysis and data visualization. JupyteR Notebook is one of 4 most popular Big
Data visualization tools, as it allows composing literally any analytical model from more
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
37
R is highly portable
R easily scales from a single test machine to vast Hadoop data lakes
7. Neo4j
Neo4j is an open source graph database with interconnected node-relationship of data,
which follows the key-value pattern in storing data. IT Svit has recently built a resilient
AWS infrastructure with Neo4j for one of our customers and the database performs well
under heavy workload of network data and graph-related requests. Main Neo4j features
are as follows:
8. Apache SAMOA
This is another of the Apache family of tools used for Big Data processing. Samoa
specializes at building distributed streaming algorithms for successful Big Data mining.
This tool is built with pluggable architecture and must be used atop other Apache
products like Apache Storm we mentioned earlier. Its other features used for Machine
Learning include the following:
Clustering
Classification
Normalization
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
38
Regression
Using Apache Samoa enables the distributed stream processing engines to provide such
tangible benefits:
Final thoughts on the list of hot Big Data tools for 2018
Big Data industry and data science evolve rapidly and progressed a big deal lately, with
multiple Big Data projects and tools launched in 2017. This is one of the hottest IT trends
of 2018, along with IoT, blockchain, AI & ML.
PREDICTIVE ANALYTICS
Predictive analytics is a form of advanced analytics that uses both new and
historical data to forecast activity, behavior and trends. It involves applying
statistical analysis techniques, analytical queries and automated machine learning
algorithms to data sets to create predictive models that place a numerical value --
or score -- on the likelihood of a particular event happening.
Predictive analytics software applications use variables that can be measured and
analyzed to predict the likely behavior of individuals, machinery or other entities.
For example, an insurance company is likely to take into account potential driving
safety variables, such as age, gender, location, type of vehicle and driving record,
when pricing and issuing auto insurance policies.
Multiple variables are combined into a predictive model capable of assessing
future probabilities with an acceptable level of reliability. The software relies
heavily on advanced algorithms and methodologies, such as logistic regression
models, time series analysis and decision trees.
Predictive analytics has grown in prominence alongside the emergence of big data
systems. As enterprises have amassed larger and broader pools of data in Hadoop
clusters and other big data platforms, they have created increased data mining
opportunities to gain predictive insights. Heightened development and
commercialization of machine learning tools by IT vendors has also helped
expand predictive analytics capabilities.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
39
Ans. Mobile business intelligence (mobile BI) refers to the ability to provide business
and data analytics services to mobile/handheld devices and/or remote users. MBI enables
users with limited computing capacity to use and receive the same or similar features,
capabilities and processes as those found in a desktop-based business intelligence
software solution.
One of the major problems customers face when using mobile devices for
information retrieval is the fact that mobile BI is no longer as simple as the pure
display of BI content on a mobile device. Moreover, a mobile strategy has to be
defined to cope with different suppliers and systems as well as private phones.
Besides attempts to standardize with the same supplier, companies are also
concerned that solutions should have robust security features. These points have
led many to the conclusion that a proper concept and strategy must be in place
before supplying corporate information to mobile devices.
The first major benefit is the ability for end users to access information in their
mobile BI system at any time and from any location. This enables them to get
data and analytics in ‘real time’, which improves their daily operations and
means they can react more quickly to a wider range of events.
MBI works much like a standard BI software/solution but it is designed specifically for
handheld users. Typically, MBI requires a client end utility to be installed on mobile
devices, which remotely/wirelessly connect over the Internet or a mobile network to the
primary business intelligence application server. Upon connection, MBI users can
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
40
perform queries, and request and receive data. Similarly, clientless MBI solutions can be
accessed through a cloud server that provides Software as a Service business intelligence
(SaaS BI), Real-Time Business Intelligence (RTBI or Real-Time BI).
WHAT IS CROWDSOURCING?
Crowdsourcing data collection consists in building data sets with the help of a large
group of people. There are a source and data suppliers who are willing to enrich the data
with relevant, missing, or new information.
This method originates from the scientific world. One of the first ever case of
crowdsourcing is the Oxford English Dictionary. The project aimed to list all the words
that enjoy any recognized lifespan in the standard English language with their definition
and explanation of usage. That was a gigantic task. So the dictionary creators invited the
crowd to help them on a voluntary basis.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
41
More than 1 million mappers work together to collect and supply data to OpenStreetMap
making it full of valuable information about the specified location.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
42
The Internet is now a melting pot of user-generated content from blogs to Wikipedia
entries to YouTube videos. The distinction between producer and consumer is no longer
such a prevalent distinction as everyone is equipped with the tools needed to create as
well as consume.
As a business strategy, soliciting customer input isn’t new, and open source software has
proven the productivity possible through a large group of individuals.
The history of crowdsourcing
While the idea behind crowdsourcing isn’t new, its active use online as a business
building strategy has only been around since 2006. The phrase was initially coined by
Jeff Howe, where he described a world in which people outside of a company contribute
work toward that project’s success. Video games have been utilizing crowdsourcing for
many years through their beta invitation. Granting players early access to the game,
studios request only that these passionate gamers report bugs and issues with gameplay as
they encounter before the finished product is released for sale and distribution.
Companies utilize crowdsourcing not only in a research and development capacity, but
also to simply get help from anyone for anything, whether it's word-of-mouth marketing,
creating content or giving feedback.
INFORMATION MANAGEMENT
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
43
Information management is essential part of today’s web development that ensures the
security of data that should be shared within the authorized users. This is not an easy task
to manage the whole thing, pass the authority to users and manage the privacy.
Facebook is the good example of Information system. Facebook is providing privacy to
its user which is the one of the capability that a perfect information system can do. It
passes authority to the graph nodes (i.e., your friends) connected with you to access the
information that is set for such circumstances.
Nobody can see your private information expect whom you pass authority to see your
private information in Facebook.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
44
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
45
UNIT-3
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
46
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which may
be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data
that is huge in size and yet growing exponentially with time. In short such data is so large and
complex that none of the traditional data management tools are able to store it or process it
efficiently.
The New York Stock Exchange generates about one terabyte of new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
47
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size
of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name Big Data is given and imagine the
challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one example of
a 'structured' data.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
48
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don't know how to derive value out of it since this data is
in its raw form or unstructured format.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
49
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Please note that web application data, which is unstructured, consists of log files, transaction history
files etc. OLTP systems are built to work with structured data wherein data is stored in relations
(tables).
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
50
Access to social data from search engines and sites like facebook, twitter are enabling organizations
to fine tune their business strategies.
Traditional customer feedback systems are getting replaced by new systems designed with Big Data
technologies. In these new systems, Big Data and natural language processing technologies are being
used to read and evaluate consumer responses.
Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big
Data technologies and data warehouse helps an organization to offload infrequently accessed data.
Summary
Big Data is defined as data that is huge in size. Bigdata is a term used to describe a collection
of data that is huge in size and yet growing exponentially with time.
Examples of Big Data generation includes stock exchanges, social media sites, jet engines,
etc.
Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
Volume, Variety, Velocity, and Variability are few Characteristics of Bigdata
Improved customer service, better operational efficiency, Better Decision Making are few
advantages of Bigdata
Data processing is the conversion of data into usable and desired form. This conversion or
“processing” is carried out using a predefined sequence of operations either manually or
automatically. Most of the data processing is done by using computers and thus done automatically.
The output or “processed” data can be obtained in different forms like image, graph, table, vector
file, audio, charts or any other desired format depending on the software or method of data
processing used. When done itself it is referred to as automatic data processing. Continue reading
below to understand more about what is data processing.
Data processing is undertaken by any activity which requires a collection of data. This data collected
needs to be stored, sorted, processed, analyzed and presented. This complete process can be divided
into 6 simple primary stages which are:
Data collection
Storage of data
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
51
Sorting of data
Processing of data
Data analysis
Data presentation and conclusions
Once the data is collected the need for data entry emerges for storage of data. Storage can be done in
physical form by use of papers, in notebooks or in any other physical form. With the emergence and
growing emphasis on Computer System, Big Data & Data Mining the data collection is large and a
number of operations need to be performed for meaningful analysis and presentation, the data is
stored in digital form. Having the raw data and processed data into digital form enables the user to
perform a large number of operations in small time and allows conversion into different types. The
user can thus select the output which best suits the requirement.
This continuous use and processing of data follow cycle called as data processing
cycle and information processing cycle which might provide instant results or take time depending
upon the need of processing data. The complexity in the field of data processing is increasing which
is creating a need for advanced techniques.
Storage of data is followed by sorting and filtering. This stage is profoundly affected by the format in
which data is stored and further depends on the software used. General daily day and noncomplex
data can be stored as text files, tables or a combination of both in Microsoft Excel or similar
software. As the task becomes complex which requires performing specific and specialized
operations they require different data processing tools and software which is meant to cater to the
peculiar needs.
Storing, sorting, filtering and processing of data can be done by single software or a combination of
software whichever feasible and required. Data processing thus carried out by software is done as per
the predefined set of operations. Most of the modern-day software allows users to perform different
actions based on the analysis or study to be carried out. Data processing provides the output file in
various formats.
Plain text file – These constitute the simplest form or processed data. Most of these files are
user readable and easy to comprehend. Very negligible or no further processing is these type of
files. These are exported as notepad or WordPad files.
Table/ spreadsheet – This file format is most suitable for numeric data. Having digits in rows
and columns allows the user to perform various operations like filtering & sorting in
ascending/descending order to make it easy to understand and use. Various mathematical
operations can be applied when using this file output.
Charts & Graphs – Option to get the output in the form of charts and graphs is handy and now
forms standard features in most of the software. This option is beneficial when dealing with
numerical values reflecting trends and growth/decline. Though there are ample charts and
graphs are available to match diverse requirements there exists situation when there is a need to
have a user-defined option. In case no inbuilt chart or graph is available then the option to
create own charts, i.e., custom charts/graphs come handy.
Maps/Vector or image file – When dealing with spatial data the option to export the processed
data into maps, vector and image files is of great use. Having the information on maps is of
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
52
particular use for urban planners who work on different types of maps. Image files are obtained
when dealing with graphics and do not constitute any human readable input.
Other formats/ raw files – These are the software specific file formats which can be used and
processed by specialized software. These output files may not be a complete product and
require further processing. Thus there will need to perform multiple data processing.
Manual data processing: In this method data is processed manually without the use of a
machine, tool or electronic device. Data is processed manually, and all the calculations and
logical operations are performed manually on the data.
Mechanical data processing – Data processing is done by use of a mechanical device or very
simple electronic devices like calculator and typewriters. When the need for processing is
simple, this method can be adopted.
Electronic data processing – This is the modern technique to process data. Electronic Data
processing is the fastest and best available method with the highest reliability and accuracy.
The technology used is latest as this method used computers and employed in most of the
agencies. The use of software forms the part of this type of data processing. The data is
processed through a computer; Data and set of instructions are given to the computer as input,
and the computer automatically processes the data according to the given set of instructions.
The computer is also known as electronic data processing machine.
There are various types of data processing, some of the most popular types are as follows:
Batch Processing
Real-time processing
Online Processing
Multiprocessing
Time-sharing
Nowadays more and more data is collected for academic, scientific research, private & personal use,
institutional use, commercial use. This collected data needs to be stored, sorted, filtered, analyzed
and presented and even require data transfer for it to be of any use. This process can be simple or
complex depending on the scale at which data collection is done and the complexity of the results
which are required to be obtained. The time consumed in obtaining the desired result depends on the
operations which need to be performed on the collected data and on the nature of the output file
required to be obtained. This problem becomes starker when dealing with the very large volume of
data such as those collected by multinational companies about their users, sales, manufacturing, etc.
Data processing services and companies dealing with personal information and other sensitive
information must be careful about data protection.
The need for data processing becomes more and more critical in such cases. In such cases, data
mining and data management come into play without which optimal results cannot be obtained. Each
stage starting from data collection to presentation has a direct effect on the output and usefulness of
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
53
the processed data. Sharing the dataset with third party must be done carefully and as per written data
processing agreement & service agreement. This prevents data theft, misuse and loss of data.
Data in any form and of any type requires processing most of the time. These data can be categorised
as personal information, financial transactions, tax credits, banking details, computational data,
images and simply almost anything you can think of. The quantum of processing required will
depend on the specilisatized processing which the data requires. Subsequently it will depend on the
output that you require. With the increase in demand and the requirement for automatic data
processing & electronic data processing, a competitive market for data services has emerged.
1. Data collection
Collecting data is the first step in data processing. Data is pulled from available sources,
including data lakes and data warehouses. It is important that the data sources available are
trustworthy and well-built so the data collected (and later used as information) is of the highest
possible quality.
2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to
as “pre-processing” is the stage at which raw data is cleaned up and organized for the following stage
of data processing. During preparation, raw data is diligently checked for any errors. The purpose of
this step is to eliminate bad data (redundant, incomplete, or incorrect data) and begin to create high-
quality data for the best business intelligence.
3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data
warehouse like Redshift), and translated into a language that it can understand. Data input is the
first stage in which raw data begins to take the form of usable information.
4. Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for
interpretation. Processing is done using machine learning algorithms, though the process itself
may vary slightly depending on the source of data being processed (data lakes, social networks,
connected devices etc.) and its intended use (examining advertising patterns, medical diagnosis
from connected devices, determining customer needs, etc.).
5. Data output/interpretation
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
54
The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs, videos, images, plain text, etc.). Members of the
company or institution can now begin to self-serve the data for their own data analytics projects.
6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for
future use. While some information may be put to use immediately, much of it will serve a purpose
later on. Plus, properly stored data is a necessity for compliance with data protection legislation
like GDPR. When data is properly stored, it can be quickly and easily accessed by members of the
organization when needed.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
55
UNIT-4
Creating the components of Hadoop Map Reduce jobs
Hadoop components and Daemons
Every framework needs two important components:
Storage: The place where code, data, executables etc are stored.
Compute: The logic by which code is executed and data is acted upon.
Two main components of Hadoop framework are also on the same lines:
1. HDFS: This is the storage in which all the data is stored. It is a file system that is required by
Hadoop to run various map reduce jobs.
2. MapReduce: This is the compute logic based on which Hadoop runs. MapReduce is the
fundamental algorithm behind the success of Hadoop. This provides very fast processing.
There are 2 layers in Hadoop – HDFS layer and Map-Reduce layer and 5 daemons which run on Hadoop
in these 2 layers. Daemons are the processes that run in the background.
5) Secondary-namenode – It is back-up for namenode and runs on a different system (other than master
and slave nodes but can be configured on slave node also)
Job Tracker
Is a service with Hadoop system
It is like a scheduler
Client application is sent to the JobTracker
It talks to the Namenode, locates the TaskTracker near the data (remember the data has been
populated already).
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
56
TaskTracker
It accepts tasks (Map, Reduce, Shuffle, etc.) from JobTracker
Each TaskTracker has a number of slots for the tasks; these are execution slots available on the
machine or machines on the same rack;
It spawns a sepearte JVM for execution of the tasks;
It indicates the number of available slots through the hearbeat message to the JobTracker
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
57
JobTracker
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the
cluster, ideally the nodes that have the data, or at least are in the same rack.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs
are halted.
Job Tracker –
1. JobTracker process runs on a separate node and not usually on a DataNode.
2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by
ResourceManager/ApplicationMaster in MRv2.
3. JobTracker receives the requests for MapReduce execution from the client.
4. JobTracker talks to the NameNode to determine the location of the data.
5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.
6. JobTracker monitors the individual TaskTrackers and the submits back the overall status of the
job back to the client.
7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
8. When the JobTracker is down, HDFS will still be functional but the MapReduce execution can
not be started and the existing MapReduce jobs will be halted.
TaskTracker –
1. TaskTracker runs on DataNode. Mostly on all DataNodes.
2. TaskTracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
5. TaskTracker will be in constant communication with the JobTracker signalling the progress of the
task in execution.
6. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive,
JobTracker will assign the task executed by the TaskTracker to another node.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
58
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
59
The purpose behind GFS was the ability to store and access large files, and by large I
mean files that can’t be stored on a single hard drive. The idea is to divide these files into
manageable chunks of 64 MB and store these chunks on multiple nodes, having a
mapping between these chunks also stored inside the file system.
GFS assumes that it runs on many inexpensive commodity components that can often fail,
therefore it should consistently perform failure monitoring and recovery. It can store many
large files simultaneously and allows for two kinds of reads to them: small random reads
and large streaming reads. Instead of rewriting files, GFS is optimized towards appending
data to existing files in the system.
The GFS master node stores the index of files, while GFS chunk servers store the actual
chunks in the filesystems on multiple Linux nodes. The chunks that are stored in the GFS
are replicated, so the system can tolerate chunk server failures. Data corruption is also
detected using checksums, and GFS tries to compensate for these events as soon as
possible.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
60
2006: Hadoop, which provides a software framework for distributed storage and
processing of big data using the MapReduce programming model, was created.
All the modules in Hadoop are designed with a fundamental assumption that
hardware failures are common occurrences and should be automatically handled
by the framework.
2008: Hadoop wins the TeraSort contest. TeraSort is a popular benchmark that
measures the amount of time to sort one terabyte of randomly distributed data on
a given computer system
2010: Hive, a data warehouse software project built on top of Apache Hadoop for
providing data query and analysis, was created. It gives a SQL-like interface to
query data stored in various databases and file systems that integrate with
Hadoop.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
61
In HDFS, files are divided into blocks, and file access follows multi-reader, single-writer
semantics. To meet the fault-tolerance requirement, multiple replicas of a block are stored
on different DataNodes. The number of replicas is called the replication factor. When a
new file block is created, or an existing file is opened for append, the HDFS write
operation creates a pipeline of DataNodes to receive and store the replicas.
(The replication factor generally determines the number of DataNodes in the pipeline.)
Subsequent writes to that block go through the pipeline. For reading operations the client
chooses one of the DataNodes holding copies of the block and requests a data transfer
from it.
MapReduce
MapReduce is a programming model which consists of
writing map and reduce functions. Map accepts key/value pairs and produces a sequence
of key/value pairs. Then, the data is shuffled to group keys together. After that, we reduce
the accepted values with the same key and produce a new key/value pair.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
62
During the execution, the Map tasks are assigned to machines based on input data. Then
those Map tasks produce their output. Next, the mapper output is shuffled and sorted.
Then, the Reduce tasks are scheduled and run. The Reduce output is finally stored to disk.
MapReduce in Python
Let’s walk through some code. The following program is from Michael Noll’s tutorial on
writing a Hadoop MapReduce program in Python.
The code below is the Map function. It will read data from STDIN, split it into words and
output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map
script will not compute an (intermediate) sum of a word’s occurrences though. Instead, it
will output <word> 1 tuple immediately — even though a specific word might occur
multiple times in the input. In our case, we let the subsequent Reduce step do the final sum
count.
import sys
# input comes from STDIN (standard input)for line in sys.stdin:
# remove leading and trailing whitespace line = line.strip()
# split the line into words words = line.
split()
# increase counters for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
# tab-delimited; the trivial word count is 1 print '%s\t%s' % (word, 1)
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
63
The code below is the Reduce function. It will read the results from the map step from
STDIN and sum the occurrences of each word to a final count, and then output its results
to STDOUT.
import syscurrent_word = Nonecurrent_count = 0word = None# input comes from STDINfor
line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input
we got from mapper.py word, count = line.split('\t', 1) # convert count (currently a string) to
int try: count = int(count) except ValueError: # count was not a number, so silently #
ignore/discard this line continue
MapReduce Example
Here is a real-world use case of MapReduce:
Facebook has a list of friends (note that friends are a bi-directional thing on Facebook. If
I’m your friend, you’re mine). They also have lots of disk space and they serve hundreds
of millions of requests every day. They’ve decided to pre-compute calculations when they
can to reduce the processing time of requests. One common processing request is the
“You and Joe have 230 friends in common” feature. When you visit someone’s profile,
you see a list of friends that you have in common. This list doesn’t change frequently so
it’d be wasteful to recalculate it every time you visited the profile (sure you could use a
decent caching strategy, but then I wouldn’t be able to continue writing about MapReduce
for this problem). We’re going to use MapReduce so that we can calculate every one’s
common friends once a day and store those results. Later on, it’s just a quick lookup.
We’ve got lots of disk, it’s cheap.
Assume the friends are stored as Person->[List of Friends], our friends list is then:
A -> B C DB -> A C D EC -> A B D ED -> A B C EE -> B C D
Each line will be an argument to a mapper. For every friend in the list of friends, the
mapper will output a key-value pair. The key will be a friend along with the person. The
value will be the list of friends. The key will be sorted so that the friends are in order,
causing all pairs of friends to go to the same reducer. This is hard to explain with text, so
let’s just do it and see if you can see the pattern. After all the mappers are done running,
you’ll have a list like this:
For map(A -> B C D) :(A B) -> B C D(A C) -> B C D(A D) -> B C DFor map(B -> A C D
E) : (Note that A comes before B in the key)(A B) -> A C D E(B C) -> A C D E(B D) -> A C
D E(B E) -> A C D EFor map(C -> A B D E) :(A C) -> A B D E(B C) -> A B D E(C D) -> A
B D E(C E) -> A B D EFor map(D -> A B C E) :(A D) -> A B C E(B D) -> A B C E(C D) ->
A B C E(D E) -> A B C EAnd finally for map(E -> B C D):(B E) -> B C D(C E) -> B C D(D
E) -> B C DBefore we send these key-value pairs to the reducers, we group them by their
keys and get:(A B) -> (A C D E) (B C D)(A C) -> (A B D E) (B C D)(A D) -> (A B C E) (B
C D)(B C) -> (A B D E) (A C D E)(B D) -> (A B C E) (A C D E)(B E) -> (A C D E) (B C
D)(C D) -> (A B C E) (A B D E)(C E) -> (A B D E) (B C D)(D E) -> (A B C E) (B C D)
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
64
Each line will be passed as an argument to a reducer. The reduce function will simply
intersect the lists of values and output the same key with the result of the intersection. For
example, reduce((A B) -> (A C D E) (B C D)) will output (A B) : (C D) and means that
friends A and B have C and D as common friends.
Now when D visits B’s profile, we can quickly look up (B D) and see that they have three
friends in common, (A C E).
MapReduce in MongoDB
We can also use Map-Reduce in MongoDB via the mapReduce database command.
Consider the following map-reduce operation:
In this map-reduce operation, MongoDB applies the map phase to each input document
(i.e. the documents in the collection that match the query condition). The map function
emits key-value pairs. For those keys that have multiple values, MongoDB applies
the reduce phase, which collects and condenses the aggregated data. MongoDB then
stores the results in a collection. Optionally, the output of the reduce function may pass
through a finalize function to further condense or process the results of the aggregation.
All map-reduce functions in MongoDB are JavaScript and run within the mongod process.
Map-reduce operations take the documents of a single collection as the input and can
perform any arbitrary sorting and limiting before beginning the map
stage. mapReduce can return the results of a map-reduce operation as a document or may
write the results to collections. The input and the output collections may be sharded.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
65
Apache Spark
So as we discussed above, MapReduce is an iterative process. Sometimes an algorithm
cannot be executed in a single MapReduce job. To resolve that, MapReduce jobs can be
chained together to provide a solution. This often happens with algorithms which iterate
until convergence (such as k-means, PageRank, etc.). But the big disadvantage is that the
Reduce output must be read from disk again with each new job; and in some cases, the
input data is read from disk many times. Thus, there is no easy way to share the work done
between iterations.
In Apache Spark, the computation model is much richer than just MapReduce.
Transformations on input data can be written lazily and batched together. Intermediate
results can be cached and reused in future calculations. There is a series of
lazy transformations which are followed by actions that force evaluation of all
transformations. Notably, each step in the Spark model produces a resilient distributed
dataset (RDD). Intermediate results can be cached on memory or dis, optionally serialized.
For each RDD, we keep a lineage, which is the operations which created it. Spark can
then recompute any data which is lost without storing to disk. We can still decide to keep
a copy in memory or on disk for performance.
Let’s delve into the Spark model through an example. Say we have the code below that
reads in data .csv file and do data wrangling.
The code highlighted below perform transformations of the data. Spark transforms
include these function calls: distinct(), filter(fn), intersection(other), join(other), map(fn),
union(other).
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
66
The code highlighted below perform actions of the data. Spark actions include these
function calls: collect(), count(), first(), take(n), reduce(fn), foreach(fn), saveAsTextFile().
We can represent this Spark model via a Task Directed Acyclic Graph — as illustrated
below:
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
67
Let’s say we want to implement k-means clustering in Spark. The process follows like
this:
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
68
Unlike MapReduce, we can keep the input in memory and load them once. The
broadcasting in step 3 means that we can quickly send the centers to all machines. All
cluster assignments do not need to be written to disk every time.
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster
managers (either Spark’s own standalone cluster manager, Mesos or YARN), which
allocate resources across applications. Once connected, Spark acquires executors on nodes
in the cluster, which are processes that run computations and store data for your
application. Next, it sends your application code (defined by JAR or Python files passed to
SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
Seen above is the Spark architecture. There are several useful things to note about this
architecture:
1. Each application gets its own executor processes, which stay up for the duration
of the whole application and run tasks in multiple threads. This has the benefit of
isolating applications from each other, on both the scheduling side (each driver
schedules its own tasks) and executor side (tasks from different applications run
in different JVMs). However, it also means that data cannot be shared across
different Spark applications (instances of SparkContext) without writing it to an
external storage system.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
69
3. The driver program must listen for and accept incoming connections from its
executors throughout its lifetime. As such, the driver program must be network
addressable from the worker nodes.
4. Because the driver schedules tasks on the cluster, it should be run close to the
worker nodes, preferably on the same local area network. If you’d like to send
requests to the cluster remotely, it’s better to open an RPC to the driver and have
it submit operations from nearby than to run a driver far away from the worker
nodes.
Standalone — a simple cluster manager included with Spark that makes it easy to
set up a cluster.
Apache Mesos — a general cluster manager that can also run Hadoop MapReduce
and service applications.
Overall, Apache Spark is much more flexible since we can also run distributed SQL
queries. It also contains many libraries for machine learning, stream processing, etc.
Furthermore, Spark can connect to a number of different data sources.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
70
Apache Flink is another system designed for distributed analytics like Apache Spark. It
executes everything as a stream. Iterative computations can be written natively with cycles
in the data flow. It has a very similar architecture to Spark, including (1) a client that
optimizes and constructs data flow graph, (2) a job manager that receives jobs, and (3) a
task manager that executes jobs.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
71
Apache Pig is thus another platform for analyzing large datasets. It provides the Pig Latin
language which is easy to write but runs as MapReduce. The PigLatin code is shorter and
faster to develop than the equivalent Java code.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
72
Apache Pig is installed locally and can send jobs to any Hadoop cluster. It is slower than
Spark, but doesn’t require any software on the Hadoop cluster. We can write user-defined
functions for more complex operations (intersection, union, etc.)
Apache HBase is quite similar to Apache Cassandra — the wide column store that we
discussed above. It is essentially a large sorted map that we can update. It uses Apache
Zookeeper to ensure consistent updates to the data.
Next, I want to mention Apache Calcite, a framework that can parse/optimize SQL
queries and process data. It powers query optimization in Flink, Hive, Druid, and others. It
also provides many pieces needed to implement a database engine. The following
companies and projects are powered by Calcite:
More importantly, Calcite can connect to a variety of database systems including Spark,
Druid, Elastic Search, Cassandra, MongoDB, and Java JDBC.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
73
On the other hand, the Calcite architecture takes away the client, server, and parser, and
lets the optimizer do the heavy work of processing metadata. The Calcite optimizer uses
more than 100 rewrite rules to optimize queries. Queries use relational algebra but can
operate on non-relational algebra. Calcite will aim to find the lowest cost way to execute a
query.
And that’s the end of this short post on distributed data processing! If you’r
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
74
UNIT-5
Installing and running PIG
Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Apache
Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps
given in the following link −
http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm
Step 1
Open the homepage of Apache Pig website. Under the section News, click on the link release
page as shown in the following snapshot.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
75
Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this
page, under the Download section, you will have two links, namely, Pig 0.8 and later and Pig
0.7 and before. Click on the link Pig 0.8 and later, then you will be redirected to the page
having a set of mirrors.
Step 3
Choose and click any one of these mirrors as shown below.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
76
Step 4
These mirrors will take you to the Pig Releases page. This page contains various versions of
Apache Pig. Click the latest version among them.
Step 5
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
77
Within these folders, you will have the source and binary files of Apache Pig in various
distributions. Download the tar files of the source and binary files of Apache Pig
0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.
Step 1
Create a directory with the name Pig in the same directory where the installation directories
of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig
directory in the user named Hadoop).
$ mkdir Pig
Step 2
Extract the downloaded tar files as shown below.
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
78
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown
below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
.bashrc file
In the .bashrc file, set the following variables −
pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can
set various parameters as given below.
pig -h properties
The following properties are supported −
Logging: verbose = true|false; default is false. This property is the same as -v
switch brief=true|false; default is false. This property is the same
as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.
This property is the same as -d switch aggregate.warning = true|false; default is true.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
79
If true, prints count of warnings of each type rather than logging each warning.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
80
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing
purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig
Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode
using the Grunt shell. In this shell, you can enter the Pig Latin statements and get
the output (using Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig
Latin script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and
using them in our script.
Invoking the Grunt Shell
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output − Output −
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
81
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin statements in it.
You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as
shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
Dump student;
Now, you can execute the script in the above file as shown below.
$ pig -x $ pig -x
local Sample_script.pig mapreduce Sample_script.pig
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
82
Installing Java
Step I:
Step II:
Generally you will find the downloaded java file in the Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Step III:
To make java available to all the users, you have to move it to the location “/usr/local/”. Open
root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
83
Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc
file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step V:
Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following
commands.
$ su
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
84
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration
files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs using java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in
your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and the size of
Read/Write buffers.
Open the core-site.xml and add the following properties in between the <configuration> and
</configuration> tags.
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
85
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, the namenode
path, and the datanode path of your local file systems. It means the place where you want to
store the Hadoop infra.
Let us assume the following data.
dfs.replication (data replication value) = 1
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
86
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
87
The following command is used to start dfs. Executing this command will start your Hadoop
file system.
$ start-dfs.sh
The expected output is as follows:
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-
namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-
localhost.out
Starting secondary namenodes [0.0.0.0]
The following command is used to start the yarn script. Executing this command will start your
yarn daemons.
$ start-yarn.sh
The expected output is as follows:
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
nodemanager-localhost.out
The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on your browser.
http://localhost:50070/
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
88
The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.
http://localhost:8088/
We use hive-0.14.0 in this tutorial. You can download it by visiting the following
link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz”
for this tutorial. The following command is used to verify the download:
$ cd Downloads
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin.tar.gz
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
89
The following command is used to verify the download and extract the hive archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
We need to copy the files from the super user “su -”. The following commands are used to copy
the files from the extracted directory to the /usr/local/hive” directory.
$ su -
passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
The following command is used to execute ~/.bashrc file.
$ source ~/.bashrc
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
90
The following command is used to download Apache Derby. It takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
The following command is used to verify the download:
$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin.tar.gz
The following commands are used for extracting and verifying the Derby archive:
$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
We need to copy from the super user “su -”. The following commands are used to copy the files
from the extracted directory to the /usr/local/derby directory:
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit
You can set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
The following command is used to execute ~/.bashrc file:
$ source ~/.bashrc
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
91
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
92
Hive query language provides the basic SQL like operations. Here are few of the tasks
which HQL can do easily.
FROM sales;
FROM products
GROUP BY category;
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
93
When you look at the above query, you can see they are very similar to SQL like queries.
UDF
A UDF processes one or several columns of one row and outputs one value. For example
:SELECT lower(str) from table
For each row in "table," the "lower" UDF takes one argument, the value of "str", and
outputs one value, the lowercase representation of "str".
For each row in "table," the "datediff" UDF takes two arguments, the value of
"date_begin" and "date_end", and outputs one value, the difference in time between these
two dates.
UDAF
An UDAF processes one or several columns of several input rows and outputs one value.
It is commonly used together with the GROUP operator. For example:
The Hive Query executor will group rows by customer, and for each group, call the
UDAF with all price values. The UDAF then outputs one value for the output record (one
output record per customer);
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP
94
For each record of each group, the UDAF will receive the three values of the three
selected column, and output one value of the output record.
Extends Oracle SQL to Hadoop and NoSQL and the security of Oracle Database to all
your data. It also includes a unique Smart Scan service that minimizes data movement
and maximizes performance, by parsing and intelligently filtering data where it resides .
HIVE:
PIG
CS- 503 (A) Data Analytics Notes By -Dr. Kapil Chaturvedi, Associate Professor, DoCSE, SIRT, BHOPAL, MP