You are on page 1of 36

m

S
o
L Diffic
d Map
. Questions A B C D Correct answer ulty
u ped
N level
l to
O
e CO
A manager at Gampco Inc.
wishes to know the company's
revenue and profit in its Prescriptive Normative Descriptiv Predictive Descriptive
1 1 Easy
previous quarter. Which of the analytics analytics e analytics analytics analytics
following business analytics
will help the manager? 1
summarizes uses data
detects
data into to
identifies the patterns in
meaningful determine detects patterns
best historical
charts and a course in historical data
alternatives to data and
2 1 Predictive analytics: reports that of action and extrapolates Easy
minimize or extrapolates
can be to be them forward in
maximize an them
standardized executed time.
objective. forward in
or in a given
time.
customized. situation. 1
What do
What will
How many What is the they
happen if
Which of the following and what best way of expect to
demand falls
questions will prescriptive types of shipping pay for What is the best Diffic
3 1 by 10% or if
analytics help a company complaints goods from fuel over way of shipping ult
supplier
address? did they their factories the next goods from their
prices go up
resolve? to minimize several factories to
5%?
costs? months? minimize costs? 1
They are
limitations, They are
They are
requirement They are quantities
unknown They are
s, or other quantities that for which
values that quantities that an
Which of the following best restrictions an no
an optimization Mode
4 1 defines objective functions in that are optimization feasible
optimization model seeks to rate
an optimization problem? imposed on model seeks solutions
model seeks maximize or
any solution to maximize exist in an
to minimize.
in an or minimize. optimizati
determine.
optimization on model.
model. 1
Roger wants to compare values
across categories using vertical Stacked
Clustered Clustered Mode
5 1 rectangles. Which of the Line chart Pie chart column
column chart column chart rate
following charts must Roger chart
use? 2
Which of the following charts
A doughnut
6 1 provides a useful means for Scatter chart Pie chart Line chart Line chart Easy
chart
displaying data over time? 2
Philip wishes to understand the
relative proportion of each data 2
Scatter Column
7 1 source to the total. Which of the Pie chart Bar chart Pie chart Easy
chart chart
following charts must Philip
use?
Observations consisting of pairs Mode
8 1 doughnut scatter radar line scatter
of variable data are required to rate 2

1
construct a ________ chart.

Which of the following charts


allows plotting of multiple Doughnut Radar Diffic
9 1 Bubble chart Area chart Radar chart
dimensions of several data chart chart ult
series? 2
Three-
The 25th One-fourth of
fourths of The 50th Three-fourths of
percentile is the data fall
1 Which of the following is true the data quartile is the data are
1 called the below the Easy
0 about quartiles? are below the third below the third
fourth fourth
the third percentile quartile.
quartile. quartile.
quartile. 2

sample mean
1
2 Which of the following measures ofmidrange sample mean mode median Easy
1
location is
calculated using the formula
where n is the number of
observations? 1,3
A mean is
an
observatio
A median is
A mean n that
A median is not
divides the occurs A median is not
not affected meaningful
Which of the following is a data half most affected by
1 by outliers; a for ratio Mode
2 difference between a mean and above it and frequently outliers; a mean
2 mean is data; a mean rate
a median? half below ; a median is affected by
affected by is
it; a median is the outliers.
outliers. meaningful
does not. average of
to ratio data.
all
observatio
ns. 1,3
The ________ is the
1
2 observation that occurs most mode mean outlier median mode Easy
3
frequently 1,3
Which of the following is an
1
2 example of a measure of median mode variance midrange variance Easy
4
dispersion? 1,3
The ________ measures the coefficient
1 coefficient return to risk coefficient coefficient of
2 degree of asymmetry of of Easy
5 of variation factor of kurtosis skewness
observations around the mean. skewness 1,3
In statistics, ________ refers to
1 Markov
2 the peakedness or flatness of a Sharpe ratio entropy rate kurtosis kurtosis Easy
6 chain
histogram 1,3
Which of the following
1 standard null alternative
2 propositions describes an proportion null hypothesis Easy
7 deviation hypothesis hypothesis
existing theory or belief? 1,3

2
the null the null
the null the null
hypothesis is hypothesis is
hypothesis is hypothesis
actually true, actually true, the null
actually false, is actually
1 Which of the following is a and the but the hypothesis is Diffic
2 but the test false, and
8 Type I error? hypothesis hypothesis actually true, but ult
incorrectly the test
test correctly test the hypothesis
fails to reject correctly
fails to reject incorrectly test incorrectly
it rejects it
it rejects it rejects it 1,3
In order to reject the null
1 hypothesis, the F-test statistic Mode
2 p-value variance df
9 must be greater than the rate
________. F crit F crit 1,3
Which of the following tests is
2 used to determine if two Chi-square Mode
2 t-test z-test ANOVA Chi-square test
0 categorical variables are test rate
independent? 1,3

3
INTERNAL EXAMINATION:SEP– OCT’ 2019

MBA Batch 2019 – 21: Semester - III


Subject Code: 18JBS315 Subject Name: Marketing Analytics

Duration: 2 Hrs Marks: 50

SECTION –A

1. Answer any TWO of the following Three Questions: 2X8=16


a.An advertising campaign was carried out by a consumer product company in various media to increase brand
awareness for their range of detergents. A paired sample t test was conducted by measuring the pre and post
campaign awareness in percentage. The table below shows the results of t-test. Construct suitable hypothesis to
prove if the advertising campaign is effective or not by interpreting the results from table below?

t-Test: Paired Two Sample for


Means

post_advt pre_advt
Mean 59.98 52.23
Variance 655.3865327 105.1227136
Observations 200 200
Pearson Correlation 0.42304019
Hypothesized Mean Difference 0
df 199
t Stat 4.723372724
P(T<=t) one-tail 2.18728E-06
t Critical one-tail 1.652546746
P(T<=t) two-tail 4.37456E-06
t Critical two-tail 1.971956544

H0: There is no significant difference in awareness before and after the advertising campaign
Ha: There is significant difference in awareness before and after the advertising campaign.

Since p value(0.00000437) is less than 0.05, null hypothesis is rejected and we can conclude that advertising
had a significant impact in increasing brand awareness.

b.Explain the components of summary statistics.


Mean
Median
Mode
Maximum
Minimum

4
Range
Skewness
Kurtosis
Standard Deviation
Variance
Standard Error
25th Percentile
50th Percentile
75th Percentile
Inter Quartile Range

c.Figure below demonstrates the home selling price distributions of five cities in the USA, including Chicago,
Las Vegas, New York, Texas & Washington. Discuss & compare the home prices of these cities.

The lowest median selling prices are in Washington and highest is in Texas. We can summarize the median
prices as follows Median Texas> Median Chicago>Median Las Vegas>Median New York> Median Washington

5
Las Vegas has a large 4th quartile indicating many homes have high selling prices. This is also reflected in
one outlier price.We can summarize the dispersion as follows d las vegas> d chicago> d texas> d washington> d New
York
New York AND Washington has the least SD and most of the home prices are clustered around the
median.
There are two outliers in New York home prices both on the higher side and lower side. One outlier in case
of Las Vegas.

SECTION-B
2. Answer any TWO of the followingThree questions: 2X12=24

a.Classify Business Analytics into 3 categories of Descriptive, Predictive and Prescriptive and discuss each one
of them.

Descriptive Analytics
Most businesses start with descriptive analytics -the use of data to understand past and current business
performance and make informed decisions. These techniques categorize, characterize, consolidate, and
classify data to convert it into useful information for the purposes of understanding and analyzing
business performance. Descriptive analytics summarizes data into meaningful charts and reports, for
example about budgets, sales, revenues, or cost. This process allows managers to obtain standard and
customized reports and then drill down into the data and make queries to understand the impact of an
advertising campaign for example, review business performance to find problems or areas of
opportunity, and identify patterns and trends in data. Typical questions that descriptive analytics help
answer are “How much did we sell in each region?” “ What was our revenue and profit last quarter?”

Predictive Analytics
Predictive analytics seeks to predict the future by examining historical data, detecting patterns or
relationships in these data, and then extrapolating these relationships forward in time. For example, a
marketer might wish to predict the response of different customer segments to an advertising campaign,
or a fashion manufacturer might want to predict next season’s demand for fashion of a specific color
and size. Using advanced techniques, predictive analytics can help to detect hidden patterns in large
quantities of data to segment and group data into coherent sets to predict behavior and detect trends.

Prescriptive Analytics
Prescriptive analytics uses optimization to identify the best alternative to minimize or maximize some
objective. Prescriptive analytics is used in many areas of business including operations, marketing, and
finance. For example, we may determine the best pricing and advertising strategy to maximize revenue.
The optimal amount of cash to store in an ATM, or the best mix of investments in a retirement portfolio
to manage risk. Prescriptive analytics addresses questions such as “How much should we produce to
maximize profit.?”. “ What is the best way of shipping goods from our factories to minimize costs?”.
b.A sample of 100 individuals were asked to evaluate their preferences for three new proposed energy drinks
in a blind taste test. The sample space consists of two types of outcomes corresponding to each individual:
gender and brand preference. The table below shows the cross tabulation of the resu

6
Count of Respondent Column Labels
(Observed)
Row Labels Brand 1 Brand 2 Brand 3 Grand Total
Female 9 6 22 37
Male 25 17 21 63
Grand Total 34 23 43 100
Expected Frequency Brand 1 Brand 2 Brand 3 Grand Total
Female 12.58 8.51 15.91 37
Male 21.42 14.49 27.09 63
Grand Total 34 23 43 100

Calculate the Chi-square value and determine if the proportion of males who prefer a particular brand is no
different from the proportion of females.at the 5% level of significance.

Chi Square Statistic Brand 1 Brand 2 Brand 3 Grand Total


Female 1.02 0.74 2,33 4.09
Male 0.60 0.43 1.37 2.40
Grand Total 1.62 1.18 3.70 6.49

Chi square critical value 5.99146455

Because the estimated chi-square value exceeds the critical value, we reject the null hypothesis that the brand
preference is not influenced by Gender.
What would be your recommendations for advertising campaigns based on the results?
The advertising campaigns should be separate for males and females as the brand preference is influenced by
Gender.
Chi-square table

df 0.995 0.99 0.975 0.95 0.9 0.1 0.05 0.025 0.01 0.005
1 -- -- 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.01 0.02 0.051 0.103 0.211 4.605 5.991 7.378 9.21 10.6
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.35 12.84
4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.14 13.28 14.86
5 0.412 0.554 0.831 1.145 1.61 9.236 11.07 12.83 15.09 16.75
6 0.676 0.872 1.237 1.635 2.204 10.65 12.59 14.45 16.81 18.55
7 0.989 1.239 1.69 2.167 2.833 12.02 14.07 16.01 18.48 20.28
8 1.344 1.646 2.18 2.733 3.49 13.36 15.51 17.54 20.09 21.96
9 1.735 2.088 2.7 3.325 4.168 14.68 16.92 19.02 21.67 23.59
10 2.156 2.558 3.247 3.94 4.865 15.99 18.31 20.48 23.21 25.19

c.The marketing manager of DataCom Inc. wants to predict the annual revenues generated by its customers
given certain characteristics of them. The manager runs a regression model on Years of Loyalty, Years

7
Employed, Years of Marriage, Gender and Average Number of Products Purchased. The output of
regression model is given below:

Regression Statistics

Multiple R 0.874778

R Square 0.765237

Adjusted R Square 0.735892

Standard Error 5652.67

Observations 46

ANOVA

Significance
df SS MS F F

Regression 5 4166144430.17 833228886 26.07696863 0.00

Residual 40 1278106972.78 31952674.32

Total 45 5444251402.96

P-
Coefficients Standard Error t Stat value Lower 95% Upper 95%

30830.4408
Intercept 25801.11 2488.444433 10.36836776 0.00 20771.77322 3

271.270223
Years of Loyalty -208.634 237.4500522 -0.878644716 0.38 -688.5386908 2

923.539530
Years Employed 638.4924 141.0373752 4.527114617 0.00 353.4451946 9

Years of Marriage 1604.962 408.1861998 3.931935175 0.00 779.9865937 2429.93676

1916.96797
Gender* -1627.26 1753.634545 -0.927935442 0.36 -5171.487268 6

Average Number of 381.003404


Products Purchased 183.2034 97.86866728 1.871931586 0.07 -14.59650537 5

*Note: Female = 0, Male = 1


8
a) Build the model to predict the annual revenues of the company.
Annual Revenues = 25801.11 + 638.4924*years employed + 1604.962*Years of
marriage+183.2034*Average number of products purchased.
b) Interpret the model output.
R Square = 0.765. 76.5% of the variation in dependent variable is explained by the independent variables.
Intercept and 3 independent variables are significant namely, Years employed, years of marriage and
average number of products purchased.
c) According to the model, is there a difference between the mean revenues earned by males & female
customers at DataCom? Justify your answer.
There is no significant difference in revenues earned by males and female customers as gender is
insignificant variable. p value is more than 0.05.
d) Predict a 95% confidence interval of the annual revenue of a female customer who has been purchasing for
last 10 years; is married for 8 Years and is employed for 12 years.
Annual Revenue = 25801.11 + 638.4924*12+1604.962*8 = 25801.11+7661.9088+12839.696 =46302.71

SECTION-C
3. Case study - Compulsory: 1X10=10

KUMAR SOFT DRINK BOTTLING COMPANY

Kumar Soft Drink Bottling Company came into operation in 1984 and was operating in the NCR of Delhi and
in the states of Punjab and Haryana. The turnover of the company was ~ 1.5 crore in 2010 and it was growing
at the rate of 10 per cent per annum. The chairman 'of the company, Mr. Kumar, wanted to examine whether
the flavour of the soft-drink and the price level had any impact upon the sales. He wanted this because the
results could have implicationsfor changing the product mix if required. Three types of flavours were
considered, namely, pineapple, mango and orange. Further, three level of prices were taken into
considerations Rs.10, Rs.12 and Rs.14. An experiment was conducted by randomly choosing a sample of 18
stores where the flavour of the soft drink and the price level were varied. The experiment period was one
month.
Coding for flavor: Pineapple =1
Mango =2
Orange =3
Coding for price: Rs.10/- =1
Rs.12/- =2
Rs.14/- =3

Questions

Is there any impact of the flavor or the price level independently upon the sales? Conduct the test using a 5
percent level of significance.Construct the hypothesis and interpret the results.
A one way ANOVA was run between flavor and sales. The results are summarized below:

9
Anova: Single
Factor

SUMMARY
Groups Count Sum Average Variance
Pineapple 6 23.3 3.883333333 2.169666667
Mango 6 19.9 3.316666667 0.885666667
Orange 6 19.6 3.266666667 1.430666667

ANOVA
Source of
Variation SS df MS F P-value F crit
Between Groups 1.407777778 2 0.703888889 0.470723733 0.633470362 3.68232
Within Groups 22.43 15 1.495333333

Total 23.83777778 17

H0: Sales is not dependent on type of flavor.


Ha: Sales is dependent on type of flavor.
Since p value (0.633) is more than 0.05, H0 cannot be rejected. Hence flavor has no influence on sales.

A one way ANOVA was run between price and sales. The results are summarized below.
Anova: Single
Factor

SUMMARY
Groups Count Sum Average Variance
Rs10 6 28.7 4.783333 0.661667
Rs12 6 20 3.333333 0.326667
Rs 14 6 14.1 2.35 0.183

ANOVA
Source of
Variation SS df MS F P-value F crit
Between Groups 17.98111111 2 8.990556 23.02647 0.000027 3.68232
Within Groups 5.856666667 15 0.390444

Total 23.83777778 17

H0: Sales is not dependent on Price.

10
Ha: Sales is dependent on type of Price.
Since p value (0.000027) is less than 0.05, H0 can be rejected. Hence Sales is influenced by Price.

CMS Business School

MBA : 3rd Semester Code 17JBS315Marketing Analytics

Max Marks : 50 Time : 2 Hours


SECTION A
Answer any 4 out of 6 Questions 4X5=20 Marks

a) What are the steps involved in data driven decision making process?
I. Identify the problem or opportunity
II. Identify sources of data(primary as well as secondary data)
III. Process the data for missing and incorrect data. Prepare the data for analytics model building)
IV. Build the analytical models
V. Communicate the data analysis output and decisions effectively.
VI. Implement Solution/Decision

b) Why are organizations moving towards use of Analytics in business decisions?

I. Humans are inherently not good at making decisions


a. Monty Ball Problem
b. The Travelling salesman problem (Akshay Patra Foundation)
II. Striking correlation between an organization’s analytics sophistication and its competitive
performance.
III. Business Analytics is being used as a competitive strategy

c) How do we determine outliers in data? Illustrate with a suitable diagram.


Outliers in data are determined through Box Plot.

11
There are a number of households whose income is more than the 4 th Quartile. All those cases are outliers. The
mean income in dark black line is approximately $70,000, which can be determined through descriptive
statistics.
d) An independent sample t-test was conducted between Household income and retired households. Is
there a significant difference between HH income and retired and not retired Households?
Independent Samples Test

Levene's Test for


Equality of
Variances t-test for Equality of Means

95% Confidence
Sig. Interval of the
(2- Mean Std. Error Difference
tailed Differenc Differenc
F Sig. t df ) e e Lower Upper

Household Equal
75.34 10.1 37.5130 55.3948
income in variances .000 6398 .000 46.45395 4.56093
5 85 0 9
thousands assumed

Equal
variances 25.1 631.4 42.8248 50.0830
.000 46.45395 1.84806
not 37 44 5 4
assumed

H0: There is no significant difference in household income of retired and not retired households.

12
Ha: There is significant difference in household income of retired and not retired households.
Since p value (0.000) is less than 0.05, we reject the null hypothesis and accept the alternate.

e) The Sales Model for a firm is given below


Sales = 500 – 0.05(price) + 30(coupons) + 0.08(advertising) + 0.25(price)(advertising)
If the price is $6.99, no coupons are offered, and advertising of $150 is done, calculate the estimate
sales as determined by the model.

Sales = 500 – 0.05* (6.99) + 30*0 + 0.08 * (150) +0.25 *6.99*150 = 500-.3495+ 12+ 262.125 =
773.7755

f) What are the 3 different types of forecast errors used to determine the accuracy in demand forecasting
methods?

Mean average Deviation = Ʃ Absolute (At-Ft)/n


Root mean square error = Sqrt(Ʃ (Absolute (At-Ft)^2)/n)
Mean average percent error = Ʃ (Absolute (At-Ft)/At)/n
Where At= Actual Demand and Ft= Forecasted Demand

SECTION B
Answer any 2 out of 3 questions 2X10=20 Marks

1.Classify Business Analytics into 3 categories of Descriptive, Predictive and Prescriptive and discuss each one
of them.

Descriptive Analytics
Most businesses start with descriptive analytics -the use of data to understand past and current business
performance and make informed decisions. These techniques categorize, characterize, consolidate, and
classify data to convert it into useful information for the purposes of understanding and analyzing
business performance. Descriptive analytics summarizes data into meaningful charts and reports, for
example about budgets, sales, revenues, or cost. This process allows managers to obtain standard and
customized reports and then drill down into the data and make queries to understand the impact of an
advertising campaign for example, review business performance to find problems or areas of
opportunity, and identify patterns and trends in data. Typical questions that descriptive analytics help
answer are “How much did we sell in each region?” “ What was our revenue and profit last quarter?”

13
Predictive Analytics
Predictive analytics seeks to predict the future by examining historical data, detecting patterns or
relationships in these data, and then extrapolating these relationships forward in time. For example, a
marketer might wish to predict the response of different customer segments to an advertising campaign,
or a fashion manufacturer might want to predict next season’s demand for fashion of a specific color
and size. Using advanced techniques, predictive analytics can help to detect hidden patterns in large
quantities of data to segment and group data into coherent sets to predict behavior and detect trends.

Prescriptive Analytics
Prescriptive analytics uses optimization to identify the best alternative to minimize or maximize some
objective. Prescriptive analytics is used in many areas of business including operations, marketing, and
finance. For example, we may determine the best pricing and advertising strategy to maximize revenue.
The optimal amount of cash to store in an ATM, or the best mix of investments in a retirement portfolio
to manage risk. Prescriptive analytics addresses questions such as “How much should we produce to
maximize profit.?”. “ What is the best way of shipping goods from our factories to minimize costs?”.

2. A property broking company wanted to understand the factors which influence the price of a property. To
this end, it conducted a regression to understand the impact of the area of property in square feet and the age of
the house to determine the property prices. The table below shows the summary of regression output. Interpret
the results (R square, significant variables and residual Plot) and construct an equation for Property price.
SUMMARY OUTPUT

Regression Statistics

Multiple R 0.745494776

R Square 0.555762462

Adjusted R Square 0.532981049

Standard Error 7211.848497

Observations 42

Coefficients Standard Error t Stat P-value

Intercept 47331.38154 13884.34664 3.408974347 0.001527831

House Age -825.1612203 607.3128421 -1.358708664 0.18204591

Square Feet 40.91106845 6.696523994 6.109299165 3.65101E-07

14
Square Feet Residual Plot
50000
Residuals

0
-50000 0 500 1,000 1,500 2,000 2,500
Square Feet

R Square = 0.55
55% of the variation in dependent variable is explained by the independent variable.
Intercept and Square feet are significant variables as p value is less than 0.05.
Price of property = 47331.38 + 40.91 * Square feet

3. Monthly Demand at an electronics retailer for LED Televisions is as follows:

Month Demand(Units)
Jan 2016 1000
Feb 2016 1113
March 2016 1271
April 2016 1445
May 2016 1558
June 2016 1648
July 2016 1724
August 2016 1850
Sept 2016 1864
Oct 2016 2076
Nov 2016 2167
Dec 2016 2191

Use naïve, 3 month moving average and cumulative methods to forecast demand for January 2017. Use MAD
to determine which method provides the highest accuracy?

15
Mean Average Deviation

Demand
Month Cumulativ
(Units) Naïve 3MA e Naïve 3MA Cumulative

Jan-16 1000 1000 1000.00

Feb-16 1113 1113 1056.50 113 113.00

1128.0
Mar-16 1271
1271 0 1128.00 158 214.50

1276.3
Apr-16 1445
1445 3 1207.25 174 317.00 317.00

1424.6
May-16 1558
1558 7 1277.40 113 281.67 350.75

1550.3
Jun-16 1648
1648 3 1339.17 90 223.33 370.60

1643.3
Jul-16 1724
1724 3 1394.14 76 173.67 384.83

1740.6
Aug-16 1850
1850 7 1451.13 126 206.67 455.86

1812.6
Sep-16 1864
1864 7 1497.00 14 123.33 412.88

1930.0
Oct-16 2076
2076 0 1554.90 212 263.33 579.00

2035.6
Nov-16 2167
2167 7 1610.55 91 237.00 612.10

2144.6
Dec-16 2191
2191 7 1658.92 24 155.33 580.45

MAD 108.2727 220.15 399.179093

Mean Average Deviation (MAD) = Ʃ Absolute (At-Ft)/n

16
SECTION C

Case Study (Compulsory) 1X10=10 Marks

MALHOTRA SPICES COMPANY PVT.LTD

Malhotra Spices Company came into operation in 1960 and has operations in all parts of the country. It was
in-the business of manufacturing and selling spices suitable for the Indian kitchen. They ventured into the
export markets in the 1980s as there was a huge demand for the spices in North America, Europe, Australia
and in the Middle East. This is because the number of the Indians residing in these countries had been
increasing at an exponential rate. The spices were packed into tetrapacks containing spices in different
quantities like 100, 150,200,250 and 500 gm. The 500 gm packages were mostly used by restaurants and
hoteliers. Mr K P Malhotra, Chairman of Malhotra Spices, was wondering whether they should change the
packaging from tetrapack to plastic or glass bottle packaging. Before taking a final decision, as an
experiment, the company introduced plastic and glass bottle packaging in addition to the existing tetra packs
packaging in the national capital region (NCR) of Delhi. Mr Malhotra was thinking that switching over to a
new packaging would involve a huge investment and if the results were not different for the other two types
of packaging, they would drop the idea of change in packaging.

The company on an experimental basis came up with three types of packaging-plastic, glass bottles and
tetrapacks-for the NCR market. They wanted to observe the sales of spices for the three types of
packaging. Mr Malhotra's younger brother told him that it is not only the type of packaging that influenced
the sales but also some external factors like the size of the store selling the spices.
Type. of packaging
1= Plastic
2 = Glass
3 = Tetrapacks
Type of store
1 = large store
2 = Medium store
3 = Small store

17
A one way ANOVA was run between sales and type of packaging. The results are summarized below:

ANOVA

Source of
Variation SS df MS F P-value F crit

Between Groups 3808.866667 2 1904.433333 16.37309346 0.0000220 3.354130829

Within Groups 3140.5 27 116.3148148

Total 6949.366667 29

Construct the hypothesis and interpret the results.

H0: Sales is not dependent on type of packaging.


Ha: Sales is dependent on type of packaging.
Since p value (0.0000022) is less than 0.05, H0is rejected i.e. Sales is dependent on the type of packaging
A one way ANOVA was run between sales and type of store. The results are summarized below:

ANOVA

Source of
Variation SS df MS F P-value F crit

Between Groups 112.2555556 2 56.12777778 0.221650632 0.802638612 3.354130829

Within Groups 6837.111111 27 253.2263374

Total 6949.366667 29

Construct the hypothesis and interpret the results.


H0: Sales is not dependent on type of store
Ha: Sales is dependent on type of store.
Since p value (0.8026) is more than 0.05, H0cannot be rejected.i.e. Sales is not dependent on type of Store

18
Module 1
1) Steps involved in “Data driven decision making”

a) Identify the problem or opportunity


b) Identify sources of data (primary as well as secondary data
c) Process the data for missing and incorrect data. Prepare the data for analytics model building
d) Build the analytical models
e) Communicate the data analysis output and decisions effectively.
f) Implement Solution/Decision

2) Different types of analytics

Data is a collection of facts, such as numbers, words, measurements, observations or just


descriptions of things.
Raw data is the data that is measured and collected directly from machine, web, etc.
The processed data is the type of data that is processed from raw data. Usually some kind of
cleaning, transformation are performed to convert the raw data into a format that can be analyzed,
visualized.
A Database is a collection of related data organized in a way that data can be easily accessed,
managed and updated. Database can be software based or hardware based, with one sole purpose,
storing data.

Different types of scales


A decision variable is a quantity that the decision-maker controls. In other words, Decision
variables can be selected at the discretion of the decision maker.

The total cost of reaching consumers (C) depends upon the number of consumers (N), advertising
costs (A), and transportation costs (T). The linear cost prediction model is represented as: C = c -
nN + aA + tT where c, n, a, and t are constants and c estimates the total cost when the remaining
variables are zero.
Inference: Change in variables N, A, and T will not reflect any changes in c.

The manager at Soul Walk Inc., a shoe manufacturing company, wants to set a new price (P) for a
shoe model to maximize total profit. The demand (D) as a function of price is represented as: D =
1,500 - 2.5PThe total cost (C) as a function of demand is represented as: C = 3,200 + 3.5DWhich
of the following is a model for total profit as a function of price?

(1,508.75× price) - (2.5 × price2) - 8,450

Optimization problem is the problem of finding the best solution from all feasible solutions.
A problem with continuous variables is known as a continuous optimization

Decision variables in Optimization problem are unknown values that the model seeks to determine
while constraints are the limitations, requirements, or other restrictions that are imposed on any
solution.
Different Types of Graphs

Quartiles
Note: The difference between the first and third quartiles is referred to as the interquartile range

Module 2
Mean, Mode, Median, and Standard Deviation

The Mean and Mode

The sample mean is the average and is computed as the sum of all the observed
outcomes from the sample divided by the total number of events. We use x as the
symbol for the sample mean. In math terms,

where n is the sample size and the x correspond to the observed valued.
Example

Suppose you randomly sampled six acres in the Desolation Wilderness for a non-
indigenous weed and came up with the following counts of this weed in this region:

34, 43, 81, 106, 106 and 115

We compute the sample mean by adding and dividing by the number of samples, 6.

34 + 43 + 81 + 106 + 106 + 115


= 80.83
6

We can say that the sample mean of non-indigenous weed is 80.83.

The mode of a set of data is the number with the highest frequency. In the above
example 106 is the mode, since it occurs twice and the rest of the outcomes occur only
once.

The population mean is the average of the entire population and is usually impossible
to compute. We use the Greek letter  for the population mean.

Note: The similarity between a midrange and mean is that both are affected by outliers.

Median, and Trimmed Mean

One problem with using the mean, is that it often does not depict the typical outcome. If
there is one outcome that is very far from the rest of the data, then the mean will be
strongly affected by this outcome. Such an outcome is called and outlier. An
alternative measure is the median. The median is the middle score. If we have an even
number of events we take the average of the two middles. The median is better for
describing the typical value. It is often used for income and home prices.

Example

Suppose you randomly selected 10 house prices in the South Lake Tahoe area. Your
are interested in the typical house price. In $100,000 the prices were

2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8
If we computed the mean, we would say that the average house price
is 744,000. Although this number is true, it does not reflect the price for available
housing in South Lake Tahoe. A closer look at the data shows that the house valued
at 40.8 x $100,000 = $4.08 million skews the data. Instead, we use the median. Since
there is an even number of outcomes, we take the average of the middle two

3.7 + 4.1
= 3.9
2

The median house price is $390,000. This better reflects what house shoppers should
expect to spend.

There is an alternative value that also is resistant to outliers. This is called the trimmed
mean which is the mean after getting rid of the outliers or 5% on the top and 5% on the
bottom. We can also use the trimmed mean if we are concerned with outliers skewing
the data, however the median is used more often since more people understand it.

Example:

At a ski rental shop data was collected on the number of rentals on each of ten
consecutive Saturdays:

44, 50, 38, 96, 42, 47, 40, 39, 46, 50.

To find the sample mean, add them and divide by 10:

44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50
= 49.2
10

Notice that the mean value is not a value of the sample.

To find the median, first sort the data:

38, 39, 40, 42, 44, 46, 47, 50, 50, 96

Notice that there are two middle numbers 44 and 46. To find the median we take the
average of the two.
44 + 46
Median = = 45
2

Notice also that the mean is larger than all but three of the data points. The mean is
influenced by outliers while the median is robust.

Note: A median is not affected by outliers; a mean is affected by outliers.

Variance, Standard Deviation and Coefficient of Variation

The mean, mode, median, and trimmed mean do a nice job in telling where the center
of the data set is, but often we are interested in more. For example, a pharmaceutical
engineer develops a new drug that regulates iron in the blood. Suppose she finds out
that the average sugar content after taking the medication is the optimal level. This
does not mean that the drug is effective. There is a possibility that half of the patients
have dangerously low sugar content while the other half have dangerously high
content. Instead of the drug being an effective regulator, it is a deadly poison. What
the pharmacist needs is a measure of how far the data is spread apart. This is what the
variance and standard deviation do. First we show the formulas for these
measurements. Then we will go through the steps on how to use the formulas.

We define the variance to be

and the standard deviation to be

Variance and Standard Deviation: Step by Step

1. Calculate the mean, x.


2. Write a table that subtracts the mean from each observed value.
3. Square each of the differences.
4. Add this column.
5. Divide by n -1 where n is the number of items in the sample This is the variance.
6. To get the standard deviation we take the square root of the variance.

Example

The owner of the Ches Tahoe restaurant is interested in how much people spend at the
restaurant. He examines 10 randomly selected receipts for parties of four and writes
down the following data.

44, 50, 38, 96, 42, 47, 40, 39, 46, 50

He calculated the mean by adding and dividing by 10 to get

x = 49.2

Below is the table for getting the standard deviation:

x x - 49.2 (x - 49.2 )2
44 -5.2 27.04
50 0.8 0.64
38 11.2 125.44
96 46.8 2190.24
42 -7.2 51.84
47 -2.2 4.84
40 -9.2 84.64
39 -10.2 104.04
46 -3.2 10.24
50 0.8 0.64
Total 2600.4

Now
2600.4
= 288.7
10 - 1

Hence the variance is 289 and the standard deviation is the square root of 289 = 17.

Since the standard deviation can be thought of measuring how far the data values lie
from the mean, we take the mean and move one standard deviation in either
direction. The mean for this example was about 49.2 and the standard deviation was
17. We have:

49.2 - 17 = 32.2

and

49.2 + 17 = 66.2

What this means is that most of the patrons probably spend between $32.20 and $66.20.

The sample standard deviation will be denoted by s and the population standard
deviation will be denoted by the Greek letter .

The sample variance will be denoted by s2 and the population variance will be denoted
by 2.

The variance and standard deviation describe how spread out the data is. If the data all
lies close to the mean, then the standard deviation will be small, while if the data is
spread out over a large range of values, s will be large. Having outliers will increase
the standard deviation.

One of the flaws involved with the standard deviation, is that it depends on the units
that are used. One way of handling this difficulty, is called the coefficient of
variation which is the standard deviation divided by the mean times 100%

CV = 100%

In the above example, it is

17
100% = 34.6%
49.2

This tells us that the standard deviation of the restaurant bills is 34.6% of the mean.

Dispersion: is the degree of variation in data.

A z-score (also called a standard score) gives you an idea of how far from the mean a
data point is. But more technically it’s a measure of how many standard
deviations below or above the population mean a raw score is.

A z-score can be placed on a normal distribution curve. Z-scores range from -3 standard
deviations (which would fall to the far left of the normal distribution curve) up to +3
standard deviations (which would fall to the far right of the normal distribution curve).
In order to use a z-score, you need to know the mean μ and also the population standard
deviation σ.

Z-scores are a way to compare results to a “normal” population. Results from tests or
surveys have thousands of possible results and units; those results can often seem
meaningless. For example, knowing that someone’s weight is 150 pounds might be
good information, but if you want to compare it to the “average” person’s weight,
looking at a vast table of data can be overwhelming (especially if some weights are
recorded in kilograms). A z-score can tell you where that person’s weight is compared
to the average population’s mean weight.

z = (x – μ) / σ
For example, let’s say you have a test score of 190. The test has a mean (μ) of 150 and
a standard deviation (σ) of 25. Assuming a normal distribution, your z score would be:
• z = (x – μ) / σ
• = (190 – 150) / 25 = 1.6.

What is the difference between population and sample in inferential statistics?


From the population we take a sample. We cannot work on the population either due to
computational costs or due to availability of all data points for the population.

From the sample we calculate the statistics

From the sample statistics we conclude about the population

What is the difference between inferential statistics and descriptive statistics?

Descriptive statistics – provides exact and accurate information.

Inferential statistics – provides information of a sample and we need to inferential statistics to


reach to a conclusion about the population.

Most common characteristics used in descriptive statistics?

• Center – middle of the data. Mean / Median / Mode are the most commonly used as
measures.
• Mean – average of all the numbers
• Median – the number in the middle
• Mode – the number that occurs the most. The disadvantage of using Mode is that
there may be more than one mode.
• Spread – How the data is dispersed. Range / IQR / Standard Deviation / Variance are the
most commonly used as measures.
• Range = Max – Min
• Inter Quartile Range (IQR) = Q3 – Q1
• Standard Deviation (σ) = √(∑(x-µ)2 / n)
• Variance = σ2
• Shape – the shape of the data can be symmetric or skewed
• Symmetric – the part of the distribution that is on the left side of the median is same
as the part of the distribution that is on the right side of the median
• Left skewed – the left tail is longer than the right side
• Right skewed – the right tail is longer than the left side
• Outlier – An outlier is an abnormal value
• Keep the outlier based on judgement
• Remove the outlier based on judgement

What is the meaning of standard deviation?


It represents how far are the data points from the mean

(σ) = √(∑(x-µ)2 / n)

Variance is the square of standard deviation

What is left skewed distribution and right skewed distribution?

• Left skewed
• The left tail is longer than the right side
• Mean < median < mode
• Right skewed
• The right tail is longer than the right side
• Mode < median < mean

What is the relationship between mean and median in normal distribution?

In the normal distribution mean is equal to median

What does it mean by bell curve distribution and Gaussian distribution?

Normal distribution is called bell curve distribution / Gaussian distribution

It is called bell curve because it has the shape of a bell

It is called Gaussian distribution as it is named after Carl Gauss

What is an outlier?

An outlier is an abnormal value (It is at an abnormal distance from rest of the data points).

What can I do with outlier?

• Remove outlier
• When we know the data-point is wrong (negative age of a person)
• When we have lots of data
• We should provide two analyses. One with outliers and another without outliers.
• Keep outlier
• When there are lot of outliers (skewed data)
• When results are critical
• When outliers have meaning (fraud data)
• What is the difference between 95% confidence level and 99% confidence level?
• The confidence interval increases as me move from 95% confidence level to 99%
confidence level

What do you mean by degree of freedom?

DF is defined as the number of options we have

DF is used with t-distribution and not with Z-distribution

For a series, DF = n-1 (where n is the number of observations in the series)

When to use t distribution and when to use z distribution?

• The following conditions must be satisfied to use Z-distribution


• Do we know the population standard deviation?
• Is the sample size > 30?
• CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
• Else we should use t-distribution
• CI = x (bar) – t*s/√n to x (bar) + t*s/√n

What is H0 and H1? What is H0 and H1 for two-tail test?

• H0 is known as null hypothesis. It is the normal case / default case.


• For one tail test x <= µ
• For two-tail test x = µ
• H1 is known as alternate hypothesis. It is the other case.
• For one tail test x > µ
• For two-tail test x <> µ

What is p-value in hypothesis testing?

• If the p-value is more than then critical value, then we fail to reject the H0
• If p-value = 0.015 (critical value = 0.05) – strong evidence
• If p-value = 0.055 (critical value = 0.05) – weak evidence
• If the p-value is less than the critical value, then we reject the H0
• If p-value = 0.055 (critical value = 0.05) – weak evidence
• If p-value = 0.005 (critical value = 0.05) – strong evidence

What is the difference between one tail and two tail hypothesis testing?

• 2-tail test: Critical region is on both sides of the distribution


• H0: x = µ
• H1: x <> µ
• 1-tail test: Critical region is on one side of the distribution
• H1: x <= µ
• H1: x > µ

Skewness and Kurtosis

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. Kurtosis is a


measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That
is, data sets with high kurtosis tend to have heavy tails, or outliers. A general guideline
for skewness is that if the number is greater than +1 or lower than –1, this is an indication of a
substantially skewed distribution. For kurtosis, the general guideline is that if the number is
greater than +1, the distribution is too peaked

Correlation coefficient

The strength of the linear association between two variables is quantified by the correlation
coefficient. The correlation coefficient always takes a value between -1 and 1, with 1 or -1
indicating perfect correlation (all points would lie along a straight line in this case).

Null Hypothesis and Alternative Hypothesis

The null hypothesis is a general statement that states that there is no relationship between two
phenomena under consideration or that there is no association between two groups. An alternative
hypothesis is a statement that describes that there is a relationship between two selected variables
in a study. The null hypothesis is the one to be tested and the alternative is everything else. In
our example, the null hypothesis would be: The mean data scientist salary is 113,000 dollars.
While the alternative: The mean data scientist salary is not 113,000 dollars.

In order to reject the null hypothesis, the F-test statistic must be greater than the F-critical
Five Steps in Hypothesis Testing:
• Specify the Null Hypothesis.
• Specify the Alternative Hypothesis.
• Set the Significance Level (a)
• Calculate the Test Statistic and Corresponding P-Value.
• Drawing a Conclusion.

Type 1 and Type 2 errors in Hypothesis testing

A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in
the population; a type II error (false-negative) occurs if the investigator fails to reject a
null hypothesis that is actually false in the population

Rejection Region

The rejection region is the interval, measured in the sampling distribution of the
statistic under study, that leads to rejection of the null hypothesis H 0 in a hypothesis
test.
.

Module 3
Parametric test and non-parametric test

A parametric statistical test is one that makes assumptions about the parameters (defining
properties) of the population distribution(s) from which one's data are drawn, while a non-
parametric test is one that makes no such assumptions.
The t-test is a method that determines whether two populations are statistically different from each
other, whereas ANOVA determines whether three or more populations are statistically different
from each other. The one-way analysis of variance (ANOVA) is used to determine whether there
are any statistically significant differences between the means of two or more independent
(unrelated) groups

The Chi-Square Test of Independence determines whether there is an association between


categorical variables (i.e., whether the variables are independent or related). It is a
nonparametric test. This test is also known as: Chi-Square Test of Association. A chi-
square (χ2) statistic is a measure of the difference between the observed and expected frequencies
of the outcomes of a set of events or variables. χ2 depends on the size of the difference between
actual and observed values, the degrees of freedom, and the samples size.

Simple linear graph

Power functions: They are the mathematical functions used in predictive analytical models
which define phenomena that increase at a specific rate, and is represented by the formula y
= axb

Polynominal functions: y = ax3 + bx2+ cx + d

Exponential function: y = abx, y rises or falls at constantly increasing rates.

Regression
Regression is a statistical method used in finance, investing, and other disciplines that attempts to
determine the strength and character of the relationship between one dependent variable (usually
denoted by Y) and a series of other variables (known as independent variables).

R-squared (R2) is a statistical measure that represents the proportion of the variance for a
dependent variable that's explained by an independent variable or variables in a regression model.
It may also be known as the coefficient of determination. However, when you
used regression analysis always higher r-square is better to explain changes in your outcome
variable

Dummy variable for categorical data

A dummy variable also called as indicator variable is a numeric variable that represents
categorical data, such as gender, race, political affiliation, etc. Technically, dummy variables are
dichotomous, quantitative variables. Their range of values is small; they can take on only two
quantitative values.

Note: If there are n categories/ categorical variables ni1 dummy variables are created

Regression equation

y' = b0 + b1x where “b0” is the y-intercept and b1x is the slope.

Principal component Analysis

Principal component analysis (PCA) is the process of computing the principal components and
using them to perform a change of basis on the data, sometimes using only the first few principal
components and ignoring the rest. Principal Component Analysis, or PCA, is a dimensionality-
reduction method that is often used to reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one that still contains most of the information
in the large set. Eigen value, scree plot and Rotated Factor matrix are the components used for
knowing the number of variables in PCA

Note: The correlation between the factor scores and the original variables is considered for final
factors to be considered

Varimax rotation provides an orthogonal rotation of the original solution which maximizes the
variance of loadings in each factor

You might also like