Professional Documents
Culture Documents
S
o
L Diffic
d Map
. Questions A B C D Correct answer ulty
u ped
N level
l to
O
e CO
A manager at Gampco Inc.
wishes to know the company's
revenue and profit in its Prescriptive Normative Descriptiv Predictive Descriptive
1 1 Easy
previous quarter. Which of the analytics analytics e analytics analytics analytics
following business analytics
will help the manager? 1
summarizes uses data
detects
data into to
identifies the patterns in
meaningful determine detects patterns
best historical
charts and a course in historical data
alternatives to data and
2 1 Predictive analytics: reports that of action and extrapolates Easy
minimize or extrapolates
can be to be them forward in
maximize an them
standardized executed time.
objective. forward in
or in a given
time.
customized. situation. 1
What do
What will
How many What is the they
happen if
Which of the following and what best way of expect to
demand falls
questions will prescriptive types of shipping pay for What is the best Diffic
3 1 by 10% or if
analytics help a company complaints goods from fuel over way of shipping ult
supplier
address? did they their factories the next goods from their
prices go up
resolve? to minimize several factories to
5%?
costs? months? minimize costs? 1
They are
limitations, They are
They are
requirement They are quantities
unknown They are
s, or other quantities that for which
values that quantities that an
Which of the following best restrictions an no
an optimization Mode
4 1 defines objective functions in that are optimization feasible
optimization model seeks to rate
an optimization problem? imposed on model seeks solutions
model seeks maximize or
any solution to maximize exist in an
to minimize.
in an or minimize. optimizati
determine.
optimization on model.
model. 1
Roger wants to compare values
across categories using vertical Stacked
Clustered Clustered Mode
5 1 rectangles. Which of the Line chart Pie chart column
column chart column chart rate
following charts must Roger chart
use? 2
Which of the following charts
A doughnut
6 1 provides a useful means for Scatter chart Pie chart Line chart Line chart Easy
chart
displaying data over time? 2
Philip wishes to understand the
relative proportion of each data 2
Scatter Column
7 1 source to the total. Which of the Pie chart Bar chart Pie chart Easy
chart chart
following charts must Philip
use?
Observations consisting of pairs Mode
8 1 doughnut scatter radar line scatter
of variable data are required to rate 2
1
construct a ________ chart.
sample mean
1
2 Which of the following measures ofmidrange sample mean mode median Easy
1
location is
calculated using the formula
where n is the number of
observations? 1,3
A mean is
an
observatio
A median is
A mean n that
A median is not
divides the occurs A median is not
not affected meaningful
Which of the following is a data half most affected by
1 by outliers; a for ratio Mode
2 difference between a mean and above it and frequently outliers; a mean
2 mean is data; a mean rate
a median? half below ; a median is affected by
affected by is
it; a median is the outliers.
outliers. meaningful
does not. average of
to ratio data.
all
observatio
ns. 1,3
The ________ is the
1
2 observation that occurs most mode mean outlier median mode Easy
3
frequently 1,3
Which of the following is an
1
2 example of a measure of median mode variance midrange variance Easy
4
dispersion? 1,3
The ________ measures the coefficient
1 coefficient return to risk coefficient coefficient of
2 degree of asymmetry of of Easy
5 of variation factor of kurtosis skewness
observations around the mean. skewness 1,3
In statistics, ________ refers to
1 Markov
2 the peakedness or flatness of a Sharpe ratio entropy rate kurtosis kurtosis Easy
6 chain
histogram 1,3
Which of the following
1 standard null alternative
2 propositions describes an proportion null hypothesis Easy
7 deviation hypothesis hypothesis
existing theory or belief? 1,3
2
the null the null
the null the null
hypothesis is hypothesis is
hypothesis is hypothesis
actually true, actually true, the null
actually false, is actually
1 Which of the following is a and the but the hypothesis is Diffic
2 but the test false, and
8 Type I error? hypothesis hypothesis actually true, but ult
incorrectly the test
test correctly test the hypothesis
fails to reject correctly
fails to reject incorrectly test incorrectly
it rejects it
it rejects it rejects it 1,3
In order to reject the null
1 hypothesis, the F-test statistic Mode
2 p-value variance df
9 must be greater than the rate
________. F crit F crit 1,3
Which of the following tests is
2 used to determine if two Chi-square Mode
2 t-test z-test ANOVA Chi-square test
0 categorical variables are test rate
independent? 1,3
3
INTERNAL EXAMINATION:SEP– OCT’ 2019
SECTION –A
post_advt pre_advt
Mean 59.98 52.23
Variance 655.3865327 105.1227136
Observations 200 200
Pearson Correlation 0.42304019
Hypothesized Mean Difference 0
df 199
t Stat 4.723372724
P(T<=t) one-tail 2.18728E-06
t Critical one-tail 1.652546746
P(T<=t) two-tail 4.37456E-06
t Critical two-tail 1.971956544
H0: There is no significant difference in awareness before and after the advertising campaign
Ha: There is significant difference in awareness before and after the advertising campaign.
Since p value(0.00000437) is less than 0.05, null hypothesis is rejected and we can conclude that advertising
had a significant impact in increasing brand awareness.
4
Range
Skewness
Kurtosis
Standard Deviation
Variance
Standard Error
25th Percentile
50th Percentile
75th Percentile
Inter Quartile Range
c.Figure below demonstrates the home selling price distributions of five cities in the USA, including Chicago,
Las Vegas, New York, Texas & Washington. Discuss & compare the home prices of these cities.
The lowest median selling prices are in Washington and highest is in Texas. We can summarize the median
prices as follows Median Texas> Median Chicago>Median Las Vegas>Median New York> Median Washington
5
Las Vegas has a large 4th quartile indicating many homes have high selling prices. This is also reflected in
one outlier price.We can summarize the dispersion as follows d las vegas> d chicago> d texas> d washington> d New
York
New York AND Washington has the least SD and most of the home prices are clustered around the
median.
There are two outliers in New York home prices both on the higher side and lower side. One outlier in case
of Las Vegas.
SECTION-B
2. Answer any TWO of the followingThree questions: 2X12=24
a.Classify Business Analytics into 3 categories of Descriptive, Predictive and Prescriptive and discuss each one
of them.
Descriptive Analytics
Most businesses start with descriptive analytics -the use of data to understand past and current business
performance and make informed decisions. These techniques categorize, characterize, consolidate, and
classify data to convert it into useful information for the purposes of understanding and analyzing
business performance. Descriptive analytics summarizes data into meaningful charts and reports, for
example about budgets, sales, revenues, or cost. This process allows managers to obtain standard and
customized reports and then drill down into the data and make queries to understand the impact of an
advertising campaign for example, review business performance to find problems or areas of
opportunity, and identify patterns and trends in data. Typical questions that descriptive analytics help
answer are “How much did we sell in each region?” “ What was our revenue and profit last quarter?”
Predictive Analytics
Predictive analytics seeks to predict the future by examining historical data, detecting patterns or
relationships in these data, and then extrapolating these relationships forward in time. For example, a
marketer might wish to predict the response of different customer segments to an advertising campaign,
or a fashion manufacturer might want to predict next season’s demand for fashion of a specific color
and size. Using advanced techniques, predictive analytics can help to detect hidden patterns in large
quantities of data to segment and group data into coherent sets to predict behavior and detect trends.
Prescriptive Analytics
Prescriptive analytics uses optimization to identify the best alternative to minimize or maximize some
objective. Prescriptive analytics is used in many areas of business including operations, marketing, and
finance. For example, we may determine the best pricing and advertising strategy to maximize revenue.
The optimal amount of cash to store in an ATM, or the best mix of investments in a retirement portfolio
to manage risk. Prescriptive analytics addresses questions such as “How much should we produce to
maximize profit.?”. “ What is the best way of shipping goods from our factories to minimize costs?”.
b.A sample of 100 individuals were asked to evaluate their preferences for three new proposed energy drinks
in a blind taste test. The sample space consists of two types of outcomes corresponding to each individual:
gender and brand preference. The table below shows the cross tabulation of the resu
6
Count of Respondent Column Labels
(Observed)
Row Labels Brand 1 Brand 2 Brand 3 Grand Total
Female 9 6 22 37
Male 25 17 21 63
Grand Total 34 23 43 100
Expected Frequency Brand 1 Brand 2 Brand 3 Grand Total
Female 12.58 8.51 15.91 37
Male 21.42 14.49 27.09 63
Grand Total 34 23 43 100
Calculate the Chi-square value and determine if the proportion of males who prefer a particular brand is no
different from the proportion of females.at the 5% level of significance.
Because the estimated chi-square value exceeds the critical value, we reject the null hypothesis that the brand
preference is not influenced by Gender.
What would be your recommendations for advertising campaigns based on the results?
The advertising campaigns should be separate for males and females as the brand preference is influenced by
Gender.
Chi-square table
df 0.995 0.99 0.975 0.95 0.9 0.1 0.05 0.025 0.01 0.005
1 -- -- 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.01 0.02 0.051 0.103 0.211 4.605 5.991 7.378 9.21 10.6
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.35 12.84
4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.14 13.28 14.86
5 0.412 0.554 0.831 1.145 1.61 9.236 11.07 12.83 15.09 16.75
6 0.676 0.872 1.237 1.635 2.204 10.65 12.59 14.45 16.81 18.55
7 0.989 1.239 1.69 2.167 2.833 12.02 14.07 16.01 18.48 20.28
8 1.344 1.646 2.18 2.733 3.49 13.36 15.51 17.54 20.09 21.96
9 1.735 2.088 2.7 3.325 4.168 14.68 16.92 19.02 21.67 23.59
10 2.156 2.558 3.247 3.94 4.865 15.99 18.31 20.48 23.21 25.19
c.The marketing manager of DataCom Inc. wants to predict the annual revenues generated by its customers
given certain characteristics of them. The manager runs a regression model on Years of Loyalty, Years
7
Employed, Years of Marriage, Gender and Average Number of Products Purchased. The output of
regression model is given below:
Regression Statistics
Multiple R 0.874778
R Square 0.765237
Observations 46
ANOVA
Significance
df SS MS F F
Total 45 5444251402.96
P-
Coefficients Standard Error t Stat value Lower 95% Upper 95%
30830.4408
Intercept 25801.11 2488.444433 10.36836776 0.00 20771.77322 3
271.270223
Years of Loyalty -208.634 237.4500522 -0.878644716 0.38 -688.5386908 2
923.539530
Years Employed 638.4924 141.0373752 4.527114617 0.00 353.4451946 9
1916.96797
Gender* -1627.26 1753.634545 -0.927935442 0.36 -5171.487268 6
SECTION-C
3. Case study - Compulsory: 1X10=10
Kumar Soft Drink Bottling Company came into operation in 1984 and was operating in the NCR of Delhi and
in the states of Punjab and Haryana. The turnover of the company was ~ 1.5 crore in 2010 and it was growing
at the rate of 10 per cent per annum. The chairman 'of the company, Mr. Kumar, wanted to examine whether
the flavour of the soft-drink and the price level had any impact upon the sales. He wanted this because the
results could have implicationsfor changing the product mix if required. Three types of flavours were
considered, namely, pineapple, mango and orange. Further, three level of prices were taken into
considerations Rs.10, Rs.12 and Rs.14. An experiment was conducted by randomly choosing a sample of 18
stores where the flavour of the soft drink and the price level were varied. The experiment period was one
month.
Coding for flavor: Pineapple =1
Mango =2
Orange =3
Coding for price: Rs.10/- =1
Rs.12/- =2
Rs.14/- =3
Questions
Is there any impact of the flavor or the price level independently upon the sales? Conduct the test using a 5
percent level of significance.Construct the hypothesis and interpret the results.
A one way ANOVA was run between flavor and sales. The results are summarized below:
9
Anova: Single
Factor
SUMMARY
Groups Count Sum Average Variance
Pineapple 6 23.3 3.883333333 2.169666667
Mango 6 19.9 3.316666667 0.885666667
Orange 6 19.6 3.266666667 1.430666667
ANOVA
Source of
Variation SS df MS F P-value F crit
Between Groups 1.407777778 2 0.703888889 0.470723733 0.633470362 3.68232
Within Groups 22.43 15 1.495333333
Total 23.83777778 17
A one way ANOVA was run between price and sales. The results are summarized below.
Anova: Single
Factor
SUMMARY
Groups Count Sum Average Variance
Rs10 6 28.7 4.783333 0.661667
Rs12 6 20 3.333333 0.326667
Rs 14 6 14.1 2.35 0.183
ANOVA
Source of
Variation SS df MS F P-value F crit
Between Groups 17.98111111 2 8.990556 23.02647 0.000027 3.68232
Within Groups 5.856666667 15 0.390444
Total 23.83777778 17
10
Ha: Sales is dependent on type of Price.
Since p value (0.000027) is less than 0.05, H0 can be rejected. Hence Sales is influenced by Price.
a) What are the steps involved in data driven decision making process?
I. Identify the problem or opportunity
II. Identify sources of data(primary as well as secondary data)
III. Process the data for missing and incorrect data. Prepare the data for analytics model building)
IV. Build the analytical models
V. Communicate the data analysis output and decisions effectively.
VI. Implement Solution/Decision
11
There are a number of households whose income is more than the 4 th Quartile. All those cases are outliers. The
mean income in dark black line is approximately $70,000, which can be determined through descriptive
statistics.
d) An independent sample t-test was conducted between Household income and retired households. Is
there a significant difference between HH income and retired and not retired Households?
Independent Samples Test
95% Confidence
Sig. Interval of the
(2- Mean Std. Error Difference
tailed Differenc Differenc
F Sig. t df ) e e Lower Upper
Household Equal
75.34 10.1 37.5130 55.3948
income in variances .000 6398 .000 46.45395 4.56093
5 85 0 9
thousands assumed
Equal
variances 25.1 631.4 42.8248 50.0830
.000 46.45395 1.84806
not 37 44 5 4
assumed
H0: There is no significant difference in household income of retired and not retired households.
12
Ha: There is significant difference in household income of retired and not retired households.
Since p value (0.000) is less than 0.05, we reject the null hypothesis and accept the alternate.
Sales = 500 – 0.05* (6.99) + 30*0 + 0.08 * (150) +0.25 *6.99*150 = 500-.3495+ 12+ 262.125 =
773.7755
f) What are the 3 different types of forecast errors used to determine the accuracy in demand forecasting
methods?
SECTION B
Answer any 2 out of 3 questions 2X10=20 Marks
1.Classify Business Analytics into 3 categories of Descriptive, Predictive and Prescriptive and discuss each one
of them.
Descriptive Analytics
Most businesses start with descriptive analytics -the use of data to understand past and current business
performance and make informed decisions. These techniques categorize, characterize, consolidate, and
classify data to convert it into useful information for the purposes of understanding and analyzing
business performance. Descriptive analytics summarizes data into meaningful charts and reports, for
example about budgets, sales, revenues, or cost. This process allows managers to obtain standard and
customized reports and then drill down into the data and make queries to understand the impact of an
advertising campaign for example, review business performance to find problems or areas of
opportunity, and identify patterns and trends in data. Typical questions that descriptive analytics help
answer are “How much did we sell in each region?” “ What was our revenue and profit last quarter?”
13
Predictive Analytics
Predictive analytics seeks to predict the future by examining historical data, detecting patterns or
relationships in these data, and then extrapolating these relationships forward in time. For example, a
marketer might wish to predict the response of different customer segments to an advertising campaign,
or a fashion manufacturer might want to predict next season’s demand for fashion of a specific color
and size. Using advanced techniques, predictive analytics can help to detect hidden patterns in large
quantities of data to segment and group data into coherent sets to predict behavior and detect trends.
Prescriptive Analytics
Prescriptive analytics uses optimization to identify the best alternative to minimize or maximize some
objective. Prescriptive analytics is used in many areas of business including operations, marketing, and
finance. For example, we may determine the best pricing and advertising strategy to maximize revenue.
The optimal amount of cash to store in an ATM, or the best mix of investments in a retirement portfolio
to manage risk. Prescriptive analytics addresses questions such as “How much should we produce to
maximize profit.?”. “ What is the best way of shipping goods from our factories to minimize costs?”.
2. A property broking company wanted to understand the factors which influence the price of a property. To
this end, it conducted a regression to understand the impact of the area of property in square feet and the age of
the house to determine the property prices. The table below shows the summary of regression output. Interpret
the results (R square, significant variables and residual Plot) and construct an equation for Property price.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.745494776
R Square 0.555762462
Observations 42
14
Square Feet Residual Plot
50000
Residuals
0
-50000 0 500 1,000 1,500 2,000 2,500
Square Feet
R Square = 0.55
55% of the variation in dependent variable is explained by the independent variable.
Intercept and Square feet are significant variables as p value is less than 0.05.
Price of property = 47331.38 + 40.91 * Square feet
Month Demand(Units)
Jan 2016 1000
Feb 2016 1113
March 2016 1271
April 2016 1445
May 2016 1558
June 2016 1648
July 2016 1724
August 2016 1850
Sept 2016 1864
Oct 2016 2076
Nov 2016 2167
Dec 2016 2191
Use naïve, 3 month moving average and cumulative methods to forecast demand for January 2017. Use MAD
to determine which method provides the highest accuracy?
15
Mean Average Deviation
Demand
Month Cumulativ
(Units) Naïve 3MA e Naïve 3MA Cumulative
1128.0
Mar-16 1271
1271 0 1128.00 158 214.50
1276.3
Apr-16 1445
1445 3 1207.25 174 317.00 317.00
1424.6
May-16 1558
1558 7 1277.40 113 281.67 350.75
1550.3
Jun-16 1648
1648 3 1339.17 90 223.33 370.60
1643.3
Jul-16 1724
1724 3 1394.14 76 173.67 384.83
1740.6
Aug-16 1850
1850 7 1451.13 126 206.67 455.86
1812.6
Sep-16 1864
1864 7 1497.00 14 123.33 412.88
1930.0
Oct-16 2076
2076 0 1554.90 212 263.33 579.00
2035.6
Nov-16 2167
2167 7 1610.55 91 237.00 612.10
2144.6
Dec-16 2191
2191 7 1658.92 24 155.33 580.45
16
SECTION C
Malhotra Spices Company came into operation in 1960 and has operations in all parts of the country. It was
in-the business of manufacturing and selling spices suitable for the Indian kitchen. They ventured into the
export markets in the 1980s as there was a huge demand for the spices in North America, Europe, Australia
and in the Middle East. This is because the number of the Indians residing in these countries had been
increasing at an exponential rate. The spices were packed into tetrapacks containing spices in different
quantities like 100, 150,200,250 and 500 gm. The 500 gm packages were mostly used by restaurants and
hoteliers. Mr K P Malhotra, Chairman of Malhotra Spices, was wondering whether they should change the
packaging from tetrapack to plastic or glass bottle packaging. Before taking a final decision, as an
experiment, the company introduced plastic and glass bottle packaging in addition to the existing tetra packs
packaging in the national capital region (NCR) of Delhi. Mr Malhotra was thinking that switching over to a
new packaging would involve a huge investment and if the results were not different for the other two types
of packaging, they would drop the idea of change in packaging.
The company on an experimental basis came up with three types of packaging-plastic, glass bottles and
tetrapacks-for the NCR market. They wanted to observe the sales of spices for the three types of
packaging. Mr Malhotra's younger brother told him that it is not only the type of packaging that influenced
the sales but also some external factors like the size of the store selling the spices.
Type. of packaging
1= Plastic
2 = Glass
3 = Tetrapacks
Type of store
1 = large store
2 = Medium store
3 = Small store
17
A one way ANOVA was run between sales and type of packaging. The results are summarized below:
ANOVA
Source of
Variation SS df MS F P-value F crit
Total 6949.366667 29
ANOVA
Source of
Variation SS df MS F P-value F crit
Total 6949.366667 29
18
Module 1
1) Steps involved in “Data driven decision making”
The total cost of reaching consumers (C) depends upon the number of consumers (N), advertising
costs (A), and transportation costs (T). The linear cost prediction model is represented as: C = c -
nN + aA + tT where c, n, a, and t are constants and c estimates the total cost when the remaining
variables are zero.
Inference: Change in variables N, A, and T will not reflect any changes in c.
The manager at Soul Walk Inc., a shoe manufacturing company, wants to set a new price (P) for a
shoe model to maximize total profit. The demand (D) as a function of price is represented as: D =
1,500 - 2.5PThe total cost (C) as a function of demand is represented as: C = 3,200 + 3.5DWhich
of the following is a model for total profit as a function of price?
Optimization problem is the problem of finding the best solution from all feasible solutions.
A problem with continuous variables is known as a continuous optimization
Decision variables in Optimization problem are unknown values that the model seeks to determine
while constraints are the limitations, requirements, or other restrictions that are imposed on any
solution.
Different Types of Graphs
Quartiles
Note: The difference between the first and third quartiles is referred to as the interquartile range
Module 2
Mean, Mode, Median, and Standard Deviation
The sample mean is the average and is computed as the sum of all the observed
outcomes from the sample divided by the total number of events. We use x as the
symbol for the sample mean. In math terms,
where n is the sample size and the x correspond to the observed valued.
Example
Suppose you randomly sampled six acres in the Desolation Wilderness for a non-
indigenous weed and came up with the following counts of this weed in this region:
We compute the sample mean by adding and dividing by the number of samples, 6.
The mode of a set of data is the number with the highest frequency. In the above
example 106 is the mode, since it occurs twice and the rest of the outcomes occur only
once.
The population mean is the average of the entire population and is usually impossible
to compute. We use the Greek letter for the population mean.
Note: The similarity between a midrange and mean is that both are affected by outliers.
One problem with using the mean, is that it often does not depict the typical outcome. If
there is one outcome that is very far from the rest of the data, then the mean will be
strongly affected by this outcome. Such an outcome is called and outlier. An
alternative measure is the median. The median is the middle score. If we have an even
number of events we take the average of the two middles. The median is better for
describing the typical value. It is often used for income and home prices.
Example
Suppose you randomly selected 10 house prices in the South Lake Tahoe area. Your
are interested in the typical house price. In $100,000 the prices were
2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8
If we computed the mean, we would say that the average house price
is 744,000. Although this number is true, it does not reflect the price for available
housing in South Lake Tahoe. A closer look at the data shows that the house valued
at 40.8 x $100,000 = $4.08 million skews the data. Instead, we use the median. Since
there is an even number of outcomes, we take the average of the middle two
3.7 + 4.1
= 3.9
2
The median house price is $390,000. This better reflects what house shoppers should
expect to spend.
There is an alternative value that also is resistant to outliers. This is called the trimmed
mean which is the mean after getting rid of the outliers or 5% on the top and 5% on the
bottom. We can also use the trimmed mean if we are concerned with outliers skewing
the data, however the median is used more often since more people understand it.
Example:
At a ski rental shop data was collected on the number of rentals on each of ten
consecutive Saturdays:
44, 50, 38, 96, 42, 47, 40, 39, 46, 50.
44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50
= 49.2
10
Notice that there are two middle numbers 44 and 46. To find the median we take the
average of the two.
44 + 46
Median = = 45
2
Notice also that the mean is larger than all but three of the data points. The mean is
influenced by outliers while the median is robust.
The mean, mode, median, and trimmed mean do a nice job in telling where the center
of the data set is, but often we are interested in more. For example, a pharmaceutical
engineer develops a new drug that regulates iron in the blood. Suppose she finds out
that the average sugar content after taking the medication is the optimal level. This
does not mean that the drug is effective. There is a possibility that half of the patients
have dangerously low sugar content while the other half have dangerously high
content. Instead of the drug being an effective regulator, it is a deadly poison. What
the pharmacist needs is a measure of how far the data is spread apart. This is what the
variance and standard deviation do. First we show the formulas for these
measurements. Then we will go through the steps on how to use the formulas.
Example
The owner of the Ches Tahoe restaurant is interested in how much people spend at the
restaurant. He examines 10 randomly selected receipts for parties of four and writes
down the following data.
x = 49.2
x x - 49.2 (x - 49.2 )2
44 -5.2 27.04
50 0.8 0.64
38 11.2 125.44
96 46.8 2190.24
42 -7.2 51.84
47 -2.2 4.84
40 -9.2 84.64
39 -10.2 104.04
46 -3.2 10.24
50 0.8 0.64
Total 2600.4
Now
2600.4
= 288.7
10 - 1
Hence the variance is 289 and the standard deviation is the square root of 289 = 17.
Since the standard deviation can be thought of measuring how far the data values lie
from the mean, we take the mean and move one standard deviation in either
direction. The mean for this example was about 49.2 and the standard deviation was
17. We have:
49.2 - 17 = 32.2
and
49.2 + 17 = 66.2
What this means is that most of the patrons probably spend between $32.20 and $66.20.
The sample standard deviation will be denoted by s and the population standard
deviation will be denoted by the Greek letter .
The sample variance will be denoted by s2 and the population variance will be denoted
by 2.
The variance and standard deviation describe how spread out the data is. If the data all
lies close to the mean, then the standard deviation will be small, while if the data is
spread out over a large range of values, s will be large. Having outliers will increase
the standard deviation.
One of the flaws involved with the standard deviation, is that it depends on the units
that are used. One way of handling this difficulty, is called the coefficient of
variation which is the standard deviation divided by the mean times 100%
CV = 100%
17
100% = 34.6%
49.2
This tells us that the standard deviation of the restaurant bills is 34.6% of the mean.
A z-score (also called a standard score) gives you an idea of how far from the mean a
data point is. But more technically it’s a measure of how many standard
deviations below or above the population mean a raw score is.
A z-score can be placed on a normal distribution curve. Z-scores range from -3 standard
deviations (which would fall to the far left of the normal distribution curve) up to +3
standard deviations (which would fall to the far right of the normal distribution curve).
In order to use a z-score, you need to know the mean μ and also the population standard
deviation σ.
Z-scores are a way to compare results to a “normal” population. Results from tests or
surveys have thousands of possible results and units; those results can often seem
meaningless. For example, knowing that someone’s weight is 150 pounds might be
good information, but if you want to compare it to the “average” person’s weight,
looking at a vast table of data can be overwhelming (especially if some weights are
recorded in kilograms). A z-score can tell you where that person’s weight is compared
to the average population’s mean weight.
z = (x – μ) / σ
For example, let’s say you have a test score of 190. The test has a mean (μ) of 150 and
a standard deviation (σ) of 25. Assuming a normal distribution, your z score would be:
• z = (x – μ) / σ
• = (190 – 150) / 25 = 1.6.
• Center – middle of the data. Mean / Median / Mode are the most commonly used as
measures.
• Mean – average of all the numbers
• Median – the number in the middle
• Mode – the number that occurs the most. The disadvantage of using Mode is that
there may be more than one mode.
• Spread – How the data is dispersed. Range / IQR / Standard Deviation / Variance are the
most commonly used as measures.
• Range = Max – Min
• Inter Quartile Range (IQR) = Q3 – Q1
• Standard Deviation (σ) = √(∑(x-µ)2 / n)
• Variance = σ2
• Shape – the shape of the data can be symmetric or skewed
• Symmetric – the part of the distribution that is on the left side of the median is same
as the part of the distribution that is on the right side of the median
• Left skewed – the left tail is longer than the right side
• Right skewed – the right tail is longer than the left side
• Outlier – An outlier is an abnormal value
• Keep the outlier based on judgement
• Remove the outlier based on judgement
(σ) = √(∑(x-µ)2 / n)
• Left skewed
• The left tail is longer than the right side
• Mean < median < mode
• Right skewed
• The right tail is longer than the right side
• Mode < median < mean
What is an outlier?
An outlier is an abnormal value (It is at an abnormal distance from rest of the data points).
• Remove outlier
• When we know the data-point is wrong (negative age of a person)
• When we have lots of data
• We should provide two analyses. One with outliers and another without outliers.
• Keep outlier
• When there are lot of outliers (skewed data)
• When results are critical
• When outliers have meaning (fraud data)
• What is the difference between 95% confidence level and 99% confidence level?
• The confidence interval increases as me move from 95% confidence level to 99%
confidence level
• If the p-value is more than then critical value, then we fail to reject the H0
• If p-value = 0.015 (critical value = 0.05) – strong evidence
• If p-value = 0.055 (critical value = 0.05) – weak evidence
• If the p-value is less than the critical value, then we reject the H0
• If p-value = 0.055 (critical value = 0.05) – weak evidence
• If p-value = 0.005 (critical value = 0.05) – strong evidence
What is the difference between one tail and two tail hypothesis testing?
Correlation coefficient
The strength of the linear association between two variables is quantified by the correlation
coefficient. The correlation coefficient always takes a value between -1 and 1, with 1 or -1
indicating perfect correlation (all points would lie along a straight line in this case).
The null hypothesis is a general statement that states that there is no relationship between two
phenomena under consideration or that there is no association between two groups. An alternative
hypothesis is a statement that describes that there is a relationship between two selected variables
in a study. The null hypothesis is the one to be tested and the alternative is everything else. In
our example, the null hypothesis would be: The mean data scientist salary is 113,000 dollars.
While the alternative: The mean data scientist salary is not 113,000 dollars.
In order to reject the null hypothesis, the F-test statistic must be greater than the F-critical
Five Steps in Hypothesis Testing:
• Specify the Null Hypothesis.
• Specify the Alternative Hypothesis.
• Set the Significance Level (a)
• Calculate the Test Statistic and Corresponding P-Value.
• Drawing a Conclusion.
A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in
the population; a type II error (false-negative) occurs if the investigator fails to reject a
null hypothesis that is actually false in the population
Rejection Region
The rejection region is the interval, measured in the sampling distribution of the
statistic under study, that leads to rejection of the null hypothesis H 0 in a hypothesis
test.
.
Module 3
Parametric test and non-parametric test
A parametric statistical test is one that makes assumptions about the parameters (defining
properties) of the population distribution(s) from which one's data are drawn, while a non-
parametric test is one that makes no such assumptions.
The t-test is a method that determines whether two populations are statistically different from each
other, whereas ANOVA determines whether three or more populations are statistically different
from each other. The one-way analysis of variance (ANOVA) is used to determine whether there
are any statistically significant differences between the means of two or more independent
(unrelated) groups
Power functions: They are the mathematical functions used in predictive analytical models
which define phenomena that increase at a specific rate, and is represented by the formula y
= axb
Regression
Regression is a statistical method used in finance, investing, and other disciplines that attempts to
determine the strength and character of the relationship between one dependent variable (usually
denoted by Y) and a series of other variables (known as independent variables).
R-squared (R2) is a statistical measure that represents the proportion of the variance for a
dependent variable that's explained by an independent variable or variables in a regression model.
It may also be known as the coefficient of determination. However, when you
used regression analysis always higher r-square is better to explain changes in your outcome
variable
A dummy variable also called as indicator variable is a numeric variable that represents
categorical data, such as gender, race, political affiliation, etc. Technically, dummy variables are
dichotomous, quantitative variables. Their range of values is small; they can take on only two
quantitative values.
Note: If there are n categories/ categorical variables ni1 dummy variables are created
Regression equation
y' = b0 + b1x where “b0” is the y-intercept and b1x is the slope.
Principal component analysis (PCA) is the process of computing the principal components and
using them to perform a change of basis on the data, sometimes using only the first few principal
components and ignoring the rest. Principal Component Analysis, or PCA, is a dimensionality-
reduction method that is often used to reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one that still contains most of the information
in the large set. Eigen value, scree plot and Rotated Factor matrix are the components used for
knowing the number of variables in PCA
Note: The correlation between the factor scores and the original variables is considered for final
factors to be considered
Varimax rotation provides an orthogonal rotation of the original solution which maximizes the
variance of loadings in each factor