Professional Documents
Culture Documents
BADM
12/9/20 1
12/9/20 2
• Retail Store / Grocery store:
The number of items sold in a day
is positively correlated with the
number of customers visited the
store.
• Finance / Banking
The profit of financial
institution is negatively
correlated with the number of
bad loans.
12/9/20 3
• Education
The number of students enrolled
in a University to the number of
professors and resources
• Defense
The number of troops/cops
assigned to a particular area to a
number of criminal activities in
the area
12/9/20 4
• Healthcare
The more patients with a
certain disease or symptom
more the newer clinical trials
related to that disease.
12/9/20 5
Height & weight
12/9/20 6
Price and demand • Volume and pressure
of a perfect gas
12/9/20 7
Correlation
12/9/20 8
• Positive Correlation:
• Two variables are said
to be positively
correlated if they
deviates in the same
direction.
• e.g. height & weight,
income & expenditure
•
12/9/20 9
• Negative Correlation
• Two variables are said
to be negatively
correlated if they
deviates in the opposite
directions.
• e.g. volume and
pressure of a perfect
gas, price and demand
12/9/20 10
• Un-correlation
• Two variables are said to
Intelligence level
be uncorrelated or
statistically independent
if there is no relation
between them.
12/9/20 11
Karl Pearson’s coefficient of correlation
• Correlation coefficient between two random variables X and Y, usually denoted
by r(X,Y) and defined as
n xy x y
r( X ,Y )
n x x ).(n y y )
2 2 2 2
12/9/20 12
Note:
• If r=+1 then correlation is perfectly positive,
•
• If r=-1 then correlation is perfectly negative,
•
• If r=0 then variables are uncorrelated.
12/9/20 13
12/9/20 14
12/9/20 15
12/9/20 16
12/9/20 17
Cov( X , Y )
r
XY
n
X i X Yi Y
i 1
r
n 2 n 2
X i X
Yi Y
i 1 i 1
12/9/20 18
REGRESSION ANALYSIS
12/9/20 19
Example of MLR
Like, I have data of my monthly spending, monthly income and the
number of trips per month for the last three years. Now I need to answer
the following questions:
What will be my monthly spending for next year?
Which factor(monthly income or number of trips per month) is
more important in deciding my monthly spending?
How monthly income and trips per month are correlated with
monthly spending?
12/9/20 20
Example of MLR
• In the credit card industry, a
financial company maybe
interested in minimizing the risk
portfolio and wants to
understand the top five factors
that cause a customer to default.
12/9/20 21
Example of MLR
A company wanted to be able to
estimate or predict how much fuel
they needed to transport building
materials to their oil wells so that
they could line them with concrete.
The data provided was:
• Number of wells
• Depth of wells
• Distance to wells
• Weight of materials
• Tonne kilometres
• Fuel costs
12/9/20 22
Example of MLR
12/9/20 23
Example of MLR
12/9/20 25
12/9/20 26
12/9/20 27
Example
• The table below shows some data from the early days
of the Italian clothing company Benetton.
12/9/20 28
Simple Regression Analysis
The General Idea
Simple regression considers the relation between a single explanatory
variable and response variable
12/9/20 29
In math courses, the slope-intercept form of the
equation of a line often takes the form
y=mx+b
where
m = slope of the line
b = y intercept of the line
12/9/20 30
12/9/20 31
ŷ b0 b1 x
x y
b1
xy n b0
y
b
x
x
2 1
n n
x 2
n
12/9/20 32
Regression Nomenclature
Dependent Variable Independent Variable
Explained Variable Explanatory variable
Regressand Regressor
Predictand Predictor
Endogenous Variable Exogenous Variable
Controlled Variable Control Variable
Target Variable Stimulus Variable
Response Variable Explanatory variable
12/9/20 33
The sign of each coefficient indicates the direction of the relationship
between a predictor variable and the response variable.
12/9/20 34
Classwork Problems
The relationship between money spent on research and development and the chemical
firm annual profit for the preceding 6 years is given below:
12/9/20 35
1. Use these data to develop a regression model to predict annual
profit.
2. Plot the scatter plot.
3. Which type of relationship is it?
4. Predict annual profit when money spent on research and
development is 8 $Millions.
12/9/20 36
xy x y
b1 n
x
2
x 2
n
12/9/20 37
b0
y
b
x
1
n n
12/9/20 38
12/9/20 39
12/9/20 40
12/9/20 41
12/9/20 42
Classwork Problems
12/9/20 43
12/9/20 44
12/9/20 45
12/9/20 46
Classwork Problems
12/9/20 47
Classwork Problems
12/9/20 48
12/9/20 49
Residuals
o The residual, or error, of the regression model is the difference
between the y value and the predicted value. Each data point has
one residual.
Residual = Observed value - Predicted value
Residual=y yˆ
o Because a linear regression model is not always appropriate for the
data, you should assess the appropriateness of the model by
defining residuals and examining residual plots.
o Both the sum and the mean of the residuals are equal to zero.
12/9/20 50
12/9/20 51
RESIDUAL SUM OF SQUARES RSS
• The residual sum of squares (RSS), also known as the sum of squared
residuals (SSR) or the sum of squared errors of prediction (SSE), is the
sum of the squares of residuals.
12/9/20 52
The Standard Error of Estimate
The standard error of estimate, on the other hand, measures the variability, or
scatter, of the observed values around the regression line.
se
( y yˆ ) 2
RSS
n2 n2
12/9/20 53
Interpreting the Standard Error of Estimate
• We shall use the standard error of estimate as a tool in the same way that we
can use the standard deviation.
• The larger the standard error of estimate, the greater the scattering (or
dispersion) of points around the regression line.
• The smaller the standard error of estimate, the smaller the scattering (or
dispersion) of points around the regression line.
R 1
2 SSE
1
( y ˆ
y ) 2
SST ( y y) 2
12/9/20 55
The coefficient of determination: R Square Value
12/9/20 56
In general, the higher the R-squared (Close to 1 or
close to 100 %), the better the model fits your data.
12/9/20 57
Which Model is better???
R2=15% R2=85%
12/9/20 58
Which Model is better???
R2=38% R2=87.4%
12/9/20 59
Classwork Problems
12/9/20 60
Classwork Problems
12/9/20 61
Classwork Problems
12/9/20 62
12/9/20 63
x y
b1
xy n b0
y
b
x
1
x n n
2
x 2
n
12/9/20 64
12/9/20 65
12/9/20 66
12/9/20 67
12/9/20 68
Classwork Problems
12/9/20 69
1. Calculate RSS and standard error of estimate.
2. Analyze the residuals.
3. Calculate Coefficient of determination or R
square
4. Comment on value of R square.
12/9/20 70
Measures of Variation
Total variation is made up of two parts:
Y
Yi
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2
Y _
SSR = (Yi - Y)2
_ _
Y Y
Xi X
13-72
Assumptions of Linear Regression
12/9/20 73
o Assumption 1: Linearity
o Assumption 2: Variables are measured without error (reliably)
o Assumption 3: Outliers/influential cases
o Assumption 4: Auto Correlation
o Assumption 5: Multicollinearity
o Assumption 6: Homoscedasticity
o Assumption 7: Residuals should be normally distributed
12/9/20 74
Assumption 1: Linearity
12/9/20 75
o First, linear regression requires the relationship
between the independent and dependent
variables to be linear.
12/9/20 76
o If data is given in pairs then the
scatter diagram of the data is just
the points plotted on the xy-
plane.
12/9/20 77
12/9/20 78
o The scatter plot above represents the age vs. size of a plant.
o It is clear from the scatter plot that as the plant ages, its size tends to
increase.
o If it seems to be the case that the points follow a linear pattern well,
then we say that there is a high linear correlation, while if it seems
that the data do not follow a linear pattern, we say that there is no
linear correlation.
o If the data somewhat follow a linear path, then we say that there is a
moderate linear correlation.
o Given a scatter plot, we can draw the line that best fits the data
12/9/20 79
o Types of Regression Models:
12/9/20 80
Assumption 2: Variables are measured without error
(reliably):
12/9/20 81
Residuals
o The residual, or error, of the regression model is the difference
between the y value and the predicted value. Each data point has
one residual.
Residual = Observed value - Predicted value
Residual=y yˆ
o Because a linear regression model is not always appropriate for the
data, you should assess the appropriateness of the model by
defining residuals and examining residual plots.
o Both the sum and the mean of the residuals are equal to zero.
12/9/20 82
12/9/20 83
Using Residuals to Test the Assumptions of the Regression Model
12/9/20 85
• A residual plot is a graph in which residuals are on
the vertical axis and the independent variable is on
the horizontal axis.
• If the dots are randomly dispersed around the
horizontal axis then a linear regression model is
appropriate for the data; otherwise, choose a non-
linear model.
12/9/20 86
• Following example shows few patterns in residual plots.
12/9/20 89
Independence of Errors
Error values are statistically independent
If your points are following a
clear pattern, it might
indicate that the errors are
influencing each other. The
following image shows two
linear regression lines; on
the left, the points are
scattered randomly. On the
right, the points are clearly
influencing each other.
12/9/20 90
Assumption 3: Outliers/influential cases
12/9/20 91
Example:
Let us consider a dataset where
• y = foot length (cm) and
• x = height (in)
• for n = 33 male students in a statistics class.
A scatterplot of the male foot length and height data
shows one point labeled as an outlier.
12/9/20 92
12/9/20 93
• An outlier is a data point which is very far, somehow, from
the rest of the data.
• An outlier is an observation of data that does not fit the rest
of the data.
• It is sometimes called an extreme value.
• When you graph an outlier, it will appear not to fit the
pattern of the graph.
• Some outliers are due to mistakes (for example, writing
down 50 instead of 500) while others may indicate that
something unusual is happening.
12/9/20 94
Most common causes of outliers on a data set:
• Data entry errors (human errors)
• Measurement errors (instrument errors)
• Experimental errors (data extraction or experiment
planning/executing errors)
• Intentional (dummy outliers made to test detection methods)
• Data processing errors (data manipulation or data set unintended
mutations)
• Sampling errors (extracting or mixing data from wrong or various
sources)
12/9/20 95
Residual Plots to deduct Multivariate Outliers
12/9/20 96
Box Plot Diagram to identify Outliers
• Box plot diagram also termed as Whisker’s plot is a graphical method typically
depicted by quartiles and inter quartiles that helps in defining the upper limit
and lower limit beyond which any data lying will be considered as outliers.
• The very purpose of this diagram is to identify outliers and discard it from the
data series before making any further observation so that the conclusion made
from the study gives more accurate results not influenced by any extremes or
abnormal values.
• Box plots can be used as an initial screening tool for outliers as they provide a
graphical depiction of data distribution and extreme values
12/9/20 97
12/9/20 98
Example:
12/9/20 99
• The box part of the chart is as described above, except that the mean is shown as
an ×.
• The whiskers extend up from the top of the box to the largest data element that is
less than or equal to 1.5 times the interquartile range (IQR) and down from the
bottom of the box to the smallest data element that is larger than 1.5 times the
IQR.
• Values outside this range are considered to be outliers and are represented by
dots.
• The boundaries of the box and whiskers are as calculated by the values and
formulas shown in Figure.
• The only outlier is the value 1850 for Brand B, which is higher than the upper
whisker, and so is shown as a dot.
12/9/20 100
Assumption 4: Auto Correlation
• Economic activities of the past often have a strong effect on the present
and future economic activities.
• Students from the same class might perform more similarly to each
other than students from different classes
12/9/20 101
• Linear regression analysis requires that there is
little or no autocorrelation in the data.
12/9/20 102
How to check:
• Look for Durbin – Watson (DW) statistic.
• It must lie between 0 and 4.
• If DW = 2, implies no autocorrelation,
• 0 < DW < 2 implies positive autocorrelation
• While 2 < DW < 4 indicates negative
12/9/20 103
Remedial Measure for Autocorrelation:
• To remove the problem of autocorrelation from
the data we have to transfer the original data
and then apply OLS technique to estimate the
parameter.
12/9/20 104
Assumption 5: Multicollinearity
12/9/20 105
• you may be looking at contributions to town
charity organizations using a model that
includes the population of the town and the
total gross income of the town.
contributions to town charity organizations
12/9/20 108
• For example, you may be looking at customer loyalty to a shop using
a model that includes several different measures of satisfaction.
• .
customer
loyalty to a
shop
satisfaction satisfaction
with quality with the
of product network
12/9/20 109
• You identify that two of these measures of
satisfaction (satisfaction with quality of product and
satisfaction with the network) are highly correlated
and determine that it is because customer don’t tend
to describe out satisfaction in that way. Rather, both
measures of satisfaction are really a reflection of the
same measure of overall satisfaction.
• In this case, you could simply use overall
satisfaction as a predictor variable instead of the
separate measures of satisfaction
12/9/20 110
customer
loyalty to a
shop
Overall
satisfaction
12/9/20 111
• Multicollinearity occurs when independent variables in a regression
model are correlated.
12/9/20 113
• Remedial Measures: To drop one or several
predictor variables in order to lessen the
multicollinearity
12/9/20 114
How to detect Multicollinearity?
12/9/20 115
The variance inflation factor for the jth predictor is:
1
VIFj
1 Rj
2
12/9/20 116
where R2j is the R2-value obtained by regressing the jth predictor on
the remaining predictors.
12/9/20 117
Assumption 6: Homoscedasticity
12/9/20 118
The plot shows a violation of this assumption
12/9/20 119
• Homoscedasticity can also be tested using scatter
plot of residual vs fitted values.
12/9/20 120
12/9/20 121
Residual Analysis for Equal Variance
Y
Y
x x
residuals
residuals
x x
Non-constant variance
Constant variance
13-122
12/9/20 123
Assumption 7: Assumption of Normality
12/9/20 126
Inferences About the Slope
The standard error of the regression slope coefficient (b1) is estimated by
Se
Sb1
i
(X X ) 2
where:
Sb1= Estimate of the standard error of the slope
SSE
Se = Standard error of the estimate
n2
13-127
t Test
Is there a linear relationship between X and Y?
Null and alternative hypotheses
◦ H0: β1 = 0 (no linear relationship)
◦ H1: β1 ≠ 0 (linear relationship does exist)
Test statistic
b1 β 1 where:
t STAT b1 = regression slope
Sb
1 coefficient
β1 = hypothesized slope
d.f. n 2 Sb1 = standard
error of the slope
13-128
If the p-value of your regression estimate is less than 0.05 (or
5%), then you can conclude that In the population from which
the sample is drawn, there is a true, non-zero relationship
between Y and X is correct and that the estimate is trustworthy
for the population or predictor variables are statistically
significant.
12/9/20 129
t Test Example
13-130
t Test Example H0: β1 = 0
H1: β1 ≠ 0
b1
Sb1
b1 β 1 0.10977 0
t STAT 3.32938
Sb 0.03297
1
13-131
t Test Example
H0: β1 = 0
Test Statistic: tSTAT = 3.329
H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
There is sufficient
Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0 evidence that square
-2.3060
0
2.3060 3.329 footage affects house price
13-132
an F test to
• It is common in regression analysis to compute
Alternative hypothesis:
• HA: at least one coefficient is not 0
• This implies that at least one of the explanatory variables is a
significant predictor of the response variable ie
• Model is significant
12/9/20 133
• if the p-value from the F-test is less than LOS, we
should continue the analysis. Null Hyp rejected
12/9/20 134
F Test for Significance
MSR
F Test statistic: FSTAT
MSE
where SSR
MSR
k
SSE
MSE
n k 1
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
13-136
F Test for Significance
(continued)