Professional Documents
Culture Documents
Simple Regression
Simple Regression
LotArea
200000
150000
100000
50000
0
Age 0 100000200000300000400000500000600000700000800000
160
140
120
100
80
60
LivingArea
40
4500
20
4000 0
0 100000 200000 300000 400000 500000 600000 700000 800000
3500
3000
2500
2000
1500
1000
500
0 50000 100000 150000 200000 250000 300000 350000 400000 450000
Covariance
n
Cov(X,Y) =SXY =
å i=1
(Xi - X)(Yi - Y )
n- 1
SalePrice 6306788585
LivingArea 0.708624478 1
LotArea 0.263843354 0.263116167 1
Age -0.523350418 -0.200302496 -0.014831559 1
Probable Error
•
• P.E =
• If r < P.E, then correlation is not significant
• If r > 6 P.E, then correlation is certain
Correlation
• Correlation is dimensionless - it is standardized using standard deviations
• It always takes a value between -1 and 1
• Close to +1 implies a linear relationship with a positive slope
• Close to -1 implies a linear relationship with a negative slope
• Close to 0 implies that there is no linear relationship
• Correlation describes the direction and strength of the relationship but cannot readily used
for “predictive” purposes
• e.g. If the annual salary goes up by a $1000, how much do we expect the entertainment
spending to change?
Correlation
• Correlation is not the same as causation; can result from
“lurking” variables
• e.g. Fire damage and number of fire engines
• Correlation captures the association between two
variables at a time
• Cause-effect relationship is captured by Regression
Regression Analysis
• How to estimate a linear fit?
• How to interpret the slope and the intercept of the fitted
line?
• How to quantify the goodness of fit?
Learning Objectives
11
• What is a simple regression model (SRM)?
• How to draw statistical inference about the model parameters?
• How to construct prediction intervals for the response variable?
• What are the key assumptions required on the population for inference
and prediction?
• What important diagnostic checks should be run before interpreting
regression output?
Linear Fit: Beyond Correlation
Profit
110
90
55 60 65 70 75 80
Ad Exp
Interpretation of b0 and b1
240
PBT
220
200
180
160
140
120
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
𝑌
𝑖= 𝑏0 +𝑏1 𝑋 𝑖 +𝑒 𝑖
240
PBT
220
Observed
200
Deviation
180
Estimated
160
140
Estimated 𝑌
^ 𝑖= 𝑏0 +𝑏1 𝑋 𝑖
120
Deviation
Observed
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
Choosing the Right Line
15
•• The
error in estimation is given by:
• b0 = - 310.62; b1 = 7.0679
• PBT = -310.62 + 7.0679 AdEx
17 Ordinary Least Squares
•
• The resulting estimators b0 and b1 are called the
Ordinary Least Squares (OLS) estimates
• =
Properties of OLS line
•• Useful properties of the OLS linear fit
• Intercept: Regression line passes through
• Slope: Cov(X,Y) determines the direction of the line (+,-)
• Sum of the residuals around the best fitted line is zero:
How Good is the Model Fit?
19
𝑌
^ 𝑖=− 310.62+7.0679 𝑋 𝑖
20
How Good is the Model Fit?
240
PBT
220
(𝑋 ¿ ¿𝑖 ,𝑌 𝑖)¿ 𝑒 𝑖=(𝑌 ¿¿ 𝑖− 𝑌^ 𝑖)¿
200
180
𝑌´ ( 𝑌
^ 𝑖 − 𝑌´ )
160
140 𝑌
𝑖= 𝑏0 +𝑏1 𝑋 𝑖
^
120
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
Sum of Squares Sum of Squares Sum of Squares Error
Total (SST) = Regression (SSR) + (SSE)
2 2 2
∑ (𝑌 𝑖 − 𝑌 )
´ =
𝑌
^
∑ 𝑖 )
( − 𝑌
´ +
∑ ( 𝑌 𝑖 − 𝑌 𝑖)
^ df
R2
SSR
SST
Correl (Y ,Yˆ) Correl (Y , X ) 2
2
RMSE
SSE
(e12 e22 ... en2 )
(n 2) (n 2)
Standard Lower Upper
Coefficients Error t Stat P-value 95% 95%
Intercept -310.6173 62.9636 -4.9333 0.0011 -455.8115 -165.4230
Advertising
7.0679 0.9243 7.6466 0.0001 4.9364 9.1994
Expenditure
Regression Statistics
Multiple R 0.9379
R Square 0.8796
Adjusted R Square 0.8646
ANOVA
Standard Error 11.7646
df SS MS F Significance F
Observations 10 Regression 1 8092.75 8092.75 58.47 6.04E-05
Residual 8 1107.25 138.41
Total 9 9200.00
22
Obtaining Model Fit from Excel Output
Regression Statistics
Multiple R 0.9307
R Square 0.8662
Observations 20 ANOVA
df SS MS F Significance F
Regression
1 48674530.5 48674530 116.6 2.7102E-09
Residual
18 7515863.32 417548
Total 19 56190393.8
= 𝑆𝑆𝑅 = 48674530.5 =0.8662
𝑅
2
𝑆𝑆𝑇 56190393.8
Approximately
Approximately 86%
86% of
of the
the variation
variation in
in PBT
PBT is
is explained
explained by
by variation
variation in
in Sales
Sales
Simple Linear Regression Model
23
• Use a linear equation to model the population relationship between the variables
𝑌 =𝛽0+ 𝛽1 𝑋 + 𝜀
• We can use these distributions for making inferences about the relationship
between X and Y in the population
• Confidence intervals
• Hypothesis tests
• We can also use these distributions to construct prediction intervals for values of the
response variable (Y) for a given value of the predictor variable (X)
Variances and Standard Errors
ANOVA
• Var(ei) = MSE = 138.41 df SS MS
Regression 1 8092.75 8092.75
• SE(ei) = Se = RMSE = √MSE = √138.41 = 11.7646
Residual 8 1107.25 138.41
Total 9 9200.00
Regression Statistics
Multiple R 0.9379
R Square 0.8796
Adjusted R Square 0.8646
Standard Error 11.7646
Observations 10
Coefficients Standard Error
Intercept -310.6173 62.9636
Advertising
Expenditure 7.0679 0.9243
Inference (II): Confidence Intervals
• We can use the point estimate b1
and se(b1) to construct confidence
• What is the 95% confidence interval for
interval for the slope parameter the slope parameter?
Regression Line
Prediction
Interval
Confidence
Interval
95% Confidence Intervals (Mean Values) and Prediction Intervals (Individual Values)
“Approximate” Prediction Interval (Individual Values)
32
• What
is the 95% prediction interval for PBT for AdEX = 75 lakhs?
2. Variance of ε is a constant for all values of X The variance of Y about the regression line is the same for all values
of X and equals σε2 (Homoskedasticity)
Var[ε|X]= σε2
3. Values of εi are independent The value of Y for a particular value of X is not related to the value of
Y for another value of X
Corr[εi, εj] = 0 This condition will generally be satisfied for a SRS
4. Error term is normally distributed The dependent variable Y is normally distributed for a given value of
(ε|X) ~ N(0, σε2) X, i.e., (Y|X) ~ N(E(b0 + b1X) , σε2)
Visual Interpretation of Assumptions
34
Diagnostic checks: Using OLS residuals
35
• We need to check the appropriateness of the following main assumptions
1. E[ε|X] = 0
2. Homoskedasticity: Var[ε|X]= σε2
3. Correlation[εi, εj] = 0 for all ij
4. Normality of errors: ε|X ~ N(0, σε2)
Residual
0
900 300
900 300
800
800 200
700 200
700
Price ($000)
600 100
Price ($000)
Residual
600 100
Residual
500
500 0
400 0
400
300 -100
300 -100
200
200 -200
-200
100
100 1000 1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500
1000 1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500
Square Feet Square Feet
Square Feet Square Feet
• Standard errors reported by statistical packages are lower than the actual ones
Problem 2: Dependence and Autocorrelation
39 • The errors may be correlated to each other if data were collected over time
– e.g. return of stock over time
• Often shows up as a pattern in the residuals, if plotted in chronological order
• Errors are can also correlated when the data structure is hierarchical or
nested
– e.g. Salary of MBA students across different b-schools and GMAT scores
• Standard errors reported by statistical packages are lower than the actual ones
Problem 3: Departures from Normality
40
6000
6000 .01
.01
.05 .10
.05 .10
.25
.25
.50
.50
.75
.75
.90 .95
.90 .95
.99
.99
• Construct a quantile plot of the
4000
4000 residuals instead of original
2000
2000 variables
0
0
• Inferences (hypothesis tests and
-2000
-2000
confidence intervals) work
-4000
-4000
pretty well even when residuals
-6000
-6000
are not strictly normal
-8000
-8000
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
Normal Quantile Plot
Normal Quantile Plot
Problem 4: Outliers, leverage points, influential
observations leverage point
41 • In the case of regression, outliers (unusual
observations) can occur in the y or x
variables
• Unusual observations in the x variable are
called leverage points
• Influential observations are those that
substantially change the OLS fit depending
on whether they are included or not
• Typically leverage points are suspect for
being influential observations as OLS
penalizes large errors more (due to
Green line - Best fit without the leverage point
squaring) Red line – Best fit with the leverage point
• Not all leverage points are necessarily
influential observations
Problem 5: Linearity Assumption
42
• In many applications of interest the relationship between the
dependent and the independent variable might be nonlinear
• Look for transformation of the data (either X or Y or both
variables) such that the relationship between transformed
variables is approximately linear
• The transformed data should satisfy the OLS assumptions
Example: Pet Food Demand Curve Estimation
43 • The manager would like to estimate
140000
100000 Sales
between demand and price 80000
Volume
40000
1.3
2 years 1.2
1.1
Avg
0.9
Price ($)
0.7
Sales Volume
30000
Sales Volume
Sales Volume
Residual
100000
Sales Volume
Residual
100000
20000
20000
80000
80000
10000
10000
60000
60000
0
40000 0
40000
-10000
20000 -10000
20000 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Avg Price ($)
Avg Price ($)
Avg Price ($) Avg Price ($)
Regression Statistics
Multiple R 0.9102 Coefficient Standard
R Square 0.8285 s Error t Stat P-value
Intercept 190483.4 6226.106 30.5943 3.4E-53
Adjusted R Square 0.8268
Standard Error 6991.41 Avg Price ($) -125188.7 5640.396 -22.195 7.74E-41
Observations 104
Log-Log Transformation: Is there a pattern now?
45 • Obtain the logarithm transform for both sales and average price
0.15
Ln(Sales_Volume)
11.5
0.10
Residual
0.05
11
0.00
-0.05
10.5 -0.10
-0.15
10 -0.20
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 10 10.5 11 11.5 12
Ln(Avg_Price) Ln(Sales_Volume) Predicted
Interpreting the Estimates in Log Models
46 Model Specification Interpretation of β1
Log-Linear ln(Y) = β0 + β1 X + ε A one unit change in X is associated with a 100 β1% increase in Y