Simple Regression

QM II
LotArea
Look at the Data 250000
200000
150000
100000
50000
0
Age 0 100000200000300000400000500000600000700000800000
160
140
120
100
80
60
LivingArea
40
4500
20
4000 0
0 100000 200000 300000 400000 500000 600000 700000 800000
3500
3000
2500
2000
1500
1000
500
0 50000 100000 150000 200000 250000 300000 350000 400000 450000
Covariance
n
Cov(X,Y) =SXY =
å i=1
(Xi - X)(Yi - Y )
n- 1
SalePrice LivingArea LotArea Age
SalePrice 6306788585
LivingArea 29561605.19 275940.5035
LotArea 209067774.7 1379088.261 99557412.9
Age -1256826.986 -3181.799972 -4475.096209 914.4449615

Covariance
• The sign of covariance can be used to understand the direction of relationship
between two variables
• Random variables with zero covariance are uncorrelated
• All independent variables are uncorrelated
• All uncorrelated variables are not independent
• It is difficult to establish the strength of the relationship using covariance
because it depends on the unit of measurement
Covariance application
• Construct W = aX + bY ; a, b are any constants; X, Y are two random variables
• Mean: E[W] = aE[X] + bE[Y]
• Variance: Var[W] = a2Var[X] + b2Var[Y] + 2abCov[X,Y]
• The variance of the combination increases or decreases depending on the sign of
the covariance term (+ or -)
• Can be used to optimize portfolio
Correlation
Cov( X ,Y ) S XY
Correl( X ,Y )  rXY  
SD( X ).SD(Y ) S X SY
SalePrice LivingArea LotArea Age

SalePrice 1
LivingArea 0.708624478 1
LotArea 0.263843354 0.263116167 1
Age -0.523350418 -0.200302496 -0.014831559 1
Probable Error
•
• P.E =
• If r < P.E, then correlation is not significant
• If r > 6 P.E, then correlation is certain
Correlation
• Correlation is dimensionless - it is standardized using standard deviations
• It always takes a value between -1 and 1
• Close to +1 implies a linear relationship with a positive slope
• Close to -1 implies a linear relationship with a negative slope
• Close to 0 implies that there is no linear relationship
• Correlation describes the direction and strength of the relationship but cannot readily used
for “predictive” purposes
• e.g. If the annual salary goes up by a $1000, how much do we expect the entertainment
spending to change?
Correlation
• Correlation is not the same as causation; can result from
“lurking” variables
• e.g. Fire damage and number of fire engines
• Correlation captures the association between two
variables at a time
• Cause-effect relationship is captured by Regression
Regression Analysis
• How to estimate a linear fit?
• How to interpret the slope and the intercept of the fitted
line?
• How to quantify the goodness of fit?
Learning Objectives
11
• What is a simple regression model (SRM)?
• How to draw statistical inference about the model parameters?
• How to construct prediction intervals for the response variable?
• What are the key assumptions required on the population for inference
and prediction?
• What important diagnostic checks should be run before interpreting
regression output?
Linear Fit: Beyond Correlation
Profit
230 • The fitted line is denoted by

210
– b0 is the intercept
190
– b1 is the slope
170
– is an (point) estimate or fitted
150 value of y for a given x value
130
110
90
55 60 65 70 75 80
Ad Exp
Interpretation of b0 and b1
240
PBT
220
200
180
160
140
120
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
𝑌
𝑖= 𝑏0 +𝑏1 𝑋 𝑖 +𝑒 𝑖
240
PBT
220
Observed
200
Deviation
180
Estimated
160
140
Estimated 𝑌
^ 𝑖= 𝑏0 +𝑏1 𝑋 𝑖

120
Deviation
Observed
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
Choosing the Right Line
15
•• The
error in estimation is given by:
• The ei are called the residuals
• Choose b 0 and𝑛 b1 such that they minimize the sum of squared

2
residualsmin ∑ 𝑒 𝑖
𝑏 ,𝑏
0 1 𝑖=1
• Why square the residuals?

Normal Equations
• Equation 1:
• Equation 2:
For the case data
• Equation 1: 10 1700
• Equation 2: 680 116745
• b0 = - 310.62; b1 = 7.0679
• PBT = -310.62 + 7.0679 AdEx
17 Ordinary Least Squares
•
• The resulting estimators b0 and b1 are called the
Ordinary Least Squares (OLS) estimates
• =
Properties of OLS line
•• Useful properties of the OLS linear fit
• Intercept: Regression line passes through
• Slope: Cov(X,Y) determines the direction of the line (+,-)
• Sum of the residuals around the best fitted line is zero:
How Good is the Model Fit?
19
𝑌
^ 𝑖=− 310.62+7.0679 𝑋 𝑖

20
How Good is the Model Fit?
240
PBT
220
(𝑋 ¿ ¿𝑖 ,𝑌 𝑖)¿ 𝑒 𝑖=(𝑌 ¿¿ 𝑖− 𝑌^ 𝑖)¿
200
180
𝑌´ ( 𝑌
^ 𝑖 − 𝑌´ )
160
140 𝑌
𝑖= 𝑏0 +𝑏1 𝑋 𝑖
^
120
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
Sum of Squares Sum of Squares Sum of Squares Error
Total (SST) = Regression (SSR) + (SSE)
2 2 2
∑ (𝑌 𝑖 − 𝑌 )

´ =

𝑌
^
∑ 𝑖 )
( − 𝑌
´ +
∑ ( 𝑌 𝑖 − 𝑌 𝑖)

^ df
R2 
SSR
SST
 
 Correl (Y ,Yˆ)   Correl (Y , X ) 2
2
RMSE 
SSE

(e12  e22  ...  en2 )
(n  2) (n  2)
Standard Lower Upper
Coefficients Error t Stat P-value 95% 95%
Intercept -310.6173 62.9636 -4.9333 0.0011 -455.8115 -165.4230
Advertising
7.0679 0.9243 7.6466 0.0001 4.9364 9.1994
Expenditure
Regression Statistics
Multiple R 0.9379
R Square 0.8796
Adjusted R Square 0.8646
ANOVA
Standard Error 11.7646
df SS MS F Significance F
Observations 10 Regression 1 8092.75 8092.75 58.47 6.04E-05
Residual 8 1107.25 138.41
Total 9 9200.00
22
Obtaining Model Fit from Excel Output

Multiple R 0.9307
R Square 0.8662
Observations 20 ANOVA
df SS MS F Significance F
Regression
1 48674530.5 48674530 116.6 2.7102E-09
Residual
18 7515863.32 417548
Total 19 56190393.8
= 𝑆𝑆𝑅 = 48674530.5 =0.8662
𝑅
2
𝑆𝑆𝑇 56190393.8
Approximately
Approximately 86%
86% of
of the
the variation
variation in
in PBT
PBT is
is explained
explained by
by variation
variation in
in Sales
Sales
Simple Linear Regression Model
23
• Use a linear equation to model the population relationship between the variables
𝑌 =𝛽0+ 𝛽1 𝑋 + 𝜀

• Y  Response variable - the variable we are interested in explaining

• Also referred to as target, dependent or outcome variable
• X  Predictor variable - the variable that is useful in explaining
• Also referred to as explanatory or independent variable
• b0 and b1  parameters of the model
• e  error term (disturbance or noise)

24 Sampling Distributions for Intercept and Slope
• If certain assumptions hold, we can find the distributions for B0 and B1
• We can use these distributions for making inferences about the relationship
between X and Y in the population
• Confidence intervals
• Hypothesis tests
• We can also use these distributions to construct prediction intervals for values of the
response variable (Y) for a given value of the predictor variable (X)
Variances and Standard Errors
ANOVA
• Var(ei) = MSE = 138.41 df SS MS
Regression 1 8092.75 8092.75
• SE(ei) = Se = RMSE = √MSE = √138.41 = 11.7646
Residual 8 1107.25 138.41
Total 9 9200.00
Multiple R 0.9379
R Square 0.8796
Observations 10
Coefficients Standard Error
Intercept -310.6173 62.9636
Advertising
Expenditure 7.0679 0.9243
Inference (II): Confidence Intervals
• We can use the point estimate b1
and se(b1) to construct confidence
• What is the 95% confidence interval for
interval for the slope parameter the slope parameter?
• Given a confidence level (), the

t0.05, 8 =2.306
corresponding interval is given
by 7.0679 ± 2.306 (0.9243) = {4.9364, 9.1993}
Standard Lower Upper

Coefficients Error t Stat P-value 95% 95%
Intercept -310.6173 62.9636 -4.9333 0.0011 -455.8115 -165.4230
Available from Excel
Advertising
output 7.0679 0.9243 7.6466 0.0001 4.9364 9.1994
Expenditure
Inference (I): Hypothesis Tests
27 •• We
can use the point estimate b1 and Standard
se(B1) to test hypothesis about Coefficients Error t Stat P-value
specific value of slope parameter Intercept -310.6173 62.9636 -4.9333 0.0011
Advertising
Expenditure 7.0679 0.9243 7.6466 0.0001

• For the slope parameter, we can test • Is there strong enough evidence to conclude
the hypothesis that AdEx has an impact on PBT?
• Is there strong enough evidence to conclude

that a unit increase in AdEx is associated with
• If ε is normally distributed, this less than 5 unit increase in PBT?
hypothesis can be tested using a t-
statistic with df = n-2:
(7.0679 −5)
• By default, most tools test for 𝑡 = =2.2373 P-value =0.0278
0.9243
Prediction Using the OLS Equation
28
Interval Estimate: Point Estimate Margin of Error
• Confidence interval: What • from the • Uncertainty in

is our estimate for the regression line the estimate of
mean value of y, given x? the regression
line
e.g. Average price of 1800 sqft
house
• Prediction interval: What
• from the
is our estimate for an
regression line • Additional
individual value of y, uncertainty from
given x? the idiosyncratic
e.g. Price of a specific 1800 errors
sqft house
Confidence Intervals
•• CI for as an individual prediction (Prediction Interval)
• CI for average i.e., (
• Approximation (only as an exception)

Confidence Intervals
• Confidence intervals (mean values) are narrower near the

mean of the predictor and when the sample size is very large
because the sampling variation in b0 and b1 are expected to be
low
• However, no matter how large the sample size is, there is

always an error in prediction intervals (individual values)
Confidence (mean) and Prediction (individual)
Intervals
Regression Line
Prediction
Interval
Confidence
Interval
95% Confidence Intervals (Mean Values) and Prediction Intervals (Individual Values)
“Approximate” Prediction Interval (Individual Values)
32
• What
is the 95% prediction interval for PBT for AdEX = 75 lakhs?
• Then, we can approximate the prediction interval by

= 219.48 2.306(11.7646)1.1843
= [187.35, 251.60]
“Certain” Assumptions for the Simple Regression
Model
33 • Assuming that the true relationship between Y and X is indeed given by
Y = b0 + b1 X + e
Assumption Implication
1. Error term ε is a random variable with an Since β0 and β1 are constants, for a given value of X, the expected
expected value of zero for a given value of X value of Y is E(Y|X) = b0 + b1X
This also implies that the errors are not correlated (systematically
E[ε|X] = 0 related) to the value of X, i.e. Corr(X, ε) = 0
2. Variance of ε is a constant for all values of X The variance of Y about the regression line is the same for all values
of X and equals σε2 (Homoskedasticity)
Var[ε|X]= σε2
3. Values of εi are independent The value of Y for a particular value of X is not related to the value of
Y for another value of X
Corr[εi, εj] = 0 This condition will generally be satisfied for a SRS
4. Error term is normally distributed The dependent variable Y is normally distributed for a given value of
(ε|X) ~ N(0, σε2) X, i.e., (Y|X) ~ N(E(b0 + b1X) , σε2)
Visual Interpretation of Assumptions
34
Diagnostic checks: Using OLS residuals
35
• We need to check the appropriateness of the following main assumptions
1. E[ε|X] = 0
2. Homoskedasticity: Var[ε|X]= σε2
3. Correlation[εi, εj] = 0 for all ij
4. Normality of errors: ε|X ~ N(0, σε2)
• Other key diagnostic checks include

– Impact of outliers
– Linear relationship between Y and X
• Violations of these assumptions cause problems e.g. bias, inefficiency,

incorrect inference
• We can use plots of residuals to get an idea if the assumptions are satisfied
Residuals vs. Predictor Values
36
Residual
0
A good pattern of residuals is “no pattern”

Problem 1: Heteroskedasticity
Problem 1: Heteroskedasticity
38
• Variance of the residuals increases/decreases with the value of the predictor variable
900 300
900 300
800
800 200
700 200
700
Price ($000)
600 100
Price ($000)
Residual
600 100
Residual
500
500 0
400 0
400
300 -100
300 -100
200
200 -200
-200
100
100 1000 1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500
1000 1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500
Square Feet Square Feet
Square Feet Square Feet
• Standard errors reported by statistical packages are lower than the actual ones
Problem 2: Dependence and Autocorrelation
39 • The errors may be correlated to each other if data were collected over time
– e.g. return of stock over time
• Often shows up as a pattern in the residuals, if plotted in chronological order
• Errors are can also correlated when the data structure is hierarchical or
nested
– e.g. Salary of MBA students across different b-schools and GMAT scores
• Standard errors reported by statistical packages are lower than the actual ones
Problem 3: Departures from Normality
40
6000
6000 .01
.01
.05 .10
.05 .10
.25
.25
.50
.50
.75
.75
.90 .95
.90 .95
.99
.99
• Construct a quantile plot of the
4000
4000 residuals instead of original
2000
2000 variables
0
0
• Inferences (hypothesis tests and
-2000
-2000
confidence intervals) work
-4000
-4000
pretty well even when residuals
-6000
-6000
are not strictly normal
-8000
-8000
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
Normal Quantile Plot
Normal Quantile Plot
Problem 4: Outliers, leverage points, influential
observations leverage point
41 • In the case of regression, outliers (unusual
observations) can occur in the y or x
variables
• Unusual observations in the x variable are
called leverage points
• Influential observations are those that
substantially change the OLS fit depending
on whether they are included or not
• Typically leverage points are suspect for
being influential observations as OLS
penalizes large errors more (due to
Green line - Best fit without the leverage point
squaring) Red line – Best fit with the leverage point
• Not all leverage points are necessarily
influential observations
Problem 5: Linearity Assumption
42
• In many applications of interest the relationship between the
dependent and the independent variable might be nonlinear
• Look for transformation of the data (either X or Y or both
variables) such that the relationship between transformed
variables is approximately linear
• The transformed data should satisfy the OLS assumptions
Example: Pet Food Demand Curve Estimation
43 • The manager would like to estimate
140000
a demand curve, i.e., relationship 120000
100000 Sales
between demand and price 80000
Volume
• Data on the prices and weekly sales

60000
40000
for a pet food brand over a period of 20000
1.3
2 years 1.2
1.1
Avg
• Note: Sales is a censored signal of 1
0.9
Price ($)
demand but the best piece of information 0.8
0.7
that we have now 0.6

20000 60000 100000 140000 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
OLS estimation (assuming linear relationship)
44
160000 50000
160000 50000
140000
140000 40000
Estimated Sales = 190480 – 125190 40000
120000
120000 Price 30000
Sales Volume
30000
Sales Volume
Sales Volume
Residual
100000
Sales Volume
Residual
100000
20000
20000
80000
80000
10000
10000
60000
60000
0
40000 0
40000
-10000
20000 -10000
20000 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Avg Price ($)
Avg Price ($)
Avg Price ($) Avg Price ($)
Multiple R 0.9102 Coefficient Standard
R Square 0.8285 s Error t Stat P-value
Intercept 190483.4 6226.106 30.5943 3.4E-53
Standard Error 6991.41 Avg Price ($) -125188.7 5640.396 -22.195 7.74E-41
Observations 104
Log-Log Transformation: Is there a pattern now?
45 • Obtain the logarithm transform for both sales and average price
• Calculate OLS estimates for the transformed data
• Examine the residual plot for any nonlinear patterns
Estimated (Ln_SalesVolume) = 11.05 – 2.442 Ln_Price

12 0.25
0.20
Ln(Sales_Volume)
0.15
Ln(Sales_Volume)
11.5
0.10
Residual
0.05
11
0.00
-0.05
10.5 -0.10
-0.15
10 -0.20
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 10 10.5 11 11.5 12
Ln(Avg_Price) Ln(Sales_Volume) Predicted
Interpreting the Estimates in Log Models
46 Model Specification Interpretation of β1
Log-Log ln(Y) = β0 + β1 ln(X) + ε Elasticity: A 1% change in X is associated with a β1% change in Y
Log-Linear ln(Y) = β0 + β1 X + ε A one unit change in X is associated with a 100 β1% increase in Y
Linear-Log Y = β0 + β1 ln(X) + ε A 1% change in X is associated with a 0.01β1 change in Y
Regression Statistics Standard

Coefficients Error t Stat P-value
Multiple R 0.9770
R Square 0.9546 Intercept 11.0506 0.0075 1477.35 1.1E-222
Ln(Price) -2.4420 0.0528 -46.2865 2.75E-70
Observations 104

Simple Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Regression

Uploaded by

Copyright:

Available Formats

QM II

Look at the Data 250000

SalePrice LivingArea LotArea Age

LivingArea 29561605.19 275940.5035

LotArea 209067774.7 1379088.261 99557412.9

Age -1256826.986 -3181.799972 -4475.096209 914.4449615

SalePrice LivingArea LotArea Age

230 • The fitted line is denoted by

• The ei are called the residuals

• Choose b 0 and𝑛 b1 such that they minimize the sum of squared

• Why square the residuals?

Adjusted R Square 0.8588

Standard Error 646.1795

• Y  Response variable - the variable we are interested in explaining

• b0 and b1  parameters of the model

• e  error term (disturbance or noise)

• Given a confidence level (), the

Standard Lower Upper

• Is there strong enough evidence to conclude

• Confidence interval: What • from the • Uncertainty in

• CI for average i.e., (

• Approximation (only as an exception)

• Confidence intervals (mean values) are narrower near the

• However, no matter how large the sample size is, there is

• Then, we can approximate the prediction interval by

• Other key diagnostic checks include

• Violations of these assumptions cause problems e.g. bias, inefficiency,

A good pattern of residuals is “no pattern”

a demand curve, i.e., relationship 120000

• Data on the prices and weekly sales

for a pet food brand over a period of 20000

• Note: Sales is a censored signal of 1

demand but the best piece of information 0.8

that we have now 0.6

• Calculate OLS estimates for the transformed data

• Examine the residual plot for any nonlinear patterns

Estimated (Ln_SalesVolume) = 11.05 – 2.442 Ln_Price

Log-Log ln(Y) = β0 + β1 ln(X) + ε Elasticity: A 1% change in X is associated with a β1% change in Y

Linear-Log Y = β0 + β1 ln(X) + ε A 1% change in X is associated with a 0.01β1 change in Y

Regression Statistics Standard

You might also like