You are on page 1of 49

Chapter 4

Simple Linear Regression

Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Learning Objectives
• Understand the goals of simple linear regression analysis
• Consider what the error term contains
• Define the population regression model and the sample
regression function
• Estimate the sample regression function
• Interpret the estimated sample regression function
• Predict outcomes based on our estimated sample regression
function
• Assess the goodness-of-fit of the estimated sample
regression function
• Understand how to read regression output in Excel
• Understand the difference between correlation and
causation
4-2
4-3
Understand the Goals of Simple
Linear Regression Analysis
Regression analysis is used to:
– Obtain the marginal effect that a one-unit change in
the independent variable has on the dependent
variable
– Predict the value of a dependent variable based on
the value of the independent variable

Dependent or explanatory variable: the variable


we wish to explain

Independent variable: the variable used to


explain the dependent variable
4-4
Simple Linear Regression Model
• The term simple refers to that there is only one
independent variable, x,
• Relationship between x and y is described by
a linear function
• Regression refers to the manner the relationship
is estimated
• Changes in y are assumed to be caused by
changes in x (although this is not typically the
case)
4-5
Types of Regression Models
Positive Linear Relationship Relationship NOT Linear

Negative Linear Relationship No Relationship

4-6
Population Linear Regression Model

The population regression model:


Population Random
Population Independent Error
Slope
Dependent y intercept Variable term, or
Coefficient
Variable residual

y  β0  β1x  ε
Linear component Random Error
component

4-7
Consider What the Random Error
Component, ε, Contains
• Omitted Variables – independent variables that are
related to the dependent variable, y, but are not in
the regression model (i.e. they are omitted).

• Measurement Error – the difference between the


measured value of the observation and the true
value. This can occur if there is a data entry error or
if a person, firm, etc. does not know the true value
and instead reports an incorrect value.

4-9
Consider What the Random Error
Component, ε, Contains
• Incorrect Functional Form – the wrong model is fit to
the data. For example, a linear function is fit
between y and x but the true relationship is
quadratic.

• Random Component – the variable being studies is


inherently random. Even if two people have the
same number of years of education, they may earn
different salaries due to random factors aside from
the omitted factors listed above.

4-10
Estimated Regression Function
The sample regression line provides an estimate of the
population regression line

Estimated (or Estimate of the Estimate of the


predicted) y regression regression slope
value intercept

Independent
variable

4-11
What is a Residual?
A residual is the difference between the observed
value of y and the predicted value of y. It is an
estimate of the error term, ε, that resides in the
population while the residual is from the sample.

ei  yi  yˆ i
 observed value  predicted value

4-12
Graph of the Sample Regression
Function

4-13
Graph of Predictions and Residuals
for Multiple Observations

4-14
Estimate the Sample Regression
Function
• and are obtained by minimizing the sum
of the squared residuals with respect to
and
n n
min (y
i 1
i  yˆ i )   e
2

i 1
2
i

n
  ( yi   0  1 xi )
ˆ ˆ 2

i 1

4-15
The Least Squares Equation
• The formulas for and are:
n

 ( x  x )( y  y )
i i
Cov( x, y )
ˆ1  i 1
n

 i
Var ( x)
( x  x ) 2

i 1

and
ˆ0  y  b1 x

4-16
Interpretation of the
Slope and the Intercept

• ̂ 0 is, on average, the estimated value of y


when x is equal to zero

• ̂1 is, on average, the estimated change


in the value of y as a result of a one-unit
change in x

4-17
Salary (y) vs. Education (x)
Example in salary.xls
n

 ( x  x )( y
i 1
i i  y )  743,000
n
 ( xi  x )  568
2
i 1

x  16 and y  58,800

4-18
Example continued
n

 (x i  x )( yi  y )
743,000
ˆ1  i 1
n
  11,257.5758
 i
66
( x  x ) 2

i 1

ˆ0  y  b1 x  58,000  (568)(11,257.5758)


  121,321.2121

4-19
A Graphical Representation of the
Estimated Regression Line
Salary (Dollars) vs. Experience
160,000

140,000

120,000
Salary (dollars)

100,000

80,000

60,000

40,000

20,000

0
10 12 14 16 18 20 22
Experience (years)

4-20
Using Excel to Compute the Estimated
Regression Equation in a Scatter Plot
• Create a scatter diagram in Excel
• Position the mouse over any data point and right
click
• Select Add Trendline option
• When the Add Trendline dialog box appears:
On the Type tab select Linear (it is the default)
On the Options tab select the Display equation
on chart box (note the equation is displayed with
the slope first and the intercept second)
Click OK
4-21
Interpret the Estimated Sample
Regression Function

ˆ1 : On average, if education goes up by one year


then salary will go up by $11,257.58.

̂ 0 : On average, if an individual has 0 years of


education then their estimated salary is
$-121,321.58 (this estimate is obviously
ridiculous)

4-22
Predict Outcomes Based on our
Estimated Sample Regression Function
Say we want to predict salary for a person with
12 years of education. We would put this value
of x into the sample regression function as

yˆ  121,321.21  11,257.58(12)  $13,769.75

We predict a salary of $13,769.75 for a person


with 12 years of education.

4-23
Assess the Goodness-of-Fit of the
Estimated Regression Function

Goodness-of-Fit is how well the regression


model describes the observed data.

Two measures of goodness-of-fit


(1) R-squared
(2) Standard error of the regression

4-24
Comparing the Goodness-of-Fit of
Two Hypothetical Data Sets

4-25
A Venn Diagram Demonstrating
Joint Variation between y and x

4-26
The Sample Regression Function
Explains None of the Variation in y

4-27
The Sample Regression Function
Explains All of the Variation in y

4-28
The Sample Regression Function
Explains All of the Variation in y

4-29
Explained and Unexplained
Variation
• Total variation in the dependent variable is made up of two
parts:

TSS  ESS  USS


Total sum Explained Sum Unexplained
of Squares of Squares Sum of Squares

TSS   ( y  y ) 2 ESS   ( yˆ  y)2 USS   ( y  yˆ )2


where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
4-30
Explained and Unexplained
Variation
• SST = total sum of squares
– Measures the total variation of the yi values around the
mean of y. This is the numerator of the variance of y
• ESS = explained sum of squares
– Variation in y attributable to the portion of the dependent
variable y that is explained by the independent variable x
• USS = unexplained sum of squares
– Variation in y attributable to factors other than the
relationship between x and y

4-31
Explained and Unexplained
Variation
y
yi 
2 y
USS = (yi - yi )
_
TSS = (yi - y)2

y  2_
_ ESS = (yi - y) _
y y

Xi x
4-32
Coefficient of Determination, R 2

• The coefficient of determination is the portion


of the total variation in the dependent variable
that is explained by variation in the
independent variable

• The coefficient of determination is also called


R-squared and is denoted as R2

ESS USS 0 R 12


R 
2
 1 where
TSS TSS
4-33
Coefficient of Determination, R2

Coefficient of determination
ESS sum of squares explained by regression
R 
2

TSS total sum of squares

Note: In simple linear regression, R2 is equal to the correlation


coefficient squared

R r2 2
xy
where:
R2 = Coefficient of determination
rxy= correlation coefficient between x and y

4-34
How are the Correlation Coefficient
and the Coefficient of Determination
Related?
R2 = rxy2

Note that this relationship only occurs with


simple linear regression.

rxy  ( sign of b1 ) R 2

4-35
What is the Intuition Behind This
Relationship?
• In the case of linear relationship between two variables,
both the coefficient of determination and the sample
correlation coefficient provide measures of the strength of
relationship.
• The coefficient of determination provides a measure
between 0 and 1 whereas the correlation coefficient
provides a measure between -1 and 1.
• The coefficient of determination can be used for nonlinear
relationships and for relationships that have two or more
independent variables.
• Why might the correlation coefficient be preferred to the
coefficient of determination?
4-36
Examples of Approximate
R2 Values
y
R2 = 1

Perfect linear relationship


between x and y:
x
R2 = 1
y 100% of the variation in y is
explained by variation in x

Note that the R2 is positive


even if the line has a negative
x
R2 = +1 slope
4-37
Examples of Approximate
R2 Values
y
0 < R2 < 1

Weaker linear relationship


between x and y:
x
Some but not all of the
y
variation in y is explained by
variation in x

x
4-38
Examples of Approximate
R2 Values

R2 = 0
y
No linear relationship between
x and y:

The value of Y does not depend


x on x. (None of the variation in
R2 = 0
y is explained by variation in x)

4-39
What does R2 mean?
• R2 means that R2*100% of the variation in y is
explained by x.

• For example if R2=.85 we would say that 85%


of the variation in y is explained by x.

4-40
Calculating R2 for the salary.xls example

ESS 8,364,378,788
R 
2
  0.6373
TSS 13,125,600,000
This says 63.73% of the variation in salary is
explained by education 4-41
Using Excel to Compute the
Coefficient of Determination
• Position the mouse pointer over any data
point in the scatter diagram and right click to
display the chart menu.
• Select Add Trendline option
• When the Add Trendline dialog box occurs: On
the Options tab display the R-squared value
on the chart box and click OK.

4-42
The Standard Error of the Estimated
Sample Regression Function
The standard error of the regression function
measures, on average, how far the points fall
away from the regression line.

s y| x 
Un exp lainedSS

 ( y  ˆ
y ) 2

n  k 1 n  k 1
where k = the number of explanatory variables. In
simple linear regression k = 1.

4-43
Calculation of the Standard Error for
the salary.xls Example

Un exp lainedSS 4,761,221,212


s y| x    24,395.75
n  k 1 10  2
4-44
Reading Regression Output in Excel:
Intercept and Slope

̂ 0
ˆ1
4-45
Reading Regression Output in Excel: R2
ESS 8,364,378,788
R2    0.6373
TSS 13,125,600,000
63.73% of the variation in
salary is explained by the
variation in education

Explained
Unexplained
Total

4-46
Reading Regression Output in Excel:
Standard Error

USS 4,761,221,212
s y| x    24,395.75
n - k -1 8

Explained
Unexplained
Total

4-47
Excel’s Regression Tool
• Select the Tools menu
• Choose the Data Analysis option
• Choose Regression from the list of Analysis Tools
• Input y into the Input Y Range
Input x into the Input X Range
Select Labels
Select Output Range in the sheet
Click OK

4-48
Understand the Difference between
Correlation and Causation
Correlation is when there is a linear relationship
between two random variables.
Causation occurs between two random
variables when changes in one variable (say x)
causes changes in another variable (say y)
Spurious correlation occurs when there is
correlation between two random variables that
results from a relationship from a third random
variable
4-49
Understand the Difference
between Correlation and Causation
Just because there is correlation between two
random variables it does not mean causation.
Examples:
The more firemen at a fire is linked to increased
monetary damages from the fire.
The number of shark attacks and ice cream sales
are positively related.
Students who are tutored tend to get worse grades
than children that are tutored.
See Google correlate for more real world examples
of this phenomenon. 4-50

You might also like