You are on page 1of 39

SMU Classification: Restricted

SESSION 10*
SIMPLE LINEAR
REGRESSION
WMY Chapter 9
Parts 1-5
(Chapter 11 of Notes)

*Most Slides from Prof Yang Zhenlin


SMU Classification: Restricted

Learning Objectives
 Simple Linear Regression Model
 Least Squares Method of Estimation
 Measuring Goodness of Fit
 Inference on Regression Coefficients
 Predicting with the Model

2
SMU Classification: Restricted

Introduction
We are interested in the relationship between two numerical
variables X and Y.
• One of these variables, say X, is known in advance, called
the explanatory variable, or independent variable.
• The other variable, Y, is a random variable and its values or
its general random behavior is of interest. For this, Y is
called the response variable, or dependent variable.
• If there is a strong relationship between X and Y, one can
predict a future random variable Y , based on the known
future value of X, through such a “relationship”.
• To study the relation, n pairs of observations on (X, Y) are
collected, denoted as (x1, y1) , (x2, y2) , . . . , (xn, yn).
• The Least Squares Method helps finding such a relation. 3
SMU Classification: Restricted

Describing the Relationship

Scatter diagram: plot of the pairs of observed values (x1, y1) ,


(x2, y2) , . . . , (xn, yn) of variables X and Y. It is a very effective
graphical tool for “revealing” the relationship between
variables.

X
4
SMU Classification: Restricted

Example 1
Prices of used cars and the Car Odometer ( X ) Price ( Y )
odometer readings. 1 37388 14636
2 44758 14122
• A car dealer wants to find the 3 45833 14016
relationship between the 4 30862 15590
odometer reading and the 5 31705 15568
selling price of used cars. 6 34010 14718
• A random sample of 100 cars . . .
is selected, and the data . . .
recorded. . . .
• Construct a scatter plot of
the data. The full data

5
SMU Classification: Restricted

Example 1 (continued)

The plot indeed shows a negative linear relation between


the price and the odometer reading.
6
SMU Classification: Restricted

Summary Statistics
Besides the graphical display of the data, some numerical
measures can be used to measure the direction and
strength of the linear relationship between two variables
1 n 1 n
• Sample Means: x   xi and y   yi
n i 1 n i 1
1 n 1 n
s 2

• Sample Variances: X n  1 
i 1
( xi  x ) 2
and sY
2
 
n  1 i 1
( y i  y ) 2

1 n
• Sample covariance: Cov ( X , Y )  
n  1 i 1
( xi  x )( yi  y )

Cov( X , Y )
• Sample correlation coefficient: r
s X sY

This is called the ‘five statistics summary’ of the data


7
SMU Classification: Restricted

Formulas for Covariance


1 n
Cov ( X , Y )  
n  1 i 1
( xi  x )( yi  y )

Shortcut Formulas :
1   xi  yi 
Cov( X , Y )    xi yi  
n 1  n 
1  2   xi   1    yi  
2 2
s 
2
X  xi  ; s 
2
Y   yi 
2

n 1  n  n 1  n 

8
SMU Classification: Restricted

Example 1 (continued)
Continuing on the Example 1, find the five statistics
summary and comment on the linear relationship between
price and odometer reading.

Solution:

x  36,009.45; s X2  43,528,690, sY2  259,996


y  14,822.823; Cov( X , Y )  2,712,511, or r   0.8063

As r = 0.8063, there exists a strong negative linear


relation …

9
SMU Classification: Restricted

Sample Coefficient of Correlation


+1 Strong positive linear
relationship. The scatter Cov(X, Y) > 0
diagram shows a clear
upward trend.
No linear relationship.
Scatter diagram shows or
r= 0 either no pattern, or a
non-linear pattern. Cov(X, Y) = 0
Strong negative linear
relationship. The scatter
diagram shows a clear
Cov(X, Y) < 0
1 downward trend.
10
SMU Classification: Restricted

Simple Linear Regression Model


The simple linear regression model takes the form:
y
 𝑌 = 𝛽 0 + 𝛽 1 𝑥+ 𝜖

Y = dependent variable
x = independent variable
Rise b1 = Rise/Run
0 = intercept parameter
b0 Run
1 = slope parameter
ϵ = random error/random disturbance x

 It has 2 parts. The 1st part is the straight line given by

0 and 1 are unknown population parameters, therefore


need to be estimated from the data 11
SMU Classification: Restricted

Simple Linear Regression Model


 Since the actual y values do not fall on the straight line. We add
the 2nd part of the model which is the error term ϵ
ϵ is a random variable assumed to be normally distributed with
E(ϵ)= 0 and Var(ϵ)=

 𝑌 = 𝛽 0 + 𝛽 1 𝑥 𝑖+ 𝜖 𝑖
Y
 Observed Value
of Y for xi:
ϵi Slope = β1
 Predicted Value
of Y for xi
Random
Error for this
Intercept = β0 xi value

x 12
SMU Classification: Restricted

Simple Linear Regression Model


To estimate the parameters 0 and 1, a random sample of
n experimental units are selected, and the values of (X, Y)
for each unit are to be observed to give (x1, y1), (x2, y2), . . .
, (xn, yn) . These n pairs of observations satisfy:
 

i.e.

And so on
Note there is no change to intercept and slope 13
SMU Classification: Restricted

Simple Linear Regression Model


Y is a random variable since ϵ is one. Due to the random
sampling mechanism, {Yi} must be independent, and so are
the {ϵi}. Further, it is reasonable to assume that
E(ϵi) = 0, i = 1, 2, . . . , n.
Thus, E (Yi )   0  1 xi , i  1, 2, . . . , n.
 𝑌 = 𝛽 0 + 𝛽 1 𝑥 𝑖+ 𝜖 𝑖
Y
Observed Value
of Y for xi
ϵi Slope = β1
Predicted Value
of Y for xi
Random
Error for this
Intercept = β0 xi value

x
14
14
SMU Classification: Restricted

Learning Objectives
 Simple Linear Regression Model
 Least Squares Method of Estimation
 Measuring Goodness of Fit
 Inference about Regression Coefficients
 Predicting with the Model

15
SMU Classification: Restricted

Least Squares Estimation

Based on the observed data, we are seeking a line that


best fits the data when two variables are related to one
another. yˆ  b  b x
0 1

ei  yi  yˆi

We define “best fit line” as a line for which the sum of


squared differences between it and the data points is
minimized.

16
SMU Classification: Restricted

Least Squares Estimation


Different lines generate different errors, thus different sum of
n
squares of errors.
SSE   ei 2

i 1
Y Y
Errors
Errors

Errors Errors

X X
There is a line that minimizes the sum of squared errors,
and in this sense it is the best line.
17
SMU Classification: Restricted

Least Squares Estimation


Let yˆ  b0  b1 x be a fitted line. Finding the best line that
minimizes the sum of squared errors (SSE) is equivalent to
finding the intercept b0 and the slope b1 which
n n
minimize SSE   e  ( yi  yi )
ˆ 2 2
i
i 1 i 1

The value of point i


The actual Y
calculated from the
value of point i
equation yˆ i  b0  b1 xi
That is, to minimize
n
SSE  ( yi  b0  b1 xi ) 2
i 1
18
SMU Classification: Restricted

Least Squares Estimators

n
1 n
 xi yi  nyx 
n  1 i 1
( xi  x )( yi  y )
Cov( X , Y )
b1  n
i 1
 n

1 s X2
i 1
xi  nx
2 2

n  1 i 1
( xi  x ) 2

b0  y  b1 x

gives the least squares equation: yˆ  b0  b1 x

19
SMU Classification: Restricted

Example 1 (continued)
Continuing on the Example 1, find the least squares line
relating odometer reading to the price of the used car.
Solution: The estimated coefficients are
Cov( X , Y )  2,712,511
b1  2
  .06232
sX 43,528,690
b0  y  b1 x  14,822.82  (.06232)(36,009.45)  17,067

The least squares equation is


yˆ  b0  b1 x  17,067  0.0623x

Interpretation of b1 = 0.0623: for one additional mile on


the odometer, it is estimated that the average cost of the
cars decrease by $0.0623. 20
SMU Classification: Restricted

Interpreting the Linear Regression


Equation

yˆ  17,067  .0623x

This is the estimated slope of


The intercept is
the line.
estimated as $17067.
For each additional mile on the
Do not interpret the odometer, the price decreases
intercept as the “Price of by an average of $0.0623
cars that have not been
driven”!
21
SMU Classification: Restricted

Interpreting the Linear Regression


Equation
17067
Odometer Line Fit Plot

16000

15000
Price

14000

0 13000
Odometer
No data

22
SMU Classification: Restricted

Properties of the Least Squares


Estimators
For the simple linear regression model:
 𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑥𝑖 +𝜖 𝑖 , 𝑖=1,2 , … , 𝑛
Where {ϵi} are independent with E(ϵi) = 0, the least squares
estimators b0 and b1 are unbiased estimators of 0 and 1.
Under certain assumptions of the model, the least squares
estimators are best linear unbiased estimators (BLUE).

23
SMU Classification: Restricted

Learning Objectives
 Simple Linear Regression Model
 Least Squares Method of Estimation
 Measuring Goodness of Fit
 Inference about Regression Coefficients
 Predicting with the Model

24
SMU Classification: Restricted

Coefficient of Determination R2

2
∑ ( 𝑦𝑖 − ^𝑦𝑖 )
 

2
∑ ( 𝑦𝑖 − ´𝑦 )
 

2
∑ ( ^𝑦𝑖 − ´𝑦 )
 

25
SMU Classification: Restricted

Coefficient of Determination R2
To understand the significance of coefficient of
determination, note:
n n n

 i
( y
i 1
 y )   i
2
( ˆ
y  y
i 1
)   i i
( y  ˆ
y ) 2 2

i 1

( SST ) ( SSR) ( SSE )


SST: Total corrected sum of squares. Represents the
variation in the response values that ideally would be
explained by the model
SSR: Regression sum of squares. Reflects the amount of
variation in the y-values explained by the model.
SSE: Error sum of squares. Is the variation in response due
to the error, or variation unexplained
It follows that R2 = 1 SSE/SST = SSR/SST 26
SMU Classification: Restricted

Coefficient of Determination R2
It is a measure of the strength of the linear relationship
between the response Y and the explanatory variable(s) X,
and is defined as

R  1
2 SSE
or R 
2  Cov( X , Y )
2

 i
( y  y ) 2
s 2 2
X sY
𝑛
  2
where 𝑆𝑆𝑇 =∑ ( 𝑦 𝑖 − ´𝑦 )
𝑖=1
• The first definition is a general one and applies to linear
regression models with multi predictors.
• It simplifies to the second definition when there is only one
predictor X.
• In the case of simple linear regression, R2 is also the square
of the sample correlation coefficient r. 27
SMU Classification: Restricted

Coefficient of Determination R2

R2 measures the proportion of variability in the response Y


explained by the variation in X, or by the fitted model.

R2 takes on any value between zero and one.


R2 = 1: Fit is perfect, perfect match between the line and
the data points.
R2 = 0: There is no linear relationship between X and Y.

28
SMU Classification: Restricted

Sum of Squares due to Errors (SSE)


This is the sum of differences between the points and the
regression line.

It can serve as a measure of how well the line fits the data.
SSE is defined by
n
SSE   ( yi  yˆ i ) 2 .
i 1

A shortcut formula:

 2 Cov( X , Y ) 2 
SSE  (n  1)  sY  2 
 s X 
29
SMU Classification: Restricted

Example 1 (continued)

SST  (n  1) sY2 R 2  1  SSE / SST


 99  259996  1  9005450 /(99  259996)
 0.6501

65% of the variation in the auction selling price is explained


by the variation in odometer reading. The rest (35%)
remains unexplained by this model.

30
SMU Classification: Restricted

Learning Objectives
 Simple Linear Regression Model
 Least Squares Method of Estimation
 Measuring Goodness of Fit
 Inference about Regression Coefficients
 Predicting with the Model

31
SMU Classification: Restricted

Estimate of Error Standard Deviation


• The mean error is equal to zero. If se is small, the errors
tend to be close to zero (close to the mean error). Then,
the model fits the data well.
• Therefore, we can also use se as a measure of the
suitability of using a linear model.
• However, se is unknown and has to be estimated.
Estimate of Error Standard Deviation
SSE
s 
n2
2
s =
 ( yi  ˆ
y ) 2

compare with sY2 


 i
( y  y ) 2

n2 n 1 32
SMU Classification: Restricted

Example 1 (continued)
Calculate the estimated of error standard deviation and the
coefficient of determination for Example 1, and describe
what does it tell you about the model fit?
Solution:

sY2 
 i
( y  y ) 2

 259,996 s 
SSE

9,005,450
 303.13
n 1 n2 98
 2 Cov( X , Y ) 2 
SSE  (n  1)  sY  2  It is hard to assess the
 s X 
model based on se even
 (2,712,511) 2 
 99 (259,996)   when compared with the
 43 ,528,690  mean value of Y,
 9,005,450 s  303.1, y  14,823
Calculated earlier 33
SMU Classification: Restricted

Testing the Slope


Testing whether there is a linear relationship between X
and Y is the same as testing whether slope=0

X
34
SMU Classification: Restricted

Testing the Slope


We can draw inference about 1 from b1 by testing
H0: 1 = 0 versus
H1: 1  0 (or < 0,or > 0)
The implication of this test is clear: if H0 is rejected, one can
conclude that there is sufficient evidence to show that Y and
X are linearly related; otherwise, they are not. The same
question can be answered by constructing a confidence
interval for 1. From the theoretical result and our knowledge
of the t-distribution, it is immediate to see that
b1  1 A statistic for testing the slope
~t n  2
s (n  1) s X2 parameter or constructing a
confidence interval for it.
 Note: TB uses 35
SMU Classification: Restricted

Confidence Interval of Slope


Coefficient
s 
Apparently, the quantity is an estimate of
(n  1) s 2
X (n  1) s X2
the standard deviation of ̂1 , and thus referred to as the
estimated standard error of ̂1 .

A 100(1)% confidence interval for 1 is given as


 s s 
b1  t / 2,n  2 , b1  t / 2,n  2 
 (n  1) s X
2
(n  1) s X 
2

Inference concerning the intercept parameter 0 can be


carried out in a similar manner, but it is not as interesting
and important as for the slope parameter 1.
36
SMU Classification: Restricted

Example 1 (continued)
Test to determine whether there is enough evidence to infer
that there is a linear relationship between the car auction
price and the odometer reading for all three-year-old cars.
Use α = 5%.

Solution: H0: 1 = 0 vs H1: 1  0


b1  .0623
s 303.1
  .00462
(n  1) s X2 (99)( 43,528,690)
b1  0
t   .0623  0  13.49
s (n  1) s X2 .00462

37
SMU Classification: Restricted

Example 1 (continued)
A 95% CI for 1:
 0.0623  1.984  0.00462  {0.0715,  0.0531}
With n = n  2 = 98, the rejection region is
t > t0.025, 98
or
t <  t0.025, 98

where t0.025, 98  1.984.

As t =  13.49 <  1.984, reject H0 at 5% level of


significance. Yes, there is enough evidence to …

38
SMU Classification: Restricted

TEXTBOOK REFERENCES

Chapter 9: Linear Regression


Relevant Sections: 1 - 5
Section Remarks
1
2
3 Excluding: “Mean and Variance of Estimators”,
“Statistical Inference on the Intercept”
4
5 Partitioning of the total corrected sum of squares
of y only (pg 397-398)

39

You might also like