You are on page 1of 16

Lecture 14.

Chapter 17
Simple linear regression and
correlation
17.1 Regression model
17.2 Estimating the coefficients
17.3 Error variable: Required
conditions
17.4 Assessing the regression model
1

17.1 Regression Model


Regression analysis is used to predict the value of
one variable (the dependent variable) on the basis
of other variables (the independent variables).
Dependent variable: denoted Y.
Independent variables: denoted X1, X2, …, Xk

This chapter will examine the relationship between


two variables: dependent variable Y and
independent variable X , sometimes called simple
linear regression.

1
Regression Model Types
Mathematical equations used in regression analysis
called regression models, and they fall into two
types: deterministic or probabilistic.

Deterministic Model: an equation or set of


equations that allow us to fully determine the
value of the dependent variable from the values of
the independent variables.

Probabilistic Model: a method used to capture the


randomness that is part of a real-life process.

Deterministic Model: Example


The cost of building a new house is about $6000 per
square meter and most lots sold for about $210000.
Hence the approximate selling price (y) would be:
y = $210000 + $6000(x)
(where x is the size of the house in square meters)
House
price

Most lots sell In this model, the price of


for $210000. the house is completely
determined by the size.

House size
4

2
Probabilistic Model
In real life, however, the house cost will vary even
among the same size of house:
Lower vs. higher
variability
House
price

210000
House price = 210000 + 6000(Size) + 

Same house size, but different price points House size


(e.g. décor options, cabinet upgrades, lot location…).
5

Probabilistic Model: Random Term


We now represent the price of a house as a function of
its size in this probabilistic model:

y = 210000 + 6000x + 

where  (Greek letter epsilon) is the random term


(a.k.a. error variable). It is the difference between the
actual selling price and the estimated price based
on the size of the house. Its value will vary from house
sale to house sale, even if the area of the house (i.e. x)
remains the same due to other factors such as the
location, age, décor etc. of the house.

3
Simple Linear Regression Model
y

Rise
Run
b0
b1 = rise/run
b 0 and b 1 are
= y-intercept population
parameters which
x are usually
y = dependent variable
unknown,
x = independent variable
therefore are
b 0 = y-intercept estimated from
b 1 = slope of the line the data.
 = error variable

17.2 Estimating the Coefficients


In much the same way we base estimates of µ on x,
we estimate β1 using 𝛽1 and β0 using 𝛽0 , the y-
intercept and slope (respectively) of the least squares
or regression line given by:

ŷ  bˆ 0  bˆ 1 x
Recall: this is an application of the least squares
method and it produces a straight line that minimizes
the sum of the squared differences between the points
and the line.

4
Least Squares Method

The question is:


• Which straight line fits best?
• The least squares line minimizes the sum of
squared differences between the points and the
line.
y w

w
w
w w
w
w w w w w
w w w w w
w
x 9

Least Squares Method: starting example


The best line is the one that minimizes the sum of squared vertical
differences between the points and the line.
Line 1: Sum of squared differences = (2 – 1)2 +(4 – 2)2 +(1.5 – 3)2 + (3.2 – 4)2 = 7.89
Line 2: Sum of squared differences = (2 – 2.5)2 +(4 – 2.5)2 +(1.5 – 2.5)2 + (3.2–- 2.5)2 = 3.99

4 (2, 4)
w
(4, 3.2)
3 w
Line 2 The smaller the sum of
2.5
2 squared differences,
(1, 2)w
w (3, 1.5) the better the fit of the
1 line to the data.
Line 2 better fits the
1 2 3 4 data
10

5
Least Squares Estimates
To calculate the estimates of the coefficients that
minimize the differences between the data points and
the line, use the formulas:
( xi  yi )
x y i i 
n x y n x y
bˆ1  or i i

 xi  n
2 (  xi ) 2 x nx 2
i
2

bˆ0  y  bˆ1 x
Now we define:
( xi ) 2 𝑠𝑥 2 = 𝑆𝑆𝑥 /(𝑛 − 1)
SS x   xi2    xi2  n x 2
n
(  yi ) 2 𝑠𝑦 2 = 𝑆𝑆𝑦 /(𝑛 − 1)
SS y   y 
2
i   y n y
2
i
2

n
( xi  yi ) 𝑠𝑥𝑦 2 = 𝑆𝑆𝑥𝑦 /(𝑛 − 1)
SS xy   xi yi    xi yi  n x y
n 11

Least Squares Estimates (contd.)


Then
SS xy
bˆ1  (𝛽1 =
𝑆𝑥𝑦
)
SS x 𝑆𝑥 2
(𝛽0 = 𝑦 − 𝛽1 𝑥 )
bˆ0  y  bˆ1 x
The estimated simple linear regression equation that
estimates the equation of the first-order linear model
is:
ˆ  bˆ0  bˆ1 x
y

Note: In the textbook, some symbols are introduced:


sxy, s2x, s2y, which are proportional to SSxy, SSx, SSy,
respectively, by multiplier 1/(n-1). The results of
calculation of b𝟏 and b𝟎 remain unchanged. 12

6
Example 17.3, page 717

A car dealer wants to find the Car Odometer Price


relationship between the 1 37.4 16.0
odometer reading and the 2 44.8 15.2
selling price of used cars. 3 45.8 15.0
4 30.9 17.4
5 31.7 17.4
A random sample of 100 cars is 6 34.0 16.1
selected and the data are . . .
recorded in file Example 17.3. . . .
. . .
Find the regression line. Independent variable x

Dependent variable y

13

Example 17.3, Solution


To calculate b̂ 0 and b̂1 we need to calculate several
statistics first:
(  xi )2
x  36.01; SS x   xi2 
 4 307.378
n
(  xi  yi )
y  16.24; SS xy   xi yi   -403.6207
n
where n = 100.
SS xy - 403.6207
bˆ1    -0.0937
SS x 4 307.378
bˆ 0  y  bˆ1x  16.24  ( 0.0937)(36.01)  19.611

ŷ  bˆ 0  bˆ1x  19.611 0.094x


14

7
Using the computer (see file Example17.3.xls)
Data > Data Analysis > Regression >
[Highlight the data y range and x range] > OK
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.808265
R Square 0.653292
Adjusted R Square
0.649754
Standard Error
0.452567
Observations 100

ANOVA ŷ  19.611 0.094x


df SS MS F Significance F
Regression 1 37.82108 37.82108 184.6583 2.84212E-24
Residual 98 20.07202 0.204817
Total 99 57.8931

Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 19.61139 0.25241 77.69655 7.53E-90 19.11049321 20.1122924
Odometer (x) -0.0937 0.006896 -13.5889 2.84E-24 -0.107388721 -0.0800203
15

19.611
20

18
Price (y)

16

14

12

0 No data 10
15 20 25 30 35 40 45 50 55
Odometer (x)

ŷ  19.611 0.094x

The intercept is b̂ 0 = 19.611. This is the slope of the line.


For each additional km on the odometer,
the price decreases by an average of $0.094
Do not interpret the intercept as the
‘price of cars that have not been driven’.
16

8
17.3 Error variable: Required
conditions
• The error  is a critical part of the regression model.
• Five requirements involving the distribution of
 must be satisfied:
– The probability distribution of  is normal.
– The mean of e is zero: E() = 0.
– The standard deviation of  is a constant (s) for
all values of x.
– The errors are independent.
– The errors are independent of the independent
variable x.

From the first three assumptions we have:


Given a particular value of X: X = x, Y is a random
variable which is normally distributed with mean E(Y/X=x)
= b0 + b1x and a constant standard deviation se
E(y|x3)
The mean of
Y value b 0 + b 1x3 m3
changes E(y|x2)
depending on b 0 + b 1x2 m2
value X = x E(y|x1)

b 0 + b 1x1 m1
For simplicity
we denote:

E(Y/X=x) as x1 x2 x3
E(y/x)
The standard deviation remains constant

9
17.4 Assessing the model
• The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.

• Consequently, it is important to assess how


well the linear model fits the data.

• Several methods are used to assess the model:


– testing and/or estimating the coefficients
– using descriptive measurements.

Sum of squares for errors


• This is the sum of differences between the points
and the regression line.
• It can serve as a measure of how well the line fits
the data.

n SS xy2
SSE  (y  ŷ ) .
i 1
i i
2
SSE  SS y 
SS x

1 𝑆𝑥𝑦 2
(or SSE = 𝑛−1
𝑠𝑦 2 − 𝑆𝑥 2
)
• This statistic plays a role in every statistical
technique we employ to assess the model.

10
Standard error of estimate

• The mean error is equal to zero.


• If s is small, the errors tend to be close to zero
(close to the mean error). Then the model fits the
data well.
• Therefore we can use s as a measure of the
suitability of using a linear model.
• An unbiased estimator of s is given by s

S tan dard Error of Estimate


SSE
s 
n2

Example 17.4 (Example 17.3, contd.)


• Calculate the standard error of estimate for Example
17.3 and describe what it tells you about the model fit.
2
Solution SS y   yi2  ( yi )  57.8931 calculated before
n
2
SS xy (403.6207) 2
SSE  SS y   57.8931   20.072
SS x 4 307.378
Thus,
SSE 20.072
s    0.4526
n2 100  2
Note: In the textbook, some symbols are introduced: sxy, s2x,
s2y, which are proportional to SSxy, SSx, SSy, respectively, by
multiplier 1/(n-1). The results of calculation of SSE and s
remain unchanged.

11
Example 17.4: Solution (contd.)

s  0.4526, y  16.24
• We judge the value of the standard error of
estimate, s, by comparing it to the values of the
dependent variable, y, or more specifically to the
sample mean value of y.
• In this example, the s is only 2.8% relative to the
sample mean of y. Therefore, we can conclude that
the standard error of estimate is reasonably small.
• s cannot be used alone as an absolute measure of
the model's utility. But it can be used to compare
models.

Testing the slope


• When no linear relationship exists between two
variables, the regression line should be horizontal.

The slope is not equal to zero. The slope is equal to zero.

q
q
q
qq q
q
q
q q

q q

Linear relationship. No linear relationship.


Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y).

12
• We can draw inference about b 1 from b̂1 by testing
H0: b1 = 0
HA: b 1  0 (or < 0, or > 0)
– The test statistic is:

bˆ 1  b1 where
s
t sbˆ 
sbˆ
1
SS x
1

The standard error of bˆ 1 .

– If the error variable is normally distributed,


the statistic is Student t-distribution with
d.f. = n – 2.

Example 17.6 (Example 17.3, contd.)

• Test to determine whether there is enough evidence to


infer that a linear relationship exists between the price
and the odometer reading at the 5% significance level.

Solution
Solving by hand
– To compute t we need the values of b̂ 1 and s ˆ .
b1
bˆ1  0.0937
s 0.4526
sbˆ    0.0069
1 SS x 4 307.378
bˆ  b1  0.0937  0
t 1   13.59
s bˆ .0069
1

13
Decision rule
– Reject Ho if | t | > t0.025,98 = 1.984.
– Comparing the decision rule with the calculated t-
value (=-13.59), we reject Ho and conclude that
the odometer reading do affect the sale price.
Using the computer
Excel regression output
Coefficients Standard Error t Stat P-value
Intercept 19.61139281 0.252410094 77.69655 7.53E-90
Odometer (x)-0.093704502 0.006895663 -13.5889 2.84E-24

Looking at the p-value, there is


overwhelming evidence to infer
that the odometer reading affects
the auction selling price.

Coefficient of determination
• When we want to measure the strength of the linear
relationship, we use the coefficient of determination.

2
SS 2xy SSE
R  or R 2  1 
SS x SS y SS y
𝑆𝑥𝑦 2 𝑆𝑆𝐸
(R2 = or R2 = 1 − (𝑛−1)𝑆 )
𝑆𝑥 2𝑆𝑦 2 𝑦
2

• To understand the significance of this


coefficient, note that: Overall variability of y is:
- partly explained by the regression model.
- partly unexplained by the regression model
( it is explained by random errors).

14
Two data points (x1, y1) and (x2, y2) of a certain sample
are shown.
y2

y1

x1 x2
variation explained by
Total variation in y = the regression line + unexplained variation (error)

(y1  y) 2  (y 2  y ) 2  (ŷ1  y ) 2  (ŷ 2  y ) 2  (y 1  ŷ 1 ) 2  (y 2  ŷ 2 ) 2

SST = variation in y = SSR + SSE


• R2 measures the proportion of the variation in y
that is explained by the variation in x.

SSR SST  SSE SSE SSE


R2    1  1
SST SST SST SS y

• R2 takes on any value between zero and one.


R2 = 1: perfect match between the line and the
data points.
R = 0: there is no linear relationship between x
2

and y.

15
Example 17.7 (Example 17.3, contd.)
Find the coefficient of determination for Example 17.3;
what does this statistic tell you about the model?
Solution
Solving by hand
Using the computer
– From the regression output we have:

SSE 20.07
Regression statistics R2  1  1  0.6533
Multiple R 0.8063
SS y 57.89
R Square 0.6501 65% of the variation in the auction
selling price is explained by the
Adjusted R Square 0.6466
variation in odometer reading.
Standard Error 151.57 The rest (35%) remains
Observations 100 unexplained by this model.

Summary: pages 763 - 764

Home assignment:

- Section 17.1 Exercises page 711: 17.1, 17.3

- Section 17.2 Exercises pages 720 - 721: 17.4,


17.7, 17.10, 17.13

- Section 17.4 Exercises pages 739 - 740: 17.31,


17.33, 17.35
- Supplementary exercise page 766: 17.76

32

16

You might also like