Simple Linear Regression and Correlation

Lecture 14.
Chapter 17
Simple linear regression and
correlation
17.1 Regression model
17.2 Estimating the coefficients
17.3 Error variable: Required
conditions
17.4 Assessing the regression model
1
17.1 Regression Model

Regression analysis is used to predict the value of
one variable (the dependent variable) on the basis
of other variables (the independent variables).
Dependent variable: denoted Y.
Independent variables: denoted X1, X2, …, Xk
This chapter will examine the relationship between

two variables: dependent variable Y and
independent variable X , sometimes called simple
linear regression.
1
Regression Model Types
Mathematical equations used in regression analysis
called regression models, and they fall into two
types: deterministic or probabilistic.
Deterministic Model: an equation or set of

equations that allow us to fully determine the
value of the dependent variable from the values of
the independent variables.
Probabilistic Model: a method used to capture the

randomness that is part of a real-life process.
Deterministic Model: Example

The cost of building a new house is about $6000 per
square meter and most lots sold for about $210000.
Hence the approximate selling price (y) would be:
y = $210000 + $6000(x)
(where x is the size of the house in square meters)
House
price
Most lots sell In this model, the price of

for $210000. the house is completely
determined by the size.
House size
4
2
Probabilistic Model
In real life, however, the house cost will vary even
among the same size of house:
Lower vs. higher
variability
House
price
210000
House price = 210000 + 6000(Size) + 
Same house size, but different price points House size

(e.g. décor options, cabinet upgrades, lot location…).
5
Probabilistic Model: Random Term

We now represent the price of a house as a function of
its size in this probabilistic model:
y = 210000 + 6000x + 
where  (Greek letter epsilon) is the random term

(a.k.a. error variable). It is the difference between the
actual selling price and the estimated price based
on the size of the house. Its value will vary from house
sale to house sale, even if the area of the house (i.e. x)
remains the same due to other factors such as the
location, age, décor etc. of the house.
3
Simple Linear Regression Model
y
Rise
Run
b0
b1 = rise/run
b 0 and b 1 are
= y-intercept population
parameters which
x are usually
y = dependent variable
unknown,
x = independent variable
therefore are
b 0 = y-intercept estimated from
b 1 = slope of the line the data.
 = error variable
17.2 Estimating the Coefficients

In much the same way we base estimates of µ on x,
we estimate β1 using 𝛽1 and β0 using 𝛽0 , the y-
intercept and slope (respectively) of the least squares
or regression line given by:
ŷ  bˆ 0  bˆ 1 x
Recall: this is an application of the least squares
method and it produces a straight line that minimizes
the sum of the squared differences between the points
and the line.
4
Least Squares Method
The question is:

• Which straight line fits best?
• The least squares line minimizes the sum of
squared differences between the points and the
line.
y w
w
w
w w
w
w w w w w
w w w w w
w
x 9
Least Squares Method: starting example

The best line is the one that minimizes the sum of squared vertical
differences between the points and the line.
Line 1: Sum of squared differences = (2 – 1)2 +(4 – 2)2 +(1.5 – 3)2 + (3.2 – 4)2 = 7.89
Line 2: Sum of squared differences = (2 – 2.5)2 +(4 – 2.5)2 +(1.5 – 2.5)2 + (3.2–- 2.5)2 = 3.99
4 (2, 4)
w
(4, 3.2)
3 w
Line 2 The smaller the sum of
2.5
2 squared differences,
(1, 2)w
w (3, 1.5) the better the fit of the
1 line to the data.
Line 2 better fits the
1 2 3 4 data
10
5
Least Squares Estimates
To calculate the estimates of the coefficients that
minimize the differences between the data points and
the line, use the formulas:
( xi  yi )
x y i i 
n x y n x y
bˆ1  or i i
 xi  n
2 (  xi ) 2 x nx 2
i
2
bˆ0  y  bˆ1 x
Now we define:
( xi ) 2 𝑠𝑥 2 = 𝑆𝑆𝑥 /(𝑛 − 1)
SS x   xi2    xi2  n x 2
n
(  yi ) 2 𝑠𝑦 2 = 𝑆𝑆𝑦 /(𝑛 − 1)
SS y   y 
2
i   y n y
2
i
2
n
( xi  yi ) 𝑠𝑥𝑦 2 = 𝑆𝑆𝑥𝑦 /(𝑛 − 1)
SS xy   xi yi    xi yi  n x y
n 11
Least Squares Estimates (contd.)

Then
SS xy
bˆ1  (𝛽1 =
𝑆𝑥𝑦
)
SS x 𝑆𝑥 2
(𝛽0 = 𝑦 − 𝛽1 𝑥 )
bˆ0  y  bˆ1 x
The estimated simple linear regression equation that
estimates the equation of the first-order linear model
is:
ˆ  bˆ0  bˆ1 x
y
Note: In the textbook, some symbols are introduced:

sxy, s2x, s2y, which are proportional to SSxy, SSx, SSy,
respectively, by multiplier 1/(n-1). The results of
calculation of b𝟏 and b𝟎 remain unchanged. 12
6
Example 17.3, page 717
A car dealer wants to find the Car Odometer Price

relationship between the 1 37.4 16.0
odometer reading and the 2 44.8 15.2
selling price of used cars. 3 45.8 15.0
4 30.9 17.4
5 31.7 17.4
A random sample of 100 cars is 6 34.0 16.1
selected and the data are . . .
recorded in file Example 17.3. . . .
. . .
Find the regression line. Independent variable x
Dependent variable y
13
Example 17.3, Solution

To calculate b̂ 0 and b̂1 we need to calculate several
statistics first:
(  xi )2
x  36.01; SS x   xi2 
 4 307.378
n
(  xi  yi )
y  16.24; SS xy   xi yi   -403.6207
n
where n = 100.
SS xy - 403.6207
bˆ1    -0.0937
SS x 4 307.378
bˆ 0  y  bˆ1x  16.24  ( 0.0937)(36.01)  19.611
ŷ  bˆ 0  bˆ1x  19.611 0.094x

14
7
Using the computer (see file Example17.3.xls)
Data > Data Analysis > Regression >
[Highlight the data y range and x range] > OK
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.808265
R Square 0.653292
Adjusted R Square
0.649754
Standard Error
0.452567
Observations 100
ANOVA ŷ  19.611 0.094x

df SS MS F Significance F
Regression 1 37.82108 37.82108 184.6583 2.84212E-24
Residual 98 20.07202 0.204817
Total 99 57.8931
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 19.61139 0.25241 77.69655 7.53E-90 19.11049321 20.1122924
Odometer (x) -0.0937 0.006896 -13.5889 2.84E-24 -0.107388721 -0.0800203
15
19.611
20
18
Price (y)
16
14
12
0 No data 10
15 20 25 30 35 40 45 50 55
Odometer (x)
ŷ  19.611 0.094x
The intercept is b̂ 0 = 19.611. This is the slope of the line.

For each additional km on the odometer,
the price decreases by an average of $0.094
Do not interpret the intercept as the
‘price of cars that have not been driven’.
16
8
17.3 Error variable: Required
conditions
• The error  is a critical part of the regression model.
• Five requirements involving the distribution of
 must be satisfied:
– The probability distribution of  is normal.
– The mean of e is zero: E() = 0.
– The standard deviation of  is a constant (s) for
all values of x.
– The errors are independent.
– The errors are independent of the independent
variable x.
From the first three assumptions we have:

Given a particular value of X: X = x, Y is a random
variable which is normally distributed with mean E(Y/X=x)
= b0 + b1x and a constant standard deviation se
E(y|x3)
The mean of
Y value b 0 + b 1x3 m3
changes E(y|x2)
depending on b 0 + b 1x2 m2
value X = x E(y|x1)
b 0 + b 1x1 m1
For simplicity
we denote:
E(Y/X=x) as x1 x2 x3
E(y/x)
The standard deviation remains constant
9
17.4 Assessing the model
• The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
• Consequently, it is important to assess how

well the linear model fits the data.
• Several methods are used to assess the model:

– testing and/or estimating the coefficients
– using descriptive measurements.
Sum of squares for errors

• This is the sum of differences between the points
and the regression line.
• It can serve as a measure of how well the line fits
the data.
n SS xy2
SSE  (y  ŷ ) .
i 1
i i
2
SSE  SS y 
SS x
1 𝑆𝑥𝑦 2
(or SSE = 𝑛−1
𝑠𝑦 2 − 𝑆𝑥 2
)
• This statistic plays a role in every statistical
technique we employ to assess the model.
10
Standard error of estimate
• The mean error is equal to zero.

• If s is small, the errors tend to be close to zero
(close to the mean error). Then the model fits the
data well.
• Therefore we can use s as a measure of the
suitability of using a linear model.
• An unbiased estimator of s is given by s
S tan dard Error of Estimate

SSE
s 
n2
Example 17.4 (Example 17.3, contd.)

• Calculate the standard error of estimate for Example
17.3 and describe what it tells you about the model fit.
2
Solution SS y   yi2  ( yi )  57.8931 calculated before
n
2
SS xy (403.6207) 2
SSE  SS y   57.8931   20.072
SS x 4 307.378
Thus,
SSE 20.072
s    0.4526
n2 100  2
Note: In the textbook, some symbols are introduced: sxy, s2x,
s2y, which are proportional to SSxy, SSx, SSy, respectively, by
multiplier 1/(n-1). The results of calculation of SSE and s
remain unchanged.
11
Example 17.4: Solution (contd.)
s  0.4526, y  16.24
• We judge the value of the standard error of
estimate, s, by comparing it to the values of the
dependent variable, y, or more specifically to the
sample mean value of y.
• In this example, the s is only 2.8% relative to the
sample mean of y. Therefore, we can conclude that
the standard error of estimate is reasonably small.
• s cannot be used alone as an absolute measure of
the model's utility. But it can be used to compare
models.
Testing the slope

• When no linear relationship exists between two
variables, the regression line should be horizontal.
The slope is not equal to zero. The slope is equal to zero.
q
q
q
qq q
q
q
q q
q q
Linear relationship. No linear relationship.

Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y).
12
• We can draw inference about b 1 from b̂1 by testing
H0: b1 = 0
HA: b 1  0 (or < 0, or > 0)
– The test statistic is:
bˆ 1  b1 where
s
t sbˆ 
sbˆ
1
SS x
1
The standard error of bˆ 1 .
– If the error variable is normally distributed,

the statistic is Student t-distribution with
d.f. = n – 2.
• Test to determine whether there is enough evidence to

infer that a linear relationship exists between the price
and the odometer reading at the 5% significance level.
Solution
Solving by hand
– To compute t we need the values of b̂ 1 and s ˆ .
b1
bˆ1  0.0937
s 0.4526
sbˆ    0.0069
1 SS x 4 307.378
bˆ  b1  0.0937  0
t 1   13.59
s bˆ .0069
1
13
Decision rule
– Reject Ho if | t | > t0.025,98 = 1.984.
– Comparing the decision rule with the calculated t-
value (=-13.59), we reject Ho and conclude that
the odometer reading do affect the sale price.
Using the computer
Excel regression output
Coefficients Standard Error t Stat P-value
Intercept 19.61139281 0.252410094 77.69655 7.53E-90
Odometer (x)-0.093704502 0.006895663 -13.5889 2.84E-24
Looking at the p-value, there is

overwhelming evidence to infer
that the odometer reading affects
the auction selling price.
Coefficient of determination
• When we want to measure the strength of the linear
relationship, we use the coefficient of determination.
2
SS 2xy SSE
R  or R 2  1 
SS x SS y SS y
𝑆𝑥𝑦 2 𝑆𝑆𝐸
(R2 = or R2 = 1 − (𝑛−1)𝑆 )
𝑆𝑥 2𝑆𝑦 2 𝑦
2
• To understand the significance of this

coefficient, note that: Overall variability of y is:
- partly explained by the regression model.
- partly unexplained by the regression model
( it is explained by random errors).
14
Two data points (x1, y1) and (x2, y2) of a certain sample
are shown.
y2
y1
x1 x2
variation explained by
Total variation in y = the regression line + unexplained variation (error)
(y1  y) 2  (y 2  y ) 2  (ŷ1  y ) 2  (ŷ 2  y ) 2  (y 1  ŷ 1 ) 2  (y 2  ŷ 2 ) 2
SST = variation in y = SSR + SSE

• R2 measures the proportion of the variation in y
that is explained by the variation in x.
SSR SST  SSE SSE SSE

R2    1  1
SST SST SST SS y
• R2 takes on any value between zero and one.

R2 = 1: perfect match between the line and the
data points.
R = 0: there is no linear relationship between x
2
and y.
15
Find the coefficient of determination for Example 17.3;
what does this statistic tell you about the model?
Solution
Solving by hand
Using the computer
– From the regression output we have:
SSE 20.07
Regression statistics R2  1  1  0.6533
Multiple R 0.8063
SS y 57.89
R Square 0.6501 65% of the variation in the auction
selling price is explained by the
Adjusted R Square 0.6466
variation in odometer reading.
Standard Error 151.57 The rest (35%) remains
Observations 100 unexplained by this model.
Summary: pages 763 - 764
Home assignment:
- Section 17.1 Exercises page 711: 17.1, 17.3
- Section 17.2 Exercises pages 720 - 721: 17.4,

17.7, 17.10, 17.13
- Section 17.4 Exercises pages 739 - 740: 17.31,

17.33, 17.35
- Supplementary exercise page 766: 17.76
32
16

Simple Linear Regression and Correlation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Linear Regression and Correlation

Uploaded by

Copyright:

Available Formats

Lecture 14.

17.1 Regression Model

This chapter will examine the relationship between

Deterministic Model: an equation or set of

Probabilistic Model: a method used to capture the

Deterministic Model: Example

Most lots sell In this model, the price of

Same house size, but different price points House size

Probabilistic Model: Random Term

where  (Greek letter epsilon) is the random term

17.2 Estimating the Coefficients

The question is:

Least Squares Method: starting example

Least Squares Estimates (contd.)

Note: In the textbook, some symbols are introduced:

A car dealer wants to find the Car Odometer Price

Example 17.3, Solution

ŷ  bˆ 0  bˆ1x  19.611 0.094x

ANOVA ŷ  19.611 0.094x

The intercept is b̂ 0 = 19.611. This is the slope of the line.

From the first three assumptions we have:

• Consequently, it is important to assess how

• Several methods are used to assess the model:

Sum of squares for errors

• The mean error is equal to zero.

S tan dard Error of Estimate

Example 17.4 (Example 17.3, contd.)

Testing the slope

The slope is not equal to zero. The slope is equal to zero.

Linear relationship. No linear relationship.

The standard error of bˆ 1 .

– If the error variable is normally distributed,

Example 17.6 (Example 17.3, contd.)

• Test to determine whether there is enough evidence to

Looking at the p-value, there is

• To understand the significance of this

(y1  y) 2  (y 2  y ) 2  (ŷ1  y ) 2  (ŷ 2  y ) 2  (y 1  ŷ 1 ) 2  (y 2  ŷ 2 ) 2

SST = variation in y = SSR + SSE

SSR SST  SSE SSE SSE

• R2 takes on any value between zero and one.

Summary: pages 763 - 764

- Section 17.1 Exercises page 711: 17.1, 17.3

- Section 17.2 Exercises pages 720 - 721: 17.4,

- Section 17.4 Exercises pages 739 - 740: 17.31,

You might also like