Professional Documents
Culture Documents
Chapter 17
Simple linear regression and
correlation
17.1 Regression model
17.2 Estimating the coefficients
17.3 Error variable: Required
conditions
17.4 Assessing the regression model
1
1
Regression Model Types
Mathematical equations used in regression analysis
called regression models, and they fall into two
types: deterministic or probabilistic.
House size
4
2
Probabilistic Model
In real life, however, the house cost will vary even
among the same size of house:
Lower vs. higher
variability
House
price
210000
House price = 210000 + 6000(Size) +
y = 210000 + 6000x +
3
Simple Linear Regression Model
y
Rise
Run
b0
b1 = rise/run
b 0 and b 1 are
= y-intercept population
parameters which
x are usually
y = dependent variable
unknown,
x = independent variable
therefore are
b 0 = y-intercept estimated from
b 1 = slope of the line the data.
= error variable
ŷ bˆ 0 bˆ 1 x
Recall: this is an application of the least squares
method and it produces a straight line that minimizes
the sum of the squared differences between the points
and the line.
4
Least Squares Method
w
w
w w
w
w w w w w
w w w w w
w
x 9
4 (2, 4)
w
(4, 3.2)
3 w
Line 2 The smaller the sum of
2.5
2 squared differences,
(1, 2)w
w (3, 1.5) the better the fit of the
1 line to the data.
Line 2 better fits the
1 2 3 4 data
10
5
Least Squares Estimates
To calculate the estimates of the coefficients that
minimize the differences between the data points and
the line, use the formulas:
( xi yi )
x y i i
n x y n x y
bˆ1 or i i
xi n
2 ( xi ) 2 x nx 2
i
2
bˆ0 y bˆ1 x
Now we define:
( xi ) 2 𝑠𝑥 2 = 𝑆𝑆𝑥 /(𝑛 − 1)
SS x xi2 xi2 n x 2
n
( yi ) 2 𝑠𝑦 2 = 𝑆𝑆𝑦 /(𝑛 − 1)
SS y y
2
i y n y
2
i
2
n
( xi yi ) 𝑠𝑥𝑦 2 = 𝑆𝑆𝑥𝑦 /(𝑛 − 1)
SS xy xi yi xi yi n x y
n 11
6
Example 17.3, page 717
Dependent variable y
13
7
Using the computer (see file Example17.3.xls)
Data > Data Analysis > Regression >
[Highlight the data y range and x range] > OK
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.808265
R Square 0.653292
Adjusted R Square
0.649754
Standard Error
0.452567
Observations 100
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 19.61139 0.25241 77.69655 7.53E-90 19.11049321 20.1122924
Odometer (x) -0.0937 0.006896 -13.5889 2.84E-24 -0.107388721 -0.0800203
15
19.611
20
18
Price (y)
16
14
12
0 No data 10
15 20 25 30 35 40 45 50 55
Odometer (x)
ŷ 19.611 0.094x
8
17.3 Error variable: Required
conditions
• The error is a critical part of the regression model.
• Five requirements involving the distribution of
must be satisfied:
– The probability distribution of is normal.
– The mean of e is zero: E() = 0.
– The standard deviation of is a constant (s) for
all values of x.
– The errors are independent.
– The errors are independent of the independent
variable x.
b 0 + b 1x1 m1
For simplicity
we denote:
E(Y/X=x) as x1 x2 x3
E(y/x)
The standard deviation remains constant
9
17.4 Assessing the model
• The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
n SS xy2
SSE (y ŷ ) .
i 1
i i
2
SSE SS y
SS x
1 𝑆𝑥𝑦 2
(or SSE = 𝑛−1
𝑠𝑦 2 − 𝑆𝑥 2
)
• This statistic plays a role in every statistical
technique we employ to assess the model.
10
Standard error of estimate
11
Example 17.4: Solution (contd.)
s 0.4526, y 16.24
• We judge the value of the standard error of
estimate, s, by comparing it to the values of the
dependent variable, y, or more specifically to the
sample mean value of y.
• In this example, the s is only 2.8% relative to the
sample mean of y. Therefore, we can conclude that
the standard error of estimate is reasonably small.
• s cannot be used alone as an absolute measure of
the model's utility. But it can be used to compare
models.
q
q
q
qq q
q
q
q q
q q
12
• We can draw inference about b 1 from b̂1 by testing
H0: b1 = 0
HA: b 1 0 (or < 0, or > 0)
– The test statistic is:
bˆ 1 b1 where
s
t sbˆ
sbˆ
1
SS x
1
Solution
Solving by hand
– To compute t we need the values of b̂ 1 and s ˆ .
b1
bˆ1 0.0937
s 0.4526
sbˆ 0.0069
1 SS x 4 307.378
bˆ b1 0.0937 0
t 1 13.59
s bˆ .0069
1
13
Decision rule
– Reject Ho if | t | > t0.025,98 = 1.984.
– Comparing the decision rule with the calculated t-
value (=-13.59), we reject Ho and conclude that
the odometer reading do affect the sale price.
Using the computer
Excel regression output
Coefficients Standard Error t Stat P-value
Intercept 19.61139281 0.252410094 77.69655 7.53E-90
Odometer (x)-0.093704502 0.006895663 -13.5889 2.84E-24
Coefficient of determination
• When we want to measure the strength of the linear
relationship, we use the coefficient of determination.
2
SS 2xy SSE
R or R 2 1
SS x SS y SS y
𝑆𝑥𝑦 2 𝑆𝑆𝐸
(R2 = or R2 = 1 − (𝑛−1)𝑆 )
𝑆𝑥 2𝑆𝑦 2 𝑦
2
14
Two data points (x1, y1) and (x2, y2) of a certain sample
are shown.
y2
y1
x1 x2
variation explained by
Total variation in y = the regression line + unexplained variation (error)
and y.
15
Example 17.7 (Example 17.3, contd.)
Find the coefficient of determination for Example 17.3;
what does this statistic tell you about the model?
Solution
Solving by hand
Using the computer
– From the regression output we have:
SSE 20.07
Regression statistics R2 1 1 0.6533
Multiple R 0.8063
SS y 57.89
R Square 0.6501 65% of the variation in the auction
selling price is explained by the
Adjusted R Square 0.6466
variation in odometer reading.
Standard Error 151.57 The rest (35%) remains
Observations 100 unexplained by this model.
Home assignment:
32
16