Hsuhl LRClass Chap2

Chapter 2
Inferences in Regression and Correlation Analysis
許湘伶
Applied Linear Regression Models
(Kutner, Nachtsheim, Neter, Li)
hsuhl (NUK) LR Chap 2 1 / 118

Inferences concerning:
the regression parameters β0 and β1
interval estimation of β0 and β1
tests about them
X interval estimation of E(Y ) of the probability distribution of

Y for given X
X prediction intervals of a new observation Y
4 confidence bands for the regression line

: the analysis of variance approach to regression analysis
: the general linear test approach
: descriptive measure of association
2 the correlation coefficient

Normal error regression model
Assume that the normal error regression model is applicable:
Yi = β0 + β1 Xi + εi
β0 and β1 are parameters

Xi are known constants
εi ∼ N (0, σ 2 ): are independent

Inferences Concerning β1
Inferences about the slope β1 of the regression line
illustration
A market research analyst studying the relation between sales
(Y ) and advertising expenditures (廣告支出 X):
to obtain an interval estimate of β1
provide information as to how many additional sales dollars
檢定
H0 : β1 = 0;
Ha : β1 6= 0.

Inferences Concerning β1
Inferences about the slope β1 of the regression line (cont.)
Figure 1 : Regression Model when β1 = 0.
β1 = 0 ⇒ no linear association between Y and X

The regression line is horizontal.
The means of Y :
E{Y } = β0 .
The probability distribution of Y are identical at all levels of
X.
Sampling Distribution of b1
Point estimator b1
(Xi − X̄ )(Yi − Ȳ )
P
b1 =
(Xi − X̄ )2
P
The sample distribution of b1 refers to the different values of

b1 that would be obtained with repeated sampling when the
levels of the predictor variable X are held constant from
sample to sample.

Point estimator b1 (cont.)
Sampling distribution of b1 (需ki 的特性)

For normal error regression model:
Mean: E{b1 } = β1 ;
σ2
variance: σ 2 {b1 } = P
(Xi − X̄ )2
X b1 is a linear combination of Yi
X
b1 = ki Yi
Xi − X̄
ki = P : a function of Xi
(Xi − X̄ )2

Point estimator b1 (cont.)
X Properties of ki :
X
ki = 0
X
ki X i = 1
1
ki2 = P
X
(Xi − X̄ )2

Normality
Yi : independently, normally distributed

A linear combination of independent normal random
variables is normally distributed.

Normality (cont.)
Normally properties
Mean:
E{b1 } = β1
Variance:
1
σ 2 {b1 } = σ 2 P
(Xi − X̄ )2

Normality
Normally properties (cont.)

The unbiased estimator of σ 2 :

Normality

(Yi − Ŷi )2
P
SSE
MSE = =
n−2 n−2

Normality

(Yi − Ŷi )2
P
SSE
MSE = =
n−2 n−2
Estimated Variance σ 2 {b1 }:
MSE
s 2 {b1 } = P
(Xi − X̄ )2
The point estimator s 2 {b1 } is an unbiased estimator of

σ 2 {b1 }.
The point estimator of σ{b1 }:
Normality

(Yi − Ŷi )2
P
SSE
MSE = =
n−2 n−2
Estimated Variance σ 2 {b1 }:
MSE
s 2 {b1 } = P
(Xi − X̄ )2
The point estimator s 2 {b1 } is an unbiased estimator of

σ 2 {b1 }.
The point estimator of σ{b1 }: s{b1 }
b1 : unbiased linear estimator
Theorem 1
The estimator b1 has minimum variance among all unbiased
linear estimators of: X
β̂1 = ci Yi
P P
ci : arbitrary constants; ci = 0; ci Xi = 1
X Unbiased: E{β̂1 } = β1
X σ 2 {β̂1 } = σ 2
P 2
c i
ci = ki + di
Xi −X̄
ki = P(X 2
i −X̄)
di : arbitrary constants
P
ki di = 0
σ 2 ki2 = σ 2 {b1 }
P
P 2
X β̂1 is at a minimum when d i = 0 ⇔ all di = 0 ⇔ ci = ki
Sampling distribution of (b1 − β1 )/s{b1 }
Theorem 2
b1 − β1
∼ tn−2
s{b1 }
If the observations Yi come from the same normal

d
population, (Ȳ − µ)/s{Ȳ } ∼

Theorem 2
b1 − β1
∼ tn−2
s{b1 }

d
population, (Ȳ − µ)/s{Ȳ } ∼ tn−1

Theorem 2
b1 − β1
∼ tn−2
s{b1 }

d
population, (Ȳ − µ)/s{Ȳ } ∼ tn−1
Two parameters: β0 , β1

Sampling distribution of (b1 − β1 )/s{b1 } (cont.)
b1 −β1
Proof of s{b1 }
∼ tn−2
Theorem 3
For regression model (2.1),
SSE/σ 2 ∼ χ2 (n − 2),
and is independent of b0 and b1 .

Sampling distribution of (b1 − β1 )/s{b1 } (cont.) (cont.)
Theorem 4
t Distribution
Let Z and χ2 (ν) be independent r.v. (standard normal, χ2 ). A t
random variable as follows:
Z
t(ν) = h χ2 (ν) i where Z and χ2 (ν) are indep.
ν

Sampling distribution of (b1 − β1 )/s{b1 } (cont.) (cont.)

1 (b1 − β1 )/σ{b1 } ∼ Z (Standard Normal variable)
s2 {b1 } χ2 (n−2)
2
σ 2 {b1 }
∼ n−2
3
b1 − β1 b1 − β1 s{b1 } Z
= ÷ ∼q 2
s{b1 } σ{b1 } σ{b1 } χ (n−2)
n−2
Z and χ2 are independent;

4 Z is a function of b1
5 b1 is independent of SSE/σ 2 ∼ χ2
b1 − β1
∼ tn−2
s{b1 }

Confidence interval for β1
b1 − β1
∼ tn−2
s{b1 }
n o
P t(α/2; n − 2) ≤ (b1 − β1 )/s{b1 } ≤ t(1 − α/2; n − 2) = 1 − α
n o
⇒P b1 − t(1 − α/2; n − 2)s{b1 } ≤ β1 ≤ b1 + t(1 − α/2; n − 2)s{b1 }
=1−α
t(α/2; n − 2): (α/2)100 percentile of the t distribution with

n − 2 d.f.
Confidence interval for β1 (cont.)
Symmetric: t(α/2; n − 2) = −t(1 − α/2; n − 2)
The 1 − α confidence limits for β1 are:
b1 ± t(1 − α/2; n − 2)s{b1 }

Confidence interval for β1 (cont.)

Ex: Confidence interval for β1
The Toluca Company: an estimate of β1 with 95 percent

confidence coefficient.
#### Example p46
toluca<-read.table("toluca.txt",header=T)
attach(toluca)
## method 1
n<-length(Size)
alpha<-0.05
tdf<-qt(1-alpha/2,n-2)
Sxx<-sum((Size-mean(Size))^2)
b1<-sum((Size-mean(Size))*(Hrs-mean(Hrs)))/Sxx
b0<-mean(Hrs)-b1*mean(Size)
c(b0,b1)
SSE<-sum((Hrs-(b0+b1*Size))^2)
MSE<-SSE/(n-2)
Ex: Confidence interval for β1 (cont.)
sb1<-sqrt(MSE/sum((Size-mean(Size))^2)) #s{b1}
conf<-c(b1-tdf*sb1,b1+tdf*sb1)
conf
## method 2
fitreg<-lm(Hrs~Size)
summary(fitreg)
confint(fitreg)
MSE
s 2 {b1 } = P(Xi −X̄)
2 = 0.120404
s{b1 } = 0.3470
t(0.975, 23) = 2.069
The 95 percent confidence interval:
2.85 ≤ β1 ≤ 4.29

Ex: Confidence interval for β1 (cont.)

Tests concerning β1
b1 − β1 d
∼ tn−2
s{b1 }
Two-Sided Test
H0 : β1 = 0 vs. Ha : β1 6= 0
The test statistic: b1

t∗ =
s{b1 }
The decision rule (the level of significance α):
If |t ∗ | ≤ t(1 − α/2; n − 2), conclude H0

If |t ∗ | > t(1 − α/2; n − 2), conclude Ha

α = 0.05; n = 25
t(0.975; 23) = 2.069
b1 = 3.5702
|t ∗ | =
|3.5702/0.3470| =
10.29 > 2.069
conclude Ha :
β1 6= 0

p-value: P t(23) >
t ∗ = 10.29} < 0.0005
α = 0.05; n = 25
t(0.975; 23) = 2.069
b1 = 3.5702
|t ∗ | =
|3.5702/0.3470| =
10.29 > 2.069
conclude Ha :
β1 6= 0

p-value: P t(23) >
t ∗ = 10.29} < 0.0005
One-Sided Test
H0 : β1 ≤ 0 vs. Ha : β1 > 0
The test statistic: b1

t∗ =
s{b1 }
If t ∗ ≤ t(1 − α; n − 2), conclude H0

If t ∗ > t(1 − α; n − 2), conclude Ha

Two-Sided Test
H0 : β1 = β10 vs. Ha : β1 6= β10
The test statistic: b1 − β10

t∗ =
s{b1 }
If |t ∗ | ≤ t(1 − α/2; n − 2), conclude H0


Inferences concerning β0
Sampling distribution of b0
The point estimator b0 : b0 = Ȳ − b1 X̄
Theorem 5
For regression model (2.1), the sampling distribution of b0 is
normal, with mean and variance:
E{b0 } = β0
1 X̄ 2

2 2
σ {b0 } = σ +P
n (Xi − X̄ )2

Inferences concerning β0
Sampling distribution of b0 (cont.)
An estimator of σ 2 {b0 }:
1 X̄ 2

2
s {b0 } = MSE +P
n (Xi − X̄ )2
An estimator of σ{b0 }: s{b0 }
Theorem 6
b0 − β0 d
∼ tn−2
s{b0 }

The 1 − α confidence limits for β0 are:
b0 ± t(1 − α/2; n − 2)s{b0 }
## method 1
sb0<-sqrt(MSE*(1/n+mean(Size)^2/sum((Size-mean(Size))^2))) #
tdf0<-qt(1-0.1/2,n-2)
conf0<-c(b0-tdf0*sb0,b0+tdf0*sb0)
conf0
## method 2
fitreg<-lm(Hrs~Size)
confint(fitreg,level = 0.9)
Example C.I. for β0
Toluca Company
Range of X : [20,120]
t(0.95; 23) = 1.714

1 2
s 2 {b0 } = MSE n
+ P(XX̄i −X̄)2 = 685.34
s{b0 } = 26.18
The 90% C.I. for β0 :
17.5 ≤ β0 ≤ 107.2

Example C.I. for β0 (cont.)
It does not necessarily provide information about the “setup"

cost since we are not certain whether a linear regression
model is appropriate when the scope of the model is extended
to X = 0
Some considerations on making inferences concerning β0
and β1
Departures from Normality
If the probability distribution of Y are not exactly normal

but do not depart seriously, the sampling distributions of b0
and b1 will approximately normal, and the use of the t
distribution will provide approximately the specified
confidence coefficient or level of significance.

and β1
Departures from Normality (cont.)
Even if the distribution of Y are far from normal, the

estimators b0 and b1 generally have the property of
asymptotic normality- their distributions approach normality
under very general conditions as the sample size increases.

and β1
Power of Tests
H0 : β1 = β10 vs. Ha : β1 6= β10

b1 − β10
Test statistic: t ∗ =
s{b1 }
The power of this test for α level: the decision rule will lead to
conclusion Ha when Ha holds

Power = P |t ∗ | > t(1 − α/2; n − 2)|δ

and β1
Power of Tests (cont.)
the noncentrality measure i.e., a measure of how far the true

value of β1 is from β10
|β1 − β10 |
δ= (Appendix Table B.5)
σ{b1 }

and β1
Common objective
To estimate the mean for one or more probability distribu-
tions of Y .
Illustration
A study of the relation between level of piecework(按件計酬的工
作) pay (X ) and worker productivity (生產力 Y ).
The mean productivity at high and medium levels of
piecework pay may be of particular interest for purposes of
analyzing the benefits obtained from an increase in the pay

and β1
Xh : the level of X for which we wish to estimate the mean

response
may be a value which occurred in the sample
other value of the predictor variable within the scope(範圍)
of the model
E{Yh }: the mean response when X = Xh
Ŷh : the point estimator of E{Yh }:
Ŷh = b0 + b1 Xh
What is the sampling distribution of Ŷh ?

and β1
Sampling distribution of Ŷh

For normal error regression model (2,1), the sampling distribution
of Ŷh is normal:
Mean:
E{Ŷh } = E{Yh }
Variance:
1 (Xh − X̄ )2

2 2
σ {Ŷh } = σ +P
n (Xi − X̄ )2
Ŷh is a linear combination of the observations Yi .

Ŷh is an unbiased estimator of E{Yh }.

and β1
Sampling distribution of Ŷh (cont.)
Figure 2 : Effect on Ŷh of Variation in b1 from Sample to Sample in

Two Samples with Same Means Ȳ and X̄

and β1
Properties of Ŷh
Xh = 0:
Var(Ŷh ) = Var(b0 )
s 2 {Ŷh } = s 2 {b0 }
b1 and Ȳ are uncorrelated ⇔ σ{Ȳ , b1 } = 0

and β1
Properties of Ŷh (cont.)
When MSE is substituted for σ 2 , the estimated variance of

Ŷh (s 2 (Ŷh ))
1 (Xh − X̄ )2

2
s {Ŷh } = MSE +P
n (Xi − X̄ )2
The estimated standard deviation of Ŷh : s{Ŷh }

Interval Estimation of E{Yh }
Sampling Distribution of (Ŷh − E{Yh })/s{Ŷh }
Theorem 7
Ŷh − E{Yh }
∼ tn−2
s{Ŷh }
The 1 − α confidence limits for E{Yh } are:
Ŷh ± t(1 − α/2; n − 2)s{Ŷh }

Example 1
Toluca Company example
find a 90 percent confidence interval for E{Yh } when the lot
size is Xh = 65 units.
Ŷh = 62.37 + 3.5702(65) = 294.4i
(65−70.00)2
h
2 1
s {Ŷh } = 2, 384 25 + 19,800 = 98.37 ⇒ s{Ŷh } = 9, 918
t(.95; 23) = 1.7141
277.4 ≤ E{Yh } ≤ 311.4
Conclude: the mean number of work hours required when
lots of 65 units are produced is somewhere between 277.4 and
311.4 hours.
the estimate of the mean number of work hours is moderately
precise.


Code for Example 1
#### Example p54

toluca<-read.table("toluca.txt",header=T)
attach(toluca)
## method 1
n<-length(Size)
alpha<-0.1
tdf90<-qt(1-alpha/2,n-2)
Sxx<-sum((Size-mean(Size))^2)
b1<-sum((Size-mean(Size))*(Hrs-mean(Hrs)))/Sxx
c(b0,b1)
SSE<-sum((Hrs-(b0+b1*Size))^2)
MSE<-SSE/(n-2)
sb1<-sqrt(MSE/sum((Size-mean(Size))^2)) #s{b1}
Xh<-65
hYh<-b0+b1* Xh
sYh<-sqrt(MSE*(1/n+(Xh-mean(Size))^2/sum((Size-mean(Size))^2)))
Code for Example 1 (cont.)
EYhCI<-c(hYh-tdf90*sYh,hYh+tdf90*sYh)
## method 2
fitreg<-lm(Hrs~Size,data=toluca)
summary(fitreg)
predXCI<-predict(fitreg,data.frame(Size = c(Xh)),
interval = "confidence", se.fit = F,level = 0.9)
plot(Size, Hrs)
abline(fitreg,col="blue")
# now the confidence interval for $X_h=specific level$

points(Xh, predXCI[, "fit"],col="red",pch=15)
points(Xh, predXCI[, "lwr"], lty = "dotted",col="red")
points(Xh, predXCI[, "upr"], lty = "dotted",col="red")

Example 2
Toluca Company example
Estimate E{Yh } for lots with Xh = 100 units with a 90
percent confidence interval
Ŷh = 62.37 + 3.5702(100) = 419.4i
(100−70.00)2
h
2 1
s {Ŷh } = 2, 384 25 + 19,800 = 203.72
s{Ŷh } = 14.27
t(.95; 23) = 1.7141
394.9 ≤ E{Yh } ≤ 443.9
Conclude: the confidence interval is somewhat wider than
that for Example 1, since the Xh level here is substantially
farther from the mean X̄ = 70 than the Xh = 65.


The variance of Ŷh is smallest when Xh = X̄ .

In an experiment to estimate the mean response at a
particular level Xh of the predictor variable, the precision of
the estimate will be greatest if the observations on X are
spaced so that X̄ = Xh .
The usual relationship between C.I. and tests applies in
inferences concerning the mean response.
The two-sided confidence limits can be utilized for two-sided
tests concerning the mean response at Xh . Alternatively, a
regular decision rule can be set up.
The confidence limits for a mean response E{Yh } are not
sensitive to moderate departures from the assumption that
the error terms are normally distributed.

Prediction of New Observation
Toluca Company: the next lot to be produced consists of 100

units; wishes to predict the number of work hours for this
particular lot.
Estimated the regression relation between company sales and

number of persons 16 or more years old from data for the
past 10 years; wishes to predict next year’s company sales

College admissions example

the relevant parameters of the
regression model are known:
β0 = 0.10 β1 = 0.95
E{Y } = 0.10 + 0.95X
σ = 0.12
An applicant whose high school GPA is Xh = 3.5:
E{Yh } = 0.10 + 0.95(3.5) = 3.425
E{Yh } ± 3σ:
3.425 ± 3(0.12) ⇒ 3.065 ≤ Yh(new) ≤ 3.785

The basic idea of a prediction interval is to choose a range in

the distribution of Y wherein most of the observations will
fall, and then to declare that the next observation will fall in
this range.
When the regression parameters of normal error regression
model (2.1) are known, the 1 − α prediction limits for Yh(new)
are:
E{Yh } ± z(1 − α/2)σ

Prediction Interval for Yh(new) when Parameters Unknown
Figure 3 : Figure 2.5: Prediction of Yh(new) when Parameters

Unknown


(cont.)
Since we cannot be certain of the location of the distribution

of Y , prediction limits for Yh(new) clearly must take account
of two elements (Figure 2.5)
Variation in possible location of the distribution of Y
Variation within the probability distribution of Y
Prediction limits for a new observation Yh(new) at Xh (given)
are obtained:
Theorem 8
Yh(new) − Ŷh
∼ tn−2 for normal error regression model (2.1)
s{pred}


(cont.)
The 1 − α prediction limits for Yh(new) :
Ŷh ± t(1 − α/2; n − 2)s{pred}
The difference may be viewed as the prediction error, with

Ŷh(new) serving as the best point estimate of the value of the
new observation Yh(new)
σ 2 {pred}: the variance of the prediction error
σ 2 {pred} = σ 2 {Yh(new) − Ŷh } = σ 2 + σ 2 {Ŷh }


(cont.)
σ 2 {pred} has two components:

The variance of the distribution of Y at X = Xh ; σ 2
The variance of the sampling distribution of Ŷh ; σ 2 {Ŷh }
An unbiased estimator of σ 2 {pred}:
1 (Xh − X̄ )2

2 2
s {pred} = MSE + s {Ŷh } = MSE 1 + + P
n (Xi − X̄ )2

Example
Toluca Company: Xh = 100
90 percent prediction interval: t(0.95; 23) = 1.714
Ŷh = 419.4 s 2 {Ŷh } = 203.72 MSE = 2, 384

⇒s 2 {pred} = 2, 384 + 203.72 = 2, 587.72
s{pred} = 50.87
The 90 percent prediction interval for Yh(new) :
332.2 ≤ Yh(new) ≤ 506.6

## Example (p59)
Xh<-100
fitnew<-predict.lm(fitreg,data.frame(Size = c(Xh)),
se.fit = T, level = 0.9)
s2pred<-fitnew$se.fit^2+fitnew$residual.scale^2
c(s2pred,sqrt(s2pred))

This prediction interval is rather wide and may not be useful

for planning worker requirements for the next lot.
The interval can still be useful for control purposes.
If the actual work hours fall outside the prediction limits ⇒
some alerts- may have occurred a change in the production
process

Comparing Yh(new) and E{Yh }
Toluca Company:
The C.I. for Yh(new) is wider than the C.I. for E{Yh }:
∵ predicting the work hours required for a new lot, ⇒
encounter the variability in Ŷh from sample to sample as well
as the lot-to-lot variation within the probability distribution
of Y
(2.38a):
The prediction interval is wider the further Xh is from X̄

Prediction of Mean of m New Observations for Given Xh
Predict the mean of m new observations on Y for a given Xh

Y : the mean of the new observations to be predicted as
Ȳh(new)
the new Y observations are independent
The appropriate 1 − α prediction limits:
Ŷh ± t(1 − α/2; n − 2)s{predmean}

MSE
s 2 {predmean} = + s 2 {Ŷh }
m
1 1 (Xh − X̄ )2

2
⇔ s {predmean} = MSE + +P }
m n (Xi − X̄ )2
Two components for s 2 {predmean}


(cont.)
the variance of the mean of m observations from the

probability distribution of Y at X = Xh
The variance of the sampling distribution of Ŷh .


(cont.)
Example
Toluca Company: Xh = 100
90 percent prediction interval for the mean number of work
hours Ȳh(new) in three new production runs
Previous results:
Ŷh = 419.4; s 2 {Yh } = 203.72
MSE = 2, 384; t(0.95; 23) = 1.714;
2, 384
⇒s 2 {predmean} = + 203.72 = 998.4
3
s{predmean} = 31.60


(cont.)
The prediction interval for Ȳh(new) :
365.2 ≤ Ȳh(new) ≤ 473.6
The total number of work hours:
1, 095.6 ≤ Total work hours ≤ 1, 420.8

Analysis of Variance Approach to Regression Analysis
Partition of Total Sum of Squares
The analysis of variance approach is based on the partitioning

of sums of squares and degrees of freedom associated with Y .
The variation is measured: the deviations of the Yi around

their mean Ȳ :
Yi − Ȳ

Partition of Total Sum of Squares (cont.)

Total variation: (SSTO: total sum of squares)
(Yi − Ȳ )2
X
SSTO =
Yi are the same ⇒ SSTO = 0
The greater the variation among the Yi , the larger is SSTO.
SSE: error sum of squares
(Yi − Ŷi )2
X
SSE =
Yi fall on the fitted regression line ⇒ SSE = 0
The greater the variation of the Yi around the fitted
regression line, the larger is SSE.

SSR: regression sum of squares
(Ŷi − Ȳi )2
X
SSR =
The regression line is horizontal ⇒ SSR = 0, otherwise
SSR > 0
a measure associated with the regression line
The larger SSR is in relation to SSTO, the greater is the
effect of the regression relation in accounting for the total
variation in the Yi observations.

Formal Development of Partitioning
The total deviation:
Y − Ȳ
| i {z }
= Ŷ − Ȳ
| i {z }
+ Y − Ŷ
| i {z }i
Total deviation Deviation of Deviation around
fitted regression value fitted regression line
around mean
Two components:
X The deviation of the fitted value Ŷi around the mean Ȳ .
X The deviation of the observation Yi around the fitted
regression line.

Formal Development of Partitioning (cont.)
(Yi − Ȳ )2 = (Ŷi − Ȳ )2 + (Yi − Ŷ )2

X X X
SSTO SSR SSE
X
2 (Ŷi − Ȳ )(Yi − Ŷi ) = 0
SSR = b12 (Xi − X̄ )2
P
Degrees of freedom: (df)

SS df explanation
SSTO n − 1 (∵ (Yi − Ȳ ) = 0)
P
SSR 1
SSE n − 2 (∵ β0 , β1 in Ŷi )

Mean Squares
Mean square (MS):

A sum of squares divided by its associated df
SS df MS
SSR
SSR 1 MSR = 1
(regression mean square)
SSE
SSE n − 2 MSE = n−2
(error mean square)
Mean squares are not additive.
SSTO
6= MSR + MSE
n−1

ANOVA table
Table 1 : ANOVA Table for Simple Linear Regression

Source of SS df MS E{MS }
Variation
SSR
Regression SSR = (Ŷi − Ȳ )2 σ 2 + β12 (Xi − X̄ )2
P
1 MSR =
P
1
SSE
Error SSE = (Yi − Ŷi )2 n−2 σ2
P
MSE = n−2
Total SSTO = (Yi − Ȳ )2 n−1
P

ANOVA table (cont.)
Modified ANOVA table:
(Yi − Ȳ )2 = Yi2 − n Ȳ 2
X X
SSTO =
SSTOU: total uncorrected sum of squares:

X
SSTOU = Yi2
correction for the mean sum of squares:
SS(correctionformean) = n Ȳ 2

ANOVA table (cont.)
Table 2 : Modified ANOVA Table for Simple Linear Regression

Source of Variation SS df MS
SSR
Regression SSR = (Ŷi − Ȳ )2
P
1 MSR = 1
SSE
Error SSE = (Yi − Ŷi )2 n−2
P
MSE = n−2
Total SSTO = (Yi − Ȳ )2 n−1
P
Correction for mean SS (correctionformean) = n Ȳ 2 1

Total, uncorrected SSTOU = Yi2
P
n

Expected Mean Squares
E{MSE} = σ 2
E{MSR} = σ 2 + β12 (Xi − X̄ )2
X
MSE is an unbiased estimator of σ 2 .

Implication:
The mean of the sampling distribution of MSE is σ 2 ;
The mean of the sampling distribution of MSR is σ 2 when
β1 = 0;
When β1 6= 0, E{MSR} > E{MSE} = σ 2 .
(∵ β12 (Xi − X̄ )2 > 0)
P

F test of β1 = 0 vs. β1 6= 0
The analysis of variance approach provides us with a battery

highly useful tests for regression models.
For the simple linear regression case, the ANOVA provides us
with a test:
H0 :β1 = 0
6 0
Ha :β1 =
Test Statistics:
MSR
F∗ =
MSE
large values of F ∗ ⇒ Ha ;
values of F ∗ near 1 ⇒ H0 ;
F test of β1 = 0 vs. β1 6= 0 (cont.)
Cochran’s theorem
If all n observations Yi come from the same normal distribu-
tion with mean µ and variance σ 2 , and SSTO is decomposed
into k sums of squares SSr , each with degrees of freedom dfr ,
then the SSr /σ 2 terms are independent χ2 variables with dfr
degrees of freedom if
k
X
dfr = n − 1
r=1

F test of β1 = 0 vs. β1 6= 0 (cont.)
Property
If β1 = 0 so that all Yi have the same mean µ = β0 and the
same variance σ 2 , SSE/σ 2 and SSR/σ 2 are independent χ2
variables.
When H0 holds:
SSR SSE
MSR χ2 (1) χ2 (n − 2)
F∗ = σ2
÷ σ2
= ∼ ÷ ∼ F (1, n − 2)
1 n−2 MSE 1 n−2
When Ha holds, F ∗ follows the noncentral F distribution.

Construction of Decision Rule
F ∗ ∼ F (1, n − 2)
The decision rule: α=Type I error
Decision
IfF ∗ ≤ F (1 − α; 1, n − 2), concludeH0 ;

IfF ∗ > F (1 − α; 1, n − 2), concludeHa ;
where F (1 − α; 1, n − 2) is the (1 − α)100 percentile of

the approximate F distribution.

Exmaple
Toluca Company
Earlier test on β1 : (the t test, p46)
Two-Sided Test
H0 : β1 = β10
b1 − β10
If |t ∗ | = ≤ t(1 − α/2; n − 2), conclude H0
s{b1 }
b1 − β10
s{b1 }

Exmaple
Using the F test
H0 :β1 = β10 = 0
6 β10 = 0
Ha :β1 =
α = 0.05; n = 26; F (0.95; 1, 23) = 4.28
IfF ∗ ≤ 4.28, concludeH0
We have
MSR 252, 378
F∗ = = = 105.9
MSE 2, 384
What is the conclusion?

Equivalence of F test and t Test
For a given α level, the F test of
H0 :β1 = 0 Ha : β1 6= 0
is equivalence algebraically to the two-tailed t test.

Decision
!2
∗ b2 b1
F = 2 1 = = (t ∗ )2
s {b1 } s{b1 }
[t(1 − α/2; n − 2)]2 = F (1 − α; 1, n − 2)
t test: two-tailed; F test: one-tailed;

General Linear Test Approach
Steps for General Linear Test Approach
1 Fit the full model and SSE(F )

2 Fit the reduced model under H0 and SSE(R)
3 Use test statistic and decision rule

1. Full Model
The full or unrestricted model:
Yi = β0 + β1 Xi + εi
SSE(F ):
X 2
(Yi − Ŷi )2 = SSE
X
SSE(F ) = Yi − (b0 + b1 Xi ) =

2. Reduced Model
Hypothesis:
H0 :β1 = 0 Ha : β1 6= 0
The reduced or restricted model when H0 holds:
Yi = β0 + εi
SSE(R):
X 2
(Yi − Ȳ )2 = SSTO
X
SSE(R) = Yi − (b0 ) =

3. Test Statistic
SSE(F ) ≤ SSE(R)
The more parameters are in the model, the better one can fit
the data and the smaller are the deviations around the fitted
regression function.

3. Test Statistic (cont.)
When SSE(F ) is not much less than SSE(R), using the full
model does not account for much more of the variability of
the Yi than does the reduced model.
⇒ Suggest that the reduced model is adequate
i.e., H0 holds.
When SSE(F ) is close to SSE(R), the variation of the
observations around the fitted regression function for the full
model is almost as great as the variation around the fitted
regression function for the reduced model.

A small difference SSE(R) − SSE(F ) suggests that H0 holds.

⇔ A large difference suggests that Ha holds.
Test Statistic: a function of SSE(R) − SSE(F ):
SSE(R) − SSE(F ) SSE(F )

F∗ = ÷ ∼ F distribution
dfR − dfF dfF
when H0 holds.
Decision rule:
If F ∗ ≤ F (1 − α; dfR − dfF , dfF ), conclude H0

If F ∗ > F (1 − α; dfR − dfF , dfF ), conclude Ha

For testing whether or not β1 = 0, we have
SSE(R) = SSTO SSE(F ) = SSE

dfR = n − 1 dfF = n − 2
MSR
⇒ F∗ =
MSE

Descriptive Measures of Linear Association between X
and Y
Reason
The usefulness of estimates or predictions depends upon the

width of the interval
The user’s needs for precision which vary from one
application to another.
No single descriptive measure of the “degree of linear
association” can capture the essential information as to
whether a given regression relation is useful in any particular
application.

and Y
Coefficient of Determination
SSTO is a measure of the uncertainty in predicting Y when

X is not considered.
SSE measures the variation in the Yi when a regression
model utilizing the predictor variable X is employed.
A natural measure of the effect of X in reducing the variation
in Y is to express the reduction in variation
(SSTO − SSE = SSR) as a proportion of the total variation:
SSR SSE
R2 = =1−
SSTO SSTO

and Y
Coefficient of Determination (cont.)

R2 : the coefficient of determination; 判定係數；決定係數
0 ≤ SSE ≤ SSTO
SSR SSE
⇒ 0 ≤ R2 = =1− ≤1
SSTO SSTO
SSE = 0 ⇒ R2 = 1 (If all Yi = Ŷi )

SSR = 0 ⇒ R2 = 0 (If all Ŷi = Ȳ (b1 = 0 ⇒ X 與Y 無直線
關係))
and Y
R2 = 0 ⇒ X 與Y 無直線關係
There is no linear association between X and Y in the sample
data, and the predictor variable X is of no help in reducing the
variation in Yi with linear regression.
R2 → 1 ⇒ X 軸與Y 軸變項間的直線關係越強(具有線性變化)
The closer it is to 1, the greater is said to be the degree of linear
association between X and Y .

and Y
The Toluca Company:

anova(fitreg)
summary(fitreg)

and Y
The variation in work hours is reduced by 82.2% when lot size is

considered.

and Y

and Y
R2 is widely used for describing the usefulness of a regression

model.
Serious misunderstanding:
A high R2 indicates that useful predictions can be made.
(Not necessarily correct. Ex: Xh = 100)
A high R2 indicated that the estimated regression line is a
good fit. (Not necessarily correct. Curvilinear)
A R2 near 0 indicated that X and Y are not related. (Not
necessarily correct. Curvilinear)

and Y

and Y
相關係數 )
Coefficient of Correlation (相
A measure of linear association between Y and X when Y

and X are random is the coefficient of correlation.
√
r = ± R2
A plus or minus sign is attached to this measure according to

whether the slope of the fitted regression line is positive of
negative.
−1 ≤ r ≤ 1
The wider the Xi are spaced, the higher R2 will tend to be.
SSR: the “expected variation” in Y
SSE: the “unexplained variation”
R2 is interpreted in terms of the proportion of SSTO in Y
which has been “explained” by X .
Distinction between Regression and Correlation Model
Normal Correlation Models
Assume that the X values are known constants?

The confidence coefficients and risks of errors refer to
repeated sampling when X values are kept the same from
sample to sample.
Frequently, it may not be appropriate to consider the X

values as known constants.
cannot control daily temperatures
“height of person” vs. “weight of person”: using correlation
model
the normal correlation model

Bivariate Normal Distribution
Two variables Y1 , Y2 : bivariate normal distribution.

Density Function
The density function of the bivariate normal distribu-
tion: 
1 1 Y 1 − µ1 2

f (Y1 , Y2 ) = √ exp −
2πσ1 σ2 1 − ρ12  2(1 − ρ212 ) σ1

2 
Y 1 − µ1 Y 2 − µ2 Y2 − µ2

− 2ρ12 +
σ1 σ2 σ2 
Five parameters: µ1 , µ2 , σ1 , σ2 , ρ12

Bivariate Normal Distribution (cont.)

Marginal Distribution
Y1 , Y2 ∼ N2 (µ1 , µ2 , σ1 , σ2 , ρ12 ):  
2
1 1 Y1 − µ1

Y1 ∼ N (µ1 , σ12 ) ⇒f1 (Y1 ) = √ exp  − 
2πσ1 2 σ1
 
2
1 1 Y2 − µ2

Y2 ∼ N (µ2 , σ22 ) ⇒f2 (Y2 ) = √ exp  − 
2πσ2 2 σ2
When Y1 , Y2 ∼ N2 (µ1 , µ2 , σ1 , σ2 , ρ12 ) ⇒

Y1 ∼ N (µ1 , σ12 ); Y2 ∼ N (µ2 , σ22 )
The converse is not generally true.

ρ12 : the coefficient of correlation between Y1 , Y2
σ{Y1 , Y2 } σ12
ρ12 = =
σ{Y1 }σ{Y2 } σ1 σ2
Y1 ⊥ Y2 ⇒ σ12 = 0 ⇒ ρ12 = 0
If Y1 and Y2 are positively related⇒ σ12 and ρ12 are positive.
−1 ≤ ρ12 ≤ 1

Conditional Probability Distribution of Y1
Y1 , Y2 ∼ N2 (µ1 , µ2 , σ1 , σ2 , ρ12 ); and

Y1 ∼ N (µ1 , σ12 ); Y2 ∼ N (µ2 , σ22 );
The density function of the conditional probability of Y1 for
given value of Y2 : (Y1 |Y2 ∼ N (α1|2 + β12 Y2 , σ1|2 ))
" #
f (Y1 , Y2 ) 1 1 Y1 − α1|2 − β12 Y2 2
f (Y1 |Y2 ) = =√ exp −
f2 (Y2 ) 2πσ1|2 2 σ1|2
σ1
α1|2 = µ1 − µ2 ρ12
σ2
σ1
β12 = ρ12
σ2
2
σ1|2 = σ12 (1 − ρ212 )

Conditional Probability Distribution of Y1 (cont.)
α1|2 : the intercept of the line of regression of Y1 and Y2

β12 : the slope of this line
The conditional distribution of Y1 , given Y2 , is equivalent to
the normal error regression model (1.24).


Three important characteristics of the conditional distributions of
Y1 :
1 normal: slice a bivariate normal distribution vertically; scaled
its area;

2 The means of the conditional probability distributions of Y1

fall on a straight line:
E{Y1 |Y2 } = α1|2 + β12 Y2
3 All conditional probability distribution have the same

standard deviation σ1|2 .
Equivalence to Normal Error Regression Model.

Can we still use regression model (2.1) if Y1 and Y2 are not

bivariate normal?
1 The conditional distributions of the Yi , given Xi , are normal

and independent, with conditional means β0 + β1 Xi and
conditional variance σ 2 .
2 The X1 are independent r.v. whose probability distribution
does not involve the parameter β0 , β1 , σ 2 .

Inferences on Correlation Coefficients
To study the relationship between two variables: ρ12

MLE of ρ12 :
(Yi1 − Ȳ1 )(Yi2 − Ȳ2 )

P
r12 = P
[ (Yi1 − Ȳ1 )2 (Yi2 − Ȳ2 )2 ]1/2
P
r12 is a biased estimator of ρ12 .

−1 ≤ r12 ≤ 1

Inferences on Correlation Coefficients (cont.)

Test
The population is bivariate normal
H0 : ρ12 = 0 vs. Ha : ρ12 6= 0
(ρ12 = 0 ⇒ Y1 ⊥ Y2 )
⇐⇒ H0 : β12 = 0 vs. Ha : β12 6= 0

⇐⇒ H0 : β21 = 0 vs. Ha : β21 6= 0
Test statistics:
√
∗ r12 n − 2
t = q ∼ t(n − 2)
2
1 − r12

This test statistics is identical to the regression t ∗ test

P 1/2
(Xi −X̄)2
q
b1 SSR
statistics s{b1 }
. (∵ r = SSTO
= b1 SSTO
)
The appropriate decision rule to control the Type I error α:
If |t ∗ | ≤ t(1 − α/2; n − 2), conlude H0

If |t ∗ | > t(1 − α/2; n − 2), conlude Ha

Inferences on Correlation Coefficients (cont.)

A national oil company: service station gasoline sales vs.
sales of auxiliary(附屬的) product
23 of its service stations
average monthly sales data: Y1 =gasoline sales vs.
Y2 =auxiliary products and services
r12 = 0.52; α = 0.05
To test whether or not the association was positive
H0 : ρ12 ≤ 0 vs. Ha : ρ12 > 0
t ∗ = 2.79 > t(0.95; 21) = 1.721; (P-value = 0.006)

Interval Estimation of ρ12
The Fisher z transformation:

1 1 + r12

0
z = ln
2 1 − r12
When n is large (n ≥ 25), the sampling distribution of z 0 is

≈ N (E{z 0 }, σ 2 {z 0 }).
1 1 + ρ12

E{z 0 } = ς = ln (2.90)
2 1 − ρ12
1
σ 2 {z 0 } = (只與n有關) (2.91)
n−3
(z 0 , ς): Table B.8

Interval Estimation of ρ12 (cont.)
Interval estimate: (n ≥ 25)
z0 − ς
∼ N (0, 1)
σ{z 0 }
⇒z 0 ± z(1 − α/2)σ{z 0 }
The 1 − α C.I. for ρ12 are obtained by transforming the limits

on ς by (2.90).

Interval Estimation of ρ12 (cont.)
A C.I. for ρ12 can be employed to test whether or not ρ12 has
a specified value. (ex. 0.5)
0 ≤ ρ212 ≤ 1: measures the relative reduction in the variability
of Y2 associated with the use of variable Y1 .
σ12 − σ1|2
2
ρ212 =
σ12
σ22 − σ2|1
2
ρ212 =
σ22

Spearman Rand Correlation Coefficient
When no appropriate transformations can be found, a

nonparametric rank correlation procedure may be useful for
making inferences about the association between Y1 and Y2 .
The ordinal Pearson product-moment correlation coefficient:
(Ri1 − R̄1 )(Ri2 − R̄2 )

P
rs = P
[ (Ri1 − R̄1 )2 (Ri2 − R̄2 )2 ]1/2
P
Test: (two-sided)
H0 : There is no association between Y1 and Y2

Ha : There is an association between Y1 and Y2
√
∗ rs n − 2
⇒t = ∼ t(n − 2)
1 − rs2

Hsuhl LRClass Chap2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hsuhl LRClass Chap2

Uploaded by

Copyright:

Available Formats

Chapter 2

Inferences in Regression and Correlation Analysis

hsuhl (NUK) LR Chap 2 1 / 118

X interval estimation of E(Y ) of the probability distribution of

hsuhl (NUK) LR Chap 2 2 / 118

hsuhl (NUK) LR Chap 2 3 / 118

Assume that the normal error regression model is applicable:

β0 and β1 are parameters

hsuhl (NUK) LR Chap 2 4 / 118

Inferences about the slope β1 of the regression line

hsuhl (NUK) LR Chap 2 5 / 118

Inferences about the slope β1 of the regression line (cont.)

Figure 1 : Regression Model when β1 = 0.

β1 = 0 ⇒ no linear association between Y and X

The sample distribution of b1 refers to the different values of

hsuhl (NUK) LR Chap 2 7 / 118

Point estimator b1 (cont.)

Sampling distribution of b1 (需ki 的特性)

hsuhl (NUK) LR Chap 2 8 / 118

Point estimator b1 (cont.)

hsuhl (NUK) LR Chap 2 9 / 118

Yi : independently, normally distributed

hsuhl (NUK) LR Chap 2 10 / 118

hsuhl (NUK) LR Chap 2 11 / 118

Normally properties (cont.)

hsuhl (NUK) LR Chap 2 12 / 118

Normally properties (cont.)

hsuhl (NUK) LR Chap 2 12 / 118

Normally properties (cont.)

Estimated Variance σ 2 {b1 }:

The point estimator s 2 {b1 } is an unbiased estimator of

Normally properties (cont.)

Estimated Variance σ 2 {b1 }:

The point estimator s 2 {b1 } is an unbiased estimator of

b1 : unbiased linear estimator

Sampling distribution of (b1 − β1 )/s{b1 }

If the observations Yi come from the same normal

hsuhl (NUK) LR Chap 2 14 / 118

Sampling distribution of (b1 − β1 )/s{b1 }

If the observations Yi come from the same normal

hsuhl (NUK) LR Chap 2 14 / 118

Sampling distribution of (b1 − β1 )/s{b1 }

If the observations Yi come from the same normal

hsuhl (NUK) LR Chap 2 14 / 118

Sampling distribution of (b1 − β1 )/s{b1 } (cont.)

and is independent of b0 and b1 .

hsuhl (NUK) LR Chap 2 15 / 118

Sampling distribution of (b1 − β1 )/s{b1 } (cont.) (cont.)

hsuhl (NUK) LR Chap 2 16 / 118

Sampling distribution of (b1 − β1 )/s{b1 } (cont.) (cont.)

Z and χ2 are independent;

hsuhl (NUK) LR Chap 2 17 / 118

Confidence interval for β1

t(α/2; n − 2): (α/2)100 percentile of the t distribution with

Confidence interval for β1 (cont.)

Symmetric: t(α/2; n − 2) = −t(1 − α/2; n − 2)

The 1 − α confidence limits for β1 are:

b1 ± t(1 − α/2; n − 2)s{b1 }

hsuhl (NUK) LR Chap 2 19 / 118

Confidence interval for β1 (cont.)

hsuhl (NUK) LR Chap 2 20 / 118

Ex: Confidence interval for β1