You are on page 1of 9

Chapter 2.

Simple linear regression

1. Definition of simple linear regression

1.1. Standard model


Y = β0 + β1 X + ε
• Y dependent variable
• X independent variable
• ε random error term
• Parameters
o β0 intercept
o β1 slope

(a) Data setting


yi = β0 + β1 xi + εi, i =1, 2, …, n

• β0, β1
o constant parameters
• xi
o nonrandom, observed with negligible error
• εi
o Random
o Zero mean, E(εi) = 0
o Constant variance, Var(εi) = σ2
o Uncorrelated, Cov(εi, εj) = 0, for i ≠ j
o Usually assumed independent identically distributed (i.i.d.)

(b) Properties
(i) Dependent variable yi
• yi is a random variable
• E(yi) = β0 + β1 xi
• E(Y|X) = β0 + β1 X
o Mean of Y conditional on X
• Var(yi) = σ2
o Independent of xi
• Cov(yi, yj) = 0, for i ≠ j

(ii) Regression parameters


• β0 and β1 are called regression coefficients
o Depend on the units used on Y and X
• β1 is the change in E(Y) per unit increase of X
• Characteristics of β0 depends on the range of the X in the data
o When X = 0 is included, β0 = E(Y|X=0)
o When X = 0 is far away from the data, β0 has no particular meaning
o Usually just call β0 the intercept

1.2. Alternative model


• Dummy variable
yi = β0 x0 + β1 xi + εi
o x0 ≡ 1 for all i

1
• Centered linear regression model
yi = β 0* + β1 ( xi − x ) + ε i
o β 0* = ? , express in terms of the components in the base model.

2. Estimation

2.1. Least squares estimates for β 0 and β1


• The linear regression model assume the relation between X and Y to be linear.
o What would be the exact relation between X and Y?
• β0 and β1 are the true but unknown parameters under the linear relationship.
o What are the actual values of the parameters?

(a) Definitions
• b0 = βˆ0 and b1 = βˆ1 are estimators of β0 and β1
• Fitted value
yˆ i = b0 + b1 xi
• Residual
ei = yi − yˆ i
• Fitted regression line
yˆ = b0 + b1 x
o Is it the true relation when the realization of b0 and b1 are obtained?

(b) Least squares method


• Estimate β0 and β1 so that the fitted regression line (based on the estimated parameters) lies
“closest” to the data
• The parameter estimates are obtained by minimizing the residual sum of squares
n n
SSE = ∑ ( yi − yˆ i ) = ∑ ( yi − f ( xi | b0 , b1 ))
2 2

i =1 i =1

o In simple linear regression model, f ( x | b0 , b1 ) = b0 + b1 x


o The estimators are called the least squares estimators
o SSE corresponds to the squared vertical distances from the observed values to the fitted
line

2
• For the simple linear regression model, the LSE b0 and b1 satisfy
⎧ ∂ ⎡ n
⎪ ⎢ ∑ ( yi − b0 − b1 xi )2 ⎤⎥ = 0
⎪ ∂b0 ⎣ i =1 ⎦

⎪ ∂ ⎡∑ ( yi − b0 − b1 xi )2 ⎤ = 0
n

⎪⎩ ∂b1 ⎢⎣ i =1 ⎥

⎧ n
⎪⎪ − ∑ 2( yi − b0 − b1 xi ) = 0
i =1
⎨ n
⎪− ∑ 2 xi ( yi − b0 − b1 xi ) = 0
⎪⎩ i =1
⎧ n n

⎪⎪ ∑ i y = nb 0 + b1 ∑ xi
i =1 i =1
⎨n n n
⎪∑ xi yi = b0 ∑ xi + b1 ∑ xi2
⎪⎩ i =1 i =1 i =1

o From the 1st equation


1⎛ n n

⎜ ∑ yi − b1 ∑ xi ⎟ = y − b1 x
b0 =
n ⎝ i =1 i =1 ⎠
o Substitute b0 into the 2nd equation
n n n

∑ xi yi = ∑ ( y − b1 x )xi + b1 ∑ xi2
i =1 i =1 i =1
n n

∑ x ( y − y ) = b ∑ x (x − x )
i =1
i i 1
i =1
i i

n n
⎛ n
⎞ n

∑ (xi − x )( yi − y ) + ∑ x ( yi − y ) = b1 ⎜ ∑ (xi − x )(xi − x ) + ∑ x (xi − x )⎟


i =1 i =1 ⎝ i=1 i =1 ⎠

(xi − x )( yi − y ) + x ∑ ( yi − y ) = b1 ⎛⎜ ∑ (xi − x )(xi − x ) + x ∑ (xi − x )⎞⎟


n n n n


i =1 i =1 ⎝ i=1 i =1 ⎠
S XY = b1S XX
S
b1 = XY
S XX

• The least squares estimators b0 and b1 are linear estimators as they are linear combinations
of yi
n n n n

∑ (x − x )( y
i i − y) ∑ (x − x ) y − y ∑ (x − x ) ∑ (x − x ) y
i i i i i n
b1 = i =1
= i =1 i =1
= i =1
= ∑ k i yi
S XX S XX S XX i =1

x −x
o where ki = i is independent of yi
S XX
1⎛ n n
⎞ 1⎛ n ⎛ n ⎞ n ⎞ n 1⎛ n

b0 = ⎜ ∑ yi − b1 ∑ xi ⎟ = ⎜⎜ ∑ yi − ⎜ ∑ ki yi ⎟∑ xi ⎟⎟ = ∑ ⎜1 − ki ∑ xi ⎟ yi
n ⎝ i=1 i =1 ⎠ n ⎝ i=1 ⎝ i=1 ⎠ i=1 ⎠ i=1 n ⎝ i =1 ⎠

3
(c) Distribution of the least squares estimators when the linear model is true
• b0 and b1 are unbiased estimators for β0 and β1
⎛ n ⎞ n n
⎜ ∑ (xi − x ) yi ⎟ ∑ ( xi − x )E ( yi ) ∑ (xi − x )(β 0 + β1 xi )
E (b1 ) = E ⎜ i=1 ⎟ = i=1 = i=1
⎜ S XX ⎟ S XX S XX
⎜ ⎟
⎝ ⎠
n n n
β 0 ∑ ( xi − x ) + β1 ∑ (xi − x )xi β1 ∑ (xi − x )2
= i =1 i =1
= i =1
S XX S XX
= β1

E (b0 ) = E ( y − b1 x ) = E ( y ) − E (b1 x )
1 n 1 n
= ∑ i E ( y ) − x E (b1 ) = ∑ (β 0 + β1xi ) − β1 x
n i=1 n i=1
1 n
= β 0 + β1 ∑ xi − β1 x
n i=1
= β0

Example

Westwood company data


• Man-hours : dependent variable
• Lot size: independent variable
• Regression model
o Man-hours = β0 + β1 Lot size + ε 180
y = 2x + 10
• Least squares estimates 160

S 6800
o b1 = XY = =2
140

S XX 3400 120

o b0 = y − b1 x = 110 − 2 × 50 = 10 100

• Estimated regression line:


80

60
o Man-hours = 10 + 2 Lot size
• b1 = +2 40

o Man-hours increase with log-size 20

o When log-size increases by 1 unit, man- 0


0 20 40 60 80 100
hours increase by 2 units

• b0 = 10
o When log-size = 0, man-hors = 10 unit
o Not reliable as data range for X excludes
zero

4
Example

Shocks data
• All observations are considered
• Time: dependent variable
• Shocks: independent variable
• Regression model Time
15
o Time = β0 + β1 Shocks + ε 14
13
• Least squares estimates 12
S − 208.4 11
o b1 = XY = = - 0.6129 10
S XX 340 9
8
o b0 = y − b1 x = 5.8875 − (−0.6129 × 7.5) = 10.4846 7
6
• Estimated regression line 5
o Time = 10.48456 - 0.612941×Shocks 4
3
• b1 = -0.6129 2
o Time decreases with number of shocks 0 2 4 6 8 10 12 14 16
o When number of shock increases by 1, time Shocks
decreases by 0.6129 seconds
• b0 = 10.48
o When number of shocks = 0, time = 10.48 seconds
o Data range for X includes zero

• Variance of the sampling distribution of b1


⎛ n ⎞ n n
⎜ ∑ ( xi − x ) yi ⎟ ∑ (xi − x ) Var ( yi ) ∑ (x − x ) σ
2 2 2
i
σ2
Var (b1 ) = Var ⎜ i=1 ⎟ = i=1 = i =1
=
⎜ S XX ⎟ (S XX )2 (S XX )2 S XX
⎜ ⎟
⎝ ⎠
o Q Cov ( yi , y j ) = 0, i ≠ j

• Variance of the sampling distribution of b0


o Consider
(x − x )yi ⎞⎟ 1
Cov( y , b1 ) = Cov⎜⎜ y , ∑ i

S XX ⎟= S ∑ (x − x )Cov( y, y )
i i
⎝ ⎠ XX

⎛1 n ⎞ ⎛ yi ⎞
∑ (x − x )Cov⎜⎜ n ∑ y , y ⎟⎟ = S ∑ (x − x )Cov⎜⎝ n , y ⎟⎠
1 1
= i j i i i
S XX ⎝ j =1 ⎠ XX

σ 2
=
nS XX
∑ (x − x )
i

=0
o Therefore
Var (b0 ) = Var ( y − b1 x )
= Var ( y ) + x 2Var (b1 ) − 2 x Cov( y , b1 )
⎛ 1 x2 ⎞
= σ 2 ⎜⎜ + ⎟⎟
⎝ n S XX ⎠

5
• Covariance between b0 and b1
xσ 2
Cov(b0 , b1 ) = Cov(( y − b1 x ), b1 ) = Cov( y , b1 ) − x Cov(b1 , b1 ) = −
S XX
o We obtain SXX and x from the data, how about σ2?

Gauss-Markov theorem

The least square estimators b0 and b1 are unbiased and have minimum variance among all
unbiased linear estimators (Exercise)

(d) Remarks
• The inference and prediction by the fitted line are only valid for X values in the range of the
data set.
• A linear relationship between two variables can exist without causation.
• The simple linear regression model applies only if the true relationship between the two
variables is a straight-line relationship.
• When the magnitude of the slope estimate b1 is close to zero, the fitted regression line will
be nearly parallel to the x-axis. Then the supplementary variable X will be of little use for
the prediction of Y.

2.2. Estimate of error variance


• The error variance
σ 2 = Var (ε i ) = Var ( yi − (β 0 + β1 xi ))
• Error mean square (or mean square error, MSE) is defined as
n

∑ (y − yˆ i )
2
i
MSE = s 2 = i =1

n−2
o Error (or residual) degrees of freedom (df) = n − 2
o s2 is unbiased under the important assumption that the model is correct
E s2 = σ 2 ( )
• Estimates of the variances and covariance of b0 and b1
o Obtained by replacing σ2 by s2
s2
s 2 {b1} =
S XX
⎛ 1 x2 ⎞
s {b0 } = s ⎜⎜ +
2 2
⎟⎟
⎝ n S XX ⎠
x ⋅ s2
Cov(b0 , b1 ) = −
^

S XX

Example

Westwood company data


• b0 = 10, b1 = 2
o yˆ i = 10 + 2 xi

6
Production run Log-size (X) Man-hours (Y) Predicted Man-hours ( Ŷ )
1 30 73 70
2 20 50 50
3 60 128 130
4 80 170 170
5 40 87 90
6 50 108 110
7 60 135 130
8 30 69 70
9 70 148 150
10 60 132 130

• s2 =
1
10 − 2
[ ]
(73 − 70)2 + (50 − 50)2 + L + (132 − 130)2 = 60 = 7.5
8
o s = 7.5 = 2.74
o Degrees of freedom = n – 2 = 8
• Sample variance of b1
s2 7.5
o = = 0.002206
S XX 3400
o SE (b1 ) = 0.002206 = 0.046967
• Sample variance of b0
⎛ 1 x2 ⎞ ⎛1 50 2 ⎞
o s 2 ⎜⎜ + ⎟⎟ = 7.5⎜⎜ + ⎟⎟ = 6.264706
⎝ n S XX ⎠ ⎝ 10 3400 ⎠
o SE (b0 ) = 6.264706 = 2.502939
• Sample covariance of b0 and b1
xs 2 50 × 7.5
o − =− = −0.1103
S XX 3400

Example

Shocks data
• b0 = 10.4846, b1 = -0.6129
o yˆ = 10.4846 − 0.6129 x

X Y Predicted Y
0 11.4 10.4846
1 11.9 9.8716
2 7.1 9.2587
3 14.2 8.6457
4 5.9 8.0328
5 6.1 7.4199
… … …

• s2 =
1
16 − 2
[ ]
(11.4 − 10.48)2 + (11.9 − 9.87 )2 + L = 5.0943
o s = 5.0943 = 2.257

7
o df = 14

• Sample variance for b1


5.0943
o = 0.0150
340
o SE (b1 ) = 0.1224
• Sample variance for b0
⎛ 1 7.52 ⎞
o 5.0943⎜⎜ + ⎟⎟ = 1.1612
⎝ 16 340 ⎠
o SE (b0 ) = 1.0776
• Sample covariance of b0 and b1
7.5 × 5.0943
o − = −0.1124
340

2.3. Maximum likelihood estimation


(a) Likelihood function
• Assume εi ~ i.i.d. N(0,σ2) with probability density function
f (ε i ) = φ (ε i )
• The joint density function / likelihood function
n n
L = ∏ f (ε i ) = ∏ φ (ε i )
i =1 i =1

• The pdf for the normal distribution with mean 0 is


⎛ x2 ⎞
φ (x ) =
1
Exp ⎜− ⎟
(2π )1/ 2 σ ⎜⎝ 2σ 2 ⎟⎠
• The likelihood function is
L = L(β 0 , β1 , σ 2 | x, y )
1 ⎛ 1 n

=
(2π ) σ
n/2 n
Exp⎜ − 2
⎝ 2σ
∑ε
i =1
i
2


⎛ 1 n
2⎞
∑ (y − β 0 − β1 xi ) ⎟
1
= Exp⎜ − 2
(2π ) σ
n/2 n
⎝ 2σ i =1
i

(b) Maximum likelihood estimates (MLE) for β0 and β1


• MLE for β0 and β1 maximize L, i.e. to minimize
1 n
2 ∑
( yi − β 0 − β1 xi )2
2σ i =1
o Equivalent to minimizing SSE
o Under the normal theory assumption, the MLE of the regression coefficients β0 and β1
are the least squares estimators

(c) MLE for error variance


• The log-likelihood is given as
( )
n
l = log(L ) = k − log σ 2 − 2 ∑ (y − β 0 − β1 xi )
n 1 2


i
2 i =1

o where k is free of σ2

8
• Substituted by b0 and b1
n
2
( ) 1 n
l = k − log σ 2 − 2 ∑ ( yi − yˆ i )
2σ i =1
2

• MLE for σ2 maximizes l


∂l 1 n
= −
n
+ 4 ∑
( yi − yˆ i )2 = 0
∂σ 2
2σ 2
2σ i =1
n
− n + 2 ∑ ( yi − yˆ i ) = 0
1 2

σˆ i =1
n

∑ (y i − yˆ i )
2

σˆ 2 = i =1
n
n−2 2
= s
n
o σˆ 2 is a biased estimator of σ2

You might also like