You are on page 1of 44

Regression Estimation – Least

Squares and Maximum Likelihood

Dr. Frank Wood

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 1


Least Squares Max(min)imization
• Function to minimize w.r.t. β, β
n 2
Q= i=1 (Yi − (β0 + β1 Xi ))
• Minimize this by maximizing –Q
• Find partials and set both equal to zero

go to board

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 2


Normal Equations
• The result of this maximization step are called
the normal equations. b0 and b1 are called
point estimators of β and β respectively
 
Yi = nb0 + b1 Xi
  
2
Xi Yi = b0 Xi + b1 Xi

• This is a system of two equations and two


unknowns. The solution is given by…
Write these on board
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 3
Solution to Normal Equations
• After a lot of algebra one arrives at

(Xi − X̄)(Yi − Ȳ )
b1 = 
(Xi − X̄)2
b0 = Ȳ − b1 X̄

Xi
X̄ =
n
Yi
Ȳ =
n

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 4


Least Squares Fit
?
40
Estimate, y = 2.09x + 8.36, mse: 4.15
True, y = 2x + 9, mse: 4.22
35

30
Response/Output

25

20

15

10
1 2 3 4 5 6 7 8 9 10 11
Predictor/Input

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 5


Guess #1
40
Guess, y = 0x + 21.2, mse: 37.1
True, y = 2x + 9, mse: 4.22
35

30
Response/Output

25

20

15

10
1 2 3 4 5 6 7 8 9 10 11
Predictor/Input

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 6


Guess #2
40
Guess, y = 1.5x + 13, mse: 7.84
True, y = 2x + 9, mse: 4.22
35

30
Response/Output

25

20

15

10
1 2 3 4 5 6 7 8 9 10 11
Predictor/Input

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 7


Looking Ahead: Matrix Least Squares
   
Y1 X1 1
 Y2   X 2 1 

    β1
 ..  =  .. 
 .   .  β0
Yn Xn 1

• Solution to this equation is solution to least squares


linear regression (and maximum likelihood under
normal error distribution assumption)

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 8


Questions to Ask
• Is the relationship really linear?
• What is the distribution of the of “errors”?
• Is the fit good?
• How much of the variability of the response is
accounted for by including the predictor
variable?
• Is the chosen predictor variable the best one?

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 9


Is This Better?
40
7 Order, mse: 3.18

35

30
Response/Output

25

20

15

10
1 2 3 4 5 6 7 8 9 10 11
Predictor/Input

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 10


Goals for First Half of Course
• How to do linear regression
– Self familiarization with software tools
• How to interpret standard linear regression
results
• How to derive tests
• How to assess and address deficiencies in
regression models

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 11


Properties of Solution
• The ith residual is defined to be

ei = Yi − Ŷi
• The sum of the residuals is zero:
 
ei = (Yi − b0 − b1 Xi )
i
 
= Yi − nb0 − b1 Xi
= 0 By first normal equation.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 12


Properties of Solution
• The sum of the observed values Yi equals the
sum of the fitted values Ŷi
 
Yi = Ŷi
i i

= (b1 Xi + b0 )
i

= (b1 Xi + Ȳ − b1 X̄)
i

= b1 Xi + nȲ − b1 nX̄
i

= b1 nX̄ + Yi − b1 nX̄
i
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 13
Properties of Solution
• The sum of the weighted residuals is zero
when the residual in the ith trial is weighted by
the level of the predictor variable in the ith trial
 
X i ei = (Xi (Yi − b0 − b1 Xi ))
i
  
= Xi Yi − b0 Xi − b1 (Xi2 )
i
= 0
By second normal equation.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 14


Properties of Solution
• The sum of the weighted residuals is zero
when the residual in the ith trial is weighted by
the fitted value of the response variable for
the ith trial
 
Ŷi ei = (b0 + b1 Xi )ei
i i
 
= b0 ei + b1 ei Xi
i i
= 0
By previous properties.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 15


Properties of Solution
• The regression line always goes through the
point
X̄, Ȳ

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 16


Estimating Error Term Variance σ
• Review estimation in non-regression setting.
• Show estimation results for regression
setting.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 17


Estimation Review
• An estimator is a rule that tells how to
calculate the value of an estimate based on
the measurements contained in a sample

• i.e. the sample mean

1
n
Ȳ = n i=1 Yi

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 18


Point Estimators and Bias
• Point estimator

θ̂ = f ({Y1 , . . . , Yn })
• Unknown quantity / parameter

θ
• Definition: Bias of estimator

B(θ̂) = E(θ̂) − θ

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 19


One Sample Example
0.7
µ = 5, σ = 0.75
samples
0.6
θ
est. θ
0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9 10

run bias_example_plot.m
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 20
Distribution of Estimator
• If the estimator is a function of the samples
and the distribution of the samples is known
then the distribution of the estimator can
(often) be determined
– Methods
• Distribution (CDF) functions
• Transformations
• Moment generating functions
• Jacobians (change of variable)

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 21


Example
• Samples from a Normal(µ,σ) distribution

Yi ∼ Normal(µ, σ 2 )

• Estimate the population mean

1
n
θ = µ, θ̂ = Ȳ = n i=1 Yi

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 22


Sampling Distribution of the Estimator
• First moment
n
1
E(θ̂) = E( Yi )
n i=1
n
1 nµ
= E(Yi ) = =θ
n i=1 n
• This is an example of an unbiased estimator

B(θ̂) = E(θ̂) − θ = 0

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 23


Variance of Estimator
• Definition: Variance of estimator

V (θ̂) = E([θ̂ − E(θ̂)]2 )

• Remember:
V (cY ) = c2 V (Y )
n n
V ( i=1 Yi ) = i=1 V (Yi )
Only if the Yi are independent with finite variance

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 24


Example Estimator Variance
• For N(0,1) mean estimator
n
1
V (θ̂) = V( Yi )
n i=1
n
1 nσ 2 σ2
= V (Y i ) = =
n2 i=1 n2 n

• Note assumptions

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 25


Distribution of sample mean estimator
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5

1000 samples
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 26
Bias Variance Trade-off
• The mean squared error of an estimator

M SE(θ̂) = E([θ̂ − θ]2 )

• Can be re-expressed

M SE(θ̂) = V (θ̂) + (B(θ̂)2 )

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 27


MSE = VAR + BIAS2
• Proof
M SE(θ̂) = E((θ̂ − θ)2 )
= E(([θ̂ − E(θ̂)] + [E(θ̂) − θ])2 )
= E([θ̂ − E(θ̂)]2 ) + 2E([E(θ̂) − θ][θ̂ − E(θ̂)]) + E([E(θ̂) − θ]2 )
= V (θ̂) + 2E([E(θ̂)[θ̂ − E(θ̂)] − θ[θ̂ − E(θ̂)])) + (B(θ̂))2
= V (θ̂) + 2(0 + 0) + (B(θ̂))2
= V (θ̂) + (B(θ̂))2

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 28


Trade-off
• Think of variance as confidence and bias as
correctness.
– Intuitions (largely) apply
• Sometimes a biased estimator can produce
lower MSE if it lowers the variance.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 29


Estimating Error Term Variance σ
• Regression model
• Variance of each observation Yi is σ (the
same as for the error term ǫi)
• Each Yi comes from a different probability
distribution with different means that depend
on the level Xi
• The deviation of an observation Yi must be
calculated around its own estimated mean.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 30


s2 estimator for σ
 2

SSE (Yi −Ŷi ) e2i
s2 = M SE = n−2 = n−2 = n−2

• MSE is an unbiased estimator of σ

E(M SE) = σ 2

• The sum of squares SSE has n-2 degrees of


freedom associated with it.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 31


Normal Error Regression Model
• No matter how the error terms ǫi are
distributed, the least squares method
provides unbiased point estimators of β and
β
– that also have minimum variance among all
unbiased linear estimators
• To set up interval estimates and make tests
we need to specify the distribution of the ǫi
• We will assume that the ǫi are normally
distributed.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 32


Normal Error Regression Model
Yi = β0 + β1 Xi + ǫi
• Yi value of the response variable in the ith trial
• β and β are parameters
• Xi is a known constant, the value of the
predictor variable in the ith trial
• ǫi ~iid N(0,σ)
• i = 1,…,n

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 33


Notational Convention
• When you see ǫi ~iid N(0,σ)

• It is read as ǫi is distributed identically and


independently according to a normal
distribution with mean 0 and variance σ

• Examples
– θ ~ Poisson(λ)
– z ~ G(θ)

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 34


Maximum Likelihood Principle
• The method of maximum likelihood chooses
as estimates those values of the parameters
that are most consistent with the sample data.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 35


Likelihood Function
• If
Xi ∼ F (Θ), i = 1 . . . n

then the likelihood function is

n
L({Xi }ni=1 , Θ) = i=1 F (Xi ; Θ)

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 36


Example, N(10,3) Density, Single Obs.
0.14
Samples
N(10, 3) Density
0.12

0.1

0.08

0.06

0.04

0.02

0
0 2 4 6 8 10 12 14 16 18 20

N=10, - log likelihood = 4.3038

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 37


Example, N(10,3) Density, Single Obs. Again
0.14
Samples
N(10, 3) Density
0.12

0.1

0.08

0.06

0.04

0.02

0
0 2 4 6 8 10 12 14 16 18 20

N=10, - log likelihood = 4.3038

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 38


Example, N(10,3) Density, Multiple Obs.
0.14
Samples
N(10, 3) Density
0.12

0.1

0.08

0.06

0.04

0.02

0
0 5 10 15 20 25

N=10, - log likelihood = 36.2204

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 39


Maximum Likelihood Estimation
• The likelihood function can be maximized
w.r.t. the parameter(s) Θ, doing this one can
arrive at estimators for parameters as well.
n
n
L({Xi }i=1 , Θ) = i=1 F (Xi ; Θ)

• To do this, find solutions to (analytically or by


following gradient)
dL({Xi }n
i=1 ,Θ)
dΘ =0
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 40
Important Trick
• Never (almost) maximize the likelihood
function, maximize the log likelihood function
instead.
n
log(L({Xi }ni=1 , Θ)) = log( F (Xi ; Θ))
i=1
n

= log(F (Xi ; Θ))
i=1
Quite often the log of the density is easier to
work with mathematically.

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 41


ML Normal Regression
• Likelihood function
n

2 1 − 12 (Yi −β0 −β1 Xi )2
L(β0 , β1 , σ ) = 2 )1/2
e 2σ

i=1
(2πσ
1 n
− 21
(Yi −β0 −β1 Xi )2
= e 2σ i=1
(2πσ 2 )n/2

which if you maximize (how?) w.r.t. to the


parameters you get…

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 42


Maximum Likelihood Estimator(s)
• β
– b0 same as in least squares case
• β
– b1 same as in least squares case
• σ 
(Yi −Ŷi )2
σ̂ 2 = i
n

• Note that ML estimator is biased as s2 is unbiased


and
n
s2 = M SE = n−2 σ̂ 2

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 43


Comments
• Least squares minimizes the squared error
between the prediction and the true output
• The normal distribution is fully characterized
by its first two central moments (mean and
variance)

• Food for thought:


– What does the bias in the ML estimator of the
error variance mean? And where does it come
from?

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 44

You might also like