Lecture 3

Regression Estimation – Least
Squares and Maximum Likelihood
Dr. Frank Wood
Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 1

Least Squares Max(min)imization
• Function to minimize w.r.t. β, β
n 2
Q= i=1 (Yi − (β0 + β1 Xi ))
• Minimize this by maximizing –Q
• Find partials and set both equal to zero
go to board

Normal Equations
• The result of this maximization step are called
the normal equations. b0 and b1 are called
point estimators of β and β respectively

Yi = nb0 + b1 Xi

2
Xi Yi = b0 Xi + b1 Xi
• This is a system of two equations and two

unknowns. The solution is given by…
Write these on board
Solution to Normal Equations
• After a lot of algebra one arrives at

(Xi − X̄)(Yi − Ȳ )
b1 =
(Xi − X̄)2
b0 = Ȳ − b1 X̄

Xi
X̄ =
n
Yi
Ȳ =
n

Least Squares Fit
?
40
Estimate, y = 2.09x + 8.36, mse: 4.15
True, y = 2x + 9, mse: 4.22
35
30
Response/Output
25
20
15
10
1 2 3 4 5 6 7 8 9 10 11
Predictor/Input

Guess #1
40
Guess, y = 0x + 21.2, mse: 37.1
True, y = 2x + 9, mse: 4.22
35
30
Response/Output
25
20
15
10
1 2 3 4 5 6 7 8 9 10 11
Predictor/Input

Guess #2
40
Guess, y = 1.5x + 13, mse: 7.84
True, y = 2x + 9, mse: 4.22
35
30
Response/Output
25
20
15
10
1 2 3 4 5 6 7 8 9 10 11
Predictor/Input

Looking Ahead: Matrix Least Squares
   
Y1 X1 1
 Y2   X 2 1 
    β1
 ..  =  .. 
 .   .  β0
Yn Xn 1
• Solution to this equation is solution to least squares

linear regression (and maximum likelihood under
normal error distribution assumption)

Questions to Ask
• Is the relationship really linear?
• What is the distribution of the of “errors”?
• Is the fit good?
• How much of the variability of the response is
accounted for by including the predictor
variable?
• Is the chosen predictor variable the best one?

Is This Better?
40
7 Order, mse: 3.18
35
30
Response/Output
25
20
15
10
1 2 3 4 5 6 7 8 9 10 11
Predictor/Input

Goals for First Half of Course
• How to do linear regression
– Self familiarization with software tools
• How to interpret standard linear regression
results
• How to derive tests
• How to assess and address deficiencies in
regression models

Properties of Solution
• The ith residual is defined to be
ei = Yi − Ŷi
• The sum of the residuals is zero:

ei = (Yi − b0 − b1 Xi )
i

= Yi − nb0 − b1 Xi
= 0 By first normal equation.

• The sum of the observed values Yi equals the
sum of the fitted values Ŷi

Yi = Ŷi
i i

= (b1 Xi + b0 )
i

= (b1 Xi + Ȳ − b1 X̄)
i

= b1 Xi + nȲ − b1 nX̄
i

= b1 nX̄ + Yi − b1 nX̄
i
• The sum of the weighted residuals is zero
when the residual in the ith trial is weighted by
the level of the predictor variable in the ith trial

X i ei = (Xi (Yi − b0 − b1 Xi ))
i

= Xi Yi − b0 Xi − b1 (Xi2 )
i
= 0
By second normal equation.

• The sum of the weighted residuals is zero
when the residual in the ith trial is weighted by
the fitted value of the response variable for
the ith trial

Ŷi ei = (b0 + b1 Xi )ei
i i

= b0 ei + b1 ei Xi
i i
= 0
By previous properties.

• The regression line always goes through the
point
X̄, Ȳ

Estimating Error Term Variance σ
• Review estimation in non-regression setting.
• Show estimation results for regression
setting.

Estimation Review
• An estimator is a rule that tells how to
calculate the value of an estimate based on
the measurements contained in a sample
• i.e. the sample mean
1
n
Ȳ = n i=1 Yi

Point Estimators and Bias
• Point estimator
θ̂ = f ({Y1 , . . . , Yn })
• Unknown quantity / parameter
θ
• Definition: Bias of estimator
B(θ̂) = E(θ̂) − θ

One Sample Example
0.7
µ = 5, σ = 0.75
samples
0.6
θ
est. θ
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
run bias_example_plot.m
Distribution of Estimator
• If the estimator is a function of the samples
and the distribution of the samples is known
then the distribution of the estimator can
(often) be determined
– Methods
• Distribution (CDF) functions
• Transformations
• Moment generating functions
• Jacobians (change of variable)

Example
• Samples from a Normal(µ,σ) distribution
Yi ∼ Normal(µ, σ 2 )
• Estimate the population mean
1
n
θ = µ, θ̂ = Ȳ = n i=1 Yi

Sampling Distribution of the Estimator
• First moment
n
1
E(θ̂) = E( Yi )
n i=1
n
1 nµ
= E(Yi ) = =θ
n i=1 n
• This is an example of an unbiased estimator
B(θ̂) = E(θ̂) − θ = 0

Variance of Estimator
• Definition: Variance of estimator
V (θ̂) = E([θ̂ − E(θ̂)]2 )
• Remember:
V (cY ) = c2 V (Y )
n n
V ( i=1 Yi ) = i=1 V (Yi )
Only if the Yi are independent with finite variance

Example Estimator Variance
• For N(0,1) mean estimator
n
1
V (θ̂) = V( Yi )
n i=1
n
1 nσ 2 σ2
= V (Y i ) = =
n2 i=1 n2 n
• Note assumptions

Distribution of sample mean estimator
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
1000 samples
Bias Variance Trade-off
• The mean squared error of an estimator
M SE(θ̂) = E([θ̂ − θ]2 )
• Can be re-expressed
M SE(θ̂) = V (θ̂) + (B(θ̂)2 )

MSE = VAR + BIAS2
• Proof
M SE(θ̂) = E((θ̂ − θ)2 )
= E(([θ̂ − E(θ̂)] + [E(θ̂) − θ])2 )
= E([θ̂ − E(θ̂)]2 ) + 2E([E(θ̂) − θ][θ̂ − E(θ̂)]) + E([E(θ̂) − θ]2 )
= V (θ̂) + 2E([E(θ̂)[θ̂ − E(θ̂)] − θ[θ̂ − E(θ̂)])) + (B(θ̂))2
= V (θ̂) + 2(0 + 0) + (B(θ̂))2
= V (θ̂) + (B(θ̂))2

Trade-off
• Think of variance as confidence and bias as
correctness.
– Intuitions (largely) apply
• Sometimes a biased estimator can produce
lower MSE if it lowers the variance.

Estimating Error Term Variance σ
• Regression model
• Variance of each observation Yi is σ (the
same as for the error term ǫi)
• Each Yi comes from a different probability
distribution with different means that depend
on the level Xi
• The deviation of an observation Yi must be
calculated around its own estimated mean.

s2 estimator for σ
2

SSE (Yi −Ŷi ) e2i
s2 = M SE = n−2 = n−2 = n−2
• MSE is an unbiased estimator of σ
E(M SE) = σ 2
• The sum of squares SSE has n-2 degrees of

freedom associated with it.

Normal Error Regression Model
• No matter how the error terms ǫi are
distributed, the least squares method
provides unbiased point estimators of β and
β
– that also have minimum variance among all
unbiased linear estimators
• To set up interval estimates and make tests
we need to specify the distribution of the ǫi
• We will assume that the ǫi are normally
distributed.

Normal Error Regression Model
Yi = β0 + β1 Xi + ǫi
• Yi value of the response variable in the ith trial
• β and β are parameters
• Xi is a known constant, the value of the
predictor variable in the ith trial
• ǫi ~iid N(0,σ)
• i = 1,…,n

Notational Convention
• When you see ǫi ~iid N(0,σ)
• It is read as ǫi is distributed identically and

independently according to a normal
distribution with mean 0 and variance σ
• Examples
– θ ~ Poisson(λ)
– z ~ G(θ)

Maximum Likelihood Principle
• The method of maximum likelihood chooses
as estimates those values of the parameters
that are most consistent with the sample data.

Likelihood Function
• If
Xi ∼ F (Θ), i = 1 . . . n
then the likelihood function is
n
L({Xi }ni=1 , Θ) = i=1 F (Xi ; Θ)

Example, N(10,3) Density, Single Obs.
0.14
Samples
N(10, 3) Density
0.12
0.1
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12 14 16 18 20
N=10, - log likelihood = 4.3038

Example, N(10,3) Density, Single Obs. Again
0.14
Samples
N(10, 3) Density
0.12
0.1
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12 14 16 18 20

Example, N(10,3) Density, Multiple Obs.
0.14
Samples
N(10, 3) Density
0.12
0.1
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25

Maximum Likelihood Estimation
• The likelihood function can be maximized
w.r.t. the parameter(s) Θ, doing this one can
arrive at estimators for parameters as well.
n
n
L({Xi }i=1 , Θ) = i=1 F (Xi ; Θ)
• To do this, find solutions to (analytically or by

following gradient)
dL({Xi }n
i=1 ,Θ)
dΘ =0
Important Trick
• Never (almost) maximize the likelihood
function, maximize the log likelihood function
instead.
n
log(L({Xi }ni=1 , Θ)) = log( F (Xi ; Θ))
i=1
n

= log(F (Xi ; Θ))
i=1
Quite often the log of the density is easier to
work with mathematically.

ML Normal Regression
• Likelihood function
n

2 1 − 12 (Yi −β0 −β1 Xi )2
L(β0 , β1 , σ ) = 2 )1/2
e 2σ
i=1
(2πσ
1 n
− 21
(Yi −β0 −β1 Xi )2
= e 2σ i=1
(2πσ 2 )n/2
which if you maximize (how?) w.r.t. to the

parameters you get…

Maximum Likelihood Estimator(s)
• β
– b0 same as in least squares case
• β
– b1 same as in least squares case
• σ
(Yi −Ŷi )2
σ̂ 2 = i
n
• Note that ML estimator is biased as s2 is unbiased

and
n
s2 = M SE = n−2 σ̂ 2

Comments
• Least squares minimizes the squared error
between the prediction and the true output
• The normal distribution is fully characterized
by its first two central moments (mean and
variance)
• Food for thought:

– What does the bias in the ML estimator of the
error variance mean? And where does it come
from?

Lecture 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

Regression Estimation – Least

Squares and Maximum Likelihood

Dr. Frank Wood

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 1

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 2

• This is a system of two equations and two

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 4

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 5

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 6

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 7

• Solution to this equation is solution to least squares

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 8

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 9

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 10

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 11

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 12

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 14

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 15

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 16

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 17

• i.e. the sample mean

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 18

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 19

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 21

• Estimate the population mean

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 22

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 23

V (θ̂) = E([θ̂ − E(θ̂)]2 )

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 24

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 25

M SE(θ̂) = E([θ̂ − θ]2 )

M SE(θ̂) = V (θ̂) + (B(θ̂)2 )

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 27

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 28

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 29

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 30

• MSE is an unbiased estimator of σ

• The sum of squares SSE has n-2 degrees of

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 31

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 32

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 33

• It is read as ǫi is distributed identically and

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 34

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 35

then the likelihood function is

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 36

N=10, - log likelihood = 4.3038

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 37

N=10, - log likelihood = 4.3038

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 38

N=10, - log likelihood = 36.2204

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 39

• To do this, find solutions to (analytically or by

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 41

which if you maximize (how?) w.r.t. to the

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 42

• Note that ML estimator is biased as s2 is unbiased

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 43

• Food for thought:

Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 44

You might also like