0% found this document useful (0 votes)
43 views46 pages

Maximum Likelihood Estimation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Topics covered

  • Mô hình hóa thống kê,
  • Phương pháp Lagrange,
  • Hàm mất mát tối đa,
  • Phân phối chuẩn,
  • Hệ số hồi quy,
  • Hàm mất mát tổng quát,
  • Hàm mất mát điều kiện,
  • Hàm mất mát tối ưu,
  • Hàm log mất mát,
  • Phân phối beta
0% found this document useful (0 votes)
43 views46 pages

Maximum Likelihood Estimation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Topics covered

  • Mô hình hóa thống kê,
  • Phương pháp Lagrange,
  • Hàm mất mát tối đa,
  • Phân phối chuẩn,
  • Hệ số hồi quy,
  • Hàm mất mát tổng quát,
  • Hàm mất mát điều kiện,
  • Hàm mất mát tối ưu,
  • Hàm log mất mát,
  • Phân phối beta

Maximum Likelihood Estimation (MLE)

Regularizations

Faculty of Computer Science


University of Information Technology (UIT)
Vietnam National University - Ho Chi Minh City (VNU-HCM)

Math for Computer Science, Fall 2023

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 1 / 46


References

The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1

1
https://probml.github.io/pml-book/book1.html
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 2 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 3 / 46


Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 4 / 46


Introduction
quá trình ước lượng khớp mô hình
The process of estimating θ from D is called model fitting, or
training, is at the heart of machine learning.
có nhiêu phương thức để ước lượng
There are many methods for estimating θ, and they involve an
optimization problem of the form
vấn đề tối ưu hóa dưới dạng
θ̂ = argmin L(θ)
θ
được gọi là hàm mất mát hoặc hàm mục tiêu
where L(θ) is some kind of loss function or objective function.
quá trình định lượng sự không chắc chắn về một lượng không biết
The process of quantifying uncertainty about an unknown quantity
mẫu dữ liệu hữu hạn
estimated from a finite sample of data is called inference.
In deep learning, the term “inference” refers to “prediction”, namely
suy luận dự đoán
computing
p(y|x, θ̂)

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 5 / 46


Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 6 / 46


Maximum Likelihood Estimation
Phương pháp thông dụng nhất ước lượng tham số
The most common approach to parameter estimation is to pick the
tham số mà gán xác suất cao nhất
parameters that assign the highest probability to the training data.
This is called maximum likelihood estimation or MLE.
ước lượng hợp lý cực đại
θ̂ mle = argmax p(D|θ)
θ
giả định dữ liệu huấn luyện
We usually assume the training examples are “independent and
cùng phân phối được lấy mẫu từ cùng một phân phối
identically distributed”, and are sampled from the same distribution
(i.e., the iid assumption). The conditional likelihood becomes
chúng ta đang đánh giá khả năng xuất hiện yn nếu biết
N
vector đặc trưng xn và tham số mô hình Y
p(D|θ) = p(y1 , y2 , . . . , yN |x1 , x2 , . . . , xN , θ) = p(yn |xn , θ)
n=1
We usually work with the log likelihood, which decomposes into a
sum of terms, one per example.
N
Y N
X
LL(θ) = log p(D|θ) = log p(yn |xn , θ) = log p(yn |xn , θ)
n=1 n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 7 / 46


Maximum Likelihood Estimation
The MLE is given by
N
X
θ̂ mle = argmax log p(yn |xn , θ)
θ n=1
bởi vì hầu hết các thuật toán tối ưu hóa được thiết kế để giảm chi phí của func
Because most optimization algorithms are designed to minimize cost
chúng ta tái định nghĩa hàm mục tiêu là hợp lý log âm
functions, we redefine the objective function to be the conditional
negative log likelihood or NLL:
N
X
NLL(θ) = − log p(D|θ) = − log p(yn |xn , θ)
n=1

Minimizing this will give the MLE.


N
X
θ̂ mle = argmin − log p(yn |xn , θ)
θ n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 8 / 46


Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 9 / 46


Bernoulli Random Variables

X: Đây là một loại biến ngẫu nhiên chỉ nhận hai giá trị có thể: thường là 0 và 1. Giá trị
này thường được dùng để mô tả kết quả của các thí nghiệm ngẫu nhiên với hai kết quả
có thể là "thành công" và "thất bại".
A Bernoulli r.v. X takes two possible values, usually 0 and 1,
modeling random experiments that have two possible outcomes (e.g.,
“success” and “failure”).
e.g., tossing a coin. The outcome is either Head or Tail.
e.g., taking an exam. The result is either Pass or Fail.
e.g., classifying images. An image is either Cat or Non-cat.
Khi tung đồng xu, kết quả có thể là "Ngửa" hoặc "Sấp
".
Khi thi một kỳ thi, kết quả có thể là "Đậu" hoặc "Rớt".
Khi phân loại hình ảnh, một hình ảnh có thể được
phân loại là "Có mèo" hoặc "Không có mèo".

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 10 / 46


Bernoulli Random Variables

Definition
A random variable X is a Bernoulli random variable with parameter
p ∈ [0, 1], written as X ∼ Bernoulli(p) if its PMF is given by
(
p, for x = 1
PX (x) =
1 − p, for x = 0.
1.0
0.8
p
0.6
pX (x)

0.4
1−p
0.2
0.0 0 1
x
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 11 / 46
Example

A bag contains 3 balls, each ball is either red or blue.


The number of blue balls θ can be 0, 1, 2, 3.
Choose 4 balls randomly with replacement.
Random variables X1 , X2 , X3 , X4 are defined as
nghĩa là sau khi chọn một quả bóng, nó sẽ được đặt trở lại vào túi trước khi chọn quả
bóng tiếp theo
(
1, if the i-th chosen ball is blue
Xi =
0, if the i-th chosen ball is red

After doing the experiment, the following values for Xi ’s are


observed: x1 = 1, x2 = 0, x3 = 1, x4 = 1.
Note that Xi ’s are i.i.d. (independent and identically distributed) and
Xi ∼ Bernoulli( 3θ ). For which value of θ is the probability of the
observed sample is the largest?

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 12 / 46


Example
(
θ
PXi (x) = 3, for x = 1
θ
1− 3, for x = 0
Xi ’s are independent, the joint PMF of X1 , X2 , X3 , X4 can be written
PX1 X2 X3 X4 (x1 , x2 , x3 , x4 ) = PX1 (x1 )PX2 (x2 )PX3 (x3 )PX4 (x4 )
   3  
θ θ θ θ θ θ
PX1 X2 X3 X4 (1, 0, 1, 1) = · 1 − · · = 1−
3 3 3 3 3 3

θ PX1 X2 X3 X4 (1, 0, 1, 1; θ)
0 0
1 0.0247
2 0.0988
3 0
The observed data is most likely to occur for θ = 2.
We may choose θ̂ = 2 as our estimate of θ.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 13 / 46
MLE for the Bernoulli distribution
giả sử Y là biến ngẫu nhiên đại diện cho việc tung đồng xu
Suppose Y is a random variable representing a coin toss.
sấp
tương ứng với ngửa
The event Y = 1 corresponds to heads, Y = 0 corresponds to tails.
phân phối xác suất
The probability distribution for this rv is the Bernoulli. The NLL for
the Bernoulli distribution is
N
Y N
Y
NLL(θ) = − log p(yn |θ) = − log θI(yn =1) (1 − θ)I(yn =0)
n=1 n=1
N
X
=− I(yn = 1) log θ + I(yn = 0) log(1 − θ)
n=1
= −[N1 log θ + N0 log(1 − θ)]

where N1 = N
P
PN n=1 I(yn = 1) is the number of heads, and
N0 = n=1 I(yn = 0) is the number of tails.
N = N0 + N1 is the sample size.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 14 / 46
MLE for the Bernoulli distribution

NLL(θ) = −[N1 log θ + N0 log(1 − θ)]

đạo hàm
The derivative of the NLL is
d −N1 N0
NLL(θ) = +
dθ θ 1−θ
MLE có thể được tìm bằng cách giải phương trình
d
The MLE can be found by solving dθ NLL(θ) = 0.
The MLE is given by
N1
θ̂mle =
N0 + N1
which is the empirical fraction of heads.
tỉ lệ thực nghiệp xuất hiện mặt ngửa

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 15 / 46


MLE for the categorical distribution
phân phối đa thức
tung một con xúc xắc K mặt N lần
Suppose we roll a K-sided dice N times.
gọi Yn là kết quả của lần tung thứ n
Let Yn ∈ {1, . . . , K} be the n-th outcome, where Yn ∼ Cat(θ).
We want to estimate θ from the dataset D{yn : n = 1 : N }.
The NLL is given by
X
NLL(θ) = − Nk log θk
k
là số lần sự kiện Y = k quan sát được
where Nk is the number of times the event Y = k is observed.
The compute the MLE, we have to minimize the NLL subject to the
constraint that
K
X
θk = 1
k=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 16 / 46


MLE for the categorical distribution
We use the method of Lagrange multipliers. The Lagrangian is as
X  X 
L(θ, λ) = − Nk log θk − λ 1 − θk
k k
Taking derivatives with respect to λ yields the original constraint
∂L X
=1− θk = 0
∂λ
k
Taking derivatives with respect to θk yields
∂L Nk
=− + λ = 0 −→ Nk = λθk
∂θk θk
We can solve for λ using the sum-to-one constraint
X X
Nk = N = λ θk = λ
k k
Nk Nk
Thus the MLE is given by θ̂k = λ = N , the empirical fraction of
times event k occurs.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 17 / 46
Standard Normal (Gaussian) Random Variable N (0, 1)

1.0

0.8
x 2 /2
0.6
fX (x)

e −x 2 /2
1 e −x 2 /2
0.4 p

0.2

0.0
3 2 1 0 1 2 3
x

Z ∞
2 /2 √
e−x dx = 2π
−∞
1 2
fX (x) = √ e−x /2

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 18 / 46


General Normal (Gaussian) Random Variable N (µ, σ 2 )

1.0

0.8
(x − µ) 2 /2σ 2
0.6 e −(x − µ)2 /2σ2
1 e −(x − µ) 2 /2σ 2
0.4 p
σ 2π

0.2

0.0
6 4 2 0 2 4 6 8 10
x

1 2 2
fX (x) = √ e−(x−µ) /2σ
σ 2π
 
1 1 2
=√ exp − 2 (y − µ)
2πσ 2 2σ
E [X] = µ Var (X) = σ 2
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 19 / 46
General Normal (Gaussian) Random Variable N (µ, σ 2 )

1.0

0.8
σ = 0.5
0.6 σ=1
fX (x)

σ=2
0.4 σ=3
0.2

0.0
6 4 2 0 2 4 6 8 10
x
Smaller σ, narrower PDF.
Let Y = aX + b N ∼ N (µ, σ 2 )
Then, E [Y ] = aE[X] + b Var (Y ) = a2 σ 2 (always true)
But also, Y ∼ N (aµ + b, a2 σ 2 )

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 20 / 46


Example

We have N = 3 data points y1 = 1, y2 = 0.5, y3 = 1.5 which are


independent and Gaussian with unknown mean µ and variance 1:

yi ∼ N (µ, 1)

Likelihood P (y1 y2 y3 |µ) = P (y1 |µ)P (y2 |µ)P (y3 |µ).


Consider two guesses µ = 1.0 and µ = 2.5. Which has higher
likelihood?
Finding the µ that maximizes the likelihood is equivalent to moving
the Gaussian until the product P (y1 |µ)P (y2 |µ)P (y3 |µ) is maximized.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 21 / 46


MLE for the univariate Gaussian
Y ∼ N (µ, σ 2 ) and D = {yn : n = 1 : N } be an iid sample of size N .
 
1 1
p(y|θ) = N (y|µ, σ 2 ) = √ exp − 2 (y − µ)2
2πσ 2 2σ
We can estimate the parameters θ = (µ, σ 2 ) using MLE.
We derive the NLL, which is given by
N  1  
2
X 1 2 1 2
NLL(µ, σ ) = − log exp − 2 (yn − µ)
2πσ 2 2σ
n=1
N
1 X N
= (yn − µ)2 + log(2πσ 2 )
2σ 2 2
n=1

The minimum of this function must satisfy the following conditions


∂ ∂
NLL(µ, σ 2 ) = 0, NLL(µ, σ 2 ) = 0
∂µ ∂σ 2

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 22 / 46


MLE for the univariate Gaussian
The solution is given by
N
1 X
µ̂mle = yn = ȳ
N
n=1
N  N 
2 1 X
21 X 2
σ̂mle = (yn − µ̂mle ) = yn + µ̂mle − 2yn µ̂mle = s2 − ȳ 2
2
N N
n=1 n=1
N
1 X
s2 , yn2
N
n=1

The quantities ȳ and s2 are called the sufficient statistics of the


data because they are sufficient to compute the MLE.
Sometimes, we might se the estimate for the variance as
N
2 1 X
σ̂ = (yn − µ̂mle )2
N −1
n=1
which is not the MLE, but is a different kind of estimate.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 23 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 24 / 46


MLE for linear regression

We can make the parameters of the Gaussian to be functions of some


input variables

p(y|x; θ) = N (y|fµ (x; θ), fσ (x; θ)2 )

fµ (x; θ) ∈ R predicts mean, and fσ (x; θ) ∈ R+ predicts variance.


người ta thường cho rằng phương sai là cố định và độc lập với input
It is common to assume that the variance is fixed, and is independent
hồi quy đồng nhất
of the input. This is called homoscedastic regression.
giá trị trung bình là hàm
Furthermore, it is common to assume the mean is a linear function of
tuyến tính của input
the input. The resulting model is called linear regression.

p(y|x; θ) = N (y|wT x + b, σ 2 )

where θ = (w, b, σ). s

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 25 / 46


MLE for linear regression
mục tiêu của MLE trong hồi quy tuyến tính là tìm ra vector trọng số

w sao cho tổng bình phương sai số còn lại là thấp nhất, qua đó mô hình dự
đoán kết quả gần với thực tế nhất có thể.

Figure: Linear regression using Gaussian output with mean µ(x) = b + wx and
fixed variance σ 2 .
cố định

The figure plots the 95% predictive interval [µ(x) − 2σ, µ(x) + 2σ].
độ không đảm bảo trong quan sát sự đoán y cho trước x.
This is the uncertainty in the predicted observation y given x, and
thể hiện sự biến thiên trong các chấm màu xanh.
capture the variablity in the blue dots.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 26 / 46
MLE for linear regression
Linear regression model

p(y|x; θ) = N (y|wT x, σ 2 )

where θ = (w, σ 2 ), and w = (b, w1 , w2 , . . . , wD ).


giả sử
Assume that σ 2 is fixed, we estimate the weights w. The NLL is
N  1  
X 1 2 1
NLL(w) = − log exp − 2 (yn − wT xn )2
2πσ 2 2σ
n=1
bỏ qua hằng số cộng không quan trọng, ta được mục tiêu đơn giản hóa
Dropping the irrelevant additive constants gives the simplified
tổng bình phương sai số còn lại
objective, known as the residual sum of squares or RSS:
N
X N
X
RSS(w) = (yn − wT xn )2 = rn2
n=1 n=1
sai số còn lại thứ n
where rn is the n-th residual error.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 27 / 46
MLE for linear regression
tổng bình phương sai số
Residual sum of squares or RSS:
RSS đơn giản là
N
X tổng bình phương
RSS(w) = (yn − wT xn )2 các sai số, trong khi
n=1 MSE là giá trị trung
sai số bình phương trung bình bình của tổng bình
Mean squared error or MSE: phương sai số, và
RMSE là căn bậc hai
N của MSE.
1 X
MSE(w) = (yn − wT xn )2
N
n=1

Root mean squared error or RMSE:


v
p
u
u1 X N
RMSE(w) = MSE(w) = t (yn − wT xn )2
N
n=1

We can compute the MLE by minimizing the NLL, RSS, MSE, or


RMSE. All give the same results.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 28 / 46
MLE for linear regression
có thể được viết trong ký hiệu ma trận như sau:
The RSS can be written in matrix notation as follows
N
X
RSS(w) = (yn − wT xn )2 = kXw − yk22 = (Xw − y)T (Xw − y)
n=1

The gradient is given by


đạo hàm
∇w RSS(w) = X T Xw − X T y
đặt gradient = 0
Setting the gradient to zero ∇w RSS(w) = 0 and solving gives

X T Xw = X T y
đây được biết đến là
phương trình chuẩn
These are known as the normal equations.
bình phương tối thiểu
The MLE solution ŵmle is called the ordinary least squares (OLS)
solution:
ŵmle = argmin RSS(w) = (X T X)−1 X T y
w

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 29 / 46


MLE for linear regression

ŵmle = argmin RSS(w) = (X T X)−1 X T y


w

đại lượng là giả nghịch đảo bên trái của ma trận


The quantity X † = (X T X)−1 X T is the (left) pseudo-inverse of the
(non-square) matrix X.
duy nhất
Is the solution ŵmle unique?
The gradient is ∇w RSS(w) = X T Xw − X T y. Then, the Hessian is

∂2
H(w) = RSS(w) = X T X
∂w2
các cột của X là độc lập tuyến tính
If X is full rank (i.e., the columns of X are linearly independent),
xác định dương với mọi v
then H is positive definite, since for any v, we have

v T (X T X)v = (Xv)T (Xv) = kXvk2 > 0


cực tiểu toàn cục duy nhất
In the full rank case, the RSS(w) has a unique global minimum.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 30 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 31 / 46


Overfitting
MLE will try to pick parameters that minimize loss on the training
set, but this may not result in a model that has low loss on future
data. This is called overfitting.
Ex: We want to predict the probability of heads when tossing a coin.
We toss it N = 3 times and observe 3 heads. The MLE is
N1 3
θ̂mle = = =1
N0 + N1 3+0
If we use this Ber(y|θ̂mle ) to make predictions, we will predict that all
future coin tosses will also be heads!!!
The model has enough parameters to perfectly fit the observed
training data, so it can perfectly match the empirical distribution.
In most cases, the empirical distribution is not the same as the true
distribution. Putting all the probability mass on the observed set of N
examples will not leave over any probability for novel data in the
future. The model may not generalize.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 32 / 46
Example: MLE for Linear Regression

Example 1:
   
1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0 
   
1 0 1
X= , y=
1  1
ŵmle = (X T X)−1 X T y =?
Example 2:
   
1 1
Training data: x1 = , y1 = 1 +  x 2 = , y2 = 1.
0 
   
1 0 1+
X= , y=
1  1
ŵmle = (X T X)−1 X T y =?

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 33 / 46


Example: MLE for Linear Regression

Example 1:
   
1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0 
   
1 0 1
X= , y=
1  1
ŵmle = (X T X)−1 X T y
    
T 1 1 1 0 2 
X X= =
0  1   2
 
1 −1/
(X T X)−1 =
−1/ 2/2
     
1 −1/ 1 1 1 1
ŵmle = 2 = .
−1/ 2/ 0  1 0

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 34 / 46


Example: MLE for Linear Regression

Example 2:
   
1 1
Training data: x1 = , y1 = 1 +  x 2 = , y2 = 1.
0 
   
1 0 1+
X= , y=
1  1
ŵmle = (X T X)−1 X T y
    
T 1 1 1 0 2 
X X= =
0  1   2
 
1 −1/
(X T X)−1 =
−1/ 2/2
     
1 −1/ 1 1 1 +  1+
ŵmle = = .
−1/ 2/2 0  1 −1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 35 / 46


Regularization
giải pháp chính để giải quyết vấn dề overfit là chuẩn hóa
The main solution to overfitting is to use regularization.
thêm một hình phạt vào NLL rủi ro thực nghiệm
We add a penalty term to the NLL (or empirical risk):
 N 
1 X
L(θ; λ) = `(yn , f (xn ; θ) + λC(θ)
N
n=1
tham số chuẩn hóa
where λ ≥ 0 is the regularization parameter, and C(θ) is some
là một dạng phạt độ phức tạp
form of complexity penalty.
một dạng phạt độ phức tạp phổ biến là sử dụng
A common complexity penalty is to use C(θ) = − log p(θ), where
tiên nghiệm
p(θ) is the prior for θ.
mục tiêu chuẩn hóa trở thành
If ` is the log loss, the regularized objective becomes
N
1 X
L(θ; λ) = − log p(yn |xn , θ) − λ log p(θ)
N
n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 36 / 46


Maximum a posteriori estimation (MAP)
ước lượng hậu nghiệm tối đa
N
1 X
L(θ; λ) = − log p(yn |xn , θ) − λ log p(θ)
N
n=1
một cách thích hợp, chúng ta có thể
bằng cách đặt tái cân chỉnh
By setting λ = 1 and rescaling p(θ) appropriately, we can equivalently
tương đương tối thiểu hóa sau:
minimize the following
X N 
L(θ; λ) = − log p(yn |xn , θ)+log p(θ) = −[log p(D|θ)+log p(θ)]
n=1

Minimizing this is equivalent to maximizing the log posterior:


p(D|θ)p(θ)
θ̂ map = argmax log p(θ|D) = argmax log
θ θ p(D)
= argmax[log p(D|θ) + log p(θ) − const]
θ

This is MAP estimation, or maximum a posteriori estimation.


(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 37 / 46
MAP estimation for Bernoulli distribution
Coin tossing. If we observe just one head, the MLE is θ̂mle = 1.
To avoid this, we can add a penalty to θ to discourage “extreme”
values, such as θ = 0 or θ = 1.
We can use a beta distribution as our prior p(θ) = Beta(θ|a, b),
where a, b > 1 encourages values of θ near to a/(a + b).
If a = b = 1, we get uniform
distribution
If a and b are both less than 1,
we get bimodal distribution.
If a and b are both greater than
1, the distribution is unimodal.
a
mean =
a+b
ab
var = 2
(a + b) (a + b + 1)

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 38 / 46


MAP estimate for Bernoulli dsitribution

Using the beta distribution as our prior p(θ) = Beta(θ|a, b), the log
likelihood plus log prior becomes

LL(θ) = log p(D|θ) + log p(θ)


= [N1 log θ + N0 log(1 − θ)] + [(a − 1) log θ + (b − 1) log(1 − θ)]

The MAP estimate is


N1 + a − 1
θ̂map =
N1 + N0 + a + b − 2
If we set a = b = 2, that weakly favor a value of θ near 0.5, the
estimate becomes
N1 + 1
θ̂map =
N1 + N0 + 2
This is called add-one smoothing to avoid the zero count problem.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 39 / 46


Black swan paradox

The zero-count problem, and


overfitting, is analogous to the black
swan paradox.
It is used to illustrate the problem of
induction: how to draw general
conclusions about the future from
specific observations from the past.
The solution to the paradox is to admit
that induction is in general impossible.
The best we can do is to make plausible
guesses by combining the empirical data
with prior knowledge.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 40 / 46


Weight decay

Polynomial regression with too much degree of freedom can result in


overfitting. One solution is to reduce the degree of the polynomial.
A more general solution is to penalize the magnitude of the weights
(regression coefficients).
We use a zero-mean Gaussian prior p(w). The MAP estimate is

ŵmap = argmin NLL(w) + λkwk22


w

where kwk22 = D 2
P
d=1 wd . We penalize the magnitude of weight
vectors w, rather than the bias term b.
The equation is called `2 regularization or weight decay.
The larger the value of λ, the more the parameters are penalized for
being large (i.e., deviating from the zero-mean prior), and thus the
less flexible the model.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 41 / 46


Ridge regression

In the case of linear regression, the weight decay penalization scheme


is called ridge regression.
Consider polynomial regression, where the predictor has the form
D
X
f (x; w) = wd xd = wT [1, x, x2 , . . . , xD ]
d=0

Suppose we use a high degree polynomial, say D = 14, even though


we have a small dataset with just N = 21 examples.
MLE for the parameters will enable the model to fit the data very
well, but the resulting function is very “wiggly”, thus resulting in
overfitting.
Increasing λ can reduce overfitting.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 42 / 46


Ridge regression

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 43 / 46


Ridge regression
MAP estimation with a zero-mean Gaussian prior
p(w) = N (w|0, λ−1 I).

1 1
ŵmap = argmin 2
(y − Xw)T (y − Xw) + 2 wT w
w 2σ 2τ
2
= argmin RSS(w) + λkwk2
w

σ2
where λ = τ2
is proportional to the strength of the prior, and
v
uD
uX √
kwk2 = t |wd |2 = wT w
d=1

is the `2 norm of the vector w.


We do not penalize the offset w0 , since that only affects the global
mean of the output, and does not contribute to the overfitting.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 44 / 46
Ridge Regression

The MAP estimate corresponds to minimizing the penalized objective:

J(w) = (y − Xw)T (y − Xw) + λkwk22


σ2
where λ = τ2
is the strength of the regularizer.
The derivative is given by

∇w J(w) = 2(X T Xw − X T y + λw)

Therefore,

ŵmap = (X T X + λI D )−1 X T y
X X
=( xn xT -1
n + λI D ) ( yn x n )
n n

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 45 / 46


Example: MAP for Linear Regression

Maximum likelihood
 estimation.
  Let  = 0.1  
1 −1/ 1 1 1 1
Ex. 1: ŵmle = 2 = .
−1/ 2/ 0  1 0
       
1 −1/ 1 1 1 +  1+ 1.1
Ex. 2: ŵmle = = = .
−1/ 2/2 0  1 −1 −1
Maximum a posteriori estimation. Let λ = 0.05
   
T 2+λ  2.05 0.1
(X X + λI D ) = =
 2 + λ 0.1 0.06
 
0.531 −0.885
(X T X + λI D )−1 =
−0.885 18.1416
     
0.531 −0.885 1 1 1 0.9735
Ex. 1: ŵmap = =
−0.885 18.1416 0  1 0.0442
     
0.531 −0.885 1 1 1 +  1.0265
Ex. 2: ŵmap = =
−0.885 18.1416 0  1 −0.0442
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 46 / 46

You might also like