0% found this document useful (0 votes)

43 views46 pages

Maximum Likelihood Estimation

Uploaded by

Quân Võ Đình Minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Topics covered

Mô hình hóa thống kê,
Phương pháp Lagrange,
Hàm mất mát tối đa,
Phân phối chuẩn,
Hệ số hồi quy,
Hàm mất mát tổng quát,
Hàm mất mát điều kiện,
Hàm mất mát tối ưu,
Hàm log mất mát,
Phân phối beta

0% found this document useful (0 votes)

43 views46 pages

Maximum Likelihood Estimation

Uploaded by

Quân Võ Đình Minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Topics covered

Mô hình hóa thống kê,
Phương pháp Lagrange,
Hàm mất mát tối đa,
Phân phối chuẩn,
Hệ số hồi quy,
Hàm mất mát tổng quát,
Hàm mất mát điều kiện,
Hàm mất mát tối ưu,
Hàm log mất mát,
Phân phối beta

Maximum Likelihood Estimation (MLE)

Regularizations

Faculty of Computer Science

University of Information Technology (UIT)
Vietnam National University - Ho Chi Minh City (VNU-HCM)

Math for Computer Science, Fall 2023

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 1 / 46

References

The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1

1
https://probml.github.io/pml-book/book1.html
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 2 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 3 / 46

Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 4 / 46

Introduction
quá trình ước lượng khớp mô hình
The process of estimating θ from D is called model fitting, or
training, is at the heart of machine learning.
có nhiêu phương thức để ước lượng
There are many methods for estimating θ, and they involve an
optimization problem of the form
vấn đề tối ưu hóa dưới dạng
θ̂ = argmin L(θ)
θ
được gọi là hàm mất mát hoặc hàm mục tiêu
where L(θ) is some kind of loss function or objective function.
quá trình định lượng sự không chắc chắn về một lượng không biết
The process of quantifying uncertainty about an unknown quantity
mẫu dữ liệu hữu hạn
estimated from a finite sample of data is called inference.
In deep learning, the term “inference” refers to “prediction”, namely
suy luận dự đoán
computing
p(y|x, θ̂)

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 5 / 46

Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 6 / 46

Maximum Likelihood Estimation
Phương pháp thông dụng nhất ước lượng tham số
The most common approach to parameter estimation is to pick the
tham số mà gán xác suất cao nhất
parameters that assign the highest probability to the training data.
This is called maximum likelihood estimation or MLE.
ước lượng hợp lý cực đại
θ̂ mle = argmax p(D|θ)
θ
giả định dữ liệu huấn luyện
We usually assume the training examples are “independent and
cùng phân phối được lấy mẫu từ cùng một phân phối
identically distributed”, and are sampled from the same distribution
(i.e., the iid assumption). The conditional likelihood becomes
chúng ta đang đánh giá khả năng xuất hiện yn nếu biết
N
vector đặc trưng xn và tham số mô hình Y
p(D|θ) = p(y1 , y2 , . . . , yN |x1 , x2 , . . . , xN , θ) = p(yn |xn , θ)
n=1
We usually work with the log likelihood, which decomposes into a
sum of terms, one per example.
N
Y N
X
LL(θ) = log p(D|θ) = log p(yn |xn , θ) = log p(yn |xn , θ)
n=1 n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 7 / 46

Maximum Likelihood Estimation
The MLE is given by
N
X
θ̂ mle = argmax log p(yn |xn , θ)
θ n=1
bởi vì hầu hết các thuật toán tối ưu hóa được thiết kế để giảm chi phí của func
Because most optimization algorithms are designed to minimize cost
chúng ta tái định nghĩa hàm mục tiêu là hợp lý log âm
functions, we redefine the objective function to be the conditional
negative log likelihood or NLL:
N
X
NLL(θ) = − log p(D|θ) = − log p(yn |xn , θ)
n=1

Minimizing this will give the MLE.

N
X
θ̂ mle = argmin − log p(yn |xn , θ)
θ n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 8 / 46

Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 9 / 46

Bernoulli Random Variables

X: Đây là một loại biến ngẫu nhiên chỉ nhận hai giá trị có thể: thường là 0 và 1. Giá trị
này thường được dùng để mô tả kết quả của các thí nghiệm ngẫu nhiên với hai kết quả
có thể là "thành công" và "thất bại".
A Bernoulli r.v. X takes two possible values, usually 0 and 1,
modeling random experiments that have two possible outcomes (e.g.,
“success” and “failure”).
e.g., tossing a coin. The outcome is either Head or Tail.
e.g., taking an exam. The result is either Pass or Fail.
e.g., classifying images. An image is either Cat or Non-cat.
Khi tung đồng xu, kết quả có thể là "Ngửa" hoặc "Sấp
".
Khi thi một kỳ thi, kết quả có thể là "Đậu" hoặc "Rớt".
Khi phân loại hình ảnh, một hình ảnh có thể được
phân loại là "Có mèo" hoặc "Không có mèo".

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 10 / 46

Bernoulli Random Variables

Definition
A random variable X is a Bernoulli random variable with parameter
p ∈ [0, 1], written as X ∼ Bernoulli(p) if its PMF is given by
(
p, for x = 1
PX (x) =
1 − p, for x = 0.
1.0
0.8
p
0.6
pX (x)

0.4
1−p
0.2
0.0 0 1
x
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 11 / 46
Example

A bag contains 3 balls, each ball is either red or blue.

The number of blue balls θ can be 0, 1, 2, 3.
Choose 4 balls randomly with replacement.
Random variables X1 , X2 , X3 , X4 are defined as
nghĩa là sau khi chọn một quả bóng, nó sẽ được đặt trở lại vào túi trước khi chọn quả
bóng tiếp theo
(
1, if the i-th chosen ball is blue
Xi =
0, if the i-th chosen ball is red

After doing the experiment, the following values for Xi ’s are

observed: x1 = 1, x2 = 0, x3 = 1, x4 = 1.
Note that Xi ’s are i.i.d. (independent and identically distributed) and
Xi ∼ Bernoulli( 3θ ). For which value of θ is the probability of the
observed sample is the largest?

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 12 / 46

Example
(
θ
PXi (x) = 3, for x = 1
θ
1− 3, for x = 0
Xi ’s are independent, the joint PMF of X1 , X2 , X3 , X4 can be written
PX1 X2 X3 X4 (x1 , x2 , x3 , x4 ) = PX1 (x1 )PX2 (x2 )PX3 (x3 )PX4 (x4 )
3
θ θ θ θ θ θ
PX1 X2 X3 X4 (1, 0, 1, 1) = · 1 − · · = 1−
3 3 3 3 3 3

θ PX1 X2 X3 X4 (1, 0, 1, 1; θ)
0 0
1 0.0247
2 0.0988
3 0
The observed data is most likely to occur for θ = 2.
We may choose θ̂ = 2 as our estimate of θ.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 13 / 46
MLE for the Bernoulli distribution
giả sử Y là biến ngẫu nhiên đại diện cho việc tung đồng xu
Suppose Y is a random variable representing a coin toss.
sấp
tương ứng với ngửa
The event Y = 1 corresponds to heads, Y = 0 corresponds to tails.
phân phối xác suất
The probability distribution for this rv is the Bernoulli. The NLL for
the Bernoulli distribution is
N
Y N
Y
NLL(θ) = − log p(yn |θ) = − log θI(yn =1) (1 − θ)I(yn =0)
n=1 n=1
N
X
=− I(yn = 1) log θ + I(yn = 0) log(1 − θ)
n=1
= −[N1 log θ + N0 log(1 − θ)]

where N1 = N
P
PN n=1 I(yn = 1) is the number of heads, and
N0 = n=1 I(yn = 0) is the number of tails.
N = N0 + N1 is the sample size.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 14 / 46
MLE for the Bernoulli distribution

NLL(θ) = −[N1 log θ + N0 log(1 − θ)]

đạo hàm
The derivative of the NLL is
d −N1 N0
NLL(θ) = +
dθ θ 1−θ
MLE có thể được tìm bằng cách giải phương trình
d
The MLE can be found by solving dθ NLL(θ) = 0.
The MLE is given by
N1
θ̂mle =
N0 + N1
which is the empirical fraction of heads.
tỉ lệ thực nghiệp xuất hiện mặt ngửa

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 15 / 46

MLE for the categorical distribution
phân phối đa thức
tung một con xúc xắc K mặt N lần
Suppose we roll a K-sided dice N times.
gọi Yn là kết quả của lần tung thứ n
Let Yn ∈ {1, . . . , K} be the n-th outcome, where Yn ∼ Cat(θ).
We want to estimate θ from the dataset D{yn : n = 1 : N }.
The NLL is given by
X
NLL(θ) = − Nk log θk
k
là số lần sự kiện Y = k quan sát được
where Nk is the number of times the event Y = k is observed.
The compute the MLE, we have to minimize the NLL subject to the
constraint that
K
X
θk = 1
k=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 16 / 46

MLE for the categorical distribution
We use the method of Lagrange multipliers. The Lagrangian is as
X X
L(θ, λ) = − Nk log θk − λ 1 − θk
k k
Taking derivatives with respect to λ yields the original constraint
∂L X
=1− θk = 0
∂λ
k
Taking derivatives with respect to θk yields
∂L Nk
=− + λ = 0 −→ Nk = λθk
∂θk θk
We can solve for λ using the sum-to-one constraint
X X
Nk = N = λ θk = λ
k k
Nk Nk
Thus the MLE is given by θ̂k = λ = N , the empirical fraction of
times event k occurs.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 17 / 46
Standard Normal (Gaussian) Random Variable N (0, 1)

1.0

0.8
x 2 /2
0.6
fX (x)

e −x 2 /2
1 e −x 2 /2
0.4 p
2π

0.2

0.0
3 2 1 0 1 2 3
x

Z ∞
2 /2 √
e−x dx = 2π
−∞
1 2
fX (x) = √ e−x /2
2π

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 18 / 46

General Normal (Gaussian) Random Variable N (µ, σ 2 )

1.0

0.8
(x − µ) 2 /2σ 2
0.6 e −(x − µ)2 /2σ2
1 e −(x − µ) 2 /2σ 2
0.4 p
σ 2π

0.2

0.0
6 4 2 0 2 4 6 8 10
x

1 2 2
fX (x) = √ e−(x−µ) /2σ
σ 2π

1 1 2
=√ exp − 2 (y − µ)
2πσ 2 2σ
E [X] = µ Var (X) = σ 2
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 19 / 46
General Normal (Gaussian) Random Variable N (µ, σ 2 )

1.0

0.8
σ = 0.5
0.6 σ=1
fX (x)

σ=2
0.4 σ=3
0.2

0.0
6 4 2 0 2 4 6 8 10
x
Smaller σ, narrower PDF.
Let Y = aX + b N ∼ N (µ, σ 2 )
Then, E [Y ] = aE[X] + b Var (Y ) = a2 σ 2 (always true)
But also, Y ∼ N (aµ + b, a2 σ 2 )

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 20 / 46

Example

We have N = 3 data points y1 = 1, y2 = 0.5, y3 = 1.5 which are

independent and Gaussian with unknown mean µ and variance 1:

yi ∼ N (µ, 1)

Likelihood P (y1 y2 y3 |µ) = P (y1 |µ)P (y2 |µ)P (y3 |µ).

Consider two guesses µ = 1.0 and µ = 2.5. Which has higher
likelihood?
Finding the µ that maximizes the likelihood is equivalent to moving
the Gaussian until the product P (y1 |µ)P (y2 |µ)P (y3 |µ) is maximized.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 21 / 46

MLE for the univariate Gaussian
Y ∼ N (µ, σ 2 ) and D = {yn : n = 1 : N } be an iid sample of size N .

1 1
p(y|θ) = N (y|µ, σ 2 ) = √ exp − 2 (y − µ)2
2πσ 2 2σ
We can estimate the parameters θ = (µ, σ 2 ) using MLE.
We derive the NLL, which is given by
N 1
2
X 1 2 1 2
NLL(µ, σ ) = − log exp − 2 (yn − µ)
2πσ 2 2σ
n=1
N
1 X N
= (yn − µ)2 + log(2πσ 2 )
2σ 2 2
n=1

The minimum of this function must satisfy the following conditions

∂ ∂
NLL(µ, σ 2 ) = 0, NLL(µ, σ 2 ) = 0
∂µ ∂σ 2

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 22 / 46

MLE for the univariate Gaussian
The solution is given by
N
1 X
µ̂mle = yn = ȳ
N
n=1
N N
2 1 X
21 X 2
σ̂mle = (yn − µ̂mle ) = yn + µ̂mle − 2yn µ̂mle = s2 − ȳ 2
2
N N
n=1 n=1
N
1 X
s2 , yn2
N
n=1

The quantities ȳ and s2 are called the sufficient statistics of the

data because they are sufficient to compute the MLE.
Sometimes, we might se the estimate for the variance as
N
2 1 X
σ̂ = (yn − µ̂mle )2
N −1
n=1
which is not the MLE, but is a different kind of estimate.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 23 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 24 / 46

MLE for linear regression

We can make the parameters of the Gaussian to be functions of some

input variables

p(y|x; θ) = N (y|fµ (x; θ), fσ (x; θ)2 )

fµ (x; θ) ∈ R predicts mean, and fσ (x; θ) ∈ R+ predicts variance.

người ta thường cho rằng phương sai là cố định và độc lập với input
It is common to assume that the variance is fixed, and is independent
hồi quy đồng nhất
of the input. This is called homoscedastic regression.
giá trị trung bình là hàm
Furthermore, it is common to assume the mean is a linear function of
tuyến tính của input
the input. The resulting model is called linear regression.

p(y|x; θ) = N (y|wT x + b, σ 2 )

where θ = (w, b, σ). s

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 25 / 46

MLE for linear regression
mục tiêu của MLE trong hồi quy tuyến tính là tìm ra vector trọng số

w sao cho tổng bình phương sai số còn lại là thấp nhất, qua đó mô hình dự
đoán kết quả gần với thực tế nhất có thể.

Figure: Linear regression using Gaussian output with mean µ(x) = b + wx and
fixed variance σ 2 .
cố định

The figure plots the 95% predictive interval [µ(x) − 2σ, µ(x) + 2σ].
độ không đảm bảo trong quan sát sự đoán y cho trước x.
This is the uncertainty in the predicted observation y given x, and
thể hiện sự biến thiên trong các chấm màu xanh.
capture the variablity in the blue dots.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 26 / 46
MLE for linear regression
Linear regression model

p(y|x; θ) = N (y|wT x, σ 2 )

where θ = (w, σ 2 ), and w = (b, w1 , w2 , . . . , wD ).

giả sử
Assume that σ 2 is fixed, we estimate the weights w. The NLL is
N 1
X 1 2 1
NLL(w) = − log exp − 2 (yn − wT xn )2
2πσ 2 2σ
n=1
bỏ qua hằng số cộng không quan trọng, ta được mục tiêu đơn giản hóa
Dropping the irrelevant additive constants gives the simplified
tổng bình phương sai số còn lại
objective, known as the residual sum of squares or RSS:
N
X N
X
RSS(w) = (yn − wT xn )2 = rn2
n=1 n=1
sai số còn lại thứ n
where rn is the n-th residual error.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 27 / 46
MLE for linear regression
tổng bình phương sai số
Residual sum of squares or RSS:
RSS đơn giản là
N
X tổng bình phương
RSS(w) = (yn − wT xn )2 các sai số, trong khi
n=1 MSE là giá trị trung
sai số bình phương trung bình bình của tổng bình
Mean squared error or MSE: phương sai số, và
RMSE là căn bậc hai
N của MSE.
1 X
MSE(w) = (yn − wT xn )2
N
n=1

Root mean squared error or RMSE:

v
p
u
u1 X N
RMSE(w) = MSE(w) = t (yn − wT xn )2
N
n=1

We can compute the MLE by minimizing the NLL, RSS, MSE, or

RMSE. All give the same results.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 28 / 46
MLE for linear regression
có thể được viết trong ký hiệu ma trận như sau:
The RSS can be written in matrix notation as follows
N
X
RSS(w) = (yn − wT xn )2 = kXw − yk22 = (Xw − y)T (Xw − y)
n=1

The gradient is given by

đạo hàm
∇w RSS(w) = X T Xw − X T y
đặt gradient = 0
Setting the gradient to zero ∇w RSS(w) = 0 and solving gives

X T Xw = X T y
đây được biết đến là
phương trình chuẩn
These are known as the normal equations.
bình phương tối thiểu
The MLE solution ŵmle is called the ordinary least squares (OLS)
solution:
ŵmle = argmin RSS(w) = (X T X)−1 X T y
w

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 29 / 46

MLE for linear regression

ŵmle = argmin RSS(w) = (X T X)−1 X T y

đại lượng là giả nghịch đảo bên trái của ma trận

The quantity X † = (X T X)−1 X T is the (left) pseudo-inverse of the
(non-square) matrix X.
duy nhất
Is the solution ŵmle unique?
The gradient is ∇w RSS(w) = X T Xw − X T y. Then, the Hessian is

∂2
H(w) = RSS(w) = X T X
∂w2
các cột của X là độc lập tuyến tính
If X is full rank (i.e., the columns of X are linearly independent),
xác định dương với mọi v
then H is positive definite, since for any v, we have

v T (X T X)v = (Xv)T (Xv) = kXvk2 > 0

cực tiểu toàn cục duy nhất
In the full rank case, the RSS(w) has a unique global minimum.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 30 / 46
Table of Contents

1 Introduction

2 Maximum Likelihood Estimation

3 Examples

4 MLE for Linear Regression

5 Regularization

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 31 / 46

Overfitting
MLE will try to pick parameters that minimize loss on the training
set, but this may not result in a model that has low loss on future
data. This is called overfitting.
Ex: We want to predict the probability of heads when tossing a coin.
We toss it N = 3 times and observe 3 heads. The MLE is
N1 3
θ̂mle = = =1
N0 + N1 3+0
If we use this Ber(y|θ̂mle ) to make predictions, we will predict that all
future coin tosses will also be heads!!!
The model has enough parameters to perfectly fit the observed
training data, so it can perfectly match the empirical distribution.
In most cases, the empirical distribution is not the same as the true
distribution. Putting all the probability mass on the observed set of N
examples will not leave over any probability for novel data in the
future. The model may not generalize.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 32 / 46
Example: MLE for Linear Regression

Example 1:

1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0

1 0 1
X= , y=
1 1
ŵmle = (X T X)−1 X T y =?
Example 2:

1 1
Training data: x1 = , y1 = 1 + x 2 = , y2 = 1.
0

1 0 1+
X= , y=
1 1
ŵmle = (X T X)−1 X T y =?

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 33 / 46

Example: MLE for Linear Regression

Example 1:

1 1
Training data: x1 = , y1 = 1 x2 = , y2 = 1.
0

1 0 1
X= , y=
1 1
ŵmle = (X T X)−1 X T y

T 1 1 1 0 2
X X= =
0 1 2

1 −1/
(X T X)−1 =
−1/ 2/2

1 −1/ 1 1 1 1
ŵmle = 2 = .
−1/ 2/ 0 1 0

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 34 / 46

Example: MLE for Linear Regression

Example 2:

1 1
Training data: x1 = , y1 = 1 + x 2 = , y2 = 1.
0

1 0 1+
X= , y=
1 1
ŵmle = (X T X)−1 X T y

T 1 1 1 0 2
X X= =
0 1 2

1 −1/
(X T X)−1 =
−1/ 2/2

1 −1/ 1 1 1 + 1+
ŵmle = = .
−1/ 2/2 0 1 −1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 35 / 46

Regularization
giải pháp chính để giải quyết vấn dề overfit là chuẩn hóa
The main solution to overfitting is to use regularization.
thêm một hình phạt vào NLL rủi ro thực nghiệm
We add a penalty term to the NLL (or empirical risk):
N
1 X
L(θ; λ) = `(yn , f (xn ; θ) + λC(θ)
N
n=1
tham số chuẩn hóa
where λ ≥ 0 is the regularization parameter, and C(θ) is some
là một dạng phạt độ phức tạp
form of complexity penalty.
một dạng phạt độ phức tạp phổ biến là sử dụng
A common complexity penalty is to use C(θ) = − log p(θ), where
tiên nghiệm
p(θ) is the prior for θ.
mục tiêu chuẩn hóa trở thành
If ` is the log loss, the regularized objective becomes
N
1 X
L(θ; λ) = − log p(yn |xn , θ) − λ log p(θ)
N
n=1

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 36 / 46

Maximum a posteriori estimation (MAP)
ước lượng hậu nghiệm tối đa
N
1 X
L(θ; λ) = − log p(yn |xn , θ) − λ log p(θ)
N
n=1
một cách thích hợp, chúng ta có thể
bằng cách đặt tái cân chỉnh
By setting λ = 1 and rescaling p(θ) appropriately, we can equivalently
tương đương tối thiểu hóa sau:
minimize the following
X N
L(θ; λ) = − log p(yn |xn , θ)+log p(θ) = −[log p(D|θ)+log p(θ)]
n=1

Minimizing this is equivalent to maximizing the log posterior:

p(D|θ)p(θ)
θ̂ map = argmax log p(θ|D) = argmax log
θ θ p(D)
= argmax[log p(D|θ) + log p(θ) − const]
θ

This is MAP estimation, or maximum a posteriori estimation.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 37 / 46
MAP estimation for Bernoulli distribution
Coin tossing. If we observe just one head, the MLE is θ̂mle = 1.
To avoid this, we can add a penalty to θ to discourage “extreme”
values, such as θ = 0 or θ = 1.
We can use a beta distribution as our prior p(θ) = Beta(θ|a, b),
where a, b > 1 encourages values of θ near to a/(a + b).
If a = b = 1, we get uniform
distribution
If a and b are both less than 1,
we get bimodal distribution.
If a and b are both greater than
1, the distribution is unimodal.
a
mean =
a+b
ab
var = 2
(a + b) (a + b + 1)

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 38 / 46

MAP estimate for Bernoulli dsitribution

Using the beta distribution as our prior p(θ) = Beta(θ|a, b), the log
likelihood plus log prior becomes

LL(θ) = log p(D|θ) + log p(θ)

= [N1 log θ + N0 log(1 − θ)] + [(a − 1) log θ + (b − 1) log(1 − θ)]

The MAP estimate is

N1 + a − 1
θ̂map =
N1 + N0 + a + b − 2
If we set a = b = 2, that weakly favor a value of θ near 0.5, the
estimate becomes
N1 + 1
θ̂map =
N1 + N0 + 2
This is called add-one smoothing to avoid the zero count problem.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 39 / 46

Black swan paradox

The zero-count problem, and

overfitting, is analogous to the black
swan paradox.
It is used to illustrate the problem of
induction: how to draw general
conclusions about the future from
specific observations from the past.
The solution to the paradox is to admit
that induction is in general impossible.
The best we can do is to make plausible
guesses by combining the empirical data
with prior knowledge.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 40 / 46

Weight decay

Polynomial regression with too much degree of freedom can result in

overfitting. One solution is to reduce the degree of the polynomial.
A more general solution is to penalize the magnitude of the weights
(regression coefficients).
We use a zero-mean Gaussian prior p(w). The MAP estimate is

ŵmap = argmin NLL(w) + λkwk22

where kwk22 = D 2
P
d=1 wd . We penalize the magnitude of weight
vectors w, rather than the bias term b.
The equation is called `2 regularization or weight decay.
The larger the value of λ, the more the parameters are penalized for
being large (i.e., deviating from the zero-mean prior), and thus the
less flexible the model.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 41 / 46

Ridge regression

In the case of linear regression, the weight decay penalization scheme

is called ridge regression.
Consider polynomial regression, where the predictor has the form
D
X
f (x; w) = wd xd = wT [1, x, x2 , . . . , xD ]
d=0

Suppose we use a high degree polynomial, say D = 14, even though

we have a small dataset with just N = 21 examples.
MLE for the parameters will enable the model to fit the data very
well, but the resulting function is very “wiggly”, thus resulting in
overfitting.
Increasing λ can reduce overfitting.

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 42 / 46

Ridge regression

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 43 / 46

Ridge regression
MAP estimation with a zero-mean Gaussian prior
p(w) = N (w|0, λ−1 I).

1 1
ŵmap = argmin 2
(y − Xw)T (y − Xw) + 2 wT w
w 2σ 2τ
2
= argmin RSS(w) + λkwk2
w

σ2
where λ = τ2
is proportional to the strength of the prior, and
v
uD
uX √
kwk2 = t |wd |2 = wT w
d=1

is the `2 norm of the vector w.

We do not penalize the offset w0 , since that only affects the global
mean of the output, and does not contribute to the overfitting.
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 44 / 46
Ridge Regression

The MAP estimate corresponds to minimizing the penalized objective:

J(w) = (y − Xw)T (y − Xw) + λkwk22

σ2
where λ = τ2
is the strength of the regularizer.
The derivative is given by

∇w J(w) = 2(X T Xw − X T y + λw)

Therefore,

ŵmap = (X T X + λI D )−1 X T y
X X
=( xn xT -1
n + λI D ) ( yn x n )
n n

(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 45 / 46

Example: MAP for Linear Regression

Maximum likelihood
estimation.
Let = 0.1
1 −1/ 1 1 1 1
Ex. 1: ŵmle = 2 = .
−1/ 2/ 0 1 0

1 −1/ 1 1 1 + 1+ 1.1
Ex. 2: ŵmle = = = .
−1/ 2/2 0 1 −1 −1
Maximum a posteriori estimation. Let λ = 0.05

T 2+λ 2.05 0.1
(X X + λI D ) = =
2 + λ 0.1 0.06

0.531 −0.885
(X T X + λI D )−1 =
−0.885 18.1416

0.531 −0.885 1 1 1 0.9735
Ex. 1: ŵmap = =
−0.885 18.1416 0 1 0.0442

0.531 −0.885 1 1 1 + 1.0265
Ex. 2: ŵmap = =
−0.885 18.1416 0 1 −0.0442
(University of Information Technology) Math for CompSci Maximum Likelihood Estimation 46 / 46

Maximum Likelihood Estimation: Head Maximum A Posteriori Estimation MAP Estimation MLE MAP Estimation
No ratings yet
Maximum Likelihood Estimation: Head Maximum A Posteriori Estimation MAP Estimation MLE MAP Estimation
4 pages
MLE & MAP For Beginner
No ratings yet
MLE & MAP For Beginner
13 pages
MAT1101 Bài 10.1 - Ước Lượng Thống Kê Theo Trường Phái Cổ Điển
No ratings yet
MAT1101 Bài 10.1 - Ước Lượng Thống Kê Theo Trường Phái Cổ Điển
44 pages
06 - Ư C Lư NG Bayes
No ratings yet
06 - Ư C Lư NG Bayes
57 pages
L9.1 Pro Models em
No ratings yet
L9.1 Pro Models em
21 pages
BG - C2 - Pháo N 1 - CSLTTK - K62
No ratings yet
BG - C2 - Pháo N 1 - CSLTTK - K62
13 pages
TKTNC 03
No ratings yet
TKTNC 03
52 pages
Lec07 Conv
No ratings yet
Lec07 Conv
87 pages
01 - Giới Thiệu Về Thống Kê Bayes
No ratings yet
01 - Giới Thiệu Về Thống Kê Bayes
65 pages
05 - Uư C Lư NG Bayes
No ratings yet
05 - Uư C Lư NG Bayes
68 pages
Bai Tap Thong Ke
No ratings yet
Bai Tap Thong Ke
36 pages
Lesson 7
No ratings yet
Lesson 7
104 pages
Ước Lượng Hợp Lý Cực Đại
No ratings yet
Ước Lượng Hợp Lý Cực Đại
9 pages
Chương 6. Uoc Luong Tham So 23
No ratings yet
Chương 6. Uoc Luong Tham So 23
41 pages
Naive Ba Yes Gen
No ratings yet
Naive Ba Yes Gen
26 pages
Lec 05
No ratings yet
Lec 05
98 pages
Powerpoint-đồ-án - thuật Toán Expectation Maximization
No ratings yet
Powerpoint-đồ-án - thuật Toán Expectation Maximization
66 pages
DATK - Phan Van Thanh-04Aug2022
No ratings yet
DATK - Phan Van Thanh-04Aug2022
63 pages
Chương 2
No ratings yet
Chương 2
18 pages
TH03. Naive-Bayes
No ratings yet
TH03. Naive-Bayes
7 pages
Bai Giang XSTK-Chuong 4 - 20231015
No ratings yet
Bai Giang XSTK-Chuong 4 - 20231015
18 pages
Math For AI: Homework 2
No ratings yet
Math For AI: Homework 2
5 pages
BG ChV Bài 2. Trung Bình Mẫu Và Luật Số Lớn
No ratings yet
BG ChV Bài 2. Trung Bình Mẫu Và Luật Số Lớn
15 pages
LTXS BNN
No ratings yet
LTXS BNN
8 pages
MAT1101 Bài 10.1 - Ước lượng theo trường phái cổ điển
No ratings yet
MAT1101 Bài 10.1 - Ước lượng theo trường phái cổ điển
44 pages
C0 TKBayes
No ratings yet
C0 TKBayes
7 pages
Chương 2
No ratings yet
Chương 2
40 pages
Chương 6. UOC LUONG THAM SO 23 Ver2
No ratings yet
Chương 6. UOC LUONG THAM SO 23 Ver2
7 pages
CT XSTK
No ratings yet
CT XSTK
5 pages
Bài Thảo Luận Toán Đại Cương
No ratings yet
Bài Thảo Luận Toán Đại Cương
24 pages
Chương 2.1 - Ước lượng điểm
No ratings yet
Chương 2.1 - Ước lượng điểm
46 pages
Chương 5. Ước lượng tham số
No ratings yet
Chương 5. Ước lượng tham số
45 pages
CS115 P11 Logistic Regression
No ratings yet
CS115 P11 Logistic Regression
69 pages
Các Công TH C T c2-c7
No ratings yet
Các Công TH C T c2-c7
11 pages
Công Thức Xác Suất Thống Kê y Học
100% (1)
Công Thức Xác Suất Thống Kê y Học
13 pages
Lecturer4 - Bayesian Decision Theory
No ratings yet
Lecturer4 - Bayesian Decision Theory
40 pages
Baitap LTTK 2025
No ratings yet
Baitap LTTK 2025
21 pages
Untitled
No ratings yet
Untitled
39 pages
Chapter4 Slide
No ratings yet
Chapter4 Slide
6 pages
MAT1101 Bài 7.1 - Biến Ngẫu Nhiên Dẫn Xuất
No ratings yet
MAT1101 Bài 7.1 - Biến Ngẫu Nhiên Dẫn Xuất
37 pages
DPL PTDLKD Chuong 4 Hoi Quy Logistic
No ratings yet
DPL PTDLKD Chuong 4 Hoi Quy Logistic
66 pages
Bai 3 Cac Quy Luat Phan Phoi Thong Dung PDF
No ratings yet
Bai 3 Cac Quy Luat Phan Phoi Thong Dung PDF
4 pages
Đề cương xác suất thống kê 2
100% (1)
Đề cương xác suất thống kê 2
12 pages
Ds 321 - Machine Learning - Support Vector Machine
No ratings yet
Ds 321 - Machine Learning - Support Vector Machine
54 pages
Bài Giảng Học Máy Buổi 4
No ratings yet
Bài Giảng Học Máy Buổi 4
23 pages
Chương 2
No ratings yet
Chương 2
5 pages
Tong Hop Cong Thuc Giua Ky Ban 1
No ratings yet
Tong Hop Cong Thuc Giua Ky Ban 1
10 pages
Phần nâng cao thống kê hiệu 2 tỉ lệ và 2 trung bình
No ratings yet
Phần nâng cao thống kê hiệu 2 tỉ lệ và 2 trung bình
7 pages
Tong Hop Cong Thuc Giua Ky XSTK
No ratings yet
Tong Hop Cong Thuc Giua Ky XSTK
8 pages
Lecture 02 Estimation
No ratings yet
Lecture 02 Estimation
94 pages
Chương 3. Mot So Quy Luat PPXS 2019 PDF
No ratings yet
Chương 3. Mot So Quy Luat PPXS 2019 PDF
7 pages
Lý thuyết XSTK F2
No ratings yet
Lý thuyết XSTK F2
2 pages
BG Toán XSTK - C5-6-7
No ratings yet
BG Toán XSTK - C5-6-7
82 pages
TT Thongke
No ratings yet
TT Thongke
7 pages
Chuong 7
No ratings yet
Chuong 7
33 pages
Chương 3
No ratings yet
Chương 3
16 pages
BHTCNPM Slide Offline XSTK 2022 2023
No ratings yet
BHTCNPM Slide Offline XSTK 2022 2023
91 pages
Bài Toán Ư C Lư NG
No ratings yet
Bài Toán Ư C Lư NG
6 pages
XSTK Ch2
No ratings yet
XSTK Ch2
191 pages
Nhóm 3
No ratings yet
Nhóm 3
52 pages
XSTK Ch3
No ratings yet
XSTK Ch3
46 pages
XSTK Ch1
No ratings yet
XSTK Ch1
120 pages
XSTK Ch5
No ratings yet
XSTK Ch5
171 pages
XSTK Ch4
No ratings yet
XSTK Ch4
111 pages
Bài Giải 09 - Cây Nhị Phân Cân Bằng
No ratings yet
Bài Giải 09 - Cây Nhị Phân Cân Bằng
17 pages
Chuong 2. Phep Dem
No ratings yet
Chuong 2. Phep Dem
63 pages