You are on page 1of 7

Assignment 1: Probability (Partial Solution)

AIGS/CSED515 Machine Learning


Instructor: Jungseul Ok
jungseul@postech.ac.kr

Due: 20:00pm Sep 30, 2020

Remarks

• Group study and open discussion via LMS board are encouraged, however, assignment
that your hand-in must be of your own work, and hand-written.

• Submit a scanned copy of your answer on LMS online in a single PDF file.

• Delayed submission may get some penalty in score: 5% off for delay of 0 ∼ 4 hours;
20% off for delay of 4 ∼ 24 hours; and delay longer than 24 hours will not be accepted.

1. [20pt; marginalization] Consider the following bivariate distribution p(x, y) of two


discrete random variables X and Y .

Y
1 2 3
X
1 0.01 0.05 0.1
2 0.02 0.1 0.05
3 0.03 0.05 0.03
4 0.1 0.07 0.05
5 0.1 0.2 0.04

Compute:

(a) The marginal distribution p(x) and p(y).


(b) The expectation E[X] and E[Y ].
(c) The conditional distributions p(x|Y = 1) and P (y | X = 3)
(d) The conditional expectation E[X | Y = 1] and E[Y | X = 3]

1
2. [10pt; Bayes’ theorem; from Murphy’s book] After your yearly checkup, the doctor
has bad news and good news. The bad news is that you tested positive for a serious
disease, and that the test is 99% accurate (i.e., the probability of testing positive given
that you have the disease is 0.99, as is the probability of testing negative given that
you don’t have the disease). The good news is that this is a rare disease, striking only
one in 10,000 people. What are the chances that you actually have the disease? (Show
your calculations as well as giving the final result.)

2
3. [20pt] Consider two random variables X, Y with joint distribution p(x, y) and finite
supports X and Y. Prove:

(a) The expected value of the conditional expected value of X given Y is the same
as the expected value of X, i.e.,

E[X] = E[E[X|Y ]] .

sol) From the definition of EY [EX [x|y]], we have


X
EY [EX [x|y]] = EX [X|Y = y]P[Y = y]
y
XX
= x · P[X = x|Y = y]P[Y = y]
y x
XX
= x · P[X = x, Y = y]
y x
XX
= x · p(x, y)
y x
!!
X X
= x p(x, y)
x y
X
= xP[X = x] = EX (x)
x

which concludes the proof.


(b) The covariance can be computed as follows:

cov(X, Y ) = E[XY ] − E[X] E[Y ] .

sol) From the definition of cov(X, Y ), it follows that

cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]


= E[XY − X E[Y ] − E[X]Y + E[X] E[Y ]]
= E[XY ] − E[X] E[Y ] − E[X] E[Y ] + E[X] E[Y ]
= E[XY ] − E[X] E[Y ]

where the third equality is from the fact that E[X], E[Y ] and E[X] E[Y ] are
constant. This concludes the proof.

3
4. [40pt; Bayes’ theorem; from Murphy’s book] Consider a data set D = {xn }n=1,...,N ,
where each xn ∈ {0, 1} is independently drawn from Bernoulli distribution
PNwith mean
1
θ, i.e., p(xi = 1 | θ) = θ. Denote by x̄ the sample mean, i.e., x̄ = N n=1 xn , and
answer the following questions:

(a) Obtain the maximum likelihood estimator θ̂MLE of θ for given D.


sol)
Let L(θ) be the log-likelihood:
N
!
Y
xn 1−xn
L(θ) = log θ (1 − θ)
n=1
N
! N
!
X X
= xn log θ + N− xn log(1 − θ) .
n=1 n=1

P   
N
∂L(θ) 1 1
N− N
P
By setting the derivative to zero, i.e., ∂θ
= θ n=1 xn − 1−θ n=1 x n =
0, we obtain
N
1 X
θ̂MLE = xn = x̄ .
N n=1

(b) Compute (i) the bias E[θ̂MLE − θ], (ii) variance E[(θ̂MLE − E[θ̂MLE ])2 ], and (iii)
mean squared error E[(θ̂MLE − θ)2 ] of the maximum likelihood estimator θ̂MLE for
given θ.
sol)
(i) The bias of θ̂MLE can be computed as follows:
" N
#
1 X
E[θ̂MLE − θ] = E xn − θ
N n=1
=θ−θ =0.

(ii) To obtain the variance of θ̂MLE , we first compute the individual variance:

E[(xn − θ)2 ] = E[x2n − 2xn + θ2 ]


= E[xn − 2xn + θ2 ]
= −θ + θ2 = θ(1 − θ) ,

where the second equality is from the fact that xn is binary, i.e., x2n = xn .

4
Therefore, the variance of θ̂MLE is obtained as follows:
E[(θ̂MLE − E[θ̂MLE ])2 ] = E[(θ̂MLE − θ)2 ]
 !2 
N
1 X
= E xn − θ 
N n=1
 !2 
N
1 X
= E (xn − θ) 
N n=1
" N N #
1 XX
= 2E (xn − θ)(xm − θ)
N n=1 m=1
" N #
1 X θ(1 − θ)
= 2E (xn − θ)2 = ,
N n=1
N
where the second last equality is from the facts that different xn ’s are independent
and E[xn − θ] = 0, and the last one is from the aforementioned computation of
individual variance.
(iii) The mean squared error E[(θ̂MLE −θ)2 ] is directly given from the computation
of the variance:
θ(1 − θ)
E[(θ̂MLE − θ)2 ] = .
N
(c) Recall that a random variable y ∈ [0, 1] from Beta(α, β) with α, β > 0 has the
probability density function in the following form:
Γ(α + β) α−1
p(y | α, β) = y (1 − y)β−1
Γ(α)Γ(β)
∝ y α−1 (1 − y)β−1 ,
R∞
where Γ(α) = 0 z α−1 e−z dz, and thus the expected value of y is α+β
α
. Assume
that θ is drawn from Beta(v, v) for a given v > 0. Obtain the Bayes estimator
θ̂BE = E[θ | D, v].
sol)
To obtain the Bayes estimator, we write the posterior as follows:
p(θ | D, v) ∝ p(D | θ)p(θ | v)
N
!
Y
∝ θxn (1 − θ)1−xn θv−1 (1 − θ)v−1
n=1
=θ v−1+N x̄
(1 − θ)v−1+N −N x̄ ,
which is Beta distribution with (α, β) = (v + N x̄, v + N (1 − x̄)). Recalling the
α
fact that the expected value of random variable from Beta(α, β) is α+β , we have
N x̄ + v
θ̂BE = .
N + 2v

5
(d) Compute (i) the bias E[θ̂BE − θ], (ii) variance E[(θ̂BE − E[θ̂BE ])2 ], and (iii) mean
squared error E[(θ̂BE − θ)2 ] of the Bayes estimator θ̂BE for given θ.
sol)
(i) The bias of θ̂BE can be computed as follows:
 
N x̄ + v
E[θ̂BE − θ] = E −θ
N + 2v
Nθ + v v(1 − 2θ)
= −θ = .
N + 2v N + 2v

(ii) Recall the variance of MLE:


 !2 
N
1 X θ(1 − θ)
E xn − θ  = .
N n=1 N

Then, the variance of θ̂BE is obtained as follows:


" 2 #
N x̄ + v N θ + v
E[(θ̂BE − E[θ̂BE ])2 ] = E −
N + 2v N + 2v
 2
N
E (x̄ − θ)2
 
=
N + 2v
 !2 
 2 N
N 1 X N θ(1 − θ)
= E (xn − θ)  = .
N + 2v N n=1 (N + 2v)2

(iii) The mean squared error E[(θ̂BE − θ)2 ] is given as follows:


   2 
2
E[(θ̂BE − θ) ] = E θ̂BE − E[θ̂BE ] + E[θ̂BE ] − θ
 2  2   
= E θ̂BE − E[θ̂BE ] + E[θ̂BE ] − θ + 2 θ̂BE − E[θ̂BE ] E[θ̂BE ] − θ
 2   2
= E θ̂BE − E[θ̂BE ] + E[θ̂BE ] − θ
| {z } | {z
2
}
Variance Bias
2
N θ(1 − θ) + v 2 (1 − 2θ)2

N θ(1 − θ) v(1 − 2θ)
= + = .
(N + 2v)2 N + 2v (N + 2v)2

(e) (i) Show that the maximum likelihood estimator θ̂MLE is a special case of the
Bayes estimator θ̂BE with a particular choice of v, and (ii) discuss the role of
hyper-parameter v in the Bayes estimator.
sol)

6
(i) As v → 0, the Bayes estimator converges to the maximum likelihood estimator.
(ii) The hyper-parameter v controls the strength of belief on that the true param-
eter θ is around 1/2, where as v increases, the prior distribution of θ concentrates
around 1/2.

You might also like