You are on page 1of 7

STAT732: Solutions for Homework 2

Due: Wednesday, Feb 14

1. (10 pt) Suppose we take one observation, X, from the discrete distribution,

x −2 −1 0 1 2
Pr(X = x|θ) (1 − θ)/4 θ/12 1/2 (3 − θ)/12 θ/4, 0 ≤ θ ≤ 1

Find an unbiased estimator of θ. Obtain the maximum likelihood estimator (MLE)


θ̂(x) of θ and show that it is not unique.Is any choice of MLE unbiased?

solutions:

(a) Let T (2) = 4, T (x) = 0 when x 6= 2. Then

θ
E[T (X)] = 4 × + 0 × ((1 − θ)/4 + θ/12 + 1/2 + (3 − θ)/12) = θ
4
, which is an unbiased estimator of θ.
(b) Denote the MLE θ̂. Since there’s only one observation:

θ̂(−2) = arg max((1 − θ)/4)) = 0


θ̂(−1) = arg max(θ/12) = 1
θ̂(0) = arg max(1/2) = ∀c ∈ (0, 1)
θ̂(1) = arg max((3 − θ)/12) = 0
θ
θ̂(2) = arg max( ) = 1
4

Since θ̂(0) can be multiple values, it’s not unique and all the MLE’s are biased
since
1 θ
E[θ̂] = 0 × (1 − θ)/4 + 1 × θ/12 + θ̂(0) × + 0 × (3 − θ)/12 + 1 × 6= θ
2 4

2. (10 pt) Bickel & Doksum pg 153, problem 2.3.8


solutions:

1
Q
(a) It is enough to show that the negative log likelihood function ℓ(θ) = − log f (xi |
θ) is a strictly convex function of θ ∈ Rp . Since it’s the sum of the negative log
likelihoods for each Xj , and a sum of strictly convex functions is strictly convex,
it’s enough to consider a single observation n = 1.
To show the negative log likelihood is strictly convex it is enough to show that
2
its Hessian, the matrix Hij = ∂θ∂i ∂θj ℓ(θ) of second derivatives, is positive-definite
everywhere (except possibly at θ̂ = x). Let’s compute the necessary derivatives:

p
!α/2
X
2
ℓ(θ) = − log cα + (xi − θi )
i=1
p
!(α−2)/2
∂ X
ℓ(θ) = −α (xi − θi )2 (xk − θk )
∂θk i=1
p
!(α−2)/2 p
!(α−4)/2
∂2 X X
ℓ(θ) = α (xi − θi )2 + α(α − 2) (xi − θi )2 (xk − θk )2
∂θk2 i=1 i=1
p
! (α−4)/2 " p
#
X X
2 2 2
= α (xi − θi ) (xi − θi ) + (α − 2)(xk − θk )
i=1 i=1
p
!(α−4)/2
∂2 X
ℓ(θ) = α (xi − θi )2 (α − 2)(xi − θi )(xk − θk )
∂θi ∂θk i=1

The Hessian H = α | x − θ |α−4 A is the product of a constant positive factor


α | x − θ |α−4 and a matrix A whose on- and off-diagonal entries are:

p
X
Akk = (xi − θi )2 + (α − 2)(xk − θk )2
i=1
Akj = (α − 2)(xk − θk )(xj − θj )

If we introduce the notation ∆ = (x − θ) ∈ Rp for the vector with components


∆i = (xi − θi ), and Ip for the p × p identity matrix, we can write A in the form

A = |∆|2 Ip + (α − 2)∆∆′
The matrix A staisfies A∆ = λ∆ with eigenvalue λ = |∆|2 (α − 1), strictly
postive since α > 1 (except at ∆ = 0, i.e., θ = θ̂ = x, which is okay). The
other eigenvectors are orthogonal to ∆, all with eigenvalues λ′ = |∆|2 , which
are also strictly positive. Thus A is positive-definite and so is the Hessian,
H = α|∆|α−4 A.

2
(b) First consider the case where α = 1 in dimension p = 1, with n = 2m even.
WLOG order the data x1 ≤ x2 ≤ · · · ≤ xn . The log likelihood function is given
by
X
ℓ(θ) = n log c(α) − |xi − θ|,
a continuous function whose derivative does not exist at the data points θ ∈ {xi }
and which elsewhere satisfies

d X d
ℓ(θ) = − |xi − θ|
dθ " dθ # " #
X X
= (−1) + (+1)
xi <θ xi >θ
= The number of xi > θ minus the number of xi < θ
(note that the derivative of |x − θ| is −1 on the interval θ ∈ (−∞, x), +1 on the
interval θ ∈ (x, ∞), and undefined at the point θ = x). Thus ℓ(θ) is increasing
when θ < xm , when more than half the {xi } exceed θ, and is decreasing when
θ > xm+1 , when fewer than half the {xi } exceed θ. In the interval xm < θ < xm+1
the derivative is zero, so ℓ(θ) is constant there and equal to its maximum value.
In case n = 2m − 1 is odd, the same argument shows that ℓ(θ) achieves a unique
maximum at the median xm .
In dimension p > 2 a similar argument holds, only θ̂ now should be any value in
the “median rectangle” (or “median block”) where each of its components is a
median of the corresponding components of the xi ’s.

3. (10 pt) Let (X1 , · · · , Xn ) be a random sample from the uniform distribution on the
interval (θ − 12 , θ + 12 ), where θ ∈ R is unknown. Let X(j) be the j th order statistic.
(a) Show that (X(1) + X(n) )/2 is strongly consistent for θ, i.e., that
limn→∞ (X(1) + X(n) )/2 = θ a.s.
(b) Show that X n := (X(1) + · · · + X(n) )/n is L2 consistent.
solution:
(a) For any ǫ > 0,
1 1
P (|X(1) − (θ − )| > ǫ) = P (X(1) > ǫ + (θ − ))
2 2
1 n
= [P (X1 > ǫ + (θ − ))]
2
n
= (1 − ǫ)

3
and
1 1
P (|X(n) − (θ + )| > ǫ) = P (X(n) < −ǫ + (θ + ))
2 2
1 n
= [P (X1 < −ǫ + (θ + ))]
2
= (1 − ǫ)n

Since ni=1 (1 − ǫ)n < ∞, by the crollary of Borel-Cantelli lemma, we conclude


P
that limn X(1) = θ − 12 a.s. and limn X(n) = θ + 21 a.s. Hence limn→∞ (X(1) +
X(n) )/2 = θ a.s.
(b) X n := (X(1) + · · · + X(n) )/n = (X1 + · · · + Xn )/n Since E(X1 ) = θ, E[X n ] = θ
and
1 1 n→∞
V ar[X n ] = E[X n − θ2 ] = E[(X1 − θ)2 ] = −→ 0.
n 12n
Therefore, X n is L2 consistent.

iid
4. (10 pt) Let {Xi } ∼ No(µ, σ 2 ) be a random sample from the normal distribution with
unknown mean µ ∈ R and known variance σ 2 > 0. For fixed t 6= 0, find the Uniform
Minimum-Variance Unbiased Estimator (UMVUE) of etµ and show that its variance
is larger than the Cramér-Rao lower bound. Show that the ratio of its variance to
the Cramér-Rao lower bound converges to 1 as n → ∞.
solutions:
2 2
(a) The sample mean X is complete and sufficient for µ. Since E[etX ] = eµt+σ t /(2n) ,
2 2
the UMVUE of etµ is T (X) = e−σ t /(2n)+tX .
The Fisher information I(µ) = n/σ 2 . Then the Cramér-Rao lower bound is
d tµ 2
( dµ e ) /I(µ) = σ 2 t2 e2tµ /n. On the other hand,

2 t2 /n 2 t2 /n σ 2 t2 e2tµ
V ar(T ) = e−σ Ee2tX − e2tµ = (eσ − 1)e2tµ >
n
, the Cramér-Rao lower bound.
(b) The ratio of the variance of the UMVUE over the Cramér-Rao lower bound is
2 2
(eσ t /n −1)/(σ 2 t2 /n), which converges to 1 as n → ∞, since limx→0(ex −1)/x = 1

4
5. (10 pt) Let X have density function f (x|θ), x ∈ Rd , and let θ have (proper) prior
density π(θ). Let δ π (x) denote the Bayes estimate of θ under π(θ) for squared-error
loss (i.e., δ π (x) := E[θ | X = x]) and suppose δ π has finite Bayes risk r(π, δ π ) =
E[|δ π (X) − θ|2 ]. Show that for any other estimator δ(x),
Z
π
2
r(π, δ) − r(π, δ ) = δ(x) − δ π (x) f (x)dx

where f (x) is the marginal density of X. If f (x|θ) is normal No(θ, 1), consider the
collection of estimators of the form δ(X) := cX + d. Show that whenever 0 ≤ c < 1
these estimators are all proper Bayes estimators, and hence admissible [Hint: Find
a prior π for which δ π (X) = cX + d]. Show that if c > 1 the resulting estimator is
inadmissible.
solutions:

(a)
Z Z
π
r(π, δ) − r(π, δ ) = [(θ − δ(x))2 − (θ − δ π (x))2 ]f (x|θ)π(θ)dxdθ
Z Z
= [(θ − δ(x))2 − (θ − δ π (x))2 ]π(θ|x)f (x)dθdx
Z Z
= f (x)[ [(θ − δ(x))2 − (θ − δ π (x))2 ]π(θ|x)f (x)dθ]dx
Z
= f (x)[δ 2 (x) − δ π2 (x) + 2δ π2 (x) − 2δδ π ]dx
Z
2
= δ(x) − δ π (x) f (x)dx

(b) Take π(θ) = No(µ, τ 2 ), where τ is the precision parameter. Then

τ2 1
E[θ|x] = 2
x+ 2 µ
τ +1 τ +1
2
. Let c = τ 2τ+1 , d = τ 2µ+1 . Whenever 0 ≤ c < 1, we can always get µ, τ 2 and
cX + d is a Bayes estimator with proper prior No(µ, τ 2 ).
(c) When c > 1, MSE(cX + d) = R(θ, δc ) = (cθ + d − θ)2 + c2 V ar(X) > 1, while
MSE for δ0 (X) = X is 1, so the estimator δ0 (X) = X is uniformly better than
cX + d. Similarly when c = 1, d 6= 0, X is uniformly better than X + d.

6. (10 pt) Bickel & Doksum pg 197, problem 3.2.1


solutions:

5
(a) One way to show this is to notice that it is a limiting case of homework problem
1.2.14 from the first homework set, taking the limit as τ 2 goes to infinity. Then
the posterior distribution of µ approaches a normal distribution with mean x̄ and
standard deviation σ 2 . If we were “bad” and interpreted a frequentist confidence
interval for µ in a Bayesian way, we would be implicitly assuming that our prior
knowledge of µ was utterly nil—that all real values were equally likely in our
minds (a questionable assumption, since it is not a proper prior). Here is a
slightly more detailed development of the solution.
iid
We know that if Xj ∼ No(θ, σ 2 ), then the sufficient statistic X̄ ∼ No(θ, σ 2 /n).
So the likelihood function for θ is given by:
 
1 −1
ℓ(θ) = f (x̄|θ) = √ exp (x̄ − θ)2
2πσ 2 /n 2σ 2 /n
The posterior distribution of θ given X is given by:

π(θ|X) ∝ π(θ)f (X|θ)

 
1 −1
π(θ|X) ∝ 1 · √ exp (x̄ − θ)2
2πσ 2 /n 2σ 2 /n
This function, which came about because it was the pdf of x̄, is also the kernel of
a normal density for θ, and it is easy to see that the mean is x̄ and the variance
is σ 2 /n. So θ|X ∼ No(x̄, σ 2 /n).
Under squared error loss, the Bayes estimator is the posterior mean, which in
this case is x̄.

7. (10 pt) Bickel & Doksum pg 197, problem 3.2.4


solutions:
θ
(a) We begin with λ = 1−θ . If we take the prior on λ to be the improper prior
πλ (λ) = 1, we find the induced prior on θ as follows:
   
θ d θ 1
πθ (θ) ∝ πλ =1· 2
= (1 − θ)−2
1−θ dθ 1 − θ (1 − θ)
Using this prior for θ, we find the posterior distribution of θ thus:

π(θ|X) ∝ π(θ)π(X|θ)
= (1 − θ)−2 θS (1 − θ)n−S


= θS (1 − θ)n−S−2

6
This is the kernel of a Be(S + 1, n − S − 1) density, so long as the conditions are
met for such a density to exist, which are that S + 1 > 0 and n − S − 1 > 0.
The first of these will always be true, but the second will only be true when
n − S > 1, i.e., when at least two “failures” are observed. (Should we have
n − S = 1 or n − S = 0, then the expression θS (1 − θ)n−S−2 would not have a
fininte integral, so the posterior distribution on θ would be improper, a decided
no-no.)
Under the squared error loss function, the Bayes estimator is the posterior mean
S+1
of θ, which for a Be(S + 1, n − S − 1) density is known to be (S+1)+(n−S−1) = S+1
n
.
• Note that some students calculated the Bayes rule for λ instead of for θ, which
S+1
I also marked correct. The Bayes rule for λ is n−S−2 , exists when S < n − 1,
finite when S < n − 2.

You might also like