Professional Documents
Culture Documents
Stephen G. Walker
arXiv:2106.06597v1 [math.ST] 11 Jun 2021
Department of Mathematics
University of Texas at Austin, USA
e-mail: s.g.walker@math.utexas.edu
Abstract
1 Introduction
One important aspect of statistical inference is quantifying the uncertainty
in statistics; for example, the sampling distribution of the maximum like-
lihood estimator arising from a model and data. If approximations are re-
quired, it is the asymptotic normal distribution which is often, if not always,
used. In this paper we show that if the log–likelihood is strictly concave in
the parameter for all data sets, then an improved asymptotic distribution
is available. The density estimate has similar properties to a second order
Edgeworth expansion, which uses up to three derivtives of the log–likelihood
(see [8] and [5]); whereas we obtain this using only one derivative. It is the
concavity of the log–likelihood which facilitates this. It is also clearly to be
seen how to get the asymptotic normal distribution from this new asymp-
totic distribution.
Key words and phrases: Asymptotic distribution; Central limit theorem; Exponential
family; Weighted likelihood bootstrap.
1
2 S. G. Walker
Consider the family of density functions f (x; θ), with respect to some
dominating measure, which will either be the counting measure or the
Lebesgue measure. Here x ∈ X and θ ∈ Θ ⊂ R. We write
negative the score function, and assume that l0 (x; θ) = ∂l(x; θ)/∂θ exists for
all θ and x, and that l(x; θ) is strictly convex in θ for all x; i.e. for all θ 6= θ0
is is that
l(x; θ) > l(x; θ0 ) + (θ − θ0 )l0 (x; θ0 ).
An example of such is the exponential family, see for example, [6]; so for
some functions c(x) and t(x),
(c) For all x ∈ A the density f (x; θ) is thrice diffentiable with respect to θ
and the third derivative is continuous in θ.
(f) For θ∗ ∈ Θ there exists c > 0 and function M (x) such that |l000 (x, θ)| ≤
M (x) for all x ∈ A and |θ − θ∗ | < c and Eθ∗ [M (x)] < ∞.
3
as the estimator of the distribution function for the mle. Here the superscript
N refers to the normal approximation.
In section 2 we provide the new asymptotic distributions for the mle
under the concave condition. Section 3 then presents three illustrations and
section 4 uses the same technique to find the exact distribution for the
weighted likelihood bootstrap sampler. Finally, section 5 concludes with
some ideas for future work and considers the multivariate case.
and note that Tn (z) = −Sn (z)/n, where Sn (z) is the usual score function,
the asymptotic normality of Tn (z), for each z ∈ Θ, implies
√ Tn (z) − D(z, θ∗ )
An (z) = np →d N(0, 1).
V (z, θ∗ ) − D2 (z, θ∗ )
Theorem 2.1. Under Assumptions (a), (b), (d) and (e) combined with
l(x; θ) being strictly convex in θ for all x, it is that
!
√ ∗
D(z, θ )
(2.1) Fbθb(z) = Φ n p
V (z, θ∗ ) − D2 (z, θ∗ )
Proof. Since l(x; θ) is strictly convex in θ for all data sets, we have the
observation that
P θb ≤ z = P (Tn (z) ≥ 0) .
Hence,
!
√ D(z; θ∗ )
P θb ≤ z = P An (z) ≥ − n p
V (z, θ∗ ) − D2 (z, θ∗ )
We can see clearly how to get (1.1) from (2.1); requiring the approximations
∂D ∗ ∗
D(z, θ∗ ) ≈ D(θ∗ , θ∗ ) + (z − θ∗ ) (θ , θ ) = (z − θ∗ ) I(θ∗ ),
∂z
and V (z, θ∗ ) − D2 (z, θ∗ ) ≈ V (θ∗ , θ∗ ), noting ∂D(θ∗ , θ∗ )/∂z = V (θ∗ , θ∗ ) =
I(θ∗ ). This involves some rather loose approximations and suggests the nor-
mal approximation should not necessarily work well with z away from θ∗ .
Indeed we see this phenomenon in an illustration which follows.
Before this we see how (2.1) is comparable to an Edgeworth expansion.
The following standard expansion is to be found in Chapter 16 in [3];
√ p √
P n (θb − θ∗ ) I(θ∗ ) ≤ x = Φ(x) + φ(x) (a + bx2 )/ n,
where a and b use up to the third derivatives of l(x; θ), and are based on
expectations with respect to f (x; θ∗ ).
5
where
∂ 2D ∗ ∗ ∂V ∗ ∗
c= 2
(θ , θ ) V (θ∗ , θ∗ )−3/2 − (θ , θ ) V (θ∗ , θ∗ )−1 .
∂z ∂z
Proof. The proof to this uses
d2 ∂ 2 D ∗ ∗
∗ d ∗ d ∂D ∗ ∗
D θ + √ ,θ = √ (θ , θ ) + 1
2
(θ , θ ) + O(n−3/2 )
n n ∂z n ∂z 2
and
d d ∂V ∗ ∗
V θ + √ , θ∗
∗
= V (θ∗ , θ∗ ) + √ (θ , θ ) + O(n−1 ).
n n ∂z
3 Illustrations
3.1 Exponential family
Consider the exponential family with functions t(x) and b(θ) so
D(θ, θ∗ ) = b0 (θ) − b0 (θ∗ ) and V (θ, θ∗ ) = b00 (θ∗ ) + (b0 (θ) − b0 (θ∗ ))2 .
Therefore, !
√ b0 (z) − b0 (θ∗ )
Fθb(z) = Φ
b n p .
b00 (θ∗ )
On the other hand, the asymptotic normal distribution is given by
√
(N ) ∗
p
Fθb (z) ≈ Φ
b 00 ∗
n (z − θ ) b (θ ) .
In particular, suppose f (x; θ) = θ e−xθ , with x > 0 and θ > 0. Then b(θ) =
− log θ so b0 (θ) = −1/θ and b00 (θ) = 1/θ2 .
1.0
0.8
0.6
distribution
0.4
0.2
0.0
0 1 2 3 4 5
Figure 1: (i) Dotted line: Fθb∗ (z); (ii) solid line: Fbθb(z); (iii) dashed line
(N )
Fb (z).
θb
which is the usual asymptotic normal distribution. The true distribution for
θb is
θ xθ−1
f (x; θ) = , x>0
(1 + xθ )2
7
log x xθ
l0 (x; θ) = 2 − log x − 1/θ.
1 + xθ
0.4
0.2
0.0
0 2 4 6 8 10
theta
Figure 2: (i) Solid line: Fθb∗ (z); (ii) dashed line: Fθ(z)
b .
The aim here is to compare the true distribution of θ;b i.e. F ∗ (z), based
θb
on a sample of size n = 10 with the estimate given by (2.1). We obtain
Fθb∗ (z) by simulating samples of size 10 with a true θ∗ = 2. Repeating this
multiple times and maximizing the likelihood each time yields a sample of
mle’s from which we construct the empirical distribution.
On the other hand, we compute Fbθb(z) by estimating D(z, θ∗ ) and V (z, θ∗ )
arbitrarily accurately using Monte Carlo methods. The two distributions are
plotted in Fig. 2; the bold line is the true distribution while the dashed line
is (2.1). As can be seen, they are remarkably close for a sample of size
n = 10.
8 S. G. Walker
√ D(θ,
b 0)
T = nq .
b 0) − D2 (θ,
V (θ, b 0)
0.4
0.3
0.3
Density
Density
0.2
0.2
0.1
0.1
0.0
0.0
-5 0 5 -5 0 5 10
T T_N
where l(x; θ) = − log f (x; θ), and the w = (wi )i=1:n are from a Dirichlet
distribution with all parameters set to 1; i.e.
n
!
X
p(w) ∝ 1 wi ≥ 0, wi = 1 .
i=1
where the (vi ) are independent and identically distributed as standard ex-
ponential, and minimize
n
X
lv (θ) = vi l(xi ; θ).
i=1
So note that the randomness is generated by the weights now rather than
the data.
10 S. G. Walker
The aim is to find F (z) = P(θ ≤ z). This result relies partly on knowing
the distribution of sums of independent exponentials; e.g.
n
X
S= vi ψi
i=1
This follows due to the convexity of l(x; θ). Hence, we are interested in the
distribution of n
X
S(z) = wi γi (z)
i=1
and in particular
F (z) = P(S(z) ≥ 0).
Since we are only interested in the probability of S(z) being positive, we
P
can represent the (wi ) without their normalizing constant i=1:n wi , and
so we can take them as independent and identically distributed standard
exponential random variables, (vi ).
Now let us arrange S(z) = S1 (z) − S2 (z) where
X X
S1 (z) = vi γi (z) and S2 (z) = vi |γi (z)|
γi (z)>0 γi (z)<0
and
γi (z) = l0 (xi ; z),
where l0 denotes differentiation with respect to θ. If we now present the
labels so that for i = 1, . . . , m it is that γi (z) > 0, and for i = m + 1, . . . , n,
for some m ∈ {0, . . . , n}, it is that γi (z) < 0; then define λi (z) = 1/|γi (z)|.
We can assume all the λi are mutually distinct arising from the (xi ) being
continuous random variables.
The density function for S1 (z) is
"m # m
Y X
f1 (t; γ1 (z), . . . , γm (z)) = λi (z) q1i (z) e−λi (z)t ,
i=1 i=1
11
where, for i = 1, . . . , m,
Y 1
q1i (z) = .
k=1:m, k6=i
λk (z) − λi (z)
where, for i = m + 1, . . . , n,
Y 1
q2i (z) = .
k=m+1:n, k6=i
λk (z) − λi (z)
1.0
0.8
0.6
distribution
0.4
0.2
0.0
0 1 2 3 4 5 6
theta
Figure 4: Exact (bold line) and estimated (dashed red line) weighted likeli-
hood bootstrap posterior distribution for the beta model
0.7
0.6
0.5
0.4
density
0.3
0.2
0.1
0.0
0 2 4 6 8
theta
for small |θb − θ|, where θb is the maximum likelihood estimator. Given that
n n
X X 2
n−1 l00 (xi ; θ) and n−1 l0 (xi ; θ)
i=1 i=1
where
√ p
Tn (z) = n z − θb I(z).
It is possibe to see (4.3) as a Bayesian probability matching type procedure.
See, for example, [10]. In particular, in Section 3 of [12], the authors consider
√ p
Tn (θ) = n (θ − θ)b I(θ).
14 S. G. Walker
5 Discussion
If l(x; θ) is strictly convex in θ for all x then we can obtain an accurate
estimate of the distribution of the maximum likelihood estimate using only
l0 (x; θ). For the multivariate case; i.e. Θ ⊂ Rd , it is not easy in general to
find
n n
!
X ∂ X ∂
(5.1) Fθb(z1 , . . . , zd ) = P l(xi ; z1 ) ≥ 0, . . . , l(xi ; zd ) ≥ 0 .
i=1
∂θ1 i=1
∂θd
1.0
0.8
0.6
distribution
0.4
0.2
0.0
theta
we would have Fθb(z) ≈ P(Y (z) ≥ 0) with Y (z) ∼ MVNd (µ(z), Σ(z)). Here
Z Z
∂ ∗ ∂ ∂
µj (z) = l(x; zj ) f (x; θ ) x., Σjk = l(x; zj ) l(x; zk ) f (x; θ∗ ) dx.
∂θj ∂θj ∂θk
Approximating P(Y (z) ≥ 0) in multidimensions has been considered, for
example, by [2], who could only find adequate approximations up to 3 di-
mensions.
An approximate sampling strategy from (5.1) would involve the paramet-
ric bootstrap; see for example [4]. To sample z from (5.1) approximately,
take a sample xe = (e
x1 , . . . , x
en ) from f (·; θ)
b and take z as the mle with data
x
e; i.e. take X
z = arg min l(e
xi ; θ)
θ
1≤i≤n
where z = (z1 , . . . , zd ) = (zj , z−j ). Hence, we can find easily the conditional
density equivalent of (4.2); i.e. F (zj |z−j ) for each j ∈ {1, . . . , d}.
References
[1] A. Azzalini, A class of distributions which includes the normal ones.
Scandinavian Journal of Statistics (1985) 12:171–178.
[4] B. Efron, Bayesian inference and the parametric Bootstrap. Ann. Appl.
Statist. (2012) 6, 1971–1997.
[9] E. Belitser, On coverage and local radial rates of credible sets, Ann.
Statist. (2017) 45, 1124–1151.