You are on page 1of 4

Mathematical Statistics II 2nd Semester, 2021-2022

Introduction to M-Estimator
Date: 2022/05/31 Scribe: Ying Niu 21110840012

1 Empirical Process

Setting: Suppose that random∫ elements X1 , · · · , Xn are i.i.d. on a measurable space (X , A) with common
distribution P . Denote Qf = f dQ for a given measurable function f and signed measure Q.
Definition 1. (Empirical measure) The empirical measure Pn of a sample of X1 , · · · , Xn is the discrete
random measure given by
1∑
n
Pn (C) = I(Xi ∈ C).
n i=1
1
Remark 1. If points are measurable, it can be described as the random measure that
∑nputs mass n at each
observation. We can write the empirical measure as the linear combination Pn = n i=1 δXi , where δXi is
1

the dirac measure at the observation Xi .

Given a collection F of measurable functions f : X → R, the empirical measure induces a map


F →R
1∑
n
f 7→ Pn f = f (Xi ).
n i=1

Definition 2. (Empirical process) The empirical process at f , denoted as Gn f , is the centered and scaled
version of Pn f given by
1 ∑
n

f 7→ Gn f = n(Pn − P )f = √ (f (Xi ) − P f ).
n i=1
∑n
Remark 2. The signed measure Gn = √1n i=1 (δXi − P ) will be identified with the empirical process.

For a given function f , it follows SLLN and CLT that


a.s.
Pn f → P f,
d
Gn f → N (0, P (f − P f )2 ),
provided P f exists and P f 2 < ∞, respectively.
Goal: When f varys over a class F, can we similarly get the uniform versions of LLN and CLT, and what
conditions do we need? (Glivenko-Cantelli Theorem and Donsker Theorem)
Methods:

• entropy numbers(covering number, bracketing number);


• maximal inequalities(Hoeffding, Bernstein);
• symmetrization;
• VC dimension;
···

1
2 M-Estimator

The study of M-estimator and its asymptotic behavior can be seen as an application of the empirical process.
Suppose that we are interested in estimating a parameter θ of a distribution based on the observations
X1 , · · · , Xn . The most important method of constructing statistical estimators is to choose the estimator to
maximize a certain criterion function.
Definition 3. (M-estimator) An M-estimator θ̂n = θ̂n (X1 , · · · , Xn ) maximizes a criterion function

1∑
n
θ 7→ Mn (θ) = mθ (Xi ),
n i=1

where mθ : X → R̄ are known functions.

The maximizing value is often sought by setting a derivative(or the set of partial derivatives in the multidi-
mensional case) equal to zero. In other words, estimators that maximize a certain map also solve a system
of estimating equations.
Definition 4. (Z-estimator) A Z-estimator solves estimating equations

1∑
n
Ψn (θ) = ψθ (Xi ) = 0,
n i=1

where ψθ are known vector-valued maps. For instance, if θ is k-dimensional, then ψθ typically has k coordi-
nate functions ψθ = (ψθ,1 , · · · , ψθ,k ), and in fact, the estimating equations are

n
ψθ,j (Xi ) = 0, j = 1, · · · , k.
i=1

Remark 3. A Z-estimator doesn’t necessarily correspond to a maximization problem. The use of the name
of M-estimator is widespread.

Sometimes the maximum of the criterion function Mn is not taken or the estimating equation does not have
an exact solution. Then it is natural to use as estimator a value that almost maximizes the criterion function
or is a near zero. This yields approximate M-estimator or approximate Z-estimator. Estimators that
are sufficiently close to being a point of maximum or a zero often have the same asymptotic behavior.
In terms of empirical process, we can write

1∑
n
Mn (θ) = mθ (Xi ) = Pn mθ ,
n i=1
1∑
n
Ψn (θ) = ψθ (Xi ) = Pn ψθ .
n i=1

Now we are interested in the limiting distribution of the M-estimator, which is separated into three steps:

• proving consistency;
• deriving the convergence rate;
• establishing the limit distribution.

2
3 Examples

Example 1. (MLE) Suppose X1 , · · · , Xn are i.i.d. and have a common density pθ , and θ is k-dimensional.
Then the MLE maximizes the log likelihood

n
θ 7→ log pθ (Xi ).
i=1

Thus, the MLE is an M-estimator with mθ = log pθ . If pθ is partially differentiable with respect to θ for
each fixed x, then the MLE is also a Z-estimator with ψθ,j = ∂θ∂ j log pθ .
i.i.d.
However, sometimes M-estimator and Z-estimator are not equivalent. For instance, if X1 , · · · , Xn ∼ U [0, θ],
we can obtain the MLE by maximizing the log likelihood

n
θ 7→ (log 1[0,θ] (Xi ) − log θ).
i=1

Define log 0 = −∞, then the MLE of θ is θ̂n = X(n) . But this function is not differentiable with respect to θ,
thus there exists no corresponding estimating equations which equal to zero. In this example, the definition
as the location of a maximum is more fundamental than the definition as a zero. 
Example 2. (Location estimators) Let X1 , · · · , Xn be random sample of real-valued observations and sup-
pose we want to estimate the location of their distribution. We can view a Z-estimator defined by

n
ψ(Xi − θ) = 0
i=1

as a ”location” estimator, where ψ(x) is monotone. Here are four examples of location estimators.


n
1. Mean. Take ψ(x) = x, then the estimator of the mean solves (Xi − θ) = 0.
i=1


n
2. Median. Take ψ(x) = sign(x), then the estimator of the median solves sign(Xi − θ) = 0. (Assume
i=1
that there are no tied observations in the middle.)
3. Huber estimators. The exact values of the largest and smallest observations have very little influence
on the value of the median, but a proportional influence on the mean. Therefore, the sample mean is
considered nonrobust against outliers, while the sample median is considered robust. How to limit the
influence of the outliers on the estimate? The Huber estimator are motivated by studies in robust
statistics. Take 
 −k if x < −k,

ψ(x) = [x]k−k = x if |x|≤ k,


k if x > k.
Depending on the value of k, the Huber estimators behave more like the mean (large k) or more like
the median (small k) and thus bridge the gap between the nonrobust mean and very robust median.
4. Quantiles. A pth sample quantile is roughly a point θ such that pn observations are less than θ and
(1 − p)n observations are greater than θ. That is,

n
((1 − p)1 {Xi < θ} − p1 {Xi > θ}) = 0.
i=1

3
It implies that we can take 
 1 − p if x < 0,

ψ(x) = 0 if x = 0,


−p if x > 0
to construct a Z-estimator. However, except for special
∑n combinations of p and n, it is almost impossible
to find an exact zero, because the function θ 7→ i=1 ψn (Xi − θ) is discontinuous with jumps at the
observations. Now we consider an approximate Z-estimator.
If no observations are tied, then all jumps are of size one, we call a pth sample quantile any θ̂ that
solves the inequalities
∑n
−1 < ((1 − p)1 {Xi < θ} − p1 {Xi > θ}) < 1.
i=1

At least one solution θ̂ to the inequalities exists. There may be more than one solution, and all solutions
turn out to have the same asymptotic behavior. If tied observations are present, it may be necessary
to increase the interval (−1, 1) to ensure the existence of solutions.


Figure 1: The functions θ 7→ limni=1 ψn (Xi −θ) for these four estimators for samples of size 15 from N (5, 5)

All the estimators can also be defined as a solution of a maximization problem. Mean, median, Huber

n
estimators, and quantiles minimize m(Xi − θ) for m equal to x2 , |x|, x2 1|x|≤k + (2k|x|−k 2 )1|x|>k and
i=1
(1 − p)x− + px+ , respectively. 

References

[1] A.W. van der Vaart, Asymptotic Statistics, Chapter 5.1.


[2] A.W. van der Vaart and J.A. Wellner, Weak Convergence and Empirical Processes, Chapters 2.1,3.1,3.2.

You might also like