You are on page 1of 32

VI: Asymptotic theory and parametric inference

A modern crash course in intermediate Statistics and Probability

Paul Rognon

Barcelona School of Economics


Universitat Pompeu Fabra
Universitat Politècnica de Catalunya

1 / 27
Convergence of random variables
Modes of convergence of functions
Let {fn }n∈N be a sequence of function, ∀n, fn : S 7→ R and f be another
function.
• fn converges pointwise to function f if:
∀x ∈ S, fn (x) −−−→ f (x).
n→∞
• fn converges to function f with respect to norm ∥·∥, if:
∥fn − f ∥ −−−→ 0.
n→∞
• fn converges to function f in measure µ if:
µ({s ∈ S : |fn (s) − f (s)| > ϵ}) −−−→ 0
n→∞

• A common family of norms for functions is the Lp norms:


Z 1/p
∥f ∥p = |f (x)|p dµ(x)

• The convergence of {fn } according to one definition does not necessarily


imply its convergence according to another.
2 / 27
Modes of convergence of random variables
Intuitively we should be able to write something like ”N(0, n1 ) → 0” but
at the same time if Xn ∼ N(0, n1 ), P(Xn = 0) = 0 for all n. Just like for
functions there are different ways to define convergence of random
variables.
Convergence almost surely (≈ pointwise convergence)
a.s
(Xn ) converges almost surely to X , noted Xn → X , if:
The event {ω ∈ Ω : Xn (ω) → X (ω) as n → ∞} has probability 1.

Convergence in probability (≈ convergence in measure)


p
(Xn ) converges in probability to X , noted Xn −→ X , if:
P(|Xn − X | > ϵ) −→ 0 as n → ∞

Convergence in r th mean (≈ convergence in norm)


r
(Xn ) converges in r th mean to X , r ∈ N∗ , noted Xn −→ X ,if:
E (|Xn − X |r ) → 0 as n→∞
3 / 27
Modes of convergence of random variables
Convergence in distribution
d
(Xn ) converges in distribution to X , noted Xn −→ X , if:

lim FXn (t) = FX (t)


n→∞

where F is continuous.
Relationships betweens modes of convergence

4 / 27
Example

Example 5.3, Wasserman, All of Statistic. Let Xn ∼ N(0, n1 ), we have


said that intuitively it should concentrate around 0. Let Y be the
random variable with point mass distribution at 0.

1. Does Xn converge in distribution to Y ?


2. Does Xn converge in probability to 0?

5 / 27
Law of large numbers and central limit theorem
Let X1 , . . . , Xn be independent identically distributed random variables
with mean µ and variance σ 2 .
Law of large numbers
a.s
• Strong law: X̄n −→ µ
p d
• Weak law: X̄n −→ µ and X̄n → µ

Central Limit Theorem (CLT)

X̄n − E(X̄n ) √ X̄n − µ d


Zn := p = n −→ Z ∼ N(0, 1).
var(X̄n ) σ
2
This implies that X̄n ≈ N(µ, σn ).
Under stronger assumptions, the CLT can be extended to the case of
different variances σj2 .

6 / 27
Fundamental concepts in
statistical inference
Statistical inference
Statistical inference (learning) is the process of using data to infer the
distribution that generated the data. We observe a sample X1 , . . . , Xn
from a distribution F . We may want to infer F or only some feature of F
such as its mean.
Statistical model
A statistical model M is a family of probability distributions that we pick
for our data.
A statistical model is parametric if it can be parameterized by a finite
number of parameters θ: M = {f (x; θ) : θ ∈ Θ ⊆ Rd } where Θ is the
parameter space
When M cannot be parameterized by a finite number of parameters, e.g.
if d = ∞, it is a nonparametric model.
Examples - parametric or nonparametric?
We toss a biased coin, X1 , . . . , Xn ∼ Bern(p). Task: learn p.
X1 , . . . , Xn observed from some univariate distribution F . Try to infer F
with no assumption on F . 7 / 27
Fundamental problems in statistical inference
Given a parametric model M with parameter set Θ suppose that the true
distribution of the sample X1 , . . . , Xn is given by θ∗ ∈ Θ (unknown).
Point estimation
Using the data, we provide an estimate of the parameter θ with the help
of an estimator. An estimator is any function of the sample,
θ̂n = θ̂n (X1 , . . . , Xn ) with values in Θ. For example: X̄n for E(X ).
Confidence intervals
Using the data, we estimate an interval or region of values which is likely
to contain the parameter θ. A (1 − α)-confidence interval, Cn = (a, b), is
such that
P(θ∗ ∈ Cn ) ≥ 1 − α.

Hypothesis testing/model selection


We test whether an hypothesis on θ is supported by the sample data.
null hypothesis H0 : θ∗ ∈ Θ0 , alternative H1 : θ∗ ∈
/ Θ0 .
e.g. testing if coin is fair H0 : p = 21 . Reject H0 if T = |p̂n − 21 | is “large”.
8 / 27
Point estimation
Maximum Likelihood Estimator (MLE)
Let X1 , . . . , Xn be a random sample (iid) from f (x; θ). An intuitive
estimator of θ is the value of θ that maximizes the chances to observe
the sample data we have.
The likelihood function gives for any value θ the likelihood (or probability
in the discrete case) of observing the sample data:
Ln (θ) = f (X1 , . . . , Xn ; θ) = ni=1 f (Xi ; θ) (by independence)
Q

The maximum likelihood estimator θ̂n is the global maximizer of Ln (θ):


θ̂n = arg max Ln (θ).
θ∈Θ
It is more convenient to work with the log-likelihood
Pn (same optima):
ℓn (θ) = log Ln (θ) = i=1 log f (Xi ; θ).
Some properties of the MLE
p
• Consistency: θ̂n −→ θ∗ .

• Equivariance: if θ̂n is the MLE of θ, then g (θ̂n ) is the MLE of g (θ).


9 / 27
Example: Bernoulli trials

Let X1 , . . . , Xn ∼ Bernoulli(p) iid for p ∈ Θ = (0, 1),

f (x; θ) = p x (1 − p)1−x for x = 0, 1,


Pn Pn
p xi (1 − p)1−xi = p i=1 xi (1 − p)n− i=1 xi
Q Q
Ln (p) = i f (xi ; p) = i .

ℓn (p) = ( ni=1 xi ) log(p) + (n − ni=1 xi ) log(1 − p),


P P
Pn Pn
dℓn i=1 xi n− i=1 xi
dp (p) = p − 1−p ,

dℓn Pn Pn
We have dp (p) = 0 if i=1 xi (1 − p) = (n − i=1 xi )p so the MLE is

n
1X
p̂n = xi .
n
i=1

10 / 27
Example: Multivariate Gaussian distribution

The density function of the multivariate Gaussian distribution is:


1 1
f (x; µ, Σ) = p/2
(det Σ)−1/2 exp{− (x − µ)T Σ−1 (x − µ)}.
(2π) 2

Up to constant factors the log-likelihood for a iid sample of size n is:


n
n 1X
ℓn (µ, Σ) = − log det(Σ) − (xi − µ)T Σ−1 (xi − µ).
2 2
i=1

Independently of Σ, maximizing over µ always yields µ̂n = x̄n . Now


n n
ℓn (x̄n , Σ) = − log det(Σ) − trace((n − 1)Sn Σ−1 ),
2 2

concave function of K = Σ−1 . Then Σ̂ =


which is aP n−1
n Sn , where
1 n T
Sn = n−1 i=1 (xi − x¯n )(xi − x¯n ) .

11 / 27
Methods of moments

Let X1 , . . . , Xn be a random sample (iid) from f (x; θ). The k first


moments (αj )j=1,...,k of the distribution are functions of θ. The idea of
the methods of moments is to find the value of θ that makes the
theoretical values of the moments equal to the values of the sample
moment (α̂j )j=1,...,k .

The method of moments estimator θ̂n is defined to be the value of θ that


solves the system:   


 α1 θ̂n =α b1

  
α2 θ̂n

=α b2
..



 .  

αk θ̂n

=α bk

12 / 27
Example
Let X1 , . . . , Xn be iid binomial (k, p). Here we assume that both k and p
are unknown and we desire point estimators for both parameters.
Equating the first two sample moments to those of the population yields
the system of equations
(
X̄ = kp
1P 2
n Xi = kp(1 − p) + k 2 p 2

which now must be solved for k and p.


After some algebra, we obtain the method of moments estimators

X̄ 2
k̂ = P 2
X̄ − (1/n) Xi − X̄

p̂ =

13 / 27
How to evaluate an estimator?
Bias
We define: bias(θ̂n ) := E(θ̂n ) − θ
θ̂n is said to be unbiased if bias(θ̂n ) = 0.
Consistency
p
θ̂n consistent if θ̂n −→ θ.
Standard error of θ̂n q
We define: se(θ̂n ) = varθ (θ̂n ). se(θ̂n ) is often not computable, we use
an estimated standard error se.
b
Example
X1 , . . . , Xn ∼ Bern(p) iid. We define p̂n = n1 i Xi an estimator of p:
P

• p̂n is unbiased E(p̂n ) = n1 ni=1 E(X ) = n1 np = p


P
q q q
• se = var(p̂n ) = n12 var( i Xi ) = n12 np(1 − p) = n1 p(1 − p).
p P
q
We use se b = n1 p̂n (1 − p̂n )
. 14 / 27
Mean squared error (MSE)
We define:
MSE = E[(θ̂n − θ)2 ]


! MSE ̸= var(θ̂n ) = E[(θ̂n − E(θ̂n ))2 ]. Actually:
MSE = bias2 (θ̂n ) + var(θ̂n )

Example: Gaussian MSE


E[(X̄n )] = µ, E(Sn ) = σ 2 , and

σ2 2σ 4
E[(X̄n − µ)2 ] = , E[(Sn − σ 2 )2 ] = .
n n−1

MSE and consistency


If MSE → 0 as n → ∞ (mean square convergence) then θ̂n is consistent.
15 / 27
Exercise

Let X1 , . . . , Xn be a random sample and θ̂n be an estimator of θ.

1. Show that MSE = bias2 (θ̂n ) + var(θ̂n ).


2. Show that if MSE → 0 as n → ∞ then θ̂n is consistent.
3. Does finding the θ̂n that has minimum MSE seem like a good idea?

16 / 27
Statistical decision theory
There are plenty of ways to find reasonably good estimators. How can we
compare them systematically? That’s what statistical decision theory
does, it studies the optimality of estimators.
We first need to measure the discrepancy between the true θ and θ̂n .
This is done through a loss function. Common choices are:
• absolute error loss: L(θ, θ̂n ) = |θ̂n − θ|,
• squared error loss: L(θ, θ̂n ) = (θ̂n − θ)2 (large errors penalized more)
• Lp loss: L(θ, θ̂n ) =| θ − θ̂n |p
• zero-one loss: L(θ, θ̂n ) = 0 if θ = θ̂n or 1 if θ ̸= θ̂
Risk of an estimator
We define the risk of estimator θ̂n for a given loss function L as:

R(θ, θ̂n ) = E[L(θ, θ̂n )]

We will want to minimize that risk.


The MSE is the risk for the squared error loss.
17 / 27
Best estimator of variance with different loss functions

We consider estimators of the type tSn with t > 0, to estimate the


variance σ 2 of a iid sample N(µ, σ 2 ).
h 2 i
2t
The MSE of tSn is: var(tSn ) + (E(tSn ) − σ 2 )2 = n−1 + (t − 1)2 σ 4 .
n−1
The estimator that minimizes the MSE is obtained taking t = n+1 . It is
not the traditional unbiased Sn !.

Now, consider the Stein’s loss function

σ̂ 2 σ̂ 2
L(σ 2 , σ̂ 2 ) = − 1 − log( ).
σ2 σ2

For the estimator tSn the risk is R(σ 2 , tSn ) = t − 1 − log t − E[log σSn2 ],
which is minimized for t = 1.

18 / 27
Asymptotic properties of the
maximum likelihood estimator
Asymptotic normality of the MLE
Score function: u(θ; X1 , . . . , Xn ) = ∇θ ℓn (θ) (u ∈ Rp )
Let s(Xi ; θ) := ∇θ log f (Xi ; θ) then
u(θ; X1 , . . . , Xn ) = s(X1 ; θ) + · · · + s(Xn θ).
And so the score function is a sum of n independent random variables.
Moments of s(X ; θ)
the mean: E(s(X ; θ)) = 0,
the covariance or Fisher information matrix:
!!
I (θ) := var(s(X ; θ)) = −E ∇∇T log f (X , θ)


Fisher information matrix of a sample


Pn
In (θ) = varθ (u(θ; X1 , . . . , Xn )) = varθ ( i=1 s(Xi ; θ)) = nI (θ)
Asymptotic distribution of MLE (under some regularity conditions)
√ d
n(θ̂n − θ∗ ) −→ N(0d , (I (θ̂n ))−1 ).
19 / 27
Delta Method (univariate case)

We define a new parameter τ as a smooth function of θ ∈ R: τ = g (θ).


√ d
If n(θ̂n − θ∗ ) −→ N(0, σ 2 ), then the Delta Method gives us the
asymptotic distribution of τ̂n = g (θ̂n ) as:
√ d
n(τ̂n − τ ) −→ N(0, g ′ (θ)2 σ 2 )

Application to MLEs
By invariance, if θ̂n is the MLE of θ then τ̂n = g (θ̂n ) is the MLE of τ .
√ d
We also have that n(θ̂n − θ) −→ N(0, (I (θ̂n ))−1 ), then:
√ d
n(τ̂n − τ ) −→ N(0, g ′ (θ̂n )2 (I (θ̂n ))−1 )

20 / 27
Example

Let X1 , . . . , Xn ∼ Bern(p) and let ψ = g (p) = log(p/(1− p) ).


The Fisher information function is I (p) = 1/(p(1 − p)). The MLE of pbn
is X̄ .
I −1 (b
pn ) = pbn (1 − pbn )
By invariance the MLE of ψ is ψbn = log(b p /(1 − pb)). Since,
g ′ (p) = 1/(p(1 − p)), according to the delta method


 
d 1
n(ψn − ψ) −→ N 0,
b
pbn (1 − pbn )

21 / 27
MLE and KL divergence

The Kullback-Leibler divergence is a concept from information theory. It


can be understood as a ”distance” between two distributions p and q:
Z  
q(x)
KL(p∥q) = − p(x) ln dx
p(x)

If:

• pθ (x) belongs to our statistical model, the family of distribution we


are working with.
• pθ∗ (x) the real probability distribution from which our data was
generated

Finding θ̂ that minimizes KL(pθ∗ ∥pθ ) is asymptotically equivalent to


maximizing the likelihood and finding the MLE.

22 / 27
Regression
Linear regression

Regression function
Regression is a method for studying the relationship between a response
variable Y and covariates X . The covariates are also called predictor
variables or features.
The function r (X ) than minimizes the mean squared error is the
conditional expectation, also called regression function.
r (X1 , . . . , Xp ) = E[Y |X1 , . . . , Xp ]

Linear regression
In linear regression, we assume a linear form for r (X ):

Yi = β0 + β1 X1,i + · · · + βp Xp,i + ϵi = Xi β + ϵi

where E(ϵi |Xi ) = 0 and var(ϵi |Xi ) = σ 2 for all i’s (assumption of
homoscedasticity)
23 / 27
Least squares estimator for linear regression
We observe y ∈ Rn and X ∈ Rn×(p+1) , suppose rank(X ) = p + 1.

In the least squares approach for linear regression, we estimate β with β̂


that minimizes the residual sum of squares RSS = ∥y − X β∥2 .

We saw in the chapter on algebra that the minimizer is


β̂ = (X T X )−1 X T y and that X β̂ is the orthogonal projection of y onto
the column span of X .

If var(ϵ) = σ 2 then
var(β̂) = σ 2 (X T X )−1
Moreover by the CLT,
β̂ ≈ N(β, σ 2 (X T X )−1 )
A unbiased estimator of σ 2 is:
1
b2 =
σ ∥y − X β̂∥2
n−p−1
24 / 27
Gaussian errors in linear regression

A common additional assumption is the normality of errors:

ϵi |Xi ∼ N(0, σ 2 ) or equivalently Yi ∼ N(Xi β, σ 2 ).

In that case, we can estimate β using the MLE. The log-likelihood is


1
ℓn (β) = −n log σ − ∥y − X β∥2 .
2σ 2
Under the assumption of normality of erros, the least squares estimator
of β is also the maximum likelihood estimator. This remains true if σ is
unknown.
The MLE of σ 2 is slightly different:
1
σ̂ 2 = ∥y − X β̂∥2
n

25 / 27
Logistic regression
When Yi ∈ {0, 1} is binary, we preferably use a logistic regression model.
The name logistic comes from the logistic function:
ex
g (x) =
1 + ex
In logistic regression, the regression function is the composition of the
logistic function and a linear function:

E[Yi |Xi = xi ] = P(Yi = 1|Xi = xi ) = g (xi β) = pi

That is: Yi |Xi = xi ∼ Bern(g (Xi β))


We have that:
 
pi
Xi β = ln = logit(pi ) or log-odds of pi
1 − pi

The model can be estimated by MLE. The MLE βb has to be obtained by


maximizing numerically the likelihood (no closed form solution).
26 / 27
Generalized linear model (GLM)
Both the linear regression with Gaussian errors and the logistic regression
are special cases of generalized linear models (GLMs).
A GLM specifies a parametric statistical model for the conditional
distribution Yi |Xi = xi . It consists of three elements:
1. a family of probability distributions for the response, Y ∼ fY (y )
2. a linear predictor: XiT β
3. a link function g (·) such that E(Yi | Xi = xi ) = g −1 (xiT β)
Exercise
Give the family fY and the link function g for the linear regression with
Gaussian errors and the logistic regression.
Other common GLMs
• Poisson regression: Poisson distribution for the response, logarithm ln
as link function
• Multinomial regression: Multinomial distribution for the response, logit
as link function
27 / 27

You might also like