Professional Documents
Culture Documents
Paul Rognon
1 / 27
Convergence of random variables
Modes of convergence of functions
Let {fn }n∈N be a sequence of function, ∀n, fn : S 7→ R and f be another
function.
• fn converges pointwise to function f if:
∀x ∈ S, fn (x) −−−→ f (x).
n→∞
• fn converges to function f with respect to norm ∥·∥, if:
∥fn − f ∥ −−−→ 0.
n→∞
• fn converges to function f in measure µ if:
µ({s ∈ S : |fn (s) − f (s)| > ϵ}) −−−→ 0
n→∞
where F is continuous.
Relationships betweens modes of convergence
4 / 27
Example
5 / 27
Law of large numbers and central limit theorem
Let X1 , . . . , Xn be independent identically distributed random variables
with mean µ and variance σ 2 .
Law of large numbers
a.s
• Strong law: X̄n −→ µ
p d
• Weak law: X̄n −→ µ and X̄n → µ
6 / 27
Fundamental concepts in
statistical inference
Statistical inference
Statistical inference (learning) is the process of using data to infer the
distribution that generated the data. We observe a sample X1 , . . . , Xn
from a distribution F . We may want to infer F or only some feature of F
such as its mean.
Statistical model
A statistical model M is a family of probability distributions that we pick
for our data.
A statistical model is parametric if it can be parameterized by a finite
number of parameters θ: M = {f (x; θ) : θ ∈ Θ ⊆ Rd } where Θ is the
parameter space
When M cannot be parameterized by a finite number of parameters, e.g.
if d = ∞, it is a nonparametric model.
Examples - parametric or nonparametric?
We toss a biased coin, X1 , . . . , Xn ∼ Bern(p). Task: learn p.
X1 , . . . , Xn observed from some univariate distribution F . Try to infer F
with no assumption on F . 7 / 27
Fundamental problems in statistical inference
Given a parametric model M with parameter set Θ suppose that the true
distribution of the sample X1 , . . . , Xn is given by θ∗ ∈ Θ (unknown).
Point estimation
Using the data, we provide an estimate of the parameter θ with the help
of an estimator. An estimator is any function of the sample,
θ̂n = θ̂n (X1 , . . . , Xn ) with values in Θ. For example: X̄n for E(X ).
Confidence intervals
Using the data, we estimate an interval or region of values which is likely
to contain the parameter θ. A (1 − α)-confidence interval, Cn = (a, b), is
such that
P(θ∗ ∈ Cn ) ≥ 1 − α.
dℓn Pn Pn
We have dp (p) = 0 if i=1 xi (1 − p) = (n − i=1 xi )p so the MLE is
n
1X
p̂n = xi .
n
i=1
10 / 27
Example: Multivariate Gaussian distribution
11 / 27
Methods of moments
12 / 27
Example
Let X1 , . . . , Xn be iid binomial (k, p). Here we assume that both k and p
are unknown and we desire point estimators for both parameters.
Equating the first two sample moments to those of the population yields
the system of equations
(
X̄ = kp
1P 2
n Xi = kp(1 − p) + k 2 p 2
X̄ 2
k̂ = P 2
X̄ − (1/n) Xi − X̄
X̄
p̂ =
k̂
13 / 27
How to evaluate an estimator?
Bias
We define: bias(θ̂n ) := E(θ̂n ) − θ
θ̂n is said to be unbiased if bias(θ̂n ) = 0.
Consistency
p
θ̂n consistent if θ̂n −→ θ.
Standard error of θ̂n q
We define: se(θ̂n ) = varθ (θ̂n ). se(θ̂n ) is often not computable, we use
an estimated standard error se.
b
Example
X1 , . . . , Xn ∼ Bern(p) iid. We define p̂n = n1 i Xi an estimator of p:
P
△
! MSE ̸= var(θ̂n ) = E[(θ̂n − E(θ̂n ))2 ]. Actually:
MSE = bias2 (θ̂n ) + var(θ̂n )
σ2 2σ 4
E[(X̄n − µ)2 ] = , E[(Sn − σ 2 )2 ] = .
n n−1
16 / 27
Statistical decision theory
There are plenty of ways to find reasonably good estimators. How can we
compare them systematically? That’s what statistical decision theory
does, it studies the optimality of estimators.
We first need to measure the discrepancy between the true θ and θ̂n .
This is done through a loss function. Common choices are:
• absolute error loss: L(θ, θ̂n ) = |θ̂n − θ|,
• squared error loss: L(θ, θ̂n ) = (θ̂n − θ)2 (large errors penalized more)
• Lp loss: L(θ, θ̂n ) =| θ − θ̂n |p
• zero-one loss: L(θ, θ̂n ) = 0 if θ = θ̂n or 1 if θ ̸= θ̂
Risk of an estimator
We define the risk of estimator θ̂n for a given loss function L as:
σ̂ 2 σ̂ 2
L(σ 2 , σ̂ 2 ) = − 1 − log( ).
σ2 σ2
For the estimator tSn the risk is R(σ 2 , tSn ) = t − 1 − log t − E[log σSn2 ],
which is minimized for t = 1.
18 / 27
Asymptotic properties of the
maximum likelihood estimator
Asymptotic normality of the MLE
Score function: u(θ; X1 , . . . , Xn ) = ∇θ ℓn (θ) (u ∈ Rp )
Let s(Xi ; θ) := ∇θ log f (Xi ; θ) then
u(θ; X1 , . . . , Xn ) = s(X1 ; θ) + · · · + s(Xn θ).
And so the score function is a sum of n independent random variables.
Moments of s(X ; θ)
the mean: E(s(X ; θ)) = 0,
the covariance or Fisher information matrix:
!!
I (θ) := var(s(X ; θ)) = −E ∇∇T log f (X , θ)
Application to MLEs
By invariance, if θ̂n is the MLE of θ then τ̂n = g (θ̂n ) is the MLE of τ .
√ d
We also have that n(θ̂n − θ) −→ N(0, (I (θ̂n ))−1 ), then:
√ d
n(τ̂n − τ ) −→ N(0, g ′ (θ̂n )2 (I (θ̂n ))−1 )
20 / 27
Example
√
d 1
n(ψn − ψ) −→ N 0,
b
pbn (1 − pbn )
21 / 27
MLE and KL divergence
If:
22 / 27
Regression
Linear regression
Regression function
Regression is a method for studying the relationship between a response
variable Y and covariates X . The covariates are also called predictor
variables or features.
The function r (X ) than minimizes the mean squared error is the
conditional expectation, also called regression function.
r (X1 , . . . , Xp ) = E[Y |X1 , . . . , Xp ]
Linear regression
In linear regression, we assume a linear form for r (X ):
Yi = β0 + β1 X1,i + · · · + βp Xp,i + ϵi = Xi β + ϵi
where E(ϵi |Xi ) = 0 and var(ϵi |Xi ) = σ 2 for all i’s (assumption of
homoscedasticity)
23 / 27
Least squares estimator for linear regression
We observe y ∈ Rn and X ∈ Rn×(p+1) , suppose rank(X ) = p + 1.
If var(ϵ) = σ 2 then
var(β̂) = σ 2 (X T X )−1
Moreover by the CLT,
β̂ ≈ N(β, σ 2 (X T X )−1 )
A unbiased estimator of σ 2 is:
1
b2 =
σ ∥y − X β̂∥2
n−p−1
24 / 27
Gaussian errors in linear regression
25 / 27
Logistic regression
When Yi ∈ {0, 1} is binary, we preferably use a logistic regression model.
The name logistic comes from the logistic function:
ex
g (x) =
1 + ex
In logistic regression, the regression function is the composition of the
logistic function and a linear function: