0% found this document useful (0 votes)
20 views5 pages

Maximum Likelihood in Linear Regression

The document discusses statistical models, particularly focusing on maximum likelihood estimation (MLE) for multiple linear regression and the derivation of the log-likelihood function. It includes theorems related to confidence intervals and Wilks' theorem, which relates to the chi-squared distribution in hypothesis testing. Additionally, it defines the general linear model and ordinary least squares estimation within the context of multivariate normal data.

Uploaded by

gohocel840
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views5 pages

Maximum Likelihood in Linear Regression

The document discusses statistical models, particularly focusing on maximum likelihood estimation (MLE) for multiple linear regression and the derivation of the log-likelihood function. It includes theorems related to confidence intervals and Wilks' theorem, which relates to the chi-squared distribution in hypothesis testing. Additionally, it defines the general linear model and ordinary least squares estimation within the context of multivariate normal data.

Uploaded by

gohocel840
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

474 CHAPTER III.

STATISTICAL MODELS

The derivative of the log-likelihood function (4) at β̂ with respect to σ 2 is

 
dLL(β̂, σ 2 ) d n 1 T −1
= − log(σ ) − 2 (y − X β̂) V (y − X β̂)
2
dσ 2 dσ 2 2 2σ
n 1 1
=− 2
+ 2 2
(y − X β̂)T V −1 (y − X β̂) (8)
2 σ 2(σ )
n 1
=− 2 + (y − X β̂)T V −1 (y − X β̂)
2σ 2(σ 2 )2

and setting this derivative to zero gives the MLE for σ 2 :

dLL(β̂, σ̂ 2 )
=0
dσ 2
n 1
0=− 2
+ 2 2
(y − X β̂)T V −1 (y − X β̂)
2σ̂ 2(σ̂ )
n 1
= (y − X β̂)T V −1 (y − X β̂) (9)
2σ̂ 2 2(σ̂ 2 )2
2(σ̂ 2 )2 n 2(σ̂ 2 )2 1
· 2 = · (y − X β̂)T V −1 (y − X β̂)
n 2σ̂ n 2(σ̂ 2 )2
1
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂)
n

Together, (7) and (9) constitute the MLE for multiple linear regression.

1.5.24 Maximum log-likelihood


Theorem: Consider a linear regression model (→ III/1.5.1) m with correlation structure (→ I/1.14.5)
V

m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the maximum log-likelihood (→ I/4.1.4) for this model is
 
n RSS n
MLL(m) = − log − [1 + log(2π)] (2)
2 n 2
under uncorrelated observations (→ III/1.5.1), i.e. if V = In , and
 
n wRSS n 1
MLL(m) = − log − [1 + log(2π)] − log |V | , (3)
2 n 2 2
in the general case, i.e. if V ̸= In , where RSS is the residual sum of squares (→ III/1.5.9) and wRSS
is the weighted residual sum of squares (→ III/1.5.22).

Proof: The likelihood function (→ I/5.1.2) for multiple linear regression is given by (→ III/1.5.23)
3. ESTIMATION THEORY 115

where 1 − α is the confidence level and χ21,1−α is the (1 − α)-quantile of the chi-squared distribution
(→ II/3.7.1) with 1 degree of freedom.

Proof: The confidence interval (→ I/3.2.1) is defined as the interval that, under infinitely repeated
random experiments (→ I/1.1.1), contains the true parameter value with a certain probability.
Let us define the likelihood ratio (→ I/4.1.6)

p(y|ϕ, λ̂)
Λ(ϕ) = for all ϕ∈Φ (4)
p(y|ϕ̂, λ̂)
and compute the log-likelihood ratio (→ I/4.1.7)

log Λ(ϕ) = log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) . (5)


Wilks’ theorem states that, when comparing two statistical models with parameter spaces Θ1 and
Θ0 ⊂ Θ1 , as the sample size approaches infinity, the quantity calculated as −2 times the log-ratio of
maximum likelihoods follows a chi-squared distribution (→ II/3.7.1), if the null hypothesis is true:

maxθ∈Θ0 p(y|θ)
H0 : θ ∈ Θ0 ⇒ −2 log ∼ χ2∆k as n→∞ (6)
maxθ∈Θ1 p(y|θ)
where ∆k is thendifference
o in dimensionality between Θ0 and Θ1 . Applied to our example in (5), we
note that Θ1 = ϕ, ϕ̂ and Θ0 = {ϕ}, such that ∆k = 1 and Wilks’ theorem implies:

−2 log Λ(ϕ) ∼ χ21 . (7)


Using the quantile function (→ I/1.9.1) χ2k,p of the chi-squared distribution (→ II/3.7.1), an (1 − α)-
confidence interval is therefore given by all values ϕ that satisfy

−2 log Λ(ϕ) ≤ χ21,1−α . (8)


Applying (5) and rearranging, we can evaluate

h i
−2 log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) ≤ χ21,1−α
1
log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) ≥ − χ21,1−α (9)
2
1
log p(y|ϕ, λ̂) ≥ log p(y|ϕ̂, λ̂) − χ21,1−α
2
which is equivalent to the confidence interval given by (3).


Sources:
• Wikipedia (2020): “Confidence interval”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-19; URL: https://en.wikipedia.org/wiki/Confidence_interval#Methods_of_derivation.
• Wikipedia (2020): “Likelihood-ratio test”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-19; URL: https://en.wikipedia.org/wiki/Likelihood-ratio_test#Definition.
• Wikipedia (2020): “Wilks’ theorem”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-19;
URL: https://en.wikipedia.org/wiki/Wilks%27_theorem.
294 CHAPTER II. PROBABILITY DISTRIBUTIONS

 
1 ··· 0
 
 . . 
Cov(z) =  .. . . . ..  = In . (7)
 
0 ··· 1
2) Next, consider an n × n matrix A solving the equation AAT = Σ. Such a matrix exists, because Σ
is defined to be positive definite (→ II/4.1.1). Then, x can be represented as a linear transformation
of (→ II/4.1.13) z:

x = Az + µ ∼ N (A0n + µ, AIn AT ) = N (µ, Σ) . (8)


Thus, the covariance (→ I/1.13.1) of x can be written as:

Cov(x) = Cov(Az + µ) . (9)


With the invariance of the covariance matrix under addition (→ I/1.13.14)

Cov(x + a) = Cov(x) (10)


and the scaling of the covariance matrix upon multiplication (→ I/1.13.15)

Cov(Ax) = ACov(x)AT , (11)


this becomes:

Cov(x) = Cov(Az + µ)
(10)
= Cov(Az)
(11)
= A Cov(z)AT
(12)
(7) T
= AIn A
= AAT
=Σ.


Sources:
• Rosenfeld, Meni (2016): “Deriving the Covariance of Multivariate Gaussian”; in: StackExchange
Mathematics, retrieved on 2022-09-15; URL: https://math.stackexchange.com/questions/1905977/
deriving-the-covariance-of-multivariate-gaussian.

4.1.11 Differential entropy


Theorem: Let x follow a multivariate normal distribution (→ II/4.1.1)

x ∼ N (µ, Σ) . (1)
Then, the differential entropy (→ I/2.2.1) of x in nats is
n 1 1
h(x) = ln(2π) + ln |Σ| + n . (2)
2 2 2
2. MULTIVARIATE NORMAL DATA 525

2 Multivariate normal data


2.1 General linear model
2.1.1 Definition
Definition: Let Y be an n × v matrix and let X be an n × p matrix. Then, a statement asserting a
linear mapping from X to Y with parameters B and matrix-normally distributed (→ II/5.1.1) errors
E

Y = XB + E, E ∼ MN (0, V, Σ) (1)
is called a multivariate linear regression model or simply, “general linear model”.
• Y is called “data matrix”, “set of dependent variables” or “measurements”;
• X is called “design matrix”, “set of independent variables” or “predictors”;
• B are called “regression coefficients” or “weights”;
• E is called “noise matrix” or “error terms”;
• V is called “covariance across rows”;
• Σ is called “covariance across columns”;
• n is the number of observations;
• v is the number of measurements;
• p is the number of predictors.
When rows of Y correspond to units of time, e.g. subsequent measurements, V is called “temporal
covariance”. When columns of Y correspond to units of space, e.g. measurement channels, Σ is called
“spatial covariance”.
When the covariance matrix V is a scalar multiple of the n×n identity matrix, this is called a general
linear model with independent and identically distributed (i.i.d.) observations:
i.i.d.
V = λIn ⇒ E ∼ MN (0, λIn , Σ) ⇒ εi ∼ N (0, λΣ) . (2)
Otherwise, it is called a general linear model with correlated observations.

Sources:
• Wikipedia (2020): “General linear model”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-21; URL: https://en.wikipedia.org/wiki/General_linear_model.

2.1.2 Ordinary least squares


Theorem: Given a general linear model (→ III/2.1.1) with independent observations

Y = XB + E, E ∼ MN (0, σ 2 In , Σ) , (1)
the ordinary least squares (→ III/1.5.3) parameters estimates are given by

B̂ = (X T X)−1 X T Y . (2)

Proof: Let B̂ be the ordinary least squares (→ III/1.5.3) (OLS) solution and let Ê = Y − X B̂ be
the resulting matrix of residuals. According to the exogeneity assumption of OLS, the errors have
conditional mean (→ I/1.10.1) zero
2 CHAPTER I. GENERAL THEOREMS

1 Probability theory
1.1 Random experiments
1.1.1 Random experiment
Definition: A random experiment is any repeatable procedure that results in one (→ I/1.2.2) out
of a well-defined set of possible outcomes.
• The set of possible outcomes is called sample space (→ I/1.1.2).
• A set of zero or more outcomes is called a random event (→ I/1.2.1).
• A function that maps from events to probabilities is called a probability function (→ I/1.5.1).
Together, sample space (→ I/1.1.2), event space (→ I/1.1.3) and probability function (→ I/1.1.4)
characterize a random experiment.

Sources:
• Wikipedia (2020): “Experiment (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-19; URL: https://en.wikipedia.org/wiki/Experiment_(probability_theory).

1.1.2 Sample space


Definition: Given a random experiment (→ I/1.1.1), the set of all possible outcomes from this
experiment is called the sample space of the experiment. A sample space is usually denoted as Ω and
specified using set notation.

Sources:
• Wikipedia (2021): “Sample space”; in: Wikipedia, the free encyclopedia, retrieved on 2021-11-26;
URL: https://en.wikipedia.org/wiki/Sample_space.

1.1.3 Event space


Definition: Given a random experiment (→ I/1.1.1), an event space E is any set of events, where
an event (→ I/1.2.1) is any set of zero or more elements from the sample space (→ I/1.1.2) Ω of this
experiment.

Sources:
• Wikipedia (2021): “Event (probability theory)”; in: Wikipedia, the free encyclopedia, retrieved on
2021-11-26; URL: https://en.wikipedia.org/wiki/Event_(probability_theory).

1.1.4 Probability space


Definition: Given a random experiment (→ I/1.1.1), a probability space (Ω, E, P ) is a triple con-
sisting of
• the sample space (→ I/1.1.2) Ω, i.e. the set of all possible outcomes from this experiment;
• an event space (→ I/1.1.3) E ⊆ 2Ω , i.e. a set of subsets from the sample space, called events (→
I/1.2.1);
• a probability measure P : E → [0, 1], i.e. a function mapping from the event space (→ I/1.1.3)
to the real numbers, observing the axioms of probability (→ I/1.4.1).

You might also like