474 CHAPTER III.
STATISTICAL MODELS
The derivative of the log-likelihood function (4) at β̂ with respect to σ 2 is
dLL(β̂, σ 2 ) d n 1 T −1
= − log(σ ) − 2 (y − X β̂) V (y − X β̂)
2
dσ 2 dσ 2 2 2σ
n 1 1
=− 2
+ 2 2
(y − X β̂)T V −1 (y − X β̂) (8)
2 σ 2(σ )
n 1
=− 2 + (y − X β̂)T V −1 (y − X β̂)
2σ 2(σ 2 )2
and setting this derivative to zero gives the MLE for σ 2 :
dLL(β̂, σ̂ 2 )
=0
dσ 2
n 1
0=− 2
+ 2 2
(y − X β̂)T V −1 (y − X β̂)
2σ̂ 2(σ̂ )
n 1
= (y − X β̂)T V −1 (y − X β̂) (9)
2σ̂ 2 2(σ̂ 2 )2
2(σ̂ 2 )2 n 2(σ̂ 2 )2 1
· 2 = · (y − X β̂)T V −1 (y − X β̂)
n 2σ̂ n 2(σ̂ 2 )2
1
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂)
n
Together, (7) and (9) constitute the MLE for multiple linear regression.
1.5.24 Maximum log-likelihood
Theorem: Consider a linear regression model (→ III/1.5.1) m with correlation structure (→ I/1.14.5)
V
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the maximum log-likelihood (→ I/4.1.4) for this model is
n RSS n
MLL(m) = − log − [1 + log(2π)] (2)
2 n 2
under uncorrelated observations (→ III/1.5.1), i.e. if V = In , and
n wRSS n 1
MLL(m) = − log − [1 + log(2π)] − log |V | , (3)
2 n 2 2
in the general case, i.e. if V ̸= In , where RSS is the residual sum of squares (→ III/1.5.9) and wRSS
is the weighted residual sum of squares (→ III/1.5.22).
Proof: The likelihood function (→ I/5.1.2) for multiple linear regression is given by (→ III/1.5.23)
3. ESTIMATION THEORY 115
where 1 − α is the confidence level and χ21,1−α is the (1 − α)-quantile of the chi-squared distribution
(→ II/3.7.1) with 1 degree of freedom.
Proof: The confidence interval (→ I/3.2.1) is defined as the interval that, under infinitely repeated
random experiments (→ I/1.1.1), contains the true parameter value with a certain probability.
Let us define the likelihood ratio (→ I/4.1.6)
p(y|ϕ, λ̂)
Λ(ϕ) = for all ϕ∈Φ (4)
p(y|ϕ̂, λ̂)
and compute the log-likelihood ratio (→ I/4.1.7)
log Λ(ϕ) = log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) . (5)
Wilks’ theorem states that, when comparing two statistical models with parameter spaces Θ1 and
Θ0 ⊂ Θ1 , as the sample size approaches infinity, the quantity calculated as −2 times the log-ratio of
maximum likelihoods follows a chi-squared distribution (→ II/3.7.1), if the null hypothesis is true:
maxθ∈Θ0 p(y|θ)
H0 : θ ∈ Θ0 ⇒ −2 log ∼ χ2∆k as n→∞ (6)
maxθ∈Θ1 p(y|θ)
where ∆k is thendifference
o in dimensionality between Θ0 and Θ1 . Applied to our example in (5), we
note that Θ1 = ϕ, ϕ̂ and Θ0 = {ϕ}, such that ∆k = 1 and Wilks’ theorem implies:
−2 log Λ(ϕ) ∼ χ21 . (7)
Using the quantile function (→ I/1.9.1) χ2k,p of the chi-squared distribution (→ II/3.7.1), an (1 − α)-
confidence interval is therefore given by all values ϕ that satisfy
−2 log Λ(ϕ) ≤ χ21,1−α . (8)
Applying (5) and rearranging, we can evaluate
h i
−2 log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) ≤ χ21,1−α
1
log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) ≥ − χ21,1−α (9)
2
1
log p(y|ϕ, λ̂) ≥ log p(y|ϕ̂, λ̂) − χ21,1−α
2
which is equivalent to the confidence interval given by (3).
■
Sources:
• Wikipedia (2020): “Confidence interval”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-19; URL: https://en.wikipedia.org/wiki/Confidence_interval#Methods_of_derivation.
• Wikipedia (2020): “Likelihood-ratio test”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-19; URL: https://en.wikipedia.org/wiki/Likelihood-ratio_test#Definition.
• Wikipedia (2020): “Wilks’ theorem”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-19;
URL: https://en.wikipedia.org/wiki/Wilks%27_theorem.
294 CHAPTER II. PROBABILITY DISTRIBUTIONS
1 ··· 0
. .
Cov(z) = .. . . . .. = In . (7)
0 ··· 1
2) Next, consider an n × n matrix A solving the equation AAT = Σ. Such a matrix exists, because Σ
is defined to be positive definite (→ II/4.1.1). Then, x can be represented as a linear transformation
of (→ II/4.1.13) z:
x = Az + µ ∼ N (A0n + µ, AIn AT ) = N (µ, Σ) . (8)
Thus, the covariance (→ I/1.13.1) of x can be written as:
Cov(x) = Cov(Az + µ) . (9)
With the invariance of the covariance matrix under addition (→ I/1.13.14)
Cov(x + a) = Cov(x) (10)
and the scaling of the covariance matrix upon multiplication (→ I/1.13.15)
Cov(Ax) = ACov(x)AT , (11)
this becomes:
Cov(x) = Cov(Az + µ)
(10)
= Cov(Az)
(11)
= A Cov(z)AT
(12)
(7) T
= AIn A
= AAT
=Σ.
■
Sources:
• Rosenfeld, Meni (2016): “Deriving the Covariance of Multivariate Gaussian”; in: StackExchange
Mathematics, retrieved on 2022-09-15; URL: https://math.stackexchange.com/questions/1905977/
deriving-the-covariance-of-multivariate-gaussian.
4.1.11 Differential entropy
Theorem: Let x follow a multivariate normal distribution (→ II/4.1.1)
x ∼ N (µ, Σ) . (1)
Then, the differential entropy (→ I/2.2.1) of x in nats is
n 1 1
h(x) = ln(2π) + ln |Σ| + n . (2)
2 2 2
2. MULTIVARIATE NORMAL DATA 525
2 Multivariate normal data
2.1 General linear model
2.1.1 Definition
Definition: Let Y be an n × v matrix and let X be an n × p matrix. Then, a statement asserting a
linear mapping from X to Y with parameters B and matrix-normally distributed (→ II/5.1.1) errors
E
Y = XB + E, E ∼ MN (0, V, Σ) (1)
is called a multivariate linear regression model or simply, “general linear model”.
• Y is called “data matrix”, “set of dependent variables” or “measurements”;
• X is called “design matrix”, “set of independent variables” or “predictors”;
• B are called “regression coefficients” or “weights”;
• E is called “noise matrix” or “error terms”;
• V is called “covariance across rows”;
• Σ is called “covariance across columns”;
• n is the number of observations;
• v is the number of measurements;
• p is the number of predictors.
When rows of Y correspond to units of time, e.g. subsequent measurements, V is called “temporal
covariance”. When columns of Y correspond to units of space, e.g. measurement channels, Σ is called
“spatial covariance”.
When the covariance matrix V is a scalar multiple of the n×n identity matrix, this is called a general
linear model with independent and identically distributed (i.i.d.) observations:
i.i.d.
V = λIn ⇒ E ∼ MN (0, λIn , Σ) ⇒ εi ∼ N (0, λΣ) . (2)
Otherwise, it is called a general linear model with correlated observations.
Sources:
• Wikipedia (2020): “General linear model”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-21; URL: https://en.wikipedia.org/wiki/General_linear_model.
2.1.2 Ordinary least squares
Theorem: Given a general linear model (→ III/2.1.1) with independent observations
Y = XB + E, E ∼ MN (0, σ 2 In , Σ) , (1)
the ordinary least squares (→ III/1.5.3) parameters estimates are given by
B̂ = (X T X)−1 X T Y . (2)
Proof: Let B̂ be the ordinary least squares (→ III/1.5.3) (OLS) solution and let Ê = Y − X B̂ be
the resulting matrix of residuals. According to the exogeneity assumption of OLS, the errors have
conditional mean (→ I/1.10.1) zero
2 CHAPTER I. GENERAL THEOREMS
1 Probability theory
1.1 Random experiments
1.1.1 Random experiment
Definition: A random experiment is any repeatable procedure that results in one (→ I/1.2.2) out
of a well-defined set of possible outcomes.
• The set of possible outcomes is called sample space (→ I/1.1.2).
• A set of zero or more outcomes is called a random event (→ I/1.2.1).
• A function that maps from events to probabilities is called a probability function (→ I/1.5.1).
Together, sample space (→ I/1.1.2), event space (→ I/1.1.3) and probability function (→ I/1.1.4)
characterize a random experiment.
Sources:
• Wikipedia (2020): “Experiment (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-19; URL: https://en.wikipedia.org/wiki/Experiment_(probability_theory).
1.1.2 Sample space
Definition: Given a random experiment (→ I/1.1.1), the set of all possible outcomes from this
experiment is called the sample space of the experiment. A sample space is usually denoted as Ω and
specified using set notation.
Sources:
• Wikipedia (2021): “Sample space”; in: Wikipedia, the free encyclopedia, retrieved on 2021-11-26;
URL: https://en.wikipedia.org/wiki/Sample_space.
1.1.3 Event space
Definition: Given a random experiment (→ I/1.1.1), an event space E is any set of events, where
an event (→ I/1.2.1) is any set of zero or more elements from the sample space (→ I/1.1.2) Ω of this
experiment.
Sources:
• Wikipedia (2021): “Event (probability theory)”; in: Wikipedia, the free encyclopedia, retrieved on
2021-11-26; URL: https://en.wikipedia.org/wiki/Event_(probability_theory).
1.1.4 Probability space
Definition: Given a random experiment (→ I/1.1.1), a probability space (Ω, E, P ) is a triple con-
sisting of
• the sample space (→ I/1.1.2) Ω, i.e. the set of all possible outcomes from this experiment;
• an event space (→ I/1.1.3) E ⊆ 2Ω , i.e. a set of subsets from the sample space, called events (→
I/1.2.1);
• a probability measure P : E → [0, 1], i.e. a function mapping from the event space (→ I/1.1.3)
to the real numbers, observing the axioms of probability (→ I/1.4.1).