You are on page 1of 34

Notes

August 30, 2023


Contents

1
Chapter 1

Legendre Equation

1.1 Generalized Legendre equation


The generalized Legendre equation

m2
!
(1 − x )y − 2xy + λ −
2 00 0
y=0 (1.1)
1 − x2
arises when the equation ∆u = f (ρ)u is solved with separation of variables in
spherical coordinates. The function y(cos θ) describes the polar part of the solution
of ∆u = f (ρ)u.
The Legendre equation

(1 − x2 )y00 − 2xy0 + λy = 0 (1.2)

is the special case when m = 0. Because 0 is an ordinary point of the equation, it


is natural to attempt a series solution. We describe the solution of the equation as
+∞
X
y(x) = cn x n (1.3)
n=0

where cn are unknown coefficient to be determined by the boundary conditions.


If we replace eq. (1.3) in eq. (1.2) we obtain
+∞
X +∞
X +∞
X
2
(1 − x ) cn n(n − 1)x n−2
− 2x cn nx n−1
+λ cn x n = 0 (1.4)
n=2 n=1 n=0

1.2 Even Numbered Coefficients, l Even


(l − k)(l + k + 1)
ck+2 = − (1.5)
(k + 2)(k + 1)
The first terms are:

2
CHAPTER 1. LEGENDRE EQUATION 3

l(l + 2 + 1)
c2 = − c0
2
(l − 2)(l + 2 + 1) l(l − 2)(l + 3)(l + 1)
c4 = c2 = − c0
(2 + 2)(2 + 1) 4!
(l − 4)(l + 4 + 1) l(l − 2)(l − 4)(l + 5)(l + 3)(l + 1)
c6 = − c4 = − c0
(4 + 2)(4 + 1) 6!
..
.
l(l − 2) · · · (l − 2(n − 1))(l + 2n + 1)(l + 2n − 3) · · · (l + 1)
c2n = (−1)n − c0
2n!
where we considered k = 2n because k is an even number. From the expanded
coefficient we now derive the closed formula for Legendre Polynomial
l(l − 2) · · · (l − 2(n − 1))(l + 2n + 1)(l + 2n − 3) · · · (l + 1)
c2n = (−1)n − c0
(2n)!
1
 
l
l(l − 2) · · · (l − 2(n − 1))(l + 2n + 1)(l + 2n − 3) · · · (l + 1) 2 2 2
c2n = (−1)n − 1
  c0
(2n)! 2 2 2l !
1
 
l
(l + 2n − 1)(l + 2n − 3) · · · (l + 1) l(l − 2) · · · (l − 2(n − 1)) 2 2 2 !
= (−1)n −  c0
l(l − 2) · · · (l − 2(n − 1)) 2 12 −n l − n !

2n!
2
1
+n 1
   
(l + 2n − 1)(l + 2n − 3) · · · (l + 1) l(l − 2) · · · (l − 2(n − 1)) l! 2 2
l
2 + n ! 2 2
l
2 !
= (−1)n −  c0
l(l − 2) · · · (l − 2(n − 1)) l! 2 12 +n l + n ! 2 12 −n l − n !
  
2n!
2 2
1 1
   
l l
(l + 2n) ! 22 2 ! 22 2 !
= (−1)n −  c0
(2n) ! l! 2 2 +n l + n ! 2 2 −n l − n !
1
  1 
2 2
  2 
l
(l + 2n) ! 2!
= (−1)n −    c0 (1.6)
(2n) ! l! l + n l − n
2 2

Therefore  
l 2

(l + 2n) ! 2 !
c2n = (−1)n −    c0 (1.7)
(2n) ! l! l + n l − n
2 2

Now, for l even, the series terms are non-zero until 2n = l, that is to say that the
in the series the index of the terms assumes all the integeres value betwee 0 and
l/2. Therefore, we can substitute the index 2n with l − 2k, i.e. we are counting the
CHAPTER 1. LEGENDRE EQUATION 4

elements backwards. With this change the eq. (1.7) become


h  i2
l
l−2k (2l − 2k) ! 2 !
cl−2k = (−1) 2    c0
(l − 2k) ! l! l +
l−2k l l−2k
2 2 2 − 2
h  i2
l
l (2l − 2k) ! 2 !
= (−1) (−1)
2 −k
c0
(l − 2k) ! l! (l − k) ! k!
h  i2
l l
(2l − 2k) ! l 2 2 !
= (−1) l
k
(−1) 2 c0 (1.8)
2 (l − 2k) ! (l − k) ! k! l!
By choosing
l l!
c0 = (−1) 2 h i (1.9)
l
2l 2 !
we obtain
l/2
X (2l − 2k) !
Pl (x) = (−1)k xl−2k (1.10)
k=0
2l (l − 2k) ! (l − k) ! k!
Chapter 2

Statistics

2.1 Definitions
Probability Space: is a measure space such that the measure of the whole space is
equal to one. It can be represented as (Ω, F , P) where:

• Ω is the sample space (an arbitrary non-empty set)

• F ⊆ 2Ω is a set of sub-set of Ω, called “events” that form a σ-algebra, i.e.

– Ω∈F
– F is closed under complements: if A ∈ F ⇒ (Ω/A) ∈ F
S
– F is closed for countable unions: if Ai ∈ F with {i ∈ N} then ( Ai ) ∈
F
– F is closed for countable intersection: if Ai ∈ F with {i ∈ N} then
T
( Ai ) ∈ F

• P is countably additive: if {Ai } ⊆ F is a countable collection of pairwise


disjoint set, then P ( Ai ) = P(Ai ) where is the disjoint union
F P F

• the measure of the space is equal to 1: P(Ω) = 1

Random Variable: given a probability space (Ω, F , P) one can define a random
variable as a function X : Ω → R which is measurable in the sense that the in-
verse of a measurable Borel set B in R is in F . The interpretation is that if ω
is an experiment, then X(ω) measure an observable quantity of the experiment.
The random variables posses and expectation valueh E i[X]. With expectation it is
possible to define the notion of variance Var = E X 2 − E [X]2 , standard devia-

tion σX = Var. The correlation Cor (X, Y) = Cov (X, Y) /(σX σY ) of two random
variables X, Y with positive variance is a quantity that tells how much the random
variable X is related to random variable Y.

5
CHAPTER 2. STATISTICS 6

Stochastic Process: a set of random variables defines a stochastic process. The


variable t ∈ T is a parameter called “Time”. With more general time like Rn
random variables are called “random fields”.

2.2 Maximum Likelihood Estimator


Let y = (y1 , y2 , . . . , yn )T be a vector of iid, random variables from one family of
distributions on <n and indexed by a p-dimensional parameter θ = (θ1 , θ2 , . . . , θn )T
where θ ∈ Θ ⊂ < p and p ≤ n. Typically, we are interested in estimating parametric
models of the form
y ∼ f (θ, y) (2.1)
where θ is a vector of parameters and f is some specific functional form. That is
to say, given y we want to make inferences about the value of θ.
Note that the problem that we face is the opposite of the typical probability
problem. A typical probability problem is to know something about the distribution
of y (the outcomes) given the parameters of your model (θ). In this case we want
to know p(Data|Model), or rather we want to know f (y|θ).
However, in our case we have the data but want to learn about the model, specif-
ically the model’s parameters. In other words, we want to know the distribution of
the unknown parameters conditional on the observed data, i.e. p(Model|Data), this
called the “inverse probability problem”.

2.2.1 Bayes’s Theorem


Recall the following identities

p(θ, y) = p(θ)p(y|θ) (2.2)


p(θ, y) = p(y)p(θ|y) (2.3)

therefore the conditional density p(θ|y) is


p(θ, y)
p(θ|y) =
p(y)
p(θ)p(y|θ)
= (2.4)
p(y)
Note that the denominator, p(y), is just a function of the data. Since it only makes
sense to compare these conditional densities for the same data, we can essentially
ignore the denominator. This means that we can rewrite eq. (2.4) in its more famil-
iar form:
p(θ|y) ∝ p(θ)p(y|θ) (2.5)
where p(y) is the constant of proportionality, p(θ) is the prior density of θ, p(y|θ) is
the likelihood, and p(θ|y) is the posterior density of θ. The likelihood is the sample
information that transforms a prior into a posterior density of θ.
CHAPTER 2. STATISTICS 7

It is important to note that the prior, p(θ), is fixed before our observations and
so can be treated as invariant to our problem. This means that we can rewrite
eq. (2.5) as
p(θ|y) ∝ k(y)p(y|θ) (2.6)
p(θ)
where k(y) = p(y) and is an unknown function of the data. Since k(y) is not a
function of θ, it is treated as an unknown positive constant. Put differently, for a
given set of observed data, k(y) remains the same over all posiible hypothetical
values of θ.

2.3 Likelihood
Without knowing the prior of making assumptions about the prior, we cannot cal-
culate the inverse probability in eq. (2.6). This is the point where the notion of
likelihood and the Likelihood Axiom was introduced. It was defined

L(θ|y) = k(y)p(y|θ) ∝ p(y|θ) (2.7)

The likelihood is proportional to the probability of observing the data, treating the
parameters of the distribution as variables and the data as fixed. The advantage of
likelihood is that it can be calculated from a traditional probability, p(y|θ), whereas
an inverse probability cannot be calculated in any way. Note that we can only
compare likelihoods for the same set of data and the same prior.
The best estimator θ̂, is whatever value of θ maximizes

L(θ|y) = p(y|θ) (2.8)

In effect, we are looking for the θ̂ that maximizes the likelihood of observing our
sample. Because of the proportional relationship, the θ̂ that maximizes L(θ|y) will
also maximize p(θ|y) i.e. the probability of observing the data. This is what we
wanted from the beginning.
Since the logarithm is a monotonically increasing function and the likelihood
is a non-negative function, they have the maximum at the same position. The
logarithm of the likelihood is commonly used to calculate the maximum of L.
For a ML estimator, the Information matrix is defined as negative of the ex-
pected value of the Hessian of the likelihood with respect the parameters:

I(θ) = −E[H(θ)] (2.9)

where
∂2 L
Hi j (θ) = (2.10)
∂θi ∂θ j
CHAPTER 2. STATISTICS 8

The variance of an ML estimator, θ̂ ML is calculated by the inverse of the Informa-


tion matrix:

var(θ) = [I(θ)]−1
#!−1
∂2 L
"
= − (E[H(θ)]) −1
=− E (2.11)
∂θi ∂θ j

This actually represents an inferior limit, as stated by the Cramer-Rao Theorem

var(θ) ≥ − (E[H(θ)]) (2.12)

This means that any unbiased estimator that achieves this lower bound is efficient
and no better unbiased estimator is possible. The inverse of the information matrix
for MLE is exactly the same as the Cramer-Rao lower bound. This means that
MLE is efficient.

2.4 Ordinary Kriging


Suppose (Z0 , Z1 , . . . , Zn ) is a vector assumed to have a multivariate distribution of
unknown mean (µ, µ, . . . , µ) and known variance-covariance matrix Σ:

 Var (Z0 , Z0 ) · · · Cov (Z0 , Zn )


 
 .. .. .. 
 . . . 
Cov (Zn , Z0 ) · · · Var (Zn , Zn )
 

We observe (z1 , z2 , . . . , zn ) from this distribution and wish to predict z0 from


this information using an unbiased linear predictor.
Linear means the prediction must take the form ẑ0 = λ1 z1 + λ2 z2 + · · · + λn zn for
coefficients λi to be determined. These coefficients can depend at most on what is
known in advance: namely, the entries of Σ. This predictor can also be considered
a random variable Zˆ0 = λ1 Z1 + λ2 Z2 + · · · + λn Zn .
Unbiased means the expectation of Ẑ0 equals its (unknown) mean µ.
Writing things out gives some information about the coefficients:

µ = E[Zˆ0 ] = E[λ1 Z1 + λ2 Z2 + · · · + λn Zn ]
= λ1 E[Z1 ] + λ2 E[Z2 ] + · · · + λn E[Zn ]
= λ1 µ + · · · + λn µ
= (λ1 + · · · + λn ) µ (2.13)

Where the second equality follows from the linearity of expectation E [·]. Be-
cause this procedure is suppose to work regardless of the value of µ, evidently
the coefficients have to sum to unity. Writing the coefficients in vector notation
λ = (λi )T , this can be neatly written 1 · λ = 1.
CHAPTER 2. STATISTICS 9

Among the set of all such unbiased linear predictors, we seek one that deviates
as little from the real value as possible, measured in the room mean square. This,
again, is a computation. It relies on the bilinearity and symmetry of covariance,
whose application is responsible for the summations in the second line:

E[(Zˆ0 − Z0 )2 ] = E[(λ1 Z1 + λ2 Z2 + · · · + λn Zn − Z0 )2 ]
Xn Xn n
X
= λi λ j Σi, j − 2 λi Cov (Zi , Z0 ) + Var (Z0 , Z0 ) (2.14)
i=1 j=1 i=1

That can be written in matrix notation as:

E[(Ẑ0 − Z0 )2 ] = (λ)T Σλ − 2λ · Cov(Z, Z0 ) + Var (Z0 , Z0 ) (2.15)

where Cov(Z, Z0 ) = (Cov (Z1 , Z0 ) , . . . , Cov (Zn , Z0 )).


Whence the coefficients can be obtained by minimizing this quadratic form
subject to the (linear) constraint 1 · λ = 1. This is readily solved using the method
of Lagrange multipliers, yielding a linear system of equations, the ”Kriging equa-
tions.”
In the application, Z is a spatial stochastic process (”random field”). This
means that for any given set of fixed (not random) locations x0 , . . . , xn , the vec-
tor of values of Z at those locations, (Z(x0 ), . . . , Z(xn )) is random with some kind
of a multivariate distribution. Write Zi = Z(xi ) and apply the foregoing analysis,
assuming the means of the process at all n + 1 locations xi are the same and assum-
ing the covariance matrix of the process values at these n + 1 locations is known
with certainty.
Let’s interpret this. Under the assumptions (including constant mean and known
covariance), the coefficients determine the minimum variance attainable by any
linear estimator. Let’s call this variance σ2OK (”OK” is for ”ordinary kriging”). It
depends solely on the matrix Σ. It tells us that if we were to repeatedly sample from
(Z0 , . . . , Zn ) and use these coefficients to predict the z0 values from the remaining
values each time, then

• On the average our predictions would be correct.

• Typically, our predictions of the z0 would deviate about σOK from the actual
values of the z0 .

Much more needs to be said before this can be applied to practical situations
like estimating a surface from punctual data: we need additional assumptions about
how the statistical characteristics of the spatial process vary from one location to
another and from one realization to another (even though, in practice, usually only
one realization will ever be available). But this exposition should be enough to
follow how the search for a ”Best” Unbiased Linear Predictor (”BLUP”) leads
straightforwardly to a system of linear equations.
CHAPTER 2. STATISTICS 10

By the way, kriging as usually practiced is not quite the same as least squares
estimation, because Σ is estimated in a preliminary procedure (known as ”variog-
raphy”) using the same data. That is contrary to the assumptions of this derivation,
which assumed Σ was known (and a fortiori independent of the data). Thus, at the
very outset, kriging has some conceptual and statistical flaws built into it. Thought-
ful practitioners have always been aware of this and found various creative ways to
(try to) justify the inconsistencies. (Having lots of data can really help.) Procedures
now exist for simultaneously estimating Σ and predicting a collection of values at
unknown locations. They require slightly stronger assumptions (multivariate nor-
mality) in order to accomplish this feat.

2.5 DACE
DACE stands for “Design and analysis of computer experiments” and it is become
a common way to indicate a way to built a response surface for a generic process.
Suppose we have a generic output of a deterministic experiment y(x) where x
is the value of the output of the experiment (a physical property for example) at
position x. In DACE this deterministic outcome is treated as the realization of a
random function (i.e. a stochastic process) Y(x) described by a regression model:
N
X
Y(x) = β j f j (x) + Z(x) (2.16)
j=1

where Z(·) is a random process with zero mean and covariance between Z(x1 ) and
Z(x2 ) given by:
Cov (x1 , x2 ) = σ2 Cor(x1 , x2 ) (2.17)
where σ2 is the process variance and Cor(x1 , x2 ) is the correlation between the two
points.
Given a number M of observation y = (y(x1 ), . . . , y(x M ))T we consider the
linear predictor
ŷ(x) = cT (x) · y (2.18)
where x (without indexes) represents some untried position (i.e. some position not
contained in the set {x1 , . . . , x M }), and cT (x) = (c(x1 ), . . . , c(x M )). In the frequentist
and Bayesian approach we are lead to two different way to predict ŷ(x) :

• In the frequentist viewpoint we replace y by the corresponding random quan-


tity Y = (Y(x1 ), . . . , Y(x M ))T , treat ŷ(x) as a random variable as well and
compute the mean squared error of this predictor averaged over the random
process. The BLUP is obtained by choosing the n × 1 vector c(x) that mini-
mize the Mean Squared Error (MSE):
h i2
MSE[ŷ(x)] = E cT (x) · Y − Y(x) (2.19)
CHAPTER 2. STATISTICS 11

subject to unbiasdness constraint


h i
E cT (x)Y = E [Y(x)] (2.20)

• The Bayesian approach would predict y(x) by the posterior mean:

ŷ(x) = E Y(x)|y
 
(2.21)

In general Bayesian and frequentist approach lead to different methods and results
but in the special case of Gaussian process for Z(·) and improper uniform priors on
the β’s the methods and results will be the same.
Let’s now derive the expression for kriging. Let us define the vector f(x) =
( f1 (x), . . . , fN (x))T for the N function in the regression,
   
 f(x1 )   f1 (x1 ) · · · fN (x1 ) 
F =  ...  =  ... .. .. 
  
   . . 

f(x M ) f1 (x M ) · · · fN (x M )

for the N × M design matrix R,


 
Ri, j = Cor xi , x j , 1 ≤ i ≤ M; 1 ≤ j ≤ M (2.22)

the matrix of stochastic-process correlations between Z’s at the design site, and

r(x) = (Cor (x1 , x) , . . . , Cor (x M , x))T (2.23)

the vector of correlations between the design site and the untried input x.
Starting from eq. (2.19) we obtain:
h i2
MSE[ŷ(x)] = E cT (x) · Y − Y(x)
 2  
= E cT (x) · Y + Y 2 (x) − 2Y(x) cT (x) · Y
= E[c21 (x)Y12 + . . . + c2M (x)Y M
2
+ 2c1 Y1 (x)(c2 Y2 + . . . + c M Y M ) + . . .
+ 2c M Y M (c1 Y1 + . . . + c M−1 Y M−1 ) + Y 2 (x) − 2Y(x)(c1 (x)Y1 + . . . + c M (x)Y M )
 
M X
X M M
X 
= E  ci (x)c j (x)Yi Y j − 2Y(x) ci Yi + Y 2 (x) (2.24)
i=1 j=1 i=1

From eq. (2.16) we can write


   
N
X  N
X 
E [Y(x)] = E  β j f j (x) + Z(x) = E  β j f j (x) + E [Z(x)]
j=1 j=1
N
X
= β j f j (x) (2.25)
j=1
CHAPTER 2. STATISTICS 12

because of the linearity of the expectation function E [·] and the fact that the f j ’s
are not random variables, c j ’s are constant and Z(·) is a stochastic process with zero
mean. In the same way we can write, from eq. (2.16):
 N  N 
h i X  X 
E Y(xi )Y(x j ) = E  βk fk (xi ) + Z(xi )  βk fk (x j ) + Z(x j )
k=1 k=1
 N  N 
X  X  h i
= E  βk fk (xi )  βk fk (x j ) + E Z(xi )Z(x j )
   
k=1 k=1
N
XX N  
= βk βh fk (xi ) fh (x j ) + σ2 Cor xi , x j (2.26)
k=1 h=1

where the last term in the last equality follows from eq. (2.17), and the fact that:
  h h ii
Cov Zi , Z j = E Zi − E [Zi ])(Z j − E Z j
h i h h ii h i h h i h ii
= E Zi Z j − E Zi E Z j − E Z j E [Zi ] − E E Z j E Z j
h i h i h i h i h i h i
= E Zi Z j − E Z j E [Zi ] − E Z j E Z j + E Z j E Z j
h i h i
= E Zi Z j − E Z j E [Zi ] (2.27)

and because we have assumed a random process wih zero mean for Z(·) we evetu-
ally obtain:   h i
Cov Zi , Z j = E Zi Z j (2.28)
Coming back to eq. (2.24) we can now write:
 
X M X M XM 
MSE[ŷ(x)] = E   ci (x)c j (x)Yi Y j − 2Y(x) ci Yi + Y (x) 2

i=1 j=1 i=1


 
M
X X X X M N N 
= E   βk βh ci (x)c j (x) fk (xi ) fh (x j )
i=1 j=1 k=1 h=1
  M N N 
X M XM  X X X 
+ E   ci (x)c j (x)Z(xi )Z(x j ) − 2E 
 βk βh ci (x) fk (xi ) fh (x)
i=1 j=1 i=1 k=1 h=1
M   N N 
X  X X 
− 2E  ci (x)Z(xi )Z(x) + E 
   βk βh fk (x) fh (x) + E [Z(x)Z(x)]
i=1 k=1 h=1
(2.29)

To further simplify the equation we must consider the unbiasedness constraints


CHAPTER 2. STATISTICS 13

FT c(x) = f(x) which allows to write:


 
M X
X M X
N XN 
E  βk βh ci (x)c j (x) fk (xi ) fh (x j ) =
i=1 j=1 k=1 h=1
M X
X M X
N X
N
= βk βh ci (x)c j (x) fk (xi ) fh (x j )
i=1 j=1 k=1 h=1
M N  M N 
X X  X X 
=  βk ci (x) fk (xi )  βh c j (x) fh (x j )
i=1 k=1 j=1 h=1
 N M
 N M

X X  X X 
=  βk ci (x) fk (xi )  βh c j (x) fh (x j )
k=1 i=1 h=1 i=1
 N  N 
X  X 
=  βk fk (x)  βh fh (x)
k=1 h=1
N
XX N
= βk βh fk (xi ) fh (x j ) (2.30)
k=1 h=1

and
M N N 
X X X 
E  βk βh ci (x) fk (xi ) fh (x)
i=1 k=1 h=1
M X
X N X
N
= βk βh ci (x) fk (xi ) fh (x)
i=1 k=1 h=1
XN X N M
X
= βk βh fh (x) ci (x) fk (xi )
k=1 h=1 i=1
N X
X N
= βk βh fh (x) fk (x) (2.31)
k=1 h=1

Therefore from eq. (2.29):


 
XM X M X N X N 
E  βk βh ci (x)c j (x) fk (xi ) fh (x j )
i=1 j=1 k=1 h=1
M N N 
X X X 
− 2E  βk βh ci (x) fk (xi ) fh (x)
i=1 k=1 h=1
 N N 
X X 
+ E  βk βh fh (x) fk (x) = 0 (2.32)
k=1 h=1
CHAPTER 2. STATISTICS 14

and
  M 
M X
X M  X 
MSE[ŷ(x)] = E  ci (x)c j (x)Z(xi )Z(x j ) − 2E  ci (x)Z(xi )Z(x) + E [Z(x)Z(x)]
i=1 j=1 i=1
M X
X M h i M
X
= ci (x)c j (x)E Z(xi )Z(x j ) − 2 ci (x)E [Z(xi )Z(x)] + E [Z(x)Z(x)]
i=1 j=1 i=1
M X
X M   M
X
= σ2 ci (x)c j (x)Cor xi , x j − 2σ2 ci (x)Cor (xi , x) + σ2 Cor (x, x)
i=1 j=1 i=1
 
= σ2 1 + cT Rc − 2cT r (2.33)

To minimize eq. (2.33) subject to condition eq. (2.20) we use the method of La-
grange multiplier defining the function Λ(x):
 
Λ(x) = MSE[ŷ(x)] + λ(x) · FT c(x) − f(x) (2.34)

where λ is the vector of the Lagrangian multiplier


Predicted properties in an unknown comes with an error. Therefore we should
have:
QR = Q̂ + s(x? ) (2.35)
where Q is the charge the superscript R means that is the real value, Q̂ is the pre-
dicted value ans s(x? ) is the error on the unknown position x? . The error is given
by the square root of the MSE (Mean Squared Error):

s(x? ) = MS E (2.36)

where   2 
 1 − rT R−1 r 
MS E = s(x? )2 = σ2 1 − rT R−1 r + (2.37)
 
 1T R−1 1 

Derivative of the property (using the notation in Mills and Popelier (JCTC
2014)) can be calculated as:

∂QT ∂Q̂ ∂s

= Ω+ Ω (2.38)
∂αi ∂αi ∂αi

Derivative of Q̂ is described in the paper, we derive derivative of s which can be


written as:
Nf
∂s X ∂s ∂ fk
= (2.39)
∂αΩ
i k=1
∂ fk ∂αΩ
i
CHAPTER 2. STATISTICS 15

the only unknown term is ∂∂sfk . Derivative of s with respect the features can be
written as:
 
rT R−1 r ∂rT
 !
∂s  ∂rT ∂r T −1 ∂r 
= −σ 
2 
R r+r R
−1 T −1
+ 2 T −1 R r+r R
−1
∂ fk ∂ fk ∂ fk 1 R 1 ∂ fk ∂ fk 

  T −1   !
 r R r  ∂rT ∂r
= −σ2 2 T −1 + 1 R−1 r + rT R−1

(2.40)
∂ fk ∂ fk 

1 R 1

Force can be written as:


    
X  ∂ Q̂A + sA  ∂T AB ∂ Q̂B + sB 
FiΩ =
  


T AB Q̂B + sB + Q̂A + sA Ω
(Q̂B + sB ) + (Q̂A + sA )T AB Ω

AB
 ∂α i ∂α i ∂α i

(2.41)

which results in:


   
∂ Q̂A + sA     ∂T AB ∂ Q̂B + sB

T AB Q̂B + sB + Q̂A + sA Ω
(Q̂B + sB ) + (Q̂A + sA )T AB
∂αi ∂αi ∂αΩi
   
∂ Q̂A   ∂ sA  
= T AB Q̂B + sB + T
Ω AB
Q̂B + sB
∂αΩi ∂α i
   
  ∂T AB ∂ Q̂B ∂ sB
+ Q̂A + sA (Q̂B + sB ) + (Q̂A + sA )T AB + (Q̂A + sA )T AB
∂αΩi ∂α Ω
i ∂αΩ i
(2.42)
CHAPTER 2. STATISTICS 16

2.6 Log Likelihood


The kriging model has 2k + 2 parameters: µ, σ2 , θ1 , θ2 , . . . , θk , p1 , p2 , . . . , pk .
 T
Parameters are chose by maximising the likelihood of the sample. Let y = y1 , . . . , yn
the n-vector of observed function values, R the n × n matrix whose (i, j) entry is
Corr(ε(xi ), ε(x j )), and 1 denote an n-vector of ones. The likelihood function is:
(y−1µ)T R−1 (y−1µ)
 
1 −
L(µ, σ , θ1 , θ2 , . . . , θk , p1 , p2 , . . . , pk ) =
2
1
e 2σ2

((2πσ2 )n |R|) 2
(2.43)
Given the correlation parameters θh and ph for h = 1, . . . , k we can solve for the
values od µ and σ2 that maximize the likelihood function in closed form:

(1)T R−1 y
µ̂ = (2.44)
(1)T R−1 1
and
(y − 1µ)T R−1 (y − 1µ)
σˆ2 = (2.45)
n
substituting eqs. (2.44) and (2.45) in eq. (2.43) we get the so-called ’concentrated
log-likelihood’ function, which depens only upon parameters θh and ph for h =
1, . . . , k. The function that we have to maximize is therefore:
1 n
L(µ, σ2 , θ1 , θ2 , . . . , θk , p1 , p2 , . . . , pk ) = 1
e (− 2 ) (2.46)
((2πσ2 )n |R|) 2
Because logarithm is a monotonically increasing function, optimizing the likeli-
hood is the same as maximising the log of the likelihood. Therefore, the function
to maximize is (after ignoring the constant terms):
n 1
− log σˆ2 − log (|R|) (2.47)
2 2
Before calculating the derivative we consider two properties of the derivative
of a matrix.
∂log |R| −1 ∂R
!
= tr R (2.48)
∂γ ∂γ
∂R−1 ∂R
= −R−1 R−1 (2.49)
∂γ ∂γ
∂R
where ∂γ is a matrix of elementwise derivatives.
CHAPTER 2. STATISTICS 17

2.6.1 Proof of the eq. (2.49)


If I is the identity matrix, eq. (2.49) can be derived as follows:

∂I ∂R−1 R
=0=
∂γ ∂γ
∂R−1 ∂R
= R + R−1 =⇒
∂γ ∂γ
∂R−1 ∂R
= −R−1 R−1
∂γ ∂γ

2.6.2 First derivative of the Loglikelihood function


The derivative for the log-likelihood function can be calculated by derivation term
by term. In the firs part we will make use of a variables called γ to indicates
indifferently the θ j or p j . In the next part we will explicit the derivative with respect
to θ j or p j .

∂ln σ2 1 ∂σ2
= 2 =
∂γ σ ∂γ
1 ∂  
= (y − 1µ)T R−1 (y − 1µ) =
nσ ∂γ
2
  
1  ∂ (y − 1µ)
T

∂R−1 ∂(y − 1µ) 
= R −1
(y − 1µ) + (y − 1µ)T
(y − 1µ) + (y − 1µ) T −1
R 
∂γ ∂γ ∂γ 

nσ2
(2.50)

By considering that y does not depend from any θ j or p j ’s we can write

∂(y − 1µ) ∂µ
= −1
∂γ ∂γ
 
∂ (y − 1µ) T
∂(y − 1µ)
!T
∂µ
!T
= =− 1
∂γ ∂γ ∂γ

and the eq. (2.60) becomes:


 !T !
∂ln σ2 1  ∂µ T ∂R
−1 ∂µ 
= − 1 R (y − 1µ) + (y − 1µ)
−1 T −1
(y − 1µ) − (y − 1µ) R 1
∂γ nσ2  ∂γ ∂γ ∂γ 

(2.51)

Now let us calculate the derivative of µ with respect γ. By recalling eq. (2.44):

∂µ ∂ (1)T R−1 y
!
= (2.52)
∂γ ∂γ (1)T R−1 1
CHAPTER 2. STATISTICS 18

1 is a constant, therefore:

∂µ T ∂R
−1
" ! # 1
= (1) T −1
y (1) R 1  2 +
∂γ ∂γ (1)T R−1 1
T ∂R
−1
" !#
T −1
 1
− (1) R y (1) 1 
∂γ (1)T R−1 1
2

From eq. (2.49) we have:

∂R−1 ∂R
= −R−1 R−1 (2.53)
∂γ ∂γ
From eq. (2.48) we obtain the derivatives of the first term in the loglikelihood
function. Now, everything is written as function of the derivative of the R matrix.
The last thing to do is to explicit the derivative of the entries of R matrix with
respect to θh and ph for h = 1, . . . , k.
Let us start with θ = (θ1 , . . . , θn ).

∂Ri j ∂ j pk
 P !
Nd
− k=1 θk xki −xk
= e
∂θh ∂θh
 PNd i j pk 
j ph − k=1 θk xk −xk

= − xh − xh e
i
(2.54)

whereas for p = (p1 , . . . , pn ) we have to use a passage more:

∂Ri j ∂ j pk
 P !
Nd
− k=1 θk xki −xk
= e
∂ph ∂ph
∂  i  − PNd θ xi −x j pk 
j ph
=
k
−θh xh − xh e k=1 k k

∂ph
∂  i  − PNd θ xi −x j pk 
j ph
=− θh xh − xh e k=1 k k
k
(2.55)

∂ph
By using the following identity we can explicit the derivative in eq. (2.55):
 

j
ph ln θ +p ln xi −x j
θh xh − xh ≡ e
i h h h h

Therefore the derivative in eq. (2.55) is:

∂  i  
j
j ph
 i
j ph ln xh −xh
θh xh − xh = θh ln xh − xh e
i
∂ph
and the complete derivative for ph is:

∂Ri j  j pk
  P 
j Nd
j ph ln xh −xh − k=1 θk xk −xk
i i
= −θh ln xh − xh e
i
e (2.56)
∂ph
CHAPTER 2. STATISTICS 19

2.6.3 Second derivative of the Loglikelihood function


By starting for the eq. (2.47) we can calculating the second derivative of the log-
likelihood function term by term. In this case we have to consider the different
cases:
∂2 ∂2 ∂2
, −→
∂θ2 ∂p2 ∂γ2
∂2 ∂2 ∂2
= −→
∂θ∂p ∂p∂θ ∂γ∂φ

∂2 log |R| ∂ −1 ∂R ∂ −1 ∂R
! !!
= tr R = tr R
∂γ2 ∂γ ∂γ ∂γ ∂γ
∂R ∂R
−1 ∂ R
2
!
= tr + R−1 2
∂γ ∂γ ∂γ
∂R ∂R ∂2 R
!
= tr −R−1 R−1 + R−1 2 (2.57)
∂γ ∂γ ∂γ

and
∂2 log |R| ∂ −1 ∂R ∂ −1 ∂R
! !!
= tr R = tr R
∂γ∂φ ∂γ ∂φ ∂γ ∂φ
∂R ∂R
−1 ∂ R
2
!
= tr + R−1
∂γ ∂φ ∂γ∂φ
−1 ∂R −1 ∂R −1 ∂ R
2
!
= tr −R R +R (2.58)
∂γ ∂φ ∂γ∂φ

For the second term we start from the first derivative (eq. (2.60))

∂2 ln σ2 ∂ 1 ∂(σ2 )
!
= =
∂γ2 ∂γ σ2 ∂γ
∂ (σ ) 1 ∂(σ2 ) ∂(σ2 )
" 2 2 !#
1
= − 2
∂γ2 σ ∂γ ∂γ nσ2
 !2 
 ∂2 (σ2 ) 1 ∂(σ2 )  1
=  − 2 (2.59)
∂γ

∂γ2 σ σ
 2

and
∂2 ln σ2 ∂ 1 ∂σ2
!
= = (2.60)
∂γ∂φ ∂γ σ2 ∂φ
∂ (σ ) 1 ∂(σ2 ) ∂(σ2 ) 1
" 2 2 !#
= − 2
∂γ∂φ σ ∂γ ∂φ σ2
CHAPTER 2. STATISTICS 20

Note that since σ2 is a scalar quantity the order of multiplication is not important.
The same is not valid for the derivatives of the matrices and vectors. The only
quantity to be calculated is the second derivative of σ2
 !T !
∂2 (nσ2 ) ∂  ∂µ T ∂R
−1
T −1 ∂µ 
= +
−1

− 1 R (y − 1µ) (y − 1µ) (y − 1µ) − (y − 1µ) R 1
∂γ2 ∂γ  ∂γ ∂γ ∂γ 
 
!T !T !T
∂2 µ ∂µ ∂R−1 ∂µ −1 ∂µ
!
=− −1
R (y − 1µ) − 1 (y − 1µ) + 1 R 1
∂γ2 ∂γ ∂γ ∂γ ∂γ
!T
∂µ ∂R−1 T ∂ R
2 −1
T ∂R
−1 ∂µ
!
− 1 (y − 1µ) + (y − 1µ) (y − 1µ) − (y − 1µ) 1
∂γ ∂γ ∂γ2 ∂γ ∂γ
!T
∂µ −1 ∂µ T ∂R
−1 ∂µ
T −1 ∂ µ
2
! ! !
+ 1 R 1 − (y − 1µ) 1 − (y − 1µ) R 1
∂γ ∂γ ∂γ ∂γ ∂γ2
(2.61)
It can be written in matrix form as:
∂2 (nσ2 )
= (1)T K1 (2.62)
∂γ2
where K is a 3 × 3 symmetric matrix whose entries are:
1  
∂ai j (y − 1µ)T ∂a2i j R−1 ∂ai j (y − 1µ)
3

Ki j = 1 2 3
(2.63)
∂γai j ∂γai j ∂γai j
and where
ahij = δhi + δhj h = 1, 2, 3 (2.64)
where δhi is the Kronecker delta and with the convention that the 0-th derivative is
the function.
The mixed derivative in this case is:
 !T !
∂2 (nσ2 ) ∂  ∂µ T ∂R
−1
T −1 ∂µ 
= 1 R (y − 1µ) + (y − 1µ)
−1

− (y − 1µ) − (y − 1µ) R 1 
∂γ∂φ ∂γ ∂φ ∂φ ∂φ
T T T
∂2 µ ∂µ ∂R−1 ∂µ ∂µ
! ! ! !
=− 1 R−1 (y − 1µ) − 1 (y − 1µ) + 1 R−1 1
∂γ∂φ ∂γ ∂φ ∂γ ∂φ
!T
∂µ ∂R−1 T ∂ R
2 −1
T ∂R
−1 ∂µ
!
− 1 (y − 1µ) + (y − 1µ) (y − 1µ) − (y − 1µ) 1
∂φ ∂γ ∂γ∂φ ∂γ ∂φ
!T
∂µ −1 ∂µ T ∂R
−1 ∂µ
T −1 ∂ µ
2
! ! !
+ 1 R 1 − (y − 1µ) 1 − (y − 1µ) R 1
∂φ ∂γ ∂φ ∂γ ∂γ∂φ
(2.65)
It can be written in matrix form as:
∂2 (nσ2 )
= (1)T K0 1 (2.66)
∂γ2
CHAPTER 2. STATISTICS 21

where K0 is a 3 × 3 symmetric matrix whose entries are:


1  
∂ai j (y − 1µ)T ∂a2i j R−1 ∂ai j (y − 1µ)
3

Ki j =
0
(2.67)
∂γδi ∂φδ j ∂γδi ∂φδ j ∂γδi ∂φδ j
1 1 2 2 3 3

and where ahij has the same meaning as in eq. (2.64).


The second derivative of the inverse of the R matrix can be obtained as:
∂2 R−1 ∂ −1 ∂R −1
!
=− R R (2.68)
∂γ2 ∂γ ∂γ
∂R−1 ∂R −1 −1 ∂ R −1
2
−1 ∂R ∂R
−1
!
=− R +R R +R
∂γ ∂γ ∂γ2 ∂γ ∂γ
∂R ∂R ∂ R
2 ∂R ∂R
= R−1 R−1 R−1 − R−1 2 R−1 + R−1 R−1 R−1
∂γ ∂γ ∂γ ∂γ ∂γ
∂R ∂R ∂ R
2
!
= 2 R−1 R−1 R−1 − R−1 2 R−1
∂γ ∂γ ∂γ
 !2 
∂R ∂2 R
= 2  R−1
 
R−1  − R−1 2 R−1 (2.69)
∂γ ∂γ

and the mixed derivative of the inverse of the R matrix can be obtained as:
∂2 R−1 ∂ −1 ∂R −1
!
=− R R (2.70)
∂γ∂φ ∂γ ∂φ
∂R−1 ∂R −1 ∂2 R −1 ∂R ∂R−1
!
=− R + R−1 R + R−1
∂γ ∂φ ∂γ∂φ ∂γ ∂φ
∂R ∂R ∂ R −1
2 ∂R ∂R
= R−1 R−1 R−1 − R−1 R + R−1 R−1 R−1
∂γ ∂φ ∂γ∂φ ∂φ ∂γ
∂R ∂R ∂ R −1
2
" #
= 2 R−1 R−1 R−1 − R−1 R (2.71)
∂γ ∂φ ∂γ∂φ

where the last equality follows from lemma 3.


CHAPTER 2. STATISTICS 22

The second derivative of µ is


 
∂ µ
2 ∂  T ∂R
−1
" ! #
1
 
=  (1)
T −1
y (1) R 1  2  +

∂γ ∂γ ∂γ

2
(1)T R−1 1
 
∂  ∂R −1
" !#
1
 
−  (1)T R−1 y (1)T 1 

∂γ  ∂γ T −1 2 
(1) R 1
 

T ∂ R T ∂R T ∂R
2 −1 −1 −1
" ! ! !#
  1
= (1) y (1) R 1 + (1)
T −1
y (1) 1 
∂γ 2 ∂γ ∂γ (1)T R−1 1
2

∂R−1  T −1  T ∂R
−1
" ! # !
1
− 2 (1)T y (1) R 1  (1) 1
∂γ (1)T R−1 1
3 ∂γ

T ∂R T ∂R T ∂ R
−1 −1 2 −1
" ! !  !#
 1
+ (1) y (1) 1 + (1) R y (1)
T −1
1 
∂γ ∂γ ∂γ 2
(1)T R−1 1
2

∂R−1 T ∂R
−1
" !# !
 1
− 2 (1)T R−1 y (1)T 1  (1) 1 (2.72)
∂γ T −1 3
(1) R 1
 ∂γ

and the mixed counterpart is:


 
∂ µ
2 ∂  T ∂R
−1
" ! #
1
 
=  (1)
T −1
y (1) R 1   +

∂γ∂φ ∂γ ∂φ
 2
(1)T R−1 1
 
∂  ∂R −1
" !#
1
 
−  (1)T R−1 y (1)T 1 

∂γ  ∂φ T −1 2 
(1) R 1
 

T ∂ R T ∂R T ∂R
2 −1 −1 −1
" ! ! !#
  1
= (1) y (1) R 1 + (1)
T −1
y (1) 1 
∂γ∂φ ∂γ ∂φ (1)T R−1 1
2

∂R−1  T −1  T ∂R
−1
" ! # !
1
− 2 (1)T y (1) R 1  (1) 1
∂φ (1)T R−1 1
3 ∂γ

T ∂R T ∂R T ∂ R
−1 −1 2 −1
" ! !  !#
 1
+ (1) y (1) 1 + (1) R y (1)
T −1
1 
∂γ ∂φ ∂γ∂φ (1)T R−1 1
2

T ∂R T ∂R
−1 −1
" !# !
T −1
 1
− 2 (1) R y (1) 1  3 (1) ∂γ 1 (2.73)
∂φ (1)T R−1 1

Now that everything is written as a derivative (first or second) of R matrix we need


only to calculate explicitly the (second) derivative of the R matrix. Let us start with
CHAPTER 2. STATISTICS 23

θ = (θ1 , . . . , θn ).

∂2 Ri j ∂
j
ph − PNd θ xi −x j pk  !
= i k
− xh − xh e k=1 k k

∂θh2 ∂θh
j pk
 P 
Nd
j 2ph − k=1 θk xk −xk
i
= xh − xh e
i
(2.74)

whereas for p = (p1 , . . . , pn )

∂2 Ri j ∂  j pk
  P !
j Nd
j ph ln xh −xh − k=1 θk xk −xk
i i
= i
−θh ln xh − xh e
e
∂p2h ∂ph
 p ln xi −x j  − PNd θ xi −x j pk  !
j j
= −θh ln xh − xh ln xh − xh e
i i h h h k
e k=1 k k

 j pk
    P !
j j Nd
j ph ln xh −xh ph ln xh −xh − k=1 θk xk −xk
i i i
j
+ θh ln xh − xh θh ln xh − xh e
i i
e e
 2  PNd i j pk  !
j ph j ph − k=1 θk xk −xk
  
j i
= θh xh − xh − 1 θh ln xh − xh xh − xh e
i i
(2.75)

To calculate the mixed derivative we must distinguish four cases:

• θh pl with h 6= l

• θh pl with h = l → θh ph

• ph pl with h 6= l

• θh θl with h 6= l

θh pl with h 6= l:

∂Ri j ∂  PNd i j pk  !
j ph − k=1 θk xk −xk
= i
− xh − xh e

∂θh ∂pl ∂pl
 PNd i j pk 
j ph i j pl − k=1 θk xk −xk

j i
= θl ln xl − xl xh − xh xl − xl e
i
(2.76)

The order of derivation should be irrelevant therefore we check the eq. (2.76) by
calculating the same derivative by starting from the derivative with respect to pl :

∂Ri j ∂
j i
 PNd i j pk  !
j pl − k=1 θk xk −xk
= i
−θl ln xl − xl xl − xl e

∂pl ∂θh ∂θh
 PNd i j pk 
j pl i j ph − k=1 θk xk −xk

j i
= θl ln xl − xl xl − xl xh − xh e
i
(2.77)
CHAPTER 2. STATISTICS 24

θh pl with h = l :

∂Ri j ∂  PNd i j pk  !
j ph − k=1 θk xk −xk
= i
− xh − xh e

∂θh ∂ph ∂ph
 PNd i j pk  !
j ph − k=1 θk xk −xk

j i
= − ln xh − xh xh − xh e
i

2p  PNd i j pk  !
xi − x j h e − k=1 θk xk −xk

j
+ θh ln xhi − xh h h
 ph  ph − PNd θ xi −x j pk 
j j j
= θh xh − xh − 1 ln xh − xh xh − xh e k=1 k k
i i i k
(2.78)

again we check the consistency of the derivative by inverting the order of


derivation:
∂Ri j ∂
j i
 PNd i j pk  !
j ph − k=1 θk xk −xk
= i
−θh ln xh − xh xh − xh e

∂ph ∂θh ∂θh
 PNd i j pk 
j ph − k=1 θk xk −xk

j i
= − ln xh − xh xh − xh e
i

2ph − PNd θ xi −x j pk 
j j
+ θh ln xh − xh xh − xh e k=1 k k
i i k

 PNd i j pk 
j ph j ph − k=1 θk xk −xk
 
j i
= θh xh − xh − 1 ln xh − xh xh − xh e
i i
(2.79)

In this case the second derivative of the R matrix is a symmetric matrix with
null principal diagonal.

ph pl with h 6= l :

∂Ri j ∂
j i
 PNd i j pk  !
j ph − k=1 θk xk −xk
= i
−θh ln xh − xh xh − xh e

∂ph ∂pl ∂pl
 PNd i j pk 
j ph j pl − k=1 θk xk −xk

j i j i
= θh θl ln xh − xh xh − xh ln xl − xl xl − xl e
i i

 PNd i j pk 
j pl i j ph − k=1 θk xk −xk

j j i
= θh θl ln xh − xh ln xl − xl xl − xl xh − xh e
i i
(2.80)

θh θl with h 6= l :

∂Ri j ∂  PNd i j pk  !
j ph − k=1 θk xk −xk
= i
− xh − xh e

∂θh ∂θl ∂θl
 PNd i j pk 
j ph i j pl − k=1 θk xk −xk

= xh − xh xl − xl e
i
(2.81)
CHAPTER 2. STATISTICS 25

2.7 Some Theorems and properties used throughout the


derivation
Theorem 1. The trace of a matrix is invariant under similarity transformation:
tr(ABC) = tr(CAB) = tr(BCA)
Theorem 2. The trace of a matrix is a linear operator
Lemma 3. The two matrices R−1 ∂R −1 ∂R
∂γ and R ∂φ commutes
Proof. Let’s start with the second derivative of the log of the determinant of the
matrix R eq. (2.58) :
∂2 log |R| ∂ ∂R ∂R ∂R ∂2 R
! !
= tr R−1 = tr −R−1 R−1 + R−1
∂γ∂φ ∂γ ∂φ ∂γ ∂φ ∂γ∂φ
but considering the property of mixed derivatives
∂2 f (γ, φ) ∂2 f (γ, φ)
=
∂γ∂φ ∂φ∂γ
we must have
∂2 log |R| ∂2 log |R|
=
∂γ∂φ ∂φ∂γ
therefore:
∂ −1 ∂R −1 ∂R −1 ∂R −1 ∂ R
2
! !
tr R = tr −R R +R
∂γ ∂φ ∂γ ∂φ ∂γ∂φ
∂ −1 ∂R −1 ∂R −1 ∂R −1 ∂ R
2
! !
= tr R = tr −R R +R
∂φ ∂γ ∂φ ∂γ ∂γ∂φ
and
−1 ∂R −1 ∂R ∂2 R
! !
tr −R R + tr R −1
∂γ ∂φ ∂γ∂φ
−1 ∂R −1 ∂R −1 ∂ R
2
! !
= tr −R R + tr R
∂φ ∂γ ∂γ∂φ
from which it follows that
−1 ∂R −1 ∂R −1 ∂R −1 ∂R
! !
tr −R R = tr −R R
∂γ ∂φ ∂φ ∂γ
the equality holds if the arguments are equals, therefore we obtain:
∂R −1 ∂R ∂R ∂R
R−1 R = R−1 R−1
∂γ ∂φ ∂φ ∂γ
or
−1 ∂R −1 ∂R
" #
R ,R =0
∂γ ∂φ

Chapter 3

Thermodynamics

3.1 Helmholtz Free Energy


Starting with the first law:
dU = δW + δq (3.1)
at constant temperature and volume we have

dU = δq (3.2)

which, by using the second law of thermodynamics, leads for a generic transfor-
mation:
dU ≤ T dS (3.3)
Because the temperature is constant we can write:

d(U − T S ) ≤ 0 (3.4)

We can define a new state function A called Helmholtz free energy as A = U − T S .

26
Appendices

27
Appendix A

Miscellanea

A.1 Shifting invariance convolution of periodic functions


It is easily shown that for f and g, both 2π-periodic functions on [−π, π], we have
Z π Z π
( f ∗ g)(x) = f (x − y)g(y) dy = f (z)g(x − z) dz = (g ∗ f )(x),
−π −π

by using the substitution z = x − y.


The point is to check the bounds on your integral, since y ranges from −π to π,
you’ll have z = x − y ranging from x + π to x − π. Therefore:
Z π Z x−π Z x+π Z π
f (x−y)g(y)dy = − f (z)g(x−z)dz = f (z)g(x−z)dz = f (z)g(x−z)dz
−π x+π x−π −π

in the second-last step, we can swap the two bounds on the integral by changing
the sign. In the final step, both bounds are shifted on the integral by −x, which
does not change the value because we are integrating over an interval of length 2π
and the function is 2π-periodic

A.2 Stirling Approximation


The correct definition is: √  n
2πn ne
lim =1 (A.1)
n→+∞ n!
In approximated form, let n an integer
!
1 1
ln n! ≈ n + ln n − n + ln (2π) (A.2)
2 2

28
Appendix B

Multivariate Gaussian

A vector-valued random variable Y = [Y1 , Y2 , . . . , Yn ] is said to have a multivariate


normal (or Gaussian) distribution with mean µ ∈ Rn and covariance matrix Σ ∈ Sn++
if its probability function is given by
1 1 T
Σ−1 (x−µ))
p(x; µ, Σ) = 1
e(− 2 (x−µ) (B.1)
((2π)n |Σ|) 2

where Sn++ = {A ∈ Rn×n : A = AT and xT Ax > 0 for all x ∈ Rn such that x 6= 0}


is the space of symmetric positive definite n × n matrices. We write Y ∼ N(µ, Σ).
The density function of a univariate normal (Gaussian) distribution is given by
 
1 − 1
(x−µ)2
p(x; µ, σ) = √ e 2σ2 . (B.2)
2πσ
Here the argument of the exponential function 2σ1 2 (x − µ)2 , is a quadratic function
of the variable x. The coefficient in front √ 1 is a normalization factor which
2πσ
ensure that Z +∞  1 
1 − 2 (x−µ)2
√ e 2σ dx = 1 (B.3)
2πσ −∞
In the case of the multivariate Gaussian density, the argument of the exponential
− 21 (x − µ)T Σ−1 (x − µ) is a quadratic form in the variable x. Since Σ is positive
definite, and since the inverse of any positive definite matrix is also positive defi-
nite, then for any non-zero vector x, zT Σ−1 z > 0. This implies that for any vector
x 6= µ1

(x − µ)T Σ−1 (x − µ) > 0


− (x − µ)T Σ−1 (x − µ) < 0

Like in the univariate case, the coefficient is still a normalization constant


Z Z Z +∞
1
e(− 2 (x−µ1) Σ (x−µ1)) dx1 dx2 · · · dxn = 1
1 T −1
1
· · · (B.4)
((2π)n |Σ|) 2 −∞

29
APPENDIX B. MULTIVARIATE GAUSSIAN 30

B.1 Covariance Matrix


The concept of covariance matrix in very important in the context of multivariate
Gaussian distribution. For a pair of random variables X and Y their covariance in
defined as

Cov [X.Y] = E [(X − E[X]) (Y − E[Y])] = E [XY] − E [X] E [Y] (B.5)

The covariance matrix Σ is the n × n matrix whose (i, j)-th entry is Cov[Xi , X j ].

Proposition 4. For any random vector X with mean µ and covariance matrix Σ
h i
Σ = E (X − µ1)(X − µ)T = E[XXT ] − µT µ (B.6)

Proof. We prove the first of the two equalities in eq. (B.6), the other is very similar

Cov[X1 , X1 ] · · · Cov[X1 , Xn ]


 

Σ = 
 .. .. .. 
. . . 
Cov[Xn , X1 ] · · · Cov[Xn , Xn ]
 

E[(X1 − µ1 )2 ] · · · E[(X1 − µ1 )(Xn − µn )]


 

.. .. ..

= 

. . .


E[(Xn − µn )(X1 − µ1 )] · · · E[(Xn − µn )(Xn − µn )]


(X1 − µ1 )2 · · · (X1 − µ1 )(Xn − µn )


 

.. .. ..

= E 

. . .


(Xn − µn )(X1 − µ1 ) · · · (Xn − µn )(Xn − µn )


(X1 − µ1 ) 
  

= E 
 ..  .
 (X1 − µ1 ) .. (Xn − µn ) 
 
.
(Xn − µn )
  
h i
= E (X − µ)T (X − µ) (B.7)

where we used the fact that the expectation matrix is simply the matrix found by
taking the componentwise expectation of each matrix. 

The symmetric definite property of a covariance matrix derives from the fol-
lowing

Proposition 5. Suppose Σ is the covariance matrix corresponding to some random


vector X. Then Σ is symmetric positive definite.

Proof. The symmetry of Σ follows immediately from its definition. Next, for any
APPENDIX B. MULTIVARIATE GAUSSIAN 31

vector z ∈ Rn , observe that


n X
X n
z Σz =
T
Σi j zi z j
i=1 j=1
Xn X n  
= Cov Xi , X j zi z j
i=1 j=1
n X
X n h  i
= E (Xi − E[Xi ]) X j − E[X j ] zi z j
i=1 j=1
 
n X
X n   
= E  (Xi − E[Xi ]) X j − E[X j ] zi z j 

i=1 j=1

(B.8)

Where we used the formula for expanding the quadratic forms and the linearity of
the expectation operator. To complete the proof, we observe that the quantity inside
the brackets is of the form i j xi x j zi z j = (xT z)2 ≥ 0. Therefore the quantity
P P
inside the expectation is always non-negative, and hence the expectation itself must
be non-negative. We conclude that zT Σz ≥ 0. For Σ−1 to exist, as required in the
definition of the multivariate Gaussian density, then Σ must be invertible and hence
full rank. Since any full rank symmetric positive semidefinite matrix is necessarily
positive definite, it follows that Σ must be symmetric positive definite. 
Appendix C

Polarization Formula

The polarization formula allows to write the generic dot product among two generic
elements in a Hilbert space1 as a sum of norms in the same space.
We derive only the complex case, since the real case is just a special case of it.
Let H(x, y) be a Hermitian form, so H(x, y) = H(y, x) and the single-variable
function Q(x) = H(x, x) satisfies Q(cx) = H(cx, cx) = ccH(x, x) = |c|2 Q(x) for
c ∈ C. Let’s start by obtaining the real part for H(x, y) by looking at Q(x + y) and
Q(x − y):

Q(x + y) = H(x + y, x + y)
= H(x, x) + H(x, y) + H(y, x) + H(y, y)
= Q(x) + 2<(H(x, y)) + Q(y) (C.1)
and
Q(x − y) = H(x − y, x − y)
= H(x, x) − H(x, y) − H(y, x) + H(y, y)
= Q(x) − 2<(H(x, y)) + Q(y). (C.2)
Therefore we can solve for the real part of H(x, y) by subtracting the second result
from the first:
1
Q(x + y) − Q(x − y) = 4<(H(x, y)) =⇒ <(H(x, y)) = (Q(x + y) − Q(x − y)).
4
That is the first part of the complex polarization formula. Of course, if H(x, y)
is a real quantity then <(H(x, y)) = H(x, y) and we have proved the polariza-
tion formula in the real case. To get a formula for the imaginary part, note that
=(H(x, y)) = <(−iH(x, y)) = <(H(x, iy)), so if we run through the above work
with iy in place of y then we get
1
=(H(x, y)) = <(H(x, iy)) = (Q(x + iy) − Q(x − iy)).
4
1
I am studying the Hilber space now, so I am using it only for ti

32
APPENDIX C. POLARIZATION FORMULA 33

If we write H(x, y) as <(H(x, y)) + i=(H(x, y)) and feed the formulas for the real
and imaginary parts of H(x, y) into this, the complex polarization formula appears:

4H(x, y) = Q(x + y) − Q(x − y) + iQ(x + iy) − iQ(x − iy) (C.3)

We don’t have to use Q(x + y) and Q(x − y), or Q(x + iy) and Q(x − iy); just one
of each would suffice: since Q(x + y) = Q(x) + 2<(H(x, y)) + Q(y), we have
1
<(H(x, y)) = (Q(x + y) − Q(x) − Q(y)), (C.4)
2
and
1
=(H(x, y)) = <(H(x, iy)) = (Q(x + iy) − Q(x) − Q(iy)), (C.5)
2
and Q(iy) = H(iy, iy) = iH(y, iy) = iH(iy, y) = iiH(y, y) = −iiH(y, y) = Q(y), so

H(x, y) = <(H(x, y)) + i=(H(x, y))


1
= (Q(x + y) − Q(x) − Q(y) + i(Q(x + iy) − Q(x) − Q(y)))
2
1
= (Q(x + y) + iQ(x + iy) − (1 + i)Q(x) − (1 + i)Q(y)). (C.6)
2
Finally, what happens for a Hermitian form on a vector space over an arbitrary
field? Let V be a vector space over a field F and σ be an automorphism of F with
order 2. If H: V × V → F is σ-Hermitian, so H(w, v) = σ(H(v, w)), then is H
determined by the function Q(v) = H(v, v)? For a ∈ F,

Q(av + w) = (Na)Q(v) + Q(w) + Tr(aH(v, w)),

where N and T ∇ are the norm and trace maps F → F σ (where F σ is the fixed
field of σ on F, so the extension F/F σ has degree 2 and is Galois). From the
nondegeneracy of the trace pairing on separable extensions, such as F/F σ , the
above formula as a varies in F shows Q-values determine H-values, even if F has
characteristic 2, when the usual polarization formula over C, with division by 4,
makes no sense.

You might also like