Appunti

Notes
August 30, 2023

Contents
1
Chapter 1
Legendre Equation
1.1 Generalized Legendre equation

The generalized Legendre equation
m2
!
(1 − x )y − 2xy + λ −
2 00 0
y=0 (1.1)
1 − x2
arises when the equation ∆u = f (ρ)u is solved with separation of variables in
spherical coordinates. The function y(cos θ) describes the polar part of the solution
of ∆u = f (ρ)u.
The Legendre equation
(1 − x2 )y00 − 2xy0 + λy = 0 (1.2)
is the special case when m = 0. Because 0 is an ordinary point of the equation, it

is natural to attempt a series solution. We describe the solution of the equation as
+∞
X
y(x) = cn x n (1.3)
n=0
where cn are unknown coefficient to be determined by the boundary conditions.

If we replace eq. (1.3) in eq. (1.2) we obtain
+∞
X +∞
X +∞
X
2
(1 − x ) cn n(n − 1)x n−2
− 2x cn nx n−1
+λ cn x n = 0 (1.4)
n=2 n=1 n=0
1.2 Even Numbered Coefficients, l Even

(l − k)(l + k + 1)
ck+2 = − (1.5)
(k + 2)(k + 1)
The first terms are:
2
CHAPTER 1. LEGENDRE EQUATION 3
l(l + 2 + 1)
c2 = − c0
2
(l − 2)(l + 2 + 1) l(l − 2)(l + 3)(l + 1)
c4 = c2 = − c0
(2 + 2)(2 + 1) 4!
(l − 4)(l + 4 + 1) l(l − 2)(l − 4)(l + 5)(l + 3)(l + 1)
c6 = − c4 = − c0
(4 + 2)(4 + 1) 6!
..
.
l(l − 2) · · · (l − 2(n − 1))(l + 2n + 1)(l + 2n − 3) · · · (l + 1)
c2n = (−1)n − c0
2n!
where we considered k = 2n because k is an even number. From the expanded
coefficient we now derive the closed formula for Legendre Polynomial
l(l − 2) · · · (l − 2(n − 1))(l + 2n + 1)(l + 2n − 3) · · · (l + 1)
c2n = (−1)n − c0
(2n)!
1

l
l(l − 2) · · · (l − 2(n − 1))(l + 2n + 1)(l + 2n − 3) · · · (l + 1) 2 2 2
c2n = (−1)n − 1
c0
(2n)! 2 2 2l !
1

l
(l + 2n − 1)(l + 2n − 3) · · · (l + 1) l(l − 2) · · · (l − 2(n − 1)) 2 2 2 !
= (−1)n − c0
l(l − 2) · · · (l − 2(n − 1)) 2 12 −n l − n !

2n!
2
1
+n 1

(l + 2n − 1)(l + 2n − 3) · · · (l + 1) l(l − 2) · · · (l − 2(n − 1)) l! 2 2
l
2 + n ! 2 2
l
2 !
= (−1)n − c0
l(l − 2) · · · (l − 2(n − 1)) l! 2 12 +n l + n ! 2 12 −n l − n !

2n!
2 2
1 1

l l
(l + 2n) ! 22 2 ! 22 2 !
= (−1)n − c0
(2n) ! l! 2 2 +n l + n ! 2 2 −n l − n !
1
1
2 2
2
l
(l + 2n) ! 2!
= (−1)n − c0 (1.6)
(2n) ! l! l + n l − n
2 2
Therefore
l 2

(l + 2n) ! 2 !
c2n = (−1)n − c0 (1.7)
(2n) ! l! l + n l − n
2 2
Now, for l even, the series terms are non-zero until 2n = l, that is to say that the
in the series the index of the terms assumes all the integeres value betwee 0 and
l/2. Therefore, we can substitute the index 2n with l − 2k, i.e. we are counting the
CHAPTER 1. LEGENDRE EQUATION 4
elements backwards. With this change the eq. (1.7) become

h i2
l
l−2k (2l − 2k) ! 2 !
cl−2k = (−1) 2 c0
(l − 2k) ! l! l +
l−2k l l−2k
2 2 2 − 2
h i2
l
l (2l − 2k) ! 2 !
= (−1) (−1)
2 −k
c0
(l − 2k) ! l! (l − k) ! k!
h i2
l l
(2l − 2k) ! l 2 2 !
= (−1) l
k
(−1) 2 c0 (1.8)
2 (l − 2k) ! (l − k) ! k! l!
By choosing
l l!
c0 = (−1) 2 h i (1.9)
l
2l 2 !
we obtain
l/2
X (2l − 2k) !
Pl (x) = (−1)k xl−2k (1.10)
k=0
2l (l − 2k) ! (l − k) ! k!
Chapter 2
Statistics
2.1 Definitions
Probability Space: is a measure space such that the measure of the whole space is
equal to one. It can be represented as (Ω, F , P) where:
• Ω is the sample space (an arbitrary non-empty set)
• F ⊆ 2Ω is a set of sub-set of Ω, called “events” that form a σ-algebra, i.e.
– Ω∈F
– F is closed under complements: if A ∈ F ⇒ (Ω/A) ∈ F
S
– F is closed for countable unions: if Ai ∈ F with {i ∈ N} then ( Ai ) ∈
F
– F is closed for countable intersection: if Ai ∈ F with {i ∈ N} then
T
( Ai ) ∈ F
• P is countably additive: if {Ai } ⊆ F is a countable collection of pairwise

disjoint set, then P ( Ai ) = P(Ai ) where is the disjoint union
F P F
• the measure of the space is equal to 1: P(Ω) = 1
Random Variable: given a probability space (Ω, F , P) one can define a random
variable as a function X : Ω → R which is measurable in the sense that the in-
verse of a measurable Borel set B in R is in F . The interpretation is that if ω
is an experiment, then X(ω) measure an observable quantity of the experiment.
The random variables posses and expectation valueh E i[X]. With expectation it is
possible to define the notion of variance Var = E X 2 − E [X]2 , standard devia-
√
tion σX = Var. The correlation Cor (X, Y) = Cov (X, Y) /(σX σY ) of two random
variables X, Y with positive variance is a quantity that tells how much the random
variable X is related to random variable Y.
5
CHAPTER 2. STATISTICS 6
Stochastic Process: a set of random variables defines a stochastic process. The

variable t ∈ T is a parameter called “Time”. With more general time like Rn
random variables are called “random fields”.
2.2 Maximum Likelihood Estimator

Let y = (y1 , y2 , . . . , yn )T be a vector of iid, random variables from one family of
distributions on <n and indexed by a p-dimensional parameter θ = (θ1 , θ2 , . . . , θn )T
where θ ∈ Θ ⊂ < p and p ≤ n. Typically, we are interested in estimating parametric
models of the form
y ∼ f (θ, y) (2.1)
where θ is a vector of parameters and f is some specific functional form. That is
to say, given y we want to make inferences about the value of θ.
Note that the problem that we face is the opposite of the typical probability
problem. A typical probability problem is to know something about the distribution
of y (the outcomes) given the parameters of your model (θ). In this case we want
to know p(Data|Model), or rather we want to know f (y|θ).
However, in our case we have the data but want to learn about the model, specif-
ically the model’s parameters. In other words, we want to know the distribution of
the unknown parameters conditional on the observed data, i.e. p(Model|Data), this
called the “inverse probability problem”.
2.2.1 Bayes’s Theorem

Recall the following identities
p(θ, y) = p(θ)p(y|θ) (2.2)

p(θ, y) = p(y)p(θ|y) (2.3)
therefore the conditional density p(θ|y) is

p(θ, y)
p(θ|y) =
p(y)
p(θ)p(y|θ)
= (2.4)
p(y)
Note that the denominator, p(y), is just a function of the data. Since it only makes
sense to compare these conditional densities for the same data, we can essentially
ignore the denominator. This means that we can rewrite eq. (2.4) in its more famil-
iar form:
p(θ|y) ∝ p(θ)p(y|θ) (2.5)
where p(y) is the constant of proportionality, p(θ) is the prior density of θ, p(y|θ) is
the likelihood, and p(θ|y) is the posterior density of θ. The likelihood is the sample
information that transforms a prior into a posterior density of θ.
It is important to note that the prior, p(θ), is fixed before our observations and
so can be treated as invariant to our problem. This means that we can rewrite
eq. (2.5) as
p(θ|y) ∝ k(y)p(y|θ) (2.6)
p(θ)
where k(y) = p(y) and is an unknown function of the data. Since k(y) is not a
function of θ, it is treated as an unknown positive constant. Put differently, for a
given set of observed data, k(y) remains the same over all posiible hypothetical
values of θ.
2.3 Likelihood
Without knowing the prior of making assumptions about the prior, we cannot cal-
culate the inverse probability in eq. (2.6). This is the point where the notion of
likelihood and the Likelihood Axiom was introduced. It was defined
L(θ|y) = k(y)p(y|θ) ∝ p(y|θ) (2.7)
The likelihood is proportional to the probability of observing the data, treating the
parameters of the distribution as variables and the data as fixed. The advantage of
likelihood is that it can be calculated from a traditional probability, p(y|θ), whereas
an inverse probability cannot be calculated in any way. Note that we can only
compare likelihoods for the same set of data and the same prior.
The best estimator θ̂, is whatever value of θ maximizes
L(θ|y) = p(y|θ) (2.8)
In effect, we are looking for the θ̂ that maximizes the likelihood of observing our
sample. Because of the proportional relationship, the θ̂ that maximizes L(θ|y) will
also maximize p(θ|y) i.e. the probability of observing the data. This is what we
wanted from the beginning.
Since the logarithm is a monotonically increasing function and the likelihood
is a non-negative function, they have the maximum at the same position. The
logarithm of the likelihood is commonly used to calculate the maximum of L.
For a ML estimator, the Information matrix is defined as negative of the ex-
pected value of the Hessian of the likelihood with respect the parameters:
I(θ) = −E[H(θ)] (2.9)
where
∂2 L
Hi j (θ) = (2.10)
∂θi ∂θ j
The variance of an ML estimator, θ̂ ML is calculated by the inverse of the Informa-

tion matrix:
var(θ) = [I(θ)]−1
#!−1
∂2 L
"
= − (E[H(θ)]) −1
=− E (2.11)
∂θi ∂θ j
This actually represents an inferior limit, as stated by the Cramer-Rao Theorem
var(θ) ≥ − (E[H(θ)]) (2.12)
This means that any unbiased estimator that achieves this lower bound is efficient
and no better unbiased estimator is possible. The inverse of the information matrix
for MLE is exactly the same as the Cramer-Rao lower bound. This means that
MLE is efficient.
2.4 Ordinary Kriging

Suppose (Z0 , Z1 , . . . , Zn ) is a vector assumed to have a multivariate distribution of
unknown mean (µ, µ, . . . , µ) and known variance-covariance matrix Σ:
 Var (Z0 , Z0 ) · · · Cov (Z0 , Zn )

 
 .. .. .. 
 . . . 
Cov (Zn , Z0 ) · · · Var (Zn , Zn )
 
We observe (z1 , z2 , . . . , zn ) from this distribution and wish to predict z0 from

this information using an unbiased linear predictor.
Linear means the prediction must take the form ẑ0 = λ1 z1 + λ2 z2 + · · · + λn zn for
coefficients λi to be determined. These coefficients can depend at most on what is
known in advance: namely, the entries of Σ. This predictor can also be considered
a random variable Zˆ0 = λ1 Z1 + λ2 Z2 + · · · + λn Zn .
Unbiased means the expectation of Ẑ0 equals its (unknown) mean µ.
Writing things out gives some information about the coefficients:
µ = E[Zˆ0 ] = E[λ1 Z1 + λ2 Z2 + · · · + λn Zn ]
= λ1 E[Z1 ] + λ2 E[Z2 ] + · · · + λn E[Zn ]
= λ1 µ + · · · + λn µ
= (λ1 + · · · + λn ) µ (2.13)
Where the second equality follows from the linearity of expectation E [·]. Be-
cause this procedure is suppose to work regardless of the value of µ, evidently
the coefficients have to sum to unity. Writing the coefficients in vector notation
λ = (λi )T , this can be neatly written 1 · λ = 1.
Among the set of all such unbiased linear predictors, we seek one that deviates
as little from the real value as possible, measured in the room mean square. This,
again, is a computation. It relies on the bilinearity and symmetry of covariance,
whose application is responsible for the summations in the second line:
E[(Zˆ0 − Z0 )2 ] = E[(λ1 Z1 + λ2 Z2 + · · · + λn Zn − Z0 )2 ]
Xn Xn n
X
= λi λ j Σi, j − 2 λi Cov (Zi , Z0 ) + Var (Z0 , Z0 ) (2.14)
i=1 j=1 i=1
That can be written in matrix notation as:
E[(Ẑ0 − Z0 )2 ] = (λ)T Σλ − 2λ · Cov(Z, Z0 ) + Var (Z0 , Z0 ) (2.15)
where Cov(Z, Z0 ) = (Cov (Z1 , Z0 ) , . . . , Cov (Zn , Z0 )).

Whence the coefficients can be obtained by minimizing this quadratic form
subject to the (linear) constraint 1 · λ = 1. This is readily solved using the method
of Lagrange multipliers, yielding a linear system of equations, the ”Kriging equa-
tions.”
In the application, Z is a spatial stochastic process (”random field”). This
means that for any given set of fixed (not random) locations x0 , . . . , xn , the vec-
tor of values of Z at those locations, (Z(x0 ), . . . , Z(xn )) is random with some kind
of a multivariate distribution. Write Zi = Z(xi ) and apply the foregoing analysis,
assuming the means of the process at all n + 1 locations xi are the same and assum-
ing the covariance matrix of the process values at these n + 1 locations is known
with certainty.
Let’s interpret this. Under the assumptions (including constant mean and known
covariance), the coefficients determine the minimum variance attainable by any
linear estimator. Let’s call this variance σ2OK (”OK” is for ”ordinary kriging”). It
depends solely on the matrix Σ. It tells us that if we were to repeatedly sample from
(Z0 , . . . , Zn ) and use these coefficients to predict the z0 values from the remaining
values each time, then
• On the average our predictions would be correct.
• Typically, our predictions of the z0 would deviate about σOK from the actual
values of the z0 .
Much more needs to be said before this can be applied to practical situations
like estimating a surface from punctual data: we need additional assumptions about
how the statistical characteristics of the spatial process vary from one location to
another and from one realization to another (even though, in practice, usually only
one realization will ever be available). But this exposition should be enough to
follow how the search for a ”Best” Unbiased Linear Predictor (”BLUP”) leads
straightforwardly to a system of linear equations.
By the way, kriging as usually practiced is not quite the same as least squares
estimation, because Σ is estimated in a preliminary procedure (known as ”variog-
raphy”) using the same data. That is contrary to the assumptions of this derivation,
which assumed Σ was known (and a fortiori independent of the data). Thus, at the
very outset, kriging has some conceptual and statistical flaws built into it. Thought-
ful practitioners have always been aware of this and found various creative ways to
(try to) justify the inconsistencies. (Having lots of data can really help.) Procedures
now exist for simultaneously estimating Σ and predicting a collection of values at
unknown locations. They require slightly stronger assumptions (multivariate nor-
mality) in order to accomplish this feat.
2.5 DACE
DACE stands for “Design and analysis of computer experiments” and it is become
a common way to indicate a way to built a response surface for a generic process.
Suppose we have a generic output of a deterministic experiment y(x) where x
is the value of the output of the experiment (a physical property for example) at
position x. In DACE this deterministic outcome is treated as the realization of a
random function (i.e. a stochastic process) Y(x) described by a regression model:
N
X
Y(x) = β j f j (x) + Z(x) (2.16)
j=1
where Z(·) is a random process with zero mean and covariance between Z(x1 ) and
Z(x2 ) given by:
Cov (x1 , x2 ) = σ2 Cor(x1 , x2 ) (2.17)
where σ2 is the process variance and Cor(x1 , x2 ) is the correlation between the two
points.
Given a number M of observation y = (y(x1 ), . . . , y(x M ))T we consider the
linear predictor
ŷ(x) = cT (x) · y (2.18)
where x (without indexes) represents some untried position (i.e. some position not
contained in the set {x1 , . . . , x M }), and cT (x) = (c(x1 ), . . . , c(x M )). In the frequentist
and Bayesian approach we are lead to two different way to predict ŷ(x) :
• In the frequentist viewpoint we replace y by the corresponding random quan-

tity Y = (Y(x1 ), . . . , Y(x M ))T , treat ŷ(x) as a random variable as well and
compute the mean squared error of this predictor averaged over the random
process. The BLUP is obtained by choosing the n × 1 vector c(x) that mini-
mize the Mean Squared Error (MSE):
h i2
MSE[ŷ(x)] = E cT (x) · Y − Y(x) (2.19)
subject to unbiasdness constraint

h i
E cT (x)Y = E [Y(x)] (2.20)
• The Bayesian approach would predict y(x) by the posterior mean:
ŷ(x) = E Y(x)|y

(2.21)
In general Bayesian and frequentist approach lead to different methods and results
but in the special case of Gaussian process for Z(·) and improper uniform priors on
the β’s the methods and results will be the same.
Let’s now derive the expression for kriging. Let us define the vector f(x) =
( f1 (x), . . . , fN (x))T for the N function in the regression,
   
 f(x1 )   f1 (x1 ) · · · fN (x1 ) 
F =  ...  =  ... .. .. 
  
   . . 

f(x M ) f1 (x M ) · · · fN (x M )
for the N × M design matrix R,

Ri, j = Cor xi , x j , 1 ≤ i ≤ M; 1 ≤ j ≤ M (2.22)
the matrix of stochastic-process correlations between Z’s at the design site, and
r(x) = (Cor (x1 , x) , . . . , Cor (x M , x))T (2.23)
the vector of correlations between the design site and the untried input x.
Starting from eq. (2.19) we obtain:
h i2
MSE[ŷ(x)] = E cT (x) · Y − Y(x)
2
= E cT (x) · Y + Y 2 (x) − 2Y(x) cT (x) · Y
= E[c21 (x)Y12 + . . . + c2M (x)Y M
2
+ 2c1 Y1 (x)(c2 Y2 + . . . + c M Y M ) + . . .
+ 2c M Y M (c1 Y1 + . . . + c M−1 Y M−1 ) + Y 2 (x) − 2Y(x)(c1 (x)Y1 + . . . + c M (x)Y M )
 
M X
X M M
X 
= E  ci (x)c j (x)Yi Y j − 2Y(x) ci Yi + Y 2 (x) (2.24)
i=1 j=1 i=1
From eq. (2.16) we can write

   
N
X  N
X 
E [Y(x)] = E  β j f j (x) + Z(x) = E  β j f j (x) + E [Z(x)]
j=1 j=1
N
X
= β j f j (x) (2.25)
j=1
because of the linearity of the expectation function E [·] and the fact that the f j ’s
are not random variables, c j ’s are constant and Z(·) is a stochastic process with zero
mean. In the same way we can write, from eq. (2.16):
 N  N 
h i X  X 
E Y(xi )Y(x j ) = E  βk fk (xi ) + Z(xi )  βk fk (x j ) + Z(x j )
k=1 k=1
 N  N 
X  X  h i
= E  βk fk (xi )  βk fk (x j ) + E Z(xi )Z(x j )
   
k=1 k=1
N
XX N
= βk βh fk (xi ) fh (x j ) + σ2 Cor xi , x j (2.26)
k=1 h=1
where the last term in the last equality follows from eq. (2.17), and the fact that:
h h ii
Cov Zi , Z j = E Zi − E [Zi ])(Z j − E Z j
h i h h ii h i h h i h ii
= E Zi Z j − E Zi E Z j − E Z j E [Zi ] − E E Z j E Z j
h i h i h i h i h i h i
= E Zi Z j − E Z j E [Zi ] − E Z j E Z j + E Z j E Z j
h i h i
= E Zi Z j − E Z j E [Zi ] (2.27)
and because we have assumed a random process wih zero mean for Z(·) we evetu-
ally obtain: h i
Cov Zi , Z j = E Zi Z j (2.28)
Coming back to eq. (2.24) we can now write:
 
X M X M XM 
MSE[ŷ(x)] = E   ci (x)c j (x)Yi Y j − 2Y(x) ci Yi + Y (x) 2
i=1 j=1 i=1

 
M
X X X X M N N 
= E   βk βh ci (x)c j (x) fk (xi ) fh (x j )
i=1 j=1 k=1 h=1
  M N N 
X M XM  X X X 
+ E   ci (x)c j (x)Z(xi )Z(x j ) − 2E 
 βk βh ci (x) fk (xi ) fh (x)
i=1 j=1 i=1 k=1 h=1
M   N N 
X  X X 
− 2E  ci (x)Z(xi )Z(x) + E 
   βk βh fk (x) fh (x) + E [Z(x)Z(x)]
i=1 k=1 h=1
(2.29)
To further simplify the equation we must consider the unbiasedness constraints

FT c(x) = f(x) which allows to write:

 
M X
X M X
N XN 
E  βk βh ci (x)c j (x) fk (xi ) fh (x j ) =
i=1 j=1 k=1 h=1
M X
X M X
N X
N
= βk βh ci (x)c j (x) fk (xi ) fh (x j )
i=1 j=1 k=1 h=1
M N  M N 
X X  X X 
=  βk ci (x) fk (xi )  βh c j (x) fh (x j )
i=1 k=1 j=1 h=1
 N M
 N M

X X  X X 
=  βk ci (x) fk (xi )  βh c j (x) fh (x j )
k=1 i=1 h=1 i=1
 N  N 
X  X 
=  βk fk (x)  βh fh (x)
k=1 h=1
N
XX N
= βk βh fk (xi ) fh (x j ) (2.30)
k=1 h=1
and
M N N 
X X X 
E  βk βh ci (x) fk (xi ) fh (x)
i=1 k=1 h=1
M X
X N X
N
= βk βh ci (x) fk (xi ) fh (x)
i=1 k=1 h=1
XN X N M
X
= βk βh fh (x) ci (x) fk (xi )
k=1 h=1 i=1
N X
X N
= βk βh fh (x) fk (x) (2.31)
k=1 h=1
Therefore from eq. (2.29):

 
XM X M X N X N 
E  βk βh ci (x)c j (x) fk (xi ) fh (x j )
i=1 j=1 k=1 h=1
M N N 
X X X 
− 2E  βk βh ci (x) fk (xi ) fh (x)
i=1 k=1 h=1
 N N 
X X 
+ E  βk βh fh (x) fk (x) = 0 (2.32)
k=1 h=1
and
  M 
M X
X M  X 
MSE[ŷ(x)] = E  ci (x)c j (x)Z(xi )Z(x j ) − 2E  ci (x)Z(xi )Z(x) + E [Z(x)Z(x)]
i=1 j=1 i=1
M X
X M h i M
X
= ci (x)c j (x)E Z(xi )Z(x j ) − 2 ci (x)E [Z(xi )Z(x)] + E [Z(x)Z(x)]
i=1 j=1 i=1
M X
X M M
X
= σ2 ci (x)c j (x)Cor xi , x j − 2σ2 ci (x)Cor (xi , x) + σ2 Cor (x, x)
i=1 j=1 i=1

= σ2 1 + cT Rc − 2cT r (2.33)
To minimize eq. (2.33) subject to condition eq. (2.20) we use the method of La-
grange multiplier defining the function Λ(x):

Λ(x) = MSE[ŷ(x)] + λ(x) · FT c(x) − f(x) (2.34)
where λ is the vector of the Lagrangian multiplier

Predicted properties in an unknown comes with an error. Therefore we should
have:
QR = Q̂ + s(x? ) (2.35)
where Q is the charge the superscript R means that is the real value, Q̂ is the pre-
dicted value ans s(x? ) is the error on the unknown position x? . The error is given
by the square root of the MSE (Mean Squared Error):
√
s(x? ) = MS E (2.36)
where  2 
 1 − rT R−1 r 
MS E = s(x? )2 = σ2 1 − rT R−1 r + (2.37)
 
 1T R−1 1 
Derivative of the property (using the notation in Mills and Popelier (JCTC
2014)) can be calculated as:
∂QT ∂Q̂ ∂s
Ω
= Ω+ Ω (2.38)
∂αi ∂αi ∂αi
Derivative of Q̂ is described in the paper, we derive derivative of s which can be

written as:
Nf
∂s X ∂s ∂ fk
= (2.39)
∂αΩ
i k=1
∂ fk ∂αΩ
i
the only unknown term is ∂∂sfk . Derivative of s with respect the features can be
written as:

rT R−1 r ∂rT
 !
∂s  ∂rT ∂r T −1 ∂r 
= −σ 
2 
R r+r R
−1 T −1
+ 2 T −1 R r+r R
−1
∂ fk ∂ fk ∂ fk 1 R 1 ∂ fk ∂ fk 

 T −1  !
 r R r  ∂rT ∂r
= −σ2 2 T −1 + 1 R−1 r + rT R−1

(2.40)
∂ fk ∂ fk 

1 R 1
Force can be written as:

 
X  ∂ Q̂A + sA ∂T AB ∂ Q̂B + sB 
FiΩ =


Ω
T AB Q̂B + sB + Q̂A + sA Ω
(Q̂B + sB ) + (Q̂A + sA )T AB Ω

AB
 ∂α i ∂α i ∂α i

(2.41)
which results in:

∂ Q̂A + sA ∂T AB ∂ Q̂B + sB
Ω
T AB Q̂B + sB + Q̂A + sA Ω
(Q̂B + sB ) + (Q̂A + sA )T AB
∂αi ∂αi ∂αΩi

∂ Q̂A ∂ sA
= T AB Q̂B + sB + T
Ω AB
Q̂B + sB
∂αΩi ∂α i

∂T AB ∂ Q̂B ∂ sB
+ Q̂A + sA (Q̂B + sB ) + (Q̂A + sA )T AB + (Q̂A + sA )T AB
∂αΩi ∂α Ω
i ∂αΩ i
(2.42)
2.6 Log Likelihood

The kriging model has 2k + 2 parameters: µ, σ2 , θ1 , θ2 , . . . , θk , p1 , p2 , . . . , pk .
T
Parameters are chose by maximising the likelihood of the sample. Let y = y1 , . . . , yn
the n-vector of observed function values, R the n × n matrix whose (i, j) entry is
Corr(ε(xi ), ε(x j )), and 1 denote an n-vector of ones. The likelihood function is:
(y−1µ)T R−1 (y−1µ)

1 −
L(µ, σ , θ1 , θ2 , . . . , θk , p1 , p2 , . . . , pk ) =
2
1
e 2σ2
((2πσ2 )n |R|) 2
(2.43)
Given the correlation parameters θh and ph for h = 1, . . . , k we can solve for the
values od µ and σ2 that maximize the likelihood function in closed form:
(1)T R−1 y
µ̂ = (2.44)
(1)T R−1 1
and
(y − 1µ)T R−1 (y − 1µ)
σˆ2 = (2.45)
n
substituting eqs. (2.44) and (2.45) in eq. (2.43) we get the so-called ’concentrated
log-likelihood’ function, which depens only upon parameters θh and ph for h =
1, . . . , k. The function that we have to maximize is therefore:
1 n
L(µ, σ2 , θ1 , θ2 , . . . , θk , p1 , p2 , . . . , pk ) = 1
e (− 2 ) (2.46)
((2πσ2 )n |R|) 2
Because logarithm is a monotonically increasing function, optimizing the likeli-
hood is the same as maximising the log of the likelihood. Therefore, the function
to maximize is (after ignoring the constant terms):
n 1
− log σˆ2 − log (|R|) (2.47)
2 2
Before calculating the derivative we consider two properties of the derivative
of a matrix.
∂log |R| −1 ∂R
!
= tr R (2.48)
∂γ ∂γ
∂R−1 ∂R
= −R−1 R−1 (2.49)
∂γ ∂γ
∂R
where ∂γ is a matrix of elementwise derivatives.
2.6.1 Proof of the eq. (2.49)

If I is the identity matrix, eq. (2.49) can be derived as follows:
∂I ∂R−1 R
=0=
∂γ ∂γ
∂R−1 ∂R
= R + R−1 =⇒
∂γ ∂γ
∂R−1 ∂R
= −R−1 R−1
∂γ ∂γ
2.6.2 First derivative of the Loglikelihood function

The derivative for the log-likelihood function can be calculated by derivation term
by term. In the firs part we will make use of a variables called γ to indicates
indifferently the θ j or p j . In the next part we will explicit the derivative with respect
to θ j or p j .
∂ln σ2 1 ∂σ2
= 2 =
∂γ σ ∂γ
1 ∂
= (y − 1µ)T R−1 (y − 1µ) =
nσ ∂γ
2

1  ∂ (y − 1µ)
T

∂R−1 ∂(y − 1µ) 
= R −1
(y − 1µ) + (y − 1µ)T
(y − 1µ) + (y − 1µ) T −1
R 
∂γ ∂γ ∂γ 

nσ2
(2.50)
By considering that y does not depend from any θ j or p j ’s we can write
∂(y − 1µ) ∂µ
= −1
∂γ ∂γ

∂ (y − 1µ) T
∂(y − 1µ)
!T
∂µ
!T
= =− 1
∂γ ∂γ ∂γ
and the eq. (2.60) becomes:

 !T !
∂ln σ2 1  ∂µ T ∂R
−1 ∂µ 
= − 1 R (y − 1µ) + (y − 1µ)
−1 T −1
(y − 1µ) − (y − 1µ) R 1
∂γ nσ2  ∂γ ∂γ ∂γ 

(2.51)
Now let us calculate the derivative of µ with respect γ. By recalling eq. (2.44):
∂µ ∂ (1)T R−1 y
!
= (2.52)
∂γ ∂γ (1)T R−1 1
1 is a constant, therefore:
∂µ T ∂R
−1
" ! # 1
= (1) T −1
y (1) R 1 2 +
∂γ ∂γ (1)T R−1 1
T ∂R
−1
" !#
T −1
1
− (1) R y (1) 1
∂γ (1)T R−1 1
2
From eq. (2.49) we have:
∂R−1 ∂R
= −R−1 R−1 (2.53)
∂γ ∂γ
From eq. (2.48) we obtain the derivatives of the first term in the loglikelihood
function. Now, everything is written as function of the derivative of the R matrix.
The last thing to do is to explicit the derivative of the entries of R matrix with
respect to θh and ph for h = 1, . . . , k.
Let us start with θ = (θ1 , . . . , θn ).
∂Ri j ∂ j pk
P !
Nd
− k=1 θk xki −xk
= e
∂θh ∂θh
PNd i j pk
j ph − k=1 θk xk −xk

= − xh − xh e
i
(2.54)
whereas for p = (p1 , . . . , pn ) we have to use a passage more:
∂Ri j ∂ j pk
P !
Nd
− k=1 θk xki −xk
= e
∂ph ∂ph
∂ i − PNd θ xi −x j pk
j ph
=
k
−θh xh − xh e k=1 k k

∂ph
∂ i − PNd θ xi −x j pk
j ph
=− θh xh − xh e k=1 k k
k
(2.55)

∂ph
By using the following identity we can explicit the derivative in eq. (2.55):

j
ph ln θ +p ln xi −x j
θh xh − xh ≡ e
i h h h h
Therefore the derivative in eq. (2.55) is:
∂ i
j
j ph
i
j ph ln xh −xh
θh xh − xh = θh ln xh − xh e
i
∂ph
and the complete derivative for ph is:
∂Ri j j pk
P
j Nd
j ph ln xh −xh − k=1 θk xk −xk
i i
= −θh ln xh − xh e
i
e (2.56)
∂ph
2.6.3 Second derivative of the Loglikelihood function

By starting for the eq. (2.47) we can calculating the second derivative of the log-
likelihood function term by term. In this case we have to consider the different
cases:
∂2 ∂2 ∂2
, −→
∂θ2 ∂p2 ∂γ2
∂2 ∂2 ∂2
= −→
∂θ∂p ∂p∂θ ∂γ∂φ
∂2 log |R| ∂ −1 ∂R ∂ −1 ∂R
! !!
= tr R = tr R
∂γ2 ∂γ ∂γ ∂γ ∂γ
∂R ∂R
−1 ∂ R
2
!
= tr + R−1 2
∂γ ∂γ ∂γ
∂R ∂R ∂2 R
!
= tr −R−1 R−1 + R−1 2 (2.57)
∂γ ∂γ ∂γ
and
∂2 log |R| ∂ −1 ∂R ∂ −1 ∂R
! !!
= tr R = tr R
∂γ∂φ ∂γ ∂φ ∂γ ∂φ
∂R ∂R
−1 ∂ R
2
!
= tr + R−1
∂γ ∂φ ∂γ∂φ
−1 ∂R −1 ∂R −1 ∂ R
2
!
= tr −R R +R (2.58)
For the second term we start from the first derivative (eq. (2.60))
∂2 ln σ2 ∂ 1 ∂(σ2 )
!
= =
∂γ2 ∂γ σ2 ∂γ
∂ (σ ) 1 ∂(σ2 ) ∂(σ2 )
" 2 2 !#
1
= − 2
∂γ2 σ ∂γ ∂γ nσ2
 !2 
 ∂2 (σ2 ) 1 ∂(σ2 )  1
=  − 2 (2.59)
∂γ

∂γ2 σ σ
 2
and
∂2 ln σ2 ∂ 1 ∂σ2
!
= = (2.60)
∂γ∂φ ∂γ σ2 ∂φ
∂ (σ ) 1 ∂(σ2 ) ∂(σ2 ) 1
" 2 2 !#
= − 2
∂γ∂φ σ ∂γ ∂φ σ2
Note that since σ2 is a scalar quantity the order of multiplication is not important.
The same is not valid for the derivatives of the matrices and vectors. The only
quantity to be calculated is the second derivative of σ2
 !T !
∂2 (nσ2 ) ∂  ∂µ T ∂R
−1
T −1 ∂µ 
= +
−1

− 1 R (y − 1µ) (y − 1µ) (y − 1µ) − (y − 1µ) R 1
∂γ2 ∂γ  ∂γ ∂γ ∂γ 
 
!T !T !T
∂2 µ ∂µ ∂R−1 ∂µ −1 ∂µ
!
=− −1
R (y − 1µ) − 1 (y − 1µ) + 1 R 1
∂γ2 ∂γ ∂γ ∂γ ∂γ
!T
∂µ ∂R−1 T ∂ R
2 −1
T ∂R
−1 ∂µ
!
− 1 (y − 1µ) + (y − 1µ) (y − 1µ) − (y − 1µ) 1
∂γ ∂γ ∂γ2 ∂γ ∂γ
!T
∂µ −1 ∂µ T ∂R
−1 ∂µ
T −1 ∂ µ
2
! ! !
+ 1 R 1 − (y − 1µ) 1 − (y − 1µ) R 1
∂γ ∂γ ∂γ ∂γ ∂γ2
(2.61)
It can be written in matrix form as:
∂2 (nσ2 )
= (1)T K1 (2.62)
∂γ2
where K is a 3 × 3 symmetric matrix whose entries are:
1
∂ai j (y − 1µ)T ∂a2i j R−1 ∂ai j (y − 1µ)
3
Ki j = 1 2 3
(2.63)
∂γai j ∂γai j ∂γai j
and where
ahij = δhi + δhj h = 1, 2, 3 (2.64)
where δhi is the Kronecker delta and with the convention that the 0-th derivative is
the function.
The mixed derivative in this case is:
 !T !
∂2 (nσ2 ) ∂  ∂µ T ∂R
−1
T −1 ∂µ 
= 1 R (y − 1µ) + (y − 1µ)
−1

− (y − 1µ) − (y − 1µ) R 1 
∂γ∂φ ∂γ ∂φ ∂φ ∂φ
T T T
∂2 µ ∂µ ∂R−1 ∂µ ∂µ
! ! ! !
=− 1 R−1 (y − 1µ) − 1 (y − 1µ) + 1 R−1 1
∂γ∂φ ∂γ ∂φ ∂γ ∂φ
!T
∂µ ∂R−1 T ∂ R
2 −1
T ∂R
−1 ∂µ
!
− 1 (y − 1µ) + (y − 1µ) (y − 1µ) − (y − 1µ) 1
∂φ ∂γ ∂γ∂φ ∂γ ∂φ
!T
∂µ −1 ∂µ T ∂R
−1 ∂µ
T −1 ∂ µ
2
! ! !
+ 1 R 1 − (y − 1µ) 1 − (y − 1µ) R 1
∂φ ∂γ ∂φ ∂γ ∂γ∂φ
(2.65)
It can be written in matrix form as:
∂2 (nσ2 )
= (1)T K0 1 (2.66)
∂γ2
where K0 is a 3 × 3 symmetric matrix whose entries are:

1
∂ai j (y − 1µ)T ∂a2i j R−1 ∂ai j (y − 1µ)
3
Ki j =
0
(2.67)
∂γδi ∂φδ j ∂γδi ∂φδ j ∂γδi ∂φδ j
1 1 2 2 3 3
and where ahij has the same meaning as in eq. (2.64).

The second derivative of the inverse of the R matrix can be obtained as:
∂2 R−1 ∂ −1 ∂R −1
!
=− R R (2.68)
∂γ2 ∂γ ∂γ
∂R−1 ∂R −1 −1 ∂ R −1
2
−1 ∂R ∂R
−1
!
=− R +R R +R
∂γ ∂γ ∂γ2 ∂γ ∂γ
∂R ∂R ∂ R
2 ∂R ∂R
= R−1 R−1 R−1 − R−1 2 R−1 + R−1 R−1 R−1
∂γ ∂γ ∂γ ∂γ ∂γ
∂R ∂R ∂ R
2
!
= 2 R−1 R−1 R−1 − R−1 2 R−1
∂γ ∂γ ∂γ
 !2 
∂R ∂2 R
= 2  R−1
 
R−1  − R−1 2 R−1 (2.69)
∂γ ∂γ
and the mixed derivative of the inverse of the R matrix can be obtained as:
∂2 R−1 ∂ −1 ∂R −1
!
=− R R (2.70)
∂γ∂φ ∂γ ∂φ
∂R−1 ∂R −1 ∂2 R −1 ∂R ∂R−1
!
=− R + R−1 R + R−1
∂γ ∂φ ∂γ∂φ ∂γ ∂φ
∂R ∂R ∂ R −1
2 ∂R ∂R
= R−1 R−1 R−1 − R−1 R + R−1 R−1 R−1
∂γ ∂φ ∂γ∂φ ∂φ ∂γ
∂R ∂R ∂ R −1
2
" #
= 2 R−1 R−1 R−1 − R−1 R (2.71)
where the last equality follows from lemma 3.

The second derivative of µ is

 
∂ µ
2 ∂  T ∂R
−1
" ! #
1

=  (1)
T −1
y (1) R 1 2  +

∂γ ∂γ ∂γ

2
(1)T R−1 1
 
∂  ∂R −1
" !#
1

−  (1)T R−1 y (1)T 1

∂γ  ∂γ T −1 2 
(1) R 1

T ∂ R T ∂R T ∂R
2 −1 −1 −1
" ! ! !#
1
= (1) y (1) R 1 + (1)
T −1
y (1) 1
∂γ 2 ∂γ ∂γ (1)T R−1 1
2
∂R−1 T −1 T ∂R
−1
" ! # !
1
− 2 (1)T y (1) R 1 (1) 1
∂γ (1)T R−1 1
3 ∂γ
T ∂R T ∂R T ∂ R
−1 −1 2 −1
" ! ! !#
1
+ (1) y (1) 1 + (1) R y (1)
T −1
1
∂γ ∂γ ∂γ 2
(1)T R−1 1
2
∂R−1 T ∂R
−1
" !# !
1
− 2 (1)T R−1 y (1)T 1 (1) 1 (2.72)
∂γ T −1 3
(1) R 1
∂γ
and the mixed counterpart is:

 
∂ µ
2 ∂  T ∂R
−1
" ! #
1

=  (1)
T −1
y (1) R 1  +

∂γ∂φ ∂γ ∂φ
 2
(1)T R−1 1
 
∂  ∂R −1
" !#
1

−  (1)T R−1 y (1)T 1

∂γ  ∂φ T −1 2 
(1) R 1

T ∂ R T ∂R T ∂R
2 −1 −1 −1
" ! ! !#
1
= (1) y (1) R 1 + (1)
T −1
y (1) 1
∂γ∂φ ∂γ ∂φ (1)T R−1 1
2
∂R−1 T −1 T ∂R
−1
" ! # !
1
− 2 (1)T y (1) R 1 (1) 1
∂φ (1)T R−1 1
3 ∂γ
T ∂R T ∂R T ∂ R
−1 −1 2 −1
" ! ! !#
1
+ (1) y (1) 1 + (1) R y (1)
T −1
1
∂γ ∂φ ∂γ∂φ (1)T R−1 1
2
T ∂R T ∂R
−1 −1
" !# !
T −1
1
− 2 (1) R y (1) 1 3 (1) ∂γ 1 (2.73)
∂φ (1)T R−1 1
Now that everything is written as a derivative (first or second) of R matrix we need

only to calculate explicitly the (second) derivative of the R matrix. Let us start with
θ = (θ1 , . . . , θn ).
∂2 Ri j ∂
j
ph − PNd θ xi −x j pk !
= i k
− xh − xh e k=1 k k

∂θh2 ∂θh
j pk
P
Nd
j 2ph − k=1 θk xk −xk
i
= xh − xh e
i
(2.74)
whereas for p = (p1 , . . . , pn )
∂2 Ri j ∂ j pk
P !
j Nd
j ph ln xh −xh − k=1 θk xk −xk
i i
= i
−θh ln xh − xh e
e
∂p2h ∂ph
p ln xi −x j − PNd θ xi −x j pk !
j j
= −θh ln xh − xh ln xh − xh e
i i h h h k
e k=1 k k

j pk
P !
j j Nd
j ph ln xh −xh ph ln xh −xh − k=1 θk xk −xk
i i i
j
+ θh ln xh − xh θh ln xh − xh e
i i
e e
2 PNd i j pk !
j ph j ph − k=1 θk xk −xk

j i
= θh xh − xh − 1 θh ln xh − xh xh − xh e
i i
(2.75)
To calculate the mixed derivative we must distinguish four cases:
• θh pl with h 6= l
• θh pl with h = l → θh ph
• ph pl with h 6= l
• θh θl with h 6= l
θh pl with h 6= l:
∂Ri j ∂ PNd i j pk !
= i
− xh − xh e

∂θh ∂pl ∂pl
PNd i j pk
j ph i j pl − k=1 θk xk −xk

j i
= θl ln xl − xl xh − xh xl − xl e
i
(2.76)
The order of derivation should be irrelevant therefore we check the eq. (2.76) by
calculating the same derivative by starting from the derivative with respect to pl :
∂Ri j ∂
j i
PNd i j pk !
j pl − k=1 θk xk −xk
= i
−θl ln xl − xl xl − xl e

∂pl ∂θh ∂θh
PNd i j pk
j pl i j ph − k=1 θk xk −xk

j i
= θl ln xl − xl xl − xl xh − xh e
i
(2.77)
θh pl with h = l :
= i
− xh − xh e

∂θh ∂ph ∂ph
PNd i j pk !

j i
= − ln xh − xh xh − xh e
i
2p PNd i j pk !
xi − x j h e − k=1 θk xk −xk

j
+ θh ln xhi − xh h h
ph ph − PNd θ xi −x j pk
j j j
= θh xh − xh − 1 ln xh − xh xh − xh e k=1 k k
i i i k
(2.78)

again we check the consistency of the derivative by inverting the order of

derivation:
∂Ri j ∂
j i
PNd i j pk !
= i
−θh ln xh − xh xh − xh e

∂ph ∂θh ∂θh
PNd i j pk

j i
= − ln xh − xh xh − xh e
i
2ph − PNd θ xi −x j pk
j j
+ θh ln xh − xh xh − xh e k=1 k k
i i k
PNd i j pk
j ph j ph − k=1 θk xk −xk

j i
= θh xh − xh − 1 ln xh − xh xh − xh e
i i
(2.79)
In this case the second derivative of the R matrix is a symmetric matrix with
null principal diagonal.
ph pl with h 6= l :
∂Ri j ∂
j i
PNd i j pk !
= i
−θh ln xh − xh xh − xh e

∂ph ∂pl ∂pl
PNd i j pk
j ph j pl − k=1 θk xk −xk

j i j i
= θh θl ln xh − xh xh − xh ln xl − xl xl − xl e
i i
PNd i j pk
j pl i j ph − k=1 θk xk −xk

j j i
= θh θl ln xh − xh ln xl − xl xl − xl xh − xh e
i i
(2.80)
θh θl with h 6= l :
= i
− xh − xh e

∂θh ∂θl ∂θl
PNd i j pk
j ph i j pl − k=1 θk xk −xk

= xh − xh xl − xl e
i
(2.81)
2.7 Some Theorems and properties used throughout the

derivation
Theorem 1. The trace of a matrix is invariant under similarity transformation:
tr(ABC) = tr(CAB) = tr(BCA)
Theorem 2. The trace of a matrix is a linear operator
Lemma 3. The two matrices R−1 ∂R −1 ∂R
∂γ and R ∂φ commutes
Proof. Let’s start with the second derivative of the log of the determinant of the
matrix R eq. (2.58) :
∂2 log |R| ∂ ∂R ∂R ∂R ∂2 R
! !
= tr R−1 = tr −R−1 R−1 + R−1
∂γ∂φ ∂γ ∂φ ∂γ ∂φ ∂γ∂φ
but considering the property of mixed derivatives
∂2 f (γ, φ) ∂2 f (γ, φ)
=
∂γ∂φ ∂φ∂γ
we must have
∂2 log |R| ∂2 log |R|
=
∂γ∂φ ∂φ∂γ
therefore:
∂ −1 ∂R −1 ∂R −1 ∂R −1 ∂ R
2
! !
tr R = tr −R R +R
∂γ ∂φ ∂γ ∂φ ∂γ∂φ
∂ −1 ∂R −1 ∂R −1 ∂R −1 ∂ R
2
! !
= tr R = tr −R R +R
∂φ ∂γ ∂φ ∂γ ∂γ∂φ
and
−1 ∂R −1 ∂R ∂2 R
! !
tr −R R + tr R −1
−1 ∂R −1 ∂R −1 ∂ R
2
! !
= tr −R R + tr R
∂φ ∂γ ∂γ∂φ
from which it follows that
−1 ∂R −1 ∂R −1 ∂R −1 ∂R
! !
tr −R R = tr −R R
∂γ ∂φ ∂φ ∂γ
the equality holds if the arguments are equals, therefore we obtain:
∂R −1 ∂R ∂R ∂R
R−1 R = R−1 R−1
∂γ ∂φ ∂φ ∂γ
or
−1 ∂R −1 ∂R
" #
R ,R =0
∂γ ∂φ

Chapter 3
Thermodynamics
3.1 Helmholtz Free Energy

Starting with the first law:
dU = δW + δq (3.1)
at constant temperature and volume we have
dU = δq (3.2)
which, by using the second law of thermodynamics, leads for a generic transfor-
mation:
dU ≤ T dS (3.3)
Because the temperature is constant we can write:
d(U − T S ) ≤ 0 (3.4)
We can define a new state function A called Helmholtz free energy as A = U − T S .
26
Appendices
27
Appendix A
Miscellanea
A.1 Shifting invariance convolution of periodic functions

It is easily shown that for f and g, both 2π-periodic functions on [−π, π], we have
Z π Z π
( f ∗ g)(x) = f (x − y)g(y) dy = f (z)g(x − z) dz = (g ∗ f )(x),
−π −π
by using the substitution z = x − y.

The point is to check the bounds on your integral, since y ranges from −π to π,
you’ll have z = x − y ranging from x + π to x − π. Therefore:
Z π Z x−π Z x+π Z π
f (x−y)g(y)dy = − f (z)g(x−z)dz = f (z)g(x−z)dz = f (z)g(x−z)dz
−π x+π x−π −π
in the second-last step, we can swap the two bounds on the integral by changing
the sign. In the final step, both bounds are shifted on the integral by −x, which
does not change the value because we are integrating over an interval of length 2π
and the function is 2π-periodic
A.2 Stirling Approximation

The correct definition is: √ n
2πn ne
lim =1 (A.1)
n→+∞ n!
In approximated form, let n an integer
!
1 1
ln n! ≈ n + ln n − n + ln (2π) (A.2)
2 2
28
Appendix B
Multivariate Gaussian
A vector-valued random variable Y = [Y1 , Y2 , . . . , Yn ] is said to have a multivariate

normal (or Gaussian) distribution with mean µ ∈ Rn and covariance matrix Σ ∈ Sn++
if its probability function is given by
1 1 T
Σ−1 (x−µ))
p(x; µ, Σ) = 1
e(− 2 (x−µ) (B.1)
((2π)n |Σ|) 2
where Sn++ = {A ∈ Rn×n : A = AT and xT Ax > 0 for all x ∈ Rn such that x 6= 0}

is the space of symmetric positive definite n × n matrices. We write Y ∼ N(µ, Σ).
The density function of a univariate normal (Gaussian) distribution is given by

1 − 1
(x−µ)2
p(x; µ, σ) = √ e 2σ2 . (B.2)
2πσ
Here the argument of the exponential function 2σ1 2 (x − µ)2 , is a quadratic function
of the variable x. The coefficient in front √ 1 is a normalization factor which
2πσ
ensure that Z +∞ 1
1 − 2 (x−µ)2
√ e 2σ dx = 1 (B.3)
2πσ −∞
In the case of the multivariate Gaussian density, the argument of the exponential
− 21 (x − µ)T Σ−1 (x − µ) is a quadratic form in the variable x. Since Σ is positive
definite, and since the inverse of any positive definite matrix is also positive defi-
nite, then for any non-zero vector x, zT Σ−1 z > 0. This implies that for any vector
x 6= µ1
(x − µ)T Σ−1 (x − µ) > 0

− (x − µ)T Σ−1 (x − µ) < 0
Like in the univariate case, the coefficient is still a normalization constant

Z Z Z +∞
1
e(− 2 (x−µ1) Σ (x−µ1)) dx1 dx2 · · · dxn = 1
1 T −1
1
· · · (B.4)
((2π)n |Σ|) 2 −∞
29
APPENDIX B. MULTIVARIATE GAUSSIAN 30
B.1 Covariance Matrix

The concept of covariance matrix in very important in the context of multivariate
Gaussian distribution. For a pair of random variables X and Y their covariance in
defined as
Cov [X.Y] = E [(X − E[X]) (Y − E[Y])] = E [XY] − E [X] E [Y] (B.5)
The covariance matrix Σ is the n × n matrix whose (i, j)-th entry is Cov[Xi , X j ].
Proposition 4. For any random vector X with mean µ and covariance matrix Σ
h i
Σ = E (X − µ1)(X − µ)T = E[XXT ] − µT µ (B.6)
Proof. We prove the first of the two equalities in eq. (B.6), the other is very similar
Cov[X1 , X1 ] · · · Cov[X1 , Xn ]

 
Σ = 
 .. .. .. 
. . . 
Cov[Xn , X1 ] · · · Cov[Xn , Xn ]
 
E[(X1 − µ1 )2 ] · · · E[(X1 − µ1 )(Xn − µn )]

 

.. .. ..

= 

. . .


E[(Xn − µn )(X1 − µ1 )] · · · E[(Xn − µn )(Xn − µn )]

(X1 − µ1 )2 · · · (X1 − µ1 )(Xn − µn )

 

.. .. ..

= E 

. . .


(Xn − µn )(X1 − µ1 ) · · · (Xn − µn )(Xn − µn )

(X1 − µ1 )
  

= E 
 ..  .
 (X1 − µ1 ) .. (Xn − µn ) 
 
.
(Xn − µn )
  
h i
= E (X − µ)T (X − µ) (B.7)
where we used the fact that the expectation matrix is simply the matrix found by
taking the componentwise expectation of each matrix.
The symmetric definite property of a covariance matrix derives from the fol-
lowing
Proposition 5. Suppose Σ is the covariance matrix corresponding to some random

vector X. Then Σ is symmetric positive definite.
Proof. The symmetry of Σ follows immediately from its definition. Next, for any
APPENDIX B. MULTIVARIATE GAUSSIAN 31
vector z ∈ Rn , observe that

n X
X n
z Σz =
T
Σi j zi z j
i=1 j=1
Xn X n
= Cov Xi , X j zi z j
i=1 j=1
n X
X n h i
= E (Xi − E[Xi ]) X j − E[X j ] zi z j
i=1 j=1
 
n X
X n 
= E  (Xi − E[Xi ]) X j − E[X j ] zi z j 

i=1 j=1
(B.8)
Where we used the formula for expanding the quadratic forms and the linearity of
the expectation operator. To complete the proof, we observe that the quantity inside
the brackets is of the form i j xi x j zi z j = (xT z)2 ≥ 0. Therefore the quantity
P P
inside the expectation is always non-negative, and hence the expectation itself must
be non-negative. We conclude that zT Σz ≥ 0. For Σ−1 to exist, as required in the
definition of the multivariate Gaussian density, then Σ must be invertible and hence
full rank. Since any full rank symmetric positive semidefinite matrix is necessarily
positive definite, it follows that Σ must be symmetric positive definite.
Appendix C
Polarization Formula
The polarization formula allows to write the generic dot product among two generic
elements in a Hilbert space1 as a sum of norms in the same space.
We derive only the complex case, since the real case is just a special case of it.
Let H(x, y) be a Hermitian form, so H(x, y) = H(y, x) and the single-variable
function Q(x) = H(x, x) satisfies Q(cx) = H(cx, cx) = ccH(x, x) = |c|2 Q(x) for
c ∈ C. Let’s start by obtaining the real part for H(x, y) by looking at Q(x + y) and
Q(x − y):
Q(x + y) = H(x + y, x + y)
= H(x, x) + H(x, y) + H(y, x) + H(y, y)
= Q(x) + 2<(H(x, y)) + Q(y) (C.1)
and
Q(x − y) = H(x − y, x − y)
= H(x, x) − H(x, y) − H(y, x) + H(y, y)
= Q(x) − 2<(H(x, y)) + Q(y). (C.2)
Therefore we can solve for the real part of H(x, y) by subtracting the second result
from the first:
1
Q(x + y) − Q(x − y) = 4<(H(x, y)) =⇒ <(H(x, y)) = (Q(x + y) − Q(x − y)).
4
That is the first part of the complex polarization formula. Of course, if H(x, y)
is a real quantity then <(H(x, y)) = H(x, y) and we have proved the polariza-
tion formula in the real case. To get a formula for the imaginary part, note that
=(H(x, y)) = <(−iH(x, y)) = <(H(x, iy)), so if we run through the above work
with iy in place of y then we get
1
=(H(x, y)) = <(H(x, iy)) = (Q(x + iy) − Q(x − iy)).
4
1
I am studying the Hilber space now, so I am using it only for ti
32
APPENDIX C. POLARIZATION FORMULA 33
If we write H(x, y) as <(H(x, y)) + i=(H(x, y)) and feed the formulas for the real
and imaginary parts of H(x, y) into this, the complex polarization formula appears:
4H(x, y) = Q(x + y) − Q(x − y) + iQ(x + iy) − iQ(x − iy) (C.3)
We don’t have to use Q(x + y) and Q(x − y), or Q(x + iy) and Q(x − iy); just one
of each would suffice: since Q(x + y) = Q(x) + 2<(H(x, y)) + Q(y), we have
1
<(H(x, y)) = (Q(x + y) − Q(x) − Q(y)), (C.4)
2
and
1
=(H(x, y)) = <(H(x, iy)) = (Q(x + iy) − Q(x) − Q(iy)), (C.5)
2
and Q(iy) = H(iy, iy) = iH(y, iy) = iH(iy, y) = iiH(y, y) = −iiH(y, y) = Q(y), so
H(x, y) = <(H(x, y)) + i=(H(x, y))

1
= (Q(x + y) − Q(x) − Q(y) + i(Q(x + iy) − Q(x) − Q(y)))
2
1
= (Q(x + y) + iQ(x + iy) − (1 + i)Q(x) − (1 + i)Q(y)). (C.6)
2
Finally, what happens for a Hermitian form on a vector space over an arbitrary
field? Let V be a vector space over a field F and σ be an automorphism of F with
order 2. If H: V × V → F is σ-Hermitian, so H(w, v) = σ(H(v, w)), then is H
determined by the function Q(v) = H(v, v)? For a ∈ F,
Q(av + w) = (Na)Q(v) + Q(w) + Tr(aH(v, w)),
where N and T ∇ are the norm and trace maps F → F σ (where F σ is the fixed
field of σ on F, so the extension F/F σ has degree 2 and is Galois). From the
nondegeneracy of the trace pairing on separable extensions, such as F/F σ , the
above formula as a varies in F shows Q-values determine H-values, even if F has
characteristic 2, when the usual polarization formula over C, with division by 4,
makes no sense.

Appunti

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Appunti

Uploaded by

Copyright:

Available Formats

Notes

August 30, 2023

1.1 Generalized Legendre equation

(1 − x2 )y00 − 2xy0 + λy = 0 (1.2)

is the special case when m = 0. Because 0 is an ordinary point of the equation, it

where cn are unknown coefficient to be determined by the boundary conditions.

1.2 Even Numbered Coefficients, l Even

elements backwards. With this change the eq. (1.7) become

• Ω is the sample space (an arbitrary non-empty set)

• F ⊆ 2Ω is a set of sub-set of Ω, called “events” that form a σ-algebra, i.e.

• P is countably additive: if {Ai } ⊆ F is a countable collection of pairwise

• the measure of the space is equal to 1: P(Ω) = 1

Stochastic Process: a set of random variables defines a stochastic process. The

2.2 Maximum Likelihood Estimator

2.2.1 Bayes’s Theorem

p(θ, y) = p(θ)p(y|θ) (2.2)

therefore the conditional density p(θ|y) is

L(θ|y) = k(y)p(y|θ) ∝ p(y|θ) (2.7)

L(θ|y) = p(y|θ) (2.8)

I(θ) = −E[H(θ)] (2.9)

The variance of an ML estimator, θ̂ ML is calculated by the inverse of the Informa-

This actually represents an inferior limit, as stated by the Cramer-Rao Theorem

var(θ) ≥ − (E[H(θ)]) (2.12)

2.4 Ordinary Kriging

 Var (Z0 , Z0 ) · · · Cov (Z0 , Zn )

We observe (z1 , z2 , . . . , zn ) from this distribution and wish to predict z0 from

That can be written in matrix notation as:

E[(Ẑ0 − Z0 )2 ] = (λ)T Σλ − 2λ · Cov(Z, Z0 ) + Var (Z0 , Z0 ) (2.15)

where Cov(Z, Z0 ) = (Cov (Z1 , Z0 ) , . . . , Cov (Zn , Z0 )).

• On the average our predictions would be correct.

• In the frequentist viewpoint we replace y by the corresponding random quan-

subject to unbiasdness constraint

• The Bayesian approach would predict y(x) by the posterior mean:

for the N × M design matrix R,

r(x) = (Cor (x1 , x) , . . . , Cor (x M , x))T (2.23)

From eq. (2.16) we can write

i=1 j=1 i=1

To further simplify the equation we must consider the unbiasedness constraints

FT c(x) = f(x) which allows to write:

Therefore from eq. (2.29):

where λ is the vector of the Lagrangian multiplier

Derivative of Q̂ is described in the paper, we derive derivative of s which can be

Force can be written as:

which results in:

2.6 Log Likelihood

2.6.1 Proof of the eq. (2.49)

2.6.2 First derivative of the Loglikelihood function

By considering that y does not depend from any θ j or p j ’s we can write

and the eq. (2.60) becomes:

From eq. (2.49) we have:

whereas for p = (p1 , . . . , pn ) we have to use a passage more:

Therefore the derivative in eq. (2.55) is:

2.6.3 Second derivative of the Loglikelihood function

where K0 is a 3 × 3 symmetric matrix whose entries are:

and where ahij has the same meaning as in eq. (2.64).

where the last equality follows from lemma 3.

The second derivative of µ is

and the mixed counterpart is:

Now that everything is written as a derivative (first or second) of R matrix we need

whereas for p = (p1 , . . . , pn )

To calculate the mixed derivative we must distinguish four cases:

again we check the consistency of the derivative by inverting the order of

2.7 Some Theorems and properties used throughout the

3.1 Helmholtz Free Energy