Professional Documents
Culture Documents
1
Chapter 1
Legendre Equation
m2
!
(1 − x )y − 2xy + λ −
2 00 0
y=0 (1.1)
1 − x2
arises when the equation ∆u = f (ρ)u is solved with separation of variables in
spherical coordinates. The function y(cos θ) describes the polar part of the solution
of ∆u = f (ρ)u.
The Legendre equation
2
CHAPTER 1. LEGENDRE EQUATION 3
l(l + 2 + 1)
c2 = − c0
2
(l − 2)(l + 2 + 1) l(l − 2)(l + 3)(l + 1)
c4 = c2 = − c0
(2 + 2)(2 + 1) 4!
(l − 4)(l + 4 + 1) l(l − 2)(l − 4)(l + 5)(l + 3)(l + 1)
c6 = − c4 = − c0
(4 + 2)(4 + 1) 6!
..
.
l(l − 2) · · · (l − 2(n − 1))(l + 2n + 1)(l + 2n − 3) · · · (l + 1)
c2n = (−1)n − c0
2n!
where we considered k = 2n because k is an even number. From the expanded
coefficient we now derive the closed formula for Legendre Polynomial
l(l − 2) · · · (l − 2(n − 1))(l + 2n + 1)(l + 2n − 3) · · · (l + 1)
c2n = (−1)n − c0
(2n)!
1
l
l(l − 2) · · · (l − 2(n − 1))(l + 2n + 1)(l + 2n − 3) · · · (l + 1) 2 2 2
c2n = (−1)n − 1
c0
(2n)! 2 2 2l !
1
l
(l + 2n − 1)(l + 2n − 3) · · · (l + 1) l(l − 2) · · · (l − 2(n − 1)) 2 2 2 !
= (−1)n − c0
l(l − 2) · · · (l − 2(n − 1)) 2 12 −n l − n !
2n!
2
1
+n 1
(l + 2n − 1)(l + 2n − 3) · · · (l + 1) l(l − 2) · · · (l − 2(n − 1)) l! 2 2
l
2 + n ! 2 2
l
2 !
= (−1)n − c0
l(l − 2) · · · (l − 2(n − 1)) l! 2 12 +n l + n ! 2 12 −n l − n !
2n!
2 2
1 1
l l
(l + 2n) ! 22 2 ! 22 2 !
= (−1)n − c0
(2n) ! l! 2 2 +n l + n ! 2 2 −n l − n !
1
1
2 2
2
l
(l + 2n) ! 2!
= (−1)n − c0 (1.6)
(2n) ! l! l + n l − n
2 2
Therefore
l 2
(l + 2n) ! 2 !
c2n = (−1)n − c0 (1.7)
(2n) ! l! l + n l − n
2 2
Now, for l even, the series terms are non-zero until 2n = l, that is to say that the
in the series the index of the terms assumes all the integeres value betwee 0 and
l/2. Therefore, we can substitute the index 2n with l − 2k, i.e. we are counting the
CHAPTER 1. LEGENDRE EQUATION 4
Statistics
2.1 Definitions
Probability Space: is a measure space such that the measure of the whole space is
equal to one. It can be represented as (Ω, F , P) where:
– Ω∈F
– F is closed under complements: if A ∈ F ⇒ (Ω/A) ∈ F
S
– F is closed for countable unions: if Ai ∈ F with {i ∈ N} then ( Ai ) ∈
F
– F is closed for countable intersection: if Ai ∈ F with {i ∈ N} then
T
( Ai ) ∈ F
Random Variable: given a probability space (Ω, F , P) one can define a random
variable as a function X : Ω → R which is measurable in the sense that the in-
verse of a measurable Borel set B in R is in F . The interpretation is that if ω
is an experiment, then X(ω) measure an observable quantity of the experiment.
The random variables posses and expectation valueh E i[X]. With expectation it is
possible to define the notion of variance Var = E X 2 − E [X]2 , standard devia-
√
tion σX = Var. The correlation Cor (X, Y) = Cov (X, Y) /(σX σY ) of two random
variables X, Y with positive variance is a quantity that tells how much the random
variable X is related to random variable Y.
5
CHAPTER 2. STATISTICS 6
It is important to note that the prior, p(θ), is fixed before our observations and
so can be treated as invariant to our problem. This means that we can rewrite
eq. (2.5) as
p(θ|y) ∝ k(y)p(y|θ) (2.6)
p(θ)
where k(y) = p(y) and is an unknown function of the data. Since k(y) is not a
function of θ, it is treated as an unknown positive constant. Put differently, for a
given set of observed data, k(y) remains the same over all posiible hypothetical
values of θ.
2.3 Likelihood
Without knowing the prior of making assumptions about the prior, we cannot cal-
culate the inverse probability in eq. (2.6). This is the point where the notion of
likelihood and the Likelihood Axiom was introduced. It was defined
The likelihood is proportional to the probability of observing the data, treating the
parameters of the distribution as variables and the data as fixed. The advantage of
likelihood is that it can be calculated from a traditional probability, p(y|θ), whereas
an inverse probability cannot be calculated in any way. Note that we can only
compare likelihoods for the same set of data and the same prior.
The best estimator θ̂, is whatever value of θ maximizes
In effect, we are looking for the θ̂ that maximizes the likelihood of observing our
sample. Because of the proportional relationship, the θ̂ that maximizes L(θ|y) will
also maximize p(θ|y) i.e. the probability of observing the data. This is what we
wanted from the beginning.
Since the logarithm is a monotonically increasing function and the likelihood
is a non-negative function, they have the maximum at the same position. The
logarithm of the likelihood is commonly used to calculate the maximum of L.
For a ML estimator, the Information matrix is defined as negative of the ex-
pected value of the Hessian of the likelihood with respect the parameters:
where
∂2 L
Hi j (θ) = (2.10)
∂θi ∂θ j
CHAPTER 2. STATISTICS 8
var(θ) = [I(θ)]−1
#!−1
∂2 L
"
= − (E[H(θ)]) −1
=− E (2.11)
∂θi ∂θ j
This means that any unbiased estimator that achieves this lower bound is efficient
and no better unbiased estimator is possible. The inverse of the information matrix
for MLE is exactly the same as the Cramer-Rao lower bound. This means that
MLE is efficient.
µ = E[Zˆ0 ] = E[λ1 Z1 + λ2 Z2 + · · · + λn Zn ]
= λ1 E[Z1 ] + λ2 E[Z2 ] + · · · + λn E[Zn ]
= λ1 µ + · · · + λn µ
= (λ1 + · · · + λn ) µ (2.13)
Where the second equality follows from the linearity of expectation E [·]. Be-
cause this procedure is suppose to work regardless of the value of µ, evidently
the coefficients have to sum to unity. Writing the coefficients in vector notation
λ = (λi )T , this can be neatly written 1 · λ = 1.
CHAPTER 2. STATISTICS 9
Among the set of all such unbiased linear predictors, we seek one that deviates
as little from the real value as possible, measured in the room mean square. This,
again, is a computation. It relies on the bilinearity and symmetry of covariance,
whose application is responsible for the summations in the second line:
E[(Zˆ0 − Z0 )2 ] = E[(λ1 Z1 + λ2 Z2 + · · · + λn Zn − Z0 )2 ]
Xn Xn n
X
= λi λ j Σi, j − 2 λi Cov (Zi , Z0 ) + Var (Z0 , Z0 ) (2.14)
i=1 j=1 i=1
• Typically, our predictions of the z0 would deviate about σOK from the actual
values of the z0 .
Much more needs to be said before this can be applied to practical situations
like estimating a surface from punctual data: we need additional assumptions about
how the statistical characteristics of the spatial process vary from one location to
another and from one realization to another (even though, in practice, usually only
one realization will ever be available). But this exposition should be enough to
follow how the search for a ”Best” Unbiased Linear Predictor (”BLUP”) leads
straightforwardly to a system of linear equations.
CHAPTER 2. STATISTICS 10
By the way, kriging as usually practiced is not quite the same as least squares
estimation, because Σ is estimated in a preliminary procedure (known as ”variog-
raphy”) using the same data. That is contrary to the assumptions of this derivation,
which assumed Σ was known (and a fortiori independent of the data). Thus, at the
very outset, kriging has some conceptual and statistical flaws built into it. Thought-
ful practitioners have always been aware of this and found various creative ways to
(try to) justify the inconsistencies. (Having lots of data can really help.) Procedures
now exist for simultaneously estimating Σ and predicting a collection of values at
unknown locations. They require slightly stronger assumptions (multivariate nor-
mality) in order to accomplish this feat.
2.5 DACE
DACE stands for “Design and analysis of computer experiments” and it is become
a common way to indicate a way to built a response surface for a generic process.
Suppose we have a generic output of a deterministic experiment y(x) where x
is the value of the output of the experiment (a physical property for example) at
position x. In DACE this deterministic outcome is treated as the realization of a
random function (i.e. a stochastic process) Y(x) described by a regression model:
N
X
Y(x) = β j f j (x) + Z(x) (2.16)
j=1
where Z(·) is a random process with zero mean and covariance between Z(x1 ) and
Z(x2 ) given by:
Cov (x1 , x2 ) = σ2 Cor(x1 , x2 ) (2.17)
where σ2 is the process variance and Cor(x1 , x2 ) is the correlation between the two
points.
Given a number M of observation y = (y(x1 ), . . . , y(x M ))T we consider the
linear predictor
ŷ(x) = cT (x) · y (2.18)
where x (without indexes) represents some untried position (i.e. some position not
contained in the set {x1 , . . . , x M }), and cT (x) = (c(x1 ), . . . , c(x M )). In the frequentist
and Bayesian approach we are lead to two different way to predict ŷ(x) :
ŷ(x) = E Y(x)|y
(2.21)
In general Bayesian and frequentist approach lead to different methods and results
but in the special case of Gaussian process for Z(·) and improper uniform priors on
the β’s the methods and results will be the same.
Let’s now derive the expression for kriging. Let us define the vector f(x) =
( f1 (x), . . . , fN (x))T for the N function in the regression,
f(x1 ) f1 (x1 ) · · · fN (x1 )
F = ... = ... .. ..
. .
f(x M ) f1 (x M ) · · · fN (x M )
the matrix of stochastic-process correlations between Z’s at the design site, and
the vector of correlations between the design site and the untried input x.
Starting from eq. (2.19) we obtain:
h i2
MSE[ŷ(x)] = E cT (x) · Y − Y(x)
2
= E cT (x) · Y + Y 2 (x) − 2Y(x) cT (x) · Y
= E[c21 (x)Y12 + . . . + c2M (x)Y M
2
+ 2c1 Y1 (x)(c2 Y2 + . . . + c M Y M ) + . . .
+ 2c M Y M (c1 Y1 + . . . + c M−1 Y M−1 ) + Y 2 (x) − 2Y(x)(c1 (x)Y1 + . . . + c M (x)Y M )
M X
X M M
X
= E ci (x)c j (x)Yi Y j − 2Y(x) ci Yi + Y 2 (x) (2.24)
i=1 j=1 i=1
because of the linearity of the expectation function E [·] and the fact that the f j ’s
are not random variables, c j ’s are constant and Z(·) is a stochastic process with zero
mean. In the same way we can write, from eq. (2.16):
N N
h i X X
E Y(xi )Y(x j ) = E βk fk (xi ) + Z(xi ) βk fk (x j ) + Z(x j )
k=1 k=1
N N
X X h i
= E βk fk (xi ) βk fk (x j ) + E Z(xi )Z(x j )
k=1 k=1
N
XX N
= βk βh fk (xi ) fh (x j ) + σ2 Cor xi , x j (2.26)
k=1 h=1
where the last term in the last equality follows from eq. (2.17), and the fact that:
h h ii
Cov Zi , Z j = E Zi − E [Zi ])(Z j − E Z j
h i h h ii h i h h i h ii
= E Zi Z j − E Zi E Z j − E Z j E [Zi ] − E E Z j E Z j
h i h i h i h i h i h i
= E Zi Z j − E Z j E [Zi ] − E Z j E Z j + E Z j E Z j
h i h i
= E Zi Z j − E Z j E [Zi ] (2.27)
and because we have assumed a random process wih zero mean for Z(·) we evetu-
ally obtain: h i
Cov Zi , Z j = E Zi Z j (2.28)
Coming back to eq. (2.24) we can now write:
X M X M XM
MSE[ŷ(x)] = E ci (x)c j (x)Yi Y j − 2Y(x) ci Yi + Y (x) 2
and
M N N
X X X
E βk βh ci (x) fk (xi ) fh (x)
i=1 k=1 h=1
M X
X N X
N
= βk βh ci (x) fk (xi ) fh (x)
i=1 k=1 h=1
XN X N M
X
= βk βh fh (x) ci (x) fk (xi )
k=1 h=1 i=1
N X
X N
= βk βh fh (x) fk (x) (2.31)
k=1 h=1
and
M
M X
X M X
MSE[ŷ(x)] = E ci (x)c j (x)Z(xi )Z(x j ) − 2E ci (x)Z(xi )Z(x) + E [Z(x)Z(x)]
i=1 j=1 i=1
M X
X M h i M
X
= ci (x)c j (x)E Z(xi )Z(x j ) − 2 ci (x)E [Z(xi )Z(x)] + E [Z(x)Z(x)]
i=1 j=1 i=1
M X
X M M
X
= σ2 ci (x)c j (x)Cor xi , x j − 2σ2 ci (x)Cor (xi , x) + σ2 Cor (x, x)
i=1 j=1 i=1
= σ2 1 + cT Rc − 2cT r (2.33)
To minimize eq. (2.33) subject to condition eq. (2.20) we use the method of La-
grange multiplier defining the function Λ(x):
Λ(x) = MSE[ŷ(x)] + λ(x) · FT c(x) − f(x) (2.34)
where 2
1 − rT R−1 r
MS E = s(x? )2 = σ2 1 − rT R−1 r + (2.37)
1T R−1 1
Derivative of the property (using the notation in Mills and Popelier (JCTC
2014)) can be calculated as:
∂QT ∂Q̂ ∂s
Ω
= Ω+ Ω (2.38)
∂αi ∂αi ∂αi
the only unknown term is ∂∂sfk . Derivative of s with respect the features can be
written as:
rT R−1 r ∂rT
!
∂s ∂rT ∂r T −1 ∂r
= −σ
2
R r+r R
−1 T −1
+ 2 T −1 R r+r R
−1
∂ fk ∂ fk ∂ fk 1 R 1 ∂ fk ∂ fk
T −1 !
r R r ∂rT ∂r
= −σ2 2 T −1 + 1 R−1 r + rT R−1
(2.40)
∂ fk ∂ fk
1 R 1
((2πσ2 )n |R|) 2
(2.43)
Given the correlation parameters θh and ph for h = 1, . . . , k we can solve for the
values od µ and σ2 that maximize the likelihood function in closed form:
(1)T R−1 y
µ̂ = (2.44)
(1)T R−1 1
and
(y − 1µ)T R−1 (y − 1µ)
σˆ2 = (2.45)
n
substituting eqs. (2.44) and (2.45) in eq. (2.43) we get the so-called ’concentrated
log-likelihood’ function, which depens only upon parameters θh and ph for h =
1, . . . , k. The function that we have to maximize is therefore:
1 n
L(µ, σ2 , θ1 , θ2 , . . . , θk , p1 , p2 , . . . , pk ) = 1
e (− 2 ) (2.46)
((2πσ2 )n |R|) 2
Because logarithm is a monotonically increasing function, optimizing the likeli-
hood is the same as maximising the log of the likelihood. Therefore, the function
to maximize is (after ignoring the constant terms):
n 1
− log σˆ2 − log (|R|) (2.47)
2 2
Before calculating the derivative we consider two properties of the derivative
of a matrix.
∂log |R| −1 ∂R
!
= tr R (2.48)
∂γ ∂γ
∂R−1 ∂R
= −R−1 R−1 (2.49)
∂γ ∂γ
∂R
where ∂γ is a matrix of elementwise derivatives.
CHAPTER 2. STATISTICS 17
∂I ∂R−1 R
=0=
∂γ ∂γ
∂R−1 ∂R
= R + R−1 =⇒
∂γ ∂γ
∂R−1 ∂R
= −R−1 R−1
∂γ ∂γ
∂ln σ2 1 ∂σ2
= 2 =
∂γ σ ∂γ
1 ∂
= (y − 1µ)T R−1 (y − 1µ) =
nσ ∂γ
2
1 ∂ (y − 1µ)
T
∂R−1 ∂(y − 1µ)
= R −1
(y − 1µ) + (y − 1µ)T
(y − 1µ) + (y − 1µ) T −1
R
∂γ ∂γ ∂γ
nσ2
(2.50)
∂(y − 1µ) ∂µ
= −1
∂γ ∂γ
∂ (y − 1µ) T
∂(y − 1µ)
!T
∂µ
!T
= =− 1
∂γ ∂γ ∂γ
(2.51)
Now let us calculate the derivative of µ with respect γ. By recalling eq. (2.44):
∂µ ∂ (1)T R−1 y
!
= (2.52)
∂γ ∂γ (1)T R−1 1
CHAPTER 2. STATISTICS 18
1 is a constant, therefore:
∂µ T ∂R
−1
" ! # 1
= (1) T −1
y (1) R 1 2 +
∂γ ∂γ (1)T R−1 1
T ∂R
−1
" !#
T −1
1
− (1) R y (1) 1
∂γ (1)T R−1 1
2
∂R−1 ∂R
= −R−1 R−1 (2.53)
∂γ ∂γ
From eq. (2.48) we obtain the derivatives of the first term in the loglikelihood
function. Now, everything is written as function of the derivative of the R matrix.
The last thing to do is to explicit the derivative of the entries of R matrix with
respect to θh and ph for h = 1, . . . , k.
Let us start with θ = (θ1 , . . . , θn ).
∂Ri j ∂ j pk
P !
Nd
− k=1 θk xki −xk
= e
∂θh ∂θh
PNd i j pk
j ph − k=1 θk xk −xk
= − xh − xh e
i
(2.54)
∂Ri j ∂ j pk
P !
Nd
− k=1 θk xki −xk
= e
∂ph ∂ph
∂ i − PNd θ xi −x j pk
j ph
=
k
−θh xh − xh e k=1 k k
∂ph
∂ i − PNd θ xi −x j pk
j ph
=− θh xh − xh e k=1 k k
k
(2.55)
∂ph
By using the following identity we can explicit the derivative in eq. (2.55):
j
ph ln θ +p ln xi −x j
θh xh − xh ≡ e
i h h h h
∂ i
j
j ph
i
j ph ln xh −xh
θh xh − xh = θh ln xh − xh e
i
∂ph
and the complete derivative for ph is:
∂Ri j j pk
P
j Nd
j ph ln xh −xh − k=1 θk xk −xk
i i
= −θh ln xh − xh e
i
e (2.56)
∂ph
CHAPTER 2. STATISTICS 19
∂2 log |R| ∂ −1 ∂R ∂ −1 ∂R
! !!
= tr R = tr R
∂γ2 ∂γ ∂γ ∂γ ∂γ
∂R ∂R
−1 ∂ R
2
!
= tr + R−1 2
∂γ ∂γ ∂γ
∂R ∂R ∂2 R
!
= tr −R−1 R−1 + R−1 2 (2.57)
∂γ ∂γ ∂γ
and
∂2 log |R| ∂ −1 ∂R ∂ −1 ∂R
! !!
= tr R = tr R
∂γ∂φ ∂γ ∂φ ∂γ ∂φ
∂R ∂R
−1 ∂ R
2
!
= tr + R−1
∂γ ∂φ ∂γ∂φ
−1 ∂R −1 ∂R −1 ∂ R
2
!
= tr −R R +R (2.58)
∂γ ∂φ ∂γ∂φ
For the second term we start from the first derivative (eq. (2.60))
∂2 ln σ2 ∂ 1 ∂(σ2 )
!
= =
∂γ2 ∂γ σ2 ∂γ
∂ (σ ) 1 ∂(σ2 ) ∂(σ2 )
" 2 2 !#
1
= − 2
∂γ2 σ ∂γ ∂γ nσ2
!2
∂2 (σ2 ) 1 ∂(σ2 ) 1
= − 2 (2.59)
∂γ
∂γ2 σ σ
2
and
∂2 ln σ2 ∂ 1 ∂σ2
!
= = (2.60)
∂γ∂φ ∂γ σ2 ∂φ
∂ (σ ) 1 ∂(σ2 ) ∂(σ2 ) 1
" 2 2 !#
= − 2
∂γ∂φ σ ∂γ ∂φ σ2
CHAPTER 2. STATISTICS 20
Note that since σ2 is a scalar quantity the order of multiplication is not important.
The same is not valid for the derivatives of the matrices and vectors. The only
quantity to be calculated is the second derivative of σ2
!T !
∂2 (nσ2 ) ∂ ∂µ T ∂R
−1
T −1 ∂µ
= +
−1
− 1 R (y − 1µ) (y − 1µ) (y − 1µ) − (y − 1µ) R 1
∂γ2 ∂γ ∂γ ∂γ ∂γ
!T !T !T
∂2 µ ∂µ ∂R−1 ∂µ −1 ∂µ
!
=− −1
R (y − 1µ) − 1 (y − 1µ) + 1 R 1
∂γ2 ∂γ ∂γ ∂γ ∂γ
!T
∂µ ∂R−1 T ∂ R
2 −1
T ∂R
−1 ∂µ
!
− 1 (y − 1µ) + (y − 1µ) (y − 1µ) − (y − 1µ) 1
∂γ ∂γ ∂γ2 ∂γ ∂γ
!T
∂µ −1 ∂µ T ∂R
−1 ∂µ
T −1 ∂ µ
2
! ! !
+ 1 R 1 − (y − 1µ) 1 − (y − 1µ) R 1
∂γ ∂γ ∂γ ∂γ ∂γ2
(2.61)
It can be written in matrix form as:
∂2 (nσ2 )
= (1)T K1 (2.62)
∂γ2
where K is a 3 × 3 symmetric matrix whose entries are:
1
∂ai j (y − 1µ)T ∂a2i j R−1 ∂ai j (y − 1µ)
3
Ki j = 1 2 3
(2.63)
∂γai j ∂γai j ∂γai j
and where
ahij = δhi + δhj h = 1, 2, 3 (2.64)
where δhi is the Kronecker delta and with the convention that the 0-th derivative is
the function.
The mixed derivative in this case is:
!T !
∂2 (nσ2 ) ∂ ∂µ T ∂R
−1
T −1 ∂µ
= 1 R (y − 1µ) + (y − 1µ)
−1
− (y − 1µ) − (y − 1µ) R 1
∂γ∂φ ∂γ ∂φ ∂φ ∂φ
T T T
∂2 µ ∂µ ∂R−1 ∂µ ∂µ
! ! ! !
=− 1 R−1 (y − 1µ) − 1 (y − 1µ) + 1 R−1 1
∂γ∂φ ∂γ ∂φ ∂γ ∂φ
!T
∂µ ∂R−1 T ∂ R
2 −1
T ∂R
−1 ∂µ
!
− 1 (y − 1µ) + (y − 1µ) (y − 1µ) − (y − 1µ) 1
∂φ ∂γ ∂γ∂φ ∂γ ∂φ
!T
∂µ −1 ∂µ T ∂R
−1 ∂µ
T −1 ∂ µ
2
! ! !
+ 1 R 1 − (y − 1µ) 1 − (y − 1µ) R 1
∂φ ∂γ ∂φ ∂γ ∂γ∂φ
(2.65)
It can be written in matrix form as:
∂2 (nσ2 )
= (1)T K0 1 (2.66)
∂γ2
CHAPTER 2. STATISTICS 21
Ki j =
0
(2.67)
∂γδi ∂φδ j ∂γδi ∂φδ j ∂γδi ∂φδ j
1 1 2 2 3 3
and the mixed derivative of the inverse of the R matrix can be obtained as:
∂2 R−1 ∂ −1 ∂R −1
!
=− R R (2.70)
∂γ∂φ ∂γ ∂φ
∂R−1 ∂R −1 ∂2 R −1 ∂R ∂R−1
!
=− R + R−1 R + R−1
∂γ ∂φ ∂γ∂φ ∂γ ∂φ
∂R ∂R ∂ R −1
2 ∂R ∂R
= R−1 R−1 R−1 − R−1 R + R−1 R−1 R−1
∂γ ∂φ ∂γ∂φ ∂φ ∂γ
∂R ∂R ∂ R −1
2
" #
= 2 R−1 R−1 R−1 − R−1 R (2.71)
∂γ ∂φ ∂γ∂φ
T ∂ R T ∂R T ∂R
2 −1 −1 −1
" ! ! !#
1
= (1) y (1) R 1 + (1)
T −1
y (1) 1
∂γ 2 ∂γ ∂γ (1)T R−1 1
2
∂R−1 T −1 T ∂R
−1
" ! # !
1
− 2 (1)T y (1) R 1 (1) 1
∂γ (1)T R−1 1
3 ∂γ
T ∂R T ∂R T ∂ R
−1 −1 2 −1
" ! ! !#
1
+ (1) y (1) 1 + (1) R y (1)
T −1
1
∂γ ∂γ ∂γ 2
(1)T R−1 1
2
∂R−1 T ∂R
−1
" !# !
1
− 2 (1)T R−1 y (1)T 1 (1) 1 (2.72)
∂γ T −1 3
(1) R 1
∂γ
T ∂ R T ∂R T ∂R
2 −1 −1 −1
" ! ! !#
1
= (1) y (1) R 1 + (1)
T −1
y (1) 1
∂γ∂φ ∂γ ∂φ (1)T R−1 1
2
∂R−1 T −1 T ∂R
−1
" ! # !
1
− 2 (1)T y (1) R 1 (1) 1
∂φ (1)T R−1 1
3 ∂γ
T ∂R T ∂R T ∂ R
−1 −1 2 −1
" ! ! !#
1
+ (1) y (1) 1 + (1) R y (1)
T −1
1
∂γ ∂φ ∂γ∂φ (1)T R−1 1
2
T ∂R T ∂R
−1 −1
" !# !
T −1
1
− 2 (1) R y (1) 1 3 (1) ∂γ 1 (2.73)
∂φ (1)T R−1 1
θ = (θ1 , . . . , θn ).
∂2 Ri j ∂
j
ph − PNd θ xi −x j pk !
= i k
− xh − xh e k=1 k k
∂θh2 ∂θh
j pk
P
Nd
j 2ph − k=1 θk xk −xk
i
= xh − xh e
i
(2.74)
∂2 Ri j ∂ j pk
P !
j Nd
j ph ln xh −xh − k=1 θk xk −xk
i i
= i
−θh ln xh − xh e
e
∂p2h ∂ph
p ln xi −x j − PNd θ xi −x j pk !
j j
= −θh ln xh − xh ln xh − xh e
i i h h h k
e k=1 k k
j pk
P !
j j Nd
j ph ln xh −xh ph ln xh −xh − k=1 θk xk −xk
i i i
j
+ θh ln xh − xh θh ln xh − xh e
i i
e e
2 PNd i j pk !
j ph j ph − k=1 θk xk −xk
j i
= θh xh − xh − 1 θh ln xh − xh xh − xh e
i i
(2.75)
• θh pl with h 6= l
• θh pl with h = l → θh ph
• ph pl with h 6= l
• θh θl with h 6= l
θh pl with h 6= l:
∂Ri j ∂ PNd i j pk !
j ph − k=1 θk xk −xk
= i
− xh − xh e
∂θh ∂pl ∂pl
PNd i j pk
j ph i j pl − k=1 θk xk −xk
j i
= θl ln xl − xl xh − xh xl − xl e
i
(2.76)
The order of derivation should be irrelevant therefore we check the eq. (2.76) by
calculating the same derivative by starting from the derivative with respect to pl :
∂Ri j ∂
j i
PNd i j pk !
j pl − k=1 θk xk −xk
= i
−θl ln xl − xl xl − xl e
∂pl ∂θh ∂θh
PNd i j pk
j pl i j ph − k=1 θk xk −xk
j i
= θl ln xl − xl xl − xl xh − xh e
i
(2.77)
CHAPTER 2. STATISTICS 24
θh pl with h = l :
∂Ri j ∂ PNd i j pk !
j ph − k=1 θk xk −xk
= i
− xh − xh e
∂θh ∂ph ∂ph
PNd i j pk !
j ph − k=1 θk xk −xk
j i
= − ln xh − xh xh − xh e
i
2p PNd i j pk !
xi − x j h e − k=1 θk xk −xk
j
+ θh ln xhi − xh h h
ph ph − PNd θ xi −x j pk
j j j
= θh xh − xh − 1 ln xh − xh xh − xh e k=1 k k
i i i k
(2.78)
2ph − PNd θ xi −x j pk
j j
+ θh ln xh − xh xh − xh e k=1 k k
i i k
PNd i j pk
j ph j ph − k=1 θk xk −xk
j i
= θh xh − xh − 1 ln xh − xh xh − xh e
i i
(2.79)
In this case the second derivative of the R matrix is a symmetric matrix with
null principal diagonal.
ph pl with h 6= l :
∂Ri j ∂
j i
PNd i j pk !
j ph − k=1 θk xk −xk
= i
−θh ln xh − xh xh − xh e
∂ph ∂pl ∂pl
PNd i j pk
j ph j pl − k=1 θk xk −xk
j i j i
= θh θl ln xh − xh xh − xh ln xl − xl xl − xl e
i i
PNd i j pk
j pl i j ph − k=1 θk xk −xk
j j i
= θh θl ln xh − xh ln xl − xl xl − xl xh − xh e
i i
(2.80)
θh θl with h 6= l :
∂Ri j ∂ PNd i j pk !
j ph − k=1 θk xk −xk
= i
− xh − xh e
∂θh ∂θl ∂θl
PNd i j pk
j ph i j pl − k=1 θk xk −xk
= xh − xh xl − xl e
i
(2.81)
CHAPTER 2. STATISTICS 25
Thermodynamics
dU = δq (3.2)
which, by using the second law of thermodynamics, leads for a generic transfor-
mation:
dU ≤ T dS (3.3)
Because the temperature is constant we can write:
d(U − T S ) ≤ 0 (3.4)
26
Appendices
27
Appendix A
Miscellanea
in the second-last step, we can swap the two bounds on the integral by changing
the sign. In the final step, both bounds are shifted on the integral by −x, which
does not change the value because we are integrating over an interval of length 2π
and the function is 2π-periodic
28
Appendix B
Multivariate Gaussian
29
APPENDIX B. MULTIVARIATE GAUSSIAN 30
The covariance matrix Σ is the n × n matrix whose (i, j)-th entry is Cov[Xi , X j ].
Proposition 4. For any random vector X with mean µ and covariance matrix Σ
h i
Σ = E (X − µ1)(X − µ)T = E[XXT ] − µT µ (B.6)
Proof. We prove the first of the two equalities in eq. (B.6), the other is very similar
Σ =
.. .. ..
. . .
Cov[Xn , X1 ] · · · Cov[Xn , Xn ]
(X1 − µ1 )
= E
.. .
(X1 − µ1 ) .. (Xn − µn )
.
(Xn − µn )
h i
= E (X − µ)T (X − µ) (B.7)
where we used the fact that the expectation matrix is simply the matrix found by
taking the componentwise expectation of each matrix.
The symmetric definite property of a covariance matrix derives from the fol-
lowing
Proof. The symmetry of Σ follows immediately from its definition. Next, for any
APPENDIX B. MULTIVARIATE GAUSSIAN 31
(B.8)
Where we used the formula for expanding the quadratic forms and the linearity of
the expectation operator. To complete the proof, we observe that the quantity inside
the brackets is of the form i j xi x j zi z j = (xT z)2 ≥ 0. Therefore the quantity
P P
inside the expectation is always non-negative, and hence the expectation itself must
be non-negative. We conclude that zT Σz ≥ 0. For Σ−1 to exist, as required in the
definition of the multivariate Gaussian density, then Σ must be invertible and hence
full rank. Since any full rank symmetric positive semidefinite matrix is necessarily
positive definite, it follows that Σ must be symmetric positive definite.
Appendix C
Polarization Formula
The polarization formula allows to write the generic dot product among two generic
elements in a Hilbert space1 as a sum of norms in the same space.
We derive only the complex case, since the real case is just a special case of it.
Let H(x, y) be a Hermitian form, so H(x, y) = H(y, x) and the single-variable
function Q(x) = H(x, x) satisfies Q(cx) = H(cx, cx) = ccH(x, x) = |c|2 Q(x) for
c ∈ C. Let’s start by obtaining the real part for H(x, y) by looking at Q(x + y) and
Q(x − y):
Q(x + y) = H(x + y, x + y)
= H(x, x) + H(x, y) + H(y, x) + H(y, y)
= Q(x) + 2<(H(x, y)) + Q(y) (C.1)
and
Q(x − y) = H(x − y, x − y)
= H(x, x) − H(x, y) − H(y, x) + H(y, y)
= Q(x) − 2<(H(x, y)) + Q(y). (C.2)
Therefore we can solve for the real part of H(x, y) by subtracting the second result
from the first:
1
Q(x + y) − Q(x − y) = 4<(H(x, y)) =⇒ <(H(x, y)) = (Q(x + y) − Q(x − y)).
4
That is the first part of the complex polarization formula. Of course, if H(x, y)
is a real quantity then <(H(x, y)) = H(x, y) and we have proved the polariza-
tion formula in the real case. To get a formula for the imaginary part, note that
=(H(x, y)) = <(−iH(x, y)) = <(H(x, iy)), so if we run through the above work
with iy in place of y then we get
1
=(H(x, y)) = <(H(x, iy)) = (Q(x + iy) − Q(x − iy)).
4
1
I am studying the Hilber space now, so I am using it only for ti
32
APPENDIX C. POLARIZATION FORMULA 33
If we write H(x, y) as <(H(x, y)) + i=(H(x, y)) and feed the formulas for the real
and imaginary parts of H(x, y) into this, the complex polarization formula appears:
We don’t have to use Q(x + y) and Q(x − y), or Q(x + iy) and Q(x − iy); just one
of each would suffice: since Q(x + y) = Q(x) + 2<(H(x, y)) + Q(y), we have
1
<(H(x, y)) = (Q(x + y) − Q(x) − Q(y)), (C.4)
2
and
1
=(H(x, y)) = <(H(x, iy)) = (Q(x + iy) − Q(x) − Q(iy)), (C.5)
2
and Q(iy) = H(iy, iy) = iH(y, iy) = iH(iy, y) = iiH(y, y) = −iiH(y, y) = Q(y), so
where N and T ∇ are the norm and trace maps F → F σ (where F σ is the fixed
field of σ on F, so the extension F/F σ has degree 2 and is Galois). From the
nondegeneracy of the trace pairing on separable extensions, such as F/F σ , the
above formula as a varies in F shows Q-values determine H-values, even if F has
characteristic 2, when the usual polarization formula over C, with division by 4,
makes no sense.