You are on page 1of 203

Unsupervised and Composite Gaussian

Processes

Carl Henrik Ek - che29@cam.ac.uk


September 15, 2020
http://carlhenrik.com
Learning Theory

• F space of functions
• A learning algorithm
• S = {(x1 , y1 ), . . . , (xN , yN )}
• S ∼ P (X × Y)
• `(AF (S), x, y) loss function

1
Statistical Learning

e(S, A, F) = EP ({X ,Y}) [`(AF (S), x, y)]

2
Statistical Learning

e(S, A, F) = EP ({X ,Y}) [`(AF (S), x, y)]


M
1 X
≈ `(AF (S), xn , yn )
M
n=1

3
No Free Lunch

We can come up with a combination of {S, A, F} that makes


e(S, A, F) take an arbitary value

4
Example

5
Example

5
Example

5
Example

5
Example

5
Data and Knowledge

6
Assumptions: Algorithms

Statistical Learning

AF (S)

7
Assumptions: Biased Sample

Statistical Learning

AF (S)

8
Assumptions: Hypothesis space

Statistical Learning

AF (S)

9
The No Free Lunch

• There seems to be a narrative that the more flexible a model is


the better it is

10
The No Free Lunch

• There seems to be a narrative that the more flexible a model is


the better it is
• This is not true

10
The No Free Lunch

• There seems to be a narrative that the more flexible a model is


the better it is
• This is not true
• The best possible model has infinite support (nothing is
excluded) but very focused mass

10
The No Free Lunch

• There seems to be a narrative that the more flexible a model is


the better it is
• This is not true
• The best possible model has infinite support (nothing is
excluded) but very focused mass
• Your solution can only ever be interpreted in the light of your
assumptions

10
GPSS

Iudicium Posterium Discipulus Est Prioris1

1
The posterior is the student of the prior
11
Gaussian Processes
Gaussian Processes

12
Gaussian Processes

13
Gaussian Processes

14
Gaussian Processes

15
Gaussian Processes

16
Gaussian Processes

17
Gaussian Processes

18
Gaussian Processes

19
Gaussian Processes

20
Gaussian Processes

21
Gaussian Processes

22
Conditional Gaussians

" # " #! " # " #! " # " #!


0 1 0.5 0 1 0.9 0 1 0
N , N , N ,
0 0.5 1 0 0.9 1 0 0 1

23
Gaussian Processes

24
Gaussian Processes

25
The Gaussian Identities

p(x1 , x2 )
Z
p(x1 , x2 ) p(x1 ) = p(x1 , x2 )dx p(x1 |x2 ) =
p(x2 )

Gaussian Identities

26
Stochastic Processes
Kolmogrovs Excistence Theorem

For all permutations π, measurable sets Fi ⊆ Rn and probability


measure ν

1. Exchangeable

νtπ(1) ···tπ(k) Fπ(1) × · × Fπ(k) = νt1 ···tk (F1 × · · · × Fk )

2. Marginal

νt1 ·tk (F1 × · × Fk ) = νt1 ···tk ,tk+1 ·tk+m (F1 × · × Fk × Rn × · × Rn )

In this case the finite dimensional probability measure is a


realisation of an underlying stochastic process

27
Gaussian Distribution - Exchangeable

!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22

28
Gaussian Distribution - Exchangeable

!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
= p(x2 , x1 )

28
Gaussian Distribution - Exchangeable

!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
!
x2 µ2 k22 k12
= p(x2 , x1 ) = N ,
x1 µ1 k21 k11

28
Gaussian Distribution - Exchangeable

29
Gaussian Distribution - Exchangeable

29
Gaussian Distribution - Marginal

!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22

30
Gaussian Distribution - Marginal

!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
Z
⇒ p(x1 ) = p(x1 , x2 ) = N (x1 | µ1 , k11 )
x2

30
Gaussian Distribution - Marginal

!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
Z
⇒ p(x1 ) = p(x1 , x2 ) = N (x1 | µ1 , k11 )
x2
 
x1 µ1 k11 k12 · · · k1N
 x2 µ2 k21 k22 · · · k2N
 

p(x1 , x2 , . . . , xN ) = N 
 .. .. , .. .. .. .. 
 . . . . . .


xN µN kN 1 kN 2 · · · kN N

30
Gaussian Distribution - Marginal

!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
Z
⇒ p(x1 ) = p(x1 , x2 ) = N (x1 | µ1 , k11 )
x2
 
x1 µ1 k11 k12 · · · k1N
 x2 µ2 k21 k22 · · · k2N
 

p(x1 , x2 , . . . , xN ) = N 
 .. .. , .. .. .. .. 
 . . . . . .


xN µN kN 1 kN 2 · · · kN N

Z
⇒ p(x1 ) = p(x1 , x2 , . . . , xN ) = N (x1 | µ1 , k11
x2 ,...,xN

30
Gaussian Distribution - Marginal

31
Gaussian processes

GP(·, ·) N (·, ·)
M ∈ R∞×N

∞ N

The Gaussian distribution is the projection of the infinite Gaussian


process

32
Unsupervised Gaussian Processes
Unsupervised Learning

x x

fi θ fi θ

yi yi
D D

p(y|x) p(y)

33
Unsupervised Learning

34
Unsupervised Learning

34
Unsupervised Learning

34
Unsupervised Learning

34
Unsupervised Learning

34
Unsupervised Learning

35
Unsupervised Learning

35
Unsupervised Learning

35
Unsupervised Learning

35
Priors

36
Priors

Z
p(y) = p(y|f )p(f |x)p(x)df dx

1. Priors that makes sense


p(f) describes our belief/assumptions and defines our
notion of complexity in the function
p(x) expresses our belief/assumptions and defines our
notion of complexity in the latent space
2. Now lets churn the handle

37
Relationship between x and data

Z
p(y) = p(y|f )p(f |x)p(x)df dx

• GP prior
1 T K −1 f )
p(f |x) ∼ N (0, K) ∝ e− 2 (f
T M T M (x
Kij = e−(xi −xj ) i −xj )

38
Relationship between x and data

Z
p(y) = p(y|f )p(f |x)p(x)df dx

• GP prior
1 T K −1 f )
p(f |x) ∼ N (0, K) ∝ e− 2 (f
T M T M (x
Kij = e−(xi −xj ) i −xj )

• Likelihood
1
− 2β tr(y−f )T (y−f )
p(y|f ) ∼ N (y|f, β) ∝ e

38
Laplace Integration

"Nature laughs at the difficulties of integrations"


– Simon Laplace

39
Approximate Inference
Machine Learning

p(y)

• Given some observed data y


• Find a probabilistic model such that the probability of the data
is maximised
• Idea: find an approximate model q that we can integrate

40
Lower Bound

x x x θ

y y y y

p(y|x)p(x)
Z
p(y) = p(y|x)p(x) = qθ (x) ≈ p(x|y)
x p(x|y)

41
Deterministic Approximation

log p(y)

KL(q||p)

L(q)

42
Variational Bayes

p(y)

43
Variational Bayes

log p(y)

43
Variational Bayes

p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)

43
Variational Bayes

p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
p(x|y)
Z Z
= q(x)log p(y)dx + q(x)log dx
p(x|y)

43
Variational Bayes

p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
p(x|y)
Z Z
= q(x)log p(y)dx + q(x)log dx
p(x|y)
p(x|y)p(y)
Z
= q(x)log dx
p(x|y)

43
Variational Bayes

p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
p(x|y)
Z Z
= q(x)log p(y)dx + q(x)log dx
p(x|y)
p(x|y)p(y)
Z
= q(x)log dx
p(x|y)
q(x) 1
Z Z Z
= q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)

43
Variational Bayes

p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
p(x|y)
Z Z
= q(x)log p(y)dx + q(x)log dx
p(x|y)
p(x|y)p(y)
Z
= q(x)log dx
p(x|y)
q(x) 1
Z Z Z
= q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)

1 q(x)
Z Z Z
= q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)

43
Jensen Inequality

Convex Function
λf (x0 ) + (1 − λ)f (x1 ) ≥ f (λx0 + (1 − λ)x1 )
x ∈ [xmin , xmax ]
λ ∈ [0, 1]] 44
Jensen Inequality

E[f (x)] ≥ f (E[x])


Z Z 
f (x)p(x)dx ≥ f xp(x)dx

45
Jensen Inequality in Variational Bayes

Z Z 
log(x)p(x)dx ≤ log xp(x)dx

moving the log inside the the integral is a lower-bound on the


integral
46
The "posterior" term

q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)

47
The "posterior" term

q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
p(x|y)
Z
= − q(x) log dx
q(x)

47
The "posterior" term

q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
p(x|y)
Z
= − q(x) log dx
q(x)
Z
≥ −log p(x|y)dx = −log1 = 0

47
The "posterior" term

q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)

48
The "posterior" term

q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
= {Lets assume that q(x) = p(x|y)}

48
The "posterior" term

q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
= {Lets assume that q(x) = p(x|y)}
p(x|y)
Z
= p(x|y) log dx
p(x|y)
| {z }
=1

48
The "posterior" term

q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
= {Lets assume that q(x) = p(x|y)}
p(x|y)
Z
= p(x|y) log dx
p(x|y)
| {z }
=1

=0

48
Kullback-Leibler Divergence

q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)

• Measure of divergence between distributions


• Not a metric (not symmetric)
• KL(q(x)||p(x|y)) = 0 ⇔ q(x) = p(x|y)
• KL(q(x)||p(x|y)) ≥ 0

49
The "other terms"

1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)

50
The "other terms"

1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)

50
The "other terms"

1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}

50
The "other terms"

1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y)
Z
= p(x|y)log dx
p(x|y)

50
The "other terms"

1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y) p(x|y)p(y)
Z Z
= p(x|y)log dx = p(x|y)log dx
p(x|y) p(x|y)

50
The "other terms"

1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y) p(x|y)p(y)
Z Z
= p(x|y)log dx = p(x|y)log dx
p(x|y) p(x|y)
p(x|y)
Z Z
= p(x|y)log dx + p(x|y)logp(y)dx
p(x|y)
| {z }
=1

50
The "other terms"

1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y) p(x|y)p(y)
Z Z
= p(x|y)log dx = p(x|y)log dx
p(x|y) p(x|y)
p(x|y)
Z Z
= p(x|y)log dx + p(x|y)logp(y)dx
p(x|y)
| {z }
=1
Z
= p(x|y)dx logp(y)
| {z }
=1

50
The "other terms"

1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y) p(x|y)p(y)
Z Z
= p(x|y)log dx = p(x|y)log dx
p(x|y) p(x|y)
p(x|y)
Z Z
= p(x|y)log dx + p(x|y)logp(y)dx
p(x|y)
| {z }
=1
Z
= p(x|y)dx logp(y) = logp(y)
| {z }
=1

50
Variational Bayes

1 q(x)
Z Z Z
log p(y) = q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)

Z Z
≥− q(x)log q(x)dx + q(x)log p(x, y)dx

• The Evidence Lower BOnd


• Tight if q(x) = p(x|y)

51
Deterministic Approximation

log p(y)

KL(q||p)

L(q)

52
ELBO

Z Z
log p(y) ≥ − q(x)log q(x)dx + q(x)log p(x, y)dx

= Eq(x) [log p(x, y)] − H(q(x)) = L(q(x))

• if we maximise the ELBO we,


• find an approximate posterior
• lower bound the marginal likelihood
• maximising p(y) is learning
• finding q(x) ≈ p(x|y) is prediction

53
Lower Bound

x x x θ

y y y y

p(y|x)p(x)
Z
p(y) = p(y|x)p(x) = qθ (x) ≈ p(x|y)
x p(x|y)

54
Why is this useful?

Why is this a sensible thing to do?

– Ryan Adams2

2
Talking Machines Podcast
55
Why is this useful?

Why is this a sensible thing to do?


• If we can’t formulate the joint distribution there isn’t much we
can do

– Ryan Adams2

2
Talking Machines Podcast
55
Why is this useful?

Why is this a sensible thing to do?


• If we can’t formulate the joint distribution there isn’t much we
can do
• Taking the expectation of a log is usually easier than the
expectation

– Ryan Adams2

2
Talking Machines Podcast
55
Why is this useful?

Why is this a sensible thing to do?


• If we can’t formulate the joint distribution there isn’t much we
can do
• Taking the expectation of a log is usually easier than the
expectation
• We are allowed to choose the distribution to take the
expectation over

– Ryan Adams2

2
Talking Machines Podcast
55
How to choose Q?

L(q(x)) = Eq(x) [log p(x, y)] − H(q(x))

• We have to be able to compute an expectation over the joint


distribution
• The second term should be trivial

56
Lower Bound3

 
p(y, f, x)
Z
L= q(x)log
x q(x)

3
Damianou, 2015
57
Lower Bound3

 
p(y, f, x)
Z
L = q(x)log
q(x)
Zx
p(y | f )p(f | x)p(x))
 
= q(x)log
x q(x)

3
Damianou, 2015
57
Lower Bound3

 
p(y, f, x)
Z
L= q(x)log
q(x)
Zx
p(y | f )p(f | x)p(x))
 
= q(x)log
q(x)
Zx
q(x)
Z
= q(x)log p(y | f )p(f | x) − q(x)log
x x p(x)

3
Damianou, 2015
57
Lower Bound3

 
p(y, f, x)
Z
L = q(x)log
q(x)
Zx
p(y | f )p(f | x)p(x))
 
= q(x)log
q(x)
Zx
q(x)
Z
= q(x)log p(y | f )p(f | x) − q(x)log
x x p(x)
= L̃ − KL(q(x) k p(x))

3
Damianou, 2015
57
Lower Bound

Z
L̃ = q(x) log p(y|f )p(f |x)df dx

• Has not eliviate the problem at all, x still needs to go through


f to reach the data
• Idea of sparse approximations4

4
Candela et al., 2005
58
5
Lower Bound

p(f, u | x, z)

• Add another set of samples from the same prior


• Conditional distribution

5
Titsias et al., 2010
59
5
Lower Bound

p(f, u | x, z) = p(f | u, x, z)p(u | z)

• Add another set of samples from the same prior


• Conditional distribution

5
Titsias et al., 2010
59
5
Lower Bound

p(f, u | x, z) = p(f | u, x, z)p(u | z)


−1 −1
= N (f | Kf u Kuu u, Kf f − Kf u Kuu Kuf )N (u | 0, Kuu )

• Add another set of samples from the same prior


• Conditional distribution

5
Titsias et al., 2010
59
Lower Bound

p(y, f, u, x | z) = p(y | f )p(f | u, x)p(u | z)p(x)

• we have done nothing to the model, just project an additional


set of marginals from the GP
• however we will now interpret u and z not as random variables
but variational parameters
• i.e. the variational distribution q(·) is parametrised by these

60
Lower Bound

• Variational distributions are approximations to intractable


posteriors,

q(u) ≈ p(u | y, x, z, f )
q(f ) ≈ p(f | u, x, z, y)
q(x) ≈ p(x | y)

61
Lower Bound

• Variational distributions are approximations to intractable


posteriors,

q(u) ≈ p(u | y, x, z, f )
q(f ) ≈ p(f | u, x, z, y)
q(x) ≈ p(x | y)

• Bound is tight if u completely represents f i.e. u is sufficient


statistics for f

q(f ) ≈ p(f | u, x, z, y) = p(f | u, x, z)

61
Lower Bound

p(y, f, y | x, z)
Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)

62
Lower Bound

p(y, f, y | x, z)
Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
= q(f )q(u)q(x)log
x,f,u q(f )q(u)

• Assume that u is sufficient statistics of f

q(f ) = p(f | u, x, z)

62
Lower Bound

p(y | f )p(f | u, x, z)p(u | z)


Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)

63
Lower Bound

p(y | f )p(f | u, x, z)p(u | z)


Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
= p(f | u, x, z)q(u)q(x)log
x,f,u p(f | u, x, z)q(u)

63
Lower Bound

p(y | f )p(f | u, x, z)p(u | z)


Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
= p(f | u, x, z)q(u)q(x)log
x,f,u p(f | u, x, z)q(u)

63
Lower Bound

p(y | f )p(f | u, x, z)p(u | z)


Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
= p(f | u, x, z)q(u)q(x)log
x,f,u p(f | u, x, z)q(u)
p(y | f )p(u | z)
Z
= p(f | u, x, z)q(u)q(x)log
x,f,u q(u)

63
Lower Bound

p(y | f )p(f | u, x, z)p(u | z)


Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
= p(f | u, x, z)q(u)q(x)log
x,f,u p(f | u, x, z)q(u)
p(y | f )p(u | z)
Z
= p(f | u, x, z)q(u)q(x)log
x,f,u q(u)
= Ep(f |u,x,z) [p(y | f )] − KL(q(u) k p(u | z))

63
Lower Bound

L = Ep(f |u,x,z) [p(y | f )] − KL(q(u) k p(u | z)) − KL(q(x) k p(x))

• Expectation tractable (for some co-variances)


• Allows us to place priors and not "regularisers" over the latent
representation
• Stochastic inference Hensman et al., 2013
• Importantly p(x) only appears in KL(· k ·) term!

64
Latent Space Priors

p(x) ∼ N (0, I)

65
Automatic Relevance Determination

PD
αd ·(xi,d −xj,d )2
k(xi , xj ) = σe− d

GPy
Code
[]python RBF(...,ARD=True) Matern32(...,ARD=True)

66
Dynamic Prior

p(x | t) = N (µt , Kt )

67
Structured Latent Spaces
Explaining Away

x 

y = f (x) + 

68
Explaining Away

x 

y −  = f (x)

68
Factor Analysis

x1 x2 x3

y 

y = f (x1 , x2 , x3 ) + 

69
Alignments

70
Alignments

70
Alignments

70
Alignments

70
Alignments

70
Alignments

{duck}
{cat}
{duck}
{cat}
{duck}
{cat}
{duck}
{cat}

70
Alignments

Relevant

{duck}
{cat}
{duck}
Irrelevant {cat} Ambiguous
{duck}
{cat}
{duck}
{cat}

70
IBFA with GP-LVM6

w1 x w3

f1 f2
θ1 θ2

y1 y2
D1 D2

y1 = f (w1T x) y2 = f (w2T x)

6
Damianou et al., 2016
71
GP-DP7

7
Lawrence et al., 2019
72
Constrained Latent Space8

y 

y = f (g(y)) + 
8
Lawrence et al., 2006
73
Geometry

74
Latent GP-Regression9

Z
p(Y|X) = p(Y|F) p(F|X, X(C) ) p(X(C) ) dF dX(C) .

9
Bodin et al., 2017, Yousefi et al., 2016
75
Discrete

76
Continous

77
Composite Gaussian Processes
10
Composite Gaussian Processes

x2 f2 x1 f1 y

10
Damianou et al., 2013
78
Composite Functions

y = fk (fk−1 (. . . f0 (x))) = fk ◦ fk−1 ◦ · · · ◦ f1 (x)

79
When do I want Composite Functions

y = fk ◦ fk−1 ◦ · · · ◦ f1 (x)

1. My generative process is composite


• my prior knowledge is composite
2. I want to "re-parametrise" my kernel in a learning setting
• i have knowledge of the re-parametrisation

80
Because we lack "models"?

81
Composite Functions

82
Composite functions

y = fk (fk−1 (. . . f0 (x))) = fk ◦ fk−1 ◦ · · · ◦ f1 (x)

Kern(f1 ) ⊆ Kern(fk−1 ◦ . . . ◦ f2 ◦ f1 ) ⊆ Kern(fk ◦ fk−1 ◦ . . . ◦ f2 ◦ f1 )

Im(fk ◦ fk−1 ◦ . . . ◦ f2 ◦ f1 ) ⊆ Im(fk ◦ fk−1 ◦ . . . ◦ f2 ) ⊆ . . . ⊆ Im(fk )

83
Sampling

84
Sampling

85
Sampling

86
Change of Variables

87
Change of Variables

88
Change of Variables

89
Change of Variables

90
Change of Variables

91
Change of Variables

92
Change of Variables

93
"I’m Bayesian therefore I am superior"

94
Because we want to hang out with the cool kids

Deep Learning is a bit like smoking, you know that its


wrong but you do it anyway because you want to look
cool.
– Fantomens Djungelordspråk

95
MacKay plot

96
Composite Functions

97
Composite Functions

98
Composite Functions

99
Composite Functions

100
Composite Functions

101
Composite Functions

102
Composite Functions

103
Composite Functions

104
Composite Functions

105
Composite Functions

106
Composite Functions

107
The Final Composition

108
Remember why we did this in the first place

109
These damn plots

110
It gets worse

111
It gets even worse

112
Approximate Inference

x
• Sufficient statistics

q(F)q(U)q(X) = p(F|Y, U, X, Z)q(U)q(X) f1 u1

= p(F|U, X, Z)q(U)q(X) uL−1


fL−1

• Mean-Field
fL uL
L
Y
q(U) = q(Ui )
i
Y

113
The effect

114
What have we lost

• Our priors are not reflected correctly


• → we cannot interpret the results
• No intermediate uncertainties
• → we cannot do sequential decision making

115
What have we lost

• Our priors are not reflected correctly


• → we cannot interpret the results
• No intermediate uncertainties
• → we cannot do sequential decision making
• We are performing a massive computational overhead for very
little use

115
What have we lost

• Our priors are not reflected correctly


• → we cannot interpret the results
• No intermediate uncertainties
• → we cannot do sequential decision making
• We are performing a massive computational overhead for very
little use
• ". . . throwing out the baby with the bathwater. . . "

115
What we really want11

Inputs Rotation 1 Translation 1 Rotation 2 Translation 2


4

−2
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6

11
Ustyuzhaninov et al., 2019
116
What we really want12

12
Ustyuzhaninov et al., 2019
117
Summary
Summary

• Unsupervised learning13 is very hard.

13
I would argue that there is no such thing
118
Summary

• Unsupervised learning13 is very hard.


• Its actually not, its really really easy.

13
I would argue that there is no such thing
118
Summary

• Unsupervised learning13 is very hard.


• Its actually not, its really really easy.
• Relevant assumptions needed to learn anything useful

13
I would argue that there is no such thing
118
Summary

• Unsupervised learning13 is very hard.


• Its actually not, its really really easy.
• Relevant assumptions needed to learn anything useful
• Strong assumptions needed to learn anything from "sensible"
amounts of data

13
I would argue that there is no such thing
118
Summary

• Unsupervised learning13 is very hard.


• Its actually not, its really really easy.
• Relevant assumptions needed to learn anything useful
• Strong assumptions needed to learn anything from "sensible"
amounts of data
• Stochastic processes such as GPs provide strong, interpretative
assumptions that aligns well to our intuitions allowing us to
make relevant assumptions

13
I would argue that there is no such thing
118
Summary II

• Composite functions cannot model more things

119
Summary II

• Composite functions cannot model more things


• However, they can easily warp the input space to model less
things

119
Summary II

• Composite functions cannot model more things


• However, they can easily warp the input space to model less
things
• This leads to high requirments on data

119
Summary II

• Composite functions cannot model more things


• However, they can easily warp the input space to model less
things
• This leads to high requirments on data
• Even bigger need for uncertainty propagation, we cannot
assume noiseless data

119
Summary II

• Composite functions cannot model more things


• However, they can easily warp the input space to model less
things
• This leads to high requirments on data
• Even bigger need for uncertainty propagation, we cannot
assume noiseless data
• We need to think about correlated uncertainty, not marginals

119
eof

120
Reference
References i

References

Bodin, Erik, Neill D. F. Campbell, and Carl Henrik Ek (2017).


Latent Gaussian Process Regression.
Candela, Joaquin Quiñonero and Carl Edward Rasmussen (2005).
“A Unifying View of Sparse Approximate Gaussian Process
Regression”. In: Journal of Machine Learning Research 6,
pp. 1939–1959.
Damianou, Andreas, Neil D Lawrence, and Carl Henrik Ek (2016).
“Multi-view Learning as a Nonparametric Nonlinear Inter-Battery
Factor Analysis”. In: arXiv preprint arXiv:1604.04939.

121
References ii

Damianou, Andreas C (Feb. 2015). “Deep Gaussian Processes and


Variational Propagation of Uncertainty”. PhD thesis. University
of Sheffield.
Damianou, Andreas C and Neil D Lawrence (2013). “Deep
Gaussian Processes”. In: International Conference on Airtificial
Inteligence and Statistical Learning, pp. 207–215.
Hensman, James, N Fusi, and Neil D Lawrence (2013). “Gaussian
Processes for Big Data”. In: Uncertainty in Artificial Intelligence.

122
References iii

Lawrence, Andrew R., Carl Henrik Ek, and Neill W. Campbell


(2019). “DP-GP-LVM: A Bayesian Non-Parametric Model for
Learning Multivariate Dependency Structures”. In: Proceedings of
the 36th International Conference on Machine Learning, ICML
2019, 9-15 June 2019, Long Beach, California, USA,
pp. 3682–3691.
Lawrence, Neil D. and Joaquin Quiñonero Candela (2006). “Local
Distance Preservation in the GP-LVM Through Back
Constraints”. In: Proceedings of the 23rd International
Conference on Machine Learning. ICML ’06. Pittsburgh,
Pennsylvania, USA: ACM, pp. 513–520.

123
References iv

Titsias, Michalis and Neil D Lawrence (2010). “Bayesian Gaussian


Process Latent Variable Model”. In: International Conference on
Airtificial Inteligence and Statistical Learning, pp. 844–851.
Ustyuzhaninov, Ivan et al. (2019). “Compositional Uncertainty in
Deep Gaussian Processes”. In: CoRR.
Yousefi, Fariba, Zhenwen Dai, Carl Henrik Ek, and Neil Lawrence
(2016). “Unsupervised Learning With Imbalanced Data Via
Structure Consolidation Latent Variable Model”. In: CoRR.

124

You might also like