Ek 2020

Unsupervised and Composite Gaussian
Processes
Carl Henrik Ek - che29@cam.ac.uk

September 15, 2020
http://carlhenrik.com
Learning Theory
• F space of functions
• A learning algorithm
• S = {(x1 , y1 ), . . . , (xN , yN )}
• S ∼ P (X × Y)
• `(AF (S), x, y) loss function
1
Statistical Learning
e(S, A, F) = EP ({X ,Y}) [`(AF (S), x, y)]
2
e(S, A, F) = EP ({X ,Y}) [`(AF (S), x, y)]

M
1 X
≈ `(AF (S), xn , yn )
M
n=1
3
No Free Lunch
We can come up with a combination of {S, A, F} that makes

e(S, A, F) take an arbitary value
4
Example
5
Example
5
Example
5
Example
5
Example
5
Data and Knowledge
6
Assumptions: Algorithms
AF (S)
7
Assumptions: Biased Sample
AF (S)
8
Assumptions: Hypothesis space
AF (S)
9
The No Free Lunch
• There seems to be a narrative that the more flexible a model is

the better it is
10
The No Free Lunch

the better it is
• This is not true
10
The No Free Lunch

the better it is
• The best possible model has infinite support (nothing is
excluded) but very focused mass
10
The No Free Lunch

the better it is
• The best possible model has infinite support (nothing is
excluded) but very focused mass
• Your solution can only ever be interpreted in the light of your
assumptions
10
GPSS
Iudicium Posterium Discipulus Est Prioris1
1
The posterior is the student of the prior
11
Gaussian Processes
Gaussian Processes
12
Gaussian Processes
13
Gaussian Processes
14
Gaussian Processes
15
Gaussian Processes
16
Gaussian Processes
17
Gaussian Processes
18
Gaussian Processes
19
Gaussian Processes
20
Gaussian Processes
21
Gaussian Processes
22
Conditional Gaussians
" # " #! " # " #! " # " #!

0 1 0.5 0 1 0.9 0 1 0
N , N , N ,
0 0.5 1 0 0.9 1 0 0 1
23
Gaussian Processes
24
Gaussian Processes
25
The Gaussian Identities
p(x1 , x2 )
Z
p(x1 , x2 ) p(x1 ) = p(x1 , x2 )dx p(x1 |x2 ) =
p(x2 )
Gaussian Identities
26
Stochastic Processes
Kolmogrovs Excistence Theorem
For all permutations π, measurable sets Fi ⊆ Rn and probability

measure ν
1. Exchangeable

νtπ(1) ···tπ(k) Fπ(1) × · × Fπ(k) = νt1 ···tk (F1 × · · · × Fk )
2. Marginal
νt1 ·tk (F1 × · × Fk ) = νt1 ···tk ,tk+1 ·tk+m (F1 × · × Fk × Rn × · × Rn )
In this case the finite dimensional probability measure is a

realisation of an underlying stochastic process
27
Gaussian Distribution - Exchangeable
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
28
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
= p(x2 , x1 )
28
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
!
x2 µ2 k22 k12
= p(x2 , x1 ) = N ,
x1 µ1 k21 k11
28
29
29
Gaussian Distribution - Marginal
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
30
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
Z
⇒ p(x1 ) = p(x1 , x2 ) = N (x1 | µ1 , k11 )
x2
30
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
Z
⇒ p(x1 ) = p(x1 , x2 ) = N (x1 | µ1 , k11 )
x2
 
x1 µ1 k11 k12 · · · k1N
 x2 µ2 k21 k22 · · · k2N
 

p(x1 , x2 , . . . , xN ) = N 
 .. .. , .. .. .. .. 
 . . . . . .


xN µN kN 1 kN 2 · · · kN N
30
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
Z
⇒ p(x1 ) = p(x1 , x2 ) = N (x1 | µ1 , k11 )
x2
 
x1 µ1 k11 k12 · · · k1N
 x2 µ2 k21 k22 · · · k2N
 

p(x1 , x2 , . . . , xN ) = N 
 .. .. , .. .. .. .. 
 . . . . . .


xN µN kN 1 kN 2 · · · kN N
Z
⇒ p(x1 ) = p(x1 , x2 , . . . , xN ) = N (x1 | µ1 , k11
x2 ,...,xN
30
31
Gaussian processes
GP(·, ·) N (·, ·)
M ∈ R∞×N
→
∞ N
The Gaussian distribution is the projection of the infinite Gaussian

process
32
Unsupervised Gaussian Processes
Unsupervised Learning
x x
fi θ fi θ
yi yi
D D
p(y|x) p(y)
33
34
34
34
34
34
35
35
35
35
Priors
36
Priors
Z
p(y) = p(y|f )p(f |x)p(x)df dx
1. Priors that makes sense

p(f) describes our belief/assumptions and defines our
notion of complexity in the function
p(x) expresses our belief/assumptions and defines our
notion of complexity in the latent space
2. Now lets churn the handle
37
Relationship between x and data
Z
• GP prior
1 T K −1 f )
p(f |x) ∼ N (0, K) ∝ e− 2 (f
T M T M (x
Kij = e−(xi −xj ) i −xj )
38
Relationship between x and data
Z
• GP prior
1 T K −1 f )
p(f |x) ∼ N (0, K) ∝ e− 2 (f
T M T M (x
Kij = e−(xi −xj ) i −xj )
• Likelihood
1
− 2β tr(y−f )T (y−f )
p(y|f ) ∼ N (y|f, β) ∝ e
38
Laplace Integration
"Nature laughs at the difficulties of integrations"

– Simon Laplace
39
Approximate Inference
Machine Learning
p(y)
• Given some observed data y

• Find a probabilistic model such that the probability of the data
is maximised
• Idea: find an approximate model q that we can integrate
40
Lower Bound
x x x θ
y y y y
p(y|x)p(x)
Z
p(y) = p(y|x)p(x) = qθ (x) ≈ p(x|y)
x p(x|y)
41
Deterministic Approximation
log p(y)
KL(q||p)
L(q)
42
Variational Bayes
p(y)
43
Variational Bayes
log p(y)
43
Variational Bayes
p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
43
Variational Bayes
p(x|y)
Z
p(x|y)
p(x|y)
Z Z
= q(x)log p(y)dx + q(x)log dx
p(x|y)
43
Variational Bayes
p(x|y)
Z
p(x|y)
p(x|y)
Z Z
p(x|y)
p(x|y)p(y)
Z
= q(x)log dx
p(x|y)
43
Variational Bayes
p(x|y)
Z
p(x|y)
p(x|y)
Z Z
p(x|y)
p(x|y)p(y)
Z
= q(x)log dx
p(x|y)
q(x) 1
Z Z Z
= q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)
43
Variational Bayes
p(x|y)
Z
p(x|y)
p(x|y)
Z Z
p(x|y)
p(x|y)p(y)
Z
= q(x)log dx
p(x|y)
q(x) 1
Z Z Z
q(x) p(x|y)
1 q(x)
Z Z Z
q(x) p(x|y)
43
Jensen Inequality
Convex Function
λf (x0 ) + (1 − λ)f (x1 ) ≥ f (λx0 + (1 − λ)x1 )
x ∈ [xmin , xmax ]
λ ∈ [0, 1]] 44
Jensen Inequality
E[f (x)] ≥ f (E[x])

Z Z
f (x)p(x)dx ≥ f xp(x)dx
45
Jensen Inequality in Variational Bayes
Z Z
log(x)p(x)dx ≤ log xp(x)dx
moving the log inside the the integral is a lower-bound on the

integral
46
The "posterior" term
q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
47
q(x)
Z
p(x|y)
p(x|y)
Z
= − q(x) log dx
q(x)
47
q(x)
Z
p(x|y)
p(x|y)
Z
= − q(x) log dx
q(x)
Z
≥ −log p(x|y)dx = −log1 = 0
47
q(x)
Z
p(x|y)
48
q(x)
Z
p(x|y)
= {Lets assume that q(x) = p(x|y)}
48
q(x)
Z
p(x|y)
p(x|y)
Z
= p(x|y) log dx
p(x|y)
| {z }
=1
48
q(x)
Z
p(x|y)
p(x|y)
Z
= p(x|y) log dx
p(x|y)
| {z }
=1
=0
48
Kullback-Leibler Divergence
q(x)
Z
p(x|y)
• Measure of divergence between distributions

• Not a metric (not symmetric)
• KL(q(x)||p(x|y)) = 0 ⇔ q(x) = p(x|y)
• KL(q(x)||p(x|y)) ≥ 0
49
The "other terms"
1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
50
The "other terms"
1
Z Z
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
50
The "other terms"
1
Z Z
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
50
The "other terms"
1
Z Z
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
p(x, y)
Z
= p(x|y)log dx
p(x|y)
50
The "other terms"
1
Z Z
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
p(x, y) p(x|y)p(y)
Z Z
= p(x|y)log dx = p(x|y)log dx
p(x|y) p(x|y)
50
The "other terms"
1
Z Z
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
p(x, y) p(x|y)p(y)
Z Z
p(x|y) p(x|y)
p(x|y)
Z Z
= p(x|y)log dx + p(x|y)logp(y)dx
p(x|y)
| {z }
=1
50
The "other terms"
1
Z Z
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
p(x, y) p(x|y)p(y)
Z Z
p(x|y) p(x|y)
p(x|y)
Z Z
p(x|y)
| {z }
=1
Z
= p(x|y)dx logp(y)
| {z }
=1
50
The "other terms"
1
Z Z
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
p(x, y) p(x|y)p(y)
Z Z
p(x|y) p(x|y)
p(x|y)
Z Z
p(x|y)
| {z }
=1
Z
= p(x|y)dx logp(y) = logp(y)
| {z }
=1
50
Variational Bayes
1 q(x)
Z Z Z
log p(y) = q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)
Z Z
≥− q(x)log q(x)dx + q(x)log p(x, y)dx
• The Evidence Lower BOnd

• Tight if q(x) = p(x|y)
51
Deterministic Approximation
log p(y)
KL(q||p)
L(q)
52
ELBO
Z Z
log p(y) ≥ − q(x)log q(x)dx + q(x)log p(x, y)dx
= Eq(x) [log p(x, y)] − H(q(x)) = L(q(x))
• if we maximise the ELBO we,

• find an approximate posterior
• lower bound the marginal likelihood
• maximising p(y) is learning
• finding q(x) ≈ p(x|y) is prediction
53
Lower Bound
x x x θ
y y y y
p(y|x)p(x)
Z
p(y) = p(y|x)p(x) = qθ (x) ≈ p(x|y)
x p(x|y)
54
Why is this useful?
Why is this a sensible thing to do?
– Ryan Adams2
2
Talking Machines Podcast
55
Why is this useful?

• If we can’t formulate the joint distribution there isn’t much we
can do
– Ryan Adams2
2
55
Why is this useful?

can do
• Taking the expectation of a log is usually easier than the
expectation
– Ryan Adams2
2
55
Why is this useful?

can do
• Taking the expectation of a log is usually easier than the
expectation
• We are allowed to choose the distribution to take the
expectation over
– Ryan Adams2
2
55
How to choose Q?
L(q(x)) = Eq(x) [log p(x, y)] − H(q(x))
• We have to be able to compute an expectation over the joint

distribution
• The second term should be trivial
56
Lower Bound3

p(y, f, x)
Z
L= q(x)log
x q(x)
3
Damianou, 2015
57
Lower Bound3

p(y, f, x)
Z
L = q(x)log
q(x)
Zx
p(y | f )p(f | x)p(x))

= q(x)log
x q(x)
3
Damianou, 2015
57
Lower Bound3

p(y, f, x)
Z
L= q(x)log
q(x)
Zx
p(y | f )p(f | x)p(x))

= q(x)log
q(x)
Zx
q(x)
Z
= q(x)log p(y | f )p(f | x) − q(x)log
x x p(x)
3
Damianou, 2015
57
Lower Bound3

p(y, f, x)
Z
L = q(x)log
q(x)
Zx
p(y | f )p(f | x)p(x))

= q(x)log
q(x)
Zx
q(x)
Z
= q(x)log p(y | f )p(f | x) − q(x)log
x x p(x)
= L̃ − KL(q(x) k p(x))
3
Damianou, 2015
57
Lower Bound
Z
L̃ = q(x) log p(y|f )p(f |x)df dx
• Has not eliviate the problem at all, x still needs to go through

f to reach the data
• Idea of sparse approximations4
4
Candela et al., 2005
58
5
Lower Bound
p(f, u | x, z)
• Add another set of samples from the same prior

• Conditional distribution
5
Titsias et al., 2010
59
5
Lower Bound
p(f, u | x, z) = p(f | u, x, z)p(u | z)

5
59
5
Lower Bound
p(f, u | x, z) = p(f | u, x, z)p(u | z)

−1 −1
= N (f | Kf u Kuu u, Kf f − Kf u Kuu Kuf )N (u | 0, Kuu )

5
59
Lower Bound
p(y, f, u, x | z) = p(y | f )p(f | u, x)p(u | z)p(x)
• we have done nothing to the model, just project an additional

set of marginals from the GP
• however we will now interpret u and z not as random variables
but variational parameters
• i.e. the variational distribution q(·) is parametrised by these
60
Lower Bound
• Variational distributions are approximations to intractable

posteriors,
q(u) ≈ p(u | y, x, z, f )
q(f ) ≈ p(f | u, x, z, y)
q(x) ≈ p(x | y)
61
Lower Bound
• Variational distributions are approximations to intractable

posteriors,
q(u) ≈ p(u | y, x, z, f )
q(f ) ≈ p(f | u, x, z, y)
q(x) ≈ p(x | y)
• Bound is tight if u completely represents f i.e. u is sufficient

statistics for f
q(f ) ≈ p(f | u, x, z, y) = p(f | u, x, z)
61
Lower Bound
p(y, f, y | x, z)
Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)
62
Lower Bound
p(y, f, y | x, z)
Z
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
= q(f )q(u)q(x)log
x,f,u q(f )q(u)
• Assume that u is sufficient statistics of f
q(f ) = p(f | u, x, z)
62
Lower Bound
p(y | f )p(f | u, x, z)p(u | z)

Z
x,f,u q(f )q(u)
63
Lower Bound
p(y | f )p(f | u, x, z)p(u | z)

Z
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
= p(f | u, x, z)q(u)q(x)log
x,f,u p(f | u, x, z)q(u)
63
Lower Bound
p(y | f )p(f | u, x, z)p(u | z)

Z
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
63
Lower Bound
p(y | f )p(f | u, x, z)p(u | z)

Z
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
p(y | f )p(u | z)
Z
x,f,u q(u)
63
Lower Bound
p(y | f )p(f | u, x, z)p(u | z)

Z
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
p(y | f )p(u | z)
Z
x,f,u q(u)
= Ep(f |u,x,z) [p(y | f )] − KL(q(u) k p(u | z))
63
Lower Bound
L = Ep(f |u,x,z) [p(y | f )] − KL(q(u) k p(u | z)) − KL(q(x) k p(x))
• Expectation tractable (for some co-variances)

• Allows us to place priors and not "regularisers" over the latent
representation
• Stochastic inference Hensman et al., 2013
• Importantly p(x) only appears in KL(· k ·) term!
64
Latent Space Priors
p(x) ∼ N (0, I)
65
Automatic Relevance Determination
PD
αd ·(xi,d −xj,d )2
k(xi , xj ) = σe− d
GPy
Code
[]python RBF(...,ARD=True) Matern32(...,ARD=True)
66
Dynamic Prior
p(x | t) = N (µt , Kt )
67
Structured Latent Spaces
Explaining Away
x
y = f (x) +
68
Explaining Away
x
y − = f (x)
68
Factor Analysis
x1 x2 x3
y
y = f (x1 , x2 , x3 ) +
69
Alignments
70
Alignments
70
Alignments
70
Alignments
70
Alignments
70
Alignments
{duck}
{cat}
{duck}
{cat}
{duck}
{cat}
{duck}
{cat}
70
Alignments
Relevant
{duck}
{cat}
{duck}
Irrelevant {cat} Ambiguous
{duck}
{cat}
{duck}
{cat}
70
IBFA with GP-LVM6
w1 x w3
f1 f2
θ1 θ2
y1 y2
D1 D2
y1 = f (w1T x) y2 = f (w2T x)
6
Damianou et al., 2016
71
GP-DP7
7
Lawrence et al., 2019
72
Constrained Latent Space8
y
y = f (g(y)) +
8
Lawrence et al., 2006
73
Geometry
74
Latent GP-Regression9
Z
p(Y|X) = p(Y|F) p(F|X, X(C) ) p(X(C) ) dF dX(C) .
9
Bodin et al., 2017, Yousefi et al., 2016
75
Discrete
76
Continous
77
Composite Gaussian Processes
10
Composite Gaussian Processes
x2 f2 x1 f1 y
10
Damianou et al., 2013
78
Composite Functions
y = fk (fk−1 (. . . f0 (x))) = fk ◦ fk−1 ◦ · · · ◦ f1 (x)
79
When do I want Composite Functions
y = fk ◦ fk−1 ◦ · · · ◦ f1 (x)
1. My generative process is composite

• my prior knowledge is composite
2. I want to "re-parametrise" my kernel in a learning setting
• i have knowledge of the re-parametrisation
80
Because we lack "models"?
81
Composite Functions
82
Composite functions
y = fk (fk−1 (. . . f0 (x))) = fk ◦ fk−1 ◦ · · · ◦ f1 (x)
Kern(f1 ) ⊆ Kern(fk−1 ◦ . . . ◦ f2 ◦ f1 ) ⊆ Kern(fk ◦ fk−1 ◦ . . . ◦ f2 ◦ f1 )
Im(fk ◦ fk−1 ◦ . . . ◦ f2 ◦ f1 ) ⊆ Im(fk ◦ fk−1 ◦ . . . ◦ f2 ) ⊆ . . . ⊆ Im(fk )
83
Sampling
84
Sampling
85
Sampling
86
Change of Variables
87
Change of Variables
88
Change of Variables
89
Change of Variables
90
Change of Variables
91
Change of Variables
92
Change of Variables
93
"I’m Bayesian therefore I am superior"
94
Because we want to hang out with the cool kids
Deep Learning is a bit like smoking, you know that its

wrong but you do it anyway because you want to look
cool.
– Fantomens Djungelordspråk
95
MacKay plot
96
Composite Functions
97
Composite Functions
98
Composite Functions
99
Composite Functions
100
Composite Functions
101
Composite Functions
102
Composite Functions
103
Composite Functions
104
Composite Functions
105
Composite Functions
106
Composite Functions
107
The Final Composition
108
Remember why we did this in the first place
109
These damn plots
110
It gets worse
111
It gets even worse
112
Approximate Inference
x
• Sufficient statistics
q(F)q(U)q(X) = p(F|Y, U, X, Z)q(U)q(X) f1 u1
= p(F|U, X, Z)q(U)q(X) uL−1

fL−1
• Mean-Field
fL uL
L
Y
q(U) = q(Ui )
i
Y
113
The effect
114
What have we lost
• Our priors are not reflected correctly

• → we cannot interpret the results
• No intermediate uncertainties
• → we cannot do sequential decision making
115
What have we lost

• We are performing a massive computational overhead for very
little use
115
What have we lost

• We are performing a massive computational overhead for very
little use
• ". . . throwing out the baby with the bathwater. . . "
115
What we really want11
Inputs Rotation 1 Translation 1 Rotation 2 Translation 2

4
−2
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
11
Ustyuzhaninov et al., 2019
116
What we really want12
12
Ustyuzhaninov et al., 2019
117
Summary
Summary
• Unsupervised learning13 is very hard.
13
I would argue that there is no such thing
118
Summary

• Its actually not, its really really easy.
13
118
Summary

• Relevant assumptions needed to learn anything useful
13
118
Summary

• Strong assumptions needed to learn anything from "sensible"
amounts of data
13
118
Summary

• Strong assumptions needed to learn anything from "sensible"
amounts of data
• Stochastic processes such as GPs provide strong, interpretative
assumptions that aligns well to our intuitions allowing us to
make relevant assumptions
13
118
Summary II
• Composite functions cannot model more things
119
Summary II

• However, they can easily warp the input space to model less
things
119
Summary II

things
• This leads to high requirments on data
119
Summary II

things
• Even bigger need for uncertainty propagation, we cannot
assume noiseless data
119
Summary II

things
• Even bigger need for uncertainty propagation, we cannot
assume noiseless data
• We need to think about correlated uncertainty, not marginals
119
eof
120
Reference
References i
References
Bodin, Erik, Neill D. F. Campbell, and Carl Henrik Ek (2017).

Latent Gaussian Process Regression.
Candela, Joaquin Quiñonero and Carl Edward Rasmussen (2005).
“A Unifying View of Sparse Approximate Gaussian Process
Regression”. In: Journal of Machine Learning Research 6,
pp. 1939–1959.
Damianou, Andreas, Neil D Lawrence, and Carl Henrik Ek (2016).
“Multi-view Learning as a Nonparametric Nonlinear Inter-Battery
Factor Analysis”. In: arXiv preprint arXiv:1604.04939.
121
References ii
Damianou, Andreas C (Feb. 2015). “Deep Gaussian Processes and

Variational Propagation of Uncertainty”. PhD thesis. University
of Sheffield.
Damianou, Andreas C and Neil D Lawrence (2013). “Deep
Gaussian Processes”. In: International Conference on Airtificial
Inteligence and Statistical Learning, pp. 207–215.
Hensman, James, N Fusi, and Neil D Lawrence (2013). “Gaussian
Processes for Big Data”. In: Uncertainty in Artificial Intelligence.
122
References iii
Lawrence, Andrew R., Carl Henrik Ek, and Neill W. Campbell

(2019). “DP-GP-LVM: A Bayesian Non-Parametric Model for
Learning Multivariate Dependency Structures”. In: Proceedings of
the 36th International Conference on Machine Learning, ICML
2019, 9-15 June 2019, Long Beach, California, USA,
pp. 3682–3691.
Lawrence, Neil D. and Joaquin Quiñonero Candela (2006). “Local
Distance Preservation in the GP-LVM Through Back
Constraints”. In: Proceedings of the 23rd International
Conference on Machine Learning. ICML ’06. Pittsburgh,
Pennsylvania, USA: ACM, pp. 513–520.
123
References iv
Titsias, Michalis and Neil D Lawrence (2010). “Bayesian Gaussian

Process Latent Variable Model”. In: International Conference on
Airtificial Inteligence and Statistical Learning, pp. 844–851.
Ustyuzhaninov, Ivan et al. (2019). “Compositional Uncertainty in
Deep Gaussian Processes”. In: CoRR.
Yousefi, Fariba, Zhenwen Dai, Carl Henrik Ek, and Neil Lawrence
(2016). “Unsupervised Learning With Imbalanced Data Via
Structure Consolidation Latent Variable Model”. In: CoRR.
124

Ek 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ek 2020

Uploaded by

Copyright:

Available Formats

Unsupervised and Composite Gaussian

Carl Henrik Ek - che29@cam.ac.uk

e(S, A, F) = EP ({X ,Y}) [`(AF (S), x, y)]

e(S, A, F) = EP ({X ,Y}) [`(AF (S), x, y)]

We can come up with a combination of {S, A, F} that makes

• There seems to be a narrative that the more flexible a model is

• There seems to be a narrative that the more flexible a model is

• There seems to be a narrative that the more flexible a model is

• There seems to be a narrative that the more flexible a model is

Iudicium Posterium Discipulus Est Prioris1

" # " #! " # " #! " # " #!

For all permutations π, measurable sets Fi ⊆ Rn and probability

νt1 ·tk (F1 × · × Fk ) = νt1 ···tk ,tk+1 ·tk+m (F1 × · × Fk × Rn × · × Rn )

In this case the finite dimensional probability measure is a

The Gaussian distribution is the projection of the infinite Gaussian

1. Priors that makes sense

"Nature laughs at the difficulties of integrations"

• Given some observed data y

E[f (x)] ≥ f (E[x])

moving the log inside the the integral is a lower-bound on the

• Measure of divergence between distributions

• The Evidence Lower BOnd

= Eq(x) [log p(x, y)] − H(q(x)) = L(q(x))

• if we maximise the ELBO we,

Why is this a sensible thing to do?

Why is this a sensible thing to do?

Why is this a sensible thing to do?

Why is this a sensible thing to do?

L(q(x)) = Eq(x) [log p(x, y)] − H(q(x))

• We have to be able to compute an expectation over the joint

• Has not eliviate the problem at all, x still needs to go through

• Add another set of samples from the same prior

p(f, u | x, z) = p(f | u, x, z)p(u | z)

• Add another set of samples from the same prior

p(f, u | x, z) = p(f | u, x, z)p(u | z)

• Add another set of samples from the same prior

p(y, f, u, x | z) = p(y | f )p(f | u, x)p(u | z)p(x)

• we have done nothing to the model, just project an additional

• Variational distributions are approximations to intractable

• Variational distributions are approximations to intractable

• Bound is tight if u completely represents f i.e. u is sufficient

q(f ) ≈ p(f | u, x, z, y) = p(f | u, x, z)

• Assume that u is sufficient statistics of f

p(y | f )p(f | u, x, z)p(u | z)

p(y | f )p(f | u, x, z)p(u | z)

p(y | f )p(f | u, x, z)p(u | z)

p(y | f )p(f | u, x, z)p(u | z)

p(y | f )p(f | u, x, z)p(u | z)

L = Ep(f |u,x,z) [p(y | f )] − KL(q(u) k p(u | z)) − KL(q(x) k p(x))

• Expectation tractable (for some co-variances)

y = fk (fk−1 (. . . f0 (x))) = fk ◦ fk−1 ◦ · · · ◦ f1 (x)

1. My generative process is composite

y = fk (fk−1 (. . . f0 (x))) = fk ◦ fk−1 ◦ · · · ◦ f1 (x)

Kern(f1 ) ⊆ Kern(fk−1 ◦ . . . ◦ f2 ◦ f1 ) ⊆ Kern(fk ◦ fk−1 ◦ . . . ◦ f2 ◦ f1 )

Im(fk ◦ fk−1 ◦ . . . ◦ f2 ◦ f1 ) ⊆ Im(fk ◦ fk−1 ◦ . . . ◦ f2 ) ⊆ . . . ⊆ Im(fk )

Deep Learning is a bit like smoking, you know that its

q(F)q(U)q(X) = p(F|Y, U, X, Z)q(U)q(X) f1 u1

= p(F|U, X, Z)q(U)q(X) uL−1

• Our priors are not reflected correctly

• Our priors are not reflected correctly

• Our priors are not reflected correctly

Inputs Rotation 1 Translation 1 Rotation 2 Translation 2

• Unsupervised learning13 is very hard.