Professional Documents
Culture Documents
Processes
• F space of functions
• A learning algorithm
• S = {(x1 , y1 ), . . . , (xN , yN )}
• S ∼ P (X × Y)
• `(AF (S), x, y) loss function
1
Statistical Learning
2
Statistical Learning
3
No Free Lunch
4
Example
5
Example
5
Example
5
Example
5
Example
5
Data and Knowledge
6
Assumptions: Algorithms
Statistical Learning
AF (S)
7
Assumptions: Biased Sample
Statistical Learning
AF (S)
8
Assumptions: Hypothesis space
Statistical Learning
AF (S)
9
The No Free Lunch
10
The No Free Lunch
10
The No Free Lunch
10
The No Free Lunch
10
GPSS
1
The posterior is the student of the prior
11
Gaussian Processes
Gaussian Processes
12
Gaussian Processes
13
Gaussian Processes
14
Gaussian Processes
15
Gaussian Processes
16
Gaussian Processes
17
Gaussian Processes
18
Gaussian Processes
19
Gaussian Processes
20
Gaussian Processes
21
Gaussian Processes
22
Conditional Gaussians
23
Gaussian Processes
24
Gaussian Processes
25
The Gaussian Identities
p(x1 , x2 )
Z
p(x1 , x2 ) p(x1 ) = p(x1 , x2 )dx p(x1 |x2 ) =
p(x2 )
Gaussian Identities
26
Stochastic Processes
Kolmogrovs Excistence Theorem
1. Exchangeable
νtπ(1) ···tπ(k) Fπ(1) × · × Fπ(k) = νt1 ···tk (F1 × · · · × Fk )
2. Marginal
27
Gaussian Distribution - Exchangeable
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
28
Gaussian Distribution - Exchangeable
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
= p(x2 , x1 )
28
Gaussian Distribution - Exchangeable
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
!
x2 µ2 k22 k12
= p(x2 , x1 ) = N ,
x1 µ1 k21 k11
28
Gaussian Distribution - Exchangeable
29
Gaussian Distribution - Exchangeable
29
Gaussian Distribution - Marginal
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
30
Gaussian Distribution - Marginal
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
Z
⇒ p(x1 ) = p(x1 , x2 ) = N (x1 | µ1 , k11 )
x2
30
Gaussian Distribution - Marginal
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
Z
⇒ p(x1 ) = p(x1 , x2 ) = N (x1 | µ1 , k11 )
x2
x1 µ1 k11 k12 · · · k1N
x2 µ2 k21 k22 · · · k2N
p(x1 , x2 , . . . , xN ) = N
.. .. , .. .. .. ..
. . . . . .
xN µN kN 1 kN 2 · · · kN N
30
Gaussian Distribution - Marginal
!
x1 µ1 k11 k12
p(x1 , x2 ) = N ,
x2 µ2 k21 k22
Z
⇒ p(x1 ) = p(x1 , x2 ) = N (x1 | µ1 , k11 )
x2
x1 µ1 k11 k12 · · · k1N
x2 µ2 k21 k22 · · · k2N
p(x1 , x2 , . . . , xN ) = N
.. .. , .. .. .. ..
. . . . . .
xN µN kN 1 kN 2 · · · kN N
Z
⇒ p(x1 ) = p(x1 , x2 , . . . , xN ) = N (x1 | µ1 , k11
x2 ,...,xN
30
Gaussian Distribution - Marginal
31
Gaussian processes
GP(·, ·) N (·, ·)
M ∈ R∞×N
→
∞ N
32
Unsupervised Gaussian Processes
Unsupervised Learning
x x
fi θ fi θ
yi yi
D D
p(y|x) p(y)
33
Unsupervised Learning
34
Unsupervised Learning
34
Unsupervised Learning
34
Unsupervised Learning
34
Unsupervised Learning
34
Unsupervised Learning
35
Unsupervised Learning
35
Unsupervised Learning
35
Unsupervised Learning
35
Priors
36
Priors
Z
p(y) = p(y|f )p(f |x)p(x)df dx
37
Relationship between x and data
Z
p(y) = p(y|f )p(f |x)p(x)df dx
• GP prior
1 T K −1 f )
p(f |x) ∼ N (0, K) ∝ e− 2 (f
T M T M (x
Kij = e−(xi −xj ) i −xj )
38
Relationship between x and data
Z
p(y) = p(y|f )p(f |x)p(x)df dx
• GP prior
1 T K −1 f )
p(f |x) ∼ N (0, K) ∝ e− 2 (f
T M T M (x
Kij = e−(xi −xj ) i −xj )
• Likelihood
1
− 2β tr(y−f )T (y−f )
p(y|f ) ∼ N (y|f, β) ∝ e
38
Laplace Integration
39
Approximate Inference
Machine Learning
p(y)
40
Lower Bound
x x x θ
y y y y
p(y|x)p(x)
Z
p(y) = p(y|x)p(x) = qθ (x) ≈ p(x|y)
x p(x|y)
41
Deterministic Approximation
log p(y)
KL(q||p)
L(q)
42
Variational Bayes
p(y)
43
Variational Bayes
log p(y)
43
Variational Bayes
p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
43
Variational Bayes
p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
p(x|y)
Z Z
= q(x)log p(y)dx + q(x)log dx
p(x|y)
43
Variational Bayes
p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
p(x|y)
Z Z
= q(x)log p(y)dx + q(x)log dx
p(x|y)
p(x|y)p(y)
Z
= q(x)log dx
p(x|y)
43
Variational Bayes
p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
p(x|y)
Z Z
= q(x)log p(y)dx + q(x)log dx
p(x|y)
p(x|y)p(y)
Z
= q(x)log dx
p(x|y)
q(x) 1
Z Z Z
= q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)
43
Variational Bayes
p(x|y)
Z
log p(y) = log p(y) + log dx
p(x|y)
p(x|y)
Z Z
= q(x)log p(y)dx + q(x)log dx
p(x|y)
p(x|y)p(y)
Z
= q(x)log dx
p(x|y)
q(x) 1
Z Z Z
= q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)
1 q(x)
Z Z Z
= q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)
43
Jensen Inequality
Convex Function
λf (x0 ) + (1 − λ)f (x1 ) ≥ f (λx0 + (1 − λ)x1 )
x ∈ [xmin , xmax ]
λ ∈ [0, 1]] 44
Jensen Inequality
45
Jensen Inequality in Variational Bayes
Z Z
log(x)p(x)dx ≤ log xp(x)dx
q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
47
The "posterior" term
q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
p(x|y)
Z
= − q(x) log dx
q(x)
47
The "posterior" term
q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
p(x|y)
Z
= − q(x) log dx
q(x)
Z
≥ −log p(x|y)dx = −log1 = 0
47
The "posterior" term
q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
48
The "posterior" term
q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
= {Lets assume that q(x) = p(x|y)}
48
The "posterior" term
q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
= {Lets assume that q(x) = p(x|y)}
p(x|y)
Z
= p(x|y) log dx
p(x|y)
| {z }
=1
48
The "posterior" term
q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
= {Lets assume that q(x) = p(x|y)}
p(x|y)
Z
= p(x|y) log dx
p(x|y)
| {z }
=1
=0
48
Kullback-Leibler Divergence
q(x)
Z
KL(q(x)||p(x|y)) = q(x) log dx
p(x|y)
49
The "other terms"
1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
50
The "other terms"
1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
50
The "other terms"
1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
50
The "other terms"
1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y)
Z
= p(x|y)log dx
p(x|y)
50
The "other terms"
1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y) p(x|y)p(y)
Z Z
= p(x|y)log dx = p(x|y)log dx
p(x|y) p(x|y)
50
The "other terms"
1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y) p(x|y)p(y)
Z Z
= p(x|y)log dx = p(x|y)log dx
p(x|y) p(x|y)
p(x|y)
Z Z
= p(x|y)log dx + p(x|y)logp(y)dx
p(x|y)
| {z }
=1
50
The "other terms"
1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y) p(x|y)p(y)
Z Z
= p(x|y)log dx = p(x|y)log dx
p(x|y) p(x|y)
p(x|y)
Z Z
= p(x|y)log dx + p(x|y)logp(y)dx
p(x|y)
| {z }
=1
Z
= p(x|y)dx logp(y)
| {z }
=1
50
The "other terms"
1
Z Z
q(x)log dx + q(x)log p(x, y)dx =
q(x)
p(x, y)
Z
= q(x)log dx
q(x)
= {Lets assume that q(x) = p(x|y)}
p(x, y) p(x|y)p(y)
Z Z
= p(x|y)log dx = p(x|y)log dx
p(x|y) p(x|y)
p(x|y)
Z Z
= p(x|y)log dx + p(x|y)logp(y)dx
p(x|y)
| {z }
=1
Z
= p(x|y)dx logp(y) = logp(y)
| {z }
=1
50
Variational Bayes
1 q(x)
Z Z Z
log p(y) = q(x)log dx + q(x)log p(x, y)dx + q(x) log dx
q(x) p(x|y)
Z Z
≥− q(x)log q(x)dx + q(x)log p(x, y)dx
51
Deterministic Approximation
log p(y)
KL(q||p)
L(q)
52
ELBO
Z Z
log p(y) ≥ − q(x)log q(x)dx + q(x)log p(x, y)dx
53
Lower Bound
x x x θ
y y y y
p(y|x)p(x)
Z
p(y) = p(y|x)p(x) = qθ (x) ≈ p(x|y)
x p(x|y)
54
Why is this useful?
– Ryan Adams2
2
Talking Machines Podcast
55
Why is this useful?
– Ryan Adams2
2
Talking Machines Podcast
55
Why is this useful?
– Ryan Adams2
2
Talking Machines Podcast
55
Why is this useful?
– Ryan Adams2
2
Talking Machines Podcast
55
How to choose Q?
56
Lower Bound3
p(y, f, x)
Z
L= q(x)log
x q(x)
3
Damianou, 2015
57
Lower Bound3
p(y, f, x)
Z
L = q(x)log
q(x)
Zx
p(y | f )p(f | x)p(x))
= q(x)log
x q(x)
3
Damianou, 2015
57
Lower Bound3
p(y, f, x)
Z
L= q(x)log
q(x)
Zx
p(y | f )p(f | x)p(x))
= q(x)log
q(x)
Zx
q(x)
Z
= q(x)log p(y | f )p(f | x) − q(x)log
x x p(x)
3
Damianou, 2015
57
Lower Bound3
p(y, f, x)
Z
L = q(x)log
q(x)
Zx
p(y | f )p(f | x)p(x))
= q(x)log
q(x)
Zx
q(x)
Z
= q(x)log p(y | f )p(f | x) − q(x)log
x x p(x)
= L̃ − KL(q(x) k p(x))
3
Damianou, 2015
57
Lower Bound
Z
L̃ = q(x) log p(y|f )p(f |x)df dx
4
Candela et al., 2005
58
5
Lower Bound
p(f, u | x, z)
5
Titsias et al., 2010
59
5
Lower Bound
5
Titsias et al., 2010
59
5
Lower Bound
5
Titsias et al., 2010
59
Lower Bound
60
Lower Bound
q(u) ≈ p(u | y, x, z, f )
q(f ) ≈ p(f | u, x, z, y)
q(x) ≈ p(x | y)
61
Lower Bound
q(u) ≈ p(u | y, x, z, f )
q(f ) ≈ p(f | u, x, z, y)
q(x) ≈ p(x | y)
61
Lower Bound
p(y, f, y | x, z)
Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)
62
Lower Bound
p(y, f, y | x, z)
Z
L̃ = q(f )q(u)q(x)log
x,f,u q(f )q(u)
p(y | f )p(f | u, x, z)p(u | z)
Z
= q(f )q(u)q(x)log
x,f,u q(f )q(u)
q(f ) = p(f | u, x, z)
62
Lower Bound
63
Lower Bound
63
Lower Bound
63
Lower Bound
63
Lower Bound
63
Lower Bound
64
Latent Space Priors
p(x) ∼ N (0, I)
65
Automatic Relevance Determination
PD
αd ·(xi,d −xj,d )2
k(xi , xj ) = σe− d
GPy
Code
[]python RBF(...,ARD=True) Matern32(...,ARD=True)
66
Dynamic Prior
p(x | t) = N (µt , Kt )
67
Structured Latent Spaces
Explaining Away
x
y = f (x) +
68
Explaining Away
x
y − = f (x)
68
Factor Analysis
x1 x2 x3
y
y = f (x1 , x2 , x3 ) +
69
Alignments
70
Alignments
70
Alignments
70
Alignments
70
Alignments
70
Alignments
{duck}
{cat}
{duck}
{cat}
{duck}
{cat}
{duck}
{cat}
70
Alignments
Relevant
{duck}
{cat}
{duck}
Irrelevant {cat} Ambiguous
{duck}
{cat}
{duck}
{cat}
70
IBFA with GP-LVM6
w1 x w3
f1 f2
θ1 θ2
y1 y2
D1 D2
y1 = f (w1T x) y2 = f (w2T x)
6
Damianou et al., 2016
71
GP-DP7
7
Lawrence et al., 2019
72
Constrained Latent Space8
y
y = f (g(y)) +
8
Lawrence et al., 2006
73
Geometry
74
Latent GP-Regression9
Z
p(Y|X) = p(Y|F) p(F|X, X(C) ) p(X(C) ) dF dX(C) .
9
Bodin et al., 2017, Yousefi et al., 2016
75
Discrete
76
Continous
77
Composite Gaussian Processes
10
Composite Gaussian Processes
x2 f2 x1 f1 y
10
Damianou et al., 2013
78
Composite Functions
79
When do I want Composite Functions
y = fk ◦ fk−1 ◦ · · · ◦ f1 (x)
80
Because we lack "models"?
81
Composite Functions
82
Composite functions
83
Sampling
84
Sampling
85
Sampling
86
Change of Variables
87
Change of Variables
88
Change of Variables
89
Change of Variables
90
Change of Variables
91
Change of Variables
92
Change of Variables
93
"I’m Bayesian therefore I am superior"
94
Because we want to hang out with the cool kids
95
MacKay plot
96
Composite Functions
97
Composite Functions
98
Composite Functions
99
Composite Functions
100
Composite Functions
101
Composite Functions
102
Composite Functions
103
Composite Functions
104
Composite Functions
105
Composite Functions
106
Composite Functions
107
The Final Composition
108
Remember why we did this in the first place
109
These damn plots
110
It gets worse
111
It gets even worse
112
Approximate Inference
x
• Sufficient statistics
• Mean-Field
fL uL
L
Y
q(U) = q(Ui )
i
Y
113
The effect
114
What have we lost
115
What have we lost
115
What have we lost
115
What we really want11
−2
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
11
Ustyuzhaninov et al., 2019
116
What we really want12
12
Ustyuzhaninov et al., 2019
117
Summary
Summary
13
I would argue that there is no such thing
118
Summary
13
I would argue that there is no such thing
118
Summary
13
I would argue that there is no such thing
118
Summary
13
I would argue that there is no such thing
118
Summary
13
I would argue that there is no such thing
118
Summary II
119
Summary II
119
Summary II
119
Summary II
119
Summary II
119
eof
120
Reference
References i
References
121
References ii
122
References iii
123
References iv
124