Pang NNGP Crunch Seminar

When a neural net
meets
a Gaussian process
Guofei Pang
05/11/18
Crunch seminar
Neural net (fully connected)
Input layer First hidden layer Last hidden layer Output layer
1,0 1 𝐿+1,𝐿 𝐿+1

𝑤1,1 , 𝑏1 𝑤1,1 , 𝑏1
𝑥10 𝐼 𝑦10 𝑥11  𝑦10 𝑥1𝐿  𝑦1𝐿 𝑥1𝐿+1  𝑦1𝐿+1
L+1,𝐿 L+1
1,0 1
𝑤1,2 𝑤1,2 , 𝑏1
, 𝑏1
𝑥20 𝐼 𝑦20 𝑥21  𝑦21 𝑥2𝐿  𝑦2𝐿 𝑥2𝐿+1  𝑦2𝐿+1
L+1,𝐿 L+1
𝑤1,𝑁 , 𝑏1
1,0 1 𝐿
𝑤1,N , 𝑏1
0
𝑥𝑁0 0 𝐼 𝑦𝑁0 0 1
𝑥𝑁  𝑦𝑁1 1 𝑥𝑁𝐿 𝐿  𝑦𝑁𝐿 𝐿 𝑥𝑁𝐿+1  𝑦𝑁𝐿+1
𝐿 1
1 𝐿 1 +
+
𝐿+1,𝐿 𝐿
𝑦𝑖0 = 𝑥𝑖0 , 𝑥𝑖1 = 𝑁 1,0 0 1
𝑗=1 𝑖,𝑗 𝑦𝑗 + 𝑏𝑖
0 𝑤 𝑥𝑖𝐿+1 = 𝑁 𝑗=1 𝑤𝑖,𝑗
L 𝑦𝑗 + 𝑏𝑖𝐿+1
i =1,2, …, N0 𝑦𝑖1 = 𝜙(𝑥𝑖1 ) 𝑦𝑖𝐿+1 = 𝜙(𝑥𝑖𝐿+1 )
i =1,2, …, N1 i =1,2, …, NL+1
2/18
Input layer First hidden layer Last hidden layer Output layer
1,0 1 𝐿+1,𝐿 𝐿+1

𝑤1,1 , 𝑏1 𝑤1,1 , 𝑏1
𝑥10 𝐼 𝑦10 𝑥11  𝑦10 𝑥1𝐿  𝑦1𝐿 𝑥1𝐿+1  𝑦1𝐿+1
L+1,𝐿 L+1
1,0 1
𝑤1,2 𝑤1,2 , 𝑏1
, 𝑏1
𝑥20 𝐼 𝑦20 𝑥21  𝑦21 𝑥2𝐿  𝑦2𝐿 𝑥2𝐿+1  𝑦2𝐿+1
L+1,𝐿 L+1
𝑤1,𝑁 , 𝑏1
1,0 1 L
𝑤1,N , 𝑏1
0
𝑥𝑁0 0 𝐼 𝑦𝑁0 0 1
𝐿 1
1 𝐿 1 +
+
𝑙+1,𝑙 𝑙
𝑥𝑖𝑙+1 = 𝑁 𝑙+1
𝑗=1 𝑤𝑖,𝑗 𝑦𝑗 + 𝑏𝑖
𝑙
𝑦1𝐿+1 = 𝑓1 𝑥10 , 𝑥20 , … , 𝑥𝑁0 ; {𝐖𝑙+1,𝑙 }, {𝐛𝑙+1 }
𝑦𝑖𝑙+1 = 𝜙(𝑥𝑖𝑙+1 ) 𝑦2𝐿+1 = 𝑓2 𝑥10 , 𝑥20 , … , 𝑥𝑁0 ; {𝐖𝑙+1,𝑙 }, {𝐛𝑙+1 }
0
0
i =1,2, …, Nl+1 …………
l = 0, 1, …, L 𝑦𝑁𝐿+1
𝐿 1
= 𝑓𝑁𝐿 1 𝑥10 , 𝑥20 , … , 𝑥𝑁0 0 ; {𝐖𝑙+1,𝑙 }, {𝐛𝑙+1 }
+ +
𝑦𝑖0 = 𝑥𝑖0
1,0 1 𝐿+1,𝐿 𝐿+1

𝑤1,1 , 𝑏1 𝑤1,1 , 𝑏1
𝑥10 𝐼 𝑦10 𝑥11  𝑦10 𝑥1𝐿  𝑦1𝐿 𝑥1𝐿+1  𝑦1𝐿+1
L+1,𝐿 L+1
1,0 1
𝑤1,2 𝑤1,2 , 𝑏1
, 𝑏1
𝑥20 𝐼 𝑦20 𝑥21  𝑦21 𝑥2𝐿  𝑦2𝐿 𝑥2𝐿+1  𝑦2𝐿+1
𝐿+1,𝐿 𝐿+1
𝑤1,𝑁 , 𝑏1
1,0 1 𝐿
𝑤1,N , 𝑏1
0
𝑥𝑁0 0 𝐼 𝑦𝑁0 0 1
𝐿 1
1 𝐿 1 +
+
𝐖1,0 , 𝐛1 𝐖𝐿+1,𝐿 , 𝐛𝐿
𝐱0 𝐼 𝐲0 𝐱1  𝐲1 𝐱𝐿  𝐲𝐿 𝐱 𝐿+1  𝐲 𝐿+1
𝐱 𝑙+1 = 𝐖𝑙+1,𝑙 𝐲 𝑙 +𝐛𝑙+1

𝐲 𝑙+1 = 𝜙(𝐱 𝑙+1 ), l = 1,2,…, L 𝒚 = 𝒚𝐿+1 = 𝐟𝑁𝑁(𝐱 = 𝐱 0 ; 𝐖 = {𝐖𝑙+1,𝑙 } , 𝐛 = {𝐛𝑙+1 } )
𝐲0 = 𝐱0 4/18
Gaussian process
𝐘 = [𝑌1, 𝑌2, … 𝑌𝑁]T
m = E(Y) = [E 𝑌1 , E(𝑌2), … E(𝑌𝑁)]T

Multivariate normal distribution
K = [Kij] = [cov (Yi , Yj)]
1 1
𝑝 Y = exp{- (Y-m)TK-1(Y-m)}
|2𝜋K|1/2 2
 Index set of the random vector Y is a discrete set
 Index set {1,2, …, N}
 What if a continuous index set ?
 Say, a unit interval [0, 1]
5/18
Y  ℕ(m, K) Yx := fGP(x)  GP(m(x), k(x,x’))
Discrete index i  {1,2,…, 10} Continuous index x  [0, 1]

Zero mean m = 0 Zero mean function m(x) = 0
Covariance matrix K Covariance function k(x, x’) is
is Kij = cov (Yi , Yj) = exp{- (i –j)2 * 10} cov (Yx , Yx’) = exp{- (x– x’)2 * 10}
6/18
Neural net (fully connected) & Gaussian process
y= 𝑓𝑁𝑁 (𝐱; 𝐖 , 𝐛 ) Deterministic function fNN
y = fGP(x;  , )  GP(m (x), k (x, x’)) Stochastic function fGP
7/18
y= 𝑓𝑁𝑁 (𝐱; 𝐖 , 𝐛 )
 GP(0, 𝑘𝜎 2 , 𝜎 2 (x, x’))
𝑤 𝑏
2
𝜎𝑤
𝐖𝑙+1,𝑙  𝐷 𝟎, 𝐈 (𝒊. 𝒊. 𝒅) 𝑁𝑙 ∞ for hidden layers
𝑁𝑙
𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2 𝐈 (𝒊. 𝒊. 𝒅)
y = fGP(x;  , )  GP(m (x), k (x, x’))
8/18
𝑥𝑖1 𝐱 𝐷𝑃(𝟎, 𝑘0(𝐱, 𝐱 ′ )) 𝑥𝑖2 𝐱  𝐺𝑃(0, 𝑘1(𝐱, 𝐱 ′ ))
𝐖1,0 , 𝐛1 𝐖 2,1 , 𝐛2
𝐱 𝐼 𝐲0 𝐱1  𝐲1 𝐱2 𝐼 𝐲2
1,0 0 2,1 1
𝑦𝑖0 = 𝑥𝑖 , 𝑥𝑖1 = 𝑁 1
0 𝑥𝑖2 = 𝑁 0
2
i =1,2, …, N0 𝑦𝑖1 = 𝜙(𝑥𝑖1 ) 𝑦𝑖2 = 𝑥𝑖2
i =1,2, …, N1 i =1,2, …, N2
𝜎𝑤2
k0(x, x’) = Cov (𝑥𝑖1 (x), 𝑥𝑖1 (x′ )) = 𝜎𝑏2+ x x′ regardless of the subscript i
𝑁0
k1(x, x’) = Cov (𝑥𝑖2 (x), 𝑥𝑖2 (x′)) = 𝜎𝑏2 + 𝜎𝑤2 F(k0(x, x’), k0(x, x), k0(x’, x’) )
𝐖1,0 , 𝐛1 𝐖 2,1 , 𝐛2
𝐱′ 𝐼 𝐲0 𝐱1  𝐲1 𝐱2 𝐼 𝐲2
2,1 1
𝑦𝑖0 = 𝑥′𝑖 , 𝑥𝑖1 = 𝑁 1,0 0 1
0 𝑥𝑖2 = 𝑁 1
2
i =1,2, …, N0 𝑦𝑖1 = 𝜙(𝑥𝑖1 ) 𝑦𝑖2 = 𝑥𝑖2
i =1,2, …, N1 i =1,2, …, N2
9/18
y= 𝑓𝑁𝑁 (𝐱; 𝐖 , 𝐛 ) GP(0, 𝑘𝜎 2 , 𝜎 2 (x, x’))

𝑤 𝑏
𝜎 2
𝑤
𝐖𝑙+1,𝑙  𝐷 𝟎, 𝐈 (𝒊. 𝒊. 𝒅) 𝑁𝑙 ∞ for hidden layers
𝑁𝑙
𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2 𝐈 (𝒊. 𝒊. 𝒅)
𝑘𝜎 2 , 𝜎 2 (x, x’) = 𝑘 𝐿 (x, x’) = Cov (𝑥𝑖𝐿+1 (x), 𝑥𝑖𝐿+1 (x′))

𝑤 𝑏
𝑘 𝑙 (x, x’) = 𝜎𝑏2 + 𝜎𝑤2 F (𝑘 𝑙−1 (x, x’), 𝑘 𝑙−1 (x, x), 𝑘 𝑙−1 (x’, x’) ) ,
𝑙 = 1, 2, …, 𝐿
2
𝜎𝑤
𝑘0 (x, x’) = Cov (𝑥𝑖1 (x), 𝑥𝑖1 (x′ )) = 𝜎𝑏2 + x x′
𝑁0
10/18
NNGP: Gaussian process with NN-induced covariance function
 Discovering the equivalence between infinite NN and GP

(RM Neal, Bayesian learning for neural networks, 1994 )
 Deducing the analytical GP kernels for single hidden-layer NN
with specific nonlinearity: error function or Gaussian
nonlinearity
(CKI William, Computing with inﬁnite networks, 1997)
 Deriving the GP kernel for multi-hidden-layer NN with general
nonlinearity based on signal propagation theory (Cho &
Saul,2009) (NNGP)
(J Lee et al, Deep neural networks as Gaussian processes, 2018)
11/18
NNGP: Gaussian process with NN-induced covariance function
NN GP Bayesian NN NNGP
Expressivity High Intermediate High High
Uncertainty No Yes Yes Yes
Cost Low intermediate High intermediate
Accuracy It depends It depends It depends It depends
12/18
Expressivity
 Deep neural networks (DNN) can compactly express highly
complex functions over input space in a way that shallow networks
with one hidden layer and the same number of neurons cannot.
(Regression)
 Deep neural networks can disentangle highly curved manifolds in
input space into flattened manifolds in hidden space.
(Classification)
Poole B, Lahiri S, Raghu M, Sohl-Dickstein J, Ganguli S.
Exponential expressivity in deep neural networks through transient chaos. In
Advances in neural information processing systems 2016 (pp. 3360-3368).
 GP model is actually a kernel learning machine whose performance

relies on selection of the covariance kernel.
 NNGP enjoys the high expressivity of DNN.
13/18
Uncertainty quantification / surrogate accuracy
14/18
Accuracy (Regression problem)
15/18
Accuracy (Classification problem, J Lee et al, 2018)
16/18
Computational cost
 NN: Loss function is generally convex -- stochastic gradient descent

optimization algorithm (Training cost is generally O(n) where n is
number of neurons)
 GP: Loss function is non-convex -- gradient descent algorithms

need multi-starting points. Inverting covariance matrix requires
O(N3) complexity where N is the number of training datapoints.
 Bayesian NN: Exact Bayesian inference is NP- hard. Approximate

Bayesian inference could also be NP hard. Cost of MCMC for
approximating posterior distribution of network parameters could
be o(m2) where m is the parameters number.
 NNGP: Same as GP
17/18
Future work – to improve, apply, and generalize NNGP
 Training a NNGP
 NNGP for solving PDEs (forward and inverse problems)
 Develop NNGP for other types of NNs, say, CNN, RNN, GAN
……
18/18

Pang NNGP Crunch Seminar

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pang NNGP Crunch Seminar

Uploaded by

Copyright:

Available Formats

When a neural net

1,0 1 𝐿+1,𝐿 𝐿+1

1,0 1 𝐿+1,𝐿 𝐿+1

1,0 1 𝐿+1,𝐿 𝐿+1

𝐱 𝑙+1 = 𝐖𝑙+1,𝑙 𝐲 𝑙 +𝐛𝑙+1

m = E(Y) = [E 𝑌1 , E(𝑌2), … E(𝑌𝑁)]T

 Index set of the random vector Y is a discrete set

 Index set {1,2, …, N}

 What if a continuous index set ?

 Say, a unit interval [0, 1]

Discrete index i  {1,2,…, 10} Continuous index x  [0, 1]

y= 𝑓𝑁𝑁 (𝐱; 𝐖 , 𝐛 ) Deterministic function fNN

y = fGP(x;  , )  GP(m (x), k (x, x’)) Stochastic function fGP

𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2 𝐈 (𝒊. 𝒊. 𝒅)

y = fGP(x;  , )  GP(m (x), k (x, x’))

y= 𝑓𝑁𝑁 (𝐱; 𝐖 , 𝐛 ) GP(0, 𝑘𝜎 2 , 𝜎 2 (x, x’))

𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2 𝐈 (𝒊. 𝒊. 𝒅)

𝑘𝜎 2 , 𝜎 2 (x, x’) = 𝑘 𝐿 (x, x’) = Cov (𝑥𝑖𝐿+1 (x), 𝑥𝑖𝐿+1 (x′))

 Discovering the equivalence between infinite NN and GP

 GP model is actually a kernel learning machine whose performance

 NN: Loss function is generally convex -- stochastic gradient descent

 GP: Loss function is non-convex -- gradient descent algorithms

 Bayesian NN: Exact Bayesian inference is NP- hard. Approximate

 NNGP for solving PDEs (forward and inverse problems)

You might also like