You are on page 1of 18

When a neural net

meets
a Gaussian process

Guofei Pang

05/11/18
Crunch seminar
Neural net (fully connected)

Input layer First hidden layer Last hidden layer Output layer

1,0 1 𝐿+1,𝐿 𝐿+1


𝑤1,1 , 𝑏1 𝑤1,1 , 𝑏1
𝑥10 𝐼 𝑦10 𝑥11  𝑦10 𝑥1𝐿  𝑦1𝐿 𝑥1𝐿+1  𝑦1𝐿+1
L+1,𝐿 L+1
1,0 1
𝑤1,2 𝑤1,2 , 𝑏1
, 𝑏1
𝑥20 𝐼 𝑦20 𝑥21  𝑦21 𝑥2𝐿  𝑦2𝐿 𝑥2𝐿+1  𝑦2𝐿+1

L+1,𝐿 L+1
𝑤1,𝑁 , 𝑏1
1,0 1 𝐿
𝑤1,N , 𝑏1
0

𝑥𝑁0 0 𝐼 𝑦𝑁0 0 1
𝑥𝑁  𝑦𝑁1 1 𝑥𝑁𝐿 𝐿  𝑦𝑁𝐿 𝐿 𝑥𝑁𝐿+1  𝑦𝑁𝐿+1
𝐿 1
1 𝐿 1 +
+

𝐿+1,𝐿 𝐿
𝑦𝑖0 = 𝑥𝑖0 , 𝑥𝑖1 = 𝑁 1,0 0 1
𝑗=1 𝑖,𝑗 𝑦𝑗 + 𝑏𝑖
0 𝑤 𝑥𝑖𝐿+1 = 𝑁 𝑗=1 𝑤𝑖,𝑗
L 𝑦𝑗 + 𝑏𝑖𝐿+1
i =1,2, …, N0 𝑦𝑖1 = 𝜙(𝑥𝑖1 ) 𝑦𝑖𝐿+1 = 𝜙(𝑥𝑖𝐿+1 )
i =1,2, …, N1 i =1,2, …, NL+1
2/18
Neural net (fully connected)

Input layer First hidden layer Last hidden layer Output layer

1,0 1 𝐿+1,𝐿 𝐿+1


𝑤1,1 , 𝑏1 𝑤1,1 , 𝑏1
𝑥10 𝐼 𝑦10 𝑥11  𝑦10 𝑥1𝐿  𝑦1𝐿 𝑥1𝐿+1  𝑦1𝐿+1
L+1,𝐿 L+1
1,0 1
𝑤1,2 𝑤1,2 , 𝑏1
, 𝑏1
𝑥20 𝐼 𝑦20 𝑥21  𝑦21 𝑥2𝐿  𝑦2𝐿 𝑥2𝐿+1  𝑦2𝐿+1

L+1,𝐿 L+1
𝑤1,𝑁 , 𝑏1
1,0 1 L
𝑤1,N , 𝑏1
0

𝑥𝑁0 0 𝐼 𝑦𝑁0 0 1
𝑥𝑁  𝑦𝑁1 1 𝑥𝑁𝐿 𝐿  𝑦𝑁𝐿 𝐿 𝑥𝑁𝐿+1  𝑦𝑁𝐿+1
𝐿 1
1 𝐿 1 +
+

𝑙+1,𝑙 𝑙
𝑥𝑖𝑙+1 = 𝑁 𝑙+1
𝑗=1 𝑤𝑖,𝑗 𝑦𝑗 + 𝑏𝑖
𝑙
𝑦1𝐿+1 = 𝑓1 𝑥10 , 𝑥20 , … , 𝑥𝑁0 ; {𝐖𝑙+1,𝑙 }, {𝐛𝑙+1 }
𝑦𝑖𝑙+1 = 𝜙(𝑥𝑖𝑙+1 ) 𝑦2𝐿+1 = 𝑓2 𝑥10 , 𝑥20 , … , 𝑥𝑁0 ; {𝐖𝑙+1,𝑙 }, {𝐛𝑙+1 }
0

0
i =1,2, …, Nl+1 …………
l = 0, 1, …, L 𝑦𝑁𝐿+1
𝐿 1
= 𝑓𝑁𝐿 1 𝑥10 , 𝑥20 , … , 𝑥𝑁0 0 ; {𝐖𝑙+1,𝑙 }, {𝐛𝑙+1 }
+ +
𝑦𝑖0 = 𝑥𝑖0
Neural net (fully connected)

1,0 1 𝐿+1,𝐿 𝐿+1


𝑤1,1 , 𝑏1 𝑤1,1 , 𝑏1
𝑥10 𝐼 𝑦10 𝑥11  𝑦10 𝑥1𝐿  𝑦1𝐿 𝑥1𝐿+1  𝑦1𝐿+1
L+1,𝐿 L+1
1,0 1
𝑤1,2 𝑤1,2 , 𝑏1
, 𝑏1
𝑥20 𝐼 𝑦20 𝑥21  𝑦21 𝑥2𝐿  𝑦2𝐿 𝑥2𝐿+1  𝑦2𝐿+1

𝐿+1,𝐿 𝐿+1
𝑤1,𝑁 , 𝑏1
1,0 1 𝐿
𝑤1,N , 𝑏1
0

𝑥𝑁0 0 𝐼 𝑦𝑁0 0 1
𝑥𝑁  𝑦𝑁1 1 𝑥𝑁𝐿 𝐿  𝑦𝑁𝐿 𝐿 𝑥𝑁𝐿+1  𝑦𝑁𝐿+1
𝐿 1
1 𝐿 1 +
+

𝐖1,0 , 𝐛1 𝐖𝐿+1,𝐿 , 𝐛𝐿
𝐱0 𝐼 𝐲0 𝐱1  𝐲1 𝐱𝐿  𝐲𝐿 𝐱 𝐿+1  𝐲 𝐿+1

𝐱 𝑙+1 = 𝐖𝑙+1,𝑙 𝐲 𝑙 +𝐛𝑙+1


𝐲 𝑙+1 = 𝜙(𝐱 𝑙+1 ), l = 1,2,…, L 𝒚 = 𝒚𝐿+1 = 𝐟𝑁𝑁(𝐱 = 𝐱 0 ; 𝐖 = {𝐖𝑙+1,𝑙 } , 𝐛 = {𝐛𝑙+1 } )
𝐲0 = 𝐱0 4/18
Gaussian process
𝐘 = [𝑌1, 𝑌2, … 𝑌𝑁]T

m = E(Y) = [E 𝑌1 , E(𝑌2), … E(𝑌𝑁)]T


Multivariate normal distribution
K = [Kij] = [cov (Yi , Yj)]
1 1
𝑝 Y = exp{- (Y-m)TK-1(Y-m)}
|2𝜋K|1/2 2

 Index set of the random vector Y is a discrete set

 Index set {1,2, …, N}

 What if a continuous index set ?

 Say, a unit interval [0, 1]

5/18
Y  ℕ(m, K) Yx := fGP(x)  GP(m(x), k(x,x’))

Discrete index i  {1,2,…, 10} Continuous index x  [0, 1]


Zero mean m = 0 Zero mean function m(x) = 0
Covariance matrix K Covariance function k(x, x’) is
is Kij = cov (Yi , Yj) = exp{- (i –j)2 * 10} cov (Yx , Yx’) = exp{- (x– x’)2 * 10}

6/18
Neural net (fully connected) & Gaussian process

y= 𝑓𝑁𝑁 (𝐱; 𝐖 , 𝐛 ) Deterministic function fNN

y = fGP(x;  , )  GP(m (x), k (x, x’)) Stochastic function fGP

7/18
Neural net (fully connected) & Gaussian process

y= 𝑓𝑁𝑁 (𝐱; 𝐖 , 𝐛 )
 GP(0, 𝑘𝜎 2 , 𝜎 2 (x, x’))
𝑤 𝑏

2
𝜎𝑤
𝐖𝑙+1,𝑙  𝐷 𝟎, 𝐈 (𝒊. 𝒊. 𝒅) 𝑁𝑙 ∞ for hidden layers
𝑁𝑙

𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2 𝐈 (𝒊. 𝒊. 𝒅)

y = fGP(x;  , )  GP(m (x), k (x, x’))

8/18
𝑥𝑖1 𝐱 𝐷𝑃(𝟎, 𝑘0(𝐱, 𝐱 ′ )) 𝑥𝑖2 𝐱  𝐺𝑃(0, 𝑘1(𝐱, 𝐱 ′ ))

𝐖1,0 , 𝐛1 𝐖 2,1 , 𝐛2
𝐱 𝐼 𝐲0 𝐱1  𝐲1 𝐱2 𝐼 𝐲2

1,0 0 2,1 1
𝑦𝑖0 = 𝑥𝑖 , 𝑥𝑖1 = 𝑁 1
𝑗=1 𝑤𝑖,𝑗 𝑦𝑗 + 𝑏𝑖
0 𝑥𝑖2 = 𝑁 0
2
𝑗=1 𝑤𝑖,𝑗 𝑦𝑗 + 𝑏𝑖
i =1,2, …, N0 𝑦𝑖1 = 𝜙(𝑥𝑖1 ) 𝑦𝑖2 = 𝑥𝑖2
i =1,2, …, N1 i =1,2, …, N2

𝜎𝑤2
k0(x, x’) = Cov (𝑥𝑖1 (x), 𝑥𝑖1 (x′ )) = 𝜎𝑏2+ x x′ regardless of the subscript i
𝑁0
k1(x, x’) = Cov (𝑥𝑖2 (x), 𝑥𝑖2 (x′)) = 𝜎𝑏2 + 𝜎𝑤2 F(k0(x, x’), k0(x, x), k0(x’, x’) )

𝐖1,0 , 𝐛1 𝐖 2,1 , 𝐛2
𝐱′ 𝐼 𝐲0 𝐱1  𝐲1 𝐱2 𝐼 𝐲2

2,1 1
𝑦𝑖0 = 𝑥′𝑖 , 𝑥𝑖1 = 𝑁 1,0 0 1
𝑗=1 𝑤𝑖,𝑗 𝑦𝑗 + 𝑏𝑖
0 𝑥𝑖2 = 𝑁 1
2
𝑗=1 𝑤𝑖,𝑗 𝑦𝑗 + 𝑏𝑖
i =1,2, …, N0 𝑦𝑖1 = 𝜙(𝑥𝑖1 ) 𝑦𝑖2 = 𝑥𝑖2
i =1,2, …, N1 i =1,2, …, N2

9/18
Neural net (fully connected) & Gaussian process

y= 𝑓𝑁𝑁 (𝐱; 𝐖 , 𝐛 ) GP(0, 𝑘𝜎 2 , 𝜎 2 (x, x’))


𝑤 𝑏

𝜎 2
𝑤
𝐖𝑙+1,𝑙  𝐷 𝟎, 𝐈 (𝒊. 𝒊. 𝒅) 𝑁𝑙 ∞ for hidden layers
𝑁𝑙

𝐛𝑙+1 𝐷 𝟎, 𝜎𝑏2 𝐈 (𝒊. 𝒊. 𝒅)

𝑘𝜎 2 , 𝜎 2 (x, x’) = 𝑘 𝐿 (x, x’) = Cov (𝑥𝑖𝐿+1 (x), 𝑥𝑖𝐿+1 (x′))


𝑤 𝑏

𝑘 𝑙 (x, x’) = 𝜎𝑏2 + 𝜎𝑤2 F (𝑘 𝑙−1 (x, x’), 𝑘 𝑙−1 (x, x), 𝑘 𝑙−1 (x’, x’) ) ,
𝑙 = 1, 2, …, 𝐿
2
𝜎𝑤
𝑘0 (x, x’) = Cov (𝑥𝑖1 (x), 𝑥𝑖1 (x′ )) = 𝜎𝑏2 + x x′
𝑁0

10/18
NNGP: Gaussian process with NN-induced covariance function

 Discovering the equivalence between infinite NN and GP


(RM Neal, Bayesian learning for neural networks, 1994 )
 Deducing the analytical GP kernels for single hidden-layer NN
with specific nonlinearity: error function or Gaussian
nonlinearity
(CKI William, Computing with infinite networks, 1997)
 Deriving the GP kernel for multi-hidden-layer NN with general
nonlinearity based on signal propagation theory (Cho &
Saul,2009) (NNGP)
(J Lee et al, Deep neural networks as Gaussian processes, 2018)

11/18
NNGP: Gaussian process with NN-induced covariance function

NN GP Bayesian NN NNGP
Expressivity High Intermediate High High
Uncertainty No Yes Yes Yes
Cost Low intermediate High intermediate
Accuracy It depends It depends It depends It depends

12/18
Expressivity
 Deep neural networks (DNN) can compactly express highly
complex functions over input space in a way that shallow networks
with one hidden layer and the same number of neurons cannot.
(Regression)
 Deep neural networks can disentangle highly curved manifolds in
input space into flattened manifolds in hidden space.
(Classification)
Poole B, Lahiri S, Raghu M, Sohl-Dickstein J, Ganguli S.
Exponential expressivity in deep neural networks through transient chaos. In
Advances in neural information processing systems 2016 (pp. 3360-3368).

 GP model is actually a kernel learning machine whose performance


relies on selection of the covariance kernel.
 NNGP enjoys the high expressivity of DNN.
13/18
Uncertainty quantification / surrogate accuracy

14/18
Accuracy (Regression problem)

15/18
Accuracy (Classification problem, J Lee et al, 2018)

16/18
Computational cost

 NN: Loss function is generally convex -- stochastic gradient descent


optimization algorithm (Training cost is generally O(n) where n is
number of neurons)

 GP: Loss function is non-convex -- gradient descent algorithms


need multi-starting points. Inverting covariance matrix requires
O(N3) complexity where N is the number of training datapoints.

 Bayesian NN: Exact Bayesian inference is NP- hard. Approximate


Bayesian inference could also be NP hard. Cost of MCMC for
approximating posterior distribution of network parameters could
be o(m2) where m is the parameters number.

 NNGP: Same as GP

17/18
Future work – to improve, apply, and generalize NNGP

 Training a NNGP

 NNGP for solving PDEs (forward and inverse problems)

 Develop NNGP for other types of NNs, say, CNN, RNN, GAN

……

18/18

You might also like