Zohar Ringel - Mappings From DNNs To GPS - Part 2

Ulttawide Duds after training and GP tegtession
Hete we show that wide pads whete width I is the largest
scale in problem trained with vanishing beating tate
Mst weight decay and white noise added to gradients

less
ate simply an efficient of performing GPR with
way
a Dna at initialization
K being that of
theypketrel
Noise
The training algorithm GD
standard SGD Afinitenumber GD t Noise
tate
learning
wit
w
out jdt
w w
wit outfuldt
Edt
Zuk af
whete Lain L b b
ÉÉ e.sc Noise
9 sttogh
Random not al
onsets Neil
In the limit of dt this discrete dynamics

so
for the Gpt Noise converges to the diff eg
wet dw Int Iftu ft Jt A
a Ect
3 13 17 XS84 E
Compate this with Langevin's Eq at zero mass overdope
mins x at nlt yet 7 2084 t
Ati's ditty t
J'd
I s
tilt
So essentially we can view the stake of the Dun us
as the position of an ovetdanped particle with I is
and
in
flat t
potential t temperature t
a L zu
full
This motivates the educated guess fat the
following
distribution predicted
by A at
leg times
Equilibrium in weight space
I lost Ipu Lew

Post
a t
Ref
BNS
How long is How ergodic is a Pats
long enough s
each data point as a constraint buns in
Viewing
the ovet pataettized regime have fat note degrees of
freedom than constraints As a tough analogy imagine
a
spin system with fat note spins than terms in
the Hamiltonian don't expect spinglass behaviet

we
Extremun points in high dimensions tend to be saddles
we don't expect local minima Dauphin et al 2014
Empirical works find essentially no battiets in the
energy landscape of over patanettized DN's
Landscape is essentially a high dimensional petcolating
zeta less manifold with peaks that ate easy to
drop from in between Dalet et al 2018
has 13286336 patanetets ten times note
patanetets than
datapoints date points
we has 1461406
Example 3 patens I datapoint
j.I b

Is
J Patan I
In practice we find that olio ooo epochs at very small
learning tate et
t to equilibrate but clearly
le is enough
details could be problem specific
rety
moving to function space
Weight space in general is quite difficult to work
with in the over pataettized limit Since many
weight configurations can goetate equivalent input Hard

to expect simple non multi nodal partition function
any
in weight
space
Let's
try function space
functigsfunction
I
f zw
gt8
Pole Soupy
tht forCzw
1
T
dotalicationfdw 8 f 2W
I
I
not alizatie
8 fdwe few
lookslikeapFioDatinitialization
pun's GP with owl Ty
SSfk' Not
for
le
T notarization.ch
alizatie
P f
Na
t fffk if
I T e
notalizatie Ben David Ringel 2019
and genetic
This is
very appealing theoretical result
disotdet best one couldhave hoped
Quadratic free
9 theotys analytically
stochastic
A elastic manifold pinned at the x data points toy
target
jisualizati
46k
DIN details enter through K controlling the elasticity
of the manifold
l
In usual elastic setting K
E T
here it's a much note complicated elastic energy
Roughly speaking
the Dan will extrapolate between
the training points in a which is most
way
energy efficient probable to its GP
according
This also what P Regression does
Making predictions
We now have sufficient analytical to predict the
machinaty average
Mse
test less of
ensembles of
N sa DNNs trainedusing useless
lov learning tate and weight decay without eatly

stepping i.e
until
train and
validation score equilibrate
Exercise Explain why ate all the greenmarked tequiterents
neccessaty
A date set
j
Assumed given measure dy from which X is a
typical draw
KG t associate with the DN
Eigenvalue decomposition
x 0,0 of k w.tt dy
None of these is trivial to get in real world settings
Reasonable
strategies ate
models dot product ketoel
out choice Toy e.g Klay being a
leastreal world
dy in inputspace
butmost pedagogical being some ellipsoid
andtractable
Using
tandem Dvds as genetatots fat dy
Hiddenmanifold
inÉspace
E
models
Modeling
offitaset template model skpolisks
Empirical of pal and 0,6
modeling
typically
pewetlaw
The Ek approximation
s t fff f
Z fbp e
SdtSdt fix Khofcs

f Sdn fo 5k
É go e

Next we decompose fax f andfind
0,6 ga god x
fo got tip
f
ZE fpf
Considering the test etat Sdp

golf co fig f we find
t ft'm
f g Spf et g p
fi Lf
Kfa 7
variant bias
Bias term Since action is quadratic average value is also
extremal value
I f 9,7 If
Dfa 17
p o
F 2 f
9ft so
fo g g Cfo z I 9

features in good with a I ate learned well on average
whereas features with tats ate projected out
Variance tetm Since action is quadratic vatiace is independent of g
FA
Ig
fo got fin
é
fluctuation
e
f t
I
fi
Cfo s
z f
a nodes of fo with tooth ate essentially frozen and
fixed to theit average value
affected
o modes of fax with tact ate essentially on

by
the data training
Corrections to the Ek appteach
applied above is uncontrolled
The Ek approximation as
o It also gives the false impression that fat t se
petferance is perfect regardless of ne
e as the

d ed app
p g
teth in expansion in
an fix Cfo following averaging
overvarious dataset draws fton dy
Tsitheads
Since fix Cfo KG oct

x taking
moderate
works ok also
to good Ek accuracy fat latget n Ek
fat THI
See Coben Malka Ringel 2019
and older walks

by Malzahn soppet zool
list
Forthet Reading and a highly braised of references
.https://arxiv.org/pdf/1806.07572.pdf
The Ntk approach to GPs and Dads
finite width perturbative cottection
https://arxiv.org/pdf/2004.01190.pdf
see also Yaida 2020
High learning tate transition

phase
How wide is wide fat

enough GPs
very good to cap of Ntk
Out to
approach strong finite width channel
effects
Empirical study of energy
https://arxiv.org/pdf/1803.00885.pdf Tc
landscape
in DNds
https://journals.aps.org/prresearch/pdf/10.1103/PhysRevResearch.3.023034
Perturbative improvement
of Ek field
via fetal
theetg
Effect of batch
https://arxiv.org/pdf/1705.08741.pdf,
and

g
in
sob suggesting 8D Heise is as good
as so when trained sufficiently long
https://arxiv.org/abs/1805.12076
A study into ovetpatanettization showing it
doesn't damage test performance

State of the att in
ops a raising the prospect
that they could maybe match Dan performance
soon at least on small datasets

Zohar Ringel - Mappings From DNNs To GPS - Part 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zohar Ringel - Mappings From DNNs To GPS - Part 2

Uploaded by

Copyright:

Available Formats

Ulttawide Duds after training and GP tegtession

Hete we show that wide pads whete width I is the largest

scale in problem trained with vanishing beating tate

Mst weight decay and white noise added to gradients

ate simply an efficient of performing GPR with

The training algorithm GD

standard SGD Afinitenumber GD t Noise

In the limit of dt this discrete dynamics

for the Gpt Noise converges to the diff eg

wet dw Int Iftu ft Jt A

Compate this with Langevin's Eq at zero mass overdope

mins x at nlt yet 7 2084 t

So essentially we can view the stake of the Dun us

as the position of an ovetdanped particle with I is

This motivates the educated guess fat the

Equilibrium in weight space

I lost Ipu Lew

How long is How ergodic is a Pats

each data point as a constraint buns in

the ovet pataettized regime have fat note degrees of

freedom than constraints As a tough analogy imagine

the Hamiltonian don't expect spinglass behaviet

Extremun points in high dimensions tend to be saddles

we don't expect local minima Dauphin et al 2014

Empirical works find essentially no battiets in the

energy landscape of over patanettized DN's

Landscape is essentially a high dimensional petcolating

zeta less manifold with peaks that ate easy to

drop from in between Dalet et al 2018

has 13286336 patanetets ten times note

datapoints date points

Example 3 patens I datapoint

In practice we find that olio ooo epochs at very small

details could be problem specific

moving to function space

Weight space in general is quite difficult to work

with in the over pataettized limit Since many

weight configurations can goetate equivalent input Hard

try function space

pun's GP with owl Ty

notalizatie Ben David Ringel 2019

disotdet best one couldhave hoped

A elastic manifold pinned at the x data points toy

DIN details enter through K controlling the elasticity

here it's a much note complicated elastic energy

the training points in a which is most

energy efficient probable to its GP

This also what P Regression does

We now have sufficient analytical to predict the

lov learning tate and weight decay without eatly

Exercise Explain why ate all the greenmarked tequiterents

Assumed given measure dy from which X is a

KG t associate with the DN

None of these is trivial to get in real world settings

models dot product ketoel

out choice Toy e.g Klay being a

butmost pedagogical being some ellipsoid

Empirical of pal and 0,6

SdtSdt fix Khofcs

Next we decompose fax f andfind

Considering the test etat Sdp

Bias term Since action is quadratic average value is also

features in good with a I ate learned well on average