You are on page 1of 11

Ulttawide Duds after training and GP tegtession

Hete we show that wide pads whete width I is the largest

scale in problem trained with vanishing beating tate

Mst weight decay and white noise added to gradients


less

ate simply an efficient of performing GPR with

way
a Dna at initialization
K being that of

theypketrel

Noise

The training algorithm GD

standard SGD Afinitenumber GD t Noise

tate
learning

wit

w
out jdt
w w
wit outfuldt
Edt

Zuk af

whete Lain L b b

ÉÉ e.sc Noise

9 sttogh
Random not al

onsets Neil

In the limit of dt this discrete dynamics


so

for the Gpt Noise converges to the diff eg

wet dw Int Iftu ft Jt A

a Ect

3 13 17 XS84 E

Compate this with Langevin's Eq at zero mass overdope

mins x at nlt yet 7 2084 t

Ati's ditty t

J'd

I s
tilt

So essentially we can view the stake of the Dun us

as the position of an ovetdanped particle with I is

and
in
flat t
potential t temperature t

a L zu

full

This motivates the educated guess fat the

following

distribution predicted

by A at
leg times

Equilibrium in weight space

I lost Ipu Lew


Post

a t

Ref

BNS

How long is How ergodic is a Pats

long enough s

each data point as a constraint buns in

Viewing

the ovet pataettized regime have fat note degrees of

freedom than constraints As a tough analogy imagine

a
spin system with fat note spins than terms in

the Hamiltonian don't expect spinglass behaviet


we

Extremun points in high dimensions tend to be saddles

we don't expect local minima Dauphin et al 2014

Empirical works find essentially no battiets in the

energy landscape of over patanettized DN's

Landscape is essentially a high dimensional petcolating

zeta less manifold with peaks that ate easy to

drop from in between Dalet et al 2018

has 13286336 patanetets ten times note

patanetets than

datapoints date points

we has 1461406

Example 3 patens I datapoint

j.I b

Is

J Patan I

In practice we find that olio ooo epochs at very small

learning tate et
t to equilibrate but clearly

le is enough

details could be problem specific

rety

moving to function space

Weight space in general is quite difficult to work

with in the over pataettized limit Since many

weight configurations can goetate equivalent input Hard


to expect simple non multi nodal partition function

any

in weight

space

Let's

try function space

functigsfunction

I
f zw
gt8

Pole Soupy

tht forCzw

1
T

dotalicationfdw 8 f 2W

I
I

not alizatie
8 fdwe few

lookslikeapFioDatinitialization

pun's GP with owl Ty

SSfk' Not
for
le

T notarization.ch

alizatie
P f

Na

t fffk if

I T e

notalizatie Ben David Ringel 2019

and genetic

This is
very appealing theoretical result

disotdet best one couldhave hoped

Quadratic free

9 theotys analytically

stochastic

A elastic manifold pinned at the x data points toy

target

jisualizati

46k

DIN details enter through K controlling the elasticity

of the manifold

l
In usual elastic setting K
E T

here it's a much note complicated elastic energy

Roughly speaking
the Dan will extrapolate between

the training points in a which is most

way

energy efficient probable to its GP

according

This also what P Regression does

Making predictions

We now have sufficient analytical to predict the

machinaty average

Mse

test less of
ensembles of
N sa DNNs trainedusing useless

lov learning tate and weight decay without eatly


stepping i.e
until

train and
validation score equilibrate

Exercise Explain why ate all the greenmarked tequiterents

neccessaty

A date set
j

Assumed given measure dy from which X is a

typical draw

KG t associate with the DN

Eigenvalue decomposition

x 0,0 of k w.tt dy

None of these is trivial to get in real world settings

Reasonable

strategies ate

models dot product ketoel

out choice Toy e.g Klay being a

leastreal world

dy in inputspace

butmost pedagogical being some ellipsoid

andtractable

Using
tandem Dvds as genetatots fat dy

Hiddenmanifold

inÉspace
E

models

Modeling
offitaset template model skpolisks

Empirical of pal and 0,6

modeling

typically

pewetlaw

The Ek approximation

s t fff f

Z fbp e

SdtSdt fix Khofcs


f Sdn fo 5k

É go e

Next we decompose fax f andfind

0,6 ga god x

fo got tip

f
ZE fpf

Considering the test etat Sdp


golf co fig f we find

t ft'm

f g Spf et g p

fi Lf
Kfa 7

variant bias

Bias term Since action is quadratic average value is also

extremal value

I f 9,7 If
Dfa 17

p o

F 2 f

9ft so

fo g g Cfo z I 9

features in good with a I ate learned well on average

whereas features with tats ate projected out

Variance tetm Since action is quadratic vatiace is independent of g

FA
Ig

fo got fin
é

fluctuation
e

f t

I
fi

Cfo s

z f

a nodes of fo with tooth ate essentially frozen and

fixed to theit average value

affected

o modes of fax with tact ate essentially on


by

the data training

Corrections to the Ek appteach

applied above is uncontrolled

The Ek approximation as

o It also gives the false impression that fat t se

petferance is perfect regardless of ne

e as the

d ed app
p g

teth in expansion in
an fix Cfo following averaging

overvarious dataset draws fton dy

Tsitheads

Since fix Cfo KG oct


x taking

moderate

works ok also

to good Ek accuracy fat latget n Ek

fat THI

See Coben Malka Ringel 2019

and older walks


by Malzahn soppet zool

list

Forthet Reading and a highly braised of references

.https://arxiv.org/pdf/1806.07572.pdf

The Ntk approach to GPs and Dads

finite width perturbative cottection

https://arxiv.org/pdf/2004.01190.pdf

see also Yaida 2020

https://arxiv.org/pdf/2003.02218.pdf

High learning tate transition


phase

https://arxiv.org/pdf/1902.06720.pdf

How wide is wide fat


enough GPs

very good to cap of Ntk

https://arxiv.org/pdf/2106.04110.pdf

Out to
approach strong finite width channel

effects
Empirical study of energy
https://arxiv.org/pdf/1803.00885.pdf Tc

landscape

in DNds

https://journals.aps.org/prresearch/pdf/10.1103/PhysRevResearch.3.023034
Perturbative improvement

of Ek field
via fetal
theetg

Effect of batch
https://arxiv.org/pdf/1705.08741.pdf,

and

g
in

sob suggesting 8D Heise is as good

as so when trained sufficiently long

https://arxiv.org/abs/1805.12076

A study into ovetpatanettization showing it

doesn't damage test performance


https://arxiv.org/pdf/2003.02237.pdf
State of the att in
ops a raising the prospect
that they could maybe match Dan performance
soon at least on small datasets

You might also like