0 Up votes0 Down votes

212 views34 pagesJan 03, 2010

Neural Networks

© Attribution Non-Commercial (BY-NC)

PDF, TXT or read online from Scribd

Attribution Non-Commercial (BY-NC)

212 views

Neural Networks

Attribution Non-Commercial (BY-NC)

- Automatic Ship Detection of Remote Sensing Images
- Karsten s Chier Holt
- mlp_2
- Binder1 55.pdf
- ResNeXt
- Paper 28-Estimating the Number of Test Workers Necessary
- Advances in Self-Organizing Maps
- Linear Regression Review
- 1. Electronics - IJECE - Statistical Analysis of Skin - Santosh Achakanalli
- Neural networks ppt
- Statsoft (2004) Neural Networks
- NLPP Lect Slides
- ProjectReport-Kanwarpal
- Final Report
- Principios de Circuitos ElectricosOK
- Cs101 Assignment 04 Spring 2012
- Handbook of Brain Theory and Neural Network
- 514.docx
- intervening variables. Organizational behavior. Contributors. Feelings and emotions.
- Extraction of Rules With PSO

You are on page 1of 34

Classification with

Neural Networks

Note to other teachers and users of

these slides. Andrew would be delighted

Andrew W. Moore

Professor

if you found this source material useful in

giving your own lectures. Feel free to use

these slides verbatim, or to modify them

to fit your own needs. PowerPoint

originals are available. If you make use School of Computer Science

of a significant portion of these slides in

your own lecture, please include this

message, or the following link to the Carnegie Mellon University

source repository of Andrew’s tutorials:

http://www.cs.cmu.edu/~awm/tutorials . www.cs.cmu.edu/~awm

Comments and corrections gratefully

received. awm@cs.cmu.edu

412-268-7599

Linear Regression

DATASET

inputs outputs

x1 = 1 y1 = 1

x2 = 3 y2 = 2.2

↑

w x3 = 2 y3 = 2

← 1 →↓

x4 = 1.5 y4 = 1.9

x5 = 4 y5 = 3.1

the output given an input, E[y|x], is linear.

Simplest case: Out(x) = wx for some unknown w.

Given the data, we can estimate w.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 2

1

1-parameter linear regression

Assume that the data is formed by

yi = wxi + noisei

where…

• the noise signals are independent

• the noise has a normal distribution with mean 0

and unknown variance σ2

• mean wx

• variance σ2

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 3

P(y|w,x) = Normal (mean wx, var σ2)

which are EVIDENCE about w.

P(w|x1, x2, x3,…xn, y1, y2…yn)

•You can use BAYES rule to work out a posterior

distribution for w given the data.

•Or you could do Maximum Likelihood Estimation

2

Maximum likelihood estimation of w

“For which value of w is this data most likely to have

happened?”

<=>

For what w is

P(y1, y2…yn |x1, x2, x3,…xn, w) maximized?

<=>

For what w is n

∏P( y w, x ) maximized?

i =1

i i

For what w is

n

∏ P( y

i =1

i w , x i ) maximized?

1

∏ exp(− 2 (

yi − wxi

) 2 ) maximized?

i =1 σ

For what w is

2

n

1 y − wx i

∑ i =1

− i

2 σ

maximized?

For what w is 2

n

∑ (y

i =1

i − wx i ) minimized?

3

Linear Regression

The maximum

likelihood w is

the one that E(w)

w

minimizes sum-

Ε = ∑( yi − wxi )

2

of-squares of

residuals i

= ∑ yi − (2∑ xi yi )w +

2

(∑ x )w i

2 2

i

We want to minimize a quadratic function of w.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 7

Linear Regression

Easy to show the sum of

squares is minimized

when

w=

∑x y i i

∑x

2

i

model is

Out(x) = wx

We can use it for

prediction

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 8

4

Linear Regression

Easy to show the sum of

squares is minimized

when

∑ xi yi

p(w)

w= w

∑x

2

i Note: In Bayesian stats you’d have

ended up with a prob dist of w

The maximum likelihood

model is

Out(x) = wx And predictions would have given a prob

dist of expected output

We can use it for

Max likelihood can give some kinds of

prediction

confidence too.

Multivariate Regression

What if the inputs are vectors?

3.

.4

6.

2-d input

. 5 example

.8

x2 . 10

x1

Dataset has form

x1 y1

x2 y2

x3 y3

.: :

.

xR yR

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 10

5

Multivariate Regression

Write matrix X and Y thus:

.....x ..... x x22

... x2 m y

x= 2 = 21 y = 2

M M M

.....x R ..... xR1 xR 2 ... xRm yR

The linear regression model assumes a vector w such that

Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]

The max. likelihood w is w = (XTX) -1(XTY)

Multivariate Regression

Write matrix X and Y thus:

.....x ..... x x22 ... x2 m y

x= 2 = 21 y = 2

M M M

.....x R ..... xR1 xR 2 ... xRm yR

IMPORTANT EXERCISE:

(there are R datapoints. Each input hasPROVE

m components)

IT !!!!!

The linear regression model assumes a vector w such that

Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]

The max. likelihood w is w = (XTX) -1(XTY)

6

Multivariate Regression (con’t)

R

k =1

ki kj

R

XTY is an m-element vector: i’th elt ∑x

k =1

y

ki k

We may expect

linear data that does

not go through the

origin.

Statisticians and

Neural Net Folks all

agree on a simple

obvious hack.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14

7

The constant term

• The trick is to create a fake input “X0” that

always takes the value 1

X1 X2 Y X0 X1 X2 Y

2 4 16 1 2 4 16

3 4 17 1 3 4 17

5 5 20 1 5 5 20

Before: After:

Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2

…has to be a poor In this example, = w0+w1X1+ w2X2

You should be able

model to see the MLE w0 …has a fine constant

, w1 and w2 by

inspection

term

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 15

• Suppose you know the variance of the noise that

was added to each datapoint.

y=3

σ=2

xi yi σi2

½ ½ 4 y=2

σ=1/2

1 1 1

σ=1

2 1 1/4 y=1

σ=1/2

σ=2

2 3 4 y=0

LE

heM ?

yi ~ N ( wxi , σ i2 )

t

at’s fw

Assume W h at e o

m

esti

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 16

8

MLE estimation with varying noise

argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ 12 , σ 22 ,..., σ R2 , w) =

w Assuming i.i.d. and

R

( yi − wxi ) 2 then plugging in

argmin ∑ σ i2

= equation for Gaussian

and simplifying.

i =1

w

R

x ( y − wx ) Setting dLL/dw

w such that ∑ i i 2 i = 0 = equal to zero

i =1 σi

R xi yi Trivial algebra

∑ 2

i =1 σ i

R xi2

∑ 2

i =1 σ i

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 17

• We are asking to minimize the weighted sum of

squares

y=3

σ=2

R

( yi − wxi ) 2

argmin ∑ σ i2 y=2

i =1 σ=1/2

w

y=1 σ=1

σ=1/2

σ=2

y=0

x=0 x=1 x=2 x=3

1

where weight for i’th datapoint is σ i2

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 18

9

Weighted Multivariate Regression

R xki xkj

(WXTWX) is an m x m matrix: i,j’th elt is ∑

k =1 σ i2

R

(WXTWY) is an m-element vector: i’th elt xki yk

∑

k =1 σ i2

Non-linear Regression

• Suppose you know that y is related to a function of x in

such a way that the predicted values have a non-linear

dependence on w, e.g:

y=3

xi yi

½ ½ y=2

1 2.5

2 3 y=1

3 2 y=0

LE

eM ?

yi ~ N ( w + xi , σ ) 2 th

at’s fw

Assume W h at e o

m

esti

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 20

10

Non-linear MLE estimation

argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =

w Assuming i.i.d. and

argmin ∑ (y )

R 2 then plugging in

i − w + xi = equation for Gaussian

i =1 and simplifying.

w

R

y − w + xi Setting dLL/dw

w such that ∑ i = 0 = equal to zero

w + xi

i =1

argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =

w Assuming i.i.d. and

argmin ∑ (y )

R 2 then plugging in

i − w + xi = equation for Gaussian

i =1 and simplifying.

w

R

y − w + xi Setting dLL/dw

w such that ∑ i = 0 = equal to zero

w + xi

i =1

algebraic toilet

t

ss wha

u e

So g e do?

w

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 22

11

Non-linear MLE estimation

argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =

w Assuming i.i.d. and

argmin ∑ ( )

R

Common (but not only) approach: 2 then plugging in

yi − w + xi = equation for Gaussian

Numerical Solutions: and simplifying.

i =1

• Line Search w

• Simulated Annealing

R

yi − w + xi Setting dLL/dw

w such that

• Gradient Descent

w +

∑

x

= 0 =

equal to zero

• Conjugate Gradient i = 1 i

• Levenberg Marquart

• Newton’s Method We’re down the

algebraic toilet

Also, special purpose statistical-

hat

ss w

optimization-specific tricks such as

u e

So g e do?

E.M. (See Gaussian Mixtures lecture

for introduction) w

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 23

GRADIENT DESCENT

Suppose we have a scalar function

f(w): ℜ → ℜ

We want to find a local minimum.

Assume our current weight is w

∂

GRADIENT DESCENT RULE: w ← w −η f (w)

∂w

number, e.g. η = 0.05

12

GRADIENT DESCENT

Suppose we have a scalar function

f(w): ℜ → ℜ

We want to find a local minimum.

Assume our current weight is w

∂

GRADIENT DESCENT RULE: w ← w −η f (w)

∂w

Recall Andrew’s favorite

default value for anything

η is called the LEARNING RATE. A small positive

number, e.g. η = 0.05

QUESTION: Justify the Gradient Descent Rule

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 25

Given f(w ) : ℜ m → ℜ

∂

f (w )

∂w1

∇f (w) = M points in direction of steepest ascent.

∂

∂w f (w)

m

∇f (w) is the gradient in that direction

Equivalently ∂

wj ← wj - η f (w ) ….where wj is the jth weight

∂w j

“just like a linear feedback system”

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 26

13

What’s all this got to do with Neural

Nets, then, eh??

For supervised learning, neural nets are also models with

vectors of w parameters in them. They are now called

weights.

As before, we want to compute the weights to minimize sum-

of-squared residuals.

Which turns out, under “Gaussian i.i.d noise”

assumption to be max. likelihood.

Instead of explicitly solving for max. likelihood weights, we

use GRADIENT DESCENT to SEARCH for them.

our eyes.

s exp ression in y

” you a sk, a querulou

“Wh y?

e later.”

ply: “We’ll se

“Aha!!” I re

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 27

Linear Perceptrons

They are multivariate linear models:

Out(x) = wTx

by gradient descent.

Ε = ∑ (Out (x ) −

k

k yk )2

∑ (w )

Τ 2

= x k − yk

k

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 28

14

Linear Perceptron Training Rule

R

E = ∑ ( yk − w T x k ) 2

k =1

we should update w

thusly if we wish to

minimize E:

∂E

wj ← wj - η

∂w j

∂E

So what’s ?

∂w j

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 29

R

∂E R

∂

E = ∑ ( yk − w T x k ) 2 =∑ ( yk − w T x k ) 2

k =1

∂w j k =1 ∂w j

R

∂

Gradient descent tells us = ∑ 2( y k − w T x k ) ( yk − w T x k )

we should update w k =1 ∂w j

thusly if we wish to R

∂

minimize E: = −2∑ δk wT xk

k =1 ∂w j

∂E …where…

wj ← wj - η δk = yk − w T x k

∂w j R

∂ m

= −2∑ δk ∑w x i ki

k =1 ∂w j i =1

∂E

So what’s ? R

∂w j = −2∑ δk xkj

k =1

15

Linear Perceptron Training Rule

R

E = ∑ ( yk − w T x k ) 2

k =1

we should update w

thusly if we wish to

minimize E:

∂E

wj ← wj - η R

…where…

∂w j w j ← w j + 2η∑ δk xkj

k =1

∂E R

= −2∑ δk xkj

∂w j k =1 We frequently neglect the 2 (meaning

we halve the learning rate)

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 31

1) Randomly initialize weights w1 w2 … wm

you don’t want to go through the origin).

3) for i = 1 to R

δ i := yi − wΤxi

4) for j = 1 to m R

w j ← w j + η ∑ δ i xij

i =1

back to 3.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 32

16

δ i ← yi − w Τ x i A RULE KNOWN BY

w j ← w j + ηδ i xij

MANY NAMES

rule

ule Hoff

MSR id row

L W

Th e The

The delta ru

le

Th

e ad

alin

er

ule

Classical

conditioning

quickly. THEN

Don’t bother remembering old ones.

Just keep using new ones.

observe (x,y)

δ ← y − wΤx

∀j w j ← w j + η δ x j

17

Gradient Descent vs Matrix Inversion

for Linear Perceptrons

GD Advantages (MI disadvantages):

• Biologically plausible

• With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

• More easily parallelizable (or implementable in wetware)?

GD Disadvantages (MI advantages):

• It’s moronic

• It’s essentially a slow implementation of a way to build the XTX matrix

and then solve a set of linear equations

• If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to

be efficient.

• Hard to choose a good learning rate

• Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 35

for Linear Perceptrons

GD Advantages (MI disadvantages):

• Biologically plausible

• With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

• More easily parallelizable (or implementable in wetware)?

GD Disadvantages (MI advantages):

• It’s moronic

• It’s essentially a slow implementation of a way to build the XTX matrix

and then solve a set of linear equations

• If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to

be efficient.

• Hard to choose a good learning rate

• Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 36

18

Gradient Descent vs Matrix Inversion

for Linear Perceptrons

GD Advantages (MI disadvantages):

• Biologically plausible

• With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

But we’llin wetware)?

• More easily parallelizable (or implementable

GD Disadvantages (MIsoon see that

advantages):

• It’s moronic GD

has an important

• It’s essentially a slow implementation of a way to extra

build the XTX matrix

and then solve a set of linear equations

trick up its sleeve

• If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to

be efficient.

• Hard to choose a good learning rate

• Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 37

What if all outputs are 0’s or 1’s ?

or

Our prediction is 0 if out(x)≤1/2

1 if out(x)>1/2

WHAT’S THE BIG PROBLEM WITH THIS???

19

Perceptrons for Classification

What if all outputs are 0’s or 1’s ?

or

Blue = Out(x)

We can do a linear fit.

Our prediction is 0 if out(x)≤½

1 if out(x)>½

WHAT’S THE BIG PROBLEM WITH THIS???

What if all outputs are 0’s or 1’s ?

or

Blue = Out(x)

We can do a linear fit.

Our prediction is 0 if out(x)≤½ Green = Classification

1 if out(x)>½

20

Classification with Perceptrons I

∑ (y )2

Don’t minimize i − w Τxi .

Minimize number of misclassifications instead. [Assume outputs are

∑ (y ( ))

+1 & -1, not +1 & 0]

i − Round w Τ x i

NON OBVIOUS WHY

1 if x≥0 THIS WORKS!!

if (xi,yi) correctly classed, don’t change

if wrongly predicted as 1 w Å w - xi

if wrongly predicted as -1 w Å w + xi

Sigmoid Functions

This fit would classify much

better. But not a least

squares fit.

21

Classification with Perceptrons II:

Sigmoid Functions

This fit would classify much

SOLUTION: better. But not a least

squares fit.

Instead of Out(x) = wTx

We’ll use Out(x) = g(wTx)

where g( x) : ℜ → (0,1) is a

squashing function

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 43

The Sigmoid

1

g ( h) =

1 + exp(− h)

curve through 180o

centered on (0,1/2) you get

the same curve.

i.e. g(h)=1-g(-h)

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 44

22

The Sigmoid

1

g ( h) =

1 + exp(− h)

∑ [ yi − Out(x i )] = ∑ [yi − g (w Τ x i )]

R R

2 2

i =1 i =1

Regions

0 0

0

1

X2 1

1

X1

= g(w1x1 + w2x2 + w0)

Which region of above diagram classified with +1, and

which with 0 ??

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 46

23

Gradient descent with sigmoid on a perceptron

First, notice g ' ( x ) = g ( x )(1 − g ( x ))

1 − e− x

Because : g ( x ) = so g ' ( x ) =

1 + e− x 1 + e − x

2

1 −1 − e− x 1 1 −1 1

= = − = 1 − = − g ( x )(1 − g ( x ))

− −

1 + e − x

2

1 + e − x

2 1+ e x 1+ e x 1 + e− x

Out(x) = g ∑ wk xk

k The sigmoid perceptron

update rule:

2

Ε = ∑ yi − g ∑ wk xik

i k

R

∂Ε ∂

= ∑ 2 yi − g ∑ wk xik −

g ∑ wk xik w j ← w j + η ∑ δ i gi (1 − gi )xij

∂w j i k ∂w j k i =1

∂

= ∑ − 2 yi − g ∑ wk xik g ' ∑ wk xik ∑w x m

i k k ∂w j k

k ik

where gi = g ∑ w j xij

= ∑ − 2δ i g (net i )(1 − g (net i ))xij j =1

δ i = yi − gi

i

k

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 47

convergence is guaranteed

underconstrained problems

24

Perceptrons and Boolean Functions

If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…

X1

X2

• Can learn the function x1 ∨ x2 .

X1

• Can learn any conjunction of literals, e.g.

x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5

QUESTION: WHY?

• Can learn any disjunction of literals

e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5

f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1

0 if less than n/2 xi’s are = 1

f(x1,x2) = x1 ∀ x2 =

(x1 ∧ ~x2) ∨ (~ x1 ∧ x2)

25

Multilayer Networks

The class of functions representable by perceptrons

is limited

( )

Out(x) = g w Τ x = g ∑ w j x j

j

Use a wider

representation !

Out(x) = g ∑W j g ∑ w jk x jk This is a nonlinear function

j k

Of a linear combination

Of non linear functions

Of linear combinations of inputs

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 51

NINPUTS = 2 NHIDDEN = 3

NINS

w11

v1 = g ∑ w1k xk

k =1 w1

x1 w21

w31

NINS

v2 = g ∑ w2k xk w2 N HID

Out = g ∑ Wk vk

w12 k =1 k =1

w22

x2 w3

w32

NINS

v3 = g ∑ w3k xk

k =1

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 52

26

OTHER NEURAL NETS

1

x1

x2

x3

2-Hidden layers + Constant Term

“JUMP” CONNECTIONS

x1

N INS N HID

Out = g ∑ w0 k xk + ∑Wk vk

x2

k =1 k =1

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 53

Backpropagation

Out(x) = g ∑W j g ∑ w jk xk

j k

Find a set of weights {W j },{w jk }

to minimize

∑ ( y − Out(x ))

2

i i

i

by gradient descent.

That’s

That’sit!

it!

That’s

That’s thebackpropagation

the backpropagation

algorithm.

algorithm.

27

Backpropagation Convergence

Convergence to a global minimum is not

guaranteed.

•In practice, this is not a problem, apparently.

Tweaking to find the right number of hidden

units, or a useful learning rate η, is more

hassle, apparently.

Write down the Gradient Descent Rule It turns out to be easier &

computationally efficient to use lots of local variables with names like hj ok vj neti

etc…

• This is a subtle art.

• Too small: can take days instead of minutes

to converge

• Too large: diverges (MSE gets larger and

larger while the weights increase and

usually oscillate)

• Sometimes the “just right” value is hard to

find.

28

Learning-rate problems

G. Palmer. Introduction to the

Theory of Neural Computation.

Addison-Wesley, 1994.

Momentum

Don’t just change weights according to the current datapoint.

Re-use changes from earlier iterations.

Let ∆w(t) = weight changes at time t.

Let ∂Ε be the change we would make with

−η

∂w regular gradient descent.

Instead we use

∂Ε

∆w (t +1) = −η + α∆w (t )

∂w

w (t + 1) = w (t ) + ∆w (t )

Momentum damps oscillations. momentum parameter

A hack? Well, maybe.

29

Momentum illustration

Newton’s method

∂E 1 T ∂ 2 E

E ( w + h) = E ( w ) + hT + h h + O(| h |3 )

∂w 2 ∂w 2

If y = c + bT x - 1/2 xT A x

And if A is SPD

Then

xopt = A-1b is the value of x that maximizes y

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 60

30

Improving Simple Gradient Descent

Newton’s method

∂E 1 T ∂ 2 E

E ( w + h) = E ( w ) + hT + h h + O(| h |3 )

∂w 2 ∂w 2

−1

∂ 2 E ∂E

w ←w− 2

∂w ∂w

This should send us directly to the global minimum if the

function is truly quadratic.

And it might get us close if it’s locally quadraticish

Newton’s method

∂E 1 T ∂ 2 E

E ( w + h) = E ( w ) + hT + h h + O(| h |3 )

∂w 2 ∂w 2

(and

it’s a

That big b

secon ut) −1

expen d der w ←… ∂ 2 E ∂E

w− 2

s i ve a i va

nd fid tive matr ∂w ∂w

If we dly to ix

weThis

’re no

t comp can be

a

’ll goshould lrsend uto

te.the global minimum if the

nuts. eadyus in directly

t h

function is truly quadratic. e quad

ratic

bowl,

And it might get us close if it’s locally quadraticish

31

Improving Simple Gradient Descent

Conjugate Gradient

Another method which attempts to exploit the “local

quadratic bowl” assumption

But does so while only needing to use ∂E

∂w

and not ∂ E

2

∂w 2

quadratic bowl assumption is violated.

It’s complicated, outside our scope, but it often works well.

More details in Numerical Recipes in C.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 63

BEST GENERALIZATION

Intuitively, you want to use the smallest,

simplest net that seems to fit the data.

2. Bayesian Methods Get it Right

3. Statistical Analysis explains what’s going on

4. Cross-validation

Discussed in the next

lecture

32

What You Should Know

• How to implement multivariate Least-

squares linear regression.

• Derivation of least squares as max.

likelihood estimator of linear coefficients

• The general gradient descent rule

• Perceptrons

Æ Linear output, least squares

Æ Sigmoid output, least squares

• Multilayer nets

Æ The idea behind back prop

Æ Awareness of better minimization methods

33

APPLICATIONS

To Discuss:

regressors) be useful for?

nonlinear regression?

• Time series with recurrent nets

• Unsupervised learning (clustering principal

components and non-linear versions

thereof)

• Combinatorial optimization with Hopfield

nets, Boltzmann Machines

• Evaluation function learning (in

reinforcement learning)

34

- Automatic Ship Detection of Remote Sensing ImagesUploaded bySaad Ali
- Karsten s Chier HoltUploaded bymubeen
- mlp_2Uploaded byKALYANpwn
- Binder1 55.pdfUploaded byAbdul Rahman
- ResNeXtUploaded byblackicebattle
- Paper 28-Estimating the Number of Test Workers NecessaryUploaded bySanchit Kapoor
- Advances in Self-Organizing MapsUploaded byAllwine Hutabarat
- Linear Regression ReviewUploaded byalex_az
- 1. Electronics - IJECE - Statistical Analysis of Skin - Santosh AchakanalliUploaded byiaset123
- Neural networks pptUploaded bymohanapriya
- Statsoft (2004) Neural NetworksUploaded bygirnetvaleriu
- NLPP Lect SlidesUploaded bySiddarth Yerram
- ProjectReport-KanwarpalUploaded byKanwarpal Singh
- Final ReportUploaded byBibinMathew
- Principios de Circuitos ElectricosOKUploaded byerech
- Cs101 Assignment 04 Spring 2012Uploaded bysamra
- Handbook of Brain Theory and Neural NetworkUploaded byrhycardo5902
- 514.docxUploaded byDevendra Varma
- intervening variables. Organizational behavior. Contributors. Feelings and emotions.Uploaded byIJAERS JOURNAL
- Extraction of Rules With PSOUploaded byNaren Pathak
- Machine Learning WikiUploaded byAnonymous wKjrBAYRo
- bisection_NN.pdfUploaded byAchmad Lukman
- ANN.docxUploaded byPravat Satpathy
- NnsUploaded byjaved765
- CUploaded byPrithireddy Theneti
- Greenhouse Climate Stamp)Uploaded byAntonio Martinez
- Artificial Neural Network Weights Optimization Using ICA GA ICAGA and RICAGA SSCI2011 IEEE Conference Paris FranceUploaded byJovii Nguru
- Many Task Which Seem Simple for UsUploaded bymadhuee
- NntoolsUploaded byEDGAR ORLANDO LADINO MORENO
- Topic Detection using Machine LearningUploaded byEditor IJTSRD

- Web Caching AlgUploaded byramachandra
- Web Caching ArchitUploaded byramachandra
- Web CachingUploaded byramachandra
- Wake on LAN White PaperUploaded byramachandra
- Uml BasicsUploaded byverma.saket.pramod7132
- SP800-58-finalUploaded byMichael Williams
- Background Voice Over Internet Protocol (VoIP) is a Technology ForUploaded byramachandra
- Title of Dissertation:Uploaded byramachandra
- Uds10 Uds100 User GuideUploaded byramachandra
- UDP ApplicationsUploaded byramachandra
- Traffic Simulation DataUploaded byramachandra
- Traffic DataUploaded byramachandra
- TINY AlgorithmUploaded byramachandra
- The OSI LayersUploaded byFenil Desai
- TestingUploaded byrahulkumbhar
- Tea Tiny AlgorithmUploaded byramachandra
- TCP-OSI Layers DataUploaded byramachandra
- SITE Commands on FTP ServerUploaded byramachandra
- Sh a 256Uploaded byramachandra
- Sh a 1Uploaded byramachandra
- Serial-spec in JavaUploaded byramachandra
- Search AlgorithmsUploaded byramachandra
- Sdlc in BreifUploaded byramachandra
- Pwd AuthenticationUploaded byramachandra
- Path FindUploaded byramachandra
- Password AuthenticationUploaded byramachandra
- Page Replacement Algs DatagoodUploaded byramachandra
- Cryptanalysis of Microsoft’s Point-To-Point Tunneling Protocol (PPTP)Uploaded byramachandra
- OSI ModelUploaded byn43.nitin

- Tree ArrayUploaded byFunWithNumbers88
- 0029_2A_Mernoki_optimalas_EN.pdfUploaded byTejashree Dhamale
- Echo Cancellation in Audio Signal Using LMS AlgorithmUploaded byVa Su
- 3.VI. Logistic RegressionUploaded byChinmay Bhushan
- Opt Nlp BiglerUploaded byKim Christian Galaez
- _10_Evaluation of Different Bilevel Optimization Algorithms With-ThesisUploaded byGeraldo Rocha
- 10.1.1.13Uploaded bylocnp0209
- [DownSub.com] Stanford CS230_ Deep Learning _ Autumn 2018 _ Lecture 4 - Adversarial Attacks _ GANsUploaded byyassine
- Fully Supervised Training of Gaussian Radial Basis Function Networks in WEKAUploaded byEdinilson Vida
- Large-Scale Machine Learning with Stochastic Gradient Descent.pdfUploaded byravigobi
- Deep Learning with KerasUploaded byHisham Shihab
- Nuero Fuzzy JangUploaded byArbaz Khan
- Benchmarking of LSTM NetworksUploaded byFelipe De Souza Araujo
- Harmonic Analysis Using Neural NetworksUploaded byQuang Hoang
- Nonlinear OptimizationUploaded byJames Lindsey
- Chapter 1 - Fundamental of OptimizationUploaded byWael_Barakat_3179
- IJE_V5_I5Uploaded byAI Coordinator - CSC Journals
- Non-Liner Model Predictive Control for Autonomous VehiclesUploaded bym awais
- Newton Gauss MethodUploaded byLucas Weaver
- 09 Collaborative Filtering 2 r1Uploaded byvikas
- MIT15_093J_F09_lec18 (1)Uploaded byKirubaKaran S
- SJE000697Uploaded byhitzero20032131
- Steepest DescentUploaded byvaradu
- ch7Uploaded bymechkalyan
- painless-conjugate-gradientUploaded byMaruthi Nh
- ex2Uploaded byvpalazon
- Numerical Op Tim IzationUploaded bySescioreanu Mihai
- Full Report3Uploaded byVinay Kumar
- gee7-2011Uploaded byJaved Raoof
- IMSBranchAssAssignment 4_RuoyuUploaded byRuoyu Tan

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.