212 views

Original Title: Neural Networks

Uploaded by ramachandra

- neural networks exam
- neural network
- PDF Global f
- Computer and Communication
- Chapter 7 'Support Vector Machines for Pattern Recognition'
- v77-127
- AreviewofComputationalIntelligencetechniquesincoral 1 s2.0 S1574954116300036 Main
- HWCR
- Iisc Conference
- An Unified Approach for Process Quality Analysis and Control
- 967_mudproject
- Neural Network Fingerprint ClassiCation
- 10.1.1.115
- Toolboxes in MATLAB
- Self Organizing Fuzzy Neural Networks
- ApAPPLICATION OF ADAPTIVE NEURO FUZZY INFERENCE SYSTEM IN THE PROCESS OF TRANSPORTATION SUPPORTjor Article Anfis
- Article
- 2005 Neural Networks and Applications
- Model Predictive Control
- 04371406

You are on page 1of 34

Classification with

Neural Networks

Note to other teachers and users of

these slides. Andrew would be delighted

Andrew W. Moore

Professor

if you found this source material useful in

giving your own lectures. Feel free to use

these slides verbatim, or to modify them

to fit your own needs. PowerPoint

originals are available. If you make use School of Computer Science

of a significant portion of these slides in

your own lecture, please include this

message, or the following link to the Carnegie Mellon University

source repository of Andrew’s tutorials:

http://www.cs.cmu.edu/~awm/tutorials . www.cs.cmu.edu/~awm

Comments and corrections gratefully

received. awm@cs.cmu.edu

412-268-7599

Linear Regression

DATASET

inputs outputs

x1 = 1 y1 = 1

x2 = 3 y2 = 2.2

↑

w x3 = 2 y3 = 2

← 1 →↓

x4 = 1.5 y4 = 1.9

x5 = 4 y5 = 3.1

the output given an input, E[y|x], is linear.

Simplest case: Out(x) = wx for some unknown w.

Given the data, we can estimate w.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 2

1

1-parameter linear regression

Assume that the data is formed by

yi = wxi + noisei

where…

• the noise signals are independent

• the noise has a normal distribution with mean 0

and unknown variance σ2

• mean wx

• variance σ2

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 3

P(y|w,x) = Normal (mean wx, var σ2)

which are EVIDENCE about w.

P(w|x1, x2, x3,…xn, y1, y2…yn)

•You can use BAYES rule to work out a posterior

distribution for w given the data.

•Or you could do Maximum Likelihood Estimation

2

Maximum likelihood estimation of w

“For which value of w is this data most likely to have

happened?”

<=>

For what w is

P(y1, y2…yn |x1, x2, x3,…xn, w) maximized?

<=>

For what w is n

∏P( y w, x ) maximized?

i =1

i i

For what w is

n

∏ P( y

i =1

i w , x i ) maximized?

1

∏ exp(− 2 (

yi − wxi

) 2 ) maximized?

i =1 σ

For what w is

2

n

1 y − wx i

∑ i =1

− i

2 σ

maximized?

For what w is 2

n

∑ (y

i =1

i − wx i ) minimized?

3

Linear Regression

The maximum

likelihood w is

the one that E(w)

w

minimizes sum-

Ε = ∑( yi − wxi )

2

of-squares of

residuals i

= ∑ yi − (2∑ xi yi )w +

2

(∑ x )w i

2 2

i

We want to minimize a quadratic function of w.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 7

Linear Regression

Easy to show the sum of

squares is minimized

when

w=

∑x y i i

∑x

2

i

model is

Out(x) = wx

We can use it for

prediction

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 8

4

Linear Regression

Easy to show the sum of

squares is minimized

when

∑ xi yi

p(w)

w= w

∑x

2

i Note: In Bayesian stats you’d have

ended up with a prob dist of w

The maximum likelihood

model is

Out(x) = wx And predictions would have given a prob

dist of expected output

We can use it for

Max likelihood can give some kinds of

prediction

confidence too.

Multivariate Regression

What if the inputs are vectors?

3.

.4

6.

2-d input

. 5 example

.8

x2 . 10

x1

Dataset has form

x1 y1

x2 y2

x3 y3

.: :

.

xR yR

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 10

5

Multivariate Regression

Write matrix X and Y thus:

.....x ..... x x22

... x2 m y

x= 2 = 21 y = 2

M M M

.....x R ..... xR1 xR 2 ... xRm yR

The linear regression model assumes a vector w such that

Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]

The max. likelihood w is w = (XTX) -1(XTY)

Multivariate Regression

Write matrix X and Y thus:

.....x ..... x x22 ... x2 m y

x= 2 = 21 y = 2

M M M

.....x R ..... xR1 xR 2 ... xRm yR

IMPORTANT EXERCISE:

(there are R datapoints. Each input hasPROVE

m components)

IT !!!!!

The linear regression model assumes a vector w such that

Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]

The max. likelihood w is w = (XTX) -1(XTY)

6

Multivariate Regression (con’t)

R

k =1

ki kj

R

XTY is an m-element vector: i’th elt ∑x

k =1

y

ki k

We may expect

linear data that does

not go through the

origin.

Statisticians and

Neural Net Folks all

agree on a simple

obvious hack.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14

7

The constant term

• The trick is to create a fake input “X0” that

always takes the value 1

X1 X2 Y X0 X1 X2 Y

2 4 16 1 2 4 16

3 4 17 1 3 4 17

5 5 20 1 5 5 20

Before: After:

Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2

…has to be a poor In this example, = w0+w1X1+ w2X2

You should be able

model to see the MLE w0 …has a fine constant

, w1 and w2 by

inspection

term

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 15

• Suppose you know the variance of the noise that

was added to each datapoint.

y=3

σ=2

xi yi σi2

½ ½ 4 y=2

σ=1/2

1 1 1

σ=1

2 1 1/4 y=1

σ=1/2

σ=2

2 3 4 y=0

LE

heM ?

yi ~ N ( wxi , σ i2 )

t

at’s fw

Assume W h at e o

m

esti

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 16

8

MLE estimation with varying noise

argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ 12 , σ 22 ,..., σ R2 , w) =

w Assuming i.i.d. and

R

( yi − wxi ) 2 then plugging in

argmin ∑ σ i2

= equation for Gaussian

and simplifying.

i =1

w

R

x ( y − wx ) Setting dLL/dw

w such that ∑ i i 2 i = 0 = equal to zero

i =1 σi

R xi yi Trivial algebra

∑ 2

i =1 σ i

R xi2

∑ 2

i =1 σ i

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 17

• We are asking to minimize the weighted sum of

squares

y=3

σ=2

R

( yi − wxi ) 2

argmin ∑ σ i2 y=2

i =1 σ=1/2

w

y=1 σ=1

σ=1/2

σ=2

y=0

x=0 x=1 x=2 x=3

1

where weight for i’th datapoint is σ i2

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 18

9

Weighted Multivariate Regression

R xki xkj

(WXTWX) is an m x m matrix: i,j’th elt is ∑

k =1 σ i2

R

(WXTWY) is an m-element vector: i’th elt xki yk

∑

k =1 σ i2

Non-linear Regression

• Suppose you know that y is related to a function of x in

such a way that the predicted values have a non-linear

dependence on w, e.g:

y=3

xi yi

½ ½ y=2

1 2.5

2 3 y=1

3 2 y=0

LE

eM ?

yi ~ N ( w + xi , σ ) 2 th

at’s fw

Assume W h at e o

m

esti

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 20

10

Non-linear MLE estimation

argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =

w Assuming i.i.d. and

argmin ∑ (y )

R 2 then plugging in

i − w + xi = equation for Gaussian

i =1 and simplifying.

w

R

y − w + xi Setting dLL/dw

w such that ∑ i = 0 = equal to zero

w + xi

i =1

argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =

w Assuming i.i.d. and

argmin ∑ (y )

R 2 then plugging in

i − w + xi = equation for Gaussian

i =1 and simplifying.

w

R

y − w + xi Setting dLL/dw

w such that ∑ i = 0 = equal to zero

w + xi

i =1

algebraic toilet

t

ss wha

u e

So g e do?

w

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 22

11

Non-linear MLE estimation

argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =

w Assuming i.i.d. and

argmin ∑ ( )

R

Common (but not only) approach: 2 then plugging in

yi − w + xi = equation for Gaussian

Numerical Solutions: and simplifying.

i =1

• Line Search w

• Simulated Annealing

R

yi − w + xi Setting dLL/dw

w such that

• Gradient Descent

w +

∑

x

= 0 =

equal to zero

• Conjugate Gradient i = 1 i

• Levenberg Marquart

• Newton’s Method We’re down the

algebraic toilet

Also, special purpose statistical-

hat

ss w

optimization-specific tricks such as

u e

So g e do?

E.M. (See Gaussian Mixtures lecture

for introduction) w

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 23

GRADIENT DESCENT

Suppose we have a scalar function

f(w): ℜ → ℜ

We want to find a local minimum.

Assume our current weight is w

∂

GRADIENT DESCENT RULE: w ← w −η f (w)

∂w

number, e.g. η = 0.05

12

GRADIENT DESCENT

Suppose we have a scalar function

f(w): ℜ → ℜ

We want to find a local minimum.

Assume our current weight is w

∂

GRADIENT DESCENT RULE: w ← w −η f (w)

∂w

Recall Andrew’s favorite

default value for anything

η is called the LEARNING RATE. A small positive

number, e.g. η = 0.05

QUESTION: Justify the Gradient Descent Rule

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 25

Given f(w ) : ℜ m → ℜ

∂

f (w )

∂w1

∇f (w) = M points in direction of steepest ascent.

∂

∂w f (w)

m

∇f (w) is the gradient in that direction

Equivalently ∂

wj ← wj - η f (w ) ….where wj is the jth weight

∂w j

“just like a linear feedback system”

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 26

13

What’s all this got to do with Neural

Nets, then, eh??

For supervised learning, neural nets are also models with

vectors of w parameters in them. They are now called

weights.

As before, we want to compute the weights to minimize sum-

of-squared residuals.

Which turns out, under “Gaussian i.i.d noise”

assumption to be max. likelihood.

Instead of explicitly solving for max. likelihood weights, we

use GRADIENT DESCENT to SEARCH for them.

our eyes.

s exp ression in y

” you a sk, a querulou

“Wh y?

e later.”

ply: “We’ll se

“Aha!!” I re

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 27

Linear Perceptrons

They are multivariate linear models:

Out(x) = wTx

by gradient descent.

Ε = ∑ (Out (x ) −

k

k yk )2

∑ (w )

Τ 2

= x k − yk

k

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 28

14

Linear Perceptron Training Rule

R

E = ∑ ( yk − w T x k ) 2

k =1

we should update w

thusly if we wish to

minimize E:

∂E

wj ← wj - η

∂w j

∂E

So what’s ?

∂w j

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 29

R

∂E R

∂

E = ∑ ( yk − w T x k ) 2 =∑ ( yk − w T x k ) 2

k =1

∂w j k =1 ∂w j

R

∂

Gradient descent tells us = ∑ 2( y k − w T x k ) ( yk − w T x k )

we should update w k =1 ∂w j

thusly if we wish to R

∂

minimize E: = −2∑ δk wT xk

k =1 ∂w j

∂E …where…

wj ← wj - η δk = yk − w T x k

∂w j R

∂ m

= −2∑ δk ∑w x i ki

k =1 ∂w j i =1

∂E

So what’s ? R

∂w j = −2∑ δk xkj

k =1

15

Linear Perceptron Training Rule

R

E = ∑ ( yk − w T x k ) 2

k =1

we should update w

thusly if we wish to

minimize E:

∂E

wj ← wj - η R

…where…

∂w j w j ← w j + 2η∑ δk xkj

k =1

∂E R

= −2∑ δk xkj

∂w j k =1 We frequently neglect the 2 (meaning

we halve the learning rate)

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 31

1) Randomly initialize weights w1 w2 … wm

you don’t want to go through the origin).

3) for i = 1 to R

δ i := yi − wΤxi

4) for j = 1 to m R

w j ← w j + η ∑ δ i xij

i =1

back to 3.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 32

16

δ i ← yi − w Τ x i A RULE KNOWN BY

w j ← w j + ηδ i xij

MANY NAMES

rule

ule Hoff

MSR id row

L W

Th e The

The delta ru

le

Th

e ad

alin

er

ule

Classical

conditioning

quickly. THEN

Don’t bother remembering old ones.

Just keep using new ones.

observe (x,y)

δ ← y − wΤx

∀j w j ← w j + η δ x j

17

Gradient Descent vs Matrix Inversion

for Linear Perceptrons

GD Advantages (MI disadvantages):

• Biologically plausible

• With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

• More easily parallelizable (or implementable in wetware)?

GD Disadvantages (MI advantages):

• It’s moronic

• It’s essentially a slow implementation of a way to build the XTX matrix

and then solve a set of linear equations

• If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to

be efficient.

• Hard to choose a good learning rate

• Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 35

for Linear Perceptrons

GD Advantages (MI disadvantages):

• Biologically plausible

• With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

• More easily parallelizable (or implementable in wetware)?

GD Disadvantages (MI advantages):

• It’s moronic

• It’s essentially a slow implementation of a way to build the XTX matrix

and then solve a set of linear equations

• If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to

be efficient.

• Hard to choose a good learning rate

• Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 36

18

Gradient Descent vs Matrix Inversion

for Linear Perceptrons

GD Advantages (MI disadvantages):

• Biologically plausible

• With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

But we’llin wetware)?

• More easily parallelizable (or implementable

GD Disadvantages (MIsoon see that

advantages):

• It’s moronic GD

has an important

• It’s essentially a slow implementation of a way to extra

build the XTX matrix

and then solve a set of linear equations

trick up its sleeve

• If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to

be efficient.

• Hard to choose a good learning rate

• Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 37

What if all outputs are 0’s or 1’s ?

or

Our prediction is 0 if out(x)≤1/2

1 if out(x)>1/2

WHAT’S THE BIG PROBLEM WITH THIS???

19

Perceptrons for Classification

What if all outputs are 0’s or 1’s ?

or

Blue = Out(x)

We can do a linear fit.

Our prediction is 0 if out(x)≤½

1 if out(x)>½

WHAT’S THE BIG PROBLEM WITH THIS???

What if all outputs are 0’s or 1’s ?

or

Blue = Out(x)

We can do a linear fit.

Our prediction is 0 if out(x)≤½ Green = Classification

1 if out(x)>½

20

Classification with Perceptrons I

∑ (y )2

Don’t minimize i − w Τxi .

Minimize number of misclassifications instead. [Assume outputs are

∑ (y ( ))

+1 & -1, not +1 & 0]

i − Round w Τ x i

NON OBVIOUS WHY

1 if x≥0 THIS WORKS!!

if (xi,yi) correctly classed, don’t change

if wrongly predicted as 1 w Å w - xi

if wrongly predicted as -1 w Å w + xi

Sigmoid Functions

This fit would classify much

better. But not a least

squares fit.

21

Classification with Perceptrons II:

Sigmoid Functions

This fit would classify much

SOLUTION: better. But not a least

squares fit.

Instead of Out(x) = wTx

We’ll use Out(x) = g(wTx)

where g( x) : ℜ → (0,1) is a

squashing function

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 43

The Sigmoid

1

g ( h) =

1 + exp(− h)

curve through 180o

centered on (0,1/2) you get

the same curve.

i.e. g(h)=1-g(-h)

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 44

22

The Sigmoid

1

g ( h) =

1 + exp(− h)

∑ [ yi − Out(x i )] = ∑ [yi − g (w Τ x i )]

R R

2 2

i =1 i =1

Regions

0 0

0

1

X2 1

1

X1

= g(w1x1 + w2x2 + w0)

Which region of above diagram classified with +1, and

which with 0 ??

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 46

23

Gradient descent with sigmoid on a perceptron

First, notice g ' ( x ) = g ( x )(1 − g ( x ))

1 − e− x

Because : g ( x ) = so g ' ( x ) =

1 + e− x 1 + e − x

2

1 −1 − e− x 1 1 −1 1

= = − = 1 − = − g ( x )(1 − g ( x ))

− −

1 + e − x

2

1 + e − x

2 1+ e x 1+ e x 1 + e− x

Out(x) = g ∑ wk xk

k The sigmoid perceptron

update rule:

2

Ε = ∑ yi − g ∑ wk xik

i k

R

∂Ε ∂

= ∑ 2 yi − g ∑ wk xik −

g ∑ wk xik w j ← w j + η ∑ δ i gi (1 − gi )xij

∂w j i k ∂w j k i =1

∂

= ∑ − 2 yi − g ∑ wk xik g ' ∑ wk xik ∑w x m

i k k ∂w j k

k ik

where gi = g ∑ w j xij

= ∑ − 2δ i g (net i )(1 − g (net i ))xij j =1

δ i = yi − gi

i

k

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 47

convergence is guaranteed

underconstrained problems

24

Perceptrons and Boolean Functions

If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…

X1

X2

• Can learn the function x1 ∨ x2 .

X1

• Can learn any conjunction of literals, e.g.

x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5

QUESTION: WHY?

• Can learn any disjunction of literals

e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5

f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1

0 if less than n/2 xi’s are = 1

f(x1,x2) = x1 ∀ x2 =

(x1 ∧ ~x2) ∨ (~ x1 ∧ x2)

25

Multilayer Networks

The class of functions representable by perceptrons

is limited

( )

Out(x) = g w Τ x = g ∑ w j x j

j

Use a wider

representation !

Out(x) = g ∑W j g ∑ w jk x jk This is a nonlinear function

j k

Of a linear combination

Of non linear functions

Of linear combinations of inputs

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 51

NINPUTS = 2 NHIDDEN = 3

NINS

w11

v1 = g ∑ w1k xk

k =1 w1

x1 w21

w31

NINS

v2 = g ∑ w2k xk w2 N HID

Out = g ∑ Wk vk

w12 k =1 k =1

w22

x2 w3

w32

NINS

v3 = g ∑ w3k xk

k =1

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 52

26

OTHER NEURAL NETS

1

x1

x2

x3

2-Hidden layers + Constant Term

“JUMP” CONNECTIONS

x1

N INS N HID

Out = g ∑ w0 k xk + ∑Wk vk

x2

k =1 k =1

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 53

Backpropagation

Out(x) = g ∑W j g ∑ w jk xk

j k

Find a set of weights {W j },{w jk }

to minimize

∑ ( y − Out(x ))

2

i i

i

by gradient descent.

That’s

That’sit!

it!

That’s

That’s thebackpropagation

the backpropagation

algorithm.

algorithm.

27

Backpropagation Convergence

Convergence to a global minimum is not

guaranteed.

•In practice, this is not a problem, apparently.

Tweaking to find the right number of hidden

units, or a useful learning rate η, is more

hassle, apparently.

Write down the Gradient Descent Rule It turns out to be easier &

computationally efficient to use lots of local variables with names like hj ok vj neti

etc…

• This is a subtle art.

• Too small: can take days instead of minutes

to converge

• Too large: diverges (MSE gets larger and

larger while the weights increase and

usually oscillate)

• Sometimes the “just right” value is hard to

find.

28

Learning-rate problems

G. Palmer. Introduction to the

Theory of Neural Computation.

Addison-Wesley, 1994.

Momentum

Don’t just change weights according to the current datapoint.

Re-use changes from earlier iterations.

Let ∆w(t) = weight changes at time t.

Let ∂Ε be the change we would make with

−η

∂w regular gradient descent.

Instead we use

∂Ε

∆w (t +1) = −η + α∆w (t )

∂w

w (t + 1) = w (t ) + ∆w (t )

Momentum damps oscillations. momentum parameter

A hack? Well, maybe.

29

Momentum illustration

Newton’s method

∂E 1 T ∂ 2 E

E ( w + h) = E ( w ) + hT + h h + O(| h |3 )

∂w 2 ∂w 2

If y = c + bT x - 1/2 xT A x

And if A is SPD

Then

xopt = A-1b is the value of x that maximizes y

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 60

30

Improving Simple Gradient Descent

Newton’s method

∂E 1 T ∂ 2 E

E ( w + h) = E ( w ) + hT + h h + O(| h |3 )

∂w 2 ∂w 2

−1

∂ 2 E ∂E

w ←w− 2

∂w ∂w

This should send us directly to the global minimum if the

function is truly quadratic.

And it might get us close if it’s locally quadraticish

Newton’s method

∂E 1 T ∂ 2 E

E ( w + h) = E ( w ) + hT + h h + O(| h |3 )

∂w 2 ∂w 2

(and

it’s a

That big b

secon ut) −1

expen d der w ←… ∂ 2 E ∂E

w− 2

s i ve a i va

nd fid tive matr ∂w ∂w

If we dly to ix

weThis

’re no

t comp can be

a

’ll goshould lrsend uto

te.the global minimum if the

nuts. eadyus in directly

t h

function is truly quadratic. e quad

ratic

bowl,

And it might get us close if it’s locally quadraticish

31

Improving Simple Gradient Descent

Conjugate Gradient

Another method which attempts to exploit the “local

quadratic bowl” assumption

But does so while only needing to use ∂E

∂w

and not ∂ E

2

∂w 2

quadratic bowl assumption is violated.

It’s complicated, outside our scope, but it often works well.

More details in Numerical Recipes in C.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 63

BEST GENERALIZATION

Intuitively, you want to use the smallest,

simplest net that seems to fit the data.

2. Bayesian Methods Get it Right

3. Statistical Analysis explains what’s going on

4. Cross-validation

Discussed in the next

lecture

32

What You Should Know

• How to implement multivariate Least-

squares linear regression.

• Derivation of least squares as max.

likelihood estimator of linear coefficients

• The general gradient descent rule

• Perceptrons

Æ Linear output, least squares

Æ Sigmoid output, least squares

• Multilayer nets

Æ The idea behind back prop

Æ Awareness of better minimization methods

33

APPLICATIONS

To Discuss:

regressors) be useful for?

nonlinear regression?

• Time series with recurrent nets

• Unsupervised learning (clustering principal

components and non-linear versions

thereof)

• Combinatorial optimization with Hopfield

nets, Boltzmann Machines

• Evaluation function learning (in

reinforcement learning)

34

- neural networks examUploaded byfruit1991
- neural networkUploaded byAB Anshu Bhardwaj
- PDF Global fUploaded byShashank S Kudlur
- Computer and CommunicationUploaded byrahul c
- Chapter 7 'Support Vector Machines for Pattern Recognition'Uploaded bychenshiyao
- v77-127Uploaded byపాకనాటి ప్రశాంత్ కుమార్ రెడ్డి
- AreviewofComputationalIntelligencetechniquesincoral 1 s2.0 S1574954116300036 MainUploaded byminoofam
- HWCRUploaded byNikhil Sai Nethi
- Iisc ConferenceUploaded byShamya S Pal
- An Unified Approach for Process Quality Analysis and ControlUploaded byeditor_ijarcsse
- 967_mudprojectUploaded byAseem Raseed
- Neural Network Fingerprint ClassiCationUploaded bysooraj1234
- 10.1.1.115Uploaded byJagath Prasanga
- Toolboxes in MATLABUploaded bySam Chawla
- Self Organizing Fuzzy Neural NetworksUploaded byjonathan.shore3401
- ApAPPLICATION OF ADAPTIVE NEURO FUZZY INFERENCE SYSTEM IN THE PROCESS OF TRANSPORTATION SUPPORTjor Article AnfisUploaded byDragan Pamučar
- ArticleUploaded byVaqar Sayyed
- 2005 Neural Networks and ApplicationsUploaded byGopipavan
- Model Predictive ControlUploaded bycacr_72
- 04371406Uploaded byAnonymous bcDacM
- 2007 02 01b Janecek PerceptronUploaded byRaden
- Haibo Li, Torbjorn Kronander and Ingemar Ingemarsson- A Pattern Classifier Integrating Multilayer Perceptron and Error-Correcting CodeUploaded byTuhma
- Linear Regression with Multiple VariablesUploaded byAsif Bin Latif
- Escaping Saddles With SGDUploaded byGaotingwe
- ON-LINE IDENTIFICATION OF A ROBOT MANIPULATOR USING NEURAL.pdfUploaded byMohd Sazli Saad
- GeologicalApplications[1]Uploaded byvanpato
- 1-s2.0-S0266353806002788-mainUploaded byfficici
- Devanagari Offline Handwritten Numeral and Character Recognition Using Multiple Features and Neural Network ClassifierUploaded byVikas Dongre
- fuzzy 3Uploaded byengrode
- Artificial Neural Network Model to Predict Process Performance in Ultrasonic Drilling of GFRP -15353Uploaded byEditor IJAERD

- Web Caching AlgUploaded byramachandra
- Web Caching ArchitUploaded byramachandra
- Web CachingUploaded byramachandra
- Wake on LAN White PaperUploaded byramachandra
- Uml BasicsUploaded byverma.saket.pramod7132
- SP800-58-finalUploaded byMichael Williams
- Background Voice Over Internet Protocol (VoIP) is a Technology ForUploaded byramachandra
- Title of Dissertation:Uploaded byramachandra
- Uds10 Uds100 User GuideUploaded byramachandra
- UDP ApplicationsUploaded byramachandra
- Traffic Simulation DataUploaded byramachandra
- Traffic DataUploaded byramachandra
- TINY AlgorithmUploaded byramachandra
- The OSI LayersUploaded byFenil Desai
- TestingUploaded byrahulkumbhar
- Tea Tiny AlgorithmUploaded byramachandra
- TCP-OSI Layers DataUploaded byramachandra
- SITE Commands on FTP ServerUploaded byramachandra
- Sh a 256Uploaded byramachandra
- Sh a 1Uploaded byramachandra
- Serial-spec in JavaUploaded byramachandra
- Search AlgorithmsUploaded byramachandra
- Sdlc in BreifUploaded byramachandra
- Path FindUploaded byramachandra
- Password AuthenticationUploaded byramachandra
- Page Replacement Algs DatagoodUploaded byramachandra
- Cryptanalysis of Microsoft’s Point-To-Point Tunneling Protocol (PPTP)Uploaded byramachandra
- OSI ModelUploaded byn43.nitin

- Data Mining Classification and Prediction by Dr. Tanvir AhmedUploaded byHapi
- Solution 4 Ann Weka 2012Uploaded byNguyen Tien Thanh
- mlpsUploaded byTito Herrera
- Soft Computing FileUploaded byshivang2056yadav
- PDF%2Fjcssp.2012.506.514Uploaded byMiljan Kovacevic
- TB04_soft-computing-ebook.pdfUploaded byManoj Balaji
- Neural and Fuzzy LogicUploaded byNaveen Kumar Sinha
- Multilayer Perceptron Neural Network Based Immersive VR System for Cognitive Computer GamingUploaded byDr.P.S.Jagadeesh Kumar, Research Scientist, Harvard University, Cambridge, United States
- M.Tech(CSE)_Syllabus_N.S.E.CUploaded byJinu Madhavan
- nnUploaded byandhracolleges
- 0 PerceptronUploaded byLuis Enrique Carmona Gutierrez
- Franco - Neural NetworksUploaded byAustin Fritzke
- test_examUploaded byMOhmedSharaf
- SCUploaded byReshma Vasudevan
- a.k.t.u.-B.Tech._CSE-4th year_ _19_05_2016Uploaded byGaurav Sharma
- g19p1Uploaded bytriplewalker
- ML.pptUploaded bykiran
- l1Uploaded byMat Kamil Awang
- An ANN-based approach to predict blast-induced ground vibration of Gol-E-Gohar iron ore mine, Iran.pdfUploaded byRishav Kanth
- Optical Neural NetworkUploaded byrajverma1999
- C & C Elective Subjects SyllabusUploaded byCholavendhan
- Decline Curve Based Models for Predicting Natural Gas Well Performance by ARASH KAMARI 2017Uploaded byAnonymous Vbv8SHv0b
- J. A. K. Suykens, J. Vandewalle and B. De Moor- Lur’e Systems with Multilayer Perceptron and Recurrent Neural Networks: Absolute Stability and DissipativityUploaded byAsvcxv
- LAB 1 Neural NetworkUploaded byAyie_ikm
- A Comparison of Leading Data Mining ToolsUploaded byDiana Darie
- nn.pdfUploaded byhat
- Neural NetworkUploaded byKannan.s
- JMSLUploaded bymaximin
- v46i07Uploaded byMarco Zárate
- Lecture 1Uploaded byJimmy Dore