You are on page 1of 34

# Regression and

Classification with
Neural Networks
Note to other teachers and users of
these slides. Andrew would be delighted
Andrew W. Moore
Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use School of Computer Science
of a significant portion of these slides in
message, or the following link to the Carnegie Mellon University
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials . www.cs.cmu.edu/~awm
412-268-7599

Linear Regression
DATASET

inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2

w x3 = 2 y3 = 2
← 1 →↓
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1

## Linear regression assumes that the expected value of

the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.

1
1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei

where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2

• mean wx
• variance σ2

## Bayesian Linear Regression

P(y|w,x) = Normal (mean wx, var σ2)

## We want to infer w from the data.

P(w|x1, x2, x3,…xn, y1, y2…yn)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 4

2
Maximum likelihood estimation of w

“For which value of w is this data most likely to have
happened?”
<=>
For what w is
P(y1, y2…yn |x1, x2, x3,…xn, w) maximized?
<=>
For what w is n

∏P( y w, x ) maximized?
i =1
i i

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 5

For what w is
n

∏ P( y
i =1
i w , x i ) maximized?

## For what w isn

1
∏ exp(− 2 (
yi − wxi
) 2 ) maximized?
i =1 σ
For what w is
2
n
1  y − wx i 
∑ i =1
−  i
2 σ
 maximized?

For what w is 2
n

∑ (y
i =1
i − wx i ) minimized?

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 6

3
Linear Regression

The maximum
likelihood w is
the one that E(w)
w
minimizes sum-
Ε = ∑( yi − wxi )
2
of-squares of
residuals i

= ∑ yi − (2∑ xi yi )w +
2
(∑ x )w i
2 2

i
We want to minimize a quadratic function of w.

Linear Regression
Easy to show the sum of
squares is minimized
when
w=
∑x y i i

∑x
2
i

## The maximum likelihood

model is
Out(x) = wx
We can use it for
prediction

4
Linear Regression
Easy to show the sum of
squares is minimized
when
∑ xi yi
p(w)

w= w

∑x
2
i Note: In Bayesian stats you’d have
ended up with a prob dist of w
The maximum likelihood
model is
Out(x) = wx And predictions would have given a prob
dist of expected output

## Often useful to know your confidence.

We can use it for
Max likelihood can give some kinds of
prediction
confidence too.

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 9

Multivariate Regression
What if the inputs are vectors?
3.
.4
6.
2-d input
. 5 example
.8
x2 . 10

x1
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR

5
Multivariate Regression
Write matrix X and Y thus:

##  .....x1 .....   x11 x12 ... x1m   y1 

.....x .....  x x22 
... x2 m  y 
x= 2  =  21 y =  2
 M   M  M
     
.....x R .....  xR1 xR 2 ... xRm   yR 

## (there are R datapoints. Each input has m components)

The linear regression model assumes a vector w such that
Out(x) = wTx = w1x + w2x + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 11

Multivariate Regression
Write matrix X and Y thus:

##  .....x1 .....   x11 x12 ... x1m   y1 

.....x .....  x x22 ... x2 m  y 
x= 2  =  21 y =  2
 M   M  M
     
.....x R .....  xR1 xR 2 ... xRm   yR 

IMPORTANT EXERCISE:
(there are R datapoints. Each input hasPROVE
m components)
IT !!!!!
The linear regression model assumes a vector w such that
Out(x) = wTx = w1x + w2x + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 12

6
Multivariate Regression (con’t)

R

## XTX is an m x m matrix: i,j’th elt is ∑x x

k =1
ki kj

R
XTY is an m-element vector: i’th elt ∑x
k =1
y
ki k

## What about a constant term?

We may expect
linear data that does
not go through the
origin.

Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.

## Can you guess??

7
The constant term
• The trick is to create a fake input “X0” that
always takes the value 1

X1 X2 Y X0 X1 X2 Y
2 4 16 1 2 4 16
3 4 17 1 3 4 17
5 5 20 1 5 5 20
Before: After:
Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2
…has to be a poor In this example, = w0+w1X1+ w2X2
You should be able
model to see the MLE w0 …has a fine constant
, w1 and w2 by
inspection
term

## Regression with varying noise

• Suppose you know the variance of the noise that
y=3
σ=2
xi yi σi2
½ ½ 4 y=2
σ=1/2
1 1 1
σ=1
2 1 1/4 y=1
σ=1/2
σ=2
2 3 4 y=0

## 3 2 1/4 x=0 x=1 x=2 x=3

LE
heM ?
yi ~ N ( wxi , σ i2 )
t
at’s fw
Assume W h at e o
m
esti

8
MLE estimation with varying noise
argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ 12 , σ 22 ,..., σ R2 , w) =
w Assuming i.i.d. and
R
( yi − wxi ) 2 then plugging in
argmin ∑ σ i2
= equation for Gaussian
and simplifying.
i =1
w
 R
x ( y − wx )  Setting dLL/dw
 w such that ∑ i i 2 i = 0  = equal to zero
 i =1 σi 

 R xi yi  Trivial algebra
 ∑ 2 
 i =1 σ i 
 R xi2 
 ∑ 2 
 i =1 σ i 

## This is Weighted Regression

• We are asking to minimize the weighted sum of
squares
y=3
σ=2
R
( yi − wxi ) 2

argmin ∑ σ i2 y=2
i =1 σ=1/2
w
y=1 σ=1
σ=1/2
σ=2
y=0
x=0 x=1 x=2 x=3

1
where weight for i’th datapoint is σ i2

9
Weighted Multivariate Regression

## The max. likelihood w is w = (WXTWX)-1(WXTWY)

R xki xkj
(WXTWX) is an m x m matrix: i,j’th elt is ∑
k =1 σ i2
R
(WXTWY) is an m-element vector: i’th elt xki yk

k =1 σ i2

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 19

Non-linear Regression
• Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
y=3
xi yi
½ ½ y=2

1 2.5
2 3 y=1

3 2 y=0

## 3 3 x=0 x=1 x=2 x=3

LE
eM ?
yi ~ N ( w + xi , σ ) 2 th
at’s fw
Assume W h at e o
m
esti

10
Non-linear MLE estimation
argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =
w Assuming i.i.d. and

argmin ∑ (y )
R 2 then plugging in
i − w + xi = equation for Gaussian
i =1 and simplifying.
w
 R
y − w + xi  Setting dLL/dw
 w such that ∑ i = 0 = equal to zero
 w + xi 
 i =1 

## Non-linear MLE estimation

argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =
w Assuming i.i.d. and

argmin ∑ (y )
R 2 then plugging in
i − w + xi = equation for Gaussian
i =1 and simplifying.
w
 R
y − w + xi  Setting dLL/dw
 w such that ∑ i = 0 = equal to zero
 w + xi 
 i =1 

## We’re down the

algebraic toilet
t
ss wha
u e
So g e do?
w

11
Non-linear MLE estimation
argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =
w Assuming i.i.d. and

argmin ∑ ( )
R
Common (but not only) approach: 2 then plugging in
yi − w + xi = equation for Gaussian
Numerical Solutions: and simplifying.
i =1
• Line Search w
• Simulated Annealing
 
R
yi − w + xi Setting dLL/dw
 w such that
 w +

x
= 0 =

equal to zero

• Conjugate Gradient i = 1 i 
• Levenberg Marquart
• Newton’s Method We’re down the
algebraic toilet
Also, special purpose statistical-
hat
ss w
optimization-specific tricks such as
u e
So g e do?
E.M. (See Gaussian Mixtures lecture
for introduction) w

Suppose we have a scalar function
f(w): ℜ → ℜ
We want to find a local minimum.
Assume our current weight is w

GRADIENT DESCENT RULE: w ← w −η f (w)
∂w

## η is called the LEARNING RATE. A small positive

number, e.g. η = 0.05

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 24

12
Suppose we have a scalar function
f(w): ℜ → ℜ
We want to find a local minimum.
Assume our current weight is w

GRADIENT DESCENT RULE: w ← w −η f (w)
∂w
Recall Andrew’s favorite
default value for anything
η is called the LEARNING RATE. A small positive
number, e.g. η = 0.05
QUESTION: Justify the Gradient Descent Rule

## Gradient Descent in “m” Dimensions

Given f(w ) : ℜ m → ℜ
 ∂ 
 f (w ) 
 ∂w1 
∇f (w) =  M  points in direction of steepest ascent.
 ∂ 
 ∂w f (w)
 m 
∇f (w) is the gradient in that direction

## GRADIENT DESCENT RULE: w ← w -η∇f (w)

Equivalently ∂
wj ← wj - η f (w ) ….where wj is the jth weight
∂w j
“just like a linear feedback system”

13
What’s all this got to do with Neural
Nets, then, eh??
For supervised learning, neural nets are also models with
vectors of w parameters in them. They are now called
weights.
As before, we want to compute the weights to minimize sum-
of-squared residuals.
Which turns out, under “Gaussian i.i.d noise”
assumption to be max. likelihood.
Instead of explicitly solving for max. likelihood weights, we
use GRADIENT DESCENT to SEARCH for them.
our eyes.
s exp ression in y
” you a sk, a querulou
“Wh y?
e later.”
ply: “We’ll se
“Aha!!” I re

Linear Perceptrons
They are multivariate linear models:

Out(x) = wTx

## And “training” consists of minimizing sum-of-squared residuals

Ε = ∑ (Out (x ) −
k
k yk )2

∑ (w )
Τ 2
= x k − yk
k

## QUESTION: Derive the perceptron training rule.

14
Linear Perceptron Training Rule
R
E = ∑ ( yk − w T x k ) 2
k =1

we should update w
thusly if we wish to
minimize E:
∂E
wj ← wj - η
∂w j

∂E
So what’s ?
∂w j

## Linear Perceptron Training Rule

R
∂E R

E = ∑ ( yk − w T x k ) 2 =∑ ( yk − w T x k ) 2
k =1
∂w j k =1 ∂w j
R

Gradient descent tells us = ∑ 2( y k − w T x k ) ( yk − w T x k )
we should update w k =1 ∂w j
thusly if we wish to R

minimize E: = −2∑ δk wT xk
k =1 ∂w j
∂E …where…
wj ← wj - η δk = yk − w T x k
∂w j R
∂ m
= −2∑ δk ∑w x i ki
k =1 ∂w j i =1
∂E
So what’s ? R
∂w j = −2∑ δk xkj
k =1

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 30

15
Linear Perceptron Training Rule
R
E = ∑ ( yk − w T x k ) 2
k =1

we should update w
thusly if we wish to
minimize E:
∂E
wj ← wj - η R

…where…
∂w j w j ← w j + 2η∑ δk xkj
k =1
∂E R
= −2∑ δk xkj
∂w j k =1 We frequently neglect the 2 (meaning
we halve the learning rate)

## The “Batch” perceptron algorithm

1) Randomly initialize weights w1 w2 … wm

## 2) Get your dataset (append 1’s to the inputs if

you don’t want to go through the origin).

3) for i = 1 to R
δ i := yi − wΤxi
4) for j = 1 to m R
w j ← w j + η ∑ δ i xij
i =1

## 5) if ∑ δ i 2 stops improving then stop. Else loop

back to 3.

16
δ i ← yi − w Τ x i A RULE KNOWN BY

w j ← w j + ηδ i xij
MANY NAMES

rule
ule Hoff
MSR id row
L W
Th e The
The delta ru
le
Th
alin
er
ule
Classical
conditioning

## Input-output pairs (x,y) come streaming in very

quickly. THEN
Don’t bother remembering old ones.
Just keep using new ones.

observe (x,y)

δ ← y − wΤx
∀j w j ← w j + η δ x j

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 34

17
for Linear Perceptrons
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when

## Gradient Descent vs Matrix Inversion

for Linear Perceptrons
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when

18
for Linear Perceptrons
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
But we’llin wetware)?
• More easily parallelizable (or implementable
• It’s moronic GD
has an important
• It’s essentially a slow implementation of a way to extra
build the XTX matrix
and then solve a set of linear equations
trick up its sleeve
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when

## Perceptrons for Classification

What if all outputs are 0’s or 1’s ?

or

## We can do a linear fit.

Our prediction is 0 if out(x)≤1/2
1 if out(x)>1/2
WHAT’S THE BIG PROBLEM WITH THIS???

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 38

19
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?

or

Blue = Out(x)
We can do a linear fit.
Our prediction is 0 if out(x)≤½
1 if out(x)>½
WHAT’S THE BIG PROBLEM WITH THIS???

## Perceptrons for Classification

What if all outputs are 0’s or 1’s ?

or

Blue = Out(x)
We can do a linear fit.
Our prediction is 0 if out(x)≤½ Green = Classification
1 if out(x)>½

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 40

20
Classification with Perceptrons I
∑ (y )2
Don’t minimize i − w Τxi .
Minimize number of misclassifications instead. [Assume outputs are

∑ (y ( ))
+1 & -1, not +1 & 0]
i − Round w Τ x i

## where Round(x) = -1 if x<0 NOTE: CUTE &

NON OBVIOUS WHY
1 if x≥0 THIS WORKS!!

## The gradient descent rule can be changed to:

if (xi,yi) correctly classed, don’t change
if wrongly predicted as 1 w Å w - xi
if wrongly predicted as -1 w Å w + xi

## Classification with Perceptrons II:

Sigmoid Functions

## Least squares fit useless

This fit would classify much
better. But not a least
squares fit.

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 42

21
Classification with Perceptrons II:
Sigmoid Functions

## Least squares fit useless

This fit would classify much
SOLUTION: better. But not a least
squares fit.
We’ll use Out(x) = g(wTx)
where g( x) : ℜ → (0,1) is a
squashing function

The Sigmoid
1
g ( h) =
1 + exp(− h)

## Note that if you rotate this

curve through 180o
centered on (0,1/2) you get
the same curve.

i.e. g(h)=1-g(-h)

22
The Sigmoid
1
g ( h) =
1 + exp(− h)

## Now we choose w to minimize

∑ [ yi − Out(x i )] = ∑ [yi − g (w Τ x i )]
R R
2 2

i =1 i =1

Regions
0 0
0
1
X2 1
1
X1

## We’ll use the model Out(x) = g(wT(x,1))

= g(w1x1 + w2x2 + w0)
Which region of above diagram classified with +1, and
which with 0 ??

23
Gradient descent with sigmoid on a perceptron
First, notice g ' ( x ) = g ( x )(1 − g ( x ))
1 − e− x
Because : g ( x ) = so g ' ( x ) =
1 + e− x 1 + e − x 
2
 
1 −1 − e− x 1 1 −1  1 
= = − = 1 −  = − g ( x )(1 − g ( x ))
− −  
1 + e − x 
2
1 + e − x 
2 1+ e x 1+ e x  1 + e− x 
   
 
Out(x) = g  ∑ wk xk 
 k  The sigmoid perceptron
update rule:
2
  
Ε = ∑  yi − g  ∑ wk xik  
i   k 
R
∂Ε     ∂ 
= ∑ 2 yi − g  ∑ wk xik   −

g  ∑ wk xik   w j ← w j + η ∑ δ i gi (1 − gi )xij
∂w j i   k   ∂w j  k   i =1
     ∂
= ∑ − 2 yi − g  ∑ wk xik  g '  ∑ wk xik  ∑w x  m 
i   k   k  ∂w j k
k ik
where gi = g  ∑ w j xij 
= ∑ − 2δ i g (net i )(1 − g (net i ))xij  j =1 
δ i = yi − gi
i

k

## • Even with sigmoid nonlinearity, correct

convergence is guaranteed

## • Stable behavior for overconstrained and

underconstrained problems

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 48

24
Perceptrons and Boolean Functions
If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…

## • Can learn the function x1 ∧ x2 X2

X1

X2
• Can learn the function x1 ∨ x2 .
X1
• Can learn any conjunction of literals, e.g.
x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5

QUESTION: WHY?

## Perceptrons and Boolean Functions

• Can learn any disjunction of literals
e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5

## • Can learn majority function

f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1
0 if less than n/2 xi’s are = 1

## • What about the exclusive or function?

f(x1,x2) = x1 ∀ x2 =
(x1 ∧ ~x2) ∨ (~ x1 ∧ x2)

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 50

25
Multilayer Networks
The class of functions representable by perceptrons
is limited  
( )
Out(x) = g w Τ x = g  ∑ w j x j 
 j 

Use a wider
representation !

  
Out(x) = g  ∑W j g  ∑ w jk x jk   This is a nonlinear function
 j  k 
Of a linear combination
Of non linear functions
Of linear combinations of inputs

## A 1-HIDDEN LAYER NET

NINPUTS = 2 NHIDDEN = 3

 NINS 
w11
v1 = g ∑ w1k xk 
 k =1  w1
x1 w21

w31
 NINS 
v2 = g ∑ w2k xk  w2  N HID 
Out = g  ∑ Wk vk 
w12  k =1   k =1 
w22
x2 w3

w32
 NINS 
v3 = g ∑ w3k xk 
 k =1 

26
OTHER NEURAL NETS
1

x1
x2
x3
2-Hidden layers + Constant Term

“JUMP” CONNECTIONS

x1

 N INS N HID

Out = g  ∑ w0 k xk + ∑Wk vk 
x2

 k =1 k =1 

Backpropagation
  
Out(x) = g  ∑W j g  ∑ w jk xk  
 j  k 
Find a set of weights {W j },{w jk }
to minimize

∑ ( y − Out(x ))
2
i i
i

That’s
That’sit!
it!
That’s
That’s thebackpropagation
the backpropagation
algorithm.
algorithm.

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 54

27
Backpropagation Convergence
Convergence to a global minimum is not
guaranteed.
•In practice, this is not a problem, apparently.
Tweaking to find the right number of hidden
units, or a useful learning rate η, is more
hassle, apparently.

## IMPLEMENTING BACKPROP:  Differentiate Monster sum-square residual 

Write down the Gradient Descent Rule  It turns out to be easier &
computationally efficient to use lots of local variables with names like hj ok vj neti
etc…

## Choosing the learning rate

• This is a subtle art.
• Too small: can take days instead of minutes
to converge
• Too large: diverges (MSE gets larger and
larger while the weights increase and
usually oscillate)
• Sometimes the “just right” value is hard to
find.

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 56

28
Learning-rate problems

## From J. Hertz, A. Krogh, and R.

G. Palmer. Introduction to the
Theory of Neural Computation.

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 57

Momentum
Don’t just change weights according to the current datapoint.
Re-use changes from earlier iterations.
Let ∆w(t) = weight changes at time t.
Let ∂Ε be the change we would make with
−η
∂Ε
∆w (t +1) = −η + α∆w (t )
∂w
w (t + 1) = w (t ) + ∆w (t )
Momentum damps oscillations. momentum parameter
A hack? Well, maybe.

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 58

29
Momentum illustration

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 59

Newton’s method
∂E 1 T ∂ 2 E
E ( w + h) = E ( w ) + hT + h h + O(| h |3 )
∂w 2 ∂w 2

## If we neglect the O(h3) terms, this is a quadratic form

If y = c + bT x - 1/2 xT A x
And if A is SPD
Then
xopt = A-1b is the value of x that maximizes y

30
Newton’s method
∂E 1 T ∂ 2 E
E ( w + h) = E ( w ) + hT + h h + O(| h |3 )
∂w 2 ∂w 2

## If we neglect the O(h3) terms, this is a quadratic form

−1
 ∂ 2 E  ∂E
w ←w− 2
 ∂w  ∂w
This should send us directly to the global minimum if the
And it might get us close if it’s locally quadraticish

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 61

Newton’s method
∂E 1 T ∂ 2 E
E ( w + h) = E ( w ) + hT + h h + O(| h |3 )
∂w 2 ∂w 2

## IfBweUT neglect the O(h3) terms, this is a quadratic form

(and
it’s a
That big b
secon ut) −1
expen d der w ←…  ∂ 2 E  ∂E
w− 2
s i ve a i va
nd fid tive matr  ∂w  ∂w
If we dly to ix
weThis
’re no
t comp can be
a
’ll goshould lrsend uto
te.the global minimum if the
t h
ratic
bowl,
And it might get us close if it’s locally quadraticish

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 62

31
Another method which attempts to exploit the “local
But does so while only needing to use ∂E
∂w

and not ∂ E
2

∂w 2

## It is also more stable than Newton’s method if the local

It’s complicated, outside our scope, but it often works well.
More details in Numerical Recipes in C.

BEST GENERALIZATION
Intuitively, you want to use the smallest,
simplest net that seems to fit the data.

## 1. Don’t. Just use intuition

2. Bayesian Methods Get it Right
3. Statistical Analysis explains what’s going on
4. Cross-validation
Discussed in the next
lecture

## Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 64

32
What You Should Know
• How to implement multivariate Least-
squares linear regression.
• Derivation of least squares as max.
likelihood estimator of linear coefficients
• The general gradient descent rule

## What You Should Know

• Perceptrons
Æ Linear output, least squares
Æ Sigmoid output, least squares

• Multilayer nets
Æ The idea behind back prop
Æ Awareness of better minimization methods

33
APPLICATIONS
To Discuss:

## • What can neural nets (used as non-linear

regressors) be useful for?

## • What are the advantages of N. Nets for

nonlinear regression?

## Other Uses of Neural Nets…

• Time series with recurrent nets
• Unsupervised learning (clustering principal
components and non-linear versions
thereof)
• Combinatorial optimization with Hopfield
nets, Boltzmann Machines
• Evaluation function learning (in
reinforcement learning)

34