You are on page 1of 16

Classification Problem

load fisheriris
p=meas'; feedforwardnet
a=zeros(3,50); patternnet
b=zeros(3,50); fitnet
c=zeros(3,50);
a(1,:)=1;
b(2,:)=1; y=sim(net,p(:,150))
c(3,:)=1;
t=[a,b,c]; y=
net=newff(p,t,5);
net=train(net,p,t); -0.0000
y=sim(net,p); 0.0743
0.9258
Regression Problem
1.5

Create a 2-layer MLP to model the function 1

0.5

First you sketch the function -0.5

x = 0:0.1:9 ; -1

d = 0.02*polyval([1 -13 48],x).*cos(2*x); -1.5


0 1 2 3 4 5 6 7 8 9

plot(x, d)
hold
Add noise and again plot
x = 0:0.1:9 ;
d = 0.02*polyval([1 -13 48],x).*cos(2*x) +0.2*rand(1,size(x));
plot(x, d)
p is the input d is the target. Create the structure, train it, simulate it

p=x; t=d; net


net=newff(p,t,3) net.iw{1,1}
net=train(net,p,t); net.lw{2,1}
o=sim(net,p); net.b{1}
plot(p,o) net.b{2}
y=sim(net,9)
How is generalization possible?

Necessary conditions for good generalization.

1.The function you are trying to learn be, in some sense,


smooth. In other words, a small change in the inputs
should, most of the time, produce a small change in the
outputs.
2. The training cases be a sufficiently large and representative
subset of the set of all cases that you want to generalize
11 data points obtained by sampling h(x) at equal intervals of
x and adding random noise. Solid curve shows output of a
linear network.
Here we use a network which has more free parameters than
the earlier one. This network is more flexible. Approximation
improves.
Here we use a network which has many free parameters than
the earlier one. This complex network gives perfect fit to the
training data, but gives a poor representation of the function.

Not a simple model. Not a complex model. Complexity can be


controlled by controlling the free parameters.
Regularization
Adding penalty to the error function to control the model
complexity. Assume many free parameters. The total error then

where Ω is called a regularization. The parameter v controls the


extent to which Ω influences the form of the solution.
In the figure the function (function with lot of flexibility) has large
oscillations, and hence the function has regions of large
curvature. We might therefore choose a regularization function
which is large for functions with large values of the second
derivative, such, as
Weight Decay
• Weight decay adds a penalty term to the error function. The usual
penalty is the sum of squared weights times a decay constant. Weight
decay is a subset of regularization methods. The penalty term in weight
decay, by definition, penalizes large weights.
• The weight decay penalty term causes the weights to converge to
smaller absolute values than they otherwise would. Large weights can
hurt generalization in two different ways. Excessively large weights
leading to hidden units can cause the output function to be too rough,
possibly with near discontinuities. Excessively large weights leading to
output units can cause wild outputs far beyond the range of the data if
the output activation function is not bounded to the same range as the
data.

where the sum runs over all weights and biases.


Adding noise to improve generalization
(Jitter)

Heuristically, we might expect that the noise will 'smear out' each data
point and make it difficult for the network to fit individual data points
precisely, and hence will reduce over-fitting

Early Stopping
Evaluation Methods
• Various networks are trained by
minimization of an appropriate error
function defined with respect to a training
data set. The performance of the networks is
then compared by evaluating the error
function using an independent validation
set, and the network having the smallest
error with respect to the validation set is
selected

You might also like