You are on page 1of 19

Update the weights of the multi-layer network using backpropagation

algorithm. The transfer function of the neurons are tansig functions. Target
outputs are y2*=1 and y3*=0.5. Learning rate is 0.5.
Show that with the updated weights there is a reduction in the total error.

2.998 0.9950
0.9993

0.9639

1.9987

a1 = 0.999329299739067 (hidden neuron output)


O1= 0.99503486 (1st output neuron output)
O2= 0.96393 (2nd output neuron output)
d2 =4.918e-05 (del of 1st output neuron )
d3 =-0.032862 (del of 2nd output neuron )
d1 =- -8.7935e-05(del of hidden neuron )
w2n = 3.000024574895913
w3n = 1.983579968283086
w1n = 3.999956032466642
Input normalization (Preprocessing)

• Small random values of weights for avoidance of


saturation.
• The connection weights from the inputs to a hidden unit
determine the orientation of the hyperplane. The bias
determines the distance of the hyperplane from the
origin.
• If the data are not centered at the origin, the hyperplane
may fail to pass through the data cloud.
•If all the inputs have a small coefficient of variation, it is
quite possible that all the initial hyperplanes will miss the
data entirely.
• To avoid saturation
• If the bias terms are all small random numbers, then
all the decision surfaces will pass close to the origin.
If the data are not centered at the origin, the decision
surfaces will not pass through the data points
Consider an MLP with two inputs (X and Y)
and 100 hidden units.
It will be easy to learn a hyperplane passing through any part of these regions at any
angle.
Curse of Dimensionality

Example: Fisher Iris problem is a 3-class pattern recognition problem.


Assume that we are taking only one feature (x1), say sepal length.
If we are forced to work with a limited quantity of data then increasing the
dimensionality of the space rapidly leads to the point where the data is very
sparse, in which case it provides a very poor representation of the mapping.
Idea of PCA
• Reduce the dimensionality of a data set which
consists of a large number of interrelated variables by
linearly transforming the original data set to a new set
of usually fewer uncorrelated variables (PCs), while
retaining as much as possible of the variation present
in the original data set.
• The PC causing higher variation has more impact on
the observations, thus intuitively more informational.

Eigenvectors of a matrix corresponding to distinct eigenvalues are


linearly independent of each other.
Mean, Standard Deviation and Variance
The average distance
from the mean of the data set to a point

Covariance
The covariance Matrix

Covariance is always measured between 2 dimensions. If we


have a data set with more than 2 dimensions, there is more than
one covariance measurement that can be calculated. For example,
from a 3 dimensional data set
Mean 1.81 and 1.91
Original data set
1.5

0.5

-0.5

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
r'
ans =
-0.8280
1.7776
-0.9922
-0.2742
-1.6759
-0.9130
datared=(v(:,2)'*[xadj yadj]')' 0.0991
1.1446

v(:,2)‘=0.67787 0.735 0.4381


1.2239
(v'*[xadj yadj]')'
Step 1: Get some data
Step 2: Subtract the mean
Step 3: Calculate the covariance matrix
Step 4: Calculate the eigenvectors and eigenvalues
of the covariance matrix
Step 5: Choosing components and forming a
feature vector
Step 6: Deriving the new data set

dataorig=(v'*datatrans')'+[xmean*ones(10,1) ymean*ones(10,1)]
A set of variables that define a projection that encapsulates the
maximum amount of variation in a dataset and is orthogonal (and
therefore uncorrelated) to the previous principal component of the
same dataset.

The blue lines represent 2 consecutive


principle components. Note that they are
orthogonal (at right angles) to each other.

You might also like