You are on page 1of 35

16-11-2019

Linear Method for Classification

Linear Regression
• Linear approach to model the relationship between a
scalar response, (y) (or dependent variable) and one
or more predictor variables, (x or x) (or independent
variables)
• The response is going to be the linear function of
input (one or more independent variables)
• Optimal coefficient vector w is given by


ˆ  XT X
w 
1
XT y
y

x
2

1
16-11-2019

Linear Method for Classification


• The boundary that separates the region of classes is
linear
• Separating surface will be a hyperplane
• A hyperplane that best fit the region of separation
between the classes

x2 x2

x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 )  w1 xn1  w2 xn 2  w0
3

Linear Method for Classification


• The boundary that separates the region of classes is
linear
• Separating surface will be a hyperplane
• A hyperplane that best fit the region of separation
between the classes

x2 x2

x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 )  w1 xn1  w2 xn 2  w0  0
4

2
16-11-2019

Linear Method for Classification


• The boundary that separates the region of classes is
linear
• Separating surface will be a hyperplane
• A hyperplane that best fit the region of separation
between the classes

x2 x2

x1 x1
• Discriminant function in 2-dimensional space :
w1 w
xn=[xn1, xn2]T xn 2   x n1  0  mx n1  c
w2 w2 5

Linear Method for Classification


• The boundary that separates the region of classes is
linear
• Separating surface will be a hyperplane
• A hyperplane that best fit the region of separation
between the classes

x2 x2

x1 x1

• Discriminant function in d-dimensional space: d


f ( x n , w )  w T x n  w0   wi xi
i 0 6

3
16-11-2019

Two classes of Approaches for


Linear Classification
1. Modeling a discriminating function:
– For each class, a linear discriminant function fi(x,wi) is
defined
– Let C1, C2, …, Ci, …, CM be the M classes
– Let fi(x,wi) be the linear discriminant function for ith
class
Class label for x = arg max f i (x,wi ) i  1, 2,..., M
i

– Discriminant function is defined independent of the


classes
– Linear regression can be treated as discriminant function
• Do the linear regression by considering dependent variable
as indicator variable
– Logistic regression

Two classes of Approaches for


Linear Classification
2. Directly learn a discriminant function (hyperplane):
– Classic method: Discriminant function between the
classes is learnt

x2 x2

x1 x1

– Perceptron (linear discriminant function is learnt)


– Support vector machine (SVM) (linear discriminant
function is learnt)
– Neural networks (when the discriminant function is
nonlinear) 8

4
16-11-2019

Classification Using Linear Regression


• Given:-Training data: D  {x n , y n }n 1 , x n  R and y n  R
N d M

– xn is input vector (d dependent variable)


– There are M classes, represented by M indicator
variables
– yn is response vector (dependent variables) which is M-
dimensional binary vector i.e. one of the M values is 1
– For N examples, X is data matrix of size N x (d+1) and Y
is response matrix of size N x M
ˆ  XT X
• Linear regression on response vector: W   1
XT Y
– Ŵ is of the size (d+1) x M
ˆ  w
W ˆ 1, w ˆM
ˆ 2 ,..., w

– Each column of Ŵ is (d+1) coefficients corresponding


to a class
9

Classification Using Linear Regression


• For any test example x, the discriminant value for
class i is: d
ˆ Ti x   wˆ ij xi
ˆ i)  w
f i ( x, w
j 0

Class label for ˆ i)


x = arg max fi (x,w i  1, 2,..., M
i

10

5
16-11-2019

Illustration of Classification using


Linear Regression
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Number of classes: 2
• Each output variable is a 2-dimensional
binary vector
• Class: Child (C1) Adult (C2)

Weight
in Kg

Height in cm 11

Illustration of Classification using


Linear Regression
ˆ  XT X
• Training: W  
1
XT Y
• X is data matrix of size 20 x 3
• Y is response matrix of size 20 x 2
 2.8897 - 1.8897 
ˆ  w
W ˆ1 ˆ 2   - 0.0222
w 0.0222 

 0.0122 - 0.0122 

ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w

Height in cm 12

6
16-11-2019

Illustration of Classification using


Linear Regression
Test Example:

ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w

Height in cm

ˆ 1 )  0.1842
f 1 ( x, w ˆ 2 )  0.8158
f 2 ( x, w

• Class: Adult (C2)

13

Illustration of Classification using


Linear Regression
Test Example:

ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w

Height in cm

ˆ 1 )  0.4639
f 1 ( x, w ˆ 2 )  0.5361
f 2 ( x, w

• Class: Adult (C2)

14

7
16-11-2019

Classification Using Linear Regression


• Dependent variable is categorical (indicator variable)
• Output is multiple outputs (multiple dependent
variables)

• If the input x belongs to Ci, then yi is 1


• The expected output for x should be close to 1
• During linear regression for classification, we are
trying to predict the expected output value
• In other way, we are trying to predict probability of
class
E y i | x   P  y i  C i | x 
• This is the ideal situation
• Linear regression gives the hope of getting this
• The notion of predicting probability of class is given
nicely by logistic regression 15

Logistic Regression

8
16-11-2019

Logistic Regression
• Requirement: The discriminant function fi(x,wi)
should give the probability of class Ci

E yi | x  P  yi  Ci | x 

• Look for some kind of transformation of probability


and fit that
 P (x) 
• Logit transformation: log 
 1  P (x) 
• 2-class classification:
– Class label: 0 or 1
– P(x) is P(Ci=1|x) i.e. probability that output is 1 given
input (probability of success)
– 1-P(x) is P(Ci=0|x) i.e. probability of failure
17

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
 P ( x) 
log   w0  w1 x1  ...  wd xd  w T xˆ
 1  P ( x) 
where w  [ w0 , w1 ,..., wd ]T and xˆ  [1, x1 ,..., xd ]T

– For 1-dimensional (d=1) space, x


 P( x)  P( x)
log   w0  w1 x  e ( w0  w1 x )
 1  P( x)  1  P( x)
18

9
16-11-2019

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
 P (x)  where w  [ w0 , w1 ,..., wd ]T
log   w T xˆ
 1  P ( x)  and xˆ  [1, x1 ,..., xd ]T

– For 1-dimensional (d=1) space, x


P( x)
 e ( w0  w1x )
1  P ( x)
e ( w0  w1x ) 1
P( x)   ( w  w x )
1  e ( w0  w1x ) 1  e 0 1
19

Logistic Regression
1
P( x) 
1  e( w0  w1x )

• This function is a sigmoidal function, specifically called


as logistic function
• Logistic function:
1
P( x)  ( w0  w1x ) β = 0.6 β = 0.2
1 e
Px  β = 0.3
1
P( x) 
1  e (  x )

x
20

10
16-11-2019

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure

 P (x) 
• Fit a linear model to logit function: log   w T xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
 P (x)  where w  [ w0 , w1 ,..., wd ]T
log   w0  w1 x1  ...  wd xd  w T xˆ
 1  P ( x)  and xˆ  [1, x1 ,..., xd ]T
P ( x) T
 e ( w xˆ )
1  P ( x)
21

Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure

 P (x) 
• Fit a linear model to logit function: log   w T xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
P ( x) T
where w  [ w0 , w1 ,..., wd ]T
 e ( w xˆ )
1  P ( x) and xˆ  [1, x1 ,..., xd ]T
( w T xˆ )
e
P (x)  T
1  e(w xˆ ) If P(x) ≥ 0.5 then x is assigned to C2
1 If P(x) < 0.5 then x is assigned to C1
P(x)  T
1  e ( w xˆ ) 22

11
16-11-2019

Estimation of Parameter in
Logistic Regression
• Criterion considered is different than linear regression
to estimate the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D  {x n , y n }nN1 , x n  R d and y n  1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
• Likelihood of xn: P (x n | w )  P (x n ) yn (1  P ( x n )) (1 yn )

Probability that x Probability that x


has label 1 has label 0 23

Estimation of Parameter in
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D  {x n , y n }nN1 , x n  R d and y n  1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w Binomial distribution (Bernoulli Distribution)

(1 y n )
• Likelihood of xn: P (x n | w )  P (x n ) n (1  P ( x n ))
y

N
• Total data likelihood: P(D | w )   P(x
n 1
n | w)
24

12
16-11-2019

Estimation of Parameter in
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D  {x n , y n }nN1 , x n  R d and y n  1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
(1 y n )
• Likelihood of xn: P (x n | w )  P (x n ) n (1  P ( x n ))
y

N
• Total data likelihood: P(D | w )   P(x
n 1
n ) yn (1  P(x n )) (1 yn )
25

Estimation of Parameter in
Logistic Regression
• Total data log likelihood:
l (w )  ln P (D | w ) 
N
l (w )   yn ln( P (x n ))  (1  yn ) ln(1  P(x n ))
n 1

• Choose the parameters for which the total data log


likelihood is maximum:
w ML  arg max l ( w )
w
• Cost function for optimization:
N
l (w )   yn ln( P(x n ))  (1  yn ) ln(1  P(x n ))
n 1
 l (w )
• Conditions for optimality: 0
w
• Unfortunately, solving this, no closed form expression
for w is obtained
• Solution: Gradient accent method 26

13
16-11-2019

Estimation of Parameter in
Logistic Regression
• Gradient accent method Global
maximum
• It is an iterative Local
maximum
procedure
• We start with an initial Local
l(w) maximum
value for w
• At each iteration:
– Estimate change in w
Weight, w
– The change in w (∆w) is
proportional to the slope  l (w )
w 
(gradient) of the w
likelihood surface
– Then, the w is updated using ∆w
• This indicate, we move in the positive slope of the
likelihood surface, likelihood is maximum in each
iteration 27

Estimation of Parameter in Logistic


Regression – Gradient Accent Method
• Given a training dataset, the goal is to maximize the
likelihood function with respect to the parameters of
linear function
1. Initialize the w
• Evaluate the initial value of the log likelihood, l(w)
 l (w )
2. Determine the change inw (∆w):  w 
w
3. Update the w: w  w   w
4. Evaluate the log likelihood and check for convergence of
the log likelihood
• If the convergence criterion is not satisfied repeat steps 2
to 4

• Convergence criterion: Difference between log


likelihoods of successive iterations fall below a
threshold (E.g. threshold=10-3)
28

14
16-11-2019

Illustration of Classification using


Logistic Regression
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Class label attribute is 3rd dimension
• Class:
– Child (0)
– Adult (1)

Weight
in Kg

Height in cm 29

Illustration of Classification using


Logistic Regression
• Training:
 - 378.2085 
w   2.2065 
 1.8818 

f (x, w )  w T xˆ  0
Weight
in Kg

Height in cm 30

15
16-11-2019

Illustration of Classification using


Linear Regression
Test Example:

Weight
in Kg

Height in cm
1
w T xˆ  47.9851 P(x)  T 1
1  e ( w xˆ )

• Class: Adult (C2)


31

Illustration of Classification using


Linear Regression
Test Example:

Weight
in Kg

Height in cm
1
w T xˆ  6.7771 P(x)  T  0.9
1  e ( w xˆ )

• Class: Adult (C2)


32

16
16-11-2019

Logistic Regression
• Logistic regression is a linear classifier
• Logistic regression looks simple, but yields a very
powerful classifier
• It is used not just building classifier, but also used in
sensitivity analysis
• Logistic regression is used to identify how each
attribute contribute to output
• How much each attribute is important for predicting
class label
• Perform logistic regression and observe w
• The value of each element of w indicate how much
each attribute is contributing to the output

33

Illustration of Sensitivity Analysis using


Logistic Regression
• Training:  - 378.2085   w0 
w   2.2065    w1 
 1.8818   w2 

Weight
in Kg

Height in cm

• Both the attributes are equally


important
34

17
16-11-2019

Discriminative Learning Methods


for Classification

Two classes of Approaches for


Linear Classification
1. Modeling a discriminating function:
• Linear regression and Logistic regression
2. Directly learn a discriminant function (hyperplane):
– Classic method: Discriminant function between the
classes is learnt

x2 x2

x1 x1
– Perceptron (linear discriminant function is learnt)
– Neural networks (when the discriminant function is
nonlinear)
36

18
16-11-2019

Discriminative Learning Methods


• Learn the surface that better separates the region of
classes
• Learning discriminant function: Learns a function that
maps input data to output
• Linear discriminant function:

x2

x1

37

Linear Discriminant Function


• Regions of two classes are separable by a linear
surface (line, plane or hyperplane)
• 2-dimensional space: The
decision boundary is a
line specified by
w1 x1  w 2 x 2  w 0  0 x2
w w
x 2   1 x1  0
w2 w2
• d-dimensional space: The
decision surface is a x1
hyperplane specified by
d
wd xd  ....  w2 x2  w1 x1  w0   wi xi  w T xˆ  0
i 0

where w  [ w0 , w1 ,..., wd ] and xˆ  [1, x1 ,..., xd ]T


T

38

19
16-11-2019

Discriminant Function of a Hyperplane


• The discriminant function of a hyperplane:
d
g (x)   wi xi  w0  w T x  w0
i 1

• For any point the lies on the hyperplane


d
g (x)   wi xi  w0  w T x  w0  0
• Example: i 1

– Consider a straight line with its equation as x2+x1-1=0


– Discriminant function of the straight line is g(x)=x2+x1-1
– For points (1,0) and (0,1) that lie on
this straight line g(x)=0 x2
– For the point (0,0), g(x)=-1 i.e. the
value of g(x) is negative
(1,0) * (1,1)
(0,0) * (0,1) x1
– For the point (1,1), g(x)=+1 i.e. the
value of g(x) is positive
39

Discriminant Function of a Hyperplane


x2
Positive side

Negative side (1,0) * (1,1)


(0,0) * (0,1) x1

g (x)  x1  x2  1

• A hyperplane has a positive side and a negative side


– For any point on the positive side, the value of
discriminant function, g(x), is positive
– For any point on the negative side, the value of
discriminant function , g(x), is negative

40

20
16-11-2019

Perceptron Learning
• Given - training data:D  {x n , y n }nN1 , x n  R d and y n   1,1
• Goal: To estimate parameter vector w=[w0, w1,…, wd]T
– such that linear function (hyperplane) is places between
the training data of two classes so that training error
(classification error) is minimum

w T x n  w0  0 C2 C2
w T x n  w0  0
x2 OR x2 C1
C1

w T x n  w0  0 w T x n  w0  0
x1 x1

41

Perceptron Learning
• Given - training data:D  {x n , y n }nN1 , x n  R d and y n   1,1
1. Initialize the w
2. Choose a training example xn
3. Update the w, if xn is misclassified
w  w   x n , for w T x n  w0  0 and x n  C2 (1)
w  w   x n , for w T x n  w0  0 and x n  C1 (1)
– Here η is a positive, learning rate parameter
– Increment the misclassification count by 1
4. Repeat steps 2 and 3 till all the training examples
are presented
5. Repeat steps 2 to 4 by setting misclassification count
to 0, till the convergence criterion is not satisfied
• Convergence criterion:
– Total misclassification count is 0 OR
– Total misclassification count is minimum (falls below
threshold) 42

21
16-11-2019

Perceptron Learning
• Training: C2

x2 C1

x1

• Test phase:
• Classification of a test pattern x using the weights
w obtained by training the model:
– If wTx + w0 > 0 then x is assigned to C2
– If wTx + w0 ≤ 0 then x is assigned to C1

43

Discriminative Learning Methods


for Classification:

Neural Networks

22
16-11-2019

Human Brain

45

Biological Neural Networks

• Several neurons are connected to one another to


form a neural network or a layer of a neural network

46

23
16-11-2019

Neuron with Threshold Logic


Activation Function

Input Weights Activation Output


value signal
x1 w1
d
w2 a s = f(a)
x2 w x w
i 1
i i 0

wd
xd
Threshold logic
activation function
s  1 if f (a )  0
s  0 if f (a )  0

• McCulloch-Pitts Neuron [1]


[1] W.S.McCulloch and W.Pitts. A logival calculus of the ideas imminent in nervous 47
activity.
1943.

Linearly Separable Classes –


Perceptron Model
• Regions of two classes are separable by a linear surface (line,
plane or hyperplane)
• Perceptron model that uses a single MuCulloch-Pitts neuron
can be trained using the perceptron learning algorithm [2]

Decision surface in a 2-dimensional


space is a line: x2

w1 x1  w2 x2  w0  0
w1 w
x2   x1  0
w2 w2
Decision surface in a d-dimensional
space is a hyperplane:
d x1
w x
i 0
i i  w T x  w0  0

[2] A.G. Ivakhnenko and V.G. Lapa. Cybernetic predicting devices. 1965. 48

24
16-11-2019

Perceptron Learning
• Perceptron learning rule for weight update:
At Step m, choose a training example xn
w  w   x n , for w T x n  w0  0 and x n  C2 (1)
w  w   x n , for w T x n  w0  0 and x n  C1 (1)
Here  is the learning rate parameter.
x2
• Test phase:
• Classification of a test pattern x
using the weights w obtained
by training the model:
– If wTx + w0 > 0 then x is
assigned to C2
– If wTx + w0 ≤ 0 then x is
x1
assigned to C1
49

Neuron with Threshold Logic


Activation Function and
Perceptron Learning
Input Weights Activation Output
value signal
x1 w1
w2 d a s = f(a)
x2 ∑ wi xi + w0
i=1
wd
xd

50

25
16-11-2019

Hard Problems

• Nonlinearly separable classes

x2

x1

51

Neuron with Continuous Activation


Function

Input Weights Activation Output


value signal
x1 w1
w2 d a s = f(a)
x2 ∑ wi xi +w0
i=1
wd
xd
Sigmoidal
activation function

52

26
16-11-2019

Sigmoidal Activation Functions

Logistic function:
1 β = 0.6 β = 0.2
f (a) 
1  e  a f a  β = 0.3

df (a )
  f (a )1  f a 
da

a
Hyperbolic tangent function:
f ( a)  tanh( a)

df ( a)
da

  1  f 2 a  
53

Neuron with Continuous Activation


Function
Input Weights Activation Output
value signal

x1 w1
d 1
w2 a s  f (a) 
x2 ∑ wixi+w0 1  e  a
i=1
wd
xd Sigmoidal activation function

• Given a set of N input–output pattern pairs (xn, yn),


n=1,2,….N. Here yn∈{0, 1} or yn∈{+1, -1}
1
• Instantaneous error for the nth pattern is given as , E n  ( y n  s ) 2
2
• Change in weight at mth iteration: (Gradient descent
technique) w ( m )   η E n ( m )
i
 wi
wi ( m  1)  wi ( m )   wi ( m )
 wi ( m )  ηδ so

df(a n ) 54
where δ o  ( y n  s )
da n

27
16-11-2019

Illustration: Linearly Separable Data


Input Weights Activation Output
value signal

x1 w1
1
d s  f ( a) 
w2 a 1  e  a
x2 ∑ wixi+w0
wd i=1
xd

55

Illustration: Non-Linearly Separable Data


Input Weights Activation Output
value signal

x1 w1
1
d s  f ( a) 
w2 a 1  e  a
x2 ∑wixi+w0
wd i=1
xd

56

28
16-11-2019

Illustration: Non-Linearly Separable Data


Input Weights Activation Output
value signal

x1 w1
1
d s  f ( a) 
w2 a 1  e  a
x2 ∑ wixi+w0
wd i=1
xd

57

Feedforward Neural Networks

58

29
16-11-2019

Multilayer Feedforward Neural Network


• Architecture of an MLFFNN
– Input layer: Linear neurons
– Hidden layers (1 or 2 or more): Sigmoidal neurons
– Output layer:
• Linear neurons (for function approximation task)
• Sigmoidal neurons (for pattern classification task)

Input Layer Hidden Layer Output Layer


x1 1 1 1 s 1o
x2 2 2 2 s 2o
. . . .
. . . .
. . . .

xi i j k s ko
. . . .
. . . .
. . . .

xd d J K s Ko
59

Multilayer Feedforward Neural Network


Input Hidden Output
Layer Layer Layer
xi si  xi s hj sko
i j k
wijh wjk

s hj  f jh ( a hj )
Input Layer Hidden Layer Output Layer
s 1o d
a hj   wijh si  w0h j , where si  xi
x1 1 1 1

x2 2 2 2 s o
2
. . . . i 1
. . . .
. . . .
xi
.
i
.
j
.
k
.
s ko
s ko  f ko ( a ko )
. . . .
. . . s Ko J
a ko   w jk s hj  w0 k
.
xd d J K

j 1

 J  d  
s ko  f ko   w jk f jh   w hji si  w0h j   w0 k 
 j 1  i 1  60 

30
16-11-2019

Backpropagation Learning
• Gradient descent method
• Error backpropagation algorithm
• Forward computation:
– Innerproduct computation
– Activation function computation
• Backward operation:
– Error calculation and propagation
– Modification of weights
• Given a set of N input–output pattern pairs (xn, yn),
n=1,2,….N.
– yn is a K-dimensional binary vector if activation function is
logistic function. Only one element is 1 and rest are 0
– yn is a K-dimensional vector with only one element is 1 rest
are -1 when the activation function is hyperbolic tangent
function
• Instantaneous error for
K
the nth pattern is given as ,
1
En  
2 k 1
( y nk  sko ) 2 61

Backpropagation Learning (contd.)


Input Hidden Output
Layer Layer Layer
xi si  xi s hj sko
i j k
wijh wjk
• Change in weight at output layer is given by
E ( m )
 w jk ( m )   η
w jk
 w jk ( m )  ηδ ko s hj w jk ( m  1)  w jk ( m )   w jk ( m )

df ko(a ko )
where δ ko  ( y k  s ko )
da ko

• Change in weight at hidden layer is given by


E ( m )
wijh (m)   η
wijh
wijh ( m  1)  wijh ( m )   wijh ( m )
wijh (m)  η jh si
K df jh (a hj )
where  hj    ko w jk 62
k 1 da hj

31
16-11-2019

Mode of Presentation of Patterns


• Stochastic Gradient Descent Method
• Pattern Mode:
• At mth iteration: Weights are updated after the
presentation of each pattern.
E(m)
 w jk ( m )   η
w jk
• Epoch: Presentation of all the patterns once (all
training examples).

• Batch Mode:
• Weights are updated after the presentation of all the
patterns once.
Eav
w jk (m)  η
w jk

 y 
N K
1 2
where Eav  nk  snk
o
63
2N l 1 k 1

Illustration: Ring Data


Input Layer Hidden Layer Output Layer

w11h w1
h
w21
x1 w h
12 w2 s
h
w22
x2 h w13h w3
w23

64

32
16-11-2019

Illustration: X-OR Data


Input Layer Hidden Layer Output Layer

w11h w1
h
w21
x1 w h
12 w2 s
h
w22
x2 h w13h w3
w23

65

Practical Considerations
• Stopping Criterion:
• Threshold on average error
• Threshold on average gradient
• Number of Weights:
• Depends on number of input nodes, output nodes,
hidden nodes and hidden layers
• Number of Hidden Nodes
• Cross-validation method
• Data Requirements
• Limitations:
• Slow convergence
• Local minima problem

66

33
16-11-2019

Gradient Descent Method

Local minimum
Error

Local minimum

Global minimum

Weight

67

Feedforward Neural Networks: Summary


• Perceptrons, with threshold logic function as activation function,
are suitable for pattern classification tasks that involve linearly
separable classes
• Multilayer feedforward neural networks, with sigmoidal function
as activation function, are suitable for nonlinearly separable
classes
– Complexity of the model depends on
• Dimension of the input pattern vector
• Number of classes
• Shapes of the decision surfaces to be formed
– Architecture of the model is empirically determined
– Large number of training examples are required when the
complexity of the model is high
– Local minima problem

• Multilayer feedforward neural network models are suitable for


function approximation tasks also

• Multilayer feedforward neural network with one or two hidden


layers is now called a shallow network
68

34
16-11-2019

Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.
2. S. Theodoridis and K. Koutroumbas, Pattern Recognition,
Academic Press, 2009.
3. C. M. Bishop, Pattern Recognition and Machine Learning,
Springer, 2006.
4. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall
of India, 1999.
5. Satish Kumar, Neural Networks - A Class Room Approach,
Second Edition, Tata McGraw-Hill, 2013.
6. S. Haykin, Neural Networks and Learning Machines, Prentice
Hall of India, 2010.

69

35

You might also like