Class23 26 LinearClassification NeuralNetworks - 05 15nov2019

16-11-2019
Linear Method for Classification
Linear Regression
• Linear approach to model the relationship between a
scalar response, (y) (or dependent variable) and one
or more predictor variables, (x or x) (or independent
variables)
• The response is going to be the linear function of
input (one or more independent variables)
• Optimal coefficient vector w is given by

ˆ  XT X
w 
1
XT y
y
x
2
1
16-11-2019

• The boundary that separates the region of classes is
linear
• Separating surface will be a hyperplane
• A hyperplane that best fit the region of separation
between the classes
x2 x2
x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 )  w1 xn1  w2 xn 2  w0
3

linear
between the classes
x2 x2
x1 x1
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 )  w1 xn1  w2 xn 2  w0  0
4
2
16-11-2019

linear
between the classes
x2 x2
x1 x1
w1 w
xn=[xn1, xn2]T xn 2   x n1  0  mx n1  c
w2 w2 5

linear
between the classes
x2 x2
x1 x1
• Discriminant function in d-dimensional space: d

f ( x n , w )  w T x n  w0   wi xi
i 0 6
3
16-11-2019
Two classes of Approaches for

Linear Classification
1. Modeling a discriminating function:
– For each class, a linear discriminant function fi(x,wi) is
defined
– Let C1, C2, …, Ci, …, CM be the M classes
– Let fi(x,wi) be the linear discriminant function for ith
class
Class label for x = arg max f i (x,wi ) i  1, 2,..., M
i
– Discriminant function is defined independent of the

classes
– Linear regression can be treated as discriminant function
• Do the linear regression by considering dependent variable
as indicator variable
– Logistic regression

2. Directly learn a discriminant function (hyperplane):
– Classic method: Discriminant function between the
classes is learnt
x2 x2
x1 x1
– Perceptron (linear discriminant function is learnt)

– Support vector machine (SVM) (linear discriminant
function is learnt)
– Neural networks (when the discriminant function is
nonlinear) 8
4
16-11-2019
Classification Using Linear Regression

• Given:-Training data: D  {x n , y n }n 1 , x n  R and y n  R
N d M
– xn is input vector (d dependent variable)

– There are M classes, represented by M indicator
variables
– yn is response vector (dependent variables) which is M-
dimensional binary vector i.e. one of the M values is 1
– For N examples, X is data matrix of size N x (d+1) and Y
is response matrix of size N x M
ˆ  XT X
• Linear regression on response vector: W   1
XT Y
– Ŵ is of the size (d+1) x M
ˆ  w
W ˆ 1, w ˆM
ˆ 2 ,..., w
– Each column of Ŵ is (d+1) coefficients corresponding

to a class
9

• For any test example x, the discriminant value for
class i is: d
ˆ Ti x   wˆ ij xi
ˆ i)  w
f i ( x, w
j 0
Class label for ˆ i)

x = arg max fi (x,w i  1, 2,..., M
i
10
5
16-11-2019
Illustration of Classification using

Linear Regression
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Number of classes: 2
• Each output variable is a 2-dimensional
binary vector
• Class: Child (C1) Adult (C2)
Weight
in Kg
Height in cm 11

Linear Regression
ˆ  XT X
• Training: W  
1
XT Y
• X is data matrix of size 20 x 3
• Y is response matrix of size 20 x 2
 2.8897 - 1.8897 
ˆ  w
W ˆ1 ˆ 2   - 0.0222
w 0.0222 

 0.0122 - 0.0122 
ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w
Height in cm 12
6
16-11-2019

Linear Regression
Test Example:
ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w
Height in cm
ˆ 1 )  0.1842
f 1 ( x, w ˆ 2 )  0.8158
f 2 ( x, w
• Class: Adult (C2)
13

Linear Regression
Test Example:
ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w
Height in cm
ˆ 1 )  0.4639
f 1 ( x, w ˆ 2 )  0.5361
f 2 ( x, w
14
7
16-11-2019

• Dependent variable is categorical (indicator variable)
• Output is multiple outputs (multiple dependent
variables)
• If the input x belongs to Ci, then yi is 1

• The expected output for x should be close to 1
• During linear regression for classification, we are
trying to predict the expected output value
• In other way, we are trying to predict probability of
class
E y i | x   P  y i  C i | x 
• This is the ideal situation
• Linear regression gives the hope of getting this
• The notion of predicting probability of class is given
nicely by logistic regression 15
Logistic Regression
8
16-11-2019
Logistic Regression
• Requirement: The discriminant function fi(x,wi)
should give the probability of class Ci
E yi | x  P  yi  Ci | x 
• Look for some kind of transformation of probability

and fit that
 P (x) 
• Logit transformation: log 
 1  P (x) 
• 2-class classification:
– Class label: 0 or 1
– P(x) is P(Ci=1|x) i.e. probability that output is 1 given
input (probability of success)
– 1-P(x) is P(Ci=0|x) i.e. probability of failure
17
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1  P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
 P ( x) 
log   w0  w1 x1  ...  wd xd  w T xˆ
 1  P ( x) 
where w  [ w0 , w1 ,..., wd ]T and xˆ  [1, x1 ,..., xd ]T
– For 1-dimensional (d=1) space, x

 P( x)  P( x)
log   w0  w1 x  e ( w0  w1 x )
 1  P( x)  1  P( x)
18
9
16-11-2019
Logistic Regression
1  P (x)
• Fit a linear model to logit function:
 P (x)  where w  [ w0 , w1 ,..., wd ]T
log   w T xˆ
 1  P ( x)  and xˆ  [1, x1 ,..., xd ]T
– For 1-dimensional (d=1) space, x

P( x)
 e ( w0  w1x )
1  P ( x)
e ( w0  w1x ) 1
P( x)   ( w  w x )
1  e ( w0  w1x ) 1  e 0 1
19
Logistic Regression
1
P( x) 
1  e( w0  w1x )
• This function is a sigmoidal function, specifically called

as logistic function
• Logistic function:
1
P( x)  ( w0  w1x ) β = 0.6 β = 0.2
1 e
Px  β = 0.3
1
P( x) 
1  e (  x )
x
20
10
16-11-2019
Logistic Regression
1  P (x)
 P (x) 
• Fit a linear model to logit function: log   w T xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
 P (x)  where w  [ w0 , w1 ,..., wd ]T
log   w0  w1 x1  ...  wd xd  w T xˆ
 1  P ( x)  and xˆ  [1, x1 ,..., xd ]T
P ( x) T
 e ( w xˆ )
1  P ( x)
21
Logistic Regression
1  P (x)
 P (x) 
• Fit a linear model to logit function: log   w T xˆ
 1  P ( x) 
– For d-dimensional space, x=[x1, x2, …, xd]T
P ( x) T
where w  [ w0 , w1 ,..., wd ]T
 e ( w xˆ )
1  P ( x) and xˆ  [1, x1 ,..., xd ]T
( w T xˆ )
e
P (x)  T
1  e(w xˆ ) If P(x) ≥ 0.5 then x is assigned to C2
1 If P(x) < 0.5 then x is assigned to C1
P(x)  T
1  e ( w xˆ ) 22
11
16-11-2019
Estimation of Parameter in
Logistic Regression
• Criterion considered is different than linear regression
to estimate the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D  {x n , y n }nN1 , x n  R d and y n  1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
• Likelihood of xn: P (x n | w )  P (x n ) yn (1  P ( x n )) (1 yn )
Probability that x Probability that x

has label 1 has label 0 23
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
estimation
• Unknown: w Binomial distribution (Bernoulli Distribution)
(1 y n )
• Likelihood of xn: P (x n | w )  P (x n ) n (1  P ( x n ))
y
N
• Total data likelihood: P(D | w )   P(x
n 1
n | w)
24
12
16-11-2019
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
estimation
• Unknown: w
(1 y n )
• Likelihood of xn: P (x n | w )  P (x n ) n (1  P ( x n ))
y
N
• Total data likelihood: P(D | w )   P(x
n 1
n ) yn (1  P(x n )) (1 yn )
25
Logistic Regression
• Total data log likelihood:
l (w )  ln P (D | w ) 
N
l (w )   yn ln( P (x n ))  (1  yn ) ln(1  P(x n ))
n 1
• Choose the parameters for which the total data log

likelihood is maximum:
w ML  arg max l ( w )
w
• Cost function for optimization:
N
l (w )   yn ln( P(x n ))  (1  yn ) ln(1  P(x n ))
n 1
 l (w )
• Conditions for optimality: 0
w
• Unfortunately, solving this, no closed form expression
for w is obtained
• Solution: Gradient accent method 26
13
16-11-2019
Logistic Regression
• Gradient accent method Global
maximum
• It is an iterative Local
maximum
procedure
• We start with an initial Local
l(w) maximum
value for w
• At each iteration:
– Estimate change in w
Weight, w
– The change in w (∆w) is
proportional to the slope  l (w )
w 
(gradient) of the w
likelihood surface
– Then, the w is updated using ∆w
• This indicate, we move in the positive slope of the
likelihood surface, likelihood is maximum in each
iteration 27
Estimation of Parameter in Logistic

Regression – Gradient Accent Method
• Given a training dataset, the goal is to maximize the
likelihood function with respect to the parameters of
linear function
1. Initialize the w
• Evaluate the initial value of the log likelihood, l(w)
 l (w )
2. Determine the change inw (∆w):  w 
w
3. Update the w: w  w   w
4. Evaluate the log likelihood and check for convergence of
the log likelihood
• If the convergence criterion is not satisfied repeat steps 2
to 4
• Convergence criterion: Difference between log

likelihoods of successive iterations fall below a
threshold (E.g. threshold=10-3)
28
14
16-11-2019

Logistic Regression
• Number of training examples (N) = 20
• Dimension of a training example = 2
• Class label attribute is 3rd dimension
• Class:
– Child (0)
– Adult (1)
Weight
in Kg
Height in cm 29

Logistic Regression
• Training:
 - 378.2085 
w   2.2065 
 1.8818 
f (x, w )  w T xˆ  0
Weight
in Kg
Height in cm 30
15
16-11-2019

Linear Regression
Test Example:
Weight
in Kg
Height in cm
1
w T xˆ  47.9851 P(x)  T 1
1  e ( w xˆ )

31

Linear Regression
Test Example:
Weight
in Kg
Height in cm
1
w T xˆ  6.7771 P(x)  T  0.9
1  e ( w xˆ )

32
16
16-11-2019
Logistic Regression
• Logistic regression is a linear classifier
• Logistic regression looks simple, but yields a very
powerful classifier
• It is used not just building classifier, but also used in
sensitivity analysis
• Logistic regression is used to identify how each
attribute contribute to output
• How much each attribute is important for predicting
class label
• Perform logistic regression and observe w
• The value of each element of w indicate how much
each attribute is contributing to the output
33
Illustration of Sensitivity Analysis using

Logistic Regression
• Training:  - 378.2085   w0 
w   2.2065    w1 
 1.8818   w2 
Weight
in Kg
Height in cm
• Both the attributes are equally

important
34
17
16-11-2019
Discriminative Learning Methods

for Classification

1. Modeling a discriminating function:
• Linear regression and Logistic regression
2. Directly learn a discriminant function (hyperplane):
– Classic method: Discriminant function between the
classes is learnt
x2 x2
x1 x1
– Perceptron (linear discriminant function is learnt)
– Neural networks (when the discriminant function is
nonlinear)
36
18
16-11-2019

• Learn the surface that better separates the region of
classes
• Learning discriminant function: Learns a function that
maps input data to output
• Linear discriminant function:
x2
x1
37
Linear Discriminant Function

• Regions of two classes are separable by a linear
surface (line, plane or hyperplane)
• 2-dimensional space: The
decision boundary is a
line specified by
w1 x1  w 2 x 2  w 0  0 x2
w w
x 2   1 x1  0
w2 w2
• d-dimensional space: The
decision surface is a x1
hyperplane specified by
d
wd xd  ....  w2 x2  w1 x1  w0   wi xi  w T xˆ  0
i 0
where w  [ w0 , w1 ,..., wd ] and xˆ  [1, x1 ,..., xd ]T

T
38
19
16-11-2019
Discriminant Function of a Hyperplane

• The discriminant function of a hyperplane:
d
g (x)   wi xi  w0  w T x  w0
i 1
• For any point the lies on the hyperplane

d
g (x)   wi xi  w0  w T x  w0  0
• Example: i 1
– Consider a straight line with its equation as x2+x1-1=0

– Discriminant function of the straight line is g(x)=x2+x1-1
– For points (1,0) and (0,1) that lie on
this straight line g(x)=0 x2
– For the point (0,0), g(x)=-1 i.e. the
value of g(x) is negative
(1,0) * (1,1)
(0,0) * (0,1) x1
– For the point (1,1), g(x)=+1 i.e. the
value of g(x) is positive
39
Discriminant Function of a Hyperplane

x2
Positive side
Negative side (1,0) * (1,1)

(0,0) * (0,1) x1
g (x)  x1  x2  1
• A hyperplane has a positive side and a negative side

– For any point on the positive side, the value of
discriminant function, g(x), is positive
– For any point on the negative side, the value of
discriminant function , g(x), is negative
40
20
16-11-2019
Perceptron Learning
• Given - training data:D  {x n , y n }nN1 , x n  R d and y n   1,1
• Goal: To estimate parameter vector w=[w0, w1,…, wd]T
– such that linear function (hyperplane) is places between
the training data of two classes so that training error
(classification error) is minimum
w T x n  w0  0 C2 C2
w T x n  w0  0
x2 OR x2 C1
C1
w T x n  w0  0 w T x n  w0  0
x1 x1
41
Perceptron Learning
• Given - training data:D  {x n , y n }nN1 , x n  R d and y n   1,1
1. Initialize the w
2. Choose a training example xn
3. Update the w, if xn is misclassified
w  w   x n , for w T x n  w0  0 and x n  C2 (1)
w  w   x n , for w T x n  w0  0 and x n  C1 (1)
– Here η is a positive, learning rate parameter
– Increment the misclassification count by 1
4. Repeat steps 2 and 3 till all the training examples
are presented
5. Repeat steps 2 to 4 by setting misclassification count
to 0, till the convergence criterion is not satisfied
• Convergence criterion:
– Total misclassification count is 0 OR
– Total misclassification count is minimum (falls below
threshold) 42
21
16-11-2019
Perceptron Learning
• Training: C2
x2 C1
x1
• Test phase:
• Classification of a test pattern x using the weights
w obtained by training the model:
– If wTx + w0 > 0 then x is assigned to C2
– If wTx + w0 ≤ 0 then x is assigned to C1
43

for Classification:
Neural Networks
22
16-11-2019
Human Brain
45
Biological Neural Networks
• Several neurons are connected to one another to

form a neural network or a layer of a neural network
46
23
16-11-2019
Neuron with Threshold Logic

Activation Function
Input Weights Activation Output

value signal
x1 w1
d
w2 a s = f(a)
x2 w x w
i 1
i i 0
wd
xd
Threshold logic
activation function
s  1 if f (a )  0
s  0 if f (a )  0
• McCulloch-Pitts Neuron [1]

[1] W.S.McCulloch and W.Pitts. A logival calculus of the ideas imminent in nervous 47
activity.
1943.
Linearly Separable Classes –

Perceptron Model
• Regions of two classes are separable by a linear surface (line,
plane or hyperplane)
• Perceptron model that uses a single MuCulloch-Pitts neuron
can be trained using the perceptron learning algorithm [2]
Decision surface in a 2-dimensional

space is a line: x2
w1 x1  w2 x2  w0  0
w1 w
x2   x1  0
w2 w2
Decision surface in a d-dimensional
space is a hyperplane:
d x1
w x
i 0
i i  w T x  w0  0
[2] A.G. Ivakhnenko and V.G. Lapa. Cybernetic predicting devices. 1965. 48
24
16-11-2019
Perceptron Learning
• Perceptron learning rule for weight update:
At Step m, choose a training example xn
w  w   x n , for w T x n  w0  0 and x n  C2 (1)
w  w   x n , for w T x n  w0  0 and x n  C1 (1)
Here  is the learning rate parameter.
x2
• Test phase:
• Classification of a test pattern x
using the weights w obtained
by training the model:
– If wTx + w0 > 0 then x is
assigned to C2
– If wTx + w0 ≤ 0 then x is
x1
assigned to C1
49
Neuron with Threshold Logic

Activation Function and
Perceptron Learning
value signal
x1 w1
w2 d a s = f(a)
x2 ∑ wi xi + w0
i=1
wd
xd
50
25
16-11-2019
Hard Problems
• Nonlinearly separable classes
x2
x1
51
Neuron with Continuous Activation

Function

value signal
x1 w1
w2 d a s = f(a)
x2 ∑ wi xi +w0
i=1
wd
xd
Sigmoidal
activation function
52
26
16-11-2019
Sigmoidal Activation Functions
Logistic function:
1 β = 0.6 β = 0.2
f (a) 
1  e  a f a  β = 0.3
df (a )
  f (a )1  f a 
da
a
Hyperbolic tangent function:
f ( a)  tanh( a)
df ( a)
da

  1  f 2 a  
53
Neuron with Continuous Activation

Function
value signal
x1 w1
d 1
w2 a s  f (a) 
x2 ∑ wixi+w0 1  e  a
i=1
wd
xd Sigmoidal activation function
• Given a set of N input–output pattern pairs (xn, yn),

n=1,2,….N. Here yn∈{0, 1} or yn∈{+1, -1}
1
• Instantaneous error for the nth pattern is given as , E n  ( y n  s ) 2
2
• Change in weight at mth iteration: (Gradient descent
technique) w ( m )   η E n ( m )
i
 wi
wi ( m  1)  wi ( m )   wi ( m )
 wi ( m )  ηδ so
df(a n ) 54
where δ o  ( y n  s )
da n
27
16-11-2019
Illustration: Linearly Separable Data

value signal
x1 w1
1
d s  f ( a) 
w2 a 1  e  a
x2 ∑ wixi+w0
wd i=1
xd
55
Illustration: Non-Linearly Separable Data

value signal
x1 w1
1
d s  f ( a) 
w2 a 1  e  a
x2 ∑wixi+w0
wd i=1
xd
56
28
16-11-2019
Illustration: Non-Linearly Separable Data

value signal
x1 w1
1
d s  f ( a) 
w2 a 1  e  a
x2 ∑ wixi+w0
wd i=1
xd
57
Feedforward Neural Networks
58
29
16-11-2019
Multilayer Feedforward Neural Network

• Architecture of an MLFFNN
– Input layer: Linear neurons
– Hidden layers (1 or 2 or more): Sigmoidal neurons
– Output layer:
• Linear neurons (for function approximation task)
• Sigmoidal neurons (for pattern classification task)
Input Layer Hidden Layer Output Layer

x1 1 1 1 s 1o
x2 2 2 2 s 2o
. . . .
. . . .
. . . .
xi i j k s ko
. . . .
. . . .
. . . .
xd d J K s Ko
59
Multilayer Feedforward Neural Network

Input Hidden Output
Layer Layer Layer
xi si  xi s hj sko
i j k
wijh wjk
s hj  f jh ( a hj )
s 1o d
a hj   wijh si  w0h j , where si  xi
x1 1 1 1
x2 2 2 2 s o
2
. . . . i 1
. . . .
. . . .
xi
.
i
.
j
.
k
.
s ko
s ko  f ko ( a ko )
. . . .
. . . s Ko J
a ko   w jk s hj  w0 k
.
xd d J K
j 1
 J  d  
s ko  f ko   w jk f jh   w hji si  w0h j   w0 k 
 j 1  i 1  60 
30
16-11-2019
Backpropagation Learning
• Gradient descent method
• Error backpropagation algorithm
• Forward computation:
– Innerproduct computation
– Activation function computation
• Backward operation:
– Error calculation and propagation
– Modification of weights
• Given a set of N input–output pattern pairs (xn, yn),
n=1,2,….N.
– yn is a K-dimensional binary vector if activation function is
logistic function. Only one element is 1 and rest are 0
– yn is a K-dimensional vector with only one element is 1 rest
are -1 when the activation function is hyperbolic tangent
function
• Instantaneous error for
K
the nth pattern is given as ,
1
En  
2 k 1
( y nk  sko ) 2 61
Backpropagation Learning (contd.)

Input Hidden Output
Layer Layer Layer
xi si  xi s hj sko
i j k
wijh wjk
• Change in weight at output layer is given by
E ( m )
 w jk ( m )   η
w jk
 w jk ( m )  ηδ ko s hj w jk ( m  1)  w jk ( m )   w jk ( m )
df ko(a ko )
where δ ko  ( y k  s ko )
da ko
• Change in weight at hidden layer is given by

E ( m )
wijh (m)   η
wijh
wijh ( m  1)  wijh ( m )   wijh ( m )
wijh (m)  η jh si
K df jh (a hj )
where  hj    ko w jk 62
k 1 da hj
31
16-11-2019
Mode of Presentation of Patterns

• Stochastic Gradient Descent Method
• Pattern Mode:
• At mth iteration: Weights are updated after the
presentation of each pattern.
E(m)
 w jk ( m )   η
w jk
• Epoch: Presentation of all the patterns once (all
training examples).
• Batch Mode:
• Weights are updated after the presentation of all the
patterns once.
Eav
w jk (m)  η
w jk
 y 
N K
1 2
where Eav  nk  snk
o
63
2N l 1 k 1
Illustration: Ring Data

w11h w1
h
w21
x1 w h
12 w2 s
h
w22
x2 h w13h w3
w23
64
32
16-11-2019
Illustration: X-OR Data

w11h w1
h
w21
x1 w h
12 w2 s
h
w22
x2 h w13h w3
w23
65
Practical Considerations
• Stopping Criterion:
• Threshold on average error
• Threshold on average gradient
• Number of Weights:
• Depends on number of input nodes, output nodes,
hidden nodes and hidden layers
• Number of Hidden Nodes
• Cross-validation method
• Data Requirements
• Limitations:
• Slow convergence
• Local minima problem
66
33
16-11-2019
Gradient Descent Method
Local minimum
Error
Local minimum
Global minimum
Weight
67
Feedforward Neural Networks: Summary

• Perceptrons, with threshold logic function as activation function,
are suitable for pattern classification tasks that involve linearly
separable classes
• Multilayer feedforward neural networks, with sigmoidal function
as activation function, are suitable for nonlinearly separable
classes
– Complexity of the model depends on
• Dimension of the input pattern vector
• Number of classes
• Shapes of the decision surfaces to be formed
– Architecture of the model is empirically determined
– Large number of training examples are required when the
complexity of the model is high
– Local minima problem
• Multilayer feedforward neural network models are suitable for

function approximation tasks also
• Multilayer feedforward neural network with one or two hidden

layers is now called a shallow network
68
34
16-11-2019
Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.
2. S. Theodoridis and K. Koutroumbas, Pattern Recognition,
Academic Press, 2009.
3. C. M. Bishop, Pattern Recognition and Machine Learning,
Springer, 2006.
4. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall
of India, 1999.
5. Satish Kumar, Neural Networks - A Class Room Approach,
Second Edition, Tata McGraw-Hill, 2013.
6. S. Haykin, Neural Networks and Learning Machines, Prentice
Hall of India, 2010.
69
35

Class23 26 LinearClassification NeuralNetworks - 05 15nov2019

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class23 26 LinearClassification NeuralNetworks - 05 15nov2019

Uploaded by

Copyright:

Available Formats

16-11-2019

Linear Method for Classification

Linear Method for Classification

Linear Method for Classification

Linear Method for Classification

Linear Method for Classification

• Discriminant function in d-dimensional space: d

Two classes of Approaches for

– Discriminant function is defined independent of the

Two classes of Approaches for

– Perceptron (linear discriminant function is learnt)

Classification Using Linear Regression

– xn is input vector (d dependent variable)

– Each column of Ŵ is (d+1) coefficients corresponding

Classification Using Linear Regression

Class label for ˆ i)

Illustration of Classification using

Illustration of Classification using

Illustration of Classification using

• Class: Adult (C2)

Illustration of Classification using

• Class: Adult (C2)

Classification Using Linear Regression

• If the input x belongs to Ci, then yi is 1

• Look for some kind of transformation of probability

– For 1-dimensional (d=1) space, x

– For 1-dimensional (d=1) space, x

• This function is a sigmoidal function, specifically called

Probability that x Probability that x

• Choose the parameters for which the total data log

Estimation of Parameter in Logistic

• Convergence criterion: Difference between log

Illustration of Classification using

Illustration of Classification using

Illustration of Classification using

• Class: Adult (C2)

Illustration of Classification using

• Class: Adult (C2)

Illustration of Sensitivity Analysis using

• Both the attributes are equally

Discriminative Learning Methods

Two classes of Approaches for

Discriminative Learning Methods

Linear Discriminant Function

where w  [ w0 , w1 ,..., wd ] and xˆ  [1, x1 ,..., xd ]T

Discriminant Function of a Hyperplane

• For any point the lies on the hyperplane

– Consider a straight line with its equation as x2+x1-1=0

Discriminant Function of a Hyperplane

Negative side (1,0) * (1,1)

• A hyperplane has a positive side and a negative side

Discriminative Learning Methods

Biological Neural Networks

• Several neurons are connected to one another to

Neuron with Threshold Logic

Input Weights Activation Output

• McCulloch-Pitts Neuron [1]

Linearly Separable Classes –

Decision surface in a 2-dimensional

Neuron with Threshold Logic

• Nonlinearly separable classes

Neuron with Continuous Activation

Input Weights Activation Output