Professional Documents
Culture Documents
Linear Regression
• Linear approach to model the relationship between a
scalar response, (y) (or dependent variable) and one
or more predictor variables, (x or x) (or independent
variables)
• The response is going to be the linear function of
input (one or more independent variables)
• Optimal coefficient vector w is given by
ˆ XT X
w
1
XT y
y
x
2
1
16-11-2019
x2 x2
x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 ) w1 xn1 w2 xn 2 w0
3
x2 x2
x1 x1
• Discriminant function in 2-dimensional space :
xn=[xn1, xn2]T f ( x n , w1 , w2 , w0 ) w1 xn1 w2 xn 2 w0 0
4
2
16-11-2019
x2 x2
x1 x1
• Discriminant function in 2-dimensional space :
w1 w
xn=[xn1, xn2]T xn 2 x n1 0 mx n1 c
w2 w2 5
x2 x2
x1 x1
3
16-11-2019
x2 x2
x1 x1
4
16-11-2019
10
5
16-11-2019
Weight
in Kg
Height in cm 11
ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w
Height in cm 12
6
16-11-2019
ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w
Height in cm
ˆ 1 ) 0.1842
f 1 ( x, w ˆ 2 ) 0.8158
f 2 ( x, w
13
ˆ 1)
f 1 ( x, w
Weight
in Kg
ˆ 2)
f 2 ( x, w
Height in cm
ˆ 1 ) 0.4639
f 1 ( x, w ˆ 2 ) 0.5361
f 2 ( x, w
14
7
16-11-2019
Logistic Regression
8
16-11-2019
Logistic Regression
• Requirement: The discriminant function fi(x,wi)
should give the probability of class Ci
E yi | x P yi Ci | x
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1 P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
P ( x)
log w0 w1 x1 ... wd xd w T xˆ
1 P ( x)
where w [ w0 , w1 ,..., wd ]T and xˆ [1, x1 ,..., xd ]T
9
16-11-2019
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1 P (x)
– Probability of success divided by the probability of failure
• Fit a linear model to logit function:
P (x) where w [ w0 , w1 ,..., wd ]T
log w T xˆ
1 P ( x) and xˆ [1, x1 ,..., xd ]T
Logistic Regression
1
P( x)
1 e( w0 w1x )
x
20
10
16-11-2019
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1 P (x)
– Probability of success divided by the probability of failure
P (x)
• Fit a linear model to logit function: log w T xˆ
1 P ( x)
– For d-dimensional space, x=[x1, x2, …, xd]T
P (x) where w [ w0 , w1 ,..., wd ]T
log w0 w1 x1 ... wd xd w T xˆ
1 P ( x) and xˆ [1, x1 ,..., xd ]T
P ( x) T
e ( w xˆ )
1 P ( x)
21
Logistic Regression
• Logit function: Log of odds function
• Odds function: P ( x)
1 P (x)
– Probability of success divided by the probability of failure
P (x)
• Fit a linear model to logit function: log w T xˆ
1 P ( x)
– For d-dimensional space, x=[x1, x2, …, xd]T
P ( x) T
where w [ w0 , w1 ,..., wd ]T
e ( w xˆ )
1 P ( x) and xˆ [1, x1 ,..., xd ]T
( w T xˆ )
e
P (x) T
1 e(w xˆ ) If P(x) ≥ 0.5 then x is assigned to C2
1 If P(x) < 0.5 then x is assigned to C1
P(x) T
1 e ( w xˆ ) 22
11
16-11-2019
Estimation of Parameter in
Logistic Regression
• Criterion considered is different than linear regression
to estimate the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D {x n , y n }nN1 , x n R d and y n 1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
• Likelihood of xn: P (x n | w ) P (x n ) yn (1 P ( x n )) (1 yn )
Estimation of Parameter in
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D {x n , y n }nN1 , x n R d and y n 1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w Binomial distribution (Bernoulli Distribution)
(1 y n )
• Likelihood of xn: P (x n | w ) P (x n ) n (1 P ( x n ))
y
N
• Total data likelihood: P(D | w ) P(x
n 1
n | w)
24
12
16-11-2019
Estimation of Parameter in
Logistic Regression
• Different criterion than linear regression to estimate
the parameter
• Optimize the likelihood of data
• As that goal is to model the probability of class, we
are maximizing the likelihood of data
• Maximum likelihood (ML) method of parameter
estimation
• Given:- Training data:D {x n , y n }nN1 , x n R d and y n 1,0
• Data of a class is represented by parameter vector:
w=[w0, w1,…, wd]T (parameter of linear function)
• Unknown: w
(1 y n )
• Likelihood of xn: P (x n | w ) P (x n ) n (1 P ( x n ))
y
N
• Total data likelihood: P(D | w ) P(x
n 1
n ) yn (1 P(x n )) (1 yn )
25
Estimation of Parameter in
Logistic Regression
• Total data log likelihood:
l (w ) ln P (D | w )
N
l (w ) yn ln( P (x n )) (1 yn ) ln(1 P(x n ))
n 1
13
16-11-2019
Estimation of Parameter in
Logistic Regression
• Gradient accent method Global
maximum
• It is an iterative Local
maximum
procedure
• We start with an initial Local
l(w) maximum
value for w
• At each iteration:
– Estimate change in w
Weight, w
– The change in w (∆w) is
proportional to the slope l (w )
w
(gradient) of the w
likelihood surface
– Then, the w is updated using ∆w
• This indicate, we move in the positive slope of the
likelihood surface, likelihood is maximum in each
iteration 27
14
16-11-2019
Weight
in Kg
Height in cm 29
f (x, w ) w T xˆ 0
Weight
in Kg
Height in cm 30
15
16-11-2019
Weight
in Kg
Height in cm
1
w T xˆ 47.9851 P(x) T 1
1 e ( w xˆ )
Weight
in Kg
Height in cm
1
w T xˆ 6.7771 P(x) T 0.9
1 e ( w xˆ )
16
16-11-2019
Logistic Regression
• Logistic regression is a linear classifier
• Logistic regression looks simple, but yields a very
powerful classifier
• It is used not just building classifier, but also used in
sensitivity analysis
• Logistic regression is used to identify how each
attribute contribute to output
• How much each attribute is important for predicting
class label
• Perform logistic regression and observe w
• The value of each element of w indicate how much
each attribute is contributing to the output
33
Weight
in Kg
Height in cm
17
16-11-2019
x2 x2
x1 x1
– Perceptron (linear discriminant function is learnt)
– Neural networks (when the discriminant function is
nonlinear)
36
18
16-11-2019
x2
x1
37
38
19
16-11-2019
g (x) x1 x2 1
40
20
16-11-2019
Perceptron Learning
• Given - training data:D {x n , y n }nN1 , x n R d and y n 1,1
• Goal: To estimate parameter vector w=[w0, w1,…, wd]T
– such that linear function (hyperplane) is places between
the training data of two classes so that training error
(classification error) is minimum
w T x n w0 0 C2 C2
w T x n w0 0
x2 OR x2 C1
C1
w T x n w0 0 w T x n w0 0
x1 x1
41
Perceptron Learning
• Given - training data:D {x n , y n }nN1 , x n R d and y n 1,1
1. Initialize the w
2. Choose a training example xn
3. Update the w, if xn is misclassified
w w x n , for w T x n w0 0 and x n C2 (1)
w w x n , for w T x n w0 0 and x n C1 (1)
– Here η is a positive, learning rate parameter
– Increment the misclassification count by 1
4. Repeat steps 2 and 3 till all the training examples
are presented
5. Repeat steps 2 to 4 by setting misclassification count
to 0, till the convergence criterion is not satisfied
• Convergence criterion:
– Total misclassification count is 0 OR
– Total misclassification count is minimum (falls below
threshold) 42
21
16-11-2019
Perceptron Learning
• Training: C2
x2 C1
x1
• Test phase:
• Classification of a test pattern x using the weights
w obtained by training the model:
– If wTx + w0 > 0 then x is assigned to C2
– If wTx + w0 ≤ 0 then x is assigned to C1
43
Neural Networks
22
16-11-2019
Human Brain
45
46
23
16-11-2019
wd
xd
Threshold logic
activation function
s 1 if f (a ) 0
s 0 if f (a ) 0
w1 x1 w2 x2 w0 0
w1 w
x2 x1 0
w2 w2
Decision surface in a d-dimensional
space is a hyperplane:
d x1
w x
i 0
i i w T x w0 0
[2] A.G. Ivakhnenko and V.G. Lapa. Cybernetic predicting devices. 1965. 48
24
16-11-2019
Perceptron Learning
• Perceptron learning rule for weight update:
At Step m, choose a training example xn
w w x n , for w T x n w0 0 and x n C2 (1)
w w x n , for w T x n w0 0 and x n C1 (1)
Here is the learning rate parameter.
x2
• Test phase:
• Classification of a test pattern x
using the weights w obtained
by training the model:
– If wTx + w0 > 0 then x is
assigned to C2
– If wTx + w0 ≤ 0 then x is
x1
assigned to C1
49
50
25
16-11-2019
Hard Problems
x2
x1
51
52
26
16-11-2019
Logistic function:
1 β = 0.6 β = 0.2
f (a)
1 e a f a β = 0.3
df (a )
f (a )1 f a
da
a
Hyperbolic tangent function:
f ( a) tanh( a)
df ( a)
da
1 f 2 a
53
x1 w1
d 1
w2 a s f (a)
x2 ∑ wixi+w0 1 e a
i=1
wd
xd Sigmoidal activation function
df(a n ) 54
where δ o ( y n s )
da n
27
16-11-2019
x1 w1
1
d s f ( a)
w2 a 1 e a
x2 ∑ wixi+w0
wd i=1
xd
55
x1 w1
1
d s f ( a)
w2 a 1 e a
x2 ∑wixi+w0
wd i=1
xd
56
28
16-11-2019
x1 w1
1
d s f ( a)
w2 a 1 e a
x2 ∑ wixi+w0
wd i=1
xd
57
58
29
16-11-2019
xi i j k s ko
. . . .
. . . .
. . . .
xd d J K s Ko
59
s hj f jh ( a hj )
Input Layer Hidden Layer Output Layer
s 1o d
a hj wijh si w0h j , where si xi
x1 1 1 1
x2 2 2 2 s o
2
. . . . i 1
. . . .
. . . .
xi
.
i
.
j
.
k
.
s ko
s ko f ko ( a ko )
. . . .
. . . s Ko J
a ko w jk s hj w0 k
.
xd d J K
j 1
J d
s ko f ko w jk f jh w hji si w0h j w0 k
j 1 i 1 60
30
16-11-2019
Backpropagation Learning
• Gradient descent method
• Error backpropagation algorithm
• Forward computation:
– Innerproduct computation
– Activation function computation
• Backward operation:
– Error calculation and propagation
– Modification of weights
• Given a set of N input–output pattern pairs (xn, yn),
n=1,2,….N.
– yn is a K-dimensional binary vector if activation function is
logistic function. Only one element is 1 and rest are 0
– yn is a K-dimensional vector with only one element is 1 rest
are -1 when the activation function is hyperbolic tangent
function
• Instantaneous error for
K
the nth pattern is given as ,
1
En
2 k 1
( y nk sko ) 2 61
df ko(a ko )
where δ ko ( y k s ko )
da ko
31
16-11-2019
• Batch Mode:
• Weights are updated after the presentation of all the
patterns once.
Eav
w jk (m) η
w jk
y
N K
1 2
where Eav nk snk
o
63
2N l 1 k 1
w11h w1
h
w21
x1 w h
12 w2 s
h
w22
x2 h w13h w3
w23
64
32
16-11-2019
w11h w1
h
w21
x1 w h
12 w2 s
h
w22
x2 h w13h w3
w23
65
Practical Considerations
• Stopping Criterion:
• Threshold on average error
• Threshold on average gradient
• Number of Weights:
• Depends on number of input nodes, output nodes,
hidden nodes and hidden layers
• Number of Hidden Nodes
• Cross-validation method
• Data Requirements
• Limitations:
• Slow convergence
• Local minima problem
66
33
16-11-2019
Local minimum
Error
Local minimum
Global minimum
Weight
67
34
16-11-2019
Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.
2. S. Theodoridis and K. Koutroumbas, Pattern Recognition,
Academic Press, 2009.
3. C. M. Bishop, Pattern Recognition and Machine Learning,
Springer, 2006.
4. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall
of India, 1999.
5. Satish Kumar, Neural Networks - A Class Room Approach,
Second Edition, Tata McGraw-Hill, 2013.
6. S. Haykin, Neural Networks and Learning Machines, Prentice
Hall of India, 2010.
69
35