Methods of Knowledge Engineering Classification - Algorithms, Logistic Regression - Lecture 4

MIW - Classification - algorithms, logistic regression - lecture 4 1
Methods of Knowledge Engineering

Classification - algorithms, logistic
regression - lecture 4
Adam Szmigielski
aszmigie@pjwstk.edu.pl
materials: f tp(public) : //aszmigie/M IW English
Classification tasks
The classification task is to assign an example to one of the disjoint

classes.
• binary classification - in case all data is divided into two classes,
• multi-class classification - where there are more than two classes.
One versus Rest
It is used to extend the binary classification to multi-class problems:

• We can use one classifier per class,
• The class is treated as positive, and samples from other classes are
negative,
• For classifying data, n-classifiers are used and a class label with the
highest certainty is assigned to a given sample.
Grouping
Grouping task requires selection of a metric (distance measure

between examples)
In class division tasks it is important to solve the problems:

• The problem of the number of classes,
• The problem of discrimination - the division of data into classes.
The choice of number of class K
• Proper selection of the K parameter is the basis for achieving a

balance between overfitting and underfitting
item The selected distance metric should match the features of
the data set.
The k-nearest neighborhood algorithm - a lazy learning

model
KNN algorithm:
1. Select some parameter value k and distance metric.
2. Find k nearest neighbors of the sample you want to classify.
3. Assign the class label by majority voting.
The KNN algorithm - an example for K = 5 neighbors
• The KNN algorithm belongs to non-parametric models,
• The KNN algorithm does not learn discriminative functions based on

training data, but tries to ”remember” the entire set of samples.
The K-means algorithm
• Step 1: Select the number of clusters K,

• Step 2 Select any K points that will be K centers of gravity,
• Step 3: For all points, calculate the distance to K centers of
gravity.
• Step 4: Specify the affiliation of all points to one of the K
groups. The point belongs to the group to which the distance to
the center of gravity is the smallest. If none of the points has
changed the previous group, the algorithm ends,
• Step 5: For all K groups of points, calculate the center of
gravity. Go to Step 3
The 3-means algorithm - an example

Choice of 3 starting points - centroid squares
Allocation of points to classes


Determination of centers of gravity
New allocation of points to classes


Setting new centers of gravity
Allocation of points to classes and determination of the center of gravity:

KNN @ K-means comparison
KNN algorithm
• classification algorithm
• Calculates k of the nearest data points from the data point X. Uses
these points to determine which X class belongs to
• The classified point does not change the class,
• Requires only calculation of k distance.
K-mean algorithm
• grouping algorithm,
• Uses the distance of data to k-centroids,
• Centroids are not necessarily data points,
• Updates the centroids after each iteration,
• It must iterate the data until the centroid point stabilizes.
Perceptron as a binary classifier
• We refer to two classes: 1 (positive class) and -1 (negative class),
• Perceptron calculates a weighted sum of

z = w1 x1 + ... + wm xm = wT · x inputs x and weight w, then uses the
activation function φ(z).
• The φ activation function is stepwise function, i.e. it gives 1 if the

threshold is exceeded or −1 otherwise
• In the perceptron algorithm, the activation function is a simple step

function,
• The threshold value Φ can be implemented as an additional, fixed

input (bias)
The Rosenblatt rule of the perceptron learning
1. initialize weights value of 0 or small, random values.

2. For each sample xi do the following:
(a) Calculate the output value ŷ.
(b) Update wights
The weight values should be corrected with a factor proportional to
the error
5wj = η · (yi − yî ) · xij
where η is a learning rate. New value of weight wj after correction
rot sample i is equal:
w ⇐ w + 5wj
Way of learning perceptron
• In case the desired i-th out value of y i is the class label (ie -1 or
1) and the neuron corresponds to the value yî then you can
calculate the response difference.
• The weight values should be corrected with a factor proportional
to the error
5wj = η · (yi − yî ) · xij
where η is a learning rate,
• New value of weight wj after correction for sample i is equal:
wj ⇐ wj + 5wj
Linear separation using perceptron
• The convergence of the perceptron is only ensured if the two

classes are linearly separated,
• If you can not separate two classes using a linear decision
boundary, you can set the maximum number of algorithm epochs
or tolerance threshold for incorrect classifications.
Multi-class classification
• In case of division into many classes, many binary classifiers can

be used,
• A separate classifier can be used for each class.
The problem of linear data separability
• Perceptron can only separate data in a linear way,

• In case data can not be separated in line, other techniques should
be used, such as data transformation or classification with a
certain probability.
Support Vector Machine
• We can treat SVM as a development of the perceptron model.

• The optimization goal of the SVM model is to maximize the margin.
• Margin is defined as the distance between the separation hyperspace
(limit decision-making) and the closest teaching samples (so-called
support vectors).
Regularization
• variance is used to measure the uniformity of model for a given

sample occurrence in the case of multiple learning of the model
• Loads is a measure of a systematic error independent of randomness.
• In order for regularization to be properly carried out, we must ensure

that all features are adapted to a uniform scale (eg standardization).
• The most popular form of regularization is the so-called L2

regularization, also sometimes called weight decay:
λ 2 λX 2
||w|| = wj (1)
2 2
j
Support Vector Machine - problem definition
For positive hyperspace we got w0 + wT · xpoz = 1 and for the negative,

respectively w0 + wT · xneg = −1 After subtracting equations:
wT (xpoz − xneg ) = 2 (2)
We can normalize the length of the vector w:

sX
||w|| = wj2
j
after normalization of the equation (2):

wT (xpoz − xneg ) 2
=
||w|| ||w||
The left side of the above equation is interpreted as the distance between
the positive and negative hyperplanes, i.e. the margin we want to
maximize, i.e.
wT (xpoz − xneg )
⇒ max
||w||
2
by maximizing ||w||
⇒ max
Instead of maximizing 2
||w||
can be minimized ||w|| or square ||w||2 and
“flexibilise” the hyperplane equations, introducing additional variables ζ i :
w T · xi ≥ 1 − ζ i gdy yi = 1
wT · xi ≤ −1 + ζ i gdy yi = −1
The purpose of minimization

1 2
X i
||w|| + C ζ ⇒ min
2
i
therefore has additional restrictions that can be customized by changing

the parameter C:
The result of the operation using SVM

Solving non-linear problems using the SVM kernel
• SVM gives you the ability to solve non-linear classification problems.
• In methods using kernel functions, the basic concept of dealing with

inseparable linear data is to create non-linear combinations of original
features that will be projected onto a space having more dimensions
with the φ mapping function, where they will become linearly
separable.
Interpretation of the operation of the SVM kernel functions
φ(x1 , x2 ) = (z1 , z2 , z3 ) = (x1 , x2 , x21 + x22 )

SVM kernel functions for the XOR problem - example
A set of data generated using the XOR gateway and a decision

boundary generated using the SVM kernel
Kernel of the Radial Base Function for Iris data
i j ||xi − xj ||2
k(x , x ) = exp(− ) ≈ exp(−γ||xi − xj ||2 )
2σ
Kernel functions for different values of γ = σ12 , small and large
Classification with probability - logistic regression
• Perceptron will only classify data that can be separated linearly,

• In the opposite case, the learning of the perceptron will never
end. This can be prevented by introducing restrictions on the
number of epochs or classification accuracy,
• Logistic regression is an algorithm used to solve linear and binary
problems,
• Logistic regression, despite its name, is a classification model, not
regression,
• The classification by means of logistic regression takes place with
a certain probability.
Logit function
• Odds ratio - the chance of occurrence of a given event, can be

p
expressed with a formula 1−p where p is the probability of a positive
event.
• Logit function logit(p) is the logarithm of the odds ratio:
p
logit(p) = log( )
1−p
• The logit function accepts input values ranging from 0 to 1 and
converts them into values from the full range of real numbers.
• We can use the logit function to model the linear relationship between
feature values and chances, expressed as logarithm:
logit(p(y = 1|x)) = w0 x0 + w1 x1 + . . . wn xn = wT · x
where p(y = 1|x) is the conditional probability that a given sample

belongs to the 1 class with known features x.
Logistic function (sigmoidal)
• We are interested in predicting the probability that a sample belongs

to a particular class, which is the inverse of the logit function.
• We are dealing here with a logistic function, also known as a sigmoidal

function (s-shaped)
• z is a weight sum z = w0 x0 + w1 x1 + . . . wn xn = wT · x,
Logistic regression model
• In the logistic regression model, the function of the activation function

is assumed by the sigmoid function,
• The sigmoidal function result is interpreted as the probability of

belonging a given sample to class 1, φ(z) = p(y = 1|x, w), where x are
the characteristics of this sample multiplied by w weights.
Cost function of logistic regression
• In the perceptron, the cost function was the sum of error squares
X
(φ(z)i − y i )2
i
• Likelihood of L(w) (for independent samples) is a function of:

i yi i 1−y i
Y Y
i
L(w) = p(y|x, w) = p(y |xi , w) = φ(z ) · (1 − φ(z ))
i i
• And the logarithm of likelihood is:

X
l(w) = log(L(w)) = [y i log(φ(z i )) + (1 − y i ) log(1 − φ(z i ))]
i
• Logistic regression for the cost function J(w) assumes −l(w),

because adding up reduces the risk of computational stability.
Minimizing the cost function
• For i samples, the cost is

X
J(w) = [−y i log(φ(z i )) − (1 − y i ) log(1 − φ(z i ))]
i
• For a single sample y the cost will be
J(w) = −y log(φ(z)) − (1 − y) log(1 − φ(z))]

Learning the logistic regression model
Learning the model will involve minimizing the cost function J(W ).
• The derivative of the φ activation function is:
δ δ 1 1 −z
φ(z) = = e = φ(z)(1 − φ(z))
δwj δwj 1 + e−z (1 + e−z )2
• For the wj weight, the gradient of the cost function will be:
δ 1 1 δ
J(w) = [y − (1 − y) ] φ(z) =
δwj φ(z) 1 − φ(z) δwj
= . . . = (y − φ(z)) · xj
• The effect of all i samples the weight of wj after correction is:
X
wj ⇐ wj + η (y i − φ(z i )) · xij = wj + ∆wj
i
Implementation of Logistic Regression in Python
c l a s s LogisticRegressionGD ( object ) :
d e f __init__ ( s e l f , e t a = 0 . 0 5 , n _ i t e r =100 , random_state =1):
s e l f . eta = eta
s e l f . n_iter = n_iter
s e l f . random_state = random_state
def fit (
s e l f , X, y ) :
rgen = np . random . RandomState ( s e l f . r a n d o m _ s t a t e )
self. w_ = r g e n . n o r m a l ( l o c = 0 . 0 , s c a l e = 0 . 0 1 , s i z e =1 + X . s h a p e [ 1 ] )
self. cost_ = [ ]
for i in range ( s e l f . n_iter ) :
n e t _ i n p u t = s e l f . n e t _ i n p u t (X)
output = s e l f . a c t i v a t i o n ( net_input )
e r r o r s = ( y − output )
s e l f . w_ [ 1 : ] += s e l f . e t a ∗ X . T . d o t ( e r r o r s )
s e l f . w_ [ 0 ] += s e l f . e t a ∗ e r r o r s . sum ( )
c o s t = (−y . d o t ( np . l o g ( o u t p u t ) ) − ( ( 1 − y ) . d o t ( np . l o g ( 1 − o u t p u t ) ) ) )
s e l f . c o s t _ . append ( c o s t )
return s e l f
def net_input ( s e l f , X) :
r e t u r n np . d o t (X, s e l f . w_ [ 1 : ] ) + s e l f . w_ [ 0 ]
def activation ( self , z ):

r e t u r n 1 . / ( 1 . + np . exp(−np . c l i p ( z , −250 , 250)))
def predict ( s e l f , X) :
r e t u r n np . w h e r e ( s e l f . n e t _ i n p u t (X) >= 0 . 0 , 1, 0)
Regularization in logistic regression
• To perform regularization, just add the appropriate cost function

factor J(w), which will be used to reduce the weights:
X 1
J(w) = C · [−y i log(φ(z i )) − (1 − y i ) log(1 − φ(z i ))] + ||w||2
2
i
• Parameter C is the inverse of the λ parameter in the formula (1)

C = λ1
• In the library sklearn it is possible to set a parameter C

from sklearn . linear_model import LogisticRegression
l r = L o g i s t i c R e g r e s s i o n (C= 1 0 0 0 . 0 , r a n d o m _ s t a t e =1)
l r . f i t ( X_train_std , y _ t r a i n )
Control of regularization force
The weighting factors decrease when the value of the C parameter

decreases, i.e. during increasing the regularization force.

Methods of Knowledge Engineering Classification - Algorithms, Logistic Regression - Lecture 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Methods of Knowledge Engineering Classification - Algorithms, Logistic Regression - Lecture 4

Uploaded by

Copyright:

Available Formats

MIW - Classification - algorithms, logistic regression - lecture 4 1

Methods of Knowledge Engineering

The classification task is to assign an example to one of the disjoint

One versus Rest

It is used to extend the binary classification to multi-class problems:

Grouping task requires selection of a metric (distance measure

In class division tasks it is important to solve the problems:

The choice of number of class K

• Proper selection of the K parameter is the basis for achieving a

The k-nearest neighborhood algorithm - a lazy learning

The KNN algorithm - an example for K = 5 neighbors

• The KNN algorithm belongs to non-parametric models,

• The KNN algorithm does not learn discriminative functions based on

The K-means algorithm

• Step 1: Select the number of clusters K,

The 3-means algorithm - an example

Allocation of points to classes

The 3-means algorithm - an example

New allocation of points to classes

The 3-means algorithm - an example

Allocation of points to classes and determination of the center of gravity:

KNN @ K-means comparison

Perceptron as a binary classifier

• We refer to two classes: 1 (positive class) and -1 (negative class),

• Perceptron calculates a weighted sum of

• The φ activation function is stepwise function, i.e. it gives 1 if the

• In the perceptron algorithm, the activation function is a simple step

• The threshold value Φ can be implemented as an additional, fixed

The Rosenblatt rule of the perceptron learning

1. initialize weights value of 0 or small, random values.

Way of learning perceptron

Linear separation using perceptron

• The convergence of the perceptron is only ensured if the two

• In case of division into many classes, many binary classifiers can

The problem of linear data separability

• Perceptron can only separate data in a linear way,

Support Vector Machine

• We can treat SVM as a development of the perceptron model.

• variance is used to measure the uniformity of model for a given

• Loads is a measure of a systematic error independent of randomness.

• In order for regularization to be properly carried out, we must ensure

• The most popular form of regularization is the so-called L2

Support Vector Machine - problem definition

For positive hyperspace we got w0 + wT · xpoz = 1 and for the negative,

wT (xpoz − xneg ) = 2 (2)

We can normalize the length of the vector w:

after normalization of the equation (2):

The purpose of minimization

therefore has additional restrictions that can be customized by changing

The result of the operation using SVM

Solving non-linear problems using the SVM kernel

• SVM gives you the ability to solve non-linear classification problems.

• In methods using kernel functions, the basic concept of dealing with

Interpretation of the operation of the SVM kernel functions

φ(x1 , x2 ) = (z1 , z2 , z3 ) = (x1 , x2 , x21 + x22 )

SVM kernel functions for the XOR problem - example

A set of data generated using the XOR gateway and a decision

Kernel of the Radial Base Function for Iris data

Classification with probability - logistic regression

• Perceptron will only classify data that can be separated linearly,

• Odds ratio - the chance of occurrence of a given event, can be

where p(y = 1|x) is the conditional probability that a given sample

Logistic function (sigmoidal)