You are on page 1of 37

MIW - Classification - algorithms, logistic regression - lecture 4 1

Methods of Knowledge Engineering


Classification - algorithms, logistic
regression - lecture 4
Adam Szmigielski
aszmigie@pjwstk.edu.pl
materials: f tp(public) : //aszmigie/M IW English
MIW - Classification - algorithms, logistic regression - lecture 4 2

Classification tasks

The classification task is to assign an example to one of the disjoint


classes.
• binary classification - in case all data is divided into two classes,
• multi-class classification - where there are more than two classes.
MIW - Classification - algorithms, logistic regression - lecture 4 3

One versus Rest

It is used to extend the binary classification to multi-class problems:


• We can use one classifier per class,

• The class is treated as positive, and samples from other classes are
negative,

• For classifying data, n-classifiers are used and a class label with the
highest certainty is assigned to a given sample.
MIW - Classification - algorithms, logistic regression - lecture 4 4

Grouping

Grouping task requires selection of a metric (distance measure


between examples)

In class division tasks it is important to solve the problems:


• The problem of the number of classes,
• The problem of discrimination - the division of data into classes.
MIW - Classification - algorithms, logistic regression - lecture 4 5

The choice of number of class K

• Proper selection of the K parameter is the basis for achieving a


balance between overfitting and underfitting
item The selected distance metric should match the features of
the data set.
MIW - Classification - algorithms, logistic regression - lecture 4 6

The k-nearest neighborhood algorithm - a lazy learning


model

KNN algorithm:
1. Select some parameter value k and distance metric.
2. Find k nearest neighbors of the sample you want to classify.
3. Assign the class label by majority voting.
MIW - Classification - algorithms, logistic regression - lecture 4 7

The KNN algorithm - an example for K = 5 neighbors

• The KNN algorithm belongs to non-parametric models,

• The KNN algorithm does not learn discriminative functions based on


training data, but tries to ”remember” the entire set of samples.
MIW - Classification - algorithms, logistic regression - lecture 4 8

The K-means algorithm

• Step 1: Select the number of clusters K,


• Step 2 Select any K points that will be K centers of gravity,
• Step 3: For all points, calculate the distance to K centers of
gravity.
• Step 4: Specify the affiliation of all points to one of the K
groups. The point belongs to the group to which the distance to
the center of gravity is the smallest. If none of the points has
changed the previous group, the algorithm ends,
• Step 5: For all K groups of points, calculate the center of
gravity. Go to Step 3
MIW - Classification - algorithms, logistic regression - lecture 4 9

The 3-means algorithm - an example


Choice of 3 starting points - centroid squares

Allocation of points to classes


MIW - Classification - algorithms, logistic regression - lecture 4 10

The 3-means algorithm - an example


Determination of centers of gravity

New allocation of points to classes


MIW - Classification - algorithms, logistic regression - lecture 4 11

The 3-means algorithm - an example


Setting new centers of gravity

Allocation of points to classes and determination of the center of gravity:


MIW - Classification - algorithms, logistic regression - lecture 4 12

KNN @ K-means comparison

KNN algorithm
• classification algorithm
• Calculates k of the nearest data points from the data point X. Uses
these points to determine which X class belongs to
• The classified point does not change the class,
• Requires only calculation of k distance.
K-mean algorithm
• grouping algorithm,
• Uses the distance of data to k-centroids,
• Centroids are not necessarily data points,
• Updates the centroids after each iteration,
• It must iterate the data until the centroid point stabilizes.
MIW - Classification - algorithms, logistic regression - lecture 4 13

Perceptron as a binary classifier

• We refer to two classes: 1 (positive class) and -1 (negative class),

• Perceptron calculates a weighted sum of


z = w1 x1 + ... + wm xm = wT · x inputs x and weight w, then uses the
activation function φ(z).

• The φ activation function is stepwise function, i.e. it gives 1 if the


threshold is exceeded or −1 otherwise

• In the perceptron algorithm, the activation function is a simple step


function,

• The threshold value Φ can be implemented as an additional, fixed


input (bias)
MIW - Classification - algorithms, logistic regression - lecture 4 14

The Rosenblatt rule of the perceptron learning

1. initialize weights value of 0 or small, random values.


2. For each sample xi do the following:
(a) Calculate the output value ŷ.
(b) Update wights
The weight values should be corrected with a factor proportional to
the error
5wj = η · (yi − yˆi ) · xij
where η is a learning rate. New value of weight wj after correction
rot sample i is equal:
w ⇐ w + 5wj
MIW - Classification - algorithms, logistic regression - lecture 4 15

Way of learning perceptron

• In case the desired i-th out value of y i is the class label (ie -1 or
1) and the neuron corresponds to the value yˆi then you can
calculate the response difference.
• The weight values should be corrected with a factor proportional
to the error
5wj = η · (yi − yˆi ) · xij
where η is a learning rate,
• New value of weight wj after correction for sample i is equal:

wj ⇐ wj + 5wj
MIW - Classification - algorithms, logistic regression - lecture 4 16

Linear separation using perceptron

• The convergence of the perceptron is only ensured if the two


classes are linearly separated,
• If you can not separate two classes using a linear decision
boundary, you can set the maximum number of algorithm epochs
or tolerance threshold for incorrect classifications.
MIW - Classification - algorithms, logistic regression - lecture 4 17

Multi-class classification

• In case of division into many classes, many binary classifiers can


be used,
• A separate classifier can be used for each class.
MIW - Classification - algorithms, logistic regression - lecture 4 18

The problem of linear data separability

• Perceptron can only separate data in a linear way,


• In case data can not be separated in line, other techniques should
be used, such as data transformation or classification with a
certain probability.
MIW - Classification - algorithms, logistic regression - lecture 4 19

Support Vector Machine

• We can treat SVM as a development of the perceptron model.


• The optimization goal of the SVM model is to maximize the margin.
• Margin is defined as the distance between the separation hyperspace
(limit decision-making) and the closest teaching samples (so-called
support vectors).
MIW - Classification - algorithms, logistic regression - lecture 4 20

Regularization

• variance is used to measure the uniformity of model for a given


sample occurrence in the case of multiple learning of the model

• Loads is a measure of a systematic error independent of randomness.

• In order for regularization to be properly carried out, we must ensure


that all features are adapted to a uniform scale (eg standardization).

• The most popular form of regularization is the so-called L2


regularization, also sometimes called weight decay:

λ 2 λX 2
||w|| = wj (1)
2 2
j
MIW - Classification - algorithms, logistic regression - lecture 4 21

Support Vector Machine - problem definition

For positive hyperspace we got w0 + wT · xpoz = 1 and for the negative,


respectively w0 + wT · xneg = −1 After subtracting equations:

wT (xpoz − xneg ) = 2 (2)

We can normalize the length of the vector w:


sX
||w|| = wj2
j

after normalization of the equation (2):


wT (xpoz − xneg ) 2
=
||w|| ||w||
The left side of the above equation is interpreted as the distance between
the positive and negative hyperplanes, i.e. the margin we want to
maximize, i.e.
MIW - Classification - algorithms, logistic regression - lecture 4 22

wT (xpoz − xneg )
⇒ max
||w||
2
by maximizing ||w||
⇒ max

Instead of maximizing 2
||w||
can be minimized ||w|| or square ||w||2 and
“flexibilise” the hyperplane equations, introducing additional variables ζ i :

w T · xi ≥ 1 − ζ i gdy yi = 1

wT · xi ≤ −1 + ζ i gdy yi = −1

The purpose of minimization


1 2
X i
||w|| + C ζ ⇒ min
2
i

therefore has additional restrictions that can be customized by changing


the parameter C:
MIW - Classification - algorithms, logistic regression - lecture 4 23

The result of the operation using SVM


MIW - Classification - algorithms, logistic regression - lecture 4 24

Solving non-linear problems using the SVM kernel

• SVM gives you the ability to solve non-linear classification problems.

• In methods using kernel functions, the basic concept of dealing with


inseparable linear data is to create non-linear combinations of original
features that will be projected onto a space having more dimensions
with the φ mapping function, where they will become linearly
separable.
MIW - Classification - algorithms, logistic regression - lecture 4 25

Interpretation of the operation of the SVM kernel functions

φ(x1 , x2 ) = (z1 , z2 , z3 ) = (x1 , x2 , x21 + x22 )


MIW - Classification - algorithms, logistic regression - lecture 4 26

SVM kernel functions for the XOR problem - example

A set of data generated using the XOR gateway and a decision


boundary generated using the SVM kernel
MIW - Classification - algorithms, logistic regression - lecture 4 27

Kernel of the Radial Base Function for Iris data

i j ||xi − xj ||2
k(x , x ) = exp(− ) ≈ exp(−γ||xi − xj ||2 )

Kernel functions for different values of γ = σ12 , small and large
MIW - Classification - algorithms, logistic regression - lecture 4 28

Classification with probability - logistic regression

• Perceptron will only classify data that can be separated linearly,


• In the opposite case, the learning of the perceptron will never
end. This can be prevented by introducing restrictions on the
number of epochs or classification accuracy,
• Logistic regression is an algorithm used to solve linear and binary
problems,
• Logistic regression, despite its name, is a classification model, not
regression,
• The classification by means of logistic regression takes place with
a certain probability.
MIW - Classification - algorithms, logistic regression - lecture 4 29

Logit function

• Odds ratio - the chance of occurrence of a given event, can be


p
expressed with a formula 1−p where p is the probability of a positive
event.
• Logit function logit(p) is the logarithm of the odds ratio:
p
logit(p) = log( )
1−p
• The logit function accepts input values ranging from 0 to 1 and
converts them into values from the full range of real numbers.
• We can use the logit function to model the linear relationship between
feature values and chances, expressed as logarithm:

logit(p(y = 1|x)) = w0 x0 + w1 x1 + . . . wn xn = wT · x

where p(y = 1|x) is the conditional probability that a given sample


belongs to the 1 class with known features x.
MIW - Classification - algorithms, logistic regression - lecture 4 30

Logistic function (sigmoidal)

• We are interested in predicting the probability that a sample belongs


to a particular class, which is the inverse of the logit function.

• We are dealing here with a logistic function, also known as a sigmoidal


function (s-shaped)

• z is a weight sum z = w0 x0 + w1 x1 + . . . wn xn = wT · x,
MIW - Classification - algorithms, logistic regression - lecture 4 31

Logistic regression model

• In the logistic regression model, the function of the activation function


is assumed by the sigmoid function,

• The sigmoidal function result is interpreted as the probability of


belonging a given sample to class 1, φ(z) = p(y = 1|x, w), where x are
the characteristics of this sample multiplied by w weights.
MIW - Classification - algorithms, logistic regression - lecture 4 32

Cost function of logistic regression

• In the perceptron, the cost function was the sum of error squares
X
(φ(z)i − y i )2
i

• Likelihood of L(w) (for independent samples) is a function of:


i yi i 1−y i
Y Y
i
L(w) = p(y|x, w) = p(y |xi , w) = φ(z ) · (1 − φ(z ))
i i

• And the logarithm of likelihood is:


X
l(w) = log(L(w)) = [y i log(φ(z i )) + (1 − y i ) log(1 − φ(z i ))]
i

• Logistic regression for the cost function J(w) assumes −l(w),


because adding up reduces the risk of computational stability.
MIW - Classification - algorithms, logistic regression - lecture 4 33

Minimizing the cost function

• For i samples, the cost is


X
J(w) = [−y i log(φ(z i )) − (1 − y i ) log(1 − φ(z i ))]
i

• For a single sample y the cost will be

J(w) = −y log(φ(z)) − (1 − y) log(1 − φ(z))]


MIW - Classification - algorithms, logistic regression - lecture 4 34

Learning the logistic regression model

Learning the model will involve minimizing the cost function J(W ).
• The derivative of the φ activation function is:

δ δ 1 1 −z
φ(z) = = e = φ(z)(1 − φ(z))
δwj δwj 1 + e−z (1 + e−z )2
• For the wj weight, the gradient of the cost function will be:
δ 1 1 δ
J(w) = [y − (1 − y) ] φ(z) =
δwj φ(z) 1 − φ(z) δwj

= . . . = (y − φ(z)) · xj
• The effect of all i samples the weight of wj after correction is:

X
wj ⇐ wj + η (y i − φ(z i )) · xij = wj + ∆wj
i
MIW - Classification - algorithms, logistic regression - lecture 4 35

Implementation of Logistic Regression in Python

c l a s s LogisticRegressionGD ( object ) :
d e f __init__ ( s e l f , e t a = 0 . 0 5 , n _ i t e r =100 , random_state =1):
s e l f . eta = eta
s e l f . n_iter = n_iter
s e l f . random_state = random_state

def fit (
s e l f , X, y ) :
rgen = np . random . RandomState ( s e l f . r a n d o m _ s t a t e )
self. w_ = r g e n . n o r m a l ( l o c = 0 . 0 , s c a l e = 0 . 0 1 , s i z e =1 + X . s h a p e [ 1 ] )
self. cost_ = [ ]
for i in range ( s e l f . n_iter ) :
n e t _ i n p u t = s e l f . n e t _ i n p u t (X)
output = s e l f . a c t i v a t i o n ( net_input )
e r r o r s = ( y − output )
s e l f . w_ [ 1 : ] += s e l f . e t a ∗ X . T . d o t ( e r r o r s )
s e l f . w_ [ 0 ] += s e l f . e t a ∗ e r r o r s . sum ( )
c o s t = (−y . d o t ( np . l o g ( o u t p u t ) ) − ( ( 1 − y ) . d o t ( np . l o g ( 1 − o u t p u t ) ) ) )
s e l f . c o s t _ . append ( c o s t )
return s e l f

def net_input ( s e l f , X) :
r e t u r n np . d o t (X, s e l f . w_ [ 1 : ] ) + s e l f . w_ [ 0 ]

def activation ( self , z ):


r e t u r n 1 . / ( 1 . + np . exp(−np . c l i p ( z , −250 , 250)))

def predict ( s e l f , X) :
r e t u r n np . w h e r e ( s e l f . n e t _ i n p u t (X) >= 0 . 0 , 1, 0)
MIW - Classification - algorithms, logistic regression - lecture 4 36

Regularization in logistic regression

• To perform regularization, just add the appropriate cost function


factor J(w), which will be used to reduce the weights:

X 1
J(w) = C · [−y i log(φ(z i )) − (1 − y i ) log(1 − φ(z i ))] + ||w||2
2
i

• Parameter C is the inverse of the λ parameter in the formula (1)


C = λ1

• In the library sklearn it is possible to set a parameter C


from sklearn . linear_model import LogisticRegression

l r = L o g i s t i c R e g r e s s i o n (C= 1 0 0 0 . 0 , r a n d o m _ s t a t e =1)
l r . f i t ( X_train_std , y _ t r a i n )
MIW - Classification - algorithms, logistic regression - lecture 4 37

Control of regularization force

The weighting factors decrease when the value of the C parameter


decreases, i.e. during increasing the regularization force.

You might also like