You are on page 1of 74

Unit-II

Supervised Machine Learning

VENKATARAMANAIAH G
Assistant Professor, Dept. of ECE

NBKR Institute of Science & Technology


Vidyanagar, S.P.S.R. Nellore
Introduction
The majority of practical machine learning uses supervised
learning.
Supervised learning is where the input variables (x) and an
output variable (Y) and we use an algorithm to learn the
mapping function from the input to the output Y = f(X) .
The goal is to approximate the mapping function so well that
when the new input data (x) that can predict the output
variables (Y) for that data.
Techniques of Supervised Machine Learning algorithms include
linear and logistic regression, multi-class
classification, Decision Trees and support vector machines.
Supervised learning requires that the data used to train the
algorithm is already labeled with correct answers.
For example, a classification algorithm will learn to
identify animals after being trained on a dataset of
images that are properly labeled with the species of the
animal and some identifying characteristics.
Supervised learning problems can be further grouped
into Regression and Classification problems.
 Both problems are constructed with model that can
predict the value of the dependent attribute from the
attribute variables.
The difference between the two tasks is the fact that the
dependent attribute is numerical for regression and
categorical for classification.
Classification
The goal in classification is to take an input vector x and to assign it
to one of K discrete classes Ck where k = 1, . . . , K. In the most
common scenario, the classes are taken to be disjoint, so that each
input is assigned to one and only one class.
The input space is thereby divided into decision regions whose
boundaries are called decision boundaries or decision surfaces
In linear models for classification, by which we mean that the
decision surfaces are linear functions of the input vector x and hence
are defined by (D - 1)-dimensional hyper planes within the D-
dimensional input space.
 Data sets whose classes can be separated exactly by linear decision
surfaces are said to be linearly separable.
In a classification problem, there are various ways of using target
values to represent class labels.
For probabilistic models, the most convenient is two-class problems,
is the binary representation in which there is a single target variable t
∈ {0, 1} such that t = 1 represents class C1 and t = 0 represents class
C2.
For K > 2 classes, it is convenient to use a 1-of-K coding scheme in
which t is a vector of length K such that if the class is Cj, then all
elements tk of t are zero except element tj, which takes the value 1.
For instance, if we have K = 5 classes, then a pattern from class 2
would be t = (0, 1, 0, 0, 0)T
Discriminant Functions –
Two class classification problem
A discriminant is a function that takes an input vector x and
assigns it to one of K classes, denoted Ck.
The simplest representation of a linear discriminant function is
obtained by taking a linear function of the input vector so that

where w is called a weight vector, and w0 is a bias. The negative


of the bias is sometimes called a threshold. An input vector x is
assigned to class C1 if y(x)≥ 0 and to class C2 otherwise.
The corresponding decision boundary is therefore defined by
the relation y(x) = 0, which corresponds to a (D - 1)-
dimensional hyperplane within the D-dimensional input space.
Consider two points xA and xB both of which lie on the decision surface
Because and
Hence the vector w is orthogonal to every vector lying within the
decision surface, and so w determines the orientation of the decision
surface.
Similarly, if x is a point on the decision surface, then y(x) = 0, and so
the normal distance from the origin to the decision surface is
given by

 Therefore the bias parameter w0 determines the location of the decision


surface as shown in fig below.
For D=2 ,The value of y(x) gives a
signed measure of the
perpendicular distance r of the
point x from the decision surface.
To see this, consider an arbitrary
point x and let x⊥ be its orthogonal
projection onto the decision
surface, so that
Multiplying both sides of this
result by wT and adding w0,we get
y(x) =

For more compact notation input


value
and
Multiple classes
 The K-class discriminant by combining a number of two-class
discriminant functions. This leads to some serious difficulties.
Consider the use of K-1 classifiers each of which solves a two-
class problem of separating points in a particular class Ck from
points not in that class. This is known as a one-versus-the-rest
classifier.
The left-hand example in Figure below shows an example
involving three classes where this approach leads to regions of
input space that are ambiguously classified.
An alternative is to introduce K(K - 1)/2 binary discriminant
functions, one for every possible pair of classes.
This is known as a one-versus-one classifier. Each point is then
classified according to a majority vote amongst the
discriminant functions.
However, this too runs into the problem of ambiguous regions,
as illustrated in the right-hand diagram of Figure shown
above.
We can avoid these difficulties by considering a single K-class
discriminant comprising K linear functions of the form
And then assigning a point x to class Ck if yk(x) > yj(x) for all j ≠ k. The
decision boundary between class Ck and class Cj is therefore given by yk(x) =
yj(x) and hence corresponds to a (D - 1)-dimensional hyper plane defined by

This has the same form as the decision boundary for the two-class case .
The decision regions of such a discriminant are always singly connected and
convex. To see this, consider two points xA and xB both of which lie inside
decisionregion Rk, as illustrated in Figure below. Any point that lies on
the line connecting xA and xB can be expressed in the form. ^

x
where 0 ≤ λ ≤ 1. From the
linearity of the discriminant
functions, it follows that

Because both xA and xB lie


inside Rk, it follows that yk(xA)
> yj(xA), and
yk(xB) > yj(xB), for all j ≠ k,
and hence yk() > yj(), and so
^ ^

x x
also ^
lies
inside Rk. Thusx Rk is singly
connected and convex
Least squares for classification
Consider a general classification problem with K classes, with a 1-of-K binary
coding scheme for the target vector t.
Each class Ck is described by its own linear model so that

where k = 1, . . . , K. We can conveniently group these together using vector


notation so that

Where , is a matrix whose kth column comprises the D + 1-dimensional vector


is the corresponding augmented input vector (1,xT)T with a dummy input x0 = 1
A new input x is then assigned to the class for which the output is largest


Least squares for classification
 Here our aim is to determine the parameter matrix W, by minimizing a sum-of-squares
error function.
Consider a training data set {xn,tn} where n = 1, . . . , N, and define a matrix T whose nth
row is the vector tTn, together with a matrix

The sum-of-squares error function can then be written as

Setting the derivative with respect to to zero, we get

The discriminant function in the form


The least-squares approach gives an exact closed-form solution for the
discriminant function parameters. However, even as a discriminant function , it
suffers from some severe problems
.

Fig.1b
Fig.1a
It is seen that the additional data points in the Fig.1b produce a significant
change in the location of the decision boundary, even though these point
would be correctly classified by the original decision boundary in Fig.1a
The sum-of-squares error function penalizes predictionsthat are ‘too correct’
in that they lie a long way on the correct side of the decision boundary.
 However, problems with least squares can be more severe than simply lack of
robustness shown in Fig.2
This shows a synthetic data set drawn from three classes in a two-dimensional
input space (x1, x2), having the property that linear decision boundaries can
give excellent separation between the classes shown in fig 2b .
However, the least-squares solution gives poor results, with only a small
region of the input space assigned to the green class
The failure of least squares should not surprise us when we
recall that it corresponds to maximum likelihood under the
assumption of a Gaussian conditional distribution, whereas binary
target vectors clearly have a distribution that is far from
Gaussian.
By adopting more appropriate probabilistic models, we shall obtain
classification techniques with much better properties than least
squares.

Fig.2a Fig.2b
Fisher’s linear discriminant
One way to view a linear classification model is in terms of
dimensionality reduction.
Consider first the case of two classes, and suppose we take the D-
dimensional input vector x and project it down to one dimension using

If we place a threshold on y and classify y -w0 as class C1, else classC2.
In general, the projection onto one dimension leads to a considerable
loss of information, and classes that are well separated in the original D-
dimensional space may become strongly overlapping in one dimension.
However, by adjusting the components of the weight vector w, we can
select a projection that maximizes the class separation.
For a two-class problem in which there are N1 points of class C1 and
N2 points of class C2, so that the mean vectors of the two classes are
given by

The simplest measure of the separation of the classes, when projected


onto w, is the separation of the projected class means. This suggests
that we might choose w so as to maximize.

is the mean of the projected data from class Ck


To solve this problem, we could constrain w to have unit length, so
that
Using a Lagrange multiplier to perform the constrained
maximization, we then find that
This shows two classes that are well separated in the original
two dimensional space (x1, x2) but that have considerable
overlap when projected onto the line joining their means.
This difficulty arises from the strongly non diagonal
covariances of the class distributions.
The idea proposed by Fisher is to maximize a function that
will give a large separation between the projected class
means while also giving a small variance within each class,
thereby minimizing the class overlap.
The projection formula transforms the set of labelled data points in x
into a labelled set in the one-dimensional space y.
The within-class variance of the transformed data from class Ck is therefore
given by
Where the total within-class variance for the whole
data set =
The Fisher criterion is defined to be the ratio of the between-class variance to the
within-class variance.
Fisher’s criterion is defined to be the ratio of the between class variance to the
written class variance ----1
where SB is the between-class covariance matrix and is given by

---2
and SW is the total within-class covariance matrix, given by

Differentiating equation 1 with respect to w, we find that J(w) is


maximized when
 --------3
From equation 2, we see that SBw is always in the direction of (m2-m1). so
we can drop the scalar factors (wTSBw) and (wTSWw). Multiplying both
sides of equation3 by we get

This is known as Fisher’s linear Distcreminant
The left plot shows samples from two classes (depicted in red and blue) along
with the histograms resulting from projection onto the line joining the class
means. Note that there is considerable class overlap in the projected space.
The right plot shows the corresponding projection based on the Fisher linear
discriminant, showing the greatly improved class separation
Linear Models for Regression
A regression problem is when the output variable is a real or
continuous value, such as “salary” or “weight”.
 Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyper plane which
goes through the points.
Regression Analysis is a statistical process for estimating the
relationships between the dependent variables or criterion
variables and one or more independent variables or
predictors.
Regression analysis is the changes in criterions in relation to
changes in select predictors.
The conditional expectation of the criterions based on
predictors where the average value of the dependent variables
is given when the independent variables are changed.
Types of Regression
Linear regression
Logistic regression
Polynomial regression
Stepwise regression
Ridge regression
Lasso regression
ElasticNet regressio
Linear regression
Linear regression is used for predictive analysis.
 Linear regression is a linear approach for modeling the
relationship between the criterion or the scalar response
and the multiple predictors or explanatory variables.
Linear regression focuses on the conditional probability
distribution of the response given the values of the
predictors. For linear regression, there is a danger of
over fitting.
Given a dataset of variables (xi,yi) where xi is the
explanatory variable and yi is the dependent variable
that varies as xi does, the simplest model that could be
applied for the relation between two of them is a linear
one.
Simple linear regression model is as follows

 is the random component of the regression handling the


residue, i.e. the lag between the estimation and actual value of
the dependent parameter.
 If Y is the estimation value of the dependent variable, it is
determined by two parameters:
1 The core parameter term α+β∗xi which is not random in
nature. α is known as the constant term or the intercept (also is
the measure of the y-intercept value of regression line). β is the
coefficient term or slope of the intercept line.
2. Random component, ϵi.
Parameter Estimation- Ordinary
Least Squares
After hypothesizing that Y is linearly related to X, the next
step would be estimating the parameters α & β. While doing
this our main aim always remains in the core idea that Y
must be the best possible estimate of the real data.
Hence we use OLS (ordinary least squares) method to
estimate the parameters.
In this method, the main function used to estimate the
parameters is the sum of squares of error in estimate of Y,
i.e. sum of squares of ϵi values. The equation is as follow
The above equation is to be minimized to get the best possible
estimate for our model and that is done by equating the first
partial derivatives of the above equation w.r.t α and β to 0.

Solving the system of equations for α & β leads to the


following values
Multiple regression model
 Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable.
 The goal of multiple linear regression (MLR) is to model the
linear relationship between the explanatory (independent) variables
and response (dependent) variable
 Multiple regression is the extension of ordinary least-squares (OLS)
regression because it involves more than one explanatory variable
Polynomial regression
Polynomial regression is used for curvilinear data. Polynomial
regression is fit with the method of least squares.
The goal of regression analysis to model the expected value of a
dependent variable y in regards to the independent variable x.
The equations for polynomial regression are

They are the second-order polynomials in one and two variables,


respectively.
The kth order polynomial model in one variable is given by
If xj=x j
j= 1, 2,...,k , then the model is multiple linear regressions
model in k explanatory variables x1 ,x2 ,..., . xk So
the linear regression model includes the polynomial
regression model. Thus the techniques for fitting linear regression
model can be used for fitting the polynomial regression model.

is a polynomial regression model in one variable and is called a


second-order model or quadratic model. The coefficients
are called the linear effect parameter and quadratic effect
parameter, respectively.
The interpretation of parameter when x = 0 and it can be
included in the model provided the range of data includes x =0. If x =
0 is not included, then has no interpretation. An example of the
quadratic model is like as follows:

The polynomial models can be used to approximate a complex


nonlinear relationship. The polynomial models is just the Taylor series
expansion of the unknown nonlinear function in such a case.
Logistic regression
Logistic regression is a supervised learning classification
algorithm used to predict the probability of a target variable.
The nature of target or dependent variable is dichotomous,
which means there would be only two possible classes.
The dependent variable is binary in nature having data coded
as either 1 (stands for success/yes) or 0 (stands for
failure/no).
Mathematically, a logistic regression model predicts P(Y=1)
as a function of X.
 It is one of the simplest ML algorithms that can be used for
various classification problems such as spam detection,
Diabetes prediction, cancer detection etc.
Types of Logistic Regression
Generally, logistic regression means binary logistic regression
having binary target variables, but there can be two more categories
of target variables that can be predicted by it. Based on those
number of categories, This can be divided into following types.
Binary or Binomial
In such a kind of classification, a dependent variable will have only
two possible types either 1 and 0. For example, these variables may
represent success or failure, yes or no, win or loss etc.
Multinomial
In such a kind of classification, dependent variable can have 3 or
more possible unordered types or the types having no quantitative
significance. For example, these variables may represent “Type A”
or “Type B” or “Type C”.
Ordinal
In such a kind of classification, dependent variable can have 3 or
more possible ordered types or the types having a quantitative
significance. For example, these variables may represent “poor”
or “good”, “very good”, “Excellent” and each category can have
the scores like 0,1,2,3.
Logistic Regression Assumptions
In case of binary logistic regression, the target variables must be
binary always and the desired outcome is represented by the
factor level 1.
There should not be any multi-collinearity in the model, which
means the independent variables must be independent of each
other .
We must include meaningful variables in our model.
We should choose a large sample size for logistic regression.
Binary Logistic Regression model
The simplest form of logistic regression is binary or binomial
logistic regression in which the target or dependent variable
can have only 2 possible types either 1 or 0.
It allows us to model a relationship between multiple predictor
variables and a binary/binomial target variable.
The logistic regression model uses the logistic function to
squeeze the output of a linear equation between 0 and 1.
The logistic function is defined as:
The step from linear regression to logistic regression is kind of
straightforward.
 In the linear regression model, we have modeled the
relationship between outcome and features with a linear
equation

For classification, we prefer probabilities between 0 and 1, so


we wrap the right side of the equation into the logistic function.
This forces the output to assume only values between 0 and 1.
Let us revisit the tumor size example. But instead of the linear
regression model, we use the logistic regression model:

Classification works better with logistic regression and we can use


0.5 as a threshold in both cases. The inclusion of additional points
does not really affect the estimated curve.
Gradient Descent Algorithm –Introduction
Gradient descent is an optimization algorithm that's used
when training a machine learning model. It's based on a convex
function and tweaks its parameters iteratively to minimize a
given function to its local minimum.
Gradient Descent is an optimization algorithm for finding a
local minimum of a differentiable function. Gradient descent is
simply used to find the values of a function's parameters
(coefficients) that minimize a cost function as far as possible.
GRADIENT
 A gradient simply measures the change in all weights with regard to
the change in error. we can also think of a gradient as the slope of a
function. The higher the gradient, the steeper the slope and the faster
a model can learn. But if the slope is zero, the model stops learning.
 In mathematical terms, a gradient is a partial derivative with respect
to its inputs.
 Imagine a blindfolded man who wants to climb to the top of a hill
with the fewest steps along the way as possible.
 He might start climbing the hill by taking really big steps in the
steepest direction, which he can do as long as he is not close to the
top.
 As he comes closer to the top, however, his steps will get smaller and
smaller to avoid overshooting it. This process can be described
mathematically using the gradient.
Imagine the as a vector that contains the direction of the
steepest step the blindfolded man can take and also how long
that step should be image below illustrates our hill from a top-
down view and the red arrows are the steps of our climber.
Think of a gradient in this context
Note that the gradient ranging from X0 to X1 is much longer
than the one reaching from X3 to X4. This is because the
steepness/slope of the hill, which determines the length of the
vector, is less. This perfectly represents the example of the hill
because the hill is getting less steep the higher it's climbed.
Therefore a reduced gradient goes along with a reduced slope
and a reduced step size for the hill climber
HOW GRADIENT
DESCENT WORKS
Instead of climbing up a hill, think of gradient descent as hiking
down to the bottom of a valley.
This is a better analogy because it is a minimization algorithm that
minimizes a given function.
The equation below describes what gradient descent does: b is the
next position of our climber, while a represents his current position.
The minus sign refers to the minimization part of gradient descent.
 The gamma in the middle is a waiting factor and the gradient term
(Δf(a) ) is simply the direction of the steepest descent.

So this formula basically tells us the next position we need to go,


which is the direction of the steepest descent.
Example 2
Imagine you have a machine learning problem and want to train
your algorithm with gradient descent to minimize your cost-
function J(w, b) and reach its local minimum by tweaking its
parameters (w and b). The image below shows the horizontal
axes represent the parameters (w and b), while the cost
function J(w, b) is represented on the vertical axes. Gradient
descent is a convex function.

J(w,b)=wx+b
We know we want to find the values of w and b that
correspond to the minimum of the cost function (marked with
the red arrow).
To start finding the right values we initialize w and b with
some random numbers.
Gradient descent then starts at that point (somewhere around
the top of our illustration), and it takes one step after another
in the steepest downside direction (i.e., from the top to the
bottom of the illustration) until it reaches the point where the
cost function is as small as possible.
IMPORTANCE OF THE
LEARNING RATE
 How big the steps are gradient descent takes into
the direction of the local minimum are
determined by the learning rate, which figures
out how fast or slow we will move towards the
optimal weights.
 For gradient descent to reach the local minimum
we must set the learning rate to an appropriate
value, which is neither too low nor too high.
 This is important because if the steps it takes are
too big, it may not reach the local minimum
because it bounces back and forth between the
convex function of gradient descent (see left
image below). If we set the learning rate to a
very small value, gradient descent will
eventually reach the local minimum but that may
take a while (see the right mage). So, the
learning rate should never be too high or too low
for this reason
Support vector Machine(SVM)
Support vector machine is another simple algorithm that every
machine learning expert should have in his/her arsenal. Support
vector machine is highly preferred by many as it produces
significant accuracy with less computation power.
The objective of the support vector machine algorithm is to find a
hyperplane in an N-dimensional space(N - the number of features)
that distinctly classifies the data points.
To separate the two classes of data points, there are many possible
hyper planes that could be chosen. Our objective is to find a plane
that has the maximum margin, i.e the maximum distance between
data points of both classes.
Maximizing the margin distance provides some reinforcement so
that future data points can be classified with more confidence.
Linear Classifiers Estimation:

x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
w: weight vector
denotes -1 x: data vector

How would you


classify this data?

51

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?

52

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x +b)
denotes +1
denotes -1

How would you


classify this data?

53

Linear Classifiers
x f yest
f(x,w,b) = sign(w x +b)
denotes +1 w x +b>0
denotes -1

0
b=
+
How would you

x
w
classify this data?

w x + b<0

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?

55

Classifier Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

Define the margin of


a linear classifier as
the width that the
boundary could be
increased by
before hitting a data
point.

56

Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the linear
classifier with the,
maximum margin.

This is the simplest


kind of SVM (Called
Linear SVM an LSVM)
57 23/10/31

Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1 The maximum margin
denotes -1 linear classifier is the
linear classifier with
the, um, maximum
margin.

Support Vectors
are those
datapoints that
the margin
pushes up
against This is the simplest
kind of SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum margin
linear classifier is the
linear classifier with the,
um, maximum margin.
This is the simplest kind
Support Vectors of SVM (Called an
are those LSVM)
datapoints that
the margin
pushes up 1. Maximizing the margin is good according to
against intuition and PAC theory
2. Implies that only support vectors are important;
other training examples are ignorable.
59 3. Empirically it works very well.
How to calculate the distance from a point to a line?

denotes +1
denotes -1 x wx +b = 0

X – Vector
W
W – Normal Vector
b – Scale Value

 In our case, w1*x1+w2*x2+b=0,


 thus, w=(w1,w2), x=(x1,x2)

60
Linear SVM Mathematically
+1”
= x+ M=Margin Width
la ss
i c tC
Pr ed e
“ zon
X-
- 1”
=1 =
x +b =0 la ss
w
wx
+ b
i c tC
= -1
Pr ed e
x +b “ zon
w

What we know:  
(x  x )  w 2
w . x+ + b = +1 M  
w . x- + b = -1 w w
w . (x+-x-) = 2
Linear SVM Mathematically
 Goal: 1) Correctly classify all training data
if yi = +1
wxi ify =b-1  1
i

wx
2) Maximize the Margin i
b 1
for all i

yi ( wxi  b)  1
same as minimize
2
M 
1 t w
 We can formulate a Quadratic Optimization Problem and solve for w and b

 Minimize ww
2
subject to

1 t
 ( w)  w w
2
yi ( wxi  b)  1
Finding the Decision Boundary
 Let {x , ..., x } be our data set and let y  {1,-1} be the
1 n i
class label of xi
 The decision boundary should classify all points correctly 

 The decision boundary can be found by solving the


following constrained optimization problem

 This is a constrained optimization problem. Solving it


requires some new tools
Feel free to ignore the following several slides; what is
important is the constrained optimization problem above
Solving the Optimization Problem
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
 Need to optimize a quadratic function subject to linear
constraints.
 Quadratic optimization problems are a well-known class of
mathematical programming problems, and many (rather
intricate) algorithms exist for solving them.
 The solution involves constructing a dual problem where a
Lagrange multiplier αi is associated with every constraint in the
primary problem:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
The Optimization Problem Solution
 The solution has the form:
w =Σαiyixi b= yk- wTxk for any xk such that αk 0
 Each non-zero αi indicates that corresponding xi is a
support vector.
 Then the classifying function will have the form:
f(x) = ΣαiyixiTx + b
 Notice that it relies on an inner product between the test
point x and the support vectors xi – we will return to this
later.
 Also keep in mind that solving the optimization problem
involved computing the inner products xiTxj between all
pairs of training points.
SVM –The dual formation
The Optimization Problem

Let {x1, ..., xn} be our data set and let yi  {1,-1} be the
class label of xi
The decision boundary should classify all points correct
ly 
A constrained optimization problem
||w|| = w w
2 T
Lagrangian of Original Problem

The Lagrangian is Lagrangian multipliers

Note that ||w||2 = wTw

Setting the gradient of w.r.t. w and b to zero, we have

i0
Finding the Decision Boundary
 Let {x , ..., x } be our data set and let y  {1,-1} be the
1 n i
class label of xi
 The decision boundary should classify all points correctly 
 The decision boundary can be found by solving the
following constrained optimization problem

 This is a constrained optimization problem. Solving it


requires some new tools
Feel free to ignore the following several slides; what is
important is the constrained optimization problem above

10/31/23
The Dual Optimization Problem
We can transform the problem to its dual Dot product of X

’s  New variables


(Lagrangian multipliers)
This is a convex quadratic programming (QP) problem
Global maximum of i can always be found
well established tools for solving this optimization problem (e.g. cplex)
Note:
A Geometrical Interpretation
Class 2
Support vectors
10=0 ’s with values
8=0.6
different from zero
(they hold up the
7=0 separating plane)!
2=0
5=0

1=0.8
4=0
6=1.4

9=0
3=0
Class 1
Non-linearly Separable Problems
We allow “error” i in classification; it is based on the
output of the discriminant function wTx+b
i approximates the number of misclassified samples
New objective function:

Class 2

C : tradeoff parameter between


error and margin;
chosen by the user;
large C means a higher
penalty to errors

Class 1
The Optimization Problem

The dual of the problem is

w is also recovered as
The only difference with the linear separable case is that
there is an upper bound C on i
Once again, a QP solver can be used to find i efficiently!!!
THANK YOU

You might also like