The maximum margin classifier chooses the hyperplane that results in the largest separation, or margin, between the two classes. This hyperplane will best generalize to unseen data points

FEM 2063 - Data Analytics
CHAPTER 6
Machine Learning Classification Methods
1
Overview
➢k-Nearest Neighbours (kNN)
➢Support Vector Machines (SVM)
2
KNN
Tell me about your friends(who your neighbors are) and I will tell you
who you are.
KNN
KNN – Different names
• K-Nearest Neighbours
• Memory-Based Reasoning
• Example-Based Reasoning
• Instance-Based Learning
• Lazy Learning
KNN
What is it?
• A powerful classification algorithm used in pattern recognition used today.
• K nearest neighbors stores all available cases and classifies new cases based
on a similarity measure (e.g distance function)
• Categorize query points based on their distance to points in a training data

set . You can use various metrics to determine the distance.
• A non-parametric lazy learning algorithm (An Instance based Learning

method).
KNN
Classification Approach
• An object (a new instance) is classified

by a majority votes for its neighbor
classes.
• The object is assigned to the most
common class amongst its K nearest
neighbors (measured by a distant
function )
KNN
KNN
Distance Measure
KNN
Distance measure for Continuous Variables
Distance Functions
k
• Euclidean  i i
( x
i =1
− y ) 2
• Manhattan x
i =1
i − yi
1
 k 
  ( xi − yi ) 
q q
• Minkowski
 i =1 
KNN
Distance Between Neighbours
• Calculate the distance between new example (E) and all examples in the
training set
• Euclidean distance between two examples.
– X = [x1,x2,x3,..,xn]
– Y = [y1,y2,y3,...,yn]
– The Euclidean distance between X and Y is defined as:
k
D( X , Y ) =  i i
( x
i =1
− y ) 2
KNN
K-Nearest Neighbour Algorithm
• All the instances correspond to points in an n-dimensional feature space.
• Each instance is represented with a set of numerical attributes.
• Each of the training data consists of a set of vectors and a class label associated
with each vector.
• Classification is done by comparing feature vectors of different K nearest points.
• Select the K-nearest examples to E in the training set.
• Assign E to the most common class among its K-nearest neighbours.
KNN
3-KNN: Example(1)
KNN
How to choose K?
• If K is too small, it is sensitive to noise points.
• Larger K works well. But too large K may include majority points
from other classes.
• Rule of thumb is k  n, n is number of examples

KNN
KNN
(a) 1-nearest neighbour (b) 2-nearest neighbour (c) 3-nearest neighbour
K-nearest neighbors of a record x are data points that have the k smallest
distance to x
KNN
KNN Feature Weighting
• Scale each feature by its importance for classification
• Can use our prior knowledge about which features are more important
• Can learn the weights wk using cross‐validation
KNN
Feature Normalization
• Distance between neighbors could be dominated by some attributes
with relatively large numbers.
e.g., income of customers in our previous example.
• Arises when two features are in different scales.

• Important to normalize those features.
– Mapping values to numbers between 0 – 1.
KNN
Nominal/Categorical Data
• Distance works naturally with numerical attributes.

• Binary value categorical data attributes can be regarded as 1 or 0.
k
DH =  xi − yi
Hamming Distance i =1
x= yD=0
x  y  D =1
X Y Distance
Male Male 0
Male Female 1
KNN
KNN Classification
KNN
KNN Classification – Distance
KNN
KNN Classification – Standardized Distance
KNN
Strengths of KNN
• Very simple and intuitive.
• Can be applied to the data from any distribution.
• Good classification if the number of samples is large enough.

KNN
Weaknesses of KNN
• Takes more time to classify a new example.

need to calculate and compare distance from new example to all other examples.
• Choosing k may be tricky.
• Need large number of samples for accuracy.

KNN
Example
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
x = np.array([3000, 1000, 2700, 1500, 1200,1400, 4200, 580, 7390, 245]).reshape(-1, 1)
y = np.array([1, 0, 1, 0, 0, 0, 1, 0, 1, 0])
xtest = np.array([1970, 2845, 4656, 5800,900]).reshape(-1, 1)
ytest = np.array([0, 1, 1, 1, 1])
#define the model to be used

model = KNeighborsClassifier(n_neighbors=3)
#fit the (training) data

model.fit(x, y)
ypred = model.predict(xtest)
print(ypred)
Output: [0 1 1 1 0]
Overview
➢k-Nearest Neighbours (kNN)
➢Support Vector Machines (SVM)
25
SVM
Introduction
Given N labeled empirical data:
(x1, y1 ), , (x N , yN )  X  {−1,+1}
where X is the set of input data in D and yi are the labels.
x2 yi = 1
yi = −1
c1 c2
Domain X
x1
SVM
Introduction Unit normal vector
w w
x2 yi = 1
yi = −1
c1 c2
x Decision plan
Domain X
x1
SVM
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1
How would you classify this data?
w x + b<0
SVM
Linear Classifiers
x f yest
denotes +1
denotes -1
How would you

classify this data?
SVM
Linear Classifiers
x f yest
denotes +1
denotes -1
How would you

classify this data?
SVM
Linear Classifiers
x f yest
denotes +1
denotes -1
Any of these would be fine..
..but which is best?

SVM
Linear Classifiers x f yest

denotes +1 f(x,w,b) = sign(w x + b)
denotes -1
How would you classify this data?
Misclassified
to +1 class
SVM
Classifier Margin x f yest
denotes +1 f(x,w,b) = sign(w x + b)
denotes -1
Define the margin of a

linear classifier as the
width that the boundary
could be increased by
before hitting a datapoint.
SVM
Maximum Margin x f yest
denotes +1
denotes -1 f(x,w,b) = sign(w x + b)
The maximum margin

Support Vectors linear classifier is the
are those
datapoints that
linear classifier with the
the margin maximum margin.
pushes up
against
This is the simplest kind
of SVM (Called an LSVM)
Linear SVM
SVM
Good Decision Boundary: Margin Should Be Large
The decision boundary should be as far away from the data of both classes as
possible
We should maximize the margin, m
Class 2
m?
Class 1
35
SVM
Training set is linearly separable
H 2 = z, w • z − b = 1 H1 = z, w • z − b = −1

z i  H i , i = 1,2
w 1
d ( H1 , H 2 ) = • ( z 2 − z1 ) = (w • z 2 − w • z1 )
w w
1
= (1 + b − (−1 + b))
w
2
d(H1, H 2 ) =
w
36
SVM
Linear SVM
w w
x2
Therefore, maximizing
the margin is equivalent
w  x + b = +1 to minimizing ||w||2.
wx +b = 0
w  x + b = −1 x1
SVM
The Optimization Problem
• Let {x1, ..., xn} be our data set and let yi  {1,-1} be the class label of xi
• The decision boundary should classify all points correctly
• A constrained optimization problem
38
SVM
The Optimization Problem
• We can transform the problem to its dual using Lagrange multipliers
• This is a quadratic programming (QP) problem

Global maximum of ai can always be found
• w can be recovered by
39
SVM
Characteristics of the Solution
• Many of the ai are zero

• w is a linear combination of a small number of data
• Sparse representation
• xi with non-zero ai are called support vectors (SV)

• The decision boundary is determined only by the SV
• Let tj (j=1, ..., s) be the indices of the s support vectors. We can write
40
SVM
Characteristics of the solution
For testing with a new data z
Compute
and classify z as class 1 if the sum is positive, and class 2 otherwise
41
SVM
A Geometrical Interpretation
Class 2
a10=0
a8=0.6
a7=0
a2=0
a5=0
a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1
42
SVM
If Nonlinear?
Use of mapping through
Kernal function

43
SVM
Examples of kernel functions.
Polynomial kernel K ( xi , x j ) = ( xi • x j + 1) 2
2
x j − xi
−
Radial Base Function (RBF) K ( xi , x j ) = e 22
x j − xi
−
Laplace K ( xi , x j ) = e 
SVM
Dataset with noise
denotes +1 ◼ Hard Margin: require all data points be
denotes -1 correctly classified, No training error
◼ What if the training set is noisy?

- Solution 1: use very powerful kernels
OVERFITTING!
SVM
Hard Margin v.s. Soft Margin
◼ The formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1
◼ The formulation incorporating slack variables:

Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
◼ Parameter C can be viewed as a way to control overfitting.

SVM
More than 2 classes?
The SVM as defined works for K = 2 classes. What do we do if we have
K > 2 classes?
SVM
Strengths
• It has a unique solution
• It can deal with non-linearity

SVM
Weaknesses
• It is sensitive to noise: a relatively small number of mislabeled examples

can dramatically decrease the performance
• The choice of the kernel is not obvious!

KNN
Example
from sklearn import svm
from sklearn.svm import SVC
import numpy as np
x = np.array([3000, 1000, 2700, 1500, 1200,1400, 4200, 580, 7390, 245]).reshape(-1, 1)

y = np.array([1, 0, 1, 0, 0, 0, 1, 0, 1, 0])
xtest = np.array([1970, 2845, 4656, 5800,900]).reshape(-1, 1)
ytest = np.array([0, 1, 1, 1, 1])
#define the model to be used

model = SVC(kernel='linear’)
#fit the (training) data

model.fit(x, y)
ypred = model.predict(xtest)
print(ypred)
Output: [0 1 1 1 0]
References
https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
https://sikoried.github.io/sequence-learning/09-sequence-kernels/intro_svm_new.pdf

The maximum margin classifier chooses the hyperplane that results in the largest separation, or margin, between the two classes. This hyperplane will best generalize to unseen data points

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The maximum margin classifier chooses the hyperplane that results in the largest separation, or margin, between the two classes. This hyperplane will best generalize to unseen data points

Uploaded by

Copyright:

Available Formats

FEM 2063 - Data Analytics

➢k-Nearest Neighbours (kNN)

➢Support Vector Machines (SVM)

KNN – Different names

• A powerful classification algorithm used in pattern recognition used today.

• Categorize query points based on their distance to points in a training data

• A non-parametric lazy learning algorithm (An Instance based Learning

• An object (a new instance) is classified

Distance measure for Continuous Variables

• Rule of thumb is k  n, n is number of examples

(a) 1-nearest neighbour (b) 2-nearest neighbour (c) 3-nearest neighbour

• Arises when two features are in different scales.

• Distance works naturally with numerical attributes.

• Very simple and intuitive.

• Can be applied to the data from any distribution.

• Good classification if the number of samples is large enough.

• Takes more time to classify a new example.

• Choosing k may be tricky.

• Need large number of samples for accuracy.

#define the model to be used

#fit the (training) data

➢k-Nearest Neighbours (kNN)

➢Support Vector Machines (SVM)

Introduction Unit normal vector

How would you classify this data?

How would you

How would you

Any of these would be fine..

..but which is best?

Linear Classifiers x f yest

How would you classify this data?

Define the margin of a

The maximum margin

Training set is linearly separable

H 2 = z, w • z − b = 1 H1 = z, w • z − b = −1

• The decision boundary should classify all points correctly

• A constrained optimization problem

• This is a quadratic programming (QP) problem

• Many of the ai are zero

• xi with non-zero ai are called support vectors (SV)

For testing with a new data z

and classify z as class 1 if the sum is positive, and class 2 otherwise

Examples of kernel functions.

◼ What if the training set is noisy?

◼ The formulation incorporating slack variables:

◼ Parameter C can be viewed as a way to control overfitting.

• It has a unique solution

• It can deal with non-linearity

• It is sensitive to noise: a relatively small number of mislabeled examples

• The choice of the kernel is not obvious!

x = np.array([3000, 1000, 2700, 1500, 1200,1400, 4200, 580, 7390, 245]).reshape(-1, 1)

#define the model to be used

#fit the (training) data

You might also like