You are on page 1of 23

MACHINE LEARNING [IV Year II Sem]

UNIT-IV
Linear models: The least-squares method, The perceptron: a heuristic learning algorithm for
linear classifiers, Support vector machines, obtaining probabilities from linear classifiers, Going
beyond linearity with kernel methods.
Distance Based Models: Introduction, Neighbors and exemplars, Nearest Neighbours
classification, Distance Based Clustering, Hierarchical Clustering.

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the
input variables (x) and the single output variable (y). More specifically, that y can be calculated
from a linear combination of the input variables (x).
 When there is a single input variable (x), the method is referred to as simple linear
regression.
 When there are multiple input variables, literature from statistics often refers to the
method as multiple linear regression.
 Different techniques can be used to prepare or train the linear regression equation from
data, the most common of which is called Ordinary Least Squares. It is common to
therefore refer to a model prepared this way as Ordinary Least Squares Linear Regression
or just Least Squares Regression.

Linear Regression Model Representation

 Linear regression is an attractive model because the representation is so simple.


 The representation is a linear equation that combines a specific set of input values (x) the
solution to which is the predicted output for that set of input values (y). As such, both the
input values (x) and the output value are numeric.
 The linear equation assigns one scale factor to each input value or column, called a
coefficient and represented by the capital Greek letter Beta (B). One additional
coefficient is also added, giving the line an additional degree of freedom (e.g. moving up
and down on a two-dimensional plot) and is often called the intercept or the bias
coefficient.
 For example, in a simple regression problem (a single x and a single y), the form of the
model would be:
y = B0 + B1*x

G SRINIVASARAO,Asst.Prof Page 1
MACHINE LEARNING [IV Year II Sem]

 In higher dimensions when we have more than one input (x), the line is called a plane or
a hyper-plane. The representation therefore is the form of the equation and the specific
values used for the coefficients (e.g. B0 and B1 in the above example).
 It is common to talk about the complexity of a regression model like linear regression.
This refers to the number of coefficients used in the model.

 When a coefficient becomes zero, it effectively removes the influence of the input
variable on the model and therefore from the prediction made from the model (0 * x = 0).
This becomes  relevant if you look at regularization methods that change the learning
algorithm to reduce the complexity of regression models by putting pressure on the
absolute size of the coefficients, driving some to zero.

 How to calculate Linear regression through Least Squares Method

G SRINIVASARAO,Asst.Prof Page 2
MACHINE LEARNING [IV Year II Sem]

 How to calculate R squared using regression Analysis

G SRINIVASARAO,Asst.Prof Page 3
MACHINE LEARNING [IV Year II Sem]

 Calculation of standard error of the estimate using Regression Analysis

G SRINIVASARAO,Asst.Prof Page 4
MACHINE LEARNING [IV Year II Sem]

 The perceptron: a heuristic learning algorithm for linear classifiers


 In machine learning, the perceptron is an algorithm for supervised learning of binary
classifiers.
 A binary classifier is a function which can decide whether or not an input, represented by
a vector of numbers, belongs to some specific class.
 It is a type of linear classifier, i.e. a classification algorithm that makes its predictions
based on a linear predictor function combining a set of weights with the feature vector.
 Example:

G SRINIVASARAO,Asst.Prof Page 5
MACHINE LEARNING [IV Year II Sem]

G SRINIVASARAO,Asst.Prof Page 6
MACHINE LEARNING [IV Year II Sem]

G SRINIVASARAO,Asst.Prof Page 7
MACHINE LEARNING [IV Year II Sem]

G SRINIVASARAO,Asst.Prof Page 8
MACHINE LEARNING [IV Year II Sem]

 Support Vector Machines


 Introduction to SVMs:
In machine learning, support vector machines (SVMs, also support vector networks) are
supervised learning models with associated learning algorithms that analyze data used for
classification and regression analysis.
 A Support Vector Machine (SVM) is a discriminative classifier formally defined by a
separating hyperplane. In other words, given labeled training data (supervised learning),
the algorithm outputs an optimal hyperplane which categorizes new examples.

What is Support Vector Machine?

 An SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible.
 In addition to performing linear classification, SVMs can efficiently perform a non-linear
classification, implicitly mapping their inputs into high-dimensional feature spaces.

What does SVM do?

Given a set of training examples, each marked as belonging to one or the other of two categories,
an SVM training algorithm builds a model that assigns new examples to one category or the
other, making it a non-probabilistic binary linear classifier.

Working of SVM

An SVM model is basically a representation of different classes in a hyperplane in


multidimensional space. The hyperplane will be generated in an iterative manner by SVM so that
the error can be minimized. The goal of SVM is to divide the datasets into classes to find a
maximum marginal hyperplane (MMH).

G SRINIVASARAO,Asst.Prof Page 9
MACHINE LEARNING [IV Year II Sem]

The followings are important concepts in SVM −

 Support Vectors − Datapoints that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.

 Hyperplane − As we can see in the above diagram, it is a decision plane or space which
is divided between a set of objects having different classes.

 Margin − It may be defined as the gap between two lines on the closet data points of
different classes. It can be calculated as the perpendicular distance from the line to the
support vectors. Large margin is considered as a good margin and small margin is
considered as a bad margin.

The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps −

 First, SVM will generate hyperplanes iteratively that segregates the classes in best way.

 Then, it will choose the hyperplane that separates the classes correctly.

SVM Kernels

In practice, SVM algorithm is implemented with kernel that transforms an input data space into
the required form. SVM uses a technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional space. In simple words,
kernel converts non-separable problems into separable problems by adding more dimensions to
it. It makes SVM more powerful, flexible and accurate. The following are some of the types of
kernels used by SVM.

Linear Kernel

It can be used as a dot product between any two observations. The formula of linear kernel is as
below −

K(x,xi)=sum(x∗xi)

From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum
of the multiplication of each pair of input values.

Polynomial Kernel

It is more generalized form of linear kernel and distinguish curved or nonlinear input space.
Following is the formula for polynomial kernel −

G SRINIVASARAO,Asst.Prof Page 10
MACHINE LEARNING [IV Year II Sem]

k(X,Xi)=1+sum(X∗Xi)^d

Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.

Radial Basis Function (RBF) Kernel

RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional
space. Following formula explains it mathematically −

K(x,xi)=exp(−gamma∗sum(x−xi^2))

Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A
good default value of gamma is 0.1.

As we implemented SVM for linearly separable data, we can implement it in Python for the data
that is not linearly separable. It can be done by using kernels.

Pros and Cons of SVM Classifiers

Pros of SVM classifiers

SVM classifiers offers great accuracy and work well with high dimensional space. SVM
classifiers basically use a subset of training points hence in result uses very less memory.

Cons of SVM classifiers

They have high training time hence in practice not suitable for large datasets. Another
disadvantage is that SVM classifiers do not work well with overlapping classes.

G SRINIVASARAO,Asst.Prof Page 11
MACHINE LEARNING [IV Year II Sem]

 Obtaining probabilities from linear classifiers

G SRINIVASARAO,Asst.Prof Page 12
MACHINE LEARNING [IV Year II Sem]

G SRINIVASARAO,Asst.Prof Page 13
MACHINE LEARNING [IV Year II Sem]

 Going beyond linearity with kernel methods


 We then looked at the heuristic perceptron algorithm for linearly separable
data, and the support vector machine which finds the unique decision
boundary with maximum margin and which can be adapted to non-separable
data.
 In this section we show that the techniques can be adapted to learn non-
linear decision boundaries. The main idea is simple to transform the data
non-linearly to a feature space in which linear classification can be applied.
 It is customary to call the transformed space the feature space and the
original space the input space. The approach thus appears to be transform
the training data to feature space and learn a model there. In order to classify
new data we transform that to feature space as well and apply the model.
 Take the perceptron algorithm in dual form, for example The algorithm is a
simple counting algorithm–the only operation that is somewhat involved is
testing whether example xi is correctly classified by evaluating

G SRINIVASARAO,Asst.Prof Page 14
MACHINE LEARNING [IV Year II Sem]

G SRINIVASARAO,Asst.Prof Page 15
MACHINE LEARNING [IV Year II Sem]

 Distance Based Models:

A number of Machine Learning Algorithms - Supervised or Unsupervised, use Distance Metrics


to know the input data pattern in order to make any Data Based decision. A good distance metric
helps in improving the performance of Classification, Clustering and Information Retrieval
process significantly.

Introduction

In many real world applications, we use Machine Learning algorithms for classifying or
recognizing images and for retrieving information through an Image’s content. For example -
Face recognition, Censored Images online, Retail Catalog, Recommendation Systems etc.

Choosing a good distance metric becomes really important here. The distance metric helps
algorithms to recognize similarities between the contents.

Distance metric uses distance function which provides a relationship metric between each
elements in the dataset.

G SRINIVASARAO,Asst.Prof Page 16
MACHINE LEARNING [IV Year II Sem]

Distance Function

Do you remember studying Pythagorean theorem? If you do, then you might remember
calculating distance between two data points using the theorem.

In order to calculate the distance between data points A and B Pythagorean theorem considers
the length of x and y axis.

A distance function provides distance between the elements of a set. If the distance is zero
then elements are equivalent else they are different from each other.

A distance function is nothing but a mathematical formula used by distance metrics. The distance
function can differ across different distance metrics.

o Distance-based models are the second class of Geometric models. Like Linear models,
distance-based models are based on the geometry of data.
o As the name implies, distance-based models work on the concept of distance. 

G SRINIVASARAO,Asst.Prof Page 17
MACHINE LEARNING [IV Year II Sem]

o In the context of Machine learning, the concept of distance is not based on merely the
physical distance between two points. Instead, we could think of the distance between
two points considering the mode of transport between two points.
o Travelling between two cities by plane covers less distance physically than by train
because a plane is unrestricted. Similarly, in chess, the concept of distance depends on
the piece used – for example, a Bishop can move diagonally.  
o Thus, depending on the entity and the mode of travel, the concept of distance can be
experienced differently.
o The distance metrics commonly used are Euclidean, Minkowski, Manhattan,
and Mahalanobis.

o Distance is applied through the concept of neighbours and exemplars.


 Neighbours are points in proximity with respect to the distance measure
expressed through exemplars.
 Exemplars are either centroids that find a centre of mass according to a chosen
distance metric or medoids that find the most centrally located data point.
 The most commonly used centroid is the arithmetic mean, which minimises
squared Euclidean distance to all other points.

Notes:

 The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean
position of all the points in the figure from the centroid point. This definition extends to
any object in n-dimensional space: its centroid is the mean position of all the points.

G SRINIVASARAO,Asst.Prof Page 18
MACHINE LEARNING [IV Year II Sem]

 Medoids are similar in concept to means or centroids. Medoids are most commonly used
on data when a mean or centroid cannot be defined. They are used in contexts where the
centroid is not representative of the dataset, such as in image data.

Examples of distance-based models include the nearest-neighbour models, which use the


training data as exemplars – for example, in classification. The K-means clustering algorithm
also uses exemplars to create clusters of similar data points.

 Nearest Neighbours Classification


K-nearest neighbours (KNN) algorithm is a type of supervised ML algorithm which can be used
for both classification as well as regression predictive problems. However, it is mainly used for
classification predictive problems in industry. The following two properties would define KNN
well −

 Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.

 Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm


because it doesn’t assume anything about the underlying data.

Working of KNN Algorithm

K-nearest neighbours (KNN) algorithm uses ‘feature similarity’ to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. We can understand its working with the help of
following steps −

Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we
must load the training as well as test data.

Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.

Step 3 − For each point in the test data do the following −

 3.1 − Calculate the distance between test data and each row of training data with the help
of any of the method namely: Euclidean, Manhattan or Hamming distance. The most
commonly used method to calculate distance is Euclidean.

 3.2 − Now, based on the distance value, sort them in ascending order.

 3.3 − Next, it will choose the top K rows from the sorted array.

G SRINIVASARAO,Asst.Prof Page 19
MACHINE LEARNING [IV Year II Sem]

 3.4 − Now, it will assign a class to the test point based on most frequent class of these
rows.

Step 4 – End

Example

The following is an example to understand the concept of K and working of KNN algorithm −

Suppose we have a dataset which can be plotted as follows −

Now, we need to classify new data point with black dot (at point 60,60) into blue or red class.
We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next
diagram −

G SRINIVASARAO,Asst.Prof Page 20
MACHINE LEARNING [IV Year II Sem]

We can see in the above diagram the three nearest neighbors of the data point with black dot.
Among those three, two of them lies in Red class hence the black dot will also be assigned in red
class.

Pros and Cons of KNN

Pros

 It is very simple algorithm to understand and interpret.

 It is very useful for nonlinear data because there is no assumption about data in this
algorithm.

 It is a versatile algorithm as we can use it for classification as well as regression.

 It has relatively high accuracy but there are much better supervised learning models than
KNN.

Cons

 It is computationally a bit expensive algorithm because it stores all the training data.

 High memory storage required as compared to other supervised learning algorithms.

 Prediction is slow in case of big N.

 It is very sensitive to the scale of data as well as irrelevant features.

Applications of KNN

The following are some of the areas in which KNN can be applied successfully −

Banking System

KNN can be used in banking system to predict weather an individual is fit for loan approval?
Does that individual have the characteristics similar to the defaulters one?

Calculating Credit Ratings

KNN algorithms can be used to find an individual’s credit rating by comparing with the persons
having similar traits.

G SRINIVASARAO,Asst.Prof Page 21
MACHINE LEARNING [IV Year II Sem]

Politics

With the help of KNN algorithms, we can classify a potential voter into various classes like
“Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.

Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.

 Distance Based Clustering

 Clustering methods are one of the most useful unsupervised ML methods. These
methods are used to find similarity as well as the relationship patterns among data
samples and then cluster those samples into groups having similarity based on features.
 Clustering is important because it determines the intrinsic grouping among the present
unlabelled data. They basically make some assumptions about data points to constitute
their similarity. Each assumption will construct different but equally valid clusters.
 For example, below is the diagram which shows clustering system grouped together the
similar kind of data in different clusters −

 Hierarchal clustering

Introduction to Hierarchical Clustering

 Hierarchical clustering is another unsupervised learning algorithm that is used to group


together the unlabelled data points having similar characteristics.
 Hierarchical clustering algorithms falls into following two categories.
Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data
point is treated as a single cluster and then successively merge or agglomerate (bottom-up
approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or
tree structure.
G SRINIVASARAO,Asst.Prof Page 22
MACHINE LEARNING [IV Year II Sem]

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all
the data points are treated as one big cluster and the process of clustering involves dividing
(Top-down approach) the one big cluster into various small clusters.

Steps to Perform Agglomerative Hierarchical Clustering

We are going to explain the most used and important Hierarchical clustering i.e. agglomerative.
The steps to perform the same is as follows −
 Step 1 − Treat each data point as single cluster. Hence, we will be having, say K clusters
at start. The number of data points will also be K at start.
 Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints.
This will result in total of K-1 clusters.
 Step 3 − Now, to form more clusters we need to join two closet clusters. This will result
in total of K-2 clusters.
 Step 4 − Now, to form one big cluster repeat the above three steps until K would become
0 i.e. no more data points left to join.
 Step 5 − At last, after making one single big cluster, dendrograms will be used to divide
into multiple clusters depending upon the problem.

G SRINIVASARAO,Asst.Prof Page 23

You might also like